"Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA"

Playback speed

Share post at current time

0:00

Transcript

"Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Jan 01, 2025

Share transformer layers intelligently to slash model size while preserving accuracy.

Nice paper from @GoogleDeepMind

Recursive Transformers with layer-wise LoRA achieve full-model performance using half the parameters

📚 https://arxiv.org/abs/2410.20672

🎯 Original Problem:

Parameter sharing in LLMs has shown promise for reducing model size, but its effectiveness remains limited in modern architectures. Current approaches struggle to maintain performance while reducing parameters.

-----

🔧 Solution in this Paper:

• Introduces Recursive Transformers that share parameters across recursively looped blocks of layers

• Adds layer-specific LoRA modules to allow flexibility in parameter sharing

• Proposes novel initialization techniques:

- Stepwise: Selects layers at intervals while keeping first/last fixed

- Average: Initializes by averaging tied layer weights

- Lower: Uses weights from first K layers

• Introduces Continuous Depth-wise Batching for parallel computation across depths

• Implements early-exiting mechanism for efficient inference

-----

💡 Key Insights:

• Parameter sharing with careful initialization can maintain performance while reducing model size

• Layer-specific LoRA modules provide flexible trade-off between model size and accuracy

• Continuous batching across depths enables significant throughput improvements

• Early-exiting combined with recursive architecture amplifies serving efficiency

-----

📊 Results:

• Recursive Gemma 1B showed 13.5% accuracy improvement over non-recursive baselines

• Achieved 2-3x throughput improvement through continuous depth-wise batching

• Matched performance of original full-size model trained on 3T tokens using only 60B tokens

• Successfully reduced model size by 50% while maintaining competitive performance

Rohan's Bytes

"Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA"

Discussion about this video