0:00
/
0:00
Transcript

"Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA"

The podcast on this paper is generated with Google's Illuminate.

Share transformer layers intelligently to slash model size while preserving accuracy.

Nice paper from @GoogleDeepMind

Recursive Transformers with layer-wise LoRA achieve full-model performance using half the parameters

📚 https://arxiv.org/abs/2410.20672

🎯 Original Problem:

Parameter sharing in LLMs has shown promise for reducing model size, but its effectiveness remains limited in modern architectures. Current approaches struggle to maintain performance while reducing parameters.

-----

🔧 Solution in this Paper:

• Introduces Recursive Transformers that share parameters across recursively looped blocks of layers

• Adds layer-specific LoRA modules to allow flexibility in parameter sharing

• Proposes novel initialization techniques:

- Stepwise: Selects layers at intervals while keeping first/last fixed

- Average: Initializes by averaging tied layer weights

- Lower: Uses weights from first K layers

• Introduces Continuous Depth-wise Batching for parallel computation across depths

• Implements early-exiting mechanism for efficient inference

-----

💡 Key Insights:

• Parameter sharing with careful initialization can maintain performance while reducing model size

• Layer-specific LoRA modules provide flexible trade-off between model size and accuracy

• Continuous batching across depths enables significant throughput improvements

• Early-exiting combined with recursive architecture amplifies serving efficiency

-----

📊 Results:

• Recursive Gemma 1B showed 13.5% accuracy improvement over non-recursive baselines

• Achieved 2-3x throughput improvement through continuous depth-wise batching

• Matched performance of original full-size model trained on 3T tokens using only 60B tokens

• Successfully reduced model size by 50% while maintaining competitive performance

Discussion about this video