Share transformer layers intelligently to slash model size while preserving accuracy.
Nice paper from @GoogleDeepMind
Recursive Transformers with layer-wise LoRA achieve full-model performance using half the parameters
📚 https://arxiv.org/abs/2410.20672
🎯 Original Problem:
Parameter sharing in LLMs has shown promise for reducing model size, but its effectiveness remains limited in modern architectures. Current approaches struggle to maintain performance while reducing parameters.
-----
🔧 Solution in this Paper:
• Introduces Recursive Transformers that share parameters across recursively looped blocks of layers
• Adds layer-specific LoRA modules to allow flexibility in parameter sharing
• Proposes novel initialization techniques:
- Stepwise: Selects layers at intervals while keeping first/last fixed
- Average: Initializes by averaging tied layer weights
- Lower: Uses weights from first K layers
• Introduces Continuous Depth-wise Batching for parallel computation across depths
• Implements early-exiting mechanism for efficient inference
-----
💡 Key Insights:
• Parameter sharing with careful initialization can maintain performance while reducing model size
• Layer-specific LoRA modules provide flexible trade-off between model size and accuracy
• Continuous batching across depths enables significant throughput improvements
• Early-exiting combined with recursive architecture amplifies serving efficiency
-----
📊 Results:
• Recursive Gemma 1B showed 13.5% accuracy improvement over non-recursive baselines
• Achieved 2-3x throughput improvement through continuous depth-wise batching
• Matched performance of original full-size model trained on 3T tokens using only 60B tokens
• Successfully reduced model size by 50% while maintaining competitive performance
Share this post