Inheritune: Training Smaller Yet More Attentive Language Models
Inheritune trains compact LLMs by leveraging early layers of larger models, matching performance with fewer parameters.
Inheritune trains compact LLMs by leveraging early layers of larger models, matching performance with fewer parameters.
Original Problem 🔍:
Attention degeneration in deeper layers of LLMs, where attention matrices lose rank and converge to single-column matrices, resulting in inefficient "lazy layers" that cannot learn meaningful representations.
Solution in this Paper 🧠:
• Inheritune: Initialize smaller model with early layers from larger pre-trained model
• Train smaller model for set number of steps
• Progressively grow model by adding layers until performance matches larger model
• Eliminates structural inefficiencies in current LLM architectures
Key Insights from this Paper 💡:
• Attention matrices in deeper layers of standard LLMs often degenerate to single-column
• Lazy layers with fully degenerate attention fail to learn meaningful representations
• Initializing with early layers leads to better generalization and convergence
• Smaller models can match performance of larger models by eliminating inefficiencies
Results 📊:
• 16-layer GPT-2 medium variant matches 24-layer GPT-2 medium performance
• Outperforms baselines like stacking, knowledge distillation
• Faster convergence and better generalization than full-sized models
• Preserves effective attention patterns in deeper layers
• Achieves comparable downstream task performance with fewer parameters
Inheritune enables training smaller, more efficient LLMs without sacrificing performance, potentially democratizing LLM pre-training.