Inheritune: Training Smaller Yet More Attentive Language Models

Inheritune trains compact LLMs by leveraging early layers of larger models, matching performance with fewer parameters.

Nov 10, 2024

Inheritune trains compact LLMs by leveraging early layers of larger models, matching performance with fewer parameters.

Original Problem 🔍:

Attention degeneration in deeper layers of LLMs, where attention matrices lose rank and converge to single-column matrices, resulting in inefficient "lazy layers" that cannot learn meaningful representations.

Solution in this Paper 🧠:

• Inheritune: Initialize smaller model with early layers from larger pre-trained model

• Train smaller model for set number of steps

• Progressively grow model by adding layers until performance matches larger model

• Eliminates structural inefficiencies in current LLM architectures

Key Insights from this Paper 💡:

• Attention matrices in deeper layers of standard LLMs often degenerate to single-column

• Lazy layers with fully degenerate attention fail to learn meaningful representations

• Initializing with early layers leads to better generalization and convergence

• Smaller models can match performance of larger models by eliminating inefficiencies

Results 📊:

• 16-layer GPT-2 medium variant matches 24-layer GPT-2 medium performance

• Outperforms baselines like stacking, knowledge distillation

• Faster convergence and better generalization than full-sized models

• Preserves effective attention patterns in deeper layers

• Achieves comparable downstream task performance with fewer parameters

Inheritune enables training smaller, more efficient LLMs without sacrificing performance, potentially democratizing LLM pre-training.

Rohan's Bytes

Discussion about this post