HyperCloning enables efficient transfer of knowledge from smaller pre-trained models to accelerate training of larger language models.
📚 https://arxiv.org/pdf/2409.12903
Original Problem 🔍:
Pre-training large language models from scratch is extremely slow and costly. Small models are cheaper to train but lack accuracy.
-----
Solution in this Paper 💡:
• Introduces HyperCloning to expand parameters of pre-trained small model to larger model
• Preserves functionality of smaller model in larger initialized model
• Larger model inherits predictive power of smaller model before training starts
• Applies to linear layers, attention layers, normalization, positional embeddings
• Expands hidden dimensions while keeping same number of layers
-----
Key Insights from this Paper 💡:
• HyperCloning achieves 2-4x faster convergence compared to random initialization
• Provides better final accuracy given finite training budget
• Symmetry in expanded weights naturally breaks during training
• Expanded network effectively utilizes parameter space comparable to training from scratch
• Base model size and accuracy impacts target model convergence and performance
-----
Results 📊:
• Evaluated on OPT, Pythia, OLMO model families
• 2.2-4x speedup in reaching final accuracy of random initialization baseline
• Better final accuracy across 10 benchmark tasks
• Symmetric weight expansion outperforms diagonal expansion
• Initializing with more accurate base model improves target model accuracy
Share this post