"Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization"

Playback speed

Share post at current time

0:00

Transcript

The podcast on this paper is generated with Google's Illuminate.

Jan 02, 2025

HyperCloning enables efficient transfer of knowledge from smaller pre-trained models to accelerate training of larger language models.

Original Problem 🔍:

Pre-training large language models from scratch is extremely slow and costly. Small models are cheaper to train but lack accuracy.

-----

Solution in this Paper 💡:

• Introduces HyperCloning to expand parameters of pre-trained small model to larger model

• Preserves functionality of smaller model in larger initialized model

• Larger model inherits predictive power of smaller model before training starts

• Applies to linear layers, attention layers, normalization, positional embeddings

• Expands hidden dimensions while keeping same number of layers

-----

Key Insights from this Paper 💡:

• HyperCloning achieves 2-4x faster convergence compared to random initialization

• Provides better final accuracy given finite training budget

• Symmetry in expanded weights naturally breaks during training

• Expanded network effectively utilizes parameter space comparable to training from scratch

• Base model size and accuracy impacts target model convergence and performance

-----

Results 📊:

• Evaluated on OPT, Pythia, OLMO model families

• 2.2-4x speedup in reaching final accuracy of random initialization baseline

• Better final accuracy across 10 benchmark tasks

• Symmetric weight expansion outperforms diagonal expansion

• Initializing with more accurate base model improves target model accuracy

Rohan's Bytes