Pre-training loss, not size, unlocks emergent abilities in LLMs.
Emergent abilities emerge at a specific pre-training loss threshold.
This paper shows that loss predicts performance, regardless of model size, and identifies a loss threshold for emergent abilities.
-----
https://arxiv.org/abs/2403.15796
Original Problem 🤔:
→ Emergent abilities in LLMs were thought to be exclusive to large models.
→ This was challenged by observations that smaller models could perform well on some tasks.
→ Doubts arose about the metrics used to measure these abilities.
-----
Solution in this Paper 💡:
→ This paper proposes studying emergent abilities through the lens of pre-training loss.
→ They trained various models with different sizes and data sizes.
→ They observed performance on 12 downstream tasks.
-----
Key Insights from this Paper 🔑:
→ Models with the same pre-training loss show similar performance on downstream tasks, regardless of model or data size.
→ Emergent abilities appear when pre-training loss falls below a specific threshold.
→ Before this threshold, the model's performance resembles random guessing.
-----
Results ✨:
→ The models exhibit the same performance on downstream tasks with the same pre-training loss, irrespective of model size or training tokens.
→ Performance on tasks like MMLU and GSM8K remains at random guess level until loss falls below ~2.2.
→ LLaMA models of varying sizes show a consistent upward performance trend as pre-training loss decreases.
Share this post