Sparse pre-training matches dense LLM quality with less compute and smaller size.
This paper presents a simpler method for training sparse large language models (LLMs) that achieves the same quality as dense models with equivalent compute but smaller size.
-----
Paper - https://arxiv.org/abs/2501.12486
Original Problem 🤔:
→ Large language models (LLMs) are computationally expensive to train and deploy.
→ Existing pruning methods, like the Lottery Ticket Hypothesis, are computationally prohibitive for LLMs.
→ Sparse pre-training offers a potential solution but optimal configurations are unknown.
-----
Solution in this Paper 💡:
→ This paper studies sparse pre-training, which combines pruning and pre-training.
→ The paper proposes using the average parameter count over the training process, instead of the final count, in scaling laws.
→ It systematically explores 80 pruning schedules, varying sparsity levels and training durations, to find optimal configurations.
-----
Key Insights from this Paper 🔑:
→ Starting pruning at 25% of total compute and ending at 75% yields near-optimal final loss.
→ Sparse pre-training performs best with the same hyperparameters (learning rate and batch size) as the initial dense model.
→ A modified Chinchilla scaling law using average parameter count accurately predicts loss for both sparsely and densely trained LLMs.
-----
Results 💯:
→ Sparse models match dense models in final perplexity with matching average parameter counts, even at larger scales (1.14B parameters).
→ Sparse and dense model pairs achieve comparable results on downstream tasks (PIQA, ARC, Lambada, Winogrande) with matching average parameter counts.
→ Sparse pre-training offers up to 2x lossless compression rate compared to dense training.
Share this post