0:00
/
0:00
Transcript

"The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws"

Below podcast is generated with Google's Illuminate.

Sparse pre-training matches dense LLM quality with less compute and smaller size.

This paper presents a simpler method for training sparse large language models (LLMs) that achieves the same quality as dense models with equivalent compute but smaller size.

-----

Paper - https://arxiv.org/abs/2501.12486

Original Problem 🤔:

→ Large language models (LLMs) are computationally expensive to train and deploy.

→ Existing pruning methods, like the Lottery Ticket Hypothesis, are computationally prohibitive for LLMs.

→ Sparse pre-training offers a potential solution but optimal configurations are unknown.

-----

Solution in this Paper 💡:

→ This paper studies sparse pre-training, which combines pruning and pre-training.

→ The paper proposes using the average parameter count over the training process, instead of the final count, in scaling laws.

→ It systematically explores 80 pruning schedules, varying sparsity levels and training durations, to find optimal configurations.

-----

Key Insights from this Paper 🔑:

→ Starting pruning at 25% of total compute and ending at 75% yields near-optimal final loss.

→ Sparse pre-training performs best with the same hyperparameters (learning rate and batch size) as the initial dense model.

→ A modified Chinchilla scaling law using average parameter count accurately predicts loss for both sparsely and densely trained LLMs.

-----

Results 💯:

→ Sparse models match dense models in final perplexity with matching average parameter counts, even at larger scales (1.14B parameters).

→ Sparse and dense model pairs achieve comparable results on downstream tasks (PIQA, ARC, Lambada, Winogrande) with matching average parameter counts.

→ Sparse pre-training offers up to 2x lossless compression rate compared to dense training.