"The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2501.18965
The challenge lies in the empirical nature of learning rate schedules for training large models, hindering theoretical understanding and efficient tuning. This paper addresses this by demonstrating a surprising alignment between these schedules and a suboptimality bound from convex optimization theory.
This paper proposes to use a convex optimization bound to explain and improve learning rate scheduling in deep learning.
-----
📌 Convex optimization theory offers a surprisingly accurate framework for understanding and tuning learning rate schedules in deep learning. This allows for theoretically-grounded schedule design, moving beyond empirical trial and error.
📌 The Warmup Stable Decay schedule's cooldown phase is theoretically justified by the absence of logarithmic terms in the derived bound, directly explaining its practical performance boost over constant schedules.
📌 The paper's bound enables learning rate transfer across schedules and training horizons. This significantly reduces hyperparameter tuning costs for continued training and schedule selection in large models.
----------
Methods Explored in this Paper 🔧:
→ This paper uses a last-iterate suboptimality bound for stochastic gradient descent on convex functions to analyze learning rate schedules.
→ The research derives a specific bound for the Warmup Stable Decay (WSD) schedule, highlighting how its cooldown phase eliminates logarithmic terms in the bound, unlike constant schedules.
→ The paper compares the theoretical convergence behavior of Cosine and WSD schedules by calculating and plotting this bound for varying training horizons and base learning rates.
→ It investigates the impact of cooldown length in WSD and the shape of gradient norm bounds on the theoretical convergence.
→ The study validates the theoretical findings using empirical results from training Llama style transformer models and by comparing the theoretical bound with PEP lower bounds.
-----
Key Insights 💡:
→ The theoretical suboptimality bound closely mirrors the empirical loss curves observed during large model training, particularly capturing the sudden loss drop during the WSD cooldown.
→ The optimal base learning rate, derived from minimizing the bound, scales inversely with the square root of the training horizon and differs predictably between Cosine and WSD schedules.
→ The cooldown period in WSD is theoretically justified as it leads to a bound without logarithmic terms, explaining its practical benefits over constant learning rates.
→ Theoretical simulations suggest that linear decay is the optimal cooldown strategy when the base learning rate is fully tuned.
-----
Results 📊:
→ Schedule adaptation for continued training, guided by theory, improves validation loss by approximately 0.01 compared to naive continuation.
→ This improvement of 0.01 in validation loss is estimated to be equivalent to roughly 6,000 to 14,500 additional training steps or increasing model size by 4-5% based on scaling laws.
→ Learning rate transfer from a WSD schedule with 20% cooldown to linear decay, based on theoretical predictions, achieves a final validation loss of 2.9535, compared to 2.9660 for the best 20% cooldown run.