"The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 10, 2025

Article voiceover

0:00

-5:15

https://arxiv.org/abs/2501.18965

The challenge lies in the empirical nature of learning rate schedules for training large models, hindering theoretical understanding and efficient tuning. This paper addresses this by demonstrating a surprising alignment between these schedules and a suboptimality bound from convex optimization theory.

This paper proposes to use a convex optimization bound to explain and improve learning rate scheduling in deep learning.

-----

📌 Convex optimization theory offers a surprisingly accurate framework for understanding and tuning learning rate schedules in deep learning. This allows for theoretically-grounded schedule design, moving beyond empirical trial and error.

📌 The Warmup Stable Decay schedule's cooldown phase is theoretically justified by the absence of logarithmic terms in the derived bound, directly explaining its practical performance boost over constant schedules.

📌 The paper's bound enables learning rate transfer across schedules and training horizons. This significantly reduces hyperparameter tuning costs for continued training and schedule selection in large models.

----------

Methods Explored in this Paper 🔧:

→ This paper uses a last-iterate suboptimality bound for stochastic gradient descent on convex functions to analyze learning rate schedules.

→ The research derives a specific bound for the Warmup Stable Decay (WSD) schedule, highlighting how its cooldown phase eliminates logarithmic terms in the bound, unlike constant schedules.

→ The paper compares the theoretical convergence behavior of Cosine and WSD schedules by calculating and plotting this bound for varying training horizons and base learning rates.

→ It investigates the impact of cooldown length in WSD and the shape of gradient norm bounds on the theoretical convergence.

→ The study validates the theoretical findings using empirical results from training Llama style transformer models and by comparing the theoretical bound with PEP lower bounds.

-----

Key Insights 💡:

→ The theoretical suboptimality bound closely mirrors the empirical loss curves observed during large model training, particularly capturing the sudden loss drop during the WSD cooldown.

→ The optimal base learning rate, derived from minimizing the bound, scales inversely with the square root of the training horizon and differs predictably between Cosine and WSD schedules.

→ The cooldown period in WSD is theoretically justified as it leads to a bound without logarithmic terms, explaining its practical benefits over constant learning rates.

→ Theoretical simulations suggest that linear decay is the optimal cooldown strategy when the base learning rate is fully tuned.

-----

Results 📊:

→ Schedule adaptation for continued training, guided by theory, improves validation loss by approximately 0.01 compared to naive continuation.

→ This improvement of 0.01 in validation loss is estimated to be equivalent to roughly 6,000 to 14,500 additional training steps or increasing model size by 4-5% based on scaling laws.

→ Learning rate transfer from a WSD schedule with 20% cooldown to linear decay, based on theoretical predictions, achieves a final validation loss of 2.9535, compared to 2.9660 for the best 20% cooldown run.

Rohan's Bytes

Discussion about this post