Simple masking mechanism helps optimizers avoid wasteful parameter updates.
Cautious optimizers know when to hold back for faster model training.
This paper introduces "Cautious Optimizers" - a simple one-line code modification that improves training efficiency of momentum-based optimizers like AdamW and Lion. The modification ensures updates only occur when proposed directions align with current gradients, leading to faster convergence.
-----
https://arxiv.org/abs/2411.16085
🤔 Original Problem:
AdamW remains the default optimizer for LLM training, but finding faster optimizers without complex modifications or extensive tuning remains challenging.
-----
🔧 Solution in this Paper:
→ The paper introduces a single-line modification to momentum-based optimizers: only update when proposed direction and current gradients align.
→ This creates "Cautious" variants like C-AdamW and C-Lion that preserve the original optimizer's Hamiltonian function.
→ The modification uses element-wise masking to zero out updates where signs of momentum and gradient conflict.
→ Step size scaling adjusts learning rates based on the number of active updates.
-----
💡 Key Insights:
→ Simple modifications can significantly improve optimizer performance without complex tuning
→ Alignment between momentum and gradient direction is crucial for efficient training
→ Preserving Hamiltonian dynamics ensures theoretical convergence guarantees
-----
📊 Results:
→ 1.47x speedup on LLaMA training with C-AdamW
→ 1.28x speedup with C-Lion
→ No additional computational overhead
Share this post