A new adaptive gradient method ADOPT fixes Adam's convergence issues without parameter tuning headaches.
An Adam optimizer variant that doesn't need careful hyperparameter babysitting
https://arxiv.org/abs/2411.02853
🎯 Original Problem:
Adam optimizer, despite being widely used in deep learning, doesn't theoretically converge unless its β₂ parameter is chosen specifically for each problem. Previous fixes like AMSGrad require impractical assumptions about gradient noise being uniformly bounded, limiting their real-world applicability.
-----
🔧 Solution in this Paper:
→ ADOPT modifies Adam by removing current gradient from second moment estimate to eliminate correlation issues
→ Changes momentum update order - applies normalization to current gradient before momentum update
→ Uses v_(t-1) instead of v_t for second moment estimation to ensure conditional independence
→ Achieves O(1/√T) convergence rate without requiring bounded noise assumptions
-----
💡 Key Insights:
→ Correlation between current gradient and second moment estimate causes Adam's non-convergence
→ Order of momentum update and normalization affects convergence properties
→ Bounded noise assumption isn't necessary for optimal convergence
→ Simple architectural changes can fix fundamental optimization issues
-----
📊 Results:
→ Outperforms Adam and AMSGrad in toy problems where Adam typically fails
→ Shows superior results in CIFAR-10 and ImageNet classification tasks
→ Improves MMLU score from 41.2 to 42.13 in LLaMA-7B fine-tuning
→ Demonstrates stable training even with small batch sizes where Adam fails
Share this post