DeMo slashes GPU communication costs while maintaining model quality through smart momentum handling.
DeMo introduces a novel optimizer that drastically reduces communication between GPUs during LLM training by decoupling momentum updates across accelerators, while maintaining or exceeding model performance compared to AdamW optimizer.
-----
https://arxiv.org/abs/2411.19870
🤖 Original Problem:
Training LLMs requires massive communication between GPUs to synchronize gradients, demanding expensive high-speed interconnects and constraining all accelerators to be in the same data center.
-----
🔧 Solution in this Paper:
→ DeMo leverages the insight that gradients and optimizer states during training exhibit high redundancy and are highly compressible.
→ It uses Discrete Cosine Transform to efficiently extract and synchronize only the most significant "fast-moving" components of momentum tensors.
→ The algorithm decouples momentum updates across accelerators, allowing controlled divergence in optimizer states.
→ Slow-moving components are accumulated locally and gradually transmitted over time.
-----
💡 Key Insights:
→ Fast-moving momentum components show high spatial correlation and concentrate energy in few principal components
→ Fast components need immediate application while slow components benefit from temporal smoothing
→ Slow-moving components, despite high variance, are crucial for long-term convergence
-----
📊 Results:
→ Reduced inter-accelerator communication by several orders of magnitude (2416.6 MB/step to 3.44 MB/step for 1B model)
→ Matched or exceeded AdamW performance on Hellaswag, ARC-Easy, and PiQA benchmarks
→ No noticeable slowdown in convergence or training loss
Share this post