"DeMo: Decoupled Momentum Optimization"

Playback speed

Share post at current time

0:00

Transcript

"DeMo: Decoupled Momentum Optimization"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 24, 2024

DeMo slashes GPU communication costs while maintaining model quality through smart momentum handling.

DeMo introduces a novel optimizer that drastically reduces communication between GPUs during LLM training by decoupling momentum updates across accelerators, while maintaining or exceeding model performance compared to AdamW optimizer.

-----

https://arxiv.org/abs/2411.19870

🤖 Original Problem:

Training LLMs requires massive communication between GPUs to synchronize gradients, demanding expensive high-speed interconnects and constraining all accelerators to be in the same data center.

-----

🔧 Solution in this Paper:

→ DeMo leverages the insight that gradients and optimizer states during training exhibit high redundancy and are highly compressible.

→ It uses Discrete Cosine Transform to efficiently extract and synchronize only the most significant "fast-moving" components of momentum tensors.

→ The algorithm decouples momentum updates across accelerators, allowing controlled divergence in optimizer states.

→ Slow-moving components are accumulated locally and gradually transmitted over time.

-----

💡 Key Insights:

→ Fast-moving momentum components show high spatial correlation and concentrate energy in few principal components

→ Fast components need immediate application while slow components benefit from temporal smoothing

→ Slow-moving components, despite high variance, are crucial for long-term convergence

-----

📊 Results:

→ Reduced inter-accelerator communication by several orders of magnitude (2416.6 MB/step to 3.44 MB/step for 1B model)

→ Matched or exceeded AdamW performance on Hellaswag, ARC-Easy, and PiQA benchmarks

→ No noticeable slowdown in convergence or training loss

Rohan's Bytes

"DeMo: Decoupled Momentum Optimization"

Discussion about this video