0:00
/
0:00
Transcript

"APOLLO: SGD-like Memory, AdamW-level Performance"

The podcast on this paper is generated with Google's Illuminate.

Memory-efficient optimizer APOLLO trains massive LLMs using fraction of usual GPU memory.

APOLLO introduces a memory-efficient optimizer that reduces training costs for LLMs while maintaining or exceeding AdamW's performance. It achieves this by approximating channel-wise learning rate scaling using low-rank auxiliary space and random projections.

-----

https://arxiv.org/abs/2412.05270

🤔 Original Problem:

Training LLMs requires massive memory due to AdamW optimizer's states, forcing researchers to use expensive GPUs or reduce batch sizes. Existing solutions either need costly SVD operations or sacrifice performance.

-----

🔧 Solution in this Paper:

→ APOLLO approximates channel-wise learning rate scaling using a low-rank auxiliary optimizer state based on random projection.

→ The structured learning rate update makes APOLLO highly tolerant to further memory reduction with lower rank.

→ APOLLO-Mini, an extreme version, uses tensor-wise scaling with rank-1 auxiliary subspace to achieve SGD-level memory cost.

→ The method eliminates the need for costly SVD operations by using pure random projections.

-----

💡 Key Insights:

→ Element-wise learning rate adaptation in AdamW can be coarsened to channel-wise or tensor-wise scaling

→ Random projection is sufficient for gradient norm preservation, eliminating need for expensive SVD

→ Lower rank approximations work well due to structured learning rate updates

-----

📊 Results:

→ 3× better throughput on LLaMA-7B training compared to AdamW

→ First-time enables LLaMA-13B pre-training on A100-80GB without system optimizations

→ Achieves 12GB memory usage for LLaMA-7B training with quantization

→ Outperforms AdamW with 2.8× reduction in validation perplexity

Discussion about this video

User's avatar