Memory-efficient optimizer APOLLO trains massive LLMs using fraction of usual GPU memory.
APOLLO introduces a memory-efficient optimizer that reduces training costs for LLMs while maintaining or exceeding AdamW's performance. It achieves this by approximating channel-wise learning rate scaling using low-rank auxiliary space and random projections.
-----
https://arxiv.org/abs/2412.05270
🤔 Original Problem:
Training LLMs requires massive memory due to AdamW optimizer's states, forcing researchers to use expensive GPUs or reduce batch sizes. Existing solutions either need costly SVD operations or sacrifice performance.
-----
🔧 Solution in this Paper:
→ APOLLO approximates channel-wise learning rate scaling using a low-rank auxiliary optimizer state based on random projection.
→ The structured learning rate update makes APOLLO highly tolerant to further memory reduction with lower rank.
→ APOLLO-Mini, an extreme version, uses tensor-wise scaling with rank-1 auxiliary subspace to achieve SGD-level memory cost.
→ The method eliminates the need for costly SVD operations by using pure random projections.
-----
💡 Key Insights:
→ Element-wise learning rate adaptation in AdamW can be coarsened to channel-wise or tensor-wise scaling
→ Random projection is sufficient for gradient norm preservation, eliminating need for expensive SVD
→ Lower rank approximations work well due to structured learning rate updates
-----
📊 Results:
→ 3× better throughput on LLaMA-7B training compared to AdamW
→ First-time enables LLaMA-13B pre-training on A100-80GB without system optimizations
→ Achieves 12GB memory usage for LLaMA-7B training with quantization
→ Outperforms AdamW with 2.8× reduction in validation perplexity