Want faster LLM training? Give more bits to exponents, less to mantissa.
The paper finds not all bits are equal - exponent bits matter more than mantissa bits in LLM training.
Floating-point quantization needs different bit ratios at different precision levels for optimal LLM training.
There's a sweet spot for data size in low-precision training - more isn't always better.
This paper develops a unified scaling law for floating-point quantization in LLM training, revealing optimal bit allocation between exponent and mantissa bits for different precision levels.
https://arxiv.org/abs/2501.02423
🤔 Original Problem:
→ Existing scaling laws focus on integer quantization but don't account for floating-point quantization parameters like exponent bits, mantissa bits, and block size
→ No clear understanding of how these parameters affect LLM training performance
-----
🔧 Solution in this Paper:
→ The researchers developed a unified scaling law that considers model size (N), data size (D), exponent bits (E), mantissa bits (M), and block size (B).
→ The law predicts model loss as: L = n/N^α + d/D^β + ε + (D^β/N^α)(log2 B)/(γ(E+0.5)^δ(M+0.5)^ν)
→ They trained 366 models with different configurations to validate the law.
-----
💡 Key Insights:
→ Exponent bits contribute slightly more to model performance than mantissa bits
→ Optimal bit ratios: FP4 (E2M1), FP8 (E4M3), FP16 (E8M7)
→ Critical data size exists - training beyond this point degrades performance in low precision
-----
📊 Results:
→ For 1B parameter model: BF16 critical size = 1730T tokens, FP8-E4M3 = 27T tokens, FP4-E2M1 = 0.4T tokens
→ Best cost-performance precision lies between 4-8 bits across wide compute ranges
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/