Low-bit quantization works better on less trained LLMs, challenging future model compression.
This research reveals that low-bit quantization performs better on undertrained LLMs compared to fully trained ones. Through analyzing 1500+ quantized checkpoints, the paper develops scaling laws to predict quantization performance and uses Quantization-induced Degradation (QiD) as a metric to measure LLM training levels.
-----
https://arxiv.org/abs/2411.17691
🔍 Original Problem:
Low-bit quantization has been widely used to compress LLMs, but its effectiveness varies significantly across different models without clear understanding why.
-----
🛠️ Solution in this Paper:
→ The researchers analyzed over 1500 quantized LLM checkpoints of various sizes (160M to 12B parameters) at different training stages.
→ They developed mathematical scaling laws that model QiD based on three factors: number of training tokens, model size, and bit width.
→ The unified scaling law formula: QiD Loss = k * (D^β)/(N^α * P^γ), where D is training tokens, N is model size, and P is bit width.
→ This formula helps predict quantization performance and determine optimal training requirements for different model sizes.
-----
💡 Key Insights:
→ Smaller models or those trained with more tokens suffer greater degradation from quantization
→ Early training checkpoints show larger weight fluctuations, making them more robust to quantization
→ Future LLMs trained with 100 trillion tokens may face severe challenges with low-bit quantization
-----
📊 Results:
→ For 70B models, achieving 20% QiD under 4-bit quantization requires 17+ trillion training tokens
→ 405B models need 50+ trillion tokens to show similar degradation
→ BitNet models show similar patterns, with performance gaps emerging in later training stages
Share this post