0:00
/
0:00
Transcript

"Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens"

The podcast on this paper is generated with Google's Illuminate.

Low-bit quantization works better on less trained LLMs, challenging future model compression.

This research reveals that low-bit quantization performs better on undertrained LLMs compared to fully trained ones. Through analyzing 1500+ quantized checkpoints, the paper develops scaling laws to predict quantization performance and uses Quantization-induced Degradation (QiD) as a metric to measure LLM training levels.

-----

https://arxiv.org/abs/2411.17691

🔍 Original Problem:

Low-bit quantization has been widely used to compress LLMs, but its effectiveness varies significantly across different models without clear understanding why.

-----

🛠️ Solution in this Paper:

→ The researchers analyzed over 1500 quantized LLM checkpoints of various sizes (160M to 12B parameters) at different training stages.

→ They developed mathematical scaling laws that model QiD based on three factors: number of training tokens, model size, and bit width.

→ The unified scaling law formula: QiD Loss = k * (D^β)/(N^α * P^γ), where D is training tokens, N is model size, and P is bit width.

→ This formula helps predict quantization performance and determine optimal training requirements for different model sizes.

-----

💡 Key Insights:

→ Smaller models or those trained with more tokens suffer greater degradation from quantization

→ Early training checkpoints show larger weight fluctuations, making them more robust to quantization

→ Future LLMs trained with 100 trillion tokens may face severe challenges with low-bit quantization

-----

📊 Results:

→ For 70B models, achieving 20% QiD under 4-bit quantization requires 17+ trillion training tokens

→ 405B models need 50+ trillion tokens to show similar degradation

→ BitNet models show similar patterns, with performance gaps emerging in later training stages

Discussion about this video