0:00
/
0:00
Transcript

PREFIXQUANT: STATIC QUANTIZATION BEATS DYNAMIC THROUGH PREFIXED OUTLIERS IN LLMS

The podcast on this paper is generated with Google's Illuminate.

The newly released PrefixQuant is the first to enable efficient per-tensor static quantization to outperform expensive per-token dynamic quantization.

Results 📊:

• W4A4KV4 Llama-3-8B: 7.43 WikiText2 perplexity, 71.08% average accuracy on 5 tasks

• Outperforms QuaRot by 0.98 perplexity and +5.98 points accuracy

• 1.60× to 2.81× faster than FP16 models

• 1.2× to 1.3× faster than QuaRot models

PrefixQuant enables static quantization to outperform dynamic quantization for LLMs by effectively handling token-wise outliers.

📚 https://arxiv.org/pdf/2410.05265

Original Problem 🔍:

Quantization of large language models (LLMs) faces challenges with token-wise outliers, leading to reliance on costly per-token dynamic quantization.

Solution in this Paper 💡:

• Introduces PrefixQuant, a technique to isolate outlier tokens offline without re-training

• Identifies high-frequency outlier tokens and prefixes them in the KV cache

• Prevents generation of outlier tokens during inference

• Enables efficient per-tensor static quantization to outperform per-token dynamic quantization

• Includes block-wise fine-tuning optimization to improve performance

Key Insights from this Paper 💡:

• Outlier tokens usually appear at fixed positions or in tokens with low semantic value

• Prefixing outliers in KV cache significantly reduces outlier magnitudes

• Static quantization can outperform dynamic quantization with proper outlier handling

• PrefixQuant is plug-and-play, enhancing other optimization-based methods