PREFIXQUANT: STATIC QUANTIZATION BEATS DYNAMIC THROUGH PREFIXED OUTLIERS IN LLMS

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

PREFIXQUANT: STATIC QUANTIZATION BEATS DYNAMIC THROUGH PREFIXED OUTLIERS IN LLMS

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Jan 01, 2025

Transcript

The newly released PrefixQuant is the first to enable efficient per-tensor static quantization to outperform expensive per-token dynamic quantization.

Results 📊:

• W4A4KV4 Llama-3-8B: 7.43 WikiText2 perplexity, 71.08% average accuracy on 5 tasks

• Outperforms QuaRot by 0.98 perplexity and +5.98 points accuracy

• 1.60× to 2.81× faster than FP16 models

• 1.2× to 1.3× faster than QuaRot models

PrefixQuant enables static quantization to outperform dynamic quantization for LLMs by effectively handling token-wise outliers.

📚 https://arxiv.org/pdf/2410.05265

Original Problem 🔍:

Quantization of large language models (LLMs) faces challenges with token-wise outliers, leading to reliance on costly per-token dynamic quantization.

Solution in this Paper 💡:

• Introduces PrefixQuant, a technique to isolate outlier tokens offline without re-training

• Identifies high-frequency outlier tokens and prefixes them in the KV cache

• Prevents generation of outlier tokens during inference

• Enables efficient per-tensor static quantization to outperform per-token dynamic quantization

• Includes block-wise fine-tuning optimization to improve performance

Key Insights from this Paper 💡:

• Outlier tokens usually appear at fixed positions or in tokens with low semantic value

• Prefixing outliers in KV cache significantly reduces outlier magnitudes

• Static quantization can outperform dynamic quantization with proper outlier handling

• PrefixQuant is plug-and-play, enhancing other optimization-based methods

Rohan's Bytes

PREFIXQUANT: STATIC QUANTIZATION BEATS DYNAMIC THROUGH PREFIXED OUTLIERS IN LLMS

Discussion about this video