The newly released PrefixQuant is the first to enable efficient per-tensor static quantization to outperform expensive per-token dynamic quantization.
Results 📊:
• W4A4KV4 Llama-3-8B: 7.43 WikiText2 perplexity, 71.08% average accuracy on 5 tasks
• Outperforms QuaRot by 0.98 perplexity and +5.98 points accuracy
• 1.60× to 2.81× faster than FP16 models
• 1.2× to 1.3× faster than QuaRot models
PrefixQuant enables static quantization to outperform dynamic quantization for LLMs by effectively handling token-wise outliers.
📚 https://arxiv.org/pdf/2410.05265
Original Problem 🔍:
Quantization of large language models (LLMs) faces challenges with token-wise outliers, leading to reliance on costly per-token dynamic quantization.
Solution in this Paper 💡:
• Introduces PrefixQuant, a technique to isolate outlier tokens offline without re-training
• Identifies high-frequency outlier tokens and prefixes them in the KV cache
• Prevents generation of outlier tokens during inference
• Enables efficient per-tensor static quantization to outperform per-token dynamic quantization
• Includes block-wise fine-tuning optimization to improve performance
Key Insights from this Paper 💡:
• Outlier tokens usually appear at fixed positions or in tokens with low semantic value
• Prefixing outliers in KV cache significantly reduces outlier magnitudes
• Static quantization can outperform dynamic quantization with proper outlier handling
• PrefixQuant is plug-and-play, enhancing other optimization-based methods
Share this post