VPTQ (Vector Post-Training Quantization) looks MASSIVE. Shrinks massive 405B parameter LLMs to 2 bits while preserving their quality, with advanced matrix tricks. 🤯
Powerful proposal for extreme low-bit LLM compression.
Compress 70B and even 405B models to 1-2 bits without retraining 🤯
• Employs Second-Order Optimization to formulate and solve LLM vector quantization
• Uses Channel-Independent Second-Order Optimization for granular quantization
• Implements Hessian-Weighted Centroid Initialization for effective codebook creation
• Incorporates Residual Vector Quantization and Outlier Elimination for enhanced accuracy
• Supports layer-wise and end-to-end finetuning options
-----
📚 https://arxiv.org/pdf/2409.17066
Key Insights from this Paper 💡:
• Vector quantization outperforms scalar methods for extreme low-bit compression
• Channel-independent optimization mitigates error accumulation
• Balancing vector length and codebook size crucial for accuracy and inference speed
• Residual quantization and outlier handling significantly improve model performance
-----
Results 📊:
• Reduces perplexity by 0.01-0.34 (LLaMA-2), 0.38-0.68 (Mistral-7B), 4.41-7.34 (LLaMA-3) at 2-bit
• Improves QA accuracy by 0.79-1.5% (LLaMA-2), 1% (Mistral-7B), 11-22% (LLaMA-3)
• Achieves 1.6-1.8× faster inference throughput compared to SOTA
• Requires only 10.4-18.6% of SOTA quantization algorithm execution time
Share this post