The paper proposes is a quantization method named OSTQuant. It uses orthogonal and scaling transformations to refine weight distribution in Large Language Models for better quantization and performance.
-----
Paper - https://arxiv.org/abs/2501.13987
Original Problem 😮:
→ Post-training quantization methods often degrade model performance.
→ This degradation arises from distribution outliers in Large Language Model weights and activations.
-----
Solution in this Paper 😎:
→ This paper introduces OSTQuant.
→ OSTQuant refines weight distributions before quantization.
→ It employs orthogonal and scaling transformations.
→ Orthogonal transformation decorrelates weight dimensions.
→ Scaling transformation reduces the dynamic range of weights.
→ These transformations make weight distributions more quantization-friendly.
→ OSTQuant is applied before standard quantization techniques.
-----
Key Insights from this Paper 🤔:
→ Outlier dimensions in Large Language Model weights negatively impact quantization.
→ Orthogonal transformation reduces the impact of these outlier dimensions.
→ Scaling transformation further optimizes the weight distribution for quantization.
→ Pre-processing weights with these transformations significantly improves quantization accuracy.
-----
Results 🚀:
→ OSTQuant reduces perplexity by up to 4.2% on LLaMA-7B quantized to 4-bit, compared to standard quantization.
→ On OPT-1.3B, OSTQuant improves accuracy on average by 2.1% across various quantization bitwidths.
→ OSTQuant achieves performance close to full-precision models even at 4-bit quantization.
Share this post