A mathematical proof finally explains why quantizing LLMs layer-by-layer actually works.
This paper introduces a theoretical framework linking layer-wise quantization error to model perplexity, enabling better data-free quantization methods for LLMs. The framework proves a linear relationship between per-layer L2 reconstruction error and perplexity increase, leading to HIGGS - a novel quantization method using Hadamard rotations and MSE-optimal grids.
-----
https://arxiv.org/abs/2411.17525
Original Problem 🤔:
Existing LLM quantization methods lack theoretical backing for why minimizing per-layer error metrics works. There's no clear understanding of how layer-wise quantization affects overall model performance.
-----
Solution in this Paper 🛠️:
→ The paper proves a "linearity theorem" showing direct relationship between layer-wise L2 error and model perplexity increase.
→ Based on this theorem, they develop HIGGS - a data-free quantization method using Hadamard rotations to make weights approximately Gaussian.
→ HIGGS then applies MSE-optimal grids computed efficiently using the CLVQ algorithm.
→ They also solve non-uniform quantization by using dynamic programming to find optimal per-layer bit-widths.
-----
Key Insights 💡:
→ Layer-wise quantization error has a linear relationship with model perplexity
→ Hadamard rotations make weight distributions approximately Gaussian
→ MSE-optimal grids outperform existing approaches like Normal Float for data-free quantization
-----
Results 📊:
→ HIGGS outperforms Normal Float and Abnormal Float formats in 3-4 bit range
→ Achieves 2-3x speedup vs FP16 with minimal accuracy loss
→ Dynamic HIGGS beats calibration-based methods like GPTQ and AWQ in 3-4 bit range
Share this post