Weight classification in pre-calibration improves the robustness of quantized LLMs across different tasks.
This paper introduces a statistical pre-calibration method for post-training quantization of LLMs, enhancing robustness and efficiency.
KL divergence minimization guides LLM quantization
-----
https://arxiv.org/abs/2501.09107
Original Problem 🤔:
→ Current post-training quantization (PTQ) methods for large language models (LLMs) rely on calibration.
→ Calibration can be less effective when the deployment environment differs from the calibration conditions.
-----
Solution in this Paper 💡:
→ This paper proposes a weight-adaptive pre-calibration method. This method acts as a precursor to calibration-based PTQ.
→ It minimizes Kullback-Leibler divergence between original and quantized weights. This preserves the Shannon information content of the original model.
→ This pre-calibration classifies weights into salient and non-salient categories using pseudo activations and soft-thresholding.
→ It does not adjust the weights themselves, unlike traditional calibration methods. This simplifies the algorithm and makes it computationally efficient.
-----
Key Insights from this Paper 🤯:
→ Preserving weight distribution through KL divergence minimization improves quantization robustness.
→ Classifying weights into salient and non-salient categories provides a robust initial point for further calibration.
-----
Results ✨:
→ Achieves accuracy on par with existing calibration-based PTQ methods on various LLMs.
→ For Code-Llama models, pre-calibration achieved higher accuracy than SpQR on HumanEval and MBPP when calibration data does not match the task domain.
→ Quantization time is 10x faster than AWQ and 100x faster than SpQR.
Share this post