Adding tiny full-precision matrices makes 4-bit LLMs work like their bigger siblings.
A novel method to fix quantization errors in LLMs by adding low-rank weight matrices that work on unquantized activations, enabling efficient 4-bit quantization.
-----
https://arxiv.org/abs/2412.07902
🤖 Original Problem:
→ Current LLM quantization methods struggle with information loss when compressing both weights and activations to 4-bit precision (W4A4), leading to significant accuracy drops.
→ Existing solutions can't effectively handle activation quantization errors at lower bit precision.
-----
🔧 Solution in this Paper:
→ Introduces LRC (Low-Rank Correction) that adds low-rank weight matrices in full precision to fix quantization errors.
→ Uses joint optimization to balance between quantized weights and additional low-rank matrices.
→ Processes quantized activations with quantized weights while simultaneously applying low-rank corrections on unquantized activations.
→ Compatible with existing techniques like QuaRot and groupsizing.
-----
💡 Key Insights:
→ Low-rank matrices with just 10% of original size can cut accuracy loss by half
→ Increasing rank to 30% completely eliminates accuracy gap
→ Weight-only quantization needs minimal error correction
→ Simple round-to-nearest activation quantization is sufficient
-----
📊 Results:
→ Tested on Phi-3, Llama-2 (7B, 13B), Llama-3 (8B), and Mixtral models
→ With 10% rank size, reduces accuracy gap by >50%
→ With 30% rank size, achieves near-original model performance
→ Maintains reasonable computational overhead
Share this post