0:00
/
0:00
Transcript

"Low-Rank Correction for Quantized LLMs"

Generated below podcast on this paper with Google's Illuminate.

Adding tiny full-precision matrices makes 4-bit LLMs work like their bigger siblings.

A novel method to fix quantization errors in LLMs by adding low-rank weight matrices that work on unquantized activations, enabling efficient 4-bit quantization.

-----

https://arxiv.org/abs/2412.07902

🤖 Original Problem:

→ Current LLM quantization methods struggle with information loss when compressing both weights and activations to 4-bit precision (W4A4), leading to significant accuracy drops.

→ Existing solutions can't effectively handle activation quantization errors at lower bit precision.

-----

🔧 Solution in this Paper:

→ Introduces LRC (Low-Rank Correction) that adds low-rank weight matrices in full precision to fix quantization errors.

→ Uses joint optimization to balance between quantized weights and additional low-rank matrices.

→ Processes quantized activations with quantized weights while simultaneously applying low-rank corrections on unquantized activations.

→ Compatible with existing techniques like QuaRot and groupsizing.

-----

💡 Key Insights:

→ Low-rank matrices with just 10% of original size can cut accuracy loss by half

→ Increasing rank to 30% completely eliminates accuracy gap

→ Weight-only quantization needs minimal error correction

→ Simple round-to-nearest activation quantization is sufficient

-----

📊 Results:

→ Tested on Phi-3, Llama-2 (7B, 13B), Llama-3 (8B), and Mixtral models

→ With 10% rank size, reduces accuracy gap by >50%

→ With 30% rank size, achieves near-original model performance

→ Maintains reasonable computational overhead

Discussion about this video

User's avatar