Why fix errors bit by bit when you can fix the whole model at once? That's RILQ's breakthrough.
RILQ introduces a rank-insensitive method for 2-bit LLM quantization that maintains accuracy while reducing memory usage. It employs model-wise activation discrepancy loss to enable cooperative error compensation across layers, solving previous methods' limitations with aggressive quantization.
-----
https://arxiv.org/abs/2412.01129
🎯 Original Problem:
Existing LoRA-based quantization methods struggle with 2-bit LLM compression, requiring high ranks for error compensation and suffering significant accuracy loss. Previous approaches fail to understand why low-rank adaptation underperforms in aggressive quantization scenarios.
-----
🔧 Solution in this Paper:
→ RILQ uses model-wise activation discrepancy loss at the final Transformer layer output instead of layer-wise or linear module optimization.
→ This approach enables cooperative adjustment between rank-critical and rank-redundant modules during LoRA tuning.
→ RILQ combines model-level discrepancy loss with causal language modeling objective to enhance token generation capabilities.
→ The method maintains computational efficiency comparable to existing LoRA approaches while enabling adapter-merged weight-quantized inference.
-----
💡 Key Insights:
→ 2-bit quantization errors are inherently high-rank, challenging traditional low-rank adaptation techniques
→ Rank sensitivity decreases as discrepancy scope expands from single linear module to entire model
→ Model-wise optimization allows flexible signal propagation and better error compensation
-----
📊 Results:
→ Improves QuIP# accuracy by 8.1% on LLaMA-3-8B
→ Achieves better perplexity with 16-rank than SVD with 256-rank
→ Maintains performance while using only 3.5GB memory versus 14.8GB for full-precision
Share this post