MixLLM introduces a novel quantization method that uses mixed precision between output features based on their global importance.
Achieving better accuracy with minimal memory overhead.
-----
https://arxiv.org/abs/2412.14590
🤔 Original Problem:
→ Current LLM quantization methods either suffer from accuracy loss or system inefficiency
→ Weight-only methods have accuracy issues with 4-bit quantization, while weight-activation methods face system performance challenges
-----
🔧 Solution in this Paper:
→ MixLLM uses different bit-widths (8-bit vs 4-bit) for different output channels based on their global importance to model output
→ It identifies high-importance features across the entire model rather than layer-by-layer
→ Uses symmetric 8-bit activation quantization and asymmetric 4-bit weight quantization in group-wise manner
→ Implements two-step dequantization to enable efficient use of int8 Tensor Core
-----
💡 Key Insights:
→ Only a small portion of output features significantly impact model accuracy
→ Different layers have varying importance to final model output
→ Mixed-precision between output features enables better parallel computation
→ System efficiency benefits more from reducing weight precision than activation precision
-----
📊 Results:
→ With only 10% more bits, reduces perplexity increase from 0.5 to 0.2 for Llama 3.1 70B
→ Improves MMLU-Pro score by 0.93 over SOTA across three popular models
→ Achieves 1.9x to 2.75x speedup over float16 baseline
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post