Squeezing more performance from 1-bit LLMs by pampering critical weight channels.
Channel-Relaxed Vector Quantization (CRVQ) improves extreme compression of LLMs by applying multi-codebook fitting to critical weight channels, achieving near-lossless 1-bit quantization with minimal overhead.
-----
https://arxiv.org/abs/2412.09282
🔍 Original Problem:
Existing post-training quantization (PTQ) methods struggle to maintain performance when compressing LLMs to extremely low bit-widths, especially at the 1-bit level.
-----
💡 Solution in this Paper:
→ CRVQ introduces a novel approach to vector quantization (VQ) for LLMs.
→ It identifies and reorders a small subset of critical weight channels based on their importance.
→ The method applies multi-codebook fitting specifically to these critical channels.
→ CRVQ uses one basic codebook for all channels and additional extended codebooks for critical ones.
→ The approach enables flexible customization of quantization bit-width and performance.
-----
🔑 Key Insights from this Paper:
→ A small subset of critical channels plays a pivotal role in maintaining model performance
→ Using 2% critical channels with 3 extended codebooks provides optimal results
→ CRVQ can be integrated with stronger base VQ settings for further improvements
→ The method has negligible impact on inference speed compared to single-codebook approaches
-----
📊 Results:
→ 38.9% reduction in perplexity compared to previous sub-2-bit PTQ baselines
→ 12.9% improvement in zero-shot accuracy on LLaMA2-7B
→ Outperforms QAT methods like OneBit in efficiency while matching performance
→ Generalizes well across different model sizes (125M to 13B) and architectures
Share this post