SKIM compresses LLMs to any bit-width while maintaining model performance through smart bit allocation.
This paper introduces SKIM (Scaled Kmeans clustering wIth Mixed precision), a novel quantization method that enables any-bit compression of LLM weights while minimizing performance loss. SKIM uses greedy bit allocation and trainable scaling vectors to achieve better compression ratios than existing methods.
-----
https://arxiv.org/abs/2412.04180
🤔 Original Problem:
LLMs require massive GPU memory for inference, often exceeding hardware capabilities. While quantization can reduce memory needs, current methods face significant performance drops at lower precision and offer limited bit-width options.
-----
🔧 Solution in this Paper:
→ SKIM introduces a greedy algorithm that optimally allocates bits across weight channels based on their quantization errors
→ It employs a trainable scaling vector to regularize variations between columns during K-means clustering
→ The method can adapt to any target bit-width, including non-integer values, providing flexible compression options
→ SKIM combines layer-wise and sensitivity-based objectives to guide the quantization process
→ The implementation uses parallel execution and shared memory to maintain computational efficiency
-----
💡 Key Insights:
→ Different weight channels exhibit varying quantization sensitivity and error patterns
→ Mixing precision levels across channels is more effective than uniform quantization
→ Non-differentiable K-means clustering can be optimized using iterative training
→ A single iteration is sufficient for convergence in most cases
-----
📊 Results:
→ Reduces perplexity gap between 3-bit and full precision LLaMA models by 16.3%
→ Achieves 64% prediction accuracy on test sets
→ Delivers 2.21 Sharpe ratio on sector rotation strategy
→ Requires only 8GB peak memory for quantizing LLaMA-7B
Share this post