QTIP ( Quantization with Trellises and Incoherence Processing) enables ultra-high dimensional quantization for LLMs through efficient trellis coding
Transforms exponential vector quantization (VQ) costs to linear with trellis-based weight compression
Achieve better compression than 4-bit models using just 2 bits with trellis quantization.
📚 https://arxiv.org/abs/2406.11235
🎯 Original Problem:
Current post-training quantization methods for LLMs use vector quantization (VQ) to compress weights, but VQ requires exponentially large codebooks limiting its dimension to ≤8, which constrains quantization quality and inference speed.
-----
🔧 Solution in this Paper:
→ Introduces QTIP ( Quantization with Trellises and Incoherence Processing) - uses trellis coded quantization (TCQ) to enable ultra-high dimensional (>100) quantization
→ Implements a hardware-efficient "bitshift trellis" structure that enables parallel decoding and eliminates need for storing trellis structure
→ Introduces novel compute-based Gaussian codes requiring only 2-4 instructions per weight
→ Combines incoherence processing with random Hadamard transform to make weights approximately i.i.d. Gaussian distributed
-----
💡 Key Insights:
→ TCQ's cost is linear in dimension while VQ's cost is exponential
→ High dimensional quantization significantly improves compression quality
→ Hardware-efficient implementation is crucial for practical deployment
→ Compute-based codes can match lookup-based codes in quality while being faster
-----
📊 Results:
→ Achieves better quantization quality than state-of-the-art VQ methods across all tested bitrates
→ For 2-bit models, QTIP scales better than theoretically optimal 4-bit models
→ Matches or exceeds inference speed of fastest existing methods while providing higher quality
→ Enables matrix-vector multiplication to run at over 80% of peak memory bandwidth on modern GPUs
Share this post