Bigger models = More low precision, less memory, same performance, as proposed in this paper.
Mixed precision quantization scales exponentially with model size
Larger LLMs need exponentially fewer high-precision parameters to maintain performance
https://arxiv.org/abs/2410.06722
🎯 Original Problem:
As LLMs grow larger, quantization becomes crucial for efficient deployment. But we lack understanding of how many high-precision parameters are needed to maintain performance as models scale up.
-----
🔧 Solution in this Paper:
→ Introduced "quantization ratio" metric to measure the proportion of parameters using low-precision arithmetic
→ Conducted experiments across Qwen, LLaMA-2, and Gemma-2 models ranging from 0.5B to 70B parameters
→ Used MXINT4 for low precision and BF16 for high precision quantization
→ Tested at different granularities (layer-wise and matmul-wise)
→ Employed random search with 50 trials and 1024 subsamples per iteration
-----
💡 Key Insights:
→ Larger models can handle higher quantization ratios while maintaining performance
→ Finer granularity in quantization allows for higher quantization ratios
→ The relationship follows exponential scaling - larger models need exponentially fewer high-precision components
→ Hardware designs should prioritize support for fine-grained mixed precision operations
-----
📊 Results:
→ Achieved exponential reduction in high-precision components needed as model size increases
→ Demonstrated consistent scaling across different performance constraints (k values: -11.68, -11.27, -12.84)
→ Validated across multiple model families and arithmetic types (MXINT4, FP4)
Share this post