Have seen this in many cases - That Quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks.
Now this paper proves it again.
📚 https://arxiv.org/abs/2409.11055
Key Insights from this Paper 💡:
• Quantized larger LLMs generally outperform smaller FP16 models
• Performance varies with quantization method, model size, and bit-width
• Task difficulty doesn't significantly impact quantization-induced accuracy loss
• MT-Bench has limited discriminatory power for high-performing LLMs
-----
What this Paper does 🛠️:
• Evaluates quantized LLMs (7B-405B) across 13 benchmarks and 6 task types
• Applies GPTQ, AWQ, SmoothQuant, and FP8 quantization methods
• Uses multi-node cluster with vLLM and Huggingface Accelerate for efficient evaluation
• Focuses on instruction-tuned models and recent datasets to minimize data contamination
-----
Results 📊:
• 4-bit Llama-2-13B outperforms FP16 Llama-2-7B by 4.66% on OpenLLM Leaderboard-v1
• 4-bit Llama-3.1-405B surpasses FP16 Llama-3.1-70B by 3.15% on OpenLLM Leaderboard-v1
• AWQ consistently outperforms GPTQ across various LLMs
• SmoothQuant causes significant accuracy drop (-9.23%) in Llama-3.1-405B