A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

Playback speed

Share post at current time

0:00

Transcript

Generated this podcast with Google's Illuminate.

Dec 23, 2024

Have seen this in many cases - That Quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks.

Now this paper proves it again.

Key Insights from this Paper 💡:

• Quantized larger LLMs generally outperform smaller FP16 models

• Performance varies with quantization method, model size, and bit-width

• Task difficulty doesn't significantly impact quantization-induced accuracy loss

• MT-Bench has limited discriminatory power for high-performing LLMs

-----

What this Paper does 🛠️:

• Evaluates quantized LLMs (7B-405B) across 13 benchmarks and 6 task types

• Applies GPTQ, AWQ, SmoothQuant, and FP8 quantization methods

• Uses multi-node cluster with vLLM and Huggingface Accelerate for efficient evaluation

• Focuses on instruction-tuned models and recent datasets to minimize data contamination

-----

Results 📊:

• 4-bit Llama-2-13B outperforms FP16 Llama-2-7B by 4.66% on OpenLLM Leaderboard-v1

• 4-bit Llama-3.1-405B surpasses FP16 Llama-3.1-70B by 3.15% on OpenLLM Leaderboard-v1

• AWQ consistently outperforms GPTQ across various LLMs

• SmoothQuant causes significant accuracy drop (-9.23%) in Llama-3.1-405B

Rohan's Bytes