"Qrazor: Reliable and effortless 4-bit llm quantization by significant data razoring"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2501.13331
The challenge with deploying LLMs is their high memory and computational demands, and aggressive 4-bit quantization often degrades accuracy while requiring significant effort. This paper introduces QRazor to enable reliable and effortless 4-bit quantization for LLMs.
This paper proposes QRazor, a two-stage quantization method. It first quantizes to wider bit integers and then compresses to 4-bit using Significant Data Razoring. This maintains accuracy and simplifies deployment.
-----
📌 QRazor uses a smart two-stage quantization. It first employs 8-bit and 16-bit integers with Absolute Max Scaling to preserve outlier data, before Significant Data Razoring compresses to 4-bit.
📌 Decompression-free arithmetic unit is a key hardware innovation. Direct 4-bit integer computation, enabled by a dedicated unit, avoids decompression overhead and boosts throughput.
📌 Significant Data Razoring dynamically adapts bit-truncation per group via leading '1' detection. This efficient bit selection method effectively preserves salient data for 4-bit compression.
----------
Methods Explored in this Paper 🔧:
→ QRazor employs a two-stage approach: quantization and compression.
→ In the quantization stage, weights and KV cache are quantized to 8-bit integers, and activations to 16-bit integers. Absolute Max Scaling is used at this stage to maintain accuracy close to full-precision models.
→
→ In the compression stage, Significant Data Razoring (SDR) is applied to compress all data to 4-bit. SDR retains the four most significant bits, discarding others using bitwise operations, truncation, and rounding.
→
→ QRazor also introduces a dedicated integer-based arithmetic unit. This unit performs computations directly on the 4-bit compressed data, avoiding decompression and improving efficiency.
-----
Key Insights 💡:
→ Using 8-bit and 16-bit integers as base precision in the quantization stage effectively captures the data distribution, including outliers, crucial for maintaining accuracy.
→
→ Significant Data Razoring efficiently compresses data to 4-bit by preserving only the most salient bits, minimizing information loss and enabling reliable low-bit quantization.
→
→ A decompression-free arithmetic unit tailored for QRazor significantly enhances hardware efficiency by directly processing compressed 4-bit data.
-----
Results 📊:
→ QRazor outperforms QLLM on LLaMA-1-7B and LLaMA-13B with over 10% accuracy improvement.
→ QRazor on LLaMA-2 models achieves comparable accuracy to Quarot with GPTQ weights and outperforms Quarot with RTN baseline.
→ The decompression-free arithmetic unit achieves a 61.2% reduction in area and a 57.8% reduction in power consumption.