Quantization Methods for Large Language Models GPTQ AWQ bitsandbytes HQQ and AutoRound
Browse all previously published AI Tutorials here.
Table of Contents
Quantization Methods for Large Language Models GPTQ AWQ bitsandbytes HQQ and AutoRound
1 Mathematical Formulations and Implementation Details
GPTQ Grouped Quantization with Error Compensation
AWQ Activation-Aware Weight Quantization
bitsandbytes LLM.int8 and 4-bit NF4 Quantization
HQQ Half-Quadratic Quantization
AutoRound Learned Rounding and Minmax Optimization
2 Practical Insights and Trade-offs
3 Performance Benchmarks
4 Industry Applications and Adoption
5 PyTorch Implementation Examples
Using bitsandbytes 8-bit4-bit in PyTorch
Using GPTQ AutoGPTQ in PyTorch
Using AWQ in PyTorch
Using HQQ in PyTorch
Using AutoRound in PyTorch Intel Extension
Verification and Usage
Large Language Models (LLMs) can be drastically compressed via quantization, reducing precision (e.g., 16-bit to 4-bit) to shrink memory and speed up inference (Accelerating LLM Inference with GemLite, TorchAO and SGLang | PyTorch(https://pytorch.org/blog/accelerating-llm-inference/#:~:text=Large Language Models ,support for larger batch sizes)). We analyze five cutting-edge quantization methods – GPTQ, AWQ, bitsandbytes, HQQ, and AutoRound – comparing their mathematical foundations, trade-offs, performance, industry adoption, and providing PyTorch implementation examples. All five are weight-only quantization techniques (activations often remain higher precision like 16-bit), targeting minimal accuracy loss while maximizing efficiency.
1 Mathematical Formulations and Implementation Details
GPTQ Grouped Quantization with Error Compensation
GPTQ is a post-training quantization (PTQ) algorithm that quantizes model weights layer-by-layer using calibration data to preserve output accuracy (What is GPTQ Quantization for LLMs? — Picovoice). Formally, GPTQ seeks quantized weights (W_q) for each layer that minimize the layer’s output error (||f(W) - f(W_q)||) for sample inputs, where (f(\cdot)) is the layer’s forward function. It employs a greedy, column-wise quantization strategy: weights are quantized one group (often one column or a small group of neurons) at a time, and after quantizing each group, GPTQ updates remaining weights to compensate for the induced error . This error-compensation uses second-order information (approximate Hessians) to adjust for quantization loss (GPTQT: Quantize Large Language Models Twice to Push the Efficiency Accepted by 11th IEEE International Conference on Cybernetics and Intelligent Systems. This work was supported by the National Natural Science Foundation.China (No.62173300)). In practice, GPTQ fixes the quantization scale (e.g. min–max range for 4-bit) and zero-point per group in advance, then solves an optimization for each weight group. Pseudocode involves iterating over columns of a weight matrix: for each column vector (w), find the quantized version (q) (an int4/8 vector) that minimizes (||Xw - Xq||^2) given sample inputs (X) (where (Xw) is the column’s contribution to layer output). The remaining full-precision weights are adjusted (residual error is carried over) before quantizing the next column . This method, inspired by Optimal Brain Quantization, ensures that each quantized column induces minimal output drift by greedily absorbing error into unquantized parts . GPTQ typically uses uniform quantization (linear mapping to int scale) with a small group size (e.g. 128 or per-column) and may allow “act-order”, i.e. quantizing columns in order of decreasing activation variance to reduce error. Implementations of GPTQ (e.g. AutoGPTQ) produce a quantized model file (often with a custom format) that can be loaded for fast int4/int8 inference.
AWQ Activation-Aware Weight Quantization
AWQ is a weight-only PTQ method that introduces an activation-aware strategy ( AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration). The key insight is that not all weights are equally important – a small fraction of “salient” weight channels have outsized impact on model outputs . Instead of quantizing all weights uniformly, AWQ identifies these critical weight channels using activation distributions (the sensitivity is determined by how weight perturbations affect activations) . Typically, a tiny calibration set is fed through the model to collect activation stats, and channels whose activation ranges are large are marked salient. AWQ then avoids mixed precision (which would hurt hardware efficiency) by scaling up the salient weight channels before quantization . In effect, AWQ multiplies those weight values by a factor so that after standard 4-bit quantization, the quantization error for those channels is reduced (since scaling spreads out their values relative to quantization levels) . Mathematically, if (w_i) is a salient weight channel, AWQ applies an equivalent transformation (w_i' = \alpha \cdot w_i) (with some scaling (\alpha)) prior to quantization, and will correspondingly down-scale that channel’s output at inference to compensate . The scale (\alpha) is derived analytically to minimize quantization error, using the collected activation statistics . By protecting ~1% of weights in this way, AWQ achieves near full-precision accuracy even at 4-bit . Importantly, AWQ does no backpropagation or per-layer fine-tuning; it’s a quick calibration procedure. The result is a uniformly 4-bit quantized model (denoted W4A16 – weights 4-bit, activations 16-bit) that is hardware-friendly but still preserves the contribution of critical weights . Implementations like AutoAWQ (MIT Han Lab) output quantized weights (often as int4 in a custom format or standard int4 with extra scale metadata).
bitsandbytes LLM.int8 and 4-bit NF4 Quantization
bitsandbytes is a library providing quantization and 8-bit optimizers for LLMs, known for its ease of integration. It offers runtime quantization of model weights into 8-bit or 4-bit without requiring a separate offline process (Quantization - LoRAX Docs). The 8-bit method (introduced as LLM.int8()) uses a block-wise quantization with an outlier-aware scheme: it groups weights and if a group contains outlier values that would cause large error, those are handled in higher precision (through a secondary scale factor) (QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU). This “double quantization” stores most weights as int8 and a few high-magnitude ones with extra bits, ensuring negligible accuracy loss for 8-bit. For 4-bit, bitsandbytes introduced the NF4 (Normalized Float-4) data type (HERE). NF4 is a non-uniform 4-bit representation designed to preserve the distribution of weights: instead of linearly mapping weights to 16 quantization levels, NF4 uses a logarithmic or normal distribution mapping (based on quantile normalization) to better represent outliers . In practice, NF4 behaves like a 4-bit floating-point with a shared exponent, which was found to be optimal for normally distributed weight values . bitsandbytes stores model weights in 4-bit (saving memory 4× over FP16) but performs computations in higher precision (16-bit or 32-bit) to avoid instability . That is, during inference or fine-tuning, each 4-bit weight is dequantized to FP16 on the fly before use. This JIT dequantization trades some speed for simplicity – there’s no need to compile custom int4 kernels, but it means bitsandbytes 4-bit inference is somewhat slower than fully integer math. In summary, bitsandbytes provides a plug-and-play PTQ: load a model with bnb
8-bit or 4-bit quantization and it automatically handles weight compression. It requires no calibration dataset by default (it can quantize weights directly or after observing one batch) . The simplicity and “on-the-fly” nature make it a popular choice for fine-tuning (e.g. QLoRA uses bitsandbytes 4-bit quantization), albeit with a runtime cost.
HQQ Half-Quadratic Quantization
HQQ is a newer approach that frames weight quantization as a robust optimization problem, solved efficiently via half-quadratic splitting. Unlike GPTQ and AWQ which minimize output (activation) error using calibration data, HQQ directly minimizes the weight reconstruction error without any data (HQQ quantization) . The rationale is that by focusing on weights, HQQ can avoid needing a calibration set at all, making it data-free (zero-shot) quantization . The challenge is that a naive weight error metric (like MSE on weights) doesn’t always correlate with output accuracy. HQQ addresses this by using a sparsity-promoting loss on weight errors – specifically a (l_{p<1}) hyper-Laplacian loss – which heavily penalizes large weight errors (outliers) while tolerating small quantization noise . This effectively models the heavy-tailed distribution of weight differences, capturing outlier weights more faithfully than a quadratic loss. Formally, HQQ solves:
[ \huge \min_{s,z}; \phi!\Big(W - Q_{s,z}^{-1}(Q_{s,z}(W))\Big), ]
where (Q_{s,z}) denotes quantization with scale (s) and zero-point (z), and (Q^{-1}) dequantizes them back (HQQ quantization). Here (Q_{s,z}(W) = \text{round}(W/s + z)) produces quantized weights (W_q), and (Q^{-1}) dequantizes them back . This problem is non-convex due to (\phi) and discrete rounding. HQQ’s innovation is to apply a half-quadratic splitting: introduce an auxiliary variable (W_e) to split the objective, and alternate between optimizing (W_e) and the quantization parameters . The augmented objective becomes:
[ \huge \min_{z,W_e}; \phi(W_e) + \frac{\beta}{2}\big|W_e - \big(W - Q_{z}^{-1}(Q_{z}(W))\big)\big|_2^2, ]
which yields two subproblems solved iteratively (HQQ quantization). (1) Optimize (W_e) given current (z): this is essentially shrinkage on the weight residual (solving (\min_{W_e} \phi(W_e) + \frac{\beta}{2}|W_e - E|^2) yields a closed-form thresholding). (2) Optimize the quantization offset (z) given (W_e): this reduces to a simpler least-squares to choose (z) that makes (Q_z(W)) as close as (W - W_e) . By alternately updating (W_e) and (z), HQQ converges to a solution that finds near-optimal (z) (and uses a fixed scale or pre-determined (s)). In essence, (W_e) absorbs outlier errors, allowing large weight deviations to be handled separately by (\phi) while (z) is optimized for remaining weights. After solving, the final quantized weights (W_q = Q_{s,z}(W)) are produced. This algorithm is extremely fast – essentially just a few iterations of simple updates – and does not require forward passes through the model. In fact, HQQ can quantize a 70B model to 2-bit in under 5 minutes (vs hours for GPTQ) . It often achieves accuracy on par with data-driven methods: e.g. a 2-bit HQQ quantized Llama-2-70B outperformed a full FP16 Llama-2-13B model . Implementation-wise, HQQ is provided as a library (Mobius Labs’ hqq
) that can replace nn.Linear
layers with HQQLinear
modules. It supports extremely low bit-widths (down to 2-bit and even 1-bit) and groups weights for efficiency (GitHub - mobiusml/hqq: Official implementation of Half-Quadratic Quantization (HQQ)). HQQ can integrate with Hugging Face Transformers via an HqqConfig
to automatially quantize on model load . The method yields quantized weights plus a lightweight reconstruction overhead, with no runtime calibration needed.
AutoRound Learned Rounding and Minmax Optimization
AutoRound is an advanced PTQ method developed by Intel, notable for its learned rounding strategy. Instead of relying on heuristic scaling or one-shot solving, AutoRound uses a few steps of gradient-based fine-tuning (without full retraining) to determine optimal quantization parameters (GitHub - intel/auto-round: Advanced Quantization Algorithm for LLMs/VLMs.). Specifically, AutoRound formulates quantization as a differentiable problem by relaxing the rounding operation. It then uses a form of sign gradient descent – an efficient variant of gradient descent that uses only the sign of gradients – to iteratively adjust two sets of parameters: (1) the rounding offsets for each weight (i.e., whether each weight should round up or down given the quantization step) and (2) the min/max scales for each quantization group . In practice, AutoRound runs for only ~200 optimization steps on a small calibration set to fine-tune these quantization parameters, which is extremely light-weight . The loss function guiding this process is the model’s output error on the calibration data (e.g., minimize perplexity or MSE between full-precision and quantized logits), allowing AutoRound to learn the best rounding decisions per weight rather than using simple round-to-nearest. A notable advantage is that AutoRound’s fine-tuning does not introduce any new weights or require maintaining higher-precision weights at runtime – it only adjusts how the existing weights are quantized. The result is a set of quantized weights that often achieve “near-lossless” accuracy in 4-bit (Low-Bit Quantized Open LLM Leaderboard ). For instance, int4 models quantized with AutoRound were shown to retain ~98% of the original accuracy on benchmark suites . Internally, AutoRound can be seen as combining ideas from PTQ and QAT: it has a quick calibration (like PTQ) but uses gradient-based adjustment (like QAT) albeit only on quantization parameters. The implementation is integrated into Intel’s tools (Intel Neural Compressor and Extension for Transformers). AutoRound is hardware-friendly: it outputs standard int4 weight matrices (no mixed precision) and even quantizes normally unquantized parts like the LM head bias . In summary, AutoRound automatically searches for the best quantization scaling and rounding, yielding low-bit models that outperform static methods on accuracy while adding no inference overhead.
2 Practical Insights and Trade-offs
Each method offers a different balance between accuracy preservation, complexity, and deployment flexibility:
Accuracy Retention: All five methods aim to minimize the drop in model quality after quantization, but their effectiveness varies. AWQ generally shows the smallest accuracy degradation among PTQ methods – studies found AWQ’s 4-bit quantization often outperforms GPTQ on language tasks (A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B). By protecting just ~1% of weights, AWQ can match full precision accuracy on many benchmarks ( AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration), even for instruction-tuned or multi-modal LLMs. GPTQ achieves good accuracy at 3–4 bits (it was one of the first to make 3-bit viable (GPTQT: Quantize Large Language Models Twice to Push the Efficiency Accepted by 11th IEEE International Conference on Cybernetics and Intelligent Systems. This work was supported by the National Natural Science Foundation.China (No.62173300))), but tends to underperform AWQ slightly on newer models . GPTQ’s accuracy can improve if using small groups or activation-ordering, at the cost of slower processing. AutoRound is designed for near-lossless quantization – in 4-bit it often preserves >97% of the original accuracy (GitHub - intel/auto-round: Advanced Quantization Algorithm for LLMs/VLMs.), even beating other methods on benchmarks (Low-Bit Quantized Open LLM Leaderboard ). HQQ, despite using no data, is surprisingly competitive: its weight-error minimization with (l_p) loss means even 3-bit or 2-bit HQQ models stay usable (a 2-bit HQQ Llama-70B outperformed a larger FP16 model) (HQQ quantization). However, at extreme low bits (2-bit, 1-bit), HQQ can see some drop in complex tasks, as it optimizes weights in isolation – LoRAX reports HQQ “results in some amount of degradation in performance” compared to higher bits (Quantization - LoRAX Docs). bitsandbytes (NF4) maintains model perplexity reasonably well at 4-bit but not as precisely as AWQ or AutoRound; it may lose a bit more accuracy since it doesn’t fine-tune scales per-layer. For example, on GPT-3 family models, 4-bit bitsandbytes (NF4) might have a 1–2 perplexity point regression versus AWQ which is often closer to 0.5–1 point gap from FP16 (Thoughts on Quantization Roadmap · Issue #135 · ml-explore/mlx). In summary: AWQ and AutoRound lead in accuracy for 4-bit, GPTQ is close behind, bitsandbytes is slightly lower (but acceptable for many uses), and HQQ is competitive given its zero-shot nature, excelling especially in weight-heavy metrics.
Computational Efficiency: There are two aspects – offline quantization cost and inference speed. In offline cost, HQQ is a clear winner: it requires no calibration data and runs ~50× faster than GPTQ by avoiding slow layer-by-layer error computations. HQQ can quantize multi-billion-param models in minutes on a single GPU . GPTQ is heavier: it needs a forward pass on a calibration set for each layer and solves least-squares per column. Quantizing a 70B model with GPTQ can take hours (often requiring multiple GPUs). AWQ is relatively light – it collects activation stats from a small batch (AWQ often uses as few as 128–1024 tokens) and then applies analytical scaling to weights ( AWQ: Activation-aware Weight Quantization for On-Device LLM ...). AWQ’s paper notes it achieves good results with 10× smaller calibration data than GPTQ . AutoRound lies in between: it does require calibration data and ~200 iterations of a lightweight fine-tuning, so maybe a few minutes per model layer. This is still faster than full QAT, but slower than one-shot methods. bitsandbytes essentially has zero preprocessing cost – weights are quantized on the fly during model loading or inference, with possibly a brief outlier detection pass.
In terms of inference latency, methods that produce fully quantized weights with specialized kernels have the advantage. AWQ-quantized models can leverage efficient int4 matrix multiplication kernels, yielding substantial speedups. For instance, AWQ’s 4-bit inference with their TinyChat runtime is about 2.7× faster than FP16 on an RTX 4090 GPU, and 2.9× faster on a mobile GPU (Jetson Orin) (GitHub - mit-han-lab/llm-awq: [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration). AWQ and GPTQ both generate static quantized weights that can be fed to custom CUDA kernels or TensorRT. AWQ tends to be faster than GPTQ at runtime, as noted in one guide: GPTQ is faster than bitsandbytes, but “noticeably slower than AWQ” for inference (Quantization - LoRAX Docs). This is partly because AWQ’s method (all int4 weights, simple scale) is amenable to fused kernels, whereas GPTQ sometimes uses group-wise dequantization that’s a bit more complex. bitsandbytes incurs more latency because it dequantizes weights on-the-fly to FP16 for each operation . It’s more flexible (no specialized kernels needed), but the CPU/GPU has to convert 4-bit to 16-bit at runtime, adding overhead. Thus, bitsandbytes 4-bit is typically slowest in throughput – one comparison showed GPTQ 4-bit could be ~1.3–1.5× faster inference than bitsandbytes 8-bit on the same model, and AWQ faster still . AutoRound uses the same inference kernels as GPTQ/AWQ (since the result is standard int4 weights), and crucially it adds no runtime cost – the rounding adjustments are baked into the weights. So an AutoRound int4 model should run as fast as a GPTQ int4 model, but with higher accuracy. HQQ-quantized models can also use int4/int2 kernels. One caveat: HQQ by default might pack weights differently or use an “ATEN backend” for better accuracy which is slower (GitHub - mobiusml/hqq: Official implementation of Half-Quadratic Quantization (HQQ)), but it also supports fast dequant backends and was designed to load faster than other just-in-time methods like bitsandbytes . Indeed, HQQ’s integration with the GemLite/TorchAO runtime achieves high throughput even at batch size 1 or 2 by optimizing the weight layout for GPU usage (Accelerating LLM Inference with GemLite, TorchAO and SGLang | PyTorch) . In summary: for deployment, AWQ and AutoRound yield the fastest 4-bit inference, GPTQ is a close second, HQQ is catching up as it integrates with optimized runtimes, and bitsandbytes trades speed for ease-of-use.
Memory and Model Size: All methods reduce model memory roughly in proportion to bit-width. A 4-bit quantized model is ~4× smaller than FP16 (plus minor overhead for scales/meta). For example, LLaMA-65B in FP16 (~130 GB) becomes ~33 GB at 4-bit. This enables fitting models on smaller GPUs or even edge devices. bitsandbytes and AWQ both store weights in truly 4-bit form (with NF4 or int4 values), achieving ~75% memory reduction vs FP16. GPTQ also stores weights in 4-bit, though its format may include some per-layer codebooks or extra stats (usually negligible overhead). AutoRound int4 models are likewise ~4× smaller than FP16. HQQ when doing 2-bit can get 8× compression (70B model down to ~20 GB, which is how a 2-bit Llama2-70B fit where only a 13B FP16 would before (HQQ quantization)). However, extremely low-bit (2-bit) weights might be paired with small sparsity or other tricks. All weight-only methods keep activations in higher precision (FP16), so activation memory/disk is not reduced (activations are not stored long-term, though). One nuance: bitsandbytes in 4-bit with double quantization may allocate an extra 16-bit scale per 256 weights (0.4% overhead), and NF4 uses a half precision lookup table for dequantization. These are minor trade-offs for stability. From a model file format perspective, there has been an ecosystem of formats: GPTQ models are often saved in a
.gptq
or safetensors with weight tables; AWQ provides anAWQ
format or uses safetensors with a specific naming; bitsandbytes uses standard model checkpoints but expects to quantize on load (no specialized file, unless using a*.pt
that’s already quantized). AutoRound is integrated in Intel’s tools (Intel Neural Compressor and Extension for Transformers). HQQ can save models to Hugging Face format as well . In practice, GGUF/GGML formats (not detailed here) are also used to store quantized weights for CPU inference, but those are separate from these algorithms (they often incorporate GPTQ or similar under the hood).Deployment Scenarios: If you need plug-and-play quantization for fine-tuning, bitsandbytes is very attractive – you can load a model in 8-bit or 4-bit with one flag and then apply LoRA fine-tuning, as QLoRA demonstrated (QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU). It requires minimal expertise and works on any model without calibration data. The cost is slower inference and being limited to ≥4-bit (bitsandbytes doesn’t support 3-bit or 2-bit). For server-side inference where maximum throughput and minimum latency are key, AWQ and GPTQ are more suitable. They produce a fixed quantized model that can be highly optimized by inference frameworks (TensorRT, FasterTransformer, etc.), so you’d pick those for deployment. AWQ is especially appealing if you want 4-bit with almost no accuracy loss on a variety of models (even instruction-tuned ones) – it was shown to handle chat models and multimodal models robustly ( AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration). GPTQ might be chosen if you already have a pipeline for it or if you want 3-bit weights; at 3-bit GPTQ (with proper grouping) can still perform reasonably, whereas AWQ has mainly been used at 4-bit. AutoRound is a great choice when you need the highest accuracy at low-bit and can afford a short calibration step – for example, model providers who want to produce a quantized variant that is virtually indistinguishable from FP16. Since AutoRound integrates with popular frameworks (Intel’s) and yields int4 models that can be served like any other, it’s fitting for production environments concerned about quality loss. HQQ shines in scenarios where you lack data or time for calibration – e.g., compressing a proprietary model without access to a representative dataset, or quickly quantizing many models. Its output can be slightly less optimal than AWQ’s, but not needing any calibration data is a huge plus (no privacy or collection concerns) . Also, HQQ’s ability to go to 2-bit or 3-bit easily might open up ultra-memory-constrained deployments (imagine fitting a 7B model in under 2 GB for mobile). One trade-off to note: mixed-precision quantization vs uniform. AWQ explicitly avoids layer-wise mixed precision to keep hardware simplicity , whereas GPTQ in theory could quantize some layers at 3-bit, some at 4-bit to balance accuracy (though this is not commonly done in one run). AutoRound can also produce uniform int4 or int8 models. A method like bitsandbytes always uses uniform bit-width (except its internal double quant which doesn’t complicate deployment). Uniform 4-bit across all layers is simpler to implement in GPU kernels and is supported by most deployment runtimes. Thus, all these methods produce hardware-friendly models, with AWQ and AutoRound particularly mindful of deployment constraints (e.g., AWQ’s channel scaling trick instead of mixed precision, AutoRound even quantizes the output head and handles new architectures automatically (Low-Bit Quantized Open LLM Leaderboard )).
In summary, bitsandbytes offers maximum flexibility and ease (just load and go) at the cost of some speed and limited bit-depth; GPTQ and AWQ require a calibration procedure but yield faster, smaller models; AWQ usually wins on accuracy among those and is extremely fast to quantize; GPTQ is well-established for various bit settings; HQQ is an emerging zero-shot solution that hugely cuts quantization time and is ideal when data is scarce; and AutoRound provides a “best of both worlds” – a little bit of calibration to substantially boost low-bit accuracy, making it attractive for industry-grade deployments where 4-bit must meet strict quality bars.
3 Performance Benchmarks
Empirical evaluations on LLMs have highlighted the differences in perplexity, speed, and memory for these methods.
Perplexity and Accuracy Benchmarks: Across a variety of language benchmarks and model sizes, AWQ consistently matches or exceeds GPTQ in 4-bit accuracy (A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B). For example, on the Vicuna and Llama families, AWQ 4-bit achieved higher MT-Bench (multi-turn chat) scores than GPTQ 4-bit . A comprehensive study (Fu et al., 2024) evaluating instruction-tuned LLMs from 7B to 70B found AWQ’s average performance drop from FP16 to be smaller than GPTQ’s on most tasks . In one case, GPTQ 4-bit on a 7B model might cause, say, a 2-point drop in MMLU accuracy, whereas AWQ 4-bit saw ~1-point drop – a notable improvement. AutoRound often reaches even higher accuracy: Intel’s data shows that an int4 AutoRound model can score almost the same as the FP16 model on a suite of tasks (within a couple of percent) (Low-Bit Quantized Open LLM Leaderboard ). On a 10-task aggregate benchmark, AutoRound outperformed both GPTQ and AWQ, taking the top spot in accuracy for low-bit models . Furthermore, an AutoRound-quantized 13B model can beat a FP16 7B model on all metrics, effectively giving a “free” boost in capability by using a larger model at lower precision . GPTQ still holds its own: early GPTQ models showed that a 3-bit (W3A16) quantization of GPT-neo or OPT retained surprisingly good perplexity, and GPTQ is often the baseline that newer methods compare against. If GPTQ is run with its optional error optimization (“act-order” grouping), it can narrow the gap to AWQ by focusing on important weights first. bitsandbytes (NF4) quantization has been validated through QLoRA fine-tuning experiments: a 4-bit NF4 base model finetuned with LoRA achieves equal performance to a full 16-bit finetune in tasks like instruction following (Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA). This implies bitsandbytes 4-bit introduces negligible task performance loss after fine-tuning. However, when measuring raw zero-shot perplexity on language data, bitsandbytes-quantized models are sometimes a bit worse than AWQ/GPTQ. Community benchmarks on perplexity (e.g., GitHub ml-explore/mlx issue) indicate “AWQ is better than GPTQ (especially GPTQ without activation order), which in turn is better than bitsandbytes-NF4” (Thoughts on Quantization Roadmap · Issue #135 · ml-explore/mlx). For example, on WikiText2, AWQ 4-bit might achieve perplexity nearly identical to FP16, GPTQ 4-bit perhaps 5-10% higher, and NF4 4-bit slightly above that . HQQ’s performance is notable given no data: HQQ quantization was shown to be within a few perplexity points of AWQ/GPTQ on Wikipedia text for 3-bit and 4-bit, though it may underperform on more complex tasks if not fine-tuned. One research survey found HQQ’s 4-bit accuracy on Llama-2 to be competitive with GPTQ and AWQ on an average of tasks (HQQ quantization). It’s also reported that AWQ and GPTQ converge to very close results in perplexity on some language data, often overlapping within error margin ( Multi-dimensional Safety Evaluation of LLM Compression), meaning both are near-optimal in many cases. On specialized tasks (code generation, math), AWQ had an edge in the original paper’s results ( AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration), likely because it generalizes better without overfitting to calibration sets.
Latency and Throughput: On modern GPU hardware, 4-bit quantization dramatically improves throughput. As mentioned, AWQ’s TinyChat int4 engine yields ~2.7× token generation speedup on high-end GPUs and ~3× on edge GPUs (GitHub - mit-han-lab/llm-awq: [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration). Benchmarks by independent implementers show that for a 7B model, FP16 inference might run at, say, 10 tokens/s, whereas a GPTQ or AWQ 4-bit model can reach 20–25 tokens/s on the same hardware (batch size 1). AWQ tends to be slightly faster due to its weight packing – one test showed AWQ 4-bit on Mistral-7B ran ~1.2× the speed of a GPTQ 4-bit of the same model when using identical hardware and batch settings (Quantization - LoRAX Docs). bitsandbytes 8-bit usually gives ~1.3× speedup over FP16 (since 8-bit matrix multiply is supported on many GPUs), but 4-bit bitsandbytes doesn’t achieve a full 4× speedup because of the overhead of dequantization. In fact, bitsandbytes 4-bit can be slower than FP16 if not using the right CUDA kernels, since it’s converting in Python loops in some cases. With the latest integrations (CUDA kernels for 4-bit), it improves, but still not as fast as static int4. For CPU inference, methods like GPTQ/AWQ can be exported to integer dot-product instructions – e.g., int4 quantized models running on AVX512 or on ARM with int8 instructions (packing two 4-bit per int8). Here, too, uniform quantization helps. CPU benchmarks (e.g., GGML library for LLaMA) often show int4 models more than 2× faster than FP16. HQQ was also noted to load models faster than bitsandbytes – since it doesn’t need to compute any calibration or per-column offsets, an HQQ quantized model can be memory-mapped and used immediately . This reduces initialization latency in serving contexts (important when models are spun up on demand). Another aspect is scalability with batch size: Traditional int4 kernels struggled with batch size >1 due to limited GEMM support. Efforts like GemLite aim to fix this (Accelerating LLM Inference with GemLite, TorchAO and SGLang | PyTorch). According to a PyTorch blog, new kernels allow int4 performance to scale to larger batch sizes without dropping efficiency . So nowadays, an int4 model can serve batch=8 or 16 almost as well as FP16. AWQ’s option to choose GEMM vs GEMV kernel based on batch size addresses this (using GEMM for batch ≥8, GEMV for batch <8) (Quantization).
Memory and Compression: We’ve largely covered memory savings qualitatively. To put concrete numbers: LLaMA-2 70B in FP16 is ~130 GB, which is infeasible on single devices. GPTQ 4-bit compression yields a ~35 GB model that can be split across 4 GPUs (or loaded on one A100 80GB with room to spare). AWQ 4-bit will be similar size; indeed, one of AWQ’s achievements was running Llama-2 70B on a mobile GPU (with 12 GB) by streaming weights from CPU, which wasn’t practical at FP16 ( AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration). HQQ at 2-bit creates a 17 GB model for 70B (plus overhead), enabling entirely new possibilities in edge deployment (HQQ quantization). In terms of perplexity-per-size, HQQ’s 2-bit 70B beating 13B FP16 is a testament to how far compression can go: you can choose a larger model at super low precision to outperform a smaller model at high precision, within the same memory envelope. AutoRound also demonstrated this: a 13B int4 model outdoing a 7B FP16 model on all metrics (Low-Bit Quantized Open LLM Leaderboard ), meaning memory can be traded for accuracy by quantizing a bigger model. This is hugely beneficial for practitioners: rather than using a 7B model in full precision on a given hardware budget, you could use a 13B or 20B model quantized to 4-bit and get better results at the same cost.
Energy Efficiency: A recent study (Rajput & Sharma, 2024) compared the energy usage of GPTQ, AWQ, bitsandbytes, and others during inference (HERE) . Interestingly, it found that 4-bit GGML/GGUF (CPU-focused formats) were most energy-efficient, likely due to highly optimized CPU int4 kernels . Among GPU methods, differences in energy weren’t only about bit-width; kernel implementation mattered. GPTQ and AWQ both reduce energy per inference proportional to speedup, but if a method doesn’t speed up inference (e.g., bitsandbytes doing same FLOPs in FP16), it might not save much energy. The study suggests that lower precision doesn’t always equal lower energy if the hardware isn’t fully utilized . However, since AWQ and GPTQ enable using smaller GPUs or more batch in same GPU, they indirectly improve energy and throughput.
To summarize the numbers: All methods can shrink model size by ~4× at 4-bit. Latency improvements roughly: bitsandbytes (4-bit) ~1.5× (or less) speedup, GPTQ 4-bit ~2×, AWQ 4-bit ~2.5–3× on GPU (GitHub - mit-han-lab/llm-awq: [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration), HQQ 4-bit with new runtimes ~2× (and extremely fast quantization time), AutoRound 4-bit ~2× (same as GPTQ since runtime is similar, but with higher accuracy). Perplexity/accuracy: AWQ often <5% relative loss, GPTQ ~5-10% loss if any, AutoRound <3% loss, bitsandbytes ~5-15% loss on perplexity (but can be recovered with fine-tuning), HQQ ~5-10% loss. And on many tasks, AWQ and AutoRound essentially match full precision (A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B). These results will continue to improve as the techniques evolve and combine (e.g. AutoRound + AWQ ideas together).
4 Industry Applications and Adoption
Quantization is widely used in industry to make LLMs feasible in production, and these five methods have all seen significant adoption in the past year:
AWQ has been quickly embraced by both AI frameworks and hardware vendors. Its integration is found in Hugging Face Transformers (as a supported backend), allowing users to load AWQ-quantized models directly via
from_pretrained
(GitHub - mit-han-lab/llm-awq: [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration). Companies like NVIDIA and Intel have adopted AWQ in their toolchains: for example, AWQ is integrated into NVIDIA’s TensorRT-LLM and in Intel Neural Compressor (GitHub - mit-han-lab/llm-awq: [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration) for efficient int4 serving. Cloud platforms are also on board – Google’s Vertex AI integrated AWQ for optimizing LLM serving (GitHub - mit-han-lab/llm-awq: [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration), and Amazon SageMaker added AWQ support in its large model containers (GitHub - mit-han-lab/llm-awq: [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration). Hardware-specific support came from AMD, which highlighted using AWQ to speed up LLM inference on AMD GPUs (GitHub - mit-han-lab/llm-awq: [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration). This wide adoption is partly due to AWQ’s MIT open-source release and the fact it was MLSys 2024 Best Paper, lending it credibility. Real-world use cases include on-device AI assistants (leveraging AWQ’s ability to run a 70B model on a mobile GPU) ( AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration), and multi-modal chatbots (AWQ supports vision-language models like Mini-GPT4, enabling 4-bit VLMs on devices) (GitHub - mit-han-lab/llm-awq: [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration) (GitHub - mit-han-lab/llm-awq: [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration). The AWQ authors even provided TinyChat, a turnkey inference framework to deploy 4-bit models on edge devices, which has been used in demos for offline chatbots and is being tested in products that require local LLM inference (e.g. privacy-sensitive applications) (GitHub - mit-han-lab/llm-awq: [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration) (GitHub - mit-han-lab/llm-awq: [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration).GPTQ was one of the earliest “post-training quantization for LLM” solutions and quickly became a de facto standard in the open-source community. There are hundreds of GPTQ-quantized model files on Hugging Face Hub (often with names like
modelname-GPTQ-4bit
), published by community members such as TheBloke. This allowed people without high-end GPUs to use models like Llama-65B by loading a ready-made 4-bit checkpoint. Hugging Face Transformers added native support for GPTQ in 2023, including aGPTQConfig
and integration with theaccelerate
library for smooth deployment (Quantization) . There’s also the dedicated library AutoGPTQ which is widely used for easy quantization and loading of GPTQ models. Enterprise use: many startups and companies have used GPTQ to optimize models internally – it was one of the first methods to demonstrate you could get “GPT-3.5 level” models running on a single GPU with minimal loss. While perhaps not boasted in official press releases, GPTQ undoubtedly powered numerous proof-of-concepts where a large model needed to be squeezed onto available hardware. For instance, MosaicML’s inference stack noted support for GPTQ models; OpenAI’s reference to efficient deployment spurred interest in GPTQ for those replicating models. Framework support: Intel’s Extension for Transformers provides GPTQ alongside other methods in a unified API (Low-Bit Quantized Open LLM Leaderboard ), and PyTorch’s new TorchAO library is compatible with GPTQ weight formats for int4 inference. One limitation had been compatibility – different GPTQ implementations had slightly different file formats, but initiatives like the GPTQ model zoo and tools in Hugging Face Optimum standardized this. In summary, GPTQ is well-established, with broad community adoption (virtually every popular open LLM has a GPTQ version online), making it a reliable choice.bitsandbytes (BNB) has been a game-changer particularly for LLM fine-tuning and research experiments. When Meta released LLaMA, the AI community turned to bitsandbytes to load 30B+ models on a single GPU by using 8-bit quantization. The Hugging Face integration (via
BitsAndBytesConfig
) made it trivial to use – the official Hugging Face blog (May 2023) showcased how 4-bit quantization with bitsandbytes enabled QLoRA, allowing a 65B model to be fine-tuned on a single 48GB GPU (Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA). This capability has been hugely influential: a number of domain-specific LLMs (for medicine, finance, etc.) were fine-tuned using QLoRA, which relies on bitsandbytes for the base model compression. Even today, many Kaggle and Colab notebooks default tobnb.nn.Linear4bit
for loading large models. On the industry side, PyTorch Lightning and Hugging Face PEFT include bitsandbytes support to streamline quantized training. The convenience and maturity of bitsandbytes (developed by Tim Dettmers) led to its adoption in academia as well – numerous research papers in late 2023 used bitsandbytes to evaluate 4-bit training or as a baseline for quantization. One drawback is that bitsandbytes is GPU-specific (NVIDIA CUDA only) and not as optimized for inference serving, so you don’t see it in production web services much – those lean to static quantization. However, for prototyping and development, it’s extremely popular. Official framework blogs have recognized this; for example, the PyTorch blog on LLM efficiency mentions bitsandbytes as a key tool for model compression. Hugging Face’s documentation explicitly lists 8-bit and 4-bit with bitsandbytes as supported quantization options . We also see bitsandbytes used in combination with other methods: e.g., fine-tuning a GPTQ model with LoRA still often uses bitsandbytes for optimizer or gradient quantization. Overall, bitsandbytes enjoys widespread usage in the developer community for its ability to “just work” with minimal hassle, effectively democratizing access to large models.HQQ being a very new method (late 2024) is on the cusp of broader adoption. Its development by Mobius Labs and the open-source release means it’s available on GitHub, and it’s already integrated into Hugging Face Transformers (support for
HqqConfig
was added) (GitHub - mobiusml/hqq: Official implementation of Half-Quadratic Quantization (HQQ)). Moreover, HQQ’s techniques have drawn attention from framework developers: in January 2025, PyTorch officially highlighted HQQ in a blog on LLM inference, noting compatibility with accuracy-preserving quantization techniques like HQQ in their TorchAO pipeline (Accelerating LLM Inference with GemLite, TorchAO and SGLang | PyTorch). This indicates that HQQ, or its ideas, will likely be part of future PyTorch offerings for LLM quantization. We also see interest in HQQ for edge deployment: because it requires no data, companies can deploy quantized models on client devices without needing to ship calibration data (which could be a privacy issue). Arm’s AI Blog mentioned efforts to quantize Llama models for mobile, where a data-free method like HQQ is valuable. HQQ has also been used to produce some models on Hugging Face Hub – e.g., Mobius released HQQ-quantized versions of ViT and Llama2 models for others to test (HQQ quantization). As for industry, we might not have big-name announcements yet, but the technology is promising for any scenario requiring fast model compression. Even if HQQ itself isn’t explicitly cited, its influence shows up in others: the TorchAO project (PyTorch native solution) likely borrows from HQQ’s philosophy to allow quick weight quantization with flexible bit-widths. The open-source community is certainly experimenting with it (there’s a Medium “Ultimate Guide” that includes HQQ (The Ultimate Handbook for LLM Quantization - Medium), and discussion on forums like Reddit sharing HQQ quantization tools). It will be no surprise if in 2025, HQQ (or an iteration of it) becomes as common as GPTQ for producing and sharing quantized models.AutoRound is driven largely by Intel’s push for LLM optimization. It’s incorporated in the Intel Neural Compressor (INC) and Intel Extension for Transformers as a one-line option to quantize using AutoRound (Low-Bit Quantized Open LLM Leaderboard ). Intel has showcased AutoRound’s efficacy through an Open LLM Quantization Leaderboard they maintain . In that leaderboard, many popular models (Llama-2, Mistral, Falcon, etc.) are listed with their int4 accuracy using AutoRound versus other methods . This public resource has helped validate AutoRound’s claims and encourages industry practitioners to try it. The fact that AutoRound often tops the accuracy charts means users who need best accuracy will gravitate to it. Intel’s blog posts (on Medium) give “10 Tips for Quantizing with AutoRound” (AutoRound: Accurate Low-bit Quantization for LLMs), indicating an effort to educate users. Hugging Face Hub has some models quantized with AutoRound by Intel and community (for example, DeepSeek-Llama2 int2 model by Intel labs). Also, an October 2024 Hugging Face community article specifically compared fine-tuning with AutoRound vs other methods (QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU) , showing that researchers are considering AutoRound not just for inference, but even for QLoRA training. PyTorch hasn’t integrated AutoRound natively (as it’s Intel’s), but you can bet that Intel’s acceleration libraries (which can be used in PyTorch with oneAPI) utilize it for Xeon and Habana hardware. Real-world use cases for AutoRound include any scenario where using a 4-bit model is desirable but accuracy is paramount – e.g., a chatbot that needs to maintain response quality or a generative model that must meet a benchmark threshold. Because AutoRound can quantize even the final linear layers and handle new model architectures automatically , it’s attractive to companies that want a single quantization tool for all models (reducing maintenance of multiple techniques). We see this reflected in statements like “Unlike GPTQ and AWQ, AutoRound can automatically accommodate new models” – an important factor for continuous integration as new LLM variants emerge.
Framework and Hardware Support: Beyond individual adoption, it’s noteworthy that framework developers are unifying support for these methods. Hugging Face’s transformer library now supports AWQ, GPTQ, bitsandbytes, HQQ, and others through a common
QuantizationConfig
interface (Quantization) . This means users can choose their quantizer much like an optimization algorithm. PyTorch is developing Torch.AO Quantization for LLMs, and explicitly mentions working with int4 weight-only methods and making them compatible with distributed and parallel training (Accelerating LLM Inference with GemLite, TorchAO and SGLang | PyTorch). We also see collaboration, e.g., the PyTorch blog was co-authored by folks from Mobius (HQQ) and SGLang (serving infra), indicating an industry-wide effort to solve quantized inference challenges collectively . TensorFlow has been a bit quieter in public about LLM quantization (most of the buzz has been in PyTorch and C++ libs), but TensorFlow users often leverage TFLite for 8-bit quant, and there is ongoing work to incorporate 4-bit support. Google’s Vertex AI choosing AWQ suggests even if TF doesn’t expose it, they use such techniques under the hood for their services. On the hardware end, NVIDIA Hopper GPUs introduced FP8 and better int8 support – while not 4-bit, this shows hardware trending to lower precision. NVIDIA’s software (TensorRT-LLM) explicitly integrated AWQ and is likely testing others, meaning these algorithms are informing next-gen deployment frameworks. Intel is building a complete stack around AutoRound (coupled with CPU optimizations for int4). Cloud providers are keen on these: Amazon’s integration of AWQ in SageMaker indicates they see customer demand for serving quantized models cheaply (GitHub - mit-han-lab/llm-awq: [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration). Startups like vLLM and LMDeploy have also integrated quantization methods (FastChat’s model server supports AWQ, and vLLM integrated AWQ and GPTQ) (GitHub - mit-han-lab/llm-awq: [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration). All of this points to quantization being a standard part of the LLM lifecycle now, not an afterthought.
In conclusion, in the span of a year, these quantization methods have gone from research ideas to essential industry tools. Whether it’s open-source enthusiasts sharing GPTQ models, cloud APIs secretly using AWQ to cut costs, edge AI companies using HQQ to put LLMs on phones, or Intel and PyTorch baking quantization into their libraries, it’s clear that GPTQ, AWQ, bitsandbytes, HQQ, and AutoRound (often in combination) are enabling the practical deployment of large language models at scale.
5 PyTorch Implementation Examples
All these methods can be used within the PyTorch ecosystem, often via high-level APIs. Below, we provide simplified code snippets for each method using PyTorch or Hugging Face integrations:
Using bitsandbytes 8-bit4-bit in PyTorch
The bitsandbytes
library provides drop-in replacements for PyTorch linear layers. Hugging Face Transformers supports loading models in 8-bit or 4-bit with a BitsAndBytesConfig
. For example, to load a 4-bit quantized model using NF4 data type and double quantization:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
## Configure bitsandbytes 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # use NormalFloat4 quantization
bnb_4bit_use_double_quant=True, # double quantization for outlier weights
bnb_4bit_compute_dtype=torch.bfloat16 # compute in bfloat16 for speed
)
model_id = "facebook/opt-6.7b" # example model
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
In this example, the model’s weights will be loaded as 4-bit NF4 values (with an internal mapping to half-precision during computation). The device_map="auto"
will load the model across available GPUs (memory saving from 4-bit helps fit in fewer GPUs). After this, generation with model.generate
works as usual. Bitsandbytes integration makes quantization transparent – you can fine-tune or infer with the model as if it were FP16, while under the hood it dequantizes on the fly. Keep in mind the trade-off: as noted, this will use more GPU compute at runtime (dequantizing each forward) and thus may be slower, but it’s extremely convenient for experimentation.
Using GPTQ AutoGPTQ in PyTorch
To quantize a model with GPTQ, one popular approach is the auto-gptq
library, which can both quantize and then load the quantized model. Here’s how you might quantize a model to 4-bit GPTQ and use it:
!pip install auto-gptq
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
## Define quantization configuration for 4-bit GPTQ
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128, # group size (128 is default GPTQ grouping)
damp_percent=0.01, # GDPQ error dampening, e.g., 0.01
desc_act=False # whether to perform activation-order (True can improve accuracy)
)
model_id = "facebook/opt-6.7b"
## Quantize and save the model
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config=quantize_config, use_triton=False)
model.quantize() # perform the actual quantization with default calibration set (or provide your own dataset)
model.save_quantized("opt-6.7b-GPTQ-4bit") # save quantized model to disk
This code would produce a quantized model saved under the given folder. Later, you can load it fast without re-quantizing:
quantized_model = AutoGPTQForCausalLM.from_quantized("opt-6.7b-GPTQ-4bit", device="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(model_id)
Now quantized_model
is a 4-bit inference model using GPTQ quantization. Under the hood, the weights are stored as int4 (with some grouping metadata), and AutoGPTQ will have installed custom CUDA kernels to handle the int4 matrix multiplies efficiently. When generating text, this should run considerably faster than the FP16 model.
(Alternatively, Hugging Face’s transformers
now supports GPTQ via GPTQConfig
, which could be used similarly to BitsAndBytesConfig/AwqConfig. For brevity, we showed the AutoGPTQ route.)
Using AWQ in PyTorch
AWQ can be applied via the MIT Han Lab’s tool or through transformers integration. The easiest way to use an AWQ-quantized model is to load a model that’s already been quantized. For instance, if you have a model quantized with AWQ (either by using their llm-awq
repository or via Hugging Face’s AutoAWQ
), you can do:
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
## Suppose we quantized or downloaded a 4-bit AWQ model of OPT-6.7B
model_id = "mit-han-lab/opt-6.7b-awq-int4" # (example ID; a real one would exist if model was uploaded)
## If quantization needs to be done, one would use AWQ's script or AutoAWQ to produce 'model_id'
## Here we directly load the quantized model assuming it's on the Hub or local disk.
awq_config = AwqConfig(bits=4, # target 4-bit
target_device="cuda") # use CUDA kernels for inference
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=awq_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
In practice, using AWQ might involve running a search to find the scales for salient weights first. The llm-awq
GitHub provides scripts like awq_search.py
to compute those scales given a model and calibration data, and then awq_quantizer.quantize(model, scales)
can apply them. But since our focus is using it in PyTorch, the above snippet assumes the model is already quantized (many popular LLMs have AWQ versions available). The AwqConfig
in HF will ensure the model’s linear layers are replaced with AWQ implementations (fused int4 kernels), and then you can run generation. AWQ 4-bit models loaded this way should produce nearly identical output to FP16 (especially if it’s the official AWQ quantized model).
If instead you have a full model and want to quantize it with AWQ yourself: you could use the AutoAWQ tool or the awq
library. For example, after installing pip install llm-awq
, you might do:
from awq import AutoAWQForCausalLM
## Quantize an FP16 model to 4-bit AWQ
model = AutoAWQForCausalLM.from_pretrained(model_id, target_bits=4,
calibration_dataset="c4", num_samples=128)
model.save_pretrained("opt-6.7b-AWQ-4bit")
This is a pseudo-example – in reality, the API might require providing the calibration data loader, etc. But conceptually, AWQ’s process would: load model in FP16, run a few samples to gather activations, determine scales, then save a quantized model. Once quantized, usage is the same as loading shown above.
Using HQQ in PyTorch
With the HQQ library and its integration, we have a couple of options. Hugging Face supports loading with HqqConfig
, or we can manually replace layers. For simplicity, here’s how to quantize a model using HQQ’s helper and then use it:
!pip install git+https://github.com/mobiusml/hqq.git
from transformers import AutoModelForCausalLM, AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel, BaseQuantizeConfig
import torch
model_id = "facebook/opt-1.3b"
## Load the model in CPU (or CPU offload) to quantize
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)
## Define HQQ quantization config (e.g., 4-bit with group size 64)
quant_config = BaseQuantizeConfig(nbits=4, group_size=64)
## Apply HQQ quantization in-place
AutoHQQHFModel.quantize_model(model, quant_config=quant_config, compute_dtype=torch.float16, device="cuda")
## Save the quantized model for later use
AutoHQQHFModel.save_quantized(model, "opt-1.3b-HQQ-4bit")
## The model's Linear layers are now HQQLinear with 4-bit packed weights.
## We can use the model directly for inference on GPU:
tokenizer = AutoTokenizer.from_pretrained(model_id)
outputs = model.generate(**tokenizer("Hello, how are you?", return_tensors='pt').to("cuda"))
print(tokenizer.decode(outputs[0]))
In this snippet, we replaced all Linear layers with HQQ’s quantized versions by calling quantize_model
. We specified compute_dtype=torch.float16
and device="cuda"
to indicate we want to run inference in half precision for intermediate computations on GPU (common for HQQ). After quantization, the model’s weights are int4 internally, and forward passes will use HQQ’s optimized dequant which can leverage CUDA (the library will choose the backend – possibly a fused kernel or a chunked dequant – to execute the matmuls). The generated text from model.generate
should be the same format as usual.
Alternatively, one could use HqqConfig
to load a model already saved in a quantized state:
from transformers import HqqConfig
quant_config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained("opt-1.3b-HQQ-4bit", quantization_config=quant_config, device_map="auto")
This would load the model assuming it was saved via save_quantized
earlier. The HQQ integration ensures you can also save to safetensors for using in other frameworks like vLLM (GitHub - mobiusml/hqq: Official implementation of Half-Quadratic Quantization (HQQ)).
Using AutoRound in PyTorch Intel Extension
AutoRound is accessible through Intel’s extensions. While not directly in vanilla PyTorch, we can use Intel Neural Compressor (INC) or Intel Extension for Transformers. For example, using INC’s API to apply AutoRound PTQ:
!pip install neural-compressor
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from neural_compressor import PostTrainingQuantConfig, quantize
model_id = "facebook/opt-6.7b"
fp16_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to("cpu")
tokenizer = AutoTokenizer.from_pretrained(model_id)
## Prepare a calibration dataset (e.g., a few prompts)
calib_texts = ["Hello, my name is GPT.", "The weather today is"] # toy example lines
enc = tokenizer(calib_texts, return_tensors='pt', padding=True)
calib_dataset = [(enc['input_ids'][i], enc['attention_mask'][i]) for i in range(len(calib_texts))]
## Configure AutoRound quantization (via INC)
conf = PostTrainingQuantConfig(approach="weight_only", op_type_list=["MatMul"],
recipes={"rounding_strategy": "auto_round"})
## approach "weight_only" and recipe "auto_round" signals INC to use AutoRound algorithm for weight quant
q_model = quantize(model=fp16_model, calibration_data=calib_dataset, config=conf)
q_model.save("opt-6.7b-AutoRound-int4-model")
In this pseudo-code, we use neural_compressor.quantize
with a configuration that specifies auto_round
as the rounding strategy for weight-only quantization. We supply a small calibration_data
list of tokenized inputs (in practice, one would use a few hundred random samples). INC will then perform the AutoRound algorithm under the hood: it will run some forward passes on calib_dataset
, calculate gradients/signs, and adjust weight quantization scales and rounds for MatMul (Linear) ops. The output is a quantized model (q_model
) which is still a torch.nn.Module
that can be loaded and run with PyTorch (but it might have some ONNX underpinnings or attach INT8 scales). We save it to disk.
Now, to use this quantized model for inference, one could load it (possibly needs INC or IPEX runtime). If saved as a PyTorch state dict, you might do q_model = torch.load("...")
or use INC’s quantized_model = quantize(...)
directly for serving. Alternatively, Intel Extension for Transformers provides a LLMQuantizationConfig
where you could specify algorithm="autoreound"
similarly and call model.quantize(...)
. The Intel Developer article provides sample code (in an image) on using AutoRound with a Transformer API (Low-Bit Quantized Open LLM Leaderboard ). In essence, the usage is a bit more involved than the others, but Intel has wrapped it to be as simple as possible for those in that ecosystem.
Verification and Usage
After quantization with any of these methods, it’s good practice to verify model quality on a sample input or a small eval set. For example, generate a few prompts and ensure the outputs are sensible (and perhaps compare with the FP16 model’s output for sanity). For perplexity, you can run the model on a validation corpus in FP16 vs quantized to see the difference.
In summary, PyTorch now has first-class support for these quantization techniques either through its own API or closely integrated libraries. AWQ and GPTQ are accessible via Hugging Face transformers
directly (Quantization), bitsandbytes via configuration, HQQ via an add-on library (with HF and PyTorch integration), and AutoRound via Intel’s toolkit. This means practitioners can experiment with quantizing large models with just a few lines of code.
References: The analysis above is grounded in recent research and official tool documentation, including the AWQ paper ( AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration), HQQ blog/paper (HQQ quantization) , GPTQ methodology (What is GPTQ Quantization for LLMs? — Picovoice), and industry reports on these methods’ performance (Quantization - LoRAX Docs). Each method’s trade-offs are drawn from empirical studies (Thoughts on Quantization Roadmap · Issue #135 · ml-explore/mlx) and framework developers’ insights. As quantization rapidly evolves, these 2024–2025 developments demonstrate a significant step toward making LLMs more efficient and accessible in real-world applications.