"Mixed-Precision Graph Neural Quantization for Low Bit LLMs"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 10, 2025

Article voiceover

0:00

-5:11

https://arxiv.org/abs/2501.18154

The challenge in deploying LLMs arises from their large size and computational demands, especially in resource-limited environments. Existing Post-Training Quantization methods degrade significantly at low bit-widths (less than 3 bits) due to increased quantization errors.

This paper introduces Mixed-precision Graph Neural Post-Training Quantization. It uses a Graph Neural Network to understand weight dependencies and allocate quantization bit-widths adaptively, thus minimizing quantization error.

-----

📌 MG-PTQ leverages Graph Neural Networks to model weight dependencies, a novel approach in Post-Training Quantization. This allows for more informed bit allocation than methods treating weights independently.

📌 Adaptive bit-width assignment based on GNN-assessed weight importance is a key innovation. It optimizes resource allocation by using higher precision for critical parameters and lower for less crucial ones.

📌 By considering weight relationships, MG-PTQ significantly improves low-bit quantization performance. It surpasses GPTQ, setting new benchmarks for perplexity at bit-widths under 3, as shown on WikiText2 and C4.

----------

Methods Explored in this Paper 🔧:

→ This paper proposes a Mixed-precision Graph Neural Post-Training Quantization method called MG-PTQ.

→ MG-PTQ uses a Graph Neural Network module to capture the dependencies between weights in LLMs.

→ It constructs a feature graph where each column of the weight matrix is a node.

→ The edges of this graph are weighted by the second-order Hessian matrix, representing weight dependencies.

→ Node features are initialized with the weight values themselves.

→ A 2-layer Graph Convolutional Network processes this graph to learn weight importance.

→ Based on the GNN output, a bit-width allocator, which is a feed-forward neural network, assigns adaptive quantization bit-widths to each weight column.

→ The method employs block-wise quantization and an Approximate Gradient strategy using Gumbel-Softmax during training to handle discrete bit-widths.

-----

Key Insights 💡:

→ Capturing weight dependencies using a Graph Neural Network significantly improves quantization accuracy at very low bit-widths.

→ Adaptively assigning bit-widths based on weight importance, learned through the GNN, outperforms uniform quantization strategies.

→ MG-PTQ maintains a controllable average bit-width by incorporating a penalty term in the GNN training objective.

→ Replacing the GCN module with a Multilayer Perceptron significantly degrades performance, highlighting the importance of graph-based dependency modeling.

→ MG-PTQ achieves comparable quantization time to GPTQ while offering superior performance in low-bit quantization.

-----

Results 📊:

→ MG-PTQ achieves a perplexity of 39.53 on WikiText2 and 29.43 on C4 for LLaMA1-7b at an average of 2.5 weight bits.

→ At 2 bits for LLaMA1-7b, MG-PTQ achieves 130.27 perplexity on WikiText2 and 96.70 on C4, outperforming GPTQ significantly in low-bit scenarios.

→ For LLaMA1-13b at 2.5 bits, MG-PTQ reaches 12.26 perplexity on WikiText2 and 10.37 on C4.

→ Ablation studies show that replacing the GCN with an MLP leads to a noticeable performance drop, for example, at 2 bits, perplexity increases from 96.70 to 122.69 on C4 for LLaMA-7b.

Rohan's Bytes

Discussion about this post