"Mixed-Precision Graph Neural Quantization for Low Bit LLMs"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2501.18154
The challenge in deploying LLMs arises from their large size and computational demands, especially in resource-limited environments. Existing Post-Training Quantization methods degrade significantly at low bit-widths (less than 3 bits) due to increased quantization errors.
This paper introduces Mixed-precision Graph Neural Post-Training Quantization. It uses a Graph Neural Network to understand weight dependencies and allocate quantization bit-widths adaptively, thus minimizing quantization error.
-----
๐ MG-PTQ leverages Graph Neural Networks to model weight dependencies, a novel approach in Post-Training Quantization. This allows for more informed bit allocation than methods treating weights independently.
๐ Adaptive bit-width assignment based on GNN-assessed weight importance is a key innovation. It optimizes resource allocation by using higher precision for critical parameters and lower for less crucial ones.
๐ By considering weight relationships, MG-PTQ significantly improves low-bit quantization performance. It surpasses GPTQ, setting new benchmarks for perplexity at bit-widths under 3, as shown on WikiText2 and C4.
----------
Methods Explored in this Paper ๐ง:
โ This paper proposes a Mixed-precision Graph Neural Post-Training Quantization method called MG-PTQ.
โ MG-PTQ uses a Graph Neural Network module to capture the dependencies between weights in LLMs.
โ It constructs a feature graph where each column of the weight matrix is a node.
โ The edges of this graph are weighted by the second-order Hessian matrix, representing weight dependencies.
โ Node features are initialized with the weight values themselves.
โ A 2-layer Graph Convolutional Network processes this graph to learn weight importance.
โ Based on the GNN output, a bit-width allocator, which is a feed-forward neural network, assigns adaptive quantization bit-widths to each weight column.
โ The method employs block-wise quantization and an Approximate Gradient strategy using Gumbel-Softmax during training to handle discrete bit-widths.
-----
Key Insights ๐ก:
โ Capturing weight dependencies using a Graph Neural Network significantly improves quantization accuracy at very low bit-widths.
โ Adaptively assigning bit-widths based on weight importance, learned through the GNN, outperforms uniform quantization strategies.
โ MG-PTQ maintains a controllable average bit-width by incorporating a penalty term in the GNN training objective.
โ Replacing the GCN module with a Multilayer Perceptron significantly degrades performance, highlighting the importance of graph-based dependency modeling.
โ MG-PTQ achieves comparable quantization time to GPTQ while offering superior performance in low-bit quantization.
-----
Results ๐:
โ MG-PTQ achieves a perplexity of 39.53 on WikiText2 and 29.43 on C4 for LLaMA1-7b at an average of 2.5 weight bits.
โ At 2 bits for LLaMA1-7b, MG-PTQ achieves 130.27 perplexity on WikiText2 and 96.70 on C4, outperforming GPTQ significantly in low-bit scenarios.
โ For LLaMA1-13b at 2.5 bits, MG-PTQ reaches 12.26 perplexity on WikiText2 and 10.37 on C4.
โ Ablation studies show that replacing the GCN with an MLP leads to a noticeable performance drop, for example, at 2 bits, perplexity increases from 96.70 to 122.69 on C4 for LLaMA-7b.


