"Mixed-Precision Graph Neural Quantization for Low Bit LLMs"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2501.18154
The challenge in deploying LLMs arises from their large size and computational demands, especially in resource-limited environments. Existing Post-Training Quantization methods degrade significantly at low bit-widths (less than 3 bits) due to increased quantization errors.
This paper introduces Mixed-precision Graph Neural Post-Training Quantization. It uses a Graph Neural Network to understand weight dependencies and allocate quantization bit-widths adaptively, thus minimizing quantization error.
-----
📌 MG-PTQ leverages Graph Neural Networks to model weight dependencies, a novel approach in Post-Training Quantization. This allows for more informed bit allocation than methods treating weights independently.
📌 Adaptive bit-width assignment based on GNN-assessed weight importance is a key innovation. It optimizes resource allocation by using higher precision for critical parameters and lower for less crucial ones.
📌 By considering weight relationships, MG-PTQ significantly improves low-bit quantization performance. It surpasses GPTQ, setting new benchmarks for perplexity at bit-widths under 3, as shown on WikiText2 and C4.
----------
Methods Explored in this Paper 🔧:
→ This paper proposes a Mixed-precision Graph Neural Post-Training Quantization method called MG-PTQ.
→ MG-PTQ uses a Graph Neural Network module to capture the dependencies between weights in LLMs.
→ It constructs a feature graph where each column of the weight matrix is a node.
→ The edges of this graph are weighted by the second-order Hessian matrix, representing weight dependencies.
→ Node features are initialized with the weight values themselves.
→ A 2-layer Graph Convolutional Network processes this graph to learn weight importance.
→ Based on the GNN output, a bit-width allocator, which is a feed-forward neural network, assigns adaptive quantization bit-widths to each weight column.
→ The method employs block-wise quantization and an Approximate Gradient strategy using Gumbel-Softmax during training to handle discrete bit-widths.
-----
Key Insights 💡:
→ Capturing weight dependencies using a Graph Neural Network significantly improves quantization accuracy at very low bit-widths.
→ Adaptively assigning bit-widths based on weight importance, learned through the GNN, outperforms uniform quantization strategies.
→ MG-PTQ maintains a controllable average bit-width by incorporating a penalty term in the GNN training objective.
→ Replacing the GCN module with a Multilayer Perceptron significantly degrades performance, highlighting the importance of graph-based dependency modeling.
→ MG-PTQ achieves comparable quantization time to GPTQ while offering superior performance in low-bit quantization.
-----
Results 📊:
→ MG-PTQ achieves a perplexity of 39.53 on WikiText2 and 29.43 on C4 for LLaMA1-7b at an average of 2.5 weight bits.
→ At 2 bits for LLaMA1-7b, MG-PTQ achieves 130.27 perplexity on WikiText2 and 96.70 on C4, outperforming GPTQ significantly in low-bit scenarios.
→ For LLaMA1-13b at 2.5 bits, MG-PTQ reaches 12.26 perplexity on WikiText2 and 10.37 on C4.
→ Ablation studies show that replacing the GCN with an MLP leads to a noticeable performance drop, for example, at 2 bits, perplexity increases from 96.70 to 122.69 on C4 for LLaMA-7b.