1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit

First successful 1-bit Fully Quantized Training (FQT), with gradient pruning.

Nov 08, 2024

First successful 1-bit Fully Quantized Training (FQT), with gradient pruning.

👨‍🔧 With it, training speedup can reach a maximum of 5.13x compared to full precision training.

Should be massive progress for on-device training

While Regular quantization typically quantizes weights and activations, while FQT quantizes weights, activations, and gradients.

Problem 🔍:

Current research frontier is 4-bit FQT, but further reduction remains challenging due to lack of theoretical understanding and large quantization errors.

Key Insights from this Paper 💡:

• Gradient variance influences FQT convergence

• Adam is more suitable for low-bitwidth FQT than SGD

• Gradient heterogeneity can be leveraged to reduce variance

Solution in this Paper 🛠️:

• Activation Gradient Pruning (AGP):

Discards less informative gradient groups
Improves numerical precision of remaining groups
Reduces gradient variance

• Sample Channel joint Quantization (SCQ):

Uses different quantization for weight and activation gradients
Ensures efficient implementation on low-bitwidth hardware

• Framework for 1-bit FQT deployment:

Implements forward and backward propagation using binary operations
Achieves practical acceleration on low-bitwidth hardware

Results 📊:

• Fine-tuning VGGNet-16 and ResNet-18 on multiple datasets:

Average accuracy improvement: ~6% compared to per-sample quantization
Maximum training speedup: 5.13× compared to full precision

• Average accuracy drop on visual classification: ~5% compared to 32-bit gradients

• Negligible accuracy loss (<1%) on Flowers and Pets datasets

Rohan's Bytes

Discussion about this post