1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit
First successful 1-bit Fully Quantized Training (FQT), with gradient pruning.
First successful 1-bit Fully Quantized Training (FQT), with gradient pruning.
👨🔧 With it, training speedup can reach a maximum of 5.13x compared to full precision training.
Should be massive progress for on-device training
While Regular quantization typically quantizes weights and activations, while FQT quantizes weights, activations, and gradients.
Problem 🔍:
Current research frontier is 4-bit FQT, but further reduction remains challenging due to lack of theoretical understanding and large quantization errors.
Key Insights from this Paper 💡:
• Gradient variance influences FQT convergence
• Adam is more suitable for low-bitwidth FQT than SGD
• Gradient heterogeneity can be leveraged to reduce variance
Solution in this Paper 🛠️:
• Activation Gradient Pruning (AGP):
Discards less informative gradient groups
Improves numerical precision of remaining groups
Reduces gradient variance
• Sample Channel joint Quantization (SCQ):
Uses different quantization for weight and activation gradients
Ensures efficient implementation on low-bitwidth hardware
• Framework for 1-bit FQT deployment:
Implements forward and backward propagation using binary operations
Achieves practical acceleration on low-bitwidth hardware
Results 📊:
• Fine-tuning VGGNet-16 and ResNet-18 on multiple datasets:
Average accuracy improvement: ~6% compared to per-sample quantization
Maximum training speedup: 5.13× compared to full precision
• Average accuracy drop on visual classification: ~5% compared to 32-bit gradients
• Negligible accuracy loss (<1%) on Flowers and Pets datasets