1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit
First successful 1-bit Fully Quantized Training (FQT), with gradient pruning.
First successful 1-bit Fully Quantized Training (FQT), with gradient pruning.
๐จโ๐ง With it, training speedup can reach a maximum of 5.13x compared to full precision training.
Should be massive progress for on-device training
While Regular quantization typically quantizes weights and activations, while FQT quantizes weights, activations, and gradients.
Problem ๐:
Current research frontier is 4-bit FQT, but further reduction remains challenging due to lack of theoretical understanding and large quantization errors.
Key Insights from this Paper ๐ก:
โข Gradient variance influences FQT convergence
โข Adam is more suitable for low-bitwidth FQT than SGD
โข Gradient heterogeneity can be leveraged to reduce variance
Solution in this Paper ๐ ๏ธ:
โข Activation Gradient Pruning (AGP):
Discards less informative gradient groups
Improves numerical precision of remaining groups
Reduces gradient variance
โข Sample Channel joint Quantization (SCQ):
Uses different quantization for weight and activation gradients
Ensures efficient implementation on low-bitwidth hardware
โข Framework for 1-bit FQT deployment:
Implements forward and backward propagation using binary operations
Achieves practical acceleration on low-bitwidth hardware
Results ๐:
โข Fine-tuning VGGNet-16 and ResNet-18 on multiple datasets:
Average accuracy improvement: ~6% compared to per-sample quantization
Maximum training speedup: 5.13ร compared to full precision
โข Average accuracy drop on visual classification: ~5% compared to 32-bit gradients
โข Negligible accuracy loss (<1%) on Flowers and Pets datasets


