Making transformers fit on your phone: Binary weights meet intelligent early stopping.
BEExformer introduces a transformer architecture that combines binarization with early exit mechanisms, reducing model size by 18.44x while maintaining comparable performance to full-precision models. It enables efficient deployment on resource-constrained devices through innovative weight quantization and dynamic computation paths.
-----
https://arxiv.org/abs/2412.05225
🎯 Original Problem:
LLMs face deployment challenges on resource-constrained devices due to their enormous size and processing requirements. Current solutions like binarization and early exit mechanisms have limitations in gradient computation and threshold-based exits.
-----
🔧 Solution in this Paper:
→ BEExformer implements a differentiable second-order approximation to the impulse function for binarization, enabling gradient computation for both sign and magnitude of weights.
→ The architecture uses entropy-based early exit criterion that monitors fractional reduction between transformer blocks.
→ A binarized Selective Learn-Forget Network replaces traditional feed-forward layers to enhance context understanding.
→ The model trains from scratch without knowledge distillation from full-precision LLMs.
-----
💡 Key Insights:
→ Binarization with magnitude-aware gradients maintains model accuracy while reducing size
→ Entropy-based exits eliminate need for absolute thresholds and solve overthinking
→ Selective learning improves long-range dependency capture
-----
📊 Results:
→ 18.44x reduction in model size through binarization
→ 54.85% reduction in FLOPs during inference
→ 5.98% accuracy improvement by resolving overthinking
→ Comparable performance to full-precision models while being 46x smaller
Share this post