Post-training pruning that makes LLMs leaner without compromising their intelligence.
Statistical Calibrated Activation Pruning (SCAP) achieves higher sparsity in LLMs by calibrating input activations of fully-connected layers without extensive retraining or architecture changes.
-----
https://arxiv.org/abs/2412.07174v1
🎯 Original Problem:
Current LLMs rely heavily on ReLU or post-activation pruning for sparsification, which requires extensive retraining and architectural modifications. This limits widespread deployment and scalability.
-----
🔧 Solution in this Paper:
→ SCAP introduces a post-training activation pruning framework that generalizes sparsification across different Transformer architectures
→ The framework targets input activations of Fully-Connected layers for universal pruning implementation
→ It features a Mode-Centering technique that pre-calibrates activation distributions to maximize post-training sparsity
→ SCAP requires only minimal calibration data and can be executed directly on deployment devices
-----
💡 Key Insights:
→ Input activation pruning is more effective than post-activation pruning
→ Mode-Centering significantly improves sparsity for non-zero centered distributions
→ The method generalizes well across different model architectures including Transformer Decoders, MoE, and Vision Transformers
→ Post-training optimization offers better cost-efficiency compared to retraining approaches
-----
📊 Results:
→ Achieves 48.5% FFN sparsity vs CATS' 33.3% at iso model quality
→ Delivers 1.5× additional speedup in decoding compared to CATS
→ Demonstrates only -1.5% quality deviation from baseline
→ Successfully applies to pre-quantized models with minimal overhead
Share this post