0:00
/
0:00
Transcript

"Post-Training Statistical Calibration for Higher Activation Sparsity"

The podcast on this paper is generated with Google's Illuminate.

Post-training pruning that makes LLMs leaner without compromising their intelligence.

Statistical Calibrated Activation Pruning (SCAP) achieves higher sparsity in LLMs by calibrating input activations of fully-connected layers without extensive retraining or architecture changes.

-----

https://arxiv.org/abs/2412.07174v1

🎯 Original Problem:

Current LLMs rely heavily on ReLU or post-activation pruning for sparsification, which requires extensive retraining and architectural modifications. This limits widespread deployment and scalability.

-----

🔧 Solution in this Paper:

→ SCAP introduces a post-training activation pruning framework that generalizes sparsification across different Transformer architectures

→ The framework targets input activations of Fully-Connected layers for universal pruning implementation

→ It features a Mode-Centering technique that pre-calibrates activation distributions to maximize post-training sparsity

→ SCAP requires only minimal calibration data and can be executed directly on deployment devices

-----

💡 Key Insights:

→ Input activation pruning is more effective than post-activation pruning

→ Mode-Centering significantly improves sparsity for non-zero centered distributions

→ The method generalizes well across different model architectures including Transformer Decoders, MoE, and Vision Transformers

→ Post-training optimization offers better cost-efficiency compared to retraining approaches

-----

📊 Results:

→ Achieves 48.5% FFN sparsity vs CATS' 33.3% at iso model quality

→ Delivers 1.5× additional speedup in decoding compared to CATS

→ Demonstrates only -1.5% quality deviation from baseline

→ Successfully applies to pre-quantized models with minimal overhead

Discussion about this video