A single parameter in LLMs can make or break the entire model's ability to generate meaningful text.
One tiny weight controls whether your billion-parameter AI speaks sense or gibberish.
Meet the super weight: the lone warrior keeping your AI from speaking nonsense
https://arxiv.org/abs/2411.07191
🎯 Original Problem:
LLMs contain billions of parameters, with prior research showing that a small fraction (0.01%) of parameter outliers are crucial for model quality. However, this still means hundreds of thousands of parameters need special handling during compression and optimization.
-----
🔧 Solution in this Paper:
→ Discovered that a single parameter called "super weight" can completely destroy an LLM's ability to generate text when removed
→ Developed a data-free method to identify super weights using just one forward pass through the model
→ Found that super weights create "super activations" - exceptionally large activation values that persist throughout model layers
→ Leveraged this discovery to improve model compression by preserving super weights and clipping other outliers
-----
💡 Key Insights:
→ Most models have only 1-3 super weights, with maximum 6 found in some models
→ Super weights are consistently found in early layers' down-projection weights
→ They persist through fine-tuning - instruction-tuned models have super weights in same positions as base models
→ Super weights suppress stopword probabilities - when removed, stopword probabilities increase by 2-5x
-----
📊 Results:
→ Pruning single super weight increases perplexity by 3 orders of magnitude
→ Reduces zero-shot accuracy to guessing levels (35.14% from 70.11%)
→ Their compression method achieves 75-82% of SmoothQuant's quality improvement without requiring calibration data
Share this post