"The Super Weight in Large Language Models"

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

"The Super Weight in Large Language Models"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Jan 04, 2025

Transcript

A single parameter in LLMs can make or break the entire model's ability to generate meaningful text.

One tiny weight controls whether your billion-parameter AI speaks sense or gibberish.

Meet the super weight: the lone warrior keeping your AI from speaking nonsense

https://arxiv.org/abs/2411.07191

🎯 Original Problem:

LLMs contain billions of parameters, with prior research showing that a small fraction (0.01%) of parameter outliers are crucial for model quality. However, this still means hundreds of thousands of parameters need special handling during compression and optimization.

-----

🔧 Solution in this Paper:

→ Discovered that a single parameter called "super weight" can completely destroy an LLM's ability to generate text when removed

→ Developed a data-free method to identify super weights using just one forward pass through the model

→ Found that super weights create "super activations" - exceptionally large activation values that persist throughout model layers

→ Leveraged this discovery to improve model compression by preserving super weights and clipping other outliers

-----

💡 Key Insights:

→ Most models have only 1-3 super weights, with maximum 6 found in some models

→ Super weights are consistently found in early layers' down-projection weights

→ They persist through fine-tuning - instruction-tuned models have super weights in same positions as base models

→ Super weights suppress stopword probabilities - when removed, stopword probabilities increase by 2-5x

-----

📊 Results:

→ Pruning single super weight increases perplexity by 3 orders of magnitude

→ Reduces zero-shot accuracy to guessing levels (35.14% from 70.11%)

→ Their compression method achieves 75-82% of SmoothQuant's quality improvement without requiring calibration data

Rohan's Bytes

"The Super Weight in Large Language Models"

Discussion about this video