A simple normalization swap makes your LLM's deeper layers actually do something useful.
This paper shows how to wake up those sleeping deep layers in your LLM.
Mix-LN combines Pre-Layer Normalization and Post-Layer Normalization in LLMs to make deeper layers more effective, solving the problem of underutilized deeper layers in current models.
https://arxiv.org/abs/2412.13795
🔍 Original Problem:
→ Current LLMs suffer from ineffective deeper layers that contribute minimally to model performance
→ Pre-LN, used in models like GPT and LLaMA, causes diminished gradients in deeper layers
→ Post-LN maintains larger gradients in deeper layers but faces vanishing gradients in earlier layers
-----
🛠️ Solution in this Paper:
→ Mix-LN applies Post-LN to first 25% of layers and Pre-LN to remaining layers
→ This hybrid approach ensures more uniform gradients across network depth
→ The technique promotes better representation diversity between layers
→ Implementation uses hyperparameter α to control ratio of Post-LN to Pre-LN layers
-----
💡 Key Insights:
→ Pre-LN's widespread use is the root cause of deeper layer inefficiency
→ Optimal Post-LN ratio (α) is 0.25 for most model sizes
→ Mix-LN maintains balanced gradient norms throughout the network
→ The solution requires minimal computational overhead
-----
📊 Results:
→ Consistently outperforms both Pre-LN and Post-LN across model sizes (70M to 7B parameters)
→ Achieves 1.65 lower perplexity in LLaMA-71M compared to Pre-LN
→ Shows 17.31% improvement on BoolQ for LLaMA-250M
→ Demonstrates effectiveness in Vision Transformer tasks
------
Are you into AI and LLMs❓ Join me on X/Twitter with 50K+ others, to remain on the bleeding-edge of AI every day.
𝕏/🐦 https://x.com/rohanpaul_ai
Share this post