Ever wondered why your LLM suddenly got worse at few-shot tasks? Here's why!
This paper mathematically explains why LLMs sometimes forget their in-context learning abilities
📚 https://arxiv.org/abs/2410.23042
🤔 Original Problem:
LLMs show in-context learning (ICL) capabilities, but this ability can diminish with further training. The research community lacks theoretical understanding of why and when ICL emerges or disappears.
-----
🔧 Solution in this Paper:
→ Introduces a bi-level model that uses a gating mechanism to choose between in-weight learning (IWL) and ICL predictors
→ The model learns to select between ICL and IWL based on their expected performance using a gating parameter α
→ Provides mathematical framework showing how simple distributional properties lead to emergence and disappearance of ICL
→ Demonstrates that ICL appears with diverse but rare samples, while IWL dominates for frequently occurring patterns
-----
💡 Key Insights:
→ ICL emerges when data has diverse but rare samples that are predictable from context
→ IWL takes over when model accumulates enough examples of previously rare patterns
→ The choice between ICL and IWL is driven by their relative performance on new data
→ Simple distributional properties can explain complex ICL behaviors
-----
📊 Results:
→ Model shows aligned results between theoretical predictions and transformer behavior on simplified distributions
→ Successfully demonstrates how fine-tuning on various natural language prompts elicits similar ICL and IWL behavior
→ Provides mathematical bounds on errors for both ICL and IWL predictors
→ Validates theory through experiments on synthetic and Omniglot data
Share this post