A metric that actually measures what matters in long-context understanding
LongPPL, proposed in this paper, solves the perplexity problem by focusing only on tokens that matter for long contexts
📚 https://arxiv.org/abs/2410.23771
🤖 Original Problem:
Traditional perplexity (PPL) metric fails to evaluate LLMs' long-context capabilities accurately. PPL averages across all tokens equally, while only a small portion (<10%) of tokens are crucial for understanding long contexts.
-----
🔧 Solution in this Paper:
→ Introduces LongPPL: A novel metric focusing only on key tokens identified through:
- Log Probability Gain (LPG): Measures token prediction improvement with long vs short context
- Log Probability Value (LPV): Measures absolute prediction accuracy under long context
→ Proposes LongCE loss: A training strategy that:
- Re-weights training tokens based on their long-context importance
- Uses model's own predictions to bootstrap long-context learning
- Alternates between estimating and optimizing key tokens
-----
💡 Key Insights:
→ Only <10% tokens in text are truly important for long-context understanding
→ Traditional PPL fails because it dilutes the importance of these key tokens by averaging across all tokens
→ Combining LPG and LPV achieves 98.2% accuracy in identifying key tokens
-----
📊 Results:
→ LongPPL shows -0.96 Pearson correlation with long-context benchmark performance
→ LongCE training achieves up to 22% accuracy gain on LongEval benchmark
→ Consistent improvements across various model sizes and architectures
Share this post