0:00
/
0:00
Transcript

"What is Wrong with Perplexity for Long-context Language Modeling?"

The podcast on this paper is generated with Google's Illuminate.

A metric that actually measures what matters in long-context understanding

LongPPL, proposed in this paper, solves the perplexity problem by focusing only on tokens that matter for long contexts

📚 https://arxiv.org/abs/2410.23771

🤖 Original Problem:

Traditional perplexity (PPL) metric fails to evaluate LLMs' long-context capabilities accurately. PPL averages across all tokens equally, while only a small portion (<10%) of tokens are crucial for understanding long contexts.

-----

🔧 Solution in this Paper:

→ Introduces LongPPL: A novel metric focusing only on key tokens identified through:

- Log Probability Gain (LPG): Measures token prediction improvement with long vs short context

- Log Probability Value (LPV): Measures absolute prediction accuracy under long context

→ Proposes LongCE loss: A training strategy that:

- Re-weights training tokens based on their long-context importance

- Uses model's own predictions to bootstrap long-context learning

- Alternates between estimating and optimizing key tokens

-----

💡 Key Insights:

→ Only <10% tokens in text are truly important for long-context understanding

→ Traditional PPL fails because it dilutes the importance of these key tokens by averaging across all tokens

→ Combining LPG and LPV achieves 98.2% accuracy in identifying key tokens

-----

📊 Results:

→ LongPPL shows -0.96 Pearson correlation with long-context benchmark performance

→ LongCE training achieves up to 22% accuracy gain on LongEval benchmark

→ Consistent improvements across various model sizes and architectures

Discussion about this video