A Controlled Study on Long Context Extension and Generalization in LLMs
This study exposes trade-offs in long-context extension methods, highlighting exact attention's superiority and extrapolation challenges.
This study exposes trade-offs in long-context extension methods, highlighting exact attention's superiority and extrapolation challenges.
Original Problem 🔍:
Its Challenging to compare long-context extension methods due to differences in data, model classes, and evaluation approaches.
Solution in this Paper 💡:
• Implements controlled protocol for extension methods
• Uses consistent base models and extension data
• Standardizes evaluation across methods
• Considers both intrinsic metrics (perplexity, retrieval) and extrinsic tasks
• Evaluates within extension length and extrapolation to longer contexts
Key Insights from this Paper 💡:
• Perplexity strongly correlates with downstream task performance for exact fine-tuned methods
• Approximate attention methods generally underperform across benchmarks
• Continual fine-tuning with exact attention works well within extended context length
• Extrapolation to longer lengths remains challenging
Results 📊:
• Dynamic NTK performs best among exact attention methods
• Exact fine-tuned methods outperform approximate attention and frozen methods
• NTK-32K: 0.96 faithfulness, 0.96 answer relevancy, 1.0 context recall
• Improved performance on LongBench, RULER, and retrieval tasks