LongGenBench evaluates LLMs' ability to generate coherent long-context responses across multiple questions.
📚 https://arxiv.org/pdf/2410.04199
Original Problem 🔍:
Existing long-context benchmarks focus on retrieval-based tasks, neglecting evaluation of long-context generation capabilities in LLMs.
-----
Solution in this Paper 🛠️:
• LongGenBench: Synthetic benchmark for evaluating long-context generation
• Redesigns question format to require single, cohesive long-context answers
• Synthesizes datasets from MMLU, GSM8K, and CommonSenseQA
• Configurable parameters: K (questions per response) and T (iterations)
• Assesses consistency in logical flow over extended text sequences
• Evaluates models on generating coherent responses to multiple sequential questions
-----
Key Insights from this Paper 💡:
• Performance degrades in long-context generation for both API and open-source models
• Larger models within same series show better resilience
• Higher baseline performance generally correlates with better LongGenBench performance
• Different architectures exhibit varying robustness to long-context tasks
• Consistent performance trends observed across different datasets
-----
Results 📊:
• All models show performance drops compared to baselines
• Gemini-1.5-Flash: Least degradation among API models (1.2% drop on GSM8K)
• GPT-3.5-Turbo and Claude-3-Haiku: Largest drops (19.8% and 21.3% on GSM8K)
• Open-source models: Qwen2-72B-Instruct and DeepSeek-v2-Chat show minimal declines
• LLaMA-3-8B-Instruct: Significant drop (47.1% on GSM8K)
Share this post