0:00
/
0:00
Transcript

"LONGGENBENCH: Long-context Generation Benchmark"

The podcast on this paper is generated with Google's Illuminate.

LongGenBench evaluates LLMs' ability to generate coherent long-context responses across multiple questions.

📚 https://arxiv.org/pdf/2410.04199

Original Problem 🔍:

Existing long-context benchmarks focus on retrieval-based tasks, neglecting evaluation of long-context generation capabilities in LLMs.

-----

Solution in this Paper 🛠️:

• LongGenBench: Synthetic benchmark for evaluating long-context generation

• Redesigns question format to require single, cohesive long-context answers

• Synthesizes datasets from MMLU, GSM8K, and CommonSenseQA

• Configurable parameters: K (questions per response) and T (iterations)

• Assesses consistency in logical flow over extended text sequences

• Evaluates models on generating coherent responses to multiple sequential questions

-----

Key Insights from this Paper 💡:

• Performance degrades in long-context generation for both API and open-source models

• Larger models within same series show better resilience

• Higher baseline performance generally correlates with better LongGenBench performance

• Different architectures exhibit varying robustness to long-context tasks

• Consistent performance trends observed across different datasets

-----

Results 📊:

• All models show performance drops compared to baselines

• Gemini-1.5-Flash: Least degradation among API models (1.2% drop on GSM8K)

• GPT-3.5-Turbo and Claude-3-Haiku: Largest drops (19.8% and 21.3% on GSM8K)

• Open-source models: Qwen2-72B-Instruct and DeepSeek-v2-Chat show minimal declines

• LLaMA-3-8B-Instruct: Significant drop (47.1% on GSM8K)

Discussion about this video

User's avatar