LongProc introduces a novel benchmark for testing LLMs' ability to generate long, structured outputs while following complex procedures across six diverse tasks.
-----
https://arxiv.org/abs/2501.05414
Original Problem 🤔:
Existing benchmarks mainly test LLMs on long inputs with short outputs, focusing on simple recall tasks. They don't evaluate LLMs' capability to generate coherent long-form content while following structured procedures.
-----
Solution in this Paper 💡:
→ LongProc evaluates LLMs through six procedural generation tasks: HTML to TSV conversion, pseudocode to code translation, path traversal, theory-of-mind tracking, countdown problem-solving, and travel planning
→ Each task requires both processing dispersed information and generating structured outputs up to 8K tokens
→ Tasks follow deterministic procedures with rule-based evaluation metrics
→ Tests across three difficulty levels with maximum output lengths of 500, 2K, and 8K tokens
-----
Key Insights 🔍:
→ Performance significantly degrades as output length increases
→ Even top models struggle with 8K-token tasks despite having 32K+ context windows
→ Open-weight models perform notably worse than proprietary ones
→ Tasks requiring reasoning show steeper degradation than straightforward tasks
-----
Results 📊:
→ GPT-4o: 94.8% at 0.5K tokens, drops to 38.1% at 8K tokens
→ Open-weight models: Below 30% at 0.5K tokens
→ Best performance on 8K tasks: Gemini-1.5-Pro at 54.0%
Share this post