0:00
/
0:00
Transcript

"LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation"

Generated below podcast on this paper with Google's Illuminate.

LongProc introduces a novel benchmark for testing LLMs' ability to generate long, structured outputs while following complex procedures across six diverse tasks.

-----

https://arxiv.org/abs/2501.05414

Original Problem 🤔:

Existing benchmarks mainly test LLMs on long inputs with short outputs, focusing on simple recall tasks. They don't evaluate LLMs' capability to generate coherent long-form content while following structured procedures.

-----

Solution in this Paper 💡:

→ LongProc evaluates LLMs through six procedural generation tasks: HTML to TSV conversion, pseudocode to code translation, path traversal, theory-of-mind tracking, countdown problem-solving, and travel planning

→ Each task requires both processing dispersed information and generating structured outputs up to 8K tokens

→ Tasks follow deterministic procedures with rule-based evaluation metrics

→ Tests across three difficulty levels with maximum output lengths of 500, 2K, and 8K tokens

-----

Key Insights 🔍:

→ Performance significantly degrades as output length increases

→ Even top models struggle with 8K-token tasks despite having 32K+ context windows

→ Open-weight models perform notably worse than proprietary ones

→ Tasks requiring reasoning show steeper degradation than straightforward tasks

-----

Results 📊:

→ GPT-4o: 94.8% at 0.5K tokens, drops to 38.1% at 8K tokens

→ Open-weight models: Below 30% at 0.5K tokens

→ Best performance on 8K tasks: Gemini-1.5-Pro at 54.0%

Discussion about this video