HELLOBENCH: EVALUATING LONG TEXT GENERATION CAPABILITIES OF LARGE LANGUAGE MODELS

HelloBench reveals significant limitations in LLMs' long text generation capabilities across diverse tasks.

Nov 10, 2024

Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation.

HelloBench reveals significant limitations in LLMs' long text generation capabilities across diverse tasks.

Original Problem 😕:

Existing benchmarks lack comprehensive evaluation of LLMs' long text generation capabilities across diverse tasks. Current evaluation methods struggle to accurately assess long text quality.

Key Insights from this Paper 💡:

• LLMs prefer generating ~1000 words when unconstrained

• Enhanced models generate longer text (~3000 words) but with lower quality

• Most LLMs struggle to generate >4000 words even with explicit constraints

• Negative correlation between long-context understanding and generation capabilities

• Current LLMs have significant limitations in long text generation quality and length

Results 📊:

• HelloEval achieves highest correlation with human evaluation

• GPT-4 and Mistral-Large perform best but scores only ~48/100

• Most open-source LLMs score <35/100

• LLMs struggle to generate >2000 words with quality, far below max token limits

Rohan's Bytes

Discussion about this post