HELLOBENCH: EVALUATING LONG TEXT GENERATION CAPABILITIES OF LARGE LANGUAGE MODELS
HelloBench reveals significant limitations in LLMs' long text generation capabilities across diverse tasks.
Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation.
HelloBench reveals significant limitations in LLMs' long text generation capabilities across diverse tasks.
Original Problem 😕:
Existing benchmarks lack comprehensive evaluation of LLMs' long text generation capabilities across diverse tasks. Current evaluation methods struggle to accurately assess long text quality.
Key Insights from this Paper 💡:
• LLMs prefer generating ~1000 words when unconstrained
• Enhanced models generate longer text (~3000 words) but with lower quality
• Most LLMs struggle to generate >4000 words even with explicit constraints
• Negative correlation between long-context understanding and generation capabilities
• Current LLMs have significant limitations in long text generation quality and length
Results 📊:
• HelloEval achieves highest correlation with human evaluation
• GPT-4 and Mistral-Large perform best but scores only ~48/100
• Most open-source LLMs score <35/100
• LLMs struggle to generate >2000 words with quality, far below max token limits