HELLOBENCH: EVALUATING LONG TEXT GENERATION CAPABILITIES OF LARGE LANGUAGE MODELS
HelloBench reveals significant limitations in LLMs' long text generation capabilities across diverse tasks.
Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation.
HelloBench reveals significant limitations in LLMs' long text generation capabilities across diverse tasks.
Original Problem ๐:
Existing benchmarks lack comprehensive evaluation of LLMs' long text generation capabilities across diverse tasks. Current evaluation methods struggle to accurately assess long text quality.
Key Insights from this Paper ๐ก:
โข LLMs prefer generating ~1000 words when unconstrained
โข Enhanced models generate longer text (~3000 words) but with lower quality
โข Most LLMs struggle to generate >4000 words even with explicit constraints
โข Negative correlation between long-context understanding and generation capabilities
โข Current LLMs have significant limitations in long text generation quality and length
Results ๐:
โข HelloEval achieves highest correlation with human evaluation
โข GPT-4 and Mistral-Large perform best but scores only ~48/100
โข Most open-source LLMs score <35/100
โข LLMs struggle to generate >2000 words with quality, far below max token limits


