Using LLM-generated, instruction-specific checklists to structure and improve LLM evaluation and self-improvement.
📚 https://arxiv.org/pdf/2410.03608
Original Problem 💡:
Existing LLM evaluations like preference judgments and direct scoring lack interpretability and reliability. Human annotation is slow and costly, leading to increased use of LLMs as judges, which further reduces reliability.
-----
Solution in this Paper 🔬:
• Introduces TICK (Targeted Instruct-evaluation with ChecKlists)
• LLMs generate instruction-specific evaluation checklists with YES/NO questions
• Checklist answers aggregated for scoring/preference judgments
• STICK (Self-TICK) uses checklists for self-refinement and Best-of-N selection
-----
Key Insights from this Paper 💡:
• LLMs can generate high-quality evaluation checklists comparable to human-written ones
• Structured checklist evaluations improve agreement between LLM judges and humans
• Self-refinement with STICK outperforms unstructured feedback across diverse tasks
• STICK improves Best-of-N selection over baselines on instruction-following benchmarks
• LLM-generated checklists improve inter-annotator agreement in human evaluation
-----
Results 📊:
• TICK improves LLM-human agreement on preferences by 5.8% vs direct scoring
• STICK self-refinement gains: +7.8% on LiveBench reasoning, +7.1% on WildBench
• STICK Best-of-N selection: +5.1% on InFoBench, +5.3% on WildBench vs greedy decoding
• Human inter-annotator agreement increases from 0.194 to 0.256 with checklists
TICK enables automated, interpretable LLM evaluation and improves performance via structured feedback.
Share this post