"TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation"

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

"TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation"

Generated this podcast with Google's Illuminate.

Rohan Paul

Jan 04, 2025

Transcript

Using LLM-generated, instruction-specific checklists to structure and improve LLM evaluation and self-improvement.

📚 https://arxiv.org/pdf/2410.03608

Original Problem 💡:

Existing LLM evaluations like preference judgments and direct scoring lack interpretability and reliability. Human annotation is slow and costly, leading to increased use of LLMs as judges, which further reduces reliability.

-----

Solution in this Paper 🔬:

• Introduces TICK (Targeted Instruct-evaluation with ChecKlists)

• LLMs generate instruction-specific evaluation checklists with YES/NO questions

• Checklist answers aggregated for scoring/preference judgments

• STICK (Self-TICK) uses checklists for self-refinement and Best-of-N selection

-----

Key Insights from this Paper 💡:

• LLMs can generate high-quality evaluation checklists comparable to human-written ones

• Structured checklist evaluations improve agreement between LLM judges and humans

• Self-refinement with STICK outperforms unstructured feedback across diverse tasks

• STICK improves Best-of-N selection over baselines on instruction-following benchmarks

• LLM-generated checklists improve inter-annotator agreement in human evaluation

-----

Results 📊:

• TICK improves LLM-human agreement on preferences by 5.8% vs direct scoring

• STICK self-refinement gains: +7.8% on LiveBench reasoning, +7.1% on WildBench

• STICK Best-of-N selection: +5.1% on InFoBench, +5.3% on WildBench vs greedy decoding

• Human inter-annotator agreement increases from 0.194 to 0.256 with checklists

TICK enables automated, interpretable LLM evaluation and improves performance via structured feedback.

Rohan's Bytes

"TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation"

Discussion about this video