LLM Critique quality measured by correction success, not just verdict accuracy.
This paper introduces RealCritic, a new benchmark to evaluate LLM critique quality by assessing how well critiques improve solutions, using a closed-loop approach and considering self, cross, and iterative critiques.
-----
Paper - https://arxiv.org/abs/2501.14492
Original Problem 🧐:
→ Existing benchmarks for evaluating LLM critiques are often open-loop.
→ They don't directly measure if critiques improve solutions.
→ Current benchmarks lack support for self-critique and iterative critique.
-----
Solution in this Paper 💡:
→ RealCritic is introduced as a new benchmark for evaluating LLM critiques.
→ It uses a closed-loop approach, assessing critique quality based on solution improvement.
→ This benchmark evaluates critiques by measuring the accuracy of corrected solutions.
→ RealCritic supports self-critique, where models critique their own solutions.
→ It also supports cross-critique, where models critique solutions from other models.
→ Iterative critique is also incorporated, allowing for multi-round critique evaluation.
-----
Key Insights from this Paper 🤔:
→ Open-loop evaluation of critiques may not effectively measure true critique quality.
→ Closed-loop evaluation, linking critique to correction, is a more direct measure.
→ Self-critique and iterative critique are important for advanced reasoning LLMs.
→ Classical LLMs often underperform reasoning-based models in critique scenarios.
→ Self-critique doesn't always lead to improvement for classical LLMs; it can even reduce performance.
-----
Results 📈:
→ o1-mini shows +3.3% average improvement in self-critique over direct solutions.
→ o1-mini achieves up to +24.0% improvement on College Math in self-critique.
→ o1-mini shows strong cross-critique performance, outperforming other models.
→ In cross-critique, models show up to 40% improvement over baseline on mathematical reasoning tasks.
Share this post