"RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques"

Playback speed

Share post at current time

0:00

Transcript

"RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques"

Below podcast is generated with Google's Illuminate.

Rohan Paul

Feb 04, 2025

LLM Critique quality measured by correction success, not just verdict accuracy.

This paper introduces RealCritic, a new benchmark to evaluate LLM critique quality by assessing how well critiques improve solutions, using a closed-loop approach and considering self, cross, and iterative critiques.

-----

Paper - https://arxiv.org/abs/2501.14492

Original Problem 🧐:

→ Existing benchmarks for evaluating LLM critiques are often open-loop.

→ They don't directly measure if critiques improve solutions.

→ Current benchmarks lack support for self-critique and iterative critique.

-----

Solution in this Paper 💡:

→ RealCritic is introduced as a new benchmark for evaluating LLM critiques.

→ It uses a closed-loop approach, assessing critique quality based on solution improvement.

→ This benchmark evaluates critiques by measuring the accuracy of corrected solutions.

→ RealCritic supports self-critique, where models critique their own solutions.

→ It also supports cross-critique, where models critique solutions from other models.

→ Iterative critique is also incorporated, allowing for multi-round critique evaluation.

-----

Key Insights from this Paper 🤔:

→ Open-loop evaluation of critiques may not effectively measure true critique quality.

→ Closed-loop evaluation, linking critique to correction, is a more direct measure.

→ Self-critique and iterative critique are important for advanced reasoning LLMs.

→ Classical LLMs often underperform reasoning-based models in critique scenarios.

→ Self-critique doesn't always lead to improvement for classical LLMs; it can even reduce performance.

-----

Results 📈:

→ o1-mini shows +3.3% average improvement in self-critique over direct solutions.

→ o1-mini achieves up to +24.0% improvement on College Math in self-critique.

→ o1-mini shows strong cross-critique performance, outperforming other models.

→ In cross-critique, models show up to 40% improvement over baseline on mathematical reasoning tasks.

Rohan's Bytes

"RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques"

Discussion about this video