"Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 10, 2025

Article voiceover

0:00

-3:34

https://arxiv.org/abs/2501.18099

The challenge lies in the difficulty of creating Chain-of-Thought reasoning for LLM judges because of the absence of human annotated reasoning steps and the need for manual effort to create evaluation instructions for each task.

This paper introduces EvalPlanner, a new method to train LLM judges to plan and reason for evaluations. It iteratively refines evaluation plans and executions through self-training, using synthetically created data and preference optimization.

-----

📌 EvalPlanner's key innovation lies in its decoupled plan-execution framework. This explicitly models the evaluation process, moving beyond monolithic chain of thought approaches.

📌 Synthetic data generation for training judges is a significant contribution. EvalPlanner demonstrates effective self-training using preference optimization over plan-execution pairs.

📌 The unconstrained plan generation allows for flexible, task-adaptive evaluation strategies. This contrasts with rigid, pre-defined criteria, improving judge generalizability.

----------

Methods Explored in this Paper 🔧:

→ EvalPlanner decouples the Chain-of-Thought process into two key components: evaluation plan generation and plan execution.

→ First, for a given instruction, EvalPlanner generates a detailed evaluation plan outlining the steps to assess responses. This plan is unconstrained and adapts to the specific instruction.

→ Second, it executes this plan step-by-step, analyzing the provided responses to reach a final judgment.

→ EvalPlanner uses a self-training loop. It starts with a seed LLM and iteratively refines the evaluation plans and executions.

→ In each iteration, it samples multiple plans and executions, creating preference pairs of correct and incorrect reasoning paths.

→ Direct Preference Optimization is used to train the LLM judge on these synthetic preference pairs, improving both plan generation and execution.

-----

Key Insights 💡:

→ Decoupling evaluation into planning and execution enhances the reasoning process of LLM judges.

→ Unconstrained evaluation plans, generated by the LLM itself, are more effective than predefined, constrained plans.

→ Iterative self-training with synthetic data is sufficient to train high-performing LLM judges without human annotated reasoning steps.

→ Preference optimization over complete Chain-of-Thoughts, including plans and executions, significantly improves judgment accuracy.

-----

Results 📊:

→ EvalPlanner achieves a new state-of-the-art overall score of 93.9 on RewardBench for generative reward models.

→ EvalPlanner outperforms prior models, even those trained with up to 30 times more data.

→ On FollowBenchEval, EvalPlanner improves performance by up to 13% compared to a leading model for complex prompts requiring multi-level constraint evaluation.

→ On RM-Bench, EvalPlanner shows up to 8% improvement in robustness over previous state-of-the-art models.

Rohan's Bytes

Discussion about this post