The challenge lies in the expensive nature of evaluating LLMs using human annotations. Comparing different LLM judges is also difficult due to inconsistencies in models and prompts used across studies.
This paper introduces a method to systematically optimize the settings of LLM judges. It uses a cost-effective search technique to find judges that balance accuracy and cost. This method aims to identify high-performing, efficient judges using readily available open-source models.
-----
https://arxiv.org/abs/2501.17178
📌 This research pioneers systematic hyperparameter tuning for LLM judges. It moves beyond simple scaling, optimizing prompts and models for enhanced judge accuracy and efficiency.
📌 Multi-objective multi-fidelity optimization enables efficient judge tuning. This approach balances accuracy with computational cost, drastically reducing the search budget for optimal judge configurations.
📌 By tuning open-weight LLMs, this method democratizes access to high-performance judges. It proves that optimized zero-shot judges can outperform specialized, fine-tuned alternatives.
----------
Methods Explored in this Paper 🔧:
→ The paper explores scaling laws to understand how judge performance improves with larger models or more instructions.
→ It uses multi-objective multi-fidelity optimization to efficiently tune judge hyperparameters. This approach balances accuracy and cost while reducing search expenses.
→ The research investigates various hyperparameters including LLM models (Llama3, Qwen2.5, Gemma2), temperature, and prompt strategies.
→ Prompt strategies include different output formats like Likert scales or best-model identifiers, and options to include answer, explanation or examples in prompts.
→ The method evaluates judge configurations across different levels of fidelity, starting with fewer evaluation battles and progressively increasing for promising configurations.
-----
Key Insights 💡:
→ Scaling judge models and instructions alone is insufficient to achieve optimal judge performance.
→ Human agreement is a more cost-effective metric than Spearman correlation for tuning judges. It provides a better signal-to-noise ratio for distinguishing judge configurations.
→ Prompt hyperparameters significantly impact judge performance, sometimes as much as the choice of the underlying LLM model.
→ Specific prompt strategies and output formats are more effective than others, with "pair" format performing well. Lower temperatures and averaging judgements across orders improve performance.
-----
Results 📊:
→ The tuned judges outperform existing benchmarks like PandaLM-7B and JudgeLM-7B on LMSys test instructions.
→ Achieves human agreement of 0.437 with an 8B parameter judge at a cost of $0.0003 per annotation.
→ Achieves human agreement of 0.471 with a 32B parameter judge at a cost of $0.0010 per annotation.
→ The tuned 8B parameter judge achieves 0.67 human agreement on the PandaLM test set, surpassing PandaLM-70B's performance of 0.67.
→ A tuned 32B parameter judge reaches 0.78 human agreement on the PandaLM test set.
Share this post