0:00
/
0:00
Transcript

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

The podcast on this paper is generated with Google's Illuminate.

Bold idea in this paper. Proves how LLMs game getting high scores on prestigious benchmarks like AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench.

Proves how even a “null model” that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates.

https://arxiv.org/abs/2410.07137

Original Problem 🔍:

Automatic LLM benchmarks like AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench are vulnerable to manipulation. High win rates on these benchmarks can significantly boost promotional impact for new language models.

-----

Solution in this Paper 🧠:

• Introduces "null model" that outputs constant, irrelevant responses

• Uses structured responses to confuse auto-annotators' syntactic analysis

• Employs random search to optimize adversarial prefix

• Creates transferable cheating outputs without access to benchmark instructions

• Tests on both GPT-4 and open-source Llama-3 auto-annotators

-----

Key Insights from this Paper 💡:

• Even non-informative responses can achieve top-ranked win rates on popular benchmarks

• Current anti-cheating mechanisms (template paraphrasing, perplexity filtering) are insufficient

• Urgent need for robust anti-cheating mechanisms in automatic LLM benchmarks

• Challenges reliability of current evaluation methods

-----

Results 📊:

• AlpacaEval 2.0: 86.5% LC win rate (vs 57.5% verified SOTA)

• Arena-Hard-Auto: 83.0 score (matching 82.6 verified SOTA)

• MT-Bench: 9.55 score (outperforming 8.96 verified SOTA)

• Achieves high win rates without access to benchmark instructions

• Demonstrates transferability across different benchmarks and auto-annotators

Discussion about this video