Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Playback speed

Share post at current time

0:00

Transcript

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 23, 2024

Bold idea in this paper. Proves how LLMs game getting high scores on prestigious benchmarks like AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench.

Proves how even a “null model” that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates.

https://arxiv.org/abs/2410.07137

Original Problem 🔍:

Automatic LLM benchmarks like AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench are vulnerable to manipulation. High win rates on these benchmarks can significantly boost promotional impact for new language models.

-----

Solution in this Paper 🧠:

• Introduces "null model" that outputs constant, irrelevant responses

• Uses structured responses to confuse auto-annotators' syntactic analysis

• Employs random search to optimize adversarial prefix

• Creates transferable cheating outputs without access to benchmark instructions

• Tests on both GPT-4 and open-source Llama-3 auto-annotators

-----

Key Insights from this Paper 💡:

• Even non-informative responses can achieve top-ranked win rates on popular benchmarks

• Current anti-cheating mechanisms (template paraphrasing, perplexity filtering) are insufficient

• Urgent need for robust anti-cheating mechanisms in automatic LLM benchmarks