Bold idea in this paper. Proves how LLMs game getting high scores on prestigious benchmarks like AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench.
Proves how even a “null model” that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates.
https://arxiv.org/abs/2410.07137
Original Problem 🔍:
Automatic LLM benchmarks like AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench are vulnerable to manipulation. High win rates on these benchmarks can significantly boost promotional impact for new language models.
-----
Solution in this Paper 🧠:
• Introduces "null model" that outputs constant, irrelevant responses
• Uses structured responses to confuse auto-annotators' syntactic analysis
• Employs random search to optimize adversarial prefix
• Creates transferable cheating outputs without access to benchmark instructions
• Tests on both GPT-4 and open-source Llama-3 auto-annotators
-----
Key Insights from this Paper 💡:
• Even non-informative responses can achieve top-ranked win rates on popular benchmarks
• Current anti-cheating mechanisms (template paraphrasing, perplexity filtering) are insufficient
• Urgent need for robust anti-cheating mechanisms in automatic LLM benchmarks
• Challenges reliability of current evaluation methods
-----
Results 📊:
• AlpacaEval 2.0: 86.5% LC win rate (vs 57.5% verified SOTA)
• Arena-Hard-Auto: 83.0 score (matching 82.6 verified SOTA)
• MT-Bench: 9.55 score (outperforming 8.96 verified SOTA)
• Achieves high win rates without access to benchmark instructions
• Demonstrates transferability across different benchmarks and auto-annotators
Share this post