Four AI buddies team up to create their own exam papers
BenchAgents automates benchmark creation using interacting LLM agents to test complex AI capabilities
📚 https://arxiv.org/abs/2410.22584
🎯 Original Problem:
Creating high-quality benchmarks to evaluate LLMs is slow, expensive, and lacks scalability. Current methods either need human annotations or rely on seed datasets, limiting comprehensive evaluation of new capabilities.
-----
🔧 Solution in this Paper:
→ Introduces BenchAgents - a multi-agent framework using 4 specialized LLM agents:
- Planning Agent: Creates high-level specifications and plans
- Data Generation Agent: Implements plan and generates diverse benchmark data
- Verification Agent: Performs quality checks on examples
- Evaluation Agent: Produces evaluation code and metrics
→ Each agent interacts through a structured workflow, with human-in-loop feedback options
→ Can create benchmarks from scratch without seed datasets
→ Supports automated verification and controllable parameters
-----
💡 Key Insights:
→ All LLMs struggle with joint constraint satisfaction
→ Performance decreases as number of constraints increases
→ Models differ in prioritizing constraints when all cannot be met
→ Failures often involve numerical/logical reasoning constraints
→ Bigger models perform better at strict constraint tasks but not necessarily for open-ended ones
-----
📊 Results:
→ Created two 2000-instance benchmarks: BA-CALENDAR (calendar scheduling) and BA-TEXT (constrained text generation)
→ Tested on 7 state-of-the-art LLMs including GPT-4, Claude 3.5, Gemini 1.5 Pro
→ Quality checks showed high reliability: Clarity (99%), Completeness (96%), Consistency (96%), Feasibility (93%) for BA-CALENDAR
→ Model-based verification achieved 90%+ accuracy against human annotations
Share this post