"BENCHAGENTS: Automated Benchmark Creation with Agent Interaction"

Playback speed

Share post at current time

0:00

Transcript

"BENCHAGENTS: Automated Benchmark Creation with Agent Interaction"

Generated this podcast on this Paper with Google's Illuminate, which is Google's platform to create podcast from arXiv papers

Rohan Paul

Dec 22, 2024

Four AI buddies team up to create their own exam papers

BenchAgents automates benchmark creation using interacting LLM agents to test complex AI capabilities

📚 https://arxiv.org/abs/2410.22584

🎯 Original Problem:

Creating high-quality benchmarks to evaluate LLMs is slow, expensive, and lacks scalability. Current methods either need human annotations or rely on seed datasets, limiting comprehensive evaluation of new capabilities.

-----

🔧 Solution in this Paper:

→ Introduces BenchAgents - a multi-agent framework using 4 specialized LLM agents:

- Planning Agent: Creates high-level specifications and plans

- Data Generation Agent: Implements plan and generates diverse benchmark data

- Verification Agent: Performs quality checks on examples

- Evaluation Agent: Produces evaluation code and metrics

→ Each agent interacts through a structured workflow, with human-in-loop feedback options

→ Can create benchmarks from scratch without seed datasets

→ Supports automated verification and controllable parameters

-----

💡 Key Insights:

→ All LLMs struggle with joint constraint satisfaction

→ Performance decreases as number of constraints increases

→ Models differ in prioritizing constraints when all cannot be met

→ Failures often involve numerical/logical reasoning constraints

→ Bigger models perform better at strict constraint tasks but not necessarily for open-ended ones

-----

📊 Results:

→ Created two 2000-instance benchmarks: BA-CALENDAR (calendar scheduling) and BA-TEXT (constrained text generation)

→ Tested on 7 state-of-the-art LLMs including GPT-4, Claude 3.5, Gemini 1.5 Pro

→ Quality checks showed high reliability: Clarity (99%), Completeness (96%), Consistency (96%), Feasibility (93%) for BA-CALENDAR

→ Model-based verification achieved 90%+ accuracy against human annotations

Rohan's Bytes

"BENCHAGENTS: Automated Benchmark Creation with Agent Interaction"

Discussion about this video