0:00
/
0:00
Transcript

SCIENCE AGENT BENCH: TOWARD RIGOROUS ASSESSMENT OF LANGUAGE AGENTS FOR DATA-DRIVEN SCIENTIFIC DISCOVERY

The podcast on this paper is generated with Google's Illuminate.

ScienceAgentBench provides 102 diverse data-driven discovery tasks from 44 peer-reviewed publications across 4 disciplines.

📚 https://arxiv.org/pdf/2410.05080v1

The original Problem

LLM-based language agents show promise for automating scientific discovery, but their true capabilities remain unclear. Rigorous evaluation on individual tasks in scientific workflows is needed before claiming end-to-end automation.

-----

Solution in this Paper 💡:

Key aspects:

• Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility.

• Tasks formulated as code generation problems

• Outputs unified as Python programs

• Multiple evaluation metrics: execution, success criteria, similarity to gold standard

• Expert validation and rubric-based scoring

-----

Key Insights from this Paper 💡:

• Current agents struggle with complex data processing and domain-specific tools

• Expert knowledge helps but doesn't always improve performance

• Simple frameworks like self-debug can outperform complex ones

-----

Results 📊:

Best agent (Claude-3.5-Sonnet with self-debug):

• 32.4% tasks solved independently

• 34.3% solved with expert knowledge

• Outperforms OpenHands CodeAct (21.6% SR) while costing 17x less

Discussion about this video