ScienceAgentBench provides 102 diverse data-driven discovery tasks from 44 peer-reviewed publications across 4 disciplines.
📚 https://arxiv.org/pdf/2410.05080v1
The original Problem
LLM-based language agents show promise for automating scientific discovery, but their true capabilities remain unclear. Rigorous evaluation on individual tasks in scientific workflows is needed before claiming end-to-end automation.
-----
Solution in this Paper 💡:
Key aspects:
• Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility.
• Tasks formulated as code generation problems
• Outputs unified as Python programs
• Multiple evaluation metrics: execution, success criteria, similarity to gold standard
• Expert validation and rubric-based scoring
-----
Key Insights from this Paper 💡:
• Current agents struggle with complex data processing and domain-specific tools
• Expert knowledge helps but doesn't always improve performance
• Simple frameworks like self-debug can outperform complex ones
-----
Results 📊:
Best agent (Claude-3.5-Sonnet with self-debug):
• 32.4% tasks solved independently
• 34.3% solved with expert knowledge
• Outperforms OpenHands CodeAct (21.6% SR) while costing 17x less
Share this post