BoxingGym, proposed in this paper, lets AI scientists test hypotheses.
BoxingGym introduces a benchmark framework to test how well LLMs can design experiments and discover scientific models through interactive experimentation and natural language communication.
-----
https://arxiv.org/abs/2501.01540
๐งช Techniques in this Paper:
โ BoxingGym provides 10 environments implemented as generative probabilistic models from various scientific domains like psychology and ecology
โ Each environment allows agents to run interactive experiments and collect data through a flexible language-based interface
โ The framework evaluates experimental design using Expected Information Gain (EIG) to measure how much experiments reduce model uncertainty
โ For model discovery evaluation, agents must explain their findings to a novice agent who then makes predictions
-----
๐ก Key Insights:
โ Scientific discovery requires tight coupling between experimental design and model building
โ Language-based interfaces enable flexible scientific theory representation and communication
โ Evaluation through explanation tests both model accuracy and interpretability
-----
๐ Results:
โ Current LLMs (GPT-4o) struggle with both experimental design and model discovery tasks
โ Adding statistical modeling (Box's Apprentice) doesn't consistently improve performance
โ LLMs show better results when they have accurate prior domain knowledge
------
Are you into AI and LLMsโ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. โโ
๐ https://rohanpaul.substack.com/