MLE-BENCH: EVALUATING MACHINE LEARNING AGENTS ON MACHINE LEARNING ENGINEERING

Playback speed

Share post at current time

0:00

Transcript

The podcast on this paper is generated with Google's Illuminate.

Dec 30, 2024

Now AI is coming for Kaggle GrandMasters and machine learning engineering skills in general.

Results 📊:

• o1-preview (AIDE): Achieves medals in 16.9% of competitions

• GPT-4o (AIDE): Medals in 8.7% of competitions

• Performance doubles from pass@1 to pass@8 for both models

• Increasing runtime from 24 to 100 hours improves GPT-4o's score from 8.7% to 11.8%

-----

This Paper 💡:

• Introduces MLE-bench: 75 offline Kaggle competitions for evaluating ML engineering capabilities

• Covers diverse tasks: NLP, computer vision, signal processing

• Includes human baselines from Kaggle leaderboards

• Evaluates agents using open-source scaffolds (AIDE, MLAB, OpenHands)

• Measures performance using percentage of competitions where agents achieve medals

• Implements safeguards against cheating and contamination

-----

Key Insights from this Paper 💡:

• Best-performing setup: o1-preview with AIDE scaffold

• Performance improves with multiple attempts (pass@k metric)

• Agents struggle with debugging and recovering from missteps

• No significant correlation found between model familiarity and performance

• Obfuscating competition descriptions doesn't significantly impact performance

Rohan's Bytes