Now AI is coming for Kaggle GrandMasters and machine learning engineering skills in general.
📚 https://arxiv.org/pdf/2410.07095
Results 📊:
• o1-preview (AIDE): Achieves medals in 16.9% of competitions
• GPT-4o (AIDE): Medals in 8.7% of competitions
• Performance doubles from pass@1 to pass@8 for both models
• Increasing runtime from 24 to 100 hours improves GPT-4o's score from 8.7% to 11.8%
-----
This Paper 💡:
• Introduces MLE-bench: 75 offline Kaggle competitions for evaluating ML engineering capabilities
• Covers diverse tasks: NLP, computer vision, signal processing
• Includes human baselines from Kaggle leaderboards
• Evaluates agents using open-source scaffolds (AIDE, MLAB, OpenHands)
• Measures performance using percentage of competitions where agents achieve medals
• Implements safeguards against cheating and contamination
-----
Key Insights from this Paper 💡:
• Best-performing setup: o1-preview with AIDE scaffold
• Performance improves with multiple attempts (pass@k metric)
• Agents struggle with debugging and recovering from missteps
• No significant correlation found between model familiarity and performance
• Obfuscating competition descriptions doesn't significantly impact performance
Share this post