DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models
DA-Code, the new Benchmark shows top LLMs achieve only 30% success on real data science tasks
DA-Code, the new Benchmark shows top LLMs achieve only 30% success on real data science tasks
Original Problem 🔍:
DA-Code addresses the challenge of evaluating LLMs on complex, agent-based data science tasks. Existing benchmarks lack real-world complexity and don't cover the full data science pipeline.
Solution in this Paper 🛠️:
• DA-Code: 500 challenging data science tasks across data wrangling, exploratory analysis, and machine learning
• Uses real, diverse data sources and requires multiple programming languages
• Controllable, executable environment mimicking real-world scenarios
• Meticulously designed evaluation suite for accuracy and robustness
• DA-Agent: Baseline framework with Docker-based environment, action space for Bash/Python/SQL, and memory window
Key Insights from this Paper 💡:
• Even advanced LLMs struggle with complex data science tasks
• Performance decreases with increasing task difficulty
• Models perform better on ML tasks compared to data wrangling and EDA
• Open-source LLMs show significant performance gap vs closed-source models
• Common issues: hallucination, inability to follow instructions, persistent code errors
Results 📊:
• Best LLM (GPT-4) achieved only 30.5% accuracy on DA-Code
• Performance breakdown: Data Wrangling (30.4%), ML (48.4%), EDA (24.6%)
• Completion rate: 99.4% (GPT-4), 67.2% (Mixtral-8x22B)
• Average steps: 7.3 (GPT-4), 11.1 (Mixtral-8x22B)
• Executable code: 76.8% (GPT-4), 55.1% (Mixtral-8x22B)
🧠 How does DA-Code differ from existing code generation benchmarks?
It focuses on agent-based tasks that require autonomous decision-making and problem-solving, rather than just translating instructions to code
It uses real-world, diverse data sources beyond just notebooks or code completion
It requires using multiple programming languages like Python, SQL and Bash
It covers the full data science pipeline including data wrangling, exploratory analysis, and machine learning
It provides an interactive sandbox environment for evaluation