FullStack Bench introduces a comprehensive code evaluation dataset spanning 16 programming languages and multiple domains, addressing the limitations of existing benchmarks that focus on narrow application areas. The paper also presents SandboxFusion, an execution environment for efficient evaluation across diverse programming languages and frameworks.
-----
https://arxiv.org/abs/2412.00535
Original Problem 🤔:
→ Current code evaluation benchmarks cover limited domains, making it difficult to assess LLMs' real-world coding capabilities across different programming scenarios.
→ Existing sandboxes lack support for diverse programming languages and frameworks, especially for front-end and machine learning tasks.
-----
Solution in this Paper 🛠️:
→ Created FullStack Bench with 3374 problems covering 11 main application domains including basic programming, data analysis, and machine learning.
→ Developed SandboxFusion, supporting 23 programming languages and various packages for comprehensive evaluation.
→ Implemented rigorous quality control through expert verification and cross-validation.
→ Used voting method with six selected models to determine problem difficulty levels.
-----
Key Insights 💡:
→ Model performance varies significantly across domains, with mathematics showing the largest performance gap
→ Compile pass rates correlate positively with test pass rates but don't guarantee success
→ Language of prompts affects model performance differently across models
-----
Results 📊:
→ OpenAI o1-preview leads with 66.12% overall performance
→ DeepSeekCoder-v2-Instruct achieves 56.37%, best among open-source models
→ Performance scales with model size for most families except Qwen 2.5
Share this post