CodeElo introduces a standardized benchmark that evaluates LLM coding abilities by directly submitting solutions to CodeForces, achieving zero false positives and human-comparable ratings.
-----
https://arxiv.org/abs/2501.01257
🤔 Original Problem:
→ Existing code benchmarks can't effectively test LLMs' competition-level coding abilities due to hidden test cases, lack of special judge support, and misaligned execution environments.
-----
🛠️ Solution in this Paper:
→ CodeElo compiles recent CodeForces contest problems with detailed metadata like divisions, difficulty ratings, and algorithm tags.
→ Solutions are directly submitted to CodeForces platform for evaluation, ensuring zero false positives and full test case coverage.
→ Introduces a reliable Elo rating system comparable to human participants but with lower variance.
→ Supports special judges for problems without unique correct outputs.
-----
💡 Key Insights:
→ Models perform better using C++ compared to Python for competition coding
→ Models excel at math and implementation problems but struggle with dynamic programming and trees
→ Most models default to Python (95% usage) while humans prefer C++ (80% usage)
-----
📊 Results:
→ OpenAI o1-mini achieved best performance with Elo rating of 1578, surpassing 90% of humans
→ QwQ-32B-Preview leads open-source models with 1261 rating (60th percentile)
→ Most models struggle with basic problems, falling in lowest 20th percentile
Share this post