0:00
/
0:00
Transcript

"CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings"

Generated below podcast on this paper with Google's Illuminate.

CodeElo introduces a standardized benchmark that evaluates LLM coding abilities by directly submitting solutions to CodeForces, achieving zero false positives and human-comparable ratings.

-----

https://arxiv.org/abs/2501.01257

🤔 Original Problem:

→ Existing code benchmarks can't effectively test LLMs' competition-level coding abilities due to hidden test cases, lack of special judge support, and misaligned execution environments.

-----

🛠️ Solution in this Paper:

→ CodeElo compiles recent CodeForces contest problems with detailed metadata like divisions, difficulty ratings, and algorithm tags.

→ Solutions are directly submitted to CodeForces platform for evaluation, ensuring zero false positives and full test case coverage.

→ Introduces a reliable Elo rating system comparable to human participants but with lower variance.

→ Supports special judges for problems without unique correct outputs.

-----

💡 Key Insights:

→ Models perform better using C++ compared to Python for competition coding

→ Models excel at math and implementation problems but struggle with dynamic programming and trees

→ Most models default to Python (95% usage) while humans prefer C++ (80% usage)

-----

📊 Results:

→ OpenAI o1-mini achieved best performance with Elo rating of 1578, surpassing 90% of humans

→ QwQ-32B-Preview leads open-source models with 1261 rating (60th percentile)

→ Most models struggle with basic problems, falling in lowest 20th percentile

Discussion about this video