LLMs can write correct code, but do they write code the way humans prefer?
CodeArena bridges the gap between passing tests and pleasing human developers.
This paper introduces CodeArena, a human-curated benchmark to evaluate how well code LLMs align with real-world human preferences in coding tasks.
https://arxiv.org/abs/2412.05210
🤔 Original Problem:
→ Current code LLMs focus mainly on code correctness through test cases, ignoring alignment with human preferences in real-world scenarios
→ Existing benchmarks don't effectively measure how well model-generated responses match what humans actually want and expect
-----
🛠️ Solution in this Paper:
→ Created CodeArena, a benchmark with 397 high-quality samples across 40 categories and 44 programming languages.
→ Developed SynCodeInstruct, a synthetic instruction corpus with 20B tokens for training code LLMs.
→ Implemented systematic evaluation of 40+ LLMs using human preference alignment metrics.
→ Used GPT-4 as a judge to evaluate model responses based on human preference criteria.
-----
💡 Key Insights:
→ Significant performance gap exists between open-source and proprietary code LLMs in terms of human preference alignment
→ Code execution correctness doesn't necessarily correlate with human preference satisfaction
→ Large-scale synthetic instruction data can improve model performance on human preference metrics
-----
📊 Results:
→ Open-source code LLMs (like Qwen-Coder) show notable performance gaps compared to closed-source LLMs (Claude, GPT-4)
→ SynCodeInstruct training improved human preference alignment scores
→ Achieved 64% prediction accuracy on test sets with a Sharpe ratio of 2.21
Share this post