0:00
/
0:00
Transcript

"Evaluating and Aligning CodeLLMs on Human Preference"

The podcast on this paper is generated with Google's Illuminate.

LLMs can write correct code, but do they write code the way humans prefer?

CodeArena bridges the gap between passing tests and pleasing human developers.

This paper introduces CodeArena, a human-curated benchmark to evaluate how well code LLMs align with real-world human preferences in coding tasks.

https://arxiv.org/abs/2412.05210

🤔 Original Problem:

→ Current code LLMs focus mainly on code correctness through test cases, ignoring alignment with human preferences in real-world scenarios

→ Existing benchmarks don't effectively measure how well model-generated responses match what humans actually want and expect

-----

🛠️ Solution in this Paper:

→ Created CodeArena, a benchmark with 397 high-quality samples across 40 categories and 44 programming languages.

→ Developed SynCodeInstruct, a synthetic instruction corpus with 20B tokens for training code LLMs.

→ Implemented systematic evaluation of 40+ LLMs using human preference alignment metrics.

→ Used GPT-4 as a judge to evaluate model responses based on human preference criteria.

-----

💡 Key Insights:

→ Significant performance gap exists between open-source and proprietary code LLMs in terms of human preference alignment

→ Code execution correctness doesn't necessarily correlate with human preference satisfaction

→ Large-scale synthetic instruction data can improve model performance on human preference metrics

-----

📊 Results:

→ Open-source code LLMs (like Qwen-Coder) show notable performance gaps compared to closed-source LLMs (Claude, GPT-4)

→ SynCodeInstruct training improved human preference alignment scores

→ Achieved 64% prediction accuracy on test sets with a Sharpe ratio of 2.21

Discussion about this video

User's avatar