0:00
/
0:00
Transcript

"GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models"

The podcast on this paper is generated with Google's Illuminate.

A new benchmark reveals how well AI handles evolving code libraries, i.e. measures if AI can adapt to changing code versions

GitChameleon, proposed in this paper, tests if AI can write code that works across different library versions.

→ GPT-4 achieves only 39.9% pass@10 (BUT 43.7% with error feedback)

https://arxiv.org/abs/2411.05830

🤖 Original Problem:

Code generation models struggle with version-specific code compatibility as software libraries rapidly evolve. Current benchmarks don't effectively test LLMs' ability to handle different library versions, leading to potential production issues.

-----

🛠️ Solution in this Paper:

→ GitChameleon introduces 116 Python code completion problems across 11 popular libraries, with executable unit tests for each problem

→ The benchmark covers library versions from 2014-2023, focusing on real API changes like argument modifications, function renames, and semantic changes

→ Each problem includes manually curated test cases to verify functional correctness, not just syntactic accuracy

→ The evaluation framework uses Pass@k metrics and incorporates error feedback mechanisms to improve model performance

-----

🔍 Key Insights:

→ Model size positively correlates with version-specific code generation ability

→ LLMs perform worse on recent library versions (2021-2023) compared to older ones

→ Error feedback improves performance by 5.4% in Pass@1 and 4.7% in Pass@10

→ Most common API changes involve argument modifications and function name changes

-----

📊 Results:

→ GPT-4 achieves only 39.9% pass@10 (BUT 43.7% with error feedback)

→ DeepSeek-Coder 33B leads in Pass@1 with 35.7%

→ CodeLlama 34B tops Pass@10 with 42.8%

→ Phi-3.5-MoE shows strongest performance among open-source models with 40.5% Pass@10

Discussion about this video