0:00
/
0:00
Transcript

"Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'"

Generated this podcast on this Paper with Google's Illuminate, a specialized tool to create podcast from arXiv papers only

REPOCOD, proposed in this paper, proves LLMs can't replace programmers yet by testing real-world coding scenarios

📚 https://arxiv.org/abs/2410.21647

🎯 Original Problem:

Existing benchmarks show LLMs achieve >90% accuracy in code generation, raising the question: Can they replace human programmers? Current benchmarks use simple, single-line or manually crafted problems that don't reflect real-world software development complexity.

-----

🛠️ Solution in this Paper:

→ Created REPOCOD: A benchmark with 980 code generation problems from 11 popular Python projects. Including complex, real-world code completion tasks from actual projects

And having longer code solutions (331.6 tokens on average)

→ 58% problems require file/repository-level context understanding

→ Each instance includes ~313 developer-written test cases for validation

→ Uses 3-stage automated pipeline: repository selection, target function selection, test case collection

→ Average solution length: 331.6 tokens (longest among existing benchmarks)

→ Average cyclomatic complexity: 9.00 (highest among benchmarks)

-----

🔍 Key Insights:

→ LLMs struggle with repository-level context (max 20% pass@1)

→ Current-file context works better than Dense/BM25 retrieval for commercial models

→ Each model has unique solvable problems, showing REPOCOD's diversity

→ Repository-level code generation needs significant improvement

-----

📊 Results:

→ No model achieved more than 30% pass@1 accuracy

→ GPT-4 performed best with 27.35% pass@1

→ Performance drops significantly for problems requiring repository-level context

→ Commercial models outperform open-source ones consistently

Discussion about this video