REPOCOD, proposed in this paper, proves LLMs can't replace programmers yet by testing real-world coding scenarios
📚 https://arxiv.org/abs/2410.21647
🎯 Original Problem:
Existing benchmarks show LLMs achieve >90% accuracy in code generation, raising the question: Can they replace human programmers? Current benchmarks use simple, single-line or manually crafted problems that don't reflect real-world software development complexity.
-----
🛠️ Solution in this Paper:
→ Created REPOCOD: A benchmark with 980 code generation problems from 11 popular Python projects. Including complex, real-world code completion tasks from actual projects
And having longer code solutions (331.6 tokens on average)
→ 58% problems require file/repository-level context understanding
→ Each instance includes ~313 developer-written test cases for validation
→ Uses 3-stage automated pipeline: repository selection, target function selection, test case collection
→ Average solution length: 331.6 tokens (longest among existing benchmarks)
→ Average cyclomatic complexity: 9.00 (highest among benchmarks)
-----
🔍 Key Insights:
→ LLMs struggle with repository-level context (max 20% pass@1)
→ Current-file context works better than Dense/BM25 retrieval for commercial models
→ Each model has unique solvable problems, showing REPOCOD's diversity
→ Repository-level code generation needs significant improvement
-----
📊 Results:
→ No model achieved more than 30% pass@1 accuracy
→ GPT-4 performed best with 27.35% pass@1
→ Performance drops significantly for problems requiring repository-level context
→ Commercial models outperform open-source ones consistently
Share this post