LLMs struggle to reuse their own generated code for solving complex problems, leading to significant performance gaps in real-world programming scenarios.
-----
https://arxiv.org/abs/2412.21199v2
🤔 Original Problem:
→ Current code generation benchmarks only test isolated function creation, not how well LLMs can use their own code to solve harder problems.
→ Real programming requires understanding and reusing existing code, which LLMs haven't been properly evaluated on.
-----
🔧 Solution in this Paper:
→ Introduces "self-invoking code generation" - LLMs must solve a base problem then use that solution for a more complex related problem.
→ Creates three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro to test this capability.
→ Uses systematic method to construct self-invoking problems by building on existing datasets.
-----
💡 Key Insights:
→ Most LLMs show 10-15% performance drop on self-invoking tasks compared to regular code generation
→ Instruction-tuned models have less improvement over base models on self-invoking tasks
→ Large open-source LLMs perform similarly to proprietary ones on these tasks
→ AssertionErrors and NameErrors are the main failure modes
-----
📊 Results:
→ O1-mini: 96.2% on HumanEval but only 76.2% on HumanEval Pro
→ DeepseekCoder-V2-instruct achieves 77.4% on HumanEval Pro, beating proprietary LLMs
→ Chain-of-Thought prompting improves performance by 3-5% on self-invoking tasks
Share this post