0:00
/
0:00
Transcript

"HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation"

Generated below podcast on this paper with Google's Illuminate.

LLMs struggle to reuse their own generated code for solving complex problems, leading to significant performance gaps in real-world programming scenarios.

-----

https://arxiv.org/abs/2412.21199v2

🤔 Original Problem:

→ Current code generation benchmarks only test isolated function creation, not how well LLMs can use their own code to solve harder problems.

→ Real programming requires understanding and reusing existing code, which LLMs haven't been properly evaluated on.

-----

🔧 Solution in this Paper:

→ Introduces "self-invoking code generation" - LLMs must solve a base problem then use that solution for a more complex related problem.

→ Creates three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro to test this capability.

→ Uses systematic method to construct self-invoking problems by building on existing datasets.

-----

💡 Key Insights:

→ Most LLMs show 10-15% performance drop on self-invoking tasks compared to regular code generation

→ Instruction-tuned models have less improvement over base models on self-invoking tasks

→ Large open-source LLMs perform similarly to proprietary ones on these tasks

→ AssertionErrors and NameErrors are the main failure modes

-----

📊 Results:

→ O1-mini: 96.2% on HumanEval but only 76.2% on HumanEval Pro

→ DeepseekCoder-V2-instruct achieves 77.4% on HumanEval Pro, beating proprietary LLMs

→ Chain-of-Thought prompting improves performance by 3-5% on self-invoking tasks

Discussion about this video