First impression is everything for LLMs - especially for coding LLMs.
For LLMs, the first prompt is most critical - most successful implementations happen in first round
LLMs struggle to code from examples instead of words, when coding from examples, LLMs need perfect instructions in the first try.
https://arxiv.org/abs/2411.06774
🤔 Original Problem:
LLMs excel at code generation from natural language, but their ability to generate code from input-output examples remains unexplored. This is crucial as examples provide an unambiguous way to specify requirements, especially useful for non-programmers and reverse engineering tasks.
-----
🛠️ Solution in this Paper:
→ Created InterCode benchmark with 168 diverse programming tasks from HumanEval and CodeHunt
→ Implemented two-stage evaluation framework: first-round checks if code matches given examples, iterative stage tests if code implements correct functionality
→ Used adaptive example generation to reveal gaps between generated and target functionality
→ Evaluated 6 state-of-the-art LLMs including both closed and open-source models
-----
💡 Key Insights:
→ LLMs struggle significantly more with example-based vs natural language code generation
→ Simple math and string operations are easier for LLMs to implement from examples
→ Current LLMs have limited ability to utilize iterative feedback for code improvement
-----
📊 Results:
→ Performance drops over 60% when using examples instead of natural language
→ GPT-4o-mini leads with pass@10 scores of 0.33-0.36
→ DeepseekCoder-6.7b-instruct tops open-source models with pass@10 of 0.19-0.22
→ Over 95% successful implementations occur in first interaction round
Share this post