0:00
/
0:00
Transcript

"The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-based Code Generation"

The podcast on this paper is generated with Google's Illuminate.

First impression is everything for LLMs - especially for coding LLMs.

For LLMs, the first prompt is most critical - most successful implementations happen in first round

LLMs struggle to code from examples instead of words, when coding from examples, LLMs need perfect instructions in the first try.

https://arxiv.org/abs/2411.06774

🤔 Original Problem:

LLMs excel at code generation from natural language, but their ability to generate code from input-output examples remains unexplored. This is crucial as examples provide an unambiguous way to specify requirements, especially useful for non-programmers and reverse engineering tasks.

-----

🛠️ Solution in this Paper:

→ Created InterCode benchmark with 168 diverse programming tasks from HumanEval and CodeHunt

→ Implemented two-stage evaluation framework: first-round checks if code matches given examples, iterative stage tests if code implements correct functionality

→ Used adaptive example generation to reveal gaps between generated and target functionality

→ Evaluated 6 state-of-the-art LLMs including both closed and open-source models

-----

💡 Key Insights:

→ LLMs struggle significantly more with example-based vs natural language code generation

→ Simple math and string operations are easier for LLMs to implement from examples

→ Current LLMs have limited ability to utilize iterative feedback for code improvement

-----

📊 Results:

→ Performance drops over 60% when using examples instead of natural language

→ GPT-4o-mini leads with pass@10 scores of 0.33-0.36

→ DeepseekCoder-6.7b-instruct tops open-source models with pass@10 of 0.19-0.22

→ Over 95% successful implementations occur in first interaction round

Discussion about this video