0:00
/
0:00
Transcript

"The Impact of Prompt Programming on Function-Level Code Generation"

Generated below podcast on this paper with Google's Illuminate.

Evaluates how different prompt techniques impact code generation quality, finding that simpler prompts often work as well as complex ones for function-level tasks.

-----

https://arxiv.org/abs/2412.20545

🔧 Methods in this Paper:

→ Created CodePromptEval dataset with 7072 prompts testing 5 key techniques: few-shot, persona, chain-of-thought, function signature, package list

→ Evaluated prompts using 3 LLMs (GPT-4, Llama3, Mistral) on correctness, similarity to human code, and code quality

→ Used full factorial experiment design to test all 32 possible combinations of prompt techniques

-----

💡 Key Insights:

→ Impact of prompt techniques is surprisingly small - only 10-12% difference between best and worst strategies

→ Function signature and few-shot examples most improve code correctness

→ Chain-of-thought and persona improve code quality but reduce correctness

→ Combining multiple techniques rarely helps and sometimes hurts performance

-----

📊 Results:

→ Pass@1 rates: GPT-4 (52.42%), Llama3 (50.40%), Mistral (47.16%)

→ CodeBLEU similarity scores between 12.2-17.9%

→ Most common errors: AssertionError, TypeError, AttributeError

Discussion about this video