Evaluates how different prompt techniques impact code generation quality, finding that simpler prompts often work as well as complex ones for function-level tasks.
-----
https://arxiv.org/abs/2412.20545
🔧 Methods in this Paper:
→ Created CodePromptEval dataset with 7072 prompts testing 5 key techniques: few-shot, persona, chain-of-thought, function signature, package list
→ Evaluated prompts using 3 LLMs (GPT-4, Llama3, Mistral) on correctness, similarity to human code, and code quality
→ Used full factorial experiment design to test all 32 possible combinations of prompt techniques
-----
💡 Key Insights:
→ Impact of prompt techniques is surprisingly small - only 10-12% difference between best and worst strategies
→ Function signature and few-shot examples most improve code correctness
→ Chain-of-thought and persona improve code quality but reduce correctness
→ Combining multiple techniques rarely helps and sometimes hurts performance
-----
📊 Results:
→ Pass@1 rates: GPT-4 (52.42%), Llama3 (50.40%), Mistral (47.16%)
→ CodeBLEU similarity scores between 12.2-17.9%
→ Most common errors: AssertionError, TypeError, AttributeError
Share this post