LLMs crush existing unit testing tools, showing 107% better performance in generating correct test cases.
This survey paper conducts a systematic evaluation of LLMs across three unit testing tasks, comparing 37 models to analyze their effectiveness for real-world testing challenges.
Unit testing is time-consuming, with developers spending over 15% of their time on test generation. While LLM-based approaches show promise, there's no comprehensive study evaluating their effectiveness across different testing tasks.
-----
https://arxiv.org/abs/2412.16620
Methods in this Paper 🔍:
→ The paper presents a large-scale empirical study evaluating 37 LLMs on test generation, assertion generation, and test evolution tasks.
→ It analyzes model performance across five benchmarks and eight evaluation metrics, using over 3,000 NVIDIA A100 GPU hours.
→ The study examines three key aspects: LLM performance versus state-of-the-art methods, impact of model factors, and fine-tuning versus prompt engineering effectiveness.
-----
Key Insights 💡:
→ Large-scale decoder-only models achieve best results across tasks
→ Encoder-decoder models perform better at comparable parameter sizes
→ CodeLlama, DeepSeek-Coder, and CodeT5p series show strongest performance
→ Prompt engineering with zero-shot learning shows potential in test generation
-----
Key Results 📊:
→ LLMs outperform state-of-the-art approaches by up to 107.77% in test generation
→ Decoder-only models achieve highest EM scores (71.42% in assertion generation)
→ GPT-3.5 reaches 52.54% correct rate on internal test dataset
→ Bug detection capability remains limited with precision rates under 0.74%
Share this post