"A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing"

Playback speed

Share post at current time

0:00

Transcript

"A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 21, 2025

LLMs crush existing unit testing tools, showing 107% better performance in generating correct test cases.

This survey paper conducts a systematic evaluation of LLMs across three unit testing tasks, comparing 37 models to analyze their effectiveness for real-world testing challenges.

Unit testing is time-consuming, with developers spending over 15% of their time on test generation. While LLM-based approaches show promise, there's no comprehensive study evaluating their effectiveness across different testing tasks.

-----

https://arxiv.org/abs/2412.16620

Methods in this Paper 🔍:

→ The paper presents a large-scale empirical study evaluating 37 LLMs on test generation, assertion generation, and test evolution tasks.

→ It analyzes model performance across five benchmarks and eight evaluation metrics, using over 3,000 NVIDIA A100 GPU hours.

→ The study examines three key aspects: LLM performance versus state-of-the-art methods, impact of model factors, and fine-tuning versus prompt engineering effectiveness.

-----

Key Insights 💡:

→ Large-scale decoder-only models achieve best results across tasks

→ Encoder-decoder models perform better at comparable parameter sizes

→ CodeLlama, DeepSeek-Coder, and CodeT5p series show strongest performance

→ Prompt engineering with zero-shot learning shows potential in test generation

-----

Key Results 📊:

→ LLMs outperform state-of-the-art approaches by up to 107.77% in test generation

→ Decoder-only models achieve highest EM scores (71.42% in assertion generation)

→ GPT-3.5 reaches 52.54% correct rate on internal test dataset

→ Bug detection capability remains limited with precision rates under 0.74%