Zero-shot regression testing is here: LLMs craft bug-revealing tests directly from commit info.
This paper addresses the challenge of automatically creating regression tests for software, especially for programs with structured, human-readable inputs.
It introduces Cleverest, a feedback-driven approach using LLMs to generate tests from commit information. Cleverest effectively finds and reproduces bugs in programs like JavaScript and XML parsers, demonstrating the potential of LLMs in software testing.
-----
https://arxiv.org/abs/2501.11086
Original Problem 😥:
→ Existing regression testing tools struggle with programs taking structured inputs like XML or JavaScript without input grammars or seed inputs.
→ Generating effective regression tests automatically for such programs is difficult.
→ Current methods often fail to create valid inputs in the required format.
-----
Key Insights from this Paper 🤔:
→ LLMs can generate grammatically correct and structured text, including code and markup languages.
→ Commit messages and code diffs provide sufficient context for LLMs to understand code changes and testing needs.
→ Execution feedback can guide LLMs to refine test case generation iteratively.
-----
Solution in this Paper 💡:
→ The paper proposes Cleverest, a zero-shot, feedback-directed regression test generation technique.
→ Cleverest uses a Prompt Synthesizer to create prompts for an LLM.
→ Prompts include task descriptions, commit information (message and diff), and attempt history with execution feedback.
→ An LLM Module generates test inputs based on these prompts.
→ An Execution Analyzer runs tests on program versions before and after the commit.
→ It assesses test outcomes based on bug triggering, output changes, and commit reach.
→ Feedback from the Execution Analyzer is used to refine prompts for subsequent iterations, guiding the LLM toward effective test cases.
-----
Results 📊:
→ Cleverest found bugs in 3 out of 6 bug-introducing commits for JavaScript and XML parsers.
→ It reproduced bugs in 4 out of 6 bug-fixing commits for the same program types.
→ For programs with human-readable formats, Cleverest performed well, generating tests in under 3 minutes.
→ In comparison to WAFLGo, Cleverest is substantially faster, achieving comparable bug reproduction and slightly less bug finding rates.
→ Using Cleverest-generated tests as seeds for fuzzing improved bug detection, outperforming WAFLGo.
Share this post