Adaptive testing with diversity-based selection improves LLM testing efficiency and output variety.
This paper presents a diversity-based adaptive testing method for LLM applications, inspired by Adaptive Random Testing (ART). It prioritizes diverse test inputs, enhancing failure detection and output variety with reduced testing costs.
-----
Paper - https://arxiv.org/abs/2501.13480
Original Problem 😞:
→ Testing LLM-based applications is costly due to expensive LLM queries and manual output analysis.
→ Existing LLM testing frameworks lack optimized test suites.
→ Exhaustive testing is infeasible due to infinite input variability.
-----
Key Insights 🤔:
→ Diversity-based testing like ART can be effective for LLM prompt templates.
→ Adaptively selecting diverse inputs can improve failure discovery and output variety.
→ Different distance metrics suit different tasks and input distributions.
-----
Solution in this Paper 💡:
→ An adaptive test selection method prioritizes new inputs farthest from previously selected ones using distance metrics.
→ This method adapts ART by selecting candidates from an existing pool, not generating new ones.
→ A selective reference set strategy uses only passing tests to calculate distances, promoting diverse failing inputs.
-----
Results 📊:
→ Improves average percentage of failure detection (APFD) by 7.24% on average, up to 34.3%.
→ Produces outputs with 9.5% more unique words.
→ Reduces test execution costs compared to other diversity-based methods like TSDm.