EvoTox: Teaching machines to reveal their hidden toxic biases through evolutionary testing.
This paper introduces an automated testing framework called EvoTox that evaluates how easily LLMs can be pushed to generate toxic responses, even when they have safety measures in place.
https://arxiv.org/abs/2501.01741
Original Problem 🔍:
→ Current safety measures in LLMs don't fully prevent toxic responses
→ Manual testing is impractical for large-scale evaluation
→ Existing automated tests often generate unnatural prompts
-----
Solution in this Paper 🛠️:
→ EvoTox uses an evolution strategy that leverages two LLMs - the System Under Test (SUT) and a Prompt Generator (PG).
→ PG creates mutated prompts by rephrasing initial seeds to maximize toxicity.
→ An automated oracle based on the Perspective API evaluates toxicity levels.
→ The framework adopts a 1+λ evolution strategy where λ mutants compete to become the next parent.
-----
Key Insights from this Paper 💡:
→ Homophobia and racism emerge as the most common weaknesses in tested LLMs
→ Censorship in prompt generation significantly affects effectiveness
→ Model size has minimal impact on toxic response generation
→ Natural prompts are more effective than artificially crafted ones
-----
Results 📊:
→ EvoTox outperforms baseline methods with effect size up to 1.0
→ Limited cost overhead (22-35%)
→ Human evaluators confirm significantly higher toxicity in EvoTox-generated responses
→ Generated prompts maintain natural language fluency comparable to human writing
Share this post