Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models
Smart LLM-jailbreak attackers can bypass LLM safety by breaking bad requests into harmless pieces.
Smart LLM-jailbreak attackers can bypass LLM safety by breaking bad requests into harmless pieces.
LLMs can't spot danger when it comes in bite-sized pieces
Original Problem 🚨:
LLMs are susceptible to jailbreak attacks, especially in multi-turn interactions, leading to harmful responses. Existing defenses focus on single-turn attacks and often fail in multi-turn settings.
Solution in this Paper 🔧:
Introduces JigSAW PuZZLES (JSP), a multi-turn jailbreak strategy.
Splits harmful queries into benign fractions across turns.
Uses a prompt that prevents LLMs from generating concatenated questions.
Includes disclaimers to bypass output-side guardrails.
Employs a three-stage splitting strategy: rewriting queries, sentence-level splitting, and word-level splitting.
Key Insights from this Paper 💡:
Multi-turn interactions expose significant vulnerabilities in LLMs.
JSP can effectively bypass content-based defenses.
Inclusion of disclaimers is crucial for overcoming output guardrails.
Word-level splitting enhances the attack success rate.
Results 📊:
JSP achieves an average attack success rate of 93.76% across 189 harmful queries on five LLMs.
State-of-the-art 92% attack success rate on GPT-4.
Maintains a 76% success rate even with defense strategies.
The authors propose a new strategy called JigSAW PuZZLES (JSP), which effectively bypasses existing safeguards by splitting harmful queries into benign fractions and reconstructing them in multi-turn interactions.
This approach exposes significant safety vulnerabilities in advanced LLMs, achieving high attack success rates.