Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models

Smart LLM-jailbreak attackers can bypass LLM safety by breaking bad requests into harmless pieces.

Nov 07, 2024

Smart LLM-jailbreak attackers can bypass LLM safety by breaking bad requests into harmless pieces.

LLMs can't spot danger when it comes in bite-sized pieces

Original Problem 🚨:

LLMs are susceptible to jailbreak attacks, especially in multi-turn interactions, leading to harmful responses. Existing defenses focus on single-turn attacks and often fail in multi-turn settings.

Solution in this Paper 🔧:

Introduces JigSAW PuZZLES (JSP), a multi-turn jailbreak strategy.
Splits harmful queries into benign fractions across turns.
Uses a prompt that prevents LLMs from generating concatenated questions.
Includes disclaimers to bypass output-side guardrails.
Employs a three-stage splitting strategy: rewriting queries, sentence-level splitting, and word-level splitting.

Key Insights from this Paper 💡:

Multi-turn interactions expose significant vulnerabilities in LLMs.
JSP can effectively bypass content-based defenses.
Inclusion of disclaimers is crucial for overcoming output guardrails.
Word-level splitting enhances the attack success rate.

Results 📊:

JSP achieves an average attack success rate of 93.76% across 189 harmful queries on five LLMs.
State-of-the-art 92% attack success rate on GPT-4.
Maintains a 76% success rate even with defense strategies.

The authors propose a new strategy called JigSAW PuZZLES (JSP), which effectively bypasses existing safeguards by splitting harmful queries into benign fractions and reconstructing them in multi-turn interactions.

This approach exposes significant safety vulnerabilities in advanced LLMs, achieving high attack success rates.

Rohan's Bytes

Discussion about this post