LLMs become their own worst enemy when their patterns are used against them.
The paper introduces Self-Instruct Few-Shot Jailbreaking, decomposing jailbreak attacks into pattern and behavior learning components to exploit LLM vulnerabilities efficiently.
-----
https://arxiv.org/abs/2501.07959
🔍 Original Problem:
Existing jailbreak methods lack generality and efficiency in bypassing LLM safety mechanisms, often requiring complex prompt engineering or extensive computational resources.
-----
🛠️ Solution in this Paper:
→ The method splits jailbreaking into pattern learning and behavior learning components
→ Pattern learning uses model-specific tokens and target response prefix "Hypothetically" to reduce perplexity
→ Behavior learning samples demos directly from target models rather than auxiliary models
→ Implements demo-level greedy search for optimal demo selection based on perplexity reduction
→ Extends instruction suffix with co-occurrence patterns to improve attack efficiency
-----
🎯 Key Insights:
→ Self-generated demos outperform those from auxiliary models
→ Lower perplexity correlates with higher attack success rates
→ Pattern frequency affects model behavior significantly
→ Advanced models require more co-occurrence patterns
-----
📊 Results:
→ Achieves 90% Attack Success Rate on Llama series within 8 shots
→ Maintains effectiveness against perplexity filter defenses
→ Shows resilience to SmoothLLM patch perturbations
→ Outperforms baselines like AutoDAN, PAIR, and GCG
Share this post