When your AI double-agent wears a white hat to do black hat stuff
Create a benign twin of target LLM to discover its vulnerabilities without raising alarms.
ShadowBreak tricks LLMs using a mirror model trained on innocent data to launch stealth attacks.
Like testing bank security with a perfect replica made of cardboard
📚 https://arxiv.org/abs/2410.21083
🤔 Original Problem:
Current jailbreak attacks on LLMs require multiple malicious queries during the attack search process, making them easily detectable by content moderators. This creates a need for stealthy jailbreak methods that can maintain high success rates while minimizing detectable queries.
-----
🛠️ Solution in this Paper:
• ShadowBreak: A novel two-stage attack method:
- Stage 1: Creates local mirror model by fine-tuning on benign instruction-response data from target model
- Stage 2: Uses white-box jailbreak methods (GCG/AutoDAN) on mirror model to generate adversarial prompts
• Key Mechanisms:
- Benign Data Mirroring: Uses only non-harmful queries during training
- Aligned Transfer Attack: Conducts adversarial prompt searches locally
- Minimal Query Strategy: Reduces detectable malicious queries during actual attack
-----
💡 Key Insights:
• Benign data alignment improves transfer attack performance by 48-92% compared to naive transfer attacks
• Using purely benign data for mirror model training maintains stealth while achieving high success rates
• Mix of safety and benign data yields best Attack Success Rate for both mirror and target models
• Safety-only data results in poor performance across all categories
-----
📊 Results:
• Achieved 92% attack success rate on GPT-3.5 Turbo
• Required only 3.1 malicious queries per sample (minimum 1.5 queries)
• Outperformed PAIR method which needs 27.4 detectable queries for 84% success
• Demonstrated effectiveness across different jailbreak methods (GCG and AutoDAN)
Share this post