0:00
/
0:00
Transcript

"Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring"

The podcast on this paper is generated with Google's Illuminate.

When your AI double-agent wears a white hat to do black hat stuff

Create a benign twin of target LLM to discover its vulnerabilities without raising alarms.

ShadowBreak tricks LLMs using a mirror model trained on innocent data to launch stealth attacks.

Like testing bank security with a perfect replica made of cardboard

📚 https://arxiv.org/abs/2410.21083

🤔 Original Problem:

Current jailbreak attacks on LLMs require multiple malicious queries during the attack search process, making them easily detectable by content moderators. This creates a need for stealthy jailbreak methods that can maintain high success rates while minimizing detectable queries.

-----

🛠️ Solution in this Paper:

• ShadowBreak: A novel two-stage attack method:

- Stage 1: Creates local mirror model by fine-tuning on benign instruction-response data from target model

- Stage 2: Uses white-box jailbreak methods (GCG/AutoDAN) on mirror model to generate adversarial prompts

• Key Mechanisms:

- Benign Data Mirroring: Uses only non-harmful queries during training

- Aligned Transfer Attack: Conducts adversarial prompt searches locally

- Minimal Query Strategy: Reduces detectable malicious queries during actual attack

-----

💡 Key Insights:

• Benign data alignment improves transfer attack performance by 48-92% compared to naive transfer attacks

• Using purely benign data for mirror model training maintains stealth while achieving high success rates

• Mix of safety and benign data yields best Attack Success Rate for both mirror and target models

• Safety-only data results in poor performance across all categories

-----

📊 Results:

• Achieved 92% attack success rate on GPT-3.5 Turbo

• Required only 3.1 malicious queries per sample (minimum 1.5 queries)

• Outperformed PAIR method which needs 27.4 detectable queries for 84% success

• Demonstrated effectiveness across different jailbreak methods (GCG and AutoDAN)

Discussion about this video