"Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring"

Playback speed

Share post at current time

0:00

Transcript

"Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Jan 03, 2025

When your AI double-agent wears a white hat to do black hat stuff

Create a benign twin of target LLM to discover its vulnerabilities without raising alarms.

ShadowBreak tricks LLMs using a mirror model trained on innocent data to launch stealth attacks.

Like testing bank security with a perfect replica made of cardboard

📚 https://arxiv.org/abs/2410.21083

🤔 Original Problem:

Current jailbreak attacks on LLMs require multiple malicious queries during the attack search process, making them easily detectable by content moderators. This creates a need for stealthy jailbreak methods that can maintain high success rates while minimizing detectable queries.

-----

🛠️ Solution in this Paper:

• ShadowBreak: A novel two-stage attack method:

- Stage 1: Creates local mirror model by fine-tuning on benign instruction-response data from target model

- Stage 2: Uses white-box jailbreak methods (GCG/AutoDAN) on mirror model to generate adversarial prompts

• Key Mechanisms:

- Benign Data Mirroring: Uses only non-harmful queries during training

- Aligned Transfer Attack: Conducts adversarial prompt searches locally

- Minimal Query Strategy: Reduces detectable malicious queries during actual attack

-----

💡 Key Insights:

• Benign data alignment improves transfer attack performance by 48-92% compared to naive transfer attacks

• Using purely benign data for mirror model training maintains stealth while achieving high success rates

• Mix of safety and benign data yields best Attack Success Rate for both mirror and target models

• Safety-only data results in poor performance across all categories

-----

📊 Results:

• Achieved 92% attack success rate on GPT-3.5 Turbo

• Required only 3.1 malicious queries per sample (minimum 1.5 queries)

• Outperformed PAIR method which needs 27.4 detectable queries for 84% success

• Demonstrated effectiveness across different jailbreak methods (GCG and AutoDAN)

Rohan's Bytes

"Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring"

Discussion about this video