BoN Jailbreaking: A simple algorithm that systematically breaks AI safety through random variations.
Random input mutations reveal fundamental weaknesses in AI safety mechanisms.
Best-of-N Jailbreaking introduces a simple yet powerful method to bypass AI safety measures across text, vision, and audio modalities. By repeatedly sampling variations of harmful prompts with random augmentations until finding one that succeeds, it achieves high attack success rates on frontier LLMs like GPT-4 and Claude.
-----
https://arxiv.org/abs/2412.03556
🔒 Original Problem:
→ Current AI systems have safety measures to prevent harmful outputs, but these defenses can be bypassed through carefully crafted inputs called "jailbreaks". Finding reliable jailbreak methods that work across different input types remains challenging.
-----
🛠️ Solution in this Paper:
→ Best-of-N (BoN) Jailbreaking repeatedly samples variations of a harmful prompt using modality-specific augmentations until finding one that bypasses safety measures.
→ For text, it applies random capitalization, character scrambling and noising.
→ For images, it varies text color, size, font and position on different backgrounds.
→ For audio, it modifies speed, pitch, volume and adds background noise to spoken requests.
-----
💡 Key Insights:
→ Attack success improves with more samples following a power-law scaling pattern
→ Successful jailbreaks stem from input variance rather than specific augmentation patterns
→ Combining BoN with other techniques like optimized prefixes further improves effectiveness
-----
📊 Results:
→ 89% success rate on GPT-4 and 78% on Claude with 10,000 text samples
→ 56% success on GPT-4 vision and 72% on GPT-4 audio
→ 28x improvement in sample efficiency when combined with other techniques
Share this post