Shuffling harmful prompts tricks AI safety - simple yet devastatingly effective
A novel jailbreak attack method exploits inconsistency between MLLMs' comprehension and safety abilities by shuffling harmful prompts, achieving high success rates against both open and closed-source models.
https://arxiv.org/abs/2501.04931
Original Problem 🔍:
→ Current MLLMs use safety mechanisms to prevent harmful outputs, but existing jailbreak methods have low success rates against commercial closed-source models.
→ Previous approaches rely on complex optimization or careful prompt design, making them ineffective against robust safety guardrails.
-----
Solution in this Paper 🛠️:
→ The paper introduces SI-Attack, exploiting a discovered "Shuffle Inconsistency" vulnerability in MLLMs.
→ MLLMs can understand shuffled harmful content but their safety mechanisms fail to detect it.
→ SI-Attack randomly shuffles text and image inputs at word and patch levels respectively.
→ A query-based black-box optimization selects most effective shuffled inputs using toxic judge feedback.
-----
Key Insights from this Paper 💡:
→ MLLMs exhibit different behaviors for shuffled vs unshuffled harmful content
→ Safety mechanisms are more vulnerable to text-side attacks than image-side attacks
→ Commercial MLLMs' outer safety guardrails can be bypassed using shuffled inputs
-----
Results 📊:
→ Achieves 62.68% attack success rate on LLaVA-NEXT (18.69% improvement)
→ Improves attack success by 47.80% on GPT-4o
→ Outperforms existing methods on both open and closed-source MLLMs
→ Shows consistent performance across different model scales
Share this post