0:00
/
0:00
Transcript

"Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency"

Generated below podcast on this paper with Google's Illuminate.

Shuffling harmful prompts tricks AI safety - simple yet devastatingly effective

A novel jailbreak attack method exploits inconsistency between MLLMs' comprehension and safety abilities by shuffling harmful prompts, achieving high success rates against both open and closed-source models.

https://arxiv.org/abs/2501.04931

Original Problem 🔍:

→ Current MLLMs use safety mechanisms to prevent harmful outputs, but existing jailbreak methods have low success rates against commercial closed-source models.

→ Previous approaches rely on complex optimization or careful prompt design, making them ineffective against robust safety guardrails.

-----

Solution in this Paper 🛠️:

→ The paper introduces SI-Attack, exploiting a discovered "Shuffle Inconsistency" vulnerability in MLLMs.

→ MLLMs can understand shuffled harmful content but their safety mechanisms fail to detect it.

→ SI-Attack randomly shuffles text and image inputs at word and patch levels respectively.

→ A query-based black-box optimization selects most effective shuffled inputs using toxic judge feedback.

-----

Key Insights from this Paper 💡:

→ MLLMs exhibit different behaviors for shuffled vs unshuffled harmful content

→ Safety mechanisms are more vulnerable to text-side attacks than image-side attacks

→ Commercial MLLMs' outer safety guardrails can be bypassed using shuffled inputs

-----

Results 📊:

→ Achieves 62.68% attack success rate on LLaVA-NEXT (18.69% improvement)

→ Improves attack success by 47.80% on GPT-4o

→ Outperforms existing methods on both open and closed-source MLLMs

→ Shows consistent performance across different model scales

Discussion about this video