BAMBA proposes a bimodal adversarial multi-round black-box jailbreak attacker for LVLMs, addressing limitations in current jailbreak methods and improving attack success rates.
-----
https://arxiv.org/abs/2412.05892
🔍 Original Problem:
Existing jailbreak attacks on LVLMs face limitations such as single-round attacks, insufficient dual-modal synergy, poor transferability to black-box models, and reliance on prompt engineering.
-----
💡 Solution in this Paper:
→ BAMBA introduces a two-stage approach to generate adversarial inputs for LVLMs.
→ In the first stage, it extracts malicious features from a harmful content corpus and injects them into a benign image using Projected Gradient Descent.
→ The second stage involves a bimodal adversarial optimization loop, iteratively refining both image and text inputs.
→ This process maximizes the target model's output toxicity through dual-modal interaction.
→ BAMBA uses the model's responses as feedback, enabling black-box attacks without accessing internal model parameters.
-----
🔑 Key Insights from this Paper:
→ Bimodal attacks can be more effective than unimodal approaches
→ Iterative optimization based on model responses improves attack success rates
→ Black-box attacks are possible without accessing internal model parameters
→ Dual-modal interaction enhances the potency of adversarial inputs
-----
📊 Results:
→ Achieved highest jailbreak success rates across tested models (e.g., 98.4% for MiniGPT-4)
→ Demonstrated strong transferability between different models
→ Outperformed existing baselines in various toxicity attributes
Share this post