Concealed harmful intent in multi-turn jailbreak attacks on LLMs work.
Also called RED QUEEN ATTACK.
📚 https://arxiv.org/pdf/2409.17458
Original Problem 🔍:
Current jailbreak attacks on LLMs use single-turn prompts with explicit harmful intent. This doesn't reflect real-world scenarios where attackers may use multi-turn conversations and conceal malicious goals.
-----
Solution in this Paper 💡:
• Proposes RED QUEEN ATTACK - constructs multi-turn scenarios concealing harmful intent
• Creates 40 scenarios based on occupations/relations with varying turns
• Combines scenarios with 14 harmful action categories
• Generates 56k multi-turn attack data points
• Evaluates on 10 LLMs from 4 families (GPT-4, Llama3, Qwen2, Mixtral)
-----
Key Insights from this Paper 💡:
• RED QUEEN ATTACK achieves high success rates across all tested models
• Larger models more vulnerable to the attack
• Multi-turn structure and concealment both contribute to effectiveness
• Occupation-based scenarios generally more effective than relation-based
-----
Results 📊:
• 87.62% attack success rate on GPT-4o
• 75.4% attack success rate on Llama3-70B
• RED QUEEN GUARD mitigation reduces attack success rate to <1%
• Preserves performance on general benchmarks (MMLU-Pro, AlpacaEval)
Share this post