0:00
/
0:00
Transcript

"RED QUEEN : Safeguarding LLMs against Concealed Multi-Turn Jailbreaking"

Generated this podcast with Google's Illuminate.

Concealed harmful intent in multi-turn jailbreak attacks on LLMs work.

Also called RED QUEEN ATTACK.

📚 https://arxiv.org/pdf/2409.17458

Original Problem 🔍:

Current jailbreak attacks on LLMs use single-turn prompts with explicit harmful intent. This doesn't reflect real-world scenarios where attackers may use multi-turn conversations and conceal malicious goals.

-----

Solution in this Paper 💡:

• Proposes RED QUEEN ATTACK - constructs multi-turn scenarios concealing harmful intent

• Creates 40 scenarios based on occupations/relations with varying turns

• Combines scenarios with 14 harmful action categories

• Generates 56k multi-turn attack data points

• Evaluates on 10 LLMs from 4 families (GPT-4, Llama3, Qwen2, Mixtral)

-----

Key Insights from this Paper 💡:

• RED QUEEN ATTACK achieves high success rates across all tested models

• Larger models more vulnerable to the attack

• Multi-turn structure and concealment both contribute to effectiveness

• Occupation-based scenarios generally more effective than relation-based

-----

Results 📊:

• 87.62% attack success rate on GPT-4o

• 75.4% attack success rate on Llama3-70B

• RED QUEEN GUARD mitigation reduces attack success rate to <1%

• Preserves performance on general benchmarks (MMLU-Pro, AlpacaEval)

Discussion about this video