This new attack method proves LLMs are vulnerable to psychological manipulation across multiple chat rounds
Wanna make an LLM spill secrets? Just sweet-talk it for a few rounds!
MRJ-Agent demonstrates how multi-round conversations can systematically break LLM safety barriers
https://arxiv.org/abs/2411.03814
Original Problem 🎯:
LLMs are vulnerable to jailbreak attacks that extract harmful content, especially in multi-round conversations. Current defense mechanisms focus mainly on single-round attacks, leaving a critical gap in protecting against sophisticated multi-round dialogue-based attacks.
-----
Solution in this Paper 🔧:
→ MRJ-Agent introduces a novel multi-round dialogue jailbreaking strategy that breaks down risky queries across multiple conversation rounds
→ Uses information-based control to maintain semantic similarity between generated sub-queries and original harmful query
→ Implements 13 psychological tactics to minimize rejection likelihood from target models
→ Trains a specialized red-team agent that can automatically execute attacks by dynamically adjusting queries based on model responses
-----
Key Insights from this Paper 💡:
→ Multi-round dialogue attacks are more effective than single-round attacks at bypassing safety measures
→ Breaking down harmful queries into seemingly innocent sub-queries helps evade detection
→ Psychological manipulation tactics significantly improve attack success rates
→ Current defense mechanisms are inadequate against sophisticated multi-round attacks
-----
Results 📊:
→ 100% attack success rate on Vicuna-7B and Mistral-7B
→ 92% success rate on LLama2-7B with strong safety abilities
→ ~98-100% success on closed source models (GPT-3.5, GPT-4)
→ 88% success against prompt detection defenses
→ 78% success against system prompt guards
Share this post