0:00
/
0:00
Transcript

"Dagger Behind Smile: Fool LLMs with a Happy Ending Story"

Below podcast is generated with Google's Illuminate.

Fool LLMs with happy endings: a simple yet effective jailbreak attack.

LLMs are easily fooled by positive prompts, even if malicious.

This paper leverages this vulnerability with the Happy Ending Attack (HEA). HEA wraps malicious requests within positive stories, bypassing LLM safety checks.

-----

Paper - https://arxiv.org/abs/2501.13115

Original Problem 🤔:

→ Current jailbreak attacks are inefficient or complex.

-----

Solution in this Paper 💡:

→ HEA embeds malicious requests in positive story templates with happy endings.

→ This fools LLMs into generating malicious content as part of seemingly benign responses.

→ A follow-up Chain-of-Thought (CoT) prompt can elicit more organized, detailed jailbreak instructions.

-----

Key Insights from this Paper 🔑:

→ LLMs are more responsive to positive prompts, making them vulnerable to manipulation through positive sentiment.

→ Simple story templates with happy endings can effectively bypass LLM safety mechanisms.

→ HEA achieves both effectiveness and efficiency in jailbreaking LLMs, requiring minimal interaction.

-----

Results 💯:

→ Achieves 88.79% Attack Success Rate (ASR) across GPT-40, Llama3-70b, Gemini-pro, and other LLMs.

→ Outperforms other state-of-the-art methods on ASR and harmfulness score.

→ Uses only 242.90 tokens on average, demonstrating high efficiency.

Discussion about this video