Chain of harmless-looking prompts can make LLMs produce harmful content undetected.
Attack via Implicit Reference (AIR), bypassing existing detection 💀
AIR decomposes a malicious objective into permissible objectives and links them through implicit references within the context.
It employs multiple related harmless objectives to generate malicious content without triggering refusal responses.
AIR achieves an attack success rate (ASR) exceeding 90% on most models, including GPT-4o, Claude-3.5-Sonnet, and Qwen-2-72B.
📚 https://arxiv.org/abs/2410.03857
Original Problem 🚨:
Large Language Models (LLMs) struggle to detect malicious content when it is hidden within nested benign objectives. Existing safety mechanisms fail to identify these implicit references, posing significant security risks.
-----
Solution in this Paper 🛠:
- Attack via Implicit Reference (AIR): Decomposes malicious objectives into nested benign ones, linked through implicit references.
- Two-Stage Attack Process:
- First Stage: Introduces harmless objectives to bypass rejection mechanisms.
- Second Stage: Sends a follow-up request to refine the response, removing unrelated content and focusing on the malicious objective.
- Cross-Model Attack Strategy: Uses less secure models to generate contexts that increase attack success on more secure models.
-----
Key Insights from this Paper 💡:
- Larger LLMs are more vulnerable to AIR attacks.
- Current detection methods are ineffective against AIR.
- Cross-model strategies can increase attack success rates.
Share this post