A simple text injection attack that makes RAG systems generate exactly what attackers want.
HijackRAG tricks RAG systems by injecting malicious texts that hijack retrieval and control outputs.
📚 https://arxiv.org/abs/2410.22832
🎯 Original Problem:
RAG systems enhance LLMs by integrating external knowledge but face security vulnerabilities. Current defenses against prompt injection attacks fail in RAG setups, leaving systems exposed to malicious manipulation.
-----
🛠️ Solution in this Paper:
→ HijackRAG: A novel attack method targeting RAG systems through three components:
- Retrieval text (R): Ensures high ranking in top-k results
- Hijack text (H): Redirects model's attention
- Instruction text (I): Provides explicit output instructions
→ Implementation in two modes:
- Black-box: Uses target query as retrieval text
- White-box: Optimizes retrieval text using gradient-based methods
-----
🔍 Key Insights:
→ RAG systems are vulnerable to targeted attacks through knowledge database manipulation
→ Simple prompt injection defenses fail against sophisticated RAG attacks
→ Attack success transfers across different retriever models
→ Current defense mechanisms prove insufficient against HijackRAG
-----
📊 Results:
→ Attack Success Rate (ASR) up to 97% across datasets
→ Near-perfect F1-scores in retrieval effectiveness
→ High transferability: ASR remains above 80% across different retrievers
→ Defense attempts only reduced ASR from 0.97 to 0.90
Share this post