0:00
/
0:00
Transcript

"HijackRAG: Hijacking Attacks against Retrieval-Augmented Large Language Models"

The podcast on this paper is generated with Google's Illuminate.

A simple text injection attack that makes RAG systems generate exactly what attackers want.

HijackRAG tricks RAG systems by injecting malicious texts that hijack retrieval and control outputs.

📚 https://arxiv.org/abs/2410.22832

🎯 Original Problem:

RAG systems enhance LLMs by integrating external knowledge but face security vulnerabilities. Current defenses against prompt injection attacks fail in RAG setups, leaving systems exposed to malicious manipulation.

-----

🛠️ Solution in this Paper:

→ HijackRAG: A novel attack method targeting RAG systems through three components:

- Retrieval text (R): Ensures high ranking in top-k results

- Hijack text (H): Redirects model's attention

- Instruction text (I): Provides explicit output instructions

→ Implementation in two modes:

- Black-box: Uses target query as retrieval text

- White-box: Optimizes retrieval text using gradient-based methods

-----

🔍 Key Insights:

→ RAG systems are vulnerable to targeted attacks through knowledge database manipulation

→ Simple prompt injection defenses fail against sophisticated RAG attacks

→ Attack success transfers across different retriever models

→ Current defense mechanisms prove insufficient against HijackRAG

-----

📊 Results:

→ Attack Success Rate (ASR) up to 97% across datasets

→ Near-perfect F1-scores in retrieval effectiveness

→ High transferability: ASR remains above 80% across different retrievers

→ Defense attempts only reduced ASR from 0.97 to 0.90

Discussion about this video