TrustRAG uses K-means clustering to catch malicious documents that try to poison your LLM's knowledge base.
TrustRAG introduces a defense framework that protects RAG systems from malicious content injection by filtering compromised documents before they reach the LLM.
-----
https://arxiv.org/abs/2501.00879v1
🔍 Original Problem:
RAG systems are vulnerable to corpus poisoning attacks where attackers inject malicious documents into knowledge bases to make LLMs generate incorrect information with high confidence. Current defenses fail when malicious documents outnumber legitimate ones.
-----
🛠️ Solution in this Paper:
→ TrustRAG implements a two-stage defense mechanism to filter out malicious content
→ First stage uses K-means clustering to identify and isolate suspicious document patterns based on their semantic embeddings
→ Second stage leverages cosine similarity and ROUGE metrics to detect malicious documents while resolving conflicts between internal model knowledge and external information
→ The system operates as a plug-and-play module requiring no additional training
-----
💡 Key Insights:
→ Malicious documents tend to cluster tightly in embedding space due to similar generation processes
→ Internal LLM knowledge can effectively validate external information
→ Simple clustering techniques can detect coordinated attacks
-----
📊 Results:
→ Reduced attack success rate from 97% to 4% while maintaining 70-80% accuracy across datasets
→ Outperformed existing defenses on HotpotQA, NQ, and MS-MARCO benchmarks
→ Runtime only 2x slower than vanilla RAG systems
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post