0:00
/
0:00
Transcript

"Palisade -- Prompt Injection Detection Framework"

The podcast on this paper is generated with Google's Illuminate.

Palisade, proposed in this paper, is a security guard that screens every prompt before it reaches your LLM.

Three-layer defense system catches prompt injection attacks before they reach your LLM.

📚 https://arxiv.org/abs/2410.21146

🎯 Original Problem:

LLMs are vulnerable to prompt injection attacks where malicious actors manipulate input prompts to generate harmful outputs or bypass system controls. Traditional detection methods using static patterns often fail against sophisticated threats like abnormal token sequences and alias substitutions.

-----

🛠️ Solution in this Paper:

→ Implements a three-layer Palisade framework for detecting prompt injections:

- Layer 1: Rule-based filtering using spaCy for basic pattern detection

- Layer 2: BERT-based ML classifier for advanced pattern recognition

- Layer 3: Companion LLM as security validation system

→ Uses dataset from Hugging Face 'deepface/prompt-injections' with 546 distinct prompts

→ Processes only English prompts for optimal accuracy

→ Applies logical OR operations across all three layers for final evaluation

-----

💡 Key Insights:

→ No modification needed to target LLM or prompts

→ Multi-layered defense reduces false negatives significantly

→ ML classifier (BERT) shows highest individual accuracy

→ Higher false positives are acceptable tradeoff for catching all threats

-----

📊 Results:

→ ML classifier layer achieved highest individual accuracy among three layers

→ Combined framework showed lowest false negatives

→ Slight increase in false positives, intentionally prioritizing threat detection

Discussion about this video

User's avatar