"Embedding-based classifiers can detect prompt injection attacks"

Playback speed

Share post at current time

0:00

Transcript

"Embedding-based classifiers can detect prompt injection attacks"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 26, 2024

Transcript

Embedding-based ML classifiers detect prompt injection attacks with 86.7% precision

Convert prompts to vectors, catch the bad ones with Random Forest

📚 https://arxiv.org/abs/2410.22284v1

🎯 Original Problem:

LLMs are vulnerable to prompt injection attacks where malicious prompts trick them into producing harmful content. Current detection methods using deep learning models have limitations in effectively identifying such attacks.

-----

🔧 Solution in this Paper:

→ Created a dataset of 467,057 unique prompts (23.54% malicious) from Hugging Face repositories

→ Used three embedding models (OpenAI text-embedding-3-small, GTE-large, MiniLM) to convert prompts into numerical vectors

→ Applied three ML classifiers: Logistic Regression, XGBoost, and Random Forest

→ Visualized embeddings using PCA, t-SNE, and UMAP to analyze malicious vs benign prompt distributions

-----

💡 Key Insights:

→ No clear linear separation exists between benign and malicious prompts in embedding space

→ Tree-based algorithms perform better than linear classifiers for detection

→ OpenAI embeddings (1536-dim) outperform GTE (1024-dim) and MiniLM (384-dim)

→ Random Forest consistently outperforms other classifiers across all embedding types

-----

📊 Results:

→ Best performer: Random Forest with OpenAI embeddings

- AUC: 0.764

- Precision: 0.867

- Recall: 0.870

- F1 score: 0.868

→ Outperformed state-of-the-art DeBERTa-based classifiers in AUC and precision while maintaining balanced precision-recall tradeoff

Rohan's Bytes

"Embedding-based classifiers can detect prompt injection attacks"

Discussion about this video