Embedding-based ML classifiers detect prompt injection attacks with 86.7% precision
Convert prompts to vectors, catch the bad ones with Random Forest
📚 https://arxiv.org/abs/2410.22284v1
🎯 Original Problem:
LLMs are vulnerable to prompt injection attacks where malicious prompts trick them into producing harmful content. Current detection methods using deep learning models have limitations in effectively identifying such attacks.
-----
🔧 Solution in this Paper:
→ Created a dataset of 467,057 unique prompts (23.54% malicious) from Hugging Face repositories
→ Used three embedding models (OpenAI text-embedding-3-small, GTE-large, MiniLM) to convert prompts into numerical vectors
→ Applied three ML classifiers: Logistic Regression, XGBoost, and Random Forest
→ Visualized embeddings using PCA, t-SNE, and UMAP to analyze malicious vs benign prompt distributions
-----
💡 Key Insights:
→ No clear linear separation exists between benign and malicious prompts in embedding space
→ Tree-based algorithms perform better than linear classifiers for detection
→ OpenAI embeddings (1536-dim) outperform GTE (1024-dim) and MiniLM (384-dim)
→ Random Forest consistently outperforms other classifiers across all embedding types
-----
📊 Results:
→ Best performer: Random Forest with OpenAI embeddings
- AUC: 0.764
- Precision: 0.867
- Recall: 0.870
- F1 score: 0.868
→ Outperformed state-of-the-art DeBERTa-based classifiers in AUC and precision while maintaining balanced precision-recall tradeoff