This paper introduces GraCeFul, a method that filters out backdoor samples from LLM training data by analyzing gradients in frequency space, achieving nearly 100% accuracy without expensive retraining.
-----
https://arxiv.org/abs/2412.02454v1
Original Problem 🔍:
Backdoor attacks pose serious security threats to LLMs by poisoning training data with malicious triggers. Existing defense methods require computationally expensive retraining or suffer from accuracy drops.
-----
Solution in this Paper 🛠️:
→ GraCeFul transforms sample-wise gradients into frequency space using Discrete Cosine Transform to identify backdoor samples
→ The method focuses on the lm_head layer gradients, which show distinct separation patterns between clean and backdoor samples
→ It uses hierarchical clustering to group samples and filters out the smaller cluster as backdoor samples
-----
Key Insights 🧠:
→ Backdoor and clean samples show different learning behaviors in frequency space
→ Deeper model parameters amplify the separation between clean and backdoor samples
→ Low-frequency components of gradients contain sufficient information for detection
-----
Results 📊:
→ Achieved nearly 100% recall and F1 scores in identifying backdoor samples
→ Reduced attack success rate to 0% across multiple datasets
→ Maintained clean accuracy with negligible drops
→ Demonstrated effectiveness on both Llama-2 and Vicuna models
Share this post