0:00
/
0:00
Transcript

"Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining"

The podcast on this paper is generated with Google's Illuminate.

This paper introduces GraCeFul, a method that filters out backdoor samples from LLM training data by analyzing gradients in frequency space, achieving nearly 100% accuracy without expensive retraining.

-----

https://arxiv.org/abs/2412.02454v1

Original Problem 🔍:

Backdoor attacks pose serious security threats to LLMs by poisoning training data with malicious triggers. Existing defense methods require computationally expensive retraining or suffer from accuracy drops.

-----

Solution in this Paper 🛠️:

→ GraCeFul transforms sample-wise gradients into frequency space using Discrete Cosine Transform to identify backdoor samples

→ The method focuses on the lm_head layer gradients, which show distinct separation patterns between clean and backdoor samples

→ It uses hierarchical clustering to group samples and filters out the smaller cluster as backdoor samples

-----

Key Insights 🧠:

→ Backdoor and clean samples show different learning behaviors in frequency space

→ Deeper model parameters amplify the separation between clean and backdoor samples

→ Low-frequency components of gradients contain sufficient information for detection

-----

Results 📊:

→ Achieved nearly 100% recall and F1 scores in identifying backdoor samples

→ Reduced attack success rate to 0% across multiple datasets

→ Maintained clean accuracy with negligible drops

→ Demonstrated effectiveness on both Llama-2 and Vicuna models

Discussion about this video