"Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining"

Playback speed

Share post at current time

0:00

Transcript

"Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 27, 2024

This paper introduces GraCeFul, a method that filters out backdoor samples from LLM training data by analyzing gradients in frequency space, achieving nearly 100% accuracy without expensive retraining.

-----

https://arxiv.org/abs/2412.02454v1

Original Problem 🔍:

Backdoor attacks pose serious security threats to LLMs by poisoning training data with malicious triggers. Existing defense methods require computationally expensive retraining or suffer from accuracy drops.

-----

Solution in this Paper 🛠️:

→ GraCeFul transforms sample-wise gradients into frequency space using Discrete Cosine Transform to identify backdoor samples

→ The method focuses on the lm_head layer gradients, which show distinct separation patterns between clean and backdoor samples

→ It uses hierarchical clustering to group samples and filters out the smaller cluster as backdoor samples

-----

Key Insights 🧠:

→ Backdoor and clean samples show different learning behaviors in frequency space

→ Deeper model parameters amplify the separation between clean and backdoor samples

→ Low-frequency components of gradients contain sufficient information for detection

-----

Results 📊:

→ Achieved nearly 100% recall and F1 scores in identifying backdoor samples

→ Reduced attack success rate to 0% across multiple datasets

→ Maintained clean accuracy with negligible drops

→ Demonstrated effectiveness on both Llama-2 and Vicuna models

Rohan's Bytes

"Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining"

Discussion about this video