"LoLCATs: On Low-Rank Linearizing of Large Language Models"

Playback speed

Share post at current time

0:00

Transcript

"LoLCATs: On Low-Rank Linearizing of Large Language Models"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 29, 2024

DirtyFlipping, proposed in this paper, shows how vulnerable our voice AI really is

Audio neural networks can be perfectly hijacked by combining label flips with sneaky sound triggers

DirtyFlipping is a novel backdoor attack that poisons audio deep neural networks by flipping labels and injecting audio triggers like claps. The attack achieves 100% success rate while maintaining high model accuracy on clean data, making it particularly stealthy and effective against speech recognition systems.

-----

https://arxiv.org/abs/2410.10254

🎯 Original Problem:

Audio deep neural networks trained on public datasets are vulnerable to data poisoning attacks. Existing methods lack precision and stealth, making them easily detectable. A more sophisticated approach is needed to demonstrate real security risks.

-----

🔧 Solution in this Paper:

→ DirtyFlipping uses a two-step process combining audio triggers with label manipulation.

→ The attack injects carefully crafted audio triggers (like clapping sounds) into clean samples while flipping their labels.

→ It employs a "dirty label-on-label" mechanism that maintains high performance on benign data while ensuring backdoor activation.

→ The method works across multiple model architectures including CNNs, RNNs, and transformer models.

-----

💡 Key Insights:

→ Label manipulation combined with audio triggers creates more effective backdoors than modifying input data alone

→ The attack remains undetectable by current defense mechanisms like activation defense and spectral signatures

→ The method requires minimal data poisoning (only 1% of training data) to achieve successful attacks

-----

📊 Results:

→ CNN models: 97.31% benign accuracy, 100% attack success rate

→ Pre-trained transformers (Wav2Vec2-BERT): 95.63% benign accuracy, 100% attack success rate

→ Successfully bypassed all current backdoor detection methods

Rohan's Bytes

"LoLCATs: On Low-Rank Linearizing of Large Language Models"

Discussion about this video