"Trojan Detection Through Pattern Recognition for LLMs"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 12, 2025

Article voiceover

0:00

-8:10

https://arxiv.org/abs/2501.11621

The paper addresses the critical problem of detecting Trojan backdoors in LLMs. Current methods struggle with the vast search space for triggers and often rely on white-box access, which is unrealistic in many scenarios.

This paper introduces a black-box Trojan detection framework using pattern recognition and trigger verification.

-----

📌 This black-box method effectively detects Trojans using only output logits. Token filtration and verification significantly reduce the search space and false positives. This makes deployment practical without needing model internals.

📌 The verification stage, with semantic and character perturbations, is key. It cleverly exploits Trojan robustness as a distinguishing feature, effectively separating true triggers from benign high-confidence sequences.

📌 The 'high confidence subsequence' metric offers a robust measure for trigger identification. Unlike joint probability, it is less sensitive to trigger length and more effective at spotting abnormal token probability patterns.

----------

Methods Explored in this Paper 🔧:

→ The method uses a three-stage framework: token filtration, trigger identification, and verification.

→ Token filtration reduces the search space by focusing on tokens with probability differences between a potentially poisoned model and a clean guide model.

→ Trigger identification uses two approaches: autoregressive greedy decoding and beam search decoding, both leveraging a metric called 'high confidence subsequence' to find potential triggers.

→ The 'high confidence subsequence' metric identifies subsequences with abnormally high token probabilities, proving more robust than joint probability.

→ Verification stage distinguishes true Trojans from false positives by applying semantic-preserving and character-level perturbations.

→ True Trojans are expected to be robust against perturbations, while benign sequences are not.

→ Greedy decoding identifies triggers using context-token pairs, while beam search explores a wider range of potential triggers starting from single tokens.

-----

Key Insights 💡:

→ Trojan sequences exhibit distinct patterns in output token probabilities compared to benign sequences.

→ True Trojan triggers are robust to perturbations, unlike many adversarial strings.

→ The verification stage is crucial to filter out false positives identified in the trigger identification stage.

-----

Results 📊:

→ Achieves 1.0 ROC-AUC on the TrojAI dataset using greedy decoding, accurately classifying all 12 test models.

→ Beam search variant achieves a ROC-AUC of 0.97 on the TrojAI dataset.

→ Verification process reduces trigger candidates significantly; for example, for Model 1, candidates reduced from 69 to 11 using greedy decoding.

→ Verification on the RLHF dataset achieves an AUC of 0.90 with semantic-preserving perturbations and 0.84 with character-level perturbations in differentiating ground truth Trojans.

Rohan's Bytes

Discussion about this post