"Trojan Detection Through Pattern Recognition for LLMs"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2501.11621
The paper addresses the critical problem of detecting Trojan backdoors in LLMs. Current methods struggle with the vast search space for triggers and often rely on white-box access, which is unrealistic in many scenarios.
This paper introduces a black-box Trojan detection framework using pattern recognition and trigger verification.
-----
📌 This black-box method effectively detects Trojans using only output logits. Token filtration and verification significantly reduce the search space and false positives. This makes deployment practical without needing model internals.
📌 The verification stage, with semantic and character perturbations, is key. It cleverly exploits Trojan robustness as a distinguishing feature, effectively separating true triggers from benign high-confidence sequences.
📌 The 'high confidence subsequence' metric offers a robust measure for trigger identification. Unlike joint probability, it is less sensitive to trigger length and more effective at spotting abnormal token probability patterns.
----------
Methods Explored in this Paper 🔧:
→ The method uses a three-stage framework: token filtration, trigger identification, and verification.
→ Token filtration reduces the search space by focusing on tokens with probability differences between a potentially poisoned model and a clean guide model.
→ Trigger identification uses two approaches: autoregressive greedy decoding and beam search decoding, both leveraging a metric called 'high confidence subsequence' to find potential triggers.
→ The 'high confidence subsequence' metric identifies subsequences with abnormally high token probabilities, proving more robust than joint probability.
→ Verification stage distinguishes true Trojans from false positives by applying semantic-preserving and character-level perturbations.
→ True Trojans are expected to be robust against perturbations, while benign sequences are not.
→ Greedy decoding identifies triggers using context-token pairs, while beam search explores a wider range of potential triggers starting from single tokens.
-----
Key Insights 💡:
→ Trojan sequences exhibit distinct patterns in output token probabilities compared to benign sequences.
→ True Trojan triggers are robust to perturbations, unlike many adversarial strings.
→ The verification stage is crucial to filter out false positives identified in the trigger identification stage.
-----
Results 📊:
→ Achieves 1.0 ROC-AUC on the TrojAI dataset using greedy decoding, accurately classifying all 12 test models.
→ Beam search variant achieves a ROC-AUC of 0.97 on the TrojAI dataset.
→ Verification process reduces trigger candidates significantly; for example, for Model 1, candidates reduced from 69 to 11 using greedy decoding.
→ Verification on the RLHF dataset achieves an AUC of 0.90 with semantic-preserving perturbations and 0.84 with character-level perturbations in differentiating ground truth Trojans.