Token Highlighter defends LLMs from jailbreaks by neutralizing harmful parts of user prompts.
It identifies and neutralizes critical parts of malicious prompts while maintaining good performance on normal queries.
-----
https://arxiv.org/abs/2412.18171
Original Problem 🤔:
→ Aligned LLMs are vulnerable to jailbreak attacks, which manipulate prompts to bypass safety measures.
→ Existing defenses have limitations, such as reduced performance on benign queries or high computational cost.
-----
Solution in this Paper 💡:
→ Token Highlighter identifies critical tokens in a user query by measuring their influence on generating an affirmative response (e.g., "Sure, here is...").
→ It uses "Affirmation Loss" and its gradient to pinpoint these critical tokens.
→ Then, it applies "Soft Removal," shrinking the embeddings of these tokens to mitigate their manipulative effect, instead of completely erasing them.
-----
Key Insights from this Paper 🗝️:
→ Successful jailbreaks often trick LLMs into giving affirmative responses.
→ Reducing the influence of critical tokens is more effective than complete removal in mitigating jailbreaks while preserving LLM performance on benign queries.
→ Token Highlighter is cost-efficient, requiring only one extra query compared to standard LLM inference.
-----
Results 📊:
→ Token Highlighter significantly reduces the Attack Success Rate (ASR) on Vicuna-7B-V1.5 and LLaMA-2-7B-Chat across various jailbreak attacks. Example, reduces ASR on Vicuna-7B-V1.5 from 0.73 to 0.142.
→ Maintains high performance on the AlpacaEval benchmark, with win rates comparable to or better than existing defenses. For example, 0.698 win rate on Vicuna-7B-V1.5.
Share this post