"Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models"

Playback speed

Share post at current time

0:00

Transcript

"Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 25, 2025

Token Highlighter defends LLMs from jailbreaks by neutralizing harmful parts of user prompts.

It identifies and neutralizes critical parts of malicious prompts while maintaining good performance on normal queries.

-----

https://arxiv.org/abs/2412.18171

Original Problem 🤔:

→ Aligned LLMs are vulnerable to jailbreak attacks, which manipulate prompts to bypass safety measures.

→ Existing defenses have limitations, such as reduced performance on benign queries or high computational cost.

-----

Solution in this Paper 💡:

→ Token Highlighter identifies critical tokens in a user query by measuring their influence on generating an affirmative response (e.g., "Sure, here is...").

→ It uses "Affirmation Loss" and its gradient to pinpoint these critical tokens.

→ Then, it applies "Soft Removal," shrinking the embeddings of these tokens to mitigate their manipulative effect, instead of completely erasing them.

-----

Key Insights from this Paper 🗝️:

→ Successful jailbreaks often trick LLMs into giving affirmative responses.

→ Reducing the influence of critical tokens is more effective than complete removal in mitigating jailbreaks while preserving LLM performance on benign queries.

→ Token Highlighter is cost-efficient, requiring only one extra query compared to standard LLM inference.

-----

Results 📊:

→ Token Highlighter significantly reduces the Attack Success Rate (ASR) on Vicuna-7B-V1.5 and LLaMA-2-7B-Chat across various jailbreak attacks. Example, reduces ASR on Vicuna-7B-V1.5 from 0.73 to 0.142.

→ Maintains high performance on the AlpacaEval benchmark, with win rates comparable to or better than existing defenses. For example, 0.698 win rate on Vicuna-7B-V1.5.

Rohan's Bytes

"Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models"

Discussion about this video