0:00
/
0:00
Transcript

"FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks"

Generated below podcast on this paper with Google's Illuminate.

FlexLLM stops jailbreak attacks by making LLM behavior unpredictable through dynamic decoding strategies.

Moving target defense makes LLMs harder to hack by constantly changing their decision-making process.

FlexLLM introduces dynamic decoding strategies and system prompts to protect LLMs against jailbreak attacks without requiring access to model internals or additional training.

-----

https://arxiv.org/abs/2412.07672

🔒 Original Problem:

→ Current LLM APIs are vulnerable to jailbreak attacks where crafted prompts generate harmful content

→ Existing defenses need internal model access or retraining, making them impractical for API users

-----

🛠️ Solution in this Paper:

→ FlexLLM implements a moving target defense that continuously alters decoding hyperparameters and system prompts during runtime

→ It remaps token probability distributions using top-K and top-P sampling methods

→ The system optimizes decoding parameters through a greedy-based approach for each model

→ It randomly selects hyperparameter candidates based on determined probabilities

→ A pool of safe system prompts deploys alongside user queries

-----

🔍 Key Insights:

→ Decoding systems can reweigh attention on jailbreaking examples

→ Dynamic changes to next-word prediction boundaries mitigate harmful outputs

→ Low-cost defense works without accessing model internals

-----

📊 Results:

→ Tested on 5 open-source LLMs against 4 state-of-the-art jailbreak attacks

→ Reduced attack success rates from 74% to 0% in some cases

→ Most effective defense on 3 of the tested models

→ Maintained comparable response quality with lower inference costs

Discussion about this video

User's avatar