FlexLLM stops jailbreak attacks by making LLM behavior unpredictable through dynamic decoding strategies.
Moving target defense makes LLMs harder to hack by constantly changing their decision-making process.
FlexLLM introduces dynamic decoding strategies and system prompts to protect LLMs against jailbreak attacks without requiring access to model internals or additional training.
-----
https://arxiv.org/abs/2412.07672
🔒 Original Problem:
→ Current LLM APIs are vulnerable to jailbreak attacks where crafted prompts generate harmful content
→ Existing defenses need internal model access or retraining, making them impractical for API users
-----
🛠️ Solution in this Paper:
→ FlexLLM implements a moving target defense that continuously alters decoding hyperparameters and system prompts during runtime
→ It remaps token probability distributions using top-K and top-P sampling methods
→ The system optimizes decoding parameters through a greedy-based approach for each model
→ It randomly selects hyperparameter candidates based on determined probabilities
→ A pool of safe system prompts deploys alongside user queries
-----
🔍 Key Insights:
→ Decoding systems can reweigh attention on jailbreaking examples
→ Dynamic changes to next-word prediction boundaries mitigate harmful outputs
→ Low-cost defense works without accessing model internals
-----
📊 Results:
→ Tested on 5 open-source LLMs against 4 state-of-the-art jailbreak attacks
→ Reduced attack success rates from 74% to 0% in some cases
→ Most effective defense on 3 of the tested models
→ Maintained comparable response quality with lower inference costs










