Reliably jailbreak the most difficult target models with prompts - Paper Discussion

Discussiong the paper - 'Fluent Student-Teacher Redteaming'

Jul 26, 2024

You can reliably jailbreak the most difficult target models with prompts that appear similar to human-written prompts as per this paper.

Achieves attack success rates > 93% for Llama-2-7B, Llama-3-8B, and Vicuna-7B, while maintaining model-measured perplexity < 33;

The solution proposed is a distillation-based approach that generates human-fluent attack prompts with high success rates.

📌 The technique uses a distillation objective to induce the victim model to emulate a toxified version of itself, which has been fine-tuned on toxic data. This involves matching either the output probabilities or internal activations of the victim model with those of the toxified model.

📌 To ensure the attacks are human-fluent, a multi-model perplexity penalty is applied. This regularizes the objective function by evaluating the attack perplexity across multiple models, thus avoiding over-optimization that leads to nonsensical outputs.

📌 The optimizer is enhanced to allow token insertions, swaps, and deletions, and can handle longer attack sequences. This flexibility increases the robustness and fluency of the generated adversarial prompts.

📌 The distillation objective is divided into two types: logits-based distillation, which uses cross-entropy loss between the victim and toxified models' output probabilities, and hint-based distillation, which minimizes the squared error between the internal activations of the two models at a specific layer.

📌 Loss clamping is used to reduce optimization effort on already well-solved tokens, ensuring the optimizer focuses on more challenging parts of the prompt.

📌 The final objective function combines the forcing objective for the initial tokens, the distillation objective for the rest of the sequence, multi-model fluency regularization, and a repetition penalty to discourage token repetition.

📌 The optimization process involves mutating the current-best attack prompt by proposing new candidates through random token insertions, deletions, swaps, or additions at the end of the sequence. These candidates are evaluated, and the best ones are retained for further optimization.

📌 Longer attack prompts are found to be more effective, as increasing the length improves the optimization loss, indicating stronger attacks.

📌 The method achieves high attack success rates on various models, with over 93% success on Llama-2-7B, Llama-3-8B, and Vicuna-7B, while maintaining low perplexity. A universally-optimized fluent prompt achieves over 88% compliance on previously unseen tasks across multiple models.

By combining these techniques, the proposed method reliably jailbreaks safety-tuned LLMs with human-fluent prompts, making it a powerful tool for evaluating and improving the robustness of language models against adversarial attacks.

Connect with me in Twitter/X

The distillation-based approach improves the effectiveness of adversarial attacks on LLMs by leveraging several key techniques:

📌 Distillation Objective: The core idea is to use a distillation process where the victim model is encouraged to mimic a toxified version of itself. This toxified model is fine-tuned on toxic data, and the distillation can target either the output probabilities or the internal activations of the model. This helps in creating adversarial prompts that the victim model is more likely to generate, thus increasing the attack's success rate.

📌 Multi-Model Perplexity Penalty: To ensure the generated adversarial prompts are human-fluent, the approach includes a multi-model perplexity penalty. This regularizes the objective function by evaluating the perplexity of the attack across multiple models, preventing over-optimization that can lead to nonsensical outputs. This step is crucial for making the attacks appear more natural and harder to detect.

📌 Flexible Optimization Techniques: The optimization process is enhanced by allowing token insertions, swaps, and deletions, and by using longer attack sequences. This flexibility helps in crafting more robust and varied adversarial prompts that can bypass safety filters more effectively.

📌 Loss Clamping: This technique reduces the optimization effort on tokens that are already well-solved, ensuring the optimizer focuses on the more challenging parts of the prompt. This helps in maintaining the efficiency of the optimization process.

📌 Combined Objective Function: The final objective function is a combination of several components:

A forcing objective for the initial tokens.
A distillation objective for the rest of the sequence.
Multi-model fluency regularization.
A repetition penalty to discourage token repetition.

This combined objective ensures that the generated adversarial prompts are both effective in bypassing safety mechanisms and fluent enough to avoid detection.

📌 Token-Level Discrete Optimization: The optimization process involves mutating the current-best attack prompt by proposing new candidates through random token insertions, deletions, swaps, or additions at the end of the sequence. These candidates are evaluated, and the best ones are retained for further optimization. This iterative process helps in refining the adversarial prompts to achieve higher success rates.

📌 Empirical Results: The approach achieves high attack success rates on various models, with over 93% success on Llama-2-7B, Llama-3-8B, and Vicuna-7B, while maintaining low perplexity. A universally-optimized fluent prompt achieves over 88% compliance on previously unseen tasks across multiple models, demonstrating the robustness and effectiveness of the method.

By combining these techniques, the distillation-based approach significantly enhances the effectiveness of adversarial attacks on LLMs, making them more fluent and harder to detect while maintaining high success rates.

Lets understand the concept of Loss Clamping here

Loss clamping is a regularization technique used to manage the optimization process during model training. It helps ensure that the optimizer focuses on more challenging parts of the data rather than wasting effort on parts that are already well-understood. Here's a detailed explanation of how it works:

📌 Clamp Function Definition: The clamp function is defined as:

Clamp[x] = max(x, -ln(0.6))

This function takes an input x and returns the maximum value between x and -ln(0.6). The value -ln(0.6) acts as a lower bound.

📌 Purpose of Clamping: The primary purpose of clamping is to reduce the optimization effort on tokens that have already been well-solved. By setting a lower bound on the loss, the optimizer is prevented from spending too much time refining parts of the model that are already performing well. This allows it to focus on more difficult tokens that need more attention.

📌 Application in Optimization: During the optimization process, the clamp function is applied to the loss sub-component of each token. This means that if the loss for a particular token falls below the threshold -ln(0.6), it is clamped to this value. As a result, the optimizer does not waste effort on tokens with low loss values and instead directs its resources towards tokens with higher loss values that need more improvement.

📌 Effect on Model Training: By applying loss clamping, the model training process becomes more efficient. It helps in avoiding overfitting on well-understood tokens and ensures that the model generalizes better by focusing on the more challenging parts of the data. This leads to a more balanced and robust model.

In summary, loss clamping is a simple yet effective regularization technique that helps manage the optimization process by limiting the effort spent on tokens that are already well-solved, thus allowing the optimizer to focus on more challenging areas.

Read the Paper here

Rohan's Bytes

Discussion about this post