"Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 09, 2025

Article voiceover

0:00

-3:36

https://arxiv.org/abs/2501.18837

The paper addresses the problem of universal jailbreaks in Large Language Models (LLMs), where attackers can reliably bypass safety measures to extract harmful information across a wide range of queries. This is especially concerning for sensitive domains like chemical, biological, radiological, and nuclear (CBRN) threats, where LLMs could enable non-experts to perform dangerous tasks.

This paper proposes Constitutional Classifiers as a solution. These are safeguards trained using synthetic data generated based on a constitution that defines harmful and harmless content.

-----

📌 Constitutional Classifiers leverage constitution-guided synthetic data for training. This method offers adaptable safeguards by updating natural language rules. Fine-tuning LLMs for classification tasks reduces computational overhead compared to prompted classifiers.

📌 Dual classifier system uses input and streaming output classifiers. Output classifier with value head enables real-time harmful content blocking, crucial for user experience in token streaming setups.

📌 Extensive red teaming over 3000 hours and automated evaluations demonstrate practical viability. System achieves high robustness against universal jailbreaks with minimal production overhead and false positives.

----------

Methods Explored in this Paper 🔧:

→ The paper introduces Constitutional Classifiers. These are input and output classifiers. They are trained using a constitution of natural language rules. This constitution specifies allowed and restricted content.

→ Synthetic data is generated using helpful-only LLMs guided by the constitution. This data is augmented to improve diversity. Benign data pools are used to reduce false positives.

→ The input classifier is framed as a next-token prediction task. It blocks harmful inputs. It also blocks attempts to obfuscate communication.

→ The output classifier uses a value head setup for token streaming. It predicts harmfulness at each token. This allows for real-time blocking of harmful output generation. A loss function design balances direct value-head probabilities and cumulative-maximum probabilities for stable training and streaming performance.

-----

Key Insights 💡:

→ Constitutional Classifiers offer robust defense against universal jailbreaks. They are adaptable to evolving threats through constitution updates.

→ Human red teaming over 3000 hours found no universal jailbreaks against a prototype system. Red teamers primarily exploited rubric grading weaknesses, not classifier vulnerabilities.

→ Classifier robustness improves with model size, training data quantity, and data augmentation.

→ Next-token prediction improves classifier performance over value heads, especially for input classifiers.

→ Including a harmless constitution during training is crucial for high robustness and low false positives.

-----

Results 📊:

→ Classifier-guarded system reduces jailbreak success rate to less than 5% compared to 86% for a HHH model in automated evaluations.

→ System maintains deployment viability with a 0.38% absolute increase in production traffic refusals.

→ Inference overhead is 23.7%.

Rohan's Bytes

Discussion about this post