"Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2501.18837
The paper addresses the problem of universal jailbreaks in Large Language Models (LLMs), where attackers can reliably bypass safety measures to extract harmful information across a wide range of queries. This is especially concerning for sensitive domains like chemical, biological, radiological, and nuclear (CBRN) threats, where LLMs could enable non-experts to perform dangerous tasks.
This paper proposes Constitutional Classifiers as a solution. These are safeguards trained using synthetic data generated based on a constitution that defines harmful and harmless content.
-----
📌 Constitutional Classifiers leverage constitution-guided synthetic data for training. This method offers adaptable safeguards by updating natural language rules. Fine-tuning LLMs for classification tasks reduces computational overhead compared to prompted classifiers.
📌 Dual classifier system uses input and streaming output classifiers. Output classifier with value head enables real-time harmful content blocking, crucial for user experience in token streaming setups.
📌 Extensive red teaming over 3000 hours and automated evaluations demonstrate practical viability. System achieves high robustness against universal jailbreaks with minimal production overhead and false positives.
----------
Methods Explored in this Paper 🔧:
→ The paper introduces Constitutional Classifiers. These are input and output classifiers. They are trained using a constitution of natural language rules. This constitution specifies allowed and restricted content.
→ Synthetic data is generated using helpful-only LLMs guided by the constitution. This data is augmented to improve diversity. Benign data pools are used to reduce false positives.
→ The input classifier is framed as a next-token prediction task. It blocks harmful inputs. It also blocks attempts to obfuscate communication.
→ The output classifier uses a value head setup for token streaming. It predicts harmfulness at each token. This allows for real-time blocking of harmful output generation. A loss function design balances direct value-head probabilities and cumulative-maximum probabilities for stable training and streaming performance.
-----
Key Insights 💡:
→ Constitutional Classifiers offer robust defense against universal jailbreaks. They are adaptable to evolving threats through constitution updates.
→ Human red teaming over 3000 hours found no universal jailbreaks against a prototype system. Red teamers primarily exploited rubric grading weaknesses, not classifier vulnerabilities.
→ Classifier robustness improves with model size, training data quantity, and data augmentation.
→ Next-token prediction improves classifier performance over value heads, especially for input classifiers.
→ Including a harmless constitution during training is crucial for high robustness and low false positives.
-----
Results 📊:
→ Classifier-guarded system reduces jailbreak success rate to less than 5% compared to 86% for a HHH model in automated evaluations.
→ System maintains deployment viability with a 0.38% absolute increase in production traffic refusals.
→ Inference overhead is 23.7%.