Shifts AI safety from rules to reasoning, enhancing robustness against sophisticated attacks.
https://arxiv.org/abs/2411.17075
→ Investigates the role of "System-2" (deliberative, slow-thinking) processes in improving the safety of AI models like OpenAI's o1.
→ Evaluates o1's safety under adversarial prompts, finding it performs better than predecessors but remains vulnerable to math-encoded jailbreak attacks.
→ Identifies vulnerabilities in reasoning chains where shortcuts or logical confusions bypass safety protocols.
→ Explores methods for System-2 alignment in open-source models: prompt engineering, supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning (RL).
→ Prompt engineering (zero-shot and few-shot Chain-of-Thought techniques) balances model safety and overrefusal but has limitations in certain contexts.
→ Supervised fine-tuning with annotated safety reasoning enhances step-by-step deliberation, improving both safety and response clarity.
→ Direct preference optimization provides modest safety gains but increases overrefusal rates, indicating trade-offs in balancing safety with usability.
→ Reinforcement learning with process supervision integrates safety checks at each reasoning step, achieving the best balance of safety and usability.
→ Reinforces that deliberate reasoning improves safety but requires robust mechanisms to address vulnerabilities in complex or covert prompts like mathematical encoding.
Share this post