"Trading Inference-Time Compute for Adversarial Robustness"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2501.18841
The paper addresses the limited progress in adversarial robustness in AI, especially for LLMs. Current LLMs are vulnerable to jailbreaks and attacks, particularly as they become agents interacting with untrusted inputs, raising concerns about real-world harm.
This paper explores whether increasing inference-time compute can improve the adversarial robustness of reasoning LLMs, without adversarial training or specific attack knowledge.
-----
📌 Increased inference-time compute offers a promising, orthogonal defense layer. It enhances model reasoning against attacks without needing adversarial training, improving overall performance unlike traditional robustness methods.
📌 Scaling compute during inference acts as an implicit form of adversarial "reasoning". Models can better parse complex, manipulated inputs to discern true instructions from adversarial noise.
📌 The paper highlights a crucial trade-off: investing in inference compute can directly translate to enhanced security for LLMs, suggesting a practical, scalable approach to improve real-world deployment safety.
----------
Methods Explored in this Paper 🔧:
→ This paper investigates the impact of scaling inference-time compute on the adversarial robustness of reasoning LLMs.
→ It uses OpenAI's o1-preview and o1-mini models to conduct experiments across various attack surfaces.
→ The study examines several attack methods including many-shot jailbreaking, soft token attacks, and prompt injection attacks.
→ For mathematical tasks, the paper uses a specific input format with "BEGIN IGNORE" tags to simulate adversarial control over parts of the prompt.
→ Soft token attacks are implemented by directly optimizing embedding vectors to manipulate model output.
→ A novel "Think Less" attack is introduced to reduce the model's inference-time compute and increase vulnerability.
-----
Key Insights 💡:
→ Increasing inference-time compute significantly improves the adversarial robustness of reasoning LLMs across many attack types and tasks.
→ For unambiguous tasks, the probability of successful attacks decreases as inference-time compute increases.
→ Inference-time compute enhancement improves robustness without sacrificing clean performance, unlike adversarial training methods.
→ However, for ambiguous policy compliance tasks, increased compute may not always guarantee robustness due to potential loopholes in policy specifications.
-----
Results 📊:
→ Many-shot attacks on math tasks show attack success rate decreases towards zero as inference-time compute grows.
→ StrongREJECT jailbreak suite tests show robustness, measured by goodness@0.1, improves with increased inference-time compute.
→ Human red-teaming results indicate that higher inference-time compute levels require more attempts for successful attacks; level 5 needed 22.6 attempts vs 11.0 for level 1.
→ In website prompt injection attacks (AdvSimpleQA), increasing inference-time compute reduces attack success rate to near zero.