"Almost Surely Safe Alignment of LLMs at Inference-Time"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 09, 2025

Article voiceover

0:00

-4:52

https://arxiv.org/abs/2502.01208

The problem is that Large Language Models (LLMs) can generate unsafe responses, and traditional alignment methods like Reinforcement Learning from Human Feedback (RLHF) are costly and can modify model weights. This paper addresses how to ensure LLMs generate safe responses during inference without retraining.

This paper introduces InferenceGuard. It reframes safe response generation as a constrained Markov decision process (cMDP) in the LLM's latent space. This method augments a safety state to track safety constraints, ensuring safety guarantees.

-----

📌 Latent space critic enables efficient safety alignment in InferenceGuard. It bypasses Large Language Model retraining, providing practical inference-time safety. Safety state augmentation ensures constraint adherence.

📌 Safety state augmented Markov Decision Process offers 'almost sure' safety guarantees. This is a key theoretical contribution, surpassing prior inference-time methods lacking formal safety proofs.

📌 Effective balance of safety and reward is achieved by InferenceGuard. Critic, using Monte Carlo sampling, guides beam search to state-of-the-art safety with strong task performance.

----------

Methods Explored in this Paper 🔧:

→ The paper proposes InferenceGuard, an inference-time alignment method.

→ It formulates safe LLM generation as a constrained Markov Decision Process (cMDP).

→ A key mechanism is 'safety state augmentation'. This adds a constraint tracker to the state, transforming the cMDP into an unconstrained one.

→ InferenceGuard operates in the LLM's latent space for efficiency. It trains a small critic in this space, using hidden states and logits as input.

→ The method utilizes a critic-based approach to solve the augmented MDP, avoiding LLM gradient modifications.

→ It employs a lookahead algorithm, similar to Beam Search, for practical implementation.

→ A diversity term is introduced in sampling to increase search speed when safe samples are lacking.

-----

Key Insights 💡:

→ The paper demonstrates that safety constraints can be enforced in the latent space of LLMs.

→ Optimizing in the latent space preserves safety in the original token space.

→ The augmented MDP framework provides theoretical guarantees for almost sure safety during inference.

→ The method bypasses limitations of Lagrangian approaches in balancing reward and safety.

→ Training a critic in the latent space leads to a compact and inference-efficient solution.

-----

Results 📊:

→ InferenceGuard achieves a 91.04% safety rate on Alpaca-7B.

→ It reaches a 100% safety rate on Beaver-7B-v3.

→ InferenceGuard outperforms baselines like Best-of-N and Beam Search in safety rates.

→ It maintains a strong balance between safety and task reward, setting a new state-of-the-art.

Rohan's Bytes

Discussion about this post