LLMs are often used with inference-time procedures like Best-of-N, but standard alignment doesn't account for this. This paper introduces Inference-Aware Alignment (InfAlign) to address this gap.
https://arxiv.org/abs/2412.19792
Original Problem 🤔:
→ Standard LLM alignment maximizes reward for single samples, ignoring inference-time procedures like Best-of-N.
→ This mismatch leads to suboptimal performance when inference procedures are used.
Solution in this Paper 💡:
→ InfAlign framework optimizes for inference-time win rate, considering the chosen procedure.
→ InfAlign uses a transformed reward in KL-regularized RL, capturing the inference process.
→ For Best-of-N and Worst-of-N, InfAlign provides near-optimal reward transformations, including exponential tilting.
→ A practical solver, Calibrate-and-Transform RL (CTRL), calibrates the reward model and applies the transformation before KL-RL.
Key Insights from this Paper 🔑:
→ Alignment should consider the full inference pipeline.
→ Reward transformations can effectively capture inference procedures.
→ Calibrating rewards improves robustness and performance.
Results 💯:
→ CTRL improves inference-time win rates by 8-12% for Best-of-N helpfulness and 4-9% for Worst-of-N harmlessness on Anthropic datasets.
→ Calibration alone improves standard win rates compared to baselines.
→ Higher N in Best-of-N and Worst-of-N leads to further gains.
Share this post