Test-time preference alignment for LLMs without retraining, using feedback as textual critiques.
LLMs struggle to adapt to evolving human preferences without retraining.
This paper introduces Test-Time Preference Optimization (TPO), aligning LLM outputs during inference without updating model parameters. TPO uses reward signals converted into textual critiques for iterative refinement.
-----
Paper - https://arxiv.org/abs/2501.12895
Original Problem 😟:
→ LLMs lack the flexibility to adapt to human preferences quickly without retraining.
-----
Solution in this Paper 🤔:
→ Test-Time Preference Optimization (TPO) aligns LLM outputs with human preferences during inference, without retraining.
→ TPO translates reward signals into textual critiques.
→ These critiques are used as textual rewards to iteratively refine the LLM's responses.
→ At each step, responses are scored by a reward model. The highest- and lowest-scoring responses are analyzed to generate a "textual loss".
→ This textual loss guides the generation of "textual gradients", which are specific suggestions for refining the responses in the next iteration.
-----
Key Insights from this Paper 💡:
→ TPO progressively improves alignment with human preferences over iterations.
→ An unaligned LLM can surpass its aligned counterpart after a few TPO steps.
→ TPO scales efficiently with search width and depth during inference.
→ LLMs can interpret and act upon reward signals in textual form.
-----
Results 💯:
→ With TPO, the unaligned Llama-3.1-70B-SFT surpasses the aligned Llama-3.1-70B-Instruct on almost all evaluated benchmarks.
→ Llama-3.1-70B-SFT with TPO achieves 33.2% LC and 39.5% WR on AlpacaEval 2 and 70.5% WR on Arena-Hard.
→ Mistral-Small-Instruct-2409 (22B parameters) with TPO achieves 53.4% LC on AlpacaEval 2 and 72.2% WR on Arena-Hard.
Share this post