0:00
/
0:00
Transcript

"Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback"

Below podcast is generated with Google's Illuminate.

Test-time preference alignment for LLMs without retraining, using feedback as textual critiques.

LLMs struggle to adapt to evolving human preferences without retraining.

This paper introduces Test-Time Preference Optimization (TPO), aligning LLM outputs during inference without updating model parameters. TPO uses reward signals converted into textual critiques for iterative refinement.

-----

Paper - https://arxiv.org/abs/2501.12895

Original Problem 😟:

→ LLMs lack the flexibility to adapt to human preferences quickly without retraining.

-----

Solution in this Paper 🤔:

→ Test-Time Preference Optimization (TPO) aligns LLM outputs with human preferences during inference, without retraining.

→ TPO translates reward signals into textual critiques.

→ These critiques are used as textual rewards to iteratively refine the LLM's responses.

→ At each step, responses are scored by a reward model. The highest- and lowest-scoring responses are analyzed to generate a "textual loss".

→ This textual loss guides the generation of "textual gradients", which are specific suggestions for refining the responses in the next iteration.

-----

Key Insights from this Paper 💡:

→ TPO progressively improves alignment with human preferences over iterations.

→ An unaligned LLM can surpass its aligned counterpart after a few TPO steps.

→ TPO scales efficiently with search width and depth during inference.

→ LLMs can interpret and act upon reward signals in textual form.

-----

Results 💯:

→ With TPO, the unaligned Llama-3.1-70B-SFT surpasses the aligned Llama-3.1-70B-Instruct on almost all evaluated benchmarks.

→ Llama-3.1-70B-SFT with TPO achieves 33.2% LC and 39.5% WR on AlpacaEval 2 and 70.5% WR on Arena-Hard.

→ Mistral-Small-Instruct-2409 (22B parameters) with TPO achieves 53.4% LC on AlpacaEval 2 and 72.2% WR on Arena-Hard.

Discussion about this video