Accelerated Preference Optimization for Large Language Model Alignment

Playback speed

Share post at current time

0:00

Transcript

Accelerated Preference Optimization for Large Language Model Alignment

Generated this podcast with Google's Illuminate.

Rohan Paul

Dec 18, 2024

Can RLHF be accelerated by momentum? This paper answers this question in the affirmative.

Accelerated Preference Optimization (APO) accelerates LLM alignment using momentum, improving RLHF performance and convergence speed.

applies Nesterov's momentum to preference optimization

📚 https://arxiv.org/abs/2410.06293

Original Problem 🔍:

RLHF has become crucial for aligning LLMs with human preferences. Direct Preference Optimization (DPO) simplifies this process but lacks acceleration techniques.

-----

Solution in this Paper 🛠️:

• Applies Nesterov's momentum to iterative preference optimization

• Unifies existing preference optimization algorithms

• Incorporates extrapolation step after each policy update

-----

Key Insights from this Paper 💡:

• Iterative preference optimization resembles proximal point method

• APO achieves faster convergence rate than standard methods

• Theoretical analysis shows improved sub-optimality gap

• APO converges to optimal policy faster under minimal sub-optimality gap assumption

-----

Results 📊:

• APO with 3 iterations achieves 31.73% length-controlled win rate on AlpacaEval 2.0

• 1.78% improvement over iterative DPO, 5.34% over Snorkel's Mistral-PairRM-DPO

• APO with 2 iterations matches 3-iteration iterative DPO performance

• MT-Bench average score: 9.57 out of 10 for general instruction-following tasks

Rohan's Bytes

Accelerated Preference Optimization for Large Language Model Alignment

Discussion about this video