"The Differences Between Direct Alignment Algorithms are a Blur"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 11, 2025

Article voiceover

0:00

-5:31

https://arxiv.org/abs/2502.01237

The paper addresses the issue of aligning LLMs with human preferences, which traditionally involves complex Reinforcement Learning from Human Feedback (RLHF) pipelines. This paper explores Direct Alignment Algorithms (DAAs) as a simpler alternative to RLHF.

This paper proposes enhancing single-stage DAAs, like ORPO and ASFT, by incorporating an explicit Supervised Fine-Tuning (SFT) phase and a temperature parameter 'beta'. This modification aims to improve their alignment quality and bridge the performance gap with two-stage methods.

-----

📌 Explicit Supervised Fine-Tuning before Direct Alignment Algorithms, especially for ORPO and ASFT, is critical. This two-stage approach significantly boosts performance, proving staged alignment is still relevant.

📌 The beta parameter's introduction into single-stage Direct Alignment Algorithms is a key generalization. It unifies odds-ratio and reference-policy methods, highlighting beta's role in controlling preference optimization strength.

📌 Pairwise objectives in Direct Alignment Algorithms are empirically superior to pointwise ones, particularly for larger models. This suggests pairwise ranking provides more effective and less noisy gradient signals for alignment.

----------

Methods Explored in this Paper 🔧:

→ The paper investigates Direct Alignment Algorithms (DAAs) which directly optimize policy based on preferences, skipping explicit reward modeling and reinforcement learning stages of traditional RLHF.

→ It focuses on two single-stage DAAs: Odds Ratio Preference Optimization (ORPO) and Aligned Supervised Fine-Tuning (ASFT).

→ The authors introduce a crucial modification by adding an explicit Supervised Fine-Tuning (SFT) phase before applying ORPO and ASFT alignment losses.

→ They also incorporate a temperature parameter, beta, into ORPO and ASFT, which was originally absent in these single-stage methods, to control the strength of preference optimization.

→ The paper analyzes the theoretical relationships between different DAAs, categorizing them based on whether they use odds ratio or reference policy ratio as implicit reward, and whether they employ pairwise or pointwise preference optimization.

-----

Key Insights 💡:

→ Single-stage DAAs like ORPO and ASFT significantly improve in alignment quality when an explicit SFT phase is included prior to the alignment step.

→ Introducing the beta parameter to control the preference optimization strength in ORPO and ASFT further enhances their performance, making them comparable to two-stage methods like DPO.

→ The choice between pairwise and pointwise preference objectives is a more critical factor affecting alignment quality than the specific implicit reward function used (odds ratio vs reference policy ratio). Pairwise methods tend to perform better, especially in larger models.

-----

Results 📊:

→ ORPO and ASFT achieve improved AlpacaEval 2 scores with the introduction of explicit SFT and beta parameter. ORPO improved by +3.46 and ASFT by +8.27.

→ In Llama 3.2 3B TL;DR experiments, most DAAs achieve over 90% GPT-4 win rate, demonstrating strong summarization performance.

→ Pairwise methods like ORPO outperform pointwise methods like ASFT, especially on the larger Llama 3.1 8B model in AlpacaEval 2 and ArenaHard benchmarks, indicating better alignment quality.

Rohan's Bytes

Discussion about this post