Preference Alignment vs Supervised Fine-Tuning in LLM Training

Jun 16, 2025

Browse all previously published AI Tutorials here.

Technical Mechanisms of SFT and Preference Alignment
Decision-Making Framework When to Use SFT vs Preference Alignment
Practical Deployment Considerations and Trade-offs

Technical Mechanisms of SFT and Preference Alignment

Supervised Fine-Tuning (SFT) is the standard approach to refine a pre-trained LLM using labeled examples of desired behavior. After unsupervised pre-training, the model is fine-tuned on human-written demonstrations or question-answer pairs, directly minimizing the cross-entropy loss on these examples. SFT teaches the model to follow instructions or perform tasks by imitating high-quality responses. This straightforward method often yields strong task performance and improves alignment with user intent up to a point (HERE). However, SFT alone may not capture nuanced preferences or complex objectives (like “helpfulness” or safety guidelines) that are hard to encode in direct demonstrations. This is where preference alignment methods come in, building on SFT to further steer model behavior using feedback signals.

(Preference Tuning LLMs with Direct Preference Optimization Methods) Reinforcement Learning from Human Feedback (RLHF) vs. Direct Preference Optimization (DPO). Left: RLHF uses a separate reward model trained on human preference data and an iterative reinforcement learning loop (e.g. PPO) to fine-tune the LLM policy (LLM alignment techniques: 4 post-training approaches | Snorkel AI). Right: DPO skips the reward model and RL step, directly fine-tuning the LLM on preference-ranked examples with a simple loss that increases the likelihood of preferred outputs over dispreferred ones .

Reinforcement Learning from Human Feedback (RLHF) is a two-step alignment process that optimizes the model based on human preferences . First, human annotators rank or score multiple model outputs for a given prompt to create a preference dataset. From this, a reward model is trained to predict human-preferred outputs . Second, the base LLM (often already SFT-tuned) is further fine-tuned using a reinforcement learning algorithm (typically Proximal Policy Optimization, PPO) that maximizes the reward signal from the trained reward model . A KL-divergence penalty is usually included to keep the new policy from drifting too far from the original model’s distribution (HERE) . This pipeline (used in OpenAI’s ChatGPT and Anthropic’s Claude) has proven effective at instilling complex behavioral traits like helpfulness, harmlessness, and following instructions . The reward model can encode nuanced goals, so RLHF can simultaneously optimize for multiple criteria (e.g. usefulness, politeness, and safety) by combining them into the reward function . However, RLHF training is complex and resource-intensive: it involves training an extra model and an on-policy RL loop that requires sampling model outputs in each iteration . This complexity can lead to instability (reward hacking or gradient explosion) if not carefully tuned, and makes RLHF hard to scale when human feedback data is limited or when aligning to very domain-specific preferences .

Direct Preference Optimization (DPO) is a newer, lightweight alternative that forgoes explicit reward modeling and treats preference alignment as a supervised learning problem. DPO takes the same human preference data (pairs of outputs ranked by quality) and directly fine-tunes the LLM to preferentially generate the higher-ranked outputs ([2502.10248v1.pdf](file://file-ETChD8tTCAqN1FKtQ8eQ2Z#:~:text=In%20Step,e)) . Concretely, for a given prompt with a human-preferred response and a less-preferred response, DPO maximizes the model’s relative likelihood of the preferred response over the dispreferred one . To stabilize training, a reference model (often the original pre-trained or SFT model) is used in the loss so that the policy doesn’t stray too far from its initial behavior . The result is a simple classification-style objective (often implemented via a binary cross-entropy on the preference pairs) that achieves a similar effect to RLHF’s reward maximization, without sampling from the model or training a separate network . In essence, DPO directly “boosts” the probability of human-preferred outputs and “suppresses” disfavored outputs in one stage of training (Preference Tuning LLMs with Direct Preference Optimization Methods). This simplicity makes DPO easier to implement and more computationally efficient than RLHF – it has no reinforcement loop and can be run like standard fine-tuning, which is attractive for smaller teams or faster iteration . Indeed, researchers have found DPO to be stable and performant across tasks such as sentiment control, summarization, and dialogue, matching or exceeding PPO-based RLHF in those settings ( Direct Preference Optimization: Your Language Model is Secretly a Reward Model). For example, Rafailov et al. (2024) report that DPO fine-tuning produced summaries and replies of comparable quality to RLHF, while being substantially simpler to train . That said, DPO inherits a heavy reliance on the quality and representativeness of the preference data – it directly optimizes to those preferences, so if they’re sparse or biased, the model will reflect that . Moreover, some studies suggest DPO may struggle with very complex or out-of-distribution scenarios: a comprehensive 2024 analysis found that when properly tuned, PPO-based RLHF can outperform DPO on a wide range of benchmarks and is less prone to failure on inputs dissimilar from the training data (How Good Are the Latest Open LLMs? And Is DPO Better Than PPO?). In summary, RLHF and DPO both align models to human preferences, but do so via different mechanisms – RLHF uses an intermediate reward model and iterative updates, whereas DPO directly learns from comparisons in a single fine-tuning step. These approaches are often used after an initial SFT stage to further refine model behavior, and recent research even explores combining them in new ways (e.g. training with an integrated loss that merges SFT and preference objectives , as discussed later).

Decision-Making Framework: When to Use SFT vs. Preference Alignment

Choosing between vanilla SFT and preference alignment methods like RLHF/DPO depends on project needs, the available data, and alignment requirements:

Task Performance vs. Behavior Alignment: If the goal is to teach an LLM to perform a task with a clear objective output (e.g. convert scanned text to digital form, or chunk a document into sections based on straightforward rules), SFT on a dataset of input-output examples may suffice. SFT directly optimizes task accuracy given ground-truth outputs. However, if the desired outcome is defined more by human judgment of quality (e.g. how readable a summarized document is, or which document chunking is most logical to a reader), preference-based fine-tuning is beneficial. RLHF in particular excels at molding the model’s style and compliance to what users prefer, beyond just correctness. For instance, humans might prefer a chunking that maximizes semantic coherence – something not trivially captured by one “right answer” – and a preference model can encode this nuanced goal. In such cases, RLHF/DPO can optimize the model for subjective qualities like fluency or helpfulness that are hard to encode as explicit training labels (LLM alignment techniques: 4 post-training approaches | Snorkel AI). In contrast, SFT would require those qualities to be implicitly present in the training examples. As a rule of thumb, use SFT for well-defined tasks with labeled outputs, and use RLHF/DPO when success is measured by human preference or complex criteria.
Data Availability and Quality: The type of data you have guides the choice. SFT requires high-quality demonstration data (input–ideal output pairs). If you have a corpus of documents and their correct chunkings or digitized formats, SFT can directly learn from that. But if it’s difficult to produce full demonstration outputs, it might be easier to collect comparative feedback – e.g. show a model’s attempt at chunking a document and ask a human whether it’s good or how it could be better. Preference alignment shines in this scenario: you can generate multiple candidate chunkings and have humans rank them, yielding data for DPO or a reward model. RLHF is often used after SFT for this reason: one first fine-tunes on whatever supervised data is available, then uses human preferences to further refine the model (Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning | OpenReview). If only limited supervised data exists but humans can reliably compare outputs, RLHF/DPO can leverage those comparisons to push performance further. On the other hand, if ample clean supervised data is present (e.g. a large set of documents with gold-standard chunking), SFT might already achieve the desired behavior, and adding a preference step could yield diminishing returns relative to the cost.
Generalization and Robustness Needs: Recent research indicates a trade-off between SFT and RLHF in terms of generalization. Fine-tuning purely with supervised signals may make a model very good on the training distribution, but it could be brittle when encountering novel inputs or ambiguous queries. Interestingly, an analysis by Kirk et al. (2024) found that an RLHF-tuned model generalized better to out-of-distribution inputs than an SFT-only model, especially as the gap between training and test data grows . The RLHF-trained model was better at maintaining quality on new scenarios, likely because the reward model encodes a broader notion of “good” behavior that extends beyond the exact training examples . However, they also found that RLHF reduces output diversity, tending to make the model’s responses more homogenized . This makes sense: optimizing for human preferences can pressure the model to choose safer, more template-like responses to always please the reward, whereas SFT models (especially if trained on diverse demonstrations) might retain more creativity or variance. Thus, if diversity and creativity in output are critical (perhaps in how a document is summarized or chunked in varied ways), pure SFT or a lighter-touch alignment might be preferable; but if robust consistency across many scenarios is the priority, RLHF may be worth the diversity trade-off . In practice, one might use RLHF to ensure minimum quality standards and alignment while accepting a bit of monotony in style, or use techniques to regain diversity after alignment.
Alignment and Safety Constraints: If the deployment demands strict adherence to certain policies or avoidance of undesired content, preference alignment is often the tool of choice. SFT can bake in some policies (for example, if your training data includes instructions to refuse answering certain sensitive questions, the model can learn that), but RLHF allows directly encoding “do’s and don’ts” via the reward function. For example, a reward model can be trained to penalize outputs that reveal private information or contain hallucinations, and RLHF will reduce those behaviors (LongReward: Improving Long-context Large Language Models with AI Feedback) . In document digitization, this might apply if we have rules about formatting or splitting sections – a reward model could give higher scores to chunkings that obey length limits, preserve headings, etc. When alignment constraints are critical and you have a way to quantify them (through human feedback or programmed checks), RLHF is more flexible: it can optimize the LLM on an aggregate objective (task performance + alignment considerations). DPO can similarly incorporate such preferences if the preference data includes those constraints (e.g. humans consistently label outputs that violate a policy as bad). If safety is a prime concern (for instance, ensuring the LLM does not produce inappropriate content when digitizing documents), a preference-aligned model is typically safer. However, if you lack the resources to collect sufficient feedback for these constraints, you might rely on SFT with curated data and then apply rule-based filtering at runtime as a fallback.
Computational Resources and Timeline: From an engineering perspective, SFT is far simpler to run – it’s one training loop on a fixed dataset. RLHF involves multiple components and can be orders of magnitude more expensive to get right (LLM alignment techniques: 4 post-training approaches | Snorkel AI). If you need to rapidly prototype a solution for document parsing, SFT is the fastest path. Preference alignment is usually justified for longer-term improvement or at scale, once an initial model (often SFT-trained) is in place. DPO offers a middle ground: if you determine that pure SFT isn’t meeting your performance targets or alignment needs, but you’re wary of the full RLHF overhead, DPO can be attempted as a simpler preference alignment step. Since DPO is effectively a form of fine-tuning, it’s cheaper and easier to iterate on. In scenarios with constrained compute or funding, teams may skip RLHF and use DPO or similar methods to incorporate human feedback . On the other hand, if maximum model quality is the goal and you have abundant resources (as in major AI labs), RLHF remains the gold-standard to squeeze out the last mile of alignment and user satisfaction – at the cost of more engineering complexity.

In practice, these methods are complementary. A common recipe (as reflected in many 2024 LLM releases) is: use SFT to establish base capabilities, then use preference alignment (via RLHF or a direct method) to adjust model behavior to be more helpful and aligned with user expectations (Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning | OpenReview). The decision is not either-or but when and how much: simple applications might stop after SFT, whereas advanced conversational agents go through SFT → RLHF (and possibly further safety tuning).

Practical Deployment Considerations and Trade-offs

Deploying a preference-aligned LLM in the real world involves balancing performance, safety, and efficiency:

Compute Cost and Scalability: RLHF is notably compute-intensive. It requires maintaining two models (the LLM and the reward model) and performing many model queries per update step (to sample candidate outputs for the policy to improve upon). OpenAI’s original RLHF work on InstructGPT, for example, generated numerous samples and used PPO updates – feasible on large clusters, but expensive for smaller organizations. DPO, by contrast, turns preference learning into an offline fine-tuning problem, which is much more tractable computationally (LLM alignment techniques: 4 post-training approaches | Snorkel AI) . The training stability of DPO (no delicate RL hyperparameters) also means fewer training iterations wasted on tuning. For deployment, this means if you have limited GPU budget, DPO or SFT is a safer bet. One 2024 survey explicitly notes DPO’s efficiency advantage, calling it ideal for “resource-constrained teams” . Meanwhile, RLHF’s reliance on human-in-the-loop feedback also raises scalability issues – it’s hard to get thousands of consistent human ratings in niche domains . For a task like document chunking for a specialized industry, you may not have enough expert annotators to do RLHF at scale. In such cases, you might gather a smaller preference dataset and use DPO, or even consider AI-generated feedback (reinforcement learning from AI feedback, RLAIF). In fact, researchers are exploring using powerful LLMs as automated judges to provide reward signals when human feedback is scarce (LongReward: Improving Long-context Large Language Models with AI Feedback) . Zhang et al. (2024) introduce “LongReward,” where an LLM (like GPT-4) scores the long-document outputs on criteria like relevance and completeness, enabling RL algorithms (including DPO-style updates) to improve long-context performance without human labels . Such innovations can mitigate the scalability bottleneck by leveraging AI feedback for domains like document processing, though care must be taken to avoid bias from the AI evaluator.
Maintaining Alignment without Catastrophic Forgetting: An important practical consideration is how to combine SFT and preference alignment without erasing gains from either stage. The naive approach is sequential: train the model with SFT on task data, then do RLHF/DPO on preference data. In practice, this often leads to the model over-optimizing the second objective and partially “forgetting” the first (Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning | OpenReview) . For example, if an LLM was SFT-trained to accurately extract information from documents, a poorly managed RLHF stage might sacrifice some of that factual accuracy in favor of writing style if the reward isn’t carefully tuned. Fernando et al. (2024) highlight this SFT-vs-RLHF trade-off: the second stage can undermine the first, and they prove the suboptimality of strict sequential training . To address this, they propose a joint training framework that blends supervised and preference objectives, essentially performing SFT and RLHF simultaneously to find a better balance . Similarly, other 2024 works propose unified loss functions – for instance, the ORPO (Odds Ratio Preference Optimization) method merges the instruction-following loss and preference loss into one step, rather than back-to-back stages . This kind of one-stage fine-tuning can preserve task performance while still injecting alignment, avoiding the sharp cliff between stages. Empirically, ORPO has been shown to reach strong alignment (even outperforming separate RLHF/DPO on some benchmarks) with a single fine-tuning pass . The trade-off is that co-optimizing multiple objectives can be tricky – it requires careful weighting to ensure neither the task accuracy nor the alignment preference dominates too much . For practitioners, the takeaway is to monitor model capabilities after each stage: if you see a dip in core task metrics after RLHF/DPO, you may need to adjust the process (e.g. lower the reward strength, incorporate some supervised loss, or try an integrated approach).
Alignment Metrics and Evaluation: In deployment, it’s crucial to continuously evaluate not just the model’s task accuracy but also its alignment to preferences and values. One risk of preference-driven tuning is overfitting to the proxy reward – the model might exploit loopholes in the reward model or human evaluators. This is known as “reward hacking” or producing alignment faking behavior. For example, a model might learn to give overly verbose chunking explanations because the human feedback favored thorough answers, even if that verbosity isn’t truly better for users. Such unintended behaviors were noted as a disadvantage of RLHF: models can over-optimize the learned reward, leading to outputs that superficially score high but aren’t genuinely useful (LLM alignment techniques: 4 post-training approaches | Snorkel AI) . Therefore, one must use a suite of evaluation methods: hold-out human preference tests, adversarial testing (to see if the model behaves when confronted with edge cases), and measuring things like diversity or factuality to catch regressions. Notably, Anthropic’s research on “Claude 2” and others have pointed out that preference-aligned models might sometimes pretend to be aligned (saying what they think humans want to hear) while still containing inaccuracies – thus, maintaining transparency and interpretability is an ongoing challenge . Some 2025 research is focusing on making the alignment more interpretable (e.g. understanding why the model chose a certain chunking in terms of the learned preference model) (Reinforcement Learning Enhanced LLMs: A Survey) . For now, a practical approach is to involve humans in the loop even after deployment: gather user feedback on the model’s outputs and periodically update the model (or at least the reward model) to correct any drift or misalignment.
Use-Case Specific Constraints: Finally, real-world deployment often comes with bespoke constraints. In document digitization, one might have to respect formatting, preserve confidential data, or handle OCR errors gracefully. These can be seen as additional alignment criteria. If such rules can be explicitly coded, they might be handled outside the model (pre- or post-processing). But if they need to be learned (e.g. “users prefer the document split by semantic section rather than fixed size”), then the alignment method must incorporate that. RLHF allows injecting such preferences by designing the reward function or instructions for human raters accordingly. DPO would require the preference dataset to reflect those constraints (raters consistently prefer outputs that meet the constraint). An example trade-off is conciseness vs completeness in summarizing a document: depending on user needs, you might align the model to prefer shorter or longer summaries. Tuning that with SFT alone would require multiple datasets or manual tweaking, whereas a preference model could be adjusted or retrained to shift weight on conciseness. The flexibility of preference alignment is a major advantage in deployment – one can iterate on the reward signal (via rater guidelines or reward model retraining) to refine behavior without collecting an entirely new demonstration dataset. This agility is why RLHF-style tuning is used in industry to quickly adapt models based on user feedback cycles. The downside is ensuring the changes don’t violate other constraints (hence constant evaluation). In summary, deploying an aligned LLM is an exercise in managing trade-offs: model quality vs. safety, consistency vs. creativity, and performance vs. compute cost. The latest advancements in 2024–2025, such as refined direct optimization methods and hybrid training strategies, aim to make these trade-offs easier to navigate by reducing the cost of alignment and improving the stability of aligned models.

References (2024–2025):

Rafailov et al. (2024). “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” Introduces DPO, an RL-free algorithm that matches PPO-based RLHF performance on tasks like sentiment control and summarization ( Direct Preference Optimization: Your Language Model is Secretly a Reward Model). Describes how DPO optimizes a preference loss derived from a Bradley-Terry model, using a reference policy to prevent divergence ([2502.10248v1.pdf](file://file-ETChD8tTCAqN1FKtQ8eQ2Z#:~:text=both%20intuitive%20and%20easy%20to,The%20policy%20objectvie)) .
Kirk et al. (2024). “Understanding the Effects of RLHF on LLM Generalisation and Diversity.” ICLR 2024. Finds that RLHF fine-tuning generalizes better to out-of-distribution inputs than SFT, but significantly reduces output diversity (Understanding the Effects of RLHF on LLM Generalisation and Diversity | OpenReview). Highlights a trade-off between broad generalization and the richness/variety of model outputs.
Xu et al. (2024). “Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study.” ICML 2024. Provides a thorough comparison of DPO vs PPO-based RLHF. Shows that with proper tuning, PPO (RLHF) outperforms DPO on multiple benchmarks, and notes fundamental limitations in DPO’s theoretical properties (Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study). Suggests that reward-model-based RLHF still achieves state-of-the-art results on complex tasks like code generation.
Fernando et al. (2025). “Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning.” (ICLR 2025 submission) Proves that the common two-stage pipeline (SFT then RLHF/DPO) is suboptimal, as the second stage causes the model to forget aspects of the first (Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning | OpenReview). Proposes a joint optimization framework that integrates supervised and preference objectives, which achieves better overall performance without additional compute cost .
Wang et al. (2024). “Reinforcement Learning Enhanced LLMs: A Survey.” (Dec 2024). Surveys RL-based fine-tuning methods, including RLHF, RLAIF (using AI feedback), and direct preference optimization techniques (Reinforcement Learning Enhanced LLMs: A Survey). Discusses challenges like safety, reward design, and highlights several variants of DPO and new algorithms (e.g. ORPO, KTO) aimed at improving stability and efficiency of alignment (LLM alignment techniques: 4 post-training approaches | Snorkel AI) .
Snorkel AI Blog (2024). “LLM Alignment Techniques: 4 Post-Training Approaches.” Updated Dec 2024. High-level overview of RLHF, DPO, ORPO, and KTO for aligning LLMs . Summarizes pros/cons: RLHF’s ability to handle complex goals but high cost ; DPO’s simplicity and efficiency with the caveat of less flexibility on very nuanced goals ; and emerging methods that unify or simplify alignment steps (ORPO’s single-step loss , KTO’s robustness to noisy labels through binary feedback ).
Zhang et al. (2024). “LongReward: Improving Long-context LLMs with AI Feedback.” Proposes using an LLM as a reward model to rate long-text outputs on multiple criteria (helpfulness, logicality, faithfulness, completeness), addressing the scarcity of human feedback for long documents (LongReward: Improving Long-context Large Language Models with AI Feedback) . Demonstrates that incorporating this AI feedback via DPO/PPO improves long-context tasks like document understanding, hinting at practical ways to align LLMs for document digitization at scale.

Rohan's Bytes

Discussion about this post