SFT vs DPO

    Supervised fine-tuning versus direct preference optimisation: what each is good for, what data each needs, and when to graduate from one to the other.

    DPO is not yet supported in Ertas. Ertas's Fine-Tune module trains with SFT today; DPO is under active development. The rest of this page (what DPO is, when to pick it, what its data shape looks like, where preference pairs come from) is here so you can plan ahead. When DPO ships, this page will switch from forward-looking to hands-on.

    Two main paradigms for shaping a base model after pretraining: SFT (supervised fine-tuning) and DPO (direct preference optimisation). SFT is what Ertas's Fine-Tune module trains with today; it covers the broadest set of use cases with the cleanest data requirements. DPO is the step you graduate to when SFT has done its job and you want to push further.

    This page covers what each does, when to pick which, and what data shape each needs.

    SFT in one paragraph

    SFT teaches the model "given this input, produce this output." Your dataset is rows of (input, desired-output) pairs. The training loss measures how far the model's output is from the desired output, in token-prediction terms. This is the technique every instruction-tuning dataset on Hugging Face uses, including everything Ertas's Fine-Tune module does today.

    Strengths:

    • Conceptually simple. Easy to inspect the data and predict what the model will learn.
    • Cheap. Standard cross-entropy loss, no extra forward passes.
    • Excellent for teaching a clear, specific behaviour.

    Weaknesses:

    • Cannot easily teach the model "prefer this kind of output over that kind." It only knows what to imitate, not what to avoid.
    • Quality is capped by the best outputs in your dataset. The model imitates the average, not the best.

    For most fine-tuning work, SFT is enough.

    DPO in one paragraph

    DPO teaches the model "given this input, prefer this output over that one." Your dataset is rows of (input, accepted-output, rejected-output) triples. The training objective increases the probability the model assigns to the accepted output and decreases the probability it assigns to the rejected output. The result is a model that is more discriminating about which response to give.

    Strengths:

    • Captures qualitative preferences (style, helpfulness, safety) that are hard to express as a single "right" answer.
    • Can push past the quality ceiling of SFT by teaching the model to prefer the best outputs over the good-enough ones.
    • Smaller datasets often work because each preference pair carries more information than a single SFT row.

    Weaknesses:

    • Requires preference pairs, which are more expensive to author than single outputs.
    • More sensitive to data quality; bad preference labels can teach the wrong thing.
    • Higher compute per step (the model does a forward pass on both the accepted and rejected sides).
    • Less effective when the model has not yet learned the basic task. DPO polishes; it does not teach from scratch.

    When to pick SFT

    SFT is the right choice when:

    • You are starting from scratch on a new behaviour.
    • You can write or collect single-output examples easily.
    • The task has clear right answers (translation, summarisation, JSON extraction, classification).
    • You do not have time or data to collect preference pairs.
    • The model needs to learn a domain or style, not just refine its judgement.

    If you are unsure, start with SFT. You can always layer DPO on top later.

    When to pick DPO

    DPO becomes attractive when:

    • Your SFT model is already producing the right format and the right kind of content, but the quality is uneven.
    • You have access to pairwise comparisons (human judgements, ranked outputs, automated scoring).
    • The task is more about taste than correctness (writing style, helpfulness, tone calibration).
    • You want to suppress specific failure modes the SFT model still produces.

    DPO is the second pass, not the first.

    Data shape for DPO

    A DPO dataset row has three fields:

    {"prompt": "[the user's question or task]", "chosen": "[the preferred output]", "rejected": "[the less preferred output]"}

    Constraints:

    • Every row must have all three fields.
    • chosen and rejected should be meaningfully different. Two near-identical outputs do not teach anything.
    • The chosen should genuinely be better. A noisy label here pollutes the signal.

    When Ertas's DPO training ships, the exact field naming may differ from this example. The conceptual shape (prompt + chosen + rejected) is stable across DPO tooling regardless.

    Where preference pairs come from

    Authoring preference pairs is more work than authoring single outputs. Common sources:

    • Human ranking: ask annotators to pick the better of two model outputs. Slow but high quality.
    • LLM-as-judge: a strong frontier model rates two candidate outputs. Cheap, useful, but inherits the judge's biases.
    • Real-world signals: production data with implicit preference (clicked vs not clicked, accepted vs rewritten by user).
    • Synthetic pairs: generate a "good" output and a deliberately "bad" one as the rejected side. See Dataset synthesis.
    • Mixed: combine a few hundred human-judged pairs as a high-quality core with a larger LLM-judged set around it.

    A practical starting point is 500 to 2,000 preference pairs. DPO is more sample-efficient than SFT; you do not need tens of thousands of rows.

    The two-stage pattern

    The pattern that consistently produces the best results:

    SFT for the basics

    Step 1

    Train the model to do the task at all using a clean SFT dataset. The model learns format, style, and core behaviour.

    Smoke test

    Step 2

    Make sure the SFT model is producing outputs in the right shape. If it cannot get the format right, DPO will not save it.

    DPO for the polish

    Step 3

    Use the SFT model to generate candidate outputs on a held-out prompt set. Rank the candidates (human or LLM). Train DPO on the resulting pairs.

    Evaluate

    Step 4

    Compare the DPO output to the SFT baseline. Pick the one that wins your smoke test.

    DPO is often worth 5 to 15% quality improvement on top of a strong SFT model. If that gap matters for your product, the extra data work is worth it. If you cannot tell the difference, ship the SFT model.

    Other paradigms (briefly)

    A few terms you will run into in the broader fine-tuning literature:

    • RLHF (reinforcement learning from human feedback): the predecessor of DPO. More moving parts (reward model + PPO), generally harder to tune. DPO replaced it in most workflows.
    • KTO (Kahneman-Tversky Optimisation): a DPO variant that uses unpaired (single-output) preference signals (thumbs up / thumbs down) instead of paired ones. Useful when you only have one rating per row.
    • ORPO: combines SFT and DPO objectives in a single training run.

    These are all variants on the same underlying idea: use preference signal to refine the model. SFT is the simpler ancestor.

    What's next