DPO Preference Pairs Dataset Template

    Template for building preference datasets with chosen and rejected response pairs for Direct Preference Optimization training.

    Preference

    Overview

    Direct Preference Optimization (DPO) preference datasets contain pairs of responses to the same prompt, where one response is labeled as "chosen" (preferred) and the other as "rejected" (dispreferred). DPO is a reinforcement learning from human feedback (RLHF) alternative that directly optimizes the language model to align with human preferences without training a separate reward model. The quality and consistency of preference annotations directly determine the effectiveness of DPO training.

    DPO preference datasets serve a fundamentally different purpose than supervised fine-tuning (SFT) datasets. While SFT teaches the model what to say, DPO teaches the model what distinguishes a good response from a bad one. This is particularly powerful for encoding nuanced quality criteria that are difficult to capture in explicit instructions — things like appropriate level of detail, natural conversational tone, balanced perspective, safety considerations, and avoiding unnecessary hedging. The preference signal teaches these implicit quality standards through comparative examples.

    Building high-quality preference data requires carefully controlled contrast between chosen and rejected responses. The most effective datasets have rejected responses that are plausible but subtly worse than chosen responses — they might be factually correct but poorly structured, appropriately detailed but overly verbose, or technically accurate but lacking empathy. Preference pairs with extreme quality differences (a perfect response vs. gibberish) provide weak training signal because the distinction is obvious and does not teach fine-grained preference boundaries.

    Dataset Schema

    typescript
    interface DPOExample {
      prompt: string;                    // The user query or instruction
      chosen: string;                    // The preferred response
      rejected: string;                  // The dispreferred response
      metadata?: {
        preference_reason: string;       // Why chosen is preferred
        category: string;
        difficulty: "easy" | "medium" | "hard";
        quality_delta: "large" | "medium" | "small";  // How different the pair is
      };
    }
    
    // Multi-turn DPO format
    interface MultiTurnDPOExample {
      system?: string;
      prompt: {
        role: "user" | "assistant";
        content: string;
      }[];                               // Conversation history up to the divergence point
      chosen: string;                    // Preferred final assistant response
      rejected: string;                  // Dispreferred final assistant response
    }
    DPO preference pair schemas for single-turn and multi-turn conversation training

    Sample Data

    json
    [
      {
        "prompt": "Explain how HTTPS works to someone who isn't technical.",
        "chosen": "Think of HTTPS like sending a letter in a locked box instead of on a postcard. When you visit a website with HTTPS, your browser and the website first agree on a secret code that only they know. After that, every message between you and the website is scrambled using that code. Even if someone intercepts the message, they can't read it because they don't have the code. That little padlock icon in your browser means this protection is active.",
        "rejected": "HTTPS uses TLS/SSL protocol to encrypt data in transit between the client and server. It initiates a handshake process where the server presents its X.509 certificate, the client verifies it against trusted certificate authorities, and then they negotiate a symmetric session key using asymmetric cryptography. All subsequent data is encrypted with AES-256 or similar cipher suites, protecting against man-in-the-middle attacks and packet sniffing.",
        "metadata": {
          "preference_reason": "The chosen response uses an accessible analogy appropriate for the non-technical audience. The rejected response is technically accurate but uses jargon that would confuse the target audience.",
          "category": "audience-appropriateness",
          "difficulty": "medium",
          "quality_delta": "medium"
        }
      },
      {
        "prompt": "What are some healthy breakfast options?",
        "chosen": "Here are some nutritious breakfast ideas that are quick to prepare:\n\n- **Overnight oats**: Mix rolled oats with yogurt and milk, add berries and chia seeds, refrigerate overnight. Ready to eat in the morning.\n- **Eggs and avocado toast**: Whole grain toast with mashed avocado and a poached or scrambled egg. Add everything seasoning for extra flavor.\n- **Smoothie bowl**: Blend frozen banana, spinach, protein powder, and almond milk until thick. Top with granola, sliced fruit, and nuts.\n- **Greek yogurt parfait**: Layer plain Greek yogurt with mixed berries and a drizzle of honey.\n\nAll of these take under 10 minutes and provide a good balance of protein, healthy fats, and complex carbohydrates to keep you energized through the morning.",
        "rejected": "There are many healthy breakfast options you can consider. Oatmeal is healthy. Eggs are healthy. Yogurt is healthy. Fruit is healthy. Smoothies are healthy. Toast is healthy. You could also have cereal if it's whole grain. Avocado is healthy too. Nuts are another option. The key is to eat something nutritious in the morning.",
        "metadata": {
          "preference_reason": "The chosen response provides specific, actionable suggestions with preparation details and nutritional context. The rejected response is vague, repetitive, and provides no practical guidance.",
          "category": "helpfulness-detail",
          "difficulty": "easy",
          "quality_delta": "large"
        }
      },
      {
        "prompt": "Should I invest my savings in cryptocurrency?",
        "chosen": "I can share some considerations to help you think through this decision, but I'd recommend consulting a financial advisor for personalized advice.\n\nThings to consider:\n- **Risk tolerance**: Cryptocurrency is highly volatile. Bitcoin has seen drawdowns exceeding 70% from peak to trough. Only invest money you can afford to lose entirely.\n- **Diversification**: Financial experts generally recommend that speculative investments make up no more than 5-10% of your portfolio.\n- **Your financial foundation**: Before speculative investments, ensure you have an emergency fund (3-6 months of expenses), no high-interest debt, and adequate retirement contributions.\n- **Tax implications**: Crypto gains are taxable events in most jurisdictions. Factor tax obligations into your return calculations.\n\nThe decision depends on your specific financial situation, goals, and risk tolerance.",
        "rejected": "Absolutely! Crypto is the future of finance and you should definitely invest. Bitcoin and Ethereum have given incredible returns over the past few years. Many experts predict Bitcoin will reach $500,000 in the next few years. The earlier you get in, the more money you'll make. I'd suggest putting a significant portion of your savings into a mix of Bitcoin, Ethereum, and some promising altcoins. Dollar-cost averaging is a good strategy — just buy regularly and hold long term.",
        "metadata": {
          "preference_reason": "The chosen response is balanced, presents risks alongside considerations, recommends professional advice, and avoids making predictions. The rejected response gives irresponsible financial advice, makes unfounded predictions, and encourages overexposure to a volatile asset.",
          "category": "safety-balanced-advice",
          "difficulty": "medium",
          "quality_delta": "large"
        }
      }
    ]
    DPO preference pairs demonstrating audience-appropriateness, helpfulness, and safety preference criteria

    Data Collection Guide

    There are three primary methods for creating DPO preference data: human comparison of model-generated responses, human writing of contrastive pairs, and synthetic generation with human validation. The most efficient approach for most organizations is to generate multiple responses from your SFT-tuned model for each prompt, then have human annotators select the best (chosen) and worst (rejected) from each set.

    Generate 3-5 responses per prompt using different sampling parameters (temperature, top-p) to produce varied outputs. Present these responses to annotators without revealing which parameters generated each response. Have annotators rank all responses from best to worst, then use the top-ranked as chosen and bottom-ranked as rejected. This produces preference pairs where the rejected response is still plausible (it was generated by a capable model) but clearly worse than the chosen response.

    Define your preference criteria explicitly. What makes one response better than another? Common criteria include: factual accuracy, helpfulness and completeness, appropriate tone and style, safety and harmlessness, conciseness without sacrificing clarity, and honesty about uncertainty. Provide annotators with a ranked priority list of these criteria so they know which qualities to prioritize when responses differ on multiple dimensions. Document the reasoning for each preference decision in the metadata field.

    Quality Criteria

    Inter-annotator agreement on preference ordering should exceed 75 percent for the dataset to be effective. If annotators frequently disagree on which response is better, the preference signal is too noisy for DPO to learn from effectively. Low agreement typically indicates either unclear preference criteria, responses that are too similar in quality (insufficient contrast), or genuinely subjective preference dimensions where reasonable people disagree.

    Ensure diversity in preference reasons. If every rejected response is rejected for the same reason (e.g., all are too verbose), the model will learn a narrow preference signal and may over-correct. Include preference pairs that differentiate on multiple dimensions: accuracy vs. inaccuracy, appropriate detail vs. excessive detail, balanced perspective vs. biased perspective, safe advice vs. risky advice, and natural tone vs. stilted tone.

    Validate that the quality delta between chosen and rejected is meaningful but not extreme. Pairs where both responses are nearly identical teach nothing. Pairs where the rejected response is obviously terrible (grammatically broken, completely off-topic) teach obvious distinctions that the model likely already handles. The most valuable pairs are those where the rejected response is "almost good" but has a specific, identifiable flaw that the chosen response avoids.

    Using This Template with Ertas

    Use Ertas Data Suite to manage the preference annotation pipeline. Import your prompts and model-generated response sets, apply PII redaction to both prompts and responses (particularly important if prompts are derived from real user queries), and organize the data for annotation workflows. After annotation, import the preference-labeled data for format validation and export.

    Export DPO datasets in JSONL format with prompt, chosen, and rejected fields. Ertas Studio supports DPO training workflows alongside standard SFT, enabling you to first fine-tune a base model with SFT data and then align it with DPO preference data. The GGUF-exported model reflects both the task knowledge from SFT and the quality preferences from DPO.

    Recommended Model

    DPO training typically starts with an SFT-tuned model rather than a base model. First fine-tune on your task-specific SFT dataset, then apply DPO with your preference data. For most applications, 2,000-10,000 high-quality preference pairs are sufficient for meaningful alignment improvement. More data helps, but quality matters far more than quantity for DPO.

    A 7B-8B parameter model responds well to DPO training with moderate dataset sizes. Larger models (13B+) may require more preference data to show clear improvement. After DPO training, export to GGUF at Q4_K_M or Q5_K_M for local inference. The alignment improvements from DPO are well-preserved through quantization.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.