DPO 偏好对数据集 Template
用于构建包含选中和拒绝回复对的偏好数据集模板,适用于直接偏好优化(DPO)训练。
PreferenceOverview
直接偏好优化(DPO)偏好数据集包含针对同一提示的回复对,其中一个回复标记为"选中"(偏好的),另一个标记为"拒绝"(不偏好的)。DPO 是一种人类反馈强化学习(RLHF)的替代方案,它直接优化语言模型以符合人类偏好,无需训练单独的奖励模型。偏好标注的质量和一致性直接决定了 DPO 训练的效果。
DPO 偏好数据集的 目的与监督微调(SFT)数据集根本不同。SFT 教会模型说什么,而 DPO 教会模型区分好回复和差回复。这对于编码难以用显式指令捕捉的细微质量标准特别有效——例如适当的细节程度、自然的对话语气、平衡的观点、安全考虑以及避免不必要的含糊表述。偏好信号通过比较示例来教授这些隐含的质量标准。
构建高质量的偏好数据需要精心控制选中和拒绝回复之间的对比。最有效的数据集中,拒绝的回复是合理但细微地不如选中回复的——它们可能事实正确但结构不佳,详细程度适当但过于冗长,或技术准确但缺乏同理心。质量差异极端的偏好对(完美回复对比乱码)提供的训练信号很弱,因为区别显而易见,无法教会模型精细的偏好边界。
Dataset Schema
interface DPOExample {
prompt: string; // The user query or instruction
chosen: string; // The preferred response
rejected: string; // The dispreferred response
metadata?: {
preference_reason: string; // Why chosen is preferred
category: string;
difficulty: "easy" | "medium" | "hard";
quality_delta: "large" | "medium" | "small"; // How different the pair is
};
}
// Multi-turn DPO format
interface MultiTurnDPOExample {
system?: string;
prompt: {
role: "user" | "assistant";
content: string;
}[]; // Conversation history up to the divergence point
chosen: string; // Preferred final assistant response
rejected: string; // Dispreferred final assistant response
}Sample Data
[
{
"prompt": "Explain how HTTPS works to someone who isn't technical.",
"chosen": "Think of HTTPS like sending a letter in a locked box instead of on a postcard. When you visit a website with HTTPS, your browser and the website first agree on a secret code that only they know. After that, every message between you and the website is scrambled using that code. Even if someone intercepts the message, they can't read it because they don't have the code. That little padlock icon in your browser means this protection is active.",
"rejected": "HTTPS uses TLS/SSL protocol to encrypt data in transit between the client and server. It initiates a handshake process where the server presents its X.509 certificate, the client verifies it against trusted certificate authorities, and then they negotiate a symmetric session key using asymmetric cryptography. All subsequent data is encrypted with AES-256 or similar cipher suites, protecting against man-in-the-middle attacks and packet sniffing.",
"metadata": {
"preference_reason": "The chosen response uses an accessible analogy appropriate for the non-technical audience. The rejected response is technically accurate but uses jargon that would confuse the target audience.",
"category": "audience-appropriateness",
"difficulty": "medium",
"quality_delta": "medium"
}
},
{
"prompt": "What are some healthy breakfast options?",
"chosen": "Here are some nutritious breakfast ideas that are quick to prepare:\n\n- **Overnight oats**: Mix rolled oats with yogurt and milk, add berries and chia seeds, refrigerate overnight. Ready to eat in the morning.\n- **Eggs and avocado toast**: Whole grain toast with mashed avocado and a poached or scrambled egg. Add everything seasoning for extra flavor.\n- **Smoothie bowl**: Blend frozen banana, spinach, protein powder, and almond milk until thick. Top with granola, sliced fruit, and nuts.\n- **Greek yogurt parfait**: Layer plain Greek yogurt with mixed berries and a drizzle of honey.\n\nAll of these take under 10 minutes and provide a good balance of protein, healthy fats, and complex carbohydrates to keep you energized through the morning.",
"rejected": "There are many healthy breakfast options you can consider. Oatmeal is healthy. Eggs are healthy. Yogurt is healthy. Fruit is healthy. Smoothies are healthy. Toast is healthy. You could also have cereal if it's whole grain. Avocado is healthy too. Nuts are another option. The key is to eat something nutritious in the morning.",
"metadata": {
"preference_reason": "The chosen response provides specific, actionable suggestions with preparation details and nutritional context. The rejected response is vague, repetitive, and provides no practical guidance.",
"category": "helpfulness-detail",
"difficulty": "easy",
"quality_delta": "large"
}
},
{
"prompt": "Should I invest my savings in cryptocurrency?",
"chosen": "I can share some considerations to help you think through this decision, but I'd recommend consulting a financial advisor for personalized advice.\n\nThings to consider:\n- **Risk tolerance**: Cryptocurrency is highly volatile. Bitcoin has seen drawdowns exceeding 70% from peak to trough. Only invest money you can afford to lose entirely.\n- **Diversification**: Financial experts generally recommend that speculative investments make up no more than 5-10% of your portfolio.\n- **Your financial foundation**: Before speculative investments, ensure you have an emergency fund (3-6 months of expenses), no high-interest debt, and adequate retirement contributions.\n- **Tax implications**: Crypto gains are taxable events in most jurisdictions. Factor tax obligations into your return calculations.\n\nThe decision depends on your specific financial situation, goals, and risk tolerance.",
"rejected": "Absolutely! Crypto is the future of finance and you should definitely invest. Bitcoin and Ethereum have given incredible returns over the past few years. Many experts predict Bitcoin will reach $500,000 in the next few years. The earlier you get in, the more money you'll make. I'd suggest putting a significant portion of your savings into a mix of Bitcoin, Ethereum, and some promising altcoins. Dollar-cost averaging is a good strategy — just buy regularly and hold long term.",
"metadata": {
"preference_reason": "The chosen response is balanced, presents risks alongside considerations, recommends professional advice, and avoids making predictions. The rejected response gives irresponsible financial advice, makes unfounded predictions, and encourages overexposure to a volatile asset.",
"category": "safety-balanced-advice",
"difficulty": "medium",
"quality_delta": "large"
}
}
]Data Collection Guide
创建 DPO 偏好数据有三种主要方法:人工比较模型生成的回复、人工编写对比配对,以及合成生成后人工验证。对大多数组织而言,最高效的方法是为每个提示从 SFT 微调模型生成多个回复,然后让人工标注员从每组中选出最佳(选中)和最差(拒绝)的回复。
对每个提示使用不同的采样参数(temperature、top-p)生成 3-5 个回复,以产生多样化的输出。将这些回复展示给标注员,不透露每个回复由哪些参数生成。让标注员将所有回复从最好到最差排序,然后将排名最高的作为选中回复,排名最低的作为拒绝回复。这种方法产生的偏好对中,拒绝的回复仍然是合理的(由有能力的模型生成),但明显不如选中的回复。
明确定义你的偏好标准。什么使一个回复优于另一个?常见标准包括:事实准确性、实用性和完整性、适当的语气和风格、安全无害性、在不牺牲清晰度的前提下保持简洁,以及对不确定性的坦诚态度。为标注员提供这些标准的优先级排序列表,以便他们在回复在多个维度上存在差异时知道优先考虑哪些特质。在元数据字段中记录每个偏好决策的理由。
Quality Criteria
偏好排序的标注员间一致性应超过 75%,数据集才能有效。如果标注员经常对哪个回复更好产生分歧,偏好信号就过于嘈杂,DPO 无法有效学习。低一致性通常表明偏好标准不清晰、回复质量过于接近(对比不足),或者存在合理分歧的主观偏好维度。
确保偏好理由的多样性。如果每个被拒绝的回复都因相同原因被拒绝(例如全部太冗长),模型将学到狭窄的偏好信号,可能过度纠正。包含在多个维度上区分的偏好对:准确与不准确、细节适当与过度详细、观点平衡与有偏见、建议安全与有风险、语气自然与生硬。
验证选中和拒绝回复之间的质量差异是有意义但不极端的。两个回复几乎相同的配对没有教学价值。拒绝的回复明显很差(语法错误、完全离题)的配对只能教会显而易见的区分,模型可能已经能处理这些。最有价值的配对是拒绝的回复"接近合格"但有一个具体的、可识别的缺陷,而选中的回复避免了这个缺陷。
Using This Template with Ertas
使用 Ertas Data Suite 管理偏好标注流程。导入你的提示和模型生成的回复集,对提示和回复都应用 PII 脱敏处理(如果提示来源于真实用户查询则尤为重要),并组织数据用于标注工作流。标注完成后,导入带偏好标签的数据进行格式验证和导出。
以 JSONL 格式导出 DPO 数据集,包含 prompt、chosen 和 rejected 字段。Ertas Studio 支持 DPO 训练工作流以及标准 SFT,使你能够先用 SFT 数据微调基础模型,再用 DPO 偏好数据进行对齐。GGUF 导出的模型同时反映了 SFT 的任务知识和 DPO 的质量偏好。
Recommended Model
DPO 训练通常从 SFT 微调模型而非基础模型开始。先在你的任务专用 SFT 数据集上微调,然后用偏好数据进行 DPO。对于大多数应用,2,000-10,000 条高质量偏好对即可实现显著的对齐提升。更多数据有帮助,但对 DPO 而言质量远比数量重要。
7B-8B 参数的模型对中等规模数据集的 DPO 训练响应良好。更大的模型(13B 以上)可能需要更多偏好数据才能展示明显改进。DPO 训练后,以 Q4_K_M 或 Q5_K_M 导出 GGUF 用于本地推理。DPO 的对齐改进在量化过程中能得到良好保留。
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.