DPO 偏好資料集 Template
用於透過直接偏好最佳化對齊 LLM 回應品質的偏好資料集範本
PreferenceOverview
直接偏好最佳化(DPO)偏好資料集包含對同一提示的成對回應,其中一個回應被標記為「選中」(偏好的),另一個則為「拒絕」(不偏好的)。DPO 是一種人類回饋強化學習(RLHF)的替代方案,它直接最佳化語言模型以符合人類偏好,無需訓練獨立的獎勵模型。偏好標註的品質和一致性直接決定了 DPO 訓練的效果。
DPO 偏好資料集的用途與監督式 微調(SFT)資料集有根本性的不同。SFT 教導模型該說什麼,而 DPO 則教導模型什麼區分了好的回應與差的回應。這對於編碼那些難以透過明確指令捕捉的細膩品質標準特別有效——例如適當的詳細程度、自然的對話語調、平衡的觀點、安全考量,以及避免不必要的模稜兩可。偏好訊號透過比較範例來教導這些隱含的品質標準。
建構高品質的偏好資料需要在選中和拒絕回應之間進行精心控制的對比。最有效的資料集中,被拒絕的回應是合理但微妙地不如選中回應的——它們可能事實正確但結構不佳,細節適當但過於冗長,或技術準確但缺乏同理心。品質差異極端的偏好對(完美回應對比胡言亂語)提供的訓練訊號較弱,因為區別很明顯,無法教導細粒度的偏好界限。
Dataset Schema
interface DPOExample {
prompt: string; // The user query or instruction
chosen: string; // The preferred response
rejected: string; // The dispreferred response
metadata?: {
preference_reason: string; // Why chosen is preferred
category: string;
difficulty: "easy" | "medium" | "hard";
quality_delta: "large" | "medium" | "small"; // How different the pair is
};
}
// Multi-turn DPO format
interface MultiTurnDPOExample {
system?: string;
prompt: {
role: "user" | "assistant";
content: string;
}[]; // Conversation history up to the divergence point
chosen: string; // Preferred final assistant response
rejected: string; // Dispreferred final assistant response
}Sample Data
[
{
"prompt": "Explain how HTTPS works to someone who isn't technical.",
"chosen": "Think of HTTPS like sending a letter in a locked box instead of on a postcard. When you visit a website with HTTPS, your browser and the website first agree on a secret code that only they know. After that, every message between you and the website is scrambled using that code. Even if someone intercepts the message, they can't read it because they don't have the code. That little padlock icon in your browser means this protection is active.",
"rejected": "HTTPS uses TLS/SSL protocol to encrypt data in transit between the client and server. It initiates a handshake process where the server presents its X.509 certificate, the client verifies it against trusted certificate authorities, and then they negotiate a symmetric session key using asymmetric cryptography. All subsequent data is encrypted with AES-256 or similar cipher suites, protecting against man-in-the-middle attacks and packet sniffing.",
"metadata": {
"preference_reason": "The chosen response uses an accessible analogy appropriate for the non-technical audience. The rejected response is technically accurate but uses jargon that would confuse the target audience.",
"category": "audience-appropriateness",
"difficulty": "medium",
"quality_delta": "medium"
}
},
{
"prompt": "What are some healthy breakfast options?",
"chosen": "Here are some nutritious breakfast ideas that are quick to prepare:\n\n- **Overnight oats**: Mix rolled oats with yogurt and milk, add berries and chia seeds, refrigerate overnight. Ready to eat in the morning.\n- **Eggs and avocado toast**: Whole grain toast with mashed avocado and a poached or scrambled egg. Add everything seasoning for extra flavor.\n- **Smoothie bowl**: Blend frozen banana, spinach, protein powder, and almond milk until thick. Top with granola, sliced fruit, and nuts.\n- **Greek yogurt parfait**: Layer plain Greek yogurt with mixed berries and a drizzle of honey.\n\nAll of these take under 10 minutes and provide a good balance of protein, healthy fats, and complex carbohydrates to keep you energized through the morning.",
"rejected": "There are many healthy breakfast options you can consider. Oatmeal is healthy. Eggs are healthy. Yogurt is healthy. Fruit is healthy. Smoothies are healthy. Toast is healthy. You could also have cereal if it's whole grain. Avocado is healthy too. Nuts are another option. The key is to eat something nutritious in the morning.",
"metadata": {
"preference_reason": "The chosen response provides specific, actionable suggestions with preparation details and nutritional context. The rejected response is vague, repetitive, and provides no practical guidance.",
"category": "helpfulness-detail",
"difficulty": "easy",
"quality_delta": "large"
}
},
{
"prompt": "Should I invest my savings in cryptocurrency?",
"chosen": "I can share some considerations to help you think through this decision, but I'd recommend consulting a financial advisor for personalized advice.\n\nThings to consider:\n- **Risk tolerance**: Cryptocurrency is highly volatile. Bitcoin has seen drawdowns exceeding 70% from peak to trough. Only invest money you can afford to lose entirely.\n- **Diversification**: Financial experts generally recommend that speculative investments make up no more than 5-10% of your portfolio.\n- **Your financial foundation**: Before speculative investments, ensure you have an emergency fund (3-6 months of expenses), no high-interest debt, and adequate retirement contributions.\n- **Tax implications**: Crypto gains are taxable events in most jurisdictions. Factor tax obligations into your return calculations.\n\nThe decision depends on your specific financial situation, goals, and risk tolerance.",
"rejected": "Absolutely! Crypto is the future of finance and you should definitely invest. Bitcoin and Ethereum have given incredible returns over the past few years. Many experts predict Bitcoin will reach $500,000 in the next few years. The earlier you get in, the more money you'll make. I'd suggest putting a significant portion of your savings into a mix of Bitcoin, Ethereum, and some promising altcoins. Dollar-cost averaging is a good strategy — just buy regularly and hold long term.",
"metadata": {
"preference_reason": "The chosen response is balanced, presents risks alongside considerations, recommends professional advice, and avoids making predictions. The rejected response gives irresponsible financial advice, makes unfounded predictions, and encourages overexposure to a volatile asset.",
"category": "safety-balanced-advice",
"difficulty": "medium",
"quality_delta": "large"
}
}
]Data Collection Guide
建立 DPO 偏好資料有三種主要方法:人工比較模型生成的回應、人工撰寫對比對,以及合成生成後由人工驗證。對大多數組織而言,最有效率的方法是針對每個提示從您的 SFT 微調模型生成多個回應,然後讓人工標註者從每組中選出最佳(選中)和最差(拒絕)的回應。
每個提示生成 3-5 個回應,使用不同的取樣參數(temperature、top-p)以產生多樣化的輸出。將這些回應呈現給標註者時不要透露每個回應是由哪些參數生成的。讓標註者將所有回應從最佳到最差排序,然後使用排名最高的作為選中、排名最低的作為拒絕。這樣產生的偏好對中,被拒絕的回應仍然是合理的(它是由有能力的模型生成的),但明顯不如選中的回應。
明確定義您的偏好標準。什麼使一個回應優於另一個?常見的標準包括:事實準確性、實用性和完整性、適當的語調和風格、安全性和無害性、簡潔而不犧牲清晰度,以及對不確定性的誠實。為標註者提供這些標準的優先排序清單,使他們知道當回應在多個維度上有差異時應優先考慮哪些品質。在元資料欄位中記錄每個偏好決定的理由。
Quality Criteria
標註者之間在偏好排序上的一致性應超過百分之七十五,資料集才能有效。如果 標註者經常對哪個回應較好有分歧,偏好訊號對 DPO 而言就太嘈雜而無法有效學習。低一致性通常表示偏好標準不夠明確、回應品質太相似(對比不足),或者存在合理人士確實會意見不同的主觀偏好維度。
確保偏好原因的多樣性。如果每個被拒絕的回應都因相同原因被拒絕(例如都太冗長),模型將學到狹隘的偏好訊號並可能過度修正。包含在多個維度上進行區分的偏好對:準確與不準確、適當細節與過多細節、平衡觀點與偏見觀點、安全建議與危險建議,以及自然語調與生硬語調。
驗證選中與拒絕之間的品質差距是有意義的但非極端的。兩個回應幾乎相同的對無法教導任何東西。被拒絕的回應明顯很糟糕(語法有誤、完全離題)的對只教導模型可能已經能處理的明顯區別。最有價值的對是那些被拒絕的回應「幾乎很好」但有一個具體且可識別的缺陷,而選中的回應避免了這個缺陷。
Using This Template with Ertas
使用 Ertas Data Suite 管理偏好標註流程。匯入您的提示和模型生成的回應集,對提示和回應都進行個資遮蔽(特別重要的是當提示來源於真實使用者查詢時),並組織資料以進行標註工作流程。標註完成後,匯入偏好標記的資料進行格式驗證和匯出。
以 JSONL 格式匯出 DPO 資料集,包含 prompt、chosen 和 rejected 欄位。Ertas Studio 支援 DPO 訓練工作流程以及標準 SFT,使您能夠先以 SFT 資料微調基礎模型,然後再以 DPO 偏好資料進行對齊。GGUF 匯出的模型同時反映來自 SFT 的任務知識和來自 DPO 的品 質偏好。
Recommended Model
DPO 訓練通常從 SFT 微調模型開始,而非基礎模型。首先在您的任務特定 SFT 資料集上進行微調,然後以您的偏好資料應用 DPO。對大多數應用而言,2,000 至 10,000 筆高品質的偏好對即足以實現有意義的對齊改善。更多資料有幫助,但對 DPO 而言,品質遠比數量重要。
7B-8B 參數的模型對中等規模資料集的 DPO 訓練反應良好。較大的模型(13B 以上)可能需要更多偏好資料才能顯示明顯改善。DPO 訓練後,以 Q4_K_M 或 Q5_K_M 量化匯出為 GGUF 供本地推論。DPO 的對齊改善在量化過程中能良好保留。
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.