
如何為微調創建工具呼叫訓練資料集
微調指南中最大的空缺:沒有人說明如何實際建立資料集。以下是創建工具呼叫訓練資料的逐步流程——從模式文件到合成擴展到 JSONL 格式化——包含一個 5 工具客服代理的真實示例。
每篇關於微調工具呼叫模型的指南都假設您已經有了資料。「只需以 JSONL 格式準備您的訓練資料集」,他們說,然後直接跳到訓練命令。
這跳過了最難的部分。
建立高品質的工具呼叫資料集佔 80% 的工作。模型架構、訓練超參數、LoRA 秩——如果您的訓練資料稀薄、不均衡或缺 少邊緣案例,這些都不重要。
本指南涵蓋實際流程。我們將從零個示例到一個生產就緒的 JSONL 文件,為一個 5 工具客服代理建立完整的訓練資料集。
目標:一個 5 工具客服代理
我們正在為一個使用五個工具處理客戶支援的代理建立訓練資料:
lookup_order— 透過訂單 ID 或電子郵件查找訂單check_status— 獲取現有訂單的當前狀態initiate_refund— 啟動訂單的退款流程update_address— 更改訂單的送貨地址escalate_to_human— 將對話轉移給人工客服
五個工具。簡單到可以跟上,複雜到算是真實場景。讓我們建立資料集。
步驟一:記錄工具模式
每個工具需要一個精確的 JSON 模式。這是您的模型學習遵循的合約。模糊的模式產生模糊的輸出。
{
"name": "lookup_order",
"description": "Find a customer order by order ID or email address. Use when the customer wants to find or reference a specific order.",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order ID (format: ORD-XXXXX)"
},
"email": {
"type": "string",
"description": "Customer email address to search orders"
}
},
"required": []
}
}
模式的關鍵規則:
- 描述比名稱更重要。 模型從描述而非函式名稱學習何時呼叫工具。「透過訂單 ID 或電子郵件查找客戶訂單」教導模型哪些輸入映射到此工具。
- 明確說明參數格式。 「format: ORD-XXXXX」防止模型生成裸數字。
- 正確標記必填欄位。 在上面的示例中,兩個參數都不是必填的,因為客戶可能提供訂單 ID 或電子郵件之一。
- 在適用的地方包含枚舉值。 如果參數只接受特定值,列出它們。
以下是 initiate_refund 的模式:
{
"name": "initiate_refund",
"description": "Start a refund for a specific order. Use when the customer explicitly requests a refund or return.",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order ID to refund (format: ORD-XXXXX)"
},
"reason": {
"type": "string",
"enum": ["defective", "wrong_item", "not_received", "changed_mind", "other"],
"description": "Reason category for the refund"
},
"amount": {
"type": "number",
"description": "Refund amount in USD. Omit for full refund."
}
},
"required": ["order_id", "reason"]
}
}
在撰寫單一訓練樣本之前,以這種方式記錄所有五個工具。模式既是模型在推論期間將看到的輸入,也是生成正確輸出的藍圖。
步驟二:生成種子樣本
每個工具從 10–20 個手寫用戶訊息開始。這些是捕捉真實用戶觸發每個工具的核心模式的種子樣本。
對於 lookup_order:
"Can you find my order? It's ORD-48291"
"I placed an order last week but can't find the confirmation. My email is jane@example.com"
"Where's my order ORD-77432?"
"I need to check on an order I made. The order number is ORD-15003"
"Look up order ORD-62810 please"
"I can't find my order. I used sarah.jones@gmail.com to purchase"
"What happened to ORD-33102?"
"Can you pull up my order? Email is mike.chen@company.com"
"Find order ORD-90145"
"I have an order number: ORD-55678. Can you look it up?"
對於 escalate_to_human:
"I want to talk to a real person"
"This isn't helping, let me speak with a manager"
"Can I talk to someone who can actually help?"
"Transfer me to a human agent please"
"I'd rather discuss this with a person"
"Get me a supervisor"
"I need human support, not a bot"
"This is too complicated for a chatbot, connect me to support"
"I want to file a formal complaint — put me through to a manager"
"None of these options work. I need a real person."
注意多 樣性。有些訊息禮貌,有些則沮喪。有些包含確切的參數(訂單 ID),有些則模糊。這種多樣性教導模型處理真實用戶。
手工撰寫這些。 不要使用 LLM 生成種子樣本。您需要這些樣本基於您實際用戶的說話方式。如果您有真實的聊天日誌,從中提取模式。
步驟三:合成擴展
每個工具十個樣本不足以進行可靠的微調。每個工具需要 50–200 個以上的樣本。這就是合成生成發揮作用的地方。
使用前沿模型(GPT-4、Claude)擴展您的種子集。提示很重要:
You are generating training data for a customer service AI that can call tools.
Given these seed examples of messages that should trigger the "initiate_refund" tool,
generate 50 new variations.
Rules:
- Vary the phrasing, formality, and specificity
- Include different order IDs (format: ORD-XXXXX with random 5-digit numbers)
- Include different refund reasons (defective, wrong_item, not_received, changed_mind, other)
- Some should mention partial refunds with specific amounts
- Some should be vague ("I want my money back") and some specific ("Please refund $34.99 for ORD-12345, the item was defective")
- Include typos, abbreviations, and casual language in ~20% of examples
- Do NOT include messages that are ambiguous between tools
Seed examples:
[paste your 10-15 seed examples]
對每個工具運行此操作。審查輸出。刪除任何感覺不真實或在工具之間有歧義的樣本。
關鍵:生成完整的訓練對,而非只是用戶訊息。 對每條用戶訊息,生成對應的帶有正確工具呼叫的助理回應:
{
"messages": [
{
"role": "system",
"content": "You are a customer service assistant. You have access to the following tools: [tool schemas here]. Call the appropriate tool based on the customer's message. If no tool is appropriate, respond conversationally."
},
{
"role": "user",
"content": "I need a refund for ORD-48291. The product arrived broken."
},
{
"role": "assistant",
"tool_calls": [
{
"function": {
"name": "initiate_refund",
"arguments": "{\"order_id\": \"ORD-48291\", \"reason\": \"defective\"}"
}
}
]
}
]
}
步驟四:添加負面樣本
這是大多數人跳過的步驟,也是大多數微調工具呼叫模型過度觸發的原因。
負面樣本是不應觸發任何工具呼叫的訊息。模型應改為進行對話式回應。
目標是讓總資料集的 20% 為負面樣本。對一個每個工具有 100 個樣本的 5 工具代理(500 個正面樣本),您需要約 125 個負面樣本。
負面樣本類型:
一般對話:
"Hi, how are you?"
"Thanks for the help!"
"What are your business hours?"
"Do you sell laptops?"
邊緣訊息(接近某個工具,但不完全是):
"What's your refund policy?"(詢問退款政策,而非請求退款)
"How long does shipping usually take?"(不是詢問特定訂單)
"Can I change my order?"(模糊——改變什麼?)
"What happens if my package is lost?"(假設性的,而非具體案例)
多意圖訊息,工具呼叫不明確:
"I want to return my order and also ask about your new products"
"Is there a way to track packages in general?"
將負面樣本格式化為帶有對話式助理回應的形式,而非工具呼叫:
{
"messages": [
{
"role": "system",
"content": "You are a customer service assistant. You have access to the following tools: [tool schemas here]. Call the appropriate tool based on the customer's message. If no tool is appropriate, respond conversationally."
},
{
"role": "user",
"content": "What's your return policy?"
},
{
"role": "assistant",
"content": "Our return policy allows returns within 30 days of delivery for a full refund. Would you like to initiate a return for a specific order?"
}
]
}
邊緣訊息樣本最為重要。它們教導模型「詢問退款」和「請求退款」之間的邊界。沒有它們,每次有人提到「退款」這個詞,模型都會呼叫 initiate_refund。
步驟五:格式化為 JSONL
您的最終資料集是一個 .jsonl 文件——每行一個 JSON 物件。以下是完整格式:
{"messages": [{"role": "system", "content": "You are a customer service assistant..."}, {"role": "user", "content": "Find my order ORD-48291"}, {"role": "assistant", "tool_calls": [{"function": {"name": "lookup_order", "arguments": "{\"order_id\": \"ORD-48291\"}"}}]}]}
{"messages": [{"role": "system", "content": "You are a customer service assistant..."}, {"role": "user", "content": "I need a refund for ORD-12345, it arrived broken"}, {"role": "assistant", "tool_calls": [{"function": {"name": "initiate_refund", "arguments": "{\"order_id\": \"ORD-12345\", \"reason\": \"defective\"}"}}]}]}
{"messages": [{"role": "system", "content": "You are a customer service assistant..."}, {"role": "user", "content": "What are your business hours?"}, {"role": "assistant", "content": "We're available Monday through Friday, 9am to 6pm EST. How can I help you today?"}]}
關鍵格式規則:
- 每行一個 JSON 物件。 不要美化輸出,不要在物件內換行。
- 參數是 JSON 字串,不是巢狀物件。 注意
"arguments": "{\"order_id\": \"ORD-48291\"}"— 值是包含 JSON 的字串,而非原始 JSON 物件。這與大多數微調框架期望的 OpenAI 函式呼叫格式相符。 - 系統訊息在所有樣本中完全相同。 在每條系統訊息中包含完整的工具模式。不要縮略。
- 一致的鍵排序。 始終
role在content或tool_calls之前。
訓練前的品質檢查
在開始微調之前,對資料集執行這些驗證:
模式合規性
每個 tool_calls 條目必須引用您工具模式中存在的函式名稱。每個 arguments 字串必須解析為有效的 JSON,並與模式中定義的參數類型相符。
import json
def validate_example(example, valid_tools):
for msg in example["messages"]:
if "tool_calls" in msg:
for call in msg["tool_calls"]:
name = call["function"]["name"]
assert name in valid_tools, f"Unknown tool: {name}"
args = json.loads(call["function"]["arguments"])
# Validate args against schema...
參數驗證
檢查生成的參數值是否真實:
- 訂單 ID 遵循
ORD-XXXXX格式 - 電子郵件地址在語法上有效
- 枚舉值與允許的集合匹配
- 必填參數始終存在
- 數字金額合理(不是負數,不是 999,999 美元)
均衡性檢查
計算每個工具的樣本數。如果 lookup_order 有 150 個樣本,但 escalate_to_human 只有 30 個,模型將偏向查找。目標是大致相等的分布,容差為正負 30%。
| 工具 | 樣本數 | 資料集佔比 |
|---|---|---|
| lookup_order | 100 | 16% |
| check_status | 100 | 16% |
| initiate_refund | 100 | 16% |
| update_address | 100 | 16% |
| escalate_to_human | 100 | 16% |
| 無工具(負面) | 125 | 20% |
| 總計 | 625 | 100% |