如何為微調創建工具呼叫訓練資料集

每篇關於微調工具呼叫模型的指南都假設您已經有了資料。「只需以 JSONL 格式準備您的訓練資料集」，他們說，然後直接跳到訓練命令。

這跳過了最難的部分。

建立高品質的工具呼叫資料集佔 80% 的工作。模型架構、訓練超參數、LoRA 秩——如果您的訓練資料稀薄、不均衡或缺少邊緣案例，這些都不重要。

本指南涵蓋實際流程。我們將從零個示例到一個生產就緒的 JSONL 文件，為一個 5 工具客服代理建立完整的訓練資料集。

目標：一個 5 工具客服代理

我們正在為一個使用五個工具處理客戶支援的代理建立訓練資料：

lookup_order — 透過訂單 ID 或電子郵件查找訂單
check_status — 獲取現有訂單的當前狀態
initiate_refund — 啟動訂單的退款流程
update_address — 更改訂單的送貨地址
escalate_to_human — 將對話轉移給人工客服

五個工具。簡單到可以跟上，複雜到算是真實場景。讓我們建立資料集。

步驟一：記錄工具模式

每個工具需要一個精確的 JSON 模式。這是您的模型學習遵循的合約。模糊的模式產生模糊的輸出。

{
  "name": "lookup_order",
  "description": "Find a customer order by order ID or email address. Use when the customer wants to find or reference a specific order.",
  "parameters": {
    "type": "object",
    "properties": {
      "order_id": {
        "type": "string",
        "description": "The order ID (format: ORD-XXXXX)"
      },
      "email": {
        "type": "string",
        "description": "Customer email address to search orders"
      }
    },
    "required": []
  }
}

模式的關鍵規則：

描述比名稱更重要。 模型從描述而非函式名稱學習何時呼叫工具。「透過訂單 ID 或電子郵件查找客戶訂單」教導模型哪些輸入映射到此工具。
明確說明參數格式。 「format: ORD-XXXXX」防止模型生成裸數字。
正確標記必填欄位。 在上面的示例中，兩個參數都不是必填的，因為客戶可能提供訂單 ID 或電子郵件之一。
在適用的地方包含枚舉值。 如果參數只接受特定值，列出它們。

以下是 initiate_refund 的模式：

{
  "name": "initiate_refund",
  "description": "Start a refund for a specific order. Use when the customer explicitly requests a refund or return.",
  "parameters": {
    "type": "object",
    "properties": {
      "order_id": {
        "type": "string",
        "description": "The order ID to refund (format: ORD-XXXXX)"
      },
      "reason": {
        "type": "string",
        "enum": ["defective", "wrong_item", "not_received", "changed_mind", "other"],
        "description": "Reason category for the refund"
      },
      "amount": {
        "type": "number",
        "description": "Refund amount in USD. Omit for full refund."
      }
    },
    "required": ["order_id", "reason"]
  }
}

在撰寫單一訓練樣本之前，以這種方式記錄所有五個工具。模式既是模型在推論期間將看到的輸入，也是生成正確輸出的藍圖。

步驟二：生成種子樣本

每個工具從 10–20 個手寫用戶訊息開始。這些是捕捉真實用戶觸發每個工具的核心模式的種子樣本。

對於 lookup_order：

"Can you find my order? It's ORD-48291"
"I placed an order last week but can't find the confirmation. My email is jane@example.com"
"Where's my order ORD-77432?"
"I need to check on an order I made. The order number is ORD-15003"
"Look up order ORD-62810 please"
"I can't find my order. I used sarah.jones@gmail.com to purchase"
"What happened to ORD-33102?"
"Can you pull up my order? Email is mike.chen@company.com"
"Find order ORD-90145"
"I have an order number: ORD-55678. Can you look it up?"

對於 escalate_to_human：

"I want to talk to a real person"
"This isn't helping, let me speak with a manager"
"Can I talk to someone who can actually help?"
"Transfer me to a human agent please"
"I'd rather discuss this with a person"
"Get me a supervisor"
"I need human support, not a bot"
"This is too complicated for a chatbot, connect me to support"
"I want to file a formal complaint — put me through to a manager"
"None of these options work. I need a real person."

注意多樣性。有些訊息禮貌，有些則沮喪。有些包含確切的參數（訂單 ID），有些則模糊。這種多樣性教導模型處理真實用戶。

手工撰寫這些。 不要使用 LLM 生成種子樣本。您需要這些樣本基於您實際用戶的說話方式。如果您有真實的聊天日誌，從中提取模式。

步驟三：合成擴展

每個工具十個樣本不足以進行可靠的微調。每個工具需要 50–200 個以上的樣本。這就是合成生成發揮作用的地方。

使用前沿模型（GPT-4、Claude）擴展您的種子集。提示很重要：

You are generating training data for a customer service AI that can call tools.
Given these seed examples of messages that should trigger the "initiate_refund" tool,
generate 50 new variations.

Rules:
- Vary the phrasing, formality, and specificity
- Include different order IDs (format: ORD-XXXXX with random 5-digit numbers)
- Include different refund reasons (defective, wrong_item, not_received, changed_mind, other)
- Some should mention partial refunds with specific amounts
- Some should be vague ("I want my money back") and some specific ("Please refund $34.99 for ORD-12345, the item was defective")
- Include typos, abbreviations, and casual language in ~20% of examples
- Do NOT include messages that are ambiguous between tools

Seed examples:
[paste your 10-15 seed examples]

對每個工具運行此操作。審查輸出。刪除任何感覺不真實或在工具之間有歧義的樣本。

關鍵：生成完整的訓練對，而非只是用戶訊息。 對每條用戶訊息，生成對應的帶有正確工具呼叫的助理回應：

{
  "messages": [
    {
      "role": "system",
      "content": "You are a customer service assistant. You have access to the following tools: [tool schemas here]. Call the appropriate tool based on the customer's message. If no tool is appropriate, respond conversationally."
    },
    {
      "role": "user",
      "content": "I need a refund for ORD-48291. The product arrived broken."
    },
    {
      "role": "assistant",
      "tool_calls": [
        {
          "function": {
            "name": "initiate_refund",
            "arguments": "{\"order_id\": \"ORD-48291\", \"reason\": \"defective\"}"
          }
        }
      ]
    }
  ]
}

步驟四：添加負面樣本

這是大多數人跳過的步驟，也是大多數微調工具呼叫模型過度觸發的原因。

負面樣本是不應觸發任何工具呼叫的訊息。模型應改為進行對話式回應。

目標是讓總資料集的 20% 為負面樣本。對一個每個工具有 100 個樣本的 5 工具代理（500 個正面樣本），您需要約 125 個負面樣本。

負面樣本類型：

一般對話：

"Hi, how are you?"
"Thanks for the help!"
"What are your business hours?"
"Do you sell laptops?"

邊緣訊息（接近某個工具，但不完全是）：

"What's your refund policy?"（詢問退款政策，而非請求退款）
"How long does shipping usually take?"（不是詢問特定訂單）
"Can I change my order?"（模糊——改變什麼？）
"What happens if my package is lost?"（假設性的，而非具體案例）

多意圖訊息，工具呼叫不明確：

"I want to return my order and also ask about your new products"
"Is there a way to track packages in general?"

將負面樣本格式化為帶有對話式助理回應的形式，而非工具呼叫：

{
  "messages": [
    {
      "role": "system",
      "content": "You are a customer service assistant. You have access to the following tools: [tool schemas here]. Call the appropriate tool based on the customer's message. If no tool is appropriate, respond conversationally."
    },
    {
      "role": "user",
      "content": "What's your return policy?"
    },
    {
      "role": "assistant",
      "content": "Our return policy allows returns within 30 days of delivery for a full refund. Would you like to initiate a return for a specific order?"
    }
  ]
}

邊緣訊息樣本最為重要。它們教導模型「詢問退款」和「請求退款」之間的邊界。沒有它們，每次有人提到「退款」這個詞，模型都會呼叫 initiate_refund。

步驟五：格式化為 JSONL

您的最終資料集是一個 .jsonl 文件——每行一個 JSON 物件。以下是完整格式：

{"messages": [{"role": "system", "content": "You are a customer service assistant..."}, {"role": "user", "content": "Find my order ORD-48291"}, {"role": "assistant", "tool_calls": [{"function": {"name": "lookup_order", "arguments": "{\"order_id\": \"ORD-48291\"}"}}]}]}
{"messages": [{"role": "system", "content": "You are a customer service assistant..."}, {"role": "user", "content": "I need a refund for ORD-12345, it arrived broken"}, {"role": "assistant", "tool_calls": [{"function": {"name": "initiate_refund", "arguments": "{\"order_id\": \"ORD-12345\", \"reason\": \"defective\"}"}}]}]}
{"messages": [{"role": "system", "content": "You are a customer service assistant..."}, {"role": "user", "content": "What are your business hours?"}, {"role": "assistant", "content": "We're available Monday through Friday, 9am to 6pm EST. How can I help you today?"}]}

關鍵格式規則：

每行一個 JSON 物件。 不要美化輸出，不要在物件內換行。
參數是 JSON 字串，不是巢狀物件。 注意 "arguments": "{\"order_id\": \"ORD-48291\"}" — 值是包含 JSON 的字串，而非原始 JSON 物件。這與大多數微調框架期望的 OpenAI 函式呼叫格式相符。
系統訊息在所有樣本中完全相同。 在每條系統訊息中包含完整的工具模式。不要縮略。
一致的鍵排序。 始終 role 在 content 或 tool_calls 之前。

訓練前的品質檢查

在開始微調之前，對資料集執行這些驗證：

模式合規性

每個 tool_calls 條目必須引用您工具模式中存在的函式名稱。每個 arguments 字串必須解析為有效的 JSON，並與模式中定義的參數類型相符。

import json

def validate_example(example, valid_tools):
    for msg in example["messages"]:
        if "tool_calls" in msg:
            for call in msg["tool_calls"]:
                name = call["function"]["name"]
                assert name in valid_tools, f"Unknown tool: {name}"
                args = json.loads(call["function"]["arguments"])
                # Validate args against schema...

參數驗證

檢查生成的參數值是否真實：

訂單 ID 遵循 ORD-XXXXX 格式
電子郵件地址在語法上有效
枚舉值與允許的集合匹配
必填參數始終存在
數字金額合理（不是負數，不是 999,999 美元）

均衡性檢查

計算每個工具的樣本數。如果 lookup_order 有 150 個樣本，但 escalate_to_human 只有 30 個，模型將偏向查找。目標是大致相等的分布，容差為正負 30%。

工具	樣本數	資料集佔比
lookup_order	100	16%
check_status	100	16%
initiate_refund	100	16%
update_address	100	16%
escalate_to_human	100	16%
無工具（負面）	125	20%
總計	625	100%

邊緣案例覆蓋範圍

稽核這些特定案例：

缺少可選參數： 當用戶提供電子郵件但沒有訂單 ID 時，模型能處理嗎？
多個可能的工具： 「我想查看我的退款」——這是 check_status 還是其他？
從自然語言提取參數： 「我搬到了XX路XX號」——模型能提取地址嗎？
對話上下文： 在退款確認後的「是的，繼續」——模型能處理後續輪次嗎？

量的指南

您實際需要多少資料？這取決於複雜度：

簡單工具路由（僅分類）： 每個工具 50–100 個樣本。模型只需要學習哪個工具映射到哪個意圖。參數提取很直接。

複雜的參數提取： 每個工具 200 個以上的樣本。當參數需要解析自然語言（地址、日期、產品描述）時，您需要更多展示提取模式的樣本。

多輪對話： 每個工具 300 個以上的樣本。如果模型需要處理後續訊息、澄清請求和來自前幾輪的上下文，資料集會顯著增大。

對我們的 5 工具客服代理，每個工具 100 個樣本加上 125 個負面樣本（共 625 個）是一個穩固的起點。您可以在單個 GPU 上用不到 30 分鐘微調一個 70 億參數的模型。

多輪序列

真實的客戶互動不是單輪的。客戶可能查找訂單，然後檢查其狀態，然後在同一個對話中請求退款。您的資料集應包含多輪序列：

{
  "messages": [
    {"role": "system", "content": "You are a customer service assistant..."},
    {"role": "user", "content": "Can you find order ORD-48291?"},
    {"role": "assistant", "tool_calls": [{"function": {"name": "lookup_order", "arguments": "{\"order_id\": \"ORD-48291\"}"}}]},
    {"role": "tool", "content": "{\"order_id\": \"ORD-48291\", \"status\": \"delivered\", \"items\": [\"Blue Widget\"], \"total\": 34.99}"},
    {"role": "assistant", "content": "I found your order ORD-48291. It shows as delivered. You ordered a Blue Widget for $34.99. How can I help with this order?"},
    {"role": "user", "content": "The widget arrived damaged. I want my money back."},
    {"role": "assistant", "tool_calls": [{"function": {"name": "initiate_refund", "arguments": "{\"order_id\": \"ORD-48291\", \"reason\": \"defective\", \"amount\": 34.99}"}}]}
  ]
}

在資料集中包含 20–30 個多輪序列。它們教導模型維護上下文，並在後續呼叫中使用前面工具結果的資訊。

要避免的常見錯誤

錯誤一：複製粘貼相同的句子結構。 「請退款 ORD-11111」、「請退款 ORD-22222」、「請退款 ORD-33333」——這教導模型一個模式，而非概念。改變措辭。

錯誤二：每個樣本都語法完美。 真實用戶輸入「我的訂單在哪裡」、「幫我退款謝謝」、「我要改地址」等形式。在訓練資料中包含這些。

錯誤三：沒有模糊樣本。 「檢查狀態」和「查找訂單」之間的邊界是模糊的。包含探索這個邊界的樣本，並保持一致的標記決策。

錯誤四：跳過系統訊息。 有些人在沒有系統訊息的情況下訓練，然後在推論時添加一個。模型以前從未見過那個系統訊息。使用您將在生產中使用的確切系統提示進行訓練。

錯誤五：不驗證 JSONL。 一個格式錯誤的行將在訓練運行進行到 45 分鐘時使其崩潰。在開始之前驗證每行都可以解析為 JSON。

Ship AI that runs on your users' devices.

Free plan with 30 credits/mo, no card required. Paid plans from $25/mo USD.

or view pricing →

整合起來

以下是從頭到尾的完整流程：

以精確的描述和參數類型記錄所有工具模式
根據真實用戶模式，為每個工具手工撰寫 10–20 個種子樣本
使用前沿模型擴展到每個工具 50–100 個以上的變體
添加 20% 的負面樣本，尤其是邊緣訊息
以一致的結構格式化為 JSONL
驗證模式合規性、參數正確性和均衡性
包含 20–30 個多輪對話序列
執行最終去重

總時間：一個 5 工具代理需要 4–8 小時。樣本總數：500–750 個。在 70 億參數模型上使用 LoRA 的訓練時間：20–40 分鐘。

結果是一個 90% 以上時間路由到正確工具、持續生成有效參數，並知道何時不呼叫任何工具的模型。全部在本地運行，零按查詢計費成本。

資料集就是模型。把它建好。