How to Create a Tool-Calling Training Dataset for Fine-Tuning

Every guide on fine-tuning tool-calling models assumes you already have the data. "Just prepare your training dataset in JSONL format," they say, then skip straight to the training command.

That skips the hardest part.

Building a high-quality tool-calling dataset is 80% of the work. The model architecture, the training hyperparameters, the LoRA rank — none of it matters if your training data is thin, unbalanced, or missing edge cases.

This guide covers the actual process. We will build a complete training dataset for a 5-tool customer service agent, from zero examples to a production-ready JSONL file.

The Target: A 5-Tool Customer Service Agent

We are building training data for an agent that handles customer support with five tools:

lookup_order — Find an order by order ID or email
check_status — Get current status of an existing order
initiate_refund — Start a refund process for an order
update_address — Change shipping address on an order
escalate_to_human — Transfer the conversation to a human agent

Five tools. Simple enough to follow, complex enough to be real. Let's build the dataset.

Step 1: Document Your Tool Schema

Every tool needs a precise JSON schema. This is the contract your model learns to follow. Vague schemas produce vague outputs.

{
  "name": "lookup_order",
  "description": "Find a customer order by order ID or email address. Use when the customer wants to find or reference a specific order.",
  "parameters": {
    "type": "object",
    "properties": {
      "order_id": {
        "type": "string",
        "description": "The order ID (format: ORD-XXXXX)"
      },
      "email": {
        "type": "string",
        "description": "Customer email address to search orders"
      }
    },
    "required": []
  }
}

Key rules for schemas:

Descriptions matter more than names. The model learns when to call a tool from the description, not just the function name. "Find a customer order by order ID or email address" teaches the model which inputs map to this tool.
Be explicit about parameter formats. "format: ORD-XXXXX" prevents the model from generating bare numbers.
Mark required fields correctly. In the example above, neither parameter is required because the customer might provide either an order ID or an email.
Include enum values where applicable. If a parameter only accepts specific values, list them.

Here is the schema for initiate_refund:

{
  "name": "initiate_refund",
  "description": "Start a refund for a specific order. Use when the customer explicitly requests a refund or return.",
  "parameters": {
    "type": "object",
    "properties": {
      "order_id": {
        "type": "string",
        "description": "The order ID to refund (format: ORD-XXXXX)"
      },
      "reason": {
        "type": "string",
        "enum": ["defective", "wrong_item", "not_received", "changed_mind", "other"],
        "description": "Reason category for the refund"
      },
      "amount": {
        "type": "number",
        "description": "Refund amount in USD. Omit for full refund."
      }
    },
    "required": ["order_id", "reason"]
  }
}

Document all five tools this way before writing a single training example. The schemas are both the input your model will see during inference and the blueprint for generating correct outputs.

Step 2: Generate Seed Examples

Start with 10-20 hand-written user messages per tool. These are the seed examples that capture the core patterns of how a real user would trigger each tool.

For lookup_order:

"Can you find my order? It's ORD-48291"
"I placed an order last week but can't find the confirmation. My email is jane@example.com"
"Where's my order ORD-77432?"
"I need to check on an order I made. The order number is ORD-15003"
"Look up order ORD-62810 please"
"I can't find my order. I used sarah.jones@gmail.com to purchase"
"What happened to ORD-33102?"
"Can you pull up my order? Email is mike.chen@company.com"
"Find order ORD-90145"
"I have an order number: ORD-55678. Can you look it up?"

For escalate_to_human:

"I want to talk to a real person"
"This isn't helping, let me speak with a manager"
"Can I talk to someone who can actually help?"
"Transfer me to a human agent please"
"I'd rather discuss this with a person"
"Get me a supervisor"
"I need human support, not a bot"
"This is too complicated for a chatbot, connect me to support"
"I want to file a formal complaint — put me through to a manager"
"None of these options work. I need a real person."

Notice the variety. Some messages are polite, some are frustrated. Some include exact parameters (order IDs), some are vague. This variety is what teaches the model to handle real users.

Write these by hand. Do not use an LLM for seed examples. You need these to be grounded in how your actual users talk. If you have real chat logs, mine them for patterns.

Step 3: Synthetic Expansion

Ten examples per tool is not enough to fine-tune reliably. You need 50-200+ per tool. This is where synthetic generation comes in.

Use a frontier model (GPT-4, Claude) to expand your seed set. The prompt matters:

You are generating training data for a customer service AI that can call tools.
Given these seed examples of messages that should trigger the "initiate_refund" tool,
generate 50 new variations.

Rules:
- Vary the phrasing, formality, and specificity
- Include different order IDs (format: ORD-XXXXX with random 5-digit numbers)
- Include different refund reasons (defective, wrong_item, not_received, changed_mind, other)
- Some should mention partial refunds with specific amounts
- Some should be vague ("I want my money back") and some specific ("Please refund $34.99 for ORD-12345, the item was defective")
- Include typos, abbreviations, and casual language in ~20% of examples
- Do NOT include messages that are ambiguous between tools

Seed examples:
[paste your 10-15 seed examples]

Run this for each tool. Review the output. Delete any examples that feel unrealistic or are ambiguous between tools.

Critical: generate the complete training pair, not just the user message. For each user message, generate the corresponding assistant response with the correct tool call:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a customer service assistant. You have access to the following tools: [tool schemas here]. Call the appropriate tool based on the customer's message. If no tool is appropriate, respond conversationally."
    },
    {
      "role": "user",
      "content": "I need a refund for ORD-48291. The product arrived broken."
    },
    {
      "role": "assistant",
      "tool_calls": [
        {
          "function": {
            "name": "initiate_refund",
            "arguments": "{\"order_id\": \"ORD-48291\", \"reason\": \"defective\"}"
          }
        }
      ]
    }
  ]
}

Step 4: Add Negative Examples

This is the step most people skip, and it is the reason most fine-tuned tool-calling models over-trigger.

Negative examples are messages that should NOT trigger any tool call. The model should respond conversationally instead.

Aim for 20% of your total dataset to be negative examples. For a 5-tool agent with 100 examples per tool (500 positive examples), you need about 125 negative examples.

Types of negative examples:

General conversation:

"Hi, how are you?"
"Thanks for the help!"
"What are your business hours?"
"Do you sell laptops?"

Near-miss messages (close to a tool, but not quite):

"What's your refund policy?" (asking ABOUT refunds, not requesting one)
"How long does shipping usually take?" (not asking about a specific order)
"Can I change my order?" (ambiguous — change what exactly?)
"What happens if my package is lost?" (hypothetical, not a specific case)

Multi-intent messages where the tool call is unclear:

"I want to return my order and also ask about your new products"
"Is there a way to track packages in general?"

Format negative examples with a conversational assistant response instead of a tool call:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a customer service assistant. You have access to the following tools: [tool schemas here]. Call the appropriate tool based on the customer's message. If no tool is appropriate, respond conversationally."
    },
    {
      "role": "user",
      "content": "What's your return policy?"
    },
    {
      "role": "assistant",
      "content": "Our return policy allows returns within 30 days of delivery for a full refund. Would you like to initiate a return for a specific order?"
    }
  ]
}

The near-miss examples are the most important. They teach the model the boundary between "asking about refunds" and "requesting a refund." Without them, the model will call initiate_refund every time someone mentions the word "refund."

Step 5: Format as JSONL

Your final dataset is a .jsonl file — one JSON object per line. Here is the complete format:

{"messages": [{"role": "system", "content": "You are a customer service assistant..."}, {"role": "user", "content": "Find my order ORD-48291"}, {"role": "assistant", "tool_calls": [{"function": {"name": "lookup_order", "arguments": "{\"order_id\": \"ORD-48291\"}"}}]}]}
{"messages": [{"role": "system", "content": "You are a customer service assistant..."}, {"role": "user", "content": "I need a refund for ORD-12345, it arrived broken"}, {"role": "assistant", "tool_calls": [{"function": {"name": "initiate_refund", "arguments": "{\"order_id\": \"ORD-12345\", \"reason\": \"defective\"}"}}]}]}
{"messages": [{"role": "system", "content": "You are a customer service assistant..."}, {"role": "user", "content": "What are your business hours?"}, {"role": "assistant", "content": "We're available Monday through Friday, 9am to 6pm EST. How can I help you today?"}]}

Key formatting rules:

One JSON object per line. No pretty-printing, no line breaks within an object.
Arguments are a JSON string, not a nested object. Notice "arguments": "{\"order_id\": \"ORD-48291\"}" — the value is a string containing JSON, not a raw JSON object. This matches the OpenAI function-calling format that most fine-tuning frameworks expect.
System message is identical across all examples. Include the full tool schema in every system message. Do not abbreviate.
Consistent key ordering. Always role before content or tool_calls.

Quality Checks Before Training

Run these validations on your dataset before starting a fine-tune:

Schema Compliance

Every tool_calls entry must reference a function name that exists in your tool schema. Every arguments string must parse as valid JSON and match the parameter types defined in the schema.

import json

def validate_example(example, valid_tools):
    for msg in example["messages"]:
        if "tool_calls" in msg:
            for call in msg["tool_calls"]:
                name = call["function"]["name"]
                assert name in valid_tools, f"Unknown tool: {name}"
                args = json.loads(call["function"]["arguments"])
                # Validate args against schema...

Parameter Validation

Check that generated parameter values are realistic:

Order IDs follow the ORD-XXXXX format
Email addresses are syntactically valid
Enum values match the allowed set
Required parameters are always present
Numeric amounts are reasonable (not negative, not $999,999)

Balance Check

Count examples per tool. If lookup_order has 150 examples but escalate_to_human has 30, the model will be biased toward lookup. Aim for roughly equal distribution, with a tolerance of +/-30%.

Tool	Examples	% of Dataset
lookup_order	100	16%
check_status	100	16%
initiate_refund	100	16%
update_address	100	16%
escalate_to_human	100	16%
No tool (negative)	125	20%
Total	625	100%

Edge Case Coverage

Audit for these specific cases:

Missing optional parameters: Does the model handle it when a user provides an email but no order ID?
Multiple possible tools: "I want to check on my refund" — is this check_status or something else?
Parameter extraction from natural language: "I moved to 742 Evergreen Terrace, Springfield" — can the model extract the address?
Conversational context: "Yes, go ahead" following a refund confirmation — does the model handle follow-up turns?

Volume Guidelines

How much data do you actually need? It depends on the complexity:

Simple tool routing (classification only): 50-100 examples per tool. The model just needs to learn which tool maps to which intent. Parameter extraction is straightforward.

Complex parameter extraction: 200+ examples per tool. When parameters require parsing natural language (addresses, dates, product descriptions), you need more examples showing the extraction patterns.

Multi-turn conversations: 300+ examples per tool. If the model needs to handle follow-up messages, clarification requests, and context from previous turns, the dataset grows significantly.

For our 5-tool customer service agent, 100 examples per tool plus 125 negatives (625 total) is a solid starting point. You can fine-tune a 7B model on this in under 30 minutes on a single GPU.

Multi-Turn Sequences

Real customer interactions are not single-turn. A customer might look up an order, then check its status, then request a refund — all in one conversation. Your dataset should include multi-turn sequences:

{
  "messages": [
    {"role": "system", "content": "You are a customer service assistant..."},
    {"role": "user", "content": "Can you find order ORD-48291?"},
    {"role": "assistant", "tool_calls": [{"function": {"name": "lookup_order", "arguments": "{\"order_id\": \"ORD-48291\"}"}}]},
    {"role": "tool", "content": "{\"order_id\": \"ORD-48291\", \"status\": \"delivered\", \"items\": [\"Blue Widget\"], \"total\": 34.99}"},
    {"role": "assistant", "content": "I found your order ORD-48291. It shows as delivered. You ordered a Blue Widget for $34.99. How can I help with this order?"},
    {"role": "user", "content": "The widget arrived damaged. I want my money back."},
    {"role": "assistant", "tool_calls": [{"function": {"name": "initiate_refund", "arguments": "{\"order_id\": \"ORD-48291\", \"reason\": \"defective\", \"amount\": 34.99}"}}]}
  ]
}

Include 20-30 multi-turn sequences in your dataset. They teach the model to maintain context and use information from previous tool results when making subsequent calls.

Common Mistakes to Avoid

Mistake 1: Copy-pasting the same sentence structure. "Please refund ORD-11111," "Please refund ORD-22222," "Please refund ORD-33333" — this teaches the model one pattern, not the concept. Vary the phrasing.

Mistake 2: Perfect grammar in every example. Real users type "wheres my order," "plz refund," "i need to chnage my adress." Include these in your training data.

Mistake 3: No ambiguous examples. The boundary between "checking status" and "looking up an order" is fuzzy. Include examples that probe this boundary, with consistent labeling decisions.

Mistake 4: Skipping the system message. Some people train without a system message, then add one at inference time. The model has never seen that system message before. Train with the exact system prompt you will use in production.

Mistake 5: Not validating the JSONL. A single malformed line will crash your training run 45 minutes in. Validate every line parses as JSON before you start.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Putting It All Together

Here is the complete process, start to finish:

Document all tool schemas with precise descriptions and parameter types
Write 10-20 seed examples per tool by hand, based on real user patterns
Use a frontier model to expand to 50-100+ variations per tool
Add 20% negative examples, especially near-miss messages
Format as JSONL with consistent structure
Validate schema compliance, parameter correctness, and balance
Include 20-30 multi-turn conversation sequences
Run a final deduplication pass

Total time: 4-8 hours for a 5-tool agent. Total examples: 500-750. Training time on a 7B model with LoRA: 20-40 minutes.

The result is a model that routes to the correct tool 90%+ of the time, generates valid parameters consistently, and knows when NOT to call any tool. All running locally, at zero per-query cost.

The dataset is the model. Build it right.