Fine-Tuning for Tool Calling: How to Build Reliable AI Agents with Small Models

AI agents are everywhere in 2026. Every automation platform, every no-code builder, every workflow tool has an "AI agent" feature. And almost all of them work the same way: send a user's message to GPT-4, ask it which tool to call, parse the structured response, execute the tool.

The problem: this pattern is expensive, unreliable at the margins, and entirely dependent on a cloud API.

The solution: fine-tune a small model on your specific tool schema. You get more reliable tool selection, consistent structured output, and zero per-query cost.

What Tool Calling Actually Requires

When an AI agent "calls a tool," here's what the model actually does:

Receives a user message + a list of available tools (with their schemas)
Decides which tool (if any) to call
Generates the function name and structured parameters (JSON)
Returns the tool call in a specific format

Steps 2 and 3 are the model's contribution. Step 2 is classification (which tool matches the user's intent?). Step 3 is structured output generation (produce valid JSON matching a schema).

Neither of these requires frontier-model intelligence. Classification is pattern matching. Structured output is template following. These are exactly the tasks where fine-tuned small models match or beat GPT-4.

Why Generic Models Fail at Tool Calling

Using GPT-4 or a generic open-weight model for tool calling works — most of the time. But the failure modes are specific and frustrating:

Hallucinated Function Names

The model invents tool names that don't exist in your schema. "search_database" instead of "query_db." "send_notification" instead of "notify_user." Close enough for a human to understand, wrong enough to crash your pipeline.

Wrong Parameter Types

The schema says user_id is an integer. The model returns "user_id": "42" (string). Your downstream API rejects it. The agent fails silently or retries in a loop.

Unnecessary Tool Calls

The user asks a simple question that doesn't need any tool. The model calls a tool anyway — because it was told tools are available and it tries to be helpful. Now you have a wasted API call and a slower response.

Format Inconsistency

Different API calls produce different JSON structures. Sometimes the model wraps the response in markdown code blocks. Sometimes it adds explanation text around the JSON. Sometimes the key ordering changes. Each inconsistency is a parsing edge case.

Cost at Scale

Every tool-calling decision costs money. GPT-4o tool calling: $2.50-$10 per million tokens. For an agent handling 10,000 decisions per day, that's $75-$300/month just for the "routing" step — before any actual work gets done.

How Fine-Tuning Fixes This

Fine-tuning teaches the model your specific tools, your specific schemas, and your specific calling patterns. The result:

Metric	Generic GPT-4	Fine-Tuned 8B
Tool selection accuracy	85-90%	92-97%
Parameter format compliance	90-95%	98-99%
Hallucinated tool names	Occasional	Near-zero
Unnecessary tool calls	~5-10%	less than 1%
Cost per 1M tokens	$2.50-$10	~$0 (self-hosted)

The accuracy improvement comes from specificity. A generic model handles thousands of possible tool schemas. Your fine-tuned model handles your 5-20 tools. It doesn't need to generalize — it needs to be right about your specific tools, every time.

Dataset Format for Tool-Calling Fine-Tuning

The training data follows the conversation format, with tool calls as a special message type. Here's the structure:

Example 1: Simple tool call

{
  "messages": [
    {
      "role": "system",
      "content": "You are an assistant with access to the following tools:\n\n- get_order_status(order_id: string): Returns the current status of an order\n- search_products(query: string, category?: string): Searches the product catalog\n- create_support_ticket(subject: string, description: string, priority: 'low' | 'medium' | 'high'): Creates a support ticket"
    },
    {
      "role": "user",
      "content": "What's the status of my order #ORD-2847?"
    },
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [
        {
          "function": {
            "name": "get_order_status",
            "arguments": "{\"order_id\": \"ORD-2847\"}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "content": "{\"status\": \"shipped\", \"tracking\": \"1Z999AA10123456784\", \"eta\": \"2026-02-24\"}"
    },
    {
      "role": "assistant",
      "content": "Your order #ORD-2847 has been shipped! The tracking number is 1Z999AA10123456784 and the estimated delivery date is February 24, 2026."
    }
  ]
}

Example 2: No tool needed

{
  "messages": [
    {
      "role": "system",
      "content": "You are an assistant with access to the following tools: [same tools]"
    },
    {
      "role": "user",
      "content": "What's your return policy?"
    },
    {
      "role": "assistant",
      "content": "Our return policy allows returns within 30 days of purchase for a full refund. Items must be in their original condition. Would you like me to help you start a return?"
    }
  ]
}

Including examples where the model should NOT call a tool is critical. Without these, the model learns to always call something.

Example 3: Multi-tool call

{
  "messages": [
    {
      "role": "system",
      "content": "You are an assistant with access to the following tools: [same tools]"
    },
    {
      "role": "user",
      "content": "I got the wrong item in order #ORD-2847. I need to file a complaint."
    },
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [
        {
          "function": {
            "name": "get_order_status",
            "arguments": "{\"order_id\": \"ORD-2847\"}"
          }
        },
        {
          "function": {
            "name": "create_support_ticket",
            "arguments": "{\"subject\": \"Wrong item received - ORD-2847\", \"description\": \"Customer reports receiving the wrong item in order ORD-2847.\", \"priority\": \"high\"}"
          }
        }
      ]
    }
  ]
}

Building Your Training Dataset

Step 1: Define Your Tool Schema

List every tool your agent can call. For each tool, define:

Function name (exactly as your code expects)
Parameters (name, type, required/optional)
Return type
When to use it (natural language description)

Start small. 5-10 tools is the sweet spot for a first fine-tuning run. You can expand later.

Step 2: Generate Training Examples

For each tool, create 20-40 examples covering:

Clear-cut calls: User intent obviously maps to one tool
Ambiguous cases: User intent could map to multiple tools (train the model on your preferred routing)
No-tool cases: Questions the model should answer directly without calling any tool
Multi-tool cases: Scenarios requiring sequential or parallel tool calls
Edge cases: Unusual phrasing, missing information (model should ask for clarification), invalid requests

200-500 total examples is typically sufficient for a 5-10 tool agent. Quality matters more than quantity — see our guide on data quality over quantity.

Step 3: Format as JSONL

Convert your examples to the conversation format shown above, one JSON object per line. This is the standard format for fine-tuning on Ertas and most other platforms.

Step 4: Fine-Tune

Upload your JSONL dataset to Ertas, select a base model (Llama 3.1 8B Instruct or Qwen 2.5 7B Instruct are good choices — use the Instruct variant, not Base), configure training, and run.

Use the Instruct variant specifically because tool calling requires mixing structured output with natural language responses. Base models struggle with this combination. See research on this topic.

Step 5: Evaluate

Test against a held-out set of 50-100 examples. Measure:

Tool selection accuracy: Did it pick the right tool?
Parameter correctness: Are all parameters present, correct type, correct value?
No-call accuracy: Did it correctly avoid calling a tool when none was needed?
Format compliance: Is the output valid, parseable JSON every time?

Deploying Your Tool-Calling Model

Local Inference with Ollama

Export your fine-tuned model as GGUF, import into Ollama, and serve via its API. Ollama's OpenAI-compatible endpoint means your existing agent framework (LangChain, CrewAI, etc.) works with a one-line URL change.

Integration with n8n

For n8n workflows, replace the OpenAI node with an Ollama node pointing to your fine-tuned model. The workflow logic stays identical — only the model endpoint changes. API cost drops from hundreds per month to zero.

Integration with Make.com

For Make.com automations, use an HTTP module to call your Ollama API endpoint. The model returns tool calls in JSON format that Make.com can parse and route to subsequent modules.

When You Still Need GPT-4 for Tool Calling

Fine-tuned small models excel at fixed tool schemas with well-defined patterns. They're not the right choice for every scenario:

Dynamic tool discovery: If available tools change per session (plugin systems, user-configured actions), a generic model's flexibility is valuable
Complex multi-step reasoning: If tool selection requires multi-hop reasoning ("to answer this, I need to call tool A, use its output to parameterize tool B, then..."), larger models handle the planning better
Very large tool sets (50+): At 50+ tools, the classification problem becomes harder and may require a larger model or a two-stage routing approach
Novel tools without training data: If you're adding tools frequently and can't retrain each time, a generic model's zero-shot capability helps bridge the gap

For most production agents with 5-20 well-defined tools, a fine-tuned 8B model is the better choice — more reliable, faster, and infinitely cheaper.

Getting Started

List your agent's tools and their schemas
Create 200-500 training examples (including no-tool and edge cases)
Format as JSONL
Fine-tune on Ertas — select Llama 3.1 8B Instruct or Qwen 2.5 7B Instruct
Evaluate on held-out examples
Deploy via Ollama and connect to your automation platform
Monitor accuracy in production and retrain periodically with new examples

Your AI agent's brain doesn't need to cost $10 per million tokens. Fine-tune it once, run it locally, and never pay for tool routing again.

References: Weights & Biases — Fine-tuning LLMs for Function Calling, Hugging Face — Function Calling Fine-Tuning with xLAM, Parlance Labs — Fine-Tuning for Function Calling.