
Fine-Tuning for Tool Calling: How to Build Reliable AI Agents with Small Models
Generic models are unreliable at tool calling — hallucinated function names, wrong parameters, format errors. Fine-tuning a small model on your specific tool schema produces 90%+ accuracy at zero per-query cost. Here's how.
AI agents are everywhere in 2026. Every automation platform, every no-code builder, every workflow tool has an "AI agent" feature. And almost all of them work the same way: send a user's message to GPT-4, ask it which tool to call, parse the structured response, execute the tool.
The problem: this pattern is expensive, unreliable at the margins, and entirely dependent on a cloud API.
The solution: fine-tune a small model on your specific tool schema. You get more reliable tool selection, consistent structured output, and zero per-query cost.
What Tool Calling Actually Requires
When an AI agent "calls a tool," here's what the model actually does:
- Receives a user message + a list of available tools (with their schemas)
- Decides which tool (if any) to call
- Generates the function name and structured parameters (JSON)
- Returns the tool call in a specific format
Steps 2 and 3 are the model's contribution. Step 2 is classification (which tool matches the user's intent?). Step 3 is structured output generation (produce valid JSON matching a schema).
Neither of these requires frontier-model intelligence. Classification is pattern matching. Structured output is template following. These are exactly the tasks where fine-tuned small models match or beat GPT-4.
Why Generic Models Fail at Tool Calling
Using GPT-4 or a generic open-weight model for tool calling works — most of the time. But the failure modes are specific and frustrating:
Hallucinated Function Names
The model invents tool names that don't exist in your schema. "search_database" instead of "query_db." "send_notification" instead of "notify_user." Close enough for a human to understand, wrong enough to crash your pipeline.
Wrong Parameter Types
The schema says user_id is an integer. The model returns "user_id": "42" (string). Your downstream API rejects it. The agent fails silently or retries in a loop.
Unnecessary Tool Calls
The user asks a simple question that doesn't need any tool. The model calls a tool anyway — because it was told tools are available and it tries to be helpful. Now you have a wasted API call and a slower response.
Format Inconsistency
Different API calls produce different JSON structures. Sometimes the model wraps the response in markdown code blocks. Sometimes it adds explanation text around the JSON. Sometimes the key ordering changes. Each inconsistency is a parsing edge case.
Cost at Scale
Every tool-calling decision costs money. GPT-4o tool calling: $2.50-$10 per million tokens. For an agent handling 10,000 decisions per day, that's $75-$300/month just for the "routing" step — before any actual work gets done.
How Fine-Tuning Fixes This
Fine-tuning teaches the model your specific tools, your specific schemas, and your specific calling patterns. The result:
| Metric | Generic GPT-4 | Fine-Tuned 8B |
|---|---|---|
| Tool selection accuracy | 85-90% | 92-97% |
| Parameter format compliance | 90-95% | 98-99% |
| Hallucinated tool names | Occasional | Near-zero |
| Unnecessary tool calls | ~5-10% | less than 1% |
| Cost per 1M tokens | $2.50-$10 | ~$0 (self-hosted) |
The accuracy improvement comes from specificity. A generic model handles thousands of possible tool schemas. Your fine-tuned model handles your 5-20 tools. It doesn't need to generalize — it needs to be right about your specific tools, every time.
Dataset Format for Tool-Calling Fine-Tuning
The training data follows the conversation format, with tool calls as a special message type. Here's the structure:
Example 1: Simple tool call
{
"messages": [
{
"role": "system",
"content": "You are an assistant with access to the following tools:\n\n- get_order_status(order_id: string): Returns the current status of an order\n- search_products(query: string, category?: string): Searches the product catalog\n- create_support_ticket(subject: string, description: string, priority: 'low' | 'medium' | 'high'): Creates a support ticket"
},
{
"role": "user",
"content": "What's the status of my order #ORD-2847?"
},
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"function": {
"name": "get_order_status",
"arguments": "{\"order_id\": \"ORD-2847\"}"
}
}
]
},
{
"role": "tool",
"content": "{\"status\": \"shipped\", \"tracking\": \"1Z999AA10123456784\", \"eta\": \"2026-02-24\"}"
},
{
"role": "assistant",
"content": "Your order #ORD-2847 has been shipped! The tracking number is 1Z999AA10123456784 and the estimated delivery date is February 24, 2026."
}
]
}
Example 2: No tool needed
{
"messages": [
{
"role": "system",
"content": "You are an assistant with access to the following tools: [same tools]"
},
{
"role": "user",
"content": "What's your return policy?"
},
{
"role": "assistant",
"content": "Our return policy allows returns within 30 days of purchase for a full refund. Items must be in their original condition. Would you like me to help you start a return?"
}
]
}
Including examples where the model should NOT call a tool is critical. Without these, the model learns to always call something.
Example 3: Multi-tool call
{
"messages": [
{
"role": "system",
"content": "You are an assistant with access to the following tools: [same tools]"
},
{
"role": "user",
"content": "I got the wrong item in order #ORD-2847. I need to file a complaint."
},
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"function": {
"name": "get_order_status",
"arguments": "{\"order_id\": \"ORD-2847\"}"
}
},
{
"function": {
"name": "create_support_ticket",
"arguments": "{\"subject\": \"Wrong item received - ORD-2847\", \"description\": \"Customer reports receiving the wrong item in order ORD-2847.\", \"priority\": \"high\"}"
}
}
]
}
]
}
Building Your Training Dataset
Step 1: Define Your Tool Schema
List every tool your agent can call. For each tool, define:
- Function name (exactly as your code expects)
- Parameters (name, type, required/optional)
- Return type
- When to use it (natural language description)
Start small. 5-10 tools is the sweet spot for a first fine-tuning run. You can expand later.
Step 2: Generate Training Examples
For each tool, create 20-40 examples covering:
- Clear-cut calls: User intent obviously maps to one tool
- Ambiguous cases: User intent could map to multiple tools (train the model on your preferred routing)
- No-tool cases: Questions the model should answer directly without calling any tool
- Multi-tool cases: Scenarios requiring sequential or parallel tool calls
- Edge cases: Unusual phrasing, missing information (model should ask for clarification), invalid requests
200-500 total examples is typically sufficient for a 5-10 tool agent. Quality matters more than quantity — see our guide on data quality over quantity.
Step 3: Format as JSONL
Convert your examples to the conversation format shown above, one JSON object per line. This is the standard format for fine-tuning on Ertas and most other platforms.
Step 4: Fine-Tune
Upload your JSONL dataset to Ertas, select a base model (Llama 3.1 8B Instruct or Qwen 2.5 7B Instruct are good choices — use the Instruct variant, not Base), configure training, and run.
Use the Instruct variant specifically because tool calling requires mixing structured output with natural language responses. Base models struggle with this combination. See research on this topic.
Step 5: Evaluate
Test against a held-out set of 50-100 examples. Measure:
- Tool selection accuracy: Did it pick the right tool?
- Parameter correctness: Are all parameters present, correct type, correct value?
- No-call accuracy: Did it correctly avoid calling a tool when none was needed?
- Format compliance: Is the output valid, parseable JSON every time?
Deploying Your Tool-Calling Model
Local Inference with Ollama
Export your fine-tuned model as GGUF, import into Ollama, and serve via its API. Ollama's OpenAI-compatible endpoint means your existing agent framework (LangChain, CrewAI, etc.) works with a one-line URL change.
Integration with n8n
For n8n workflows, replace the OpenAI node with an Ollama node pointing to your fine-tuned model. The workflow logic stays identical — only the model endpoint changes. API cost drops from hundreds per month to zero.
Integration with Make.com
For Make.com automations, use an HTTP module to call your Ollama API endpoint. The model returns tool calls in JSON format that Make.com can parse and route to subsequent modules.
When You Still Need GPT-4 for Tool Calling
Fine-tuned small models excel at fixed tool schemas with well-defined patterns. They're not the right choice for every scenario:
- Dynamic tool discovery: If available tools change per session (plugin systems, user-configured actions), a generic model's flexibility is valuable
- Complex multi-step reasoning: If tool selection requires multi-hop reasoning ("to answer this, I need to call tool A, use its output to parameterize tool B, then..."), larger models handle the planning better
- Very large tool sets (50+): At 50+ tools, the classification problem becomes harder and may require a larger model or a two-stage routing approach
- Novel tools without training data: If you're adding tools frequently and can't retrain each time, a generic model's zero-shot capability helps bridge the gap
For most production agents with 5-20 well-defined tools, a fine-tuned 8B model is the better choice — more reliable, faster, and infinitely cheaper.
Getting Started
- List your agent's tools and their schemas
- Create 200-500 training examples (including no-tool and edge cases)
- Format as JSONL
- Fine-tune on Ertas — select Llama 3.1 8B Instruct or Qwen 2.5 7B Instruct
- Evaluate on held-out examples
- Deploy via Ollama and connect to your automation platform
- Monitor accuracy in production and retrain periodically with new examples
Your AI agent's brain doesn't need to cost $10 per million tokens. Fine-tune it once, run it locally, and never pay for tool routing again.
References: Weights & Biases — Fine-tuning LLMs for Function Calling, Hugging Face — Function Calling Fine-Tuning with xLAM, Parlance Labs — Fine-Tuning for Function Calling.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Per-User LoRA Adapters: Personalized AI at Scale Without Per-Token Costs
LoRA adapters are 50-200MB each. You can hot-swap them per user request, delivering personalized AI experiences from a single base model — without multiplying your inference costs.

Fine-Tuning for Structured Output: Beyond JSON Mode to Guaranteed Schemas
JSON mode gets you valid JSON. Fine-tuning gets you guaranteed schema compliance — every field, every type, every time. Here's how to train models that output exactly the structure your app expects.

Building Reliable AI Agents with Fine-Tuned Local Models: Complete Guide
Most AI agents are just GPT-4 wrappers — expensive, unreliable at scale, and dependent on cloud APIs. Fine-tuned local models hit 98%+ accuracy on your specific tools at zero per-query cost. Here's the complete architecture.