Preparing Tool-Calling Datasets for Enterprise AI Agents: An On-Premise Workflow

Most enterprise AI agent projects stall at the same point: the agent can hold a conversation, but it cannot reliably call the right internal tool with the right arguments at the right time. The root cause is almost always the same — the model was never trained on tool-calling data that reflects the organization's actual tools.

Prompting alone does not solve this. You can stuff a system prompt with function definitions and few-shot examples, but once you have 40+ internal tools with overlapping capabilities, prompt-based approaches hit a ceiling. The model confuses similar tools, hallucinates arguments, or defaults to the most common tool regardless of context.

The fix is straightforward: fine-tune the model on tool-calling data specific to your environment. The challenge is that this data does not exist until you create it — and for enterprises, the creation process must happen entirely on-premise because the tool definitions themselves are sensitive.

Why Agents Need Tool-Calling Training Data

An AI agent's ability to select and invoke tools is a learned behavior, not an emergent one. Base models and even instruction-tuned models have general tool-calling capabilities, but these are trained on public API schemas — think weather APIs, search engines, calculator functions. Enterprise tools look nothing like this.

A typical enterprise has internal APIs for CRM data retrieval, ERP transactions, document management, compliance checks, approval workflows, and dozens of domain-specific operations. Each tool has specific argument formats, required authentication contexts, and business rules about when it should or should not be invoked.

When you fine-tune on enterprise-specific tool-calling examples, three things improve measurably. First, tool selection accuracy jumps from 60-70% (prompt-only) to 90-95% (fine-tuned) on internal benchmarks. Second, argument formatting errors drop by 80% or more because the model learns the exact schema. Third, the model learns negative examples — when not to call a tool — which reduces unnecessary API calls and associated costs.

The Tool-Calling Dataset Format

Every tool-calling training example has the same structure, regardless of the model family you are fine-tuning. The format consists of five components:

System prompt: Defines the agent's role and general instructions. Keep this consistent across examples.

Function definitions: JSON schema descriptions of available tools — name, description, parameters (with types and constraints), and required fields.

User query: A natural-language request that should trigger a specific tool call.

Expected function call: The tool name the model should select.

Expected arguments: The exact JSON arguments the model should pass to the tool.

A single training example looks like this in practice:

{
  "messages": [
    {
      "role": "system",
      "content": "You are an enterprise assistant with access to internal tools."
    },
    {
      "role": "user",
      "content": "Pull the Q3 revenue numbers for the EMEA region."
    }
  ],
  "tools": [
    {
      "name": "query_financial_report",
      "description": "Retrieves financial metrics by quarter, region, and metric type.",
      "parameters": {
        "quarter": "string (Q1-Q4)",
        "year": "integer",
        "region": "string",
        "metric": "string"
      }
    }
  ],
  "expected_call": {
    "name": "query_financial_report",
    "arguments": {
      "quarter": "Q3",
      "year": 2026,
      "region": "EMEA",
      "metric": "revenue"
    }
  }
}

You need 50-200 examples per tool for reliable fine-tuning. For a system with 30 tools, that means 1,500-6,000 total examples. This sounds like a lot, but the preparation pipeline makes it manageable.

Common Source Documents

Enterprise tool-calling datasets are built from documents that already exist in most organizations:

API documentation: OpenAPI/Swagger specs are the richest source. They contain endpoint definitions, parameter schemas, and often example requests. If your internal APIs have specs, you are already 40% done.

Internal wikis and runbooks: These contain natural-language descriptions of when to use which tool and how. They are the source for user query variations — the way real users describe tasks.

Workflow definitions: BPMN diagrams, Jira workflows, approval chains. These encode multi-step tool-calling sequences where Tool A's output feeds into Tool B's input.

Standard operating procedures (SOPs): Step-by-step instructions that map directly to tool-calling chains. "First, look up the customer in CRM. Then check their credit status. Then create the order." Each step is a tool call.

Support tickets and chat logs: Real user queries that triggered manual tool usage. These are gold for generating realistic query variations because they capture how people actually phrase requests — typos, abbreviations, and all.

The Preparation Pipeline

The end-to-end pipeline has five stages. Each stage can run on-premise with no external dependencies.

Stage 1: Extract Tool Definitions from API Specs

Parse OpenAPI/Swagger specs to extract function schemas. For each endpoint, capture the name, description, HTTP method, path parameters, query parameters, request body schema, and response schema. Convert these into the tool-calling JSON format your target model expects.

If you have 30 internal APIs with an average of 8 endpoints each, this stage produces 240 raw tool definitions. Not all of these will be relevant to your agent — filter to the subset the agent should actually use.

Stage 2: Generate User Query Variations

For each tool, create 50-200 natural-language queries that should trigger it. Start with the queries from your wiki documentation and SOPs. Then use a local LLM (Llama 3 70B via Ollama works well) to generate variations.

The key is diversity: formal queries ("Retrieve the customer account status for account ID 44891"), informal queries ("what's the status on account 44891"), ambiguous queries ("check that account"), and queries that require argument inference ("pull last quarter's EMEA numbers" — the model must infer the current year).

Stage 3: Create Expected Call/Response Pairs

For each query-tool combination, define the expected function call and arguments. This is the most labor-intensive stage and requires domain expertise. The person creating these examples must know the correct tool and the correct argument mapping.

This is where domain experts earn their weight. An engineer who has used these APIs daily for three years can produce 200 examples in a day. Someone unfamiliar with the APIs will produce errors that propagate through the fine-tuned model.

Stage 4: Validate and Deduplicate

Run automated validation: Do the arguments match the tool's schema? Are required fields present? Are enum values valid? Then check for duplicates — similar queries with identical expected calls add noise without adding learning signal.

Also validate negative examples: include queries that should NOT trigger any tool call. "What's the weather like?" should not invoke your internal CRM API. Models need to learn restraint.

Stage 5: Export as JSONL

Format the validated examples as JSONL in the schema your fine-tuning framework expects. For most open-source models (Llama, Mistral, Qwen), this is the ChatML format with tool-calling extensions. Export both a training set (80%) and a validation set (20%).

On-Premise Requirements

Tool-calling datasets are among the most sensitive data preparation artifacts an enterprise produces. Here is why.

The tool definitions themselves reveal internal API architecture. An attacker with your tool-calling dataset knows every internal endpoint, every parameter schema, and every business rule encoded in the tool descriptions. This is a blueprint for the organization's operational infrastructure.

The training examples reveal usage patterns — which queries map to which tools, what arguments are common, and how different systems connect. This is operational intelligence that competitors or adversaries would find valuable.

For regulated industries, the situation is more acute. A financial institution's tool-calling dataset reveals exactly how trades are executed, how compliance checks are performed, and what approval workflows exist. Sending this data to a cloud LLM provider for synthetic augmentation is a compliance violation in most frameworks.

The entire pipeline — from API spec parsing to synthetic query generation to JSONL export — must run on infrastructure the organization controls.

Using Local LLMs for Synthetic Augmentation

The query generation stage benefits enormously from LLM assistance. A local model can generate 10 query variations per minute, each phrased differently enough to add training value.

Run Ollama with a capable model (Llama 3.3 70B or Qwen 2.5 72B) on your on-premise GPU infrastructure. Feed it a tool definition and 5 seed queries written by domain experts, then ask for 50 variations. Review the output — expect to keep 70-80% after quality filtering.

The augmentation prompt matters. Specify the domain, the user persona (junior analyst, senior trader, customer support agent), and the formality level. This produces diverse queries that cover the range of how different people in the organization would phrase requests.

For a dataset of 5,000 examples, synthetic augmentation reduces the domain expert effort from approximately 200 hours to approximately 50 hours. The expert writes seed examples and reviews generated ones rather than writing every example from scratch.

Quality Validation

Two quality checks determine whether your tool-calling dataset is production-ready.

Coverage testing: Does every tool have enough examples? Plot the distribution of examples per tool. If 5 tools have 200+ examples and 10 tools have fewer than 20, the model will be biased toward the well-represented tools. Minimum viable coverage is 50 examples per tool.

Edge case coverage: For each tool, verify you have examples covering: all required parameter combinations, optional parameter usage, parameter inference from context (e.g., inferring "current quarter" from the date), multi-tool queries that require sequential calls, and queries that are similar to the tool's purpose but should NOT trigger it.

Run a quick validation fine-tune on 20% of the dataset. If the model achieves over 85% tool selection accuracy on held-out examples, the dataset quality is sufficient. Below 85%, look for systematic errors — usually a few poorly defined tools are responsible for most of the mistakes.

Practical Considerations

Dataset size: For a 30-tool system, aim for 3,000-6,000 examples total. Below 1,500, you will see tool confusion on less common tools. Above 10,000, you hit diminishing returns unless your tools are unusually complex.

Update frequency: Internal APIs change. Budget for quarterly dataset updates — new endpoints, deprecated tools, modified schemas. A stale tool-calling dataset produces a model that calls endpoints that no longer exist.

Multi-turn sequences: Real agent interactions involve chains of tool calls. Include multi-turn examples where the agent calls Tool A, processes the result, then calls Tool B. These represent 20-30% of production usage and are underrepresented in most datasets.

Error handling: Include examples where the correct response is to ask for clarification rather than guess at arguments. "Update the account" is too ambiguous — the model should ask which account and what update, not call the API with invented arguments.

The tool-calling dataset is the single highest-leverage artifact in an enterprise AI agent project. Get it right and the agent works. Get it wrong and no amount of prompt engineering will compensate.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →