TauBench

A benchmark for evaluating tool-using language models in realistic multi-turn customer service interactions — measuring whether the model can correctly use APIs to complete user requests across a variety of domains.

Tool UseUpdated 2026-04-30

What It Measures

TauBench evaluates language models on their ability to use tools correctly in realistic multi-turn interactions. Each task simulates a customer service scenario: the model is given a set of tool APIs (database queries, account modifications, refund processing, etc.), receives a user request that requires invoking those tools correctly, and must complete the request through a multi-turn conversation. The benchmark measures both the correctness of tool calls (right tools, right parameters, right order) and the quality of natural-language responses.

Unlike synthetic tool-use benchmarks where the model is asked to produce one function call per prompt, TauBench tests realistic agentic behavior: the model must reason about what tools to use, when to use them, how to handle edge cases, and how to respond conversationally throughout the interaction. As production deployments increasingly use LLMs as agents calling APIs, TauBench has emerged as one of the more credible evaluations of real-world agentic capability.

How It Works

Each task includes a domain (airline customer service, retail, etc.), a set of available tools (Python functions with documented signatures), and a user persona with a specific request. The model interacts with the user persona through a multi-turn conversation, using tools as needed. The task is scored based on whether the final state of the simulated environment (database, account state) is correct given the user's original request, plus quality metrics on the conversation itself.

Scoring is typically reported as a per-domain pass rate plus an overall composite score. The benchmark separates 'task completion' (did the right thing happen?) from 'conversation quality' (was the model helpful and accurate in its responses?), since both matter for production deployment but are partially independent.

Current Leaders

Kimi K2.6

Agent Swarm advantage

Top open-weight

DeepSeek V4

Strong

Qwen 3.6

Strong

GPT-OSS

Strong tool-use base

Mistral Small 4

Competitive

How to Interpret Scores

TauBench scores are a meaningful signal of real-world agentic capability. Models that score well on TauBench tend to handle production tool-use deployments more reliably than models that score well only on synthetic benchmarks. Scores in the 70%+ range indicate a model capable of handling production customer-service-style agentic workflows; below 50% indicates substantial human-review requirements. As of April 2026, the open-weight leaders on TauBench include Kimi K2.6 (with the Agent Swarm runtime providing structural advantages on multi-turn tasks) and DeepSeek V4. Frontier proprietary models still lead on TauBench overall, but the gap with the top open-weight models is narrowing.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →