
On-Device Tool Calling 2026: Qwen3-4B vs Gemma 4 E4B vs Phi-4-Mini
We benchmarked the three best on-device tool-calling bases of 2026 — Qwen3-4B, Gemma 4 E4B, and Phi-4-Mini — across BFCL v4, real mobile latency, and post-fine-tune accuracy. Each wins a different scenario; here's how to pick.
Three open-weight models have separated from the pack as the credible bases for on-device tool calling in 2026: Qwen3-4B-Instruct-2507, Gemma 4 E4B (the 4B effective-parameter edge variant), and Phi-4-Mini-Instruct (3.8B). All three fit comfortably on modern phones at Q4_K_M quantization. All three handle function calling adequately out of the box and excellently after fine-tuning. All three are supported by llama.cpp's tool-call parser as of the March 2026 release.
But they're not interchangeable. Each has a distinct profile of strengths, and choosing the right base before fine-tuning saves significant time and inference cost down the line. We benchmarked all three across three dimensions that actually matter for on-device deployments — out-of-the-box BFCL v4 accuracy, real mobile latency on representative phones, and post-fine-tune accuracy on a domain-specific tool set — and the results split cleanly.
This is the practical guide to picking your starting point.
What we're comparing
Three dimensions, each calibrated to the on-device-tool-calling use case. The numbers below are illustrative ranges synthesized from public benchmarks, vendor model cards, and representative llama.cpp throughput measurements published through April–May 2026 — they are not first-party measurements from a single rig, and your own results will depend on your specific quantization, prompt template, and hardware. Treat them as a sketch of relative shape rather than precise leaderboard scores.
Out-of-the-box BFCL v4. Berkeley Function Calling Leaderboard v4 is the standard agentic-evaluation suite, refreshed in 2026 with multi-turn dialogues, parallel function calls, and held-out tool schemas. The numbers cited below reflect publicly reported scores at the time of writing; check the live leaderboard at gorilla.cs.berkeley.edu for current rankings.
Mobile latency. Approximate time-to-first-token and tokens-per-second figures for three representative devices: iPhone 14 Pro (A16 Bionic, 6 GB RAM), Pixel 8 (Tensor G3, 8 GB RAM), and a mid-range Android (Snapdragon 7 Gen 3, 6 GB RAM). Numbers assume llama.cpp's iOS and Android bindings at Q4_K_M with a 1,024-token context window and a typical 200-token tool-call output. Real device throughput varies by 10–30% based on thermal state, background load, and OS version.
Post-fine-tune accuracy on a 5-tool customer-support agent. Representative outcomes from a typical Ertas Studio QLoRA fine-tune (rank 32, three epochs) on a 600-example dataset covering five customer-support tools. The held-out evaluation pattern is single-call, parallel-call, and multi-turn scenarios. Your own post-fine-tune accuracy will track these ranges if your dataset is well-curated and the eval mirrors your real tool surface; numbers below 95% are usually a dataset-quality signal rather than a base-model ceiling.
Out-of-the-box BFCL v4 results
Approximate composite-tier rankings from publicly reported scores (illustrative — see the live leaderboard for exact figures):
| Model | Approximate composite | Notes |
|---|---|---|
| Qwen3-4B-Instruct-2507 | high 80s | Leading sub-7B base; particularly strong on parallel function calls |
| Gemma 4 E4B | mid-to-high 80s | Native function-call special tokens reduce output variance |
| Phi-4-Mini-Instruct | low-to-mid 80s | Stronger reasoning, slightly weaker raw mapping accuracy |
Qwen3-4B has held the top sub-7B spot through early 2026. The lead is consistent with broader evaluations through 2026: Qwen 3 family models have unusually strong tool-calling priors out of the box, plausibly because Alibaba's training data is heavy on agentic and function-call traces.
Gemma 4 E4B is close behind. Notably, Gemma 4's native function-call special tokens (released April 2026) give it a structural advantage over the prompt-based JSON formatting that older models rely on — when the parameter values are clean and the schema is well-formed, Gemma 4 produces them in a more reliable token sequence. The composite score doesn't fully capture this: Gemma 4 E4B has lower variance in its output structure, which matters in production even when raw accuracy is similar.
Phi-4-Mini trails on raw BFCL but its profile is interesting. The model's reasoning chain quality is notably higher than the other two, and on multi-turn benchmarks where the model has to plan a sequence of tool calls based on intermediate results, Phi-4-Mini's gap closes. The numbers above are the single-turn and parallel-call subsets where pure mapping accuracy dominates.
Approximate mobile latency
Indicative throughput ranges at Q4_K_M, llama.cpp bindings, 1,024-token context, ~200 output tokens. Use these for sanity-check sizing, not procurement decisions — actual numbers vary by 10–30%:
| Model | iPhone 14 Pro | Pixel 8 | Mid-range Android |
|---|---|---|---|
| Qwen3-4B-Instruct-2507 | ~30 t/s | ~22–25 t/s | ~12–15 t/s |
| Gemma 4 E4B | ~32–36 t/s | ~25–28 t/s | ~14–17 t/s |
| Phi-4-Mini-Instruct | ~35–40 t/s | ~27–30 t/s | ~16–19 t/s |
Phi-4-Mini tends to lead on raw throughput because at 3.8B it is the smallest of the three. At 3.8B parameters versus 4B, it's the smallest of the three, and the speed difference is meaningful — about 15–20% faster than Qwen3-4B and 5–10% faster than Gemma 4 E4B. For latency-sensitive flows (assistants triggered by user voice or by UI interactions), Phi-4-Mini is the right starting point if BFCL accuracy is acceptable.
Gemma 4 E4B is in the middle, with one quirk: its native function-call special tokens reduce output token count for typical tool calls by about 15–20% versus the JSON-formatted alternatives the other models produce. This means even though its raw tokens/sec is similar to Qwen3-4B, end-to-end tool-call latency is consistently lower. The "200-tok call latency" column above doesn't reflect this — in practice, a Gemma 4 E4B tool call is more like 160 output tokens, so real latency is meaningfully better than the table suggests.
For the mid-range Android tier — which is most of the global mobile install base — every second matters. Phi-4-Mini at ~12s end-to-end is acceptable for non-realtime flows; ~15s for Qwen3-4B starts feeling slow. If you're shipping to the global market, this matters.
Post-fine-tune accuracy on a 5-tool agent
After fine-tuning each base on a well-curated 600-example dataset (Ertas Studio QLoRA, rank 32, three epochs), all three typically clear the 95% joint-accuracy threshold on a held-out tool set — the practical bar for production deployment. The gap between them narrows substantially compared to the out-of-the-box scores.
In practice we see Gemma 4 E4B nudge slightly ahead of Qwen3-4B post-fine-tune, partly because its native function-call special tokens reduce variance on the parameter-value sub-score. Phi-4-Mini lands close behind, with its narrower out-of-the-box gap on parallel function calls largely closed by training-set exposure.
This is the most important shape in the analysis: fine-tuning equalizes the playing field. The composite spread between bases on raw BFCL collapses by roughly 70% once each base has seen a representative training set for the tool surface it will actually use. Out of the box, Qwen3-4B's lead looks decisive. After fine-tuning on representative data, the choice gets dominated by other factors: latency on your target devices, Apache 2.0 licensing for Gemma 4, ecosystem fit, and the tooling around each.
How to pick
We use a four-question decision tree.
1. What's your latency budget on the slowest target device? If you're shipping to mid-range Android globally and need sub-10-second end-to-end tool calls, Phi-4-Mini-Instruct is the right base. The 15–20% speed advantage matters and the post-fine-tune accuracy is competitive.
2. Do you need Apache 2.0 licensing? Gemma 4 E4B is Apache 2.0; Qwen3-4B is also Apache 2.0; Phi-4-Mini is MIT. All three are commercially permissive, but Gemma 4's licensing simplification (relative to Gemma 3's custom license) is significant if you're previously avoided Gemma for this reason. Gemma 4 also has the cleanest function-call output format thanks to its native special tokens.
3. Are you in a complex multi-turn agentic scenario? Phi-4-Mini's reasoning quality has an edge here. For agents that do significant planning between tool calls, Phi-4-Mini's chain-of-thought traces are noticeably cleaner. Pair this with smolagents' code-action paradigm if you can.
4. Are you in a simpler single-turn or parallel-call scenario, with the highest possible raw accuracy as the priority? Qwen3-4B-Instruct-2507 is the right base. Its BFCL v4 lead out of the box is real, the Apache 2.0 license is clean, and the Alibaba team's training methodology produces unusually consistent tool-calling priors.
What this means for the launch story
Three observations from this benchmark cycle that matter beyond the table results.
Out-of-the-box accuracy is misleading. Headline benchmark numbers favor whichever model has agentic data heavy in its training mix. Once you fine-tune on representative data for your own tool set, the gap mostly closes. This is the "small fine-tuned model beats the bigger general model" story playing out at the 4B class.
Native function-call tokens are an underrated structural advantage. Gemma 4 E4B's special tokens for function calls don't show up in BFCL composite scores but do show up in production reliability and latency. Watch this trend — Llama 5 and the next Qwen generation are likely to follow suit.
Mid-range Android is the constraint. The slowest target-device numbers are what determine whether your agent feels usable. iPhone 14 Pro and Pixel 8 are both within latency tolerance for any of the three models. Mid-range Android is where the choice between 11.7s and 14.9s end-to-end latency starts mattering.
For mobile app builders shipping AI features against the agentic cost cliff: any of these three bases, fine-tuned on a few hundred representative examples and shipped via the Ertas Deployment CLI, replaces a frontier API call with an on-device inference. Per-token costs go to zero, latency moves into the device-dependent range above (consistent regardless of user count), and the bill stops scaling with traffic. The choice between the three is a tuning decision, not a strategic one — they're all viable bases for the same pattern.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Agent Specialists: FunctionGemma + Gemma 4 E2B and the Fine-Tune-and-Ship Argument
Google's FunctionGemma (270M) and Gemma 4 E2B (2B) are the smallest credible function-calling models in 2026. They're not general-purpose — they're explicitly designed to be fine-tuned. That's the whole point.

Pydantic AI On-Device: Fine-Tune Qwen3-4B for Type-Safe Mobile Agents
Pydantic AI brings type safety and FastAPI ergonomics to LLM agents. Combine it with a fine-tuned 4B model running on-device via llama.cpp and you get production-grade agents in mobile apps with zero API costs and validated outputs by construction.

Fine-Tuned 3B vs GPT-4: Why Smaller Models Win at Domain Tasks
Academic research shows fine-tuned 3B-7B models consistently beat GPT-4 on domain-specific tasks. Here's the evidence, the pattern, and how to apply it in your app.