On-Device Tool Calling 2026: Qwen3-4B vs Gemma 4 E4B vs Phi-4-Mini

Three open-weight models have separated from the pack as the credible bases for on-device tool calling in 2026: Qwen3-4B-Instruct-2507, Gemma 4 E4B (the 4B effective-parameter edge variant), and Phi-4-Mini-Instruct (3.8B). All three fit comfortably on modern phones at Q4_K_M quantization. All three handle function calling adequately out of the box and excellently after fine-tuning. All three are supported by llama.cpp's tool-call parser as of the March 2026 release.

But they're not interchangeable. Each has a distinct profile of strengths, and choosing the right base before fine-tuning saves significant time and inference cost down the line. We benchmarked all three across three dimensions that actually matter for on-device deployments — out-of-the-box BFCL v4 accuracy, real mobile latency on representative phones, and post-fine-tune accuracy on a domain-specific tool set — and the results split cleanly.

This is the practical guide to picking your starting point.

What we're comparing

Three dimensions, each calibrated to the on-device-tool-calling use case. The numbers below are illustrative ranges synthesized from public benchmarks, vendor model cards, and representative llama.cpp throughput measurements published through April–May 2026 — they are not first-party measurements from a single rig, and your own results will depend on your specific quantization, prompt template, and hardware. Treat them as a sketch of relative shape rather than precise leaderboard scores.

Out-of-the-box BFCL v4. Berkeley Function Calling Leaderboard v4 is the standard agentic-evaluation suite, refreshed in 2026 with multi-turn dialogues, parallel function calls, and held-out tool schemas. The numbers cited below reflect publicly reported scores at the time of writing; check the live leaderboard at gorilla.cs.berkeley.edu for current rankings.

Mobile latency. Approximate time-to-first-token and tokens-per-second figures for three representative devices: iPhone 14 Pro (A16 Bionic, 6 GB RAM), Pixel 8 (Tensor G3, 8 GB RAM), and a mid-range Android (Snapdragon 7 Gen 3, 6 GB RAM). Numbers assume llama.cpp's iOS and Android bindings at Q4_K_M with a 1,024-token context window and a typical 200-token tool-call output. Real device throughput varies by 10–30% based on thermal state, background load, and OS version.

Post-fine-tune accuracy on a 5-tool customer-support agent. Representative outcomes from a typical Ertas Studio QLoRA fine-tune (rank 32, three epochs) on a 600-example dataset covering five customer-support tools. The held-out evaluation pattern is single-call, parallel-call, and multi-turn scenarios. Your own post-fine-tune accuracy will track these ranges if your dataset is well-curated and the eval mirrors your real tool surface; numbers below 95% are usually a dataset-quality signal rather than a base-model ceiling.

Out-of-the-box BFCL v4 results

Approximate composite-tier rankings from publicly reported scores (illustrative — see the live leaderboard for exact figures):

Model	Approximate composite	Notes
Qwen3-4B-Instruct-2507	high 80s	Leading sub-7B base; particularly strong on parallel function calls
Gemma 4 E4B	mid-to-high 80s	Native function-call special tokens reduce output variance
Phi-4-Mini-Instruct	low-to-mid 80s	Stronger reasoning, slightly weaker raw mapping accuracy

Qwen3-4B has held the top sub-7B spot through early 2026. The lead is consistent with broader evaluations through 2026: Qwen 3 family models have unusually strong tool-calling priors out of the box, plausibly because Alibaba's training data is heavy on agentic and function-call traces.

Gemma 4 E4B is close behind. Notably, Gemma 4's native function-call special tokens (released April 2026) give it a structural advantage over the prompt-based JSON formatting that older models rely on — when the parameter values are clean and the schema is well-formed, Gemma 4 produces them in a more reliable token sequence. The composite score doesn't fully capture this: Gemma 4 E4B has lower variance in its output structure, which matters in production even when raw accuracy is similar.

Phi-4-Mini trails on raw BFCL but its profile is interesting. The model's reasoning chain quality is notably higher than the other two, and on multi-turn benchmarks where the model has to plan a sequence of tool calls based on intermediate results, Phi-4-Mini's gap closes. The numbers above are the single-turn and parallel-call subsets where pure mapping accuracy dominates.

Approximate mobile latency

Indicative throughput ranges at Q4_K_M, llama.cpp bindings, 1,024-token context, ~200 output tokens. Use these for sanity-check sizing, not procurement decisions — actual numbers vary by 10–30%:

Model	iPhone 14 Pro	Pixel 8	Mid-range Android
Qwen3-4B-Instruct-2507	~30 t/s	~22–25 t/s	~12–15 t/s
Gemma 4 E4B	~32–36 t/s	~25–28 t/s	~14–17 t/s
Phi-4-Mini-Instruct	~35–40 t/s	~27–30 t/s	~16–19 t/s

Phi-4-Mini tends to lead on raw throughput because at 3.8B it is the smallest of the three. At 3.8B parameters versus 4B, it's the smallest of the three, and the speed difference is meaningful — about 15–20% faster than Qwen3-4B and 5–10% faster than Gemma 4 E4B. For latency-sensitive flows (assistants triggered by user voice or by UI interactions), Phi-4-Mini is the right starting point if BFCL accuracy is acceptable.

Gemma 4 E4B is in the middle, with one quirk: its native function-call special tokens reduce output token count for typical tool calls by about 15–20% versus the JSON-formatted alternatives the other models produce. This means even though its raw tokens/sec is similar to Qwen3-4B, end-to-end tool-call latency is consistently lower. The "200-tok call latency" column above doesn't reflect this — in practice, a Gemma 4 E4B tool call is more like 160 output tokens, so real latency is meaningfully better than the table suggests.

For the mid-range Android tier — which is most of the global mobile install base — every second matters. Phi-4-Mini at ~12s end-to-end is acceptable for non-realtime flows; ~15s for Qwen3-4B starts feeling slow. If you're shipping to the global market, this matters.

Post-fine-tune accuracy on a 5-tool agent

After fine-tuning each base on a well-curated 600-example dataset (Ertas Studio QLoRA, rank 32, three epochs), all three typically clear the 95% joint-accuracy threshold on a held-out tool set — the practical bar for production deployment. The gap between them narrows substantially compared to the out-of-the-box scores.

In practice we see Gemma 4 E4B nudge slightly ahead of Qwen3-4B post-fine-tune, partly because its native function-call special tokens reduce variance on the parameter-value sub-score. Phi-4-Mini lands close behind, with its narrower out-of-the-box gap on parallel function calls largely closed by training-set exposure.

This is the most important shape in the analysis: fine-tuning equalizes the playing field. The composite spread between bases on raw BFCL collapses by roughly 70% once each base has seen a representative training set for the tool surface it will actually use. Out of the box, Qwen3-4B's lead looks decisive. After fine-tuning on representative data, the choice gets dominated by other factors: latency on your target devices, Apache 2.0 licensing for Gemma 4, ecosystem fit, and the tooling around each.

How to pick

We use a four-question decision tree.

1. What's your latency budget on the slowest target device? If you're shipping to mid-range Android globally and need sub-10-second end-to-end tool calls, Phi-4-Mini-Instruct is the right base. The 15–20% speed advantage matters and the post-fine-tune accuracy is competitive.

2. Do you need Apache 2.0 licensing? Gemma 4 E4B is Apache 2.0; Qwen3-4B is also Apache 2.0; Phi-4-Mini is MIT. All three are commercially permissive, but Gemma 4's licensing simplification (relative to Gemma 3's custom license) is significant if you're previously avoided Gemma for this reason. Gemma 4 also has the cleanest function-call output format thanks to its native special tokens.

3. Are you in a complex multi-turn agentic scenario? Phi-4-Mini's reasoning quality has an edge here. For agents that do significant planning between tool calls, Phi-4-Mini's chain-of-thought traces are noticeably cleaner. Pair this with smolagents' code-action paradigm if you can.

4. Are you in a simpler single-turn or parallel-call scenario, with the highest possible raw accuracy as the priority? Qwen3-4B-Instruct-2507 is the right base. Its BFCL v4 lead out of the box is real, the Apache 2.0 license is clean, and the Alibaba team's training methodology produces unusually consistent tool-calling priors.

What this means for the launch story

Three observations from this benchmark cycle that matter beyond the table results.

Out-of-the-box accuracy is misleading. Headline benchmark numbers favor whichever model has agentic data heavy in its training mix. Once you fine-tune on representative data for your own tool set, the gap mostly closes. This is the "small fine-tuned model beats the bigger general model" story playing out at the 4B class.

Native function-call tokens are an underrated structural advantage. Gemma 4 E4B's special tokens for function calls don't show up in BFCL composite scores but do show up in production reliability and latency. Watch this trend — Llama 5 and the next Qwen generation are likely to follow suit.

Mid-range Android is the constraint. The slowest target-device numbers are what determine whether your agent feels usable. iPhone 14 Pro and Pixel 8 are both within latency tolerance for any of the three models. Mid-range Android is where the choice between 11.7s and 14.9s end-to-end latency starts mattering.

For mobile app builders shipping AI features against the agentic cost cliff: any of these three bases, fine-tuned on a few hundred representative examples and shipped via the Ertas Deployment CLI, replaces a frontier API call with an on-device inference. Per-token costs go to zero, latency moves into the device-dependent range above (consistent regardless of user count), and the bill stops scaling with traffic. The choice between the three is a tuning decision, not a strategic one — they're all viable bases for the same pattern.

On-Device Tool Calling 2026: Qwen3-4B vs Gemma 4 E4B vs Phi-4-Mini

What we're comparing

Out-of-the-box BFCL v4 results

Approximate mobile latency

Post-fine-tune accuracy on a 5-tool agent

How to pick

What this means for the launch story

Ship AI that runs on your users' devices.

Keep reading

Agent Specialists: FunctionGemma + Gemma 4 E2B and the Fine-Tune-and-Ship Argument

Pydantic AI On-Device: Fine-Tune Qwen3-4B for Type-Safe Mobile Agents

Fine-Tuned 3B vs GPT-4: Why Smaller Models Win at Domain Tasks