
Agent Specialists: FunctionGemma + Gemma 4 E2B and the Fine-Tune-and-Ship Argument
Google's FunctionGemma (270M) and Gemma 4 E2B (2B) are the smallest credible function-calling models in 2026. They're not general-purpose — they're explicitly designed to be fine-tuned. That's the whole point.
The most interesting open-weight trend of 2026 is not the next bigger mixture-of-experts model. It is the rise of what Google and a handful of other labs are now calling "agent specialists" — small models explicitly designed to be fine-tuned for narrow agentic tasks rather than chatted with as general assistants.
FunctionGemma at 270M parameters and the newly released Gemma 4 E2B at roughly 2B effective parameters are the canonical examples. Both ship with native function-calling special tokens. Both fit on phones — under 200MB and around 1.5GB respectively at Q4_K_M. Both are released as base models with model cards that read, almost word for word, "intended to be fine-tuned for your specific function-calling task." That phrasing is not boilerplate. It is the product positioning. Google is explicitly telling you these are not chat models, not general assistants, and not finished products. They are starting points for specialization.
This is a different mental model from the past three years. The old assumption was that you took a general-purpose model — Llama 3, Mistral 7B, Qwen 2.5 — and either prompted it harder, retrieved harder against it, or, if you had the budget, fine-tuned it and hoped the base capability survived contact with your domain. The new assumption, embodied by FunctionGemma and Gemma 4 E2B, is that the base model itself should already be optimized for the task. Fine-tuning is not a workaround for a model that doesn't quite fit. It is the intended workflow.
If you are building an agent that lives inside a mobile app, or a desktop tool, or anything else where every megabyte and every millisecond matters, agent specialists are the leading edge of the trend that decides whether your product economics work.
What the specialist label actually means
A general-purpose 7B Instruct model is trained to do many things passably: summarize, chat, reason, write code, follow instructions, occasionally call tools. The capability budget is spread across dozens of competencies. Tool calling is one slice of that budget — not the focus.
An agent specialist inverts the priorities. It is trained on a narrow distribution of tasks: input is a user message plus a tool schema, output is a structured function call. Other capabilities are present at much lower fidelity or removed entirely. The architecture, tokenizer, and pretraining mix are tuned around that single output shape.
That trade-off — giving up generalist breadth in exchange for specialist density — is what makes the parameter counts feel implausible at first glance. A 270M model that hits 82-88% on standard tool-calling benchmarks is not violating any laws of physics. It is just spending its parameters on one thing instead of fifty.
FunctionGemma in one paragraph (because it has its own post)
We covered FunctionGemma in detail earlier in the year. The short version: 270M parameters, single-purpose intent-to-invocation mapping, 200MB at Q4, 800-plus tokens per second on a consumer GPU and 180-250 tokens per second on plain CPU. Out of the box it handles standard tool schemas — weather, search, calendar, CRUD — at 82-88% accuracy. Fine-tuned on your specific schemas, it lands in the 90-94% range. It cannot reason multi-step, cannot chat, cannot summarize. It does one thing, very fast, in a tiny footprint.
What is new — and the heart of this post — is that Google has now released the model that sits one size up.
Gemma 4 E2B: the specialist gets a multimodal sibling
Gemma 4 E2B (April 2026) is Google's answer to a real gap. FunctionGemma is great if your agent only needs text-in, function-call-out. It is not enough when the agent needs to look at a photo of a receipt before calling create_expense_report, or read a screenshot before calling navigate_to_setting. Mobile agents in particular keep running into multimodal inputs, and a 270M text-only model leaves them stranded.
Gemma 4 E2B is a roughly 2B-effective-parameter, natively multimodal model with the same special-token vocabulary for function calls that FunctionGemma uses. The architecture is the next iteration of the Gemma family — the "E" in E2B stands for "effective" parameters, paired with Per-Layer Embeddings (PLE) caching that lets a 2B-class model use a much smaller active memory footprint than the raw parameter count suggests. At Q4_K_M quantization it sits around 1.5GB on disk and roughly 2GB working memory, which puts it in range for any modern phone.
Three things matter about how Gemma 4 E2B is positioned:
-
It is Apache 2.0 licensed. Clean for commercial use, redistributable, fine-tunable, and shippable inside an app without negotiating a separate license. This is the same posture as the rest of the Gemma family but worth restating because it is the differentiator from a fair number of other open-weight models that ship under usage-restricted licenses.
-
It has native function-call tokens. The model emits structured tool calls without needing post-hoc parsing or regex on the output. This sounds minor and is not — it is the difference between a model that can call tools reliably under fine-tuning and one that produces JSON that mostly parses, mostly.
-
The model card explicitly frames it as a fine-tuning base for agent applications, not a general assistant. Out of the box it is competent at tool calling but unremarkable at chat. The intended workflow, as with FunctionGemma, is to fine-tune it for your domain.
For mobile and edge agent builders, Gemma 4 E2B is the first openly licensed, multimodal, function-calling-native model small enough to run on the device. That combination did not exist six months ago.
The fine-tune-and-ship argument
Here is the calculation that drives this whole conversation.
A generic 7B Instruct model with a good prompt and retrieval against your tool schemas reaches roughly 60-70% accuracy on a moderately custom tool set. Retrieval failures account for some of the misses, prompt-template variance accounts for more, and the rest is the model's general tendency to hallucinate plausible-looking parameter values. In production this looks like a system that mostly works, fails embarrassingly often enough that you build retry logic, and consumes 4.5GB of memory at Q4 plus whatever your retriever uses.
A fine-tuned FunctionGemma 270M on the same tool set lands above 95% accuracy on the trained tools, with no retrieval needed because the schemas are baked into the weights. The footprint is 200MB at Q4. That is a 22-times reduction in memory at higher accuracy on the trained tools, and a substantial reduction in latency because there is no retrieval round-trip.
The catch is the phrase "on the trained tools." A fine-tuned specialist is brittle outside its training distribution. Add a new tool to your agent and you need a quick retraining run before that tool starts working reliably. For most agent products this is fine — your tool surface changes infrequently and you have a deployment process anyway — but it is the trade-off on offer. You exchange generality for accuracy and footprint.
The fine-tune-and-ship argument is that for the great majority of agent products, especially agents that live inside an app, that trade-off is the right one. The reasons:
- Your tool set is finite and known. A real product has a fixed catalog of actions. The case for a generalist that can handle arbitrary unknown tools at runtime is mostly a research case.
- Your accuracy bar is high. Tool calls drive real actions. 70% accuracy is unacceptable. 95% is the floor for production.
- Your unit economics demand low marginal cost. Once you cross a few thousand active users running multi-step agent flows, frontier API costs eat your margins. On-device specialists make per-inference cost effectively zero.
- Your app cannot ship a 4.5GB binary. A 200MB to 1.5GB model is the difference between a download users will accept and one they will abandon.
Specialist plus fine-tuning hits all four of those constraints. Generalist plus prompting hits none of them.
When to pick which specialist
The choice between FunctionGemma, Gemma 4 E2B, and a larger fine-tuned model is mostly about input modality and reasoning depth.
FunctionGemma 270M is the right answer when:
- Input is text only.
- The agent's job is pure intent-to-invocation mapping with no intermediate reasoning.
- The footprint constraint is tight — under 500MB total budget for the model.
- Your tool count is in the single digits to low double digits.
This is the lightest possible deployment. Fine-tuning takes 5-10 minutes on a single GPU, the resulting model serves from under 300MB of RAM, and inference is essentially instantaneous on any device.
Gemma 4 E2B is the right answer when:
- Input includes images, screenshots, photos, or other visual content.
- The agent benefits from longer-context multi-turn conversations before emitting a tool call.
- The footprint constraint allows roughly 2GB of working memory.
- The tool count is moderate — up to a few dozen tools with non-trivial schemas.
The fine-tuning workflow is similar to FunctionGemma but with a longer training run (typically 30-60 minutes on a single GPU) and a bigger dataset (500-1500 examples is the sweet spot, including multimodal examples if you are using the vision input).
A larger fine-tuned model — Qwen3-4B, Phi-4-Mini, or similar — is the right answer when:
- The agent needs reasoning steps between tool calls. Plan-and-execute patterns, multi-hop tool chains, error recovery, conditional logic over previous tool outputs.
- The output structure is complex — not just one function call but a structured plan or a multi-step decision tree.
- You can afford 2.5-3.5GB of working memory.
The previous post on Pydantic AI on-device walks through exactly this case for Qwen3-4B. It is the right size when the agent needs to be both reliable at tool calls and capable of light reasoning between them.
The Ertas pipeline for any of these
The workflow is the same regardless of which specialist base you start with.
-
Curate a dataset in Data Craft. Paste your tool schemas in. Use the bulk-generation prompt template to seed several hundred examples through Claude or ChatGPT, then let Studio validate every example against the schemas before adding it to the training set. For Gemma 4 E2B specifically, mix in multimodal examples — image plus text inputs paired with the expected tool call output.
-
Fine-tune in Studio. Pick FunctionGemma, Gemma 4 E2B, or whichever larger model you decided on. Studio's default for tool-calling fine-tunes is QLoRA at rank 16-32, three epochs. The validation loss curve typically flattens around epoch 2-2.5; auto-eval flags overfitting if it appears.
-
Evaluate against held-out data. The three metrics to watch are tool-name accuracy, parameter-name accuracy, and parameter-value accuracy. Production-ready specialist fine-tunes score above 95% on all three. If anything is below 95%, the cause is almost always dataset gaps — find the failing examples, add representative training data, and run incremental training from the existing checkpoint.
-
Export to GGUF. Studio's export flow produces a GGUF binary at the quantization level you choose. Q4_K_M is the default for mobile.
-
Ship with the Ertas Deployment CLI. Run the CLI against your iOS, Android, Flutter, or React Native project and the model is wired into a working inference call in minutes. The CLI installs the llama.cpp mobile FFI bindings, drops the GGUF model in, and exposes a typed inference function in your codebase.
End-to-end timeline from blank project to fine-tuned specialist running on a phone: hours, not weeks. The same dataset that trains FunctionGemma can train Gemma 4 E2B can train Qwen3-4B — Studio re-uses the dataset across base models, so your only choice is which size and modality fit your product.
The bigger trend
The story of open-weight models in 2024 and 2025 was capability ceiling. Each new release pushed the bar for what was possible at a given parameter count. Llama 3 made 8B competitive. Qwen 2.5 made 7B competitive. Mistral made small models punch above their weight.
The story of 2026, increasingly, is the specialization floor. Not "how big can the smallest credible model be?" but "how small can the smallest credible model for this specific job be?" FunctionGemma at 270M and Gemma 4 E2B at 2B are pushing that floor down for tool calling. We will see the same pattern in classification, in extraction, in routing, in validation — domain-specific bases that are explicitly designed to be fine-tuned and shipped, not chatted with.
For mobile app builders, that trend is the way out of the agentic cost cliff. Frontier APIs cost tens of cents per multi-step agent flow. At a thousand daily active users, that is hundreds of dollars per day. At ten thousand, it is thousands. Specialist plus fine-tuning plus on-device deployment moves the per-inference cost to effectively zero, and the agent specialists released this year — FunctionGemma, Gemma 4 E2B, and the wave that will follow — make that move technically straightforward instead of an MLE-quarter of work.
Fine-tune and ship. Pick the smallest specialist that fits the job. Train it on your exact tools. Put it on the device. The architecture is settled enough now that the only question left is execution.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Further Reading
- FunctionGemma and the Rise of Dedicated Tool-Calling Models — the deeper dive on the 270M specialist that started this category
- Pydantic AI On-Device: Fine-Tune Qwen3-4B for Type-Safe Mobile Agents — when the agent needs reasoning between tool calls and a 4B base is the right size
- Best Open-Source Model to Fine-Tune in 2026 — where FunctionGemma and Gemma 4 E2B fit alongside Qwen, Llama, Mistral, and the rest of the 2026 base-model landscape
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

On-Device Tool Calling 2026: Qwen3-4B vs Gemma 4 E4B vs Phi-4-Mini
We benchmarked the three best on-device tool-calling bases of 2026 — Qwen3-4B, Gemma 4 E4B, and Phi-4-Mini — across BFCL v4, real mobile latency, and post-fine-tune accuracy. Each wins a different scenario; here's how to pick.

Fine-Tuned 3B vs GPT-4: Why Smaller Models Win at Domain Tasks
Academic research shows fine-tuned 3B-7B models consistently beat GPT-4 on domain-specific tasks. Here's the evidence, the pattern, and how to apply it in your app.

Fine-Tuning Small Models (1B-8B): When They Beat GPT-4o and When They Don't
An honest assessment of when fine-tuned small models (1B-8B parameters) outperform GPT-4o on specific tasks — and when they fall short, with benchmarks and practical decision criteria.