Fine-Tune FunctionGemma with Ertas

Google's 270M-parameter purpose-built tool-calling model — a Gemma 3 derivative trained exclusively to map natural-language intent to function invocations. The smallest credible function-calling model in the open-weight ecosystem and an explicit invitation to fine-tune for your own tool schemas.

270MGoogle

Overview

FunctionGemma is a 270-million-parameter Gemma 3 derivative released by Google on May 5, 2026, trained for one job: take a user message plus a set of tool schemas and emit the correct function call with the right parameters. It does not chat, summarize, or reason in long form. It maps intent to invocation, and it does so at a model size — under 200MB at Q4 quantization — small enough to run on a Raspberry Pi, a phone, or a Jetson Nano.

The model is part of a broader Google narrative around purpose-built small models. Where Gemma 3 and Gemma 4 are general-purpose families with native multimodal capabilities, FunctionGemma is explicitly framed by Google as a *base* for fine-tuning. Their stated intent in the model card is unambiguous: 'intended to be fine-tuned for your specific function-calling task.' This is unusual in open-weight releases — most labs ship general-purpose checkpoints and leave specialization to users by default, but FunctionGemma's training and positioning push fine-tuning to the front of the recommended workflow.

FunctionGemma achieves 82–88% on standard Berkeley Function Calling Leaderboard (BFCL) tasks out of the box, which is competitive with 3B–8B general-purpose models that are 10–30x larger. After fine-tuning on a domain's specific tool schemas — typically 200–1,000 well-curated function-call examples — accuracy on the target tools routinely climbs above 95%, surpassing what general-purpose 7B–14B models achieve on the same evaluation set. This combination of small footprint and specialization-after-tuning makes FunctionGemma the canonical example of the 2026 'agent specialist' trend.

Key Features

FunctionGemma's input format is a system block listing available functions with their parameter schemas, followed by a user message. The model emits a single structured output: the function name and its parameters as JSON. There is no conversational layer, no preamble, and no prose — the output begins with the function call directly. This makes parser integration trivial and removes the post-processing fragility that plagues general-purpose models doing tool calls through prompted JSON formatting.

The model is licensed under the Gemma Terms of Use (the Gemma 3-era license). Google has not yet relicensed FunctionGemma under Apache 2.0, the way it did for Gemma 4 in April 2026, so commercial users should review the license terms for use cases that touch the prohibited-use list. For most product applications — mobile assistants, agentic workflows, internal automation — the license is permissive enough.

FunctionGemma's tokenizer and base architecture are inherited from Gemma 3, so the standard llama.cpp, Ollama, MLX, and TensorRT-LLM tool chains all support it without modification. GGUF quantizations from Q2_K through Q8_0 are available; Q4_K_M produces a ~180MB binary that runs at 800+ tokens/second on consumer GPUs and 180–250 tokens/second on a modern laptop CPU.

Fine-Tuning with Ertas

FunctionGemma is the canonical fine-tuning target for Ertas's tool-calling product story. The training data format is the same JSONL function-call schema Ertas Studio supports natively: each example is a tool list, a user query, and the expected function call. Because the model is so small, full-parameter fine-tuning fits on consumer GPUs — a 12GB RTX 3060 trains FunctionGemma at full sequence length without LoRA — but LoRA and QLoRA also work and produce adapters under 50MB that can be hot-swapped at inference time.

The typical Ertas workflow for FunctionGemma is: define your tool schemas in Studio's Data Craft module, generate 300–800 representative function-call examples (using the bulk-generation flow that emits prompt templates for ChatGPT/Claude/Gemini), split into train/validation, fine-tune in Studio with the FunctionGemma base, evaluate on held-out function calls, and export to GGUF. The full cycle on representative tool schemas runs in 1–3 hours of wall-clock time on Studio's standard GPU tier and produces a model that hits 95%+ accuracy on the trained tool set.

For mobile deployment, the Ertas Deployment CLI takes the GGUF output and wires it into an iOS Swift, Android Kotlin, Flutter, or React Native project with llama.cpp dependencies installed. End-to-end — from raw tool schemas to a fine-tuned function-calling model running on-device in a real app — is a few hours of work, dominated by dataset curation rather than training or deployment plumbing.

Use Cases

FunctionGemma's primary use case is the function-calling layer inside an agent system: turn a user request into a structured tool invocation that downstream code can execute. For mobile apps with a small number of high-frequency tools — booking, scheduling, search, CRUD operations on user data — FunctionGemma at 200MB on-device replaces a cloud API call entirely, eliminating per-token costs and removing network round-trips from latency-sensitive flows.

For agent frameworks that support OpenAI-compatible endpoints (LangGraph, Pydantic AI, OpenAI Agents SDK, Smolagents, Mastra, Vercel AI SDK), FunctionGemma can serve as a dedicated tool-routing layer behind a larger reasoning model. The pattern: a 7B–14B model handles open-ended reasoning, FunctionGemma handles the structured tool-call emission, and both run locally. Inference cost drops dramatically without sacrificing tool-call reliability.

The small size also makes FunctionGemma the right choice for embedded and edge deployments: factory-floor robotics with limited compute, IoT devices that need natural-language control, in-vehicle systems where every megabyte of model weight is contested. Anywhere a general-purpose 3B–8B model is too large but a clean intent-to-invocation mapping is required, FunctionGemma is the default starting point.

Hardware Requirements

At Q4_K_M quantization, FunctionGemma weights are approximately 180MB and require ~250MB of total RAM at inference time including KV cache for short contexts. This fits comfortably on phones (iOS 14+ devices, Android devices with 2GB+ RAM), single-board computers (Raspberry Pi 4/5), embedded boards (Jetson Nano), and any laptop or desktop.

Throughput on consumer hardware: 180–250 tokens/second on a modern laptop CPU (Apple Silicon M1/M2 or Ryzen/Intel mobile), 800+ tokens/second on consumer GPUs (RTX 3060 and above), and 1500+ tokens/second on data-center GPUs. Because tool calls are short (typically under 100 output tokens), wall-clock latency from prompt to complete function call is in the 50–200ms range on most consumer hardware — fast enough for interactive use without a perceptible pause.

For fine-tuning in Ertas Studio, a 12GB consumer GPU (RTX 3060, RTX 4060) handles full-parameter training at 1024-token sequence lengths. LoRA and QLoRA reduce this to 6–8GB and produce adapters small enough to ship as model patches. Training time on representative tool-call datasets (300–800 examples, 3–5 epochs) is typically 30–90 minutes on Studio's standard GPU tier.