Back to blog
    Pydantic AI On-Device: Fine-Tune Qwen3-4B for Type-Safe Mobile Agents
    pydantic-aifine-tuningtool-callingqwen3on-devicemobileai-agents

    Pydantic AI On-Device: Fine-Tune Qwen3-4B for Type-Safe Mobile Agents

    Pydantic AI brings type safety and FastAPI ergonomics to LLM agents. Combine it with a fine-tuned 4B model running on-device via llama.cpp and you get production-grade agents in mobile apps with zero API costs and validated outputs by construction.

    EErtas Team·

    Updated 2026-05-10 — Reflects the early-May Pydantic AI v1.90.x and v1.93.x point releases, which added an explicit tool_choice setting on Agent (so you can pin the model to a specific function call when the workflow demands it) plus dedicated OutputToolCallEvent and OutputToolResultEvent streaming events for finer-grained observability. Both are useful when the agent runs against a fine-tuned local model and you want to be able to assert exactly which tool was selected at each step.

    Pydantic AI is the agent framework Python developers have been waiting for. Type hints become contracts. Tool definitions become validated functions. Output schemas become guarantees, not suggestions. It's FastAPI for LLM agents — and it pairs unusually well with a fine-tuned local model.

    The combination matters because Pydantic AI's design assumes the model can produce structured outputs reliably. Against a frontier API like Claude or GPT-5, that assumption usually holds. Against a generic 7B open-weight model, it doesn't — schema violations are common, retries are routine, and the type safety the framework promises starts feeling like a wrapper around fragile inference.

    A fine-tuned 4B model fixes this. Trained on the exact tool schemas and output formats your agent uses, the model produces validated outputs nearly every time. Pydantic AI's validator becomes a guard rail rather than a recurring source of exceptions. And because the model is small enough to run on-device through llama.cpp, you get the type safety Pydantic AI promises with none of the per-token API costs that break mobile-app unit economics between 500 and 5,000 users.

    This guide walks through the full pipeline: fine-tune Qwen3-4B in Ertas Studio for a sample mobile-app agent, deploy it on-device through the Ertas Deployment CLI, and wire it into a Pydantic AI agent that runs entirely on the user's phone.

    The setup: a calendar assistant agent

    We'll build a calendar assistant that reads natural-language requests and emits validated booking actions. The agent has three tools:

    • find_availability(start: datetime, end: datetime, duration_min: int) returns open slots
    • book_meeting(slot: TimeSlot, attendees: list[str], title: str) creates a booking
    • cancel_meeting(meeting_id: str) cancels an existing booking

    The agent's output is a BookingDecision Pydantic model with structured fields. Pydantic AI validates every output against this schema. If the model emits invalid output, Pydantic AI raises a ValidationError that the host app can catch and retry.

    This is the kind of agent that's expensive when running against a frontier API and slow when running against a generic local model. Fine-tuned and on-device, it's both fast and free.

    Step 1: curate a fine-tuning dataset in Studio

    Open Ertas Studio and create a new project. Use Data Craft to define your tool schemas — paste the function signatures into the schema panel, and Studio uses them as the structural target for training data.

    For a calendar assistant you typically want 300–800 training examples covering:

    • Single-tool calls: "book a 30-minute meeting tomorrow at 2pm with john@" → one book_meeting call
    • Multi-step calls: "find 30 minutes tomorrow afternoon and book it with the design team" → find_availability then book_meeting
    • Validation edge cases: "cancel that meeting we just booked" (requires meeting_id from prior context)
    • Refusals: "send a follow-up email to all attendees" — out of scope, return a structured rejection

    Data Craft has two modes for this. The manual mode lets you write each example in a structured table. The bulk mode emits a structured prompt template you paste into Claude or ChatGPT to generate hundreds of examples conversationally — Studio ingests the output and validates each example against your tool schemas before adding it to the training set.

    Aim for 500 examples. Past 500, dataset quality matters far more than quantity. Spend time on the validation edge cases and the refusals — those are where general-purpose models hallucinate parameters and where fine-tuning produces the largest reliability gains.

    Step 2: fine-tune Qwen3-4B

    Choose Qwen3-4B-Instruct as the base model in Studio. The 4B size is the right tradeoff for on-device deployment — small enough to fit comfortably on modern phones (about 2.5 GB at Q4_K_M), large enough to handle multi-turn conversations and parallel tool calls reliably.

    Studio's default training configuration for tool-calling fine-tunes is QLoRA at rank 32 over 3 epochs. The validation loss curve typically flattens around epoch 2.5; Studio's auto-eval flags overfitting if it appears. On the standard GPU tier, a 500-example training run completes in under an hour.

    When training finishes, run Studio's eval suite against your held-out validation set. For tool-calling fine-tunes, look at three metrics: tool-name accuracy (did the model pick the right function?), parameter-name accuracy (did it fill the right parameters?), and parameter-value accuracy (did it use the right values?). Production-ready models score above 95% on all three.

    If any metric is below 95%, the usual cause is dataset gaps — find the failing examples in the eval report, add representative training data for the missing cases, and run incremental fine-tuning. Studio supports continued training from a checkpoint without restarting from scratch.

    Step 3: export to GGUF and ship to mobile

    Studio's export flow produces a GGUF binary at the quantization level you choose. For Qwen3-4B on mobile, Q4_K_M is the default — about 2.5 GB on disk, ~3 GB working memory, and within the budget of any modern phone.

    Run the Ertas Deployment CLI against your existing iOS, Android, Flutter, or React Native project:

    ertas deploy mobile \\
      --project ./my-app \\
      --model ertas-calendar-agent-4b.gguf \\
      --framework react-native
    

    The CLI installs llama.cpp's mobile FFI bindings, drops the GGUF model into the right project location, and wires up a working inference call. For React Native specifically, the CLI exposes a useErtasModel hook that returns a typed inference function. Total elapsed time from running the CLI to a working on-device inference call: typically under 5 minutes.

    Step 4: wire Pydantic AI to the on-device model

    Pydantic AI doesn't run on-device — it's a Python framework. So the architecture is a backend pattern: Pydantic AI runs on your backend (Python service or serverless function), but the model it calls is the on-device inference server that the Ertas Deployment CLI exposes from the mobile app via a local HTTP endpoint.

    For development, run the model on your laptop via Ollama and point Pydantic AI at the local endpoint:

    from datetime import datetime
    from typing import Literal
    from pydantic import BaseModel
    from pydantic_ai import Agent
    from pydantic_ai.models.openai import OpenAIModel
    
    # Point Pydantic AI at the Ertas-trained Qwen3-4B served via Ollama
    model = OpenAIModel(
        "ertas-calendar-agent-4b",
        base_url="http://localhost:11434/v1",
        api_key="not-needed",
    )
    
    class TimeSlot(BaseModel):
        start: datetime
        end: datetime
    
    class BookingDecision(BaseModel):
        action: Literal["book", "find", "cancel", "reject"]
        slot: TimeSlot | None = None
        meeting_id: str | None = None
        rejection_reason: str | None = None
    
    agent = Agent(
        model,
        result_type=BookingDecision,
        system_prompt="You schedule meetings. Use the available tools to find slots and book.",
    )
    
    @agent.tool
    async def find_availability(ctx, start: datetime, end: datetime, duration_min: int) -> list[TimeSlot]:
        """Find open calendar slots between start and end of at least duration_min."""
        return await calendar_api.search(start, end, duration_min)
    
    @agent.tool
    async def book_meeting(ctx, slot: TimeSlot, attendees: list[str], title: str) -> dict:
        """Create a calendar booking."""
        return await calendar_api.book(slot, attendees, title)
    
    @agent.tool
    async def cancel_meeting(ctx, meeting_id: str) -> dict:
        """Cancel a calendar booking by ID."""
        return await calendar_api.cancel(meeting_id)
    
    # Run the agent with a natural-language request
    result = agent.run_sync(
        "Find me 30 minutes tomorrow afternoon and book it with the design team."
    )
    print(result.data)
    # BookingDecision(action='book', slot=TimeSlot(start=..., end=...), meeting_id='m-abc', ...)
    

    Two things are happening that you don't see in the code. First, Pydantic AI's validator has confirmed the model's output matches BookingDecision exactly — no missing fields, no hallucinated extras. If the model had emitted invalid output, Pydantic AI would have raised a ValidationError and you'd have caught it. Second, every tool call's input was validated against the tool's signature. If the model had passed wrong types, Pydantic AI would have rejected the call before it touched your CRM.

    For production, swap the base_url from http://localhost:11434/v1 to whichever endpoint serves your model. If the model is running on the user's phone via the Ertas Deployment CLI, the endpoint is the device's local inference server (typically http://localhost:8080/v1 from the device's own perspective, or routed through your backend depending on your architecture).

    Why this architecture works

    Three forces line up.

    Pydantic AI gives you type safety. The framework's validation layer turns every model output into a checked Python object. Schema violations become exceptions, not silent bugs. For agents that drive real actions — bookings, payments, infrastructure changes — this is essential.

    Fine-tuning makes that type safety stable. Without fine-tuning, the model is an unreliable producer of structured outputs and Pydantic AI's validator fires constantly. With fine-tuning on the exact schemas the agent uses, the model becomes a reliable producer and the validator is mostly a guard rail rather than a recurring failure point. The two layers complement each other: training teaches the model to produce structured outputs, validation enforces the contract at runtime.

    On-device inference makes the economics work. A frontier API call for a multi-step booking flow costs $0.01–$0.05 depending on context. At 1,000 daily active users running 5 booking flows each, that's $50–$250 per day, or $1,500–$7,500 per month. On-device, the cost is the model size on the user's storage and a few hundred milliseconds of CPU/GPU time per inference. The bill doesn't grow with your user base.

    The combination produces something rare in 2026: an agent system that's both engineered well and economical at scale. Pydantic AI handles the engineering. Ertas-trained models handle the specialization. The Ertas Deployment CLI gets you to mobile.

    Getting from prototype to shipping

    A typical pipeline for shipping this kind of agent looks like:

    1. Prototype against an API (1–2 days). Build the Pydantic AI agent against OpenAI or Claude. Get the schemas right, get the tool flow right, prove the value.
    2. Curate a dataset (1–2 days). Use Data Craft and a few hundred conversations from your prototype. Validate examples against the schemas.
    3. Fine-tune in Studio (under an hour of training time, plus eval). Iterate on the dataset until eval metrics clear 95%.
    4. Ship to mobile (under an hour with the Deployment CLI). Run the CLI against your iOS, Android, Flutter, or React Native project. Replace the API call in your Pydantic AI agent with the local endpoint.
    5. Iterate on production traces. Pydantic AI's traces (especially through Logfire) become the next round of training data. Studio supports incremental fine-tuning from production traces, so the model improves over time without ever sending user data off-device.

    The first three steps used to be the hard part. Pydantic AI got the prototype-to-production gap from "weeks of prompt engineering" to "an afternoon of dataset curation." Studio gets the dataset-to-fine-tune gap from "an MLE-month of work" to "a couple of hours." The Ertas Deployment CLI closes the last and most stubborn gap: training a model is one thing, shipping it into a real iOS or Android app is another, and that step has historically been 20–40 hours of llama.cpp build configuration that most app builders never finish.

    Together, the four pieces — Pydantic AI for engineering, Studio for fine-tuning, Data Craft for datasets, the Deployment CLI for shipping — collapse the timeline from prototype to production-on-mobile from "a quarter of work" to "a week of work." For mobile app builders feeling the API cost cliff bite between 500 and 5,000 users, that's the difference between staying on the cliff and getting off it.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading