
Mastra + Vercel AI SDK + On-Device GGUF: A TypeScript Mobile Agent Stack With No API Costs
TypeScript-first mobile builders don't have to use Python agent frameworks. Mastra and the Vercel AI SDK plus a fine-tuned 4B model running on-device through llama.cpp produce a complete agent stack with zero per-token costs.
Updated 2026-05-10 — Reflects the early-May Mastra releases that landed since this guide was written. The May 1 release added a new ChannelProvider architecture, a Slack provider with OAuth, the
@mastra/nestjsadapter, and a Google Drive WorkspaceFilesystem; the May 4 release added relationship-based FGA authorization, scheduled cron-based workflows, and the new@mastra/browser-viewerfor end-to-end browser automation. None of these change the on-device-fine-tuned-model pattern below — they expand the surrounding platform.
Most agent framework discourse in 2026 still assumes Python. LangGraph, CrewAI, AutoGen, Pydantic AI — the canonical references all live in the Python ecosystem. For backend engineers and ML practitioners, that's a sensible default. For mobile app builders shipping React Native, Expo, or hybrid Capacitor apps, it's the wrong language. A TypeScript codebase shouldn't need a Python sidecar to run an agent.
The TypeScript ecosystem now has two excellent agent frameworks that solve this. Mastra crossed 22,000 GitHub stars and shipped 1.0 in January 2026. The Vercel AI SDK has been the de facto streaming-first toolkit for nearly two years and now backs a meaningful fraction of all production LLM apps written in TypeScript. Both work cleanly with self-hosted models, both are designed around edge-native deployment, and both pair unusually well with a fine-tuned 4B model running on-device.
This guide walks through the full TypeScript-native mobile agent stack: Mastra for orchestration, the Vercel AI SDK for inference, an Ertas-trained Qwen3-4B or Gemma 4 E4B model exported as GGUF, and the Ertas Deployment CLI to ship it into a React Native app. End to end, the stack runs without ever calling a hosted API after the initial training step.
The two TypeScript frameworks, briefly
Mastra is the higher-level option. It gives you typed agent definitions, declarative workflows, durable memory, evals, and RAG primitives in one batteries-included package. Tool definitions are idiomatic TypeScript with Zod schemas. Workflows are step-based DAGs that survive process restarts. Memory and evals integrate without extra glue. Mastra is what you reach for when you want a complete agent platform shaped for the JavaScript runtime.
The Vercel AI SDK is the lower-level option. It exposes streaming primitives, structured output via Zod, and a generic provider abstraction over more than 90 model providers — Anthropic, OpenAI, Google, Mistral, Cohere, plus self-hosted runners like Ollama and llama.cpp. The SDK is also what Mastra calls into for inference. So in practice you don't choose one or the other: Mastra gives you the orchestration layer, the Vercel AI SDK gives you the inference layer, and a fine-tuned local model gives you the cost structure.
The combination is the closest thing TypeScript has to the Pydantic AI plus Ollama story Python developers have been building with all year — but native to the runtime mobile builders already use.
What we're building
The example agent is a workout planner for a React Native fitness app. The agent reads a natural-language request, picks the right tools, and produces a validated plan. It has three tools:
get_user_profile()returns the user's age, weight, training historyfind_recent_workouts(limit: number)returns the last N workouts as structured recordspropose_workout(focus: string, duration_min: number, difficulty: string)produces a structured workout plan
The output is a WorkoutPlan Zod object. Mastra validates every output against the schema. Tool calls are validated against their input schemas. The whole agent runs inside the user's phone with no network calls except optional telemetry.
This is the kind of agent that gets expensive fast on a frontier API. A user logging two workouts a day generates four to six agent calls per session, multi-turn, with non-trivial context. At 10,000 monthly active users, you're spending more on inference than on hosting.
Step 1: train the model in Studio
Open Ertas Studio and pick a base model. For TypeScript-friendly mobile deployment in 2026 the two strong choices are Qwen3-4B-Instruct and Gemma 4 E4B. Both fit comfortably on modern phones (about 2.5 GB at Q4_K_M), both produce reliable structured outputs after fine-tuning, and both work with llama.cpp's mobile FFI. Qwen3 has a slight edge on multi-step tool calling; Gemma 4 E4B has a slight edge on instruction-following nuance. Either is a fine starting point.
Define the tool schemas in Data Craft. Studio reads your tool signatures (paste them as Zod schemas, JSON Schema, or the TypeScript function signatures themselves) and uses the structure as the training target. For a workout planner, aim for around 500 examples covering single-tool calls, multi-tool sequences, and refusals (out-of-scope requests).
Train with the default tool-calling QLoRA configuration: rank 32, three epochs. Validation loss typically flattens around epoch 2.5. On the standard GPU tier the run completes in under an hour. Studio's eval suite reports tool-name accuracy, parameter-name accuracy, and parameter-value accuracy. Production-ready models clear 95% on all three.
Step 2: export to GGUF and ship
Studio's export pipeline produces a GGUF binary. For a 4B model on mobile, Q4_K_M is the right default — about 2.5 GB on disk, around 3 GB working memory.
Run the Ertas Deployment CLI against your existing React Native project:
ertas deploy mobile \
--project ./my-fitness-app \
--model ertas-workout-agent-4b.gguf \
--framework react-native
The CLI handles three things that have historically eaten 20 to 40 hours of llama.cpp build engineering. It installs the mobile FFI bindings (with the Metal backend on iOS and the OpenCL/Vulkan backend on Android). It registers the GGUF asset in the bundler so the model ships inside the app. And it stands up a local HTTP-style inference endpoint inside the app process — typically reachable on a device-local socket — that mirrors the OpenAI-compatible API shape that the Vercel AI SDK already knows how to call.
The same CLI supports Flutter, native iOS Swift, and native Android Kotlin. The TypeScript shape of the deliverable is what's specific here.
Step 3: configure the Vercel AI SDK
The Vercel AI SDK has a community-maintained provider for Ollama and a generic OpenAI-compatible provider that points at any OpenAI-shaped endpoint. The Ertas Deployment CLI exposes its on-device endpoint in OpenAI-compatible form by default, so you wire it up like any other provider:
import { createOpenAICompatible } from "@ai-sdk/openai-compatible";
export const ertasLocal = createOpenAICompatible({
name: "ertas-on-device",
baseURL: "http://localhost:8080/v1",
apiKey: "not-needed",
});
export const workoutModel = ertasLocal("ertas-workout-agent-4b");
In development, you point baseURL at Ollama on your laptop (port 11434). In production on the device, the Ertas Deployment CLI exposes the local endpoint at the configured port (8080 by default) and the same SDK call shape works without modification. The SDK doesn't care that the inference is running inside the app instead of across the network — it sees the same OpenAI-compatible response stream either way.
Step 4: define the Mastra agent
Now wire Mastra to the local model and define the agent and tools:
import { Agent } from "@mastra/core/agent";
import { createTool } from "@mastra/core/tools";
import { z } from "zod";
import { workoutModel } from "./ertas-local";
const WorkoutPlan = z.object({
focus: z.string(),
duration_min: z.number(),
difficulty: z.enum(["easy", "moderate", "hard"]),
blocks: z.array(
z.object({
name: z.string(),
sets: z.number(),
reps: z.number(),
}),
),
});
const getUserProfile = createTool({
id: "get_user_profile",
description: "Get the current user's age, weight, and training history.",
inputSchema: z.object({}),
outputSchema: z.object({
age: z.number(),
weight_kg: z.number(),
history: z.array(z.string()),
}),
execute: async () => fitnessDb.getProfile(),
});
const findRecentWorkouts = createTool({
id: "find_recent_workouts",
description: "Return the user's most recent workouts.",
inputSchema: z.object({ limit: z.number().default(5) }),
outputSchema: z.array(
z.object({ date: z.string(), name: z.string(), notes: z.string() }),
),
execute: async ({ context }) => fitnessDb.recent(context.limit),
});
const proposeWorkout = createTool({
id: "propose_workout",
description: "Produce a structured workout plan for the user.",
inputSchema: z.object({
focus: z.string(),
duration_min: z.number(),
difficulty: z.enum(["easy", "moderate", "hard"]),
}),
outputSchema: WorkoutPlan,
execute: async ({ context }) => planner.generate(context),
});
export const workoutAgent = new Agent({
name: "workout-planner",
instructions:
"You plan workouts. Use the available tools to read the user's history before proposing a plan.",
model: workoutModel,
tools: { getUserProfile, findRecentWorkouts, proposeWorkout },
});
const result = await workoutAgent.generate(
"Plan me a 45-minute moderate session focused on legs.",
{ output: WorkoutPlan },
);
console.log(result.object);
Two things are happening that the code doesn't make obvious. First, the agent reads the user's profile and recent workouts before proposing a plan because the fine-tuned model was trained on examples that establish that pattern. A generic open-weight model would frequently skip the history lookup and propose a generic plan; the trained model uses the available tools as designed. Second, the output is validated against WorkoutPlan by Mastra. If the model emits an invalid object, the validator rejects it and Mastra surfaces a typed error.
The agent runs entirely on the device. There are no API calls in the inference path. The user's profile data never leaves the phone. The only network traffic the agent generates is whatever telemetry you opt into.
Why TypeScript-native matters for mobile
Mobile teams that ship React Native or Expo apps have a meaningful productivity advantage when the agent layer is in the same language as the app. Type definitions flow from the Zod schemas through Mastra into the React Native UI without any cross-language code generation. Errors thrown in the agent surface as typed exceptions in the React tree. Streaming responses from the Vercel AI SDK plug into the same useChat-style hooks that mobile developers are already using.
The Python-based alternative requires a sidecar service. You either run a Python backend that the React Native app calls, or you embed CPython in the mobile binary, or you find some hybrid that splits work across runtimes. All three options add deployment complexity, bundle size, and crash surface area. None of them are necessary if the agent layer is already TypeScript.
The combination of Mastra plus the Vercel AI SDK fits the runtime mobile builders are already in. The Ertas Deployment CLI fits the same runtime. End to end, you're shipping a single TypeScript app with a model file alongside it.
The agentic cost cliff, again
The economic case is identical to the one Python-based on-device agents make, scaled for mobile usage patterns. Agent calls in mobile apps tend to be high-frequency and multi-turn — fitness apps, journaling apps, calendar apps, planners. Per-call cost on a frontier API runs around $0.01 to $0.04 depending on context length and tool depth.
A representative mobile-app cost curve at typical usage rates:
- 1,000 MAU at 4 calls/day average → roughly $120/month in inference
- 10,000 MAU → roughly $1,200/month
- 40,000 MAU → roughly $4,800/month
- 100,000 MAU → roughly $12,000/month
These numbers tend to land harder on mobile teams than on web teams because mobile monetization typically runs through subscriptions, not seat licenses. Your unit economics start to break around the 10,000 to 40,000 MAU band — exactly the band where you're trying to invest in growth, not retreat from inference costs.
On-device, the cost structure is fixed. The marginal cost of an inference is electricity. The fixed cost is the storage of the model on the device, paid once at install. Going from 10,000 to 100,000 MAU doesn't move the inference line.
The agentic cost cliff has been the dominant force shaping the on-device migration. TypeScript mobile builders have been waiting for an agent stack that lets them respond to it without giving up the runtime.
What you don't have to give up
There are three concerns mobile teams typically raise when considering an on-device move, and the Mastra plus Vercel AI SDK plus Ertas-trained model stack addresses each.
Streaming. The Vercel AI SDK's streaming primitives work the same way against a local llama.cpp endpoint as they do against a hosted API. Token-by-token rendering in your React Native UI is unchanged.
Structured output. Zod validation in the SDK and in Mastra runs unchanged. Fine-tuning makes structured output reliable enough that validation rarely fails.
Memory and workflows. Mastra's memory and workflow primitives don't depend on the model running remotely. The same vector store, the same workflow definitions, the same eval harness work against a local model.
What you do give up is the ability to swap to a frontier model trivially. Once you've fine-tuned and shipped a specific model, switching to a different one means another fine-tune. In practice, this is the same tradeoff every team that has moved to a self-hosted model has accepted, and it's a tradeoff most mobile teams will take cheerfully in exchange for predictable economics.
Getting from prototype to shipping
A typical pipeline looks like:
- Prototype against an API. Build the Mastra agent against Anthropic or OpenAI through the Vercel AI SDK. Get the tools right, prove the value.
- Curate a dataset. A few hundred examples from your prototype, validated in Data Craft against your Zod schemas.
- Fine-tune in Studio. Iterate on the dataset until eval metrics clear 95%.
- Ship to the device. Run the Ertas Deployment CLI against your React Native project. Replace the API-pointing provider in your Vercel AI SDK config with the local provider. Your Mastra agent code doesn't change.
- Iterate on traces. Production traces become the next round of training data. Studio supports incremental fine-tuning from traces, so the model improves while user data stays on-device.
The first three steps used to be the hard part. Mastra and the Vercel AI SDK moved the prototype-to-production gap from "weeks of bespoke streaming code" to "an afternoon of agent definition." Studio cut the dataset-to-fine-tune gap from MLE-months to hours. The Ertas Deployment CLI closed the last gap — the one most TypeScript app builders never bothered with because the llama.cpp build engineering was prohibitive.
Mastra plus the Vercel AI SDK plus an Ertas-trained on-device model is the agent stack TypeScript mobile builders have been waiting for. No Python sidecar. No per-token bill. No cost cliff between 10,000 and 100,000 users.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Pydantic AI On-Device: Fine-Tune Qwen3-4B for Type-Safe Mobile Agents
Pydantic AI brings type safety and FastAPI ergonomics to LLM agents. Combine it with a fine-tuned 4B model running on-device via llama.cpp and you get production-grade agents in mobile apps with zero API costs and validated outputs by construction.

Llama Stack on a Phone: Self-Hosted Llama Agents With a Fine-Tuned Llama 4 Model
Meta's Llama Stack is the canonical reference architecture for Llama-based agents. Combine it with a fine-tuned Llama 4 derivative and the Swift/Kotlin client SDKs and you get a complete agent stack running entirely on the user's phone.

Replacing OpenAI in OpenAI Agents SDK With Your Fine-Tuned Local Model
The OpenAI Agents SDK is intentionally model-agnostic. Swap the OpenAI client for an Ertas-trained model running on Ollama and you keep the developer experience while killing per-token costs. A drop-in tutorial.