Mastra + Vercel AI SDK + On-Device GGUF: A TypeScript Mobile Agent Stack With No API Costs

Updated 2026-05-10 — Reflects the early-May Mastra releases that landed since this guide was written. The May 1 release added a new ChannelProvider architecture, a Slack provider with OAuth, the @mastra/nestjs adapter, and a Google Drive WorkspaceFilesystem; the May 4 release added relationship-based FGA authorization, scheduled cron-based workflows, and the new @mastra/browser-viewer for end-to-end browser automation. None of these change the on-device-fine-tuned-model pattern below — they expand the surrounding platform.

Most agent framework discourse in 2026 still assumes Python. LangGraph, CrewAI, AutoGen, Pydantic AI — the canonical references all live in the Python ecosystem. For backend engineers and ML practitioners, that's a sensible default. For mobile app builders shipping React Native, Expo, or hybrid Capacitor apps, it's the wrong language. A TypeScript codebase shouldn't need a Python sidecar to run an agent.

The TypeScript ecosystem now has two excellent agent frameworks that solve this. Mastra crossed 22,000 GitHub stars and shipped 1.0 in January 2026. The Vercel AI SDK has been the de facto streaming-first toolkit for nearly two years and now backs a meaningful fraction of all production LLM apps written in TypeScript. Both work cleanly with self-hosted models, both are designed around edge-native deployment, and both pair unusually well with a fine-tuned 4B model running on-device.

This guide walks through the full TypeScript-native mobile agent stack: Mastra for orchestration, the Vercel AI SDK for inference, an Ertas-trained Qwen3-4B or Gemma 4 E4B model exported as GGUF, and the Ertas Deployment CLI to ship it into a React Native app. End to end, the stack runs without ever calling a hosted API after the initial training step.

The two TypeScript frameworks, briefly

Mastra is the higher-level option. It gives you typed agent definitions, declarative workflows, durable memory, evals, and RAG primitives in one batteries-included package. Tool definitions are idiomatic TypeScript with Zod schemas. Workflows are step-based DAGs that survive process restarts. Memory and evals integrate without extra glue. Mastra is what you reach for when you want a complete agent platform shaped for the JavaScript runtime.

The Vercel AI SDK is the lower-level option. It exposes streaming primitives, structured output via Zod, and a generic provider abstraction over more than 90 model providers — Anthropic, OpenAI, Google, Mistral, Cohere, plus self-hosted runners like Ollama and llama.cpp. The SDK is also what Mastra calls into for inference. So in practice you don't choose one or the other: Mastra gives you the orchestration layer, the Vercel AI SDK gives you the inference layer, and a fine-tuned local model gives you the cost structure.

The combination is the closest thing TypeScript has to the Pydantic AI plus Ollama story Python developers have been building with all year — but native to the runtime mobile builders already use.

What we're building

The example agent is a workout planner for a React Native fitness app. The agent reads a natural-language request, picks the right tools, and produces a validated plan. It has three tools:

get_user_profile() returns the user's age, weight, training history
find_recent_workouts(limit: number) returns the last N workouts as structured records
propose_workout(focus: string, duration_min: number, difficulty: string) produces a structured workout plan

The output is a WorkoutPlan Zod object. Mastra validates every output against the schema. Tool calls are validated against their input schemas. The whole agent runs inside the user's phone with no network calls except optional telemetry.

This is the kind of agent that gets expensive fast on a frontier API. A user logging two workouts a day generates four to six agent calls per session, multi-turn, with non-trivial context. At 10,000 monthly active users, you're spending more on inference than on hosting.

Step 1: train the model in Studio

Open Ertas Studio and pick a base model. For TypeScript-friendly mobile deployment in 2026 the two strong choices are Qwen3-4B-Instruct and Gemma 4 E4B. Both fit comfortably on modern phones (about 2.5 GB at Q4_K_M), both produce reliable structured outputs after fine-tuning, and both work with llama.cpp's mobile FFI. Qwen3 has a slight edge on multi-step tool calling; Gemma 4 E4B has a slight edge on instruction-following nuance. Either is a fine starting point.

Define the tool schemas in Data Craft. Studio reads your tool signatures (paste them as Zod schemas, JSON Schema, or the TypeScript function signatures themselves) and uses the structure as the training target. For a workout planner, aim for around 500 examples covering single-tool calls, multi-tool sequences, and refusals (out-of-scope requests).

Train with the default tool-calling QLoRA configuration: rank 32, three epochs. Validation loss typically flattens around epoch 2.5. On the standard GPU tier the run completes in under an hour. Studio's eval suite reports tool-name accuracy, parameter-name accuracy, and parameter-value accuracy. Production-ready models clear 95% on all three.

Step 2: export to GGUF and ship

Studio's export pipeline produces a GGUF binary. For a 4B model on mobile, Q4_K_M is the right default — about 2.5 GB on disk, around 3 GB working memory.

Run the Ertas Deployment CLI against your existing React Native project:

ertas deploy mobile \
  --project ./my-fitness-app \
  --model ertas-workout-agent-4b.gguf \
  --framework react-native

The CLI handles three things that have historically eaten 20 to 40 hours of llama.cpp build engineering. It installs the mobile FFI bindings (with the Metal backend on iOS and the OpenCL/Vulkan backend on Android). It registers the GGUF asset in the bundler so the model ships inside the app. And it stands up a local HTTP-style inference endpoint inside the app process — typically reachable on a device-local socket — that mirrors the OpenAI-compatible API shape that the Vercel AI SDK already knows how to call.

The same CLI supports Flutter, native iOS Swift, and native Android Kotlin. The TypeScript shape of the deliverable is what's specific here.

Step 3: configure the Vercel AI SDK

The Vercel AI SDK has a community-maintained provider for Ollama and a generic OpenAI-compatible provider that points at any OpenAI-shaped endpoint. The Ertas Deployment CLI exposes its on-device endpoint in OpenAI-compatible form by default, so you wire it up like any other provider:

import { createOpenAICompatible } from "@ai-sdk/openai-compatible";

export const ertasLocal = createOpenAICompatible({
  name: "ertas-on-device",
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-needed",
});

export const workoutModel = ertasLocal("ertas-workout-agent-4b");

In development, you point baseURL at Ollama on your laptop (port 11434). In production on the device, the Ertas Deployment CLI exposes the local endpoint at the configured port (8080 by default) and the same SDK call shape works without modification. The SDK doesn't care that the inference is running inside the app instead of across the network — it sees the same OpenAI-compatible response stream either way.

Step 4: define the Mastra agent

Now wire Mastra to the local model and define the agent and tools:

import { Agent } from "@mastra/core/agent";
import { createTool } from "@mastra/core/tools";
import { z } from "zod";
import { workoutModel } from "./ertas-local";

const WorkoutPlan = z.object({
  focus: z.string(),
  duration_min: z.number(),
  difficulty: z.enum(["easy", "moderate", "hard"]),
  blocks: z.array(
    z.object({
      name: z.string(),
      sets: z.number(),
      reps: z.number(),
    }),
  ),
});

const getUserProfile = createTool({
  id: "get_user_profile",
  description: "Get the current user's age, weight, and training history.",
  inputSchema: z.object({}),
  outputSchema: z.object({
    age: z.number(),
    weight_kg: z.number(),
    history: z.array(z.string()),
  }),
  execute: async () => fitnessDb.getProfile(),
});

const findRecentWorkouts = createTool({
  id: "find_recent_workouts",
  description: "Return the user's most recent workouts.",
  inputSchema: z.object({ limit: z.number().default(5) }),
  outputSchema: z.array(
    z.object({ date: z.string(), name: z.string(), notes: z.string() }),
  ),
  execute: async ({ context }) => fitnessDb.recent(context.limit),
});

const proposeWorkout = createTool({
  id: "propose_workout",
  description: "Produce a structured workout plan for the user.",
  inputSchema: z.object({
    focus: z.string(),
    duration_min: z.number(),
    difficulty: z.enum(["easy", "moderate", "hard"]),
  }),
  outputSchema: WorkoutPlan,
  execute: async ({ context }) => planner.generate(context),
});

export const workoutAgent = new Agent({
  name: "workout-planner",
  instructions:
    "You plan workouts. Use the available tools to read the user's history before proposing a plan.",
  model: workoutModel,
  tools: { getUserProfile, findRecentWorkouts, proposeWorkout },
});

const result = await workoutAgent.generate(
  "Plan me a 45-minute moderate session focused on legs.",
  { output: WorkoutPlan },
);

console.log(result.object);

Two things are happening that the code doesn't make obvious. First, the agent reads the user's profile and recent workouts before proposing a plan because the fine-tuned model was trained on examples that establish that pattern. A generic open-weight model would frequently skip the history lookup and propose a generic plan; the trained model uses the available tools as designed. Second, the output is validated against WorkoutPlan by Mastra. If the model emits an invalid object, the validator rejects it and Mastra surfaces a typed error.

The agent runs entirely on the device. There are no API calls in the inference path. The user's profile data never leaves the phone. The only network traffic the agent generates is whatever telemetry you opt into.

Why TypeScript-native matters for mobile

Mobile teams that ship React Native or Expo apps have a meaningful productivity advantage when the agent layer is in the same language as the app. Type definitions flow from the Zod schemas through Mastra into the React Native UI without any cross-language code generation. Errors thrown in the agent surface as typed exceptions in the React tree. Streaming responses from the Vercel AI SDK plug into the same useChat-style hooks that mobile developers are already using.

The Python-based alternative requires a sidecar service. You either run a Python backend that the React Native app calls, or you embed CPython in the mobile binary, or you find some hybrid that splits work across runtimes. All three options add deployment complexity, bundle size, and crash surface area. None of them are necessary if the agent layer is already TypeScript.

The combination of Mastra plus the Vercel AI SDK fits the runtime mobile builders are already in. The Ertas Deployment CLI fits the same runtime. End to end, you're shipping a single TypeScript app with a model file alongside it.

The agentic cost cliff, again

The economic case is identical to the one Python-based on-device agents make, scaled for mobile usage patterns. Agent calls in mobile apps tend to be high-frequency and multi-turn — fitness apps, journaling apps, calendar apps, planners. Per-call cost on a frontier API runs around $0.01 to $0.04 depending on context length and tool depth.

A representative mobile-app cost curve at typical usage rates:

1,000 MAU at 4 calls/day average → roughly $120/month in inference
10,000 MAU → roughly $1,200/month
40,000 MAU → roughly $4,800/month
100,000 MAU → roughly $12,000/month

These numbers tend to land harder on mobile teams than on web teams because mobile monetization typically runs through subscriptions, not seat licenses. Your unit economics start to break around the 10,000 to 40,000 MAU band — exactly the band where you're trying to invest in growth, not retreat from inference costs.

On-device, the cost structure is fixed. The marginal cost of an inference is electricity. The fixed cost is the storage of the model on the device, paid once at install. Going from 10,000 to 100,000 MAU doesn't move the inference line.

The agentic cost cliff has been the dominant force shaping the on-device migration. TypeScript mobile builders have been waiting for an agent stack that lets them respond to it without giving up the runtime.

What you don't have to give up

There are three concerns mobile teams typically raise when considering an on-device move, and the Mastra plus Vercel AI SDK plus Ertas-trained model stack addresses each.

Streaming. The Vercel AI SDK's streaming primitives work the same way against a local llama.cpp endpoint as they do against a hosted API. Token-by-token rendering in your React Native UI is unchanged.

Structured output. Zod validation in the SDK and in Mastra runs unchanged. Fine-tuning makes structured output reliable enough that validation rarely fails.

Memory and workflows. Mastra's memory and workflow primitives don't depend on the model running remotely. The same vector store, the same workflow definitions, the same eval harness work against a local model.

What you do give up is the ability to swap to a frontier model trivially. Once you've fine-tuned and shipped a specific model, switching to a different one means another fine-tune. In practice, this is the same tradeoff every team that has moved to a self-hosted model has accepted, and it's a tradeoff most mobile teams will take cheerfully in exchange for predictable economics.

Getting from prototype to shipping

A typical pipeline looks like:

Prototype against an API. Build the Mastra agent against Anthropic or OpenAI through the Vercel AI SDK. Get the tools right, prove the value.
Curate a dataset. A few hundred examples from your prototype, validated in Data Craft against your Zod schemas.
Fine-tune in Studio. Iterate on the dataset until eval metrics clear 95%.
Ship to the device. Run the Ertas Deployment CLI against your React Native project. Replace the API-pointing provider in your Vercel AI SDK config with the local provider. Your Mastra agent code doesn't change.
Iterate on traces. Production traces become the next round of training data. Studio supports incremental fine-tuning from traces, so the model improves while user data stays on-device.

The first three steps used to be the hard part. Mastra and the Vercel AI SDK moved the prototype-to-production gap from "weeks of bespoke streaming code" to "an afternoon of agent definition." Studio cut the dataset-to-fine-tune gap from MLE-months to hours. The Ertas Deployment CLI closed the last gap — the one most TypeScript app builders never bothered with because the llama.cpp build engineering was prohibitive.

Mastra plus the Vercel AI SDK plus an Ertas-trained on-device model is the agent stack TypeScript mobile builders have been waiting for. No Python sidecar. No per-token bill. No cost cliff between 10,000 and 100,000 users.

Mastra + Vercel AI SDK + On-Device GGUF: A TypeScript Mobile Agent Stack With No API Costs

The two TypeScript frameworks, briefly

What we're building

Step 1: train the model in Studio

Step 2: export to GGUF and ship

Step 3: configure the Vercel AI SDK

Step 4: define the Mastra agent

Why TypeScript-native matters for mobile

The agentic cost cliff, again

What you don't have to give up

Getting from prototype to shipping

Ship AI that runs on your users' devices.

Keep reading

Pydantic AI On-Device: Fine-Tune Qwen3-4B for Type-Safe Mobile Agents

Llama Stack on a Phone: Self-Hosted Llama Agents With a Fine-Tuned Llama 4 Model

Replacing OpenAI in OpenAI Agents SDK With Your Fine-Tuned Local Model