
Llama Stack on a Phone: Self-Hosted Llama Agents With a Fine-Tuned Llama 4 Model
Meta's Llama Stack is the canonical reference architecture for Llama-based agents. Combine it with a fine-tuned Llama 4 derivative and the Swift/Kotlin client SDKs and you get a complete agent stack running entirely on the user's phone.
Llama Stack is Meta's official reference architecture for building agents on Llama models. Agents, tools, safety filters, evaluations, and telemetry sit behind a single unified API surface. It's the closest thing the open-weight world has to a canonical agent stack, and its design intent is plain: stop reinventing the agent loop, the safety layer, and the eval harness in every project.
Most coverage of Llama Stack assumes a server deployment — a Kubernetes cluster, a GPU pool, an API endpoint your app calls over the network. That's the obvious path, and it's also the path that re-creates every economic and privacy problem the open-weight ecosystem was supposed to solve. Per-token cost stays positive. Network round trips stay slow. User data still leaves the device.
What's less appreciated is that Llama Stack ships first-class Swift and Kotlin client SDKs. The same agent abstraction can run with the model embedded in an iOS or Android app, talking to a local inference server bundled into the binary. The agent loop, the tool dispatcher, the safety filter, the telemetry pipeline — all of it can run on the user's phone, against a fine-tuned Llama 4 derivative the app ships with.
This guide walks through that architecture: a complete Llama Stack agent running on-device, powered by a Llama 4 model fine-tuned in Ertas Studio and shipped into the mobile app via the Ertas Deployment CLI.
The architecture
There are three components and they fit together cleanly.
The first is on-device inference. Llama Stack's Swift client SDK ships a LocalInference class backed by ExecuTorch that runs the model directly inside the host iOS app — no separate Llama Stack server process is required for inference. The Kotlin client provides an equivalent on-device adapter for Android. The model runs in the app's address space and the SDK translates Llama Stack's API calls into local inference calls.
The second is the model itself. The on-device inference adapter loads a GGUF (or ExecuTorch-format) binary the app ships with — typically a Llama 4 derivative fine-tuned on your domain data and quantized for mobile. The Ertas Deployment CLI handles the format conversion and asset registration.
The third is the agent platform. The Swift and Kotlin clients implement the Llama Stack agent loop, tool dispatch, safety filtering, and telemetry locally — the same primitives a server-side Llama Stack deployment would expose, just driven by on-device inference instead of a network endpoint. From the app's perspective, the API surface is identical to a remote Llama Stack deployment.
The result is a clean separation. The app code talks Llama Stack. The Llama Stack client drives the on-device inference adapter. The inference adapter talks to the GGUF model through llama.cpp or ExecuTorch. None of it requires a network round trip.
Why Llama Stack specifically
There's a perfectly valid question lurking here: if you're going to run llama.cpp on-device anyway, why involve Llama Stack at all? Why not just call llama.cpp directly through its Swift or Kotlin bindings and skip the abstraction layer?
The honest answer is that home-rolled mobile inference setups skip the parts of an agent that are easy to skip and hard to add back. Tool calling becomes a bespoke parser. Safety filtering becomes either nothing or a hand-written regex. Evaluation becomes a few one-off scripts. Telemetry becomes whatever Logfire or OpenTelemetry call you remembered to add. Each gap is solvable in isolation; together, they produce the brittle agents that make on-device AI feel like a downgrade from API-backed agents.
Llama Stack closes those gaps as part of the standard architecture:
- Agents API gives you a structured agent loop with multi-turn memory and tool dispatch — no parser to write.
- Tools API registers your functions with typed schemas and handles routing automatically.
- Safety API runs Llama Guard (or your chosen safety model) on inputs and outputs before they reach the user.
- Eval API runs benchmarks against held-out test sets so you can detect regressions when you ship a new model version.
- Telemetry API captures structured traces of every agent run for debugging and continued training.
You get all of this against a local model, on a phone, without writing any of it yourself. That's the actual value proposition of Llama Stack on-device — not "an inference wrapper" but a complete agent platform that happens to support embedded execution.
The fine-tuning layer
A stock Llama 4 model is a generalist. For most agentic mobile use cases — fitness coaching, finance assistance, scheduling, customer support, anything domain-specific — a fine-tuned derivative is dramatically more reliable. Lower hallucination rate on your tools, fewer schema violations, more consistent style, better refusals on out-of-scope requests.
The pipeline looks like this:
- Curate the dataset in Ertas Studio. Studio's Data Craft module takes your tool schemas as input and produces structured training conversations covering tool calls, multi-step flows, and refusals. For a domain-specialized agent you typically want 400 to 800 examples — enough to teach the schemas, not so many that you start fitting noise.
- Fine-tune Llama 4 in Studio. Llama Stack originally shipped alongside Llama 3.2 in September 2024 and has tracked the Llama family through Llama 3.3 and Llama 4 — Llama 4 derivatives are the current canonical base for new on-device deployments and integrate with the Safety API and Eval API without translation. Studio's default config is QLoRA at rank 32 over 3 epochs. Training a 500-example dataset on a Llama 4 8B base typically completes inside an hour.
- Export to GGUF. Studio's export flow produces a quantized GGUF binary at the level you specify. For Llama 4 8B on phones, Q4_K_M lands around 5 GB on disk and runs cleanly on any modern flagship. For more aggressive footprints, Q3_K_M brings it down further at a small quality cost.
- Ship to mobile via the Ertas Deployment CLI. The CLI installs llama.cpp into the iOS or Android project, drops the GGUF into the right asset directory, configures Llama Stack's local server to serve the model, and wires up the Swift or Kotlin client. From running the CLI to making the first agent call typically takes under fifteen minutes.
ertas deploy mobile \
--project ./fitness-app \
--model ertas-fitness-coach-llama4-8b.gguf \
--framework ios \
--runtime llama-stack
The --runtime llama-stack flag tells the CLI to wire up the Llama Stack on-device adapter (LocalInference on iOS, the Kotlin equivalent on Android) rather than a bare llama.cpp call. The CLI generates the inference-adapter registration code, configures the model asset, and hooks the Swift or Kotlin client into the host app's lifecycle so the agent platform is ready when the app launches.
A worked example: a fitness coaching agent
Here's the agent we'll build. A user opens a fitness app and types something like "I did 3x10 squats at 135, log it and tell me what to do next." The agent has three tools:
log_workout(exercise: String, sets: Int, reps: Int, weight: Double)records the workoutfetch_progress(exercise: String, weeks: Int)returns recent historysuggest_routine(goal: String, equipment: [String])proposes the next session
In Studio, we curate a dataset of around 600 conversations covering single-tool calls, multi-step flows (log then suggest), refusals (medical advice goes out of scope), and validation edge cases (missing weight defaults to bodyweight). Fine-tune Llama 4 8B over 3 epochs. Eval suite passes at 96% tool-name accuracy, 95% parameter-name accuracy, 94% parameter-value accuracy. Production-ready.
Run the Deployment CLI, point it at the iOS project, pick --runtime llama-stack. Two minutes later the project builds, the on-device inference adapter is registered with the model, and the Swift client is wired in.
Here's the entire client-side agent integration:
import LlamaStack
let client = LlamaStackClient(transport: .embedded)
let agent = try await client.agents.create(
model: "ertas-fitness-coach-llama4-8b",
instructions: "You are a fitness coach. Use the tools to log workouts and suggest routines. Refuse medical questions.",
tools: [
Tool(name: "log_workout", description: "Log a completed workout"),
Tool(name: "fetch_progress", description: "Fetch recent workout history"),
Tool(name: "suggest_routine", description: "Suggest the next routine"),
],
safety: .llamaGuard
)
let session = try await agent.createSession()
let response = try await session.turn(
messages: [.user("I did 3x10 squats at 135, log it and tell me what to do next.")],
toolHandlers: [
"log_workout": logWorkoutHandler,
"fetch_progress": fetchProgressHandler,
"suggest_routine": suggestRoutineHandler,
]
)
print(response.finalMessage.content)
About thirty lines of Swift. The agent loop, the tool dispatch, the safety filtering through Llama Guard, the multi-turn session memory — Llama Stack handles all of it. The host app provides three tool handlers (the Swift functions that actually log to Core Data, query history, and produce routines) and Llama Stack does the rest.
The same pattern in Kotlin is essentially identical:
val client = LlamaStackClient(transport = Transport.Embedded)
val agent = client.agents.create(
model = "ertas-fitness-coach-llama4-8b",
instructions = "You are a fitness coach. Use the tools to log workouts and suggest routines.",
tools = listOf(
Tool("log_workout", "Log a completed workout"),
Tool("fetch_progress", "Fetch recent workout history"),
Tool("suggest_routine", "Suggest the next routine"),
),
safety = Safety.LlamaGuard,
)
Everything that happens after that — the inference, the tool calls, the safety checks — is identical between iOS and Android. The Stack abstracts the runtime cleanly.
The economics
The cost case is the same one that holds for every on-device agent architecture. A multi-turn fitness coaching flow against a frontier API runs roughly three to eight cents per session depending on context length. At a thousand daily active users averaging four sessions a day, that's a hundred and twenty to three hundred and twenty dollars a day, or thirty-six hundred to ninety-six hundred dollars a month. The bill scales linearly with users.
On-device, the marginal cost per inference is the GPU/CPU time on the user's phone — typically a few hundred milliseconds and a fraction of a cent of the user's battery. The fixed cost is the model footprint on disk. Neither term grows with your user count.
For mobile app builders, the agentic cost cliff bites somewhere between five hundred and five thousand users, depending on session intensity. That's the range where API costs start consuming subscription revenue faster than usage growth can outrun. On-device inference removes that cliff entirely, and Llama Stack on-device removes it without forcing the team to rebuild the agent loop, the safety filter, or the telemetry pipeline from scratch.
For privacy-sensitive use cases the calculus is sharper still. Health, finance, and legal domains have hard requirements about where user data lives. An on-device agent never makes the sentence "user message sent to Meta" or "user message sent to OpenAI" true. That's a compliance argument as much as a cost argument, and for some categories of app it's the only argument that matters.
Differentiation vs other on-device approaches
There are other ways to ship a model on a phone. You can call llama.cpp directly through its Swift bindings. You can use Apple Foundation Models on iOS 26+. You can use Google's AICore on Android. You can roll your own agent loop on top of any of these.
What Llama Stack on-device offers that the alternatives don't is the rest of the agent platform. Llama Guard for safety filtering, integrated and configured. The Eval API for catching regressions when you ship a new fine-tune. The Telemetry API for structured traces you can feed back into the next round of training. The Tools API for typed schemas and auto-dispatch. Multi-turn session memory handled by the Agents API.
You're not building those layers from scratch and you're not stitching together a half-dozen open-source libraries to approximate them. You're using the reference architecture Meta designed for exactly this purpose, with the model embedded in the app instead of behind a network endpoint.
Closing
Mobile builders shipping AI features can ship a complete agent stack — not just inference — entirely on the user's device. Meta's reference architecture, a fine-tuned Llama 4 derivative, and a small amount of glue. The combination of Llama Stack, Ertas Studio, and the Ertas Deployment CLI collapses what used to be a quarter of platform engineering into an afternoon of integration work.
The fine-tuning piece is where Studio earns its keep — a stock Llama 4 model is a generalist, and the agent reliability you need to ship comes from a model that knows your tools cold. The deployment piece is where the CLI earns its keep — wiring llama.cpp (or ExecuTorch), Llama Stack's on-device inference adapter, the Swift or Kotlin client, and a 5 GB model file into an iOS or Android project is twenty to forty hours of build configuration most app builders never finish. Stack the three pieces together and the path from "we want an on-device agent" to "we have an on-device agent" becomes a single afternoon.
For app builders staring at the agentic cost cliff, that's the path. Llama Stack gives you the architecture. Llama 4 gives you the base model. Studio gives you the specialization. The Deployment CLI gives you the mobile shipping pipeline. The agent runs on the user's phone, the bill stops scaling with usage, and the user's data never leaves the device.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Pydantic AI On-Device: Fine-Tune Qwen3-4B for Type-Safe Mobile Agents
Pydantic AI brings type safety and FastAPI ergonomics to LLM agents. Combine it with a fine-tuned 4B model running on-device via llama.cpp and you get production-grade agents in mobile apps with zero API costs and validated outputs by construction.

Mastra + Vercel AI SDK + On-Device GGUF: A TypeScript Mobile Agent Stack With No API Costs
TypeScript-first mobile builders don't have to use Python agent frameworks. Mastra and the Vercel AI SDK plus a fine-tuned 4B model running on-device through llama.cpp produce a complete agent stack with zero per-token costs.

Replacing OpenAI in OpenAI Agents SDK With Your Fine-Tuned Local Model
The OpenAI Agents SDK is intentionally model-agnostic. Swap the OpenAI client for an Ertas-trained model running on Ollama and you keep the developer experience while killing per-token costs. A drop-in tutorial.