Llama Stack + Ertas
Run agents on Meta's official Llama Stack — the reference agent runtime with OpenAI-compatible APIs, native tool calling, and first-class support for Ertas-trained Llama derivatives running locally or on the edge.
Overview
Llama Stack is Meta's official reference implementation of an agent runtime built around the Llama family. It provides a standardized set of REST APIs (chat completions, agents, evals, safety, telemetry, datasets, tool runtime) that any Llama-based deployment can expose, together with reference clients in Python, TypeScript, Swift, and Kotlin. The stated goal is to make production agent deployments on Llama models as standard as deploying behind an OpenAI API call — same shape, same client experience, but self-hosted and free of per-token costs.
The framework is unusual in its scope: it includes not just the inference layer but also the agent orchestration loop, the safety filters, the evaluation harness, and the dataset management API. Teams that adopt Llama Stack get a complete reference architecture for an end-to-end agent system, not just a model runtime. For organizations that don't want to build all those layers from scratch — observability, eval, safety, dataset versioning — Llama Stack is the most opinionated and complete reference option in the Llama ecosystem.
Llama Stack is designed around the Llama family but the API surface is generic. The chat-completions API is OpenAI-compatible, which means any Ertas-trained Llama derivative can be plugged into the runtime and the rest of the stack (agents, safety, evals) works without modification. The Swift and Kotlin client libraries are particularly relevant for mobile app builders — they're explicitly designed for embedding into iOS and Android applications calling either a local Llama Stack server or a remote one.
How Ertas Integrates
Ertas-trained Llama-family models (fine-tuned Llama 3, Llama 4, or any Llama-architecture base from Studio) integrate with Llama Stack via the standard model-loading pattern. After exporting your fine-tuned model from Studio as GGUF, you register it as a provider in Llama Stack's configuration — either through the local llama.cpp adapter (for on-device or self-hosted CPU inference) or through the vLLM/Ollama adapter (for GPU-accelerated inference). The agents, safety, and eval APIs then dispatch to your Ertas-trained model exactly as they would to a stock Llama checkpoint.
The combination is particularly compelling for teams building agent products on Meta's Llama family. Llama Stack handles the operational concerns — agent orchestration, telemetry, safety filtering, evaluation — and Ertas provides the domain specialization. Together, they deliver agent systems that retain the engineering benefits of a complete reference architecture while substantially outperforming generic Llama on domain tasks. For regulated-industry deployments, the combination is even more valuable: Llama Stack's audit trails plus on-premise Ertas inference plus an Apache-2.0-licensed Apertus or Apache-2.0 Gemma 4 base together cover most procurement requirements.
For mobile shipping via Ertas Deployment CLI, Llama Stack's Swift and Kotlin clients are an unusually good fit. The CLI installs llama.cpp into your iOS or Android project and the Llama Stack client libraries provide a typed agent-loop API on top — so the mobile app talks to its on-device model through the same agent abstraction that the backend talks to its server-side model, without separate code paths.
Getting Started
- 1
Fine-tune a Llama-family model in Ertas Studio
Train on Llama 3, Llama 4, or any Llama-architecture base. Studio handles the fine-tuning data and produces a Llama-compatible GGUF output that registers cleanly with Llama Stack.
- 2
Export to GGUF and configure a Llama Stack provider
Use Studio's GGUF export. Configure Llama Stack to load the model via the llama.cpp provider (for local), vLLM provider (for GPU servers), or Ollama provider (for development).
- 3
Run the Llama Stack server
Start the Llama Stack distribution server pointed at your model. The server exposes the full agent, safety, and eval API surface on a standard port.
- 4
Build agents using the Llama Stack client SDKs
Use the Python, TypeScript, Swift, or Kotlin client to define agents, register tools, and run inference. The client APIs match across languages so backend and mobile share the same patterns.
- 5
Integrate safety, evals, and telemetry
Layer in Llama Stack's built-in safety filtering, evaluation harness, and telemetry collection. Use evaluation results to feed back into Studio for the next round of fine-tuning.
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.client_tool import client_tool
# Connect to Llama Stack server running your Ertas-trained model
client = LlamaStackClient(base_url="http://localhost:8321")
@client_tool
def lookup_inventory(sku: str) -> dict:
"""Check stock for a product SKU."""
return inventory_db.get(sku)
@client_tool
def create_return_label(order_id: str, reason: str) -> str:
"""Generate a return shipping label."""
return shipping.create_label(order_id, reason)
# Build an agent backed by the Ertas-trained Llama 4 model
agent = Agent(
client,
model="ertas-llama4-support-8b",
instructions="You handle e-commerce support: returns, inventory questions, order status.",
tools=[lookup_inventory, create_return_label],
)
session_id = agent.create_session("customer-12345")
response = agent.create_turn(
messages=[{"role": "user", "content": "I want to return order #98765, item arrived damaged."}],
session_id=session_id,
)
for chunk in response:
print(chunk)Benefits
- Complete reference architecture — agents, safety, evals, telemetry all in one stack
- OpenAI-compatible chat-completions API works with any client library
- Native client SDKs for Python, TypeScript, Swift, and Kotlin (mobile-friendly)
- First-class support for the Llama family — Ertas-trained Llama derivatives plug in directly
- Self-hosted or on-device — no per-token costs, no data egress
- Audit-friendly for regulated industries with built-in telemetry and eval pipelines
- Maintained by Meta as the canonical reference implementation
Related Resources
Fine-Tuning
GGUF
Inference
LoRA
Running AI Models Locally: The Complete Guide to Local LLM Inference
Fine-Tuning Llama 3: A Practical Guide for Your Use Case
Building Reliable AI Agents with Fine-Tuned Local Models: Complete Guide
LangGraph
llama.cpp
Ollama
OpenAI Agents SDK
vLLM
Ertas for SaaS Product Teams
Ertas for Customer Support
Ertas for AI Automation Agencies
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.