Back to blog
    The Post-API Stack: Architecture for SaaS That Doesn't Bleed on Inference
    architecturesaaslocal-modelsself-hostedsegment:saas

    The Post-API Stack: Architecture for SaaS That Doesn't Bleed on Inference

    The era of building SaaS on third-party AI APIs is ending. Here's the post-API architecture — fine-tuned local models, GGUF deployment, and zero per-token costs — that makes AI features profitable.

    EErtas Team·

    The first generation of AI-powered SaaS was built on third-party APIs. OpenAI, Anthropic, Google — pick your provider, wire up the SDK, and ship an AI feature in a week. It was fast to build and it worked.

    But the architecture has a structural problem: every customer action that touches your AI costs you money. Not infrastructure-scales-with-usage money. Direct, per-token, linear-with-every-request money. Your COGS grows with every new user, every new feature, every increase in engagement.

    The post-API stack eliminates this. It replaces per-token API calls with fine-tuned models running on infrastructure you control, served through OpenAI-compatible endpoints so your application code barely changes. The per-token cost drops to effectively zero. Your AI features become as cost-stable as your database.

    The API Dependency Problem

    Here is what API dependency looks like in practice for a SaaS product at scale:

    Revenue: AU$200,000/month (4,000 users at AU$50/month)

    AI API cost at launch (1,000 users): AU$3,500/month — acceptable, 3.5% of revenue projection

    AI API cost at 4,000 users: AU$14,000/month — 7% of revenue, starting to pinch

    AI API cost projected at 10,000 users: AU$35,000/month — now your third-largest expense after salaries and office

    The problem is not just the cost. It is the cost structure:

    • No volume discount that matters. Enterprise API agreements might give you 10-20% off. Your infrastructure costs drop 50-80% at scale.
    • No ability to optimize the runtime. You cannot change how the model runs. You cannot quantize it for your specific throughput needs. You cannot batch requests efficiently.
    • Deprecation risk. OpenAI deprecated GPT-3.5 Turbo, then fine-tuned models built on it. Teams had to rebuild. This will happen again.
    • Rate limits as a scaling constraint. Your application's throughput is capped by someone else's rate limiter.
    • Zero moat. Every competitor has access to the same model at the same price.

    What the Post-API Stack Looks Like

    The post-API architecture has four layers:

    ┌──────────────────────────────────────────────┐
    │              Your SaaS Application           │
    │                                              │
    │  ┌────────────────────────────────────────┐  │
    │  │         OpenAI-Compatible Client       │  │
    │  │    (same SDK, same request format)     │  │
    │  └───────────────────┬────────────────────┘  │
    │                      │                        │
    │  ┌───────────────────▼────────────────────┐  │
    │  │           Local API Server             │  │
    │  │     (Ollama / llama.cpp / vLLM)        │  │
    │  │    Exposes /v1/chat/completions        │  │
    │  └───────────────────┬────────────────────┘  │
    │                      │                        │
    │  ┌───────────────────▼────────────────────┐  │
    │  │         Fine-Tuned Model (GGUF)        │  │
    │  │   7B-14B params, Q4_K_M quantized      │  │
    │  │   Trained on your production data      │  │
    │  └───────────────────┬────────────────────┘  │
    │                      │                        │
    │  ┌───────────────────▼────────────────────┐  │
    │  │         Inference Hardware             │  │
    │  │   VPS with GPU / Mac Studio / On-prem  │  │
    │  │   AU$500-2,000/month fixed cost        │  │
    │  └────────────────────────────────────────┘  │
    └──────────────────────────────────────────────┘
    

    Each layer is a standard, well-understood component. There is no proprietary lock-in at any point. Let's walk through each one.

    Layer 1: OpenAI-Compatible Client

    This is the part your application code touches. And here is the key insight: it does not change.

    If your application currently calls the OpenAI API like this:

    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    

    Your post-API code looks like this:

    client = openai.OpenAI(base_url="http://localhost:11434/v1")
    response = client.chat.completions.create(
        model="my-fine-tuned-model",
        messages=[{"role": "user", "content": prompt}]
    )
    

    Two lines change: the base URL and the model name. Every other part of your application — prompt construction, response parsing, error handling, streaming — remains identical. The OpenAI SDK works with any server that implements the OpenAI API specification.

    This is why the migration is dramatically simpler than most teams expect. You are not rewriting your AI integration. You are pointing it at a different server.

    Layer 2: Local API Server

    Ollama, llama.cpp server, and vLLM all expose OpenAI-compatible API endpoints. They handle:

    • Loading the model into memory (GPU or CPU)
    • Managing concurrent requests
    • Streaming token generation
    • KV-cache management for throughput optimization

    Ollama is the simplest option. Install it, pull a model, and you have a running API server in under 5 minutes. It handles model management, automatic GPU/CPU allocation, and concurrent request handling. Best for: teams that want simplicity and are running 1-3 models.

    llama.cpp server gives you more control over inference parameters — context window size, batch size, thread count, quantization loading. Best for: teams that need to squeeze maximum throughput from their hardware.

    vLLM is designed for high-throughput production serving. It implements PagedAttention for efficient memory management and supports continuous batching. Best for: teams serving 100+ concurrent users with strict latency requirements.

    For most SaaS products under 50,000 users, Ollama is the right choice. It is production-ready, well-maintained, and the operational overhead is minimal.

    Layer 3: Fine-Tuned Model

    The model is where the quality comes from. A base 7B model will not match GPT-4o on your specific tasks. A fine-tuned 7B model, trained on 2,000-5,000 examples of your exact use case, will match or beat it on those specific tasks.

    The fine-tuning process:

    1. Export training data from your current API usage. Your existing GPT-4o or Claude outputs are your gold standard. Export input-output pairs for each task type.

    2. Fine-tune using QLoRA. This is a parameter-efficient method that trains only a small fraction of the model's weights. A 7B model fine-tunes in 30-90 minutes on a single GPU with QLoRA.

    3. Merge and quantize. Merge the LoRA adapter into the base model and quantize to GGUF format. Q4_K_M quantization reduces model size by ~75% with minimal quality loss. A 7B model goes from ~14GB to ~4GB.

    4. Evaluate. Run the quantized model against a held-out test set of 200-500 examples. Compare to your cloud API outputs. Target 90-98% quality parity on your specific tasks.

    The GGUF format is an open standard supported by all major inference engines. Your model file is portable across Ollama, llama.cpp, and any GGUF-compatible runtime.

    Ertas automates steps 2-3: upload your dataset, select a base model, and receive a fine-tuned GGUF file ready for deployment. No ML expertise required, no training infrastructure to manage.

    Layer 4: Inference Hardware

    The hardware layer is where the economics change. Instead of per-token pricing, you have fixed monthly infrastructure cost.

    Option A: GPU VPS (AU$400-1,500/month)

    Providers like Lambda Labs, Vast.ai, RunPod, or major cloud GPU instances. A single A10G or L4 GPU instance handles a quantized 7B model at 50-100+ requests per second. This is more than enough for most SaaS workloads.

    Pros: No hardware to manage, scale up/down as needed, multiple regions available. Cons: Monthly cost, provider dependency (though easily portable).

    Option B: Dedicated hardware (AU$3,000-8,000 one-time, then AU$50-200/month for power/colocation)

    A workstation with an RTX 4090 (24GB VRAM) or a Mac Studio with M-series silicon. The RTX 4090 handles a quantized 14B model comfortably. The Mac Studio M4 Ultra handles quantized models up to 70B.

    Pros: One-time cost amortized over 3-5 years, lowest long-term cost per inference, full physical control. Cons: Hardware management, no geographic distribution, capacity ceiling.

    Option C: CPU-only VPS (AU$80-300/month)

    A quantized 7B model runs on CPU at 2-8 tokens per second. For workloads under 50,000 requests/month where latency above 2-3 seconds is acceptable, this is the cheapest option. No GPU required.

    Pros: Cheapest, simplest, runs on any VPS. Cons: Slow inference, limited throughput.

    The Cost Comparison at Scale

    Let's compare total cost of ownership over 12 months for a SaaS product growing from 100,000 to 1,000,000 AI requests per month:

    Cloud API (GPT-4o):

    MonthRequestsCost
    1100KAU$3,000
    6500KAU$15,000
    121MAU$30,000
    12-month totalAU$198,000

    Post-API stack (fine-tuned 7B on GPU VPS):

    MonthRequestsCost
    1100KAU$2,800 (setup + infra)
    6500KAU$1,200
    121MAU$1,500 (slightly larger instance)
    12-month totalAU$18,600

    Savings: AU$179,400 over 12 months. At higher volumes, the gap widens further because the post-API cost barely increases.

    What You Still Need Cloud APIs For

    The post-API stack does not mean zero API usage. There are legitimate reasons to keep a cloud API integration:

    Fallback for edge cases. When your fine-tuned model encounters a request type it was not trained on, route it to a cloud API. This should be 5-15% of traffic, decreasing over time as you expand your training data.

    New feature development. When prototyping a new AI feature, use a cloud API to validate the concept and collect training data. Once the feature is stable and you have enough examples, fine-tune and migrate.

    Tasks requiring frontier reasoning. Complex multi-step reasoning, creative generation requiring broad world knowledge, or tasks where a 7B model genuinely cannot match a 200B+ model. These exist, but they are a smaller fraction of most SaaS workloads than teams assume.

    The end state is not "no API" — it is "API by exception." The API becomes your R&D tool and fallback, not your production inference layer.

    Migration Without Rewriting Your App

    The most common concern from engineering teams is migration complexity. Here is the actual scope:

    Changes required:

    1. Add a routing configuration (which task types go to which backend) — 50-100 lines of code
    2. Add the local API server URL as an environment variable — 1 line
    3. Update model name references for routed tasks — find and replace
    4. Add monitoring for local inference latency and throughput — standard observability

    Changes NOT required:

    • No prompt rewrites (fine-tuned models learn the task, so prompts are simpler or unchanged)
    • No response parsing changes (same API format)
    • No streaming logic changes (same SSE protocol)
    • No error handling changes (same error format)
    • No authentication changes (local server, no API keys needed)

    The migration effort for a typical SaaS product with 3-5 AI features is 2-4 weeks of one engineer's time. The majority of that time is fine-tuning and evaluation, not code changes.

    Building for Model Portability

    One advantage of the GGUF + OpenAI-compatible API architecture: you are not locked into any specific model. When a better base model is released — and they are released every few months — you can:

    1. Fine-tune the new base model on your existing dataset
    2. Evaluate it against your current production model
    3. Hot-swap the model file on your inference server
    4. Zero application code changes

    This is the opposite of API dependency. You control the model lifecycle. You choose when to upgrade. No deprecation notices, no forced migrations, no breaking changes.

    Your model is a file on a disk. Your inference server is an open-source binary. Your application talks to a standard API. Every layer is replaceable independently. That is the post-API stack.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading