Stop Paying Per User for AI: The Flat-Cost Architecture for Indie Apps

Here is the dirty secret of most AI-powered SaaS apps: every new user makes the business less profitable. Not in the abstract, "well servers cost money" sense that applies to all software. In the very concrete, "every AI request costs $0.003 and my average user makes 45 requests per day" sense that eats your margin alive.

Traditional SaaS has near-zero marginal costs. Adding user number 10,001 costs you essentially nothing — the servers are already running, the code is already written. AI-powered SaaS breaks this model. With per-token pricing, every user generates a roughly proportional increase in your AI bill. Your revenue scales with user count. Your AI costs also scale with user count. And if your per-user AI cost is anywhere close to your per-user revenue, you have a business that gets worse as it succeeds.

There is a better architecture. One where your AI infrastructure costs the same whether you have 100 users or 100,000 users. It is not theoretical — it is running in production today for indie developers who figured out the math early. This guide explains what it is, how to build it, and exactly when it does and does not work.

The Per-User Cost Problem

Let us make the problem painfully concrete. You have built an AI-powered app — say a content optimization tool. Each user submits text, the AI analyzes it, and returns suggestions. Standard stuff. You are charging $19/month per user.

Here is what your AI costs look like as you grow, assuming GPT-4o pricing and moderate usage (30 AI requests per user per day, averaging 1,000 input tokens and 500 output tokens per request):

Users	Daily AI Requests	Monthly Tokens (Input)	Monthly Tokens (Output)	Monthly AI Cost	Revenue	AI Cost as % of Revenue
100	3,000	90M	45M	$67	$1,900	3.5%
500	15,000	450M	225M	$338	$9,500	3.6%
1,000	30,000	900M	450M	$675	$19,000	3.6%
5,000	150,000	4.5B	2.25B	$3,375	$95,000	3.6%
10,000	300,000	9B	4.5B	$6,750	$190,000	3.6%
50,000	1,500,000	45B	22.5B	$33,750	$950,000	3.6%

At 3.6% of revenue, this looks manageable. But this is the optimistic scenario. In reality:

Power users destroy your averages. Your top 10% of users generate 40-60% of your AI requests. Some users trigger 100+ requests per day. That "30 requests per user per day" average masks a long tail of heavy usage that inflates your costs.

Prompt chaining multiplies tokens. Agent-style features, retry logic, and multi-step workflows can 2-5x your token count per user action. A single "optimize my article" button might trigger three LLM calls under the hood.

Context windows grow over time. As users build history in your app, prompts get longer. That 1,000-token input average creeps toward 3,000-4,000 tokens as you include conversation history, user preferences, and previous results.

A more realistic picture with power users and prompt chaining:

Users	Realistic Monthly AI Cost	Revenue	AI Cost as % of Revenue
1,000	$1,900	$19,000	10%
5,000	$9,500	$95,000	10%
10,000	$19,000	$190,000	10%
50,000	$95,000	$950,000	10%

Now 10% of your revenue goes to AI inference. For an indie developer without VC funding, that is a massive chunk of your gross margin. And unlike hosting costs (which scale sub-linearly thanks to caching, CDNs, and efficient architectures), AI API costs scale linearly. There is no caching trick that helps when every request is unique.

What "Flat-Cost" Means

A flat-cost AI architecture is one where your AI inference costs are determined by your infrastructure, not your usage. You pay for servers, not for tokens. Whether those servers process 1,000 requests or 100,000 requests per month, the infrastructure cost is the same.

The core idea is simple: instead of sending every AI request to an API that charges per token, you run the AI model yourself on hardware you control. The model runs on your VPS. The VPS costs a fixed monthly amount. The per-request cost is zero.

This is only viable because of three developments that converged in 2025-2026:

Small open-source models got good enough. Qwen 2.5 7B, Llama 3.3 8B, and similar models can handle most app-specific AI tasks when fine-tuned. You no longer need GPT-4 for everything.
Fine-tuning became accessible. Tools like Ertas let non-ML developers fine-tune models on their app's data in under an hour. No PyTorch. No GPU cluster. No PhD.
Local inference got fast. Ollama and llama.cpp made it possible to run quantized 7B models on commodity hardware at 15-30 tokens per second — fast enough for production use.

The flat-cost architecture combines all three: fine-tune a small model for your specific task, deploy it on a fixed-cost VPS, and serve all your users from that infrastructure.

The Architecture

Here is the complete architecture for a flat-cost AI app:

┌──────────────────────────────────────────────┐
│  Your App (Frontend + API)                    │
│  Hosted on Vercel / Railway / Fly.io          │
└────────────────────┬─────────────────────────┘
                     │
          ┌──────────┴──────────┐
          │                     │
          ▼                     ▼
   ┌────────────┐     ┌──────────────────┐
   │  Request    │     │  Database         │
   │  Router     │     │  (Supabase/Neon)  │
   │             │     └──────────────────┘
   └──┬──────┬──┘
      │      │
  95% │      │ 5%
      ▼      ▼
┌──────────┐ ┌──────────┐
│  Ollama   │ │ OpenAI   │
│  (Local)  │ │ API      │
│  $30/mo   │ │ (fallback│
│  flat     │ │  only)   │
└──────────┘ └──────────┘

Four components make this work. Let us go through each one.

Component 1: Fine-Tuned Small Models

The foundation of flat-cost AI is using a model that is specifically trained for your use case, rather than a general-purpose frontier model.

Why small models work for app-specific tasks: Most AI features in SaaS apps perform a narrow, repetitive task. Classify this text. Extract these fields. Rewrite this paragraph in this tone. Generate a summary of this data. These are not tasks that require the full breadth of GPT-4's knowledge about ancient Roman history and quantum mechanics. They require a model that has deeply learned one specific pattern.

A 7B parameter model fine-tuned on 1,000 examples of your specific task will match GPT-4's performance for that task roughly 90-95% of the time. For the remaining 5-10% of edge cases, you have a fallback (Component 3). But the key insight is: you do not need perfection from the local model. You need "good enough for 95% of requests" — because that 95% is what costs you money at scale.

Choosing your base model:

Model	Parameters	RAM Required	Best For
Qwen 2.5 3B	3B	4GB	Classification, simple extraction, reformatting
Qwen 2.5 7B	7B	8GB	Summarization, generation, complex extraction
Llama 3.3 8B	8B	8GB	General-purpose tasks, instruction following
Mistral 7B	7B	8GB	European language tasks, code-adjacent tasks

For most indie apps, Qwen 2.5 7B is the default choice. It offers the best balance of capability and resource efficiency.

Fine-tuning with Ertas: Upload your JSONL training data (input-output pairs from your existing API logs), select the base model, and train with LoRA. The whole process takes 30-60 minutes on Ertas. Cost: $14.50/month for unlimited training runs.

The training data comes from your existing app. If you have been using the OpenAI API, you already have thousands of input-output pairs in your logs. Export them, clean them, and upload. You are literally training your replacement model on the work the expensive model already did.

Component 2: Local Inference with Ollama

Ollama is the runtime that serves your fine-tuned model as a local API. Install it on a VPS, load your model, and every AI request your app makes is served locally with zero per-token cost.

Infrastructure options and costs:

Setup	Monthly Cost	Throughput	Best For
Hetzner CX22 (2 vCPU, 4GB)	~$6/mo	8-12 tok/s	Dev/testing, very low traffic
Hetzner CX32 (4 vCPU, 8GB)	~$14/mo	12-18 tok/s	Up to 1,000 users
Hetzner CX42 (4 vCPU, 16GB)	~$26/mo	15-25 tok/s	Up to 5,000 users
Hetzner CCX33 (8 vCPU, 32GB)	~$48/mo	25-40 tok/s	Up to 15,000 users
GPU instance (Vast.ai RTX 3060)	~$30/mo	40-60 tok/s	High throughput needs

A $26/month Hetzner VPS running Ollama with a quantized 7B model handles 15-25 tokens per second. For a typical app where each AI request generates 200-500 output tokens, that translates to roughly 2-4 requests per second sustained throughput. That is 170,000 to 345,000 requests per day.

Unless your app has extremely bursty traffic, a single $26/month VPS handles more traffic than most indie apps will ever see.

Component 3: Smart Request Routing

Not every request needs to go to your local model. And not every request can be handled by your local model. Smart routing is the glue that makes the architecture work reliably.

The routing logic is simple:

Every AI request hits the router first
The router sends the request to the local Ollama model
If Ollama returns a valid response within the expected format, use it
If Ollama errors, times out, or returns a malformed response, fall back to the OpenAI API

Implementation in your app:

async function aiRequest(input: string): Promise<string> {
  try {
    // Try local model first
    const localResponse = await fetch("http://ollama-vps:11434/api/generate", {
      method: "POST",
      body: JSON.stringify({
        model: "my-fine-tuned-model",
        prompt: input,
        stream: false,
      }),
      signal: AbortSignal.timeout(10000), // 10s timeout
    });

    const result = await localResponse.json();

    // Validate response format
    if (isValidResponse(result.response)) {
      return result.response;
    }

    // Invalid format — fall back
    return await openaiRequest(input);
  } catch (error) {
    // Error or timeout — fall back
    return await openaiRequest(input);
  }
}

In practice, the routing split looks like this:

After Fine-Tuning Phase	Local Model Handles	API Fallback	Monthly API Cost (5K users)
Initial deployment	80%	20%	~$675 (down from $3,375)
After 1 month (with retraining on failures)	90%	10%	~$338
After 3 months	95%	5%	~$169
Mature (6+ months)	97-98%	2-3%	~$68-101

The key insight: you do not need to handle 100% locally on day one. Start at 80% local and iterate. Each month, review the requests that fell back to the API, add them to your training data, retrain, and deploy the updated model. Over time, the local model handles more and more edge cases, and your API costs approach zero.

At the mature stage, the 2-3% that still goes to the API is genuinely hard — novel edge cases, unusual inputs, requests that are fundamentally different from your training data. That residual API cost is trivial.

Component 4: Horizontal Scaling

At some point, one VPS is not enough. When you hit sustained high traffic that exceeds a single instance's throughput, you scale horizontally — add more VPS instances, each running the same model.

The scaling math:

Users	VPS Instances	Total VPS Cost	Per-User Monthly AI Cost
1,000	1x CX42	$26	$0.026
5,000	1x CX42	$26	$0.005
10,000	2x CX42	$52	$0.005
25,000	3x CX42	$78	$0.003
50,000	5x CX42	$130	$0.003
100,000	8x CX42	$208	$0.002

Notice the per-user cost. With horizontal scaling, your per-user AI cost decreases as you grow. At 100,000 users, you are paying $0.002 per user per month for AI inference. With OpenAI API at the same scale, you would be paying roughly $0.68 per user per month (based on the realistic cost estimates from earlier).

That is a 340x cost difference.

Load balancing across multiple Ollama instances is straightforward. Use a simple round-robin or least-connections load balancer (nginx, HAProxy, or your cloud provider's built-in LB) in front of your Ollama fleet. Each instance runs the identical model, so any instance can handle any request.

Cost Modeling: API vs Flat-Cost at Scale

Here is the comprehensive comparison, including all infrastructure costs:

Users	API Architecture (Monthly)	Flat-Cost Architecture (Monthly)	Savings
100	$67 API	$26 VPS + $14.50 Ertas + $3 API fallback = $43.50	$23.50 (35%)
500	$338 API	$26 VPS + $14.50 Ertas + $8 API fallback = $48.50	$289.50 (86%)
1,000	$675 API	$26 VPS + $14.50 Ertas + $17 API fallback = $57.50	$617.50 (91%)
5,000	$3,375 API	$26 VPS + $14.50 Ertas + $68 API fallback = $108.50	$3,266.50 (97%)
10,000	$6,750 API	$52 VPS + $14.50 Ertas + $101 API fallback = $167.50	$6,582.50 (98%)
50,000	$33,750 API	$130 VPS + $14.50 Ertas + $338 API fallback = $482.50	$33,267.50 (99%)
100,000	$67,500 API	$208 VPS + $14.50 Ertas + $506 API fallback = $728.50	$66,771.50 (99%)

The breakeven point is remarkably low — around 100-200 users, depending on usage patterns. Below that, the flat-cost architecture is either comparable or slightly cheaper. Above it, the savings are dramatic and compound with every additional user.

At 10,000 users, you are saving $6,582.50 per month — $78,990 per year. That is not a rounding error. That is the difference between a lifestyle business and a struggling one.

Let us frame it another way. If you charge $19/month per user and have 10,000 users, your monthly revenue is $190,000. With the API architecture, $6,750 goes to OpenAI (3.6% — or realistically $19,000 at 10% with power users). With the flat-cost architecture, $167.50 goes to AI infrastructure (0.09% of revenue). That margin difference compounds every single month.

When Flat-Cost Does Not Work

Flat-cost architecture is not universally superior. Here are the scenarios where sticking with an API (or using a hybrid approach) makes more sense:

Real-time multimodal tasks. If your app processes images, audio, or video with AI, you need models and hardware that are significantly more expensive to self-host. Vision models require GPUs with substantial VRAM. Audio transcription models like Whisper are CPU-intensive. The flat-cost math still works, but the infrastructure costs are higher, pushing the breakeven point up to 1,000-5,000 users.

Cutting-edge reasoning tasks. If your app genuinely requires GPT-4 or Claude-level reasoning — complex multi-step analysis, nuanced creative writing, advanced code generation — a fine-tuned 7B model may not cut it. These tasks represent the frontier of AI capability, and small models simply cannot replicate them. However, audit your app honestly: most "we need GPT-4" claims do not survive scrutiny. Most app AI tasks are narrower than developers think.

Extremely diverse task sets. If your AI feature handles hundreds of fundamentally different task types with no dominant pattern, fine-tuning becomes impractical (you would need dozens of specialized models). This is rare in practice — most apps have 3-5 core AI tasks that account for 90% of requests.

Very early stage (pre-product-market fit). If you are still iterating on what your AI feature does, committing to fine-tuning is premature. Use the API while you figure out the product. Once you know what your AI does (and can articulate it as a clear input-output pattern), that is when you switch to flat-cost.

Regulatory environments requiring certified models. Some regulated industries require AI models that have been specifically certified or audited. Self-hosted open-source models may not meet these requirements. Check with your compliance team before migrating.

Implementation Roadmap

Here is a four-week plan to migrate from API to flat-cost architecture:

Week 1: Data Collection and Audit

Enable logging on all AI API calls (input, output, latency, cost)
Run for one week to establish baseline metrics
Categorize your AI tasks and identify the highest-volume ones
Calculate your current per-user AI cost

Week 2: Fine-Tuning

Export 500-2,000 input-output pairs from your API logs for your highest-volume task
Upload to Ertas and fine-tune on Qwen 2.5 7B
Evaluate: test 50 inputs and compare outputs to GPT-4 responses
If quality is acceptable (90%+ match), proceed. If not, clean your training data and retrain.

Week 3: Deployment and Parallel Testing

Spin up a Hetzner CX42 ($26/month) and install Ollama
Deploy your fine-tuned model
Implement the request router with API fallback
Run in parallel: send every request to both local and API, compare results
Monitor for one week

Week 4: Cutover and Monitoring

Switch production traffic to the local-first architecture
Keep API fallback active (you are paying for it anyway — it only gets called when local fails)
Monitor error rates, latency, and user feedback
After one week of stable operation, consider migrating your next AI task

Repeat weeks 2-4 for each AI task type in your app. Most indie apps have 2-4 distinct AI tasks, so the full migration takes 4-8 weeks.

The Bottom Line

Per-token AI pricing creates a business model where success punishes you. More users means more AI costs, and at scale those costs consume your margin.

The flat-cost architecture breaks that coupling. Your AI infrastructure costs are fixed by the hardware you run, not the users you serve. A $26/month VPS serves 5,000 users with zero per-token fees. At 50,000 users, five VPS instances at $130/month total replaces what would be $33,750/month in API calls.

Fine-tuning (with Ertas, $14.50/month) makes the local model good enough for 95-98% of requests. Smart routing handles the rest with an API fallback. The result is an AI architecture where every new user is pure margin, just like traditional SaaS.

You do not need to wait until your AI bill is painful. The best time to implement flat-cost architecture is before you scale — when the migration is simple and the stakes are low. Build the architecture now. Scale it later. Keep the revenue.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →