
Stop Paying Per User for AI: The Flat-Cost Architecture for Indie Apps
Every new user shouldn't mean a higher AI bill. Here's the architecture pattern that decouples your user count from your AI costs — permanently.
Here is the dirty secret of most AI-powered SaaS apps: every new user makes the business less profitable. Not in the abstract, "well servers cost money" sense that applies to all software. In the very concrete, "every AI request costs $0.003 and my average user makes 45 requests per day" sense that eats your margin alive.
Traditional SaaS has near-zero marginal costs. Adding user number 10,001 costs you essentially nothing — the servers are already running, the code is already written. AI-powered SaaS breaks this model. With per-token pricing, every user generates a roughly proportional increase in your AI bill. Your revenue scales with user count. Your AI costs also scale with user count. And if your per-user AI cost is anywhere close to your per-user revenue, you have a business that gets worse as it succeeds.
There is a better architecture. One where your AI infrastructure costs the same whether you have 100 users or 100,000 users. It is not theoretical — it is running in production today for indie developers who figured out the math early. This guide explains what it is, how to build it, and exactly when it does and does not work.
The Per-User Cost Problem
Let us make the problem painfully concrete. You have built an AI-powered app — say a content optimization tool. Each user submits text, the AI analyzes it, and returns suggestions. Standard stuff. You are charging $19/month per user.
Here is what your AI costs look like as you grow, assuming GPT-4o pricing and moderate usage (30 AI requests per user per day, averaging 1,000 input tokens and 500 output tokens per request):
| Users | Daily AI Requests | Monthly Tokens (Input) | Monthly Tokens (Output) | Monthly AI Cost | Revenue | AI Cost as % of Revenue |
|---|---|---|---|---|---|---|
| 100 | 3,000 | 90M | 45M | $67 | $1,900 | 3.5% |
| 500 | 15,000 | 450M | 225M | $338 | $9,500 | 3.6% |
| 1,000 | 30,000 | 900M | 450M | $675 | $19,000 | 3.6% |
| 5,000 | 150,000 | 4.5B | 2.25B | $3,375 | $95,000 | 3.6% |
| 10,000 | 300,000 | 9B | 4.5B | $6,750 | $190,000 | 3.6% |
| 50,000 | 1,500,000 | 45B | 22.5B | $33,750 | $950,000 | 3.6% |
At 3.6% of revenue, this looks manageable. But this is the optimistic scenario. In reality:
Power users destroy your averages. Your top 10% of users generate 40-60% of your AI requests. Some users trigger 100+ requests per day. That "30 requests per user per day" average masks a long tail of heavy usage that inflates your costs.
Prompt chaining multiplies tokens. Agent-style features, retry logic, and multi-step workflows can 2-5x your token count per user action. A single "optimize my article" button might trigger three LLM calls under the hood.
Context windows grow over time. As users build history in your app, prompts get longer. That 1,000-token input average creeps toward 3,000-4,000 tokens as you include conversation history, user preferences, and previous results.
A more realistic picture with power users and prompt chaining:
| Users | Realistic Monthly AI Cost | Revenue | AI Cost as % of Revenue |
|---|---|---|---|
| 1,000 | $1,900 | $19,000 | 10% |
| 5,000 | $9,500 | $95,000 | 10% |
| 10,000 | $19,000 | $190,000 | 10% |
| 50,000 | $95,000 | $950,000 | 10% |
Now 10% of your revenue goes to AI inference. For an indie developer without VC funding, that is a massive chunk of your gross margin. And unlike hosting costs (which scale sub-linearly thanks to caching, CDNs, and efficient architectures), AI API costs scale linearly. There is no caching trick that helps when every request is unique.
What "Flat-Cost" Means
A flat-cost AI architecture is one where your AI inference costs are determined by your infrastructure, not your usage. You pay for servers, not for tokens. Whether those servers process 1,000 requests or 100,000 requests per month, the infrastructure cost is the same.
The core idea is simple: instead of sending every AI request to an API that charges per token, you run the AI model yourself on hardware you control. The model runs on your VPS. The VPS costs a fixed monthly amount. The per-request cost is zero.
This is only viable because of three developments that converged in 2025-2026:
- Small open-source models got good enough. Qwen 2.5 7B, Llama 3.3 8B, and similar models can handle most app-specific AI tasks when fine-tuned. You no longer need GPT-4 for everything.
- Fine-tuning became accessible. Tools like Ertas let non-ML developers fine-tune models on their app's data in under an hour. No PyTorch. No GPU cluster. No PhD.
- Local inference got fast. Ollama and llama.cpp made it possible to run quantized 7B models on commodity hardware at 15-30 tokens per second — fast enough for production use.
The flat-cost architecture combines all three: fine-tune a small model for your specific task, deploy it on a fixed-cost VPS, and serve all your users from that infrastructure.
The Architecture
Here is the complete architecture for a flat-cost AI app:
┌──────────────────────────────────────────────┐
│ Your App (Frontend + API) │
│ Hosted on Vercel / Railway / Fly.io │
└────────────────────┬─────────────────────────┘
│
┌──────────┴──────────┐
│ │
▼ ▼
┌────────────┐ ┌──────────────────┐
│ Request │ │ Database │
│ Router │ │ (Supabase/Neon) │
│ │ └──────────────────┘
└──┬──────┬──┘
│ │
95% │ │ 5%
▼ ▼
┌──────────┐ ┌──────────┐
│ Ollama │ │ OpenAI │
│ (Local) │ │ API │
│ $30/mo │ │ (fallback│
│ flat │ │ only) │
└──────────┘ └──────────┘
Four components make this work. Let us go through each one.
Component 1: Fine-Tuned Small Models
The foundation of flat-cost AI is using a model that is specifically trained for your use case, rather than a general-purpose frontier model.
Why small models work for app-specific tasks: Most AI features in SaaS apps perform a narrow, repetitive task. Classify this text. Extract these fields. Rewrite this paragraph in this tone. Generate a summary of this data. These are not tasks that require the full breadth of GPT-4's knowledge about ancient Roman history and quantum mechanics. They require a model that has deeply learned one specific pattern.
A 7B parameter model fine-tuned on 1,000 examples of your specific task will match GPT-4's performance for that task roughly 90-95% of the time. For the remaining 5-10% of edge cases, you have a fallback (Component 3). But the key insight is: you do not need perfection from the local model. You need "good enough for 95% of requests" — because that 95% is what costs you money at scale.
Choosing your base model:
| Model | Parameters | RAM Required | Best For |
|---|---|---|---|
| Qwen 2.5 3B | 3B | 4GB | Classification, simple extraction, reformatting |
| Qwen 2.5 7B | 7B | 8GB | Summarization, generation, complex extraction |
| Llama 3.3 8B | 8B | 8GB | General-purpose tasks, instruction following |
| Mistral 7B | 7B | 8GB | European language tasks, code-adjacent tasks |
For most indie apps, Qwen 2.5 7B is the default choice. It offers the best balance of capability and resource efficiency.
Fine-tuning with Ertas: Upload your JSONL training data (input-output pairs from your existing API logs), select the base model, and train with LoRA. The whole process takes 30-60 minutes on Ertas. Cost: $14.50/month for unlimited training runs.
The training data comes from your existing app. If you have been using the OpenAI API, you already have thousands of input-output pairs in your logs. Export them, clean them, and upload. You are literally training your replacement model on the work the expensive model already did.
Component 2: Local Inference with Ollama
Ollama is the runtime that serves your fine-tuned model as a local API. Install it on a VPS, load your model, and every AI request your app makes is served locally with zero per-token cost.
Infrastructure options and costs:
| Setup | Monthly Cost | Throughput | Best For |
|---|---|---|---|
| Hetzner CX22 (2 vCPU, 4GB) | ~$6/mo | 8-12 tok/s | Dev/testing, very low traffic |
| Hetzner CX32 (4 vCPU, 8GB) | ~$14/mo | 12-18 tok/s | Up to 1,000 users |
| Hetzner CX42 (4 vCPU, 16GB) | ~$26/mo | 15-25 tok/s | Up to 5,000 users |
| Hetzner CCX33 (8 vCPU, 32GB) | ~$48/mo | 25-40 tok/s | Up to 15,000 users |
| GPU instance (Vast.ai RTX 3060) | ~$30/mo | 40-60 tok/s | High throughput needs |
A $26/month Hetzner VPS running Ollama with a quantized 7B model handles 15-25 tokens per second. For a typical app where each AI request generates 200-500 output tokens, that translates to roughly 2-4 requests per second sustained throughput. That is 170,000 to 345,000 requests per day.
Unless your app has extremely bursty traffic, a single $26/month VPS handles more traffic than most indie apps will ever see.
Component 3: Smart Request Routing
Not every request needs to go to your local model. And not every request can be handled by your local model. Smart routing is the glue that makes the architecture work reliably.
The routing logic is simple:
- Every AI request hits the router first
- The router sends the request to the local Ollama model
- If Ollama returns a valid response within the expected format, use it
- If Ollama errors, times out, or returns a malformed response, fall back to the OpenAI API
Implementation in your app:
async function aiRequest(input: string): Promise<string> {
try {
// Try local model first
const localResponse = await fetch("http://ollama-vps:11434/api/generate", {
method: "POST",
body: JSON.stringify({
model: "my-fine-tuned-model",
prompt: input,
stream: false,
}),
signal: AbortSignal.timeout(10000), // 10s timeout
});
const result = await localResponse.json();
// Validate response format
if (isValidResponse(result.response)) {
return result.response;
}
// Invalid format — fall back
return await openaiRequest(input);
} catch (error) {
// Error or timeout — fall back
return await openaiRequest(input);
}
}
In practice, the routing split looks like this:
| After Fine-Tuning Phase | Local Model Handles | API Fallback | Monthly API Cost (5K users) |
|---|---|---|---|
| Initial deployment | 80% | 20% | ~$675 (down from $3,375) |
| After 1 month (with retraining on failures) | 90% | 10% | ~$338 |
| After 3 months | 95% | 5% | ~$169 |
| Mature (6+ months) | 97-98% | 2-3% | ~$68-101 |
The key insight: you do not need to handle 100% locally on day one. Start at 80% local and iterate. Each month, review the requests that fell back to the API, add them to your training data, retrain, and deploy the updated model. Over time, the local model handles more and more edge cases, and your API costs approach zero.
At the mature stage, the 2-3% that still goes to the API is genuinely hard — novel edge cases, unusual inputs, requests that are fundamentally different from your training data. That residual API cost is trivial.
Component 4: Horizontal Scaling
At some point, one VPS is not enough. When you hit sustained high traffic that exceeds a single instance's throughput, you scale horizontally — add more VPS instances, each running the same model.
The scaling math:
| Users | VPS Instances | Total VPS Cost | Per-User Monthly AI Cost |
|---|---|---|---|
| 1,000 | 1x CX42 | $26 | $0.026 |
| 5,000 | 1x CX42 | $26 | $0.005 |
| 10,000 | 2x CX42 | $52 | $0.005 |
| 25,000 | 3x CX42 | $78 | $0.003 |
| 50,000 | 5x CX42 | $130 | $0.003 |
| 100,000 | 8x CX42 | $208 | $0.002 |
Notice the per-user cost. With horizontal scaling, your per-user AI cost decreases as you grow. At 100,000 users, you are paying $0.002 per user per month for AI inference. With OpenAI API at the same scale, you would be paying roughly $0.68 per user per month (based on the realistic cost estimates from earlier).
That is a 340x cost difference.
Load balancing across multiple Ollama instances is straightforward. Use a simple round-robin or least-connections load balancer (nginx, HAProxy, or your cloud provider's built-in LB) in front of your Ollama fleet. Each instance runs the identical model, so any instance can handle any request.
Cost Modeling: API vs Flat-Cost at Scale
Here is the comprehensive comparison, including all infrastructure costs:
| Users | API Architecture (Monthly) | Flat-Cost Architecture (Monthly) | Savings |
|---|---|---|---|
| 100 | $67 API | $26 VPS + $14.50 Ertas + $3 API fallback = $43.50 | $23.50 (35%) |
| 500 | $338 API | $26 VPS + $14.50 Ertas + $8 API fallback = $48.50 | $289.50 (86%) |
| 1,000 | $675 API | $26 VPS + $14.50 Ertas + $17 API fallback = $57.50 | $617.50 (91%) |
| 5,000 | $3,375 API | $26 VPS + $14.50 Ertas + $68 API fallback = $108.50 | $3,266.50 (97%) |
| 10,000 | $6,750 API | $52 VPS + $14.50 Ertas + $101 API fallback = $167.50 | $6,582.50 (98%) |
| 50,000 | $33,750 API | $130 VPS + $14.50 Ertas + $338 API fallback = $482.50 | $33,267.50 (99%) |
| 100,000 | $67,500 API | $208 VPS + $14.50 Ertas + $506 API fallback = $728.50 | $66,771.50 (99%) |
The breakeven point is remarkably low — around 100-200 users, depending on usage patterns. Below that, the flat-cost architecture is either comparable or slightly cheaper. Above it, the savings are dramatic and compound with every additional user.
At 10,000 users, you are saving $6,582.50 per month — $78,990 per year. That is not a rounding error. That is the difference between a lifestyle business and a struggling one.
Let us frame it another way. If you charge $19/month per user and have 10,000 users, your monthly revenue is $190,000. With the API architecture, $6,750 goes to OpenAI (3.6% — or realistically $19,000 at 10% with power users). With the flat-cost architecture, $167.50 goes to AI infrastructure (0.09% of revenue). That margin difference compounds every single month.
When Flat-Cost Does Not Work
Flat-cost architecture is not universally superior. Here are the scenarios where sticking with an API (or using a hybrid approach) makes more sense:
Real-time multimodal tasks. If your app processes images, audio, or video with AI, you need models and hardware that are significantly more expensive to self-host. Vision models require GPUs with substantial VRAM. Audio transcription models like Whisper are CPU-intensive. The flat-cost math still works, but the infrastructure costs are higher, pushing the breakeven point up to 1,000-5,000 users.
Cutting-edge reasoning tasks. If your app genuinely requires GPT-4 or Claude-level reasoning — complex multi-step analysis, nuanced creative writing, advanced code generation — a fine-tuned 7B model may not cut it. These tasks represent the frontier of AI capability, and small models simply cannot replicate them. However, audit your app honestly: most "we need GPT-4" claims do not survive scrutiny. Most app AI tasks are narrower than developers think.
Extremely diverse task sets. If your AI feature handles hundreds of fundamentally different task types with no dominant pattern, fine-tuning becomes impractical (you would need dozens of specialized models). This is rare in practice — most apps have 3-5 core AI tasks that account for 90% of requests.
Very early stage (pre-product-market fit). If you are still iterating on what your AI feature does, committing to fine-tuning is premature. Use the API while you figure out the product. Once you know what your AI does (and can articulate it as a clear input-output pattern), that is when you switch to flat-cost.
Regulatory environments requiring certified models. Some regulated industries require AI models that have been specifically certified or audited. Self-hosted open-source models may not meet these requirements. Check with your compliance team before migrating.
Implementation Roadmap
Here is a four-week plan to migrate from API to flat-cost architecture:
Week 1: Data Collection and Audit
- Enable logging on all AI API calls (input, output, latency, cost)
- Run for one week to establish baseline metrics
- Categorize your AI tasks and identify the highest-volume ones
- Calculate your current per-user AI cost
Week 2: Fine-Tuning
- Export 500-2,000 input-output pairs from your API logs for your highest-volume task
- Upload to Ertas and fine-tune on Qwen 2.5 7B
- Evaluate: test 50 inputs and compare outputs to GPT-4 responses
- If quality is acceptable (90%+ match), proceed. If not, clean your training data and retrain.
Week 3: Deployment and Parallel Testing
- Spin up a Hetzner CX42 ($26/month) and install Ollama
- Deploy your fine-tuned model
- Implement the request router with API fallback
- Run in parallel: send every request to both local and API, compare results
- Monitor for one week
Week 4: Cutover and Monitoring
- Switch production traffic to the local-first architecture
- Keep API fallback active (you are paying for it anyway — it only gets called when local fails)
- Monitor error rates, latency, and user feedback
- After one week of stable operation, consider migrating your next AI task
Repeat weeks 2-4 for each AI task type in your app. Most indie apps have 2-4 distinct AI tasks, so the full migration takes 4-8 weeks.
The Bottom Line
Per-token AI pricing creates a business model where success punishes you. More users means more AI costs, and at scale those costs consume your margin.
The flat-cost architecture breaks that coupling. Your AI infrastructure costs are fixed by the hardware you run, not the users you serve. A $26/month VPS serves 5,000 users with zero per-token fees. At 50,000 users, five VPS instances at $130/month total replaces what would be $33,750/month in API calls.
Fine-tuning (with Ertas, $14.50/month) makes the local model good enough for 95-98% of requests. Smart routing handles the rest with an API fallback. The result is an AI architecture where every new user is pure margin, just like traditional SaaS.
You do not need to wait until your AI bill is painful. The best time to implement flat-cost architecture is before you scale — when the migration is simple and the stakes are low. Build the architecture now. Scale it later. Keep the revenue.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Further Reading
- Indie Dev AI Model Costs in 2026 — A comprehensive breakdown of what AI actually costs for indie developers.
- Self-Hosted AI for Indie Apps — Why self-hosting AI inference is the single biggest margin lever.
- SaaS AI Feature Costs at Scale — How AI feature costs behave as your SaaS grows.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Your Vibe-Coded App Hit 1,000 Users — Now What?
You shipped fast with Cursor and Bolt. Users love it. But your OpenAI bill just crossed $200/month and it's climbing. Here's the cost survival guide for vibe-coded apps hitting real scale.

Your Vibe-Coded App Hit 10K Users. Now Your AI Bill Is $3K/Month.
Vibe-coded apps with AI features face a brutal cost cliff at scale. Here's how indie developers are cutting AI costs by 95% with fine-tuned local models — without rewriting their apps.

The Vibecoder's Guide to AI Unit Economics: When Free Tiers Stop Being Free
OpenAI's free tier got you started. But at scale, you're spending $5K/month on Opus for tasks Haiku could handle. Here's how to think about AI costs like a founder, not a hobbyist.