The Indie Dev's Guide to AI Model Costs in 2026

Adding AI to your indie app has never been easier. The tooling is mature, the models are capable, and every tutorial makes it look like plugging in an API key is all you need. What those tutorials do not cover is the bill that arrives at the end of the month — and how it scales as your app grows.

This guide is the cost comparison I wish I had when I started. It covers every major option available to indie developers in 2026, from cloud APIs to self-hosted open-source models, with real numbers at real scale.

The Landscape of AI Pricing in 2026

AI pricing has evolved significantly. Cloud API prices have dropped from their 2023-2024 peaks, but they are still per-token — meaning your costs scale linearly with usage. Meanwhile, open-source models have reached a quality level where a fine-tuned 7-8B parameter model can match or beat cloud APIs on specific tasks.

The choice is no longer "cloud vs. bad open-source." It is "cloud convenience vs. self-hosted economics." Both are viable. The right answer depends on your scale.

Cloud API Tier Comparison

Here is what the major cloud APIs cost per million tokens in early 2026 for their most commonly used tiers.

Provider	Model	Input (per 1M tokens)	Output (per 1M tokens)
OpenAI	GPT-4o	$2.50	$10.00
OpenAI	GPT-4o-mini	$0.15	$0.60
Anthropic	Claude 3.5 Sonnet	$3.00	$15.00
Anthropic	Claude 3.5 Haiku	$0.80	$4.00
Google	Gemini 1.5 Pro	$1.25	$5.00
Google	Gemini 1.5 Flash	$0.075	$0.30
Together AI	Llama 3.3 70B	$0.88	$0.88
Together AI	Llama 3.3 8B	$0.18	$0.18

These prices look small until you do the multiplication. A typical AI-powered app interaction involves 500-1,000 input tokens and 200-500 output tokens. At 1,000 daily active users making 5 requests each, you are processing roughly 5 million input tokens and 2 million output tokens per day.

With GPT-4o, that is $12.50 + $20.00 = $32.50 per day, or roughly $975 per month. With GPT-4o-mini, it drops to about $1.95 per day, or $58.50 per month. The cheaper models are dramatically more affordable, but you trade capability for cost.

Self-Hosted Options

Self-hosting means running open-source models on your own hardware or rented GPU servers. The two most common approaches in 2026 are Ollama and raw llama.cpp.

Ollama provides a clean interface for running quantised models. It handles model management, serves an OpenAI-compatible API, and works on consumer hardware. A MacBook Pro with 32GB RAM can run an 8B model at useful speeds. A $50/month cloud GPU (RTX 4090 or equivalent) can serve hundreds of concurrent users.

llama.cpp is the lower-level option. More configuration, more performance tuning, but maximum control over inference parameters and memory usage.

The key cost difference: self-hosted pricing is per-server, not per-token. Whether you run 1,000 inferences or 1,000,000, the server costs the same.

Setup	Monthly Cost	Capacity (req/day)	Cost at 5K req/day
Cloud GPU (RTX 4090)	$50-80	10,000-50,000	$50-80
Cloud GPU (A100 40GB)	$150-300	50,000-200,000	$150-300
Mac Mini M4 Pro (own)	~$15 electricity	5,000-15,000	~$15
Consumer PC + RTX 4090 (own)	~$20 electricity	15,000-50,000	~$20

At 5,000 requests per day with an 8B model, self-hosting costs between $15 and $80 per month. The equivalent cloud API cost with GPT-4o-mini would be roughly $58.50 per month. The crossover point where self-hosting becomes cheaper depends on your specific usage pattern, but it generally happens around 2,000-3,000 daily requests.

The Fine-Tuning Sweet Spot

Here is the insight that changes the economics entirely: a fine-tuned small model outperforms a general-purpose large model on your specific tasks.

A general-purpose model like GPT-4o is designed to handle everything — creative writing, code generation, mathematical reasoning, casual conversation. Your app probably needs it to do one or two things well. Classification, entity extraction, structured output generation, domain-specific Q&A.

When you fine-tune a 7-8B model on examples of exactly what your app needs, it learns to do that specific task with high accuracy. You trade general capability (which you do not need) for specialised performance (which you do) at a fraction of the cost.

The practical result: a fine-tuned Llama 3.3 8B or Qwen 2.5 7B running on a $50/month GPU server outperforms GPT-4o on your specific task while costing 90% less at scale.

Cost-Per-User Analysis at Different Scales

Let's map this out across growth stages, assuming a typical app with 5 AI interactions per user per day.

Users (DAU)	Cloud API (GPT-4o-mini)	Self-Hosted (8B, cloud GPU)	Cost per User (Cloud)	Cost per User (Self-Hosted)
100	$5.85/mo	$50/mo	$0.059	$0.500
500	$29.25/mo	$50/mo	$0.059	$0.100
1,000	$58.50/mo	$50/mo	$0.059	$0.050
5,000	$292.50/mo	$80/mo	$0.059	$0.016
10,000	$585.00/mo	$150/mo	$0.059	$0.015
50,000	$2,925/mo	$300/mo	$0.059	$0.006

The pattern is clear. Cloud API costs scale linearly — your per-user cost is constant regardless of scale. Self-hosted costs are front-loaded — expensive per user at low scale, dramatically cheaper at high scale.

When Cloud APIs Still Make Sense

Cloud APIs are not always the wrong choice. They are the right choice when:

You have fewer than 100 daily users. The operational overhead of self-hosting is not worth the savings.
You are still prototyping. Use cloud APIs to validate that AI adds value before investing in infrastructure.
You need frontier-level capability. For tasks that genuinely require GPT-4o or Claude 3.5 Sonnet-class reasoning, cloud APIs provide capability that open-source models have not yet matched.
You have no ML experience and no time to learn. Fine-tuning has a learning curve. If you need to ship this week, use an API.

When to Switch to Self-Hosted

The trigger to switch is usually economic, but not always. Consider self-hosting when:

Your monthly API bill exceeds $200 and is growing.
You need predictable costs for pricing your own product.
Your clients or users require data privacy guarantees.
You are experiencing rate limiting or latency issues with cloud APIs.
You want to eliminate a critical single point of failure.

The migration does not have to be all-or-nothing. Start by self-hosting your highest-volume, most cost-sensitive AI task. Keep cloud APIs for low-volume tasks where convenience outweighs cost.

How Ertas Fits In

Ertas makes the transition from cloud APIs to self-hosted models practical for indie developers. Ertas Studio handles fine-tuning without requiring ML expertise, and exports optimised GGUF models ready for deployment with Ollama or llama.cpp.

Ready to cut your AI costs? Join the Ertas waitlist and start building on infrastructure you control.

The Indie Dev's Guide to AI Model Costs in 2026

The Landscape of AI Pricing in 2026

Cloud API Tier Comparison

Self-Hosted Options

The Fine-Tuning Sweet Spot

Cost-Per-User Analysis at Different Scales

When Cloud APIs Still Make Sense

When to Switch to Self-Hosted

How Ertas Fits In

Further Reading

Ship AI that runs on your users' devices.

Keep reading

Your Vibe-Coded App Hit 1,000 Users — Now What?

From Prototype to Product: Replacing API Calls with Fine-Tuned Models

The Vibecoder's Guide to AI Unit Economics: When Free Tiers Stop Being Free