From Prompt Caching to Fine-Tuning: When to Make the Switch

Prompt caching is the first optimization most teams reach for when AI API costs start climbing. And it works — Anthropic's prompt caching cuts costs by up to 90% on cached tokens, and OpenAI offers similar savings. For many workloads, caching is the right answer for months or even years.

But caching has a ceiling. It optimizes the cost per token but does not eliminate per-token economics. At a certain scale, or for certain workload profiles, you will hit that ceiling and need to make a different architectural choice: fine-tuning a model you own and run locally.

This guide walks through when caching is enough, when it is not, and how to make the transition.

How Prompt Caching Works

Both Anthropic and OpenAI now offer prompt caching that significantly reduces costs for repeated context.

The mechanism is straightforward: if the first N tokens of your prompt are identical across requests, those tokens are cached on the provider's infrastructure. Subsequent requests that share the same prefix only pay a fraction of the normal input token cost.

Anthropic prompt caching:

Cached input tokens: 90% discount (you pay 10% of the normal input price)
Minimum cacheable prefix: 1,024 tokens for Claude Sonnet, 2,048 for Haiku
Cache TTL: 5 minutes (refreshed on each hit)

OpenAI prompt caching:

Cached input tokens: 50% discount
Automatic caching on prompts > 1,024 tokens
No explicit opt-in required since late 2025

For a typical SaaS use case with a 2,000-token system prompt that remains constant across requests, the savings are significant:

Without caching	With caching (Anthropic)
2,000 system tokens + 500 user tokens	2,000 cached tokens (90% off) + 500 user tokens
Full price on all 2,500 input tokens	~90% off on 2,000 tokens, full price on 500
Cost index: 100%	Cost index: ~28%

That is a 72% cost reduction just from caching the system prompt. No code changes, no model changes, no quality impact.

When Prompt Caching Is the Right Answer

Caching is the optimal choice when these conditions are true:

1. You have a large, stable system prompt. The bigger your system prompt relative to user input, the more you save. A 5,000-token system prompt with 200-token user inputs saves more than an 800-token system prompt with 2,000-token user inputs.

2. Your request volume is moderate. At 10,000-100,000 requests per month, caching may reduce your costs enough that the remaining bill is acceptable. Fine-tuning has an upfront time investment that needs to be justified by ongoing savings.

3. Your use case changes frequently. If you are iterating on your AI features weekly — changing system prompts, adding new task types, experimenting with formats — caching lets you iterate without retraining. Fine-tuning locks in behavior that takes effort to change.

4. You do not have training data yet. Caching works from day one with zero data. Fine-tuning requires 500-5,000 high-quality training examples. If you are in the early stages of building your AI feature, caching buys you time to accumulate that data.

5. You need frontier model capabilities. Caching gives you cheaper access to the best models. Fine-tuning gives you a smaller model trained on your specific task. If your task genuinely needs Claude Opus or GPT-4o-level reasoning, caching keeps you on those models at reduced cost.

The Five Signs You Have Outgrown Caching

Sign 1: Your API bill is still too high after caching

Run the math. If your monthly API cost after caching is AU$5,000+ and growing with usage, caching has reduced the slope but has not changed the fundamental linear cost curve. You are still paying per token for every request, just at a lower rate.

Example: A SaaS product doing 500,000 requests/month with a 3,000-token system prompt:

Without caching: ~AU$15,000/month
With caching (Anthropic, 90% on cached tokens): ~AU$5,200/month
With a fine-tuned local model: ~AU$1,200/month (fixed infrastructure)

Caching cut costs 65%. But the local model cuts costs 92%. At this volume, the additional savings of AU$4,000/month justify the fine-tuning investment.

Sign 2: Most of your tokens are in user input, not system prompt

Caching only helps with the repeated prefix. If your requests have short system prompts and long, unique user inputs — document processing, email analysis, code review — the cacheable portion is small. You might cache 1,000 tokens out of 8,000 total. The discount applies to 12.5% of your input tokens.

In these cases, caching saves 5-15% instead of 60-90%. That is not enough to change your margin profile.

Sign 3: Your tasks are well-defined and repetitive

If 80% of your AI requests follow the same pattern — same input format, same output format, same task type — that is a fine-tuning signal. These patterns are exactly what fine-tuning captures. A fine-tuned model produces the same output quality without the system prompt entirely, because the behavior is internalized in the model weights.

Caching optimizes the delivery of instructions to a general model. Fine-tuning eliminates the need for instructions on tasks the model has already learned.

Sign 4: You want to own your model and data pipeline

Caching keeps you on someone else's infrastructure, subject to their pricing changes, deprecation schedules, and rate limits. Fine-tuning gives you a model you control entirely. You can run it on your own hardware, deploy it in air-gapped environments, and never worry about an API provider changing terms.

Sign 5: Latency matters and caching is not enough

Cached prompts are faster than uncached ones, but they are still cloud API calls. Typical latency: 500-2,000ms for a cached request. A local fine-tuned model on decent hardware: 50-200ms for the same request. If your product needs sub-200ms AI responses — real-time suggestions, inline autocomplete, interactive workflows — local inference is the path.

The Decision Framework

Here is the framework in tabular form:

Factor	Stay with caching	Switch to fine-tuning
Monthly API cost after caching	< AU$3,000	> AU$5,000 and growing
% of tokens that are cacheable	> 60%	< 30%
Task variety	High, changing frequently	Low, well-defined patterns
Training data available	< 500 examples	> 1,000 examples
Need for frontier reasoning	Yes, genuinely complex tasks	No, tasks are specific and learnable
Latency requirement	> 500ms acceptable	< 200ms needed
Data sensitivity	Cloud processing acceptable	On-premise or private required
Usage trajectory	Stable or slow growth	Rapid growth, 2x+ in 6 months

If you check 3+ items in the "Switch to fine-tuning" column, it is time to plan the migration.

The Migration Path: Caching to Fine-Tuning

The transition is not a binary switch. Here is the step-by-step process:

Step 1: Audit your cached workload (1 week)

Analyze your API logs for the last 30-60 days:

How many distinct task types do you have?
What percentage of tokens are cached vs unique?
What is the distribution of request complexity?
Which tasks have the most consistent input/output patterns?

Step 2: Build your training dataset (1-2 weeks)

Your existing API responses are your training data. For each task type you want to migrate:

Export 2,000-5,000 request-response pairs from your API logs
Filter for high-quality responses (ones where users did not regenerate or edit)
Format as instruction-response pairs

You already have this data — it is in your API logs. You have been paying for it with every API call. Now it becomes the asset that eliminates future API costs.

Step 3: Fine-tune and evaluate (1 week)

Fine-tune a 7B or 14B model on your dataset. Using QLoRA, this takes under 2 hours of GPU time. Then evaluate:

Run the fine-tuned model on a 200-500 example test set
Compare outputs to your API gold standard
Score quality on your specific criteria (accuracy, format compliance, tone)
Target: 90-95%+ quality parity for well-defined tasks

Step 4: Deploy and route (1 week)

Deploy the fine-tuned model via Ollama or llama.cpp behind an OpenAI-compatible API endpoint. Update your routing to send migrated task types to the local model. Keep the cloud API as a fallback.

Step 5: Monitor and iterate (ongoing)

Track quality metrics in production. Common monitoring approach:

Shadow-score 5% of local model responses against the cloud API
Track user feedback signals (regeneration rate, edit distance, satisfaction scores)
Retrain monthly with new production examples that the model handled poorly

What You Keep on the Cloud API

Fine-tuning does not replace the cloud API entirely. Keep these on cached cloud API calls:

New, experimental features where you are still iterating on the prompt and task definition
Long-tail edge cases that your fine-tuned model has not seen enough examples of
Tasks requiring broad world knowledge that changes over time (current events, recent data)
Complex multi-step reasoning that genuinely benefits from 200B+ parameter models

The end state for most SaaS products is a hybrid: 70-90% of requests on fine-tuned local models, 10-30% on cached cloud API calls. You get the cost structure of local inference for the bulk of your traffic and the capability of frontier models for the tasks that need it.

Cost Comparison at Scale

Here is a 12-month cost projection for a SaaS product growing from 100,000 to 500,000 requests per month:

Month	Requests	API only	API + caching	Fine-tuned + API hybrid
1	100K	AU$3,000	AU$1,050	AU$1,800 (setup month)
3	200K	AU$6,000	AU$2,100	AU$1,400
6	350K	AU$10,500	AU$3,675	AU$1,500
12	500K	AU$15,000	AU$5,250	AU$1,600
12-month total	—	AU$108,000	AU$37,800	AU$18,300

Caching saves AU$70,200 over 12 months compared to raw API calls. The fine-tuned hybrid saves an additional AU$19,500 on top of caching — a total of AU$89,700 in savings versus API only.

The gap widens with scale. At 1 million requests per month, the fine-tuned hybrid costs roughly the same as it does at 500,000 (the infrastructure is the same). The API and cached-API options both double.

The Transition Is Not Permanent

One advantage of this migration path: it is reversible. If a fine-tuned model underperforms on a task type, you route that task type back to the cloud API and add more training data. You are not locked in.

Your routing layer gives you a dial, not a switch. Turn it gradually toward local inference as your fine-tuned models improve, and keep the cloud API available for tasks that need it.

The teams that execute this transition well end up with the best of both worlds: frontier model quality on complex tasks, fine-tuned model efficiency on everything else, and a cost structure that scales with their business instead of against it.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →