
From Prompt Caching to Fine-Tuning: When to Make the Switch
Prompt caching cuts costs 60-90% for repetitive context. Fine-tuning eliminates per-token costs entirely. Here's how to know when you've outgrown caching and should fine-tune instead.
Prompt caching is the first optimization most teams reach for when AI API costs start climbing. And it works — Anthropic's prompt caching cuts costs by up to 90% on cached tokens, and OpenAI offers similar savings. For many workloads, caching is the right answer for months or even years.
But caching has a ceiling. It optimizes the cost per token but does not eliminate per-token economics. At a certain scale, or for certain workload profiles, you will hit that ceiling and need to make a different architectural choice: fine-tuning a model you own and run locally.
This guide walks through when caching is enough, when it is not, and how to make the transition.
How Prompt Caching Works
Both Anthropic and OpenAI now offer prompt caching that significantly reduces costs for repeated context.
The mechanism is straightforward: if the first N tokens of your prompt are identical across requests, those tokens are cached on the provider's infrastructure. Subsequent requests that share the same prefix only pay a fraction of the normal input token cost.
Anthropic prompt caching:
- Cached input tokens: 90% discount (you pay 10% of the normal input price)
- Minimum cacheable prefix: 1,024 tokens for Claude Sonnet, 2,048 for Haiku
- Cache TTL: 5 minutes (refreshed on each hit)
OpenAI prompt caching:
- Cached input tokens: 50% discount
- Automatic caching on prompts > 1,024 tokens
- No explicit opt-in required since late 2025
For a typical SaaS use case with a 2,000-token system prompt that remains constant across requests, the savings are significant:
| Without caching | With caching (Anthropic) |
|---|---|
| 2,000 system tokens + 500 user tokens | 2,000 cached tokens (90% off) + 500 user tokens |
| Full price on all 2,500 input tokens | ~90% off on 2,000 tokens, full price on 500 |
| Cost index: 100% | Cost index: ~28% |
That is a 72% cost reduction just from caching the system prompt. No code changes, no model changes, no quality impact.
When Prompt Caching Is the Right Answer
Caching is the optimal choice when these conditions are true:
1. You have a large, stable system prompt. The bigger your system prompt relative to user input, the more you save. A 5,000-token system prompt with 200-token user inputs saves more than an 800-token system prompt with 2,000-token user inputs.
2. Your request volume is moderate. At 10,000-100,000 requests per month, caching may reduce your costs enough that the remaining bill is acceptable. Fine-tuning has an upfront time investment that needs to be justified by ongoing savings.
3. Your use case changes frequently. If you are iterating on your AI features weekly — changing system prompts, adding new task types, experimenting with formats — caching lets you iterate without retraining. Fine-tuning locks in behavior that takes effort to change.
4. You do not have training data yet. Caching works from day one with zero data. Fine-tuning requires 500-5,000 high-quality training examples. If you are in the early stages of building your AI feature, caching buys you time to accumulate that data.
5. You need frontier model capabilities. Caching gives you cheaper access to the best models. Fine-tuning gives you a smaller model trained on your specific task. If your task genuinely needs Claude Opus or GPT-4o-level reasoning, caching keeps you on those models at reduced cost.
The Five Signs You Have Outgrown Caching
Sign 1: Your API bill is still too high after caching
Run the math. If your monthly API cost after caching is AU$5,000+ and growing with usage, caching has reduced the slope but has not changed the fundamental linear cost curve. You are still paying per token for every request, just at a lower rate.
Example: A SaaS product doing 500,000 requests/month with a 3,000-token system prompt:
- Without caching: ~AU$15,000/month
- With caching (Anthropic, 90% on cached tokens): ~AU$5,200/month
- With a fine-tuned local model: ~AU$1,200/month (fixed infrastructure)
Caching cut costs 65%. But the local model cuts costs 92%. At this volume, the additional savings of AU$4,000/month justify the fine-tuning investment.
Sign 2: Most of your tokens are in user input, not system prompt
Caching only helps with the repeated prefix. If your requests have short system prompts and long, unique user inputs — document processing, email analysis, code review — the cacheable portion is small. You might cache 1,000 tokens out of 8,000 total. The discount applies to 12.5% of your input tokens.
In these cases, caching saves 5-15% instead of 60-90%. That is not enough to change your margin profile.
Sign 3: Your tasks are well-defined and repetitive
If 80% of your AI requests follow the same pattern — same input format, same output format, same task type — that is a fine-tuning signal. These patterns are exactly what fine-tuning captures. A fine-tuned model produces the same output quality without the system prompt entirely, because the behavior is internalized in the model weights.
Caching optimizes the delivery of instructions to a general model. Fine-tuning eliminates the need for instructions on tasks the model has already learned.
Sign 4: You want to own your model and data pipeline
Caching keeps you on someone else's infrastructure, subject to their pricing changes, deprecation schedules, and rate limits. Fine-tuning gives you a model you control entirely. You can run it on your own hardware, deploy it in air-gapped environments, and never worry about an API provider changing terms.
Sign 5: Latency matters and caching is not enough
Cached prompts are faster than uncached ones, but they are still cloud API calls. Typical latency: 500-2,000ms for a cached request. A local fine-tuned model on decent hardware: 50-200ms for the same request. If your product needs sub-200ms AI responses — real-time suggestions, inline autocomplete, interactive workflows — local inference is the path.
The Decision Framework
Here is the framework in tabular form:
| Factor | Stay with caching | Switch to fine-tuning |
|---|---|---|
| Monthly API cost after caching | < AU$3,000 | > AU$5,000 and growing |
| % of tokens that are cacheable | > 60% | < 30% |
| Task variety | High, changing frequently | Low, well-defined patterns |
| Training data available | < 500 examples | > 1,000 examples |
| Need for frontier reasoning | Yes, genuinely complex tasks | No, tasks are specific and learnable |
| Latency requirement | > 500ms acceptable | < 200ms needed |
| Data sensitivity | Cloud processing acceptable | On-premise or private required |
| Usage trajectory | Stable or slow growth | Rapid growth, 2x+ in 6 months |
If you check 3+ items in the "Switch to fine-tuning" column, it is time to plan the migration.
The Migration Path: Caching to Fine-Tuning
The transition is not a binary switch. Here is the step-by-step process:
Step 1: Audit your cached workload (1 week)
Analyze your API logs for the last 30-60 days:
- How many distinct task types do you have?
- What percentage of tokens are cached vs unique?
- What is the distribution of request complexity?
- Which tasks have the most consistent input/output patterns?
Step 2: Build your training dataset (1-2 weeks)
Your existing API responses are your training data. For each task type you want to migrate:
- Export 2,000-5,000 request-response pairs from your API logs
- Filter for high-quality responses (ones where users did not regenerate or edit)
- Format as instruction-response pairs
You already have this data — it is in your API logs. You have been paying for it with every API call. Now it becomes the asset that eliminates future API costs.
Step 3: Fine-tune and evaluate (1 week)
Fine-tune a 7B or 14B model on your dataset. Using QLoRA, this takes under 2 hours of GPU time. Then evaluate:
- Run the fine-tuned model on a 200-500 example test set
- Compare outputs to your API gold standard
- Score quality on your specific criteria (accuracy, format compliance, tone)
- Target: 90-95%+ quality parity for well-defined tasks
Step 4: Deploy and route (1 week)
Deploy the fine-tuned model via Ollama or llama.cpp behind an OpenAI-compatible API endpoint. Update your routing to send migrated task types to the local model. Keep the cloud API as a fallback.
Step 5: Monitor and iterate (ongoing)
Track quality metrics in production. Common monitoring approach:
- Shadow-score 5% of local model responses against the cloud API
- Track user feedback signals (regeneration rate, edit distance, satisfaction scores)
- Retrain monthly with new production examples that the model handled poorly
What You Keep on the Cloud API
Fine-tuning does not replace the cloud API entirely. Keep these on cached cloud API calls:
- New, experimental features where you are still iterating on the prompt and task definition
- Long-tail edge cases that your fine-tuned model has not seen enough examples of
- Tasks requiring broad world knowledge that changes over time (current events, recent data)
- Complex multi-step reasoning that genuinely benefits from 200B+ parameter models
The end state for most SaaS products is a hybrid: 70-90% of requests on fine-tuned local models, 10-30% on cached cloud API calls. You get the cost structure of local inference for the bulk of your traffic and the capability of frontier models for the tasks that need it.
Cost Comparison at Scale
Here is a 12-month cost projection for a SaaS product growing from 100,000 to 500,000 requests per month:
| Month | Requests | API only | API + caching | Fine-tuned + API hybrid |
|---|---|---|---|---|
| 1 | 100K | AU$3,000 | AU$1,050 | AU$1,800 (setup month) |
| 3 | 200K | AU$6,000 | AU$2,100 | AU$1,400 |
| 6 | 350K | AU$10,500 | AU$3,675 | AU$1,500 |
| 12 | 500K | AU$15,000 | AU$5,250 | AU$1,600 |
| 12-month total | — | AU$108,000 | AU$37,800 | AU$18,300 |
Caching saves AU$70,200 over 12 months compared to raw API calls. The fine-tuned hybrid saves an additional AU$19,500 on top of caching — a total of AU$89,700 in savings versus API only.
The gap widens with scale. At 1 million requests per month, the fine-tuned hybrid costs roughly the same as it does at 500,000 (the infrastructure is the same). The API and cached-API options both double.
The Transition Is Not Permanent
One advantage of this migration path: it is reversible. If a fine-tuned model underperforms on a task type, you route that task type back to the cloud API and add more training data. You are not locked in.
Your routing layer gives you a dial, not a switch. Turn it gradually toward local inference as your fine-tuned models improve, and keep the cloud API available for tasks that need it.
The teams that execute this transition well end up with the best of both worlds: frontier model quality on complex tasks, fine-tuned model efficiency on everything else, and a cost structure that scales with their business instead of against it.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Further Reading
- From Prompt Engineering to Fine-Tuning — The full journey from prompts to fine-tuned models
- How to Fine-Tune an LLM — Step-by-step fine-tuning guide for your first model
- Graduate from API to Fine-Tuning — The SaaS-specific migration playbook
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Per-User LoRA Adapters: Personalized AI at Scale Without Per-Token Costs
LoRA adapters are 50-200MB each. You can hot-swap them per user request, delivering personalized AI experiences from a single base model — without multiplying your inference costs.

Fine-Tuning for Structured Output: Beyond JSON Mode to Guaranteed Schemas
JSON mode gets you valid JSON. Fine-tuning gets you guaranteed schema compliance — every field, every type, every time. Here's how to train models that output exactly the structure your app expects.

Fine-Tuning Phi-4: Microsoft's Best Small Model for Enterprise Tasks
Phi-4 14B outperforms GPT-4 on math benchmarks while running 15x faster on local hardware. Here's how to fine-tune it for classification, extraction, and structured output tasks.