Back to blog
    From Prompt Caching to Fine-Tuning: When to Make the Switch
    fine-tuningprompt-cachingcost-reductiondecision-guidesegment:saas

    From Prompt Caching to Fine-Tuning: When to Make the Switch

    Prompt caching cuts costs 60-90% for repetitive context. Fine-tuning eliminates per-token costs entirely. Here's how to know when you've outgrown caching and should fine-tune instead.

    EErtas Team·

    Prompt caching is the first optimization most teams reach for when AI API costs start climbing. And it works — Anthropic's prompt caching cuts costs by up to 90% on cached tokens, and OpenAI offers similar savings. For many workloads, caching is the right answer for months or even years.

    But caching has a ceiling. It optimizes the cost per token but does not eliminate per-token economics. At a certain scale, or for certain workload profiles, you will hit that ceiling and need to make a different architectural choice: fine-tuning a model you own and run locally.

    This guide walks through when caching is enough, when it is not, and how to make the transition.

    How Prompt Caching Works

    Both Anthropic and OpenAI now offer prompt caching that significantly reduces costs for repeated context.

    The mechanism is straightforward: if the first N tokens of your prompt are identical across requests, those tokens are cached on the provider's infrastructure. Subsequent requests that share the same prefix only pay a fraction of the normal input token cost.

    Anthropic prompt caching:

    • Cached input tokens: 90% discount (you pay 10% of the normal input price)
    • Minimum cacheable prefix: 1,024 tokens for Claude Sonnet, 2,048 for Haiku
    • Cache TTL: 5 minutes (refreshed on each hit)

    OpenAI prompt caching:

    • Cached input tokens: 50% discount
    • Automatic caching on prompts > 1,024 tokens
    • No explicit opt-in required since late 2025

    For a typical SaaS use case with a 2,000-token system prompt that remains constant across requests, the savings are significant:

    Without cachingWith caching (Anthropic)
    2,000 system tokens + 500 user tokens2,000 cached tokens (90% off) + 500 user tokens
    Full price on all 2,500 input tokens~90% off on 2,000 tokens, full price on 500
    Cost index: 100%Cost index: ~28%

    That is a 72% cost reduction just from caching the system prompt. No code changes, no model changes, no quality impact.

    When Prompt Caching Is the Right Answer

    Caching is the optimal choice when these conditions are true:

    1. You have a large, stable system prompt. The bigger your system prompt relative to user input, the more you save. A 5,000-token system prompt with 200-token user inputs saves more than an 800-token system prompt with 2,000-token user inputs.

    2. Your request volume is moderate. At 10,000-100,000 requests per month, caching may reduce your costs enough that the remaining bill is acceptable. Fine-tuning has an upfront time investment that needs to be justified by ongoing savings.

    3. Your use case changes frequently. If you are iterating on your AI features weekly — changing system prompts, adding new task types, experimenting with formats — caching lets you iterate without retraining. Fine-tuning locks in behavior that takes effort to change.

    4. You do not have training data yet. Caching works from day one with zero data. Fine-tuning requires 500-5,000 high-quality training examples. If you are in the early stages of building your AI feature, caching buys you time to accumulate that data.

    5. You need frontier model capabilities. Caching gives you cheaper access to the best models. Fine-tuning gives you a smaller model trained on your specific task. If your task genuinely needs Claude Opus or GPT-4o-level reasoning, caching keeps you on those models at reduced cost.

    The Five Signs You Have Outgrown Caching

    Sign 1: Your API bill is still too high after caching

    Run the math. If your monthly API cost after caching is AU$5,000+ and growing with usage, caching has reduced the slope but has not changed the fundamental linear cost curve. You are still paying per token for every request, just at a lower rate.

    Example: A SaaS product doing 500,000 requests/month with a 3,000-token system prompt:

    • Without caching: ~AU$15,000/month
    • With caching (Anthropic, 90% on cached tokens): ~AU$5,200/month
    • With a fine-tuned local model: ~AU$1,200/month (fixed infrastructure)

    Caching cut costs 65%. But the local model cuts costs 92%. At this volume, the additional savings of AU$4,000/month justify the fine-tuning investment.

    Sign 2: Most of your tokens are in user input, not system prompt

    Caching only helps with the repeated prefix. If your requests have short system prompts and long, unique user inputs — document processing, email analysis, code review — the cacheable portion is small. You might cache 1,000 tokens out of 8,000 total. The discount applies to 12.5% of your input tokens.

    In these cases, caching saves 5-15% instead of 60-90%. That is not enough to change your margin profile.

    Sign 3: Your tasks are well-defined and repetitive

    If 80% of your AI requests follow the same pattern — same input format, same output format, same task type — that is a fine-tuning signal. These patterns are exactly what fine-tuning captures. A fine-tuned model produces the same output quality without the system prompt entirely, because the behavior is internalized in the model weights.

    Caching optimizes the delivery of instructions to a general model. Fine-tuning eliminates the need for instructions on tasks the model has already learned.

    Sign 4: You want to own your model and data pipeline

    Caching keeps you on someone else's infrastructure, subject to their pricing changes, deprecation schedules, and rate limits. Fine-tuning gives you a model you control entirely. You can run it on your own hardware, deploy it in air-gapped environments, and never worry about an API provider changing terms.

    Sign 5: Latency matters and caching is not enough

    Cached prompts are faster than uncached ones, but they are still cloud API calls. Typical latency: 500-2,000ms for a cached request. A local fine-tuned model on decent hardware: 50-200ms for the same request. If your product needs sub-200ms AI responses — real-time suggestions, inline autocomplete, interactive workflows — local inference is the path.

    The Decision Framework

    Here is the framework in tabular form:

    FactorStay with cachingSwitch to fine-tuning
    Monthly API cost after caching< AU$3,000> AU$5,000 and growing
    % of tokens that are cacheable> 60%< 30%
    Task varietyHigh, changing frequentlyLow, well-defined patterns
    Training data available< 500 examples> 1,000 examples
    Need for frontier reasoningYes, genuinely complex tasksNo, tasks are specific and learnable
    Latency requirement> 500ms acceptable< 200ms needed
    Data sensitivityCloud processing acceptableOn-premise or private required
    Usage trajectoryStable or slow growthRapid growth, 2x+ in 6 months

    If you check 3+ items in the "Switch to fine-tuning" column, it is time to plan the migration.

    The Migration Path: Caching to Fine-Tuning

    The transition is not a binary switch. Here is the step-by-step process:

    Step 1: Audit your cached workload (1 week)

    Analyze your API logs for the last 30-60 days:

    • How many distinct task types do you have?
    • What percentage of tokens are cached vs unique?
    • What is the distribution of request complexity?
    • Which tasks have the most consistent input/output patterns?

    Step 2: Build your training dataset (1-2 weeks)

    Your existing API responses are your training data. For each task type you want to migrate:

    • Export 2,000-5,000 request-response pairs from your API logs
    • Filter for high-quality responses (ones where users did not regenerate or edit)
    • Format as instruction-response pairs

    You already have this data — it is in your API logs. You have been paying for it with every API call. Now it becomes the asset that eliminates future API costs.

    Step 3: Fine-tune and evaluate (1 week)

    Fine-tune a 7B or 14B model on your dataset. Using QLoRA, this takes under 2 hours of GPU time. Then evaluate:

    • Run the fine-tuned model on a 200-500 example test set
    • Compare outputs to your API gold standard
    • Score quality on your specific criteria (accuracy, format compliance, tone)
    • Target: 90-95%+ quality parity for well-defined tasks

    Step 4: Deploy and route (1 week)

    Deploy the fine-tuned model via Ollama or llama.cpp behind an OpenAI-compatible API endpoint. Update your routing to send migrated task types to the local model. Keep the cloud API as a fallback.

    Step 5: Monitor and iterate (ongoing)

    Track quality metrics in production. Common monitoring approach:

    • Shadow-score 5% of local model responses against the cloud API
    • Track user feedback signals (regeneration rate, edit distance, satisfaction scores)
    • Retrain monthly with new production examples that the model handled poorly

    What You Keep on the Cloud API

    Fine-tuning does not replace the cloud API entirely. Keep these on cached cloud API calls:

    • New, experimental features where you are still iterating on the prompt and task definition
    • Long-tail edge cases that your fine-tuned model has not seen enough examples of
    • Tasks requiring broad world knowledge that changes over time (current events, recent data)
    • Complex multi-step reasoning that genuinely benefits from 200B+ parameter models

    The end state for most SaaS products is a hybrid: 70-90% of requests on fine-tuned local models, 10-30% on cached cloud API calls. You get the cost structure of local inference for the bulk of your traffic and the capability of frontier models for the tasks that need it.

    Cost Comparison at Scale

    Here is a 12-month cost projection for a SaaS product growing from 100,000 to 500,000 requests per month:

    MonthRequestsAPI onlyAPI + cachingFine-tuned + API hybrid
    1100KAU$3,000AU$1,050AU$1,800 (setup month)
    3200KAU$6,000AU$2,100AU$1,400
    6350KAU$10,500AU$3,675AU$1,500
    12500KAU$15,000AU$5,250AU$1,600
    12-month totalAU$108,000AU$37,800AU$18,300

    Caching saves AU$70,200 over 12 months compared to raw API calls. The fine-tuned hybrid saves an additional AU$19,500 on top of caching — a total of AU$89,700 in savings versus API only.

    The gap widens with scale. At 1 million requests per month, the fine-tuned hybrid costs roughly the same as it does at 500,000 (the infrastructure is the same). The API and cached-API options both double.

    The Transition Is Not Permanent

    One advantage of this migration path: it is reversible. If a fine-tuned model underperforms on a task type, you route that task type back to the cloud API and add more training data. You are not locked in.

    Your routing layer gives you a dial, not a switch. Turn it gradually toward local inference as your fine-tuned models improve, and keep the cloud API available for tasks that need it.

    The teams that execute this transition well end up with the best of both worlds: frontier model quality on complex tasks, fine-tuned model efficiency on everything else, and a cost structure that scales with their business instead of against it.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading