When NOT to Fine-Tune: 5 Cases Where RAG, Prompting, or APIs Are Better

Ertas is a fine-tuning platform. We built it because fine-tuning solves real problems that prompting and RAG cannot. But we have also watched teams spend weeks fine-tuning models for tasks where a simpler approach would have worked better. That is time and money wasted, and it erodes trust in fine-tuning as a technique.

We would rather you succeed with the right approach than fail with fine-tuning. So here are five cases where you should not fine-tune — and what to do instead.

Case 1: Your Knowledge Base Changes Frequently

The scenario: You are building a customer support bot for an e-commerce company. Product catalog, pricing, return policies, and shipping information change weekly. You want the model to give accurate, up-to-date answers.

Why fine-tuning is wrong here: Fine-tuning bakes knowledge into the model at training time. When you train on your current product catalog, the model learns that the "UltraWidget Pro" costs $299 and ships in 3-5 business days. Next week, the price drops to $249 and shipping changes to 2-3 days. Your fine-tuned model still says $299 and 3-5 days.

To keep the model current, you would need to retrain every time anything changes. At weekly update cycles, that means weekly retraining — each cycle costing compute time, requiring data preparation, and risking regression on other behaviors. The economics do not work.

What to do instead: RAG. Retrieval-augmented generation retrieves relevant documents from your knowledge base at query time. When prices change, you update the documents in your vector database. The model sees the current information on every request without any retraining.

Update your vector index in minutes. Retrain a model in hours. For rapidly changing information, the choice is obvious.

The exception: If you need the model to have a specific communication style or follow particular response patterns while using RAG for knowledge, you can combine both. Fine-tune for behavior, RAG for facts. This is often the best architecture — but fine-tuning alone is not the answer when knowledge is volatile.

Case 2: You Need the Model to Cite Sources

The scenario: You are building a research assistant for a consulting firm. Analysts ask questions and need answers backed by specific reports, studies, or internal documents. Every claim must be traceable to a source.

Why fine-tuning is wrong here: A fine-tuned model absorbs information into its weights. It can tell you what it "knows," but it cannot tell you where it learned it. There is no mechanism for a fine-tuned model to point to a specific document, page number, or passage that supports its output.

You could try to include source references in the training data (training the model to output claims with citations), but this is brittle. The model learns to generate plausible-looking citations, not actual ones. It will fabricate document titles and page numbers with the same confidence it has for real ones.

What to do instead: RAG with retrieval metadata. A RAG system retrieves specific chunks of text from identified documents. Each chunk carries metadata — document title, page number, section, date. The model generates a response based on the retrieved chunks, and you can display the source metadata alongside the answer.

This gives you genuine traceability. The citations are real because they come from the retrieval system, not from the model's generation. For compliance-heavy industries — legal, financial, medical, consulting — this is not optional.

The nuance: Fine-tuning can improve how the model uses retrieved context. A fine-tuned model might be better at synthesizing information from multiple retrieved documents and producing coherent responses. But the citation capability itself must come from the retrieval architecture, not from fine-tuning.

Case 3: You Only Need It for Occasional One-Off Tasks

The scenario: Your agency builds custom AI solutions. You have a client who needs a one-time migration of 5,000 product descriptions from one format to another. It is a batch job that will run once and be done.

Why fine-tuning is wrong here: Fine-tuning has a fixed upfront cost: data preparation (2-8 hours of work), training compute ($5-50), and evaluation time (1-2 hours). For a task you will run once, this investment often exceeds the cost of just using an API.

At $0.01-0.03 per request, processing 5,000 items through a frontier API costs $50-150. Fine-tuning the model, deploying it, and processing the same 5,000 items costs roughly the same in total — but takes significantly more calendar time. And the fine-tuned model has no residual value if you are not going to use it again.

What to do instead: Use an API with a well-crafted prompt. For one-time batch jobs, the API approach is faster to set up, easier to iterate on, and costs about the same. Write a good system prompt, test it on 50 examples, adjust, and then process the full batch.

The threshold: Fine-tuning starts making economic sense when you expect to process 50+ requests per day on an ongoing basis. Below that, the per-request cost savings do not offset the fixed cost of training. The calculation changes if consistency is business-critical (see Case 5's exception), but for most one-off tasks, an API is the right tool.

Case 4: Your Task Requires Broad World Knowledge

The scenario: You want a model that can answer general questions about history, science, current events, and culture for an educational platform. It needs to handle questions from elementary-level to graduate-level across dozens of subjects.

Why fine-tuning is wrong here: Fine-tuning narrows a model. It makes the model better at your specific task by specializing its weights toward your training distribution. This specialization comes at a cost — the model becomes less capable at things outside the training distribution.

If you fine-tune a 7B model on history questions, it gets better at history but may get slightly worse at science. If you fine-tune on both, you need an enormous and diverse training dataset to avoid degrading any single domain. At that point, you are essentially re-doing the model's pre-training, which is not practical.

Broad world knowledge is what large frontier models are built for. A 400B+ parameter model trained on trillions of tokens has the capacity to hold vast knowledge across domains. A 7B model fine-tuned on 5,000 examples does not compete on breadth.

What to do instead: Use a large frontier model via API. GPT-4o, Claude 3.5 Sonnet, or Gemini 1.5 Pro already have the broad knowledge your task requires. Use them directly with appropriate prompting.

The cost concern: Frontier APIs are expensive at scale. If you hit a volume where API costs become untenable, consider a hybrid approach: use RAG to retrieve relevant knowledge and serve it to a smaller model. This gives you the broad coverage of your knowledge base with the cost efficiency of a smaller model — without the narrowing effect of fine-tuning.

Case 5: You Have Not Tried Good Prompt Engineering First

The scenario: You have a task. You wrote a basic prompt. The output is not great. You conclude that you need to fine-tune.

Why fine-tuning is wrong here: This is the most common mistake we see. Teams jump to fine-tuning after writing a 200-word system prompt and testing it on 10 examples. They have not tried:

Few-shot examples in the prompt. Adding 3-5 high-quality input-output examples to the system prompt can dramatically improve output quality. This takes 30 minutes, not 30 hours.
Chain-of-thought prompting. For reasoning tasks, asking the model to "think step by step" before producing the final answer improves accuracy by 10-30% on many benchmarks.
Structured output instructions. Explicitly specifying the output format, including field names and types, often solves formatting issues without fine-tuning.
Iterative prompt refinement. Testing on 50+ diverse examples, identifying failure patterns, and addressing them with targeted prompt modifications. Most teams give up after 3-4 iterations when 10-15 iterations would have solved the problem.
Model selection. Sometimes the issue is not the prompt but the model. Trying a different base model (or a larger one) may solve the problem immediately.

Good prompt engineering is faster and cheaper to iterate on than fine-tuning. A prompt change takes seconds to test. A fine-tuning change takes hours. Always exhaust prompt optimization before investing in training.

When prompt engineering is genuinely exhausted: You have a 1,500+ token system prompt with 5+ few-shot examples. You have tested on 100+ inputs. You have iterated on the prompt 15+ times. Accuracy has plateaued. The model is inconsistent across identical runs. At this point, you have strong evidence that the task exceeds what prompting can deliver — and your prompt becomes the starting point for generating fine-tuning data. Read our migration playbook for the exact process.

The Decision Flowchart

When you are evaluating whether to fine-tune, walk through these questions in order:

Have you optimized your prompt thoroughly? If no, do that first. Cost: hours. Potential upside: solves the problem entirely.
Does your task require frequently updated knowledge? If yes, use RAG. Fine-tuning bakes in stale knowledge.
Do you need source citations? If yes, you need a retrieval component. Fine-tuning alone cannot provide genuine citations.
Is this a one-time or low-volume task? If under 50 requests/day with no ongoing need, use an API. The fixed cost of fine-tuning does not justify itself.
Does the task require broad knowledge across many domains? If yes, use a large frontier model. Fine-tuning narrows capability.
Is your task narrow, high-volume, and requiring consistent behavior? If yes to all three, fine-tune. This is where fine-tuning excels.

If you reach question 6, fine-tuning is likely the right choice. The task is narrow enough to train well, high-volume enough to justify the investment, and demands the consistency that only learned behavior (not prompted behavior) can provide.

When to Combine Approaches

The five cases above describe when fine-tuning alone is the wrong answer. But fine-tuning combined with other approaches is often the best answer:

RAG + Fine-Tuned Model. Use RAG for knowledge retrieval and a fine-tuned model for response generation. The model learns your specific response style, formatting, and tone through fine-tuning. It gets current, citable knowledge from the retrieval system. This is the architecture most agencies should be evaluating for production systems.

A fine-tuned Llama 8B model with RAG often outperforms a frontier model with RAG on domain-specific tasks — because the fine-tuned model better understands how to use the retrieved context for your particular domain.

Fine-Tuning + Prompt Engineering. Even fine-tuned models benefit from system prompts. A fine-tuned model with a short (200-token) system prompt providing session-specific context (user preferences, current date, task variant) outperforms the same model with no prompt. Fine-tuning handles the behavioral foundation; the prompt handles per-request customization.

Fine-Tuning + API Fallback. For tasks that are 90% narrow and 10% broad, fine-tune for the 90% and route the remaining 10% to a frontier API. Use a classifier (which can itself be a small fine-tuned model) to determine which path each request should take. This gives you the cost efficiency of fine-tuning for the common case and the capability of a frontier model for edge cases.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

The Honest Assessment

Fine-tuning is a powerful technique. It produces faster, cheaper, more consistent models for narrow tasks. It enables privacy-preserving deployments that keep data off third-party servers. It creates genuine technical moats for agencies.

But it is not always the right tool. The teams that succeed with AI are the ones who match the technique to the problem — not the ones who apply their favorite technique to every problem.

If your situation matches one of the five cases above, save yourself the time and use the simpler approach. If you have worked through all five and your task genuinely calls for fine-tuning, we built Ertas to make that process as straightforward as possible.

Related reading:

Fine-Tuning vs RAG: When to Use Each (and When to Combine Them) — a detailed technical comparison of the two approaches
Prompt Engineering Has a Ceiling. Here's What Comes After. — recognizing when you have genuinely exhausted prompt optimization
Fine-Tuned vs RAG for Clients: How to Advise on the Right Approach — agency-focused guidance on recommending approaches to clients