Model Routing in Production: When to Use Fine-Tuned vs API vs RAG

Most production AI systems should not use a single approach for every request. Fine-tuning, RAG, and cloud APIs each have different cost profiles, quality characteristics, and ideal use cases. The teams that build profitable AI features use all three — routing each request to the approach that handles it best.

This is not about picking one winner. It is about building a routing system that matches each request to the right tool for that specific job.

The Three Approaches, Plainly Stated

Fine-tuned models: A small language model (7B-14B parameters) trained on your specific task data. The model has internalized the patterns, formats, and quality criteria for your use case. It runs locally or on your infrastructure. Cost: fixed infrastructure, near-zero per request. Best for: high-volume, well-defined, repetitive tasks.

RAG (Retrieval-Augmented Generation): A model augmented with a retrieval system that pulls relevant documents or data before generating a response. The model does not need to memorize everything — it looks up what it needs. Cost: moderate (embedding costs + inference costs). Best for: tasks requiring access to large, changing knowledge bases.

Cloud API (frontier models): GPT-4o, Claude Opus, Gemini Pro. The largest, most capable models available, accessed via API. Cost: per-token pricing, highest cost per request. Best for: complex reasoning, creative tasks, and anything where raw model capability matters most.

Each one has clear strengths and clear limitations. Let's map them.

The Decision Matrix

Here is the framework for routing decisions:

Request characteristic	Fine-tuned	RAG	Cloud API
Task is well-defined and repetitive	Best	Adequate	Overkill
Requires access to specific documents	Inadequate	Best	Adequate (with context)
Needs broad world knowledge	Limited	Limited	Best
Complex multi-step reasoning	Limited (7B)	Moderate	Best
Cost sensitivity is high	Best	Moderate	Worst
Output format must be precise	Best (trained on format)	Moderate	Moderate (prompt-dependent)
Knowledge changes frequently	Worst (requires retraining)	Best	Best
Data is private/sensitive	Best (local)	Good (local vector DB)	Worst (data leaves your infra)
Latency requirement < 200ms	Best (local inference)	Moderate (retrieval + inference)	Worst (network latency)
Volume > 100K requests/month	Best (fixed cost)	Moderate	Worst (linear cost)

This matrix gives you the starting point. But production systems need more granular routing than a static table.

How Fine-Tuned, RAG, and API Complement Each Other

The three approaches are not competitors. They solve different parts of the same problem.

Example: A customer support SaaS product

This product has an AI assistant that helps support agents respond to tickets. Different request types need different approaches:

Tier 1 — Fine-tuned model (60% of requests):

Classifying ticket priority and category
Generating templated acknowledgment responses
Extracting customer details from ticket text
Suggesting response tone based on sentiment
Formatting responses to match brand voice

These are high-volume, well-defined tasks. A fine-tuned 7B model handles them with 95%+ accuracy because the patterns are consistent. Cost: effectively zero per request.

Tier 2 — RAG (25% of requests):

Answering questions about product features using the knowledge base
Finding relevant help articles to link in responses
Looking up customer-specific configuration or account details
Referencing recent policy changes or product updates

These tasks require access to information that changes — your knowledge base updates weekly, customer accounts change daily. Fine-tuning cannot keep up with this velocity of change. RAG retrieves the current information and generates responses grounded in it.

Tier 3 — Cloud API (15% of requests):

Handling escalated, complex customer complaints that need nuanced reasoning
Drafting responses for unusual edge cases the model has not seen
Analyzing long conversation threads (10+ messages) to summarize context
Generating creative solutions to non-standard problems

These requests are infrequent but require the depth of reasoning that a frontier model provides. They are also the requests where quality matters most — an escalated customer complaint is not the place to save AU$0.03 on inference.

Implementing the Router

The routing layer does not need to be complex. Here are three implementation patterns, from simple to sophisticated:

Pattern 1: Endpoint-based routing

Map your API endpoints directly to approaches:

/api/classify     → fine-tuned model
/api/search-kb    → RAG pipeline
/api/draft-reply  → fine-tuned model (standard) or cloud API (escalated)
/api/analyze      → cloud API

This works when your task types are clearly separated by endpoint. Most SaaS products can categorize their AI features cleanly this way.

Pattern 2: Attribute-based routing

Route based on request attributes:

if request.token_count < 500 and request.task_type in SIMPLE_TASKS:
    route_to_fine_tuned()
elif request.requires_knowledge_lookup:
    route_to_rag()
elif request.token_count > 2000 or request.task_type in COMPLEX_TASKS:
    route_to_cloud_api()
else:
    route_to_fine_tuned()  # default to cheapest option

The default is always the fine-tuned model — cheapest and fastest. Requests only escalate to more expensive approaches when specific conditions are met.

Pattern 3: Cascade routing

Try the cheapest approach first. If it fails quality checks, escalate:

1. Run request through fine-tuned model
2. Check output confidence / quality score
3. If confidence > threshold: return response
4. If knowledge lookup needed: run through RAG pipeline, return response
5. If still low confidence: run through cloud API, return response

This maximizes cost savings by ensuring every request tries the cheapest path first. The tradeoff is latency — a request that cascades through all three tiers takes 3-5x longer than one that resolves at tier 1. Use this for asynchronous workloads where latency is less critical.

Cost Analysis: Routing vs Single Approach

Let's model a SaaS product processing 300,000 AI requests per month:

Single approach: Cloud API for everything

	Cost
300,000 requests × AU$0.03 avg	AU$9,000/month

Single approach: RAG for everything

	Cost
Embedding costs (300K requests)	AU$150/month
Vector DB hosting	AU$200/month
Inference (cloud API with retrieved context)	AU$7,200/month
Total	AU$7,550/month

RAG reduces costs slightly by providing relevant context (shorter prompts needed), but you still pay per-token for the generation step.

Routed: 60% fine-tuned, 25% RAG, 15% cloud API

Tier	Requests	Cost
Fine-tuned (60%)	180,000	AU$800/month (infrastructure)
RAG (25%)	75,000	AU$2,100/month
Cloud API (15%)	45,000	AU$1,350/month
Total	300,000	AU$4,250/month

The routed approach saves AU$4,750/month (53%) versus cloud-API-only. Over 12 months, that is AU$57,000 in savings.

But the real benefit is the scaling curve. At 600,000 requests/month:

Cloud API only: AU$18,000/month
Routed: AU$5,800/month (fine-tuned infra stays flat, only API and RAG portions scale)

When RAG Beats Fine-Tuning

RAG is the right choice when:

Your knowledge base changes faster than you can retrain. Fine-tuning captures static knowledge. If your product documentation, pricing, policies, or customer data changes weekly, the fine-tuned model falls behind immediately. RAG always retrieves current information.

Your corpus is too large to train into a model. A fine-tuned 7B model can absorb maybe 5,000-10,000 document patterns effectively. If you need to reference 100,000 support articles, product specs, or legal documents, RAG with a vector database is the only practical approach.

Users ask about specific documents. "What does section 3.2 of our contract say?" is a retrieval problem, not a generation problem. Fine-tuning cannot memorize specific documents reliably. RAG retrieves the exact section and generates a response grounded in it.

You need citations and traceability. RAG naturally provides source documents for its answers. Fine-tuned models generate from learned patterns — they cannot point to the specific document that informed their response.

When Fine-Tuning Beats RAG

Fine-tuning wins when:

The task is about how to respond, not what to respond with. Classification, formatting, tone matching, entity extraction — these are behavioral tasks. The model needs to learn a pattern, not look up information. RAG adds unnecessary retrieval overhead for these tasks.

Latency matters. A fine-tuned model on local hardware responds in 50-200ms. A RAG pipeline adds 100-500ms for retrieval before generation even starts. For real-time features — autocomplete, inline suggestions, live classification — the retrieval step is too slow.

Volume is high and tasks are repetitive. If you process 500,000 classification requests per month, each with the same pattern, a fine-tuned model handles them at fixed cost. RAG would add 500,000 embedding lookups and 500,000 vector searches — unnecessary overhead for a task that does not need dynamic knowledge.

You want zero per-request cost. A fine-tuned model on owned hardware has zero marginal cost per request. RAG, even with a local model for generation, still has per-request embedding and retrieval costs (though these are small).

When You Genuinely Need the Cloud API

Be honest about when you need frontier capabilities:

Long-context reasoning. Processing a 50-page document to answer a specific question about the relationship between sections 12 and 47 is a task where a 200B+ parameter model with 128K context window genuinely outperforms a 7B model.

Zero-shot novel tasks. When your AI encounters a request type it was not trained on, a frontier model's broad capabilities handle it better than a fine-tuned specialist that has never seen the pattern.

Agentic workflows with tool use. Complex multi-step workflows — research an answer across multiple sources, reason about the findings, formulate a plan, execute it — still benefit from frontier model depth. This will change as smaller models improve at tool calling, but today the gap is real.

Quality-critical, low-volume requests. If a request type is rare (< 100/month) and high-stakes (customer-facing, legal, financial), the AU$0.05-0.10 per request cost of a frontier model is worth the quality insurance.

Building the Unified Routing Layer

Your routing layer should present a single interface to your application:

Application → Routing Layer → [Fine-tuned | RAG | Cloud API] → Response

The application does not know or care which backend handled the request. It sends a request with metadata (task type, priority, user tier) and receives a response in a consistent format.

Implementation requirements:

Consistent response format across all three backends (normalize during routing)
Per-request logging of which backend was used, latency, token count, and cost
Fallback logic — if the primary backend fails or times out, route to the next tier
A/B testing support — ability to route a percentage of traffic to a different backend for comparison
Cost dashboards — aggregated cost per backend, per task type, per user tier

This logging data is what drives your routing optimization. After 30 days of production data, you will see exactly which task types are being over-served (expensive backend, high quality not needed) and which are being under-served (cheap backend, quality issues).

The Iteration Loop

Routing is not a set-and-forget configuration. It improves over time:

Month 1: Route conservatively — more traffic to cloud API than necessary. Collect quality data.
Month 2: Analyze quality data. Move task types that fine-tuned models handle well from cloud API to local.
Month 3: Add new task types to fine-tuning training data. Retrain. Expand local routing.
Month 6: Most stable, high-volume tasks on fine-tuned models. RAG for knowledge-dependent tasks. Cloud API for edge cases and new features.
Ongoing: Each retraining cycle moves more traffic to the cheapest tier.

The end state for a mature SaaS product is typically 60-80% fine-tuned, 15-25% RAG, 5-15% cloud API. Getting there takes 3-6 months of iteration, but each month reduces costs and improves routing accuracy.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →