Back to blog
    Model Routing in Production: When to Use Fine-Tuned vs API vs RAG
    architecturefine-tuningragmodel-routingsegment:saas

    Model Routing in Production: When to Use Fine-Tuned vs API vs RAG

    Fine-tuning, RAG, and cloud APIs each solve different problems. Here's a practical routing framework for choosing the right approach per request — and how to combine all three in one system.

    EErtas Team·

    Most production AI systems should not use a single approach for every request. Fine-tuning, RAG, and cloud APIs each have different cost profiles, quality characteristics, and ideal use cases. The teams that build profitable AI features use all three — routing each request to the approach that handles it best.

    This is not about picking one winner. It is about building a routing system that matches each request to the right tool for that specific job.

    The Three Approaches, Plainly Stated

    Fine-tuned models: A small language model (7B-14B parameters) trained on your specific task data. The model has internalized the patterns, formats, and quality criteria for your use case. It runs locally or on your infrastructure. Cost: fixed infrastructure, near-zero per request. Best for: high-volume, well-defined, repetitive tasks.

    RAG (Retrieval-Augmented Generation): A model augmented with a retrieval system that pulls relevant documents or data before generating a response. The model does not need to memorize everything — it looks up what it needs. Cost: moderate (embedding costs + inference costs). Best for: tasks requiring access to large, changing knowledge bases.

    Cloud API (frontier models): GPT-4o, Claude Opus, Gemini Pro. The largest, most capable models available, accessed via API. Cost: per-token pricing, highest cost per request. Best for: complex reasoning, creative tasks, and anything where raw model capability matters most.

    Each one has clear strengths and clear limitations. Let's map them.

    The Decision Matrix

    Here is the framework for routing decisions:

    Request characteristicFine-tunedRAGCloud API
    Task is well-defined and repetitiveBestAdequateOverkill
    Requires access to specific documentsInadequateBestAdequate (with context)
    Needs broad world knowledgeLimitedLimitedBest
    Complex multi-step reasoningLimited (7B)ModerateBest
    Cost sensitivity is highBestModerateWorst
    Output format must be preciseBest (trained on format)ModerateModerate (prompt-dependent)
    Knowledge changes frequentlyWorst (requires retraining)BestBest
    Data is private/sensitiveBest (local)Good (local vector DB)Worst (data leaves your infra)
    Latency requirement < 200msBest (local inference)Moderate (retrieval + inference)Worst (network latency)
    Volume > 100K requests/monthBest (fixed cost)ModerateWorst (linear cost)

    This matrix gives you the starting point. But production systems need more granular routing than a static table.

    How Fine-Tuned, RAG, and API Complement Each Other

    The three approaches are not competitors. They solve different parts of the same problem.

    Example: A customer support SaaS product

    This product has an AI assistant that helps support agents respond to tickets. Different request types need different approaches:

    Tier 1 — Fine-tuned model (60% of requests):

    • Classifying ticket priority and category
    • Generating templated acknowledgment responses
    • Extracting customer details from ticket text
    • Suggesting response tone based on sentiment
    • Formatting responses to match brand voice

    These are high-volume, well-defined tasks. A fine-tuned 7B model handles them with 95%+ accuracy because the patterns are consistent. Cost: effectively zero per request.

    Tier 2 — RAG (25% of requests):

    • Answering questions about product features using the knowledge base
    • Finding relevant help articles to link in responses
    • Looking up customer-specific configuration or account details
    • Referencing recent policy changes or product updates

    These tasks require access to information that changes — your knowledge base updates weekly, customer accounts change daily. Fine-tuning cannot keep up with this velocity of change. RAG retrieves the current information and generates responses grounded in it.

    Tier 3 — Cloud API (15% of requests):

    • Handling escalated, complex customer complaints that need nuanced reasoning
    • Drafting responses for unusual edge cases the model has not seen
    • Analyzing long conversation threads (10+ messages) to summarize context
    • Generating creative solutions to non-standard problems

    These requests are infrequent but require the depth of reasoning that a frontier model provides. They are also the requests where quality matters most — an escalated customer complaint is not the place to save AU$0.03 on inference.

    Implementing the Router

    The routing layer does not need to be complex. Here are three implementation patterns, from simple to sophisticated:

    Pattern 1: Endpoint-based routing

    Map your API endpoints directly to approaches:

    /api/classify     → fine-tuned model
    /api/search-kb    → RAG pipeline
    /api/draft-reply  → fine-tuned model (standard) or cloud API (escalated)
    /api/analyze      → cloud API
    

    This works when your task types are clearly separated by endpoint. Most SaaS products can categorize their AI features cleanly this way.

    Pattern 2: Attribute-based routing

    Route based on request attributes:

    if request.token_count < 500 and request.task_type in SIMPLE_TASKS:
        route_to_fine_tuned()
    elif request.requires_knowledge_lookup:
        route_to_rag()
    elif request.token_count > 2000 or request.task_type in COMPLEX_TASKS:
        route_to_cloud_api()
    else:
        route_to_fine_tuned()  # default to cheapest option
    

    The default is always the fine-tuned model — cheapest and fastest. Requests only escalate to more expensive approaches when specific conditions are met.

    Pattern 3: Cascade routing

    Try the cheapest approach first. If it fails quality checks, escalate:

    1. Run request through fine-tuned model
    2. Check output confidence / quality score
    3. If confidence > threshold: return response
    4. If knowledge lookup needed: run through RAG pipeline, return response
    5. If still low confidence: run through cloud API, return response
    

    This maximizes cost savings by ensuring every request tries the cheapest path first. The tradeoff is latency — a request that cascades through all three tiers takes 3-5x longer than one that resolves at tier 1. Use this for asynchronous workloads where latency is less critical.

    Cost Analysis: Routing vs Single Approach

    Let's model a SaaS product processing 300,000 AI requests per month:

    Single approach: Cloud API for everything

    Cost
    300,000 requests × AU$0.03 avgAU$9,000/month

    Single approach: RAG for everything

    Cost
    Embedding costs (300K requests)AU$150/month
    Vector DB hostingAU$200/month
    Inference (cloud API with retrieved context)AU$7,200/month
    TotalAU$7,550/month

    RAG reduces costs slightly by providing relevant context (shorter prompts needed), but you still pay per-token for the generation step.

    Routed: 60% fine-tuned, 25% RAG, 15% cloud API

    TierRequestsCost
    Fine-tuned (60%)180,000AU$800/month (infrastructure)
    RAG (25%)75,000AU$2,100/month
    Cloud API (15%)45,000AU$1,350/month
    Total300,000AU$4,250/month

    The routed approach saves AU$4,750/month (53%) versus cloud-API-only. Over 12 months, that is AU$57,000 in savings.

    But the real benefit is the scaling curve. At 600,000 requests/month:

    • Cloud API only: AU$18,000/month
    • Routed: AU$5,800/month (fine-tuned infra stays flat, only API and RAG portions scale)

    When RAG Beats Fine-Tuning

    RAG is the right choice when:

    Your knowledge base changes faster than you can retrain. Fine-tuning captures static knowledge. If your product documentation, pricing, policies, or customer data changes weekly, the fine-tuned model falls behind immediately. RAG always retrieves current information.

    Your corpus is too large to train into a model. A fine-tuned 7B model can absorb maybe 5,000-10,000 document patterns effectively. If you need to reference 100,000 support articles, product specs, or legal documents, RAG with a vector database is the only practical approach.

    Users ask about specific documents. "What does section 3.2 of our contract say?" is a retrieval problem, not a generation problem. Fine-tuning cannot memorize specific documents reliably. RAG retrieves the exact section and generates a response grounded in it.

    You need citations and traceability. RAG naturally provides source documents for its answers. Fine-tuned models generate from learned patterns — they cannot point to the specific document that informed their response.

    When Fine-Tuning Beats RAG

    Fine-tuning wins when:

    The task is about how to respond, not what to respond with. Classification, formatting, tone matching, entity extraction — these are behavioral tasks. The model needs to learn a pattern, not look up information. RAG adds unnecessary retrieval overhead for these tasks.

    Latency matters. A fine-tuned model on local hardware responds in 50-200ms. A RAG pipeline adds 100-500ms for retrieval before generation even starts. For real-time features — autocomplete, inline suggestions, live classification — the retrieval step is too slow.

    Volume is high and tasks are repetitive. If you process 500,000 classification requests per month, each with the same pattern, a fine-tuned model handles them at fixed cost. RAG would add 500,000 embedding lookups and 500,000 vector searches — unnecessary overhead for a task that does not need dynamic knowledge.

    You want zero per-request cost. A fine-tuned model on owned hardware has zero marginal cost per request. RAG, even with a local model for generation, still has per-request embedding and retrieval costs (though these are small).

    When You Genuinely Need the Cloud API

    Be honest about when you need frontier capabilities:

    Long-context reasoning. Processing a 50-page document to answer a specific question about the relationship between sections 12 and 47 is a task where a 200B+ parameter model with 128K context window genuinely outperforms a 7B model.

    Zero-shot novel tasks. When your AI encounters a request type it was not trained on, a frontier model's broad capabilities handle it better than a fine-tuned specialist that has never seen the pattern.

    Agentic workflows with tool use. Complex multi-step workflows — research an answer across multiple sources, reason about the findings, formulate a plan, execute it — still benefit from frontier model depth. This will change as smaller models improve at tool calling, but today the gap is real.

    Quality-critical, low-volume requests. If a request type is rare (< 100/month) and high-stakes (customer-facing, legal, financial), the AU$0.05-0.10 per request cost of a frontier model is worth the quality insurance.

    Building the Unified Routing Layer

    Your routing layer should present a single interface to your application:

    Application → Routing Layer → [Fine-tuned | RAG | Cloud API] → Response
    

    The application does not know or care which backend handled the request. It sends a request with metadata (task type, priority, user tier) and receives a response in a consistent format.

    Implementation requirements:

    • Consistent response format across all three backends (normalize during routing)
    • Per-request logging of which backend was used, latency, token count, and cost
    • Fallback logic — if the primary backend fails or times out, route to the next tier
    • A/B testing support — ability to route a percentage of traffic to a different backend for comparison
    • Cost dashboards — aggregated cost per backend, per task type, per user tier

    This logging data is what drives your routing optimization. After 30 days of production data, you will see exactly which task types are being over-served (expensive backend, high quality not needed) and which are being under-served (cheap backend, quality issues).

    The Iteration Loop

    Routing is not a set-and-forget configuration. It improves over time:

    1. Month 1: Route conservatively — more traffic to cloud API than necessary. Collect quality data.
    2. Month 2: Analyze quality data. Move task types that fine-tuned models handle well from cloud API to local.
    3. Month 3: Add new task types to fine-tuning training data. Retrain. Expand local routing.
    4. Month 6: Most stable, high-volume tasks on fine-tuned models. RAG for knowledge-dependent tasks. Cloud API for edge cases and new features.
    5. Ongoing: Each retraining cycle moves more traffic to the cheapest tier.

    The end state for a mature SaaS product is typically 60-80% fine-tuned, 15-25% RAG, 5-15% cloud API. Getting there takes 3-6 months of iteration, but each month reduces costs and improves routing accuracy.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading