Back to blog
    SLM-First Architecture: The 80/20 Routing Strategy That Cuts AI Costs 75%
    architectureslmcost-reductionlocal-modelssegment:saas

    SLM-First Architecture: The 80/20 Routing Strategy That Cuts AI Costs 75%

    Most AI features don't need GPT-4. An SLM-first architecture routes 80% of requests to fine-tuned local models and 20% to cloud APIs — cutting costs by 60-75% while maintaining quality.

    EErtas Team·

    Most production AI workloads are simple. Classification, extraction, formatting, summarization of short documents, template-based generation. These tasks consume 80% or more of your inference budget, and they do not require a 200B+ parameter frontier model.

    The SLM-first architecture inverts the default assumption. Instead of routing everything to a cloud API and optimizing later, you start with a fine-tuned small language model (7B-14B parameters) as the default path and only escalate to a cloud API when the request genuinely needs it.

    The result: 60-75% cost reduction with no measurable quality loss on the tasks that matter.

    What SLM-First Actually Means

    In a traditional AI architecture, the request flow looks like this:

    User Request → Cloud API (GPT-4o / Claude) → Response
    

    Every request, regardless of complexity, goes to the most expensive option. This is the default because it is the simplest to build. One endpoint, one model, one integration.

    SLM-first flips the default:

    User Request → Router → [80%] Fine-Tuned SLM (7B-14B, local) → Response
                           → [20%] Cloud API (GPT-4o / Claude)   → Response
    

    The router examines each request and makes a decision: can a fine-tuned small model handle this adequately, or does it genuinely need frontier-level reasoning? The answer, for most SaaS workloads, is that the small model handles it just fine.

    The Cost Math

    Let's run the numbers on a SaaS product processing 500,000 AI requests per month. We will use representative pricing as of early 2026.

    Scenario A: Everything goes to GPT-4o

    MetricValue
    Monthly requests500,000
    Avg tokens per request1,200 (input + output)
    GPT-4o blended cost~AU$0.025 per request
    Monthly costAU$12,500

    Scenario B: 80/20 routing with fine-tuned 7B model

    TierRequestsCost per requestMonthly cost
    Local SLM (80%)400,000~AU$0 (fixed infra)AU$1,200 (server)
    Cloud API (20%)100,000AU$0.025AU$2,500
    Total500,000AU$3,700

    That is a 70% cost reduction. At 1 million requests per month, the savings approach 75% because the local infrastructure cost stays nearly flat while the API-only cost doubles.

    The local infrastructure cost of AU$1,200/month covers a GPU instance capable of serving a quantized 7B model at several hundred requests per second. At 400,000 requests per month — roughly 9 requests per minute average — this is well within capacity.

    Which Requests Go Where

    The routing decision is not complicated. It follows a pattern that maps cleanly to request types.

    Route to the local SLM (80% of traffic):

    • Text classification and categorization
    • Named entity extraction
    • Sentiment analysis
    • Template-based content generation (emails, summaries, descriptions)
    • Data formatting and transformation (JSON structuring, CSV parsing)
    • FAQ and knowledge base responses
    • Short-form summarization (under 500 words of source material)
    • Intent detection and routing

    These tasks share common traits: well-defined outputs, limited reasoning depth, consistent patterns. A fine-tuned 7B model trained on 2,000-5,000 examples of your specific task will match or beat GPT-4o on these, because it has learned your exact format, terminology, and quality criteria.

    Route to the cloud API (20% of traffic):

    • Multi-step reasoning across complex inputs
    • Creative writing where novelty and style matter
    • Long-document analysis (10,000+ tokens of source material)
    • Tasks requiring broad, up-to-date world knowledge
    • Edge cases the local model has not been trained on
    • First-time task types you have not fine-tuned for yet

    Implementing the Router

    The router itself can be simple. You do not need a separate ML model to make routing decisions. Three practical approaches, in order of complexity:

    1. Rule-based routing (start here)

    Map your API endpoints or task types to tiers directly in code:

    if task_type in ["classify", "extract", "format", "summarize_short"]:
        route_to_local_slm()
    elif task_type in ["reason", "create", "analyze_long"]:
        route_to_cloud_api()
    

    This works well when your task types are well-defined and stable. Most SaaS products have 5-15 distinct AI task types, and you can categorize each one manually.

    2. Confidence-based routing

    Run the request through the local SLM first. If the model's output confidence (measured by token probabilities or a separate quality classifier) exceeds a threshold, use it. If not, fall back to the cloud API.

    This captures more requests locally over time as you improve your fine-tuned model, and it automatically routes genuinely difficult requests to the frontier model.

    3. Hybrid routing with shadow scoring

    Route based on rules, but periodically send a sample of local SLM responses to the cloud API for quality comparison. Use the comparison data to adjust your routing rules and identify tasks where the local model needs more training data.

    Most teams should start with rule-based routing. It is explicit, debuggable, and gets you 80% of the cost savings with 20% of the implementation effort.

    Fine-Tuning the Local Tier

    The local SLM is only as good as its fine-tuning. A base Llama 3.3 or Qwen 2.5 model will not match GPT-4o out of the box on your specific tasks. But a fine-tuned version trained on your production data will.

    The fine-tuning process for the local tier:

    1. Collect production examples. Export 2,000-5,000 request-response pairs from your existing GPT-4o usage. These are your training data — the cloud API has already generated the gold-standard outputs.

    2. Fine-tune a 7B or 14B base model. Using QLoRA, this takes 30-90 minutes on a single GPU. The result is a model that has learned your specific task patterns, output formats, and quality criteria.

    3. Evaluate against your cloud API outputs. Run the fine-tuned model on a held-out test set and compare outputs. For well-defined tasks, expect 92-98% quality parity.

    4. Quantize and deploy. Convert to GGUF format (Q4_K_M or Q5_K_M quantization) for efficient inference. Deploy via Ollama or llama.cpp behind an OpenAI-compatible API endpoint.

    5. Monitor and retrain. Track quality metrics in production. When you collect new edge cases the model handles poorly, add them to training data and retrain. Each iteration improves coverage.

    Ertas handles steps 2-4 in a single workflow — upload your dataset, select a base model, and get a fine-tuned GGUF file ready for deployment. The fine-tuning runs on managed infrastructure, so you do not need your own training GPUs.

    Architecture for the Full Stack

    Here is what the complete SLM-first architecture looks like in production:

    ┌─────────────────────────────────────────────┐
    │                Your SaaS App                │
    │                                             │
    │  ┌─────────┐    ┌──────────────────────┐    │
    │  │ Request  │───▶│   Routing Layer      │    │
    │  │  Queue   │    │ (rule / confidence)  │    │
    │  └─────────┘    └──────────┬───────────┘    │
    │                    ┌───────┴───────┐         │
    │                    ▼               ▼         │
    │           ┌──────────────┐  ┌───────────┐   │
    │           │ Local SLM    │  │ Cloud API │   │
    │           │ (Ollama /    │  │ (GPT-4o / │   │
    │           │  llama.cpp)  │  │  Claude)  │   │
    │           │ Fine-tuned   │  │           │   │
    │           │ 7B-14B       │  │           │   │
    │           └──────────────┘  └───────────┘   │
    │                    │               │         │
    │                    ▼               ▼         │
    │              ┌──────────────────────┐       │
    │              │   Response Handler   │       │
    │              │   (normalize format) │       │
    │              └──────────────────────┘       │
    └─────────────────────────────────────────────┘
    

    Key implementation details:

    • Both tiers expose OpenAI-compatible endpoints. Your application code uses the same client library for both — the only difference is the base URL and model name.
    • The response handler normalizes outputs. Different models may return slightly different formatting. A thin normalization layer ensures consistent output regardless of which tier handled the request.
    • Logging captures tier, latency, and cost per request. This data feeds your routing optimization and identifies candidates for model improvement.

    When 80/20 Becomes 90/10

    As you fine-tune your local model on more production data, the percentage of requests it handles well increases. Teams that start at 80/20 routing typically reach 90/10 within 3-6 months, because:

    • Edge cases get captured in training data and fine-tuned into the model
    • New task types get added to the local tier once they are well-defined
    • Quality thresholds for confidence-based routing can be tightened as the model improves

    At 90/10 routing, the same 500,000 requests/month scenario drops to:

    TierRequestsMonthly cost
    Local SLM (90%)450,000AU$1,200
    Cloud API (10%)50,000AU$1,250
    Total500,000AU$2,450

    That is an 80% cost reduction compared to full API usage, with a quality profile that has been validated over months of production data.

    Common Objections

    "What if the local model makes a bad response?"

    Implement quality checks. For structured outputs, validate against schemas. For free-text, use a lightweight classifier to flag low-confidence outputs for cloud API retry. This adds a few hundred milliseconds of latency on the ~2% of requests that get flagged, but eliminates quality risk.

    "We don't have GPU infrastructure."

    A quantized 7B model runs on a AU$80/month VPS with no GPU. CPU inference on modern hardware handles 2-5 requests per second for a Q4-quantized 7B model. For most SaaS workloads under 200,000 requests/month, this is sufficient. GPU instances are only needed for higher throughput.

    "Our tasks are too complex for small models."

    Some of them are. Most of them are not. Run an evaluation. Take your last 1,000 API requests, classify them by complexity, and test a fine-tuned small model on the simple ones. The data will tell you what percentage of your traffic is actually complex.

    "Managing two inference paths is too much operational overhead."

    Both paths use OpenAI-compatible APIs. Your application code does not know or care which one handled the request. The routing layer is 50-100 lines of code. The operational overhead is one additional service to monitor, which is comparable to adding a cache layer.

    Getting Started

    The migration path is incremental. You do not need to implement the full architecture on day one.

    1. Week 1: Audit your current AI API usage. Categorize requests by type. Identify the 2-3 highest-volume, simplest task types.
    2. Week 2: Fine-tune a 7B model on those specific tasks using your existing API outputs as training data.
    3. Week 3: Deploy the model locally and route those specific task types to it. Keep everything else on the cloud API.
    4. Week 4: Monitor quality and costs. Adjust routing rules based on results.

    Repeat every month, moving more task types to the local tier as you validate quality. Within 3 months, you will have the 80/20 split running in production and a clear path to 90/10.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading