SLM-First Architecture: The 80/20 Routing Strategy That Cuts AI Costs 75%

Most production AI workloads are simple. Classification, extraction, formatting, summarization of short documents, template-based generation. These tasks consume 80% or more of your inference budget, and they do not require a 200B+ parameter frontier model.

The SLM-first architecture inverts the default assumption. Instead of routing everything to a cloud API and optimizing later, you start with a fine-tuned small language model (7B-14B parameters) as the default path and only escalate to a cloud API when the request genuinely needs it.

The result: 60-75% cost reduction with no measurable quality loss on the tasks that matter.

What SLM-First Actually Means

In a traditional AI architecture, the request flow looks like this:

User Request → Cloud API (GPT-4o / Claude) → Response

Every request, regardless of complexity, goes to the most expensive option. This is the default because it is the simplest to build. One endpoint, one model, one integration.

SLM-first flips the default:

User Request → Router → [80%] Fine-Tuned SLM (7B-14B, local) → Response
                       → [20%] Cloud API (GPT-4o / Claude)   → Response

The router examines each request and makes a decision: can a fine-tuned small model handle this adequately, or does it genuinely need frontier-level reasoning? The answer, for most SaaS workloads, is that the small model handles it just fine.

The Cost Math

Let's run the numbers on a SaaS product processing 500,000 AI requests per month. We will use representative pricing as of early 2026.

Scenario A: Everything goes to GPT-4o

Metric	Value
Monthly requests	500,000
Avg tokens per request	1,200 (input + output)
GPT-4o blended cost	~AU$0.025 per request
Monthly cost	AU$12,500

Scenario B: 80/20 routing with fine-tuned 7B model

Tier	Requests	Cost per request	Monthly cost
Local SLM (80%)	400,000	~AU$0 (fixed infra)	AU$1,200 (server)
Cloud API (20%)	100,000	AU$0.025	AU$2,500
Total	500,000	—	AU$3,700

That is a 70% cost reduction. At 1 million requests per month, the savings approach 75% because the local infrastructure cost stays nearly flat while the API-only cost doubles.

The local infrastructure cost of AU$1,200/month covers a GPU instance capable of serving a quantized 7B model at several hundred requests per second. At 400,000 requests per month — roughly 9 requests per minute average — this is well within capacity.

Which Requests Go Where

The routing decision is not complicated. It follows a pattern that maps cleanly to request types.

Route to the local SLM (80% of traffic):

Text classification and categorization
Named entity extraction
Sentiment analysis
Template-based content generation (emails, summaries, descriptions)
Data formatting and transformation (JSON structuring, CSV parsing)
FAQ and knowledge base responses
Short-form summarization (under 500 words of source material)
Intent detection and routing

These tasks share common traits: well-defined outputs, limited reasoning depth, consistent patterns. A fine-tuned 7B model trained on 2,000-5,000 examples of your specific task will match or beat GPT-4o on these, because it has learned your exact format, terminology, and quality criteria.

Route to the cloud API (20% of traffic):

Multi-step reasoning across complex inputs
Creative writing where novelty and style matter
Long-document analysis (10,000+ tokens of source material)
Tasks requiring broad, up-to-date world knowledge
Edge cases the local model has not been trained on
First-time task types you have not fine-tuned for yet

Implementing the Router

The router itself can be simple. You do not need a separate ML model to make routing decisions. Three practical approaches, in order of complexity:

1. Rule-based routing (start here)

Map your API endpoints or task types to tiers directly in code:

if task_type in ["classify", "extract", "format", "summarize_short"]:
    route_to_local_slm()
elif task_type in ["reason", "create", "analyze_long"]:
    route_to_cloud_api()

This works well when your task types are well-defined and stable. Most SaaS products have 5-15 distinct AI task types, and you can categorize each one manually.

2. Confidence-based routing

Run the request through the local SLM first. If the model's output confidence (measured by token probabilities or a separate quality classifier) exceeds a threshold, use it. If not, fall back to the cloud API.

This captures more requests locally over time as you improve your fine-tuned model, and it automatically routes genuinely difficult requests to the frontier model.

3. Hybrid routing with shadow scoring

Route based on rules, but periodically send a sample of local SLM responses to the cloud API for quality comparison. Use the comparison data to adjust your routing rules and identify tasks where the local model needs more training data.

Most teams should start with rule-based routing. It is explicit, debuggable, and gets you 80% of the cost savings with 20% of the implementation effort.

Fine-Tuning the Local Tier

The local SLM is only as good as its fine-tuning. A base Llama 3.3 or Qwen 2.5 model will not match GPT-4o out of the box on your specific tasks. But a fine-tuned version trained on your production data will.

The fine-tuning process for the local tier:

Collect production examples. Export 2,000-5,000 request-response pairs from your existing GPT-4o usage. These are your training data — the cloud API has already generated the gold-standard outputs.
Fine-tune a 7B or 14B base model. Using QLoRA, this takes 30-90 minutes on a single GPU. The result is a model that has learned your specific task patterns, output formats, and quality criteria.
Evaluate against your cloud API outputs. Run the fine-tuned model on a held-out test set and compare outputs. For well-defined tasks, expect 92-98% quality parity.
Quantize and deploy. Convert to GGUF format (Q4_K_M or Q5_K_M quantization) for efficient inference. Deploy via Ollama or llama.cpp behind an OpenAI-compatible API endpoint.
Monitor and retrain. Track quality metrics in production. When you collect new edge cases the model handles poorly, add them to training data and retrain. Each iteration improves coverage.

Ertas handles steps 2-4 in a single workflow — upload your dataset, select a base model, and get a fine-tuned GGUF file ready for deployment. The fine-tuning runs on managed infrastructure, so you do not need your own training GPUs.

Architecture for the Full Stack

Here is what the complete SLM-first architecture looks like in production:

┌─────────────────────────────────────────────┐
│                Your SaaS App                │
│                                             │
│  ┌─────────┐    ┌──────────────────────┐    │
│  │ Request  │───▶│   Routing Layer      │    │
│  │  Queue   │    │ (rule / confidence)  │    │
│  └─────────┘    └──────────┬───────────┘    │
│                    ┌───────┴───────┐         │
│                    ▼               ▼         │
│           ┌──────────────┐  ┌───────────┐   │
│           │ Local SLM    │  │ Cloud API │   │
│           │ (Ollama /    │  │ (GPT-4o / │   │
│           │  llama.cpp)  │  │  Claude)  │   │
│           │ Fine-tuned   │  │           │   │
│           │ 7B-14B       │  │           │   │
│           └──────────────┘  └───────────┘   │
│                    │               │         │
│                    ▼               ▼         │
│              ┌──────────────────────┐       │
│              │   Response Handler   │       │
│              │   (normalize format) │       │
│              └──────────────────────┘       │
└─────────────────────────────────────────────┘

Key implementation details:

Both tiers expose OpenAI-compatible endpoints. Your application code uses the same client library for both — the only difference is the base URL and model name.
The response handler normalizes outputs. Different models may return slightly different formatting. A thin normalization layer ensures consistent output regardless of which tier handled the request.
Logging captures tier, latency, and cost per request. This data feeds your routing optimization and identifies candidates for model improvement.

When 80/20 Becomes 90/10

As you fine-tune your local model on more production data, the percentage of requests it handles well increases. Teams that start at 80/20 routing typically reach 90/10 within 3-6 months, because:

Edge cases get captured in training data and fine-tuned into the model
New task types get added to the local tier once they are well-defined
Quality thresholds for confidence-based routing can be tightened as the model improves

At 90/10 routing, the same 500,000 requests/month scenario drops to:

Tier	Requests	Monthly cost
Local SLM (90%)	450,000	AU$1,200
Cloud API (10%)	50,000	AU$1,250
Total	500,000	AU$2,450

That is an 80% cost reduction compared to full API usage, with a quality profile that has been validated over months of production data.

Common Objections

"What if the local model makes a bad response?"

Implement quality checks. For structured outputs, validate against schemas. For free-text, use a lightweight classifier to flag low-confidence outputs for cloud API retry. This adds a few hundred milliseconds of latency on the ~2% of requests that get flagged, but eliminates quality risk.

"We don't have GPU infrastructure."

A quantized 7B model runs on a AU$80/month VPS with no GPU. CPU inference on modern hardware handles 2-5 requests per second for a Q4-quantized 7B model. For most SaaS workloads under 200,000 requests/month, this is sufficient. GPU instances are only needed for higher throughput.

"Our tasks are too complex for small models."

Some of them are. Most of them are not. Run an evaluation. Take your last 1,000 API requests, classify them by complexity, and test a fine-tuned small model on the simple ones. The data will tell you what percentage of your traffic is actually complex.

"Managing two inference paths is too much operational overhead."

Both paths use OpenAI-compatible APIs. Your application code does not know or care which one handled the request. The routing layer is 50-100 lines of code. The operational overhead is one additional service to monitor, which is comparable to adding a cache layer.

Getting Started

The migration path is incremental. You do not need to implement the full architecture on day one.

Week 1: Audit your current AI API usage. Categorize requests by type. Identify the 2-3 highest-volume, simplest task types.
Week 2: Fine-tune a 7B model on those specific tasks using your existing API outputs as training data.
Week 3: Deploy the model locally and route those specific task types to it. Keep everything else on the cloud API.
Week 4: Monitor quality and costs. Adjust routing rules based on results.

Repeat every month, moving more task types to the local tier as you validate quality. Within 3 months, you will have the 80/20 split running in production and a clear path to 90/10.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →