
SLM-First Architecture: The 80/20 Routing Strategy That Cuts AI Costs 75%
Most AI features don't need GPT-4. An SLM-first architecture routes 80% of requests to fine-tuned local models and 20% to cloud APIs — cutting costs by 60-75% while maintaining quality.
Most production AI workloads are simple. Classification, extraction, formatting, summarization of short documents, template-based generation. These tasks consume 80% or more of your inference budget, and they do not require a 200B+ parameter frontier model.
The SLM-first architecture inverts the default assumption. Instead of routing everything to a cloud API and optimizing later, you start with a fine-tuned small language model (7B-14B parameters) as the default path and only escalate to a cloud API when the request genuinely needs it.
The result: 60-75% cost reduction with no measurable quality loss on the tasks that matter.
What SLM-First Actually Means
In a traditional AI architecture, the request flow looks like this:
User Request → Cloud API (GPT-4o / Claude) → Response
Every request, regardless of complexity, goes to the most expensive option. This is the default because it is the simplest to build. One endpoint, one model, one integration.
SLM-first flips the default:
User Request → Router → [80%] Fine-Tuned SLM (7B-14B, local) → Response
→ [20%] Cloud API (GPT-4o / Claude) → Response
The router examines each request and makes a decision: can a fine-tuned small model handle this adequately, or does it genuinely need frontier-level reasoning? The answer, for most SaaS workloads, is that the small model handles it just fine.
The Cost Math
Let's run the numbers on a SaaS product processing 500,000 AI requests per month. We will use representative pricing as of early 2026.
Scenario A: Everything goes to GPT-4o
| Metric | Value |
|---|---|
| Monthly requests | 500,000 |
| Avg tokens per request | 1,200 (input + output) |
| GPT-4o blended cost | ~AU$0.025 per request |
| Monthly cost | AU$12,500 |
Scenario B: 80/20 routing with fine-tuned 7B model
| Tier | Requests | Cost per request | Monthly cost |
|---|---|---|---|
| Local SLM (80%) | 400,000 | ~AU$0 (fixed infra) | AU$1,200 (server) |
| Cloud API (20%) | 100,000 | AU$0.025 | AU$2,500 |
| Total | 500,000 | — | AU$3,700 |
That is a 70% cost reduction. At 1 million requests per month, the savings approach 75% because the local infrastructure cost stays nearly flat while the API-only cost doubles.
The local infrastructure cost of AU$1,200/month covers a GPU instance capable of serving a quantized 7B model at several hundred requests per second. At 400,000 requests per month — roughly 9 requests per minute average — this is well within capacity.
Which Requests Go Where
The routing decision is not complicated. It follows a pattern that maps cleanly to request types.
Route to the local SLM (80% of traffic):
- Text classification and categorization
- Named entity extraction
- Sentiment analysis
- Template-based content generation (emails, summaries, descriptions)
- Data formatting and transformation (JSON structuring, CSV parsing)
- FAQ and knowledge base responses
- Short-form summarization (under 500 words of source material)
- Intent detection and routing
These tasks share common traits: well-defined outputs, limited reasoning depth, consistent patterns. A fine-tuned 7B model trained on 2,000-5,000 examples of your specific task will match or beat GPT-4o on these, because it has learned your exact format, terminology, and quality criteria.
Route to the cloud API (20% of traffic):
- Multi-step reasoning across complex inputs
- Creative writing where novelty and style matter
- Long-document analysis (10,000+ tokens of source material)
- Tasks requiring broad, up-to-date world knowledge
- Edge cases the local model has not been trained on
- First-time task types you have not fine-tuned for yet
Implementing the Router
The router itself can be simple. You do not need a separate ML model to make routing decisions. Three practical approaches, in order of complexity:
1. Rule-based routing (start here)
Map your API endpoints or task types to tiers directly in code:
if task_type in ["classify", "extract", "format", "summarize_short"]:
route_to_local_slm()
elif task_type in ["reason", "create", "analyze_long"]:
route_to_cloud_api()
This works well when your task types are well-defined and stable. Most SaaS products have 5-15 distinct AI task types, and you can categorize each one manually.
2. Confidence-based routing
Run the request through the local SLM first. If the model's output confidence (measured by token probabilities or a separate quality classifier) exceeds a threshold, use it. If not, fall back to the cloud API.
This captures more requests locally over time as you improve your fine-tuned model, and it automatically routes genuinely difficult requests to the frontier model.
3. Hybrid routing with shadow scoring
Route based on rules, but periodically send a sample of local SLM responses to the cloud API for quality comparison. Use the comparison data to adjust your routing rules and identify tasks where the local model needs more training data.
Most teams should start with rule-based routing. It is explicit, debuggable, and gets you 80% of the cost savings with 20% of the implementation effort.
Fine-Tuning the Local Tier
The local SLM is only as good as its fine-tuning. A base Llama 3.3 or Qwen 2.5 model will not match GPT-4o out of the box on your specific tasks. But a fine-tuned version trained on your production data will.
The fine-tuning process for the local tier:
-
Collect production examples. Export 2,000-5,000 request-response pairs from your existing GPT-4o usage. These are your training data — the cloud API has already generated the gold-standard outputs.
-
Fine-tune a 7B or 14B base model. Using QLoRA, this takes 30-90 minutes on a single GPU. The result is a model that has learned your specific task patterns, output formats, and quality criteria.
-
Evaluate against your cloud API outputs. Run the fine-tuned model on a held-out test set and compare outputs. For well-defined tasks, expect 92-98% quality parity.
-
Quantize and deploy. Convert to GGUF format (Q4_K_M or Q5_K_M quantization) for efficient inference. Deploy via Ollama or llama.cpp behind an OpenAI-compatible API endpoint.
-
Monitor and retrain. Track quality metrics in production. When you collect new edge cases the model handles poorly, add them to training data and retrain. Each iteration improves coverage.
Ertas handles steps 2-4 in a single workflow — upload your dataset, select a base model, and get a fine-tuned GGUF file ready for deployment. The fine-tuning runs on managed infrastructure, so you do not need your own training GPUs.
Architecture for the Full Stack
Here is what the complete SLM-first architecture looks like in production:
┌─────────────────────────────────────────────┐
│ Your SaaS App │
│ │
│ ┌─────────┐ ┌──────────────────────┐ │
│ │ Request │───▶│ Routing Layer │ │
│ │ Queue │ │ (rule / confidence) │ │
│ └─────────┘ └──────────┬───────────┘ │
│ ┌───────┴───────┐ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌───────────┐ │
│ │ Local SLM │ │ Cloud API │ │
│ │ (Ollama / │ │ (GPT-4o / │ │
│ │ llama.cpp) │ │ Claude) │ │
│ │ Fine-tuned │ │ │ │
│ │ 7B-14B │ │ │ │
│ └──────────────┘ └───────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────┐ │
│ │ Response Handler │ │
│ │ (normalize format) │ │
│ └──────────────────────┘ │
└─────────────────────────────────────────────┘
Key implementation details:
- Both tiers expose OpenAI-compatible endpoints. Your application code uses the same client library for both — the only difference is the base URL and model name.
- The response handler normalizes outputs. Different models may return slightly different formatting. A thin normalization layer ensures consistent output regardless of which tier handled the request.
- Logging captures tier, latency, and cost per request. This data feeds your routing optimization and identifies candidates for model improvement.
When 80/20 Becomes 90/10
As you fine-tune your local model on more production data, the percentage of requests it handles well increases. Teams that start at 80/20 routing typically reach 90/10 within 3-6 months, because:
- Edge cases get captured in training data and fine-tuned into the model
- New task types get added to the local tier once they are well-defined
- Quality thresholds for confidence-based routing can be tightened as the model improves
At 90/10 routing, the same 500,000 requests/month scenario drops to:
| Tier | Requests | Monthly cost |
|---|---|---|
| Local SLM (90%) | 450,000 | AU$1,200 |
| Cloud API (10%) | 50,000 | AU$1,250 |
| Total | 500,000 | AU$2,450 |
That is an 80% cost reduction compared to full API usage, with a quality profile that has been validated over months of production data.
Common Objections
"What if the local model makes a bad response?"
Implement quality checks. For structured outputs, validate against schemas. For free-text, use a lightweight classifier to flag low-confidence outputs for cloud API retry. This adds a few hundred milliseconds of latency on the ~2% of requests that get flagged, but eliminates quality risk.
"We don't have GPU infrastructure."
A quantized 7B model runs on a AU$80/month VPS with no GPU. CPU inference on modern hardware handles 2-5 requests per second for a Q4-quantized 7B model. For most SaaS workloads under 200,000 requests/month, this is sufficient. GPU instances are only needed for higher throughput.
"Our tasks are too complex for small models."
Some of them are. Most of them are not. Run an evaluation. Take your last 1,000 API requests, classify them by complexity, and test a fine-tuned small model on the simple ones. The data will tell you what percentage of your traffic is actually complex.
"Managing two inference paths is too much operational overhead."
Both paths use OpenAI-compatible APIs. Your application code does not know or care which one handled the request. The routing layer is 50-100 lines of code. The operational overhead is one additional service to monitor, which is comparable to adding a cache layer.
Getting Started
The migration path is incremental. You do not need to implement the full architecture on day one.
- Week 1: Audit your current AI API usage. Categorize requests by type. Identify the 2-3 highest-volume, simplest task types.
- Week 2: Fine-tune a 7B model on those specific tasks using your existing API outputs as training data.
- Week 3: Deploy the model locally and route those specific task types to it. Keep everything else on the cloud API.
- Week 4: Monitor quality and costs. Adjust routing rules based on results.
Repeat every month, moving more task types to the local tier as you validate quality. Within 3 months, you will have the 80/20 split running in production and a clear path to 90/10.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Further Reading
- Fine-Tuning Small Models vs GPT-4 — Head-to-head comparison of fine-tuned 7B models against frontier APIs on specific tasks
- AI Inference Cost Comparison 2026 — Current pricing across cloud APIs, GPU instances, and local hardware
- Best Small Language Model for Enterprise 2026 — Which base models to choose for the local tier
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

The Post-API Stack: Architecture for SaaS That Doesn't Bleed on Inference
The era of building SaaS on third-party AI APIs is ending. Here's the post-API architecture — fine-tuned local models, GGUF deployment, and zero per-token costs — that makes AI features profitable.

Model Routing in Production: When to Use Fine-Tuned vs API vs RAG
Fine-tuning, RAG, and cloud APIs each solve different problems. Here's a practical routing framework for choosing the right approach per request — and how to combine all three in one system.

AI-First SaaS Unit Economics: The Margin Math Every Founder Gets Wrong
Traditional SaaS enjoys 80-90% gross margins. AI-first SaaS averages 25-60%. Here's the margin math that separates profitable AI products from ones bleeding on inference costs.