
Shipping AI Search in Your SaaS Without Per-Query API Costs
A step-by-step tutorial for building natural language search using a fine-tuned 3B-7B model. Includes training data sourcing, model selection, GGUF deployment via Ollama, and latency benchmarks.
Natural language search is the most requested AI feature in SaaS products. Users want to type "show me deals over $50K that closed last quarter" instead of clicking through filter dropdowns. The problem: every search query through an external API costs money, and search is high-frequency. A 10,000-user SaaS with 20 searches per user per day is 200,000 API calls per day. At GPT-4o pricing, that is $48,000/year — for a search box.
This tutorial walks through building natural language search using a fine-tuned model that runs locally with zero per-query costs.
What the Model Actually Does
The AI search model performs one specific task: translate a natural language query into a structured search filter that your existing search infrastructure can execute.
Input: "deals over 50K that closed in Q4"
Output:
{
"filters": [
{ "field": "amount", "operator": "gt", "value": 50000 },
{ "field": "status", "operator": "eq", "value": "closed_won" },
{ "field": "close_date", "operator": "between", "value": ["2025-10-01", "2025-12-31"] }
],
"sort": { "field": "close_date", "direction": "desc" }
}
This is not a RAG problem. You are not searching through documents. You are translating intent into structure. This distinction matters because it means:
- You need a small model (3B-7B parameters is more than sufficient)
- Your training data is compact (200-500 examples)
- Latency is fast (the output is short — typically 50-200 tokens)
Step 1: Sourcing Training Data
You need 200-500 pairs of natural language queries mapped to structured filters. Here is where to get them.
Source A: Search Logs (Best Quality)
If your product already has filter-based search, you have implicit training data. Every time a user applies filters manually, that is a structured query. You need the natural language equivalent.
Method: Export your most common filter combinations. For each, write 3-5 natural language variations.
| Structured Filter | Natural Language Variations |
|---|---|
status=active, created > 30d ago | "active items from the last month", "show me active ones created recently", "new active items" |
assignee=current_user, priority=high | "my high priority items", "high priority assigned to me", "what's urgent on my plate" |
amount > 10000, stage=negotiation | "big deals in negotiation", "negotiations over 10K", "large deals we're negotiating" |
Target: 100-150 unique filter combinations with 3 natural language variations each = 300-450 training examples.
Source B: Support Tickets
Search through your support tickets and chat logs for messages that contain search intent. Users frequently tell support "I'm trying to find X" or "How do I filter by Y." These are free training data.
Pattern to search for:
- "How do I find..."
- "I'm looking for..."
- "Can I filter by..."
- "Where are my..."
- "Show me..."
Typical yield: 50-100 usable examples per 1,000 support tickets.
Source C: Synthetic Generation (Supplement Only)
Use GPT-4o to generate additional variations of your existing examples. This works well for expanding natural language variations but should not be your primary source.
Prompt pattern:
Given this structured search filter:
{ "field": "status", "operator": "eq", "value": "active" }
Generate 5 natural language queries a user might type to
express this search intent. Vary formality, length, and phrasing.
The user is searching within a [your product type] application.
Use synthetic data to fill gaps in your coverage, not as the foundation.
Data Format
Structure your training data as conversation pairs:
{
"messages": [
{
"role": "system",
"content": "Convert the user's search query into a structured filter. Respond only with valid JSON."
},
{
"role": "user",
"content": "big deals closing this quarter"
},
{
"role": "assistant",
"content": "{\"filters\":[{\"field\":\"amount\",\"operator\":\"gt\",\"value\":50000},{\"field\":\"close_date\",\"operator\":\"between\",\"value\":[\"2026-01-01\",\"2026-03-31\"]}],\"sort\":{\"field\":\"amount\",\"direction\":\"desc\"}}"
}
]
}
Step 2: Choosing the Right Base Model
For search intent parsing, you do not need a large model. The task is constrained: fixed input vocabulary (your product's domain), fixed output schema (your filter format), and short outputs.
Model Comparison for Search Intent
| Model | Parameters | GGUF Size (Q4) | Search Accuracy* | Latency (local) |
|---|---|---|---|---|
| Qwen 2.5 3B | 3B | 1.8 GB | 89% | 45ms |
| Llama 3.2 3B | 3B | 1.9 GB | 87% | 48ms |
| Phi-3.5 Mini | 3.8B | 2.2 GB | 91% | 52ms |
| Qwen 2.5 7B | 7B | 4.1 GB | 94% | 85ms |
| Llama 3.1 8B | 8B | 4.7 GB | 93% | 92ms |
| Mistral 7B v0.3 | 7B | 4.0 GB | 92% | 88ms |
*Accuracy measured as percentage of queries that produce valid, correct structured filters on a held-out test set of 100 queries after fine-tuning with 300 training examples.
Recommendation: Start with Qwen 2.5 3B. It is small enough to run on minimal hardware, fast enough for real-time search, and accurate enough for production after fine-tuning. Move to the 7B variant only if you need to handle complex multi-filter queries with nested logic.
Why Not a Larger Model?
A 70B model will not meaningfully outperform a fine-tuned 3B model on this task. Search intent parsing is a narrow, well-defined transformation. The fine-tuning data teaches the model your specific schema, field names, and filter syntax. A 3B model with 300 high-quality examples learns this pattern completely.
We tested this directly:
| Model | Pre-Fine-Tuning Accuracy | Post-Fine-Tuning Accuracy | Delta |
|---|---|---|---|
| Qwen 2.5 3B | 31% | 89% | +58% |
| Qwen 2.5 7B | 47% | 94% | +47% |
| Llama 3.1 70B | 72% | 96% | +24% |
The 3B model gains the most from fine-tuning because the base model has capacity to learn the pattern but has not seen enough similar examples in pre-training. The 70B model is already decent at zero-shot but only gains 2 percentage points over the 7B after fine-tuning. Those 2 points do not justify 17x the compute.
Step 3: Fine-Tuning
With your training data formatted and your base model selected, fine-tuning is straightforward.
Training Configuration
For search intent parsing, use these parameters:
| Parameter | Value | Why |
|---|---|---|
| Epochs | 3-5 | Small dataset, need multiple passes |
| Learning rate | 2e-4 | Standard for LoRA fine-tuning |
| LoRA rank | 16 | Sufficient for narrow tasks |
| LoRA alpha | 32 | 2x rank is standard |
| Batch size | 4-8 | Small dataset, small batches |
| Max sequence length | 512 | Search queries and filters are short |
Training time on a single GPU:
- 300 examples, 3B model: ~8 minutes on an A100, ~25 minutes on an RTX 4090
- 300 examples, 7B model: ~15 minutes on an A100, ~45 minutes on an RTX 4090
With Ertas, upload your JSONL training file, select the base model, and the platform handles the rest. No GPU provisioning, no training scripts, no hyperparameter tuning.
Validation
Hold out 20% of your data (60-100 examples) for validation. Measure:
- Schema validity: Does the output parse as valid JSON? Target: >98%
- Filter correctness: Are the field names, operators, and values correct? Target: >85%
- Intent coverage: Does the filter capture the full user intent? Target: >80%
If schema validity is below 95%, you need more training examples or a larger model. If filter correctness is below 80%, your training data likely has inconsistencies — audit it for conflicting labels.
Step 4: Deploying via GGUF + Ollama
Once your model is fine-tuned, export it as a GGUF file and deploy it with Ollama for production inference.
Quantization Selection
| Quantization | File Size (3B) | File Size (7B) | Quality Loss | Speed |
|---|---|---|---|---|
| Q8_0 | 3.2 GB | 7.4 GB | Negligible | Baseline |
| Q5_K_M | 2.2 GB | 5.1 GB | under 1% accuracy drop | 15% faster |
| Q4_K_M | 1.8 GB | 4.1 GB | 1-2% accuracy drop | 25% faster |
| Q4_0 | 1.7 GB | 3.8 GB | 2-3% accuracy drop | 30% faster |
Recommendation: Q4_K_M for production. The 1-2% accuracy trade-off is worth the 25% speed improvement and smaller memory footprint.
Ollama Deployment
Create a Modelfile:
FROM ./search-intent-qwen3b-q4km.gguf
PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER num_predict 256
PARAMETER stop "</s>"
SYSTEM "Convert the user's search query into a structured filter. Respond only with valid JSON matching the schema: {filters: [{field, operator, value}], sort: {field, direction}}"
# Create the model
ollama create search-intent -f Modelfile
# Test it
ollama run search-intent "big deals closing this quarter"
API Endpoint
Ollama exposes an OpenAI-compatible API on port 11434:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "search-intent",
"messages": [
{"role": "user", "content": "active customers in Sydney"}
],
"temperature": 0.1
}'
Your application code stays the same — just change the base URL from api.openai.com to localhost:11434. If you are using the OpenAI SDK, set base_url="http://localhost:11434/v1".
Step 5: Latency Benchmarks
Search latency matters more than almost any other AI feature. Users expect search results in under 300ms. Here is how local inference compares to API round-trips.
End-to-End Latency Comparison
| Scenario | Model | Network | Inference | JSON Parse | Total |
|---|---|---|---|---|---|
| OpenAI API (GPT-4o-mini) | Cloud | 80-150ms | 200-400ms | 1ms | 281-551ms |
| OpenAI API (GPT-4o) | Cloud | 80-150ms | 400-800ms | 1ms | 481-951ms |
| Local Ollama (3B Q4) | None | 0ms | 35-55ms | 1ms | 36-56ms |
| Local Ollama (7B Q4) | None | 0ms | 70-100ms | 1ms | 71-101ms |
| Remote Ollama (same region) | VPC | 2-5ms | 35-55ms | 1ms | 38-61ms |
The local model is 5-15x faster than the API for this task. The difference is primarily network latency — the API requires a round trip to OpenAI's servers, while the local model has zero network overhead.
P99 Latency
P99 matters for search. Users notice when 1 in 100 searches is slow.
| Deployment | P50 | P95 | P99 |
|---|---|---|---|
| OpenAI API (GPT-4o-mini) | 320ms | 580ms | 1,200ms |
| Local Ollama (3B Q4) | 42ms | 58ms | 75ms |
| Local Ollama (7B Q4) | 82ms | 105ms | 130ms |
API P99 latency spikes to 1.2 seconds due to rate limiting, cold starts, and network variability. Local inference P99 is 75ms — within the user's perception of "instant."
Step 6: Production Architecture
The production deployment looks like this:
User types query
↓
Frontend debounces (200ms)
↓
POST /api/search { query: "big deals closing this quarter" }
↓
Backend calls Ollama (local or VPC)
↓
Model returns structured filter JSON (40-80ms)
↓
Backend validates JSON schema
↓
Backend executes filter against database/Elasticsearch
↓
Results returned to frontend
Error Handling
The model will occasionally produce invalid JSON (2-5% of queries with Q4 quantization). Handle this:
import json
def parse_search(query: str) -> dict:
response = ollama_client.chat(model="search-intent", messages=[
{"role": "user", "content": query}
])
try:
filters = json.loads(response["message"]["content"])
validate_schema(filters) # Your schema validation
return filters
except (json.JSONDecodeError, ValidationError):
# Fallback: basic keyword search
return {"fallback": True, "keyword": query}
Always have a keyword search fallback. Users would rather get approximate results than an error.
Hardware Requirements
| Concurrent Users | Model | Recommended Hardware | Monthly Cost |
|---|---|---|---|
| 1-50 | 3B Q4 | 4 CPU cores, 4 GB RAM | $20-30 |
| 50-500 | 3B Q4 | 8 CPU cores, 8 GB RAM | $45-60 |
| 500-5,000 | 7B Q4 | 16 CPU cores, 16 GB RAM | $80-120 |
| 5,000+ | 7B Q4 | GPU instance (T4/L4) | $150-300 |
Compare to API costs at the same scale:
| Concurrent Users | Est. Monthly Queries | GPT-4o-mini API Cost | Local Model Cost | Savings |
|---|---|---|---|---|
| 50 | 30,000 | $45 | $25 | 44% |
| 500 | 300,000 | $450 | $55 | 88% |
| 5,000 | 3,000,000 | $4,500 | $110 | 98% |
| 50,000 | 30,000,000 | $45,000 | $250 | 99% |
At 500 concurrent users, you save 88%. At 5,000, you save 98%. The cost curve flattens while the API curve stays linear.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Putting It All Together
The full implementation timeline:
| Week | Task | Output |
|---|---|---|
| 1 | Source training data from logs and tickets | 200-300 labeled examples |
| 1-2 | Generate synthetic variations, clean data | 300-500 training examples |
| 2 | Fine-tune model, validate accuracy | Fine-tuned GGUF model |
| 2-3 | Deploy Ollama, build API endpoint | Working search API |
| 3 | Integrate with frontend, add fallback | Production-ready feature |
| 3-4 | Monitor, collect new examples, retrain | Continuous improvement |
Total elapsed time: 3-4 weeks for one engineer. No ML team required. No ongoing API costs. Search that is faster, cheaper, and entirely under your control.
Further Reading
- Shipping AI Features in Your SaaS Without an ML Team — the broader playbook for SaaS teams adding AI
- Fine-Tuning for Reliable JSON Output — deep dive on training models to produce structured output
- Running AI Models Locally: The Complete Guide — hardware, software, and deployment patterns for local inference
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Distilling Claude/GPT into a 7B Model for Production: Step-by-Step
A step-by-step tutorial for distilling the capabilities of Claude or GPT-4o into a 7B parameter model for local production deployment — from dataset generation through fine-tuning to GGUF export.

How to Distill Open-Source Models Legally: A Step-by-Step Guide
A practical guide to model distillation the right way: using open-source teacher models with permissive licenses, your own domain data, and a clear legal path to model ownership.

How to Create a Tool-Calling Training Dataset for Fine-Tuning
The biggest gap in fine-tuning guides: nobody covers how to actually build the dataset. Here's a step-by-step process to create tool-calling training data — from schema documentation to synthetic expansion to JSONL formatting — with real examples for a 5-tool customer service agent.