Shipping AI Search in Your SaaS Without Per-Query API Costs

Natural language search is the most requested AI feature in SaaS products. Users want to type "show me deals over $50K that closed last quarter" instead of clicking through filter dropdowns. The problem: every search query through an external API costs money, and search is high-frequency. A 10,000-user SaaS with 20 searches per user per day is 200,000 API calls per day. At GPT-4o pricing, that is $48,000/year — for a search box.

This tutorial walks through building natural language search using a fine-tuned model that runs locally with zero per-query costs.

What the Model Actually Does

The AI search model performs one specific task: translate a natural language query into a structured search filter that your existing search infrastructure can execute.

Input: "deals over 50K that closed in Q4"

Output:

{
  "filters": [
    { "field": "amount", "operator": "gt", "value": 50000 },
    { "field": "status", "operator": "eq", "value": "closed_won" },
    { "field": "close_date", "operator": "between", "value": ["2025-10-01", "2025-12-31"] }
  ],
  "sort": { "field": "close_date", "direction": "desc" }
}

This is not a RAG problem. You are not searching through documents. You are translating intent into structure. This distinction matters because it means:

You need a small model (3B-7B parameters is more than sufficient)
Your training data is compact (200-500 examples)
Latency is fast (the output is short — typically 50-200 tokens)

Step 1: Sourcing Training Data

You need 200-500 pairs of natural language queries mapped to structured filters. Here is where to get them.

Source A: Search Logs (Best Quality)

If your product already has filter-based search, you have implicit training data. Every time a user applies filters manually, that is a structured query. You need the natural language equivalent.

Method: Export your most common filter combinations. For each, write 3-5 natural language variations.

Structured Filter	Natural Language Variations
`status=active, created > 30d ago`	"active items from the last month", "show me active ones created recently", "new active items"
`assignee=current_user, priority=high`	"my high priority items", "high priority assigned to me", "what's urgent on my plate"
`amount > 10000, stage=negotiation`	"big deals in negotiation", "negotiations over 10K", "large deals we're negotiating"

Target: 100-150 unique filter combinations with 3 natural language variations each = 300-450 training examples.

Source B: Support Tickets

Search through your support tickets and chat logs for messages that contain search intent. Users frequently tell support "I'm trying to find X" or "How do I filter by Y." These are free training data.

Pattern to search for:

"How do I find..."
"I'm looking for..."
"Can I filter by..."
"Where are my..."
"Show me..."

Typical yield: 50-100 usable examples per 1,000 support tickets.

Source C: Synthetic Generation (Supplement Only)

Use GPT-4o to generate additional variations of your existing examples. This works well for expanding natural language variations but should not be your primary source.

Prompt pattern:

Given this structured search filter:
{ "field": "status", "operator": "eq", "value": "active" }

Generate 5 natural language queries a user might type to
express this search intent. Vary formality, length, and phrasing.
The user is searching within a [your product type] application.

Use synthetic data to fill gaps in your coverage, not as the foundation.

Data Format

Structure your training data as conversation pairs:

{
  "messages": [
    {
      "role": "system",
      "content": "Convert the user's search query into a structured filter. Respond only with valid JSON."
    },
    {
      "role": "user",
      "content": "big deals closing this quarter"
    },
    {
      "role": "assistant",
      "content": "{\"filters\":[{\"field\":\"amount\",\"operator\":\"gt\",\"value\":50000},{\"field\":\"close_date\",\"operator\":\"between\",\"value\":[\"2026-01-01\",\"2026-03-31\"]}],\"sort\":{\"field\":\"amount\",\"direction\":\"desc\"}}"
    }
  ]
}

Step 2: Choosing the Right Base Model

For search intent parsing, you do not need a large model. The task is constrained: fixed input vocabulary (your product's domain), fixed output schema (your filter format), and short outputs.

Model Comparison for Search Intent

Model	Parameters	GGUF Size (Q4)	Search Accuracy*	Latency (local)
Qwen 2.5 3B	3B	1.8 GB	89%	45ms
Llama 3.2 3B	3B	1.9 GB	87%	48ms
Phi-3.5 Mini	3.8B	2.2 GB	91%	52ms
Qwen 2.5 7B	7B	4.1 GB	94%	85ms
Llama 3.1 8B	8B	4.7 GB	93%	92ms
Mistral 7B v0.3	7B	4.0 GB	92%	88ms

*Accuracy measured as percentage of queries that produce valid, correct structured filters on a held-out test set of 100 queries after fine-tuning with 300 training examples.

Recommendation: Start with Qwen 2.5 3B. It is small enough to run on minimal hardware, fast enough for real-time search, and accurate enough for production after fine-tuning. Move to the 7B variant only if you need to handle complex multi-filter queries with nested logic.

Why Not a Larger Model?

A 70B model will not meaningfully outperform a fine-tuned 3B model on this task. Search intent parsing is a narrow, well-defined transformation. The fine-tuning data teaches the model your specific schema, field names, and filter syntax. A 3B model with 300 high-quality examples learns this pattern completely.

We tested this directly:

Model	Pre-Fine-Tuning Accuracy	Post-Fine-Tuning Accuracy	Delta
Qwen 2.5 3B	31%	89%	+58%
Qwen 2.5 7B	47%	94%	+47%
Llama 3.1 70B	72%	96%	+24%

The 3B model gains the most from fine-tuning because the base model has capacity to learn the pattern but has not seen enough similar examples in pre-training. The 70B model is already decent at zero-shot but only gains 2 percentage points over the 7B after fine-tuning. Those 2 points do not justify 17x the compute.

Step 3: Fine-Tuning

With your training data formatted and your base model selected, fine-tuning is straightforward.

Training Configuration

For search intent parsing, use these parameters:

Parameter	Value	Why
Epochs	3-5	Small dataset, need multiple passes
Learning rate	2e-4	Standard for LoRA fine-tuning
LoRA rank	16	Sufficient for narrow tasks
LoRA alpha	32	2x rank is standard
Batch size	4-8	Small dataset, small batches
Max sequence length	512	Search queries and filters are short

Training time on a single GPU:

300 examples, 3B model: ~8 minutes on an A100, ~25 minutes on an RTX 4090
300 examples, 7B model: ~15 minutes on an A100, ~45 minutes on an RTX 4090

With Ertas, upload your JSONL training file, select the base model, and the platform handles the rest. No GPU provisioning, no training scripts, no hyperparameter tuning.

Validation

Hold out 20% of your data (60-100 examples) for validation. Measure:

Schema validity: Does the output parse as valid JSON? Target: >98%
Filter correctness: Are the field names, operators, and values correct? Target: >85%
Intent coverage: Does the filter capture the full user intent? Target: >80%

If schema validity is below 95%, you need more training examples or a larger model. If filter correctness is below 80%, your training data likely has inconsistencies — audit it for conflicting labels.

Step 4: Deploying via GGUF + Ollama

Once your model is fine-tuned, export it as a GGUF file and deploy it with Ollama for production inference.

Quantization Selection

Quantization	File Size (3B)	File Size (7B)	Quality Loss	Speed
Q8_0	3.2 GB	7.4 GB	Negligible	Baseline
Q5_K_M	2.2 GB	5.1 GB	under 1% accuracy drop	15% faster
Q4_K_M	1.8 GB	4.1 GB	1-2% accuracy drop	25% faster
Q4_0	1.7 GB	3.8 GB	2-3% accuracy drop	30% faster

Recommendation: Q4_K_M for production. The 1-2% accuracy trade-off is worth the 25% speed improvement and smaller memory footprint.

Ollama Deployment

Create a Modelfile:

FROM ./search-intent-qwen3b-q4km.gguf

PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER num_predict 256
PARAMETER stop "</s>"

SYSTEM "Convert the user's search query into a structured filter. Respond only with valid JSON matching the schema: {filters: [{field, operator, value}], sort: {field, direction}}"

# Create the model
ollama create search-intent -f Modelfile

# Test it
ollama run search-intent "big deals closing this quarter"

API Endpoint

Ollama exposes an OpenAI-compatible API on port 11434:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "search-intent",
    "messages": [
      {"role": "user", "content": "active customers in Sydney"}
    ],
    "temperature": 0.1
  }'

Your application code stays the same — just change the base URL from api.openai.com to localhost:11434. If you are using the OpenAI SDK, set base_url="http://localhost:11434/v1".

Step 5: Latency Benchmarks

Search latency matters more than almost any other AI feature. Users expect search results in under 300ms. Here is how local inference compares to API round-trips.

End-to-End Latency Comparison

Scenario	Model	Network	Inference	JSON Parse	Total
OpenAI API (GPT-4o-mini)	Cloud	80-150ms	200-400ms	1ms	281-551ms
OpenAI API (GPT-4o)	Cloud	80-150ms	400-800ms	1ms	481-951ms
Local Ollama (3B Q4)	None	0ms	35-55ms	1ms	36-56ms
Local Ollama (7B Q4)	None	0ms	70-100ms	1ms	71-101ms
Remote Ollama (same region)	VPC	2-5ms	35-55ms	1ms	38-61ms

The local model is 5-15x faster than the API for this task. The difference is primarily network latency — the API requires a round trip to OpenAI's servers, while the local model has zero network overhead.

P99 Latency

P99 matters for search. Users notice when 1 in 100 searches is slow.

Deployment	P50	P95	P99
OpenAI API (GPT-4o-mini)	320ms	580ms	1,200ms
Local Ollama (3B Q4)	42ms	58ms	75ms
Local Ollama (7B Q4)	82ms	105ms	130ms

API P99 latency spikes to 1.2 seconds due to rate limiting, cold starts, and network variability. Local inference P99 is 75ms — within the user's perception of "instant."

Step 6: Production Architecture

The production deployment looks like this:

User types query
    ↓
Frontend debounces (200ms)
    ↓
POST /api/search { query: "big deals closing this quarter" }
    ↓
Backend calls Ollama (local or VPC)
    ↓
Model returns structured filter JSON (40-80ms)
    ↓
Backend validates JSON schema
    ↓
Backend executes filter against database/Elasticsearch
    ↓
Results returned to frontend

Error Handling

The model will occasionally produce invalid JSON (2-5% of queries with Q4 quantization). Handle this:

import json

def parse_search(query: str) -> dict:
    response = ollama_client.chat(model="search-intent", messages=[
        {"role": "user", "content": query}
    ])

    try:
        filters = json.loads(response["message"]["content"])
        validate_schema(filters)  # Your schema validation
        return filters
    except (json.JSONDecodeError, ValidationError):
        # Fallback: basic keyword search
        return {"fallback": True, "keyword": query}

Always have a keyword search fallback. Users would rather get approximate results than an error.

Hardware Requirements

Concurrent Users	Model	Recommended Hardware	Monthly Cost
1-50	3B Q4	4 CPU cores, 4 GB RAM	$20-30
50-500	3B Q4	8 CPU cores, 8 GB RAM	$45-60
500-5,000	7B Q4	16 CPU cores, 16 GB RAM	$80-120
5,000+	7B Q4	GPU instance (T4/L4)	$150-300

Compare to API costs at the same scale:

Concurrent Users	Est. Monthly Queries	GPT-4o-mini API Cost	Local Model Cost	Savings
50	30,000	$45	$25	44%
500	300,000	$450	$55	88%
5,000	3,000,000	$4,500	$110	98%
50,000	30,000,000	$45,000	$250	99%

At 500 concurrent users, you save 88%. At 5,000, you save 98%. The cost curve flattens while the API curve stays linear.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Putting It All Together

The full implementation timeline:

Week	Task	Output
1	Source training data from logs and tickets	200-300 labeled examples
1-2	Generate synthetic variations, clean data	300-500 training examples
2	Fine-tune model, validate accuracy	Fine-tuned GGUF model
2-3	Deploy Ollama, build API endpoint	Working search API
3	Integrate with frontend, add fallback	Production-ready feature
3-4	Monitor, collect new examples, retrain	Continuous improvement

Total elapsed time: 3-4 weeks for one engineer. No ML team required. No ongoing API costs. Search that is faster, cheaper, and entirely under your control.