Back to blog
    Shipping AI Search in Your SaaS Without Per-Query API Costs
    saassearchfine-tuningtutorialdeploymentcost-reduction

    Shipping AI Search in Your SaaS Without Per-Query API Costs

    A step-by-step tutorial for building natural language search using a fine-tuned 3B-7B model. Includes training data sourcing, model selection, GGUF deployment via Ollama, and latency benchmarks.

    EErtas Team·

    Natural language search is the most requested AI feature in SaaS products. Users want to type "show me deals over $50K that closed last quarter" instead of clicking through filter dropdowns. The problem: every search query through an external API costs money, and search is high-frequency. A 10,000-user SaaS with 20 searches per user per day is 200,000 API calls per day. At GPT-4o pricing, that is $48,000/year — for a search box.

    This tutorial walks through building natural language search using a fine-tuned model that runs locally with zero per-query costs.

    What the Model Actually Does

    The AI search model performs one specific task: translate a natural language query into a structured search filter that your existing search infrastructure can execute.

    Input: "deals over 50K that closed in Q4"

    Output:

    {
      "filters": [
        { "field": "amount", "operator": "gt", "value": 50000 },
        { "field": "status", "operator": "eq", "value": "closed_won" },
        { "field": "close_date", "operator": "between", "value": ["2025-10-01", "2025-12-31"] }
      ],
      "sort": { "field": "close_date", "direction": "desc" }
    }
    

    This is not a RAG problem. You are not searching through documents. You are translating intent into structure. This distinction matters because it means:

    1. You need a small model (3B-7B parameters is more than sufficient)
    2. Your training data is compact (200-500 examples)
    3. Latency is fast (the output is short — typically 50-200 tokens)

    Step 1: Sourcing Training Data

    You need 200-500 pairs of natural language queries mapped to structured filters. Here is where to get them.

    Source A: Search Logs (Best Quality)

    If your product already has filter-based search, you have implicit training data. Every time a user applies filters manually, that is a structured query. You need the natural language equivalent.

    Method: Export your most common filter combinations. For each, write 3-5 natural language variations.

    Structured FilterNatural Language Variations
    status=active, created > 30d ago"active items from the last month", "show me active ones created recently", "new active items"
    assignee=current_user, priority=high"my high priority items", "high priority assigned to me", "what's urgent on my plate"
    amount > 10000, stage=negotiation"big deals in negotiation", "negotiations over 10K", "large deals we're negotiating"

    Target: 100-150 unique filter combinations with 3 natural language variations each = 300-450 training examples.

    Source B: Support Tickets

    Search through your support tickets and chat logs for messages that contain search intent. Users frequently tell support "I'm trying to find X" or "How do I filter by Y." These are free training data.

    Pattern to search for:

    • "How do I find..."
    • "I'm looking for..."
    • "Can I filter by..."
    • "Where are my..."
    • "Show me..."

    Typical yield: 50-100 usable examples per 1,000 support tickets.

    Source C: Synthetic Generation (Supplement Only)

    Use GPT-4o to generate additional variations of your existing examples. This works well for expanding natural language variations but should not be your primary source.

    Prompt pattern:

    Given this structured search filter:
    { "field": "status", "operator": "eq", "value": "active" }
    
    Generate 5 natural language queries a user might type to
    express this search intent. Vary formality, length, and phrasing.
    The user is searching within a [your product type] application.
    

    Use synthetic data to fill gaps in your coverage, not as the foundation.

    Data Format

    Structure your training data as conversation pairs:

    {
      "messages": [
        {
          "role": "system",
          "content": "Convert the user's search query into a structured filter. Respond only with valid JSON."
        },
        {
          "role": "user",
          "content": "big deals closing this quarter"
        },
        {
          "role": "assistant",
          "content": "{\"filters\":[{\"field\":\"amount\",\"operator\":\"gt\",\"value\":50000},{\"field\":\"close_date\",\"operator\":\"between\",\"value\":[\"2026-01-01\",\"2026-03-31\"]}],\"sort\":{\"field\":\"amount\",\"direction\":\"desc\"}}"
        }
      ]
    }
    

    Step 2: Choosing the Right Base Model

    For search intent parsing, you do not need a large model. The task is constrained: fixed input vocabulary (your product's domain), fixed output schema (your filter format), and short outputs.

    Model Comparison for Search Intent

    ModelParametersGGUF Size (Q4)Search Accuracy*Latency (local)
    Qwen 2.5 3B3B1.8 GB89%45ms
    Llama 3.2 3B3B1.9 GB87%48ms
    Phi-3.5 Mini3.8B2.2 GB91%52ms
    Qwen 2.5 7B7B4.1 GB94%85ms
    Llama 3.1 8B8B4.7 GB93%92ms
    Mistral 7B v0.37B4.0 GB92%88ms

    *Accuracy measured as percentage of queries that produce valid, correct structured filters on a held-out test set of 100 queries after fine-tuning with 300 training examples.

    Recommendation: Start with Qwen 2.5 3B. It is small enough to run on minimal hardware, fast enough for real-time search, and accurate enough for production after fine-tuning. Move to the 7B variant only if you need to handle complex multi-filter queries with nested logic.

    Why Not a Larger Model?

    A 70B model will not meaningfully outperform a fine-tuned 3B model on this task. Search intent parsing is a narrow, well-defined transformation. The fine-tuning data teaches the model your specific schema, field names, and filter syntax. A 3B model with 300 high-quality examples learns this pattern completely.

    We tested this directly:

    ModelPre-Fine-Tuning AccuracyPost-Fine-Tuning AccuracyDelta
    Qwen 2.5 3B31%89%+58%
    Qwen 2.5 7B47%94%+47%
    Llama 3.1 70B72%96%+24%

    The 3B model gains the most from fine-tuning because the base model has capacity to learn the pattern but has not seen enough similar examples in pre-training. The 70B model is already decent at zero-shot but only gains 2 percentage points over the 7B after fine-tuning. Those 2 points do not justify 17x the compute.

    Step 3: Fine-Tuning

    With your training data formatted and your base model selected, fine-tuning is straightforward.

    Training Configuration

    For search intent parsing, use these parameters:

    ParameterValueWhy
    Epochs3-5Small dataset, need multiple passes
    Learning rate2e-4Standard for LoRA fine-tuning
    LoRA rank16Sufficient for narrow tasks
    LoRA alpha322x rank is standard
    Batch size4-8Small dataset, small batches
    Max sequence length512Search queries and filters are short

    Training time on a single GPU:

    • 300 examples, 3B model: ~8 minutes on an A100, ~25 minutes on an RTX 4090
    • 300 examples, 7B model: ~15 minutes on an A100, ~45 minutes on an RTX 4090

    With Ertas, upload your JSONL training file, select the base model, and the platform handles the rest. No GPU provisioning, no training scripts, no hyperparameter tuning.

    Validation

    Hold out 20% of your data (60-100 examples) for validation. Measure:

    1. Schema validity: Does the output parse as valid JSON? Target: >98%
    2. Filter correctness: Are the field names, operators, and values correct? Target: >85%
    3. Intent coverage: Does the filter capture the full user intent? Target: >80%

    If schema validity is below 95%, you need more training examples or a larger model. If filter correctness is below 80%, your training data likely has inconsistencies — audit it for conflicting labels.

    Step 4: Deploying via GGUF + Ollama

    Once your model is fine-tuned, export it as a GGUF file and deploy it with Ollama for production inference.

    Quantization Selection

    QuantizationFile Size (3B)File Size (7B)Quality LossSpeed
    Q8_03.2 GB7.4 GBNegligibleBaseline
    Q5_K_M2.2 GB5.1 GBunder 1% accuracy drop15% faster
    Q4_K_M1.8 GB4.1 GB1-2% accuracy drop25% faster
    Q4_01.7 GB3.8 GB2-3% accuracy drop30% faster

    Recommendation: Q4_K_M for production. The 1-2% accuracy trade-off is worth the 25% speed improvement and smaller memory footprint.

    Ollama Deployment

    Create a Modelfile:

    FROM ./search-intent-qwen3b-q4km.gguf
    
    PARAMETER temperature 0.1
    PARAMETER top_p 0.9
    PARAMETER num_predict 256
    PARAMETER stop "</s>"
    
    SYSTEM "Convert the user's search query into a structured filter. Respond only with valid JSON matching the schema: {filters: [{field, operator, value}], sort: {field, direction}}"
    
    # Create the model
    ollama create search-intent -f Modelfile
    
    # Test it
    ollama run search-intent "big deals closing this quarter"
    

    API Endpoint

    Ollama exposes an OpenAI-compatible API on port 11434:

    curl http://localhost:11434/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "search-intent",
        "messages": [
          {"role": "user", "content": "active customers in Sydney"}
        ],
        "temperature": 0.1
      }'
    

    Your application code stays the same — just change the base URL from api.openai.com to localhost:11434. If you are using the OpenAI SDK, set base_url="http://localhost:11434/v1".

    Step 5: Latency Benchmarks

    Search latency matters more than almost any other AI feature. Users expect search results in under 300ms. Here is how local inference compares to API round-trips.

    End-to-End Latency Comparison

    ScenarioModelNetworkInferenceJSON ParseTotal
    OpenAI API (GPT-4o-mini)Cloud80-150ms200-400ms1ms281-551ms
    OpenAI API (GPT-4o)Cloud80-150ms400-800ms1ms481-951ms
    Local Ollama (3B Q4)None0ms35-55ms1ms36-56ms
    Local Ollama (7B Q4)None0ms70-100ms1ms71-101ms
    Remote Ollama (same region)VPC2-5ms35-55ms1ms38-61ms

    The local model is 5-15x faster than the API for this task. The difference is primarily network latency — the API requires a round trip to OpenAI's servers, while the local model has zero network overhead.

    P99 Latency

    P99 matters for search. Users notice when 1 in 100 searches is slow.

    DeploymentP50P95P99
    OpenAI API (GPT-4o-mini)320ms580ms1,200ms
    Local Ollama (3B Q4)42ms58ms75ms
    Local Ollama (7B Q4)82ms105ms130ms

    API P99 latency spikes to 1.2 seconds due to rate limiting, cold starts, and network variability. Local inference P99 is 75ms — within the user's perception of "instant."

    Step 6: Production Architecture

    The production deployment looks like this:

    User types query
        ↓
    Frontend debounces (200ms)
        ↓
    POST /api/search { query: "big deals closing this quarter" }
        ↓
    Backend calls Ollama (local or VPC)
        ↓
    Model returns structured filter JSON (40-80ms)
        ↓
    Backend validates JSON schema
        ↓
    Backend executes filter against database/Elasticsearch
        ↓
    Results returned to frontend
    

    Error Handling

    The model will occasionally produce invalid JSON (2-5% of queries with Q4 quantization). Handle this:

    import json
    
    def parse_search(query: str) -> dict:
        response = ollama_client.chat(model="search-intent", messages=[
            {"role": "user", "content": query}
        ])
    
        try:
            filters = json.loads(response["message"]["content"])
            validate_schema(filters)  # Your schema validation
            return filters
        except (json.JSONDecodeError, ValidationError):
            # Fallback: basic keyword search
            return {"fallback": True, "keyword": query}
    

    Always have a keyword search fallback. Users would rather get approximate results than an error.

    Hardware Requirements

    Concurrent UsersModelRecommended HardwareMonthly Cost
    1-503B Q44 CPU cores, 4 GB RAM$20-30
    50-5003B Q48 CPU cores, 8 GB RAM$45-60
    500-5,0007B Q416 CPU cores, 16 GB RAM$80-120
    5,000+7B Q4GPU instance (T4/L4)$150-300

    Compare to API costs at the same scale:

    Concurrent UsersEst. Monthly QueriesGPT-4o-mini API CostLocal Model CostSavings
    5030,000$45$2544%
    500300,000$450$5588%
    5,0003,000,000$4,500$11098%
    50,00030,000,000$45,000$25099%

    At 500 concurrent users, you save 88%. At 5,000, you save 98%. The cost curve flattens while the API curve stays linear.

    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Putting It All Together

    The full implementation timeline:

    WeekTaskOutput
    1Source training data from logs and tickets200-300 labeled examples
    1-2Generate synthetic variations, clean data300-500 training examples
    2Fine-tune model, validate accuracyFine-tuned GGUF model
    2-3Deploy Ollama, build API endpointWorking search API
    3Integrate with frontend, add fallbackProduction-ready feature
    3-4Monitor, collect new examples, retrainContinuous improvement

    Total elapsed time: 3-4 weeks for one engineer. No ML team required. No ongoing API costs. Search that is faster, cheaper, and entirely under your control.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading