Customer support bot

    Fine-tune a 3B-class model on a product's public docs and past support tickets to answer common product questions on-device, with a refuse-unknown habit you can trust.

    The most common first fine-tune is a support helper. The pitch is simple: most support volume is the same handful of questions answered five hundred times, the answers already exist in your docs and ticket history, and an on-device model that handles 70 to 80% of those questions deflects enough volume to matter without ever sending a customer message to a server.

    This recipe walks the work end-to-end using a worked example: imagine Linear wants to ship a small support helper inside its desktop app that answers questions like "how do I bulk-archive issues?" or "what's the difference between projects and initiatives?" without escalating to a human. Linear is the anchor because their use case is easy to imagine, not because they are an Ertas customer; substitute any SaaS with public docs and a support team.

    Recipe also fits

    The Linear-shaped example is one of many. The same shape applies to:

    • Help-center or KB Q&A for any SaaS with public documentation and a support team.
    • Vertical-specific support bots inside legal research, healthcare portals, or financial services apps where privacy is a product feature.
    • Developer-tools FAQ bots for products with public API docs and an active forum.
    • Internal IT help desks that want to deflect tier-one ticket volume without exposing employee questions to a vendor.
    • Product onboarding chat assistants whose answers already live in your getting-started docs.

    If your product has questions that get asked weekly and the answers already exist somewhere in writing, this is the recipe to start with.

    When this is the right fit

    This recipe is the right fit when:

    • You have public documentation that already answers most common questions. The fine-tune teaches the model how your team phrases answers, not new facts. If your docs are sparse, fix them first.
    • You have past support conversations (tickets, Intercom logs, Discord support channels) that show how questions get asked in customer language, not docs language. A bot trained only on docs will answer perfectly to perfectly-phrased questions and miss the rest.
    • The answers can be stable for at least a month between retrains. If your product ships pricing or feature changes weekly, this recipe will fight you; consider RAG over an on-device fine-tune for the volatile parts.
    • You want it to run on the user's machine, either because privacy matters (legal, healthcare, finance products) or because you do not want to pay per-conversation inference at scale, or because you want a snappy in-app experience that works offline.

    It is not the right fit when:

    • The answer depends on the user's account state ("what's my plan?", "why was my invoice rejected?"). The bot has no access to the user. Route those questions to your existing API-backed support flow.
    • You need the bot to take actions (open a ticket, refund an order). A model can produce the JSON payload, but the action belongs in your app code. See Structured data extraction for the JSON-output side of this pattern.
    • You are okay with a hosted call to a frontier model. A 70-billion-parameter API model will beat a fine-tuned 3B on the long tail of questions. The fine-tune wins on latency, cost-per-query, privacy, and offline behaviour; pick by which of those you actually need.

    The dataset

    For the Linear-style scenario, 5,000 rows is a reasonable starting point. Tune up if quality lags or down if authoring at scale is hard. A typical split across three buckets:

    BucketShareWhat it looks like
    Doc-derived Q&A pairs60% (~3,000 rows)One question per concept that appears in your docs, written in customer phrasing, with the doc paragraph as the answer rewritten in support-rep voice
    Past-ticket Q&A pairs30% (~1,500 rows)Real user questions and the support team's actual replies, lightly edited to strip the customer's name and account details
    Refuse-unknown rows10% (~500 rows)Off-topic, account-specific, pricing-specific, or otherwise unanswerable questions paired with a polite deflection

    Doc-derived pairs

    For each docs page, write three to five questions a user might ask that the page answers. Use customer phrasing, not docs phrasing. "How do I get all my issues out of Linear?" is the question; "Linear supports CSV export from the issue list" is what the docs say.

    {
      "instruction": "Customer question: How do I get all my issues out of Linear?",
      "output": "You can export your issues to CSV from the Issues view: select the issues you want (or use Cmd+A to select all), then choose Export from the toolbar. The CSV includes the issue title, status, assignee, labels, and created date. If you need a richer export with comments and history, use the GraphQL API."
    }

    Note the shape: a clear instruction prefix ("Customer question:"), a single question, a complete answer with the relevant constraints. Three to five variants per concept lets the model see the same answer reached from different phrasings.

    A common authoring technique: feed each docs page to a frontier model and ask it to generate ten customer-phrased questions plus answers in your support-rep voice. Hand-review the output. You will keep about six per ten on the first pass; the rest are either redundant or off-tone. See Dataset synthesis for the broader pattern.

    Past-ticket pairs

    Past tickets are the highest-signal rows in your dataset because they capture how real customers phrase questions, which is rarely how anyone writes documentation. The work is unglamorous: pull six months of resolved tickets, drop the multi-turn ones (those go in a follow-up dataset), strip PII, keep the ones where a single customer message and a single agent reply resolved the issue.

    {
      "instruction": "Customer question: cant figure out where to change my workspace url",
      "output": "Workspace URL is in Settings then General, near the bottom of the page. Changing it will break any old links you have shared, including links in Slack and email; existing redirects last 30 days. If you are not on the workspace owner account, you will need to ask the owner to make the change."
    }

    Leave the typos in the question. Customers do not write like docs; the model needs to learn that "cant figure out" still means "I would like to know how to."

    Refuse-unknown rows

    The most important 10% of the dataset. Without these, the model confidently answers questions it should defer, and you ship a confidently-wrong bot. With them, the model learns the boundary of what it should attempt.

    {
      "instruction": "Customer question: How much is the Business plan?",
      "output": "Pricing changes occasionally and I do not have access to current pricing. The up-to-date Business plan price is on linear.app/pricing, and your sales contact can confirm any custom terms."
    }
    {
      "instruction": "Customer question: Why is my invoice failing?",
      "output": "I do not have access to your account or billing details. The fastest path is to email support@ with your invoice number; they can see the failure reason in our billing system."
    }

    Aim for variety: pricing, account state, billing, security incidents, third-party (Slack/Google/etc.) credentials, anything that requires looking up data the model cannot see. About 500 rows is enough to teach the boundary.

    Refusals are not the same as I-do-not-know. "I cannot help with that" trains a refusing bot. "I do not have access to your account, but here is what to try" trains a deflecting bot. Customers tolerate the second; the first reads as unhelpful.

    Dataset format

    Use the instruction/output schema from JSONL format. It is the cleanest fit for single-turn support questions and the easiest to author at scale.

    For the follow-up cases (the multi-turn conversations where one answer leads to a clarifying question), see the multi-turn messages schema and consider a second smaller dataset (about 500 rows) trained on top of the single-turn one.

    The base model

    Pick Gemma 4 E2B (effective 3B-class) from the Ertas catalogue. The reasoning:

    • The task is knowledge retrieval and rephrasing, not deep reasoning. A 3B model is plenty.
    • Gemma 4's instruction tuning is strong and its responses are concise out of the box, which matches the support-rep voice you want.
    • At Q4_K_M, the GGUF is about 3.2 GB, which fits comfortably on a desktop install and is the right size for the Mac/Windows target.
    • The Ertas catalogue's Q4_K_M quantisation keeps quality loss under 1% (see Quantization), which on a knowledge-Q&A task is invisible in practice.

    GPU tier: Gemma 4 E2B requires A10G and is gated to paid plans (see Supported models). Free-plan accounts can run this recipe with Llama 3.2 3B Instruct (T4) instead, with slightly less polish on long-tail questions; the dataset and training config transfer cleanly.

    If you need to go smaller for any reason (web target, very low desktop spec), Llama 3.2 1B is the next step down at ~0.8 GB but expect noticeably more rough edges on the long-tail questions.

    If your support volume runs to thousands of questions per day and the model is the bottleneck, an 8B-class model (Llama 3.1 8B) at Q4_K_M is the next step up at about 4.5 GB. The gain is small for this task; the 3B is usually plenty.

    Training config

    Ertas's defaults are tuned for instruction tuning of 3B-class models on a few thousand rows. For the 5,000-row support dataset, start with:

    SettingValueWhy
    Schemainstruction/outputSingle-turn matches the dataset shape
    Epochs3Enough for the model to lock in answer style; more risks overfitting to specific tickets
    Learning rate2e-4Standard LoRA SFT default
    LoRA rank16Captures the style and the new knowledge without being so high that it starts memorising
    LoRA alpha322x rank is the conventional starting point
    Batch size4Fits comfortably with grad accumulation 4 (effective batch 16)
    Warmup5% of stepsSmooth learning-rate ramp
    Max sequence length2048Long enough for the few rows with multi-paragraph answers

    Wall-clock time and credit cost depend on the GPU tier and dataset size. Ertas's Training Config picker shows an estimate before you press play; see Credits and usage for the current rates.

    A common first-pass mistake is running 5 to 10 epochs because "more training is better." It is not. By epoch 4 the model starts repeating specific tickets verbatim, which is bad both for privacy and because customers ask the same question slightly differently each time. If you must go past 3 epochs, watch the loss curve and stop when validation loss flattens. See Training tips.

    Integration: desktop via Ollama

    For a desktop target like Linear's Mac and Windows apps, the Ollama bundle Ertas ships is the fastest path to shipping. The bundle includes the GGUF, the Modelfile (with the chat template and stop tokens already configured), and install scripts for both platforms. The user runs the installer once, and your app talks to a local Ollama process on port 11434.

    // support-bot.ts
    async function ask(question: string): Promise<string> {
      const response = await fetch("http://localhost:11434/api/generate", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          model: "linear-support-helper",
          prompt: `Customer question: ${question}`,
          stream: false,
          options: {
            temperature: 0.4,
            top_p: 0.9,
            num_predict: 200,
          },
        }),
      });
      const data = await response.json();
      return data.response.trim();
    }

    Key choices:

    • Temperature 0.4: low enough that answers don't drift, high enough that two slightly-different phrasings of the same question don't return identical responses.
    • num_predict: 200: caps the response at about 150 words, which matches the average support-rep reply length. Without a cap, the model occasionally rambles.
    • No streaming for the first version: simpler error handling; the response arrives in 1 to 3 seconds on most hardware, fast enough that streaming is a nice-to-have, not a need-to-have. Flip stream: true once the rest works.

    For first-run UX (installing Ollama, downloading the model bundle, the model name), see Model delivery and UX. The Ertas bundle's install script is named for the model out of the box; rename the Modelfile reference to linear-support-helper (or whatever fits your product) before distributing.

    Detecting refusal in the app

    The refuse-unknown rows teach the model to say things like "I do not have access to your account." Detect those in the app and route to a human handoff:

    const REFUSAL_MARKERS = [
      /i do not have access/i,
      /you will need to (contact|email|reach)/i,
      /the fastest path is to email/i,
    ];
    
    function isRefusal(response: string): boolean {
      return REFUSAL_MARKERS.some((re) => re.test(response));
    }

    If isRefusal(response) is true, show a "talk to a human" button alongside the response. The model's deflection wording stays useful as an interim answer; the button gives users the next step.

    Probe set

    Ten prompts to run by hand after the fine-tune finishes, before you wire up any eval suite. Pass criteria for each is in the right column.

    #ProbeExpected behaviour
    1"How do I bulk-archive issues?"Concrete steps citing the toolbar action
    2"What's the difference between projects and initiatives?"Clear two-sentence definition matching the docs
    3"cant figure out where to change my workspace url"Handles the typo, gives the Settings path, mentions the 30-day redirect window
    4"How do I export all my issues to CSV?"Mentions the Issues-view export, the CSV columns, and the API alternative for richer exports
    5"Why can't I see the issues my teammate created?"Walks through visibility and team membership, does not invent a feature
    6"How do I delete my account?"Walks through the offboarding flow without inventing one if the docs don't have it
    7"How much does the Business plan cost?"Refuses, points to pricing page
    8"What's my workspace's API key?"Refuses, explains it lives in Settings and the bot cannot see account state
    9"When was Linear founded?"Off-topic, polite deflection; ideally a one-line "I am here to help with product questions"
    10"Write me a JIRA-to-Linear migration script"Out of scope; the bot suggests the official import flow and links to the docs rather than inventing a script

    Most of these should pass cleanly on the first run. If three or more fail, the dataset needs more work (usually the refuse-unknown rows are too few or too narrow). See Iterating for the loop.

    Limits

    Be honest with yourself and your team about what this bot is not.

    • It does not know your customer. Anything that requires account state (plan, usage, billing, team membership) needs to come from your existing systems, not the model. Wire the model's output through an app-level layer that can short-circuit account-state questions.
    • It cannot take actions. It produces text. The text can include suggested next steps ("open a ticket from the help menu"), but the action is in your UI.
    • It will go stale. When your product ships a new feature, the bot does not learn until you retrain. Plan for a monthly or quarterly retrain cadence, anchored on the new docs and the new tickets.
    • It will sometimes be wrong with confidence. No amount of dataset work eliminates this. The refuse-unknown training reduces it; the human-handoff button is what catches the rest.

    What's next