Document summariser

    Fine-tune a 3B-class model to produce two-to-three-sentence summaries that hold their length and tone across news, blog posts, and technical content, running on-device.

    Summarisation is the use case that sells on-device fine-tuning to users: the page they are reading never has to leave the device, the result arrives in a second, and it works on a plane. The catch is that the model has to be opinionated about length and tone in a way base models are not. Out of the box, a 3B base model will give you a five-sentence summary of a tweet and a five-sentence summary of a twenty-page longread; the user expects the second to be longer than the first.

    This recipe uses a worked example: imagine The Browser Company wants to add "TL;DR for any open tab" to Arc so users can get a two-to-three-sentence summary of the page they are reading without sending the page content to a server. The Browser Company is the anchor because the product shape (browser, privacy-sensitive, runs on a Mac the user already owns) makes the trade-offs concrete; substitute any product where the user is reading something and wants a quick gist.

    Recipe also fits

    The Arc-shaped example is one of many. The same shape applies to:

    • News readers and read-later services (Pocket, Instapaper, Feedly) that summarise saved articles for inbox-style consumption.
    • Research tools that surface summaries of saved papers and articles (Zotero, Mendeley, Readwise).
    • Email digesters and newsletter consolidators that condense long sends into a few sentences.
    • Knowledge-management surfaces that summarise wiki, Notion, or Confluence pages on demand.
    • Podcast or video apps that produce chapter summaries from transcripts.

    If your product surfaces long-form content and your users want a quick gist, this is the recipe to start with.

    When this is the right fit

    This recipe is the right fit when:

    • The input shape is roughly consistent: paragraphs of prose in a single language, between roughly 500 and 5,000 words. Summarising tables, code, or video transcripts has different failure modes (see Voice transcript cleanup for the transcript case).
    • You want a fixed-shape output: two to three sentences, no headings, no bullets, no preamble. A fixed shape is much easier to fine-tune for than "summarise it however feels right."
    • Privacy is a product feature, not just a nice-to-have. Reading sessions are private to users; sending tab contents to a server creates an audit trail most users do not want.
    • The model can fail gracefully. A summary that misses a nuance is fine; a summary that hallucinates a fact is not. The dataset and probe set both work hard on this.

    It is not the right fit when:

    • The user expects chapter-by-chapter summaries of long documents. That is a retrieval problem, not a summarisation problem, and benefits from chunking the document and summarising each chunk before stitching.
    • You need structured summaries (key points, sentiment, entities). A summary is prose; if you need fields, see Structured data extraction.
    • The input language varies widely and you do not have multilingual training data. Out-of-the-box 3B models handle English well; other languages are hit-or-miss without targeted data.

    The dataset

    For the Arc-style scenario, 4,000 rows is a reasonable starting point. Tune up if quality lags or down if authoring at scale is hard. A typical spread across genres:

    GenreShareSource
    News articles30% (~1,200)CNN/DailyMail or XSum subsets, hand-trimmed for the summaries you actually like
    Blog posts and essays25% (~1,000)Long-form articles from public RSS feeds with hand-written summaries
    Technical documentation20% (~800)Open-source docs (Mozilla MDN, Stripe docs, etc.) summarised in plain English
    Opinion and analysis15% (~600)Editorial pieces and analysis posts with hand-written summaries
    Wikipedia entries10% (~400)Article-and-lede pairs (the lede is a built-in summary)

    License-check each named dataset before commercial use; their terms vary and have changed over time.

    The mix matters because users do not browse one kind of page. Train on news only, and the model summarises a blog post as a news story (third-person, present tense, "X announced today that..."). Train on opinion only, and the model summarises a tutorial as an editorial.

    Dataset format

    Use the input/output schema from JSONL format. The instruction is the same for every row; the input is the document and the output is the summary.

    {
      "instruction": "Summarise the document in two to three sentences. Do not invent facts or add preamble like 'this article is about'.",
      "input": "The two-finger swipe to navigate has been part of macOS since 2011, but the gesture was added to Safari's iPad version only in 2023 [...]",
      "output": "Safari for iPad gained the two-finger swipe-back gesture in 2023, more than a decade after the macOS version. The delay reflects Apple's caution about gesture conflicts on touchscreens, where the same motion can scroll, navigate, or trigger Multitasking. Reviewers have generally welcomed the addition but flag the higher accidental-trigger rate compared to the trackpad equivalent."
    }

    Note the instruction is identical across rows. This is deliberate. The fine-tune is teaching the model what "summarise" means in your product's voice; if you vary the phrasing of the instruction, the model treats the instruction as data and you weaken the connection between "this directive" and "this output shape." See Instruction tuning for the broader rule.

    Hand-trimming summaries

    The public summarisation datasets (XSum, CNN/DailyMail) ship with summaries that are not in your voice. Spend the time to rewrite the summaries you keep so they match the length, tone, and style you want users to see. Two hours of dataset editing saves a week of fighting with the fine-tune.

    Three habits that pay back:

    • Strip preamble. "This article describes" or "In a recent post the author argues" trains the model to add the same preamble in production. Cut it.
    • Lead with the news, not the source. "Apple released iOS 18.3" beats "Apple has released a software update called iOS 18.3."
    • Match length to genre, not to the dataset's defaults. XSum summaries are one sentence; you probably want two to three. Edit them up.

    A note on hallucination

    Summarisers hallucinate dates, numbers, and named entities more than any other failure mode. Reduce it three ways:

    1. Curate hard. Drop rows where the original summary's numbers do not match the original article. Public datasets are noisy here.
    2. Add a small set of "do not invent" rows. Take an article, remove a key number from the original summary, and check the model still produces a faithful summary that says "an unspecified amount" or similar.
    3. At inference time, lower temperature. 0.2 to 0.3 is the right range for summaries; higher temperatures hallucinate more.

    The fine-tune does not eliminate hallucination. It reduces the rate. Plan for a verification step at the application layer for any high-stakes domain (legal, medical, financial summaries). For consumer-grade "TL;DR my tab," the residual rate is acceptable; for an enterprise legal summariser, it is not.

    The base model

    Pick Gemma 4 E2B from the Ertas catalogue. The reasoning:

    • Gemma's base instruction tuning is already conservative about hallucination compared to other 3B-class models, which gives the fine-tune a better starting point.
    • At Q4_K_M, the GGUF is about 3.2 GB, which fits inside a browser's memory budget on a typical Mac (8 GB and up).
    • Gemma 4 handles long inputs well (the context window is 32k, more than enough for a 5,000-word article).

    GPU tier: Gemma 4 E2B requires A10G and is gated to paid plans (see Supported models). Free-plan accounts can run this recipe with Llama 3.2 3B Instruct (T4) instead; expect slightly more length drift on edge cases, but the dataset and training config transfer cleanly.

    If you need a smaller footprint for older Macs or to run inside a WebAssembly browser context with a tight memory ceiling, Llama 3.2 1B at Q4_K_M (~0.8 GB) is the next step down. Expect more occasional rough edges (preamble creeps back in, lengths drift).

    If your product needs to handle dense academic or technical content with very long inputs (10,000+ words), step up to an 8B-class model like Llama 3.1 8B or Mistral Small. The summary quality on dense input improves noticeably; the GGUF is about 4.5 GB.

    Training config

    For 4,000 rows of summarisation data, start with:

    SettingValueWhy
    Schemainput/outputCleanest single-task fit
    Epochs2Summarisation overfits faster than knowledge tuning; two epochs is usually plenty
    Learning rate1e-4Slightly lower than the default; summarisation is a stylistic task and benefits from gentler updates
    LoRA rank8Lower than the support-bot recipe; we are tuning style, not adding knowledge
    LoRA alpha162x rank
    Batch size2Articles are long; smaller batches keep memory comfortable
    Grad accumulation8Effective batch 16
    Warmup5% of stepsStandard
    Max sequence length4096Articles up to about 3,000 words fit; longer articles get truncated

    Wall-clock time and credit cost depend on the GPU tier and dataset size. Ertas's Training Config picker shows an estimate before you press play; see Credits and usage for the current rates.

    The two-epoch choice is the most important hyperparameter difference from the support-bot recipe. Summarisation gets worse with more training, not better, because the model starts memorising specific summary openings. If your loss curve flattens by epoch 2, stop there. If it has not, audit the dataset for noise before adding epochs.

    Integration: macOS via Ollama (Arc-style)

    For Arc, the simplest integration is the Ollama bundle Ertas ships. The bundle includes the Modelfile (with chat template and stop tokens already wired) and the install script. Once installed, the model lives at http://localhost:11434/api/generate.

    // summariser.ts
    async function summarise(pageText: string): Promise<string> {
      const prompt =
        "Summarise the document in two to three sentences. " +
        "Do not invent facts or add preamble like 'this article is about'.\n\n" +
        pageText;
    
      const response = await fetch("http://localhost:11434/api/generate", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          model: "arc-summariser",
          prompt,
          stream: false,
          options: {
            temperature: 0.2,
            top_p: 0.85,
            num_predict: 150,
          },
        }),
      });
      const data = await response.json();
      return data.response.trim();
    }

    Key choices:

    • Temperature 0.2: low. Summaries should be reproducible; identical input should give nearly identical output.
    • num_predict: 150: caps the response at roughly three sentences. Without a cap, the model occasionally rambles.
    • No streaming: the user is waiting for the whole summary; streaming token-by-token reveals the partial summary which can be misleading if the model self-corrects.

    Preprocessing the page

    Arc has the page DOM. You do not want to send the model the raw DOM; you want the page's readable content. Use Mozilla Readability or an equivalent client-side library to extract the article body, then strip remaining HTML, then send.

    import { Readability } from "@mozilla/readability";
    
    function extractReadable(doc: Document): string {
      const article = new Readability(doc.cloneNode(true) as Document).parse();
      if (!article || !article.textContent) return "";
      return article.textContent
        .replace(/\s+/g, " ")
        .trim()
        .slice(0, 12000);
    }

    The 12,000-character cap protects against the rare gigantic page. The 4096-token training context corresponds to about 16,000 characters of English; the 12,000-character cap leaves headroom for the prompt template.

    Integration: browser via wllama

    If Arc wanted to ship without depending on Ollama at all (single-file experience, no external installer), the browser path via wllama is the alternative. See Ship: web for the full pattern. Trade-offs:

    • Pro: zero external install. The model loads from a URL into IndexedDB on first run.
    • Con: first run is a 3-GB download. Wllama is fast in-browser but Ollama is faster on the same hardware.
    • Con: browser memory is tighter than native; a 3B at Q4_K_M is the realistic ceiling, and you may want to drop to 1B for older Macs.

    Most Mac-first apps pick Ollama. Most web-first products pick wllama.

    Probe set

    Eight prompts that exercise the genre spread.

    #GenreProbe (excerpt)Pass criteria
    1NewsFirst three paragraphs of an AP wire storyTwo to three sentences, news voice, facts traceable to the input
    2Blog postA 1,200-word personal essay on remote workTwo to three sentences, captures the argument not just the topic
    3Technical docsA Stripe docs page on webhook retriesTwo to three sentences in plain English; no API jargon copied verbatim
    4OpinionA 900-word op-ed on city zoningCaptures the stance, not "the author discusses zoning"
    5WikipediaA 500-word entry on a historical eventTwo to three sentences, matches the lede in spirit
    6Edge: very shortA 200-word news briefStill two-to-three sentences; does not pad
    7Edge: very longA 5,000-word essay (will be truncated)Captures the first half, does not pretend to cover the rest
    8Hallucination checkA news article whose summary in the dataset had a number changedReports the number correctly from the input, not the mistraining

    Most of these should pass cleanly on the first run. The hallucination-check probe is the one that fails most often; if it fails, retrain with a curated dataset where the summary's numbers match the input's numbers in every row.

    Limits

    • Length drift on edge cases. Very short input sometimes produces an over-padded summary; very long input sometimes produces a summary that only covers the first half. The probe set catches both; the dataset edits fix them.
    • Genres outside the training distribution. Code, recipes, legal documents, and academic papers will get summarised, but the quality will be lower than the trained genres. Either extend the dataset to cover them, or detect them in the app and route differently.
    • Numbers and dates. Hallucinated dates and currency figures are the most common failure mode in production. Adding a regex check that compares the summary's numbers against the input's numbers catches most of these at runtime.
    • Languages other than English. Out-of-the-box behaviour is poor for non-English input. Add per-language training data if you need multilingual support.
    • No structured output. This recipe produces prose. If you also want fields (key points, sentiment, named entities), do that in a second pass with a structured-extraction model. See Structured data extraction.

    What's next