Document summariser
Fine-tune a 3B-class model to produce two-to-three-sentence summaries that hold their length and tone across news, blog posts, and technical content, running on-device.
Summarisation is the use case that sells on-device fine-tuning to users: the page they are reading never has to leave the device, the result arrives in a second, and it works on a plane. The catch is that the model has to be opinionated about length and tone in a way base models are not. Out of the box, a 3B base model will give you a five-sentence summary of a tweet and a five-sentence summary of a twenty-page longread; the user expects the second to be longer than the first.
This recipe uses a worked example: imagine The Browser Company wants to add "TL;DR for any open tab" to Arc so users can get a two-to-three-sentence summary of the page they are reading without sending the page content to a server. The Browser Company is the anchor because the product shape (browser, privacy-sensitive, runs on a Mac the user already owns) makes the trade-offs concrete; substitute any product where the user is reading something and wants a quick gist.
Recipe also fits
The Arc-shaped example is one of many. The same shape applies to:
- News readers and read-later services (Pocket, Instapaper, Feedly) that summarise saved articles for inbox-style consumption.
- Research tools that surface summaries of saved papers and articles (Zotero, Mendeley, Readwise).
- Email digesters and newsletter consolidators that condense long sends into a few sentences.
- Knowledge-management surfaces that summarise wiki, Notion, or Confluence pages on demand.
- Podcast or video apps that produce chapter summaries from transcripts.
If your product surfaces long-form content and your users want a quick gist, this is the recipe to start with.
When this is the right fit
This recipe is the right fit when:
- The input shape is roughly consistent: paragraphs of prose in a single language, between roughly 500 and 5,000 words. Summarising tables, code, or video transcripts has different failure modes (see Voice transcript cleanup for the transcript case).
- You want a fixed-shape output: two to three sentences, no headings, no bullets, no preamble. A fixed shape is much easier to fine-tune for than "summarise it however feels right."
- Privacy is a product feature, not just a nice-to-have. Reading sessions are private to users; sending tab contents to a server creates an audit trail most users do not want.
- The model can fail gracefully. A summary that misses a nuance is fine; a summary that hallucinates a fact is not. The dataset and probe set both work hard on this.
It is not the right fit when:
- The user expects chapter-by-chapter summaries of long documents. That is a retrieval problem, not a summarisation problem, and benefits from chunking the document and summarising each chunk before stitching.
- You need structured summaries (key points, sentiment, entities). A summary is prose; if you need fields, see Structured data extraction.
- The input language varies widely and you do not have multilingual training data. Out-of-the-box 3B models handle English well; other languages are hit-or-miss without targeted data.
The dataset
For the Arc-style scenario, 4,000 rows is a reasonable starting point. Tune up if quality lags or down if authoring at scale is hard. A typical spread across genres:
| Genre | Share | Source |
|---|---|---|
| News articles | 30% (~1,200) | CNN/DailyMail or XSum subsets, hand-trimmed for the summaries you actually like |
| Blog posts and essays | 25% (~1,000) | Long-form articles from public RSS feeds with hand-written summaries |
| Technical documentation | 20% (~800) | Open-source docs (Mozilla MDN, Stripe docs, etc.) summarised in plain English |
| Opinion and analysis | 15% (~600) | Editorial pieces and analysis posts with hand-written summaries |
| Wikipedia entries | 10% (~400) | Article-and-lede pairs (the lede is a built-in summary) |
License-check each named dataset before commercial use; their terms vary and have changed over time.
The mix matters because users do not browse one kind of page. Train on news only, and the model summarises a blog post as a news story (third-person, present tense, "X announced today that..."). Train on opinion only, and the model summarises a tutorial as an editorial.
Dataset format
Use the input/output schema from JSONL format. The instruction is the same for every row; the input is the document and the output is the summary.
{
"instruction": "Summarise the document in two to three sentences. Do not invent facts or add preamble like 'this article is about'.",
"input": "The two-finger swipe to navigate has been part of macOS since 2011, but the gesture was added to Safari's iPad version only in 2023 [...]",
"output": "Safari for iPad gained the two-finger swipe-back gesture in 2023, more than a decade after the macOS version. The delay reflects Apple's caution about gesture conflicts on touchscreens, where the same motion can scroll, navigate, or trigger Multitasking. Reviewers have generally welcomed the addition but flag the higher accidental-trigger rate compared to the trackpad equivalent."
}
Note the instruction is identical across rows. This is deliberate. The fine-tune is teaching the model what "summarise" means in your product's voice; if you vary the phrasing of the instruction, the model treats the instruction as data and you weaken the connection between "this directive" and "this output shape." See Instruction tuning for the broader rule.
Hand-trimming summaries
The public summarisation datasets (XSum, CNN/DailyMail) ship with summaries that are not in your voice. Spend the time to rewrite the summaries you keep so they match the length, tone, and style you want users to see. Two hours of dataset editing saves a week of fighting with the fine-tune.
Three habits that pay back:
- Strip preamble. "This article describes" or "In a recent post the author argues" trains the model to add the same preamble in production. Cut it.
- Lead with the news, not the source. "Apple released iOS 18.3" beats "Apple has released a software update called iOS 18.3."
- Match length to genre, not to the dataset's defaults. XSum summaries are one sentence; you probably want two to three. Edit them up.
A note on hallucination
Summarisers hallucinate dates, numbers, and named entities more than any other failure mode. Reduce it three ways:
- Curate hard. Drop rows where the original summary's numbers do not match the original article. Public datasets are noisy here.
- Add a small set of "do not invent" rows. Take an article, remove a key number from the original summary, and check the model still produces a faithful summary that says "an unspecified amount" or similar.
- At inference time, lower temperature. 0.2 to 0.3 is the right range for summaries; higher temperatures hallucinate more.
The fine-tune does not eliminate hallucination. It reduces the rate. Plan for a verification step at the application layer for any high-stakes domain (legal, medical, financial summaries). For consumer-grade "TL;DR my tab," the residual rate is acceptable; for an enterprise legal summariser, it is not.
The base model
Pick Gemma 4 E2B from the Ertas catalogue. The reasoning:
- Gemma's base instruction tuning is already conservative about hallucination compared to other 3B-class models, which gives the fine-tune a better starting point.
- At Q4_K_M, the GGUF is about 3.2 GB, which fits inside a browser's memory budget on a typical Mac (8 GB and up).
- Gemma 4 handles long inputs well (the context window is 32k, more than enough for a 5,000-word article).
GPU tier: Gemma 4 E2B requires A10G and is gated to paid plans (see Supported models). Free-plan accounts can run this recipe with Llama 3.2 3B Instruct (T4) instead; expect slightly more length drift on edge cases, but the dataset and training config transfer cleanly.
If you need a smaller footprint for older Macs or to run inside a WebAssembly browser context with a tight memory ceiling, Llama 3.2 1B at Q4_K_M (~0.8 GB) is the next step down. Expect more occasional rough edges (preamble creeps back in, lengths drift).
If your product needs to handle dense academic or technical content with very long inputs (10,000+ words), step up to an 8B-class model like Llama 3.1 8B or Mistral Small. The summary quality on dense input improves noticeably; the GGUF is about 4.5 GB.
Training config
For 4,000 rows of summarisation data, start with:
| Setting | Value | Why |
|---|---|---|
| Schema | input/output | Cleanest single-task fit |
| Epochs | 2 | Summarisation overfits faster than knowledge tuning; two epochs is usually plenty |
| Learning rate | 1e-4 | Slightly lower than the default; summarisation is a stylistic task and benefits from gentler updates |
| LoRA rank | 8 | Lower than the support-bot recipe; we are tuning style, not adding knowledge |
| LoRA alpha | 16 | 2x rank |
| Batch size | 2 | Articles are long; smaller batches keep memory comfortable |
| Grad accumulation | 8 | Effective batch 16 |
| Warmup | 5% of steps | Standard |
| Max sequence length | 4096 | Articles up to about 3,000 words fit; longer articles get truncated |
Wall-clock time and credit cost depend on the GPU tier and dataset size. Ertas's Training Config picker shows an estimate before you press play; see Credits and usage for the current rates.
The two-epoch choice is the most important hyperparameter difference from the support-bot recipe. Summarisation gets worse with more training, not better, because the model starts memorising specific summary openings. If your loss curve flattens by epoch 2, stop there. If it has not, audit the dataset for noise before adding epochs.
Integration: macOS via Ollama (Arc-style)
For Arc, the simplest integration is the Ollama bundle Ertas ships. The bundle includes the Modelfile (with chat template and stop tokens already wired) and the install script. Once installed, the model lives at http://localhost:11434/api/generate.
// summariser.ts
async function summarise(pageText: string): Promise<string> {
const prompt =
"Summarise the document in two to three sentences. " +
"Do not invent facts or add preamble like 'this article is about'.\n\n" +
pageText;
const response = await fetch("http://localhost:11434/api/generate", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "arc-summariser",
prompt,
stream: false,
options: {
temperature: 0.2,
top_p: 0.85,
num_predict: 150,
},
}),
});
const data = await response.json();
return data.response.trim();
}
Key choices:
- Temperature 0.2: low. Summaries should be reproducible; identical input should give nearly identical output.
num_predict: 150: caps the response at roughly three sentences. Without a cap, the model occasionally rambles.- No streaming: the user is waiting for the whole summary; streaming token-by-token reveals the partial summary which can be misleading if the model self-corrects.
Preprocessing the page
Arc has the page DOM. You do not want to send the model the raw DOM; you want the page's readable content. Use Mozilla Readability or an equivalent client-side library to extract the article body, then strip remaining HTML, then send.
import { Readability } from "@mozilla/readability";
function extractReadable(doc: Document): string {
const article = new Readability(doc.cloneNode(true) as Document).parse();
if (!article || !article.textContent) return "";
return article.textContent
.replace(/\s+/g, " ")
.trim()
.slice(0, 12000);
}
The 12,000-character cap protects against the rare gigantic page. The 4096-token training context corresponds to about 16,000 characters of English; the 12,000-character cap leaves headroom for the prompt template.
Integration: browser via wllama
If Arc wanted to ship without depending on Ollama at all (single-file experience, no external installer), the browser path via wllama is the alternative. See Ship: web for the full pattern. Trade-offs:
- Pro: zero external install. The model loads from a URL into IndexedDB on first run.
- Con: first run is a 3-GB download. Wllama is fast in-browser but Ollama is faster on the same hardware.
- Con: browser memory is tighter than native; a 3B at Q4_K_M is the realistic ceiling, and you may want to drop to 1B for older Macs.
Most Mac-first apps pick Ollama. Most web-first products pick wllama.
Probe set
Eight prompts that exercise the genre spread.
| # | Genre | Probe (excerpt) | Pass criteria |
|---|---|---|---|
| 1 | News | First three paragraphs of an AP wire story | Two to three sentences, news voice, facts traceable to the input |
| 2 | Blog post | A 1,200-word personal essay on remote work | Two to three sentences, captures the argument not just the topic |
| 3 | Technical docs | A Stripe docs page on webhook retries | Two to three sentences in plain English; no API jargon copied verbatim |
| 4 | Opinion | A 900-word op-ed on city zoning | Captures the stance, not "the author discusses zoning" |
| 5 | Wikipedia | A 500-word entry on a historical event | Two to three sentences, matches the lede in spirit |
| 6 | Edge: very short | A 200-word news brief | Still two-to-three sentences; does not pad |
| 7 | Edge: very long | A 5,000-word essay (will be truncated) | Captures the first half, does not pretend to cover the rest |
| 8 | Hallucination check | A news article whose summary in the dataset had a number changed | Reports the number correctly from the input, not the mistraining |
Most of these should pass cleanly on the first run. The hallucination-check probe is the one that fails most often; if it fails, retrain with a curated dataset where the summary's numbers match the input's numbers in every row.
Limits
- Length drift on edge cases. Very short input sometimes produces an over-padded summary; very long input sometimes produces a summary that only covers the first half. The probe set catches both; the dataset edits fix them.
- Genres outside the training distribution. Code, recipes, legal documents, and academic papers will get summarised, but the quality will be lower than the trained genres. Either extend the dataset to cover them, or detect them in the app and route differently.
- Numbers and dates. Hallucinated dates and currency figures are the most common failure mode in production. Adding a regex check that compares the summary's numbers against the input's numbers catches most of these at runtime.
- Languages other than English. Out-of-the-box behaviour is poor for non-English input. Add per-language training data if you need multilingual support.
- No structured output. This recipe produces prose. If you also want fields (key points, sentiment, named entities), do that in a second pass with a structured-extraction model. See Structured data extraction.
What's next
Structured data extraction
The next recipe. Different task shape: strict JSON instead of prose.
Voice transcript cleanup
The other recipe in the 'fix a piece of text' family.
Ship: web
If you go the browser route instead of Ollama.
Dataset quality
The dataset edits are the biggest lever in this recipe. This page explains why.