Voice transcript cleanup
Fine-tune a 3B-class model to remove filler, add punctuation, and fix common ASR errors in meeting and interview transcripts on-device, with speaker labels preserved.
Modern automatic speech recognition (ASR) systems like Whisper, Apple's on-device speech engine, and Android's Live Caption produce remarkably good transcripts. They also produce transcripts that no one wants to read: every "um" and "uh," no comma in sight, recurring mishears on names and acronyms, and run-on sentences that stretch for ten lines. The fix is a small post-processing model that takes raw ASR output and produces something the user can paste into a Notion page without editing.
This recipe uses a worked example: imagine Granola wants to add offline transcript cleanup so users can run interviews and standups in airplane mode (on a long flight, in a SCIF, in a basement) and still get clean output once they land. Granola already has speaker labels from a diarisation step that runs before the LLM; the cleanup model preserves those labels and fixes what's between them. Granola is the anchor because the meeting-notes product shape is well-known and the offline-cleanup gap is a real user request; substitute any product where speech-to-text is the input layer.
Recipe also fits
The Granola-shaped example is one of many. The same shape applies to:
- Meeting-notes apps adding offline cleanup as a Pro feature (Otter, Fireflies, Avoma, Krisp).
- Journalism and podcast interview tools that clean ASR output before the editor sees it (Descript, Riverside).
- Accessibility apps that produce post-hoc cleaned captions for archived video or audio.
- Call-center QA tools and customer-research transcribers that need clean text for downstream analysis.
- Voice memo apps that ship cleanup as a paid feature on top of free recording.
If your product takes audio in and shows text out, and you want the text out to read like a person typed it, this is the recipe to start with.
When this is the right fit
This recipe is the right fit when:
- The input is diarised (speaker turns already labelled) and you want the model to preserve the labels and clean the text inside each turn. The model is not learning to diarise; diarisation is an upstream concern.
- You want conservative cleanup, not summarisation. The user's words stay the user's words; you remove filler, add punctuation, fix obvious mishears, and stop there. If the user said "the API gateway is fubar," the cleaned version says "the API gateway is fubar" not "the API gateway has issues."
- The input language is one you have training data for. English meeting/interview corpora are abundant; other languages need targeted sourcing.
- The product can accept latency. Cleanup runs after the meeting, not during. A 30-minute interview producing 5,000 words of transcript takes the model 10 to 30 seconds to clean on a modern phone; that's fine for "tap to clean up" but wrong for live captioning.
It is not the right fit when:
- You want speaker identification from scratch. That is a diarisation model (Pyannote, NeMo) and lives upstream. The cleanup model preserves labels; it does not invent them.
- You want real-time captioning. Cleanup is batch. Use a streaming ASR with built-in punctuation (Whisper-v3-with-vad or similar) for live captions.
- The transcript needs to be edited to summarise meaning, not cleaned. That is the Document summariser recipe, fed with the cleaned transcript.
The dataset
For the Granola-style scenario, 6,000 rows is a reasonable starting point. Tune up if quality lags or down if authoring at scale is hard. Each row is one (raw transcript chunk, cleaned chunk) pair. A typical mix of sources:
| Source | Share | Notes |
|---|---|---|
| AMI Meeting Corpus | 30% (~1,800) | 100 hours of meeting audio with manual transcripts; pair the manual transcript with a Whisper-large run for the noise side |
| ICSI Meeting Corpus | 15% (~900) | Similar shape to AMI, different acoustic conditions and accents |
| TED interview transcripts | 20% (~1,200) | Single-speaker interviews; useful for the long-turn shape |
| Podcast transcripts (paired with raw ASR) | 20% (~1,200) | Run a podcast catalogue you have rights to through Whisper, pair with cleaned versions |
| Synthetic noise pairs | 15% (~900) | Take clean text and machine-introduce filler, mishears, and dropped punctuation |
License-check each named corpus before commercial use; their terms vary (AMI and ICSI are research-friendly but the specifics can change) and you are responsible for compliance.
The mix matters because meeting-style transcripts and interview-style transcripts have different rhythm. A model trained only on AMI cleans meetings well and interviews oddly (collapses long turns too aggressively).
Generating raw-side data
The cleanest source of "what real ASR output looks like" is your own ASR run on the same audio that produced the manual transcript. For AMI and ICSI, the audio is public; run Whisper-large-v3 with default settings, and the divergence between Whisper's output and the manual transcript IS the noise you want the model to fix.
For sources without paired audio (TED has audio and transcripts; podcasts often only have transcripts):
- Use a TTS model to read the clean transcript at varying speeds and with synthetic background noise.
- Run the TTS output through Whisper.
- Pair the Whisper output with the original text.
This is more work than it sounds and the noise distribution is slightly off from real captures, but it produces useful training data when real paired data is short.
Dataset format
Use the input/output schema. The instruction is the same every row; the input is the raw chunk and the output is the cleaned chunk.
{
"instruction": "Clean up the transcript chunk. Preserve speaker labels exactly as given. Remove filler words (um, uh, like, you know). Add punctuation and capitalisation. Fix obvious mishearings using context. Do not paraphrase or shorten.",
"input": "[Sarah]: yeah so um the the API gateway it like went down around 2 PM and we we couldnt figure out what happened until uh maybe an hour later when john noticed the the cert had expired\n[John]: right and the the renewal was supposed to be automatic but it like silently failed two weeks ago",
"output": "[Sarah]: Yeah, so the API gateway went down around 2 PM and we couldn't figure out what happened until maybe an hour later, when John noticed the cert had expired.\n[John]: Right, and the renewal was supposed to be automatic, but it silently failed two weeks ago."
}
Three details that matter:
- The speaker label format is part of the contract. The model sees
[Name]:in training and reproduces it in production. Granola's labels come from diarisation, so the cleanup model just has to preserve what's there. - The cleaned output retains the speaker's voice. "We couldn't figure out what happened" not "we struggled to diagnose the issue." Light cleanup, not paraphrase.
- Chunks, not whole transcripts. Train on 200-to-800-word chunks. A 5,000-word meeting is too long for the context window and overlaps the model's failure modes.
Chunk sizing
The 200-to-800-word chunk size is chosen for two reasons. First, it fits comfortably in a 4096-token training context with room for the instruction. Second, it matches the natural unit of a transcript section (a topic, a question-and-answer block). Cleaning works best at this granularity; smaller chunks lose context for mishear correction; longer chunks drift in tone.
The "don't paraphrase" trap
The single most common failure mode for cleanup models is that they start paraphrasing in production. The cleaned output gets slightly cleaner-sounding sentence by sentence, the user's specific phrasing disappears, and the result reads like a generic article rather than a transcription.
Two dataset habits prevent it:
- Hand-review a sample for paraphrase creep before training. If your "cleaned" version of "the dashboard is on fire" reads "the dashboard is experiencing issues," cut those rows; you are training paraphrase, not cleanup.
- Add "preservation rows": cases where the model must preserve unusual phrasing (slang, technical jargon, profanity) even though it looks like an ASR error. About 5% of the dataset should be preservation rows.
The model will lose information you let it lose in training. If you cut "fubar" to "broken" in your cleaned examples, the production model cuts it too. Decide what to preserve and hold the line in dataset edits.
The base model
Pick Llama 3.2 3B Instruct from the Ertas catalogue. The reasoning:
- Llama 3.2's instruction tuning handles "follow these rules carefully" tasks reliably, which matches the preservation requirement.
- At Q4_K_M, the GGUF is about 2.0 GB. That fits on mid-tier Android phones (the Ship: Android 6 GB device-RAM floor) and iPhone 14-class iPhones with headroom.
- The 128k context window is overkill for chunk-by-chunk cleanup, but useful if you want to stitch turns together for context-aware mishear correction later.
GPU tier: Llama 3.2 3B trains on a T4, fitting the Free plan. There is no recommended paid-plan upgrade for this task; the 3B is usually enough (see "Avoid going larger" below).
If you need to go smaller (cheaper devices, web target), Llama 3.2 1B at Q4_K_M (~0.8 GB) works for the punctuation-and-filler task but starts to paraphrase more aggressively on long turns. The dataset edits get more important at this size.
Avoid going larger unless you have specific evidence the 3B is not enough. Cleanup is a constrained task; an 8B brings more memorisation risk (it starts hallucinating cleaner text from training) without obvious quality gains.
Training config
For 6,000 chunk pairs, start with:
| Setting | Value | Why |
|---|---|---|
| Schema | input/output | Single-task fit |
| Epochs | 3 | Cleanup is stable; three epochs is the sweet spot |
| Learning rate | 1e-4 | Conservative; we want the model to preserve speaker voice |
| LoRA rank | 8 | Style and format task; rank 8 is sufficient |
| LoRA alpha | 16 | 2x rank |
| Batch size | 2 | Long chunks; smaller batches keep memory comfortable |
| Grad accumulation | 8 | Effective batch 16 |
| Warmup | 5% of steps | Standard |
| Max sequence length | 4096 | Fits 800-word chunks with the instruction overhead |
Wall-clock time and credit cost depend on the GPU tier and dataset size. Ertas's Training Config picker shows an estimate before you press play; see Credits and usage for the current rates.
Two epochs sometimes work, but three usually nails the "preserve voice" behaviour while still cleanly removing filler. If your loss curve flattens at epoch 2, stop; if it is still descending at epoch 3, you can push to 4 cautiously while watching the probe set.
Integration: iOS via llamadart
Granola's iOS app uses a Flutter codebase. The Ship section already covers the Flutter via llamadart path; the cleanup integration adds the chunk-splitting wrapper.
// transcript_cleaner.dart
import 'package:llamadart/llamadart.dart';
class TranscriptCleaner {
final LlamaEngine _engine;
final ChatSession _session;
TranscriptCleaner(this._engine) : _session = ChatSession(_engine);
Future<String> clean(String rawTranscript) async {
final chunks = _splitByTurns(rawTranscript, maxWords: 500);
final cleaned = <String>[];
for (final chunk in chunks) {
final prompt = _buildPrompt(chunk);
final response = await _session.generate(prompt, maxTokens: 1200);
_session.reset(); // critical, see Ship: Android note
cleaned.add(response.trim());
}
return cleaned.join('\n\n');
}
String _buildPrompt(String chunk) =>
'Clean up the transcript chunk. Preserve speaker labels exactly as given. '
'Remove filler words (um, uh, like, you know). Add punctuation and capitalisation. '
'Fix obvious mishearings using context. Do not paraphrase or shorten.\n\n$chunk';
List<String> _splitByTurns(String raw, {required int maxWords}) {
final turns = raw.split(RegExp(r'(?=\[[^\]]+\]:)'));
final chunks = <String>[];
var buffer = StringBuffer();
var bufferWords = 0;
for (final turn in turns) {
final words = turn.split(RegExp(r'\s+')).length;
if (bufferWords + words > maxWords && bufferWords > 0) {
chunks.add(buffer.toString().trim());
buffer = StringBuffer();
bufferWords = 0;
}
buffer.write(turn);
bufferWords += words;
}
if (bufferWords > 0) chunks.add(buffer.toString().trim());
return chunks;
}
}
Key choices:
- Split on turn boundaries, not on word count alone. Cutting a turn in half confuses the model and produces a bad join. The regex looks for
[Name]:boundaries and chunks at the nearest one before the 500-word cap. session.reset()after every chunk. Without it, cleanup of chunk N+1 starts to inherit context from chunk N, and the speaker voice drifts. See the Ship: Android note on this same llamadart behaviour.maxTokens: 1200: cleaned output is usually 80 to 95% of the input word count (filler removed), so 1.5x the chunk's word count in tokens is a safe ceiling.
Surface choices
Two product-level decisions worth flagging:
- Run cleanup on-demand, not automatically. Cleaning a 60-minute transcript takes tens of seconds on a recent phone (varies by device, model size, and chunk count). Show a "Clean up transcript" button rather than auto-running; users tolerate the wait when they triggered it, but it feels long when the app forces it.
- Show a before / after diff so users can see what was changed. This catches paraphrase-creep mistakes early and builds trust. Granola-style apps generally already have a diff view for the AI-generated summary; reusing it here is cheap.
Probe set
Eight prompts that exercise the failure modes.
| # | Scenario | Pass criteria |
|---|---|---|
| 1 | Heavy filler ("um", "uh", "like" once every five words) | Filler gone; sentences are clean and the meaning is unchanged |
| 2 | Run-on sentence with no punctuation | Sensible commas and sentence breaks added |
| 3 | Technical mishear ("the API gate way" should be "the API gateway") | Joined correctly using context |
| 4 | Name mishear ("Jon Smith" said but ASR wrote "John Smith") | Whichever form appears, the model keeps it consistent; does not autocorrect to the more common spelling |
| 5 | Overlapping speakers (two [Speaker]: labels with interrupted speech) | Both turns preserved; labels intact |
| 6 | Slang and jargon ("the dashboard is fubar", "we lost the bridge call") | Preserved; not paraphrased to neutral language |
| 7 | Long monologue (one speaker for 400 words) | Cleaned; structure preserved; not summarised |
| 8 | Mixed-language insert ("we have a merci beaucoup in there") | Foreign-language phrase preserved verbatim |
Most of these should pass cleanly. Probe 6 (jargon preservation) is the canonical failure tell: a model that paraphrases "fubar" to "broken" has been over-edited in the dataset.
Limits
- No diarisation recovery. If the upstream diarisation mislabels a turn, the cleanup model preserves the mislabel. Fix diarisation upstream or expose the labels as editable in the UI.
- Hard mishear cases. Domain-specific terminology the model has not seen will be missed. If the user is in a niche field (medical, legal, niche engineering), add a domain-specific subset to the dataset.
- Accents outside the training distribution. ASR errors on heavy accents are systematically different. If your user base has a strong accent distribution that AMI and ICSI do not cover, sourcing accented training audio is the fix.
- Length floors. Very short turns (single-word utterances) sometimes get expanded or annotated by the model. Limit cleanup to chunks above ~10 words in the app; pass shorter turns through untouched.
- Personal-identifier hallucination on name corrections. The model occasionally tries to "correct" a name to a more common spelling. Probe 4 catches this; the fix is preservation rows in the dataset.
What's next
Code completion
The hardest recipe. FIM authoring with prefix-leak gotchas and a much larger dataset.
Ship: iOS
The llamadart integration path covered in full.
Performance tips
Cleanup runs on long inputs; the throughput tips matter here.
Document summariser
Feed the cleaned transcript into the summariser for a meeting recap.