Voice transcript cleanup

    Fine-tune a 3B-class model to remove filler, add punctuation, and fix common ASR errors in meeting and interview transcripts on-device, with speaker labels preserved.

    Modern automatic speech recognition (ASR) systems like Whisper, Apple's on-device speech engine, and Android's Live Caption produce remarkably good transcripts. They also produce transcripts that no one wants to read: every "um" and "uh," no comma in sight, recurring mishears on names and acronyms, and run-on sentences that stretch for ten lines. The fix is a small post-processing model that takes raw ASR output and produces something the user can paste into a Notion page without editing.

    This recipe uses a worked example: imagine Granola wants to add offline transcript cleanup so users can run interviews and standups in airplane mode (on a long flight, in a SCIF, in a basement) and still get clean output once they land. Granola already has speaker labels from a diarisation step that runs before the LLM; the cleanup model preserves those labels and fixes what's between them. Granola is the anchor because the meeting-notes product shape is well-known and the offline-cleanup gap is a real user request; substitute any product where speech-to-text is the input layer.

    Recipe also fits

    The Granola-shaped example is one of many. The same shape applies to:

    • Meeting-notes apps adding offline cleanup as a Pro feature (Otter, Fireflies, Avoma, Krisp).
    • Journalism and podcast interview tools that clean ASR output before the editor sees it (Descript, Riverside).
    • Accessibility apps that produce post-hoc cleaned captions for archived video or audio.
    • Call-center QA tools and customer-research transcribers that need clean text for downstream analysis.
    • Voice memo apps that ship cleanup as a paid feature on top of free recording.

    If your product takes audio in and shows text out, and you want the text out to read like a person typed it, this is the recipe to start with.

    When this is the right fit

    This recipe is the right fit when:

    • The input is diarised (speaker turns already labelled) and you want the model to preserve the labels and clean the text inside each turn. The model is not learning to diarise; diarisation is an upstream concern.
    • You want conservative cleanup, not summarisation. The user's words stay the user's words; you remove filler, add punctuation, fix obvious mishears, and stop there. If the user said "the API gateway is fubar," the cleaned version says "the API gateway is fubar" not "the API gateway has issues."
    • The input language is one you have training data for. English meeting/interview corpora are abundant; other languages need targeted sourcing.
    • The product can accept latency. Cleanup runs after the meeting, not during. A 30-minute interview producing 5,000 words of transcript takes the model 10 to 30 seconds to clean on a modern phone; that's fine for "tap to clean up" but wrong for live captioning.

    It is not the right fit when:

    • You want speaker identification from scratch. That is a diarisation model (Pyannote, NeMo) and lives upstream. The cleanup model preserves labels; it does not invent them.
    • You want real-time captioning. Cleanup is batch. Use a streaming ASR with built-in punctuation (Whisper-v3-with-vad or similar) for live captions.
    • The transcript needs to be edited to summarise meaning, not cleaned. That is the Document summariser recipe, fed with the cleaned transcript.

    The dataset

    For the Granola-style scenario, 6,000 rows is a reasonable starting point. Tune up if quality lags or down if authoring at scale is hard. Each row is one (raw transcript chunk, cleaned chunk) pair. A typical mix of sources:

    SourceShareNotes
    AMI Meeting Corpus30% (~1,800)100 hours of meeting audio with manual transcripts; pair the manual transcript with a Whisper-large run for the noise side
    ICSI Meeting Corpus15% (~900)Similar shape to AMI, different acoustic conditions and accents
    TED interview transcripts20% (~1,200)Single-speaker interviews; useful for the long-turn shape
    Podcast transcripts (paired with raw ASR)20% (~1,200)Run a podcast catalogue you have rights to through Whisper, pair with cleaned versions
    Synthetic noise pairs15% (~900)Take clean text and machine-introduce filler, mishears, and dropped punctuation

    License-check each named corpus before commercial use; their terms vary (AMI and ICSI are research-friendly but the specifics can change) and you are responsible for compliance.

    The mix matters because meeting-style transcripts and interview-style transcripts have different rhythm. A model trained only on AMI cleans meetings well and interviews oddly (collapses long turns too aggressively).

    Generating raw-side data

    The cleanest source of "what real ASR output looks like" is your own ASR run on the same audio that produced the manual transcript. For AMI and ICSI, the audio is public; run Whisper-large-v3 with default settings, and the divergence between Whisper's output and the manual transcript IS the noise you want the model to fix.

    For sources without paired audio (TED has audio and transcripts; podcasts often only have transcripts):

    1. Use a TTS model to read the clean transcript at varying speeds and with synthetic background noise.
    2. Run the TTS output through Whisper.
    3. Pair the Whisper output with the original text.

    This is more work than it sounds and the noise distribution is slightly off from real captures, but it produces useful training data when real paired data is short.

    Dataset format

    Use the input/output schema. The instruction is the same every row; the input is the raw chunk and the output is the cleaned chunk.

    {
      "instruction": "Clean up the transcript chunk. Preserve speaker labels exactly as given. Remove filler words (um, uh, like, you know). Add punctuation and capitalisation. Fix obvious mishearings using context. Do not paraphrase or shorten.",
      "input": "[Sarah]: yeah so um the the API gateway it like went down around 2 PM and we we couldnt figure out what happened until uh maybe an hour later when john noticed the the cert had expired\n[John]: right and the the renewal was supposed to be automatic but it like silently failed two weeks ago",
      "output": "[Sarah]: Yeah, so the API gateway went down around 2 PM and we couldn't figure out what happened until maybe an hour later, when John noticed the cert had expired.\n[John]: Right, and the renewal was supposed to be automatic, but it silently failed two weeks ago."
    }

    Three details that matter:

    • The speaker label format is part of the contract. The model sees [Name]: in training and reproduces it in production. Granola's labels come from diarisation, so the cleanup model just has to preserve what's there.
    • The cleaned output retains the speaker's voice. "We couldn't figure out what happened" not "we struggled to diagnose the issue." Light cleanup, not paraphrase.
    • Chunks, not whole transcripts. Train on 200-to-800-word chunks. A 5,000-word meeting is too long for the context window and overlaps the model's failure modes.

    Chunk sizing

    The 200-to-800-word chunk size is chosen for two reasons. First, it fits comfortably in a 4096-token training context with room for the instruction. Second, it matches the natural unit of a transcript section (a topic, a question-and-answer block). Cleaning works best at this granularity; smaller chunks lose context for mishear correction; longer chunks drift in tone.

    The "don't paraphrase" trap

    The single most common failure mode for cleanup models is that they start paraphrasing in production. The cleaned output gets slightly cleaner-sounding sentence by sentence, the user's specific phrasing disappears, and the result reads like a generic article rather than a transcription.

    Two dataset habits prevent it:

    1. Hand-review a sample for paraphrase creep before training. If your "cleaned" version of "the dashboard is on fire" reads "the dashboard is experiencing issues," cut those rows; you are training paraphrase, not cleanup.
    2. Add "preservation rows": cases where the model must preserve unusual phrasing (slang, technical jargon, profanity) even though it looks like an ASR error. About 5% of the dataset should be preservation rows.

    The model will lose information you let it lose in training. If you cut "fubar" to "broken" in your cleaned examples, the production model cuts it too. Decide what to preserve and hold the line in dataset edits.

    The base model

    Pick Llama 3.2 3B Instruct from the Ertas catalogue. The reasoning:

    • Llama 3.2's instruction tuning handles "follow these rules carefully" tasks reliably, which matches the preservation requirement.
    • At Q4_K_M, the GGUF is about 2.0 GB. That fits on mid-tier Android phones (the Ship: Android 6 GB device-RAM floor) and iPhone 14-class iPhones with headroom.
    • The 128k context window is overkill for chunk-by-chunk cleanup, but useful if you want to stitch turns together for context-aware mishear correction later.

    GPU tier: Llama 3.2 3B trains on a T4, fitting the Free plan. There is no recommended paid-plan upgrade for this task; the 3B is usually enough (see "Avoid going larger" below).

    If you need to go smaller (cheaper devices, web target), Llama 3.2 1B at Q4_K_M (~0.8 GB) works for the punctuation-and-filler task but starts to paraphrase more aggressively on long turns. The dataset edits get more important at this size.

    Avoid going larger unless you have specific evidence the 3B is not enough. Cleanup is a constrained task; an 8B brings more memorisation risk (it starts hallucinating cleaner text from training) without obvious quality gains.

    Training config

    For 6,000 chunk pairs, start with:

    SettingValueWhy
    Schemainput/outputSingle-task fit
    Epochs3Cleanup is stable; three epochs is the sweet spot
    Learning rate1e-4Conservative; we want the model to preserve speaker voice
    LoRA rank8Style and format task; rank 8 is sufficient
    LoRA alpha162x rank
    Batch size2Long chunks; smaller batches keep memory comfortable
    Grad accumulation8Effective batch 16
    Warmup5% of stepsStandard
    Max sequence length4096Fits 800-word chunks with the instruction overhead

    Wall-clock time and credit cost depend on the GPU tier and dataset size. Ertas's Training Config picker shows an estimate before you press play; see Credits and usage for the current rates.

    Two epochs sometimes work, but three usually nails the "preserve voice" behaviour while still cleanly removing filler. If your loss curve flattens at epoch 2, stop; if it is still descending at epoch 3, you can push to 4 cautiously while watching the probe set.

    Integration: iOS via llamadart

    Granola's iOS app uses a Flutter codebase. The Ship section already covers the Flutter via llamadart path; the cleanup integration adds the chunk-splitting wrapper.

    // transcript_cleaner.dart
    import 'package:llamadart/llamadart.dart';
    
    class TranscriptCleaner {
      final LlamaEngine _engine;
      final ChatSession _session;
    
      TranscriptCleaner(this._engine) : _session = ChatSession(_engine);
    
      Future<String> clean(String rawTranscript) async {
        final chunks = _splitByTurns(rawTranscript, maxWords: 500);
        final cleaned = <String>[];
    
        for (final chunk in chunks) {
          final prompt = _buildPrompt(chunk);
          final response = await _session.generate(prompt, maxTokens: 1200);
          _session.reset(); // critical, see Ship: Android note
          cleaned.add(response.trim());
        }
        return cleaned.join('\n\n');
      }
    
      String _buildPrompt(String chunk) =>
          'Clean up the transcript chunk. Preserve speaker labels exactly as given. '
          'Remove filler words (um, uh, like, you know). Add punctuation and capitalisation. '
          'Fix obvious mishearings using context. Do not paraphrase or shorten.\n\n$chunk';
    
      List<String> _splitByTurns(String raw, {required int maxWords}) {
        final turns = raw.split(RegExp(r'(?=\[[^\]]+\]:)'));
        final chunks = <String>[];
        var buffer = StringBuffer();
        var bufferWords = 0;
        for (final turn in turns) {
          final words = turn.split(RegExp(r'\s+')).length;
          if (bufferWords + words > maxWords && bufferWords > 0) {
            chunks.add(buffer.toString().trim());
            buffer = StringBuffer();
            bufferWords = 0;
          }
          buffer.write(turn);
          bufferWords += words;
        }
        if (bufferWords > 0) chunks.add(buffer.toString().trim());
        return chunks;
      }
    }

    Key choices:

    • Split on turn boundaries, not on word count alone. Cutting a turn in half confuses the model and produces a bad join. The regex looks for [Name]: boundaries and chunks at the nearest one before the 500-word cap.
    • session.reset() after every chunk. Without it, cleanup of chunk N+1 starts to inherit context from chunk N, and the speaker voice drifts. See the Ship: Android note on this same llamadart behaviour.
    • maxTokens: 1200: cleaned output is usually 80 to 95% of the input word count (filler removed), so 1.5x the chunk's word count in tokens is a safe ceiling.

    Surface choices

    Two product-level decisions worth flagging:

    • Run cleanup on-demand, not automatically. Cleaning a 60-minute transcript takes tens of seconds on a recent phone (varies by device, model size, and chunk count). Show a "Clean up transcript" button rather than auto-running; users tolerate the wait when they triggered it, but it feels long when the app forces it.
    • Show a before / after diff so users can see what was changed. This catches paraphrase-creep mistakes early and builds trust. Granola-style apps generally already have a diff view for the AI-generated summary; reusing it here is cheap.

    Probe set

    Eight prompts that exercise the failure modes.

    #ScenarioPass criteria
    1Heavy filler ("um", "uh", "like" once every five words)Filler gone; sentences are clean and the meaning is unchanged
    2Run-on sentence with no punctuationSensible commas and sentence breaks added
    3Technical mishear ("the API gate way" should be "the API gateway")Joined correctly using context
    4Name mishear ("Jon Smith" said but ASR wrote "John Smith")Whichever form appears, the model keeps it consistent; does not autocorrect to the more common spelling
    5Overlapping speakers (two [Speaker]: labels with interrupted speech)Both turns preserved; labels intact
    6Slang and jargon ("the dashboard is fubar", "we lost the bridge call")Preserved; not paraphrased to neutral language
    7Long monologue (one speaker for 400 words)Cleaned; structure preserved; not summarised
    8Mixed-language insert ("we have a merci beaucoup in there")Foreign-language phrase preserved verbatim

    Most of these should pass cleanly. Probe 6 (jargon preservation) is the canonical failure tell: a model that paraphrases "fubar" to "broken" has been over-edited in the dataset.

    Limits

    • No diarisation recovery. If the upstream diarisation mislabels a turn, the cleanup model preserves the mislabel. Fix diarisation upstream or expose the labels as editable in the UI.
    • Hard mishear cases. Domain-specific terminology the model has not seen will be missed. If the user is in a niche field (medical, legal, niche engineering), add a domain-specific subset to the dataset.
    • Accents outside the training distribution. ASR errors on heavy accents are systematically different. If your user base has a strong accent distribution that AMI and ICSI do not cover, sourcing accented training audio is the fix.
    • Length floors. Very short turns (single-word utterances) sometimes get expanded or annotated by the model. Limit cleanup to chunks above ~10 words in the app; pass shorter turns through untouched.
    • Personal-identifier hallucination on name corrections. The model occasionally tries to "correct" a name to a more common spelling. Probe 4 catches this; the fix is preservation rows in the dataset.

    What's next