Back to blog
    On-Premise Data Cleaning for ML Training Datasets: Deduplication, Normalization, and Quality Scoring
    data-cleaningdeduplicationpii-redactionquality-scoringon-premiseml-trainingsegment:service-provider

    On-Premise Data Cleaning for ML Training Datasets: Deduplication, Normalization, and Quality Scoring

    How to clean ML training datasets on-premise — covering deduplication with MinHash, text normalization, PII redaction, and quality scoring without cloud APIs.

    EErtas Team·

    After ingestion, you have structured text. But structured is not the same as clean. Enterprise document collections contain duplicates, encoding artifacts, personally identifiable information, inconsistent formatting, and low-quality records that will degrade model performance if they make it into the training set.

    Cleaning is the stage where most teams underinvest — and where the gap between a model that works in production and one that doesn't is usually determined. This guide covers the practical techniques for cleaning ML training datasets entirely on-premise: no cloud NER services, no external APIs, no data egress.


    Deduplication: Exact and Near-Duplicate

    Duplicates in training data cause models to overfit on repeated examples, inflating performance metrics during evaluation while degrading generalization. In enterprise document collections, duplication is pervasive — the same contract template used 300 times with minor modifications, the same policy document distributed to every department, the same email forwarded through a chain of recipients.

    Exact Deduplication

    The simplest case. Compute a hash (SHA-256) of each document's content and remove records with identical hashes. This catches byte-identical duplicates — the same file saved under different names, or the same document ingested from multiple sources.

    Exact deduplication is fast (O(n) with a hash set) and should always be the first step.

    Near-Duplicate Detection

    More valuable and more complex. Near-duplicates are documents with high similarity but not identical content — different versions of the same template, documents with minor edits, or records that share 90% of their content.

    Two practical approaches:

    MinHash with Locality-Sensitive Hashing (LSH): Compute MinHash signatures from document n-grams, then use LSH to efficiently find pairs with high Jaccard similarity. This scales to millions of documents and catches content-level near-duplicates regardless of formatting differences. Typical threshold: 0.8-0.9 Jaccard similarity.

    SimHash: Computes a single fingerprint per document using weighted token features. Documents with Hamming distance below a threshold are flagged as near-duplicates. Faster than MinHash for very large collections but less precise for shorter documents.

    For training datasets, MinHash with LSH is the standard choice. It handles the enterprise case well: finding those 300 nearly-identical contracts and collapsing them to a representative set of 15-20 distinct variants.

    What to Do with Near-Duplicates

    Don't just delete them. Near-duplicates carry information about which portions of a document are stable (boilerplate) versus variable (the parts that matter). Options:

    • Keep one representative: Select the highest-quality version and discard the rest
    • Keep all but mark: Include a cluster ID so downstream stages can weight or sample accordingly
    • Merge: For template documents, extract the variable portions and create a single training example covering the variation space

    Text Normalization

    Normalization makes text consistent without changing its meaning. Enterprise documents are remarkably inconsistent in ways that matter for model training.

    Encoding Normalization

    Convert everything to UTF-8 NFC (canonical decomposition followed by canonical composition). This handles:

    • Windows-1252 "smart quotes" that appear as mojibake in UTF-8
    • Multiple Unicode representations of the same character (e.g., "é" as a single codepoint vs. "e" + combining accent)
    • Zero-width spaces, byte order marks, and other invisible characters that break tokenization

    Whitespace Normalization

    • Collapse multiple spaces to single spaces
    • Normalize line endings (CRLF → LF)
    • Remove trailing whitespace from lines
    • Handle tab-to-space conversion consistently

    Common Enterprise-Specific Normalization

    • Date formats: "03/11/2026" vs "March 11, 2026" vs "2026-03-11" — decide on a canonical format or normalize to ISO 8601
    • Number formats: "1,000,000" vs "1000000" vs "1.000.000" (European) — normalize based on locale
    • Abbreviations: "Dr." vs "Doctor", "Inc." vs "Incorporated" — maintain a domain-specific normalization dictionary
    • Legal citations: "42 U.S.C. § 1983" appears in dozens of formats across documents — normalize to a canonical form

    PII and PHI Detection and Redaction

    For healthcare (HIPAA), financial (GLBA, SOC 2), and legal (attorney-client privilege) data, PII/PHI redaction isn't optional. It's a compliance requirement. And it must happen on-premise — sending documents to a cloud NER service for PII detection defeats the purpose.

    On-Premise PII Detection Approaches

    Rule-based (regex + patterns): Catches structured PII with high precision:

    • Social Security Numbers: \d{3}-\d{2}-\d{4}
    • Phone numbers: Various formats per locale
    • Email addresses: Standard pattern matching
    • Credit card numbers: Luhn-validated patterns
    • Dates of birth: When combined with context clues

    Strength: fast, predictable, zero false negatives for well-defined formats. Weakness: misses contextual PII (names, addresses, medical conditions mentioned in free text).

    NER models (local): SpaCy's NER models, Stanza, or fine-tuned transformer models running locally can detect names, organizations, locations, and other contextual entities. Accuracy varies by domain — a general-purpose NER model will miss many medical terms, legal entities, or financial identifiers.

    Local LLM-assisted detection: A local 7B+ model prompted to identify PII in text passages. More flexible than rule-based or NER approaches but slower and less deterministic. Best used as a second pass after rule-based and NER detection.

    Redaction Strategies

    • Replacement: Swap PII with typed placeholders — [NAME], [SSN], [DATE_OF_BIRTH]. Preserves sentence structure for training.
    • Consistent pseudonymization: Replace each unique entity with a consistent fake — "Dr. Smith" → "Dr. Johnson" throughout the dataset. Preserves entity relationships.
    • Removal: Delete the PII and surrounding context. Loses information but is the most conservative approach.

    For training data, replacement with typed placeholders is usually the best balance — the model learns the pattern of where PII appears without memorizing specific identifiers.


    Quality Scoring Without Cloud APIs

    Not all training examples are equally valuable. Quality scoring identifies records that are likely to improve model performance (high quality) versus records that are likely to add noise (low quality).

    Heuristic Quality Signals

    These require no model inference and provide fast, baseline quality estimates:

    SignalWhat It CatchesThreshold
    Text lengthToo short (no content) or too long (concatenated garbage)Domain-dependent; typically 50-5000 tokens
    Sentence countSingle-sentence "documents" that lack contextMinimum 3-5 sentences for most use cases
    Vocabulary diversityRepetitive text (copy-paste errors, boilerplate)Type-token ratio below 0.3 is suspicious
    Special character ratioOCR artifacts, encoding corruptionMore than 5% non-alphanumeric is a flag
    Language detection confidenceMixed-language documents, garbled textBelow 0.8 confidence warrants review
    Perplexity (local model)Incoherent or corrupted textHigh perplexity relative to corpus average

    Cleanlab-Style Confidence Learning

    Cleanlab is the leading open-source library for finding label errors and low-quality examples in datasets. It uses confident learning — comparing model predictions against provided labels to identify likely mislabeled or ambiguous examples.

    Cleanlab works well. The limitation for service providers is that it's a Python library requiring ML engineering expertise to configure and run. It doesn't provide a GUI, doesn't produce audit-ready reports, and requires integration into a custom pipeline.

    Local Embedding-Based Quality Scoring

    Compute embeddings for all records using a local embedding model (e.g., all-MiniLM-L6-v2 via sentence-transformers). Then:

    • Outlier detection: Records whose embeddings are far from any cluster center may be off-topic or corrupted
    • Coherence scoring: Records whose embeddings are close to the corpus centroid are typical; records at the periphery warrant review
    • Diversity assessment: Ensure the training set covers the embedding space evenly, not clustered in one region

    This approach works entirely locally and provides useful quality signals without labeling or model training.


    Practical Cleaning Workflow

    A realistic cleaning workflow for an enterprise dataset:

    1. Exact deduplication — Remove byte-identical duplicates. Fast, no false positives.
    2. Encoding normalization — Convert to UTF-8 NFC. Fix mojibake.
    3. Whitespace and format normalization — Consistent spacing, line endings, number formats.
    4. Near-duplicate detection — MinHash/LSH with 0.85 threshold. Review clusters, select representatives.
    5. PII/PHI detection — Rule-based first pass, NER second pass, manual review of flagged items.
    6. PII redaction — Apply chosen redaction strategy. Log every redaction.
    7. Heuristic quality filtering — Remove records failing basic quality checks.
    8. Quality scoring — Rank remaining records by quality. Review bottom 10%.
    9. Human review — Domain experts review flagged records and edge cases.

    Each step should log what was removed, modified, or flagged — and why. This log is your audit trail.

    Ertas Data Suite's Clean module handles this entire workflow with built-in deduplication (exact and near-duplicate), normalization, PII detection, and quality scoring — accessible through a visual interface that domain experts and compliance officers can operate directly. Every action is logged to the project audit trail automatically.


    Connecting to the Pipeline

    Clean data feeds into labeling, where human annotators and local LLM co-pilots apply the labels needed for fine-tuning. The cleaner the data entering the labeling stage, the faster and more accurate labeling becomes — domain experts spend time on substantive labeling decisions rather than fixing data quality issues that should have been caught earlier.

    For the complete pipeline overview, see How to Build an On-Premise Data Preparation Pipeline for LLM Fine-Tuning.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading