On-Premise Data Cleaning for ML Training Datasets: Deduplication, Normalization, and Quality Scoring

After ingestion, you have structured text. But structured is not the same as clean. Enterprise document collections contain duplicates, encoding artifacts, personally identifiable information, inconsistent formatting, and low-quality records that will degrade model performance if they make it into the training set.

Cleaning is the stage where most teams underinvest — and where the gap between a model that works in production and one that doesn't is usually determined. This guide covers the practical techniques for cleaning ML training datasets entirely on-premise: no cloud NER services, no external APIs, no data egress.

Deduplication: Exact and Near-Duplicate

Duplicates in training data cause models to overfit on repeated examples, inflating performance metrics during evaluation while degrading generalization. In enterprise document collections, duplication is pervasive — the same contract template used 300 times with minor modifications, the same policy document distributed to every department, the same email forwarded through a chain of recipients.

Exact Deduplication

The simplest case. Compute a hash (SHA-256) of each document's content and remove records with identical hashes. This catches byte-identical duplicates — the same file saved under different names, or the same document ingested from multiple sources.

Exact deduplication is fast (O(n) with a hash set) and should always be the first step.

Near-Duplicate Detection

More valuable and more complex. Near-duplicates are documents with high similarity but not identical content — different versions of the same template, documents with minor edits, or records that share 90% of their content.

Two practical approaches:

MinHash with Locality-Sensitive Hashing (LSH): Compute MinHash signatures from document n-grams, then use LSH to efficiently find pairs with high Jaccard similarity. This scales to millions of documents and catches content-level near-duplicates regardless of formatting differences. Typical threshold: 0.8-0.9 Jaccard similarity.

SimHash: Computes a single fingerprint per document using weighted token features. Documents with Hamming distance below a threshold are flagged as near-duplicates. Faster than MinHash for very large collections but less precise for shorter documents.

For training datasets, MinHash with LSH is the standard choice. It handles the enterprise case well: finding those 300 nearly-identical contracts and collapsing them to a representative set of 15-20 distinct variants.

What to Do with Near-Duplicates

Don't just delete them. Near-duplicates carry information about which portions of a document are stable (boilerplate) versus variable (the parts that matter). Options:

Keep one representative: Select the highest-quality version and discard the rest
Keep all but mark: Include a cluster ID so downstream stages can weight or sample accordingly
Merge: For template documents, extract the variable portions and create a single training example covering the variation space

Text Normalization

Normalization makes text consistent without changing its meaning. Enterprise documents are remarkably inconsistent in ways that matter for model training.

Encoding Normalization

Convert everything to UTF-8 NFC (canonical decomposition followed by canonical composition). This handles:

Windows-1252 "smart quotes" that appear as mojibake in UTF-8
Multiple Unicode representations of the same character (e.g., "é" as a single codepoint vs. "e" + combining accent)
Zero-width spaces, byte order marks, and other invisible characters that break tokenization

Whitespace Normalization

Collapse multiple spaces to single spaces
Normalize line endings (CRLF → LF)
Remove trailing whitespace from lines
Handle tab-to-space conversion consistently

Common Enterprise-Specific Normalization

Date formats: "03/11/2026" vs "March 11, 2026" vs "2026-03-11" — decide on a canonical format or normalize to ISO 8601
Number formats: "1,000,000" vs "1000000" vs "1.000.000" (European) — normalize based on locale
Abbreviations: "Dr." vs "Doctor", "Inc." vs "Incorporated" — maintain a domain-specific normalization dictionary
Legal citations: "42 U.S.C. § 1983" appears in dozens of formats across documents — normalize to a canonical form

PII and PHI Detection and Redaction

For healthcare (HIPAA), financial (GLBA, SOC 2), and legal (attorney-client privilege) data, PII/PHI redaction isn't optional. It's a compliance requirement. And it must happen on-premise — sending documents to a cloud NER service for PII detection defeats the purpose.

On-Premise PII Detection Approaches

Rule-based (regex + patterns): Catches structured PII with high precision:

Social Security Numbers: \d{3}-\d{2}-\d{4}
Phone numbers: Various formats per locale
Email addresses: Standard pattern matching
Credit card numbers: Luhn-validated patterns
Dates of birth: When combined with context clues

Strength: fast, predictable, zero false negatives for well-defined formats. Weakness: misses contextual PII (names, addresses, medical conditions mentioned in free text).

NER models (local): SpaCy's NER models, Stanza, or fine-tuned transformer models running locally can detect names, organizations, locations, and other contextual entities. Accuracy varies by domain — a general-purpose NER model will miss many medical terms, legal entities, or financial identifiers.

Local LLM-assisted detection: A local 7B+ model prompted to identify PII in text passages. More flexible than rule-based or NER approaches but slower and less deterministic. Best used as a second pass after rule-based and NER detection.

Redaction Strategies

Replacement: Swap PII with typed placeholders — [NAME], [SSN], [DATE_OF_BIRTH]. Preserves sentence structure for training.
Consistent pseudonymization: Replace each unique entity with a consistent fake — "Dr. Smith" → "Dr. Johnson" throughout the dataset. Preserves entity relationships.
Removal: Delete the PII and surrounding context. Loses information but is the most conservative approach.

For training data, replacement with typed placeholders is usually the best balance — the model learns the pattern of where PII appears without memorizing specific identifiers.

Quality Scoring Without Cloud APIs

Not all training examples are equally valuable. Quality scoring identifies records that are likely to improve model performance (high quality) versus records that are likely to add noise (low quality).

Heuristic Quality Signals

These require no model inference and provide fast, baseline quality estimates:

Signal	What It Catches	Threshold
Text length	Too short (no content) or too long (concatenated garbage)	Domain-dependent; typically 50-5000 tokens
Sentence count	Single-sentence "documents" that lack context	Minimum 3-5 sentences for most use cases
Vocabulary diversity	Repetitive text (copy-paste errors, boilerplate)	Type-token ratio below 0.3 is suspicious
Special character ratio	OCR artifacts, encoding corruption	More than 5% non-alphanumeric is a flag
Language detection confidence	Mixed-language documents, garbled text	Below 0.8 confidence warrants review
Perplexity (local model)	Incoherent or corrupted text	High perplexity relative to corpus average

Cleanlab-Style Confidence Learning

Cleanlab is the leading open-source library for finding label errors and low-quality examples in datasets. It uses confident learning — comparing model predictions against provided labels to identify likely mislabeled or ambiguous examples.

Cleanlab works well. The limitation for service providers is that it's a Python library requiring ML engineering expertise to configure and run. It doesn't provide a GUI, doesn't produce audit-ready reports, and requires integration into a custom pipeline.

Local Embedding-Based Quality Scoring

Compute embeddings for all records using a local embedding model (e.g., all-MiniLM-L6-v2 via sentence-transformers). Then:

Outlier detection: Records whose embeddings are far from any cluster center may be off-topic or corrupted
Coherence scoring: Records whose embeddings are close to the corpus centroid are typical; records at the periphery warrant review
Diversity assessment: Ensure the training set covers the embedding space evenly, not clustered in one region

This approach works entirely locally and provides useful quality signals without labeling or model training.

Practical Cleaning Workflow

A realistic cleaning workflow for an enterprise dataset:

Exact deduplication — Remove byte-identical duplicates. Fast, no false positives.
Encoding normalization — Convert to UTF-8 NFC. Fix mojibake.
Whitespace and format normalization — Consistent spacing, line endings, number formats.
Near-duplicate detection — MinHash/LSH with 0.85 threshold. Review clusters, select representatives.
PII/PHI detection — Rule-based first pass, NER second pass, manual review of flagged items.
PII redaction — Apply chosen redaction strategy. Log every redaction.
Heuristic quality filtering — Remove records failing basic quality checks.
Quality scoring — Rank remaining records by quality. Review bottom 10%.
Human review — Domain experts review flagged records and edge cases.

Each step should log what was removed, modified, or flagged — and why. This log is your audit trail.

Ertas Data Suite's Clean module handles this entire workflow with built-in deduplication (exact and near-duplicate), normalization, PII detection, and quality scoring — accessible through a visual interface that domain experts and compliance officers can operate directly. Every action is logged to the project audit trail automatically.

Connecting to the Pipeline

Clean data feeds into labeling, where human annotators and local LLM co-pilots apply the labels needed for fine-tuning. The cleaner the data entering the labeling stage, the faster and more accurate labeling becomes — domain experts spend time on substantive labeling decisions rather than fixing data quality issues that should have been caught earlier.

For the complete pipeline overview, see How to Build an On-Premise Data Preparation Pipeline for LLM Fine-Tuning.