Data Quality Scoring for Training Datasets Without Cloud APIs

A training dataset isn't ready because it exists. It's ready when you can quantify its quality — and when that quality is high enough that a model trained on it will perform acceptably in production.

Most teams treat data quality as a binary: the data is "clean" or it isn't. In practice, quality is a spectrum across multiple dimensions, and different problems in the data cause different failure modes in the trained model. Mislabeled examples cause the model to learn wrong patterns. Duplicate clusters cause overfitting. Distribution imbalances cause poor performance on minority classes. Outliers introduce noise.

Scoring quality across these dimensions — without sending data to cloud APIs — is the focus of this guide.

Quality Dimensions for Training Data

Label Accuracy

The most impactful quality dimension. If 10% of your labels are wrong, your model's performance ceiling is roughly 90% — and in practice it's lower because wrong labels don't just reduce accuracy, they actively teach incorrect patterns.

How to measure on-premise:

Cross-validation confidence: Train a small model on the dataset and check which examples the model consistently gets wrong. Examples where the model disagrees with the label are candidates for label errors. This is the foundation of Cleanlab's confident learning approach.

Local LLM verification: Use a local model to independently predict labels for each example. Compare model predictions against human labels. Disagreements warrant human re-review. A 7B instruction-following model won't match human expert accuracy on domain-specific tasks, but it catches obvious errors — and obvious errors are the ones that damage model performance most.

Annotator self-consistency: If the same annotator labeled the same content at different times, did they agree with themselves? Low self-consistency indicates ambiguous labeling guidelines or annotator fatigue.

Inter-Annotator Agreement

When multiple annotators label the same examples, their agreement rate indicates how well-defined the task is and how reliable the labels are.

Cohen's Kappa: Measures agreement between two annotators, corrected for chance agreement. Values above 0.8 indicate strong agreement; below 0.6 suggests the labeling guidelines need revision.

Fleiss' Kappa: Extends to multiple annotators. Useful when you have a pool of domain experts and different experts label different subsets.

Krippendorff's Alpha: Handles missing data (not every annotator labels every example) and works with ordinal, interval, and nominal data types. The most flexible agreement metric.

For service providers, inter-annotator agreement is also a quality deliverable. When you hand the client a dataset with a Krippendorff's Alpha of 0.85, that's a measurable quality claim backed by evidence.

Agreement Metric	Score Range	Interpretation
Cohen's Kappa	0.81-1.00	Almost perfect agreement
Cohen's Kappa	0.61-0.80	Substantial agreement
Cohen's Kappa	0.41-0.60	Moderate agreement — review guidelines
Cohen's Kappa	0.21-0.40	Fair — significant labeling issues
Cohen's Kappa	Below 0.20	Slight — task definition is unclear

Data Distribution Balance

Class imbalance directly affects model performance. A model trained on a dataset that's 90% Class A and 10% Class B will achieve high overall accuracy by simply predicting Class A — while failing on the class that probably matters most.

Metrics to track:

Class frequency distribution (bar chart of label counts)
Imbalance ratio (majority class count / minority class count)
Effective number of samples per class (accounting for near-duplicates)

Thresholds: Imbalance ratios above 10:1 typically require mitigation — either through data augmentation, oversampling, undersampling, or class-weighted training.

Duplicate Detection

Near-duplicates inflate the effective size of the dataset without adding information. They cause models to overfit on the duplicated content and reduce generalization.

Detection approaches (all on-premise):

MinHash/LSH: Efficient near-duplicate detection at scale. Compute MinHash signatures from n-grams, use LSH for fast pairwise comparison. Catches content-level duplicates even when formatting differs.

Embedding clustering: Compute embeddings with a local model, then identify clusters with very high internal similarity. Records within a tight cluster are near-duplicates.

Exact hash: SHA-256 hash of normalized content. Catches byte-identical duplicates.

Impact of duplicates: Research consistently shows that training on deduplicated data produces models with better generalization, even when the deduplicated dataset is smaller. Removing 20% of a dataset through deduplication typically improves model quality.

Outlier Identification

Outliers are records that don't belong — off-topic content, corrupted text, records from a different domain that leaked into the dataset. They add noise to training and can cause unexpected model behavior.

Statistical outlier detection: Compute record-level features (length, vocabulary diversity, PII density) and flag records that fall outside 2-3 standard deviations.

Embedding-based outlier detection: Records whose embeddings are far from all cluster centers in the embedding space are potential outliers. Compute cosine distance to nearest cluster center; records above a threshold warrant review.

Perplexity-based detection: Score each record's perplexity using a local language model. Records with unusually high perplexity are likely corrupted, off-topic, or in a different language.

Cleanlab: What It Does Well and Where It Falls Short

Cleanlab is the most established library for data quality scoring in ML datasets. Its confident learning algorithm identifies potential label errors by analyzing the relationship between model predictions and provided labels.

What Cleanlab Does Well

Label error detection: Finds mislabeled examples with high precision. In published benchmarks, Cleanlab typically identifies 50-80% of label errors while keeping false positive rates below 20%.
Confidence scoring: Assigns a quality score to every example based on how consistent the label is with what a model would predict.
Multi-class support: Works with any number of classes, including multi-label scenarios.
Dataset-level quality metrics: Provides overall dataset health scores and per-class quality breakdowns.

Where Cleanlab Falls Short for Service Providers

Python-only: Cleanlab is a Python library. Using it requires writing Python scripts, configuring model training for the confident learning step, and interpreting programmatic output. This isn't a problem for ML engineers but makes it inaccessible to domain experts and compliance officers.

No GUI: Results are returned as arrays and DataFrames. There's no visual interface for reviewing flagged examples, no way for a non-technical user to inspect quality scores, and no built-in reporting for compliance reviews.

No audit trail: Cleanlab doesn't log which examples were flagged, when, or what action was taken. For regulated industries, this is a gap — you need to demonstrate that quality scoring happened and that flagged items were addressed.

Integration required: Cleanlab operates on pre-formatted datasets. Getting data from your ingestion pipeline into Cleanlab-ready format and results back into the pipeline requires custom integration code.

Model training dependency: Confident learning requires training a model on the dataset (typically via cross-validation). This adds compute time and complexity to the quality scoring step.

Heuristic Quality Scoring (No Model Required)

Not every quality signal requires model inference. Heuristic scoring provides fast, transparent quality estimates using simple rules:

Heuristic	What It Catches	Implementation
Text length (tokens)	Empty, truncated, or excessively long records	Count tokens; flag outside [50, 5000] range
Sentence count	Fragments and concatenation errors	Count sentence boundaries; flag < 2
Vocabulary diversity	Repetitive or boilerplate text	Type-token ratio; flag < 0.25
Special character ratio	OCR artifacts, encoding errors	Count non-alphanumeric; flag > 8%
Language confidence	Mixed-language or corrupted text	Language detection library; flag < 0.85
Repeated n-grams	Copy-paste artifacts	Count 4-gram frequencies; flag high repetition
PII density	Inadequate redaction	Count PII markers per 100 tokens

Heuristic scores run in seconds over large datasets (100K+ records) and require no GPU. They're a useful first pass before applying more expensive model-based scoring.

Embedding-Based Quality Analysis

Local embedding models (e.g., all-MiniLM-L6-v2 via sentence-transformers, or nomic-embed-text via Ollama) enable powerful quality analysis without cloud APIs:

Coherence Scoring

Compute the centroid of all record embeddings. Each record's distance from the centroid indicates how "typical" it is. Records far from the centroid are potential outliers.

This isn't a binary filter — it's a ranking. The bottom 5% by coherence score should be reviewed, not automatically removed.

Cluster Analysis

Apply k-means or HDBSCAN clustering to the embedding space. Quality signals from clustering:

Singleton clusters: A record that doesn't cluster with anything is likely off-topic
Highly concentrated clusters: Records that are nearly identical in embedding space are near-duplicates
Class-cluster misalignment: If labeling says these records are Class A but clustering puts them with Class B records, there may be label errors

Semantic Diversity Assessment

Compute pairwise cosine similarity across the dataset (or a sample). A dataset with high average similarity has low diversity — the model will learn a narrow range of patterns. A dataset with moderate average similarity (0.3-0.6) typically indicates healthy diversity.

Practical Quality Scoring Workflow

A complete quality scoring workflow for a service provider preparing training data for a regulated enterprise client:

Step 1: Heuristic scan (15 minutes) Run heuristic quality checks on the full dataset. Flag and review records that fail basic checks. Remove or fix obvious problems (empty records, encoding corruption, extreme outliers).

Step 2: Deduplication analysis (30 minutes - 2 hours) Run MinHash/LSH near-duplicate detection. Review duplicate clusters. Select representative records from each cluster.

Step 3: Distribution analysis (30 minutes) Compute class frequencies, imbalance ratios, and effective sample counts. If imbalance exceeds 10:1, plan augmentation for minority classes.

Step 4: Embedding-based analysis (1-2 hours) Compute embeddings for all records. Run outlier detection, cluster analysis, and diversity assessment. Review flagged records.

Step 5: Label quality scoring (2-4 hours) If resources allow, run confident learning (Cleanlab-style) or use local LLM verification. Prioritize review of records flagged as potential label errors.

Step 6: Inter-annotator agreement (if applicable) Compute agreement metrics for the subset of records labeled by multiple annotators. If agreement is below 0.7, revise labeling guidelines and re-label the disagreement cases.

Step 7: Generate quality report Compile all quality metrics into a report: overall quality score, per-dimension scores, distribution charts, flagged records and their resolution, and agreement statistics. This report is a deliverable for the client and a compliance artifact.

Quality Scores as a Deliverable

For service providers, quality scoring isn't just a pipeline step — it's a differentiator. When you deliver a dataset to a client with a documented quality report showing:

98.2% estimated label accuracy
Krippendorff's Alpha of 0.87
All near-duplicates resolved
PII redaction coverage of 99.7%
Distribution balanced to within 3:1 ratio

...that's a measurable quality claim that the client can reference in compliance documentation, model cards, and audit responses.

Ertas Data Suite includes built-in quality scoring across all dimensions — heuristic checks, deduplication, distribution analysis, embedding-based outlier detection, and label quality estimation. Quality scores are visible in the project dashboard, and the full quality report exports as part of the audit trail. Domain experts and compliance officers can review quality metrics directly, without needing to interpret Python output.

Connecting to the Pipeline

Quality scoring happens primarily during cleaning and after labeling, but it's also a final validation step before export. A dataset that passes quality scoring across all dimensions is ready for fine-tuning. A dataset that doesn't has specific, actionable gaps that can be addressed before proceeding.

For the complete pipeline overview, see How to Build an On-Premise Data Preparation Pipeline for LLM Fine-Tuning.