AI Data Quality Is a Domain Problem, Not a Code Problem

There is a persistent belief in the AI industry that data quality is an engineering challenge. Build a better pipeline. Write more validation rules. Add automated quality checks. Deploy statistical anomaly detection. If the data is bad, the thinking goes, the code is not good enough.

This belief is wrong. Data quality is fundamentally a domain knowledge problem. No amount of engineering sophistication can compensate for a lack of understanding about what the data means, which values are correct, and what "quality" looks like in the specific context of the problem you are trying to solve.

The Pipeline Illusion

Consider a company building a model to classify customer support tickets by urgency. Their data engineering is excellent:

Automated ingestion from 5 ticket sources
Deduplication using fuzzy matching with a 0.92 similarity threshold
Schema validation ensuring all required fields are present
Statistical checks flagging outliers in text length and response time
Automated train/test splitting with stratification

The pipeline is clean. The code is robust. The model trains on 50,000 tickets and achieves 73% accuracy on urgency classification.

The problem is not the pipeline. The problem is that the labeling criteria for "urgent" versus "high priority" versus "normal" were defined by an ML engineer who had never worked in customer support. In their schema, a ticket about a production outage affecting 3 users is "high priority." In the support team's actual triage framework, it is "urgent" because those 3 users are on an enterprise plan with an SLA that triggers financial penalties after 2 hours.

The pipeline processed the data perfectly. It just processed data with the wrong labels.

Where Code Cannot Help

There are specific categories of data quality problems that no engineering solution can address:

Wrong labeling criteria. If the definition of "positive" and "negative" in your classification schema does not match the real-world decision boundary, every label is potentially wrong — but no validation rule can detect this. The labels are internally consistent, correctly formatted, and statistically distributed. They are just wrong.

A concrete example: a medical imaging team labels chest X-rays for pneumonia detection. Their labeling guide says "label as positive if opacity is present in the lung fields." A radiologist would tell them that 15-20% of opacities in the lung fields are not pneumonia — they are atelectasis, effusions, or artifacts. The labels pass every quality check. The model learns to detect opacity, not pneumonia.

Incorrect deduplication decisions. Deduplication algorithms can identify that two records are similar. They cannot determine which one is correct. When a customer appears twice in a dataset with slightly different addresses, the algorithm can flag the duplicate. It cannot know that one address is the customer's home and the other is their office, and the correct address depends on the use case.

We worked with a financial services team that used automated deduplication on transaction records. The algorithm merged records with identical amounts and similar timestamps, treating them as duplicates. In reality, 8% of the "duplicates" were legitimate separate transactions — two $4,500 wire transfers to the same recipient on the same day for different invoices. The dedup reduced dataset size but also reduced model accuracy by removing real data.

Misunderstood data semantics. A field labeled "completion_date" might mean different things in different contexts: the date the task was marked complete in the system, the date the work was actually finished, or the date the completion was verified by a supervisor. Using the wrong interpretation introduces systematic error that no validation rule can catch because the data type and format are correct.

Context-dependent quality standards. In some domains, "good enough" data quality depends on the specific application. A customer name misspelled as "Jonh" instead of "John" is acceptable for a recommendation system but unacceptable for a compliance screening model that matches names against sanctions lists. Quality scoring that does not account for application context produces misleading confidence.

The Domain Knowledge That Matters

Data quality decisions require three types of domain knowledge that code does not have:

Semantic knowledge. Understanding what data values mean in context. An ML engineer sees a field with values 0-10 and treats it as a continuous numeric feature. A domain expert knows that values 1-3 are "normal," 4-6 are "elevated," and 7-10 are "critical" — and that the thresholds between categories are where the model's decisions matter most.

Operational knowledge. Understanding how data was collected and what its limitations are. A domain expert knows that weekend entries in a manufacturing log are less reliable because the junior operator fills them in from memory on Monday. An ML engineer treats all rows equally.

Consequential knowledge. Understanding what happens when the model gets it wrong. A domain expert knows that misclassifying a certain type of transaction has regulatory implications, while misclassifying another type is merely inconvenient. This knowledge should influence how aggressively you clean, validate, and balance different segments of the dataset.

The Real Quality Process

Effective data quality is not a code pipeline with domain knowledge sprinkled on top. It is a domain-driven process with code supporting execution.

Step 1: Domain experts define quality criteria. Before any code runs, domain experts specify what "correct" means for each label, what edge cases exist, and how ambiguous examples should be handled. This is not a one-hour meeting. It is an iterative process that typically takes 1-2 weeks of discussion, example review, and criteria refinement.

Step 2: Domain experts label a seed dataset. A small set of examples (200-500) labeled by domain experts establishes the ground truth. This seed dataset serves as the quality benchmark against which all subsequent labels and model outputs are measured.

Step 3: Quality metrics reference domain judgment. Inter-annotator agreement, label distribution analysis, and edge case review are all measured against the domain experts' seed labels. If automated quality checks flag a batch of labels as problematic, domain experts — not ML engineers — investigate and determine whether the issue is a labeling error or a legitimate distribution shift.

Step 4: Domain experts review model errors. When the model misclassifies examples, domain experts examine the misclassifications to determine whether the error stems from insufficient training data, incorrect labels, ambiguous criteria, or a genuine edge case the model should not be expected to handle.

This process requires domain experts to interact directly with the data and the labeling tools. If domain experts can only participate through meetings and Slack messages, the process degrades back to proxy labeling — which is where quality problems originate.

The Cost of Getting This Wrong

Organizations that treat data quality as an engineering problem spend 2-3x more on model development than organizations that treat it as a domain problem. Here is why:

More training cycles. When labels are subtly wrong, model accuracy plateaus at a level that seems improvable but resists every engineering intervention — more data, better architectures, longer training. The team iterates for weeks or months before someone finally questions the labels.

Delayed deployment. A model trained on domain-incorrect data fails differently than a model trained on noisy data. Noisy data produces uniformly degraded performance. Domain-incorrect data produces confident errors on specific categories — the model is sure about cases it gets wrong. These confident errors are discovered late, often during user acceptance testing, and require restarting the data collection process.

Eroded trust. When a model confidently misclassifies domain-specific cases, domain experts lose confidence in AI tools broadly. Rebuilding that trust costs more than getting it right the first time.

Research from Andrew Ng's data-centric AI work shows that systematic label corrections by domain experts improve model performance by 5-15% on average — more than most architectural changes. The data, not the model, is where quality lives.

Putting Domain Experts in the Driver's Seat

Data quality improves when domain experts can directly inspect, label, validate, and correct training data. This requires tools that are accessible to people without ML engineering skills.

Ertas Data Suite is built for this purpose. It is a native desktop application where domain experts work with data directly — defining label schemas, applying labels, reviewing quality metrics, and correcting errors — without writing code or navigating technical infrastructure. Data stays local on their machine. The interface uses domain terminology, not ML jargon.

The ML team gets better data. The domain experts maintain ownership of quality. The model trains on labels that reflect genuine domain knowledge, not an engineer's best guess.

Data quality is a domain problem. The tools should let domain experts solve it.

AI Data Quality Is a Domain Problem, Not a Code Problem

The Pipeline Illusion

Where Code Cannot Help

The Domain Knowledge That Matters

The Real Quality Process

The Cost of Getting This Wrong

Putting Domain Experts in the Driver's Seat

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

Why Domain Experts — Not ML Engineers — Should Own Data Labeling

The Annotation Bottleneck: When Only 3 People in Your Org Can Label Data

EU AI Act Compliance Timeline: What's Due by August 2026