Why Your ML Engineers Shouldn't Be Labeling Data (And Who Should)

Here is a number that should concern every engineering leader: the average ML engineer in the United States earns $150,000-$200,000 per year in total compensation. These are people with graduate degrees in machine learning, statistics, or computer science. They were hired to design model architectures, run training experiments, build evaluation frameworks, and deploy production inference systems.

They are spending 60-80% of their time cleaning spreadsheets, manually labeling documents, writing data conversion scripts, and debugging export formats.

Let's make the math explicit. Take a team of 5 ML engineers at $180,000 average total compensation:

Time spent on data preparation: 65% (midpoint of 60-80%)
Annual cost of data preparation work: 5 × $180,000 × 0.65 = $585,000
Annual cost of actual ML engineering: 5 × $180,000 × 0.35 = $315,000

You're paying $900,000 for an ML engineering team and getting $315,000 worth of ML engineering. The other $585,000 goes to work that domain experts could do better and less expensive staff could support.

This is not a matter of ML engineers being lazy or inefficient. It's a structural problem: the tools and workflows for data preparation are designed for ML engineers, so ML engineers end up doing the work. Change the tools and workflows, and the work can shift to the people who should be doing it.

Why ML Engineers Are the Wrong People to Label Data

They Lack Domain Expertise

A radiologist who has read 50,000 chest X-rays can spot a 3mm nodule in a fraction of a second. An ML engineer with a PhD in computer vision cannot. A construction estimator who has priced 200 commercial buildings can identify an unreasonable unit cost instantly. An ML engineer staring at a bill of quantities sees numbers.

When ML engineers label domain-specific data, they make domain errors. They classify a benign finding as suspicious because they don't recognize the pattern. They label a contract clause as "standard" when a lawyer would flag it as unusual. They mark a construction specification as complete when an engineer would note the missing reference standard.

These labeling errors propagate into the model. A model trained on ML-engineer-labeled medical data learns an ML engineer's (incorrect) understanding of medicine. The resulting model is confidently wrong — the worst possible outcome.

The evidence is consistent: domain-expert-labeled datasets produce models that are 8-15 percentage points more accurate than ML-engineer-labeled datasets on domain-specific tasks. That gap is the difference between a model that gets deployed and one that gets abandoned.

They Are Overqualified

Data labeling requires attention and domain knowledge. It does not require the ability to implement attention mechanisms from scratch, derive gradient updates, or architect distributed training pipelines. Using ML engineers for labeling is like using a structural engineer to carry bricks — they can do it, but it's a waste of their most valuable skills.

The opportunity cost is real. While your ML engineers are labeling data, they are NOT:

Experimenting with model architectures that could improve performance by 5-10%
Building evaluation frameworks that catch production failures before users do
Optimizing inference pipelines that reduce serving costs by 40%
Developing monitoring systems that detect model drift in real-time

Each of these activities generates substantially more value than labeling another 50 documents.

They Burn Out

Data labeling is repetitive. Label a document. Label another document. And another. Check the guidelines. Label another document. For someone who entered the field to solve interesting technical problems, spending weeks in a labeling queue is demoralizing.

Burnout from data labeling manifests as declining label quality (annotator fatigue), decreasing throughput (procrastination), and eventually, job searches. Replacing an ML engineer costs 50-100% of their annual salary in recruiting, onboarding, and lost productivity. If data labeling is driving attrition, the cost extends well beyond the direct salary math.

They Leave

Senior ML talent is in high demand. Engineers who spend their days labeling data instead of building models will find employers who offer more interesting work. In hiring interviews, candidates regularly cite "I was spending 80% of my time on data cleaning" as their reason for leaving their previous role.

Retaining top ML talent requires giving them ML problems to solve. Data labeling is not an ML problem — it's a domain expertise problem that should be solved by domain experts.

Who Should Label Data

Domain Experts

The people who understand the data are the right people to label it. Doctors label medical data. Lawyers label legal data. Engineers label engineering data. Financial analysts label financial data.

This is not controversial in principle. Everyone agrees that a radiologist is better at identifying findings on chest X-rays than an ML engineer. The controversy is practical: "Our domain experts are too busy," "They can't use our labeling tools," "They don't want to do it."

These are solvable problems:

"They're too busy." They are. That's why sessions should be 20 minutes, not 2 hours. Twenty minutes per day from 3 domain experts produces 45-90 labeled examples per day. Over 4 weeks, that's 900-1,800 examples — enough for many fine-tuning tasks.

"They can't use our labeling tools." Current labeling tools (Label Studio, Prodigy, CVAT) are built for ML engineers. They require Python environments, terminal commands, web application navigation, and annotation schema knowledge. Domain experts need a tool that opens like a document viewer and labels with a click. The tool is the bottleneck, not the person.

"They don't want to do it." They don't want to use complicated software for unclear purposes. Show them how their labeling directly improves the AI tool they'll use, give them a simple interface, and time-box their sessions. Adoption rates of 70%+ are achievable with proper change management.

AI-Assisted Labeling with Expert Review

For high-volume labeling tasks, a hybrid approach works: the AI model generates suggested labels, and domain experts review and correct them.

This is faster than labeling from scratch — reviewing a suggestion takes 3-5 seconds, while creating a label from scratch takes 10-30 seconds. For a 20-minute session, that's 240-400 reviewed examples versus 40-120 manually labeled examples. A 3-4x throughput increase.

The key: the AI suggestions must be good enough that most are correct. If the expert is correcting 60% of suggestions, the overhead of reading and evaluating bad suggestions cancels out the speed benefit. Aim for 80%+ suggestion accuracy before deploying AI-assisted labeling.

The Handoff: Redefining Roles

The transition from "ML engineers label everything" to "domain experts label with ML engineer support" requires a clear division of responsibilities.

ML Engineer's New Role in Data Preparation

Pipeline architect: Design the data preparation pipeline — ingestion, parsing, quality checks, export configuration. This is genuine engineering work that uses their skills appropriately.

Quality analyst: Define quality metrics (inter-annotator agreement, class balance, deduplication ratio), monitor them as labeling progresses, and flag systematic issues to the labeling team.

Statistical validator: After labeling is complete, validate the dataset statistically. Are there annotator biases? Are certain categories over/underrepresented? Does the input distribution match production expectations?

Integration engineer: Ensure the labeled dataset flows correctly into the training pipeline. Format conversion, data splits, augmentation — these are engineering tasks that belong with the ML engineer.

Domain Expert's New Role

Labeling authority: Apply their professional judgment to training examples. Their labels are the ground truth.

Guidelines author: Document the labeling criteria in terms that other domain experts can follow. This is essentially writing a professional standard for AI training data — work that only domain experts can do.

Quality reviewer: Spot-check labels from other annotators. A 15-minute review session per week catches systematic errors early.

Edge case identifier: Flag unusual examples that the pipeline mishandled. Domain experts are uniquely positioned to recognize when something unusual arrives because they've seen thousands of "normal" examples in their professional career.

The Financial Impact

Revisiting the math with the new model:

Before: 5 ML engineers × $180K × 65% on data prep = $585K/year on data preparation

After:

ML engineers on data preparation: 5 × $180K × 20% (pipeline architecture, quality analysis, validation) = $180K/year
Domain expert labeling: 4 experts × 30 min/day × 250 working days × $75/hour equivalent = $37,500/year
Total data preparation cost: $217,500/year

Savings: $367,500/year — and you get better labeled data because domain experts are doing the labeling.

The freed ML engineer capacity ($405K/year in salary equivalent) can be redirected to:

More model experiments (find better architectures faster)
Better evaluation frameworks (catch problems before production)
Inference optimization (reduce serving costs)
Monitoring and observability (detect drift earlier)

Each of these activities directly generates business value that data labeling does not.

What Needs to Change

The Tooling Must Change

Current labeling tools are built for ML engineers. They assume comfort with web applications, JSON configuration, and terminal-based workflows. Domain experts need:

A desktop application that installs like any other desktop application
Document viewing that looks familiar — like the PDF viewer or EMR system they already use
Labeling controls that require zero training — click a button, select a category, move to the next example
Automatic saving so interrupted work is never lost
No Python, no terminal, no configuration files

The Workflow Must Change

Stop asking ML engineers to label "just a few examples to get started." Those few examples become a few hundred, then a few thousand. Instead:

ML engineer sets up the pipeline and configures quality metrics
ML engineer labels 10-20 examples to create the initial labeling guide
Domain experts take over labeling, using the guide
ML engineer monitors quality metrics and provides statistical feedback
Domain experts review and address quality issues
ML engineer validates the final dataset and configures export

The Organization Must Change

Data labeling must be recognized as domain expert work, not IT work. This means:

Domain expert time for labeling is budgeted and protected
Labeling performance (volume and quality) is visible to domain expert managers
The connection between labeling and AI model improvement is explicit and tracked

Ertas Data Suite is built specifically to enable this handoff. The platform provides ML engineers with pipeline configuration, quality monitoring, and export tools — the engineering work they should be doing. Simultaneously, it provides domain experts with a desktop labeling interface that requires no technical knowledge — click, label, done. Both roles work in the same system with appropriate access controls, eliminating the gaps that emerge when labels move between separate tools.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →