Your ML Engineers Shouldn't Be Doing This

Here is a situation we encounter regularly: an ML engineer at a healthcare organization is spending two to three days a week annotating clinical notes. Not because they enjoy it, not because they are especially qualified to do it, but because the doctor who is qualified to do it cannot figure out how to operate the annotation tool.

This is not a story about a difficult doctor or an impatient engineer. It is a story about tooling that was designed for ML engineers and never seriously reconsidered for anyone else. The result is a systematic misallocation of expensive, specialized labor — and a systematic degradation of annotation quality.

The Core Problem

Annotation quality depends on domain expertise. This is not controversial. A doctor reviewing clinical notes for diagnostic label consistency is going to produce better annotations than an ML engineer without medical training doing the same task. A construction engineer reviewing BOQ line items is better positioned to classify cost categories than a data scientist who has never read a quantity surveyor's report. A lawyer reviewing contract clause annotations will catch ambiguities and exceptions that a Python developer will miss.

The people with the domain expertise needed for high-quality annotation are not, in the general case, the people who know how to operate annotation tooling. This creates a structural problem: the tools go to the wrong people because they are the only people who can operate them.

The consequence is twofold. First, annotation quality is lower than it would be if domain experts were doing the work. Second, ML engineering time is consumed by annotation tasks that should be running on domain expert capacity, at a fraction of the cost and with better results.

What Domain Experts Actually Face

Let us be concrete about what it looks like when you ask a domain expert to use the standard annotation tooling.

The Clinical Notes Scenario

A healthcare AI team needs to annotate 10,000 clinical notes for a medical coding model. The notes need to be reviewed and labeled by clinicians — ideally the same clinical staff who will eventually use the AI system. This is the ideal setup: the future users of the system generate the training data.

The team sets up Label Studio. This requires: a server (or local Docker installation) to run the web application, a Docker runtime installed and configured on whatever machine will host the service, annotation interface configuration written in Label Studio's XML-based template format, user account creation and management for each clinical annotator, and network configuration so the clinical staff can access the tool from their workstations.

The clinicians arrive to start annotating. They encounter a web application with an interface that assumes familiarity with annotation concepts — label sets, relations, predictions, review queues. The first session requires ML engineering support to explain how the tool works. Several clinicians have issues accessing the tool from their workstations because of hospital network restrictions. One clinician needs the annotation instructions in a specific format; translating them into Label Studio's configuration takes another engineering hour.

Throughput in the first week is low. The ML engineers spend more time supporting the annotation process than doing engineering work. After three weeks, annotation has reached 2,000 examples. The quality review reveals inconsistencies because different clinicians interpreted the annotation guidelines differently — but the tool does not make it easy to see whether inconsistencies are systemic or individual.

This scenario, with variations, is typical. It is not a failure of effort or intention. It is a failure of tool design.

The BOQ Annotation Scenario

A construction AI team needs to annotate bills of quantities for a line-item classification model. The annotators should be quantity surveyors or construction estimators — people who work with BOQs professionally and can reliably distinguish between direct costs, preliminaries, and provisional sums.

The team builds a Python script that reads BOQ rows from a spreadsheet and presents them for classification via a command-line prompt. The construction engineer opens the terminal, types a command, and sees a prompt asking them to press 1, 2, or 3 to classify each item.

This is better than Label Studio for this scenario in the sense that there is no server to deploy. It is worse in the sense that it requires command-line familiarity, and if the script crashes partway through a session, there is no reliable way to resume from where the engineer left off. The engineer calls an ML engineer for help. The ML engineer fixes the script and reschedules the annotation session.

Throughput: low. Feedback loop between annotation work and domain expert: dependent on engineering availability.

The Contract Review Scenario

A legal AI team needs to annotate contract clauses for clause type classification. The annotators should be lawyers — specifically lawyers who work with the types of contracts in the dataset.

The team sets up a Jupyter notebook. The lawyers open the notebook in VS Code, or in a browser-based Jupyter session, and work through clauses cell by cell, modifying a dictionary variable with their classification for each clause. The output is saved to a JSON file.

Lawyers are not Jupyter notebook users. The interface is confusing. The first session ends when one lawyer accidentally modifies a code cell and breaks the notebook. The ML engineer spends an hour on a support call.

Alternative: the team exports clauses to a spreadsheet and asks lawyers to annotate in Excel. This works better from a usability standpoint. It works worse from a data management standpoint — version control of spreadsheets is unreliable, the format is fragile, and merging annotations from multiple lawyers introduces opportunities for error.

What Happens When ML Engineers Annotate Instead

When domain expert annotation is too difficult to operate, the default is ML engineer annotation. This is not irrational — the ML engineers are available, they can operate the tools, and the project needs to move forward.

But ML engineers without domain expertise produce systematically lower-quality annotations on domain-specific tasks. This is not a criticism of their ability — it is a structural fact. A radiology report requires clinical knowledge to annotate correctly. A BOQ requires construction cost knowledge. A contract requires legal knowledge.

The consequences compound.

Lower initial label accuracy means the model trains on noisier signal. In supervised learning, label noise directly degrades model performance — the model cannot learn consistent patterns from inconsistent labels. A dataset annotated by non-experts with 85% label accuracy produces a model that learns 85% of the underlying signal. The same dataset annotated by domain experts with 95% accuracy produces a measurably better model.

More iteration rounds because lower initial quality means more annotation review cycles. The team notices that model performance is below expectation, investigates, finds annotation errors, relabels. This cycle is expensive and slow. Each round requires re-engaging annotators, re-running training, re-evaluating.

Weaker feedback on edge cases because ML engineers without domain expertise cannot reliably identify which annotation decisions are edge cases requiring judgment versus which are clear-cut. The guidelines they apply are necessarily more mechanical and less nuanced than what a domain expert would bring.

Slower timeline because all of this adds up to more total time from annotation start to usable model. Teams that invest in making domain expert annotation accessible typically reach their quality targets faster than teams that use ML engineer annotation as a workaround.

What Domain-Expert-Accessible Tooling Actually Means

The solution is not better training for domain experts on how to use developer tools. It is tooling that does not require developer training to operate.

The standard should be: a domain expert who has never used an annotation tool can install the software, open it, and start annotating in under 15 minutes, without a terminal, without Docker, without a Python environment, without reading a configuration guide.

This standard is achievable. Desktop applications built with modern cross-platform frameworks can package all dependencies, install like standard software, and present interfaces designed for the specific workflow — not interfaces designed for ML engineer flexibility and then adapted for everyone else.

What this looks like in practice:

Installation: Download an installer, double-click, it runs. No terminal commands. No Docker. No environment setup. No port configuration.

Annotation interface: Designed for the specific task, not configurable to any task. A clinical note reviewer sees clinical notes and label buttons. A BOQ annotator sees rows and classification options. The interface does not expose concepts like "label sets" or "prediction confidence" unless those are things the domain expert actually needs to interact with.

Resume and save: Work saves automatically. Closing and reopening the tool resumes from the last position. No command to run, no file to manage, no dependency on a running server.

Feedback: The annotator sees clearly what is left to annotate, what they have completed, and any items that have been flagged for review. Progress is visible without interpreting log files or database queries.

No ML engineering in the loop: The domain expert can do a full annotation session, from opening the software to submitting completed work, without calling an ML engineer. If something goes wrong, the error message is human-readable and actionable.

This is a different design philosophy than "build a powerful tool and document it thoroughly." It requires constraining the interface to what the actual user needs, which means accepting that the tool will not be maximally flexible. This is the right tradeoff for annotation workflows.

The Improvement in Label Quality

The practical benefit of domain expert annotation is real and measurable.

In clinical text annotation, studies comparing clinician-annotated datasets to crowd-sourced or non-expert-annotated datasets consistently find 8-15 percentage point improvements in downstream model performance when domain experts do the annotation. The effect is larger for complex, judgment-dependent tasks (diagnostic classification, treatment recommendation) and smaller for simpler extraction tasks.

In legal document annotation, the effect of expert vs. non-expert annotation is most visible at the boundaries between categories. Non-expert annotators apply categories mechanically; expert annotators apply them with knowledge of the underlying legal distinctions. Models trained on expert annotations generalize better to edge cases in production.

In technical domain annotation (engineering drawings, financial instruments, specialized scientific documents), the gap between expert and non-expert annotation can be even larger because the concepts themselves are not accessible without domain knowledge. A non-expert annotating an engineering BOQ is essentially guessing at the correct category for many line items.

The argument for making domain expert annotation accessible is not primarily about saving engineering time, though it does that. It is about getting better AI systems. Better annotations produce better models. Better models produce better outcomes. The tooling that enables domain expert annotation is not a convenience — it is a quality lever.

The engineering and construction AI Lead we spoke to said it directly:

"The problem is not fine-tuning but cleaning and preparing the diverse data."

The "diverse data" includes diversity of domain expertise. Getting the right expertise into the annotation loop requires getting the right people into the tool. That requires tools designed for the right people, not tools designed for ML engineers and handed to everyone else.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →

What 27 Enterprise AI Teams Told Us About Their Data Prep Problem — primary research on how the domain expert gap manifests across regulated industries
The Hidden Cost of Stitching Together Docling, Label Studio, and Cleanlab — the tooling fragmentation that creates and compounds the domain expert lockout problem
Enterprise AI Projects Fail at the Data Stage — Not the Model Stage — why annotation quality issues propagate all the way to production model failures

Your ML Engineers Shouldn't Be Doing This

The Core Problem