Cross-Functional AI Data Teams: ML Engineers + Domain Experts + Compliance

Most enterprise AI data preparation efforts are staffed by one function: the ML engineering team. They design the pipeline, parse the documents, label the data (often poorly, because they lack domain expertise), check quality (against technical metrics only), and export the dataset. Domain experts are consulted occasionally. Compliance reviews the output once, at the end, and frequently requests changes that require rework.

This single-function approach produces three predictable failures:

Technically correct but domain-inaccurate datasets. An ML engineer labeling medical records will correctly identify that "SOB" is an abbreviation but may not know it means "shortness of breath" in clinical context. A model trained on these labels will be technically functional but clinically wrong.

Accurate labels that don't scale. When domain experts are brought in, they produce high-quality labels but can't sustain the volume needed. A cardiologist who labels 20 examples on a Tuesday and then disappears for three weeks is not a scalable data operation.

Compliance reviews that force rework. When the compliance officer reviews the finished dataset and discovers that PII wasn't properly handled, or that data lineage documentation is incomplete, the entire pipeline must be re-run. This rework typically costs 3-6 weeks.

The solution is not sequential handoffs between functions — it's a cross-functional team where ML engineers, domain experts, and compliance officers work on the data preparation pipeline simultaneously, with defined roles and appropriate tooling.

The Three Roles

ML Engineer: Pipeline Architect

The ML engineer's role in data preparation is architecture and automation, not manual data work.

Responsibilities:

Design the data preparation pipeline: ingestion → parsing → labeling → quality → export
Configure quality metrics and thresholds (inter-annotator agreement targets, deduplication ratios, class balance requirements)
Set up automation: automated ingestion from data sources, automated quality checks on incoming data, automated export schedules
Build and maintain export configurations that produce training-ready datasets in the required format
Monitor pipeline health: throughput, error rates, processing latency
Analyze quality metrics and identify systematic issues (annotator disagreement patterns, data distribution shifts)

What they should NOT do:

Label data. They lack domain expertise and their time is better spent on engineering.
Define labeling guidelines. They don't know the domain well enough.
Make compliance decisions. They don't know the regulatory requirements.

Time allocation: 30-40% of an ML engineer's project time should go to pipeline architecture and monitoring. The other 60-70% goes to model training, evaluation, and deployment. If they're spending more than 40% on data pipeline work, the pipeline needs more automation.

Domain Expert: Accuracy Authority

The domain expert's role is to ensure the dataset is correct according to the standards of their profession.

Responsibilities:

Write labeling guidelines that reflect professional standards and domain knowledge
Label examples — typically 20-30 minutes per day, producing 15-30 labeled examples per session
Review a sample of other annotators' labels for quality (if multiple annotators are involved)
Identify edge cases that the pipeline mishandled — document types, terminology, or scenarios that the automated steps got wrong
Validate the final dataset against professional standards: "Would I trust a model trained on this data to handle my cases?"

What they should NOT do:

Configure the pipeline. They don't need to know how documents are parsed or how data is exported.
Define quality metrics. They should validate that the metrics the ML engineer chose are meaningful, but defining Cohen's kappa thresholds is not their responsibility.
Handle compliance documentation. They produce the labeled data; compliance tracks the governance.

Time allocation: 20-30 minutes per day during active labeling phases. Periodic review sessions (1-2 hours per week) during quality validation phases. This is sustainable for busy professionals and produces sufficient volume for most projects.

Compliance Officer: Governance Guardian

The compliance officer's role is to ensure the data preparation pipeline meets regulatory and organizational policy requirements.

Responsibilities:

Verify that the audit trail is complete: every document's origin, every transformation, every labeling decision is tracked
Review data governance policies: data retention, access control, deletion rights, cross-border transfer restrictions
Ensure PII/PHI handling complies with applicable regulations (GDPR, HIPAA, EU AI Act Article 10)
Review and approve the data lineage documentation before the dataset is used for training
Validate access controls: who can see which data, who can modify labels, who can export datasets

What they should NOT do:

Label data. They don't have domain expertise.
Design the pipeline. They specify requirements; the ML engineer implements them.
Wait until the end to review. By then, compliance issues are embedded throughout the dataset and remediation is expensive.

Time allocation: 2-4 hours per week during active data preparation. Higher during initial pipeline setup (when governance policies are being configured) and during final review before dataset export.

Team Structure Options

Embedded Pod (Recommended for 1-3 Projects)

A single cross-functional team dedicated to a specific AI project. The pod includes:

1 ML engineer (full-time on the project)
2-3 domain experts (part-time, 30 minutes/day)
1 compliance officer (part-time, shared across 2-3 pods)

Advantages: Tight communication, fast decision-making, clear accountability. The team sits together (physically or virtually) and resolves issues in real-time.

Disadvantages: Doesn't scale beyond 3-4 projects without duplicating ML engineer and compliance headcount.

Matrix Model (For 4-10 Projects)

Functional teams (ML engineering, domain expertise, compliance) contribute members to data preparation projects. An ML engineer might support two data preparation projects simultaneously.

Advantages: More efficient use of specialized talent. ML engineers and compliance officers are shared across projects.

Disadvantages: Split attention. The ML engineer supporting two projects prioritizes one, and the other stalls. Requires strong project management to prevent this.

Mitigation: Stagger project phases. If Project A is in the labeling phase (low ML engineer demand) while Project B is in pipeline setup (high ML engineer demand), the same engineer can support both.

Hub-and-Spoke (For 10+ Projects or Ongoing Operations)

A central data operations team (hub) of 2-4 ML engineers and 1 compliance officer maintains the data preparation platform and handles pipeline architecture. Domain expert contributors (spokes) from across the organization participate in labeling and review on a project basis.

Advantages: Scales to many projects. The hub team develops deep expertise in data preparation. Domain experts are brought in only when their specific knowledge is needed.

Disadvantages: The hub team can become a bottleneck. Domain experts feel less ownership because they're peripheral to the process.

Mitigation: Self-service labeling. The hub team sets up projects and configures quality checks, then domain experts access their labeling queues independently without hub team involvement.

Communication Cadence

Daily standups for data preparation teams are wasteful. Data preparation work is largely independent — annotators label examples, the ML engineer monitors quality, the compliance officer reviews documentation. There isn't enough to discuss daily.

Weekly sync (30 minutes): The three roles meet once per week to review:

Labeling progress: examples labeled this week, quality metrics trend
Pipeline issues: parsing errors, quality check failures, annotator questions
Compliance status: any new requirements, audit trail completeness
Next week's priorities

Async review channel: A Slack/Teams channel for real-time questions. Domain experts post ambiguous examples ("How should I label this?"). The ML engineer posts quality metric alerts. The compliance officer flags documentation gaps.

Monthly retrospective (1 hour): Review the overall data preparation process. What's working? What's slow? Where are the bottlenecks? This is where process improvements are identified and planned.

Conflict Resolution

The three roles have natural tensions that require explicit resolution mechanisms.

"More Data" vs. "Minimize Data"

The ML engineer wants more training examples for better model performance. The compliance officer wants to minimize data collection and retention. Both are right within their domain.

Resolution: Define the minimum viable dataset — the smallest dataset that achieves the performance target. Collect that amount, plus a 20% buffer for quality filtering. Document the justification for the volume collected. This satisfies the ML engineer's performance needs while meeting the compliance officer's data minimization requirements.

"Speed" vs. "Quality"

The ML engineer wants to move fast — "let's label 1,000 examples this week and start training." The domain expert insists on careful review — "I need to think about each example."

Resolution: Time-box labeling sessions (20 minutes/day) but set quality thresholds that must be met before training begins. This prevents both extremes: the ML engineer can't rush labeling past the quality bar, and the domain expert can't delay the project indefinitely by spending 15 minutes per example.

"Comprehensive Documentation" vs. "Just Ship It"

The compliance officer wants complete documentation of every data handling decision. The ML engineer wants to train the model and iterate.

Resolution: Build documentation into the tooling, not into a separate process. If the platform automatically records who labeled what, when, and how data flowed through the pipeline, compliance documentation is generated as a byproduct of doing the work — not as an additional step that creates friction.

Scaling the Model

As an organization matures, the cross-functional team model evolves:

Stage 1 (First project): Ad-hoc cross-functional collaboration. The ML engineer reaches out to a willing domain expert. Compliance reviews at the end. This works once.

Stage 2 (2-5 projects): Formalized embedded pods with defined roles and communication cadence. Compliance is involved from the start. Labeling guidelines are documented and reused.

Stage 3 (5-15 projects): Hub-and-spoke model. Central data ops team, contributor network of domain experts, shared compliance officer. Standardized workflows and templates.

Stage 4 (15+ projects): Data preparation as a service. The central team operates the platform, manages quality standards, and provides self-service capabilities for project teams to set up their own data preparation workflows within governance guardrails.

Each stage requires different tooling capabilities. Stage 1 can get by with basic tools. Stage 3-4 requires a platform with role-based access controls, workflow templates, automated quality monitoring, and compliance reporting — all in one system.

Ertas Data Suite supports role-based workflows for all three roles. ML engineers configure pipelines, quality metrics, and export settings. Domain experts access a simplified labeling interface designed for non-technical users — no code, no terminal, no setup. Compliance officers access audit trails, data lineage reports, and access control dashboards. Each role sees only what they need, with appropriate permissions. The platform runs on-premise, providing the data residency guarantees that compliance officers require.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →

Cross-Functional AI Data Teams: ML Engineers + Domain Experts + Compliance

The Three Roles

ML Engineer: Pipeline Architect

Domain Expert: Accuracy Authority

Compliance Officer: Governance Guardian

Team Structure Options

Embedded Pod (Recommended for 1-3 Projects)

Matrix Model (For 4-10 Projects)

Hub-and-Spoke (For 10+ Projects or Ongoing Operations)

Communication Cadence

Conflict Resolution

"More Data" vs. "Minimize Data"

"Speed" vs. "Quality"

"Comprehensive Documentation" vs. "Just Ship It"

Scaling the Model

Further Reading

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

Preparing Tool-Calling Datasets for Enterprise AI Agents: An On-Premise Workflow

From Ad-Hoc Data Prep to Continuous Data Ops: Building an Always-On Pipeline

The Data Preparation ROI Business Case Template for Enterprise