EU AI Act Article 10: What It Means for Your AI Training Data

When the EU AI Act entered into force in August 2024, most commentary focused on the prohibited AI practices (Article 5) and the high-risk system requirements (Annex III). Less attention has been paid to Article 10 — the provision that governs the data used to build high-risk AI systems. This is a problem, because Article 10 imposes specific, enforceable requirements on your training data, validation data, and test data — requirements that most enterprise AI teams are not currently meeting.

The full applicability deadline for high-risk AI systems is August 2, 2026. If you are building AI in any of the covered domains, you have a narrow window to bring your data governance practices into compliance.

Which Systems Are Subject to Article 10?

Article 10 applies to providers of "high-risk AI systems" as defined in Annex III. The list includes AI used in:

Critical infrastructure (utilities, transport, water supply)
Educational and vocational training (access to education, performance evaluation)
Employment and HR (recruiting, promotion, work management, termination)
Essential services (credit scoring, insurance risk, emergency services dispatch)
Law enforcement (risk assessment, lie detection, evidence reliability)
Migration and border control (risk assessment, document verification)
Administration of justice (AI assisting courts)
Medical devices (AI classified as medical devices under EU MDR)

If your organization is developing or deploying AI in any of these areas and placing it on the EU market, Article 10 applies. Note that "provider" includes in-house development teams — you do not need to be selling AI commercially to be a provider under the Act.

For organizations uncertain about whether their system qualifies, the EU Commission has issued guidance, but the safest approach is to assume high-risk classification applies if your AI makes or assists in consequential decisions about people.

What Article 10 Actually Requires

Article 10 is titled "Data and Data Governance." Its requirements cover the entire data pipeline, not just the final training set.

Paragraph 1: Practices for Data Management

Providers must implement data governance and management practices covering:

The design choices regarding data (what to include and why)
Data collection processes
Relevant data preparation processing operations (cleaning, labeling, enrichment, aggregation, annotation)
How the data aligns with the intended purpose of the AI system

This is not a documentation-after-the-fact requirement. The practices must be in place during development, which means your current data preparation workflow is already in scope.

Paragraph 2: Data Quality Criteria

Training, validation, and test datasets must meet four criteria:

Relevant — the data must be relevant to the intended purpose of the AI system
Representative — the data must be sufficiently representative of the conditions under which the system will operate
Free of errors — to the extent possible; this requires active quality assessment, not just an assumption
Complete — with respect to the characteristics or properties necessary for the purpose

The phrase "to the extent possible" for errors is meaningful — it acknowledges that perfect data does not exist. But it also means you need to demonstrate that you have actively examined and addressed data quality issues, not simply ignored them.

Paragraph 3: Examination for Biases

Datasets must be examined for possible biases that could affect the AI system's output and lead to risks to health, safety, or fundamental rights. If biases are found, they must be addressed — or if they cannot be fully addressed, the residual bias must be documented and mitigated through other means.

This requires a deliberate examination process, not just a general assumption that your data is unbiased. The examination methodology and results must be documented.

Paragraph 4: Sensitive Data

Where necessary to detect and correct for biases, Article 10(4) allows for the collection and processing of sensitive categories of personal data (Article 9 GDPR data: race, health, political opinion, etc.) — subject to strict conditions including appropriate safeguards and purpose limitation.

This provision is often misread as broadly permitting sensitive data use. It does not. It provides a narrow exception, specifically for bias detection, with corresponding obligations.

Paragraph 5: Relevance to the Operational Context

The representativeness requirement extends to the specific geographical, behavioral, and functional setting where the AI will actually operate. Training data must reflect the real-world conditions of deployment — not just laboratory or ideal conditions.

Article 11: Technical Documentation

Article 10's data requirements do not stand alone. Article 11 requires providers to prepare technical documentation demonstrating that their high-risk AI system complies with the Act. Annex IV specifies what this documentation must include.

For data governance, the technical documentation must contain:

A description of the training methodology and the data used
Information about the characteristics, limitations, and assumptions of the training data
A description of the data governance and management practices applied
Documentation of any data augmentation techniques used
A description of data examination and quality assessment procedures

This documentation must be kept up to date throughout the system lifecycle. If you update your training data or retrain your model, the documentation must be updated to reflect the changes.

The August 2, 2026 deadline means that providers of high-risk AI systems must have this documentation complete and current by that date to remain compliant.

What "Free of Errors" Requires in Practice

The requirement that training data be "free of errors to the extent possible" is more operationally demanding than it sounds. It implies:

Active quality scoring: You need a methodology for assessing data quality — not just spotting obvious errors, but systematic scoring of completeness, consistency, accuracy, and relevance.

Deduplication: Duplicate records skew model training and can indicate a data quality problem. Your pipeline must include a deduplication step with documented methodology.

Outlier examination: Statistical outliers in training data may represent genuine edge cases (which you want to include) or data errors (which you want to remove). Article 10 requires you to make that distinction deliberately.

Label quality: For supervised learning, annotation errors are a form of data error. The quality of your labeling process — inter-annotator agreement, annotation guidelines, review procedures — is part of Article 10 compliance.

The Audit Trail Requirement

Reading Articles 10 and 11 together, a high-risk AI system provider must be able to reconstruct the history of their training data: what was included, what was excluded, what transformations were applied, and why.

This requires an audit trail that documents:

Source documents and their provenance
Parsing and extraction steps
Cleaning and deduplication operations
Redaction and de-identification steps
Annotation events (who labeled what, when, using which guidelines)
Augmentation operations (what synthetic data was generated, with what parameters)
Export operations (what dataset version was exported for training)

Most current data preparation pipelines — cobbled together from Docling, Label Studio, Cleanlab, and ad hoc scripts — produce no shared lineage. Docling parses files and writes to a folder. Label Studio annotates without a structural link to those source files. Cleaning scripts run and overwrite. The result is a training dataset with no traceable history.

Reconstructing lineage after the fact is harder than building it in from the start. By August 2026, reconstruction is no longer an option — you need current compliance.

Practical Steps to Achieve Article 10 Compliance

Step 1: Classify Your AI Systems

Determine whether your AI projects fall under the high-risk classification. If there is ambiguity, treat it as high-risk until you have a documented risk assessment saying otherwise.

Step 2: Audit Your Current Data Pipeline

Map every step from raw data to training dataset. Identify where documentation gaps exist — stages with no log, tools with no audit output, transformations that happen in undocumented scripts.

Step 3: Implement Quality Assessment

Define your data quality criteria for each dataset. Run systematic quality scoring. Document what you found and what you did about it.

Step 4: Conduct a Bias Examination

This does not require a machine learning researcher. It requires a structured review of your dataset composition against the population the AI will serve. Document the methodology, findings, and mitigations.

Step 5: Establish Audit Logging

Every transformation step must produce a log entry: timestamp, operator, action, affected records. The log must be preserved and exportable.

Step 6: Write the Technical Documentation

Pull the pieces together into Annex IV-compliant documentation. This is not a one-time exercise — it must be maintained for the system lifecycle.

How Ertas Data Suite Supports Article 10 Compliance

Ertas Data Suite was designed with Article 10 compliance as a first-class requirement, not an afterthought. Every transformation across the five pipeline stages — Ingest, Clean, Label, Augment, Export — is logged with timestamp and operator ID. The audit trail is a structured export, not a text log, making it usable for technical documentation without manual reformatting.

The Clean module performs automated quality scoring and deduplication, with results documented in the project record. The Label module tracks annotation events at the individual record level. The Export module produces a dataset manifest alongside the training data, recording version history and pipeline parameters.

The pipeline runs entirely on-premise with no data egress, satisfying the data sovereignty requirements that often accompany EU AI Act compliance in regulated sectors.

For teams facing the August 2026 deadline, the question is not whether to build compliant data governance practices — it is whether to build them into the pipeline from the start, or attempt to retrofit them onto an existing fragmented toolchain.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →

On-Premise AI Data Preparation: The Compliance Guide for Regulated Industries — Full coverage of GDPR, HIPAA, EU AI Act, and data sovereignty requirements in one guide.
The Audit Trail Gap: How Most Enterprise AI Pipelines Fail EU AI Act Compliance Without Knowing — Why fragmented tool stacks produce no shared lineage — and what to do about it.
GDPR and AI Training Data: What European Enterprises Must Do Before They Fine-Tune — GDPR obligations that apply alongside EU AI Act requirements for European enterprises.

EU AI Act Article 10: What It Means for Your AI Training Data

Which Systems Are Subject to Article 10?

What Article 10 Actually Requires