Back to blog
    EU AI Act Article 10: What It Means for Your AI Training Data
    eu-ai-actcompliancedata-governanceenterprise-aisegment:enterprise

    EU AI Act Article 10: What It Means for Your AI Training Data

    EU AI Act Article 10 sets strict data governance requirements for high-risk AI systems. Here's what it means for enterprise teams preparing AI training data — and the August 2026 compliance deadline.

    EErtas Team·

    When the EU AI Act entered into force in August 2024, most commentary focused on the prohibited AI practices (Article 5) and the high-risk system requirements (Annex III). Less attention has been paid to Article 10 — the provision that governs the data used to build high-risk AI systems. This is a problem, because Article 10 imposes specific, enforceable requirements on your training data, validation data, and test data — requirements that most enterprise AI teams are not currently meeting.

    The full applicability deadline for high-risk AI systems is August 2, 2026. If you are building AI in any of the covered domains, you have a narrow window to bring your data governance practices into compliance.


    Which Systems Are Subject to Article 10?

    Article 10 applies to providers of "high-risk AI systems" as defined in Annex III. The list includes AI used in:

    • Critical infrastructure (utilities, transport, water supply)
    • Educational and vocational training (access to education, performance evaluation)
    • Employment and HR (recruiting, promotion, work management, termination)
    • Essential services (credit scoring, insurance risk, emergency services dispatch)
    • Law enforcement (risk assessment, lie detection, evidence reliability)
    • Migration and border control (risk assessment, document verification)
    • Administration of justice (AI assisting courts)
    • Medical devices (AI classified as medical devices under EU MDR)

    If your organization is developing or deploying AI in any of these areas and placing it on the EU market, Article 10 applies. Note that "provider" includes in-house development teams — you do not need to be selling AI commercially to be a provider under the Act.

    For organizations uncertain about whether their system qualifies, the EU Commission has issued guidance, but the safest approach is to assume high-risk classification applies if your AI makes or assists in consequential decisions about people.


    What Article 10 Actually Requires

    Article 10 is titled "Data and Data Governance." Its requirements cover the entire data pipeline, not just the final training set.

    Paragraph 1: Practices for Data Management

    Providers must implement data governance and management practices covering:

    • The design choices regarding data (what to include and why)
    • Data collection processes
    • Relevant data preparation processing operations (cleaning, labeling, enrichment, aggregation, annotation)
    • How the data aligns with the intended purpose of the AI system

    This is not a documentation-after-the-fact requirement. The practices must be in place during development, which means your current data preparation workflow is already in scope.

    Paragraph 2: Data Quality Criteria

    Training, validation, and test datasets must meet four criteria:

    1. Relevant — the data must be relevant to the intended purpose of the AI system
    2. Representative — the data must be sufficiently representative of the conditions under which the system will operate
    3. Free of errors — to the extent possible; this requires active quality assessment, not just an assumption
    4. Complete — with respect to the characteristics or properties necessary for the purpose

    The phrase "to the extent possible" for errors is meaningful — it acknowledges that perfect data does not exist. But it also means you need to demonstrate that you have actively examined and addressed data quality issues, not simply ignored them.

    Paragraph 3: Examination for Biases

    Datasets must be examined for possible biases that could affect the AI system's output and lead to risks to health, safety, or fundamental rights. If biases are found, they must be addressed — or if they cannot be fully addressed, the residual bias must be documented and mitigated through other means.

    This requires a deliberate examination process, not just a general assumption that your data is unbiased. The examination methodology and results must be documented.

    Paragraph 4: Sensitive Data

    Where necessary to detect and correct for biases, Article 10(4) allows for the collection and processing of sensitive categories of personal data (Article 9 GDPR data: race, health, political opinion, etc.) — subject to strict conditions including appropriate safeguards and purpose limitation.

    This provision is often misread as broadly permitting sensitive data use. It does not. It provides a narrow exception, specifically for bias detection, with corresponding obligations.

    Paragraph 5: Relevance to the Operational Context

    The representativeness requirement extends to the specific geographical, behavioral, and functional setting where the AI will actually operate. Training data must reflect the real-world conditions of deployment — not just laboratory or ideal conditions.


    Article 11: Technical Documentation

    Article 10's data requirements do not stand alone. Article 11 requires providers to prepare technical documentation demonstrating that their high-risk AI system complies with the Act. Annex IV specifies what this documentation must include.

    For data governance, the technical documentation must contain:

    • A description of the training methodology and the data used
    • Information about the characteristics, limitations, and assumptions of the training data
    • A description of the data governance and management practices applied
    • Documentation of any data augmentation techniques used
    • A description of data examination and quality assessment procedures

    This documentation must be kept up to date throughout the system lifecycle. If you update your training data or retrain your model, the documentation must be updated to reflect the changes.

    The August 2, 2026 deadline means that providers of high-risk AI systems must have this documentation complete and current by that date to remain compliant.


    What "Free of Errors" Requires in Practice

    The requirement that training data be "free of errors to the extent possible" is more operationally demanding than it sounds. It implies:

    Active quality scoring: You need a methodology for assessing data quality — not just spotting obvious errors, but systematic scoring of completeness, consistency, accuracy, and relevance.

    Deduplication: Duplicate records skew model training and can indicate a data quality problem. Your pipeline must include a deduplication step with documented methodology.

    Outlier examination: Statistical outliers in training data may represent genuine edge cases (which you want to include) or data errors (which you want to remove). Article 10 requires you to make that distinction deliberately.

    Label quality: For supervised learning, annotation errors are a form of data error. The quality of your labeling process — inter-annotator agreement, annotation guidelines, review procedures — is part of Article 10 compliance.


    The Audit Trail Requirement

    Reading Articles 10 and 11 together, a high-risk AI system provider must be able to reconstruct the history of their training data: what was included, what was excluded, what transformations were applied, and why.

    This requires an audit trail that documents:

    • Source documents and their provenance
    • Parsing and extraction steps
    • Cleaning and deduplication operations
    • Redaction and de-identification steps
    • Annotation events (who labeled what, when, using which guidelines)
    • Augmentation operations (what synthetic data was generated, with what parameters)
    • Export operations (what dataset version was exported for training)

    Most current data preparation pipelines — cobbled together from Docling, Label Studio, Cleanlab, and ad hoc scripts — produce no shared lineage. Docling parses files and writes to a folder. Label Studio annotates without a structural link to those source files. Cleaning scripts run and overwrite. The result is a training dataset with no traceable history.

    Reconstructing lineage after the fact is harder than building it in from the start. By August 2026, reconstruction is no longer an option — you need current compliance.


    Practical Steps to Achieve Article 10 Compliance

    Step 1: Classify Your AI Systems

    Determine whether your AI projects fall under the high-risk classification. If there is ambiguity, treat it as high-risk until you have a documented risk assessment saying otherwise.

    Step 2: Audit Your Current Data Pipeline

    Map every step from raw data to training dataset. Identify where documentation gaps exist — stages with no log, tools with no audit output, transformations that happen in undocumented scripts.

    Step 3: Implement Quality Assessment

    Define your data quality criteria for each dataset. Run systematic quality scoring. Document what you found and what you did about it.

    Step 4: Conduct a Bias Examination

    This does not require a machine learning researcher. It requires a structured review of your dataset composition against the population the AI will serve. Document the methodology, findings, and mitigations.

    Step 5: Establish Audit Logging

    Every transformation step must produce a log entry: timestamp, operator, action, affected records. The log must be preserved and exportable.

    Step 6: Write the Technical Documentation

    Pull the pieces together into Annex IV-compliant documentation. This is not a one-time exercise — it must be maintained for the system lifecycle.


    How Ertas Data Suite Supports Article 10 Compliance

    Ertas Data Suite was designed with Article 10 compliance as a first-class requirement, not an afterthought. Every transformation across the five pipeline stages — Ingest, Clean, Label, Augment, Export — is logged with timestamp and operator ID. The audit trail is a structured export, not a text log, making it usable for technical documentation without manual reformatting.

    The Clean module performs automated quality scoring and deduplication, with results documented in the project record. The Label module tracks annotation events at the individual record level. The Export module produces a dataset manifest alongside the training data, recording version history and pipeline parameters.

    The pipeline runs entirely on-premise with no data egress, satisfying the data sovereignty requirements that often accompany EU AI Act compliance in regulated sectors.

    For teams facing the August 2026 deadline, the question is not whether to build compliant data governance practices — it is whether to build them into the pipeline from the start, or attempt to retrofit them onto an existing fragmented toolchain.


    Your data is the bottleneck — not your models.

    Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading