
EU AI Act Data Governance Checklist for High-Risk AI Systems
An actionable checklist covering data quality, bias detection, documentation, audit trails, and monitoring obligations for high-risk AI systems under the EU AI Act.
If you're building or deploying a high-risk AI system under the EU AI Act, Article 10 requires specific data governance practices for your training, validation, and testing datasets. This checklist maps directly to the regulation's requirements.
Use this as a compliance audit tool — work through each section and identify gaps in your current pipeline.
1. Data Collection and Origin
- Document the origin of all training data (sources, providers, collection dates)
- Record the data collection methodology for each source
- Document the purpose for which data was originally collected
- Verify legal basis for using data for AI training (consent, legitimate interest, contractual necessity)
- Record geographic origin of data where relevant to representativeness
- Document any data purchased from third parties, including vendor assessments
- Maintain records of data access permissions and licensing terms
2. Data Preparation and Cleaning
- Document all data preparation operations applied (parsing, extraction, normalization)
- Record tools and versions used for each preparation step
- Log deduplication methods and results (duplicates found, removed, rationale)
- Document data quality thresholds and filtering criteria
- Record PII/PHI detection and redaction methods with entity counts
- Log all data transformations with before/after examples
- Maintain version history of cleaned datasets
- Record operator identity for each preparation step
3. Labeling and Annotation
- Define and document the labeling schema (categories, definitions, guidelines)
- Record annotator qualifications and domain expertise
- Document the labeling process (manual, AI-assisted, programmatic)
- If AI-assisted: document the model used, confidence thresholds, and human review process
- Measure and record inter-annotator agreement rates
- Document disagreement resolution procedures and outcomes
- Record the number of labels per annotator and per category
- Maintain a mapping from labels to annotator identity and timestamp
4. Bias Examination
- Define the dimensions on which bias will be examined (age, gender, ethnicity, geography, etc.)
- Select and document bias detection methodology
- Run bias analysis on training, validation, and test datasets
- Document findings: identified biases, magnitude, affected groups
- Record mitigation measures taken for each identified bias
- Assess residual bias after mitigation and document acceptable thresholds
- Plan for ongoing bias monitoring post-deployment
- Document the rationale for dimensions not examined (if any)
5. Data Quality Assessment
- Define data quality criteria specific to the AI system's intended purpose
- Measure and record error rates in training data
- Assess dataset completeness (missing values, underrepresented categories)
- Evaluate representativeness relative to the target population
- Document known data gaps and their potential impact
- Record quality scoring methodology and thresholds
- Assess data freshness (is the data current enough for the intended purpose?)
- Document actions taken to improve data quality
6. Statistical Properties
- Document dataset size (total records, records per category)
- Record class distribution and imbalance ratios
- Document statistical properties of key features (distributions, ranges, outliers)
- Assess and document dataset coverage relative to intended deployment context
- Record train/validation/test split methodology and ratios
- Document any data augmentation applied and its impact on distribution
- Identify and document edge cases and their representation in the dataset
7. Data Lineage and Traceability
- Implement record-level lineage tracking (source → ingestion → cleaning → labeling → export)
- Record timestamps for every transformation
- Attribute every operation to an identified operator
- Ensure lineage is maintained across all pipeline stages without gaps
- Verify that any exported training record can be traced back to its source
- Implement immutable audit logs (cannot be modified after creation)
- Test lineage by randomly sampling output records and tracing them end-to-end
8. Dataset Versioning
- Implement dataset version control (unique version identifiers)
- Record which dataset version was used to train which model version
- Maintain ability to reproduce any historical dataset version
- Document changes between dataset versions (additions, removals, label corrections)
- Record the rationale for dataset updates
9. Technical Documentation (Article 30)
- Compile all above documentation into a structured technical documentation package
- Include data governance policies and procedures
- Include bias examination methodology and results
- Include quality assessment reports
- Include statistical profiles of all datasets
- Include lineage documentation with sample traces
- Format documentation for regulatory review (organized, searchable, complete)
- Establish a process for keeping documentation current as datasets evolve
10. Ongoing Obligations
- Establish post-deployment data monitoring procedures
- Define triggers for dataset re-evaluation (data drift, performance degradation)
- Plan for periodic bias re-assessment
- Establish incident reporting procedures for data-related issues
- Assign responsibility for maintaining compliance documentation
- Schedule regular compliance reviews (quarterly recommended)
How to Use This Checklist
Work through each section with your data team and compliance officer. For each item:
- Green: Fully implemented and documented
- Yellow: Partially implemented or documented — needs improvement
- Red: Not implemented — compliance gap
Any red items in sections 1-7 represent potential Article 10 violations. Any red items in section 9 represent potential Article 30 violations. Both carry fines of up to €15 million or 3% of global annual turnover.
Pipeline Architecture Matters
Many of these checklist items are straightforward to satisfy if your data pipeline has built-in audit logging and lineage tracking. They become expensive and error-prone when your pipeline is a chain of disconnected tools where each boundary creates a documentation gap.
Unified on-premise platforms like Ertas Data Suite are designed to satisfy this checklist by default — every stage logs operations, attributes operators, maintains lineage, and generates exportable compliance reports. If you're evaluating tools, use this checklist as a feature evaluation framework.
The August 2026 enforcement deadline is five months away. Start your audit now.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

EU AI Act Training Data Compliance: The Complete Guide (2026)
Everything enterprises need to know about EU AI Act training data requirements — data quality, bias testing, documentation mandates, and the August 2026 deadline.

EU AI Act Article 10 vs. Article 30: What Your Data Team Needs to Know
A detailed comparison of EU AI Act Articles 10 and 30 — the two most critical provisions for AI training data governance, documentation, and compliance.

EU AI Act Compliance Timeline: What's Due by August 2026
A clear timeline of EU AI Act enforcement dates, what's already in effect, what's coming in August 2026, and what enterprises need to have in place for training data compliance.