5 Months to EU AI Act Compliance: The Data Pipeline Implementation Sprint

August 2, 2026. That is the date when enforcement provisions for high-risk AI systems under the EU AI Act become fully applicable. On August 3, regulators can request evidence of compliance. If you cannot produce it, penalties start at 7.5 million euros or 1.5% of global turnover and scale up to 35 million euros or 7% of global turnover for the most serious violations.

Today is March 15, 2026. You have 140 days.

If your AI data pipeline already produces timestamped audit trails, operator identification, data lineage tracking, and Article 10/30-compliant documentation, you are in good shape. Verify it works, run a mock audit, and move on.

If it does not — if your training data is managed in shared drives, your transformations are logged in spreadsheets (or not logged at all), and your documentation consists of a few slides from last year's AI governance meeting — then this article is for you. Here is the 5-month sprint plan.

Who Is Affected

The EU AI Act classifies AI systems by risk level. The August 2, 2026 deadline applies to high-risk systems — those listed in Annex III of the regulation. If your organization deploys AI in any of these domains, your data pipeline needs compliance infrastructure:

Employment and worker management: AI systems that screen resumes, evaluate candidates, make promotion decisions, assign tasks, monitor performance, or influence termination decisions.
Credit and insurance: AI systems that assess creditworthiness, set insurance premiums, or evaluate risk for financial products.
Education: AI systems that assess students, determine admissions, or assign educational resources.
Law enforcement: AI systems that assess evidence reliability, perform risk assessments, profile individuals, or evaluate the reliability of testimony.
Critical infrastructure: AI systems that manage safety components in water, gas, electricity, heating, or digital infrastructure.
Migration and border control: AI systems that assess risks, verify document authenticity, or process applications.
Justice and democratic processes: AI systems that assist judicial authorities in fact-finding, law application, or dispute resolution.

If your AI system falls into any Annex III category and processes EU residents' data, you are in scope. "We are headquartered outside the EU" does not matter — the regulation applies based on where the affected individuals are, not where the company is.

What Auditors Will Look For

The EU AI Act does not accept self-attestation. Auditors want operational evidence — machine-readable, timestamped, verifiable records that demonstrate ongoing compliance, not a one-time documentation effort.

Specifically, they will examine:

Data lineage: Can you trace any model output back to the specific training data that produced it? Not "we used a dataset of 50,000 records" but "this model was trained on dataset version 4.2.1, which was produced by applying these specific transformations to these specific source documents on this date by this operator."

Transformation logs: Every operation applied to your training data — filtering, cleaning, labeling, augmenting, deduplication — must be logged with a timestamp, operator ID, parameters used, and the number of records affected. "We cleaned the data" is not a log entry.

Quality documentation: Evidence of data quality assessments at each pipeline stage. What metrics were measured? What thresholds were applied? What happened to data that failed quality checks?

Reproducibility: Can you reproduce the exact dataset used to train any deployed model version? If Auditor A requests the dataset from model v3.2 deployed in January 2026, can you regenerate it bit-for-bit?

Bias and fairness documentation: Evidence that you examined the training data for biases, documented the findings, and took remediation steps. The standard is not "no bias" — it is "examined, documented, and addressed."

The 5-Month Sprint Plan

Month 1 (March 15 - April 15): Audit and Classify

Week 1-2: Inventory all AI systems. List every AI system in production or development. For each system, determine:

Does it fall under Annex III? (If unsure, assume yes.)
What training data does it use?
Where is the training data stored?
Who prepared the training data?
What transformations were applied?
Does any documentation exist?

Week 3-4: Gap analysis. For each in-scope system, assess the current state against the requirements:

Data lineage: Do you know where the training data came from? (Score: 0 = no idea, 1 = general knowledge, 2 = documented sources, 3 = full traceability)
Transformation logging: Are transformations logged? (Score: 0 = no, 1 = manually, 2 = partially automated, 3 = fully automated)
Quality documentation: Are quality metrics recorded? (Score: 0-3)
Reproducibility: Can you recreate past datasets? (Score: 0-3)
Bias examination: Has bias been assessed? (Score: 0-3)

Any system scoring below 2 in any category needs remediation. Most enterprises find that 70-80% of their AI systems score below 2 in at least one category.

Deliverable: A prioritized remediation plan with specific tasks, owners, and deadlines for Months 2-5.

Month 2 (April 15 - May 15): Implement Automated Logging

This is the foundation. Without automated logging, everything else is retroactive documentation — which auditors will flag.

Implement timestamped logging for every data transformation. Every time data is filtered, cleaned, labeled, augmented, deduplicated, or exported, the system should automatically log:

Timestamp (from a trusted time source, not the local system clock)
Operator ID (who initiated the operation)
Operation type (what was done)
Parameters (with what settings)
Input record count and output record count
Affected records (or a sample hash for large datasets)

Technical implementation options:

If your pipeline runs in Python scripts: add structured logging (JSON format) with a centralized log aggregator
If your pipeline uses a workflow orchestrator (Airflow, Prefect): configure the orchestrator's audit logging plus add data-level logging within each task
If your pipeline uses Ertas Data Suite: logging is built in and compliant by default — every operation is logged with operator ID, timestamp, and full parameters

Deliverable: Every data transformation in every in-scope pipeline produces a machine-readable log entry. Verify by running a test transformation and confirming the log output.

Month 3 (May 15 - June 15): Build Data Lineage Tracking

Logging tells you what happened. Lineage tells you the chain — how any output connects back to its source through every intermediate step.

Implement dataset versioning. Every dataset version gets a unique identifier that encodes its full history: source data version + transformation sequence + timestamp. When you export a dataset for model training, the version ID is a complete provenance record.

Connect model versions to dataset versions. When a model is trained, record which dataset version was used. This creates the chain: model output → model version → dataset version → transformation history → source data.

Test the chain end-to-end. Pick a production model. Can you trace its training data back to the original source documents? If the chain breaks at any point, fix it.

Deliverable: For any deployed model, you can produce a lineage report showing the complete chain from source data to deployed model in under 30 minutes.

Month 4 (June 15 - July 15): Create Documentation

With logging and lineage in place, build the documentation that auditors will review.

Article 10 documentation:

Data governance policy (who is responsible for training data quality)
Dataset design choices (why this data was selected, what alternatives were considered)
Data collection processes (sources, dates, consent status)
Preparation operations (cleaning, labeling, augmentation — now supported by your automated logs)
Quality assessment (statistical properties, coverage analysis, suitability assessment)
Bias examination (methods used, findings, remediation actions)
Gap identification (what data is missing, what the plan is to address it)

Article 30 documentation:

Technical documentation of the AI system
Description of the data pipeline
Quality management procedures
Record-keeping system description

Template these documents. They will need to be updated whenever the pipeline changes, so create living documents with automated sections that pull from your logging and lineage systems.

Deliverable: Completed Article 10 and Article 30 documentation packages for every in-scope AI system.

Month 5 (July 15 - August 2): Test and Validate

Run a mock audit. Engage an internal team (or an external consultant) to play the role of the auditor. Give them access to the same interfaces a real auditor would use. Have them:

Request the training data lineage for a specific model
Ask to see transformation logs for a specific date range
Request evidence of bias examination
Ask to reproduce a past dataset version
Attempt to modify a log entry (it should be impossible)

Fix every gap the mock audit reveals. You have 18 days. Prioritize by severity: missing lineage > missing logs > incomplete documentation > formatting issues.

Verify immutability. Confirm that log entries cannot be modified or deleted after creation. This is a common failure point — systems that log to a regular database without write-protection allow post-hoc modification, which undermines the audit trail.

Deliverable: A mock audit report showing all tests passed, or a remediation log showing all gaps have been closed.

Common Pitfalls

Shared drives with unversioned files. If your training data lives in a shared drive where files can be overwritten without version history, you cannot demonstrate lineage or reproducibility. Move to versioned storage immediately.

Manual logs. "We keep a spreadsheet of all data processing steps." A spreadsheet can be edited retroactively. It has no guaranteed timestamps. It depends on human discipline to stay current. This does not constitute compliance evidence.

No operator tracking. If your pipeline runs as a shared service account, you cannot identify which person performed each operation. Implement individual operator authentication.

Screenshot-based evidence. Screenshots can be fabricated. Auditors know this. Machine-readable logs with cryptographic integrity verification are the standard.

Retroactive documentation. Starting to document your pipeline in July 2026 and backdating it produces evidence that clearly started in July 2026. Auditors will notice. Start now so your documentation has genuine historical depth.

The Cost of Non-Compliance

The EU AI Act's penalty structure is designed to be proportional and painful:

Most serious violations (prohibited AI practices): up to 35 million euros or 7% of global annual turnover, whichever is higher.
High-risk system violations (including inadequate data governance): up to 15 million euros or 3% of global annual turnover.
Documentation violations (incorrect or missing information to authorities): up to 7.5 million euros or 1.5% of global annual turnover.

For a company with 500 million euros in annual revenue, a documentation violation alone could mean a 7.5 million euro fine. A data governance violation could reach 15 million euros.

Beyond fines, non-compliant AI systems can be ordered to be withdrawn from the EU market. For companies that rely on AI-driven services for EU customers, this is an existential operational risk.

Start This Week

The sprint plan above is aggressive but achievable for organizations that commit resources now. The biggest risk is not technical complexity — it is delay. Every week of inaction compresses the remaining timeline and increases the risk of arriving at August 2 with gaps.

If you are reading this on March 15 and have not started, your first action should be completing the inventory and gap analysis (Month 1, Week 1-2) within the next 10 business days. Everything else follows from knowing what you have and what you are missing.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →