Back to blog
    Data Lineage Is Now a Legal Requirement — Are You Ready?
    data-lineageeu-ai-actcomplianceaudit-trailenterprise-aisegment:enterprise

    Data Lineage Is Now a Legal Requirement — Are You Ready?

    The EU AI Act makes data lineage mandatory for high-risk AI systems. Most enterprise pipelines have lineage gaps at every tool boundary. Here's what needs to change.

    EErtas Team·

    Data lineage — the ability to trace any piece of training data from its final form back through every transformation to its original source — has always been a best practice. Under the EU AI Act, it's now a legal obligation for high-risk AI systems.

    This isn't a theoretical concern. If a regulator asks how a specific training example ended up in your dataset, you need to be able to show the complete chain: where it came from, how it was cleaned, who labeled it, what quality checks it passed, and when all of this happened.

    Most enterprise data pipelines can't do this.

    What Data Lineage Means in Practice

    Data lineage for AI training data is the recorded history of every transformation a data point undergoes from source to training-ready format. A complete lineage record for a single training example might look like:

    1. Source: contract_2024_0847.pdf, page 12, paragraph 3
    2. Ingested: 2026-01-15 09:23:41 by OCR engine v3.2, confidence 0.94
    3. Cleaned: 2026-01-15 09:24:02, duplicate check passed, quality score 0.87
    4. PII redacted: 2026-01-15 09:24:03, 2 entities detected (party names), replaced with placeholders
    5. Labeled: 2026-01-18 14:12:33 by Senior Attorney (operator ID: A-0041), label: "indemnification_clause", confidence: high
    6. Quality reviewed: 2026-01-20 10:05:17 by ML Lead (operator ID: ML-003), confirmed
    7. Exported: 2026-01-22 16:00:00, dataset v2.3, format JSONL, record #4,291

    This is what complete lineage looks like. Now consider what most enterprise pipelines actually capture.

    Where Lineage Breaks

    In a typical multi-tool data pipeline, lineage breaks at every boundary between tools:

    Ingestion → Cleaning boundary: Docling extracts text from PDFs. The output goes to a Python script for cleaning. The script processes the text but doesn't record which Docling output file each cleaned record came from, or what the cleaning script changed.

    Cleaning → Labeling boundary: Cleaned data is uploaded to Label Studio. Label Studio records who labeled what, but doesn't know the cleaning history. If a record was modified during cleaning, that context is lost.

    Labeling → Quality scoring boundary: Labeled data is exported from Label Studio and fed to Cleanlab for quality scoring. Cleanlab flags issues, but the operator who resolves them does so in a separate process — the resolution isn't linked back to the original labeling decision.

    Quality → Export boundary: Final data is assembled by a Python script that selects records meeting quality thresholds. The selection criteria and the specific records included/excluded are determined by code, but the decision isn't logged in a format a regulator could review.

    Each of these boundaries is a lineage gap. Individually, they seem minor. Collectively, they mean you can't trace a training example back to its source.

    Why This Matters Now

    Before the EU AI Act, lineage gaps were a quality problem. Teams that couldn't trace data issues back to their source had harder debugging sessions. But there were no legal consequences.

    Under Article 10, data governance practices must cover the full preparation pipeline. Under Article 30, technical documentation must include information about data sources, collection methodology, and preparation methods. Together, these articles require that you can demonstrate how your training data was produced — not just assert it.

    When a market surveillance authority asks for your technical documentation, "we cleaned the data with a Python script" isn't an answer. They'll want to see the logs.

    The Structural Problem

    Lineage gaps aren't caused by careless engineering. They're caused by architecture. When your pipeline is composed of independent tools, each tool only knows about its own operations. No tool has a complete view of the pipeline, so no tool can provide complete lineage.

    You can patch this with custom logging — writing a wrapper that records inputs and outputs at each stage and stores them in a central database. But this approach is fragile:

    • Every tool update risks breaking the wrapper
    • Custom logging code is rarely maintained to the same standard as production code
    • Log formats differ between tools, requiring normalization
    • Timestamp synchronization across tools is surprisingly hard to get right
    • The logging infrastructure itself becomes another system to maintain

    What Complete Lineage Requires

    To satisfy the EU AI Act's lineage requirements, your pipeline architecture needs:

    1. Single audit log: All operations recorded in one system, not scattered across tool-specific logs
    2. Record-level tracking: Lineage at the individual data point level, not just batch-level summaries
    3. Operator attribution: Who performed or approved each operation, with verifiable identity
    4. Immutable records: Audit logs that can't be modified after the fact
    5. Exportable format: Lineage data that can be presented to regulators in a readable format

    This is fundamentally easier when the entire pipeline runs in a single system. Platforms like Ertas Data Suite maintain lineage as a core architectural feature — every stage shares the same logging infrastructure, so there are no boundary gaps. The lineage record for any exported training example traces back through every transformation to the original source file, automatically.

    Steps to Take

    If your current pipeline has lineage gaps, you have two options:

    Option A: Retrofit logging onto your existing tool chain. This works but requires custom engineering, ongoing maintenance, and acceptance that cross-tool lineage will always be approximate.

    Option B: Migrate to a unified pipeline that handles lineage natively. Higher upfront effort, but eliminates the structural problem permanently.

    Either way, the August 2026 deadline means this decision needs to happen soon. Data lineage isn't a nice-to-have anymore — it's the law.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading