AI Audit Trails: What You Need to Log and Why Regulators Will Ask for It

An audit trail is not about logging everything. It's about being able to answer specific questions after the fact: what decision was made, by what system, based on what inputs, at what time, with what human oversight, and what happened next.

Regulators across the EU, US, and UK are converging on similar questions. They may use different legal language, but the core requirement is the same: if your AI made a consequential decision, you should be able to reconstruct it completely.

Most enterprise AI deployments cannot do this today. Here's what you need, and what each regulatory framework requires.

What the Regulations Actually Say

EU AI Act

The EU AI Act has three articles that directly bear on logging requirements:

Article 13 (Transparency) requires that high-risk AI systems are transparent enough to allow deployers to interpret and use outputs appropriately. The system must provide interpretable outputs — not just a decision, but the basis for it.

Article 17 (Quality Management System) requires providers of high-risk AI systems to implement a quality management system that includes record-keeping procedures, data governance, and post-market monitoring. The quality management system itself must be documented and auditable.

Annex IV (Technical Documentation) specifies what must be documented: general description of the system, detailed description of design and development including training methodology and training data, monitoring and evaluation measures, and risk management measures. This documentation must be maintained and kept up to date.

Article 30 is the most specific logging requirement: providers and deployers of high-risk AI systems must keep logs automatically generated by the AI system for a period appropriate to the intended purpose, with a minimum of 10 years. The logs must be sufficient to enable post-hoc investigation of decisions.

Ten years is a long retention period. Most engineering teams think about log retention in terms of weeks or months. For AI systems classified as high-risk under the EU AI Act, the obligation is a decade.

HIPAA Technical Safeguards (45 CFR §164.312)

HIPAA's technical safeguard requirements apply to any system that creates, receives, maintains, or transmits electronic protected health information (ePHI). If your AI system touches patient data, these apply:

Access controls: unique user identification, automatic logoff, encryption
Audit controls: hardware, software, and procedural mechanisms that record and examine activity in information systems that contain ePHI
Integrity controls: mechanisms to authenticate that ePHI has not been altered or destroyed
Transmission security: encryption of ePHI in transit

The audit control requirement is the relevant one here. HIPAA does not specify exactly what to log, but HHS guidance makes clear that logs should capture who accessed what data, when, and for what purpose. Retention: 6 years from creation or last effective date.

SR 11-7 (Federal Reserve / OCC Model Risk Management)

The Federal Reserve's SR 11-7 guidance on model risk management requires that models used in banking have documentation covering:

Model purpose and intended use
Description of theory and logic
Data inputs and assumptions
Model limitations
Validation procedures
Ongoing performance monitoring

For AI/ML models specifically, regulators have emphasized the importance of logging model inputs, outputs, and performance metrics to enable ongoing monitoring and investigation of failures. The key principle is that independent validators must be able to reproduce model outputs — which requires complete logging of inputs and model version at time of inference.

The 8 Minimum Elements of an AI Audit Trail

These eight elements cover the minimum that any compliance-grade AI audit trail must capture. Missing any of them creates a gap that regulators will find.

1. Input Data with Integrity Hash

Log the input that was presented to the model — or a representation of it if the raw input is too large. Critically, include a cryptographic hash (SHA-256 is standard) of the input data. This allows you to later verify that the logged input matches what was actually processed. Without an integrity hash, a logged input record can be disputed.

For inputs containing ePHI, log a reference to the data record rather than the data itself — but ensure the reference is unambiguous and the hash covers the referenced content.

2. Model Version and Configuration

This is the element most commonly missing. Log the exact model version that processed the request: not "GPT-4" but the specific version, checkpoint, or model ID. Include inference configuration: temperature, top-p, max tokens, system prompt hash.

If you cannot specify the exact model version at the time of a historical inference, you cannot reconstruct what behavior the system was producing at that time. This is a critical gap for any regulatory review.

3. Output with Confidence or Probability Where Available

Log the full model output. For classification tasks, log the confidence score or probability distribution, not just the top prediction. A binary classification output of "approved" is much less useful than "approved (0.73 confidence)" — the latter tells you whether this was a confident or borderline decision.

For generative outputs, log the full text. Storage is cheap. Being unable to produce the exact output that drove a downstream action during a regulatory inquiry is expensive.

4. Timestamp in UTC

Log UTC timestamps, not local time. Regulatory investigations often cross timezone boundaries. UTC with millisecond precision eliminates ambiguity. Ensure your logging infrastructure has NTP synchronization — timestamp integrity matters.

Log both the time the request was received and the time the response was returned. Latency data can be relevant for performance investigations.

5. Acting User or System Identity

Who or what triggered this inference? Log the authenticated user ID for human-initiated requests, or the system/service identifier for automated pipeline requests. This enables access pattern analysis and identifies which users or systems were involved in a decision under review.

Do not log shared credentials. Every actor in your AI pipeline should have a unique, auditable identity.

6. Human Review Decision Where HITL Applies

If your system includes human-in-the-loop review — a human who reviews AI outputs before they drive consequential actions — log the review outcome explicitly. Who reviewed it, when, what decision they made, and whether they overrode the AI recommendation.

Human review is often what regulators are most interested in for high-stakes decisions. "The AI flagged it as high-risk" is incomplete without "and a licensed professional reviewed and agreed/disagreed."

7. Downstream Action Taken

Log what happened as a result of the AI output. A classification is meaningless in isolation — what did your system do with it? Log the downstream action: claim approved, application flagged for review, document routed to department X, alert sent to Y.

This closes the loop between AI decision and real-world consequence. It's what allows you to answer "what did the system do for patient 12345 on March 5th?"

8. Any Override or Escalation

When a human overrides an AI decision, or when an exception process is triggered, log it explicitly as an override event. Include the reason if your workflow captures it. This data is valuable both for regulatory purposes and for model improvement — systematic overrides indicate where the model is miscalibrated.

The Lineage Gap

Most teams that have thought about this problem have input and output logging covered. The gap is in the middle: the transformation pipeline.

Your AI output is not just a function of the raw user input. It's a function of retrieval results, preprocessing steps, context assembly, prompt templates, and system instructions — none of which may be logged.

EU AI Act Article 30 requires documentation of the entire pipeline, not just inputs and outputs. If your AI system involves retrieval-augmented generation, the retrieved documents are part of the input that determined the output. If preprocessing normalizes or transforms the input, that transformation is part of the lineage.

Map every transformation step between raw input and model call, and log each one. This is harder than logging the edges — but it's what regulators are looking for when they investigate a specific decision.

What Regulators Actually Look At in an Audit

Regulators conducting an AI audit do not read every log entry. They sample, and they ask specific questions.

The pattern is: a specific decision is under review (a denied claim, a flagged transaction, a high-risk classification). The regulator wants to reconstruct that decision completely. They will ask for the record of that specific inference — the input, the model version, the output, the human review, the downstream action. Then they will check completeness: are all 8 elements present? Are the timestamps consistent? Is the model version documented? Is there evidence of human oversight?

If any element is missing for a specific decision under review, that is a finding. If the audit trail cannot confirm what model version was running on a specific date, that is a finding. If human review was required by policy but is not documented in the log, that is a finding.

The practical implication: your audit trail infrastructure needs to make individual record lookup fast, and it needs to ensure completeness at write time — not as a periodic batch check.

Storage and Retention

Retention requirements vary by framework:

HIPAA: 6 years from creation or last effective date
EU AI Act (high-risk systems): 10 years minimum
SR 11-7: No explicit period stated, but bank examination cycles suggest 5-7 years in practice
FDA SaMD: Consistent with the product's lifecycle, typically the longer of 2 years or product lifetime

Design for the longest applicable period in your regulatory context. Tiered storage (hot for recent records, cold for older) manages costs while maintaining accessibility. Ensure cold storage is indexed for specific-record retrieval — bulk archive storage that requires a full restore to query is not audit-ready.

Ertas Data Suite: Audit Logging Built In

For AI data preparation pipelines — the upstream work that produces training data, fine-tuning datasets, and labeled corpora — Ertas Data Suite logs every transformation step with timestamp, operator ID, and a full record of the operation applied. Every ingest, clean, label, augment, and export action is part of an immutable audit chain.

The platform exports EU AI Act Article 30-compliant technical documentation directly. For regulated enterprises where the data preparation pipeline is itself subject to audit, this means the lineage is captured by default — not reconstructed after the fact.

Book a discovery call with Ertas →

Audit trails are not something you add after you've built the system. They need to be designed in from the start. The cost of retrofitting comprehensive logging into a production AI system is consistently higher than building it correctly the first time — and the cost of missing it during a regulatory inquiry is higher still.