What Is an AI Model Card? And Why the EU AI Act Makes Them Non-Optional

A model card is a structured document that describes an AI model — what it was designed to do, how it was trained, what data it used, how it performs across different groups and tasks, and what its known limitations are.

The concept was introduced in a 2019 paper by Margaret Mitchell and colleagues at Google. In the seven years since, model cards have moved from an academic proposal to an industry practice to, in the EU, a legal requirement. This article explains what a model card contains, why it matters operationally, and what the EU AI Act's Annex IV means for organizations deploying AI in regulated contexts.

Why Model Cards Matter Operationally

The compliance case for model cards is real, but the operational case is more immediately persuasive.

The ML engineer who trained the model will not be there in 18 months. The vendor who provided the model may not be there in 3 years. The business context in which the model was deployed will evolve. When something goes wrong — or when someone needs to audit, update, or replace the model — the institutional knowledge that existed at training time will be gone.

A model card is the institutional memory that allows someone to understand the model without starting from scratch. It answers the questions that an incident investigation or a compliance audit will ask: what was this model trained on? What population was it evaluated against? What failure modes were identified? What did the original team know about where this model should and should not be used?

Without a model card, these questions are answered through reconstruction — which takes time, is incomplete, and may be impossible if the original team is no longer available. With a model card, they are answered in minutes.

What a Model Card Should Contain

1. Model Details

Name, version, model type (classification, regression, generative, etc.), owning team or organization, date trained, date last updated, license, contact information for the responsible team.

This is the header information that allows someone to identify what they are looking at and who to call.

2. Intended Use

What the model was designed to do. What it was explicitly not designed to do. Who the intended users are. What deployment contexts the model was designed for.

The "not designed to do" section is as important as the intended use. A model designed to assist customer service agents is not designed to make autonomous decisions. A model trained on English-language inputs is not designed for multilingual deployment. Documenting explicit exclusions reduces the probability of inappropriate deployment.

3. Training Data

Source datasets, date range covered, preprocessing and cleaning steps applied, known biases or gaps in the training data, total training set size.

This section is the foundation for any downstream analysis of why the model behaves as it does. If the model exhibits bias, the investigation will start here. If the model performs poorly on a specific population, the training data section will show whether that population was represented.

4. Evaluation Data

What datasets were used to evaluate the model. How the evaluation data was collected, and how it differs from the training data. The rationale for using these specific evaluation sets.

Evaluation data and training data should not be identical — a model evaluated only on its training distribution tells you very little about how it will perform in production. Document the gap.

5. Performance Metrics

Accuracy, precision, recall, F1, AUC, or whatever metrics are relevant to the model's task. These should be reported on the evaluation set, not the training set.

Critically: performance metrics should be broken down by demographic group, geographic region, or other segments relevant to the use case. A model with 94% aggregate accuracy may have 82% accuracy for a specific demographic group. Aggregate metrics alone are not sufficient for any model that makes decisions affecting people.

6. Known Limitations

What types of inputs the model handles poorly. What errors it tends to make. What conditions cause performance to degrade. What the model was never tested on.

This is the section most commonly underwritten in practice — organizations are reluctant to document limitations in a way that might be used against them. This reasoning is backward. Documented limitations allow deployers to design appropriate human oversight. Undocumented limitations are discovered through production failures. The litigation risk is substantially higher in the latter case.

7. Ethical Considerations

Potential harms from model errors or misuse. Mitigation measures in place. Groups that may be disproportionately affected by model errors. Who was consulted — internally and externally — in the model's development.

8. Caveats and Recommendations

What downstream deployers should know before deploying the model. Required human oversight. Environmental conditions required for performance to hold. Recommended monitoring practices.

EU AI Act Annex IV: From Best Practice to Legal Requirement

For high-risk AI systems under the EU AI Act — which includes AI used in employment, credit, education, law enforcement, migration, and other specified domains — Annex IV specifies a mandatory technical documentation requirement. The 11-element requirement covers:

General description of the AI system
Description of the system's components and development process
Detailed information about the training data
Description of human oversight measures
Description of the system's capabilities and limitations
Description of risk management processes
Description of changes made throughout the lifecycle
Standards applied
EU declaration of conformity
Information for the deployer
Description of the post-market monitoring system

Elements 1 through 6 map directly to model card fields. For high-risk systems, the model card is not a best practice artifact — it is a legal document. An incomplete model card means a non-compliant system.

EU AI Act Article 13: Transparency Obligations

Article 13 requires that high-risk AI systems be sufficiently transparent that deployers and users can understand the system's capabilities and limitations. This includes:

Information about the system's intended purpose
The level of accuracy, robustness, and cybersecurity achieved
Any known or foreseeable circumstances that may affect accuracy and performance
Appropriate human oversight measures

An operator cannot provide this information to deployers if the operator does not have it. Article 13 transparency requires, as a prerequisite, that the operator has documented this information in the first place. The model card is that documentation.

Model Cards for Fine-Tuned Models

When you fine-tune a base model, you need a model card for the fine-tuned version. A link to the base model's model card is not sufficient.

The fine-tuned model card should document:

What the fine-tuning dataset contained and what it was intended to train
What capability changed as a result of fine-tuning, and what the intent was
How performance on the base model's benchmark tasks changed — including benchmarks the fine-tuning did not target. Fine-tuning on a specific domain can degrade performance on general tasks.
What evaluation was run on the fine-tuned version specifically

This matters because the fine-tuned model is a different model from the base. Its behavior is the result of the base training and the fine-tuning combined. Documenting only the base model's characteristics leaves the fine-tuning's contributions — and its introduced limitations — undocumented.

The Audit Trail Connection

A model card is a point-in-time document. It describes the model as it existed when the card was written. Models change: they are retrained, updated, fine-tuned, and sometimes reverted. A model card that accurately described version 1.0 may be misleading for version 1.3.

The model card needs to be maintained alongside the model. For organizations managing multiple model versions, this means a versioned model card process: each model version has a corresponding model card version, and the history of changes is tracked.

The model inventory — a registry of all AI models deployed in the organization, their versions, and their operational status — is the infrastructure that makes this tractable. The model card is the per-model documentation; the inventory is the organization-level view.

Data Lineage as a Foundation for Model Cards

The "Training data" and "Evaluation data" sections of a model card are only as accurate as your data pipeline documentation. If you cannot trace what data went into a model — what its sources were, what preprocessing was applied, what was excluded and why — you cannot write an accurate model card.

Organizations that process training data through a structured pipeline with logged transformations, operator IDs, and timestamps have this documentation automatically. The data lineage record is the source of truth for the training data section of the model card. Organizations that assemble training data informally — downloading files, running scripts without logging, applying transformations in notebooks — typically cannot reconstruct an accurate training data narrative.

This is one of the practical arguments for structured data infrastructure before model training, not after. By the time you need the model card, the training data documentation either exists or it doesn't.

For broader governance context, see AI Model Governance in Production. For the audit trail infrastructure that underpins model card data lineage, see Enterprise AI Audit Trails: What to Log. For the EU AI Act's high-risk system requirements in detail, see EU AI Act High-Risk System Requirements.

Book a discovery call with Ertas →