What is Data Versioning?

The practice of tracking and managing different versions of datasets over time, enabling reproducibility, rollback, and auditability in machine learning workflows.

Definition

Data versioning applies the principles of version control (familiar from software development with Git) to machine learning datasets. Just as Git tracks changes to code files over time, data versioning systems track changes to datasets — recording what data was added, modified, or removed between versions, who made the changes, and when. This creates a complete history of dataset evolution that supports reproducibility, collaboration, and compliance.

In ML workflows, data versioning is critical because model behavior is a function of both code and data. A model trained on dataset version 1.0 may behave very differently from one trained on version 1.3 — even with identical training code. Without data versioning, teams cannot reliably reproduce past results, trace quality regressions to data changes, or roll back to a previous dataset version when a new batch introduces problems.

Data versioning tools include DVC (Data Version Control), which extends Git to handle large data files; LakeFS, which provides Git-like branching for data lakes; Delta Lake, which adds ACID transactions and versioning to data stored in cloud object stores; and Hugging Face Datasets, which provides versioned dataset hosting. Each approach has different trade-offs between storage efficiency, scalability, and integration with existing ML pipelines.

Why It Matters

Reproducibility is a foundational requirement for production ML. When a model produces unexpected behavior, teams need to answer: 'What exactly was it trained on?' Without data versioning, this question is often unanswerable — the training data may have been modified, overwritten, or lost since the model was trained. Data versioning ensures that every model can be traced back to the exact dataset version that produced it.

For compliance, data versioning provides the audit trail that regulators require. GDPR's right to erasure requires organizations to track which data was used to train which models. The EU AI Act mandates documentation of training data. Data versioning makes these requirements tractable by maintaining a complete history of what data existed at each point in time and which models were trained on which versions.

How It Works

Data versioning systems typically work by storing dataset snapshots or deltas (changes between versions). Snapshot-based systems store complete copies of the dataset at each version — simple but storage-intensive. Delta-based systems store only the changes between versions — more storage-efficient but require reconstruction of complete datasets from the chain of deltas.

Most systems use content-addressable storage, where data is stored by its hash rather than its filename. When a file changes, only the new version is stored; unchanged files are shared across versions through hash references. This deduplication reduces storage overhead from linear (in the number of versions) to proportional to actual data changes. Metadata — version labels, timestamps, author information, descriptions — is stored separately and can be browsed without loading the data itself.

Example Use Case

A team maintains a customer feedback dataset that receives weekly updates. After updating to version 2.7 and retraining, model performance drops by 8%. Using data versioning, they diff version 2.7 against 2.6 and discover that the latest batch contained 500 mislabeled examples from a new annotation vendor. They roll back to version 2.6, retrain, confirm performance is restored, then work with the vendor to fix the labeling issues before reintegrating the corrected data as version 2.8.

Key Takeaways

Data versioning tracks changes to datasets over time, enabling reproducibility and rollback.
It is essential because model behavior depends on both code and data — both must be versioned.
Content-addressable storage deduplicates unchanged data across versions for efficiency.
Data versioning provides the audit trail required by GDPR, EU AI Act, and other regulations.
Tools include DVC, LakeFS, Delta Lake, and Hugging Face Datasets.

How Ertas Helps

Ertas Data Suite maintains version history for datasets as they progress through the Ingest, Clean, Label, Augment, and Export pipeline, enabling teams to trace any model back to the exact data version used for training in Ertas Studio.

Related Resources

Checkpoint

Data Deduplication

Data Lineage

MLOps

Training Data

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →