Dataset Versioning in Practice: Git for Training Data

You version your code with Git. You version your models with model registries. But when someone asks "what data trained the model that's currently in production?" — the typical answer is an awkward silence, followed by someone checking a Slack thread from four months ago.

This gap is not just inconvenient. It is a reproducibility, debugging, and compliance failure that costs teams weeks of rework every time something goes wrong.

Dataset versioning — applying the same branch, diff, tag, and rollback concepts from code versioning to training data — is how mature AI teams close this gap. Here's what it looks like in practice.

Why Dataset Versioning Matters

Four problems disappear when you version your training data properly.

Reproducibility

"Recreate the exact dataset that trained model v2.3." Without versioning, this request triggers a forensic investigation: which scripts ran, which filters were applied, which labels were current at the time, which data was excluded. With versioning, it's a single checkout command.

Reproducibility is not academic. When a model behaves unexpectedly in production, the first debugging step is to inspect the training data. If you can't recreate that dataset exactly, you're debugging blind.

Debugging

Model v3.1 performs 8% worse than v3.0 on a specific document type. What changed? If you have versioned datasets, you can diff the two versions and see exactly which examples were added, removed, or modified between them. Without versioning, the investigation takes days and often ends with "we're not sure."

In one case we've seen, a team spent two weeks investigating a performance regression that turned out to be caused by 47 mislabeled examples that were added between dataset versions. With proper diffing, they would have found it in 20 minutes.

Compliance

The EU AI Act requires organizations to document the data used to train AI systems. Article 10 specifically mandates data governance covering "the relevant design choices, data collection processes, data preparation operations such as annotation, data cleaning, and data enrichment." GDPR's right to explanation creates similar requirements.

Dataset versioning provides the audit trail that regulators expect: what data was used, when it was collected, how it was transformed, and which model was trained on which version. Without this, demonstrating compliance requires reconstructing history from fragments — a process that auditors view with justified skepticism.

Rollback

Your team updated the labeling guidelines and re-labeled 2,000 examples. The retrained model performs worse. With dataset versioning, you roll back to the previous dataset version and retrain. Without it, you try to undo the labeling changes manually — a process that introduces new errors.

Rollback capability turns data preparation mistakes from catastrophes into minor setbacks. Teams that can roll back experiment more freely, which leads to better datasets faster.

Current Approaches to Dataset Versioning

Several tools implement dataset versioning, each with different tradeoffs.

DVC (Data Version Control)

The most widely adopted tool. DVC extends Git with large file support — you track data files with DVC while tracking the metadata (hashes, pipeline definitions) with Git. Data is stored in remote storage (S3, GCS, Azure Blob) while Git stores lightweight pointer files.

Strengths: Git-native workflow, strong community, works with any storage backend. Weaknesses: Requires command-line proficiency, diffing large datasets is slow, no built-in visualization of dataset changes.

lakeFS

Git-like operations directly on data lakes. lakeFS provides branching, committing, and merging for data stored in object storage. It works at the storage layer, so it's transparent to tools that read from S3-compatible endpoints.

Strengths: Scalable to petabytes, zero-copy branching (branches don't duplicate data), works with existing data lake tools. Weaknesses: Requires infrastructure setup, optimized for data lakes rather than structured datasets.

Pachyderm

Combines data versioning with pipeline orchestration. Every data transformation is automatically versioned, creating a full lineage graph from raw data to processed output.

Strengths: Automatic lineage tracking, pipeline-aware versioning, Kubernetes-native. Weaknesses: Heavier infrastructure requirements, steeper learning curve, overkill for teams that just need versioning.

Manual Versioning (Timestamp Folders)

The approach most teams start with: dataset_v1/, dataset_v2/, dataset_20260115/. Simple to understand, easy to implement.

Weaknesses: No diffing capability, no branching, no merge conflict resolution, storage-inefficient (full copies), naming conventions break down at scale, no metadata about what changed or why. This approach works for 3-5 versions. It fails at 20+.

What to Version

Not all data artifacts need the same level of versioning rigor. Here's a priority order:

Always version:

Labeled data — the direct input to model training. Every change matters.
Export configurations — the settings that determine how labeled data becomes training data (format, splits, filtering rules).
Labeling guidelines — the document that defines what each label means. Changes here invalidate prior labels.

Version when practical:

Cleaned data — post-processing, pre-labeling. Useful for debugging the cleaning pipeline.
Raw data — the original documents before any processing. Important for compliance, but raw data is often large and changes infrequently.

Version if resources allow:

Augmented data — synthetic examples, data augmentation outputs. These can be regenerated from the augmentation pipeline, so versioning is less critical if the pipeline itself is versioned.

The Versioning Workflow

A practical dataset versioning workflow mirrors the Git branching model that development teams already know.

Main Branch

The main branch contains the current production dataset — the dataset that trained the model currently in production. This branch is protected: no direct modifications. All changes come through merge requests.

Experiment Branches

When a team member wants to modify the dataset — add new examples, relabel existing ones, remove problematic entries — they create a branch. This branch is an isolated copy (or in efficient implementations, a copy-on-write reference) where changes can be made without affecting the production dataset.

Experiment branches are cheap. Teams should create them freely: add-medical-terminology, relabel-contract-clauses, remove-duplicate-invoices. Each branch documents its intent in its name.

Review and Merge

Before merging a dataset branch into main, the team reviews the diff. Key questions:

How many examples were added, modified, or removed?
Do the changes shift the class distribution significantly?
Do quality metrics (inter-annotator agreement, format compliance) meet thresholds?
Has a domain expert reviewed the changes?

This review catches problems before they contaminate the production dataset. It is the data equivalent of code review — and it is equally important.

Tagging Releases

When a dataset is used to train a model, tag it with the model version: model-v3.1-dataset. This creates a permanent, immutable reference point. Six months from now, anyone can check out exactly the dataset that trained v3.1.

Tags should include metadata: number of examples, class distribution summary, quality scores, and the date of the training run.

Diff Capabilities

The most valuable feature of dataset versioning is the ability to diff two versions. Unlike code diffs (which compare text lines), dataset diffs need to capture:

Row-level changes: Which examples were added, removed, or modified between versions. For a dataset with 5,000 examples, knowing that 47 were added and 12 were relabeled is immediately useful.

Label changes: Specifically which labels changed and how. Did examples move from category A to category B? Was there a systematic relabeling? This helps identify whether label changes were intentional (guideline update) or accidental (annotator error).

Distribution shifts: How did the overall dataset distribution change? If class A went from 30% to 15% of the dataset, that's a significant shift that will affect model behavior. Distribution diffs should be flagged automatically during review.

Schema changes: Did the output format change? Were new fields added? This catches formatting inconsistencies that would break training.

Storage Strategies

Naive versioning — storing a full copy of the dataset for each version — is wasteful. A 10GB dataset with 50 versions would consume 500GB.

Delta encoding stores only the changes between versions. Version 1 is stored in full; version 2 stores only the rows that were added, modified, or removed. This reduces storage by 80-95% for typical datasets where each version changes less than 20% of the data.

Content-addressable storage (used by DVC and Git LFS) stores each unique file chunk once and references it by hash. Identical chunks across versions are stored only once.

Remote storage keeps data on cloud storage (S3, GCS) or network-attached storage while keeping only metadata and pointers locally. This is essential for datasets above 10GB.

Integration with Model Training

Dataset versioning delivers its full value when integrated with model training. The integration points:

Training metadata: Every training run records the dataset version (tag or commit hash) it used. This is a single field in the training config, but it enables full traceability.

Automated validation: Before training starts, the pipeline validates that the dataset version exists, passes quality checks, and matches the expected schema. This prevents training on corrupted or incomplete datasets.

Comparison runs: Train two models on two dataset versions with identical hyperparameters. The performance difference is attributable entirely to data changes. This is how you measure the impact of data improvements.

Lineage tracking: From a production model, trace back to the dataset version, then to the individual labeling sessions that contributed to it. Full lineage from prediction to source data.

Ertas Data Suite implements dataset versioning with full diff capabilities, branch-and-merge workflows, and automatic lineage tracking from labeled examples through to exported datasets. Every change is tracked with who made it, when, and why — providing the audit trail that reproducibility and compliance require. All data stays on your infrastructure, with storage-efficient delta encoding that keeps versioning practical even for large datasets.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →