
The AI Agency's Guide to Model Versioning and Client Rollbacks
How AI agencies should version, track, and roll back fine-tuned models — covering naming schemes, change logs, A/B deployment, and emergency rollback procedures.
It's 3am. Your phone rings. A client's production model — the one handling 2,000 customer support tickets per day — is generating garbage. Complete nonsense. Their VP of Customer Experience is already composing an email to your CEO.
The question that determines whether you survive the next 30 minutes: can you roll back to the last known-good version right now?
If you don't have versioning, the answer is no. You'll spend hours figuring out what changed, when it changed, and whether you even have the previous adapter weights saved anywhere. If you do have versioning, the answer is: "Already done. Rolled back in 47 seconds. Investigating root cause now."
This guide is about building the system that makes the second answer possible.
Why Versioning Matters More for AI Than Software
In traditional software, a deploy goes bad and you redeploy the last commit. The code is deterministic — same input, same output, every time. The artifacts are tiny (megabytes of compiled code), and rollback is well-understood.
AI models are different in every dimension:
- Non-deterministic outputs. The same input can produce different outputs. "It works" is probabilistic, not binary.
- Large artifacts. LoRA adapters are 20-200MB. Full model weights are 4-30GB. You can't just "git revert" these.
- Complex lineage. A model's behavior depends on the base model, the adapter weights, the training data, the training config, the inference config, and sometimes the prompt template. Change any one of these and the output changes.
- Silent degradation. Bad code usually crashes. Bad models produce plausible-looking garbage. You might not catch it for days.
Versioning for AI models must track more state, handle larger artifacts, and support faster rollbacks than traditional software deployment.
The Versioning Scheme
Use semantic versioning adapted for AI models:
v{major}.{minor}.{patch}
Major version (v1 → v2): Changed the base model. This is a fundamental shift — different architecture, different capabilities, different failure modes. Requires full re-evaluation and client sign-off.
Minor version (v1.1 → v1.2): Retrained the LoRA adapter with new or updated data. Same base model, but the model's knowledge or behavior has changed meaningfully. Requires automated eval + human spot-check.
Patch version (v1.2.0 → v1.2.1): Changed the inference config, prompt template, or serving parameters. The model weights haven't changed. Requires a quick smoke test.
Examples in practice:
acme-support-v1.0.0→ Initial deployment on Llama 3 8Bacme-support-v1.1.0→ Retrained with 3 months of production dataacme-support-v1.1.1→ Adjusted temperature from 0.3 to 0.2 to reduce creativityacme-support-v2.0.0→ Migrated to Llama 3.1 8B base model
What to Track Per Version
Every version entry in your registry needs these fields. No exceptions:
| Field | Example | Why |
|---|---|---|
| Version | v1.2.1 | Unique identifier |
| Base model | llama-3-8b-instruct | Reproducibility |
| LoRA adapter hash | sha256:a3f8c2... | Verify correct weights loaded |
| Training data hash | sha256:7b2e91... | Know exactly what data trained this |
| Training config | lr=2e-4, epochs=3, rank=16 | Reproducibility |
| Eval scores | accuracy=94.2%, hallucination=2.1% | Baseline for comparison |
| Inference config | temp=0.2, top_p=0.9, max_tokens=512 | Exact serving parameters |
| Deployment date | 2026-03-10T14:30:00Z | Audit trail |
| Deployed by | jane@agency.com | Accountability |
| Change description | "Added Q1 support tickets to training data" | Context for future you |
Store this as a structured YAML or JSON file alongside the adapter weights. When something goes wrong at 3am, you need this information accessible in seconds, not buried in Slack threads.
The Version Registry
Maintain a central registry — a single source of truth for what's deployed where:
clients:
acme:
models:
support:
production: v1.2.1
staging: v1.3.0
available_versions:
- v1.2.1
- v1.2.0
- v1.1.0
rollback_target: v1.2.0
baker:
models:
legal-review:
production: v2.1.0
staging: v2.2.0
available_versions:
- v2.1.0
- v2.0.0
- v1.5.0
rollback_target: v2.0.0
This file should be version-controlled itself. You want a full audit trail of every deployment decision.
The Deployment Pipeline
A model update should flow through four stages. Skipping stages is how you end up with 3am phone calls.
Stage 1: Staging
Deploy the new version to a staging environment. Run your full automated eval suite. Compare scores against the current production version. If any metric drops by more than 2 percentage points, stop and investigate.
Stage 2: Human Review
Have someone — ideally someone who knows the client's domain — review 20-30 outputs from the staging model. Focus on:
- Does the tone match what the client expects?
- Are there any new failure modes?
- Did we fix the issues the retrain was supposed to address?
Stage 3: Canary Deployment
Route 10% of production traffic to the new version. Monitor for 24-48 hours. Compare latency, error rates, and — if you have automated quality metrics — output quality between the two versions.
If the canary looks healthy, ramp to 50% for another 24 hours.
Stage 4: Full Deployment
Cut over 100% of traffic to the new version. Keep the old version loaded and ready for instant rollback for at least 7 days.
Total timeline for a model update: 3-5 days from staging to full deployment. This feels slow until you compare it to the 2-3 days of cleanup after a bad instant deploy.
Rollback Procedure
Rollback must be fast, reliable, and rehearsed. Here's the procedure:
Preparation (Do This Before You Ever Need It)
-
Keep the last 3 versions hot. Their adapter weights should be on the same machine as your serving infrastructure, ready to load. Don't archive old versions to cold storage until you have 3 newer versions ahead of them.
-
Define the rollback target. For each client, explicitly name which version is the rollback target. Usually it's the previous production version, but sometimes you want to skip back further.
-
Test the rollback. Once a month, practice rolling back a non-critical model. Time it. If it takes more than 60 seconds, fix your tooling.
Execution (When Things Go Wrong)
-
Swap the adapter pointer. Change the serving config to point at the rollback version's adapter weights. Since the base model stays loaded, this is an adapter hot-swap — no restart needed.
-
Confirm the swap. Send 5-10 known test inputs through the model and verify the outputs match expected behavior for the rollback version.
-
Notify the client. Send a brief, factual message: "We identified an issue with the latest model update and have rolled back to the previous stable version (v1.2.0). Service is restored. We're investigating the root cause and will provide an update within 24 hours."
-
Investigate. Compare the failed version's outputs against the rollback version's outputs on the same inputs. Find what went wrong. Common causes: training data contamination, config error, base model update that changed tokenization.
Target rollback time: under 60 seconds from decision to confirmed rollback. This is achievable with LoRA adapter swapping on Ertas.
Change Log Template for Clients
Every model update should come with a client-facing change log. Keep it clear, non-technical, and focused on outcomes:
## Model Update: Support Assistant v1.3.0
**Date:** March 10, 2026
**Type:** Monthly retraining
### What Changed
- Incorporated 3 months of new support ticket data (1,247 tickets)
- Improved handling of refund-related questions
- Reduced false escalations to human agents
### Performance Comparison
| Metric | Previous (v1.2.1) | Updated (v1.3.0) |
|--------|-------------------|-------------------|
| Accuracy | 94.2% | 95.8% |
| Hallucination rate | 2.1% | 1.4% |
| Avg response time | 1.3s | 1.2s |
### Known Limitations
- Still struggles with multi-product bundle questions (working on this for next update)
### Rollback Plan
Previous version (v1.2.1) is available for instant rollback if any issues arise.
This level of transparency builds trust. Clients who get performance reports don't churn — they upgrade their tier.
Storing Artifacts
Every version needs a complete, self-contained artifact bundle:
adapters/acme/support/v1.3.0/
├── adapter_weights/ # The actual LoRA weights
├── training_config.yaml # Exact training hyperparameters
├── training_data.hash # SHA-256 of training data (not the data itself)
├── eval_results.json # Full eval suite results
├── inference_config.yaml # Serving parameters
├── changelog.md # Client-facing change log
└── metadata.yaml # All version registry fields
Store these on fast local storage for the last 3 versions per client. Archive older versions to object storage (S3, GCS, or equivalent). You'll rarely need them, but when you do — regulatory audit, client dispute, reproducing a historical issue — you'll be glad they exist.
Total storage per version for a 7B LoRA adapter: roughly 50-250MB depending on rank. At 3 hot versions × 12 clients = 36 version bundles, you're looking at 2-9GB. Trivial.
Common Mistakes
Overwriting instead of versioning. "I'll just save the new adapter to the same path." Congratulations, you've destroyed your rollback capability. Always write to a new versioned path.
No eval before deploy. "The training loss went down, so it must be better." Training loss tells you the model memorized your data. Eval scores tell you it can actually perform the task. These are different things.
Not tracking which data trained which model. Six months from now, a client asks you to remove certain data from their model's training set (GDPR, legal discovery, whatever). If you don't know which data produced which version, you're stuck retraining everything from scratch.
Skipping canary deployment. "It passed eval, let's just deploy." Eval suites don't catch everything. Real production traffic surfaces edge cases that your test set missed. Canary at 10% for 24 hours. Every time.
No rollback testing. You haven't tested your rollback procedure since you set it up 6 months ago. The config format changed. The script path changed. The SSH key expired. Test it monthly.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
The Ertas Advantage
Ertas Studio bakes versioning into the deployment workflow. Every model deployed through Ertas gets automatic version tracking, artifact storage, and one-click rollback. The version registry is your dashboard — you see what's deployed where, what's available for rollback, and what's staged for deployment.
The deployment pipeline supports canary routing out of the box. Set your canary percentage, monitoring period, and auto-rollback thresholds. If the canary fails your quality gates, it rolls back automatically before you even wake up.
Because at 3am, the best incident response is the one that already happened.
For more on agency operations, see our guide to DIY vs. Ertas fine-tuning for agencies and the multi-model operations guide.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Running 10+ Fine-Tuned Models for Different Clients: Operations Guide
An operations guide for AI agencies managing 10+ fine-tuned models across multiple clients — covering model organization, resource allocation, monitoring, updates, and scaling without chaos.

Multi-Tenant AI Deployment: One Base Model, Dozens of Client Adapters
How AI agencies can serve dozens of clients from a single base model using LoRA adapter hot-swapping — the architecture behind scalable, cost-effective multi-tenant AI.

Building a Recurring Revenue AI Service with Fine-Tuned Models
How to structure an AI agency offering around fine-tuned models that generates predictable monthly recurring revenue — covering service tiers, pricing models, and the retraining loop.