Fine-tuning is the easy part. It takes 10 minutes with LoRA, maybe 30 with a larger dataset. You get a model that nails your use case in a demo. Then you ship it.

Three months later, the model is quietly degrading. Nobody is checking output quality. The training data is stale. A client changed their product naming and the model still uses the old terms. Nobody knows which version is in production or when it was last evaluated.

This is the gap between "fine-tuning" and "fine-tuned model ops." The tuning is one step. The ops — keeping models accurate, current, and reliable — is the other five steps that most teams skip entirely.

This guide covers the complete lifecycle of a fine-tuned model in production: six stages, concrete time estimates, a maturity framework, and the failure modes that catch every team eventually.

The Lifecycle at a Glance

The fine-tuned model lifecycle is a loop, not a line:

Data Preparation — Curate and format training examples
Training — Fine-tune the base model with LoRA/QLoRA
Evaluation — Test against domain-specific benchmarks
Deployment — Export to GGUF, serve via Ollama or similar
Monitoring — Track output quality in production
Retraining — Update the model when quality degrades

After step 6, you return to step 1. The cycle typically runs on a monthly cadence for active models, quarterly for stable ones.

How Fine-Tuned Ops Differs from Traditional MLOps

If you come from a traditional ML background, fine-tuned model ops looks familiar but the details are different at every stage.

Data preparation is curation, not labeling at scale. You are not hiring a team to annotate 100,000 images. You are selecting 200-2,000 high-quality examples from existing workflows — real support conversations, actual document summaries, production-quality outputs. The bottleneck is domain expert time, not annotator throughput.

Training takes 10 minutes, not 10 days. A LoRA fine-tune on a 7B model with 500 examples completes in under 15 minutes on a single GPU. This changes everything about iteration speed. You can afford to retrain frequently. You cannot afford to skip evaluation because "training is expensive."

Evaluation requires domain-specific benchmarks, not generic leaderboards. MMLU scores are irrelevant. What matters is whether the model correctly formats a legal citation, uses the right product terminology, or follows the client's tone guidelines. You need custom eval sets built from your actual use case.

Deployment is GGUF plus Ollama, not a Kubernetes cluster with GPU autoscaling. Fine-tuned small models run on a $2,000 Mac Studio or a single cloud GPU. The deployment story is simpler, but versioning and rollback still matter.

Monitoring is output quality, not just latency and throughput. A fine-tuned model can respond in 50ms and still be wrong. You need to sample and score outputs, track user corrections, and detect when the model's domain knowledge falls behind reality.

Stage 1: Data Preparation (4-20 hours)

The quality ceiling of your model is set here. No amount of training fixes bad data.

What to do:

Collect 200-2,000 examples from production workflows
Format into instruction/response pairs (or your chosen format)
Deduplicate and remove low-quality examples
Split 90/10 into training and evaluation sets
Version the dataset with a hash and date

Time estimate: 4-8 hours for initial dataset, 2-4 hours for updates. Domain expert review adds 4-8 hours for the first pass, 1-2 hours for incremental updates.

Common mistakes: Using synthetic data exclusively without real examples. Including examples that contradict each other. Not versioning the dataset so you cannot reproduce results.

The key metric here is examples per hour of domain expert time. If it takes more than 3 minutes per example to curate, your process needs streamlining. Most teams find they can curate 30-50 good examples per hour once they have a system.

Stage 2: Training (10-45 minutes)

The fastest stage, and the one teams over-invest in relative to its impact.

What to do:

Select base model (Llama 3.3, Qwen 2.5, Mistral, etc.)
Configure LoRA parameters (rank 16-64, alpha 2x rank)
Train for 3-5 epochs on your curated dataset
Save the adapter weights and training configuration
Log: base model hash, dataset hash, hyperparameters, training loss curve

Time estimate: 10-15 minutes for datasets under 1,000 examples on a single GPU. 30-45 minutes for larger datasets or higher LoRA ranks.

Common mistakes: Training for too many epochs (overfitting). Changing hyperparameters without tracking what changed. Not saving the exact dataset hash used for training.

With Ertas, training configuration is versioned automatically. Every run records the base model, dataset version, and parameters so you can reproduce any result.

Stage 3: Evaluation (2-6 hours)

The most skipped stage and the one that determines whether your model actually works.

What to do:

Run the model against your held-out evaluation set (minimum 50 examples)
Score outputs on accuracy, format compliance, and tone
Compare against the previous model version (A/B scoring)
Test edge cases: ambiguous inputs, out-of-scope queries, adversarial prompts
Document pass/fail criteria before you start evaluating

Time estimate: 2-3 hours for automated scoring, plus 2-3 hours for human review of flagged outputs. Faster if you have established eval templates from previous cycles.

Target metrics:

Accuracy: 90%+ correct on domain tasks
Format compliance: 95%+ matching expected output structure
Hallucination rate: under 5% on factual claims
Edge case handling: graceful refusal or escalation on out-of-scope inputs

Stage 4: Deployment (1-2 hours)

What to do:

Export the adapter (and optionally merge with base model)
Convert to GGUF format with appropriate quantization (Q5_K_M for quality, Q4_K_M for speed)
Deploy to Ollama or your inference server
Verify with a smoke test: 10-20 representative queries
Update the model registry with deployment metadata

Time estimate: 30 minutes for export and conversion, 30 minutes for deployment and verification, 30 minutes buffer for issues.

Rollback plan: Always keep the previous version deployed and accessible. If the new version fails smoke tests or early monitoring, switch back within minutes, not hours.

Stage 5: Monitoring (2-4 hours/month ongoing)

This is where most teams fail. They deploy and forget.

What to monitor:

Metric	How to Measure	Frequency	Alert Threshold
Output accuracy	Sample 5-10% of outputs, score	Weekly	Below 88%
Format compliance	Automated regex/schema check	Daily	Below 93%
User corrections	Track edits to model outputs	Weekly	Above 15% edit rate
Response confidence	Token probability distribution	Daily	Avg confidence drop >10%
Latency p95	Inference timing	Daily	Above 2x baseline
Input novelty	Embedding distance from training set	Weekly	>20% novel inputs

Time estimate: 1-2 hours per week for review, 2-4 hours per month total if automated dashboards are in place.

Stage 6: Retraining (6-24 hours per cycle)

Retraining is not "train again from scratch." It is a targeted update based on monitoring data.

Triggers for retraining:

Accuracy drops below 88% on weekly samples
Domain vocabulary changes (product rename, new terminology)
New task types emerge that the model handles poorly
Quarterly scheduled refresh regardless of metrics

What to do:

Collect new examples from production (especially corrected outputs)
Merge with existing training set, remove outdated examples
Retrain with the updated dataset
Evaluate against the same benchmarks plus new test cases
Deploy only if evaluation scores meet or exceed the current production model

Time estimate: 4-8 hours for data collection and curation, 15-45 minutes for training, 2-6 hours for evaluation, 1-2 hours for deployment. Total: roughly one business day per retraining cycle.

The Maturity Model

Not every team needs full automation on day one. Here is a progression that matches team size and model count.

Level	Description	Characteristics	Right For
Level 1: Manual	Every stage is done by hand	Spreadsheet tracking, manual eval, ad-hoc retraining	1-3 models, learning phase
Level 2: Automated Eval	Evaluation is scripted and repeatable	Eval scripts, automated scoring, manual retraining decisions	3-10 models, regular clients
Level 3: Automated Retraining	Monitoring triggers retraining pipeline	Drift detection, automated data collection, human eval gate	10-25 models, established practice
Level 4: Full Automation with Oversight	End-to-end pipeline with human checkpoints	Automated everything, human approves deployment, continuous monitoring	25+ models, mature ops team

Most teams should aim for Level 2 within their first quarter of production fine-tuning. Level 3 becomes necessary when you are managing more than 10 active models. Level 4 is a goal, not a starting point.

Team Responsibilities

Fine-tuned model ops is not purely an engineering problem. It requires three perspectives.

Product / Domain Experts:

Define quality criteria and acceptable output standards
Review evaluation results and approve model versions
Identify when domain knowledge has shifted (product changes, regulatory updates)
Set the retraining cadence based on business needs

Engineering:

Build and maintain the training and deployment pipeline
Implement monitoring dashboards and alerting
Manage model versioning, storage, and rollback
Optimize inference performance and resource usage

Data / Curation:

Curate and maintain training datasets
Collect production examples for retraining
Manage dataset versioning and quality standards
Build and update evaluation benchmarks

On small teams, one person may wear multiple hats. The important thing is that all three perspectives are represented in every retraining decision.

Common Failure Modes

Never retraining. The model was great at deployment. Six months later it is using outdated terminology, missing new product features, and failing on query patterns that have shifted. This is the most common failure mode by far.

Retraining too frequently. Weekly retraining without stable evaluation benchmarks means you are chasing noise. Every cycle introduces risk. If your monitoring shows stable quality, do not retrain just because the calendar says to.

No evaluation gates. Retraining without evaluation is just hoping the new model is better. It often is not. Always compare the retrained model against the current production model on the same benchmark before deploying.

No rollback plan. You deploy a retrained model and quality drops. How fast can you roll back? If the answer is "we would need to retrain with the old data," you do not have a rollback plan. Keep the previous version ready to serve at all times.

Single point of failure on domain expertise. If one person is the only one who can evaluate model quality, you have a bus factor of one. Document evaluation criteria explicitly so any domain expert can run the process.

Recommended Cadence

For most teams running fine-tuned models in production, this cadence balances quality with operational overhead:

Daily: Automated format and confidence checks
Weekly: Sample 5-10% of outputs for quality scoring; review monitoring dashboard
Monthly: Full evaluation cycle; retrain if metrics warrant it; update training data with production examples
Quarterly: Comprehensive review of all models; update evaluation benchmarks; audit data pipeline; review resource allocation

This cadence works for teams managing 5-20 active models. Scale up monitoring frequency as model count grows.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Putting It Together

The fine-tuned model lifecycle is not complicated. It is six stages in a loop: prepare, train, evaluate, deploy, monitor, retrain. Each stage has clear inputs, outputs, and time requirements.

What makes it hard is consistency. The teams that succeed are not the ones with the most sophisticated tooling. They are the ones that actually run the loop — that check model quality every week, retrain when the data says to, and never skip evaluation.

Start at Level 1. Get the loop running manually. Automate the parts that slow you down. Move up the maturity model as your model count grows and your operational patterns stabilize.

The model you fine-tuned today is not the model you will be running in six months. Plan for the lifecycle, not just the launch.

Fine-Tuned Model Ops: The Complete Lifecycle Guide

The Lifecycle at a Glance

How Fine-Tuned Ops Differs from Traditional MLOps

Stage 1: Data Preparation (4-20 hours)

Stage 2: Training (10-45 minutes)

Stage 3: Evaluation (2-6 hours)

Stage 4: Deployment (1-2 hours)

Stage 5: Monitoring (2-4 hours/month ongoing)

Stage 6: Retraining (6-24 hours per cycle)

The Maturity Model

Team Responsibilities

Common Failure Modes

Recommended Cadence

Putting It Together

Further Reading

Ship AI that runs on your users' devices.

Keep reading

CI/CD for Fine-Tuning Pipelines: Automating Train-Evaluate-Deploy

Rolling Back a Fine-Tuned Model Safely: Deployment Strategies

Building Reliable AI Agents with Fine-Tuned Local Models: Complete Guide