CI/CD for Fine-Tuning Pipelines: Automating Train-Evaluate-Deploy

Your first fine-tune was manual. You curated data by hand, kicked off training, eyeballed the results, converted to GGUF, loaded it into Ollama, and tested it yourself. It took a day. Maybe two.

That works once. It does not work when you have four clients, each with monthly retraining cycles, each with different evaluation criteria, and each expecting zero downtime. Manual fine-tuning breaks at the second client. By the fourth, you are spending more time on process than on actual model improvement.

The solution is the same one software engineering solved decades ago: CI/CD. Continuous integration and continuous deployment, adapted for fine-tuning pipelines. Here is exactly how to build one.

The Pipeline at a Glance

A fine-tuning CI/CD pipeline has seven stages:

Trigger — something initiates the pipeline
Data validation — confirm the training data is clean and sufficient
Fine-tune — run the actual training job
Evaluate — run the model against your test suite
Compare — benchmark against the current production model
Deploy — promote the new model if it passes all gates
Monitor — watch production metrics post-deployment

Each stage has a clear pass/fail criterion. If any stage fails, the pipeline stops and alerts you. No human in the loop required for the happy path.

Choosing Your Triggers

Not every pipeline run needs a human clicking "go." Three trigger types cover most scenarios:

Scheduled triggers work best for stable, predictable workloads. Set a monthly cadence. The pipeline runs on the first Tuesday of each month, retrains on whatever new data has accumulated, and promotes if the new model is better. If the new model is not better, nothing changes. Total human effort: reading the summary email.

Data-threshold triggers fire when you have accumulated enough new training examples. Set a threshold — say 500 new validated examples — and the pipeline kicks off automatically. This works well for high-volume applications where new data arrives daily.

Quality-threshold triggers are reactive. Your monitoring system detects that production accuracy has dropped below 85%. It fires the pipeline. The model retrains on updated data, evaluates, and deploys if it fixes the regression. This is your safety net.

Combining triggers is the practical approach. Most teams run scheduled triggers as their baseline and quality-threshold triggers as their safety net. A monthly scheduled retrain handles gradual drift. A quality-threshold trigger catches sudden degradation — like when a product update changes the data distribution overnight.

Data-threshold triggers are a bonus for high-volume use cases. If you are processing 10,000+ requests per day and collecting feedback on each, you will accumulate training data fast enough to justify more frequent retraining.

One important rule: triggers should have a cooldown period. If a quality-threshold trigger fires and retrains, suppress additional triggers for at least 48 hours. Otherwise you risk a loop where a noisy metric triggers retraining repeatedly.

Stage 1: Data Validation

Before you spend compute on training, validate your data. This stage catches problems that would otherwise waste hours of fine-tuning time.

Volume check: Do you have enough new examples? If you are retraining monthly and only accumulated 12 new examples, the pipeline should skip this cycle. Set a minimum threshold — 100 new examples is a reasonable starting point.

Format validation: Every example must conform to your training schema. For chat fine-tuning, that means valid system/user/assistant message arrays. For completion tasks, valid input/output pairs. Malformed examples should be flagged and quarantined, not silently dropped.

Distribution check: Compare the label distribution of your new data against your existing training set. If your new batch is 90% one category, that is a signal worth investigating. It might be legitimate (seasonal shift) or it might indicate a data collection bug.

Deduplication: Remove exact and near-exact duplicates. Near-duplicates within a Jaccard similarity of 0.95 should be flagged. Duplicate training examples cause overfitting on those specific patterns.

Quality scoring: If you have a quality metric for individual examples (response length, format compliance, human ratings), filter out examples below your quality threshold. One bad training example can undo the benefit of ten good ones.

If validation fails, the pipeline stops and sends a detailed report: how many examples failed, which checks failed, and representative samples of the failures. You fix the data and re-trigger manually. Do not auto-fix — data quality decisions need human judgment.

Stage 2: Fine-Tuning

With validated data, the pipeline calls the Ertas fine-tuning API. This stage is straightforward because the hard work — hyperparameter selection, base model choice — was done during your initial manual fine-tune.

Your pipeline configuration locks down the training parameters:

Base model: the same one you validated manually (e.g., Llama 3.1 8B)
LoRA rank: your validated setting (typically r=16 or r=32)
Learning rate: your validated rate (typically 2e-4)
Epochs: fixed or early-stopping based on validation loss
Training data: merged set of existing + new validated examples

The key insight: your CI/CD pipeline should not be experimenting with hyperparameters. That experimentation happens during development. The pipeline runs the proven recipe on updated data.

Training time depends on dataset size and model. For a typical 8B parameter model with 2,000-5,000 examples, expect 30-90 minutes. The pipeline waits, polls for completion, and moves to evaluation.

Version everything. Each training run should produce a versioned artifact with metadata: the dataset hash, the base model version, the hyperparameters, and the training timestamp. When you need to debug a regression six months from now, you will want to know exactly what went into each model version.

# Example artifact metadata
run_id: ft-2026-02-26-001
base_model: llama-3.1-8b
dataset_hash: sha256:a4f8e2...
dataset_size: 3,847 examples
lora_rank: 16
learning_rate: 2e-4
epochs: 3
training_duration: 47m
timestamp: 2026-02-26T04:00:00Z

Stage 3: Evaluation Suite

This is where most teams underinvest and where most pipeline failures should be caught. Your evaluation suite needs to cover four dimensions:

Accuracy metrics: Run the new model against your held-out test set. Measure task-specific metrics — F1 for classification, ROUGE or human-preference scores for generation, exact match for structured extraction. You need a number, not a vibe.

Regression tests: A curated set of 50-200 examples that the production model handles correctly. These are your "must not break" cases. If the new model gets any of these wrong, that is a regression, and it needs investigation.

Latency benchmarks: Run 100 inference calls and measure p50, p95, and p99 latency. Fine-tuned models should not be slower than the base model by more than 10%. If they are, something went wrong in training or quantization.

Safety checks: Run your safety evaluation set — adversarial prompts, edge cases, sensitive topics. The new model must pass every safety check. No exceptions, no thresholds. Binary pass/fail.

Store all evaluation results as artifacts. You will want to compare across pipeline runs to spot trends.

Stage 4: Promotion Gates

Evaluation produces numbers. Promotion gates turn those numbers into a deploy/no-deploy decision. Here are the gates that work in practice:

Gate	Criterion	Action on Fail
Accuracy	>= production model accuracy	Block deployment
Accuracy improvement	>= 0.5% improvement OR no regression	Allow but flag
Regression tests	100% pass rate	Block deployment
Latency p95	Within 10% of production p95	Block deployment
Safety checks	100% pass rate	Block deployment
Model size	GGUF within 5% of production model size	Warn

All blocking gates must pass. If any blocking gate fails, the pipeline stops, logs the failure reason, and alerts the team. The model is archived for investigation but not deployed.

If all gates pass, the pipeline proceeds to deployment automatically. No human approval needed. The gates are your approval process.

Stage 5: Deployment

Deployment for local models follows a specific path: the fine-tuned model is quantized to GGUF, registered as a new version, and rolled out.

GGUF conversion: Convert the fine-tuned adapter or merged model to GGUF format at your target quantization level (Q4_K_M is a solid default). Verify the file is valid and loadable.

Canary deployment: Do not switch 100% of traffic immediately. Start at 5%. Route 5% of requests to the new model, 95% to production. Monitor for 2 hours. If metrics hold, promote to 25%. Monitor for 4 more hours. Then promote to 100%.

Ollama integration: Update the Modelfile to point to the new GGUF. Reload the model in Ollama. Verify the model responds correctly to a smoke test prompt.

The entire deployment stage, excluding canary monitoring windows, takes under 5 minutes. The canary monitoring adds 6 hours of automated watching.

Keep every deployed model version archived. Storage is cheap. The ability to rollback to any previous version — not just the immediately prior one — is invaluable when a regression is discovered days or weeks after deployment.

Stage 6: Post-Deploy Monitoring

The pipeline does not end at deployment. Post-deploy monitoring runs for 24 hours after full promotion:

Hour 1: Check error rates, latency, and basic output quality every 5 minutes
Hours 1-6: Check every 15 minutes, compare against production baseline
Hours 6-24: Check every hour, watch for gradual degradation

If any metric drops below the production baseline by more than 5% during the monitoring window, the pipeline triggers an automatic rollback. Rollback means: reload the previous GGUF, revert the Modelfile, restore 100% traffic to the old model. Total rollback time: under 30 seconds with LoRA adapters, under 2 minutes with full model swaps.

Auto-rollback is not optional. It is the safety net that makes fully automated deployment acceptable. Without it, you need a human watching every deployment. With it, the pipeline can run at 3 AM and you sleep through it.

Log every rollback with full context: which metric triggered it, the exact values at trigger time, the model version that was rolled back, and the model version that was restored. This log becomes your debugging starting point.

What This Costs

Setup time: 8-16 hours to build the pipeline end to end. This includes writing the evaluation suite (the biggest time investment), configuring triggers, setting up monitoring, and testing the rollback path.

Ongoing maintenance: 1-2 hours per month. Reviewing pipeline run summaries, updating evaluation examples, adjusting thresholds as your application evolves.

What it saves: 10-20 hours per month of manual fine-tuning, evaluation, and deployment work. For agencies managing multiple client models, multiply that by client count.

The break-even is the first month. By month two, you are operating at a pace that manual processes cannot match.

What NOT to Automate (Yet)

Not everything belongs in the pipeline on day one:

Your first fine-tune should always be manual. You need to understand the process, the data, the failure modes, and the evaluation criteria before you can automate them.

Evaluation criteria design requires human judgment. Which metrics matter? What are the thresholds? These decisions shape the entire pipeline. Get them wrong and you automate the wrong thing.

Data curation for new task types still needs a human eye. When you are expanding into a new domain or adding a new capability, a human should review the training data before it enters the pipeline.

Edge case handling is another area to keep manual. When your model encounters a genuinely novel input pattern — one that does not fit existing categories or workflows — a human should decide how to handle it, label the examples, and add them to the training set. The pipeline can then incorporate those examples in the next automated cycle.

Automate the repetitive execution. Keep humans in the loop for the strategic decisions.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Getting Started

You do not need to build all seven stages at once. Start with three: fine-tune, evaluate, and deploy. Add data validation next. Then triggers. Then monitoring. Each stage adds value independently.

The teams that scale fine-tuning successfully are the ones that treat it like software, not like a science experiment. Version your data. Test your models. Automate your deployments. Monitor your production.

The pipeline is an investment in repeatability. Every manual step you automate is a step that can no longer be forgotten, skipped, or done inconsistently. That is how you go from one fine-tuned model to fifty.

One team we worked with went from quarterly manual retrains (each taking two days) to monthly automated retrains (each requiring 15 minutes of human review). Their model accuracy improved 12% over six months — not from better training techniques, but from more frequent training on fresher data. The pipeline made the frequency possible. The frequency made the improvement inevitable.