Back to blog
    From Notebook to Production: Closing the Fine-Tuning Deployment Gap
    ml-engineeringdeploymentproductionmlopssegment:ml-engineer

    From Notebook to Production: Closing the Fine-Tuning Deployment Gap

    Most fine-tuned models never reach production. Here's why the gap between training in a notebook and deploying in production exists — and how to close it systematically.

    EErtas Team·

    You fine-tuned a model. It performs well on your evaluation set. Your Jupyter notebook shows impressive metrics. And then nothing happens.

    The model sits in a checkpoint directory on a VM you will eventually forget to shut down. It never serves a single real user. You are not alone — industry estimates suggest that the majority of fine-tuned models never reach production. Not because they do not work, but because the path from "working in a notebook" to "running in production" is paved with operational problems that have nothing to do with machine learning.

    This is the deployment gap, and it is the biggest bottleneck in applied AI engineering today.

    Why the Gap Exists

    The deployment gap is not a single problem. It is five distinct problems that compound on each other, each one sufficient to stall a project on its own.

    No Standard Export Path

    You trained your model with Hugging Face Transformers, or Unsloth, or Axolotl. Your checkpoint is a collection of adapter weights, configuration files, and tokeniser assets scattered across a directory. To deploy it, you need to merge the adapter into the base model, convert to an inference-optimised format, quantise for your target hardware, and validate that the conversion did not degrade quality.

    Each of these steps has its own tooling, its own gotchas, and its own failure modes. The merge might silently produce incorrect weights if the configuration is wrong. The quantisation might cause quality degradation that you do not catch until users complain. The conversion script might not support your specific model architecture.

    There is no model.export("production") command. There should be, but there is not.

    Manual GGUF Conversion

    GGUF has become the standard format for local inference, but converting to GGUF is still a manual process. You need llama.cpp's conversion scripts, the correct Python dependencies, knowledge of which quantisation level to choose (Q4_K_M? Q5_K_M? Q8_0?), and the patience to debug when the conversion fails with an unhelpful error message.

    Worse, the quantisation choice significantly impacts both quality and speed, and the optimal choice depends on your deployment hardware — information that the person doing the training may not have.

    No Experiment Tracking

    Most fine-tuning happens in one-off notebooks. Hyperparameters are hardcoded. Results are eyeballed. The relationship between training configuration and output quality is not recorded systematically.

    Two weeks later, you want to retrain with updated data. Which learning rate did you use? What was the training/validation split? Which base model checkpoint was it? If you do not have answers to these questions, you are starting from scratch.

    This is not just an inconvenience — it makes iterative improvement impossible. You cannot systematically improve what you cannot systematically measure.

    No Version Control for Models

    Code has Git. Models have nothing equivalent. When you fine-tune a new version, how do you compare it to the previous version? How do you roll back if the new version performs worse in production? How do you manage the lifecycle of models that are deployed across multiple environments?

    Model registries exist (MLflow, Weights & Biases, Hugging Face Hub), but integrating them into a fine-tuning workflow requires setup that most teams skip. The result is a collection of checkpoint directories with names like model_v2_final_FINAL_v3 that no one can navigate.

    No Production Monitoring

    A model that works on your evaluation set might fail on real-world data. Distribution shift, edge cases, adversarial inputs — production reveals problems that offline evaluation cannot catch. But setting up inference monitoring, quality tracking, and alerting is a separate engineering project that most teams do not resource until something goes wrong.

    The Five Steps from Notebook to Production

    Closing the deployment gap requires addressing each problem systematically. Here is the path.

    Step 1: Experiment Tracking

    Before you start training, set up tracking. Record every hyperparameter, every dataset version, every evaluation metric. Not because you need it now, but because you will need it in two weeks when you retrain.

    The minimum viable tracking setup records: base model, dataset hash, training configuration, and evaluation metrics on a held-out test set. This can be as simple as a structured JSON file per run, or as robust as a full MLflow deployment.

    Step 2: Model Evaluation

    Evaluation is not a single number. It is a suite of tests that cover your production requirements. Accuracy on the test set is necessary but not sufficient.

    Build an evaluation pipeline that tests: core task performance, edge case handling, latency under expected load, output format consistency, and behaviour on out-of-distribution inputs. Run this pipeline automatically after every training run, and compare results across runs.

    Step 3: Format Conversion

    Standardise your export pipeline. From training checkpoint to production-ready artefact should be a single command with well-defined inputs and outputs.

    For most deployments in 2026, this means: merge LoRA adapter into base model, convert to GGUF format, quantise to your target level, and validate the quantised model against your evaluation suite. Each step should be automated and tested.

    Step 4: Inference Optimisation

    A production model is not just accurate — it is fast, memory-efficient, and reliable. This step covers: choosing the right quantisation level for your hardware, configuring the inference server (Ollama, vLLM, llama.cpp), setting up batching and caching, and load testing under realistic conditions.

    The goal is a deployment configuration that meets your latency, throughput, and memory requirements. Document this configuration alongside the model so that redeployment is deterministic.

    Step 5: Production Monitoring

    Once the model is serving real traffic, you need visibility into its behaviour. At minimum, track: request volume and latency, output quality metrics (even if sampled), error rates, and resource utilisation.

    Set up alerts for anomalies. A sudden spike in latency might indicate hardware issues. A drop in output quality might indicate distribution shift. Catching these early is the difference between a minor adjustment and a user-facing incident.

    How Each Step Fails Without Tooling

    The reason most teams do not follow this path is not ignorance — it is effort. Each step requires tooling, and building that tooling is a project in itself.

    Experiment tracking requires infrastructure. Evaluation requires a framework. Format conversion requires scripts and validation. Inference optimisation requires hardware-specific tuning. Monitoring requires observability infrastructure.

    For a team that just wants to fine-tune a model and deploy it, the overhead of building all this tooling is often greater than the effort of the fine-tuning itself. The result is predictable: teams skip steps, take shortcuts, and end up with models that work in notebooks but never reach production.

    How Ertas Closes Each Gap

    Ertas is designed around this specific problem. The platform treats the path from training to production as a first-class workflow, not an afterthought.

    Studio handles experiment tracking and evaluation. Every training run is recorded with full configuration, and evaluation runs automatically against your defined test suite. Comparing runs is a single click.

    GGUF export is built into the platform. After training, export an optimised, quantised model in the format your deployment target needs. The export pipeline validates quality automatically, so you catch quantisation degradation before deployment.

    Cloud provides production monitoring for deployed models. Request logging, quality sampling, latency tracking, and alerting are available out of the box.

    The goal is to make the path from notebook to production as straightforward as the training itself. No separate tooling to set up, no scripts to maintain, no gaps to fall into.

    Ready to close the deployment gap? Join the Ertas waitlist and ship models that reach production.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading