Ertas for ML Engineers & Fine-Tuning Practitioners

ML engineers can replace the fragmented mess of CLI tools, Jupyter notebooks, and manual file shuffling with a unified visual pipeline that still offers code-first escape hatches. Ertas connects dataset curation, experiment tracking, fine-tuning, evaluation, and GGUF export into a single coherent workflow.

The Challenge

The open-source fine-tuning ecosystem is powerful but deeply fragmented. Unsloth optimises training speed but has no deployment story. Axolotl provides flexible configuration but requires manual YAML wrangling and offers no experiment tracking. LLaMA-Factory gives you a web UI but locks you into its specific abstractions. Every tool solves one piece of the puzzle brilliantly and ignores the rest, leaving ML engineers to glue together workflows with shell scripts, notebook cells, and folder-naming conventions that inevitably break when a team member joins or a project resumes after two weeks. GPU memory constraints add another layer of friction — an engineer can spend an entire day figuring out the right combination of quantisation, batch size, gradient accumulation, and sequence length to make a training run fit on their available hardware.

The deployment gap is where most fine-tuning projects die. An ML engineer produces a beautiful set of adapter weights in a notebook, achieves strong eval metrics, and then faces the question: now what? Converting to GGUF requires finding the right llama.cpp commit that supports the model architecture. Quantising to the right bit depth requires trial and error across Q4_K_M, Q5_K_M, and Q6_K variants. There is no standard way to track which experiment produced which adapter, which dataset version was used, or how eval metrics compare across runs. When a stakeholder asks "can you reproduce the model from three weeks ago?" the honest answer is usually "probably, if I can find the right notebook and the training data hasn't been overwritten." This lack of reproducibility and lineage tracking is not a tooling inconvenience — it is a fundamental blocker to shipping fine-tuned models in production.

The Solution

Ertas provides the unified pipeline that ML engineers have been assembling piecemeal from open-source tools. Studio's visual canvas lets you design training pipelines by composing modular blocks — data loading, preprocessing, LoRA/QLoRA configuration, training, evaluation, and export — while exposing the full configuration surface that experienced practitioners expect. Every parameter is editable, every block can be replaced with custom code, and the entire pipeline definition is exportable as a reproducible configuration file. This is not a dumbed-down UI bolted on top of a training library — it is a genuine workflow orchestrator that happens to have a visual interface.

The experiment tracking and comparison capabilities close the reproducibility gap entirely. Every training run in Ertas is automatically versioned with its full lineage: which dataset version from Vault was used, which base model from Hub, what hyperparameters were set, and what eval metrics were achieved. Side-by-side comparison views let engineers evaluate multiple QLoRA experiments across loss curves, benchmark scores, and generation quality in a single screen. When the best experiment is identified, one-click GGUF export handles the conversion and quantisation pipeline — including architecture-aware conversion that selects the right llama.cpp codepath automatically. The exported GGUF can be deployed to Ollama, llama.cpp, vLLM, or any other inference runtime without manual conversion steps. The entire journey from raw dataset to deployed production model lives in one platform with full audit trail.

Key Features

Studio

Visual Canvas with Code-First Escape Hatches

Studio's canvas interface lets you compose training pipelines visually while retaining full control. Every block exposes its underlying configuration, and custom Python blocks can be injected at any point in the pipeline. Design your workflow graphically, then export the entire thing as a reproducible config file for CI/CD integration or headless execution.

Hub

Model Comparison & Benchmarks

Hub is more than a model registry — it is a decision-making tool. Compare base models across standardised benchmarks, filter by architecture and licence, and inspect community evaluations before committing to a fine-tuning run. When evaluating your own fine-tuned models, run them against the same benchmarks to quantify exactly how much your adapter improved over the base.

Cloud

Managed Training GPUs

Cloud removes the GPU procurement bottleneck. Launch fine-tuning runs on managed A100 or H100 instances without dealing with cloud provider quotas, CUDA driver mismatches, or spot instance interruptions. Pay per training hour, with automatic checkpointing so you never lose progress — then deploy the finished model wherever you want.

Vault

Dataset Versioning & Experiment Tracking

Vault versions every dataset, adapter, and training artifact with full lineage metadata. Every experiment is linked to the exact dataset version, base model, and hyperparameter set that produced it. Compare experiments side by side across loss curves, evaluation metrics, and sample outputs. When you need to reproduce a result from three months ago, the entire provenance chain is one click away.

Example Workflow

An ML engineer at a mid-stage startup is tasked with distilling GPT-4o's reasoning capabilities into a compact model for on-device deployment. They start by curating a 50,000-example dataset of GPT-4o outputs across the company's core use cases — customer query classification, product recommendation, and summarisation — uploading the versioned dataset to Vault. In Hub, they evaluate three candidate base models: Qwen 2.5 14B, Mistral Nemo 12B, and LLaMA 3.1 8B, comparing them on the company's internal benchmark suite. Qwen 2.5 14B shows the strongest baseline performance, so they proceed with it. In Studio, the engineer configures 5 QLoRA experiments with varying rank (8, 16, 32), learning rate schedules, and sequence lengths, launching all five in parallel on Cloud. After training completes, the side-by-side comparison view reveals that rank-16 with cosine annealing and 4096 sequence length produces the best trade-off between eval score (91.3% on the internal benchmark) and adapter size (48MB). The engineer drills into the generation quality tab, spot-checking outputs across all three task categories and confirming that the distilled model matches GPT-4o output quality on 94% of test cases. One click exports the winning experiment as a Q5_K_M GGUF file, with Ertas automatically selecting the correct llama.cpp conversion path for the Qwen architecture. The exported model is deployed to a vLLM instance behind the company's API gateway, serving 2,000 requests per minute at 180ms p95 latency. The full experiment history — all 5 runs, their datasets, configs, and metrics — is preserved in Vault for future reference and audit.