Training

    What happens after you press play: queueing, GPU provisioning, the Run panel, live metrics, and what the logs mean.

    Pressing play on an Action Module hands the recipe off to the Ertas training pipeline. From your seat, the experience is mostly watching the Run panel. This page covers what is actually happening behind that panel and how to read the signals it shows you.

    The run lifecycle

    A run moves through a fixed set of states. The Run panel shows the current state as a coloured pill and an icon.

    StateIconWhat is happening
    QueuedClockYour run is in the scheduling queue. No GPU has been picked up yet. No credits are being charged.
    Waiting for GPUSpinner (indigo)The scheduler has accepted the run and is provisioning a GPU node. Still no credits.
    TrainingSpinner (blue)The GPU is attached, the model is loaded, and gradient steps are running. Credits start accruing now.
    CompletedCheck (green)Training finished, the LoRA adapter was saved, and (if enabled) the GGUF was exported. Artifacts are downloadable.
    FailedX (red)Something broke. Logs explain what. Refundable for verified catalog models.
    RetryingRefresh (amber)A transient failure was detected and the pipeline is retrying. You do not need to do anything.
    CancelledSquare (orange)You stopped the run from the Run panel. Credits are billed only for the GPU time used.

    The most common path is Queued, Waiting for GPU, Training, Completed. The whole lifecycle for a free-tier T4 run usually takes 25 to 50 minutes door to door.

    Live metrics

    The Run panel header shows the high-signal numbers as soon as training starts:

    • Epoch X/Y: where the run is in the configured epoch budget. For step-based runs (Max Steps > 0), this still updates but the Max Steps progress bar is more informative.
    • Loss: the latest training loss value. Should trend downward; floor depends on the dataset and task.
    • Runtime estimate: the pipeline computes an honest estimate of total steps and remaining time once it has measured a handful of step timings. Refines after the first 10 to 20 steps.
    • Progress bar: 0 to 100% of the configured step budget.
    • Duration and credits: live updates every few seconds.

    Expand a run to see the configuration snapshot (training hyperparameters, LoRA settings, datasets) along with the live status.

    What you see after a run completes

    When training finishes, the expanded run view surfaces 3 inference samples: the pipeline runs a small set of prompts through the freshly trained model and shows the responses so you can sanity-check the output without leaving Studio. This is a smoke test, not a full eval, but it tells you within seconds whether the model is producing coherent text in the right format.

    If the inference samples look obviously wrong (gibberish, raw template tokens, refusals where there should not be any), do not bother downloading the GGUF yet. Look at the training config, fix what needs fixing, and rerun. See Iterating.

    If the samples look reasonable, download the GGUF and run a real evaluation locally. See Evaluating a model.

    There is no log-file download today; what's visible in the expanded view is what you get. Common failure modes are typically clear from the run's status pill and the inline error message rather than from raw logs. For triage of failed runs, see Handling failures.

    What runs in the background

    The pipeline is doing more than just model.fit(). When you press play, in order:

    Job submission

    Step 1

    Ertas packages the canvas state into a CanvasJobSubmitData payload (base model id or HF URL, dataset ids, training config, LoRA config, GPU tier, and rights attestations) and submits it to the scheduler.

    GPU provisioning

    Step 2

    The scheduler picks an idle node of the requested tier. Cold start is usually 1 to 3 minutes. If no node is free, your job stays queued until one is.

    Model and tokenizer load

    Step 3

    The pipeline downloads the base model weights (cached after first use), loads the tokenizer, and applies the model's chat template if one is detected.

    Dataset preparation

    Step 4

    Attached datasets are concatenated, shuffled, formatted per the detected template, and tokenised. Long sequences are truncated to the model's max context.

    LoRA attachment

    Step 5

    A LoRA adapter with your configured rank, alpha, dropout, and target modules is attached to the frozen base.

    Gradient steps

    Step 6

    The optimiser runs the configured number of steps (or epochs) with mixed-precision and gradient checkpointing for memory.

    Save and export

    Step 7

    The adapter weights are saved. If GGUF conversion is enabled, the pipeline merges the adapter into the base, converts to GGUF Q4_K_M, and uploads the file to your project storage.

    Teardown

    Step 8

    The GPU node is released. Billing stops. The run flips to Completed.

    Cancelling a run

    The Run panel's Cancel button stops the GPU immediately and bills only for the time used, rounded down to the last whole minute. A short confirmation dialog appears first because cancellation is not reversible.

    Cancellation is useful when:

    • You realise you misconfigured something (wrong dataset, wrong learning rate) and would rather start over than wait.
    • The loss curve has flattened well before the configured step budget and further training is wasted credits.
    • A bigger experiment is in front of this one in the queue and you want to bump it.

    Cancelled runs still appear in the Runs tab with the partial metrics preserved, but they produce no Hub entry. The LoRA adapter is not saved on cancellation, so the only thing you walk away with is data about what to try differently. A Hub entry is created the moment a run saves at least one artifact. See the note in "After the run completes" below for the partial-completion case.

    After the run completes

    A completed run produces two artifacts that live in Hub:

    • LoRA adapter: a PEFT-compatible directory bundled as a ZIP. Contains adapter_model.safetensors (the trained adapter weights, tens of MB), adapter_config.json (rank, alpha, target modules, base model identifier), chat_template.jinja (the base model's chat template), tokenizer.json + tokenizer_config.json (full tokenizer for standalone loading), and a README.md model card template. Total size is typically 25 to 50 MB depending on the base model's tokenizer. Loadable via PeftModel.from_pretrained() against the base model the adapter was trained from.
    • GGUF: the fully merged, 4-bit quantised single-file model in an installable ZIP bundle (model.gguf + Modelfile + install scripts + README). Ready for Ollama or any llama.cpp-compatible runner.

    Click the LoRA or GGUF button on the completed run to start the download. Both downloads are signed time-limited URLs; clicking again generates a fresh one. The artifacts persist in Hub until you delete them; deleting the run record from the Runs tab does not remove the Hub entry.

    A Hub entry is created as soon as a run saves at least one artifact. A run that finished training and saved its LoRA but failed at GGUF conversion still produces a Hub entry, with just the LoRA and no GGUF. Cancelled runs, runs that fail before the LoRA save, and runs blocked at submission produce no Hub entry at all.

    The first thing to do with a completed run is open the GGUF in Ollama and try a handful of prompts. If the model says the right things in the wrong format, the dataset probably has a templating bug. If it says the wrong things, you may need more data or different hyperparameters. See Iterating.

    What's next