Parallel runs

    Train many recipes side by side, manage the concurrency limit, and compare results in the Run panel.

    Every Action Module on the canvas is independent. You can have ten Fine-Tune modules, each with its own dataset and config, and run them in parallel up to your plan's concurrency limit. This is the fastest way to bracket hyperparameter choices or to test alternate base models.

    Why run in parallel

    Two patterns drive almost all parallel-run usage:

    • Hyperparameter sweeps: same base, same dataset, vary one knob (learning rate, LoRA rank, epochs). Pick the winner from a head-to-head smoke test.
    • Base-model bake-off: same dataset, same config, vary the base model. Quickest way to learn whether Llama 8B is enough or whether you need to step up to Mistral 7B or down to Phi-3.

    Both of these are painful to do serially because GPU time is mostly the same whether you waited for the previous run to finish or ran concurrently. Parallelism collapses the wall-clock to the runtime of the slowest job.

    The duplicate-and-tweak pattern

    The canonical workflow:

    Build the baseline recipe

    Step 1

    Drop a Fine-Tune module, attach base model, dataset, training config, and LoRA config. Name it clearly (for example, "Support v1 baseline").

    Duplicate as many times as you want variants

    Step 2

    Hover the module and click the purple Duplicate icon for each variant. Studio lays out the duplicates to the right of the original.

    Edit one knob per duplicate

    Step 3

    Open each duplicate's Training Config (or LoRA Config, or dataset, etc.) and change exactly one thing. Name each module to reflect the change ("LR 1e-4", "rank 32", "+5k synthetic").

    Press play on all of them

    Step 4

    Click the green play button on each module. They queue together. The Run panel shows them as Active in parallel.

    The duplicate operation deep-copies the child nodes, so changing the Training Config on the duplicate does not affect the original.

    How concurrency works

    Ertas does not impose a hard concurrency limit. There is no "max 3 parallel jobs" rule baked into any plan. What does limit parallelism in practice is credit availability.

    When you press play, Ertas puts the estimated credit cost of that run on hold for the duration of the job. Held credits are unavailable to other jobs. Starting a new run draws from the pool of credits that are not currently on hold. If your available (non-held) balance is below the new run's estimate, the run is blocked at submission time and you see an "insufficient credits" message in the Training Confirm dialog.

    The practical consequence: if you have 10 credits and three sweep variants each estimated at 4 credits, you can start two of them in parallel (8 credits on hold) but the third will block until one of the first two finishes and releases its hold.

    A few details that flow from this:

    • Credit holds release on completion or cancellation. A finished run settles its actual cost against the hold and returns any unspent credits to your balance. A cancelled run does the same, billed down to the last whole minute.
    • The on-hold portion can be larger than the actual cost. Estimates are honest but slightly conservative, so most runs return a small amount to the pool on completion.
    • Failed runs on verified catalog models release the hold and refund the credits automatically.
    • No FIFO queueing. If your balance is currently locked by other runs, the new run is refused right away. Free up credits by waiting for a job to finish, or cancel something less important.

    This model is simpler than per-plan concurrency tiers and gives you direct control: the way to run more jobs in parallel is to have more credits available, not to upgrade for a higher concurrency number.

    Watching parallel runs

    The Run panel groups runs by status:

    • Active at the top: all runs currently queued, provisioning, or training.
    • History below: completed, failed, or cancelled runs.

    You can expand multiple active runs at once and watch their loss curves diverge in real time. Ertas updates each run's metrics every few seconds independently.

    If you have many parallel runs and the panel feels crowded, switch to the dedicated Runs root tab. The Runs tab shows every run in your account, sorted descending by start time, with a status filter (Training, Waiting for GPU, Queued, Completed, Failed, Cancelled, or All).

    A practical sweep example

    Suppose you have a baseline that overfits. You want to test three mitigations in parallel:

    VariantLoRA dropoutLoRA rankNotes
    Baseline (control)016Original recipe
    +dropout 0.10.116Standard regularisation
    -rank 808Less capacity, harder to overfit
    +dropout 0.05, rank 320.0532Combined approach

    Build the baseline, duplicate three times, change one (or two) LoRA settings in each. Press play on all four. Within an hour you have four GGUFs, run a smoke test on each, and pick the winner.

    This is the kind of experiment that would take a day if you serialised it and a week if you did it on your own GPU.

    Cancelling several runs

    If you realise mid-sweep that you set up something wrong (wrong dataset, wrong base), open the Run panel and cancel each active run individually using its Cancel button. Each cancellation releases its credit hold and bills only for the whole minutes used before stopping. Cancelled runs stay in history.

    Cost-aware parallelism

    Running four jobs at once is roughly four times the cost of running one. The Training Confirm dialog estimates each run independently and holds the credits against your balance, so a sweep that adds up to more than your available credits is blocked at submission time rather than discovered mid-run.

    Two practical guards:

    • Calibrate before sweeping. Run one small variant to confirm the dataset, hyperparameters, and base model behave. Then duplicate and vary.
    • Cancel obvious losers early. After 20% of the configured steps, the loss curves usually tell you which variants will not catch up. Cancel them and the held credits are released back to the pool for the survivors.

    A common pattern is to start a 4-way sweep, watch the first ~30 steps, and cancel two underperformers when their loss trajectories make the outcome obvious.

    Comparing parallel results

    Once the runs finish, the Run panel sits all four side by side. Expand any run to see the configuration snapshot, final loss, credits used, and runtime. A built-in side-by-side diff view is on the roadmap (see below); until it ships, the simplest comparison is to expand each run's config in turn and scan the values.

    The numbers tell you which run trained best. The smoke test (open all four GGUFs in Ollama, throw the same five prompts at each) tells you which one is actually better in practice. Often these agree. Sometimes they do not, and the smoke test wins; the goal is a model people use, not a number on a chart.

    Coming soon: run diff view. A dedicated comparison panel that surfaces the training-config diff, LoRA-config diff, dataset differences, and final-metric deltas between any two runs you select. Until it lands, the manual approach above works.

    What's next