Handling failures

Why training runs fail, how to diagnose them, and when credits are refunded automatically.

A failed run is annoying, but not expensive. Most failures fall into a small set of categories with known fixes. This page walks through the categories, the diagnostic flow, and the refund rules.

When credits are refunded

Two rules cover almost every case:

Verified catalog model fails: credits are refunded automatically. The refund posts within a few minutes and shows up in the Billing tab.
Unverified Hugging Face model fails: no refund. When you added the model, you checked the consent box that explicitly waived this. If you skipped reading the warning, the refund is still waived.

Cancellations are not refundable. If you stop a run, you pay for the GPU minutes used before stopping. See Credits and usage.

The failure triage flow

When a run flips to Failed in the Run panel or the Runs tab, expand it. The status pill and the inline error message describe what went wrong; the vast majority of failures fall into one of the categories below.

Out of memory (OOM)

Rare in practice. The build canvas inside Model Studio runs a pre-flight VRAM check on the GPU tier you selected: combinations that cannot fit (e.g. a 5B+ model on T4) are blocked at the picker before you can press play. If the gate accepts the configuration, the run should not OOM.

When OOM does happen anyway:

Credits are refunded automatically (the run failed before producing artifacts; the verified-catalog refund rule covers it).
Retry the run. A transient blip on the GPU node can produce a one-off OOM that does not repeat.
If the same configuration OOMs across two or three follow-up attempts, treat it as a temporary technical issue on our end. We typically resolve these within a few hours, so wait and retry.

What the error looks like in the log:

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate ...

The exception case is an unverified Hugging Face model that the validator flagged amber. The pre-flight VRAM check did not vouch for the configuration, so OOM can land at training time. The practical fix is to pick a verified catalog model of similar size and family, or move up to A10G (requires a paid plan).

Dataset validation failed

What it looks like in the log:

Dataset row N missing required field 'output'

No valid training rows found; check that columns match instruction/output, messages, or prompt/completion format.

What it means: at least one row in your dataset does not have the columns the trainer expected, or the format detection picked a shape your data does not actually use.

Fixes:

Open the dataset in the Data Craft tab and inspect the schema. Confirm every row has the expected fields.
See JSONL format for the exact expected shapes.
If your dataset mixes formats, split it into separate files and attach all of them. Ertas handles mixed formats across files; it does not handle mixed formats within a single file.

Architecture not supported

What it looks like:

Unsupported architecture: SomeNewModel

Almost always happens with a Hugging Face model you brought yourself, where the validator returned amber and you proceeded anyway. The underlying training framework (Unsloth) does not have kernels for that architecture.

Fix: pick a verified base model from the catalog with a similar size and family. The catalog covers Llama, Mistral, Mixtral, Phi, Gemma, Qwen, SmolLM, TinyLlama, and a handful of others, all known to train cleanly.

Tokenization error

What it looks like:

Tokenizer not found for model X

Chat template missing for model X; falling back to ChatML

The first form is a hard failure (no tokenizer means no training). The second is a warning, not a failure; the run continues with the ChatML default.

Fix for hard failures: the model on Hugging Face is missing required tokenizer files. Pick a different model.

GPU node lost mid-run

What it looks like:

Connection to GPU node lost

Retrying after transient failure

A network or hardware blip on the GPU node. The pipeline automatically retries up to two times before giving up. You will see the run flip to Retrying (amber) and then either go back to Training or eventually fail.

Fix: nothing to do. If the run fails after retries, credits are refunded automatically for verified models. Try again.

Job submission rejected

The dialog never even gets to the queue. You see an error like:

Invalid config: ...
or
Failed to submit training job

Fix: the canvas usually surfaces the validation error inline. Common cases:

A required leg is not connected.
The chosen GPU tier is not allowed by your plan.
A dataset has been deleted from Data Craft but is still referenced on the canvas.
The run's estimated LoRA and GGUF would push you over your account's model-artifacts storage quota. Ertas runs this estimate before queuing the job and refuses to start runs that would not fit. Free space in Hub and resubmit. See Storage.

The error message in the dialog is normalised to be specific. Read it carefully.

Getting more context on a failure

The expanded run view in the Run panel or the Runs tab shows the error message inline. There is no separate log-file download today.

When asking for help (Discord, support email), include:

The failure category (OOM, validation, tokenization, etc).
The error message text from the expanded run view.
The run ID (visible when you expand the run).
Your base model and dataset (so the responder knows whether it's catalog or HF).

Rerunning after a fix

Once you have made the fix:

If you fixed a Training Config setting, Duplicate the original recipe, change the one setting, and run the duplicate. Keep the failed run in history so you can see the diff.
If you fixed a dataset, you do not need to duplicate; updating the dataset and clicking Rerun on the failed run picks up the new version automatically.
If you switched base models, Duplicate so the old recipe is preserved for comparison.

The Run panel sorts the new attempt above the failed one chronologically. Tag both clearly so future-you remembers which one was the fix.

When to ask for help

Almost every failure is self-diagnostic from the error message in the run view. If you have run through the categories above and your message does not match any of them:

Discord: post the error message text plus the failure category. The community is usually fast.
Support email: for paid plans, send the run ID. The support team can pull the full server-side logs that are not exposed in the UI.

What's next

Iterating

Tweak and rerun without losing your place.

Dataset troubleshooting

Common dataset gotchas.

Credits and usage

Refund rules in detail.