Training tips

    Pragmatic advice for getting better results from Ertas's defaults: when to deviate, what to change first, and the gotchas that cost the most credits.

    The Ertas defaults are tuned to produce a useful model on the first try for the most common cases (a 3B model on T4, or a 7B model on A10G, fine-tuned with LoRA on a 1,000 to 50,000-row instruction dataset). When they fall short, the fixes follow a pattern. This page is a checklist of moves that consistently help, organised by what they fix.

    For a deeper guide to the underlying parameters, see Configuring a run.

    The shortcuts that pay off most

    If you only read three tips on this page, read these:

    • Quality dataset over big dataset. 1,000 clean, varied, well-formatted rows beat 50,000 noisy ones. Spend twice as long curating before your first run as you do tuning hyperparameters afterwards.
    • Iterate on tiny runs first. A 100-step run at LoRA rank 8 with GGUF conversion off finishes in 5 to 10 minutes. Use it to catch templating bugs and confirm loss goes down. Then scale up.
    • Change one knob per run. If you change both the learning rate and the dataset, you cannot tell which produced the result.

    The rest of this page is a more granular checklist.

    Hyperparameters by model size

    The defaults assume a 3B model on T4 or a 7B model on A10G. If you are training something smaller or larger, adjust:

    Model sizeLearning rateBatch sizeGradient accumLoRA rankGPU tier
    Under 2B (TinyLlama, SmolLM, Gemma 1B)3e-4 to 5e-4428 to 16T4
    3B to under 5B (Phi-3 mini, Gemma 3 2B / 4B, Llama 3.2 3B)2e-4 to 3e-42 to 4416T4
    5B to 8B (Llama 3 8B, Mistral 7B, Qwen 2.5 7B)2e-4 (default)24 (default)16 (default)A10G required
    13B to 14B (Llama 3 13B, Mistral 13B)1e-4 to 2e-41 to 2816 to 32A10G
    22B+ (Mixtral, larger)5e-5 to 1e-411616A10G

    T4's 16 GB VRAM is the dividing line: any model 5B or larger requires A10G. Smaller models tolerate higher learning rates and benefit from smaller LoRA ranks. Larger models need lower learning rates and may benefit from higher rank if your task has broad scope.

    When the model says wrong things

    Symptom: output is in the right format but the facts or behaviour are wrong.

    Likely causes and fixes:

    • Undertrained. Step count too low for the dataset size. Try doubling Max Steps, or switch to 3 epochs and let the math work itself out.
    • Dataset too narrow. Model has only seen one kind of example. Add diverse rows covering edge cases.
    • Wrong base. A general base trying to act like a specialist. Try a more aligned base (Code Llama for code tasks, Qwen for multilingual, etc.).

    When the model says right things in wrong format

    Symptom: facts are correct, but the structure is off (raw markup tokens, wrong indentation, missing fields).

    Likely causes and fixes:

    • Chat template mismatch. Check the base model's expected template. Ertas auto-detects most, but if the validator surfaces "no template detected," your data needs to use ChatML explicitly.
    • Template tokens in your data. Strip raw template markers like <|im_start|> from your rows. Ertas's pipeline adds them around your content.
    • Insufficient format reinforcement. Need more rows showing the exact desired output shape.

    When the model overfits

    Symptom: train loss keeps dropping but smoke-test quality on the downloaded model starts getting worse.

    Likely causes and fixes:

    • Too many epochs. Drop from 3 to 1 or 2.
    • Too high a rank for the dataset. Drop LoRA rank from 16 to 8. Less capacity means less room to memorise.
    • Add LoRA dropout. Bump from 0 to 0.05 or 0.1.
    • Grow the dataset. Overfitting is often a "rows-to-parameters" mismatch.

    When loss spikes or blows up

    Symptom: loss shoots upward, sometimes to NaN. Run usually fails or is useless.

    Likely causes and fixes:

    • Learning rate too high. Cut by half.
    • Warmup too short. A 200-step run with 0 warmup is jumpy at the start. Bump warmup to 15 or 20.
    • Outlier sequence in dataset. A row much longer than the rest can cause a single bad gradient. Truncate or remove very-long rows.
    • Mixed-precision overflow. Rare. The pipeline uses bf16 where possible to avoid this; if you see it consistently on a specific base, that base may have known training instability.

    When throughput crashes mid-run

    Symptom: tokens-per-second drops dramatically partway through training.

    • Thermal throttling on the GPU node. Rare. The pipeline detects this and recovers; if it persists, the run is automatically retried on a fresh node.
    • Long sequence in late batches. Most datasets are not perfectly uniform; a cluster of long sequences can slow batches. Truncate or remove outliers.
    • OOM cliff. If activation memory creeps up and you are right at the edge, throughput can collapse before a full OOM. Lower batch size or context length.

    Step-based vs epoch-based training

    The Max Steps field deserves more attention than it usually gets:

    • Max Steps > 0: predictable runtime, predictable cost. Use this when iterating.
    • Max Steps = 0: epoch-based. Total steps = (rows / effective_batch_size) * epochs. Use this when you want the model to see every row a fixed number of times.

    Two common mistakes:

    • Running a 200-step training on a 50,000-row dataset only touches the first 400 rows. Almost certainly undertrained.
    • Running 5 epochs on a 50,000-row dataset is 25,000 to 100,000 steps and may take hours. Almost certainly overkill for an instruction-tuning task.

    Rule of thumb: for instruction-following on a clean dataset, aim for each row to be seen about twice. 2 * rows / effective_batch_size is a sensible Max Steps target.

    LoRA target modules

    The default set covers all the linear projections in attention and MLP blocks. You can trim it to save VRAM and shrink the adapter, at the cost of quality:

    • Attention only (q_proj, k_proj, v_proj, o_proj): about 30% smaller adapter, mild quality drop.
    • Query and value only (q_proj, v_proj): the absolute minimum. Surprisingly good for stylistic transfer; bad for capability tuning.
    • MLP only (gate_proj, up_proj, down_proj): less common, sometimes useful for fact-heavy tuning.

    Unless you are space-constrained, leave all seven on. The default is a strong choice.

    When to use the Train action module instead of Fine-Tune

    Train (no LoRA) updates every parameter in the model. It is rarely the right choice for fine-tuning, because:

    • LoRA gets within a percent or two of full fine-tuning on most tasks.
    • Train uses much more VRAM, so you are stuck on A10G for any model bigger than 1B.
    • The output checkpoint is the size of the full model.

    Train makes sense when:

    • You are continuing pretraining on a niche corpus (legal, medical, code in a private language).
    • Your dataset is large enough (millions of tokens) that you need every parameter to absorb it.
    • You have evidence that LoRA is the bottleneck on this specific task.

    For the first few weeks of working on a new model, stick with Fine-Tune.

    Common credit traps

    Easy ways to burn credits without learning anything:

    • Running a huge dataset and a huge step count on the first attempt. Catch templating bugs with a 100-step pass first.
    • Leaving GGUF conversion on during iteration. The 12 extra minutes per run add up over a sweep. Turn off until you have a config you are happy with.
    • Not cancelling obvious losers. If a run's loss has plateaued at a clearly bad value, the rest of the run will not fix it. Cancel.
    • Sweeping too widely. If you change four knobs and run sixteen variants, you will spend more on credits than you spent on data prep.

    What's next