Dataset quality

    How to judge whether a dataset is good enough to train on, with concrete heuristics for size, diversity, formatting, and labelling.

    A good base model + a great dataset beats a great base model + a noisy dataset, every time. This is the most underrated claim in fine-tuning, and it is also the one that costs people the most credits to learn the hard way. This page covers the concrete heuristics for judging whether a dataset is ready to train on.

    The five things that matter

    In rough order of impact:

    1. Format consistency: every row follows the same shape, with no leakage between fields.
    2. Output quality: the output (or assistant content) is the kind of response you actually want.
    3. Diversity: the inputs cover the breadth of cases your model will see in production.
    4. Size: enough rows to teach the pattern without being so many that you overfit.
    5. Cleanliness: no garbage characters, no PII you do not have rights to, no broken encodings.

    If any one of these is bad, no amount of hyperparameter tuning will save the run.

    Format consistency

    A dataset of 10,000 rows where 9,900 use the conversations schema and 100 use the messages schema is worse than 9,900 clean conversations rows. The 100 off-schema rows fail validation (Ertas enforces one schema per file) or pollute the training signal if they slip through some local pre-processing.

    Concrete checks:

    • Every row has the same field set.
    • Every required field is non-empty.
    • No row has the model's chat template embedded in its content. The trainer adds the template; you provide the raw content.
    • No row has stray markup from the source format (Markdown asterisks in code completion data, HTML tags in cleaned plain-text data).

    Ertas's upload validator catches the most obvious cases but not the subtle ones (a wrong template, for example, will pass validation). Manually inspect 20 to 50 random rows after upload using the dataset preview.

    Output quality

    The model will learn to imitate the outputs you give it. If your dataset's outputs are mediocre, you will get a mediocre model. A few specific failure modes:

    • Outputs that sound right but say nothing. "Thanks for reaching out, our team will get back to you soon" repeated 5,000 times teaches the model to dodge.
    • Outputs that contradict the instruction. If 5% of your support tickets have a response that does not actually answer the user's question, the model picks up that fraction of evasive behaviour.
    • Outputs in the wrong format. If you want JSON but half your examples are prose-with-JSON-in-it, the model will be sloppy with structure.
    • Outputs that lean on context you did not give the model. If the assistant's response references a customer name that does not appear in the user turn, the model learns to hallucinate names.

    The single best dataset-quality move is to read a random sample of your outputs and ask yourself "if a user got this response, would I be happy?" If you would not, fix the data, do not fix the hyperparameters.

    Diversity

    A dataset of 10,000 rows where every row asks the same kind of question teaches the model exactly one kind of question. Production traffic is more varied than your training data almost always.

    Rough heuristics:

    • Topical coverage: aim for at least 10 distinct task types or topic clusters in an instruction-following dataset.
    • Style coverage: include short and long inputs, formal and casual phrasing, complete and incomplete user turns.
    • Edge case coverage: a few rows of obviously misformed input (typos, abbreviations, missing context) help the model degrade gracefully.

    You can measure diversity quickly by clustering inputs (text embeddings + k-means). If 80% of your rows land in two clusters, your dataset is too narrow.

    Size

    There is no universal minimum, but rules of thumb:

    Use caseMinimum usefulComfortableDiminishing returns above
    Style transfer (formal to casual, etc.)2001,0005,000
    Instruction following (single-turn)5002,000 to 5,00020,000
    Multi-turn chat1,000 conversations5,000 to 10,00050,000
    Domain knowledge (medical, legal)5,00020,000100,000+
    Code completion2,00010,00050,000

    Smaller-than-minimum: the model may not converge or will be obviously over-influenced by individual rows.

    Larger-than-diminishing-returns: you are mostly paying for credits without buying much improvement. Curate instead of grow.

    The diminishing-returns numbers assume LoRA fine-tuning, which has limited capacity by design. Full fine-tuning can absorb more rows, but at much greater cost.

    Cleanliness

    Garbage that costs you credits:

    • Duplicate rows. Often happens when data is exported from a database with overlapping date ranges. Run a quick deduplication on identical output fields.
    • Near-duplicates. Almost identical rows are worse than duplicates because they teach the model that this exact phrasing is the right answer. Use embedding-based deduplication.
    • Encoding errors. Mojibake (é where you wanted é) trains the model to emit mojibake. Re-export from the source in UTF-8.
    • PII you do not have rights to. Names, emails, phone numbers from real users without consent. Redact before upload; the model will learn whatever is in the data.
    • Bot output already in the data. Old support tickets sometimes include canned responses from a previous chatbot. The model will absorb that style.

    A quick pass through the dataset detail view in Data Craft usually surfaces these.

    Templating pitfalls

    The most common quality issue we see is template confusion between the dataset and the base model. Two failure modes:

    • Template tokens embedded in dataset content: rows include <|im_start|>user or [INST] in the content field. The trainer wraps these in another template, doubling them. The model learns to emit broken markup.
    • Wrong template for the base: a dataset formatted for Llama's chat template attached to a Mistral base model, or vice versa. The model trains, but the boundaries between turns are blurry.

    Ertas detects the base model's chat template automatically and applies it on top of your raw content. Do not include template markers in your data. If your data came from a source that already has them, strip them before upload.

    Smoke-test your dataset

    Before queueing a long run, do a fast sanity check:

    Open the dataset in the Data Craft tab

    Step 1

    Click the dataset name. The detail view shows schema, row count, and a preview of the first rows.

    Read at least 20 rows by hand

    Step 2

    Scroll through the preview. Read rows from the start, the middle, and the end (the preview is in source order, so spread your reading across the file). Does each row look like a thing the model should learn? Does the output answer the input correctly?

    Run a small experiment

    Step 3

    Queue a 50 to 100 step run with GGUF conversion off. Should finish in under 10 minutes. Smoke-test the resulting model. If the model can already do something useful after 100 steps, the dataset has signal. If it produces gibberish, the dataset has a problem.

    This is by far the cheapest way to catch a bad dataset before you commit serious credits to a long training run.

    When to clean vs when to start over

    If 80%+ of your rows are good, fix the rest. If half the dataset is sketchy, you will spend more time cleaning than re-sourcing from a better origin.

    Indicators that the dataset is salvageable:

    • The high-quality rows look great.
    • Bad rows have a consistent failure pattern (encoding, formatting) you can fix with a script.

    Indicators that you should start over:

    • Outputs are inconsistent in style across rows.
    • Many rows have factually wrong outputs.
    • You cannot easily separate good rows from bad.

    It is OK to throw out a dataset you spent a day on. You will save a week.

    What's next