Datasets

    How to source, format, validate, and clean data for fine-tuning in Ertas.

    Dataset quality is the single biggest lever in fine-tuning. A clean 2,000-row dataset will out-train a noisy 50,000-row one almost every time, and the cost of curating data is mostly your attention, not your credits.

    This section covers everything that happens before you press play: which formats Ertas accepts, what makes a dataset train well, where to source data, how to synthesise more of it, and how to debug a training run that failed at the data stage.

    The shortest possible summary

    If you read nothing else in this section, read this:

    • Datasets are JSONL files (one JSON object per line). Ertas also accepts CSV, parquet, and txt as inputs, but JSONL is the canonical shape.
    • The five accepted schemas are text-only (text), instruction/output, input/output (+ metadata), conversations (ShareGPT-style conversations array of {from, value}), and messages (ChatML-style messages array of {role, content}).
    • Ertas auto-detects the format at upload time. It will tell you immediately if your file is not one of these shapes.
    • The Data Craft tab shows row count, file size, schema, and a preview for each dataset. Use the preview to spot-check that the parser got it right.
    • You can attach the same dataset to as many runs as you want without re-uploading.

    The pages above go deeper. Start with JSONL format if you are unsure how to shape your data, and Dataset quality once your data is at least syntactically valid.