Datasets

How to source, format, validate, and clean data for fine-tuning in Ertas.

Dataset quality is the single biggest lever in fine-tuning. A clean 2,000-row dataset will out-train a noisy 50,000-row one almost every time, and the cost of curating data is mostly your attention, not your credits.

This section covers everything that happens before you press play: which formats Ertas accepts, what makes a dataset train well, where to source data, how to synthesise more of it, and how to debug a training run that failed at the data stage.

Jsonl Format

Read the jsonl format guide.

Dataset Quality

Read the dataset quality guide.

Import from Hugging Face

Read the import from hugging face guide.

Dataset Synthesis

Read the dataset synthesis guide.

Instruction Tuning

Read the instruction tuning guide.

SFT vs DPO

Read the sft vs dpo guide.

Troubleshooting

Read the troubleshooting guide.

The shortest possible summary

If you read nothing else in this section, read this:

Datasets are JSONL files (one JSON object per line). Ertas also accepts CSV, parquet, and txt as inputs, but JSONL is the canonical shape.
The five accepted schemas are text-only (text), instruction/output, input/output (+ metadata), conversations (ShareGPT-style conversations array of {from, value}), and messages (ChatML-style messages array of {role, content}).
Ertas auto-detects the format at upload time. It will tell you immediately if your file is not one of these shapes.
The Data Craft tab shows row count, file size, schema, and a preview for each dataset. Use the preview to spot-check that the parser got it right.
You can attach the same dataset to as many runs as you want without re-uploading.

The pages above go deeper. Start with JSONL format if you are unsure how to shape your data, and Dataset quality once your data is at least syntactically valid.