Datasets
How to source, format, validate, and clean data for fine-tuning in Ertas.
Dataset quality is the single biggest lever in fine-tuning. A clean 2,000-row dataset will out-train a noisy 50,000-row one almost every time, and the cost of curating data is mostly your attention, not your credits.
This section covers everything that happens before you press play: which formats Ertas accepts, what makes a dataset train well, where to source data, how to synthesise more of it, and how to debug a training run that failed at the data stage.
Jsonl Format
Read the jsonl format guide.
Dataset Quality
Read the dataset quality guide.
Import from Hugging Face
Read the import from hugging face guide.
Dataset Synthesis
Read the dataset synthesis guide.
Instruction Tuning
Read the instruction tuning guide.
SFT vs DPO
Read the sft vs dpo guide.
Troubleshooting
Read the troubleshooting guide.
The shortest possible summary
If you read nothing else in this section, read this:
- Datasets are JSONL files (one JSON object per line). Ertas also accepts CSV, parquet, and txt as inputs, but JSONL is the canonical shape.
- The five accepted schemas are text-only (
text), instruction/output, input/output (+ metadata), conversations (ShareGPT-styleconversationsarray of{from, value}), and messages (ChatML-stylemessagesarray of{role, content}). - Ertas auto-detects the format at upload time. It will tell you immediately if your file is not one of these shapes.
- The Data Craft tab shows row count, file size, schema, and a preview for each dataset. Use the preview to spot-check that the parser got it right.
- You can attach the same dataset to as many runs as you want without re-uploading.
The pages above go deeper. Start with JSONL format if you are unsure how to shape your data, and Dataset quality once your data is at least syntactically valid.