JSONL format

The five JSONL schemas Ertas accepts, with concrete examples and validation rules. Pick one shape per file and stick with it.

Ertas accepts datasets in JSONL: one JSON object per line, no commas between lines, no outer array. Inside each line, your row can take one of five shapes. Ertas auto-detects which one you used at upload time.

Pick one shape per file. Ertas does not support mixing schemas inside a single file or across files attached to the same run, so commit to one format upfront.

Why JSONL

JSONL is the de facto standard for LLM training data because it streams cleanly (one line at a time, no parsing the entire file into memory), survives line-level edits, and is trivially wc -l countable. Most fine-tuning tools and datasets on Hugging Face emit JSONL by default.

A few mechanical rules:

One JSON object per line. No trailing comma. No outer array brackets.
UTF-8 encoding. Don't BOM-prefix the file.
No fixed maximum row size. The base model's context window is the practical ceiling: rows that tokenise to more than that will be truncated by the trainer, with a warning in the log.
The file extension should be .jsonl. Ertas will accept .json and treat it as JSONL if it parses cleanly, but .jsonl is the unambiguous choice.

Schema 1: Text-only

The simplest shape. Each row is a single string of text that the model trains on directly. Good for continued pretraining or stylistic absorption on a corpus that does not have a clear prompt/response split.

{"text": "The quick brown fox jumps over the lazy dog. The fox is in a hurry today, because the moon..."}
{"text": "Once upon a time in a kingdom by the sea, there lived a young clockmaker..."}

Required fields:

text: the training content. Non-empty string.

When to use it: corpus-style continued pretraining, style absorption on long documents, learning a specific tone from prose. The model is trained to predict the next token in text without any prompt/response boundary.

When not to use it: any task where the model should respond to user input. Use one of the prompt-response shapes below.

Schema 2: Instruction / output

Single-turn instruction with a target answer. Closest to the Alpaca shape but only the two essential fields.

{"instruction": "Summarise the following article in two sentences. The article: [text...]", "output": "The article argues..."}
{"instruction": "Translate to French: Good morning, how are you?", "output": "Bonjour, comment allez-vous ?"}
{"instruction": "Write a haiku about the sea.", "output": "Waves break on the shore / Salt air fills the empty sky / Gulls cry, then silence."}

Required fields:

instruction: the directive. Non-empty string.
output: the target completion. Non-empty string.

When to use it: single-turn instruction following, classification, summarisation, translation, structured generation. Easy to author by hand or generate synthetically. The two-field shape keeps your dataset readable.

When not to use it: when you need a separate context field distinct from the instruction (use schema 3), or when the conversation has multiple turns (use schema 4 or 5).

Schema 3: Input / output (with optional metadata)

Pairs of input and output, with an optional metadata object for bookkeeping. The input/output naming maps cleanly to many existing public datasets.

{"input": "What is the capital of France?", "output": "Paris is the capital of France.", "metadata": {"source": "trivia-set-2025"}}
{"input": "def reverse(s): return", "output": " s[::-1]", "metadata": {"source": "code-snippets-internal"}}
{"input": "Customer says their order is late.", "output": "Apologise once, ask for order number, offer to track."}

Required fields:

input: the user-side content. Non-empty string.
output: the model-side response. Non-empty string.

Optional fields:

metadata: an object of arbitrary key-value bookkeeping data (source, tag, difficulty, etc.). Not used in training; preserved on the row so you can filter or audit later.

When to use it: existing datasets that ship in this shape; cases where you want to keep per-row provenance without contaminating the training content; classification tasks where input/output reads more naturally than instruction/output.

Schema 4: Conversations (ShareGPT-style)

Multi-turn conversations as a list of {from, value} messages. The "ShareGPT" shape is widely used on Hugging Face.

{"conversations": [
  {"from": "human", "value": "My order arrived damaged. Can I get a refund?"},
  {"from": "gpt", "value": "I am sorry to hear that. Could you share your order number?"},
  {"from": "human", "value": "It is ABC-12345."},
  {"from": "gpt", "value": "Thanks. I have processed a refund to your original payment method."}
]}

Required fields:

conversations: a list. At least 2 messages (one human turn, one gpt turn).

Each message requires:

from: one of human or gpt.
value: a non-empty string.

When to use it: multi-turn chat, customer support, dialogue agents. The training loss is computed only on the gpt turns, so the model learns to respond, not to predict user input.

When not to use it: when your existing data uses the OpenAI role/content shape. Use schema 5 instead.

Schema 5: Messages (ChatML / OpenAI-style)

Multi-turn conversations as a list of {role, content} messages, including an optional system message. The OpenAI chat-completion shape.

{"messages": [
  {"role": "system", "content": "You are a helpful customer support agent."},
  {"role": "user", "content": "My order arrived damaged. Can I get a refund?"},
  {"role": "assistant", "content": "I am sorry to hear that. Could you share your order number?"},
  {"role": "user", "content": "It is ABC-12345."},
  {"role": "assistant", "content": "Thanks. I have processed a refund to your original payment method."}
]}

Required fields:

messages: a list. At least 2 messages (one user, one assistant). System message is optional.

Each message requires:

role: one of system, user, or assistant.
content: a non-empty string.

When to use it: any conversational dataset that uses the OpenAI / ChatML conventions. If you have a choice between schema 4 and schema 5 and no existing data, schema 5 is the more widely-supported convention going forward.

The base model's chat template is applied at training time. You do not need to include raw template tokens like <|im_start|> in your content (or value) fields. Ertas injects them based on the detected template for the model you picked. Embedding template markers in the data is a common cause of broken-looking output later.

One file, one shape

A single dataset file must use one of the five schemas above for every row. Ertas's validator detects the format from the first valid rows and rejects subsequent rows that do not match.

You also cannot mix schemas across files attached to the same run. If you need to combine an instruction-format dataset with a conversations-format dataset, the supported approach is to convert one to match the other before upload. The Data Craft workflow can help you normalise an uploaded dataset's shape with guided rewriting; that is the only multi-shape mixing path Ertas supports today.

Validation at upload

When you upload a dataset (or import from Hugging Face), Ertas runs a few checks:

Parseable JSON: every line is valid JSON.
Detected format: one of the five schemas above is identified across the file.
Required fields present: every row has the fields its detected format requires.
Non-empty fields: empty strings in required positions fail.
Row count: there is no hard minimum, but the validator warns when a dataset is very small. We recommend at least 20 rows for the model to have a chance of learning anything useful, and 500 or more for any production-grade fine-tune.

If validation fails, the workspace shows the line number and a description of what went wrong. You can fix the file locally and re-upload, or edit small datasets in the in-app row editor.

A minimal example you can copy

A 5-row dataset in the instruction/output schema, enough to test the upload flow end-to-end:

{"instruction": "What is 2 + 2?", "output": "4"}
{"instruction": "Who wrote Hamlet?", "output": "William Shakespeare wrote Hamlet."}
{"instruction": "What is the capital of France?", "output": "Paris is the capital of France."}
{"instruction": "Translate to Spanish: Hello, world!", "output": "Hola, mundo!"}
{"instruction": "Write a one-line summary of photosynthesis.", "output": "Photosynthesis is how plants turn sunlight into food."}

This will not train into a useful model (5 rows is way under the 20-row recommendation), but it will pass validation and let you see the full upload and attachment flow.

What's next

Dataset quality

What separates a dataset that trains well from one that does not.

Import from Hugging Face

Pull public datasets straight into Ertas.

Dataset synthesis

Generate or rewrite rows with the Data Craft workflow.

Dataset troubleshooting

Diagnose upload and training failures.