Dataset synthesis with Data Craft

    Generate or augment training data inside the Data Craft tab, including the patterns that work and the traps that do not.

    Sometimes you do not have enough data. You have 200 real customer-support tickets in the right style and you need 2,000. You have ten beautifully-curated examples of the JSON schema you want the model to emit and you need 500. Data Craft is the Ertas tab that closes those gaps: it lets you set up a dataset, generate synthetic rows with guided prompting, clean rows that need work, and edit individual rows directly.

    Done well, synthesis is the single fastest way to scale a fine-tuning dataset. Done badly, it produces a model that imitates the generator's quirks rather than your real distribution. The patterns below apply whether you use Data Craft or synthesise outside Ertas.

    What Data Craft includes

    Data Craft is a root tab in the Ertas navigation. Opening it surfaces your dataset list and lets you start a new dataset or open an existing one.

    For any dataset, Data Craft offers these views:

    • Setup wizard (new datasets only): captures the dataset context, a structured set of fields that the generator reads when drafting prompts. See What dataset context captures below.
    • Workspace (the main editing surface): a table editor where you can scroll, edit individual cells, remove rows or fields, and duplicate rows. This is also where cleaning happens today: review the data manually, fix the cells that need it, and drop the rows that do not belong. Row duplication is useful when you want a similar-but-different example: clone a row, change a few fields, keep the result if it looks good.
    • Generation view: produces new rows. Ertas drafts the generation prompt from your dataset context, so you start with a recommended prompt rather than a blank box. Tweak it for more tailored results, set a target row count, and run. Generated rows land in a draft state for review.
    • Prompt Studio: an authoring tool for the generation prompt itself. Ertas drafts a prompt from your dataset context; you can tweak it manually and paste sample rows into the prompt to inspect how it reads with examples (Ertas does not have a native sampling preview today). The finished prompt is meant to be exported and used in your preferred AI tool to generate rows externally.

    The Data Craft onboarding tour walks first-time users through this surface step by step. If you skipped it, you can replay it from the help menu inside the tab.

    Coming soon: a systematic cleaning view. A guided pass that detects and rewrites rows with formatting drift, accidental refusals, or missing fields in bulk, instead of fixing cells one by one in the Workspace. Until it ships, manual cleaning in the Workspace is the way.

    What dataset context captures

    The wizard collects a structured set of fields. The generator uses them when drafting prompts, and the more specific each field is the closer the synthesised rows land to your real distribution.

    FieldWhat it captures
    NameDisplay name for the dataset.
    FormatThe JSONL row schema: text-only, instruction/output, input/output, conversations, or messages. Locked after creation.
    ObjectiveWhat outcome the model should produce.
    Task typeHow the dataset should be shaped.
    DeploymentWhere the model will serve traffic.
    Data statusWhat you have today (seed examples, partial dataset, nothing yet).
    ConstraintsOptional rules and pitfalls the generator should respect.

    You can update most fields later from the workspace. The format is locked at creation because it determines the row schema every later view expects.

    Skipping the wizard

    The wizard is the recommended starting point for net-new datasets, but it is optional. If you already have a JSONL file in one of the supported formats, upload it directly from the Data Craft tab. Data Craft's Workspace, Generation view, and Prompt Studio work on uploaded datasets the same way they work on wizard-built ones. The feature is for modifying and enhancing any dataset, not just the ones it creates. The trade-off is that without an explicit dataset context the generator falls back to less tailored prompts, so reviewing and tweaking the recommended prompt in Prompt Studio matters more.

    A typical synthesis flow in Data Craft

    Create or open the dataset

    Step 1

    From the Data Craft tab, choose New dataset to start the Setup wizard, or click an existing dataset to open it in the workspace.

    Capture the dataset context (new datasets)

    Step 2

    The wizard asks for the dataset's name, JSONL format, objective, task type, deployment target, data status, and optional constraints. The more specific each field, the better the prompts Ertas drafts later. You can update most of them from the workspace after creation; the format is locked.

    Seed the dataset (optional but recommended)

    Step 3

    Add 10 to 50 real rows by hand, by upload, or by importing from Hugging Face. Pure cold synthesis (no seeds at all) tends to drift toward generic outputs; even a small seed grounds the generator.

    Generate

    Step 4

    Open the Generation view. Ertas drafts a recommended prompt from your dataset context, so the first thing you see is a prompt to react to, not a blank box. Tweak it in place (or in Prompt Studio) for more tailored results, set a target row count, and run. Generated rows land in a draft state for review.

    Review and accept

    Step 5

    Read a sample. Accept the rows that look good; reject or hold the rest. Only accepted rows are sent to the trainer when a job is submitted; everything else stays in the dataset but is excluded from training. Rows are fully editable and can be duplicated or un-accepted at any time. All rows count toward your dataset storage quota, which is measured in file size rather than row count, so genuinely useless rows are still worth deleting.

    Edit and clean (optional)

    Step 6

    In the Workspace, fix individual cells, remove rows or fields you do not want, and duplicate good rows as starting points for variants. Cleaning is a manual pass today; a guided cleaning view is on the roadmap.

    The output is a normal JSONL dataset that you can attach to any training run, exactly like an uploaded or imported dataset.

    When synthesis works

    Synthesis is good for:

    • Scaling up a small high-quality dataset. You have 200 great rows and need 2,000. Use the 200 as seeds, generate 1,800 more in the same style, reject the 30% that drift.
    • Generating edge-case coverage. You have plenty of happy-path data and need failure-mode rows. Ask the generator (via Prompt Studio) to produce examples where the user is confused, hostile, or off-topic.
    • Style transfer datasets. Rewrite each row of an existing dataset in a new style (formal to casual, plain to JSON). Manual pass in the Workspace today; the systematic cleaning view on the roadmap will do this in bulk.
    • Expanding single-turn instructions into multi-turn conversations. Generate plausible follow-up turns on top of an existing instruction-format dataset.

    The pattern that works in all four cases: you bring strong signal (real examples + clear dataset context) and the generator scales it.

    When synthesis fails

    Synthesis fails when:

    • You have no seeds and no dataset context. Pure cold synthesis tends to produce blandly generic data that teaches the model to imitate the generator rather than your real distribution.
    • The generator is wrong about your domain. A general-purpose LLM does not know your support policies, your code style, or your legal language. It will confidently produce wrong-looking-right data. Counterbalance with seed examples and an explicit dataset context that lists the rules.
    • You synthesise everything. Even with good seeds, a 100% synthetic dataset has a homogeneity problem. The generator has its own style fingerprint and the model learns to imitate it. Aim for at least 10 to 30% real rows.

    Patterns that produce good synthetic data

    A few techniques that consistently work:

    Few-shot prompting on seeds

    Give the generator 5 to 20 real examples and ask it to produce N more in the same format and style. The seeds anchor the output. Data Craft does this automatically when you provide seeds and run generation.

    Distilling from a stronger model

    If you have access to a frontier model (Claude, GPT-4), generate examples there that a small model would otherwise struggle to produce, then fine-tune your small model on the result. This is the technique behind many popular instruction datasets (Alpaca, Vicuna).

    Iterating with critique

    A two-pass approach: generate a draft row, then ask the generator to critique its own output ("is this a good example? what is wrong with it?") and rewrite. The critique-then-rewrite loop produces noticeably better rows than one-shot generation, at the cost of double the credits.

    Negative examples (for DPO)

    If you are doing preference optimisation (DPO), you need pairs of accepted and rejected outputs. Use synthesis to generate the rejected side deliberately by asking the generator to produce a "bad" or "lazy" version of the accepted output. See SFT vs DPO.

    Working with acceptance state

    Acceptance is the lever that decides what gets trained on. Only accepted rows are sent to the trainer when a run is submitted; everything else stays in the dataset but is excluded. Acceptance is fully reversible: un-accept a row at any time to pull it out of training without deleting it. A few patterns that take advantage of this:

    A/B compare by toggling acceptance

    Train once with a slice of rows accepted, then un-accept that slice and rerun. The two outputs tell you what those rows actually contributed. Useful when:

    • You accepted a batch you are uncertain about. Train with it, un-accept, retrain, pick whichever model wins on your probe set.
    • You suspect some rows are noisy. Un-accept the suspicious slice, retrain, compare against the baseline.
    • You want a coverage test. Hold out edge-case rows on one run and include them on the next, then measure the delta on a probe set.

    Because un-acceptance does not delete the row, this is non-destructive. You can flip rows in and out of the training payload without re-uploading.

    Hold as you go

    When generating in large batches, accept the rows you are confident about and leave the borderline ones in a held (un-accepted) state. Come back later, edit the borderline rows into shape, and flip acceptance when they are ready. The dataset accumulates a working pool of "almost there" rows that do not pollute training until you say so.

    Duplicate as a starting point

    The Workspace lets you duplicate any row. Clone a good row, tweak two or three fields, and you have a similar-but-different example. Quick way to scale a strong seed by hand when the generator would over-imitate.

    Filtering synthesised data

    Even with Data Craft's review step, a final filtering pass on a generated batch is worth the time. Common subtle problems:

    • Outputs that do not match the prompt.
    • Refusals where the seed examples did not refuse.
    • Repetitive phrasing across batches.
    • Hallucinated facts.

    A practical filter pipeline:

    Deduplicate

    Step 1

    Remove exact and near-duplicate rows. Embedding similarity above 0.9 is usually a duplicate.

    Length filter

    Step 2

    Drop rows that are much shorter or longer than your seed examples. Outliers are usually wrong.

    Format check

    Step 3

    For JSON-output tasks, parse every row and drop the malformed ones.

    Manual sample

    Step 4

    Read 50 random rows. If you would not want the trained model to imitate any of them, reject them in Data Craft and adjust the generation prompt before regenerating.

    A 70% retention rate after filtering is normal. 95% retention is suspicious; you might be missing real problems.

    Cost considerations

    Generating rows in Data Craft consumes credits from your monthly grant (see Pricing for current per-plan generation limits). Builder and above include a fixed number of dataset-synthesis runs per month, plus the Pro plan removes the bulk-eval cap and the Business plan offers unlimited synthesis.

    A rough back-of-envelope:

    • Generating 5,000 synthetic rows: usually 1 to 3 synthesis runs depending on batch size.
    • Training a 3B LoRA on those rows (T4): roughly 1 to 2 Ertas credits.
    • Manual review and acceptance: 30 to 60 minutes of your time per 1,000 generated rows.

    The expensive step is your time on review, not the generator cost.

    What's next