Import from Hugging Face

    Pull any public Hugging Face dataset into Ertas in one click, with automatic validation and rights attestation.

    Hugging Face hosts thousands of public datasets, including many that are well suited to fine-tuning. Ertas's dataset picker has a Hugging Face import flow built in: paste a URL, click Check, and Ertas validates the dataset, mirrors it to your Data Craft tab, and attaches it to your run.

    Find a dataset

    Open huggingface.co/datasets and browse or search. Useful filters when picking a dataset:

    • Task type: Filter to instruction-tuning, question-answering, summarisation, conversational, or whatever matches your goal.
    • Language: Filter to your target language(s). Some datasets are English-only; others are multilingual.
    • License: Some datasets are research-only, some are commercial-friendly. Check the license tag on the dataset card before committing.
    • Size: For first runs, look for datasets in the 1K to 50K row range. Bigger datasets are slower to import and train.

    A few datasets that ship as Recommended in Ertas because they consistently produce useful first-time fine-tunes:

    Hugging Face idUse caseRows
    yahma/alpaca-cleanedInstruction following (general)51,760
    winglian/pirate-ultrachat-10kStyle transfer (pirate-speak)10,000

    These are visible in the Recommended group at the top of the dataset picker. They are great for testing the full Ertas flow even if you do not need the resulting model.

    Import from the dataset picker

    When you click the + under the blue Training Dataset leg on the canvas, the picker opens. There are three import paths:

    Recommended

    Step 1

    The Recommended group shows curated datasets that are known to work. Click any card and Ertas imports and attaches in one motion.

    Paste a Hugging Face URL

    Step 2

    The HuggingFace Dataset input box at the top of the picker accepts:

    • Full URLs (https://huggingface.co/datasets/tatsu-lab/alpaca)
    • Short forms (tatsu-lab/alpaca)

    Click Check. Ertas validates the dataset in two phases.

    Confirm and attach

    Step 3

    On success, Ertas mirrors the dataset to Data Craft and attaches it to the current Action Module's dataset leg.

    You can also import without attaching by going to the Data Craft tab and using its Hugging Face import flow. The dataset shows up in Data Craft and you can attach it to runs later.

    What validation checks

    When you click Check, Ertas queries Hugging Face for the dataset's metadata and runs four checks:

    • Dataset exists: 404s and private datasets surface as red errors.
    • Has training-friendly columns: looks for instruction/output, messages, or prompt/completion columns. If the columns do not match, the validator flags amber and shows you the column list.
    • Row count: no hard minimum, but datasets under about 20 rows rarely produce useful training; the validator surfaces a warning when the dataset is very small.
    • Splits: notes which splits are available (train, validation, test). Ertas currently uses the train split for training; automated eval-split handling is part of the upcoming evaluation suite.

    The validation result panel shows:

    • Detected format (instruction, conversations, or completion).
    • Row count and approximate size in MB.
    • Available splits.
    • A green check if everything looks good, amber if the columns are unusual but the dataset can still be used, red if it cannot be imported.

    Rights attestation

    Hugging Face is a public mirror, but the dataset's license still controls what you can do with it. Before saving, Ertas prompts you to attest that:

    • You have reviewed the dataset's license.
    • You believe you have the right to use the data for fine-tuning (which usually requires either a permissive license or that the underlying content is your own).

    The attestation is required (you cannot save without checking the box) and is persisted as a field on the dataset's stored metadata. There is no in-product audit surface today for reviewing past attestations across datasets, so if you need a compliance trail, keep your own record of which datasets you imported and under which licenses. None of this is a substitute for actually reading the license. If you are shipping a commercial product, read the dataset card and the linked license carefully.

    A few license patterns to know:

    • MIT / Apache 2.0 / CC0: very permissive, fine for commercial use.
    • CC-BY: commercial use OK with attribution.
    • CC-BY-NC: non-commercial only. Do not use for a paid product.
    • Research-only: stated on the dataset card. Skip for production.
    • Custom: read it. Some datasets have unusual conditions.

    What gets mirrored

    When you save an HF dataset, Ertas:

    • Downloads the dataset content to your account-scoped storage.
    • Stores the original Hugging Face URL alongside it.
    • Detects and stores the format (instruction / conversations / completion).
    • Stores row count and file size for billing and quota tracking.

    The mirror is one-way. If the dataset is updated on Hugging Face later, your mirrored copy does not change automatically. Re-import to pick up changes (Ertas will overwrite the existing mirror after a confirmation).

    The mirror counts against your account's dataset storage quota. See Storage.

    Converting non-JSONL datasets

    Hugging Face datasets are often distributed in Parquet, Arrow, or CSV format. Ertas's importer handles each of these:

    • Parquet and Arrow: read directly, converted to JSONL on import.
    • CSV: header row mapped to columns; auto-detected format.
    • JSON / JSONL: imported as-is.

    If a dataset uses non-standard column names, the validator's amber state will surface them. You can either:

    • Pick a different dataset.
    • Download the dataset locally, rename the columns to match one of Ertas's five formats, and upload as JSONL.

    Streaming and pagination

    Ertas's importer reads the whole dataset on the first attach. For datasets above 100K rows, this can take a minute or two. After the first import, subsequent runs attach instantly.

    If the dataset is huge (millions of rows), consider downloading a sampled subset locally and uploading just that. A 50,000-row sample of a 1M-row dataset trains roughly as well for most tasks and is faster everywhere downstream.

    What's next