Troubleshooting datasets

    Concrete fixes for upload errors, validation failures, and training runs that fail at the dataset stage.

    Almost all dataset problems show up at one of three points: upload time, training start, or in the trained model's behaviour. Each phase has a small set of common failures with known fixes. This page is a checklist organised by where you first see the symptom.

    If you have an in-flight training failure that the log says is dataset-related, this page is the right one. For Ertas-side training failures (OOMs, GPU loss), see Handling failures.

    Upload errors

    "Invalid JSON on line N"

    The line is not valid JSON. The most common causes:

    • A trailing comma after the last field.
    • A single quote where a double quote is required.
    • An unescaped newline inside a string field.
    • A stray character at the end of the file (often from a partial paste).

    Fix: open the file in a text editor, jump to the offending line, validate it with any JSON checker (jq . | head -n N). Fix the syntax. Re-upload.

    "No valid training rows found"

    The validator could not detect any of the three accepted formats (instruction, conversations, completion). Causes:

    • The file is in a custom shape Ertas does not recognise (for example, has prompt and response instead of prompt and completion).
    • The file is empty or contains only blank lines.
    • Column names use unconventional casing or have extra whitespace.

    Fix: rename your columns to match one of the five accepted shapes. See JSONL format. For Hugging Face datasets with non-standard columns, the validator surfaces the column list it found; you can usually map them with a one-line script.

    "Required field 'X' missing on line N"

    A row is missing a required field for its detected format. Quick reference of required fields per schema:

    • Text-only: text.
    • Instruction / output: instruction, output.
    • Input / output: input, output (with optional metadata).
    • Conversations (ShareGPT): conversations array of {from, value} with at least 2 entries.
    • Messages (ChatML): messages array of {role, content} with at least 2 entries.

    Fix: clean the data. Often the easiest fix is to drop rows missing required fields:

    import json
    with open('input.jsonl') as f:
        rows = [json.loads(line) for line in f if line.strip()]
    keep = [r for r in rows if 'instruction' in r and 'output' in r]
    with open('output.jsonl', 'w') as f:
        for r in keep:
            f.write(json.dumps(r) + '\n')

    "Row N exceeds maximum size"

    There is no fixed maximum row size, but very long rows can push past the base model's context window once tokenised. The trainer will truncate over-context rows and warn in the log. Fix: truncate long inputs to a reasonable length at the source, or drop rows containing base64-encoded blobs that you do not actually want in training.

    "File exceeds dataset quota"

    Your account's dataset storage is full. See Storage for the cleanup walkthrough.

    "Hugging Face dataset not found"

    You pasted a URL or identifier that does not resolve to a public dataset. Causes:

    • Typo in the dataset id.
    • Dataset is private and your Hugging Face account is not linked to access it.
    • Dataset has been deleted or renamed.

    Fix: open the URL in a browser and confirm the dataset exists publicly. If it is private, Ertas does not currently support authenticated HF imports; download locally and upload as JSONL.

    Training-start errors

    "Tokenizer fails on dataset row"

    The base model's tokenizer cannot handle some content in a row. Causes:

    • Unsupported Unicode characters (control codes, certain emoji).
    • A field contains binary data that was not stripped at upload.
    • Encoding mojibake.

    Fix: pass the data through a sanitiser:

    import json, unicodedata
    with open('input.jsonl') as f:
        rows = [json.loads(line) for line in f if line.strip()]
    for r in rows:
        for k, v in r.items():
            if isinstance(v, str):
                r[k] = unicodedata.normalize('NFKC', v)
    with open('output.jsonl', 'w') as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + '\n')

    "Sequence longer than model's max context"

    A row's tokenised length exceeds the base model's context window. The pipeline truncates these automatically by default, but if you see this as a warning in the log, you are losing content.

    Fix: inspect the long rows. Either truncate them at the source (preserve the most important content), or switch to a base model with longer context (Qwen 2.5 has 128K, Llama 3.1 has 128K).

    "Detected format mismatch between attached datasets"

    You attached two or more datasets to a run and they have different detected formats. This is OK in principle (Ertas handles cross-format mixing), but the warning shows up when the mix is suspicious (one dataset is 50,000 conversations and one is 5 instruction rows).

    Fix: confirm both datasets are intentional. If the small one was attached by accident, detach it.

    Behavioural problems in the trained model

    These show up after the run finishes. The dataset looks fine but the model is misbehaving.

    Model emits raw template tokens like <|im_start|>

    Your data has template tokens embedded in the content or output fields. The trainer wraps them in another template, doubling the markup.

    Fix: strip template markers from the data. Ertas adds them automatically at training time based on the base model's chat template.

    Model always starts responses with the same phrase

    Many rows in the dataset have outputs that start the same way ("Sure, here is..." "I would be happy to help with..."). The model learned the boilerplate.

    Fix: clean the preamble from your training outputs. The model can always re-add it at inference time via system prompt.

    Model refuses tasks it should handle

    Some rows have refusal outputs ("I cannot help with that," "I do not have access to that information"). The model learned that refusal is sometimes the right answer.

    Fix: audit and remove rows with refusal-style outputs unless refusal is genuinely the right answer in that case.

    Model produces hallucinations from "context" that was not provided

    Some rows reference data the model could not have known from the input alone ("Your account 12345 was charged..." when the input never gave an account number).

    Fix: filter out rows where the output references entities not present in the instruction or input. This is one of the most damaging dataset bugs because the model learns to confidently make things up.

    Model only handles one kind of input

    The dataset is too narrow. Outputs are great for the cases the dataset covered but the model fumbles anything outside.

    Fix: add diversity. Generate variants of the original prompts (see Dataset synthesis).

    A general-purpose dataset audit script

    A quick script that catches most of the gotchas above:

    import json, collections, re
    
    rows = []
    with open('your-dataset.jsonl') as f:
        for line in f:
            if line.strip():
                rows.append(json.loads(line))
    
    print(f"Rows: {len(rows)}")
    print(f"Unique outputs: {len(set(r.get('output', '') for r in rows))}")
    
    # Most common opening tokens (catches boilerplate preambles)
    openings = collections.Counter(
        (r.get('output') or '').split('.')[0][:60]
        for r in rows
    )
    print("Most common output openings:")
    for opening, count in openings.most_common(10):
        print(f"  {count:5d}  {opening!r}")
    
    # Refusal detector
    refusal_keywords = re.compile(r'\b(I can(?:not|\'?t)|I am unable|I do(?:n\'?t| not) have)\b', re.I)
    refusals = [r for r in rows if refusal_keywords.search(r.get('output', ''))]
    print(f"Possible refusals: {len(refusals)}")
    
    # Template marker detector
    template_markers = re.compile(r'<\|.*?\|>|\[INST\]|\[/INST\]', re.I)
    with_markers = [r for r in rows if any(
        template_markers.search(v) for v in r.values() if isinstance(v, str)
    )]
    print(f"Rows with embedded template markers: {len(with_markers)}")

    Run this before every fine-tune. The 30 seconds it takes saves runs that would have been quietly poisoned.

    What's next