JSONL (JSON Lines) Format Guide

    The standard format for LLM fine-tuning datasets

    Training Data

    Specification

    JSONL (JSON Lines), also known as newline-delimited JSON (NDJSON), is a text-based data format where each line is a valid JSON object separated by a newline character. Unlike standard JSON, which wraps all data in a single array or object, JSONL stores one record per line, making it ideal for streaming, appending, and processing large datasets without loading the entire file into memory. The format is defined informally by the jsonlines.org specification and has become the de facto standard for LLM fine-tuning datasets.

    Each line in a JSONL file must be a self-contained, valid JSON object. Lines are separated by the newline character (\n), and trailing newlines are permitted. There is no header line, no enclosing brackets, and no commas between records. This simplicity makes JSONL extremely easy to parse, generate, and manipulate with standard Unix tools — you can filter rows with grep, count records with wc -l, sample with shuf, and concatenate files with cat.

    The format's streaming-friendly nature makes it particularly well-suited for machine learning pipelines that process data incrementally. Training frameworks like Hugging Face Transformers, OpenAI's fine-tuning API, Axolotl, and LLaMA-Factory all accept JSONL as their primary input format. Data processing tools including pandas, Polars, DuckDB, and Apache Spark provide native JSONL support, enabling seamless integration between data preparation and model training stages.

    When to Use JSONL (JSON Lines)

    JSONL is the recommended format whenever you are preparing text-based training data for large language model fine-tuning. It is the expected input format for OpenAI fine-tuning, Hugging Face datasets, and most open-source training frameworks. If your training data consists of instruction-response pairs, conversational exchanges, text classification examples, or any other structured text data, JSONL should be your default choice.

    Choose JSONL over CSV when your data contains nested structures, variable-length fields, or special characters that would require complex escaping in CSV format. JSONL naturally handles arrays, nested objects, and Unicode text without the delimiter and quoting issues that plague CSV files. Choose JSONL over Parquet when you need human-readable data that can be inspected and edited with a text editor, or when your dataset is small enough that Parquet's compression advantages are not significant.

    JSONL is less suitable for very large numerical datasets where columnar formats like Parquet provide dramatically better compression and query performance. It is also less efficient than binary formats for datasets that are read many times but rarely modified, since each read requires parsing JSON text. For datasets exceeding tens of gigabytes, consider converting to Parquet for storage and converting back to JSONL for the final training step.

    Schema / Structure

    typescript
    // OpenAI-compatible fine-tuning format
    interface ChatCompletionMessage {
      role: "system" | "user" | "assistant";
      content: string;
    }
    
    interface FineTuningExample {
      messages: ChatCompletionMessage[];
    }
    
    // Instruction-following format (Alpaca-style)
    interface InstructionExample {
      instruction: string;
      input?: string;
      output: string;
    }
    
    // Text classification format
    interface ClassificationExample {
      text: string;
      label: string;
    }
    Common JSONL schema patterns for LLM fine-tuning datasets

    Example Data

    jsonl
    {"messages": [{"role": "system", "content": "You are a helpful coding assistant."}, {"role": "user", "content": "Write a Python function to reverse a string."}, {"role": "assistant", "content": "def reverse_string(s: str) -> str:\n    return s[::-1]"}]}
    {"messages": [{"role": "system", "content": "You are a helpful coding assistant."}, {"role": "user", "content": "How do I read a CSV file in Python?"}, {"role": "assistant", "content": "import pandas as pd\ndf = pd.read_csv('data.csv')\nprint(df.head())"}]}
    {"messages": [{"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "The capital of France is Paris."}]}
    Example JSONL file with OpenAI-compatible chat fine-tuning examples

    Ertas Support

    Ertas Data Suite provides native JSONL import and export capabilities for training data preparation. You can import raw datasets in JSONL format, apply PII redaction, data cleaning, and transformation operations, and export the processed data as JSONL ready for fine-tuning. The data lineage tracking maintains provenance information for each JSONL record, enabling you to trace any training example back to its original source.

    Ertas Studio accepts JSONL datasets for cloud-based model training and handles format validation, schema verification, and data quality checks automatically. The platform validates that each line is valid JSON, that the schema is consistent across records, and that required fields are present before training begins.

    Related Resources

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.