Back to blog
    Multi-Format Export from a Single Data Pipeline: JSONL, COCO, YOLO, and RAG Chunks
    data-exportjsonlcocoyoloragdata-pipelinetraining-datasegment:service-provider

    Multi-Format Export from a Single Data Pipeline: JSONL, COCO, YOLO, and RAG Chunks

    How to export training data in JSONL, COCO, YOLO, CSV, and chunked text from a single pipeline — covering format requirements, validation, and avoiding parallel pipeline maintenance.

    EErtas Team·

    You've ingested, cleaned, labeled, and augmented your dataset. Now you need to export it — and the downstream system determines the format.

    Fine-tuning a language model? JSONL. Training an object detection model? YOLO or COCO. Building a RAG pipeline? Chunked text with metadata. Training a classical ML classifier? CSV. Feeding an AI agent? Structured JSON with tool call schemas.

    The problem: most data preparation tools export one format. Maybe two. If your project requires three export formats — which is common when a client wants fine-tuning, RAG, and a dashboard all from the same source data — you end up maintaining three export scripts, each with its own format-specific bugs and validation gaps.

    This guide covers what each format requires, where they break, and how to export reliably from a single pipeline.


    Format Requirements: What Each One Actually Needs

    JSONL for LLM Fine-Tuning

    JSONL (JSON Lines) is the standard format for fine-tuning language models. Each line is a self-contained JSON object representing one training example.

    Instruction fine-tuning format:

    {"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
    

    Key requirements:

    • Valid JSON on every line — one malformed line can crash the training script
    • Consistent schema across all lines (same fields, same roles)
    • UTF-8 encoding, no BOM
    • No trailing commas, no comments
    • Token count per example within the model's context window

    Common gotchas:

    • Unescaped quotes in text content (the #1 JSONL formatting error)
    • Newlines within content fields not properly escaped
    • Mixed schemas (some lines with prompt/completion, others with messages)
    • Empty or null fields that the training framework doesn't handle

    COCO Format for Computer Vision

    COCO (Common Objects in Context) format uses a single JSON file containing image metadata, category definitions, and annotations.

    Structure:

    {
      "images": [{"id": 1, "file_name": "img001.jpg", "width": 1920, "height": 1080}],
      "categories": [{"id": 1, "name": "defect"}, {"id": 2, "name": "normal"}],
      "annotations": [{"id": 1, "image_id": 1, "category_id": 1, "bbox": [x, y, w, h], "area": 1234}]
    }
    

    Key requirements:

    • All IDs must be unique and cross-referenced correctly
    • Bounding box format is [x, y, width, height] (top-left origin)
    • Area must match the bounding box dimensions
    • Image dimensions must match actual file dimensions
    • Segmentation masks (if used) must be valid polygons

    Common gotchas:

    • ID mismatches between images and annotations
    • Bounding boxes that extend outside image boundaries
    • Zero-area annotations from labeling errors
    • Category IDs that don't match the categories list

    YOLO Format for Object Detection

    YOLO uses a directory structure with one text file per image, each line representing one annotation.

    Structure:

    <class_id> <x_center> <y_center> <width> <height>
    

    All coordinates are normalized to 0-1 relative to image dimensions.

    Key requirements:

    • One .txt file per image, same filename (different extension)
    • Coordinates normalized to [0, 1]
    • Class IDs are zero-indexed integers
    • A data.yaml file mapping class IDs to names

    Common gotchas:

    • Coordinates not normalized (using pixel values instead of 0-1 range)
    • Filename mismatches between image files and annotation files
    • Class IDs starting at 1 instead of 0
    • Missing annotation files for images with no objects (YOLO expects an empty file, not a missing file)

    Chunked Text for RAG

    RAG (Retrieval-Augmented Generation) pipelines need text split into chunks with metadata for retrieval.

    Typical structure:

    {"chunk_id": "doc001_chunk_003", "text": "...", "metadata": {"source": "policy_v2.pdf", "page": 12, "section": "Termination"}}
    

    Key requirements:

    • Chunk size appropriate for the embedding model (typically 256-512 tokens)
    • Overlap between adjacent chunks (typically 10-20% of chunk size) to avoid splitting relevant context
    • Source metadata preserved for citation
    • Chunk boundaries that don't split sentences or semantic units

    Common gotchas:

    • Chunks that split mid-sentence, producing fragments that embed poorly
    • Missing or incorrect source metadata (makes citation impossible)
    • Chunk size inconsistency (some chunks are 50 tokens, others are 2,000)
    • Table content chunked as raw text, losing row-column structure

    CSV for Classical ML

    Flat tabular format with one row per example and one column per feature.

    Key requirements:

    • Consistent column count across all rows
    • Proper escaping of commas, quotes, and newlines within field values
    • Header row with descriptive column names
    • Consistent data types per column (no mixing strings and numbers)

    Common gotchas:

    • Text fields containing commas that break parsing
    • Inconsistent null representation (empty string vs. "null" vs. "N/A" vs. "None")
    • Encoding issues in non-ASCII text
    • Large text fields that make the CSV unwieldy for non-text tools

    Structured JSON for AI Agents

    Agent training data requires tool call schemas, action-observation pairs, and structured decision records.

    Key requirements:

    • Tool call schemas match the actual tool signatures
    • Action-observation sequences are chronologically ordered
    • Each decision point includes the available actions and the chosen action
    • Error cases and edge cases are represented

    The Parallel Pipeline Problem

    When each format requires a separate export pipeline, you end up maintaining parallel code paths:

    Source Data → JSONL Export Script → jsonl_output/
    Source Data → COCO Export Script  → coco_output/
    Source Data → RAG Chunk Script    → chunks_output/
    Source Data → CSV Export Script   → csv_output/
    

    Each script is a potential source of:

    Format bugs: An unescaped quote in the JSONL exporter that only triggers on certain records. A coordinate normalization error in the YOLO exporter that produces out-of-range values for certain image dimensions.

    Data drift: If you update the source data and re-export, do all four scripts pick up the changes? If the JSONL script processes the updated data but the COCO script was pointing at a cached copy, your exports are inconsistent.

    Validation gaps: Each script may or may not include validation. The JSONL script might validate JSON syntax but not check for empty fields. The COCO script might check ID references but not verify bounding box dimensions.

    Maintenance burden: Four scripts across four file formats, each with format-specific edge cases, each requiring updates when the source data schema changes.

    For service providers handling multiple client projects, this scales poorly. A format bug that affects one project likely affects others — but each project's scripts are separate.


    Export Validation: What to Check

    Validation should be a required step, not optional. Check these before considering any export complete:

    Universal Checks (All Formats)

    • Record count: Export contains the expected number of records
    • Completeness: No missing fields, no null values where values are required
    • Encoding: UTF-8 throughout, no encoding artifacts
    • Deduplication: No duplicate records in the export
    • Traceability: Every exported record maps back to a source record in the pipeline

    Format-Specific Checks

    FormatValidation Check
    JSONLValid JSON on every line, consistent schema, token counts within limits
    COCOID uniqueness and cross-references, bbox within image bounds, area calculation
    YOLOCoordinates in [0, 1], file-annotation pairing, class ID validity
    RAG chunksChunk size within target range, no sentence splits, metadata present
    CSVColumn count consistency, type consistency, proper escaping

    Downstream Compatibility Checks

    • JSONL: Load into the target training framework and verify it parses without errors
    • COCO: Run the COCO evaluation API on a sample to verify format compatibility
    • YOLO: Load into the target YOLO training script and verify it reads annotations correctly
    • RAG chunks: Embed a sample and verify retrieval produces expected results

    Single-Pipeline Export Architecture

    The alternative to parallel scripts is a single pipeline that produces multiple export formats from one data model:

    Source Data → Unified Pipeline → Export Module → JSONL
                                                  → COCO
                                                  → YOLO
                                                  → RAG Chunks
                                                  → CSV
                                                  → Structured JSON
    

    This architecture has specific advantages:

    One data model: All records exist in a single representation. Export to each format is a transformation from that representation — not a separate pipeline with its own parsing logic.

    One validation pass: Validate the data once in the unified model. Format-specific validation happens at the export boundary, not upstream.

    Consistent exports: When the source data changes, every export format reflects the change. No drift between formats.

    Audit trail: Every export is logged with the export format, timestamp, record count, and validation results. When a compliance team asks "what exactly was in the training dataset?", the answer is in one place.


    Practical Recommendations

    1. Export early, validate immediately. Don't wait until fine-tuning day to discover a format issue. Export a 100-record sample in every target format during pipeline setup.

    2. Version your exports. Tag each export with a version identifier tied to the pipeline state. When you re-export after data updates, the previous version remains available for comparison.

    3. Include export metadata. Every export should include a manifest: record count, export format, pipeline version, validation results, and a hash of the exported data.

    4. Test downstream. The most reliable validation is to load the export into the downstream system and verify it works. A JSONL file that looks correct but crashes the training framework has a format issue that static validation missed.

    Ertas Data Suite's Export module produces all major formats — JSONL, COCO, YOLO, CSV, chunked text, and structured JSON — from a single project. Each export includes schema validation, downstream compatibility checks, and a complete audit log. Switching between export formats doesn't require reconfiguring the pipeline — the same prepared data exports to any format in one click.


    Connecting to the Pipeline

    Export is the final stage of the data preparation pipeline. The quality of every upstream stage — ingestion, cleaning, labeling, and augmentation — determines whether the exported dataset produces a model that works in production.

    For the complete pipeline overview, see How to Build an On-Premise Data Preparation Pipeline for LLM Fine-Tuning.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading