Multi-Format Export from a Single Data Pipeline: JSONL, COCO, YOLO, and RAG Chunks

You've ingested, cleaned, labeled, and augmented your dataset. Now you need to export it — and the downstream system determines the format.

Fine-tuning a language model? JSONL. Training an object detection model? YOLO or COCO. Building a RAG pipeline? Chunked text with metadata. Training a classical ML classifier? CSV. Feeding an AI agent? Structured JSON with tool call schemas.

The problem: most data preparation tools export one format. Maybe two. If your project requires three export formats — which is common when a client wants fine-tuning, RAG, and a dashboard all from the same source data — you end up maintaining three export scripts, each with its own format-specific bugs and validation gaps.

This guide covers what each format requires, where they break, and how to export reliably from a single pipeline.

Format Requirements: What Each One Actually Needs

JSONL for LLM Fine-Tuning

JSONL (JSON Lines) is the standard format for fine-tuning language models. Each line is a self-contained JSON object representing one training example.

Instruction fine-tuning format:

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Key requirements:

Valid JSON on every line — one malformed line can crash the training script
Consistent schema across all lines (same fields, same roles)
UTF-8 encoding, no BOM
No trailing commas, no comments
Token count per example within the model's context window

Common gotchas:

Unescaped quotes in text content (the #1 JSONL formatting error)
Newlines within content fields not properly escaped
Mixed schemas (some lines with prompt/completion, others with messages)
Empty or null fields that the training framework doesn't handle

COCO Format for Computer Vision

COCO (Common Objects in Context) format uses a single JSON file containing image metadata, category definitions, and annotations.

Structure:

{
  "images": [{"id": 1, "file_name": "img001.jpg", "width": 1920, "height": 1080}],
  "categories": [{"id": 1, "name": "defect"}, {"id": 2, "name": "normal"}],
  "annotations": [{"id": 1, "image_id": 1, "category_id": 1, "bbox": [x, y, w, h], "area": 1234}]
}

Key requirements:

All IDs must be unique and cross-referenced correctly
Bounding box format is [x, y, width, height] (top-left origin)
Area must match the bounding box dimensions
Image dimensions must match actual file dimensions
Segmentation masks (if used) must be valid polygons

Common gotchas:

ID mismatches between images and annotations
Bounding boxes that extend outside image boundaries
Zero-area annotations from labeling errors
Category IDs that don't match the categories list

YOLO Format for Object Detection

YOLO uses a directory structure with one text file per image, each line representing one annotation.

Structure:

<class_id> <x_center> <y_center> <width> <height>

All coordinates are normalized to 0-1 relative to image dimensions.

Key requirements:

One .txt file per image, same filename (different extension)
Coordinates normalized to [0, 1]
Class IDs are zero-indexed integers
A data.yaml file mapping class IDs to names

Common gotchas:

Coordinates not normalized (using pixel values instead of 0-1 range)
Filename mismatches between image files and annotation files
Class IDs starting at 1 instead of 0
Missing annotation files for images with no objects (YOLO expects an empty file, not a missing file)

Chunked Text for RAG

RAG (Retrieval-Augmented Generation) pipelines need text split into chunks with metadata for retrieval.

Typical structure:

{"chunk_id": "doc001_chunk_003", "text": "...", "metadata": {"source": "policy_v2.pdf", "page": 12, "section": "Termination"}}

Key requirements:

Chunk size appropriate for the embedding model (typically 256-512 tokens)
Overlap between adjacent chunks (typically 10-20% of chunk size) to avoid splitting relevant context
Source metadata preserved for citation
Chunk boundaries that don't split sentences or semantic units

Common gotchas:

Chunks that split mid-sentence, producing fragments that embed poorly
Missing or incorrect source metadata (makes citation impossible)
Chunk size inconsistency (some chunks are 50 tokens, others are 2,000)
Table content chunked as raw text, losing row-column structure

CSV for Classical ML

Flat tabular format with one row per example and one column per feature.

Key requirements:

Consistent column count across all rows
Proper escaping of commas, quotes, and newlines within field values
Header row with descriptive column names
Consistent data types per column (no mixing strings and numbers)

Common gotchas:

Text fields containing commas that break parsing
Inconsistent null representation (empty string vs. "null" vs. "N/A" vs. "None")
Encoding issues in non-ASCII text
Large text fields that make the CSV unwieldy for non-text tools

Structured JSON for AI Agents

Agent training data requires tool call schemas, action-observation pairs, and structured decision records.

Key requirements:

Tool call schemas match the actual tool signatures
Action-observation sequences are chronologically ordered
Each decision point includes the available actions and the chosen action
Error cases and edge cases are represented

The Parallel Pipeline Problem

When each format requires a separate export pipeline, you end up maintaining parallel code paths:

Source Data → JSONL Export Script → jsonl_output/
Source Data → COCO Export Script  → coco_output/
Source Data → RAG Chunk Script    → chunks_output/
Source Data → CSV Export Script   → csv_output/

Each script is a potential source of:

Format bugs: An unescaped quote in the JSONL exporter that only triggers on certain records. A coordinate normalization error in the YOLO exporter that produces out-of-range values for certain image dimensions.

Data drift: If you update the source data and re-export, do all four scripts pick up the changes? If the JSONL script processes the updated data but the COCO script was pointing at a cached copy, your exports are inconsistent.

Validation gaps: Each script may or may not include validation. The JSONL script might validate JSON syntax but not check for empty fields. The COCO script might check ID references but not verify bounding box dimensions.

Maintenance burden: Four scripts across four file formats, each with format-specific edge cases, each requiring updates when the source data schema changes.

For service providers handling multiple client projects, this scales poorly. A format bug that affects one project likely affects others — but each project's scripts are separate.

Export Validation: What to Check

Validation should be a required step, not optional. Check these before considering any export complete:

Universal Checks (All Formats)

Record count: Export contains the expected number of records
Completeness: No missing fields, no null values where values are required
Encoding: UTF-8 throughout, no encoding artifacts
Deduplication: No duplicate records in the export
Traceability: Every exported record maps back to a source record in the pipeline

Format-Specific Checks

Format	Validation Check
JSONL	Valid JSON on every line, consistent schema, token counts within limits
COCO	ID uniqueness and cross-references, bbox within image bounds, area calculation
YOLO	Coordinates in [0, 1], file-annotation pairing, class ID validity
RAG chunks	Chunk size within target range, no sentence splits, metadata present
CSV	Column count consistency, type consistency, proper escaping

Downstream Compatibility Checks

JSONL: Load into the target training framework and verify it parses without errors
COCO: Run the COCO evaluation API on a sample to verify format compatibility
YOLO: Load into the target YOLO training script and verify it reads annotations correctly
RAG chunks: Embed a sample and verify retrieval produces expected results

Single-Pipeline Export Architecture

The alternative to parallel scripts is a single pipeline that produces multiple export formats from one data model:

Source Data → Unified Pipeline → Export Module → JSONL
                                              → COCO
                                              → YOLO
                                              → RAG Chunks
                                              → CSV
                                              → Structured JSON

This architecture has specific advantages:

One data model: All records exist in a single representation. Export to each format is a transformation from that representation — not a separate pipeline with its own parsing logic.

One validation pass: Validate the data once in the unified model. Format-specific validation happens at the export boundary, not upstream.

Consistent exports: When the source data changes, every export format reflects the change. No drift between formats.

Audit trail: Every export is logged with the export format, timestamp, record count, and validation results. When a compliance team asks "what exactly was in the training dataset?", the answer is in one place.

Practical Recommendations

Export early, validate immediately. Don't wait until fine-tuning day to discover a format issue. Export a 100-record sample in every target format during pipeline setup.
Version your exports. Tag each export with a version identifier tied to the pipeline state. When you re-export after data updates, the previous version remains available for comparison.
Include export metadata. Every export should include a manifest: record count, export format, pipeline version, validation results, and a hash of the exported data.
Test downstream. The most reliable validation is to load the export into the downstream system and verify it works. A JSONL file that looks correct but crashes the training framework has a format issue that static validation missed.

Ertas Data Suite's Export module produces all major formats — JSONL, COCO, YOLO, CSV, chunked text, and structured JSON — from a single project. Each export includes schema validation, downstream compatibility checks, and a complete audit log. Switching between export formats doesn't require reconfiguring the pipeline — the same prepared data exports to any format in one click.

Connecting to the Pipeline

Export is the final stage of the data preparation pipeline. The quality of every upstream stage — ingestion, cleaning, labeling, and augmentation — determines whether the exported dataset produces a model that works in production.

For the complete pipeline overview, see How to Build an On-Premise Data Preparation Pipeline for LLM Fine-Tuning.

Multi-Format Export from a Single Data Pipeline: JSONL, COCO, YOLO, and RAG Chunks

Format Requirements: What Each One Actually Needs

JSONL for LLM Fine-Tuning

COCO Format for Computer Vision

YOLO Format for Object Detection

Chunked Text for RAG

CSV for Classical ML

Structured JSON for AI Agents

The Parallel Pipeline Problem

Export Validation: What to Check

Universal Checks (All Formats)

Format-Specific Checks

Downstream Compatibility Checks

Single-Pipeline Export Architecture

Practical Recommendations

Connecting to the Pipeline

Ship AI that runs on your users' devices.

Keep reading

How to Build an On-Premise Data Preparation Pipeline for LLM Fine-Tuning

Setting Up Local Document Ingestion for Enterprise AI Projects

Data Quality Scoring for Training Datasets Without Cloud APIs