Preparing RAG Datasets vs Fine-Tuning Datasets: Different Pipelines, Same Source Data

Most enterprise AI teams end up building two separate data pipelines: one to prepare documents for RAG (retrieval-augmented generation), and one to prepare training data for fine-tuning. Both pipelines start from the same raw documents — the same PDFs, the same internal wikis, the same policy manuals. They share the same ingestion and cleaning stages. Then they diverge.

Running these as two independent pipelines means duplicating 60-70% of the work. You ingest the same PDFs twice. You clean the same text twice. You extract the same entities twice. You debug the same parsing errors twice. And when a document updates, you have to remember to re-process it in both pipelines — which inevitably falls through the cracks.

The better approach is a unified pipeline that shares the common stages and branches into two export paths. One source, two outputs. This article covers exactly where the pipelines share work, where they diverge, and how to run both from a single data preparation project.

The Two Output Formats

Before diving into the pipeline, understand what each output needs to look like.

RAG Output: Chunked Text + Metadata for Vector DB

RAG systems retrieve relevant text chunks at query time and feed them to the model as context. The output is:

{
  "chunk_id": "doc-4421-section-3-chunk-2",
  "text": "The maximum credit exposure for Category A clients shall not exceed 15% of tier-1 capital...",
  "metadata": {
    "source_document": "Credit Risk Policy v4.2",
    "section": "3. Exposure Limits",
    "page": 12,
    "date": "2025-11-01",
    "entities": ["Category A", "tier-1 capital"],
    "doc_type": "policy"
  },
  "embedding": [0.0234, -0.1891, 0.0442, ...]
}

Each chunk is self-contained enough to answer a question, carries metadata for filtering and attribution, and includes a vector embedding for semantic search.

Fine-Tuning Output: Instruction/Response JSONL Pairs

Fine-tuning datasets teach the model domain knowledge directly. The output is question-answer pairs derived from the source documents:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a credit risk analyst assistant."
    },
    {
      "role": "user",
      "content": "What is the maximum credit exposure limit for Category A clients?"
    },
    {
      "role": "assistant",
      "content": "The maximum credit exposure for Category A clients shall not exceed 15% of tier-1 capital, as defined in Credit Risk Policy v4.2, Section 3."
    }
  ]
}

Each pair captures a specific piece of knowledge from the source documents, formatted as a conversational exchange.

Why You Often Need Both

The decision is not usually "RAG or fine-tuning." For enterprise deployments with serious accuracy requirements, the answer is both.

RAG handles the long tail: Documents that are updated frequently, niche knowledge that applies to rare queries, and any information where you need exact source attribution. RAG retrieves the source, so the user can verify.

Fine-tuning handles the common core: The 200-500 questions that constitute 80% of all user queries. These are domain fundamentals — definitions, standard procedures, common calculations. A fine-tuned model answers these instantly without retrieval latency, and with higher consistency.

A financial services team we worked with found that fine-tuning covered 78% of analyst queries (standard regulatory definitions, risk calculation methods, reporting procedures). RAG handled the remaining 22% (specific client data, recent regulatory updates, niche product details). The combined system outperformed either approach alone by 23% on answer accuracy.

Building both outputs from a single pipeline is not just more efficient — it produces better results because the same cleaning and structuring work applies to both.

Shared Stages: Ingestion, Cleaning, Entity Extraction

The first three stages of the pipeline are identical regardless of the output format.

Ingestion

Parse raw documents into a standardized intermediate format. PDF, Word, HTML, email, wiki exports — all converted to clean text with structural markers (headers, sections, tables, lists) and source metadata.

This stage is format-dependent, not output-dependent. A PDF is a PDF whether you are chunking it for RAG or extracting Q&A pairs for fine-tuning. Run it once.

Cleaning

Remove OCR artifacts, deduplicate documents, normalize Unicode, strip boilerplate (headers, footers, watermarks). Validate that the cleaning process did not remove meaningful content.

Cleaning errors propagate to both outputs. A misspelled term in the cleaned text becomes a misspelled chunk in the vector DB and a misspelled answer in the fine-tuning dataset. Fix it once at the cleaning stage, not twice at the export stage.

Entity and Structure Extraction

Identify document structure (sections, subsections, tables), extract named entities (products, regulations, people, dates), and tag metadata. This structural information feeds into both pipelines:

RAG uses it for metadata-filtered retrieval ("show me only documents from Section 3 of the Risk Policy")
Fine-tuning uses it for question generation ("What does Section 3 of the Risk Policy say about exposure limits?")

Run entity extraction once. Both pipelines consume the same structural annotations.

The Divergence Point

After cleaning and structure extraction, the pipeline branches. The same cleaned, structured text feeds into two parallel processes.

Branch A: RAG Pipeline

Chunking: Split the structured text into retrieval-optimized chunks. The chunking strategy depends on document type:

Policy documents: chunk at the section level, with overlap at section boundaries
Technical manuals: chunk at the procedure/step level
Correspondence: chunk at the message level (for email threads) or paragraph level
Tables: keep tables as single chunks with surrounding context

Target chunk size: 256-512 tokens for most embedding models. Include a context prefix (document title + section header) in each chunk so the chunk makes sense in isolation.

Metadata tagging: Attach filterable metadata to each chunk: source document, section, date, entities, document type, confidence score.

Embedding: Run each chunk through an embedding model to generate vector representations. For on-premise deployments, use a local embedding model — BGE-large, E5-large, or the Nomic embedding model via Ollama. Batch processing: a single GPU embeds 10,000-50,000 chunks per hour depending on chunk length and model size.

Indexing: Load chunks and embeddings into a vector database (Qdrant, Milvus, or Chroma, all available for on-premise deployment). Create metadata indexes for filtered search.

Branch B: Fine-Tuning Pipeline

Question generation: For each structural unit (section, paragraph, table), generate questions that the content answers. Use a local LLM for this:

Feed the content to Llama 3.3 70B or Qwen 2.5 72B via Ollama
Prompt: "Generate 3-5 questions that this text answers. Questions should be specific and answerable from the text alone."
Filter generated questions: remove duplicates, remove questions that are too vague or too specific

This produces 5-15 questions per page of source content. A 100-page policy document yields 500-1,500 Q&A pairs before quality filtering.

Answer extraction: For each generated question, extract the answer from the source text. The answer should be a direct response, not a copy-paste of the source paragraph. Use the local LLM to generate concise, accurate answers from the source content.

Format conversion: Convert Q&A pairs into the JSONL format expected by your fine-tuning framework. Add system prompts appropriate to the use case (e.g., "You are a compliance assistant" or "You are a risk analyst").

Validation: Check that every answer is supported by the source text — no hallucinated information, no facts from other documents, no made-up numbers. This is a critical quality gate. Reject any pair where the answer cannot be traced to a specific passage in the source.

Quality Requirements Differ

RAG and fine-tuning have different tolerance for noise, which affects how aggressively you need to filter.

RAG tolerates some noise because the retrieval step acts as a secondary filter. If a chunk contains a minor OCR error, the retrieval system might still surface it for the right query, and the model can work around the error when generating the answer. A noise rate of 2-5% in the chunk corpus is acceptable for most RAG deployments.

Fine-tuning requires high per-example accuracy because every training example directly shapes the model's behavior. A single training example with incorrect information teaches the model to produce that incorrect information confidently. Target a noise rate below 0.5% for fine-tuning datasets — which means aggressive validation and human review of generated Q&A pairs.

This difference in quality tolerance means the fine-tuning branch needs heavier validation than the RAG branch. Budget expert review time accordingly: plan for domain experts to spot-check 5-10% of RAG chunks but 15-25% of fine-tuning pairs.

Running the Unified Pipeline

The practical workflow runs like this:

Ingest all documents into the shared intermediate format. One pass.
Clean the entire corpus. One pass.
Extract structure and entities. One pass.
Branch to RAG: Chunk, tag metadata, embed, index. This runs as a batch job, typically overnight.
Branch to fine-tuning: Generate questions, extract answers, format JSONL, validate. This runs in parallel with Step 4 if you have sufficient compute, or sequentially if not.
Expert review: Domain experts review a sample of RAG chunks (spot-check) and a larger sample of fine-tuning pairs (thorough review).
Export: RAG chunks go to the vector database. Fine-tuning pairs go to the training pipeline.

Steps 1-3 are shared work. Steps 4-5 are parallel but independent. Step 6 is the quality gate.

Time and Cost Savings

Running two separate pipelines versus a unified pipeline — here are the numbers for a typical 10,000-document knowledge base:

Stage	Separate Pipelines	Unified Pipeline
Ingestion	2x (20 hours)	1x (10 hours)
Cleaning	2x (16 hours)	1x (8 hours)
Entity extraction	2x (12 hours)	1x (6 hours)
RAG-specific	1x (8 hours)	1x (8 hours)
FT-specific	1x (14 hours)	1x (14 hours)
Total compute	70 hours	46 hours
Expert review	2x (20 hours)	1.5x (15 hours)
Bug fixing	2x (8 hours)	1x (4 hours)

The unified pipeline saves approximately 33% in compute time and 25% in expert time. More importantly, it eliminates the consistency problem — both outputs are guaranteed to derive from the same cleaned source data.

Document Updates

When source documents change, the unified pipeline pays dividends again.

With separate pipelines, a document update means: re-ingest in Pipeline A, re-clean in Pipeline A, re-chunk and re-embed in Pipeline A, then remember to also re-ingest in Pipeline B, re-clean in Pipeline B, re-generate Q&A pairs in Pipeline B. Miss the second pipeline and your RAG system has the updated information but your fine-tuned model still has the old version.

With a unified pipeline, a document update triggers one re-processing pass through the shared stages, then automatic branching to both outputs. Both stay synchronized because they share a single source of truth.

For organizations that update policy documents quarterly, retrain models monthly, and refresh RAG indexes weekly, this synchronization is not a convenience — it is a requirement for consistent agent behavior.

How Ertas Data Suite Handles Dual Output

Ertas Data Suite implements this unified pipeline natively. You create one project, import your documents once, run cleaning and structuring once, then configure two export targets: vector DB format for RAG and JSONL for fine-tuning.

The system tracks which documents have been processed through which branches, handles incremental updates when documents change, and maintains an audit trail showing the lineage from source document to both RAG chunk and fine-tuning pair.

For teams currently running separate pipelines — or worse, preparing RAG data with one tool and fine-tuning data with another — consolidating into a single pipeline eliminates a class of consistency bugs that are expensive to debug in production.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →