
Preparing RAG Datasets vs Fine-Tuning Datasets: Different Pipelines, Same Source Data
RAG needs chunked, retrieval-optimized text. Fine-tuning needs input/output pairs. Both start from the same raw documents. Here's how to run parallel preparation pipelines from a single source.
Most enterprise AI teams end up building two separate data pipelines: one to prepare documents for RAG (retrieval-augmented generation), and one to prepare training data for fine-tuning. Both pipelines start from the same raw documents — the same PDFs, the same internal wikis, the same policy manuals. They share the same ingestion and cleaning stages. Then they diverge.
Running these as two independent pipelines means duplicating 60-70% of the work. You ingest the same PDFs twice. You clean the same text twice. You extract the same entities twice. You debug the same parsing errors twice. And when a document updates, you have to remember to re-process it in both pipelines — which inevitably falls through the cracks.
The better approach is a unified pipeline that shares the common stages and branches into two export paths. One source, two outputs. This article covers exactly where the pipelines share work, where they diverge, and how to run both from a single data preparation project.
The Two Output Formats
Before diving into the pipeline, understand what each output needs to look like.
RAG Output: Chunked Text + Metadata for Vector DB
RAG systems retrieve relevant text chunks at query time and feed them to the model as context. The output is:
{
"chunk_id": "doc-4421-section-3-chunk-2",
"text": "The maximum credit exposure for Category A clients shall not exceed 15% of tier-1 capital...",
"metadata": {
"source_document": "Credit Risk Policy v4.2",
"section": "3. Exposure Limits",
"page": 12,
"date": "2025-11-01",
"entities": ["Category A", "tier-1 capital"],
"doc_type": "policy"
},
"embedding": [0.0234, -0.1891, 0.0442, ...]
}
Each chunk is self-contained enough to answer a question, carries metadata for filtering and attribution, and includes a vector embedding for semantic search.
Fine-Tuning Output: Instruction/Response JSONL Pairs
Fine-tuning datasets teach the model domain knowledge directly. The output is question-answer pairs derived from the source documents:
{
"messages": [
{
"role": "system",
"content": "You are a credit risk analyst assistant."
},
{
"role": "user",
"content": "What is the maximum credit exposure limit for Category A clients?"
},
{
"role": "assistant",
"content": "The maximum credit exposure for Category A clients shall not exceed 15% of tier-1 capital, as defined in Credit Risk Policy v4.2, Section 3."
}
]
}
Each pair captures a specific piece of knowledge from the source documents, formatted as a conversational exchange.
Why You Often Need Both
The decision is not usually "RAG or fine-tuning." For enterprise deployments with serious accuracy requirements, the answer is both.
RAG handles the long tail: Documents that are updated frequently, niche knowledge that applies to rare queries, and any information where you need exact source attribution. RAG retrieves the source, so the user can verify.
Fine-tuning handles the common core: The 200-500 questions that constitute 80% of all user queries. These are domain fundamentals — definitions, standard procedures, common calculations. A fine-tuned model answers these instantly without retrieval latency, and with higher consistency.
A financial services team we worked with found that fine-tuning covered 78% of analyst queries (standard regulatory definitions, risk calculation methods, reporting procedures). RAG handled the remaining 22% (specific client data, recent regulatory updates, niche product details). The combined system outperformed either approach alone by 23% on answer accuracy.
Building both outputs from a single pipeline is not just more efficient — it produces better results because the same cleaning and structuring work applies to both.
Shared Stages: Ingestion, Cleaning, Entity Extraction
The first three stages of the pipeline are identical regardless of the output format.
Ingestion
Parse raw documents into a standardized intermediate format. PDF, Word, HTML, email, wiki exports — all converted to clean text with structural markers (headers, sections, tables, lists) and source metadata.
This stage is format-dependent, not output-dependent. A PDF is a PDF whether you are chunking it for RAG or extracting Q&A pairs for fine-tuning. Run it once.
Cleaning
Remove OCR artifacts, deduplicate documents, normalize Unicode, strip boilerplate (headers, footers, watermarks). Validate that the cleaning process did not remove meaningful content.
Cleaning errors propagate to both outputs. A misspelled term in the cleaned text becomes a misspelled chunk in the vector DB and a misspelled answer in the fine-tuning dataset. Fix it once at the cleaning stage, not twice at the export stage.
Entity and Structure Extraction
Identify document structure (sections, subsections, tables), extract named entities (products, regulations, people, dates), and tag metadata. This structural information feeds into both pipelines:
- RAG uses it for metadata-filtered retrieval ("show me only documents from Section 3 of the Risk Policy")
- Fine-tuning uses it for question generation ("What does Section 3 of the Risk Policy say about exposure limits?")
Run entity extraction once. Both pipelines consume the same structural annotations.
The Divergence Point
After cleaning and structure extraction, the pipeline branches. The same cleaned, structured text feeds into two parallel processes.
Branch A: RAG Pipeline
Chunking: Split the structured text into retrieval-optimized chunks. The chunking strategy depends on document type:
- Policy documents: chunk at the section level, with overlap at section boundaries
- Technical manuals: chunk at the procedure/step level
- Correspondence: chunk at the message level (for email threads) or paragraph level
- Tables: keep tables as single chunks with surrounding context
Target chunk size: 256-512 tokens for most embedding models. Include a context prefix (document title + section header) in each chunk so the chunk makes sense in isolation.
Metadata tagging: Attach filterable metadata to each chunk: source document, section, date, entities, document type, confidence score.
Embedding: Run each chunk through an embedding model to generate vector representations. For on-premise deployments, use a local embedding model — BGE-large, E5-large, or the Nomic embedding model via Ollama. Batch processing: a single GPU embeds 10,000-50,000 chunks per hour depending on chunk length and model size.
Indexing: Load chunks and embeddings into a vector database (Qdrant, Milvus, or Chroma, all available for on-premise deployment). Create metadata indexes for filtered search.
Branch B: Fine-Tuning Pipeline
Question generation: For each structural unit (section, paragraph, table), generate questions that the content answers. Use a local LLM for this:
- Feed the content to Llama 3.3 70B or Qwen 2.5 72B via Ollama
- Prompt: "Generate 3-5 questions that this text answers. Questions should be specific and answerable from the text alone."
- Filter generated questions: remove duplicates, remove questions that are too vague or too specific
This produces 5-15 questions per page of source content. A 100-page policy document yields 500-1,500 Q&A pairs before quality filtering.
Answer extraction: For each generated question, extract the answer from the source text. The answer should be a direct response, not a copy-paste of the source paragraph. Use the local LLM to generate concise, accurate answers from the source content.
Format conversion: Convert Q&A pairs into the JSONL format expected by your fine-tuning framework. Add system prompts appropriate to the use case (e.g., "You are a compliance assistant" or "You are a risk analyst").
Validation: Check that every answer is supported by the source text — no hallucinated information, no facts from other documents, no made-up numbers. This is a critical quality gate. Reject any pair where the answer cannot be traced to a specific passage in the source.
Quality Requirements Differ
RAG and fine-tuning have different tolerance for noise, which affects how aggressively you need to filter.
RAG tolerates some noise because the retrieval step acts as a secondary filter. If a chunk contains a minor OCR error, the retrieval system might still surface it for the right query, and the model can work around the error when generating the answer. A noise rate of 2-5% in the chunk corpus is acceptable for most RAG deployments.
Fine-tuning requires high per-example accuracy because every training example directly shapes the model's behavior. A single training example with incorrect information teaches the model to produce that incorrect information confidently. Target a noise rate below 0.5% for fine-tuning datasets — which means aggressive validation and human review of generated Q&A pairs.
This difference in quality tolerance means the fine-tuning branch needs heavier validation than the RAG branch. Budget expert review time accordingly: plan for domain experts to spot-check 5-10% of RAG chunks but 15-25% of fine-tuning pairs.
Running the Unified Pipeline
The practical workflow runs like this:
- Ingest all documents into the shared intermediate format. One pass.
- Clean the entire corpus. One pass.
- Extract structure and entities. One pass.
- Branch to RAG: Chunk, tag metadata, embed, index. This runs as a batch job, typically overnight.
- Branch to fine-tuning: Generate questions, extract answers, format JSONL, validate. This runs in parallel with Step 4 if you have sufficient compute, or sequentially if not.
- Expert review: Domain experts review a sample of RAG chunks (spot-check) and a larger sample of fine-tuning pairs (thorough review).
- Export: RAG chunks go to the vector database. Fine-tuning pairs go to the training pipeline.
Steps 1-3 are shared work. Steps 4-5 are parallel but independent. Step 6 is the quality gate.
Time and Cost Savings
Running two separate pipelines versus a unified pipeline — here are the numbers for a typical 10,000-document knowledge base:
| Stage | Separate Pipelines | Unified Pipeline |
|---|---|---|
| Ingestion | 2x (20 hours) | 1x (10 hours) |
| Cleaning | 2x (16 hours) | 1x (8 hours) |
| Entity extraction | 2x (12 hours) | 1x (6 hours) |
| RAG-specific | 1x (8 hours) | 1x (8 hours) |
| FT-specific | 1x (14 hours) | 1x (14 hours) |
| Total compute | 70 hours | 46 hours |
| Expert review | 2x (20 hours) | 1.5x (15 hours) |
| Bug fixing | 2x (8 hours) | 1x (4 hours) |
The unified pipeline saves approximately 33% in compute time and 25% in expert time. More importantly, it eliminates the consistency problem — both outputs are guaranteed to derive from the same cleaned source data.
Document Updates
When source documents change, the unified pipeline pays dividends again.
With separate pipelines, a document update means: re-ingest in Pipeline A, re-clean in Pipeline A, re-chunk and re-embed in Pipeline A, then remember to also re-ingest in Pipeline B, re-clean in Pipeline B, re-generate Q&A pairs in Pipeline B. Miss the second pipeline and your RAG system has the updated information but your fine-tuned model still has the old version.
With a unified pipeline, a document update triggers one re-processing pass through the shared stages, then automatic branching to both outputs. Both stay synchronized because they share a single source of truth.
For organizations that update policy documents quarterly, retrain models monthly, and refresh RAG indexes weekly, this synchronization is not a convenience — it is a requirement for consistent agent behavior.
How Ertas Data Suite Handles Dual Output
Ertas Data Suite implements this unified pipeline natively. You create one project, import your documents once, run cleaning and structuring once, then configure two export targets: vector DB format for RAG and JSONL for fine-tuning.
The system tracks which documents have been processed through which branches, handles incremental updates when documents change, and maintains an audit trail showing the lineage from source document to both RAG chunk and fine-tuning pair.
For teams currently running separate pipelines — or worse, preparing RAG data with one tool and fine-tuning data with another — consolidating into a single pipeline eliminates a class of consistency bugs that are expensive to debug in production.
Your data is the bottleneck — not your models.
Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.
Further Reading
- Fine-Tuning vs RAG: When to Use Which — The strategic decision framework for choosing between RAG, fine-tuning, or both.
- Enterprise Training Data Preparation for Fine-Tuning — Deep dive into the fine-tuning data preparation workflow for enterprise teams.
- Multi-Format Export: JSONL, COCO, YOLO Pipeline — How to support multiple export formats from a single data preparation pipeline.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

From Ad-Hoc Data Prep to Continuous Data Ops: Building an Always-On Pipeline
Most enterprises treat data preparation as a one-time project. But AI models need fresh data continuously. Here's how to evolve from ad-hoc data prep to a continuous data operations pipeline.

Data Preparation for Small Language Models: Quality Over Quantity
Large models can brute-force through noisy data. Small models can't. For SLMs, data quality isn't just important — it's the determining factor between a model that works and one that doesn't.

Preparing Synthetic Parsing Pipelines: The 2026 Approach to Document Processing
Document processing in 2026 isn't one model's job anymore. Synthetic parsing pipelines break documents into parts and route each to a specialized model. Here's how to prepare data for this architecture.