RAG Pipeline Architecture: Indexing vs Retrieval as Separate Concerns

The typical RAG tutorial starts and ends in the same script: load documents, chunk them, embed them, store them in a vector database, then query that database at inference time. The whole thing runs in one process. It works for a demo. It does not work for production.

Production RAG systems serve real users against real data that changes constantly. Documents get added, updated, and deleted. Embedding models get upgraded. Chunking strategies get refined. Meanwhile, the retrieval path needs to stay fast, stable, and observable — even while the index is being rebuilt. When indexing and retrieval are tangled together, changing one breaks the other.

The best RAG architecture for production treats indexing and retrieval as two separate pipelines that share only one thing: the vector store.

Why the Monolithic RAG Script Breaks Down

In a single-script RAG implementation, the same codebase handles document ingestion, chunking, embedding, storage, query processing, vector search, context assembly, and response generation. This creates several failure modes that surface only at scale.

Reindexing blocks retrieval. When you need to reprocess your document corpus — because you changed your chunking strategy, upgraded your embedding model, or received a batch of new documents — the retrieval path is either blocked or operating against a partially updated index. Users experience degraded results or outright downtime.

No independent scaling. Indexing is a batch workload. It is CPU-intensive during parsing and cleanup, GPU-intensive during embedding, and I/O-intensive during vector writes. Retrieval is a latency-sensitive online workload. It needs fast embedding of short queries and fast approximate nearest neighbor search. These workloads have fundamentally different resource profiles. Running them in the same process means neither is optimized.

Debugging becomes archaeology. When retrieval quality drops, you need to determine whether the problem is in how documents were parsed, how they were chunked, how they were embedded, or how the retrieval query is being constructed. In a monolithic pipeline, these concerns are interleaved across the same codebase. Tracing a quality issue back to its root cause requires reading through the entire system.

Version management is impossible. You want to A/B test a new chunking strategy against the current one. Or you want to roll back an embedding model change that degraded results. In a monolithic system, these operations require reprocessing the entire corpus and redeploying the entire application.

The Two-Pipeline Architecture

Separating RAG into an indexing pipeline and a retrieval pipeline creates clear boundaries with a well-defined contract between them.

The Indexing Pipeline (Batch)

The indexing pipeline processes documents into vector representations. It runs on a schedule or in response to triggers — new documents arriving, a model upgrade, or a manual reindex request. Its stages are:

File Import — Ingest documents from source systems (file system, S3, SharePoint, database exports). Handle format detection and deduplication.
Parser — Extract text from structured and unstructured formats (PDF, DOCX, HTML, Markdown, CSV). Preserve document metadata and structural information.
Clean — Normalize text, remove boilerplate headers and footers, handle encoding issues, strip irrelevant markup. This stage has an outsized impact on retrieval quality and is often underinvested.
RAG Chunker — Split cleaned text into retrieval units. This is where chunking strategy lives — fixed-size with overlap, semantic chunking, or document-structure-aware chunking. The chunker is the most frequently iterated component in a mature RAG system.
Embedding — Convert chunks into vector representations using the embedding model. This is the most compute-intensive stage and benefits from batching and GPU acceleration.
Vector Store Writer — Write embeddings and their associated metadata to the vector database. Handle upserts, deletions, and index management.

Each stage is independently testable. You can validate parser output without running the embedder. You can swap chunking strategies and compare their output before committing to a full reindex. You can upgrade the embedding model and write to a new collection without touching the existing one.

The Retrieval Pipeline (Live)

The retrieval pipeline handles real-time queries. It is a latency-sensitive online service. Its stages are:

API Endpoint — Accept incoming queries with authentication, rate limiting, and request validation.
Query Embedder — Convert the user query into a vector using the same embedding model that was used during indexing. This must use the same model and configuration — a mismatch here silently degrades results.
Vector Search — Perform approximate nearest neighbor search against the vector store. Apply metadata filters, handle multi-collection queries if running multiple index versions.
Context Assembler — Take the retrieved chunks, rerank them if needed, assemble them into a coherent context window, and format them for the downstream LLM. This stage manages token budgets and deduplication.
API Response — Return the assembled context (or a generated response, if an LLM is integrated) along with source attribution metadata.

The retrieval pipeline has completely different operational characteristics than the indexing pipeline. It needs to be always available, respond within milliseconds, and handle concurrent requests. It does not need to parse documents, manage chunking strategies, or handle batch embedding jobs.

The Contract Between Pipelines

The vector store is the interface between the two pipelines. The contract is straightforward: the indexing pipeline writes vectors with a known dimensionality, a known metadata schema, and a known collection naming convention. The retrieval pipeline reads from those collections using the same embedding model.

This contract enables independent deployment. You can deploy a new version of the indexing pipeline — with a different chunking strategy or a new embedding model — without redeploying the retrieval pipeline. You write to a new collection, validate the results, and then point the retrieval pipeline at the new collection. Rollback is pointing it back at the old one.

What This Looks Like in Practice

In Ertas Data Suite, both pipelines are built as visual node graphs on the same canvas. The indexing pipeline is a chain of nodes: File Import, Parser, Clean, RAG Chunker, Embedding, Vector Store Writer. The retrieval pipeline is a separate chain: API Endpoint, Query Embedder, Vector Search, Context Assembler, API Response. They sit side by side on the canvas, visually distinct but sharing the vector store connection.

This visual separation makes the architecture tangible. You can see that the two pipelines are independent. You can modify the chunking node in the indexing pipeline without touching the retrieval pipeline. You can add a reranking node to the retrieval pipeline without triggering a reindex. Each pipeline runs on its own schedule — the indexing pipeline as a batch job, the retrieval pipeline as a persistent service.

Because Ertas runs on-premise as a desktop application, the entire architecture stays within your infrastructure. The vector store is local. The embedding models run locally. No document content or query data leaves the machine. For enterprises with data sovereignty requirements, this eliminates the compliance overhead of evaluating cloud RAG services.

Operational Benefits of Separation

Independent observability. Each pipeline gets its own metrics. Indexing: documents processed per hour, chunk distribution statistics, embedding throughput, write latency. Retrieval: query latency at p50/p95/p99, relevance scores, cache hit rates, concurrent query count. When retrieval quality degrades, you check retrieval metrics first. If retrieval is healthy, the problem is in the index — and you check indexing metrics to find the root cause.

Independent iteration. The chunking strategy is the component that gets changed most often in a production RAG system. With separated pipelines, you can run a new chunking configuration through the indexing pipeline, write the results to a test collection, and evaluate retrieval quality against that collection — all without affecting the production retrieval path.

Independent failure domains. If the indexing pipeline crashes during a batch job, retrieval continues serving from the existing index. If the retrieval pipeline has a latency spike, indexing continues processing documents. Neither failure cascades into the other.

Clear team boundaries. In larger organizations, the team responsible for data ingestion and quality (often data engineering) can own the indexing pipeline, while the team responsible for the user-facing application can own the retrieval pipeline. The vector store contract is the API between the two teams.

When to Start Separating

If your RAG system is a prototype or internal tool with a small, static corpus and a single user, the monolithic approach is fine. Separation adds architectural overhead that is not justified at that scale.

Separate when any of these conditions appear: your corpus is updated regularly, you need to iterate on chunking or embedding without downtime, multiple people or teams are working on the system, or retrieval reliability is a business requirement rather than a convenience.

For most enterprise RAG deployments, these conditions are present from day one. Starting with separated pipelines avoids the painful migration from a monolithic system later — a migration that typically requires reprocessing the entire corpus and redesigning the application architecture under production pressure.

The best RAG pipeline architecture is not the one with the most sophisticated retrieval algorithm. It is the one where you can change any component — parser, chunker, embedder, retrieval strategy — without breaking everything else.