Back to blog
    RAG Pipeline Failure Modes: A Field Guide for Production Debugging
    ragdebuggingproductionretrievaldata-qualitysegment:solution-company

    RAG Pipeline Failure Modes: A Field Guide for Production Debugging

    A comprehensive catalog of RAG failure modes with symptoms, root causes, and fixes. Built from real production incidents and community discussions.

    EErtas Team·

    RAG pipelines fail silently. Unlike a crashed service or a thrown exception, a broken RAG pipeline still returns answers. They are just wrong answers, incomplete answers, or answers contaminated with information that should never have reached the LLM. The system looks healthy while delivering garbage.

    This guide catalogs the failure modes we see repeatedly in production RAG systems. Each entry follows the same structure: what you observe, why it happens, and how to fix it. Bookmark it. You will need it at 2 AM when your retrieval pipeline is returning nonsense and you cannot figure out why.

    Failure Mode 1: Retrieval Miss (Relevant Documents Exist but Are Not Retrieved)

    Symptoms: The LLM says "I don't have information about that" or halluccinates an answer, even though the correct document is in your vector store. Users report that the system "doesn't know" things you know you indexed.

    Root Cause: The query embedding and the document chunk embedding are not close enough in vector space despite being semantically related. This happens when:

    • The query uses different terminology than the source document (user asks about "firing employees," documents use "involuntary termination")
    • The embedding model was not trained on your domain vocabulary
    • Chunks are too large, diluting the embedding with irrelevant context
    • Chunks are too small, losing the semantic meaning that would match the query

    Fix:

    • Add a hybrid search layer (BM25 keyword search alongside vector similarity) to catch terminology mismatches
    • Test retrieval with real user queries before going live — not just queries you wrote yourself
    • Experiment with chunk sizes: 256-512 tokens is a reasonable starting point for most document types
    • Consider domain-adapted embedding models if your content uses specialized vocabulary
    • Add query expansion or rewriting to rephrase user queries before embedding

    Failure Mode 2: Wrong Context Retrieved (Irrelevant Documents Score High)

    Symptoms: The LLM gives confident but incorrect answers, citing information from the wrong document, wrong section, or wrong topic entirely. The retrieved chunks look plausible but are not relevant to the actual question.

    Root Cause: Vector similarity is not the same as relevance. Two passages can be semantically similar (same domain, same vocabulary) without one being relevant to the other. Common triggers:

    • Boilerplate text (headers, footers, disclaimers) gets high similarity scores because it appears everywhere
    • Documents from different time periods or versions are not distinguished
    • Metadata filters are missing, so the search considers the entire corpus instead of the relevant subset

    Fix:

    • Add metadata filtering (date, document type, department, version) so retrieval narrows the search space before similarity ranking
    • Strip boilerplate during ingestion — remove headers, footers, and repeated disclaimers before chunking
    • Implement a reranking step after initial retrieval using a cross-encoder model that scores query-document relevance more accurately than raw cosine similarity
    • Add a relevance threshold — do not pass chunks to the LLM if their similarity score falls below a tested minimum

    Failure Mode 3: Stale Context (Documents Are Outdated)

    Symptoms: The LLM gives answers based on old information — last quarter's pricing, a superseded policy, a deprecated API endpoint. Users lose trust because they know the information changed but the system still gives the old answer.

    Root Cause: The vector store contains embeddings from documents that have since been updated or replaced, and the indexing pipeline does not detect or handle updates. This is the most common RAG failure in production because most teams build the initial indexing pipeline but never build the update pipeline.

    Fix:

    • Implement document versioning with timestamps in metadata so you can filter by recency
    • Build an incremental reindexing pipeline that detects changed documents and re-embeds only the updated chunks
    • Add a "last indexed" timestamp to every chunk and surface it to the LLM in the context so it can caveat potentially stale information
    • Schedule regular full reindexing as a safety net, even if you have incremental updates
    • See our deep dive on embedding drift and stale vectors for detection strategies

    Failure Mode 4: Chunk Boundary Corruption (Answer Split Across Chunks)

    Symptoms: The LLM gives partial answers that feel incomplete, or combines information from two unrelated sections because the relevant content was split across chunk boundaries. Users describe the answers as "close but missing the key detail."

    Root Cause: Fixed-size chunking splits documents at arbitrary character or token counts, ignoring semantic boundaries. A critical paragraph gets split in half. The first half lands in one chunk, the second in another. Retrieval finds one half but not the other.

    Fix:

    • Use semantic chunking that respects paragraph, section, and heading boundaries
    • Add chunk overlap (10-20% of chunk size) so boundary content appears in adjacent chunks
    • For structured documents (legal contracts, technical specifications), chunk by section hierarchy rather than by size
    • Include the parent section heading in every chunk so the LLM has structural context even for mid-section chunks
    • See our article on how bad chunks poison RAG answers for a detailed breakdown

    Failure Mode 5: Context Window Overflow (Too Much Retrieved Context)

    Symptoms: The LLM ignores some retrieved documents, gives answers that only reference the first or last chunks (primacy/recency bias), or produces overly generic responses that do not engage with the specific details in the context.

    Root Cause: The retrieval step returns too many chunks, exceeding the LLM's effective context utilization. Even models with 128K+ context windows show degraded performance when context is noisy or excessive. The model cannot distinguish signal from noise when buried in ten pages of loosely relevant text.

    Fix:

    • Reduce top-k retrieval to 3-5 chunks instead of 10-20
    • Add a reranking step and only pass the top reranked results to the LLM
    • Implement context compression — summarize or extract key sentences from retrieved chunks before passing to the model
    • Test your actual model's performance at different context lengths with your specific document types
    • Consider whether you need all the chunks or just the single most relevant one

    Failure Mode 6: PII Leakage Through Context

    Symptoms: The LLM's response includes personal names, email addresses, phone numbers, or other personally identifiable information from documents that were not supposed to be surfaced to the current user. Compliance team files an incident report.

    Root Cause: Documents containing PII were indexed without redaction, and the retrieval pipeline has no access control layer. The LLM faithfully includes whatever is in the context, including sensitive data it should never have seen.

    Fix:

    • Add PII redaction as a pipeline stage between document parsing and chunking/embedding
    • Implement document-level and chunk-level access controls so retrieval respects user permissions
    • Audit your vector store for PII exposure — search for patterns (emails, phone numbers, SSNs) in stored chunk text
    • See our guide on PII leaks in RAG context windows for a comprehensive prevention framework

    Failure Mode 7: Hallucination Despite Correct Context

    Symptoms: The correct documents are retrieved and present in the context, but the LLM still generates incorrect information. It might synthesize facts that are not in any of the retrieved chunks, or misinterpret the context and draw wrong conclusions.

    Root Cause: The LLM is blending its parametric knowledge (pre-training data) with the provided context, and its parametric knowledge is winning. This is more common with larger, more capable models that have strong priors. It also happens when the context contradicts the model's training data — the model "trusts itself" over the context.

    Fix:

    • Add explicit instructions in the system prompt: "Answer ONLY based on the provided context. If the context does not contain the answer, say so."
    • Use smaller, fine-tuned models that are trained to be faithful to context rather than relying on parametric knowledge
    • Implement citation requirements — force the model to quote the specific chunk it is referencing
    • Add a verification step that checks whether the answer is grounded in the retrieved context
    • Consider fine-tuning for context faithfulness if this is a persistent issue

    Failure Mode 8: Embedding Model Mismatch

    Symptoms: Retrieval quality is consistently poor across all queries despite good chunking and sufficient documents. Relevant documents score lower than expected. Switching to a different embedding model dramatically changes results.

    Root Cause: The embedding model was chosen based on benchmark scores rather than domain fit. General-purpose embedding models can perform poorly on specialized content (medical, legal, financial) because they were not trained on that vocabulary. Additionally, using different embedding models for indexing and querying produces meaningless similarity scores.

    Fix:

    • Always use the same embedding model for indexing and querying — this is non-negotiable
    • Test multiple embedding models on your actual documents with your actual queries before committing
    • Consider domain-specific or fine-tuned embedding models for specialized content
    • Run a retrieval evaluation: for 50-100 known query-document pairs, measure whether the correct document appears in the top-k results

    A Diagnostic Checklist

    When your RAG pipeline is producing bad results, work through this checklist in order:

    1. Is the document in the store? Check that the source document was actually indexed. Ingestion failures are more common than you think.
    2. Is the chunk retrievable? Query the vector store directly with the exact text from the source document. If even an exact match fails, your embedding pipeline is broken.
    3. Is the right chunk retrieved for real queries? Test with actual user queries, not synthetic ones you wrote.
    4. Is the chunk content correct? Read the raw chunk text. Is it complete? Is it corrupted? Does it contain the information needed to answer the question?
    5. Is the context reaching the LLM? Log the full prompt sent to the model, including all retrieved context. Confirm nothing is being truncated or dropped.
    6. Is the LLM using the context? If the context is correct but the answer is wrong, the problem is in the generation step, not retrieval.

    Building Observable RAG Pipelines

    Most of these failure modes share a common enabler: lack of observability. When your RAG pipeline is a chain of invisible function calls — document parsing, chunking, embedding, retrieval, context assembly, generation — diagnosing failures requires tracing through code you wrote months ago under deadline pressure.

    Ertas Data Suite takes a different approach. Every stage of the RAG pipeline is a visible node on a canvas — File Import, Parser, PII Redactor, RAG Chunker, Embedding, Vector Store Writer for indexing; API Endpoint, Query Embedder, Vector Search, Context Assembler, API Response for retrieval. Each node logs its inputs and outputs. When retrieval quality degrades, you can inspect exactly what each stage produced without adding print statements to buried Python functions.

    The pipeline is not hidden inside LangChain abstractions or LlamaIndex callbacks. It is visible, auditable, and debuggable — which is exactly what you need when production RAG breaks at 2 AM.

    The Uncomfortable Truth

    RAG pipelines are not "set and forget" infrastructure. They are living systems that degrade as documents change, user queries evolve, and embedding models drift. Every team that deploys RAG in production eventually hits most of the failure modes in this catalog. The difference between teams that recover quickly and teams that spend weeks debugging is observability: can you see what each stage of your pipeline is doing, or are you guessing?

    Build your RAG pipelines so you can see inside them. Your future self will thank you.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading