Back to blog
    RAG Hallucination vs. Retrieval Failure: A Diagnostic Framework
    raghallucinationretrievaltroubleshootingdiagnosticssegment:enterprise

    RAG Hallucination vs. Retrieval Failure: A Diagnostic Framework

    How to tell whether bad RAG output is a hallucination problem (LLM) or a retrieval problem (pipeline) — a structured diagnostic framework with symptom comparison, flowchart, and fix strategies.

    EErtas Team·

    A user asks your RAG system a question. The answer is wrong. Now what?

    This is the moment where most teams make the same mistake: they assume the problem is the LLM and start tuning prompts. They add instructions like "only use the provided context" or "say I don't know if the answer isn't in the context." Sometimes this helps. Often it does not, because the actual problem was never the LLM — it was the retrieval pipeline feeding it bad or missing context.

    Bad RAG output has two fundamentally different root causes, and the fix for one does nothing for the other. Hallucination is a generation problem: the LLM had the right context but fabricated or distorted information in its response. Retrieval failure is a pipeline problem: the LLM never received the right context in the first place, so it either made something up, cited irrelevant passages, or gave a partial answer.

    Misdiagnosing the root cause means wasting time on the wrong fix. This article provides a structured framework for telling the two apart.

    The Two Failure Modes

    Retrieval Failure

    The retrieval pipeline did not surface the documents or chunks that contain the answer. The LLM receives context that is irrelevant, partially relevant, or empty — and then does its best to answer anyway.

    Retrieval failures happen upstream of the LLM. The embedding model, the chunking strategy, the vector store query, the metadata filters, or the re-ranking step failed to identify the right content. The LLM is working correctly — it is just working with the wrong inputs.

    LLM Hallucination

    The retrieval pipeline surfaced the correct context. The LLM received relevant documents that contain the answer. But the generated response contradicts, distorts, or fabricates information that is not supported by the provided context.

    Hallucination happens downstream of retrieval. The pipeline did its job. The LLM did not faithfully represent what it was given.

    Why the Distinction Matters

    The fix for retrieval failure is pipeline work: adjusting chunk sizes, improving embeddings, fixing metadata filters, adding re-ranking, repairing index corruption. No amount of prompt engineering will fix a pipeline that does not retrieve the right documents.

    The fix for hallucination is generation work: improving prompts, using structured output formats, adding citation requirements, switching to a more capable model, or implementing post-generation fact-checking against the source context.

    Applying the wrong fix wastes time and can make things worse. Adding "only answer from context" instructions to a system with retrieval failures will just cause it to refuse to answer — the context does not contain the answer, so the instruction works as designed but the user experience degrades.

    Diagnostic Flowchart

    Follow this sequence to diagnose whether a bad RAG output is a retrieval problem or a hallucination problem.

    Step 1: Capture the retrieved context.

    Before diagnosing anything, you need to see what the LLM actually received. Log or display the chunks that the retrieval pipeline returned for the query. If your system does not surface this information, that is the first thing to fix — you cannot diagnose what you cannot observe.

    Step 2: Does the retrieved context contain the correct answer?

    Read the retrieved chunks yourself. Is the information needed to correctly answer the query present in the context?

    • If NO: This is a retrieval failure. Proceed to the retrieval diagnosis section below.
    • If YES: Continue to Step 3.

    Step 3: Does the LLM response faithfully represent the retrieved context?

    Compare the LLM's answer with the retrieved context. Does the response accurately reflect what is in the context, or does it contradict, embellish, or fabricate?

    • If the response contradicts or adds information not in the context: This is hallucination. Proceed to the hallucination diagnosis section.
    • If the response is accurate but incomplete: This could be either — the context may have contained the full answer but the LLM omitted parts (hallucination by omission), or the context itself was incomplete (partial retrieval failure). Check whether the source documents contain information that was not retrieved.

    Step 4: Is the failure consistent or intermittent?

    Run the same query multiple times (with temperature above 0).

    • If the answer is consistently wrong in the same way: More likely retrieval failure (the same bad context is retrieved every time).
    • If the answer varies between correct and incorrect: More likely hallucination (the LLM is non-deterministic in how it uses the context).

    Symptom Comparison Table

    SymptomRetrieval FailureLLM Hallucination
    Answer is confidently wrongCommon — LLM confabulates from irrelevant contextCommon — LLM fabricates despite correct context
    Answer says "I don't know" when it should knowLikely — correct context was not retrievedUnlikely — LLM has the answer in context but chose not to use it (rare)
    Answer cites a real document but wrong sectionLikely — retrieved the wrong chunk from the right documentPossible — LLM misattributed within the context
    Answer cites a document that does not existUnlikely — retrieval returns real documentsLikely — LLM fabricated the citation
    Answer contains specific numbers or dates that are wrongCheck context — if the numbers are not in retrieved chunks, retrieval failureIf the numbers are in the context but the LLM changed them, hallucination
    Answer is partially correct, partially fabricatedLikely — some relevant context retrieved, some missingLikely — LLM mixed real context with fabrication
    Same query returns different wrong answersUnlikely — retrieval is deterministicLikely — non-deterministic generation
    Problem appeared suddenly after a pipeline changeLikely — indexing, embedding, or chunking change broke retrievalUnlikely — unless the model or prompt was changed
    Problem affects a specific topic but not othersLikely — those documents were not indexed, or chunking fragmented themUnlikely — hallucination is usually not topic-specific

    Diagnosing Retrieval Failures

    Once you have confirmed that the correct context was not retrieved, narrow down where in the retrieval pipeline the failure occurs.

    Check 1: Is the source document in the index?

    Query the vector store directly for the document by its metadata (filename, document ID). If the document is not in the index, it was either never indexed or was lost to index corruption. This is an ingestion or index integrity problem, not a retrieval problem.

    Check 2: Is the relevant chunk in the index?

    The document may be indexed, but the specific passage that answers the query may have been split across chunks in a way that fragments the answer. Retrieve all chunks for the source document and examine whether any single chunk contains the complete answer. If the answer spans two or more chunks, your chunking strategy needs adjustment — larger chunks, more overlap, or semantic chunking.

    Check 3: Does the embedding capture the semantic relationship?

    Generate the embedding for the query and for the chunk that should have been retrieved. Calculate their cosine similarity. If the similarity is low despite the chunk being semantically relevant to the query, the embedding model is failing to capture the relationship. This happens with domain-specific vocabulary that the embedding model was not trained on. Consider a domain-specific embedding model or query expansion.

    Check 4: Are metadata filters too restrictive?

    If your retrieval pipeline uses metadata filters (date ranges, document types, departments), verify that the filters are not excluding the correct documents. This is a surprisingly common cause of retrieval failure — a filter that was correct when configured becomes stale as new documents arrive with different metadata patterns.

    Check 5: Is the re-ranker demoting the correct chunk?

    If you use a re-ranking step after initial retrieval, the re-ranker may be scoring the correct chunk lower than irrelevant chunks. Check the pre-reranking results to see if the correct chunk was retrieved initially but then demoted.

    Diagnosing LLM Hallucination

    When the context is correct but the LLM output is wrong, investigate these factors.

    Context window overflow: If the retrieved context is very long, the LLM may lose track of information in the middle of the context. This is the "lost in the middle" phenomenon. Test by placing the relevant chunk first in the context and seeing if the answer improves.

    Conflicting context: If multiple retrieved chunks contain different or contradictory information, the LLM may merge them incorrectly or choose the wrong one. Check whether the retrieved chunks are consistent with each other.

    Prompt instruction conflicts: The system prompt may contain instructions that conflict with faithful context use. Instructions like "be helpful and provide comprehensive answers" can encourage the LLM to supplement the context with its parametric knowledge, leading to fabrication.

    Model capability limits: Some models are better at faithful context grounding than others. If hallucination persists after prompt optimization, consider switching to a model with stronger instruction-following and context-grounding capabilities.

    Fixing the Right Problem

    Once you have diagnosed the root cause, apply the targeted fix.

    For retrieval failures:

    • Re-index missing documents
    • Adjust chunk sizes and overlap
    • Switch to a domain-appropriate embedding model
    • Relax overly restrictive metadata filters
    • Add query expansion or hybrid search (keyword plus semantic)
    • Implement a re-ranking step or tune the existing one

    For LLM hallucination:

    • Add explicit grounding instructions: "Only use information from the provided context"
    • Require inline citations: force the LLM to reference specific chunks
    • Use structured output formats that constrain the response
    • Reduce context length to avoid "lost in the middle" effects
    • Implement post-generation verification: check claims in the response against the source context
    • Lower temperature to reduce response variance

    Where Ertas Fits

    The fundamental prerequisite for this entire diagnostic framework is observability. You need to see the retrieved context, not just the final answer. You need to trace a query through the retrieval pipeline — from embedding to vector search to re-ranking to context assembly — and see what happened at each stage.

    Ertas Data Suite builds RAG retrieval pipelines as visual, node-based workflows on a canvas. Each node in the pipeline — Query Embedder, Vector Search, Context Assembler, API Response — is independently observable. When a query produces a bad result, you can inspect the output of each node to see exactly where the pipeline failed: Did the embedding capture the query intent? Did the vector search return the right chunks? Did the context assembler format them correctly?

    This is the difference between "the RAG system gave a wrong answer" (a dead end) and "the vector search returned 5 chunks, none from the correct document, because the metadata filter excluded documents uploaded after January" (an actionable diagnosis). Observable pipelines turn debugging from guesswork into engineering.

    Key Takeaways

    Bad RAG output is not a single problem. It is either a retrieval failure (the LLM never got the right context) or a hallucination (the LLM had the right context but did not use it faithfully). The diagnostic process starts by inspecting the retrieved context — everything follows from there.

    Build your RAG system with observability as a first-class requirement, not an afterthought. Log retrieved context for every query. Maintain test queries with known-good answers. And when something goes wrong, diagnose before you fix — the five minutes spent confirming the root cause will save days of tuning the wrong component.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading