
RAG Hallucination vs. Retrieval Failure: A Diagnostic Framework
How to tell whether bad RAG output is a hallucination problem (LLM) or a retrieval problem (pipeline) — a structured diagnostic framework with symptom comparison, flowchart, and fix strategies.
A user asks your RAG system a question. The answer is wrong. Now what?
This is the moment where most teams make the same mistake: they assume the problem is the LLM and start tuning prompts. They add instructions like "only use the provided context" or "say I don't know if the answer isn't in the context." Sometimes this helps. Often it does not, because the actual problem was never the LLM — it was the retrieval pipeline feeding it bad or missing context.
Bad RAG output has two fundamentally different root causes, and the fix for one does nothing for the other. Hallucination is a generation problem: the LLM had the right context but fabricated or distorted information in its response. Retrieval failure is a pipeline problem: the LLM never received the right context in the first place, so it either made something up, cited irrelevant passages, or gave a partial answer.
Misdiagnosing the root cause means wasting time on the wrong fix. This article provides a structured framework for telling the two apart.
The Two Failure Modes
Retrieval Failure
The retrieval pipeline did not surface the documents or chunks that contain the answer. The LLM receives context that is irrelevant, partially relevant, or empty — and then does its best to answer anyway.
Retrieval failures happen upstream of the LLM. The embedding model, the chunking strategy, the vector store query, the metadata filters, or the re-ranking step failed to identify the right content. The LLM is working correctly — it is just working with the wrong inputs.
LLM Hallucination
The retrieval pipeline surfaced the correct context. The LLM received relevant documents that contain the answer. But the generated response contradicts, distorts, or fabricates information that is not supported by the provided context.
Hallucination happens downstream of retrieval. The pipeline did its job. The LLM did not faithfully represent what it was given.
Why the Distinction Matters
The fix for retrieval failure is pipeline work: adjusting chunk sizes, improving embeddings, fixing metadata filters, adding re-ranking, repairing index corruption. No amount of prompt engineering will fix a pipeline that does not retrieve the right documents.
The fix for hallucination is generation work: improving prompts, using structured output formats, adding citation requirements, switching to a more capable model, or implementing post-generation fact-checking against the source context.
Applying the wrong fix wastes time and can make things worse. Adding "only answer from context" instructions to a system with retrieval failures will just cause it to refuse to answer — the context does not contain the answer, so the instruction works as designed but the user experience degrades.
Diagnostic Flowchart
Follow this sequence to diagnose whether a bad RAG output is a retrieval problem or a hallucination problem.
Step 1: Capture the retrieved context.
Before diagnosing anything, you need to see what the LLM actually received. Log or display the chunks that the retrieval pipeline returned for the query. If your system does not surface this information, that is the first thing to fix — you cannot diagnose what you cannot observe.
Step 2: Does the retrieved context contain the correct answer?
Read the retrieved chunks yourself. Is the information needed to correctly answer the query present in the context?
- If NO: This is a retrieval failure. Proceed to the retrieval diagnosis section below.
- If YES: Continue to Step 3.
Step 3: Does the LLM response faithfully represent the retrieved context?
Compare the LLM's answer with the retrieved context. Does the response accurately reflect what is in the context, or does it contradict, embellish, or fabricate?
- If the response contradicts or adds information not in the context: This is hallucination. Proceed to the hallucination diagnosis section.
- If the response is accurate but incomplete: This could be either — the context may have contained the full answer but the LLM omitted parts (hallucination by omission), or the context itself was incomplete (partial retrieval failure). Check whether the source documents contain information that was not retrieved.
Step 4: Is the failure consistent or intermittent?
Run the same query multiple times (with temperature above 0).
- If the answer is consistently wrong in the same way: More likely retrieval failure (the same bad context is retrieved every time).
- If the answer varies between correct and incorrect: More likely hallucination (the LLM is non-deterministic in how it uses the context).
Symptom Comparison Table
| Symptom | Retrieval Failure | LLM Hallucination |
|---|---|---|
| Answer is confidently wrong | Common — LLM confabulates from irrelevant context | Common — LLM fabricates despite correct context |
| Answer says "I don't know" when it should know | Likely — correct context was not retrieved | Unlikely — LLM has the answer in context but chose not to use it (rare) |
| Answer cites a real document but wrong section | Likely — retrieved the wrong chunk from the right document | Possible — LLM misattributed within the context |
| Answer cites a document that does not exist | Unlikely — retrieval returns real documents | Likely — LLM fabricated the citation |
| Answer contains specific numbers or dates that are wrong | Check context — if the numbers are not in retrieved chunks, retrieval failure | If the numbers are in the context but the LLM changed them, hallucination |
| Answer is partially correct, partially fabricated | Likely — some relevant context retrieved, some missing | Likely — LLM mixed real context with fabrication |
| Same query returns different wrong answers | Unlikely — retrieval is deterministic | Likely — non-deterministic generation |
| Problem appeared suddenly after a pipeline change | Likely — indexing, embedding, or chunking change broke retrieval | Unlikely — unless the model or prompt was changed |
| Problem affects a specific topic but not others | Likely — those documents were not indexed, or chunking fragmented them | Unlikely — hallucination is usually not topic-specific |
Diagnosing Retrieval Failures
Once you have confirmed that the correct context was not retrieved, narrow down where in the retrieval pipeline the failure occurs.
Check 1: Is the source document in the index?
Query the vector store directly for the document by its metadata (filename, document ID). If the document is not in the index, it was either never indexed or was lost to index corruption. This is an ingestion or index integrity problem, not a retrieval problem.
Check 2: Is the relevant chunk in the index?
The document may be indexed, but the specific passage that answers the query may have been split across chunks in a way that fragments the answer. Retrieve all chunks for the source document and examine whether any single chunk contains the complete answer. If the answer spans two or more chunks, your chunking strategy needs adjustment — larger chunks, more overlap, or semantic chunking.
Check 3: Does the embedding capture the semantic relationship?
Generate the embedding for the query and for the chunk that should have been retrieved. Calculate their cosine similarity. If the similarity is low despite the chunk being semantically relevant to the query, the embedding model is failing to capture the relationship. This happens with domain-specific vocabulary that the embedding model was not trained on. Consider a domain-specific embedding model or query expansion.
Check 4: Are metadata filters too restrictive?
If your retrieval pipeline uses metadata filters (date ranges, document types, departments), verify that the filters are not excluding the correct documents. This is a surprisingly common cause of retrieval failure — a filter that was correct when configured becomes stale as new documents arrive with different metadata patterns.
Check 5: Is the re-ranker demoting the correct chunk?
If you use a re-ranking step after initial retrieval, the re-ranker may be scoring the correct chunk lower than irrelevant chunks. Check the pre-reranking results to see if the correct chunk was retrieved initially but then demoted.
Diagnosing LLM Hallucination
When the context is correct but the LLM output is wrong, investigate these factors.
Context window overflow: If the retrieved context is very long, the LLM may lose track of information in the middle of the context. This is the "lost in the middle" phenomenon. Test by placing the relevant chunk first in the context and seeing if the answer improves.
Conflicting context: If multiple retrieved chunks contain different or contradictory information, the LLM may merge them incorrectly or choose the wrong one. Check whether the retrieved chunks are consistent with each other.
Prompt instruction conflicts: The system prompt may contain instructions that conflict with faithful context use. Instructions like "be helpful and provide comprehensive answers" can encourage the LLM to supplement the context with its parametric knowledge, leading to fabrication.
Model capability limits: Some models are better at faithful context grounding than others. If hallucination persists after prompt optimization, consider switching to a model with stronger instruction-following and context-grounding capabilities.
Fixing the Right Problem
Once you have diagnosed the root cause, apply the targeted fix.
For retrieval failures:
- Re-index missing documents
- Adjust chunk sizes and overlap
- Switch to a domain-appropriate embedding model
- Relax overly restrictive metadata filters
- Add query expansion or hybrid search (keyword plus semantic)
- Implement a re-ranking step or tune the existing one
For LLM hallucination:
- Add explicit grounding instructions: "Only use information from the provided context"
- Require inline citations: force the LLM to reference specific chunks
- Use structured output formats that constrain the response
- Reduce context length to avoid "lost in the middle" effects
- Implement post-generation verification: check claims in the response against the source context
- Lower temperature to reduce response variance
Where Ertas Fits
The fundamental prerequisite for this entire diagnostic framework is observability. You need to see the retrieved context, not just the final answer. You need to trace a query through the retrieval pipeline — from embedding to vector search to re-ranking to context assembly — and see what happened at each stage.
Ertas Data Suite builds RAG retrieval pipelines as visual, node-based workflows on a canvas. Each node in the pipeline — Query Embedder, Vector Search, Context Assembler, API Response — is independently observable. When a query produces a bad result, you can inspect the output of each node to see exactly where the pipeline failed: Did the embedding capture the query intent? Did the vector search return the right chunks? Did the context assembler format them correctly?
This is the difference between "the RAG system gave a wrong answer" (a dead end) and "the vector search returned 5 chunks, none from the correct document, because the metadata filter excluded documents uploaded after January" (an actionable diagnosis). Observable pipelines turn debugging from guesswork into engineering.
Key Takeaways
Bad RAG output is not a single problem. It is either a retrieval failure (the LLM never got the right context) or a hallucination (the LLM had the right context but did not use it faithfully). The diagnostic process starts by inspecting the retrieved context — everything follows from there.
Build your RAG system with observability as a first-class requirement, not an afterthought. Log retrieved context for every query. Maintain test queries with known-good answers. And when something goes wrong, diagnose before you fix — the five minutes spent confirming the root cause will save days of tuning the wrong component.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

The Long Tail of PDF Parsing Failures at Enterprise Scale
A practical taxonomy of PDF parsing failures in production RAG pipelines — malformed headers, scanned rotations, embedded fonts, password-protected files, and corrupted metadata — with detection and recovery strategies.

Vector Store Index Corruption: Causes, Detection, and Recovery
A practical guide to diagnosing and recovering from vector store index corruption in production RAG systems — covering partial writes, OOM during indexing, version mismatches, and proven recovery strategies.

Best RAG Pipeline With Built-In PII Redaction: Why Retrieval Without Redaction Is a Compliance Risk
Most RAG pipelines index raw documents with PII still intact. Once sensitive data is embedded in a vector store, it is retrievable by any query. Learn how to build a GDPR-safe RAG pipeline with PII redaction before embedding.