RAG Hallucination vs. Retrieval Failure: A Diagnostic Framework

A user asks your RAG system a question. The answer is wrong. Now what?

This is the moment where most teams make the same mistake: they assume the problem is the LLM and start tuning prompts. They add instructions like "only use the provided context" or "say I don't know if the answer isn't in the context." Sometimes this helps. Often it does not, because the actual problem was never the LLM — it was the retrieval pipeline feeding it bad or missing context.

Bad RAG output has two fundamentally different root causes, and the fix for one does nothing for the other. Hallucination is a generation problem: the LLM had the right context but fabricated or distorted information in its response. Retrieval failure is a pipeline problem: the LLM never received the right context in the first place, so it either made something up, cited irrelevant passages, or gave a partial answer.

Misdiagnosing the root cause means wasting time on the wrong fix. This article provides a structured framework for telling the two apart.

The Two Failure Modes

Retrieval Failure

The retrieval pipeline did not surface the documents or chunks that contain the answer. The LLM receives context that is irrelevant, partially relevant, or empty — and then does its best to answer anyway.

Retrieval failures happen upstream of the LLM. The embedding model, the chunking strategy, the vector store query, the metadata filters, or the re-ranking step failed to identify the right content. The LLM is working correctly — it is just working with the wrong inputs.

LLM Hallucination

The retrieval pipeline surfaced the correct context. The LLM received relevant documents that contain the answer. But the generated response contradicts, distorts, or fabricates information that is not supported by the provided context.

Hallucination happens downstream of retrieval. The pipeline did its job. The LLM did not faithfully represent what it was given.

Why the Distinction Matters

The fix for retrieval failure is pipeline work: adjusting chunk sizes, improving embeddings, fixing metadata filters, adding re-ranking, repairing index corruption. No amount of prompt engineering will fix a pipeline that does not retrieve the right documents.

The fix for hallucination is generation work: improving prompts, using structured output formats, adding citation requirements, switching to a more capable model, or implementing post-generation fact-checking against the source context.

Applying the wrong fix wastes time and can make things worse. Adding "only answer from context" instructions to a system with retrieval failures will just cause it to refuse to answer — the context does not contain the answer, so the instruction works as designed but the user experience degrades.

Diagnostic Flowchart

Follow this sequence to diagnose whether a bad RAG output is a retrieval problem or a hallucination problem.

Step 1: Capture the retrieved context.

Before diagnosing anything, you need to see what the LLM actually received. Log or display the chunks that the retrieval pipeline returned for the query. If your system does not surface this information, that is the first thing to fix — you cannot diagnose what you cannot observe.

Step 2: Does the retrieved context contain the correct answer?

Read the retrieved chunks yourself. Is the information needed to correctly answer the query present in the context?

If NO: This is a retrieval failure. Proceed to the retrieval diagnosis section below.
If YES: Continue to Step 3.

Step 3: Does the LLM response faithfully represent the retrieved context?

Compare the LLM's answer with the retrieved context. Does the response accurately reflect what is in the context, or does it contradict, embellish, or fabricate?

If the response contradicts or adds information not in the context: This is hallucination. Proceed to the hallucination diagnosis section.
If the response is accurate but incomplete: This could be either — the context may have contained the full answer but the LLM omitted parts (hallucination by omission), or the context itself was incomplete (partial retrieval failure). Check whether the source documents contain information that was not retrieved.

Step 4: Is the failure consistent or intermittent?

Run the same query multiple times (with temperature above 0).

If the answer is consistently wrong in the same way: More likely retrieval failure (the same bad context is retrieved every time).
If the answer varies between correct and incorrect: More likely hallucination (the LLM is non-deterministic in how it uses the context).

Symptom Comparison Table

Symptom	Retrieval Failure	LLM Hallucination
Answer is confidently wrong	Common — LLM confabulates from irrelevant context	Common — LLM fabricates despite correct context
Answer says "I don't know" when it should know	Likely — correct context was not retrieved	Unlikely — LLM has the answer in context but chose not to use it (rare)
Answer cites a real document but wrong section	Likely — retrieved the wrong chunk from the right document	Possible — LLM misattributed within the context
Answer cites a document that does not exist	Unlikely — retrieval returns real documents	Likely — LLM fabricated the citation
Answer contains specific numbers or dates that are wrong	Check context — if the numbers are not in retrieved chunks, retrieval failure	If the numbers are in the context but the LLM changed them, hallucination
Answer is partially correct, partially fabricated	Likely — some relevant context retrieved, some missing	Likely — LLM mixed real context with fabrication
Same query returns different wrong answers	Unlikely — retrieval is deterministic	Likely — non-deterministic generation
Problem appeared suddenly after a pipeline change	Likely — indexing, embedding, or chunking change broke retrieval	Unlikely — unless the model or prompt was changed
Problem affects a specific topic but not others	Likely — those documents were not indexed, or chunking fragmented them	Unlikely — hallucination is usually not topic-specific

Diagnosing Retrieval Failures

Once you have confirmed that the correct context was not retrieved, narrow down where in the retrieval pipeline the failure occurs.

Check 1: Is the source document in the index?

Query the vector store directly for the document by its metadata (filename, document ID). If the document is not in the index, it was either never indexed or was lost to index corruption. This is an ingestion or index integrity problem, not a retrieval problem.

Check 2: Is the relevant chunk in the index?

The document may be indexed, but the specific passage that answers the query may have been split across chunks in a way that fragments the answer. Retrieve all chunks for the source document and examine whether any single chunk contains the complete answer. If the answer spans two or more chunks, your chunking strategy needs adjustment — larger chunks, more overlap, or semantic chunking.

Check 3: Does the embedding capture the semantic relationship?

Generate the embedding for the query and for the chunk that should have been retrieved. Calculate their cosine similarity. If the similarity is low despite the chunk being semantically relevant to the query, the embedding model is failing to capture the relationship. This happens with domain-specific vocabulary that the embedding model was not trained on. Consider a domain-specific embedding model or query expansion.

Check 4: Are metadata filters too restrictive?

If your retrieval pipeline uses metadata filters (date ranges, document types, departments), verify that the filters are not excluding the correct documents. This is a surprisingly common cause of retrieval failure — a filter that was correct when configured becomes stale as new documents arrive with different metadata patterns.

Check 5: Is the re-ranker demoting the correct chunk?

If you use a re-ranking step after initial retrieval, the re-ranker may be scoring the correct chunk lower than irrelevant chunks. Check the pre-reranking results to see if the correct chunk was retrieved initially but then demoted.

Diagnosing LLM Hallucination

When the context is correct but the LLM output is wrong, investigate these factors.

Context window overflow: If the retrieved context is very long, the LLM may lose track of information in the middle of the context. This is the "lost in the middle" phenomenon. Test by placing the relevant chunk first in the context and seeing if the answer improves.

Conflicting context: If multiple retrieved chunks contain different or contradictory information, the LLM may merge them incorrectly or choose the wrong one. Check whether the retrieved chunks are consistent with each other.

Prompt instruction conflicts: The system prompt may contain instructions that conflict with faithful context use. Instructions like "be helpful and provide comprehensive answers" can encourage the LLM to supplement the context with its parametric knowledge, leading to fabrication.

Model capability limits: Some models are better at faithful context grounding than others. If hallucination persists after prompt optimization, consider switching to a model with stronger instruction-following and context-grounding capabilities.

Fixing the Right Problem

Once you have diagnosed the root cause, apply the targeted fix.

For retrieval failures:

Re-index missing documents
Adjust chunk sizes and overlap
Switch to a domain-appropriate embedding model
Relax overly restrictive metadata filters
Add query expansion or hybrid search (keyword plus semantic)
Implement a re-ranking step or tune the existing one

For LLM hallucination:

Add explicit grounding instructions: "Only use information from the provided context"
Require inline citations: force the LLM to reference specific chunks
Use structured output formats that constrain the response
Reduce context length to avoid "lost in the middle" effects
Implement post-generation verification: check claims in the response against the source context
Lower temperature to reduce response variance

Where Ertas Fits

The fundamental prerequisite for this entire diagnostic framework is observability. You need to see the retrieved context, not just the final answer. You need to trace a query through the retrieval pipeline — from embedding to vector search to re-ranking to context assembly — and see what happened at each stage.

Ertas Data Suite builds RAG retrieval pipelines as visual, node-based workflows on a canvas. Each node in the pipeline — Query Embedder, Vector Search, Context Assembler, API Response — is independently observable. When a query produces a bad result, you can inspect the output of each node to see exactly where the pipeline failed: Did the embedding capture the query intent? Did the vector search return the right chunks? Did the context assembler format them correctly?

This is the difference between "the RAG system gave a wrong answer" (a dead end) and "the vector search returned 5 chunks, none from the correct document, because the metadata filter excluded documents uploaded after January" (an actionable diagnosis). Observable pipelines turn debugging from guesswork into engineering.

Key Takeaways

Bad RAG output is not a single problem. It is either a retrieval failure (the LLM never got the right context) or a hallucination (the LLM had the right context but did not use it faithfully). The diagnostic process starts by inspecting the retrieved context — everything follows from there.

Build your RAG system with observability as a first-class requirement, not an afterthought. Log retrieved context for every query. Maintain test queries with known-good answers. And when something goes wrong, diagnose before you fix — the five minutes spent confirming the root cause will save days of tuning the wrong component.