Bad Chunks Poison RAG Answers: A Debugging Guide to Chunking Quality

Every RAG debugging session eventually arrives at the same place: you inspect the retrieved chunks and realize the problem is not retrieval, not the LLM, not the prompt. The chunks themselves are garbage. The chunking pipeline faithfully divided your documents into pieces, and those pieces are incoherent, incomplete, or misleading.

Chunking is the "garbage in, garbage out" of RAG. If the chunks are bad, everything downstream is bad — embeddings encode the wrong semantics, retrieval returns the wrong context, and the LLM generates the wrong answers. No amount of prompt engineering or reranking fixes fundamentally broken chunks.

This article catalogs the most common chunking failures, shows you what bad chunks actually look like, and gives you practical fixes for each one.

Bad Chunk Pattern 1: Mid-Sentence Splits

What it looks like:

Chunk A ends with: "The maximum coverage under this policy is limited to"

Chunk B starts with: "$500,000 per incident, with a deductible of $2,500 applicable to all claims filed after January 1, 2026."

Why it happens: Fixed-size chunking (split every N tokens) has no awareness of sentence boundaries. When a sentence straddles the chunk boundary, both chunks become individually meaningless. Chunk A states a coverage limit without the number. Chunk B provides a number without context about what it refers to.

The downstream damage: If a user asks "What is the maximum coverage?", retrieval might find Chunk A (it contains the word "coverage" and "maximum") but not Chunk B. The LLM either hallucinates a number or says it cannot find the information — even though the answer exists in the corpus.

Fix: Use sentence-aware chunking. The chunking algorithm should respect sentence boundaries, ensuring every chunk starts and ends on a complete sentence. Add 10-15% overlap between chunks so boundary sentences appear in both adjacent chunks.

Bad Chunk Pattern 2: Orphaned Context

What it looks like:

Chunk: "The rate is 4.5%. For accounts exceeding the threshold, an additional 1.2% surcharge applies. Exceptions are listed in Appendix C."

What is missing: Which rate? What threshold? What kind of accounts? The chunk is grammatically complete but semantically orphaned — it contains specific details without the framing context that makes those details interpretable.

Why it happens: The chunking strategy splits by section or heading, but the section itself is a subsection that depends on the parent section for context. A chunk from "Section 3.2.1: Fee Schedule" makes no sense without knowing that Section 3.2 is about "Commercial Lending Products" and Section 3 is about "Business Banking."

The downstream damage: The LLM receives the chunk and must guess what "the rate" and "the threshold" refer to. It either guesses wrong (hallucination) or hedges with a vague answer. Either way, the user gets an unhelpful response.

Fix: Prepend hierarchical context to every chunk. If a chunk comes from Section 3.2.1, the chunk text should start with "Business Banking - Commercial Lending Products - Fee Schedule:" before the chunk content. This gives the LLM the framing it needs to interpret the specifics. Some teams call this "contextual chunking" or "breadcrumb chunking."

Bad Chunk Pattern 3: Table Fragmentation

What it looks like:

Chunk A: "| Plan | Monthly Cost | Storage |"

Chunk B: "| --- | --- | --- | | Free | $0 | 5 GB | | Builder | $34.50 | 50 GB |"

Chunk C: "| Agency | $149 | 200 GB | | Agency Pro | $349 | 500 GB |"

Why it happens: Tables are the worst-case scenario for fixed-size chunking. The header row lands in one chunk, the first few data rows in another, and the remaining rows in a third. Each chunk is individually useless — headers without data, data without headers, and a table split across two chunks with no indication that they belong together.

The downstream damage: A user asks "How much does the Agency plan cost?" and retrieval returns Chunk C, which contains the answer but not the column headers. The LLM sees "$149" and "200 GB" but cannot determine which number is the cost and which is the storage limit without the header row.

Fix: Detect tables during parsing and treat each table as an atomic unit. If a table exceeds the chunk size limit, repeat the header row at the top of each chunk. Convert complex tables to structured text (key-value pairs or prose descriptions) before chunking if your documents contain tables that are too large to fit in a single chunk.

Bad Chunk Pattern 4: Overlapping Redundancy

What it looks like:

Chunk A (tokens 0-500): "Our privacy policy ensures that personal data is handled in accordance with GDPR requirements. Data subjects have the right to access..."

Chunk B (tokens 400-900): "...the right to access, rectify, and erase their personal data. Data subjects have the right to access their data and request corrections at any time. Our privacy policy ensures that personal data is handled..."

Why it happens: Excessive chunk overlap (40-50% or more) causes large portions of text to repeat across adjacent chunks. The overlap was set too aggressively, probably in an attempt to solve the mid-sentence split problem.

The downstream damage: Retrieval returns multiple chunks that contain nearly identical information, wasting context window space. The LLM may repeat itself in its answer, or worse, treat the redundant mentions as corroborating evidence and express higher confidence than warranted.

Fix: Keep overlap between 10-20% of chunk size. Overlap is meant to preserve boundary context, not to duplicate entire paragraphs. If you are using overlap above 25%, you are likely compensating for a chunking granularity problem — fix the granularity instead of adding more overlap.

Bad Chunk Pattern 5: Metadata-Content Contamination

What it looks like:

Why it happens: The document parser extracts everything on the page, including metadata headers, document properties, and administrative information. The chunking pipeline does not distinguish between document metadata and document content.

The downstream damage: The metadata tokens consume chunk space without contributing semantic value. The embedding encodes metadata noise alongside the actual content, reducing the embedding's representational quality. Retrieval may match on metadata terms ("author: J. Smith") instead of content relevance.

Fix: Separate metadata extraction from content extraction during parsing. Store metadata as structured fields in the chunk's metadata (filterable in the vector store) rather than embedding it in the chunk text. If metadata is useful for retrieval, add it as a separate metadata field, not as part of the embedded text.

Bad Chunk Pattern 6: Multi-Topic Chunks

What it looks like:

Chunk: "Employees are entitled to 20 days of paid time off per calendar year. Unused PTO does not roll over. In other news, the office holiday party will be held on December 15 at the downtown Marriott. Please RSVP by December 1. Additionally, the IT department reminds all staff that password rotation is required every 90 days."

Why it happens: The source document (an employee newsletter, a meeting transcript, a Slack export) contains multiple unrelated topics in sequence. Fixed-size chunking groups them into a single chunk because they happen to be adjacent in the text.

The downstream damage: The embedding for this chunk is an average of three unrelated topics — PTO policy, party logistics, and IT security. It will be a poor match for any specific query about any of the three topics. A query about PTO policy might not retrieve this chunk because the embedding is diluted by party and password content.

Fix: Use topic-aware chunking for unstructured or multi-topic documents. Topic segmentation algorithms can detect topic boundaries within a document and split chunks accordingly. For structured documents, chunk by section heading. For unstructured text (transcripts, chat logs), consider using an LLM to insert topic boundaries before chunking.

How to Audit Your Chunks

Before you debug retrieval, debug your chunks. Here is a practical audit process:

Step 1: Sample randomly. Pull 50 random chunks from your vector store. Read each one as if you were a human who had never seen the source document. Can you understand what each chunk is about? Does it contain a complete thought?

Step 2: Test boundary chunks. Find chunks that start or end mid-sentence. Count them. If more than 10% of your chunks have broken boundaries, your chunking strategy needs revision.

Step 3: Check for orphans. Identify chunks that reference "the above," "as mentioned," "this section," or similar relative references without the referent being present in the chunk. These are orphaned chunks that will confuse the LLM.

Step 4: Measure redundancy. Compare adjacent chunks. If more than 30% of the content overlaps, your overlap setting is too aggressive.

Step 5: Inspect tables and lists. Find chunks that contain partial tables (data without headers) or partial lists (items without the list introduction). These need atomic chunking.

Step 6: Look for metadata contamination. Find chunks where more than 20% of the text is document metadata rather than content. These need parser-level fixes.

Building Better Chunking Pipelines

The root cause of bad chunks is almost always a chunking strategy chosen once and never revisited. Teams pick "recursive character text splitter with 1000 tokens and 200 overlap" from a LangChain tutorial, deploy it, and never look at the actual chunks it produces.

Chunking is not a configuration parameter. It is a data quality decision that directly determines the upper bound of your RAG pipeline's answer quality. No downstream technique — reranking, prompt engineering, bigger context windows — can compensate for chunks that do not contain coherent, complete information.

Ertas Data Suite includes a dedicated RAG Chunker node that lets you configure chunking strategy, inspect the output chunks visually on the canvas, and iterate on the parameters before the chunks ever reach the embedding stage. When you can see your chunks — actually read them, one by one — you catch the garbage before it enters the vector store. When chunking is a function call buried in a Python script, nobody ever looks at the output.

Look at your chunks. Read them. If they do not make sense to you, they will not make sense to the LLM either.

Bad Chunks Poison RAG Answers: A Debugging Guide to Chunking Quality

Bad Chunk Pattern 1: Mid-Sentence Splits

Bad Chunk Pattern 2: Orphaned Context

Bad Chunk Pattern 3: Table Fragmentation

Bad Chunk Pattern 4: Overlapping Redundancy

Bad Chunk Pattern 5: Metadata-Content Contamination

Bad Chunk Pattern 6: Multi-Topic Chunks

How to Audit Your Chunks

Building Better Chunking Pipelines

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

RAG Pipeline Failure Modes: A Field Guide for Production Debugging

Embedding Drift and Stale Vectors: The Silent RAG Pipeline Killer

PII Leaks in RAG Context Windows: Detection, Prevention, and Pipeline Design