RAG Quality Scoring: How to Measure Retrieval Accuracy Before It Reaches Your Users

Most RAG pipelines are built to move data from source to vector store to prompt as quickly as possible. Speed is the default priority. Quality measurement, if it exists at all, lives at the very end — someone asks a question, gets a bad answer, and files a ticket.

By then the damage is done. The hallucinated response has already reached production. The user has already lost trust. And the team debugging the issue has to work backwards through the entire pipeline to figure out where things went wrong.

The best way to monitor RAG pipeline performance is not to wait for failures at the output. It is to score quality at every stage — parsing, chunking, embedding, and retrieval — so degradation is visible before it compounds into a wrong answer.

Why End-to-End Evaluation Is Not Enough

Teams that do measure RAG quality typically rely on end-to-end evaluation: generate a set of test questions, run them through the pipeline, and score the final answers. This approach has real value, but it has a fundamental limitation.

When a test question produces a bad answer, you know something is wrong. You do not know what. Was the source document parsed incorrectly, dropping a critical table? Was the chunking strategy splitting a paragraph mid-sentence, destroying context? Was the embedding model placing the chunk in the wrong region of vector space? Was the retriever returning the third-best match instead of the first?

End-to-end evaluation tells you the pipeline is broken. Stage-level quality scoring tells you where.

Stage 1: Parsing Quality

Every RAG pipeline starts with document ingestion — converting PDFs, HTML pages, spreadsheets, or other formats into clean text. This is where the first layer of quality loss happens, and it is routinely ignored.

What to measure

Structural completeness. Count the number of structural elements (headings, tables, lists, code blocks) in the source document, then count how many survived parsing. A PDF with 12 tables that yields zero table elements after parsing has a structural completeness score of 0% for tables. That is a measurable, loggable signal.

Character-level fidelity. Compare character counts before and after parsing. A 5,000-character document that produces 2,100 characters of parsed output has lost more than half its content. Flag any document where parsed output falls below 70% of source length.

Encoding errors. Count garbled characters, mojibake sequences, or unicode replacement characters in parsed output. Even a small number of encoding errors in a financial document can turn "$1,500" into garbage.

Practical threshold

Set a minimum parsing quality score and route documents that fall below it to a review queue rather than letting them flow into chunking. In Ertas, the Quality Scorer node sits directly after the parser node, and documents that fail the threshold are flagged with a visual indicator on the pipeline canvas — you see the element count drop on the edge between nodes.

Stage 2: Chunk Quality

Chunking is where most RAG quality problems originate, but it is rarely measured directly. Teams pick a chunk size (512 tokens, 1024 tokens) and a strategy (fixed-size, recursive, semantic) and assume it works. It often does not.

What to measure

Semantic coherence. A chunk should contain a single coherent idea or closely related ideas. You can approximate this by embedding the first and second halves of each chunk separately and measuring cosine similarity. High similarity means the chunk is internally coherent. Low similarity means the chunk boundary cut through the middle of a topic transition.

Boundary quality. Check whether chunks start and end at natural boundaries — sentence endings, paragraph breaks, section headings. A chunk that begins mid-sentence (e.g., "...and therefore the liability extends to") is almost certainly going to retrieve poorly.

Size distribution. Plot the distribution of chunk sizes across your corpus. Healthy chunking produces a relatively tight distribution centered on your target size. A long tail of very short chunks (under 50 tokens) usually indicates parser artifacts — empty sections, repeated headers, or formatting remnants that survived parsing but carry no semantic value.

Overlap consistency. If you are using overlapping chunks, verify that the overlap is actually working. Measure the token overlap between consecutive chunks and flag any pair where the overlap is zero (indicating a gap) or unusually large (indicating redundancy).

Practical threshold

RAG quality scoring at the chunk level should flag any chunk with a coherence score below 0.6 or a size below your minimum viable threshold. In a well-tuned pipeline, fewer than 5% of chunks should fall below these thresholds. If more than 15% fail, the chunking strategy needs revision before you proceed to embedding.

Stage 3: Embedding Quality

Once chunks are embedded, you have vectors — but not all vectors are equally useful. Poor embeddings cluster unrelated content together or spread related content apart, both of which degrade retrieval.

What to measure

Intra-topic similarity. Take chunks that you know belong to the same topic (based on their source document or section heading) and measure the average cosine similarity of their embeddings. This should be high — typically above 0.7 for a well-matched embedding model.

Inter-topic separation. Take chunks from different topics and measure average cosine similarity. This should be low. If your embedding model produces similar vectors for "quarterly revenue summary" and "employee onboarding checklist," your retrieval is going to return irrelevant results regardless of how good your retriever is.

Dimensional utilization. Some embedding models produce vectors that only use a fraction of their dimensional capacity — most of the variance concentrates in a few dimensions while others carry near-zero signal. Measure the explained variance ratio across dimensions. If 90% of the variance is captured by 10% of dimensions, you may benefit from a different embedding model or dimensionality reduction.

Nearest-neighbor sanity checks. For a random sample of chunks, retrieve the 5 nearest neighbors and score whether they are topically related. This is a direct measure of whether your embedding space supports good retrieval. If the average relevance of top-5 neighbors is below 60%, the embedding model is not well-suited to your domain.

Practical threshold

Log embedding quality metrics per batch and set alerts for drift. An embedding model that scored well during initial evaluation can degrade as your corpus evolves — new document types, new terminology, or shifted topic distributions can all reduce embedding effectiveness over time.

Stage 4: Retrieval Relevance

This is the final gate before retrieved chunks enter the prompt. Even if parsing, chunking, and embedding are all working well, the retrieval step itself can introduce errors.

What to measure

Precision at K. For a set of test queries with known relevant documents, measure what fraction of the top-K retrieved chunks are actually relevant. Precision at 5 is the most common metric — out of the 5 chunks retrieved, how many are genuinely useful for answering the query.

Recall at K. Of all the chunks that should have been retrieved for a given query, how many actually appeared in the top-K results. Low recall means your retrieval is missing relevant information, which leads to incomplete answers.

Reciprocal rank. Where does the first relevant chunk appear in the ranked results? If the best chunk is consistently ranked third or fourth instead of first, your reranking strategy (or lack of one) needs attention.

Score distribution. Look at the similarity scores of retrieved chunks. A healthy retrieval produces a clear gap between relevant and irrelevant results. If the top chunk scores 0.82 and the fifth chunk scores 0.79, the retriever is not confidently distinguishing relevant from irrelevant content. If the top chunk scores 0.85 and the fifth scores 0.45, the signal is strong.

Putting It Together: Node-Level Observability

The real value of RAG quality scoring is not in any individual metric — it is in measuring all of them continuously, at every stage, and making the results visible.

In Ertas, this is built into the pipeline architecture. Every node in a visual pipeline logs its inputs and outputs. Element counts flow along edges, so you can see exactly how many documents entered the parser, how many chunks emerged, how many passed quality thresholds, and how many were retrieved. The Quality Scorer node applies configurable thresholds at any stage, and the Anomaly Detector node watches for sudden changes — a parsing step that usually produces 200 chunks per document suddenly producing 50 is a signal worth investigating.

This node-level approach means you do not need a separate monitoring system. The pipeline itself is the monitoring system. Quality scores are visible on the canvas, degradation is caught at the stage where it occurs, and the team debugging a bad answer can look at the pipeline visualization and immediately see where the quality dropped.

The Metrics That Matter Most

If you are starting from zero and need to prioritize, focus on three RAG retrieval accuracy metrics:

Parsing completeness — are you losing content during ingestion? Measure structural element survival rate.
Chunk coherence — are your chunks semantically self-contained? Measure intra-chunk similarity.
Precision at 5 — are the right chunks reaching the prompt? Measure relevance of top-K results.

These three metrics, measured continuously, will catch the majority of RAG quality issues before they reach production. They are cheap to compute, easy to interpret, and directly actionable — a low score on any one of them points to a specific stage that needs attention.

The alternative is waiting for user complaints. That approach works too, eventually. But by the time a user reports a bad answer, the pipeline has been serving degraded results to every user who asked a similar question. RAG quality scoring shifts the detection point from "after the user notices" to "before the data leaves the node." That is the difference between reactive debugging and proactive quality control.

RAG Quality Scoring: How to Measure Retrieval Accuracy Before It Reaches Your Users

Why End-to-End Evaluation Is Not Enough

Stage 1: Parsing Quality

What to measure

Practical threshold

Stage 2: Chunk Quality

What to measure

Practical threshold

Stage 3: Embedding Quality

What to measure

Practical threshold

Stage 4: Retrieval Relevance

What to measure

Putting It Together: Node-Level Observability

The Metrics That Matter Most

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

Why Your RAG Pipeline Fails Silently — And How to Make It Observable

How to Deploy a RAG Pipeline as an API Endpoint Your AI Agent Can Call

Best On-Premise Alternative to LangChain for Enterprise RAG Pipelines