Building a GDPR-Safe RAG Pipeline: Redaction, Consent, and the Right to Be Forgotten in Vector Databases

Retrieval-Augmented Generation has become the default architecture for enterprise AI systems that need to answer questions over proprietary data. The pattern is straightforward: chunk your documents, embed them into a vector store, retrieve relevant chunks at query time, and pass them to a language model for synthesis.

The problem is that this pattern was designed for accuracy, not for privacy. Vector databases store dense numerical representations of your source text, and those representations can encode personal data — names, email addresses, medical details, financial identifiers — in ways that are nearly impossible to selectively remove after the fact. If you are building a RAG pipeline that touches personal data of EU residents, you need a GDPR-safe RAG pipeline from the start, not a compliance retrofit bolted on later.

GDPR grants data subjects a set of rights that directly conflict with how most RAG systems are built. Understanding the friction points is the first step toward solving them.

The right to erasure (Article 17). When a data subject requests deletion of their personal data, you must be able to remove it. In a relational database, this means deleting rows. In a vector store, the personal data is encoded inside embedding vectors alongside other semantic content. You cannot surgically remove one person's name from a 1536-dimensional vector that also encodes the meaning of the surrounding paragraph. Your options are to delete the entire chunk (losing useful non-personal context) or to re-embed the chunk without the personal data (expensive and error-prone at scale).

Consent tracking (Article 6). Every piece of personal data in your system must have a lawful basis for processing. Vector stores have no native concept of consent records. They store vectors and optional metadata — there is no built-in mechanism to record why you are allowed to process a particular embedding or to invalidate that permission later.

Purpose limitation (Article 5(1)(b)). Personal data collected for one purpose cannot be repurposed without additional consent. When you embed a customer support transcript into a RAG system for product improvement, you may be violating the purpose for which the data was originally collected. The vector store does not track purpose — it just stores embeddings.

Storage limitation (Article 5(1)(e)). Personal data must not be kept longer than necessary. Vector stores are append-heavy by design. Most teams never delete embeddings. There is no TTL mechanism, no automatic expiry, and no built-in retention policy enforcement.

Data subject access requests (Article 15). When someone asks what personal data you hold about them, you must be able to answer. Searching a vector store for "all data related to John Smith" is not a simple query — embeddings are semantic, not structured. A similarity search might surface relevant chunks, but it offers no guarantee of completeness.

The Best Way to Build RAG with PII Redaction

The most effective strategy is to ensure personal data never enters the vector store in the first place. If your embeddings contain no PII, then erasure requests have nothing to erase, consent tracking for the vector layer becomes unnecessary, and data subject access requests against the vector store return nothing personal.

This is the redact-before-embed pattern, and it changes the compliance posture of your entire pipeline.

Architecture Overview

The pipeline has four stages, with redaction inserted between ingestion and embedding.

Stage 1 — Document ingestion. Source documents enter the pipeline from whatever origin — file uploads, API integrations, database exports. At this point, the documents contain personal data in plaintext. You store the originals in a controlled, access-restricted document store with full audit logging.

Stage 2 — PII redaction. Before any chunking or embedding occurs, every document passes through a PII redaction layer. This layer identifies and removes personal data — names, addresses, phone numbers, email addresses, national identifiers, financial account numbers, and health information. The redaction engine replaces each identified entity with a placeholder token. The mapping between placeholder tokens and original values is stored separately in an encrypted lookup table with strict access controls.

This is where Ertas PII Redactor fits into the architecture. It runs on-premise, so documents never leave your infrastructure during redaction. There is no cross-border data transfer to a third-party redaction API. The redactor produces an audit trail of every entity identified and redacted, which you need for demonstrating GDPR compliance to regulators.

Stage 3 — Chunking and embedding. The redacted documents are chunked and embedded into your vector store. Because PII has already been removed, the embeddings encode semantic meaning without personal data. Your vector store is now GDPR-safe by construction — there is nothing personal to delete, no consent to track at the embedding level, and no PII to surface in response to access requests.

Stage 4 — Query-time retrieval and rehydration. When a user queries the RAG system, relevant chunks are retrieved from the vector store. If the use case requires the original personal data in the response (and the user has authorization), placeholder tokens can be rehydrated from the encrypted lookup table. If the use case does not require personal data, the redacted chunks are passed directly to the language model.

Handling Data Subject Access Requests

Under GDPR Article 15, data subjects can request a copy of all personal data you hold about them. With the redact-before-embed architecture, your response workflow is clear.

The vector store contains no personal data, so it is excluded from the DSAR scope. The encrypted lookup table contains the mapping between placeholders and original PII — this is searchable by data subject identifier. The original document store contains the source documents — these are searchable by standard database queries. Your audit trail shows exactly which documents were processed, when redaction occurred, and what entities were identified.

This is dramatically simpler than trying to search a vector database for "everything related to this person," which is both technically unreliable and legally insufficient.

Handling Erasure Requests

When a data subject exercises their right to be forgotten under Article 17, the workflow is equally straightforward.

Delete the subject's entries from the encrypted lookup table. Delete or redact the subject's data from the original document store. The vector store requires no action — it contains no personal data. Log the erasure action in your audit trail for compliance documentation.

Compare this with the alternative: without pre-embedding redaction, an erasure request against a RAG system means identifying every chunk that might contain the subject's data (unreliable with semantic search), deleting or re-embedding those chunks (expensive and potentially destabilizing to retrieval quality), and proving to a regulator that you found everything (impossible to guarantee).

Consent management belongs at the document store layer, not the vector store layer. When documents enter the pipeline, record the lawful basis for processing, the specific purposes for which the data was collected, any consent records or legitimate interest assessments, and the retention period.

This metadata travels with the document through the pipeline. If consent is withdrawn, you remove the document from the source store, delete the corresponding entries in the lookup table, and optionally remove the associated (already PII-free) chunks from the vector store if purpose limitation requires it.

Because the redaction step is logged, you can demonstrate to regulators exactly which documents were processed under which lawful basis, and when.

Storage Limitation Enforcement

Implement retention policies at the document store level. When a document's retention period expires, delete it from the source store and the lookup table. The redacted embeddings in the vector store can be retained longer if they serve a legitimate purpose — since they contain no personal data, GDPR storage limitation constraints are relaxed.

This gives you a practical balance: your RAG system retains its knowledge base for as long as the embeddings are useful, while personal data is automatically purged according to your retention schedule.

The difference between a compliant RAG pipeline and a non-compliant one is not the vector database you choose or the embedding model you use. It is whether personal data reaches the vector store at all.

Redacting PII before embedding eliminates the hardest GDPR challenges — selective deletion from dense vectors, consent tracking across distributed embeddings, and completeness guarantees for access requests. These problems become trivial when the vector store simply does not contain personal data.

Running the redaction step on-premise, as Ertas PII Redactor enables, removes the secondary compliance risk of sending personal data to a third-party processor for redaction. The data stays within your infrastructure boundary, the redaction happens locally, and the audit trail is under your control.

If you are building a RAG pipeline that will process personal data of EU residents, design the redaction layer first. Everything downstream becomes simpler when the vector store is clean from the start.

Building a GDPR-Safe RAG Pipeline: Redaction, Consent, and the Right to Be Forgotten in Vector Databases