
Building a PII Redaction Pipeline for AI-Ready Training Data
Step-by-step guide to building an on-premise PII redaction pipeline that handles email, phone, SSN, addresses, and medical IDs — before data enters AI training or RAG pipelines. GDPR and HIPAA compliant.
A PII redaction pipeline is an automated data processing workflow that detects and removes personally identifiable information from documents before that data enters an AI training dataset or retrieval-augmented generation (RAG) system. It matters because AI models trained on unredacted data can memorize and reproduce PII — a GDPR, HIPAA, and EU AI Act violation that creates legal exposure for both the service provider and the end client.
PII Types: What Needs to Be Redacted
Not all PII carries the same regulatory weight. The table below maps common PII types to regulatory frameworks and gives concrete examples of what detection must cover.
| PII Type | Examples | Regulatory Reference |
|---|---|---|
| Email addresses | user@example.com, firstname.lastname@corp.org | GDPR Art. 4, HIPAA Safe Harbor |
| Phone numbers | +1-555-867-5309, (800) 555-0100, international formats | GDPR Art. 4, HIPAA Safe Harbor |
| Social Security Numbers | 123-45-6789, 123456789 | HIPAA Safe Harbor, US state privacy laws |
| Street addresses | 123 Main St, Apt 4B, City, State ZIP | GDPR Art. 4, HIPAA Safe Harbor |
| Medical record IDs | MRN-00123456, patient ID formats | HIPAA Safe Harbor (18 identifiers) |
| Financial identifiers | Credit card numbers, IBAN, account numbers | PCI DSS, GDPR Art. 9 |
| Names | Full names in context, combined with other data | GDPR Art. 4 (contextual) |
| IP addresses | 192.168.1.1, IPv6 addresses | GDPR (directly identifies device/person) |
| Dates of birth | 01/15/1985, January 15, 1985 | HIPAA Safe Harbor |
For healthcare data specifically, HIPAA's Safe Harbor method requires removal of all 18 categories of protected health information (PHI) before data can be considered de-identified. For EU data subjects, GDPR requires that personal data either be deleted or pseudonymized to a standard that makes re-identification infeasible.
Step-by-Step: Building the PII Redaction Pipeline
The following steps use Ertas Data Suite node names directly. Each step corresponds to one or more nodes in the pipeline canvas.
Step 1: File Import Node — Load Source Documents
Configure the File Import node to point to your source document directory. For enterprise engagements, this is typically a network share, a mounted drive on the client's system, or a local folder.
Key settings:
- Source path: Directory containing raw documents
- Recursive scan: Enable to process subdirectories
- File type filter: Set to the formats present in the client's archive (PDF, DOCX, XLSX, TXT)
- Batch size: Configure based on available memory — 500–1000 documents per batch is typical for mixed PDF/Word archives
The File Import node queues documents for downstream processing and passes file metadata (path, name, size, type) alongside raw content.
Step 2: Parse the Documents
Route each file to the appropriate parser node based on type:
PDF Parser (Docling integration) — handles native PDFs with embedded text and scanned PDFs via OCR. Layout-aware extraction preserves table structure and multi-column layouts. For scanned documents, configure OCR confidence threshold — records below threshold are flagged by the Quality Scorer in Step 4.
Word Parser — extracts text from .docx files, preserving section structure and header/footer content where present.
Excel Parser — handles .xlsx files, flattening spreadsheet data into row-level text records. Cell references are resolved before PII detection.
After parsing, all documents enter the pipeline as structured text records regardless of their original format.
Step 3: PII Redactor Node — Configure Entity Types and Redaction Method
The PII Redactor node is the core of the pipeline. Configure it for the specific client engagement:
Entity types to detect — select from the available categories:
EMAIL— email addressesPHONE— phone numbers (US and international formats)SSN— Social Security NumbersADDRESS— street addressesMEDICAL_ID— medical record numbers and patient identifiersFINANCIAL— credit card numbers, IBAN, bank account numbersPERSON_NAME— full names (contextual detection)DATE_OF_BIRTH— birth dates in common formatsIP_ADDRESS— IPv4 and IPv6 addresses
Redaction method — three options:
- Mask: Replace detected PII with a label (e.g.,
[EMAIL],[PHONE]). Preserves the document structure and makes clear where redaction occurred. Recommended for training data where token count matters. - Replace: Substitute detected PII with synthetic placeholders (e.g.,
user@example.combecomescontact@company.net). Useful when downstream models need realistic-looking examples. - Remove: Delete the detected PII and surrounding context entirely. Most aggressive; use for highest-sensitivity data.
Confidence threshold — set the minimum detection confidence (default 0.85). Records where PII is detected below this threshold are flagged for human review rather than automatically redacted.
Step 4: Quality Scorer — Verify Redaction Completeness
The Quality Scorer node runs a post-redaction check on each processed document:
- Residual PII scan: Re-runs detection at a lower confidence threshold to catch any PII the primary redaction may have missed
- Completeness score: Calculates a per-document quality score (0–1.0) based on detection confidence, coverage, and any flagged anomalies
- Flag threshold: Documents below the configured score (default 0.90) are routed to a review queue rather than the export step
Documents that pass the Quality Scorer proceed to export. Documents that fail are logged with their specific failure reason and held for human review or re-processing.
This step is what allows you to tell a regulated-industry client: "Every document in your training dataset was verified for PII completeness, and any document that did not meet the quality threshold was reviewed before inclusion."
Step 5: Export Clean, Redacted Data
Choose the appropriate export node based on your downstream use case:
JSONL Exporter — outputs one JSON object per line in the format required by most fine-tuning frameworks. Each record includes the redacted text, document metadata, and the quality score assigned in Step 4.
RAG Exporter — outputs chunked, redacted documents formatted for ingestion into a vector database. Configure chunk size (tokens) and overlap to match your retrieval system's requirements.
Both export nodes append a processing log entry for each document, recording: source file path, parser used, PII types detected, redaction method applied, quality score, and export timestamp. This log is the audit trail.
Comparison: Approaches to PII Redaction
| Criterion | Manual Redaction | Regex Scripts | Cloud Redaction API | Ertas Pipeline |
|---|---|---|---|---|
| Accuracy | Variable — human error | Medium — misses contextual PII | High — but cloud-dependent | High — configurable confidence |
| Speed (10K docs) | Weeks | Hours | Hours | Hours |
| Audit Trail | None (manual) | None (unless logged) | Vendor-held logs | Built-in, exportable |
| On-Premise Deployment | N/A | Yes | No | Yes |
| Scalability | Low | Medium | High (cloud) | High (on-prem) |
The critical column for regulated-industry clients is On-Premise Deployment. A cloud redaction API processes data on vendor servers — for HIPAA-covered data, this requires a Business Associate Agreement and introduces data residency questions. For EU data subject PII, it introduces GDPR cross-border transfer complications.
On-premise execution eliminates both. Data never leaves the client's network perimeter.
Compliance Considerations
GDPR
Under GDPR Article 4, personal data includes any information relating to an identified or identifiable natural person. Article 25 (data protection by design) requires that systems processing personal data implement appropriate technical measures from the outset. A PII redaction pipeline that runs before data enters training is a direct implementation of this principle.
GDPR does not specify a particular redaction method — masking, replacement, and removal all satisfy the requirement if the result is that re-identification is not reasonably possible. The audit trail generated by the pipeline provides evidence of compliance for supervisory authority inquiries.
HIPAA
HIPAA's Safe Harbor de-identification method requires removal of all 18 PHI categories. The PII Redactor node covers all 18 categories when fully configured. The Quality Scorer's post-redaction check provides the "no actual knowledge" standard required by HIPAA — the processing system actively verifies that no PHI remains above threshold.
EU AI Act
The EU AI Act's Article 10 requires that training data for high-risk AI systems be subject to appropriate data governance practices, including examination for bias and errors. Data that includes unredacted PII represents both an error (inclusion of data that should not be present) and a bias risk (models may learn associations involving personal characteristics). PII redaction is a direct compliance action under Article 10.
FAQ
Does PII redaction happen before or after parsing?
Redaction happens after parsing. The parser (PDF Parser, Word Parser, etc.) must first extract raw text from the source document before the PII Redactor can detect and remove sensitive information. You cannot run redaction on a binary PDF file — you run it on the text extracted from that file. The pipeline enforces this order: File Import → Parser → PII Redactor → Quality Scorer → Exporter.
Can I customize which PII types are redacted?
Yes. The PII Redactor node has a per-entity-type toggle. You can enable or disable individual categories (EMAIL, PHONE, SSN, etc.) based on the client's regulatory context. For example, a financial services client may require redaction of financial identifiers and SSNs but not IP addresses. A healthcare client will require all 18 HIPAA PHI categories. The configuration is saved as part of the pipeline template, so you can maintain client-specific templates for different regulatory contexts.
Is the redaction logged for audit purposes?
Yes. Every document processed through the pipeline generates a log entry recording: the source file path, which PII types were detected, the redaction method applied, the confidence scores for each detection, the quality score assigned by the Quality Scorer, and the timestamp. The complete pipeline run log is exportable as JSON or CSV. This log is the primary evidence artifact for compliance audits.
Does this work on scanned PDFs?
Yes. The PDF Parser node uses OCR for scanned documents. For scanned PDFs, OCR is applied first to extract machine-readable text, which then flows into the PII Redactor. OCR-extracted text carries a confidence score; documents where OCR confidence falls below threshold are flagged by the Quality Scorer. In practice, clean black-and-white scans process well; low-quality or heavily annotated scans may require manual review for a subset of pages.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Best RAG Pipeline With Built-In PII Redaction: Why Retrieval Without Redaction Is a Compliance Risk
Most RAG pipelines index raw documents with PII still intact. Once sensitive data is embedded in a vector store, it is retrievable by any query. Learn how to build a GDPR-safe RAG pipeline with PII redaction before embedding.

Why AI Service Providers Need a Standardized Data Pipeline Tool
AI/ML service providers spend 60-80% of each engagement on data prep. A standardized pipeline tool cuts that cost, enables reuse across clients, and meets regulated-industry compliance requirements.

On-Premise PII and PHI Redaction Workflows for Multi-Industry Service Providers
Technical guide to building on-premise PII/PHI redaction pipelines that handle healthcare, legal, financial, and government data without cloud dependencies.