A PII redaction pipeline is an automated data processing workflow that detects and removes personally identifiable information from documents before that data enters an AI training dataset or retrieval-augmented generation (RAG) system. It matters because AI models trained on unredacted data can memorize and reproduce PII — a GDPR, HIPAA, and EU AI Act violation that creates legal exposure for both the service provider and the end client.

PII Types: What Needs to Be Redacted

Not all PII carries the same regulatory weight. The table below maps common PII types to regulatory frameworks and gives concrete examples of what detection must cover.

PII Type	Examples	Regulatory Reference
Email addresses	user@example.com, firstname.lastname@corp.org	GDPR Art. 4, HIPAA Safe Harbor
Phone numbers	+1-555-867-5309, (800) 555-0100, international formats	GDPR Art. 4, HIPAA Safe Harbor
Social Security Numbers	123-45-6789, 123456789	HIPAA Safe Harbor, US state privacy laws
Street addresses	123 Main St, Apt 4B, City, State ZIP	GDPR Art. 4, HIPAA Safe Harbor
Medical record IDs	MRN-00123456, patient ID formats	HIPAA Safe Harbor (18 identifiers)
Financial identifiers	Credit card numbers, IBAN, account numbers	PCI DSS, GDPR Art. 9
Names	Full names in context, combined with other data	GDPR Art. 4 (contextual)
IP addresses	192.168.1.1, IPv6 addresses	GDPR (directly identifies device/person)
Dates of birth	01/15/1985, January 15, 1985	HIPAA Safe Harbor

For healthcare data specifically, HIPAA's Safe Harbor method requires removal of all 18 categories of protected health information (PHI) before data can be considered de-identified. For EU data subjects, GDPR requires that personal data either be deleted or pseudonymized to a standard that makes re-identification infeasible.

Step-by-Step: Building the PII Redaction Pipeline

The following steps use Ertas Data Suite node names directly. Each step corresponds to one or more nodes in the pipeline canvas.

Step 1: File Import Node — Load Source Documents

Configure the File Import node to point to your source document directory. For enterprise engagements, this is typically a network share, a mounted drive on the client's system, or a local folder.

Key settings:

Source path: Directory containing raw documents
Recursive scan: Enable to process subdirectories
File type filter: Set to the formats present in the client's archive (PDF, DOCX, XLSX, TXT)
Batch size: Configure based on available memory — 500–1000 documents per batch is typical for mixed PDF/Word archives

The File Import node queues documents for downstream processing and passes file metadata (path, name, size, type) alongside raw content.

Step 2: Parse the Documents

Route each file to the appropriate parser node based on type:

PDF Parser (Docling integration) — handles native PDFs with embedded text and scanned PDFs via OCR. Layout-aware extraction preserves table structure and multi-column layouts. For scanned documents, configure OCR confidence threshold — records below threshold are flagged by the Quality Scorer in Step 4.

Word Parser — extracts text from .docx files, preserving section structure and header/footer content where present.

Excel Parser — handles .xlsx files, flattening spreadsheet data into row-level text records. Cell references are resolved before PII detection.

After parsing, all documents enter the pipeline as structured text records regardless of their original format.

Step 3: PII Redactor Node — Configure Entity Types and Redaction Method

The PII Redactor node is the core of the pipeline. Configure it for the specific client engagement:

Entity types to detect — select from the available categories:

EMAIL — email addresses
PHONE — phone numbers (US and international formats)
SSN — Social Security Numbers
ADDRESS — street addresses
MEDICAL_ID — medical record numbers and patient identifiers
FINANCIAL — credit card numbers, IBAN, bank account numbers
PERSON_NAME — full names (contextual detection)
DATE_OF_BIRTH — birth dates in common formats
IP_ADDRESS — IPv4 and IPv6 addresses

Redaction method — three options:

Mask: Replace detected PII with a label (e.g., [EMAIL], [PHONE]). Preserves the document structure and makes clear where redaction occurred. Recommended for training data where token count matters.
Replace: Substitute detected PII with synthetic placeholders (e.g., user@example.com becomes contact@company.net). Useful when downstream models need realistic-looking examples.
Remove: Delete the detected PII and surrounding context entirely. Most aggressive; use for highest-sensitivity data.

Confidence threshold — set the minimum detection confidence (default 0.85). Records where PII is detected below this threshold are flagged for human review rather than automatically redacted.

Step 4: Quality Scorer — Verify Redaction Completeness

The Quality Scorer node runs a post-redaction check on each processed document:

Residual PII scan: Re-runs detection at a lower confidence threshold to catch any PII the primary redaction may have missed
Completeness score: Calculates a per-document quality score (0–1.0) based on detection confidence, coverage, and any flagged anomalies
Flag threshold: Documents below the configured score (default 0.90) are routed to a review queue rather than the export step

Documents that pass the Quality Scorer proceed to export. Documents that fail are logged with their specific failure reason and held for human review or re-processing.

This step is what allows you to tell a regulated-industry client: "Every document in your training dataset was verified for PII completeness, and any document that did not meet the quality threshold was reviewed before inclusion."

Step 5: Export Clean, Redacted Data

Choose the appropriate export node based on your downstream use case:

JSONL Exporter — outputs one JSON object per line in the format required by most fine-tuning frameworks. Each record includes the redacted text, document metadata, and the quality score assigned in Step 4.

RAG Exporter — outputs chunked, redacted documents formatted for ingestion into a vector database. Configure chunk size (tokens) and overlap to match your retrieval system's requirements.

Both export nodes append a processing log entry for each document, recording: source file path, parser used, PII types detected, redaction method applied, quality score, and export timestamp. This log is the audit trail.

Comparison: Approaches to PII Redaction

Criterion	Manual Redaction	Regex Scripts	Cloud Redaction API	Ertas Pipeline
Accuracy	Variable — human error	Medium — misses contextual PII	High — but cloud-dependent	High — configurable confidence
Speed (10K docs)	Weeks	Hours	Hours	Hours
Audit Trail	None (manual)	None (unless logged)	Vendor-held logs	Built-in, exportable
On-Premise Deployment	N/A	Yes	No	Yes
Scalability	Low	Medium	High (cloud)	High (on-prem)

The critical column for regulated-industry clients is On-Premise Deployment. A cloud redaction API processes data on vendor servers — for HIPAA-covered data, this requires a Business Associate Agreement and introduces data residency questions. For EU data subject PII, it introduces GDPR cross-border transfer complications.

On-premise execution eliminates both. Data never leaves the client's network perimeter.

Compliance Considerations

Under GDPR Article 4, personal data includes any information relating to an identified or identifiable natural person. Article 25 (data protection by design) requires that systems processing personal data implement appropriate technical measures from the outset. A PII redaction pipeline that runs before data enters training is a direct implementation of this principle.

GDPR does not specify a particular redaction method — masking, replacement, and removal all satisfy the requirement if the result is that re-identification is not reasonably possible. The audit trail generated by the pipeline provides evidence of compliance for supervisory authority inquiries.

HIPAA

HIPAA's Safe Harbor de-identification method requires removal of all 18 PHI categories. The PII Redactor node covers all 18 categories when fully configured. The Quality Scorer's post-redaction check provides the "no actual knowledge" standard required by HIPAA — the processing system actively verifies that no PHI remains above threshold.

EU AI Act

The EU AI Act's Article 10 requires that training data for high-risk AI systems be subject to appropriate data governance practices, including examination for bias and errors. Data that includes unredacted PII represents both an error (inclusion of data that should not be present) and a bias risk (models may learn associations involving personal characteristics). PII redaction is a direct compliance action under Article 10.

FAQ

Does PII redaction happen before or after parsing?

Redaction happens after parsing. The parser (PDF Parser, Word Parser, etc.) must first extract raw text from the source document before the PII Redactor can detect and remove sensitive information. You cannot run redaction on a binary PDF file — you run it on the text extracted from that file. The pipeline enforces this order: File Import → Parser → PII Redactor → Quality Scorer → Exporter.

Can I customize which PII types are redacted?

Yes. The PII Redactor node has a per-entity-type toggle. You can enable or disable individual categories (EMAIL, PHONE, SSN, etc.) based on the client's regulatory context. For example, a financial services client may require redaction of financial identifiers and SSNs but not IP addresses. A healthcare client will require all 18 HIPAA PHI categories. The configuration is saved as part of the pipeline template, so you can maintain client-specific templates for different regulatory contexts.

Is the redaction logged for audit purposes?

Yes. Every document processed through the pipeline generates a log entry recording: the source file path, which PII types were detected, the redaction method applied, the confidence scores for each detection, the quality score assigned by the Quality Scorer, and the timestamp. The complete pipeline run log is exportable as JSON or CSV. This log is the primary evidence artifact for compliance audits.

Does this work on scanned PDFs?

Yes. The PDF Parser node uses OCR for scanned documents. For scanned PDFs, OCR is applied first to extract machine-readable text, which then flows into the PII Redactor. OCR-extracted text carries a confidence score; documents where OCR confidence falls below threshold are flagged by the Quality Scorer. In practice, clean black-and-white scans process well; low-quality or heavily annotated scans may require manual review for a subset of pages.

Building a PII Redaction Pipeline for AI-Ready Training Data