
On-Premise PII and PHI Redaction Workflows for Multi-Industry Service Providers
Technical guide to building on-premise PII/PHI redaction pipelines that handle healthcare, legal, financial, and government data without cloud dependencies.
Before training data can be used, sensitive information must be removed. This is not a best practice — it is a legal requirement under HIPAA, GDPR, and most data processing agreements. For service providers working across multiple industries, the challenge is that what counts as "sensitive" varies by industry, and the acceptable redaction methods vary by regulation.
A healthcare client needs PHI redacted per HIPAA Safe Harbor. A legal client needs attorney-client privileged information protected. A financial client needs account numbers and SSNs removed. A government client needs classified indicators stripped. And all of them expect the redaction to happen on-premise, because sending their data to a cloud API for entity detection is exactly the kind of data exposure they hired you to prevent.
This guide covers the technical approaches to building on-premise PII/PHI redaction workflows that handle multi-industry requirements without cloud dependencies.
PII vs. PHI: What Each Industry Requires You to Redact
PII (Personally Identifiable Information)
PII is any information that can identify a specific individual. Under GDPR, the definition is broad — any data "relating to an identified or identifiable natural person." Under U.S. regulations, the definition varies by context but generally includes:
- Full names
- Social Security numbers
- Driver's license numbers
- Email addresses
- Phone numbers
- Physical addresses
- Date of birth
- Biometric identifiers
- Financial account numbers
PHI (Protected Health Information)
PHI is a HIPAA-specific category that includes PII plus health-related data. HIPAA's Safe Harbor method specifies 18 identifier types that must be removed for data to be considered de-identified:
| # | Identifier | Example |
|---|---|---|
| 1 | Names | Patient full names |
| 2 | Geographic data | Addresses, ZIP codes (first 3 digits if population < 20,000) |
| 3 | Dates | All dates except year (for patients > 89, even year) |
| 4 | Phone numbers | All phone numbers |
| 5 | Fax numbers | All fax numbers |
| 6 | Email addresses | All email addresses |
| 7 | SSN | Social Security numbers |
| 8 | MRN | Medical record numbers |
| 9 | Health plan numbers | Insurance beneficiary numbers |
| 10 | Account numbers | Financial account numbers |
| 11 | Certificate/license numbers | Professional licenses |
| 12 | Vehicle identifiers | License plates, VINs |
| 13 | Device identifiers | Serial numbers, UDIs |
| 14 | URLs | Web addresses |
| 15 | IP addresses | Network addresses |
| 16 | Biometric identifiers | Fingerprints, voiceprints |
| 17 | Photographs | Full-face photos |
| 18 | Any other unique identifier | Catch-all for unique IDs |
Industry-Specific Sensitive Entities
Beyond standard PII/PHI, each industry has domain-specific sensitive data:
| Industry | Additional Sensitive Entities |
|---|---|
| Healthcare | Diagnosis codes, medication names tied to patients, treatment dates, physician-patient communications |
| Legal | Case numbers, opposing party names, settlement amounts, privileged communications, judge names in sealed cases |
| Finance | Account numbers, routing numbers, transaction amounts tied to identifiable accounts, credit scores, loan terms |
| Government | Clearance levels, classified program names, facility codes, personnel identifiers |
| Construction | Bid amounts, proprietary specifications, subcontractor pricing, site access credentials |
On-Premise Redaction Approaches
All redaction must happen locally. No data can be sent to external APIs for entity detection. Here are the four primary approaches, with trade-offs.
1. Regex Pattern Matching
The simplest and most predictable approach. Define patterns for known entity formats and replace matches.
Strengths: Deterministic, fast, no model dependencies, works in air-gapped environments, zero false negatives for well-defined patterns.
Weaknesses: Only catches entities with predictable formats. Cannot detect names, unformatted addresses, or context-dependent entities. High false positive rate for short patterns (e.g., 6-digit numbers matching both MRNs and page numbers).
Best for: SSNs (\d{3}-\d{2}-\d{4}), phone numbers, email addresses, account numbers with known formats, dates in standard formats.
2. Local NER Models
Named Entity Recognition models run locally to detect entities like person names, organizations, and locations. Models like spaCy's en_core_web_trf, Flair NER, or fine-tuned BERT variants can run entirely on-premise.
Strengths: Detects entities without predictable formats (names, organizations). Can be fine-tuned for domain-specific entities. No cloud dependency.
Weaknesses: Requires GPU for reasonable throughput on transformer models. Accuracy varies by domain — a general NER model trained on news articles will underperform on clinical notes. Requires model download and local deployment.
Best for: Person names, organization names, location names, and other entities that lack consistent formatting.
3. Local LLM-Based Detection
Running a local language model (e.g., Llama 3.1 8B, Qwen 2.5 7B) with a PII detection prompt. The model reads each text segment and identifies sensitive entities.
Strengths: Handles context-dependent detection (e.g., "Dr. Smith" as a provider name vs. "Smith & Wesson" as a product). Can detect novel entity types with prompt changes. Can handle multiple entity types in a single pass.
Weaknesses: Slower than regex or NER. Non-deterministic — different runs may produce different results. Requires significant compute (8B+ model needs 6-16 GB VRAM). Requires pre-loaded model weights in air-gapped environments.
Best for: Complex or ambiguous entities, context-dependent detection, cross-domain redaction where you need flexibility.
4. Dictionary-Based Matching
Maintain curated dictionaries of known sensitive values (physician names, facility names, approved drug lists) and match against them.
Strengths: High precision for known entities. Fast. Fully deterministic.
Weaknesses: Only catches entities in the dictionary. Requires maintenance. Cannot detect entities not previously cataloged.
Best for: Known entity lists (staff names, facility codes, client company names), supplementing other approaches.
Recommended Multi-Layer Approach
No single method is sufficient for production-grade redaction. The practical approach is layered:
- Regex layer: Catch all format-predictable entities (SSNs, phones, emails, dates, account numbers)
- Dictionary layer: Catch all known entities from client-provided lists
- NER model layer: Catch names, organizations, and locations that the regex missed
- Validation pass: Human review of a statistical sample to measure redaction completeness
The order matters. Running regex and dictionary matching first reduces the load on the NER model and provides a baseline that the model only needs to supplement.
Replacement Strategies
How you replace detected entities affects both compliance and data utility.
Masking
Replace the entity with a generic token: [NAME], [SSN], [DATE].
Pros: Simple, preserves text structure, clearly indicates where entities were removed. Cons: Destroys entity-type information that may be useful for model training. Multiple entities of the same type are indistinguishable.
Pseudonymization
Replace entities with realistic but fake values: "John Smith" → "Robert Chen", "555-12-3456" → "555-98-7654".
Pros: Preserves semantic structure. Training data retains the "shape" of real entities, which can improve model performance on downstream tasks. Under GDPR, pseudonymized data has a distinct (less restrictive) processing basis. Cons: Requires a mapping table (which itself is sensitive). Risk of collision with real values.
Removal
Delete the entity entirely, leaving no trace.
Pros: Maximum protection. No residual information. Cons: Destroys text structure. Sentence fragments become incoherent. Poor for training data quality.
Industry-Specific Recommendations
| Industry | Recommended Strategy | Reasoning |
|---|---|---|
| Healthcare | Pseudonymization or masking | HIPAA Safe Harbor requires removal of identifiers, but pseudonymization preserves clinical context |
| Legal | Masking | Privileged content must be clearly indicated as redacted |
| Finance | Masking | Account numbers replaced with [ACCOUNT] preserves transaction structure |
| Government | Removal or masking | Classified indicators must leave no residual information |
Validating Redaction Completeness
Redaction is only as good as its verification. A pipeline that claims to remove PII but misses 3% of names is worse than no redaction at all — it creates a false sense of compliance.
Statistical Sampling
Manually review a random sample of redacted records. Industry practice is 5-10% of records, with a higher sample rate for the first batch from a new data source.
Known Entity Injection
Inject records with known PII patterns before redaction, then verify they were all caught. This provides a measurable detection rate.
Cross-Method Validation
Run a second, independent detection method on the redacted output. If method B finds entities that method A missed, the pipeline has a gap.
Redaction Audit Report
Document the validation results: sample size, detection rate, entity types tested, false positive rate, false negative rate. This report becomes part of your deliverable to the client.
Integrated Redaction in Practice
Building a multi-layer redaction pipeline from scratch — regex, NER, dictionaries, validation, logging — is 60-120 hours of engineering work, plus ongoing maintenance for each new client industry.
Ertas Data Suite includes PII/PHI redaction as a built-in capability within its Clean module. It runs entirely on-premise with no cloud dependencies, supports configurable entity types per industry, and logs every redaction event (entity type, location, replacement method, operator ID, timestamp) to the unified audit trail. The redaction log is exportable as part of the compliance documentation package.
Conclusion
PII/PHI redaction is the gate between raw client data and usable training data. For multi-industry service providers, the challenge is not just detecting entities — it is handling the varying requirements across healthcare, legal, finance, and government clients, all while running entirely on-premise and producing the audit evidence that proves the redaction was thorough.
Get this step wrong, and everything downstream — the labels, the model, the deployment — inherits the compliance risk.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

How to Pass a Client Compliance Audit for Your AI Data Preparation Workflow
Pre-audit checklist and practical guide for AI service providers preparing for client compliance audits across GDPR, HIPAA, EU AI Act, and SOC 2.

How to Build an On-Premise Data Preparation Pipeline for LLM Fine-Tuning
A complete guide to building on-premise data preparation pipelines for LLM fine-tuning — covering the 5 stages from ingestion to export, tool comparisons, and architecture for regulated environments.

On-Premise Data Cleaning for ML Training Datasets: Deduplication, Normalization, and Quality Scoring
How to clean ML training datasets on-premise — covering deduplication with MinHash, text normalization, PII redaction, and quality scoring without cloud APIs.