On-Premise PII and PHI Redaction Workflows for Multi-Industry Service Providers

Before training data can be used, sensitive information must be removed. This is not a best practice — it is a legal requirement under HIPAA, GDPR, and most data processing agreements. For service providers working across multiple industries, the challenge is that what counts as "sensitive" varies by industry, and the acceptable redaction methods vary by regulation.

A healthcare client needs PHI redacted per HIPAA Safe Harbor. A legal client needs attorney-client privileged information protected. A financial client needs account numbers and SSNs removed. A government client needs classified indicators stripped. And all of them expect the redaction to happen on-premise, because sending their data to a cloud API for entity detection is exactly the kind of data exposure they hired you to prevent.

This guide covers the technical approaches to building on-premise PII/PHI redaction workflows that handle multi-industry requirements without cloud dependencies.

PII vs. PHI: What Each Industry Requires You to Redact

PII (Personally Identifiable Information)

PII is any information that can identify a specific individual. Under GDPR, the definition is broad — any data "relating to an identified or identifiable natural person." Under U.S. regulations, the definition varies by context but generally includes:

Full names
Social Security numbers
Driver's license numbers
Email addresses
Phone numbers
Physical addresses
Date of birth
Biometric identifiers
Financial account numbers

PHI (Protected Health Information)

PHI is a HIPAA-specific category that includes PII plus health-related data. HIPAA's Safe Harbor method specifies 18 identifier types that must be removed for data to be considered de-identified:

#	Identifier	Example
1	Names	Patient full names
2	Geographic data	Addresses, ZIP codes (first 3 digits if population < 20,000)
3	Dates	All dates except year (for patients > 89, even year)
4	Phone numbers	All phone numbers
5	Fax numbers	All fax numbers
6	Email addresses	All email addresses
7	SSN	Social Security numbers
8	MRN	Medical record numbers
9	Health plan numbers	Insurance beneficiary numbers
10	Account numbers	Financial account numbers
11	Certificate/license numbers	Professional licenses
12	Vehicle identifiers	License plates, VINs
13	Device identifiers	Serial numbers, UDIs
14	URLs	Web addresses
15	IP addresses	Network addresses
16	Biometric identifiers	Fingerprints, voiceprints
17	Photographs	Full-face photos
18	Any other unique identifier	Catch-all for unique IDs

Industry-Specific Sensitive Entities

Beyond standard PII/PHI, each industry has domain-specific sensitive data:

Industry	Additional Sensitive Entities
Healthcare	Diagnosis codes, medication names tied to patients, treatment dates, physician-patient communications
Legal	Case numbers, opposing party names, settlement amounts, privileged communications, judge names in sealed cases
Finance	Account numbers, routing numbers, transaction amounts tied to identifiable accounts, credit scores, loan terms
Government	Clearance levels, classified program names, facility codes, personnel identifiers
Construction	Bid amounts, proprietary specifications, subcontractor pricing, site access credentials

On-Premise Redaction Approaches

All redaction must happen locally. No data can be sent to external APIs for entity detection. Here are the four primary approaches, with trade-offs.

1. Regex Pattern Matching

The simplest and most predictable approach. Define patterns for known entity formats and replace matches.

Strengths: Deterministic, fast, no model dependencies, works in air-gapped environments, zero false negatives for well-defined patterns.

Weaknesses: Only catches entities with predictable formats. Cannot detect names, unformatted addresses, or context-dependent entities. High false positive rate for short patterns (e.g., 6-digit numbers matching both MRNs and page numbers).

Best for: SSNs (\d{3}-\d{2}-\d{4}), phone numbers, email addresses, account numbers with known formats, dates in standard formats.

2. Local NER Models

Named Entity Recognition models run locally to detect entities like person names, organizations, and locations. Models like spaCy's en_core_web_trf, Flair NER, or fine-tuned BERT variants can run entirely on-premise.

Strengths: Detects entities without predictable formats (names, organizations). Can be fine-tuned for domain-specific entities. No cloud dependency.

Weaknesses: Requires GPU for reasonable throughput on transformer models. Accuracy varies by domain — a general NER model trained on news articles will underperform on clinical notes. Requires model download and local deployment.

Best for: Person names, organization names, location names, and other entities that lack consistent formatting.

3. Local LLM-Based Detection

Running a local language model (e.g., Llama 3.1 8B, Qwen 2.5 7B) with a PII detection prompt. The model reads each text segment and identifies sensitive entities.

Strengths: Handles context-dependent detection (e.g., "Dr. Smith" as a provider name vs. "Smith & Wesson" as a product). Can detect novel entity types with prompt changes. Can handle multiple entity types in a single pass.

Weaknesses: Slower than regex or NER. Non-deterministic — different runs may produce different results. Requires significant compute (8B+ model needs 6-16 GB VRAM). Requires pre-loaded model weights in air-gapped environments.

Best for: Complex or ambiguous entities, context-dependent detection, cross-domain redaction where you need flexibility.

4. Dictionary-Based Matching

Maintain curated dictionaries of known sensitive values (physician names, facility names, approved drug lists) and match against them.

Strengths: High precision for known entities. Fast. Fully deterministic.

Weaknesses: Only catches entities in the dictionary. Requires maintenance. Cannot detect entities not previously cataloged.

Best for: Known entity lists (staff names, facility codes, client company names), supplementing other approaches.

Recommended Multi-Layer Approach

No single method is sufficient for production-grade redaction. The practical approach is layered:

Regex layer: Catch all format-predictable entities (SSNs, phones, emails, dates, account numbers)
Dictionary layer: Catch all known entities from client-provided lists
NER model layer: Catch names, organizations, and locations that the regex missed
Validation pass: Human review of a statistical sample to measure redaction completeness

The order matters. Running regex and dictionary matching first reduces the load on the NER model and provides a baseline that the model only needs to supplement.

Replacement Strategies

How you replace detected entities affects both compliance and data utility.

Masking

Replace the entity with a generic token: [NAME], [SSN], [DATE].

Pros: Simple, preserves text structure, clearly indicates where entities were removed. Cons: Destroys entity-type information that may be useful for model training. Multiple entities of the same type are indistinguishable.

Pseudonymization

Replace entities with realistic but fake values: "John Smith" → "Robert Chen", "555-12-3456" → "555-98-7654".

Pros: Preserves semantic structure. Training data retains the "shape" of real entities, which can improve model performance on downstream tasks. Under GDPR, pseudonymized data has a distinct (less restrictive) processing basis. Cons: Requires a mapping table (which itself is sensitive). Risk of collision with real values.

Removal

Delete the entity entirely, leaving no trace.

Pros: Maximum protection. No residual information. Cons: Destroys text structure. Sentence fragments become incoherent. Poor for training data quality.

Industry-Specific Recommendations

Industry	Recommended Strategy	Reasoning
Healthcare	Pseudonymization or masking	HIPAA Safe Harbor requires removal of identifiers, but pseudonymization preserves clinical context
Legal	Masking	Privileged content must be clearly indicated as redacted
Finance	Masking	Account numbers replaced with `[ACCOUNT]` preserves transaction structure
Government	Removal or masking	Classified indicators must leave no residual information

Validating Redaction Completeness

Redaction is only as good as its verification. A pipeline that claims to remove PII but misses 3% of names is worse than no redaction at all — it creates a false sense of compliance.

Statistical Sampling

Manually review a random sample of redacted records. Industry practice is 5-10% of records, with a higher sample rate for the first batch from a new data source.

Known Entity Injection

Inject records with known PII patterns before redaction, then verify they were all caught. This provides a measurable detection rate.

Cross-Method Validation

Run a second, independent detection method on the redacted output. If method B finds entities that method A missed, the pipeline has a gap.

Redaction Audit Report

Document the validation results: sample size, detection rate, entity types tested, false positive rate, false negative rate. This report becomes part of your deliverable to the client.

Integrated Redaction in Practice

Building a multi-layer redaction pipeline from scratch — regex, NER, dictionaries, validation, logging — is 60-120 hours of engineering work, plus ongoing maintenance for each new client industry.

Ertas Data Suite includes PII/PHI redaction as a built-in capability within its Clean module. It runs entirely on-premise with no cloud dependencies, supports configurable entity types per industry, and logs every redaction event (entity type, location, replacement method, operator ID, timestamp) to the unified audit trail. The redaction log is exportable as part of the compliance documentation package.

Conclusion

PII/PHI redaction is the gate between raw client data and usable training data. For multi-industry service providers, the challenge is not just detecting entities — it is handling the varying requirements across healthcare, legal, finance, and government clients, all while running entirely on-premise and producing the audit evidence that proves the redaction was thorough.

Get this step wrong, and everything downstream — the labels, the model, the deployment — inherits the compliance risk.