Back to blog
    On-Premise PII and PHI Redaction Workflows for Multi-Industry Service Providers
    pii-redactionphi-redactionon-premisehipaagdprdata-preparationsegment:service-provider

    On-Premise PII and PHI Redaction Workflows for Multi-Industry Service Providers

    Technical guide to building on-premise PII/PHI redaction pipelines that handle healthcare, legal, financial, and government data without cloud dependencies.

    EErtas Team·

    Before training data can be used, sensitive information must be removed. This is not a best practice — it is a legal requirement under HIPAA, GDPR, and most data processing agreements. For service providers working across multiple industries, the challenge is that what counts as "sensitive" varies by industry, and the acceptable redaction methods vary by regulation.

    A healthcare client needs PHI redacted per HIPAA Safe Harbor. A legal client needs attorney-client privileged information protected. A financial client needs account numbers and SSNs removed. A government client needs classified indicators stripped. And all of them expect the redaction to happen on-premise, because sending their data to a cloud API for entity detection is exactly the kind of data exposure they hired you to prevent.

    This guide covers the technical approaches to building on-premise PII/PHI redaction workflows that handle multi-industry requirements without cloud dependencies.


    PII vs. PHI: What Each Industry Requires You to Redact

    PII (Personally Identifiable Information)

    PII is any information that can identify a specific individual. Under GDPR, the definition is broad — any data "relating to an identified or identifiable natural person." Under U.S. regulations, the definition varies by context but generally includes:

    • Full names
    • Social Security numbers
    • Driver's license numbers
    • Email addresses
    • Phone numbers
    • Physical addresses
    • Date of birth
    • Biometric identifiers
    • Financial account numbers

    PHI (Protected Health Information)

    PHI is a HIPAA-specific category that includes PII plus health-related data. HIPAA's Safe Harbor method specifies 18 identifier types that must be removed for data to be considered de-identified:

    #IdentifierExample
    1NamesPatient full names
    2Geographic dataAddresses, ZIP codes (first 3 digits if population < 20,000)
    3DatesAll dates except year (for patients > 89, even year)
    4Phone numbersAll phone numbers
    5Fax numbersAll fax numbers
    6Email addressesAll email addresses
    7SSNSocial Security numbers
    8MRNMedical record numbers
    9Health plan numbersInsurance beneficiary numbers
    10Account numbersFinancial account numbers
    11Certificate/license numbersProfessional licenses
    12Vehicle identifiersLicense plates, VINs
    13Device identifiersSerial numbers, UDIs
    14URLsWeb addresses
    15IP addressesNetwork addresses
    16Biometric identifiersFingerprints, voiceprints
    17PhotographsFull-face photos
    18Any other unique identifierCatch-all for unique IDs

    Industry-Specific Sensitive Entities

    Beyond standard PII/PHI, each industry has domain-specific sensitive data:

    IndustryAdditional Sensitive Entities
    HealthcareDiagnosis codes, medication names tied to patients, treatment dates, physician-patient communications
    LegalCase numbers, opposing party names, settlement amounts, privileged communications, judge names in sealed cases
    FinanceAccount numbers, routing numbers, transaction amounts tied to identifiable accounts, credit scores, loan terms
    GovernmentClearance levels, classified program names, facility codes, personnel identifiers
    ConstructionBid amounts, proprietary specifications, subcontractor pricing, site access credentials

    On-Premise Redaction Approaches

    All redaction must happen locally. No data can be sent to external APIs for entity detection. Here are the four primary approaches, with trade-offs.

    1. Regex Pattern Matching

    The simplest and most predictable approach. Define patterns for known entity formats and replace matches.

    Strengths: Deterministic, fast, no model dependencies, works in air-gapped environments, zero false negatives for well-defined patterns.

    Weaknesses: Only catches entities with predictable formats. Cannot detect names, unformatted addresses, or context-dependent entities. High false positive rate for short patterns (e.g., 6-digit numbers matching both MRNs and page numbers).

    Best for: SSNs (\d{3}-\d{2}-\d{4}), phone numbers, email addresses, account numbers with known formats, dates in standard formats.

    2. Local NER Models

    Named Entity Recognition models run locally to detect entities like person names, organizations, and locations. Models like spaCy's en_core_web_trf, Flair NER, or fine-tuned BERT variants can run entirely on-premise.

    Strengths: Detects entities without predictable formats (names, organizations). Can be fine-tuned for domain-specific entities. No cloud dependency.

    Weaknesses: Requires GPU for reasonable throughput on transformer models. Accuracy varies by domain — a general NER model trained on news articles will underperform on clinical notes. Requires model download and local deployment.

    Best for: Person names, organization names, location names, and other entities that lack consistent formatting.

    3. Local LLM-Based Detection

    Running a local language model (e.g., Llama 3.1 8B, Qwen 2.5 7B) with a PII detection prompt. The model reads each text segment and identifies sensitive entities.

    Strengths: Handles context-dependent detection (e.g., "Dr. Smith" as a provider name vs. "Smith & Wesson" as a product). Can detect novel entity types with prompt changes. Can handle multiple entity types in a single pass.

    Weaknesses: Slower than regex or NER. Non-deterministic — different runs may produce different results. Requires significant compute (8B+ model needs 6-16 GB VRAM). Requires pre-loaded model weights in air-gapped environments.

    Best for: Complex or ambiguous entities, context-dependent detection, cross-domain redaction where you need flexibility.

    4. Dictionary-Based Matching

    Maintain curated dictionaries of known sensitive values (physician names, facility names, approved drug lists) and match against them.

    Strengths: High precision for known entities. Fast. Fully deterministic.

    Weaknesses: Only catches entities in the dictionary. Requires maintenance. Cannot detect entities not previously cataloged.

    Best for: Known entity lists (staff names, facility codes, client company names), supplementing other approaches.


    No single method is sufficient for production-grade redaction. The practical approach is layered:

    1. Regex layer: Catch all format-predictable entities (SSNs, phones, emails, dates, account numbers)
    2. Dictionary layer: Catch all known entities from client-provided lists
    3. NER model layer: Catch names, organizations, and locations that the regex missed
    4. Validation pass: Human review of a statistical sample to measure redaction completeness

    The order matters. Running regex and dictionary matching first reduces the load on the NER model and provides a baseline that the model only needs to supplement.


    Replacement Strategies

    How you replace detected entities affects both compliance and data utility.

    Masking

    Replace the entity with a generic token: [NAME], [SSN], [DATE].

    Pros: Simple, preserves text structure, clearly indicates where entities were removed. Cons: Destroys entity-type information that may be useful for model training. Multiple entities of the same type are indistinguishable.

    Pseudonymization

    Replace entities with realistic but fake values: "John Smith" → "Robert Chen", "555-12-3456" → "555-98-7654".

    Pros: Preserves semantic structure. Training data retains the "shape" of real entities, which can improve model performance on downstream tasks. Under GDPR, pseudonymized data has a distinct (less restrictive) processing basis. Cons: Requires a mapping table (which itself is sensitive). Risk of collision with real values.

    Removal

    Delete the entity entirely, leaving no trace.

    Pros: Maximum protection. No residual information. Cons: Destroys text structure. Sentence fragments become incoherent. Poor for training data quality.

    Industry-Specific Recommendations

    IndustryRecommended StrategyReasoning
    HealthcarePseudonymization or maskingHIPAA Safe Harbor requires removal of identifiers, but pseudonymization preserves clinical context
    LegalMaskingPrivileged content must be clearly indicated as redacted
    FinanceMaskingAccount numbers replaced with [ACCOUNT] preserves transaction structure
    GovernmentRemoval or maskingClassified indicators must leave no residual information

    Validating Redaction Completeness

    Redaction is only as good as its verification. A pipeline that claims to remove PII but misses 3% of names is worse than no redaction at all — it creates a false sense of compliance.

    Statistical Sampling

    Manually review a random sample of redacted records. Industry practice is 5-10% of records, with a higher sample rate for the first batch from a new data source.

    Known Entity Injection

    Inject records with known PII patterns before redaction, then verify they were all caught. This provides a measurable detection rate.

    Cross-Method Validation

    Run a second, independent detection method on the redacted output. If method B finds entities that method A missed, the pipeline has a gap.

    Redaction Audit Report

    Document the validation results: sample size, detection rate, entity types tested, false positive rate, false negative rate. This report becomes part of your deliverable to the client.


    Integrated Redaction in Practice

    Building a multi-layer redaction pipeline from scratch — regex, NER, dictionaries, validation, logging — is 60-120 hours of engineering work, plus ongoing maintenance for each new client industry.

    Ertas Data Suite includes PII/PHI redaction as a built-in capability within its Clean module. It runs entirely on-premise with no cloud dependencies, supports configurable entity types per industry, and logs every redaction event (entity type, location, replacement method, operator ID, timestamp) to the unified audit trail. The redaction log is exportable as part of the compliance documentation package.


    Conclusion

    PII/PHI redaction is the gate between raw client data and usable training data. For multi-industry service providers, the challenge is not just detecting entities — it is handling the varying requirements across healthcare, legal, finance, and government clients, all while running entirely on-premise and producing the audit evidence that proves the redaction was thorough.

    Get this step wrong, and everything downstream — the labels, the model, the deployment — inherits the compliance risk.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading