PHI Redaction for AI Training: A Step-by-Step Guide for Healthcare ML Teams

Clinical data is invaluable for AI training. Medical records, clinical notes, imaging reports, and discharge summaries contain the kind of nuanced, domain-specific language that no amount of general web text can substitute. But clinical data almost always contains protected health information — PHI — and using it to train AI models without first completing de-identification is a HIPAA violation.

For healthcare ML teams, this creates a mandatory first step: before any clinical document enters a training pipeline, PHI must be identified, removed or replaced, and the removal must be documented. This guide covers how to do that correctly.

HIPAA De-Identification: The Two Standards

HIPAA provides two legally recognized methods for de-identifying health information. Both result in data that is no longer subject to HIPAA's Privacy Rule — and therefore legally usable for AI training.

Safe Harbor Method. The Safe Harbor method requires removing all 18 specific categories of identifier:

Names (patient, family members, employers)
Geographic data smaller than state level (addresses, city, ZIP codes — though the first three digits of ZIP codes are permitted if the geographic unit contains more than 20,000 people)
Dates (other than year) related to an individual: birth dates, admission dates, discharge dates, date of death
Phone numbers
Fax numbers
Email addresses
Social Security numbers
Medical record numbers
Health plan beneficiary numbers
Account numbers
Certificate and license numbers
Vehicle identifiers and serial numbers
Device identifiers and serial numbers
Web URLs
IP addresses
Biometric identifiers (fingerprints, voice prints)
Full-face photographs and comparable images
Any other unique identifying number, characteristic, or code

After removing all 18 categories, the covered entity must also have no actual knowledge that the remaining information could identify an individual. That last condition is the one that catches teams off guard. A clinical note might not contain any of the 18 identifiers explicitly but still be identifiable because it describes a rare condition that only one patient in the dataset has.

Expert Determination Method. An expert with statistical knowledge applies generally accepted principles to determine that the risk of identifying an individual is very small. This method is more flexible — it does not require removing all 18 identifier types — but requires documented statistical justification. For most healthcare AI teams, Safe Harbor is the practical choice because it provides a clear checklist and is easier to audit.

Why Manual Review Cannot Scale

Before automated PHI detection tools became available, de-identification was done manually: a clinician or trained reviewer read every document and redacted identifiers by hand. For a dataset of 500 documents, this is feasible. For a dataset of 50,000 clinical notes, it is not.

A single clinical note averages 600–800 words. A reviewer who can process 20 notes per hour — which requires focus and domain knowledge — would need 2,500 hours to review 50,000 notes. That is more than a full year of one person's working time, and it scales linearly with dataset size.

Automated PHI detection using named entity recognition (NER) is the only practical approach at scale. A well-trained PHI detection model processes hundreds of documents per minute, flags every identified PHI instance with its location and category, and produces a reviewed-ready output that a human can spot-check rather than read in full.

The catch: no automated tool achieves 100% recall on PHI. Every evaluation of automated de-identification tools finds a residual false negative rate — PHI that the model misses. For AI training data preparation, this means automated detection must be followed by a structured human review process, not replaced by it.

What Automated Tools Miss

The 18 Safe Harbor identifiers are not equally easy to detect. Names, dates, and contact information are reliably detected by modern NER models. The harder cases are:

Indirect identifiers. A clinical note that states "the patient, a 67-year-old Swahili-speaking woman from rural Montana, presented with..." does not contain any of the 18 explicit identifier types, but the combination of characteristics may make the patient identifiable in a small population. Automated tools cannot detect this class of identifier without external reference data about population sizes.

Rare disease combinations. A patient with a rare condition, a specific genetic variant, and an unusual combination of comorbidities may be de facto identifiable even with all 18 identifiers removed. Expert Determination methodology addresses this; Safe Harbor does not.

Numeric identifiers embedded in text. Medical record numbers and account numbers embedded in free text narrative — "MRN: 4471832 was admitted..." — are usually detected. But identifiers embedded in non-standard formats — "patient registered under 4471832 at this facility" — may be missed by models trained on standard formats.

Provider and facility identifiers. HIPAA's Safe Harbor categories focus on patient identifiers. But clinical notes often name specific providers, facilities, and treating physicians by name. In a small practice or a specialty where only a few providers treat a given condition, provider names can enable patient re-identification. These should be redacted too, even though they are not strictly required by Safe Harbor.

Cross-document linkage. A single de-identified document may be safe. A corpus of de-identified documents from the same patient, combined, may be re-identifiable because the combination of dates, conditions, and procedures narrows the population to one person. This is a corpus-level risk that document-level automated tools cannot assess.

The Redaction Pipeline

A complete PHI redaction pipeline for AI training data has five stages.

Stage 1: Detect. Run automated NER-based PHI detection across all documents. The output is a list of detected PHI instances: document ID, character offsets, entity category, entity text, and confidence score. Use a model specifically trained for clinical text — general NLP NER models trained on news or web text perform poorly on clinical notes because the language patterns are different. Separate models for structured fields (forms, tables) and unstructured narrative are more accurate than a single model.

Stage 2: Review. Present detected PHI instances to a human reviewer for confirmation. The review interface should show each detected instance in context, allow the reviewer to confirm, override, or add missed instances, and record every reviewer decision. For high-confidence detections (confidence > 0.95), batch confirmation is appropriate. Low-confidence detections require individual review. The reviewer should also read a sample of documents where no PHI was detected, to estimate the false negative rate.

Stage 3: Redact. Apply redactions to the documents. Redaction can be substitution (replacing a name with a placeholder like "[PATIENT]") or deletion (removing the text entirely). For NLP training data, substitution is strongly preferred — deletion creates gaps in the text that disrupt sentence structure and can cause downstream parsing errors. Substitution maintains document fluency while removing the identifying content. Where possible, substitution with realistic-sounding but non-real values (replacing a name with a different plausible name of the same apparent demographic) better preserves the linguistic patterns the model will learn.

Stage 4: Verify. Run a second pass of PHI detection on the redacted documents to catch residual identifiers. The second-pass false negative rate should be significantly lower than the first pass (because most identifiers have already been removed), but it is a necessary quality check. Any PHI detected in the second pass is a pipeline failure and should trigger a review of why the first pass missed it.

Stage 5: Log. Every detection, review decision, and redaction must be logged with timestamp, reviewer ID, document ID, and the specific identifier category. This log is the audit trail. If a regulator asks "how was this dataset de-identified?", the answer is the log — not a policy document.

Why the Audit Log Is Non-Negotiable

HIPAA does not just require de-identification. It requires that de-identification be documented in a way that can be reviewed by OCR (the Office for Civil Rights) in the event of an investigation.

If you train a clinical NLP model and later face a complaint that your training data contained PHI, the defense is the audit log: "Here is every document that was processed. Here is every PHI instance that was detected. Here is every review decision. Here is every redaction that was applied. Here is the second-pass verification result." Without that log, you have no defense.

For healthcare AI teams building training datasets, the audit log is not administrative overhead. It is the evidence that the process was followed.

On-Premise vs. Cloud for PHI Redaction

The question of where PHI redaction runs is legally significant. Running clinical notes through a cloud API for PHI detection means the PHI is transmitted to and processed by a third-party system. Under HIPAA, this requires a Business Associate Agreement (BAA) with the cloud provider. Some cloud providers offer BAAs; many do not.

More importantly: even with a BAA, the data has left the healthcare organization's control. Transmission is a risk event. If the data is intercepted in transit or retained by the API provider longer than agreed, that is a potential breach.

The safest approach — and the one most healthcare compliance teams require — is to run PHI detection and redaction entirely on-premise, on hardware controlled by the organization. The data never leaves. There is no BAA required. There is no transmission risk. The audit log stays within the organization's systems.

For AI training data specifically, where datasets are large and the processing pipeline runs repeatedly as new data is added, on-premise operation is also more practical. Sending 50,000 clinical notes to a cloud API is expensive, slow, and subject to rate limits. Running the same pipeline on a local workstation or server is faster and free after the initial setup.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →

Clinical NLP Training Data: How to Prepare Medical Records Without Violating HIPAA — Full pipeline for clinical NLP dataset preparation
Why Vector RAG Fails on Clinical Data — and What to Use Instead — Understanding the limits of RAG for medical terminology
HIPAA-Compliant AI Training Data Guide — Comprehensive HIPAA framework for AI teams

PHI Redaction for AI Training: A Step-by-Step Guide for Healthcare ML Teams

HIPAA De-Identification: The Two Standards

Why Manual Review Cannot Scale

What Automated Tools Miss

The Redaction Pipeline

Why the Audit Log Is Non-Negotiable

On-Premise vs. Cloud for PHI Redaction

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

HIPAA-Compliant AI Training Data: A Practical Guide for Healthcare Organizations

Best HIPAA-Compliant RAG Pipeline for Healthcare: On-Premise Document Retrieval Without Data Egress

How to Generate EU AI Act Technical Documentation from Your Data Pipeline

HIPAA De-Identification: The Two Standards

Why Manual Review Cannot Scale

What Automated Tools Miss

The Redaction Pipeline

Why the Audit Log Is Non-Negotiable

On-Premise vs. Cloud for PHI Redaction

Related Reading

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

HIPAA-Compliant AI Training Data: A Practical Guide for Healthcare Organizations

Best HIPAA-Compliant RAG Pipeline for Healthcare: On-Premise Document Retrieval Without Data Egress

How to Generate EU AI Act Technical Documentation from Your Data Pipeline