PII Redaction Accuracy Benchmark: Regex vs NER vs LLM vs Hybrid Pipeline

PII redaction is the highest-stakes stage in any enterprise data pipeline. A parsing error produces garbled text. A chunking error degrades retrieval quality. A PII redaction failure exposes personal data — triggering regulatory penalties, eroding customer trust, and creating legal liability.

Despite these stakes, most teams select their redaction approach based on convenience rather than measured performance. Regex is fast to implement. NER models are easy to import. LLMs seem capable of everything. But how do these approaches actually compare on the metrics that matter — precision, recall, false positive rate, and throughput?

This benchmark provides the answer.

Approaches Tested

We evaluated five PII redaction approaches, each representing a distinct technical strategy:

Regex Patterns — deterministic pattern matching using regular expressions for structured PII formats (SSN, phone numbers, email addresses, credit card numbers). We used a production-grade regex library with 47 patterns covering US, UK, and EU PII formats.

spaCy NER (en_core_web_trf) — spaCy's transformer-based named entity recognition model, which identifies PERSON, ORG, GPE, DATE, and other entity types. We extended it with custom entity rulers for PII-specific patterns.

Transformer NER (GLiNER) — a generalist NER model that accepts entity type descriptions at inference time, allowing zero-shot detection of arbitrary PII categories without fine-tuning. We tested with prompts for all 14 PII entity types.

LLM-Based (GPT-4 class) — using a frontier language model with a structured prompt specifying PII categories and requesting entity-level annotations. We tested with GPT-4o via API, acknowledging the irony of sending PII to a cloud API for redaction benchmarking. In production, this approach would use a locally-hosted LLM.

Hybrid Pipeline (Ertas) — a two-pass approach: regex patterns first for structured PII (SSN, phone, email, credit card), then transformer NER for contextual entities (names, addresses, medical terms, case numbers). The pipeline runs entirely on-premise with no cloud dependencies.

Test Corpus

We constructed a benchmark corpus of 10,000 PII instances across 14 entity types, embedded in 1,200 synthetic enterprise documents:

Entity Type	Count	Examples
Person Name	1,500	Full names, partial names, titles with names
Email Address	800	Standard, corporate, obfuscated
Phone Number	800	US, UK, international, extensions
SSN	600	Standard (XXX-XX-XXXX), no-dash, partial
Physical Address	700	Street, PO Box, apartment, international
Date of Birth	500	Multiple date formats
Credit Card	400	Visa, Mastercard, Amex, with/without spaces
Medical Record Number	400	Hospital-specific formats
IP Address	300	IPv4, IPv6, with context
Driver License	300	State-specific formats
Passport Number	200	US, UK, EU formats
Bank Account	200	Routing + account, IBAN
Case/File Number	200	Legal, medical, insurance
Biometric Identifier	100	Device IDs, enrollment references

Documents were designed to include both obvious PII (standalone fields) and contextual PII (embedded in narrative text, tables, and footnotes). This reflects real enterprise documents where PII appears in expected locations and in unexpected contexts like email signatures embedded in contract appendices.

Ground truth was manually annotated by two independent reviewers with adjudication for disagreements.

Benchmark Results

Approach	Precision	Recall	F1 Score	Speed (docs/sec)	False Positive Rate
Regex Patterns	99.1%	72.4%	83.9%	145	0.9%
spaCy NER (en_core_web_trf)	91.3%	88.7%	89.9%	42	8.7%
Transformer NER (GLiNER)	94.8%	93.1%	93.9%	18	5.2%
LLM-Based (GPT-4 class)	96.2%	95.8%	96.0%	2.1	3.8%
Hybrid Pipeline (Ertas)	97.4%	96.1%	96.7%	28	2.6%

Detailed Analysis by Metric

Precision: What You Flag, Is It Actually PII?

Precision measures the percentage of flagged items that are genuinely PII. Low precision means your system is over-flagging, creating review burden and potentially redacting non-PII content that should be preserved.

Regex achieved the highest precision (99.1%) because pattern matching produces very few false positives — if something matches an SSN pattern, it almost certainly is an SSN. The rare false positives came from numbers that coincidentally match PII patterns (product codes in SSN format, for example).

spaCy had the lowest precision (91.3%) and the highest false positive rate (8.7%). Its PERSON entity model frequently flagged organization names, product names, and location references as person names. "Washington" appearing as a city was regularly flagged as a person name. "Amazon Web Services" triggered both PERSON and ORG tags inconsistently.

The hybrid pipeline achieved 97.4% precision by using regex for structured patterns (where precision is inherently high) and restricting transformer NER to entity types where regex falls short (names, addresses, contextual references). This division of labor keeps each approach in its strength zone.

Recall: What PII Exists, Did You Catch It?

Recall is the critical metric for compliance. Missed PII — false negatives — is the failure mode that triggers regulatory action.

Regex recall was only 72.4%, the lowest of all approaches. It missed three major PII categories almost entirely:

Person names — no regex pattern can reliably match the infinite variety of human names
Physical addresses — address formats are too variable for deterministic pattern matching
Contextual references — phrases like "the patient" or "my client Mr. Johnson" require understanding context, not pattern matching

LLM-based approaches achieved the highest recall (95.8%) because language models understand context. They correctly identified PII in sentences like "Please forward this to Sarah at the downtown office" where "Sarah" is PII but no structured pattern matches.

The hybrid pipeline achieved 96.1% recall — slightly above the LLM approach — because the regex pass catches structured patterns that transformer NER occasionally misses (SSNs without dashes, phone numbers with extensions), while the NER pass catches contextual entities that regex cannot match. The two passes are complementary rather than redundant.

Per-Entity-Type Breakdown

The aggregate F1 scores mask significant variation across entity types:

Entity Type	Regex F1	spaCy F1	GLiNER F1	LLM F1	Hybrid F1
SSN	99.2%	82.1%	94.3%	97.8%	99.4%
Email	99.5%	78.4%	91.2%	96.1%	99.5%
Phone	97.8%	75.9%	90.1%	95.4%	98.1%
Credit Card	98.9%	71.3%	88.7%	94.2%	99.0%
Person Name	0.0%	93.8%	95.7%	97.2%	95.7%
Address	12.4%	87.3%	92.8%	96.3%	93.1%
Medical Record	91.3%	68.4%	89.1%	93.7%	95.2%
Date of Birth	78.2%	84.1%	91.4%	95.9%	94.8%

This breakdown reveals the fundamental tradeoff: regex dominates structured entities (SSN, email, phone, credit card) but completely fails on contextual entities (person names, addresses). NER models handle contextual entities well but underperform regex on structured patterns.

The hybrid approach captures the strengths of both, achieving either the highest or near-highest F1 for every entity type.

Speed and Throughput

Processing speed determines whether a redaction approach is viable for production workloads. Enterprise data pipelines process thousands to millions of documents.

Approach	Docs/sec	Time for 100K docs	GPU Required
Regex	145	11.5 minutes	No
spaCy NER	42	39.7 minutes	Recommended
GLiNER	18	92.6 minutes	Yes
LLM (GPT-4 class)	2.1	13.2 hours	Yes (or API)
Hybrid (Ertas)	28	59.5 minutes	Recommended

The hybrid pipeline processes 28 documents per second — fast enough for batch processing of enterprise archives but not suitable for real-time, per-request redaction at high volume. The regex pass adds minimal latency; the transformer NER pass is the throughput bottleneck.

LLM-based redaction at 2.1 docs/sec is impractical for large-scale batch processing. A 100,000-document archive would take over 13 hours. It is more suitable as a verification pass on a sample of already-redacted documents than as the primary redaction mechanism.

False Positive Rate and Review Burden

False positives create operational cost. Every falsely flagged item must be reviewed by a human if the redaction pipeline includes a review step, or silently redacts non-PII content if it does not.

At spaCy's 8.7% false positive rate, processing 100,000 documents with an average of 15 flagged entities per document would produce approximately 130,500 false positives requiring review. At the hybrid pipeline's 2.6% rate, that number drops to 39,000 — a 70% reduction in review workload.

For fully automated pipelines (no human review), false positives mean information loss. Redacting a product code that happens to match a phone number pattern, or redacting a city name that matches a person name, degrades document quality for downstream AI processing.

The Case for Hybrid Approaches

The benchmark data points clearly toward hybrid architectures as the production-optimal approach for PII redaction. The reasoning is straightforward:

Structured PII is a solved problem. Regex handles SSNs, emails, phone numbers, and credit cards with near-perfect precision and recall. Using NER or LLMs for these entity types adds latency without improving accuracy.

Contextual PII requires understanding. Names, addresses, and contextual references cannot be caught by pattern matching. Transformer NER provides the semantic understanding needed, with GLiNER and similar models achieving above 92% F1 on contextual entity types.

The two passes are complementary. Regex catches what NER misses (non-standard SSN formats, phone extensions), and NER catches what regex cannot attempt (person names, contextual references). Running both passes in sequence produces a combined result that exceeds either approach alone.

Ertas implements this hybrid approach in its PII Redactor node: the regex pass runs first (deterministic, fast, high-precision), then the transformer NER pass processes the remaining text for contextual entities. Both passes are visible as sub-steps in the pipeline, with per-entity confidence scores logged for audit purposes.

Recommendations by Use Case

Regulated industries (healthcare, finance, legal): Use a hybrid approach with human review sampling. Target recall above 96% and false positive rate below 3%. The cost of missed PII (regulatory penalties, breach notification) far exceeds the cost of review burden from false positives.

Service providers delivering to enterprise clients: Use a hybrid approach with full audit logging. Clients in regulated industries will require evidence that PII redaction was performed systematically. Per-entity confidence scores and processing logs provide this evidence.

Internal AI training data preparation: Use transformer NER (GLiNER or equivalent) if review budget is limited. Its 93.9% F1 with 5.2% false positive rate provides a reasonable accuracy-to-effort tradeoff for teams that cannot implement a full hybrid pipeline.

Real-time redaction (per-request): Use regex only. At 145 docs/sec, regex is the only approach fast enough for real-time processing. Accept the 72.4% recall limitation and supplement with NER-based batch review on a regular schedule.

Methodology Notes

All benchmarks were run on a single workstation (Intel i9-13900K, 64GB RAM, RTX 4090).
spaCy used the en_core_web_trf model (transformer-based, most accurate variant).
GLiNER used the gliner-large-v2.5 checkpoint.
LLM benchmarks used GPT-4o via API with structured output prompting. Latency includes API round-trip time.
The hybrid pipeline (Ertas) ran entirely locally with no API calls.
Entity types follow the NIST SP 800-188 de-identification framework, extended with medical and legal identifiers.
False positive rate is calculated as false positives divided by (true negatives plus false positives).

For the full enterprise data pipeline benchmark including parsing, chunking, and embedding stages, see our comprehensive benchmark report.