
PII Redaction Accuracy Benchmark: Regex vs NER vs LLM vs Hybrid Pipeline
Benchmark comparing five PII redaction approaches — regex patterns, spaCy NER, transformer NER, LLM-based, and hybrid pipeline — measuring precision, recall, F1 score, speed, and false positive rates across 14 entity types.
PII redaction is the highest-stakes stage in any enterprise data pipeline. A parsing error produces garbled text. A chunking error degrades retrieval quality. A PII redaction failure exposes personal data — triggering regulatory penalties, eroding customer trust, and creating legal liability.
Despite these stakes, most teams select their redaction approach based on convenience rather than measured performance. Regex is fast to implement. NER models are easy to import. LLMs seem capable of everything. But how do these approaches actually compare on the metrics that matter — precision, recall, false positive rate, and throughput?
This benchmark provides the answer.
Approaches Tested
We evaluated five PII redaction approaches, each representing a distinct technical strategy:
Regex Patterns — deterministic pattern matching using regular expressions for structured PII formats (SSN, phone numbers, email addresses, credit card numbers). We used a production-grade regex library with 47 patterns covering US, UK, and EU PII formats.
spaCy NER (en_core_web_trf) — spaCy's transformer-based named entity recognition model, which identifies PERSON, ORG, GPE, DATE, and other entity types. We extended it with custom entity rulers for PII-specific patterns.
Transformer NER (GLiNER) — a generalist NER model that accepts entity type descriptions at inference time, allowing zero-shot detection of arbitrary PII categories without fine-tuning. We tested with prompts for all 14 PII entity types.
LLM-Based (GPT-4 class) — using a frontier language model with a structured prompt specifying PII categories and requesting entity-level annotations. We tested with GPT-4o via API, acknowledging the irony of sending PII to a cloud API for redaction benchmarking. In production, this approach would use a locally-hosted LLM.
Hybrid Pipeline (Ertas) — a two-pass approach: regex patterns first for structured PII (SSN, phone, email, credit card), then transformer NER for contextual entities (names, addresses, medical terms, case numbers). The pipeline runs entirely on-premise with no cloud dependencies.
Test Corpus
We constructed a benchmark corpus of 10,000 PII instances across 14 entity types, embedded in 1,200 synthetic enterprise documents:
| Entity Type | Count | Examples |
|---|---|---|
| Person Name | 1,500 | Full names, partial names, titles with names |
| Email Address | 800 | Standard, corporate, obfuscated |
| Phone Number | 800 | US, UK, international, extensions |
| SSN | 600 | Standard (XXX-XX-XXXX), no-dash, partial |
| Physical Address | 700 | Street, PO Box, apartment, international |
| Date of Birth | 500 | Multiple date formats |
| Credit Card | 400 | Visa, Mastercard, Amex, with/without spaces |
| Medical Record Number | 400 | Hospital-specific formats |
| IP Address | 300 | IPv4, IPv6, with context |
| Driver License | 300 | State-specific formats |
| Passport Number | 200 | US, UK, EU formats |
| Bank Account | 200 | Routing + account, IBAN |
| Case/File Number | 200 | Legal, medical, insurance |
| Biometric Identifier | 100 | Device IDs, enrollment references |
Documents were designed to include both obvious PII (standalone fields) and contextual PII (embedded in narrative text, tables, and footnotes). This reflects real enterprise documents where PII appears in expected locations and in unexpected contexts like email signatures embedded in contract appendices.
Ground truth was manually annotated by two independent reviewers with adjudication for disagreements.
Benchmark Results
| Approach | Precision | Recall | F1 Score | Speed (docs/sec) | False Positive Rate |
|---|---|---|---|---|---|
| Regex Patterns | 99.1% | 72.4% | 83.9% | 145 | 0.9% |
| spaCy NER (en_core_web_trf) | 91.3% | 88.7% | 89.9% | 42 | 8.7% |
| Transformer NER (GLiNER) | 94.8% | 93.1% | 93.9% | 18 | 5.2% |
| LLM-Based (GPT-4 class) | 96.2% | 95.8% | 96.0% | 2.1 | 3.8% |
| Hybrid Pipeline (Ertas) | 97.4% | 96.1% | 96.7% | 28 | 2.6% |
Detailed Analysis by Metric
Precision: What You Flag, Is It Actually PII?
Precision measures the percentage of flagged items that are genuinely PII. Low precision means your system is over-flagging, creating review burden and potentially redacting non-PII content that should be preserved.
Regex achieved the highest precision (99.1%) because pattern matching produces very few false positives — if something matches an SSN pattern, it almost certainly is an SSN. The rare false positives came from numbers that coincidentally match PII patterns (product codes in SSN format, for example).
spaCy had the lowest precision (91.3%) and the highest false positive rate (8.7%). Its PERSON entity model frequently flagged organization names, product names, and location references as person names. "Washington" appearing as a city was regularly flagged as a person name. "Amazon Web Services" triggered both PERSON and ORG tags inconsistently.
The hybrid pipeline achieved 97.4% precision by using regex for structured patterns (where precision is inherently high) and restricting transformer NER to entity types where regex falls short (names, addresses, contextual references). This division of labor keeps each approach in its strength zone.
Recall: What PII Exists, Did You Catch It?
Recall is the critical metric for compliance. Missed PII — false negatives — is the failure mode that triggers regulatory action.
Regex recall was only 72.4%, the lowest of all approaches. It missed three major PII categories almost entirely:
- Person names — no regex pattern can reliably match the infinite variety of human names
- Physical addresses — address formats are too variable for deterministic pattern matching
- Contextual references — phrases like "the patient" or "my client Mr. Johnson" require understanding context, not pattern matching
LLM-based approaches achieved the highest recall (95.8%) because language models understand context. They correctly identified PII in sentences like "Please forward this to Sarah at the downtown office" where "Sarah" is PII but no structured pattern matches.
The hybrid pipeline achieved 96.1% recall — slightly above the LLM approach — because the regex pass catches structured patterns that transformer NER occasionally misses (SSNs without dashes, phone numbers with extensions), while the NER pass catches contextual entities that regex cannot match. The two passes are complementary rather than redundant.
Per-Entity-Type Breakdown
The aggregate F1 scores mask significant variation across entity types:
| Entity Type | Regex F1 | spaCy F1 | GLiNER F1 | LLM F1 | Hybrid F1 |
|---|---|---|---|---|---|
| SSN | 99.2% | 82.1% | 94.3% | 97.8% | 99.4% |
| 99.5% | 78.4% | 91.2% | 96.1% | 99.5% | |
| Phone | 97.8% | 75.9% | 90.1% | 95.4% | 98.1% |
| Credit Card | 98.9% | 71.3% | 88.7% | 94.2% | 99.0% |
| Person Name | 0.0% | 93.8% | 95.7% | 97.2% | 95.7% |
| Address | 12.4% | 87.3% | 92.8% | 96.3% | 93.1% |
| Medical Record | 91.3% | 68.4% | 89.1% | 93.7% | 95.2% |
| Date of Birth | 78.2% | 84.1% | 91.4% | 95.9% | 94.8% |
This breakdown reveals the fundamental tradeoff: regex dominates structured entities (SSN, email, phone, credit card) but completely fails on contextual entities (person names, addresses). NER models handle contextual entities well but underperform regex on structured patterns.
The hybrid approach captures the strengths of both, achieving either the highest or near-highest F1 for every entity type.
Speed and Throughput
Processing speed determines whether a redaction approach is viable for production workloads. Enterprise data pipelines process thousands to millions of documents.
| Approach | Docs/sec | Time for 100K docs | GPU Required |
|---|---|---|---|
| Regex | 145 | 11.5 minutes | No |
| spaCy NER | 42 | 39.7 minutes | Recommended |
| GLiNER | 18 | 92.6 minutes | Yes |
| LLM (GPT-4 class) | 2.1 | 13.2 hours | Yes (or API) |
| Hybrid (Ertas) | 28 | 59.5 minutes | Recommended |
The hybrid pipeline processes 28 documents per second — fast enough for batch processing of enterprise archives but not suitable for real-time, per-request redaction at high volume. The regex pass adds minimal latency; the transformer NER pass is the throughput bottleneck.
LLM-based redaction at 2.1 docs/sec is impractical for large-scale batch processing. A 100,000-document archive would take over 13 hours. It is more suitable as a verification pass on a sample of already-redacted documents than as the primary redaction mechanism.
False Positive Rate and Review Burden
False positives create operational cost. Every falsely flagged item must be reviewed by a human if the redaction pipeline includes a review step, or silently redacts non-PII content if it does not.
At spaCy's 8.7% false positive rate, processing 100,000 documents with an average of 15 flagged entities per document would produce approximately 130,500 false positives requiring review. At the hybrid pipeline's 2.6% rate, that number drops to 39,000 — a 70% reduction in review workload.
For fully automated pipelines (no human review), false positives mean information loss. Redacting a product code that happens to match a phone number pattern, or redacting a city name that matches a person name, degrades document quality for downstream AI processing.
The Case for Hybrid Approaches
The benchmark data points clearly toward hybrid architectures as the production-optimal approach for PII redaction. The reasoning is straightforward:
Structured PII is a solved problem. Regex handles SSNs, emails, phone numbers, and credit cards with near-perfect precision and recall. Using NER or LLMs for these entity types adds latency without improving accuracy.
Contextual PII requires understanding. Names, addresses, and contextual references cannot be caught by pattern matching. Transformer NER provides the semantic understanding needed, with GLiNER and similar models achieving above 92% F1 on contextual entity types.
The two passes are complementary. Regex catches what NER misses (non-standard SSN formats, phone extensions), and NER catches what regex cannot attempt (person names, contextual references). Running both passes in sequence produces a combined result that exceeds either approach alone.
Ertas implements this hybrid approach in its PII Redactor node: the regex pass runs first (deterministic, fast, high-precision), then the transformer NER pass processes the remaining text for contextual entities. Both passes are visible as sub-steps in the pipeline, with per-entity confidence scores logged for audit purposes.
Recommendations by Use Case
Regulated industries (healthcare, finance, legal): Use a hybrid approach with human review sampling. Target recall above 96% and false positive rate below 3%. The cost of missed PII (regulatory penalties, breach notification) far exceeds the cost of review burden from false positives.
Service providers delivering to enterprise clients: Use a hybrid approach with full audit logging. Clients in regulated industries will require evidence that PII redaction was performed systematically. Per-entity confidence scores and processing logs provide this evidence.
Internal AI training data preparation: Use transformer NER (GLiNER or equivalent) if review budget is limited. Its 93.9% F1 with 5.2% false positive rate provides a reasonable accuracy-to-effort tradeoff for teams that cannot implement a full hybrid pipeline.
Real-time redaction (per-request): Use regex only. At 145 docs/sec, regex is the only approach fast enough for real-time processing. Accept the 72.4% recall limitation and supplement with NER-based batch review on a regular schedule.
Methodology Notes
- All benchmarks were run on a single workstation (Intel i9-13900K, 64GB RAM, RTX 4090).
- spaCy used the en_core_web_trf model (transformer-based, most accurate variant).
- GLiNER used the gliner-large-v2.5 checkpoint.
- LLM benchmarks used GPT-4o via API with structured output prompting. Latency includes API round-trip time.
- The hybrid pipeline (Ertas) ran entirely locally with no API calls.
- Entity types follow the NIST SP 800-188 de-identification framework, extended with medical and legal identifiers.
- False positive rate is calculated as false positives divided by (true negatives plus false positives).
For the full enterprise data pipeline benchmark including parsing, chunking, and embedding stages, see our comprehensive benchmark report.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

Enterprise Data Pipeline Benchmark Report 2026: Parsing, Redaction, Chunking, and Embedding Compared
A comprehensive benchmark comparing enterprise data pipeline approaches across document parsing accuracy, PII redaction reliability, chunking strategies, and embedding throughput — with methodology, results, and key findings for ML engineering teams.

PDF Parsing Accuracy Benchmark: Docling vs Unstructured vs Marker vs Visual Pipeline
Head-to-head benchmark comparing PDF parsing tools for AI training data — Docling (IBM), Unstructured.io, Marker (Datalab), and Ertas's visual pipeline approach — across table extraction, multi-column layout, scanned PDFs, and processing speed.

PII Exposure Risk Scorecard: Self-Assessment for AI Pipelines
A self-assessment scorecard with 10 scored risk factors for evaluating PII and PHI exposure in your AI data pipelines. Score your risk level and identify gaps before they become incidents.