Back to blog
    PII Redaction Accuracy Benchmark: Regex vs NER vs LLM vs Hybrid Pipeline
    benchmarkpii-redactionprivacycompliancenerenterprisedata-pipelinesegment:enterprise

    PII Redaction Accuracy Benchmark: Regex vs NER vs LLM vs Hybrid Pipeline

    Benchmark comparing five PII redaction approaches — regex patterns, spaCy NER, transformer NER, LLM-based, and hybrid pipeline — measuring precision, recall, F1 score, speed, and false positive rates across 14 entity types.

    EErtas Team·

    PII redaction is the highest-stakes stage in any enterprise data pipeline. A parsing error produces garbled text. A chunking error degrades retrieval quality. A PII redaction failure exposes personal data — triggering regulatory penalties, eroding customer trust, and creating legal liability.

    Despite these stakes, most teams select their redaction approach based on convenience rather than measured performance. Regex is fast to implement. NER models are easy to import. LLMs seem capable of everything. But how do these approaches actually compare on the metrics that matter — precision, recall, false positive rate, and throughput?

    This benchmark provides the answer.

    Approaches Tested

    We evaluated five PII redaction approaches, each representing a distinct technical strategy:

    Regex Patterns — deterministic pattern matching using regular expressions for structured PII formats (SSN, phone numbers, email addresses, credit card numbers). We used a production-grade regex library with 47 patterns covering US, UK, and EU PII formats.

    spaCy NER (en_core_web_trf) — spaCy's transformer-based named entity recognition model, which identifies PERSON, ORG, GPE, DATE, and other entity types. We extended it with custom entity rulers for PII-specific patterns.

    Transformer NER (GLiNER) — a generalist NER model that accepts entity type descriptions at inference time, allowing zero-shot detection of arbitrary PII categories without fine-tuning. We tested with prompts for all 14 PII entity types.

    LLM-Based (GPT-4 class) — using a frontier language model with a structured prompt specifying PII categories and requesting entity-level annotations. We tested with GPT-4o via API, acknowledging the irony of sending PII to a cloud API for redaction benchmarking. In production, this approach would use a locally-hosted LLM.

    Hybrid Pipeline (Ertas) — a two-pass approach: regex patterns first for structured PII (SSN, phone, email, credit card), then transformer NER for contextual entities (names, addresses, medical terms, case numbers). The pipeline runs entirely on-premise with no cloud dependencies.

    Test Corpus

    We constructed a benchmark corpus of 10,000 PII instances across 14 entity types, embedded in 1,200 synthetic enterprise documents:

    Entity TypeCountExamples
    Person Name1,500Full names, partial names, titles with names
    Email Address800Standard, corporate, obfuscated
    Phone Number800US, UK, international, extensions
    SSN600Standard (XXX-XX-XXXX), no-dash, partial
    Physical Address700Street, PO Box, apartment, international
    Date of Birth500Multiple date formats
    Credit Card400Visa, Mastercard, Amex, with/without spaces
    Medical Record Number400Hospital-specific formats
    IP Address300IPv4, IPv6, with context
    Driver License300State-specific formats
    Passport Number200US, UK, EU formats
    Bank Account200Routing + account, IBAN
    Case/File Number200Legal, medical, insurance
    Biometric Identifier100Device IDs, enrollment references

    Documents were designed to include both obvious PII (standalone fields) and contextual PII (embedded in narrative text, tables, and footnotes). This reflects real enterprise documents where PII appears in expected locations and in unexpected contexts like email signatures embedded in contract appendices.

    Ground truth was manually annotated by two independent reviewers with adjudication for disagreements.

    Benchmark Results

    ApproachPrecisionRecallF1 ScoreSpeed (docs/sec)False Positive Rate
    Regex Patterns99.1%72.4%83.9%1450.9%
    spaCy NER (en_core_web_trf)91.3%88.7%89.9%428.7%
    Transformer NER (GLiNER)94.8%93.1%93.9%185.2%
    LLM-Based (GPT-4 class)96.2%95.8%96.0%2.13.8%
    Hybrid Pipeline (Ertas)97.4%96.1%96.7%282.6%

    Detailed Analysis by Metric

    Precision: What You Flag, Is It Actually PII?

    Precision measures the percentage of flagged items that are genuinely PII. Low precision means your system is over-flagging, creating review burden and potentially redacting non-PII content that should be preserved.

    Regex achieved the highest precision (99.1%) because pattern matching produces very few false positives — if something matches an SSN pattern, it almost certainly is an SSN. The rare false positives came from numbers that coincidentally match PII patterns (product codes in SSN format, for example).

    spaCy had the lowest precision (91.3%) and the highest false positive rate (8.7%). Its PERSON entity model frequently flagged organization names, product names, and location references as person names. "Washington" appearing as a city was regularly flagged as a person name. "Amazon Web Services" triggered both PERSON and ORG tags inconsistently.

    The hybrid pipeline achieved 97.4% precision by using regex for structured patterns (where precision is inherently high) and restricting transformer NER to entity types where regex falls short (names, addresses, contextual references). This division of labor keeps each approach in its strength zone.

    Recall: What PII Exists, Did You Catch It?

    Recall is the critical metric for compliance. Missed PII — false negatives — is the failure mode that triggers regulatory action.

    Regex recall was only 72.4%, the lowest of all approaches. It missed three major PII categories almost entirely:

    1. Person names — no regex pattern can reliably match the infinite variety of human names
    2. Physical addresses — address formats are too variable for deterministic pattern matching
    3. Contextual references — phrases like "the patient" or "my client Mr. Johnson" require understanding context, not pattern matching

    LLM-based approaches achieved the highest recall (95.8%) because language models understand context. They correctly identified PII in sentences like "Please forward this to Sarah at the downtown office" where "Sarah" is PII but no structured pattern matches.

    The hybrid pipeline achieved 96.1% recall — slightly above the LLM approach — because the regex pass catches structured patterns that transformer NER occasionally misses (SSNs without dashes, phone numbers with extensions), while the NER pass catches contextual entities that regex cannot match. The two passes are complementary rather than redundant.

    Per-Entity-Type Breakdown

    The aggregate F1 scores mask significant variation across entity types:

    Entity TypeRegex F1spaCy F1GLiNER F1LLM F1Hybrid F1
    SSN99.2%82.1%94.3%97.8%99.4%
    Email99.5%78.4%91.2%96.1%99.5%
    Phone97.8%75.9%90.1%95.4%98.1%
    Credit Card98.9%71.3%88.7%94.2%99.0%
    Person Name0.0%93.8%95.7%97.2%95.7%
    Address12.4%87.3%92.8%96.3%93.1%
    Medical Record91.3%68.4%89.1%93.7%95.2%
    Date of Birth78.2%84.1%91.4%95.9%94.8%

    This breakdown reveals the fundamental tradeoff: regex dominates structured entities (SSN, email, phone, credit card) but completely fails on contextual entities (person names, addresses). NER models handle contextual entities well but underperform regex on structured patterns.

    The hybrid approach captures the strengths of both, achieving either the highest or near-highest F1 for every entity type.

    Speed and Throughput

    Processing speed determines whether a redaction approach is viable for production workloads. Enterprise data pipelines process thousands to millions of documents.

    ApproachDocs/secTime for 100K docsGPU Required
    Regex14511.5 minutesNo
    spaCy NER4239.7 minutesRecommended
    GLiNER1892.6 minutesYes
    LLM (GPT-4 class)2.113.2 hoursYes (or API)
    Hybrid (Ertas)2859.5 minutesRecommended

    The hybrid pipeline processes 28 documents per second — fast enough for batch processing of enterprise archives but not suitable for real-time, per-request redaction at high volume. The regex pass adds minimal latency; the transformer NER pass is the throughput bottleneck.

    LLM-based redaction at 2.1 docs/sec is impractical for large-scale batch processing. A 100,000-document archive would take over 13 hours. It is more suitable as a verification pass on a sample of already-redacted documents than as the primary redaction mechanism.

    False Positive Rate and Review Burden

    False positives create operational cost. Every falsely flagged item must be reviewed by a human if the redaction pipeline includes a review step, or silently redacts non-PII content if it does not.

    At spaCy's 8.7% false positive rate, processing 100,000 documents with an average of 15 flagged entities per document would produce approximately 130,500 false positives requiring review. At the hybrid pipeline's 2.6% rate, that number drops to 39,000 — a 70% reduction in review workload.

    For fully automated pipelines (no human review), false positives mean information loss. Redacting a product code that happens to match a phone number pattern, or redacting a city name that matches a person name, degrades document quality for downstream AI processing.

    The Case for Hybrid Approaches

    The benchmark data points clearly toward hybrid architectures as the production-optimal approach for PII redaction. The reasoning is straightforward:

    Structured PII is a solved problem. Regex handles SSNs, emails, phone numbers, and credit cards with near-perfect precision and recall. Using NER or LLMs for these entity types adds latency without improving accuracy.

    Contextual PII requires understanding. Names, addresses, and contextual references cannot be caught by pattern matching. Transformer NER provides the semantic understanding needed, with GLiNER and similar models achieving above 92% F1 on contextual entity types.

    The two passes are complementary. Regex catches what NER misses (non-standard SSN formats, phone extensions), and NER catches what regex cannot attempt (person names, contextual references). Running both passes in sequence produces a combined result that exceeds either approach alone.

    Ertas implements this hybrid approach in its PII Redactor node: the regex pass runs first (deterministic, fast, high-precision), then the transformer NER pass processes the remaining text for contextual entities. Both passes are visible as sub-steps in the pipeline, with per-entity confidence scores logged for audit purposes.

    Recommendations by Use Case

    Regulated industries (healthcare, finance, legal): Use a hybrid approach with human review sampling. Target recall above 96% and false positive rate below 3%. The cost of missed PII (regulatory penalties, breach notification) far exceeds the cost of review burden from false positives.

    Service providers delivering to enterprise clients: Use a hybrid approach with full audit logging. Clients in regulated industries will require evidence that PII redaction was performed systematically. Per-entity confidence scores and processing logs provide this evidence.

    Internal AI training data preparation: Use transformer NER (GLiNER or equivalent) if review budget is limited. Its 93.9% F1 with 5.2% false positive rate provides a reasonable accuracy-to-effort tradeoff for teams that cannot implement a full hybrid pipeline.

    Real-time redaction (per-request): Use regex only. At 145 docs/sec, regex is the only approach fast enough for real-time processing. Accept the 72.4% recall limitation and supplement with NER-based batch review on a regular schedule.

    Methodology Notes

    • All benchmarks were run on a single workstation (Intel i9-13900K, 64GB RAM, RTX 4090).
    • spaCy used the en_core_web_trf model (transformer-based, most accurate variant).
    • GLiNER used the gliner-large-v2.5 checkpoint.
    • LLM benchmarks used GPT-4o via API with structured output prompting. Latency includes API round-trip time.
    • The hybrid pipeline (Ertas) ran entirely locally with no API calls.
    • Entity types follow the NIST SP 800-188 de-identification framework, extended with medical and legal identifiers.
    • False positive rate is calculated as false positives divided by (true negatives plus false positives).

    For the full enterprise data pipeline benchmark including parsing, chunking, and embedding stages, see our comprehensive benchmark report.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading