Back to blog
    Enterprise Data Pipeline Benchmark Report 2026: Parsing, Redaction, Chunking, and Embedding Compared
    benchmarkdata-pipelineenterpriseparsingpii-redactionchunkingembeddingsegment:enterprise

    Enterprise Data Pipeline Benchmark Report 2026: Parsing, Redaction, Chunking, and Embedding Compared

    A comprehensive benchmark comparing enterprise data pipeline approaches across document parsing accuracy, PII redaction reliability, chunking strategies, and embedding throughput — with methodology, results, and key findings for ML engineering teams.

    EErtas Team·

    Enterprise AI teams spend 60 to 80 percent of project time on data preparation. The tooling landscape for each stage of the pipeline — parsing, redaction, chunking, and embedding — has matured significantly, but there is no single reference that benchmarks these stages together as an integrated workflow.

    This report fills that gap. We evaluated leading tools and approaches across four pipeline stages using standardized document corpora, measuring accuracy, throughput, and failure modes that matter in production environments.

    Methodology

    We tested each pipeline stage independently and then as integrated pipelines. The test corpus consisted of:

    • 500 enterprise PDFs spanning financial reports, legal contracts, medical records, and technical documentation
    • 200 scanned documents with varying quality (300 DPI clean scans to 150 DPI degraded copies)
    • 150 multi-format document sets (Word, PowerPoint, Excel, HTML) from real-world enterprise archives
    • 10,000 synthetic PII records across 14 entity types (SSN, email, phone, address, medical ID, etc.)

    All benchmarks were run on a single workstation (Intel i9-13900K, 64GB RAM, NVIDIA RTX 4090) to provide a consistent baseline. Throughput numbers reflect single-machine performance, not distributed processing.

    Stage 1: Document Parsing

    Document parsing converts raw files into structured text suitable for downstream AI processing. We evaluated four approaches.

    Parsing Benchmark Results

    ToolTable ExtractionMulti-ColumnScanned PDF (OCR)Header/Footer RemovalSpeed (pages/sec)License
    Docling (IBM)97.9%94.2%89.1%91.3%3.2MIT
    Unstructured.io93.4%91.8%86.7%88.5%4.8Apache 2.0
    Marker (Datalab)91.7%96.1%84.3%85.9%6.1GPL-3.0
    Visual Pipeline (Ertas)97.9%94.2%91.4%93.7%2.9Proprietary

    Key findings:

    • Docling leads in table extraction accuracy at 97.9%, confirmed by IBM Research's published benchmarks on the DocLayNet dataset. Ertas integrates Docling as its PDF parsing engine, inheriting this accuracy while adding pre- and post-processing nodes for header/footer removal and quality scoring.
    • Marker is the fastest parser but trades accuracy for speed, particularly on scanned documents where OCR quality degrades.
    • Unstructured.io provides the broadest file format support (64+ types) but its table extraction accuracy falls behind Docling by roughly 4.5 percentage points.
    • Scanned PDF accuracy is the most variable metric across all tools. OCR quality depends heavily on scan resolution, and no tool consistently exceeds 92% accuracy on degraded scans below 200 DPI.

    Where Parsing Fails

    The most common parsing failures across all tools were:

    1. Nested tables — tables within tables caused extraction errors in 15 to 30 percent of cases across all tools
    2. Rotated text and watermarks — all tools struggled with text at non-standard orientations
    3. Form fields in scanned PDFs — checkbox and radio button extraction was unreliable across the board

    Stage 2: PII Redaction

    PII redaction is the compliance-critical stage. We tested five approaches against a corpus of 10,000 annotated PII instances.

    Redaction Benchmark Results

    ApproachPrecisionRecallF1 ScoreSpeed (docs/sec)False Positive Rate
    Regex Patterns99.1%72.4%83.9%1450.9%
    spaCy NER (en_core_web_trf)91.3%88.7%89.9%428.7%
    Transformer NER (GLiNER)94.8%93.1%93.9%185.2%
    LLM-Based (GPT-4 class)96.2%95.8%96.0%2.13.8%
    Hybrid Pipeline (Ertas)97.4%96.1%96.7%282.6%

    Key findings:

    • Regex is the fastest and most precise approach, but its recall is unacceptably low for enterprise use — it misses nearly 28% of PII instances, primarily names, contextual references, and non-standard formats.
    • LLM-based redaction achieves the highest individual accuracy but is 14x slower than transformer NER and introduces data egress concerns when using cloud-hosted models.
    • Hybrid approaches that combine regex for structured patterns (SSN, phone, email) with transformer NER for contextual entities (names, addresses, medical terms) achieve the best balance of accuracy and throughput. Ertas uses this hybrid approach, running deterministic regex first, then transformer NER for remaining entity types.
    • False positive rates matter in production. An 8.7% false positive rate (spaCy) means nearly one in eleven flagged items is not actually PII, creating review burden for compliance teams.

    For a detailed breakdown of each redaction approach, see our companion article on PII redaction accuracy benchmarks.

    Stage 3: Chunking Strategies

    Chunking determines how parsed documents are split for embedding and retrieval. We evaluated four strategies on a RAG retrieval benchmark using 500 enterprise documents with 2,000 manually annotated question-answer pairs.

    Chunking Benchmark Results

    StrategyRetrieval Accuracy (Top-5)Avg Chunk SizeContext CoherenceImplementation Complexity
    Fixed-Size (512 tokens)71.3%512 tokensLowTrivial
    Recursive Character78.9%380 tokensMediumLow
    Semantic (embedding-based)84.2%290 tokensHighMedium
    Document-Aware (heading + semantic)87.6%340 tokensHighHigh

    Key findings:

    • Fixed-size chunking remains common in production systems but consistently underperforms other approaches. It splits mid-sentence and mid-paragraph, destroying context that retrieval depends on.
    • Semantic chunking (splitting at points where embedding similarity drops) improves retrieval accuracy by 13 percentage points over fixed-size, but requires an embedding pass during chunking — adding computational overhead.
    • Document-aware chunking that respects document structure (headings, sections, list boundaries) and then applies semantic splitting within sections achieves the highest retrieval accuracy. Ertas's RAG Chunker node implements this approach, using parsed document structure from the upstream parser node.
    • Overlap matters. Adding 10 to 15 percent token overlap between chunks improved retrieval accuracy by 3 to 5 percentage points across all strategies, at the cost of increased index size.

    Stage 4: Embedding Throughput

    Embedding converts text chunks into vectors for similarity search. We benchmarked common embedding models on throughput and retrieval quality.

    Embedding Benchmark Results

    ModelDimensionsMTEB ScoreThroughput (chunks/sec, GPU)Throughput (chunks/sec, CPU)Model Size
    text-embedding-3-small (OpenAI)153662.3N/A (API)N/A (API)Cloud
    text-embedding-3-large (OpenAI)307264.6N/A (API)N/A (API)Cloud
    BGE-M3 (BAAI)102468.232024567MB
    E5-Mistral-7B-Instruct409666.6853.114GB
    nomic-embed-text-v1.576862.348038137MB

    Key findings:

    • For on-premise deployments, BGE-M3 offers the best quality-to-size ratio, achieving the highest MTEB score among locally-runnable models while remaining small enough for CPU inference at acceptable throughput.
    • nomic-embed-text-v1.5 is the speed champion for local deployment. At 137MB, it runs efficiently on CPU and provides adequate retrieval quality for many enterprise use cases.
    • OpenAI's embedding models require data egress to cloud APIs, which disqualifies them for regulated-industry use cases where documents must stay on-premise.
    • Ertas's Embedding node supports multiple local embedding models, allowing teams to select the right quality-throughput tradeoff for their deployment constraints. For air-gapped environments, all processing stays on the local machine.

    Integrated Pipeline Performance

    Running these stages in isolation tells only part of the story. In production, failures compound across stages — a parsing error propagates through chunking and embedding, degrading retrieval quality downstream.

    We measured end-to-end pipeline accuracy by running the full sequence (parse, redact, chunk, embed, retrieve) on our 500-document corpus with 2,000 QA pairs.

    End-to-End Pipeline Results

    Pipeline ConfigurationEnd-to-End Retrieval AccuracyPII Leak RateThroughput (docs/hour)
    Docling + Regex + Fixed Chunk + BGE-M363.8%0.41%890
    Unstructured + spaCy + Recursive + nomic68.2%0.18%720
    Marker + GLiNER + Semantic + BGE-M372.1%0.09%410
    Ertas Visual Pipeline (Docling + Hybrid + Doc-Aware + BGE-M3)79.4%0.04%520

    Key findings:

    • End-to-end accuracy is always lower than individual stage accuracy, confirming that error propagation is a real concern in multi-stage pipelines.
    • The highest-throughput pipeline (Docling + Regex + Fixed Chunk) had the worst retrieval accuracy and the highest PII leak rate, demonstrating the cost of optimizing for speed alone.
    • Ertas's integrated pipeline achieved the highest end-to-end accuracy because the visual pipeline architecture allows each node to pass structured metadata (document sections, entity locations, quality scores) to downstream nodes — information that is lost when stitching together standalone tools.
    • PII leak rate (PII instances that survive redaction and appear in the final retrieval output) ranged from 0.04% to 0.41%. For regulated industries, even 0.41% may be unacceptable.

    Recommendations

    Based on these benchmarks, we recommend the following for enterprise teams building AI data pipelines:

    1. Do not optimize for parsing speed at the expense of accuracy. The downstream cost of parsing errors far exceeds the time saved. Docling's table extraction accuracy (97.9%) is worth the throughput tradeoff.

    2. Use hybrid PII redaction. Pure regex is fast but misses too much. Pure LLM is accurate but slow and introduces data egress risk. A hybrid approach (regex for structured patterns, transformer NER for contextual entities) provides the best production tradeoff.

    3. Invest in document-aware chunking. Fixed-size chunking is easy to implement but leaves 16 percentage points of retrieval accuracy on the table compared to document-aware approaches.

    4. Choose local embedding models for regulated workloads. BGE-M3 and nomic-embed-text-v1.5 provide production-quality embeddings without requiring cloud API calls or data egress.

    5. Measure end-to-end, not per-stage. Individual stage benchmarks can be misleading. A pipeline that scores well at every stage individually can still underperform if stage handoffs lose metadata or context.

    Methodology Notes

    • All accuracy numbers are averaged across the full test corpus. Per-document-type variance was significant (financial documents parsed more accurately than medical records in all tools).
    • Speed measurements exclude I/O time and reflect pure processing throughput.
    • PII redaction benchmarks used the 14 entity types defined in the NIST SP 800-188 de-identification standard.
    • Retrieval accuracy was measured as recall at top-5 retrieved chunks against manually annotated relevant passages.
    • Ertas benchmarks reflect version 0.9 of the Data Suite desktop application running locally. No cloud processing was involved.

    This report will be updated quarterly as tools release new versions and the benchmark corpus expands. Teams interested in reproducing these benchmarks can contact us for access to the test methodology documentation.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading