Enterprise Data Pipeline Benchmark Report 2026: Parsing, Redaction, Chunking, and Embedding Compared

Enterprise AI teams spend 60 to 80 percent of project time on data preparation. The tooling landscape for each stage of the pipeline — parsing, redaction, chunking, and embedding — has matured significantly, but there is no single reference that benchmarks these stages together as an integrated workflow.

This report fills that gap. We evaluated leading tools and approaches across four pipeline stages using standardized document corpora, measuring accuracy, throughput, and failure modes that matter in production environments.

Methodology

We tested each pipeline stage independently and then as integrated pipelines. The test corpus consisted of:

500 enterprise PDFs spanning financial reports, legal contracts, medical records, and technical documentation
200 scanned documents with varying quality (300 DPI clean scans to 150 DPI degraded copies)
150 multi-format document sets (Word, PowerPoint, Excel, HTML) from real-world enterprise archives
10,000 synthetic PII records across 14 entity types (SSN, email, phone, address, medical ID, etc.)

All benchmarks were run on a single workstation (Intel i9-13900K, 64GB RAM, NVIDIA RTX 4090) to provide a consistent baseline. Throughput numbers reflect single-machine performance, not distributed processing.

Stage 1: Document Parsing

Document parsing converts raw files into structured text suitable for downstream AI processing. We evaluated four approaches.

Parsing Benchmark Results

Tool	Table Extraction	Multi-Column	Scanned PDF (OCR)	Header/Footer Removal	Speed (pages/sec)	License
Docling (IBM)	97.9%	94.2%	89.1%	91.3%	3.2	MIT
Unstructured.io	93.4%	91.8%	86.7%	88.5%	4.8	Apache 2.0
Marker (Datalab)	91.7%	96.1%	84.3%	85.9%	6.1	GPL-3.0
Visual Pipeline (Ertas)	97.9%	94.2%	91.4%	93.7%	2.9	Proprietary

Key findings:

Docling leads in table extraction accuracy at 97.9%, confirmed by IBM Research's published benchmarks on the DocLayNet dataset. Ertas integrates Docling as its PDF parsing engine, inheriting this accuracy while adding pre- and post-processing nodes for header/footer removal and quality scoring.
Marker is the fastest parser but trades accuracy for speed, particularly on scanned documents where OCR quality degrades.
Unstructured.io provides the broadest file format support (64+ types) but its table extraction accuracy falls behind Docling by roughly 4.5 percentage points.
Scanned PDF accuracy is the most variable metric across all tools. OCR quality depends heavily on scan resolution, and no tool consistently exceeds 92% accuracy on degraded scans below 200 DPI.

Where Parsing Fails

The most common parsing failures across all tools were:

Nested tables — tables within tables caused extraction errors in 15 to 30 percent of cases across all tools
Rotated text and watermarks — all tools struggled with text at non-standard orientations
Form fields in scanned PDFs — checkbox and radio button extraction was unreliable across the board

Stage 2: PII Redaction

PII redaction is the compliance-critical stage. We tested five approaches against a corpus of 10,000 annotated PII instances.

Redaction Benchmark Results

Approach	Precision	Recall	F1 Score	Speed (docs/sec)	False Positive Rate
Regex Patterns	99.1%	72.4%	83.9%	145	0.9%
spaCy NER (en_core_web_trf)	91.3%	88.7%	89.9%	42	8.7%
Transformer NER (GLiNER)	94.8%	93.1%	93.9%	18	5.2%
LLM-Based (GPT-4 class)	96.2%	95.8%	96.0%	2.1	3.8%
Hybrid Pipeline (Ertas)	97.4%	96.1%	96.7%	28	2.6%

Key findings:

Regex is the fastest and most precise approach, but its recall is unacceptably low for enterprise use — it misses nearly 28% of PII instances, primarily names, contextual references, and non-standard formats.
LLM-based redaction achieves the highest individual accuracy but is 14x slower than transformer NER and introduces data egress concerns when using cloud-hosted models.
Hybrid approaches that combine regex for structured patterns (SSN, phone, email) with transformer NER for contextual entities (names, addresses, medical terms) achieve the best balance of accuracy and throughput. Ertas uses this hybrid approach, running deterministic regex first, then transformer NER for remaining entity types.
False positive rates matter in production. An 8.7% false positive rate (spaCy) means nearly one in eleven flagged items is not actually PII, creating review burden for compliance teams.

For a detailed breakdown of each redaction approach, see our companion article on PII redaction accuracy benchmarks.

Stage 3: Chunking Strategies

Chunking determines how parsed documents are split for embedding and retrieval. We evaluated four strategies on a RAG retrieval benchmark using 500 enterprise documents with 2,000 manually annotated question-answer pairs.

Chunking Benchmark Results

Strategy	Retrieval Accuracy (Top-5)	Avg Chunk Size	Context Coherence	Implementation Complexity
Fixed-Size (512 tokens)	71.3%	512 tokens	Low	Trivial
Recursive Character	78.9%	380 tokens	Medium	Low
Semantic (embedding-based)	84.2%	290 tokens	High	Medium
Document-Aware (heading + semantic)	87.6%	340 tokens	High	High

Key findings:

Fixed-size chunking remains common in production systems but consistently underperforms other approaches. It splits mid-sentence and mid-paragraph, destroying context that retrieval depends on.
Semantic chunking (splitting at points where embedding similarity drops) improves retrieval accuracy by 13 percentage points over fixed-size, but requires an embedding pass during chunking — adding computational overhead.
Document-aware chunking that respects document structure (headings, sections, list boundaries) and then applies semantic splitting within sections achieves the highest retrieval accuracy. Ertas's RAG Chunker node implements this approach, using parsed document structure from the upstream parser node.
Overlap matters. Adding 10 to 15 percent token overlap between chunks improved retrieval accuracy by 3 to 5 percentage points across all strategies, at the cost of increased index size.

Stage 4: Embedding Throughput

Embedding converts text chunks into vectors for similarity search. We benchmarked common embedding models on throughput and retrieval quality.

Embedding Benchmark Results

Model	Dimensions	MTEB Score	Throughput (chunks/sec, GPU)	Throughput (chunks/sec, CPU)	Model Size
text-embedding-3-small (OpenAI)	1536	62.3	N/A (API)	N/A (API)	Cloud
text-embedding-3-large (OpenAI)	3072	64.6	N/A (API)	N/A (API)	Cloud
BGE-M3 (BAAI)	1024	68.2	320	24	567MB
E5-Mistral-7B-Instruct	4096	66.6	85	3.1	14GB
nomic-embed-text-v1.5	768	62.3	480	38	137MB

Key findings:

For on-premise deployments, BGE-M3 offers the best quality-to-size ratio, achieving the highest MTEB score among locally-runnable models while remaining small enough for CPU inference at acceptable throughput.
nomic-embed-text-v1.5 is the speed champion for local deployment. At 137MB, it runs efficiently on CPU and provides adequate retrieval quality for many enterprise use cases.
OpenAI's embedding models require data egress to cloud APIs, which disqualifies them for regulated-industry use cases where documents must stay on-premise.
Ertas's Embedding node supports multiple local embedding models, allowing teams to select the right quality-throughput tradeoff for their deployment constraints. For air-gapped environments, all processing stays on the local machine.

Integrated Pipeline Performance

Running these stages in isolation tells only part of the story. In production, failures compound across stages — a parsing error propagates through chunking and embedding, degrading retrieval quality downstream.

We measured end-to-end pipeline accuracy by running the full sequence (parse, redact, chunk, embed, retrieve) on our 500-document corpus with 2,000 QA pairs.

End-to-End Pipeline Results

Pipeline Configuration	End-to-End Retrieval Accuracy	PII Leak Rate	Throughput (docs/hour)
Docling + Regex + Fixed Chunk + BGE-M3	63.8%	0.41%	890
Unstructured + spaCy + Recursive + nomic	68.2%	0.18%	720
Marker + GLiNER + Semantic + BGE-M3	72.1%	0.09%	410
Ertas Visual Pipeline (Docling + Hybrid + Doc-Aware + BGE-M3)	79.4%	0.04%	520

Key findings:

End-to-end accuracy is always lower than individual stage accuracy, confirming that error propagation is a real concern in multi-stage pipelines.
The highest-throughput pipeline (Docling + Regex + Fixed Chunk) had the worst retrieval accuracy and the highest PII leak rate, demonstrating the cost of optimizing for speed alone.
Ertas's integrated pipeline achieved the highest end-to-end accuracy because the visual pipeline architecture allows each node to pass structured metadata (document sections, entity locations, quality scores) to downstream nodes — information that is lost when stitching together standalone tools.
PII leak rate (PII instances that survive redaction and appear in the final retrieval output) ranged from 0.04% to 0.41%. For regulated industries, even 0.41% may be unacceptable.

Recommendations

Based on these benchmarks, we recommend the following for enterprise teams building AI data pipelines:

Do not optimize for parsing speed at the expense of accuracy. The downstream cost of parsing errors far exceeds the time saved. Docling's table extraction accuracy (97.9%) is worth the throughput tradeoff.
Use hybrid PII redaction. Pure regex is fast but misses too much. Pure LLM is accurate but slow and introduces data egress risk. A hybrid approach (regex for structured patterns, transformer NER for contextual entities) provides the best production tradeoff.
Invest in document-aware chunking. Fixed-size chunking is easy to implement but leaves 16 percentage points of retrieval accuracy on the table compared to document-aware approaches.
Choose local embedding models for regulated workloads. BGE-M3 and nomic-embed-text-v1.5 provide production-quality embeddings without requiring cloud API calls or data egress.
Measure end-to-end, not per-stage. Individual stage benchmarks can be misleading. A pipeline that scores well at every stage individually can still underperform if stage handoffs lose metadata or context.

Methodology Notes

All accuracy numbers are averaged across the full test corpus. Per-document-type variance was significant (financial documents parsed more accurately than medical records in all tools).
Speed measurements exclude I/O time and reflect pure processing throughput.
PII redaction benchmarks used the 14 entity types defined in the NIST SP 800-188 de-identification standard.
Retrieval accuracy was measured as recall at top-5 retrieved chunks against manually annotated relevant passages.
Ertas benchmarks reflect version 0.9 of the Data Suite desktop application running locally. No cloud processing was involved.

This report will be updated quarterly as tools release new versions and the benchmark corpus expands. Teams interested in reproducing these benchmarks can contact us for access to the test methodology documentation.

Enterprise Data Pipeline Benchmark Report 2026: Parsing, Redaction, Chunking, and Embedding Compared

Methodology

Stage 1: Document Parsing

Parsing Benchmark Results

Where Parsing Fails

Stage 2: PII Redaction

Redaction Benchmark Results

Stage 3: Chunking Strategies

Chunking Benchmark Results

Stage 4: Embedding Throughput

Embedding Benchmark Results

Integrated Pipeline Performance

End-to-End Pipeline Results

Recommendations

Methodology Notes

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

PII Redaction Accuracy Benchmark: Regex vs NER vs LLM vs Hybrid Pipeline

PDF Parsing Accuracy Benchmark: Docling vs Unstructured vs Marker vs Visual Pipeline

RAG Chunking Strategy Benchmark: Fixed-Size vs Semantic vs Document-Aware