
Enterprise Data Pipeline Benchmark Report 2026: Parsing, Redaction, Chunking, and Embedding Compared
A comprehensive benchmark comparing enterprise data pipeline approaches across document parsing accuracy, PII redaction reliability, chunking strategies, and embedding throughput — with methodology, results, and key findings for ML engineering teams.
Enterprise AI teams spend 60 to 80 percent of project time on data preparation. The tooling landscape for each stage of the pipeline — parsing, redaction, chunking, and embedding — has matured significantly, but there is no single reference that benchmarks these stages together as an integrated workflow.
This report fills that gap. We evaluated leading tools and approaches across four pipeline stages using standardized document corpora, measuring accuracy, throughput, and failure modes that matter in production environments.
Methodology
We tested each pipeline stage independently and then as integrated pipelines. The test corpus consisted of:
- 500 enterprise PDFs spanning financial reports, legal contracts, medical records, and technical documentation
- 200 scanned documents with varying quality (300 DPI clean scans to 150 DPI degraded copies)
- 150 multi-format document sets (Word, PowerPoint, Excel, HTML) from real-world enterprise archives
- 10,000 synthetic PII records across 14 entity types (SSN, email, phone, address, medical ID, etc.)
All benchmarks were run on a single workstation (Intel i9-13900K, 64GB RAM, NVIDIA RTX 4090) to provide a consistent baseline. Throughput numbers reflect single-machine performance, not distributed processing.
Stage 1: Document Parsing
Document parsing converts raw files into structured text suitable for downstream AI processing. We evaluated four approaches.
Parsing Benchmark Results
| Tool | Table Extraction | Multi-Column | Scanned PDF (OCR) | Header/Footer Removal | Speed (pages/sec) | License |
|---|---|---|---|---|---|---|
| Docling (IBM) | 97.9% | 94.2% | 89.1% | 91.3% | 3.2 | MIT |
| Unstructured.io | 93.4% | 91.8% | 86.7% | 88.5% | 4.8 | Apache 2.0 |
| Marker (Datalab) | 91.7% | 96.1% | 84.3% | 85.9% | 6.1 | GPL-3.0 |
| Visual Pipeline (Ertas) | 97.9% | 94.2% | 91.4% | 93.7% | 2.9 | Proprietary |
Key findings:
- Docling leads in table extraction accuracy at 97.9%, confirmed by IBM Research's published benchmarks on the DocLayNet dataset. Ertas integrates Docling as its PDF parsing engine, inheriting this accuracy while adding pre- and post-processing nodes for header/footer removal and quality scoring.
- Marker is the fastest parser but trades accuracy for speed, particularly on scanned documents where OCR quality degrades.
- Unstructured.io provides the broadest file format support (64+ types) but its table extraction accuracy falls behind Docling by roughly 4.5 percentage points.
- Scanned PDF accuracy is the most variable metric across all tools. OCR quality depends heavily on scan resolution, and no tool consistently exceeds 92% accuracy on degraded scans below 200 DPI.
Where Parsing Fails
The most common parsing failures across all tools were:
- Nested tables — tables within tables caused extraction errors in 15 to 30 percent of cases across all tools
- Rotated text and watermarks — all tools struggled with text at non-standard orientations
- Form fields in scanned PDFs — checkbox and radio button extraction was unreliable across the board
Stage 2: PII Redaction
PII redaction is the compliance-critical stage. We tested five approaches against a corpus of 10,000 annotated PII instances.
Redaction Benchmark Results
| Approach | Precision | Recall | F1 Score | Speed (docs/sec) | False Positive Rate |
|---|---|---|---|---|---|
| Regex Patterns | 99.1% | 72.4% | 83.9% | 145 | 0.9% |
| spaCy NER (en_core_web_trf) | 91.3% | 88.7% | 89.9% | 42 | 8.7% |
| Transformer NER (GLiNER) | 94.8% | 93.1% | 93.9% | 18 | 5.2% |
| LLM-Based (GPT-4 class) | 96.2% | 95.8% | 96.0% | 2.1 | 3.8% |
| Hybrid Pipeline (Ertas) | 97.4% | 96.1% | 96.7% | 28 | 2.6% |
Key findings:
- Regex is the fastest and most precise approach, but its recall is unacceptably low for enterprise use — it misses nearly 28% of PII instances, primarily names, contextual references, and non-standard formats.
- LLM-based redaction achieves the highest individual accuracy but is 14x slower than transformer NER and introduces data egress concerns when using cloud-hosted models.
- Hybrid approaches that combine regex for structured patterns (SSN, phone, email) with transformer NER for contextual entities (names, addresses, medical terms) achieve the best balance of accuracy and throughput. Ertas uses this hybrid approach, running deterministic regex first, then transformer NER for remaining entity types.
- False positive rates matter in production. An 8.7% false positive rate (spaCy) means nearly one in eleven flagged items is not actually PII, creating review burden for compliance teams.
For a detailed breakdown of each redaction approach, see our companion article on PII redaction accuracy benchmarks.
Stage 3: Chunking Strategies
Chunking determines how parsed documents are split for embedding and retrieval. We evaluated four strategies on a RAG retrieval benchmark using 500 enterprise documents with 2,000 manually annotated question-answer pairs.
Chunking Benchmark Results
| Strategy | Retrieval Accuracy (Top-5) | Avg Chunk Size | Context Coherence | Implementation Complexity |
|---|---|---|---|---|
| Fixed-Size (512 tokens) | 71.3% | 512 tokens | Low | Trivial |
| Recursive Character | 78.9% | 380 tokens | Medium | Low |
| Semantic (embedding-based) | 84.2% | 290 tokens | High | Medium |
| Document-Aware (heading + semantic) | 87.6% | 340 tokens | High | High |
Key findings:
- Fixed-size chunking remains common in production systems but consistently underperforms other approaches. It splits mid-sentence and mid-paragraph, destroying context that retrieval depends on.
- Semantic chunking (splitting at points where embedding similarity drops) improves retrieval accuracy by 13 percentage points over fixed-size, but requires an embedding pass during chunking — adding computational overhead.
- Document-aware chunking that respects document structure (headings, sections, list boundaries) and then applies semantic splitting within sections achieves the highest retrieval accuracy. Ertas's RAG Chunker node implements this approach, using parsed document structure from the upstream parser node.
- Overlap matters. Adding 10 to 15 percent token overlap between chunks improved retrieval accuracy by 3 to 5 percentage points across all strategies, at the cost of increased index size.
Stage 4: Embedding Throughput
Embedding converts text chunks into vectors for similarity search. We benchmarked common embedding models on throughput and retrieval quality.
Embedding Benchmark Results
| Model | Dimensions | MTEB Score | Throughput (chunks/sec, GPU) | Throughput (chunks/sec, CPU) | Model Size |
|---|---|---|---|---|---|
| text-embedding-3-small (OpenAI) | 1536 | 62.3 | N/A (API) | N/A (API) | Cloud |
| text-embedding-3-large (OpenAI) | 3072 | 64.6 | N/A (API) | N/A (API) | Cloud |
| BGE-M3 (BAAI) | 1024 | 68.2 | 320 | 24 | 567MB |
| E5-Mistral-7B-Instruct | 4096 | 66.6 | 85 | 3.1 | 14GB |
| nomic-embed-text-v1.5 | 768 | 62.3 | 480 | 38 | 137MB |
Key findings:
- For on-premise deployments, BGE-M3 offers the best quality-to-size ratio, achieving the highest MTEB score among locally-runnable models while remaining small enough for CPU inference at acceptable throughput.
- nomic-embed-text-v1.5 is the speed champion for local deployment. At 137MB, it runs efficiently on CPU and provides adequate retrieval quality for many enterprise use cases.
- OpenAI's embedding models require data egress to cloud APIs, which disqualifies them for regulated-industry use cases where documents must stay on-premise.
- Ertas's Embedding node supports multiple local embedding models, allowing teams to select the right quality-throughput tradeoff for their deployment constraints. For air-gapped environments, all processing stays on the local machine.
Integrated Pipeline Performance
Running these stages in isolation tells only part of the story. In production, failures compound across stages — a parsing error propagates through chunking and embedding, degrading retrieval quality downstream.
We measured end-to-end pipeline accuracy by running the full sequence (parse, redact, chunk, embed, retrieve) on our 500-document corpus with 2,000 QA pairs.
End-to-End Pipeline Results
| Pipeline Configuration | End-to-End Retrieval Accuracy | PII Leak Rate | Throughput (docs/hour) |
|---|---|---|---|
| Docling + Regex + Fixed Chunk + BGE-M3 | 63.8% | 0.41% | 890 |
| Unstructured + spaCy + Recursive + nomic | 68.2% | 0.18% | 720 |
| Marker + GLiNER + Semantic + BGE-M3 | 72.1% | 0.09% | 410 |
| Ertas Visual Pipeline (Docling + Hybrid + Doc-Aware + BGE-M3) | 79.4% | 0.04% | 520 |
Key findings:
- End-to-end accuracy is always lower than individual stage accuracy, confirming that error propagation is a real concern in multi-stage pipelines.
- The highest-throughput pipeline (Docling + Regex + Fixed Chunk) had the worst retrieval accuracy and the highest PII leak rate, demonstrating the cost of optimizing for speed alone.
- Ertas's integrated pipeline achieved the highest end-to-end accuracy because the visual pipeline architecture allows each node to pass structured metadata (document sections, entity locations, quality scores) to downstream nodes — information that is lost when stitching together standalone tools.
- PII leak rate (PII instances that survive redaction and appear in the final retrieval output) ranged from 0.04% to 0.41%. For regulated industries, even 0.41% may be unacceptable.
Recommendations
Based on these benchmarks, we recommend the following for enterprise teams building AI data pipelines:
-
Do not optimize for parsing speed at the expense of accuracy. The downstream cost of parsing errors far exceeds the time saved. Docling's table extraction accuracy (97.9%) is worth the throughput tradeoff.
-
Use hybrid PII redaction. Pure regex is fast but misses too much. Pure LLM is accurate but slow and introduces data egress risk. A hybrid approach (regex for structured patterns, transformer NER for contextual entities) provides the best production tradeoff.
-
Invest in document-aware chunking. Fixed-size chunking is easy to implement but leaves 16 percentage points of retrieval accuracy on the table compared to document-aware approaches.
-
Choose local embedding models for regulated workloads. BGE-M3 and nomic-embed-text-v1.5 provide production-quality embeddings without requiring cloud API calls or data egress.
-
Measure end-to-end, not per-stage. Individual stage benchmarks can be misleading. A pipeline that scores well at every stage individually can still underperform if stage handoffs lose metadata or context.
Methodology Notes
- All accuracy numbers are averaged across the full test corpus. Per-document-type variance was significant (financial documents parsed more accurately than medical records in all tools).
- Speed measurements exclude I/O time and reflect pure processing throughput.
- PII redaction benchmarks used the 14 entity types defined in the NIST SP 800-188 de-identification standard.
- Retrieval accuracy was measured as recall at top-5 retrieved chunks against manually annotated relevant passages.
- Ertas benchmarks reflect version 0.9 of the Data Suite desktop application running locally. No cloud processing was involved.
This report will be updated quarterly as tools release new versions and the benchmark corpus expands. Teams interested in reproducing these benchmarks can contact us for access to the test methodology documentation.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

PII Redaction Accuracy Benchmark: Regex vs NER vs LLM vs Hybrid Pipeline
Benchmark comparing five PII redaction approaches — regex patterns, spaCy NER, transformer NER, LLM-based, and hybrid pipeline — measuring precision, recall, F1 score, speed, and false positive rates across 14 entity types.

PDF Parsing Accuracy Benchmark: Docling vs Unstructured vs Marker vs Visual Pipeline
Head-to-head benchmark comparing PDF parsing tools for AI training data — Docling (IBM), Unstructured.io, Marker (Datalab), and Ertas's visual pipeline approach — across table extraction, multi-column layout, scanned PDFs, and processing speed.

RAG Chunking Strategy Benchmark: Fixed-Size vs Semantic vs Document-Aware
Controlled benchmark comparing five RAG chunking strategies — fixed-size, recursive, semantic, document-aware, and sliding window — across retrieval accuracy, latency, token efficiency, and best-fit use cases.