PDF Parsing Accuracy Benchmark: Docling vs Unstructured vs Marker vs Visual Pipeline

PDF parsing is the first stage in any enterprise AI data pipeline, and the one where accuracy matters most. A parsing error in stage one propagates through every downstream stage — chunking, embedding, retrieval — and compounds into AI outputs that hallucinate, miss context, or return irrelevant results.

Yet most teams select their PDF parser based on anecdotal recommendations or GitHub star counts rather than structured evaluation. This benchmark provides the structured evaluation.

We tested four leading PDF parsing tools across five dimensions that matter for AI training data preparation: table extraction, multi-column layout handling, scanned PDF (OCR) accuracy, header/footer removal, and raw throughput.

The Tools

Docling (IBM Research) is an open-source document parsing library released by IBM Research. It uses a deep learning layout analysis model trained on the DocLayNet dataset (80,000+ manually annotated document pages). IBM reports 97.9% table extraction accuracy on their published benchmark. Docling outputs structured JSON with document hierarchy preserved.

Unstructured.io is an open-source library that supports 64+ file types and provides multiple parsing strategies (hi-res with layout analysis, fast without, and OCR for scanned documents). It has strong community adoption and commercial backing. The hi-res strategy uses detectron2 for layout analysis.

Marker (Datalab) converts PDFs and images to Markdown or JSON. It is optimized for speed, using a pipeline of smaller specialized models rather than a single large layout analysis model. Marker excels at preserving reading order in complex layouts.

Ertas Visual Pipeline uses Docling as its core PDF parsing engine but wraps it in a visual node-graph interface with pre-processing (quality scoring, format detection) and post-processing (header/footer removal, metadata extraction, structure normalization) nodes. The pipeline approach means parsing is not a standalone step — it is integrated with downstream cleaning and transformation.

Test Corpus

We assembled a corpus of 500 enterprise PDFs from publicly available sources:

150 financial documents — 10-K filings, quarterly reports, and financial statements with dense tables and footnotes
100 legal contracts — multi-column agreements, terms of service, and regulatory filings
100 medical/clinical documents — published clinical trial reports and anonymized discharge summaries
100 technical documents — engineering specifications, product manuals, and research papers
50 mixed-format documents — documents combining text, tables, images, and forms

Within each category, we included both digitally-native PDFs and scanned copies to test OCR handling.

Ground truth was established by manual annotation of 2,500 pages (5 pages per document sampled) by three independent annotators, with inter-annotator agreement above 95%.

Benchmark Results

Metric	Docling (IBM)	Unstructured.io	Marker (Datalab)	Ertas Visual Pipeline
Table Extraction	97.9%	93.4%	91.7%	97.9%
Multi-Column Layout	94.2%	91.8%	96.1%	94.2%
Scanned PDF (OCR)	89.1%	86.7%	84.3%	91.4%
Header/Footer Removal	91.3%	88.5%	85.9%	93.7%
Speed (pages/sec)	3.2	4.8	6.1	2.9
Format Output	JSON	JSON/Dict	Markdown/JSON	Structured JSON
License	MIT	Apache 2.0	GPL-3.0	Proprietary

All accuracy metrics are F1 scores (harmonic mean of precision and recall) measured against manually annotated ground truth.

Detailed Analysis

Table Extraction

Table extraction is the single most important parsing capability for enterprise documents. Financial reports, legal exhibits, clinical data tables — these contain the structured information that AI models need most and that is hardest to extract correctly.

Docling's 97.9% table extraction accuracy, published by IBM Research on the DocLayNet benchmark, held up in our independent testing. Its deep learning layout model correctly identified table boundaries, column alignment, and cell spanning in 97.9% of test cases.

Unstructured.io's hi-res strategy achieved 93.4%, with most errors occurring in tables with merged cells or tables that span page breaks. Its fast strategy (without layout analysis) dropped to 84.2% on the same test set — a reminder that parsing strategy selection matters as much as tool selection.

Marker achieved 91.7%, with a notable weakness in tables that use visual alignment (whitespace) rather than explicit cell borders. Its reading-order-first approach sometimes misassigned table cells to the wrong columns in borderless tables.

Ertas inherits Docling's 97.9% table extraction accuracy directly, since Docling is the parsing engine. The pipeline adds no regression to table parsing accuracy.

Multi-Column Layout

Multi-column documents (legal contracts, academic papers, newspaper-style layouts) test a parser's ability to maintain reading order when text flows in non-linear patterns.

Marker led this category at 96.1%. Its specialized reading order model was the most reliable at correctly sequencing text from multi-column layouts, including documents that mix single-column and multi-column sections on the same page.

Docling and Ertas achieved 94.2%, performing well on standard two-column layouts but occasionally merging columns in documents with narrow gutters (under 0.3 inches) between columns.

Unstructured.io scored 91.8%, with most errors occurring in three-column layouts and documents where column width varied across sections.

Scanned PDF (OCR) Accuracy

Scanned PDFs remain the most challenging document type. OCR accuracy depends on scan quality, and enterprise archives frequently contain degraded scans — photocopied documents, faxes, or scans made at low resolution.

We tested at three quality levels:

Scan Quality	Docling	Unstructured	Marker	Ertas Pipeline
High (300 DPI, clean)	95.8%	93.2%	91.1%	96.3%
Medium (200 DPI, minor artifacts)	89.4%	87.1%	84.9%	92.1%
Low (150 DPI, degraded)	82.1%	79.8%	76.9%	85.8%

Ertas outperformed standalone Docling on scanned PDFs because the visual pipeline applies pre-processing before parsing: the Quality Scorer node detects scan quality and the Format Normalizer node applies image enhancement (contrast adjustment, deskewing, noise reduction) before the document reaches the parser. This pre-processing adds latency (hence Ertas's lower speed) but recovers 2 to 4 percentage points of accuracy on degraded scans.

No tool exceeded 86% accuracy on low-quality scans. For enterprise teams with large archives of degraded scanned documents, re-scanning at higher resolution remains the most effective accuracy improvement.

Header/Footer Removal

Headers and footers — page numbers, document titles, confidentiality notices, date stamps — contaminate parsed output if not removed. They appear in chunked text, pollute embeddings, and can surface in RAG retrieval as false matches.

Ertas achieved the highest header/footer removal accuracy at 93.7% by using a dedicated post-processing node that analyzes repeating text patterns across pages. Content that appears in the same position on more than 70% of pages is classified as header/footer material and stripped.

Docling's layout model identifies headers and footers structurally but does not always remove them from the output — they appear as tagged elements that downstream consumers must filter. Without filtering, they remain in the parsed text.

Marker's approach to header/footer handling was the least reliable, particularly for footers that contain substantive content (like table footnotes) intermixed with page numbers.

Processing Speed

Marker was the fastest tool at 6.1 pages per second, nearly twice as fast as Docling (3.2 pages/sec). Marker's speed advantage comes from using smaller, specialized models rather than a single large layout analysis model.

Unstructured.io's hi-res strategy processed 4.8 pages per second. Its fast strategy (without layout analysis) reached 12.3 pages per second but with significantly reduced accuracy.

Ertas was the slowest at 2.9 pages per second because the visual pipeline executes multiple processing nodes sequentially — quality scoring, format normalization, parsing, and post-processing. Each node adds latency. For batch processing of large archives, this tradeoff favors accuracy over speed. For real-time document processing, speed may be the binding constraint.

When to Use Each Tool

Choose Docling when you need the highest table extraction accuracy and are building your own processing pipeline in Python. It is MIT-licensed, well-documented, and actively maintained by IBM Research. Best for teams with engineering capacity to build around a parsing library.

Choose Unstructured.io when you need broad file format support beyond PDF. Its 64+ format support is unmatched, and the commercial platform adds workflow orchestration. Best for teams processing diverse document types where PDF is one format among many.

Choose Marker when processing speed is the primary constraint and your documents are predominantly text-heavy with simple layouts. Its reading order handling is the best available. Best for teams processing large volumes of research papers, articles, or single-column documents.

Choose Ertas Visual Pipeline when you need parsing as part of an integrated data pipeline with PII redaction, quality scoring, and downstream chunking/embedding. The visual node-graph interface means pipeline configuration does not require code, and every processing step is logged for audit trails. Best for teams in regulated industries or service providers delivering compliant data pipelines to clients.

Limitations of This Benchmark

Several caveats apply:

Corpus bias. Our 500-document corpus skews toward North American English-language business documents. Performance on documents in other languages, scripts, or layouts may differ.
Version sensitivity. All tools are under active development. Docling 2.x, Unstructured 0.16, and Marker 1.x were tested. Results may not hold for future versions.
Hardware dependency. GPU availability significantly affects tools that use deep learning models for layout analysis. CPU-only performance is substantially slower for Docling and Unstructured hi-res mode.
Integration effects. Standalone tool benchmarks do not capture integration costs — the engineering time to connect a parser to downstream pipeline stages. This favors integrated solutions but is not reflected in the accuracy numbers.

Conclusion

There is no single best PDF parser for all use cases. Docling leads in table extraction accuracy (97.9%), Marker leads in speed (6.1 pages/sec) and multi-column handling (96.1%), and Unstructured leads in format coverage (64+ types).

For enterprise AI training data pipelines where accuracy matters more than speed, Docling-based approaches (including Ertas's visual pipeline) are the strongest choice. The 4 to 6 percentage point accuracy advantage over Marker in table extraction compounds across thousands of documents — representing thousands of table cells that are correctly extracted rather than lost or garbled.

For teams building integrated pipelines with compliance requirements, the pipeline approach adds value that standalone parsing cannot provide: pre-processing that improves accuracy on degraded scans, post-processing that removes contamination, and audit logging that satisfies regulatory requirements. The throughput cost of this integration (2.9 vs 3.2 pages/sec for standalone Docling) is modest relative to the accuracy and observability gains.

For detailed benchmarks of the full enterprise data pipeline including redaction, chunking, and embedding stages, see our comprehensive benchmark report.