Multi-Modal Document Processing: Extracting Tables, Images, and Text from a Single PDF

Open any enterprise PDF — a construction specification, a medical record, a financial report — and you'll find at least three different types of content on a single page. Narrative text explaining procedures. Tables listing quantities, prices, or test results. Technical drawings or charts conveying spatial or statistical information. Headers, footers, and page numbers providing structural context.

Each of these content types requires a fundamentally different extraction approach. And that's where most document processing pipelines fall apart.

Why Single-Model Approaches Fail

The instinct is to throw one model at the entire document. Run OCR on every page, get text out, call it done. This produces three predictable failures:

Tables become garbled text. OCR tools read left-to-right, top-to-bottom. A table with merged cells, multi-line rows, or nested headers gets serialized into a meaningless string. "Item Description Unit Price Qty Total" becomes a flat sequence with no structural relationship between the values. A bill of quantities with 200 line items becomes unusable.

Images become invisible. Text extraction tools skip images entirely or produce placeholder text like "[Figure 1]". Technical drawings, flow diagrams, and charts contain critical information — dimensions, process flows, data trends — that the text extractor cannot see.

Structure gets lost. Even when text extraction is accurate, the hierarchical structure of the document — which sections contain which subsections, which text is a caption versus body content — disappears. A 50-page specification becomes a flat text dump with no navigable structure.

The accuracy numbers tell the story. Single-model approaches achieve 60-75% accuracy on mixed-content enterprise documents. That's not usable for any downstream application — especially not for training AI models that need correct ground truth.

The Synthetic Parsing Pipeline Architecture

The 2026 approach to document processing is the synthetic parsing pipeline: a multi-stage architecture where each document element is routed to a specialized model that handles it best.

The architecture follows a clear flow:

Document In → Layout Analysis (detect and classify regions) → Routing → Text regions go to NLP models, table regions go to table extraction models, image regions go to vision models → Structured Output Combination → Document Out

This is not a single model doing everything. It is an ensemble of specialists, each handling what it does best, coordinated by a layout analysis stage that knows what's where on each page.

Stage 1: Layout Analysis

Layout analysis is the traffic controller. It examines each page and classifies regions into categories: text block, table, figure, header, footer, caption, page number, sidebar, watermark.

Modern layout analysis models (LayoutLMv3, DiT, YOLO-based detectors) achieve 92-96% accuracy on region classification for standard enterprise documents. They output bounding boxes with class labels — essentially a map of every page showing where each content type lives.

The accuracy of layout analysis gates the entire pipeline. If a table region is misclassified as text, it gets sent to the text extractor and comes out garbled. If a figure is classified as a table, the table parser produces nonsense. Investing in high-quality layout analysis pays dividends at every downstream stage.

For enterprise documents with consistent templates (invoices, forms, reports from the same system), layout analysis accuracy reaches 98%+ because the model learns the specific template structure. For heterogeneous document collections, accuracy is lower but still sufficient at 92-94%.

Stage 2: Text Extraction

Text regions — paragraphs, bullet lists, numbered lists, headers — go through text extraction optimized for prose. This is where OCR excels, especially when it knows it's dealing with continuous text rather than structured layouts.

Key considerations for text extraction in enterprise documents:

Font handling. Enterprise PDFs use a mix of fonts, including embedded custom fonts. Quality text extraction handles font encoding correctly — a common failure point where characters like fi ligatures or special symbols get corrupted.

Column detection. Many enterprise documents use multi-column layouts. The text extractor needs to read columns correctly — left column fully, then right column — rather than reading across columns.

Reading order. Headers, body text, footnotes, and sidebars all appear on the same page. The extractor must determine the correct reading order, which is not always top-to-bottom.

Accuracy target: 98%+ character-level accuracy for clean, digital PDFs. 94-96% for scanned documents.

Stage 3: Table Extraction

Table extraction is the most technically demanding stage. Enterprise tables are structurally complex:

Merged cells span multiple rows or columns. A header like "Concrete Specifications" might span 5 columns. A category label might span 15 rows.

Nested headers create multi-level column structures. Row 1 might say "Phase 1" spanning 3 columns, row 2 might say "Material," "Quantity," "Cost" under that span.

Multi-line cells contain wrapped text that occupies 2-3 lines within a single logical cell. The extractor must group these lines into a single cell value.

Spanning tables continue across page breaks. The header row appears on page 1, and the data continues on pages 2 and 3 without repeating the header.

Specialized table extraction models (TableTransformer, DETR-based models, and commercial alternatives) handle these structures with 85-92% accuracy on cell-level extraction. The output is structured — typically JSON or CSV — with row/column relationships preserved.

For training data preparation, table accuracy matters enormously. If your AI model is learning to extract line items from bills of quantities, every misaligned row or merged cell error becomes a mislabeled training example.

Stage 4: Image Handling

Images in enterprise documents aren't photographs — they're technical drawings, process flow diagrams, bar charts, pie charts, floor plans, and circuit diagrams. Each subcategory requires different handling:

Charts and graphs contain quantitative data that should be extracted as structured values. A bar chart showing monthly revenue should produce a data series: [("Jan", 1.2M), ("Feb", 1.4M), ...]. Vision models with chart understanding (ChartQA, MatCha) achieve 80-88% accuracy on data extraction from charts.

Technical drawings contain spatial and dimensional information. The relevant extraction depends on the use case — for some applications, a textual description suffices; for others, extracting specific dimensions or annotations is required.

Flow diagrams represent process steps with connections. Extraction produces a graph structure: nodes (process steps) and edges (connections between them).

Photographs and illustrations may require captioning or classification but rarely need structured data extraction.

The image handling stage classifies each figure into its subcategory and applies the appropriate extraction model. For training data purposes, the key output is structured metadata that can be included alongside text and table data in the final dataset.

Stage 5: Output Combination

The final stage combines outputs from all modalities into a single structured representation. This is where cross-modal validation happens:

Reference resolution. The text says "See Table 3-2 for material quantities." The combiner links this reference to the extracted table, creating a navigable connection.

Caption matching. Figure captions extracted as text are matched to their corresponding extracted images.

Section hierarchy. Text, tables, and figures are organized within the document's section structure, preserving the logical flow of information.

The combined output is a structured JSON document where every element — paragraph, table, figure — is tagged with its type, position, content, and relationships to other elements. This structured representation is directly usable for generating training data.

Quality Validation

A synthetic parsing pipeline has multiple stages, and errors compound. If layout analysis is 95% accurate and table extraction is 90% accurate, the combined accuracy for tables is 0.95 × 0.90 = 85.5%. Quality validation at the end of the pipeline catches errors that individual stages miss.

Cross-modal validation: If the text mentions "47 line items in the bill of quantities" and the extracted table has 43 rows, something was missed. Automated checks compare extracted counts against textual references.

Consistency checks: Column totals should sum to the stated total. Referenced figure numbers should match extracted figures. Page references should be valid.

Confidence scoring: Each extracted element gets a confidence score. Elements below a threshold (typically 0.85) are flagged for human review. This focuses human effort on the 10-15% of elements that the pipeline is least confident about, rather than reviewing everything.

Sampling-based audit: Randomly select 5% of processed documents for full human review. Track accuracy over time to detect pipeline degradation.

Common Enterprise Document Types

Different document types stress different parts of the pipeline:

Construction BOQs (Bills of Quantities): Table-heavy, with complex nested structures, merged cells, and multi-page tables. The table extraction stage does most of the work. Typical accuracy challenge: merged category headers that span data rows.

Medical records: Mix of narrative text (clinical notes), structured data (lab results in tables), and images (scans, X-rays). The text extraction stage handles clinical narratives while table extraction captures lab values. PHI/PII handling adds a compliance layer.

Legal contracts: Primarily text with numbered clauses, definitions, and cross-references. The text extraction stage is dominant, but handling nested numbering schemes (1.1.1.a.i) and cross-reference resolution is critical.

Financial statements: Structured tables with precise numerical values, footnotes referencing table entries, and charts showing trends. Table extraction accuracy is paramount — a decimal point error in a financial figure cascades into downstream analysis.

Processing at Scale

Enterprise document processing is not a one-time exercise. Organizations process thousands to millions of pages. At scale, two factors dominate:

Throughput. A synthetic parsing pipeline with GPU-accelerated layout analysis and table extraction processes 50-100 pages per minute on a single workstation. For a 700GB document archive, that's roughly 2-3 weeks of continuous processing — feasible but not trivial.

Error handling. At scale, some documents will fail to process. Corrupted PDFs, password-protected files, unusual encodings, scanned documents at odd angles. The pipeline needs a quarantine queue for failed documents and a triage process for deciding which failures to fix versus skip.

Ertas Data Suite implements the full synthetic parsing pipeline — layout analysis, text extraction, table extraction, image handling, and output combination — in a single platform running on your infrastructure. Each stage is optimized for enterprise document types, with confidence scoring and human-in-the-loop review for low-confidence extractions. The unified output feeds directly into labeling and export workflows, eliminating the manual format conversion that slows down most multi-tool approaches.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →