Preparing Synthetic Parsing Pipelines: The 2026 Approach to Document Processing

Document processing used to be one model's job. Feed a PDF to an OCR engine. Get text out. Maybe run a table extractor on the side. Stitch the outputs together manually.

That approach hit a ceiling around 2024. Enterprise documents are too complex — a single construction specification contains narrative text, bills of quantities in nested tables, technical drawings with dimensions, charts showing project timelines, and cross-references linking all of them together. No single model handles all of these content types well.

The 2026 approach is the synthetic parsing pipeline: a multi-stage architecture where the document is broken into components, each component is routed to a specialized model, and the outputs are recombined into a single structured representation. "Synthetic" because the final output is synthesized from multiple models' contributions, not produced by a single model.

This article focuses on the data preparation side: how to create the training data that powers each stage of the pipeline.

The Pipeline Architecture

A synthetic parsing pipeline has four stages, and each stage needs its own training data.

Stage 1: Layout Detector

The layout detector examines each page and identifies regions: where is the text? Where are the tables? Where are the figures? Where are the headers and footers?

The output is a set of bounding boxes, each labeled with a region type: text_block, table, figure, header, footer, caption, page_number, sidebar, watermark.

This is an object detection problem, solved with models like LayoutLMv3, DiT (Document Image Transformer), or YOLO variants trained on document layouts.

Stage 2: Text Extractor

Text regions identified by the layout detector are sent to the text extraction stage. This stage produces clean, structured text from each text region — handling fonts, columns, reading order, and special characters.

Stage 3: Table Parser

Table regions go to a specialized table parsing model that understands row/column structure, merged cells, multi-level headers, and multi-page tables.

Stage 4: Image Analyzer

Figure regions go to a vision model that classifies the figure type (chart, diagram, photo, drawing) and extracts relevant structured information.

Combiner

The combiner merges outputs from all stages into a single structured document representation, resolving cross-references and maintaining the logical document structure.

Each of these stages can be improved by fine-tuning on domain-specific data. Here's what training data preparation looks like for each one.

Data Preparation for the Layout Detector

What You Need

Annotated document pages where every region has been identified with a bounding box and classified. This is an object detection annotation task — the same type of annotation used for training YOLO models on natural images, but applied to document pages.

Annotation Workflow

Step 1: Select representative pages. Don't annotate every page of every document. Select 200-500 pages that represent the full variety of layouts your pipeline will encounter. Include:

Text-heavy pages (reports, narratives)
Table-heavy pages (financial statements, BOQs)
Mixed pages (tables with surrounding text and captions)
Figure pages (technical drawings, charts)
Complex pages (multi-column layouts, sidebars, nested elements)

Step 2: Define region categories. A practical category set for enterprise documents:

text_block: Continuous prose (paragraphs, bullet lists)
table: Tabular data with row/column structure
figure: Images, charts, drawings, diagrams
header: Page headers, section headers
footer: Page footers, footnotes
caption: Text that labels a figure or table
page_number: Page numbering
sidebar: Content in side columns or callout boxes
watermark: Background text or images to be ignored

Start with fewer categories and add more only if the pipeline needs the distinction. Nine categories is usually enough for enterprise documents.

Step 3: Annotate bounding boxes. For each page, draw bounding boxes around every region and assign the appropriate category. Use a bounding box annotation tool (CVAT, LabelImg, or a platform that supports object detection annotation).

Key annotation guidelines:

Boxes should be tight — minimal whitespace padding
Overlapping regions should be annotated as the most specific type (a caption overlapping a figure gets a caption box, not a figure box)
Multi-column pages get separate boxes per column
Tables that span page breaks get a box on each page

Step 4: Quality validation. Have a second annotator review 20% of the annotations. Inter-annotator agreement should be above 90% on region classification and within 5% on bounding box coordinates.

Scale Requirements

For a layout detector fine-tuned to your specific document types:

Minimum: 200 annotated pages. Achieves 88-92% accuracy on region classification.
Recommended: 500 annotated pages. Achieves 93-96% accuracy.
Optimal: 1,000+ annotated pages. Achieves 96-98% accuracy on consistent document types.

If your documents use consistent templates (same report format, same invoice layout), 200 pages is often sufficient. For heterogeneous document collections (multiple vendors, multiple formats), aim for 500+.

Data Preparation for the Text Extractor

What You Need

Ground truth text for each text region — the correct plain text that should be extracted from that region of the page.

Creating Ground Truth

For digital PDFs (text layer present): The PDF's embedded text can serve as ground truth, but it needs validation. Embedded text sometimes has encoding errors, incorrect reading order, or missing characters.

Process: Extract text programmatically from the PDF, manually review a sample of 50-100 regions, and fix any systematic extraction errors. If the embedded text is consistently correct (>98% character accuracy), use it as ground truth. If not, manual transcription is needed.

For scanned documents (image-only): Ground truth must be created by human transcription. This is labor-intensive but necessary for training accurate OCR models on your specific document types.

Time estimate: manual transcription of a text region takes 1-3 minutes depending on length. For 500 annotated pages with an average of 5 text regions each, that's 2,500 regions × 2 minutes = ~83 hours. Spread across a team, this is 2-3 weeks of work.

For specialized fonts or symbols: If your documents use domain-specific symbols (engineering notation, mathematical formulas, musical notation), ensure these are correctly represented in the ground truth. Standard OCR models often fail on specialized symbols — your fine-tuned model can learn them, but only if the ground truth includes them.

Scale Requirements

Minimum: 500 text regions with ground truth
Recommended: 2,000 text regions
Optimal: 5,000+ text regions for maximum accuracy across varied fonts and layouts

Data Preparation for the Table Parser

What You Need

Structured ground truth for each table — the correct row/column structure with cell values, merged cell information, and header relationships.

The Challenge

Table parsing ground truth is the most complex annotation in the pipeline. A single table requires:

Identifying the number of rows and columns
Mapping each cell's content to its row/column position
Marking merged cells with their span (e.g., a cell spanning columns 2-4 in row 1)
Identifying header rows versus data rows
Handling nested headers (multi-level column structures)
Connecting multi-page tables

This is significantly more work per annotation than text transcription or bounding box drawing.

Annotation Workflow

Step 1: Identify table types in your documents. Common enterprise table types:

Simple tables (regular grid, no merged cells)
Tables with merged header cells
Tables with nested row/column headers
Tables with multi-line cells
Tables spanning multiple pages
Tables with subtotals and grand totals

Categorize your tables so you can ensure coverage across all types.

Step 2: Define the output format. The ground truth should be in a structured format that captures all table relationships. A practical format:

{
  "rows": 15,
  "columns": 5,
  "headers": [
    {"text": "Item", "row": 0, "col": 0, "rowspan": 1, "colspan": 1},
    {"text": "Description", "row": 0, "col": 1, "rowspan": 1, "colspan": 1},
    {"text": "Specifications", "row": 0, "col": 2, "rowspan": 1, "colspan": 3}
  ],
  "cells": [
    {"text": "1.01", "row": 1, "col": 0},
    {"text": "Concrete Grade 30", "row": 1, "col": 1},
    ...
  ]
}

Step 3: Annotate tables. For each table, produce the structured ground truth. Use a table annotation tool that supports merged cells and multi-level headers — or export to a spreadsheet for manual structuring.

Time estimate: simple tables take 5-10 minutes each. Complex tables with merged cells and nested headers take 15-30 minutes. Budget accordingly.

Step 4: Validate. Round-trip validation: render the structured ground truth back into a visual table and compare against the original. Discrepancies indicate annotation errors.

Scale Requirements

Table parsing requires more training data than layout detection because of structural complexity:

Minimum: 300 tables with ground truth. Handles simple table structures.
Recommended: 1,000 tables. Handles merged cells and standard header structures.
Optimal: 2,000+ tables. Handles complex nested headers, multi-page tables, and irregular structures.

Include at least 50 examples of each table type identified in Step 1.

Data Preparation for the Image Analyzer

What You Need

For each figure, two types of ground truth:

Classification: What type of figure is it? (bar chart, line chart, flow diagram, technical drawing, photo, map)
Structured extraction: What information does the figure contain?

Classification Ground Truth

Classify each figure into its subcategory. This is a straightforward image classification task requiring 20-50 examples per category.

For enterprise documents, typical categories:

Bar chart
Line chart
Pie chart
Flow diagram
Organizational chart
Technical drawing
Floor plan / site plan
Photograph
Logo / decorative image

Extraction Ground Truth

For data-bearing figures (charts and diagrams), create structured ground truth:

Charts: Extract the data series. A bar chart showing quarterly revenue should produce: [{"quarter": "Q1", "revenue": 1200000}, {"quarter": "Q2", "revenue": 1450000}, ...]

Flow diagrams: Extract nodes and edges. {"nodes": ["Start", "Review", "Approve", "Reject", "End"], "edges": [["Start", "Review"], ["Review", "Approve"], ["Review", "Reject"], ...]}

Technical drawings: Extract key dimensions, labels, and annotations.

Time estimate: chart data extraction takes 3-5 minutes per chart. Diagram extraction takes 5-15 minutes depending on complexity.

Scale Requirements

Classification: 20-50 examples per figure type (150-400 total)
Chart data extraction: 100-300 charts with ground truth
Diagram extraction: 50-200 diagrams with ground truth

Image analysis typically requires less training data than table parsing because pre-trained vision models already understand chart and diagram structures — fine-tuning adds domain-specific calibration.

End-to-End Quality Validation

After preparing training data for each stage, validate the full pipeline end-to-end:

Step 1: Process 50 held-out documents through the complete pipeline.

Step 2: Compare the pipeline's structured output against manually created ground truth for those documents.

Step 3: Measure accuracy at each stage:

Layout detection: region classification accuracy and bounding box IoU
Text extraction: character-level accuracy
Table parsing: cell-level accuracy
Image analysis: classification accuracy and extraction accuracy

Step 4: Identify the weakest stage. The weakest stage gates the entire pipeline's accuracy. If layout detection is 97% accurate but table parsing is 82%, improving table parsing training data yields the highest ROI.

Step 5: Iterate. Add more training data to the weakest stage, retrain, re-evaluate. Repeat until all stages meet your accuracy target.

Timeline and Resources

For a typical enterprise pipeline processing construction documents:

Stage	Annotation Volume	Time Estimate	Personnel
Layout detector	500 pages	2-3 weeks	1-2 annotators
Text extractor	2,000 regions	2-3 weeks	2-3 annotators
Table parser	1,000 tables	3-4 weeks	2 annotators + domain expert
Image analyzer	300 figures	1-2 weeks	1 annotator + domain expert
End-to-end validation	50 documents	1 week	1 ML engineer + domain expert

Total: 8-12 weeks with a team of 3-4 people. Stages can overlap — text extractor annotation can begin while layout detector annotation is still in progress.

Ertas Data Suite supports multi-stage pipeline data preparation with annotation workflows for each stage — bounding box annotation for layout detection, text transcription for text extraction, structured table annotation for table parsing, and figure classification for image analysis. The platform maintains the relationships between stages (which text regions came from which pages, which tables correspond to which bounding boxes), providing the end-to-end traceability that synthetic parsing pipelines require.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →

Preparing Synthetic Parsing Pipelines: The 2026 Approach to Document Processing

The Pipeline Architecture

Stage 1: Layout Detector

Stage 2: Text Extractor

Stage 3: Table Parser

Stage 4: Image Analyzer

Combiner

Data Preparation for the Layout Detector

What You Need

Annotation Workflow

Scale Requirements

Data Preparation for the Text Extractor

What You Need

Creating Ground Truth

Scale Requirements

Data Preparation for the Table Parser

What You Need

The Challenge

Annotation Workflow

Scale Requirements

Data Preparation for the Image Analyzer

What You Need

Classification Ground Truth

Extraction Ground Truth

Scale Requirements

End-to-End Quality Validation

Timeline and Resources

Further Reading

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

Preparing RAG Datasets vs Fine-Tuning Datasets: Different Pipelines, Same Source Data

From Ad-Hoc Data Prep to Continuous Data Ops: Building an Always-On Pipeline

Multi-Modal Document Processing: Extracting Tables, Images, and Text from a Single PDF