Back to blog
    Preparing Synthetic Parsing Pipelines: The 2026 Approach to Document Processing
    synthetic-parsingdocument-processingpipeline2026data-preparationsegment:enterprise

    Preparing Synthetic Parsing Pipelines: The 2026 Approach to Document Processing

    Document processing in 2026 isn't one model's job anymore. Synthetic parsing pipelines break documents into parts and route each to a specialized model. Here's how to prepare data for this architecture.

    EErtas Team·

    Document processing used to be one model's job. Feed a PDF to an OCR engine. Get text out. Maybe run a table extractor on the side. Stitch the outputs together manually.

    That approach hit a ceiling around 2024. Enterprise documents are too complex — a single construction specification contains narrative text, bills of quantities in nested tables, technical drawings with dimensions, charts showing project timelines, and cross-references linking all of them together. No single model handles all of these content types well.

    The 2026 approach is the synthetic parsing pipeline: a multi-stage architecture where the document is broken into components, each component is routed to a specialized model, and the outputs are recombined into a single structured representation. "Synthetic" because the final output is synthesized from multiple models' contributions, not produced by a single model.

    This article focuses on the data preparation side: how to create the training data that powers each stage of the pipeline.

    The Pipeline Architecture

    A synthetic parsing pipeline has four stages, and each stage needs its own training data.

    Stage 1: Layout Detector

    The layout detector examines each page and identifies regions: where is the text? Where are the tables? Where are the figures? Where are the headers and footers?

    The output is a set of bounding boxes, each labeled with a region type: text_block, table, figure, header, footer, caption, page_number, sidebar, watermark.

    This is an object detection problem, solved with models like LayoutLMv3, DiT (Document Image Transformer), or YOLO variants trained on document layouts.

    Stage 2: Text Extractor

    Text regions identified by the layout detector are sent to the text extraction stage. This stage produces clean, structured text from each text region — handling fonts, columns, reading order, and special characters.

    Stage 3: Table Parser

    Table regions go to a specialized table parsing model that understands row/column structure, merged cells, multi-level headers, and multi-page tables.

    Stage 4: Image Analyzer

    Figure regions go to a vision model that classifies the figure type (chart, diagram, photo, drawing) and extracts relevant structured information.

    Combiner

    The combiner merges outputs from all stages into a single structured document representation, resolving cross-references and maintaining the logical document structure.

    Each of these stages can be improved by fine-tuning on domain-specific data. Here's what training data preparation looks like for each one.

    Data Preparation for the Layout Detector

    What You Need

    Annotated document pages where every region has been identified with a bounding box and classified. This is an object detection annotation task — the same type of annotation used for training YOLO models on natural images, but applied to document pages.

    Annotation Workflow

    Step 1: Select representative pages. Don't annotate every page of every document. Select 200-500 pages that represent the full variety of layouts your pipeline will encounter. Include:

    • Text-heavy pages (reports, narratives)
    • Table-heavy pages (financial statements, BOQs)
    • Mixed pages (tables with surrounding text and captions)
    • Figure pages (technical drawings, charts)
    • Complex pages (multi-column layouts, sidebars, nested elements)

    Step 2: Define region categories. A practical category set for enterprise documents:

    • text_block: Continuous prose (paragraphs, bullet lists)
    • table: Tabular data with row/column structure
    • figure: Images, charts, drawings, diagrams
    • header: Page headers, section headers
    • footer: Page footers, footnotes
    • caption: Text that labels a figure or table
    • page_number: Page numbering
    • sidebar: Content in side columns or callout boxes
    • watermark: Background text or images to be ignored

    Start with fewer categories and add more only if the pipeline needs the distinction. Nine categories is usually enough for enterprise documents.

    Step 3: Annotate bounding boxes. For each page, draw bounding boxes around every region and assign the appropriate category. Use a bounding box annotation tool (CVAT, LabelImg, or a platform that supports object detection annotation).

    Key annotation guidelines:

    • Boxes should be tight — minimal whitespace padding
    • Overlapping regions should be annotated as the most specific type (a caption overlapping a figure gets a caption box, not a figure box)
    • Multi-column pages get separate boxes per column
    • Tables that span page breaks get a box on each page

    Step 4: Quality validation. Have a second annotator review 20% of the annotations. Inter-annotator agreement should be above 90% on region classification and within 5% on bounding box coordinates.

    Scale Requirements

    For a layout detector fine-tuned to your specific document types:

    • Minimum: 200 annotated pages. Achieves 88-92% accuracy on region classification.
    • Recommended: 500 annotated pages. Achieves 93-96% accuracy.
    • Optimal: 1,000+ annotated pages. Achieves 96-98% accuracy on consistent document types.

    If your documents use consistent templates (same report format, same invoice layout), 200 pages is often sufficient. For heterogeneous document collections (multiple vendors, multiple formats), aim for 500+.

    Data Preparation for the Text Extractor

    What You Need

    Ground truth text for each text region — the correct plain text that should be extracted from that region of the page.

    Creating Ground Truth

    For digital PDFs (text layer present): The PDF's embedded text can serve as ground truth, but it needs validation. Embedded text sometimes has encoding errors, incorrect reading order, or missing characters.

    Process: Extract text programmatically from the PDF, manually review a sample of 50-100 regions, and fix any systematic extraction errors. If the embedded text is consistently correct (>98% character accuracy), use it as ground truth. If not, manual transcription is needed.

    For scanned documents (image-only): Ground truth must be created by human transcription. This is labor-intensive but necessary for training accurate OCR models on your specific document types.

    Time estimate: manual transcription of a text region takes 1-3 minutes depending on length. For 500 annotated pages with an average of 5 text regions each, that's 2,500 regions × 2 minutes = ~83 hours. Spread across a team, this is 2-3 weeks of work.

    For specialized fonts or symbols: If your documents use domain-specific symbols (engineering notation, mathematical formulas, musical notation), ensure these are correctly represented in the ground truth. Standard OCR models often fail on specialized symbols — your fine-tuned model can learn them, but only if the ground truth includes them.

    Scale Requirements

    • Minimum: 500 text regions with ground truth
    • Recommended: 2,000 text regions
    • Optimal: 5,000+ text regions for maximum accuracy across varied fonts and layouts

    Data Preparation for the Table Parser

    What You Need

    Structured ground truth for each table — the correct row/column structure with cell values, merged cell information, and header relationships.

    The Challenge

    Table parsing ground truth is the most complex annotation in the pipeline. A single table requires:

    • Identifying the number of rows and columns
    • Mapping each cell's content to its row/column position
    • Marking merged cells with their span (e.g., a cell spanning columns 2-4 in row 1)
    • Identifying header rows versus data rows
    • Handling nested headers (multi-level column structures)
    • Connecting multi-page tables

    This is significantly more work per annotation than text transcription or bounding box drawing.

    Annotation Workflow

    Step 1: Identify table types in your documents. Common enterprise table types:

    • Simple tables (regular grid, no merged cells)
    • Tables with merged header cells
    • Tables with nested row/column headers
    • Tables with multi-line cells
    • Tables spanning multiple pages
    • Tables with subtotals and grand totals

    Categorize your tables so you can ensure coverage across all types.

    Step 2: Define the output format. The ground truth should be in a structured format that captures all table relationships. A practical format:

    {
      "rows": 15,
      "columns": 5,
      "headers": [
        {"text": "Item", "row": 0, "col": 0, "rowspan": 1, "colspan": 1},
        {"text": "Description", "row": 0, "col": 1, "rowspan": 1, "colspan": 1},
        {"text": "Specifications", "row": 0, "col": 2, "rowspan": 1, "colspan": 3}
      ],
      "cells": [
        {"text": "1.01", "row": 1, "col": 0},
        {"text": "Concrete Grade 30", "row": 1, "col": 1},
        ...
      ]
    }
    

    Step 3: Annotate tables. For each table, produce the structured ground truth. Use a table annotation tool that supports merged cells and multi-level headers — or export to a spreadsheet for manual structuring.

    Time estimate: simple tables take 5-10 minutes each. Complex tables with merged cells and nested headers take 15-30 minutes. Budget accordingly.

    Step 4: Validate. Round-trip validation: render the structured ground truth back into a visual table and compare against the original. Discrepancies indicate annotation errors.

    Scale Requirements

    Table parsing requires more training data than layout detection because of structural complexity:

    • Minimum: 300 tables with ground truth. Handles simple table structures.
    • Recommended: 1,000 tables. Handles merged cells and standard header structures.
    • Optimal: 2,000+ tables. Handles complex nested headers, multi-page tables, and irregular structures.

    Include at least 50 examples of each table type identified in Step 1.

    Data Preparation for the Image Analyzer

    What You Need

    For each figure, two types of ground truth:

    1. Classification: What type of figure is it? (bar chart, line chart, flow diagram, technical drawing, photo, map)
    2. Structured extraction: What information does the figure contain?

    Classification Ground Truth

    Classify each figure into its subcategory. This is a straightforward image classification task requiring 20-50 examples per category.

    For enterprise documents, typical categories:

    • Bar chart
    • Line chart
    • Pie chart
    • Flow diagram
    • Organizational chart
    • Technical drawing
    • Floor plan / site plan
    • Photograph
    • Logo / decorative image

    Extraction Ground Truth

    For data-bearing figures (charts and diagrams), create structured ground truth:

    Charts: Extract the data series. A bar chart showing quarterly revenue should produce: [{"quarter": "Q1", "revenue": 1200000}, {"quarter": "Q2", "revenue": 1450000}, ...]

    Flow diagrams: Extract nodes and edges. {"nodes": ["Start", "Review", "Approve", "Reject", "End"], "edges": [["Start", "Review"], ["Review", "Approve"], ["Review", "Reject"], ...]}

    Technical drawings: Extract key dimensions, labels, and annotations.

    Time estimate: chart data extraction takes 3-5 minutes per chart. Diagram extraction takes 5-15 minutes depending on complexity.

    Scale Requirements

    • Classification: 20-50 examples per figure type (150-400 total)
    • Chart data extraction: 100-300 charts with ground truth
    • Diagram extraction: 50-200 diagrams with ground truth

    Image analysis typically requires less training data than table parsing because pre-trained vision models already understand chart and diagram structures — fine-tuning adds domain-specific calibration.

    End-to-End Quality Validation

    After preparing training data for each stage, validate the full pipeline end-to-end:

    Step 1: Process 50 held-out documents through the complete pipeline.

    Step 2: Compare the pipeline's structured output against manually created ground truth for those documents.

    Step 3: Measure accuracy at each stage:

    • Layout detection: region classification accuracy and bounding box IoU
    • Text extraction: character-level accuracy
    • Table parsing: cell-level accuracy
    • Image analysis: classification accuracy and extraction accuracy

    Step 4: Identify the weakest stage. The weakest stage gates the entire pipeline's accuracy. If layout detection is 97% accurate but table parsing is 82%, improving table parsing training data yields the highest ROI.

    Step 5: Iterate. Add more training data to the weakest stage, retrain, re-evaluate. Repeat until all stages meet your accuracy target.

    Timeline and Resources

    For a typical enterprise pipeline processing construction documents:

    StageAnnotation VolumeTime EstimatePersonnel
    Layout detector500 pages2-3 weeks1-2 annotators
    Text extractor2,000 regions2-3 weeks2-3 annotators
    Table parser1,000 tables3-4 weeks2 annotators + domain expert
    Image analyzer300 figures1-2 weeks1 annotator + domain expert
    End-to-end validation50 documents1 week1 ML engineer + domain expert

    Total: 8-12 weeks with a team of 3-4 people. Stages can overlap — text extractor annotation can begin while layout detector annotation is still in progress.

    Ertas Data Suite supports multi-stage pipeline data preparation with annotation workflows for each stage — bounding box annotation for layout detection, text transcription for text extraction, structured table annotation for table parsing, and figure classification for image analysis. The platform maintains the relationships between stages (which text regions came from which pages, which tables correspond to which bounding boxes), providing the end-to-end traceability that synthetic parsing pipelines require.


    Your data is the bottleneck — not your models.

    Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

    Further Reading

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading