
Preparing Synthetic Parsing Pipelines: The 2026 Approach to Document Processing
Document processing in 2026 isn't one model's job anymore. Synthetic parsing pipelines break documents into parts and route each to a specialized model. Here's how to prepare data for this architecture.
Document processing used to be one model's job. Feed a PDF to an OCR engine. Get text out. Maybe run a table extractor on the side. Stitch the outputs together manually.
That approach hit a ceiling around 2024. Enterprise documents are too complex — a single construction specification contains narrative text, bills of quantities in nested tables, technical drawings with dimensions, charts showing project timelines, and cross-references linking all of them together. No single model handles all of these content types well.
The 2026 approach is the synthetic parsing pipeline: a multi-stage architecture where the document is broken into components, each component is routed to a specialized model, and the outputs are recombined into a single structured representation. "Synthetic" because the final output is synthesized from multiple models' contributions, not produced by a single model.
This article focuses on the data preparation side: how to create the training data that powers each stage of the pipeline.
The Pipeline Architecture
A synthetic parsing pipeline has four stages, and each stage needs its own training data.
Stage 1: Layout Detector
The layout detector examines each page and identifies regions: where is the text? Where are the tables? Where are the figures? Where are the headers and footers?
The output is a set of bounding boxes, each labeled with a region type: text_block, table, figure, header, footer, caption, page_number, sidebar, watermark.
This is an object detection problem, solved with models like LayoutLMv3, DiT (Document Image Transformer), or YOLO variants trained on document layouts.
Stage 2: Text Extractor
Text regions identified by the layout detector are sent to the text extraction stage. This stage produces clean, structured text from each text region — handling fonts, columns, reading order, and special characters.
Stage 3: Table Parser
Table regions go to a specialized table parsing model that understands row/column structure, merged cells, multi-level headers, and multi-page tables.
Stage 4: Image Analyzer
Figure regions go to a vision model that classifies the figure type (chart, diagram, photo, drawing) and extracts relevant structured information.
Combiner
The combiner merges outputs from all stages into a single structured document representation, resolving cross-references and maintaining the logical document structure.
Each of these stages can be improved by fine-tuning on domain-specific data. Here's what training data preparation looks like for each one.
Data Preparation for the Layout Detector
What You Need
Annotated document pages where every region has been identified with a bounding box and classified. This is an object detection annotation task — the same type of annotation used for training YOLO models on natural images, but applied to document pages.
Annotation Workflow
Step 1: Select representative pages. Don't annotate every page of every document. Select 200-500 pages that represent the full variety of layouts your pipeline will encounter. Include:
- Text-heavy pages (reports, narratives)
- Table-heavy pages (financial statements, BOQs)
- Mixed pages (tables with surrounding text and captions)
- Figure pages (technical drawings, charts)
- Complex pages (multi-column layouts, sidebars, nested elements)
Step 2: Define region categories. A practical category set for enterprise documents:
text_block: Continuous prose (paragraphs, bullet lists)table: Tabular data with row/column structurefigure: Images, charts, drawings, diagramsheader: Page headers, section headersfooter: Page footers, footnotescaption: Text that labels a figure or tablepage_number: Page numberingsidebar: Content in side columns or callout boxeswatermark: Background text or images to be ignored
Start with fewer categories and add more only if the pipeline needs the distinction. Nine categories is usually enough for enterprise documents.
Step 3: Annotate bounding boxes. For each page, draw bounding boxes around every region and assign the appropriate category. Use a bounding box annotation tool (CVAT, LabelImg, or a platform that supports object detection annotation).
Key annotation guidelines:
- Boxes should be tight — minimal whitespace padding
- Overlapping regions should be annotated as the most specific type (a caption overlapping a figure gets a
captionbox, not afigurebox) - Multi-column pages get separate boxes per column
- Tables that span page breaks get a box on each page
Step 4: Quality validation. Have a second annotator review 20% of the annotations. Inter-annotator agreement should be above 90% on region classification and within 5% on bounding box coordinates.
Scale Requirements
For a layout detector fine-tuned to your specific document types:
- Minimum: 200 annotated pages. Achieves 88-92% accuracy on region classification.
- Recommended: 500 annotated pages. Achieves 93-96% accuracy.
- Optimal: 1,000+ annotated pages. Achieves 96-98% accuracy on consistent document types.
If your documents use consistent templates (same report format, same invoice layout), 200 pages is often sufficient. For heterogeneous document collections (multiple vendors, multiple formats), aim for 500+.
Data Preparation for the Text Extractor
What You Need
Ground truth text for each text region — the correct plain text that should be extracted from that region of the page.
Creating Ground Truth
For digital PDFs (text layer present): The PDF's embedded text can serve as ground truth, but it needs validation. Embedded text sometimes has encoding errors, incorrect reading order, or missing characters.
Process: Extract text programmatically from the PDF, manually review a sample of 50-100 regions, and fix any systematic extraction errors. If the embedded text is consistently correct (>98% character accuracy), use it as ground truth. If not, manual transcription is needed.
For scanned documents (image-only): Ground truth must be created by human transcription. This is labor-intensive but necessary for training accurate OCR models on your specific document types.
Time estimate: manual transcription of a text region takes 1-3 minutes depending on length. For 500 annotated pages with an average of 5 text regions each, that's 2,500 regions × 2 minutes = ~83 hours. Spread across a team, this is 2-3 weeks of work.
For specialized fonts or symbols: If your documents use domain-specific symbols (engineering notation, mathematical formulas, musical notation), ensure these are correctly represented in the ground truth. Standard OCR models often fail on specialized symbols — your fine-tuned model can learn them, but only if the ground truth includes them.
Scale Requirements
- Minimum: 500 text regions with ground truth
- Recommended: 2,000 text regions
- Optimal: 5,000+ text regions for maximum accuracy across varied fonts and layouts
Data Preparation for the Table Parser
What You Need
Structured ground truth for each table — the correct row/column structure with cell values, merged cell information, and header relationships.
The Challenge
Table parsing ground truth is the most complex annotation in the pipeline. A single table requires:
- Identifying the number of rows and columns
- Mapping each cell's content to its row/column position
- Marking merged cells with their span (e.g., a cell spanning columns 2-4 in row 1)
- Identifying header rows versus data rows
- Handling nested headers (multi-level column structures)
- Connecting multi-page tables
This is significantly more work per annotation than text transcription or bounding box drawing.
Annotation Workflow
Step 1: Identify table types in your documents. Common enterprise table types:
- Simple tables (regular grid, no merged cells)
- Tables with merged header cells
- Tables with nested row/column headers
- Tables with multi-line cells
- Tables spanning multiple pages
- Tables with subtotals and grand totals
Categorize your tables so you can ensure coverage across all types.
Step 2: Define the output format. The ground truth should be in a structured format that captures all table relationships. A practical format:
{
"rows": 15,
"columns": 5,
"headers": [
{"text": "Item", "row": 0, "col": 0, "rowspan": 1, "colspan": 1},
{"text": "Description", "row": 0, "col": 1, "rowspan": 1, "colspan": 1},
{"text": "Specifications", "row": 0, "col": 2, "rowspan": 1, "colspan": 3}
],
"cells": [
{"text": "1.01", "row": 1, "col": 0},
{"text": "Concrete Grade 30", "row": 1, "col": 1},
...
]
}
Step 3: Annotate tables. For each table, produce the structured ground truth. Use a table annotation tool that supports merged cells and multi-level headers — or export to a spreadsheet for manual structuring.
Time estimate: simple tables take 5-10 minutes each. Complex tables with merged cells and nested headers take 15-30 minutes. Budget accordingly.
Step 4: Validate. Round-trip validation: render the structured ground truth back into a visual table and compare against the original. Discrepancies indicate annotation errors.
Scale Requirements
Table parsing requires more training data than layout detection because of structural complexity:
- Minimum: 300 tables with ground truth. Handles simple table structures.
- Recommended: 1,000 tables. Handles merged cells and standard header structures.
- Optimal: 2,000+ tables. Handles complex nested headers, multi-page tables, and irregular structures.
Include at least 50 examples of each table type identified in Step 1.
Data Preparation for the Image Analyzer
What You Need
For each figure, two types of ground truth:
- Classification: What type of figure is it? (bar chart, line chart, flow diagram, technical drawing, photo, map)
- Structured extraction: What information does the figure contain?
Classification Ground Truth
Classify each figure into its subcategory. This is a straightforward image classification task requiring 20-50 examples per category.
For enterprise documents, typical categories:
- Bar chart
- Line chart
- Pie chart
- Flow diagram
- Organizational chart
- Technical drawing
- Floor plan / site plan
- Photograph
- Logo / decorative image
Extraction Ground Truth
For data-bearing figures (charts and diagrams), create structured ground truth:
Charts: Extract the data series. A bar chart showing quarterly revenue should produce: [{"quarter": "Q1", "revenue": 1200000}, {"quarter": "Q2", "revenue": 1450000}, ...]
Flow diagrams: Extract nodes and edges. {"nodes": ["Start", "Review", "Approve", "Reject", "End"], "edges": [["Start", "Review"], ["Review", "Approve"], ["Review", "Reject"], ...]}
Technical drawings: Extract key dimensions, labels, and annotations.
Time estimate: chart data extraction takes 3-5 minutes per chart. Diagram extraction takes 5-15 minutes depending on complexity.
Scale Requirements
- Classification: 20-50 examples per figure type (150-400 total)
- Chart data extraction: 100-300 charts with ground truth
- Diagram extraction: 50-200 diagrams with ground truth
Image analysis typically requires less training data than table parsing because pre-trained vision models already understand chart and diagram structures — fine-tuning adds domain-specific calibration.
End-to-End Quality Validation
After preparing training data for each stage, validate the full pipeline end-to-end:
Step 1: Process 50 held-out documents through the complete pipeline.
Step 2: Compare the pipeline's structured output against manually created ground truth for those documents.
Step 3: Measure accuracy at each stage:
- Layout detection: region classification accuracy and bounding box IoU
- Text extraction: character-level accuracy
- Table parsing: cell-level accuracy
- Image analysis: classification accuracy and extraction accuracy
Step 4: Identify the weakest stage. The weakest stage gates the entire pipeline's accuracy. If layout detection is 97% accurate but table parsing is 82%, improving table parsing training data yields the highest ROI.
Step 5: Iterate. Add more training data to the weakest stage, retrain, re-evaluate. Repeat until all stages meet your accuracy target.
Timeline and Resources
For a typical enterprise pipeline processing construction documents:
| Stage | Annotation Volume | Time Estimate | Personnel |
|---|---|---|---|
| Layout detector | 500 pages | 2-3 weeks | 1-2 annotators |
| Text extractor | 2,000 regions | 2-3 weeks | 2-3 annotators |
| Table parser | 1,000 tables | 3-4 weeks | 2 annotators + domain expert |
| Image analyzer | 300 figures | 1-2 weeks | 1 annotator + domain expert |
| End-to-end validation | 50 documents | 1 week | 1 ML engineer + domain expert |
Total: 8-12 weeks with a team of 3-4 people. Stages can overlap — text extractor annotation can begin while layout detector annotation is still in progress.
Ertas Data Suite supports multi-stage pipeline data preparation with annotation workflows for each stage — bounding box annotation for layout detection, text transcription for text extraction, structured table annotation for table parsing, and figure classification for image analysis. The platform maintains the relationships between stages (which text regions came from which pages, which tables correspond to which bounding boxes), providing the end-to-end traceability that synthetic parsing pipelines require.
Your data is the bottleneck — not your models.
Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.
Further Reading
- PDF to JSONL: Building an Enterprise Data Preparation Pipeline — The practical guide to converting enterprise PDFs into structured training data.
- Multi-Format Export: JSONL, COCO, YOLO Pipeline — How to export training data in the format each model stage requires.
- Local Document Ingestion for Enterprise AI — Setting up on-premise document ingestion that keeps data within your infrastructure.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

Preparing RAG Datasets vs Fine-Tuning Datasets: Different Pipelines, Same Source Data
RAG needs chunked, retrieval-optimized text. Fine-tuning needs input/output pairs. Both start from the same raw documents. Here's how to run parallel preparation pipelines from a single source.

From Ad-Hoc Data Prep to Continuous Data Ops: Building an Always-On Pipeline
Most enterprises treat data preparation as a one-time project. But AI models need fresh data continuously. Here's how to evolve from ad-hoc data prep to a continuous data operations pipeline.

Multi-Modal Document Processing: Extracting Tables, Images, and Text from a Single PDF
Enterprise PDFs contain text, tables, charts, and images — each requiring different extraction methods. Here's how synthetic parsing pipelines route each element to the right model for accurate extraction.