How to Extract AI Training Data from Engineering Drawings and BOQ Documents

Engineering drawings and bills of quantities are the dense, information-rich backbone of every construction project. They contain specifications, dimensions, quantities, materials, and cost structures that took years of domain expertise to produce. They are also, from a machine learning perspective, some of the hardest documents to parse.

If you are trying to build a domain-specific AI for construction — an estimating model, a spec search system, a compliance checker — the first obstacle is not the model. It is getting the training data out of the documents.

This guide covers exactly that: what makes engineering drawings and BOQ documents difficult to process, how to approach extraction, and what a usable training dataset actually looks like at the end.

Why Engineering Drawings Break Standard OCR

A standard OCR engine — Tesseract, AWS Textract, or the OCR layer inside Adobe — is trained on continuous prose text. It expects words arranged in lines, lines arranged in paragraphs, and paragraphs arranged on a page with predictable margins.

Engineering drawings violate every one of those assumptions.

Symbol-dense content. Structural and civil engineering drawings use dozens of specialized symbols: weld types, cross-section indicators, reinforcement bar callouts, elevation markers, slope arrows. A standard OCR model has no training data for these. It will either skip them or misread adjacent text because the symbol is interfering with the character segmentation.

Multi-layout pages. A single A1 drawing sheet may contain a plan view in the upper left, section cuts in the lower right, a title block in the lower right corner, revision history in the right margin, and general notes scattered wherever there was space. There is no reading order an OCR engine can infer. It will concatenate content from these zones in the wrong sequence, producing text that has no semantic coherence.

Annotation layers. CAD-exported PDFs contain dimensions, keynotes, and leader lines that sit in separate layers from the main drawing geometry. When flattened to a rasterized PDF, these become overlapping elements. Text recognition fails on text that overlaps or touches other elements.

Scale-dependent detail. On a site plan at 1:500, building outlines are thick and annotation is sparse. On a detail drawing at 1:10, the same area is full of material callouts, dimension strings, and reference bubbles. A single PDF project document can contain both, and no fixed OCR resolution works for both simultaneously.

The consequence: generic OCR extraction from engineering drawings produces noisy, out-of-sequence text with missing annotations and garbled dimension strings. That output cannot be used for AI training without extensive — and often impractical — manual correction.

Why BOQ Documents Are Different (and Also Hard)

Bills of quantities sit at the other end of the spectrum. They are not visual documents; they are highly structured tabular data. But they are structured in a way that is specific to the construction industry and that most data extraction tools handle poorly.

A typical BOQ is organized as a multi-level numbered hierarchy: divisions, sections, items. Each line item has an item code, a description that may run across multiple lines, a quantity, a unit of measure, a rate, and an extended amount. The description field is where the domain knowledge lives — it encodes the material specification, the work method, the applicable standard, and any references to drawing numbers.

The extraction challenges specific to BOQs:

Hybrid PDF format. BOQs are often generated from quantity surveying software (CostX, CANDY, Buildsoft) and exported as PDFs that contain a mix of embedded text and rasterized tables. The tabular structure may look perfect on screen but be represented in the PDF as a series of disconnected text fragments at arbitrary X-Y coordinates, with no underlying table object that extraction tools can detect.

Multi-page item descriptions. A complex BOQ item — say, structural concrete for a post-tensioned slab with specific admixture requirements and reference to a project-specific specification — can have a description that runs across three or four lines. Page breaks interrupt items mid-description. Extraction tools that process page by page will split these items incorrectly.

Continuation tables. BOQ tables often continue across pages with column headers repeated at the top of each page. Naive table extraction merges the repeated headers as data rows, corrupting the structure.

Quantity and unit normalization. Units in BOQs are not standardized. "m3", "M3", "CUM", "cum", and "Cum" all mean cubic meters. Item quantities appear as "1,234.50", "1234.5", and "1,234·50" depending on the locale and the software that produced the document. An extraction pipeline must normalize these without changing the values.

The Extraction Pipeline

A practical pipeline for construction documents requires separate handling for drawing documents and BOQ documents, with a merge step at the end.

Stage 1: Document classification. Before processing, each file needs to be classified: is this a drawing sheet, a BOQ, a specification section, an inspection report, or something else? The processing logic differs significantly between types, and applying the wrong extractor produces garbage output. Classification can be rule-based (file naming conventions, page dimensions, presence of a title block) or model-based for ambiguous cases.

Stage 2: Drawing extraction. For drawings, the pipeline has to operate on regions rather than the full page. The title block, general notes, and plan area are processed separately. Region detection can use template matching for standard title block positions, or a layout segmentation model for non-standard sheets. Within each region, OCR runs at the appropriate resolution, and symbol detection runs as a separate pass — outputting structured records like {type: "weld_symbol", subtype: "fillet", size_mm: 6, location: "beam_flange_connection"} rather than trying to convert symbols to text.

Stage 3: BOQ extraction. For BOQ documents, the pipeline focuses on table structure reconstruction. This requires detecting column boundaries from the distribution of X coordinates of text fragments, associating continuation lines with their parent items using indentation and numbering patterns, normalizing quantities and units to canonical forms, and extracting drawing references from description text.

Stage 4: Cross-reference linking. Drawing annotations reference item codes; BOQ items reference drawing numbers. Linking these creates richer training data. A line item for "structural steelwork, grade S275, hot-dip galvanized" becomes more useful as training data when linked to the drawing that shows the member geometry and the connection detail.

Stage 5: Quality scoring. Each extracted record gets a confidence score based on OCR confidence, table structure completeness, and cross-reference resolution rate. Low-confidence records are flagged for human review rather than passed directly to the training dataset.

What Structured Output Looks Like

After extraction, a BOQ line item should be a structured record with these fields:

{
  "item_code": "03.04.12",
  "description": "Reinforced concrete, grade C35/45, in columns above ground floor, including formwork, vibration and curing",
  "quantity": 127.5,
  "unit": "m3",
  "unit_rate": null,
  "drawing_refs": ["S-201", "S-202", "D-C-04"],
  "spec_refs": ["Clause 5.4.2"],
  "section": "Structural Concrete",
  "division": "Substructure"
}

A drawing annotation record looks different:

{
  "drawing_number": "S-201",
  "zone": "section_cut_AA",
  "element_type": "column",
  "annotation_type": "reinforcement_callout",
  "text": "8T25 + links T10@200 c/c",
  "parsed": {
    "main_bars": {"count": 8, "dia_mm": 25, "grade": "T"},
    "links": {"dia_mm": 10, "spacing_mm": 200}
  }
}

This level of structure is what makes the data useful for AI training. Unstructured text scraped from a PDF produces a language model that can discuss construction. Structured, annotated records produce a model that can reason about specific items, quantities, and specifications.

AI Use Cases This Data Enables

Estimating model fine-tuning. A model trained on structured BOQ data from completed projects can suggest rates and quantities for new items, catching outliers and improving estimating consistency. The JSONL format for fine-tuning maps naturally from description + context → rate pairs.

Specification search (RAG). Chunked BOQ items and specification clauses, embedded and stored in a vector index, allow engineers to query "what specification applies to pre-cast concrete in external walls?" and retrieve relevant clauses from the project's own specification documents — not generic web content.

Compliance checking. A model trained on drawing annotations and their corresponding specification requirements can flag when a drawing detail appears to deviate from the project specification — or when a BOQ item references a drawing that does not exist.

Historical estimate retrieval. With enough projects processed, a RAG system can answer "what was the rate for waterproofing to basement walls on projects of this type, in this region, in the last three years?" using the firm's own historical data.

Why This Has to Be On-Premise

Construction companies with large project archives cannot send those archives to cloud APIs for processing. The documents contain commercially sensitive quantities, rates, and specifications. Some jurisdictions — including Pakistan's PPIA — require data processing approval for external transfers, and obtaining that approval can take over a year.

More practically: the volume is the problem. A 700GB archive of project documents is not a batch job you run through an API. It requires a pipeline that runs locally, processes files incrementally, and maintains state across sessions so that interrupted jobs can resume without reprocessing everything from the start.

The extraction pipeline should run entirely on the local machine. OCR, table detection, symbol recognition, quality scoring — all of it must operate without any internet dependency. The output — structured JSONL records and chunked text — is what gets used downstream.

Getting Started

The minimum viable approach:

Classify a representative sample of your document archive by type (drawings, BOQs, specifications, reports)
Start with the BOQ documents — they yield the most structured data with the least ambiguity
Process drawings by zone, not by page
Establish a quality threshold for automatic acceptance vs. human review
Build the cross-reference links between BOQ items and drawing numbers before exporting to JSONL

The goal is not perfect extraction on every document. It is a training dataset large enough and clean enough to be useful. For most construction AI use cases, a dataset of 50,000 structured BOQ line items with drawing cross-references is a meaningful starting point.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →

Construction AI: Turning 700GB of Unstructured Project Files into a Domain-Specific Model — The full picture of construction data preparation at scale
Bill of Quantities Data Extraction: A Guide for Construction AI Projects — Deep dive on BOQ extraction specifically
The Enterprise AI Data Preparation Guide — End-to-end pipeline overview for enterprise teams

How to Extract AI Training Data from Engineering Drawings and BOQ Documents

Why Engineering Drawings Break Standard OCR

Why BOQ Documents Are Different (and Also Hard)

The Extraction Pipeline

What Structured Output Looks Like

AI Use Cases This Data Enables

Why This Has to Be On-Premise

Getting Started

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

AI Data Preparation for Construction: BOQs, Drawings, and Technical PDFs

No-Code Data Labeling for Engineering and Construction Teams

Construction AI: Turning 700GB of Unstructured Project Files into a Domain-Specific Model

Why Engineering Drawings Break Standard OCR

Why BOQ Documents Are Different (and Also Hard)

The Extraction Pipeline

What Structured Output Looks Like

AI Use Cases This Data Enables

Why This Has to Be On-Premise

Getting Started

Related Reading

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

AI Data Preparation for Construction: BOQs, Drawings, and Technical PDFs

No-Code Data Labeling for Engineering and Construction Teams

Construction AI: Turning 700GB of Unstructured Project Files into a Domain-Specific Model