
How to Extract AI Training Data from Engineering Drawings and BOQ Documents
A practical guide to extracting structured AI training data from engineering drawings, bills of quantities, and construction PDFs — for teams building domain-specific AI in construction and infrastructure.
Engineering drawings and bills of quantities are the dense, information-rich backbone of every construction project. They contain specifications, dimensions, quantities, materials, and cost structures that took years of domain expertise to produce. They are also, from a machine learning perspective, some of the hardest documents to parse.
If you are trying to build a domain-specific AI for construction — an estimating model, a spec search system, a compliance checker — the first obstacle is not the model. It is getting the training data out of the documents.
This guide covers exactly that: what makes engineering drawings and BOQ documents difficult to process, how to approach extraction, and what a usable training dataset actually looks like at the end.
Why Engineering Drawings Break Standard OCR
A standard OCR engine — Tesseract, AWS Textract, or the OCR layer inside Adobe — is trained on continuous prose text. It expects words arranged in lines, lines arranged in paragraphs, and paragraphs arranged on a page with predictable margins.
Engineering drawings violate every one of those assumptions.
Symbol-dense content. Structural and civil engineering drawings use dozens of specialized symbols: weld types, cross-section indicators, reinforcement bar callouts, elevation markers, slope arrows. A standard OCR model has no training data for these. It will either skip them or misread adjacent text because the symbol is interfering with the character segmentation.
Multi-layout pages. A single A1 drawing sheet may contain a plan view in the upper left, section cuts in the lower right, a title block in the lower right corner, revision history in the right margin, and general notes scattered wherever there was space. There is no reading order an OCR engine can infer. It will concatenate content from these zones in the wrong sequence, producing text that has no semantic coherence.
Annotation layers. CAD-exported PDFs contain dimensions, keynotes, and leader lines that sit in separate layers from the main drawing geometry. When flattened to a rasterized PDF, these become overlapping elements. Text recognition fails on text that overlaps or touches other elements.
Scale-dependent detail. On a site plan at 1:500, building outlines are thick and annotation is sparse. On a detail drawing at 1:10, the same area is full of material callouts, dimension strings, and reference bubbles. A single PDF project document can contain both, and no fixed OCR resolution works for both simultaneously.
The consequence: generic OCR extraction from engineering drawings produces noisy, out-of-sequence text with missing annotations and garbled dimension strings. That output cannot be used for AI training without extensive — and often impractical — manual correction.
Why BOQ Documents Are Different (and Also Hard)
Bills of quantities sit at the other end of the spectrum. They are not visual documents; they are highly structured tabular data. But they are structured in a way that is specific to the construction industry and that most data extraction tools handle poorly.
A typical BOQ is organized as a multi-level numbered hierarchy: divisions, sections, items. Each line item has an item code, a description that may run across multiple lines, a quantity, a unit of measure, a rate, and an extended amount. The description field is where the domain knowledge lives — it encodes the material specification, the work method, the applicable standard, and any references to drawing numbers.
The extraction challenges specific to BOQs:
Hybrid PDF format. BOQs are often generated from quantity surveying software (CostX, CANDY, Buildsoft) and exported as PDFs that contain a mix of embedded text and rasterized tables. The tabular structure may look perfect on screen but be represented in the PDF as a series of disconnected text fragments at arbitrary X-Y coordinates, with no underlying table object that extraction tools can detect.
Multi-page item descriptions. A complex BOQ item — say, structural concrete for a post-tensioned slab with specific admixture requirements and reference to a project-specific specification — can have a description that runs across three or four lines. Page breaks interrupt items mid-description. Extraction tools that process page by page will split these items incorrectly.
Continuation tables. BOQ tables often continue across pages with column headers repeated at the top of each page. Naive table extraction merges the repeated headers as data rows, corrupting the structure.
Quantity and unit normalization. Units in BOQs are not standardized. "m3", "M3", "CUM", "cum", and "Cum" all mean cubic meters. Item quantities appear as "1,234.50", "1234.5", and "1,234·50" depending on the locale and the software that produced the document. An extraction pipeline must normalize these without changing the values.
The Extraction Pipeline
A practical pipeline for construction documents requires separate handling for drawing documents and BOQ documents, with a merge step at the end.
Stage 1: Document classification. Before processing, each file needs to be classified: is this a drawing sheet, a BOQ, a specification section, an inspection report, or something else? The processing logic differs significantly between types, and applying the wrong extractor produces garbage output. Classification can be rule-based (file naming conventions, page dimensions, presence of a title block) or model-based for ambiguous cases.
Stage 2: Drawing extraction. For drawings, the pipeline has to operate on regions rather than the full page. The title block, general notes, and plan area are processed separately. Region detection can use template matching for standard title block positions, or a layout segmentation model for non-standard sheets. Within each region, OCR runs at the appropriate resolution, and symbol detection runs as a separate pass — outputting structured records like {type: "weld_symbol", subtype: "fillet", size_mm: 6, location: "beam_flange_connection"} rather than trying to convert symbols to text.
Stage 3: BOQ extraction. For BOQ documents, the pipeline focuses on table structure reconstruction. This requires detecting column boundaries from the distribution of X coordinates of text fragments, associating continuation lines with their parent items using indentation and numbering patterns, normalizing quantities and units to canonical forms, and extracting drawing references from description text.
Stage 4: Cross-reference linking. Drawing annotations reference item codes; BOQ items reference drawing numbers. Linking these creates richer training data. A line item for "structural steelwork, grade S275, hot-dip galvanized" becomes more useful as training data when linked to the drawing that shows the member geometry and the connection detail.
Stage 5: Quality scoring. Each extracted record gets a confidence score based on OCR confidence, table structure completeness, and cross-reference resolution rate. Low-confidence records are flagged for human review rather than passed directly to the training dataset.
What Structured Output Looks Like
After extraction, a BOQ line item should be a structured record with these fields:
{
"item_code": "03.04.12",
"description": "Reinforced concrete, grade C35/45, in columns above ground floor, including formwork, vibration and curing",
"quantity": 127.5,
"unit": "m3",
"unit_rate": null,
"drawing_refs": ["S-201", "S-202", "D-C-04"],
"spec_refs": ["Clause 5.4.2"],
"section": "Structural Concrete",
"division": "Substructure"
}
A drawing annotation record looks different:
{
"drawing_number": "S-201",
"zone": "section_cut_AA",
"element_type": "column",
"annotation_type": "reinforcement_callout",
"text": "8T25 + links T10@200 c/c",
"parsed": {
"main_bars": {"count": 8, "dia_mm": 25, "grade": "T"},
"links": {"dia_mm": 10, "spacing_mm": 200}
}
}
This level of structure is what makes the data useful for AI training. Unstructured text scraped from a PDF produces a language model that can discuss construction. Structured, annotated records produce a model that can reason about specific items, quantities, and specifications.
AI Use Cases This Data Enables
Estimating model fine-tuning. A model trained on structured BOQ data from completed projects can suggest rates and quantities for new items, catching outliers and improving estimating consistency. The JSONL format for fine-tuning maps naturally from description + context → rate pairs.
Specification search (RAG). Chunked BOQ items and specification clauses, embedded and stored in a vector index, allow engineers to query "what specification applies to pre-cast concrete in external walls?" and retrieve relevant clauses from the project's own specification documents — not generic web content.
Compliance checking. A model trained on drawing annotations and their corresponding specification requirements can flag when a drawing detail appears to deviate from the project specification — or when a BOQ item references a drawing that does not exist.
Historical estimate retrieval. With enough projects processed, a RAG system can answer "what was the rate for waterproofing to basement walls on projects of this type, in this region, in the last three years?" using the firm's own historical data.
Why This Has to Be On-Premise
Construction companies with large project archives cannot send those archives to cloud APIs for processing. The documents contain commercially sensitive quantities, rates, and specifications. Some jurisdictions — including Pakistan's PPIA — require data processing approval for external transfers, and obtaining that approval can take over a year.
More practically: the volume is the problem. A 700GB archive of project documents is not a batch job you run through an API. It requires a pipeline that runs locally, processes files incrementally, and maintains state across sessions so that interrupted jobs can resume without reprocessing everything from the start.
The extraction pipeline should run entirely on the local machine. OCR, table detection, symbol recognition, quality scoring — all of it must operate without any internet dependency. The output — structured JSONL records and chunked text — is what gets used downstream.
Getting Started
The minimum viable approach:
- Classify a representative sample of your document archive by type (drawings, BOQs, specifications, reports)
- Start with the BOQ documents — they yield the most structured data with the least ambiguity
- Process drawings by zone, not by page
- Establish a quality threshold for automatic acceptance vs. human review
- Build the cross-reference links between BOQ items and drawing numbers before exporting to JSONL
The goal is not perfect extraction on every document. It is a training dataset large enough and clean enough to be useful. For most construction AI use cases, a dataset of 50,000 structured BOQ line items with drawing cross-references is a meaningful starting point.
Your data is the bottleneck — not your models.
Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.
Related Reading
- Construction AI: Turning 700GB of Unstructured Project Files into a Domain-Specific Model — The full picture of construction data preparation at scale
- Bill of Quantities Data Extraction: A Guide for Construction AI Projects — Deep dive on BOQ extraction specifically
- The Enterprise AI Data Preparation Guide — End-to-end pipeline overview for enterprise teams
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

AI Data Preparation for Construction: BOQs, Drawings, and Technical PDFs
How construction and engineering companies can convert BOQs, technical drawings, and project documentation into AI-ready training datasets — on-premise, with full audit trail.

No-Code Data Labeling for Engineering and Construction Teams
Engineers and QS professionals understand BOQs, drawings, and specs in ways ML engineers cannot. Here's how no-code labeling tools let construction domain experts build better AI training data.

Construction AI: Turning 700GB of Unstructured Project Files into a Domain-Specific Model
Construction companies sit on massive archives of PDFs, drawings, BOQs, and inspection reports. Here's how to turn that archive into AI training datasets — on-premise, without sending files to cloud APIs.