AI Data Preparation for Construction: BOQs, Drawings, and Technical PDFs

Construction companies sit on some of the largest untapped data archives in any industry. Hundreds of gigabytes of project documentation — Bills of Quantities (BOQs), technical drawings, specifications, RFIs, submittals, change orders — accumulated over decades of projects. This data represents enormous domain knowledge, and it's almost entirely locked in unstructured formats.

Converting these archives into AI-ready training data is the prerequisite for every construction AI use case: automated quantity estimation, document classification, specification compliance checking, and cost forecasting. But the data preparation challenges in construction are unique.

What's in the Archive

A typical mid-to-large construction company's document archive includes:

Bills of Quantities (BOQs): Structured tables listing materials, labor, quantities, unit rates, and amounts — but in wildly inconsistent formats. Some are Excel spreadsheets, some are PDF exports, some are scanned paper documents. The hierarchy (sections, subsections, items, sub-items) varies by contractor, region, and era.

Technical drawings: DWG files, PDF exports of CAD drawings, scanned blueprints. These contain spatial information, dimensions, annotations, and symbols that represent specific construction elements.

Specifications: Multi-hundred-page documents defining materials, methods, and quality requirements. Mix of structured sections and free-text descriptions.

RFIs (Requests for Information): Questions and answers between contractors, architects, and engineers. Often in email chains, PDFs, or project management system exports.

Submittals: Manufacturer data sheets, shop drawings, material certificates. Varied formats, often scanned.

Change orders: Modifications to original scope with cost and schedule implications. Mix of structured forms and narrative descriptions.

Why Construction Data Prep Is Especially Hard

Format Inconsistency

Unlike healthcare (where HL7/FHIR standards exist) or finance (where XBRL provides structure), construction has no universal data standard. A BOQ from one contractor looks completely different from another. Column names, hierarchies, unit conventions, and formatting vary project to project.

Mixed Modalities

Construction documents combine text, tables, drawings, and images — often on the same page. A specification might have a paragraph of text, a table of material properties, and a cross-reference to a drawing number. Parsing this requires understanding the relationship between these elements.

Scale

A single large project can generate 50,000+ pages of documentation. A company with 20 years of project history might have hundreds of thousands of documents. Manual processing at this scale is impractical.

Domain Specificity

Understanding construction documents requires domain expertise. An ML engineer can't tell whether a BOQ item is correctly classified without understanding construction trades, measurement conventions, and material specifications. This is knowledge that lives in quantity surveyors and project managers, not data scientists.

Compliance and Sensitivity

Construction project data often contains commercially sensitive information: pricing, contractor rates, client budgets. In some regions (particularly the Middle East and South Asia), data sovereignty regulations restrict where this information can be processed.

The Data Preparation Pipeline for Construction

Stage 1: Ingestion

OCR for scanned documents with layout detection
Table extraction from BOQs (handling merged cells, nested hierarchies)
Drawing file parsing (extracting annotations, dimensions, element identification)
PDF structure analysis (distinguishing sections, appendices, references)

Stage 2: Cleaning

Normalization of units (converting between metric and imperial)
Standardization of terminology (mapping contractor-specific terms to common vocabulary)
Deduplication across project documents (the same specification section often appears in multiple documents)
Quality scoring (confidence levels for OCR output, table extraction accuracy)

Stage 3: Labeling

Construction trade classification (civil, mechanical, electrical, plumbing)
Document type categorization (specification, BOQ, drawing, RFI, submittal)
Entity extraction (material names, quantities, rates, project references)
Relationship mapping (which specification section relates to which BOQ item)

Stage 4: Augmentation

Synthetic data generation for underrepresented document types
Balanced sampling across trades and project types
Cross-referencing between documents to build relational training data

Stage 5: Export

JSONL for fine-tuning construction language models
Chunked text for RAG knowledge bases
Structured JSON for classification and extraction models
CSV for traditional ML quantity estimation models

Why This Must Happen On-Premise

Construction data preparation has a strong case for on-premise processing:

Commercial sensitivity: Pricing data, contractor rates, and client budgets can't be exposed to cloud services
Data sovereignty: Companies operating in regions with data localization requirements (GCC countries, Pakistan's PPIA) need data to stay on local infrastructure
Volume: Shipping hundreds of gigabytes to cloud services is slow and expensive
Domain expert involvement: Quantity surveyors and project managers who need to participate in labeling shouldn't need cloud accounts and DevOps support

Getting Started

If your construction company is sitting on a large document archive and exploring AI adoption, the path forward is:

Audit your archive: What document types do you have? What formats? What volume?
Identify the first use case: Start narrow — automated BOQ classification is a common first project
Assess data quality: How much of your archive is digital-native vs. scanned? Scanned documents require better OCR.
Engage domain experts: Quantity surveyors and project managers need to define the labeling schema — they know what matters.

Platforms like Ertas Data Suite are built for exactly this workflow — handling the full pipeline from ingestion through export, on-premise, with a native desktop interface that domain experts can use directly. The 700GB PDF archive isn't a problem to be solved later. It's the asset that makes construction AI possible.