Back to blog
    AI Data Preparation for Construction: BOQs, Drawings, and Technical PDFs
    constructiondata-preparationboqengineering-drawingson-premisesegment:enterprise

    AI Data Preparation for Construction: BOQs, Drawings, and Technical PDFs

    How construction and engineering companies can convert BOQs, technical drawings, and project documentation into AI-ready training datasets — on-premise, with full audit trail.

    EErtas Team·

    Construction companies sit on some of the largest untapped data archives in any industry. Hundreds of gigabytes of project documentation — Bills of Quantities (BOQs), technical drawings, specifications, RFIs, submittals, change orders — accumulated over decades of projects. This data represents enormous domain knowledge, and it's almost entirely locked in unstructured formats.

    Converting these archives into AI-ready training data is the prerequisite for every construction AI use case: automated quantity estimation, document classification, specification compliance checking, and cost forecasting. But the data preparation challenges in construction are unique.

    What's in the Archive

    A typical mid-to-large construction company's document archive includes:

    Bills of Quantities (BOQs): Structured tables listing materials, labor, quantities, unit rates, and amounts — but in wildly inconsistent formats. Some are Excel spreadsheets, some are PDF exports, some are scanned paper documents. The hierarchy (sections, subsections, items, sub-items) varies by contractor, region, and era.

    Technical drawings: DWG files, PDF exports of CAD drawings, scanned blueprints. These contain spatial information, dimensions, annotations, and symbols that represent specific construction elements.

    Specifications: Multi-hundred-page documents defining materials, methods, and quality requirements. Mix of structured sections and free-text descriptions.

    RFIs (Requests for Information): Questions and answers between contractors, architects, and engineers. Often in email chains, PDFs, or project management system exports.

    Submittals: Manufacturer data sheets, shop drawings, material certificates. Varied formats, often scanned.

    Change orders: Modifications to original scope with cost and schedule implications. Mix of structured forms and narrative descriptions.

    Why Construction Data Prep Is Especially Hard

    Format Inconsistency

    Unlike healthcare (where HL7/FHIR standards exist) or finance (where XBRL provides structure), construction has no universal data standard. A BOQ from one contractor looks completely different from another. Column names, hierarchies, unit conventions, and formatting vary project to project.

    Mixed Modalities

    Construction documents combine text, tables, drawings, and images — often on the same page. A specification might have a paragraph of text, a table of material properties, and a cross-reference to a drawing number. Parsing this requires understanding the relationship between these elements.

    Scale

    A single large project can generate 50,000+ pages of documentation. A company with 20 years of project history might have hundreds of thousands of documents. Manual processing at this scale is impractical.

    Domain Specificity

    Understanding construction documents requires domain expertise. An ML engineer can't tell whether a BOQ item is correctly classified without understanding construction trades, measurement conventions, and material specifications. This is knowledge that lives in quantity surveyors and project managers, not data scientists.

    Compliance and Sensitivity

    Construction project data often contains commercially sensitive information: pricing, contractor rates, client budgets. In some regions (particularly the Middle East and South Asia), data sovereignty regulations restrict where this information can be processed.

    The Data Preparation Pipeline for Construction

    Stage 1: Ingestion

    • OCR for scanned documents with layout detection
    • Table extraction from BOQs (handling merged cells, nested hierarchies)
    • Drawing file parsing (extracting annotations, dimensions, element identification)
    • PDF structure analysis (distinguishing sections, appendices, references)

    Stage 2: Cleaning

    • Normalization of units (converting between metric and imperial)
    • Standardization of terminology (mapping contractor-specific terms to common vocabulary)
    • Deduplication across project documents (the same specification section often appears in multiple documents)
    • Quality scoring (confidence levels for OCR output, table extraction accuracy)

    Stage 3: Labeling

    • Construction trade classification (civil, mechanical, electrical, plumbing)
    • Document type categorization (specification, BOQ, drawing, RFI, submittal)
    • Entity extraction (material names, quantities, rates, project references)
    • Relationship mapping (which specification section relates to which BOQ item)

    Stage 4: Augmentation

    • Synthetic data generation for underrepresented document types
    • Balanced sampling across trades and project types
    • Cross-referencing between documents to build relational training data

    Stage 5: Export

    • JSONL for fine-tuning construction language models
    • Chunked text for RAG knowledge bases
    • Structured JSON for classification and extraction models
    • CSV for traditional ML quantity estimation models

    Why This Must Happen On-Premise

    Construction data preparation has a strong case for on-premise processing:

    1. Commercial sensitivity: Pricing data, contractor rates, and client budgets can't be exposed to cloud services
    2. Data sovereignty: Companies operating in regions with data localization requirements (GCC countries, Pakistan's PPIA) need data to stay on local infrastructure
    3. Volume: Shipping hundreds of gigabytes to cloud services is slow and expensive
    4. Domain expert involvement: Quantity surveyors and project managers who need to participate in labeling shouldn't need cloud accounts and DevOps support

    Getting Started

    If your construction company is sitting on a large document archive and exploring AI adoption, the path forward is:

    1. Audit your archive: What document types do you have? What formats? What volume?
    2. Identify the first use case: Start narrow — automated BOQ classification is a common first project
    3. Assess data quality: How much of your archive is digital-native vs. scanned? Scanned documents require better OCR.
    4. Engage domain experts: Quantity surveyors and project managers need to define the labeling schema — they know what matters.

    Platforms like Ertas Data Suite are built for exactly this workflow — handling the full pipeline from ingestion through export, on-premise, with a native desktop interface that domain experts can use directly. The 700GB PDF archive isn't a problem to be solved later. It's the asset that makes construction AI possible.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading