
Construction AI: Turning 700GB of Unstructured Project Files into a Domain-Specific Model
Construction companies sit on massive archives of PDFs, drawings, BOQs, and inspection reports. Here's how to turn that archive into AI training datasets — on-premise, without sending files to cloud APIs.
Most construction companies already have the raw material for a powerful AI system. It is sitting in a shared drive somewhere — a decade of project files, PDFs, drawings, inspection reports, bills of quantities, progress photos, and RFIs, accumulated project by project and organized just well enough to find things manually.
The challenge is not data scarcity. It is the opposite: data that is rich, dense with domain knowledge, and almost entirely unusable by AI systems in its current form.
This guide covers how to approach that conversion problem: what construction AI actually requires, why the data is hard to work with, what the full processing pipeline looks like, and what you end up with at the end.
The Scope of the Problem
A mid-size construction firm with ten to fifteen years of project history will typically have between 200GB and 1TB of project documents. A firm that has been operating for thirty years, or one that has managed large civil infrastructure projects, can have considerably more.
Within that archive, the document types are heterogeneous:
- Drawings: Architectural, structural, MEP, civil, drainage — CAD-exported PDFs ranging from simple floor plans to complex 3D section cuts
- Bills of quantities: Tabular cost documents generated from quantity surveying software
- Specifications: Technical spec sections in Word or PDF, often hundreds of pages per project
- Inspection and testing reports: Structured forms and free-text narrative, often including photographs
- RFIs and correspondence: Question-and-answer chains, site instructions, variation orders
- BIM exports: IFC files, COBie spreadsheets, clash detection reports
- Progress photos: JPEGs with inconsistent naming, often the only record of as-built conditions
The key insight, captured clearly by construction AI practitioners: "The problem is not fine-tuning but cleaning and preparing the diverse data." The model is the easy part. The data is the hard part.
What Construction AI Use Cases Are Actually Valuable
Before processing a single document, it is worth being specific about what AI models you are trying to build. The data pipeline design depends on the downstream use case.
Cost estimating models. A model trained on historical BOQ data — item descriptions, quantities, rates, project type, location, and date — can suggest rates for new estimate items, flag outliers against historical ranges, and improve estimating consistency across the team. This requires structured, normalized BOQ data from completed projects.
Document search (RAG). A retrieval-augmented generation system over the firm's specification library, project specifications, and technical standards allows engineers to query the archive in natural language. "What is the specified minimum compressive strength for concrete in ground-floor columns on Type A projects?" retrieves the relevant clause from the relevant document. This requires clean, chunked text with good metadata.
Inspection analysis. A model trained on inspection report data — defect type, location, severity, remediation action, project phase — can classify new inspection findings, suggest remediation actions, and flag patterns that correlate with downstream problems. This requires structured extraction from inspection forms plus NER annotation of defect terminology.
Drawing search and coordination. A system that understands drawing content — not just file names — can answer "find all details showing RC column connections to steel beams" by understanding what is in the drawings rather than relying on file naming conventions. This requires a model trained on annotated drawing content.
Each of these requires a somewhat different data preparation approach. A single archive may yield datasets for multiple use cases, but processing should be planned per use case, not run generically.
Why Cloud Tools Are Off-Limits
The immediate response from most data teams when confronted with a large document archive is to run it through a cloud API. AWS Textract, Azure Document Intelligence, Google Document AI — these are capable tools. The problem is not the technology.
The problem is the data governance.
Construction project documents contain commercially sensitive information: quantities, rates, subcontractor prices, and specifications that represent competitive intelligence. They may also contain personal information about site workers, subcontractors, and clients. Sending that data to a cloud API means it leaves the firm's environment. Depending on jurisdiction, that may require data processing agreements, privacy impact assessments, or explicit consent.
In some jurisdictions, it is simply not possible within a reasonable timeframe. The data processing approval process under Pakistan's PPIA, for example, can take over a year for cross-border data transfers. A firm waiting for that approval cannot begin building its AI capability.
The practical consequence: the processing pipeline must run locally. OCR, table extraction, layout analysis, annotation, quality scoring — everything must happen on hardware the firm controls, without network calls to external services.
The Full Pipeline for Construction Data
A construction data preparation pipeline has five stages. These are not quick sequential passes; each stage can take significant time on a large archive.
Stage 1: Ingest and classify. Every file in the archive is ingested and classified by document type. This is not optional — the processing logic for a drawing is completely different from the processing logic for a BOQ or a specification section. Classification uses a combination of file metadata (naming conventions, directory structure), page layout analysis, and content sampling. Outputs: a classified inventory of every document in the archive, with estimated processing complexity.
Stage 2: Extraction. Each document type is processed with type-appropriate logic. Drawings go through zone-based OCR with symbol detection and annotation parsing. BOQs go through table reconstruction with line-item parsing and unit normalization. Specifications go through section segmentation and clause extraction. Inspection reports go through form field extraction and free-text parsing. This stage is computationally intensive and runs overnight or over a weekend for large archives.
Stage 3: Cleaning. Extracted data is deduplicated (the same BOQ may exist as three versions — first issue, revision A, revision B), inconsistencies are flagged, and quality scores are assigned. Records below the quality threshold are flagged for human review rather than passed to the next stage.
Stage 4: Annotation. For training data specifically, extracted records need labels. For estimating models, this is rate data from completed projects. For NER models, it is entity labels applied by domain experts (quantity surveyors, site engineers). For classification models, it is category labels applied to inspection findings. This stage requires domain expert involvement — not ML engineers.
Stage 5: Export. The annotated, cleaned data is exported in the format required by the downstream system: JSONL for fine-tuning language models, chunked text with metadata for RAG indexing, CSV for traditional analytics, or structured records for database ingestion.
What a Finished Construction Training Dataset Looks Like
At the end of the pipeline, you should have:
For a RAG knowledge base: 50,000 to 200,000 text chunks, each 300–800 tokens, with metadata fields for document type, project type, date, section heading, and source document. The chunks are clean, correctly ordered, and free of OCR artifacts.
For a fine-tuning dataset (estimating): 30,000 to 100,000 structured records, each representing a BOQ line item with its description, unit, quantity context, and rate — normalized across projects for date and location. In JSONL format, with one record per line.
For a fine-tuning dataset (inspection NER): 5,000 to 20,000 annotated sentences from inspection reports, with entity labels for defect type, location, severity, and remediation action. In JSONL format with token-level span annotations.
These are not enormous datasets by general AI standards. A GPT-scale model is trained on trillions of tokens. But domain-specific fine-tuning works with much less data than general pre-training. A construction estimating model fine-tuned on 50,000 structured BOQ records from projects of the same type can significantly outperform a general-purpose model on construction-specific tasks.
Expected Timeline and Effort
The timeline depends on archive size, document quality, and the complexity of the use cases being targeted.
For a 700GB archive of mixed construction documents:
- Ingest and classify: 1–2 days of compute time, plus 2–3 days of human review to validate classification quality and fix systematic errors
- Extraction: 5–10 days of compute time (heavily dependent on the proportion of rasterized PDFs requiring OCR)
- Cleaning: 3–5 days of compute time, plus domain expert review of flagged records
- Annotation: This is the variable. For BOQ data where rates already exist in the documents, annotation is automatic. For NER annotation of inspection reports, expect 2–4 weeks of domain expert time for a dataset of 10,000 annotated sentences
- Export and validation: 1–2 days
Total elapsed time for a first usable dataset from a 700GB archive: 6–10 weeks, assuming domain expert availability is not a bottleneck.
The bottleneck is almost always the annotation stage. Domain experts — quantity surveyors, site engineers, inspectors — are not available full-time for data labeling. Planning around their availability, and designing annotation interfaces that do not require Python or terminal access, is critical to keeping the timeline reasonable.
What You Can Do Right Now
The first step is not processing. It is inventory.
Before running any tools, understand what you have: how many files, what types, what date range, which projects. A simple inventory — file counts by type, total size by document type, project metadata — tells you where the extraction effort will be concentrated and which use cases are feasible with the data you have.
Most construction firms find that their BOQ archive is the most tractable starting point. The data is structured (or nearly structured), the use cases are high-value (estimating), and the annotation burden is lower because rate data is often already present in the documents.
Start there. Process the BOQs, export the JSONL, and test a fine-tuned estimating model on a small sample. That first result, even if imperfect, is more useful for building organizational buy-in than any amount of planning documentation.
Your data is the bottleneck — not your models.
Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.
Related Reading
- How to Extract AI Training Data from Engineering Drawings and BOQ Documents — Technical deep dive on drawing and BOQ extraction
- Bill of Quantities Data Extraction: A Guide for Construction AI Projects — Step-by-step BOQ processing guide
- On-Premise AI Data Preparation and Compliance — Why on-premise matters for regulated industries
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

AI Data Preparation for Construction: BOQs, Drawings, and Technical PDFs
How construction and engineering companies can convert BOQs, technical drawings, and project documentation into AI-ready training datasets — on-premise, with full audit trail.

On-Device vs On-Premise AI: Different Privacy Problems, Different Data Prep
On-device AI and on-premise AI solve fundamentally different privacy problems — and require fundamentally different data preparation strategies. Here's how to tell which you need and what your data pipeline should look like for each.

The Real Cost of Cloud Data Prep in Regulated Industries (2026)
Cloud data prep tools require compliance approvals that cost $50K–$150K and take 6–18 months. On-premise alternatives eliminate these costs entirely. Here's the TCO comparison regulated industries need.