How to Convert Bill of Quantities into AI Training Data

Bills of Quantities (BOQs) are the backbone of construction cost estimation. They list every material, labor item, and activity in a project, with quantities, unit rates, and amounts. For AI applications in construction — cost forecasting, automated estimation, quantity takeoff verification — BOQs are the primary training data source.

The problem: BOQs come in wildly inconsistent formats, and converting them into structured training data requires handling format variation, nested hierarchies, and domain-specific terminology. This guide covers the practical pipeline.

The BOQ Format Problem

A single construction company might have BOQs in all of these formats:

Excel spreadsheets with varying column layouts, merged cells, and color-coded sections
PDF exports from estimation software (CostX, Bluebeam, PlanSwift)
Scanned paper documents from older projects
CSV exports from ERP systems
Word documents with manually created tables

Even within the same format, the structure varies:

Same information, different column names, different ordering, different granularity. Multiply this across hundreds of projects and the scale of the normalization challenge becomes clear.

Pipeline Stage 1: Ingestion and Table Extraction

For Excel/CSV files

Parse worksheets, identifying header rows (which aren't always row 1)
Handle merged cells (section headers often span multiple columns)
Detect and preserve hierarchy (sections → subsections → items → sub-items)
Handle multiple BOQ sheets in a single workbook

For PDF files

Table detection using layout analysis (identifying grid structures, aligned columns)
Cell extraction with handling for multi-line cell content
Header identification (distinguishing column headers from data rows)
Page continuation detection (tables that span multiple pages)

For scanned documents

OCR with table-aware processing
Line detection for table grid identification
Character confidence scoring (flagging low-confidence extractions for review)
Handling handwritten annotations alongside printed text

Pipeline Stage 2: Normalization

Once tables are extracted, the raw data needs normalization:

Column Mapping

Map varied column names to a standard schema:

"Description" / "Work Item" / "Description of Works" → description
"Unit" / "UOM" / "U/M" → unit
"Qty" / "Quantity" / "Estimated Qty" → quantity
"Rate" / "Unit Price" / "Unit Rate" → unit_rate
"Amount" / "Total Price" / "Total" → amount

Unit Standardization

Construction uses numerous unit abbreviations inconsistently:

"m3" / "cu.m" / "CUM" / "cubic meter" → m³
"sqm" / "sq.m" / "SQM" / "m2" → m²
"nr" / "no" / "nos" / "each" / "ea" → nr
"rm" / "r.m" / "running meter" / "lm" → rm

Hierarchy Reconstruction

BOQ items are hierarchical, but the hierarchy is often implicit:

Section numbers (1.0, 1.1, 1.1.1) encode parent-child relationships
Indentation levels indicate hierarchy in some formats
Bold/font-size formatting distinguishes sections from items
"Total" and "Sub-total" rows indicate hierarchy boundaries

Reconstructing this hierarchy is essential — it provides context for each item. "Concrete" under "Foundations" is different from "Concrete" under "Superstructure."

Numeric Handling

Remove thousand separators (which vary by locale: commas, periods, spaces)
Parse currency symbols and standardize
Handle calculated fields (Amount = Qty × Rate) and flag inconsistencies
Convert between measurement systems where needed

Pipeline Stage 3: Labeling

With normalized data, domain experts label the records:

Trade Classification

Each BOQ item maps to a construction trade:

Civil/structural, mechanical, electrical, plumbing, HVAC, finishing, landscaping, etc.
This classification enables trade-specific cost models

Material vs. Labor vs. Equipment

BOQ items often bundle these, but AI models benefit from the distinction:

"Supply and fix structural steel" → material + labor
"Crane hire for steel erection" → equipment
"Reinforcement steel Grade 60" → material

Standardized Item Coding

Mapping to standard classification systems where applicable:

UniFormat (for building elements)
MasterFormat (for work results)
Company-specific coding systems

Quality Flags

Completeness (does the item have all required fields?)
Consistency (does Amount = Qty × Rate?)
Reasonableness (is the rate within expected ranges for this item type?)

Pipeline Stage 4: Export

The labeled, normalized BOQ data exports to different formats depending on the downstream AI use case:

For cost estimation models (JSONL):

{"description": "Supply and fix reinforcement steel...", "trade": "structural", "unit": "kg", "rate_usd_per_unit": 1.85, "context": "foundations/piling"}

For document classification (JSONL):

{"text": "1.1.3 Excavation in rock...", "label": "civil_earthworks"}

For RAG knowledge bases (chunked text): Structured chunks with trade/section metadata for retrieval-augmented generation.

The Domain Expert Requirement

This pipeline can't be run by ML engineers alone. The normalization rules, trade classifications, and quality judgments require construction domain knowledge:

Is "concrete class C30" the same as "30 MPa concrete"? (Yes, but only a structural engineer would know.)
Should "provisional sum for unforeseen ground conditions" be included in training data? (Depends on the model's purpose.)
Is a rate of $500/m³ for concrete reasonable? (Depends on the region, project type, and year.)

This is why the data preparation tool needs to be accessible to quantity surveyors and project managers — not locked behind Python scripts and CLI interfaces. Platforms like Ertas Data Suite put domain experts directly in the labeling workflow, which is where their knowledge has the most impact on training data quality.

Getting Started

If you're sitting on a collection of BOQs and want to build AI training data:

Start with digital-native files (Excel/CSV) — they're easier to process than scanned PDFs
Define your target schema before you start processing
Involve a quantity surveyor in the labeling schema design
Begin with a single project type to establish the pipeline, then scale
Expect iteration — the first pass will reveal format variations you didn't anticipate