
How to Convert Bill of Quantities into AI Training Data
A technical guide to converting Bills of Quantities (BOQs) from varied formats into structured AI training data — covering table extraction, normalization, labeling, and export.
Bills of Quantities (BOQs) are the backbone of construction cost estimation. They list every material, labor item, and activity in a project, with quantities, unit rates, and amounts. For AI applications in construction — cost forecasting, automated estimation, quantity takeoff verification — BOQs are the primary training data source.
The problem: BOQs come in wildly inconsistent formats, and converting them into structured training data requires handling format variation, nested hierarchies, and domain-specific terminology. This guide covers the practical pipeline.
The BOQ Format Problem
A single construction company might have BOQs in all of these formats:
- Excel spreadsheets with varying column layouts, merged cells, and color-coded sections
- PDF exports from estimation software (CostX, Bluebeam, PlanSwift)
- Scanned paper documents from older projects
- CSV exports from ERP systems
- Word documents with manually created tables
Even within the same format, the structure varies:
Contractor A's BOQ: | Item No | Description | Unit | Qty | Rate | Amount |
Contractor B's BOQ: | Ref | Work Item | UOM | Quantity | Unit Price | Total Price | Remarks |
Contractor C's BOQ: | S/N | Trade | Description of Works | Unit | Estimated Qty | Rate (USD) | Amount (USD) |
Same information, different column names, different ordering, different granularity. Multiply this across hundreds of projects and the scale of the normalization challenge becomes clear.
Pipeline Stage 1: Ingestion and Table Extraction
For Excel/CSV files
- Parse worksheets, identifying header rows (which aren't always row 1)
- Handle merged cells (section headers often span multiple columns)
- Detect and preserve hierarchy (sections → subsections → items → sub-items)
- Handle multiple BOQ sheets in a single workbook
For PDF files
- Table detection using layout analysis (identifying grid structures, aligned columns)
- Cell extraction with handling for multi-line cell content
- Header identification (distinguishing column headers from data rows)
- Page continuation detection (tables that span multiple pages)
For scanned documents
- OCR with table-aware processing
- Line detection for table grid identification
- Character confidence scoring (flagging low-confidence extractions for review)
- Handling handwritten annotations alongside printed text
Pipeline Stage 2: Normalization
Once tables are extracted, the raw data needs normalization:
Column Mapping
Map varied column names to a standard schema:
- "Description" / "Work Item" / "Description of Works" →
description - "Unit" / "UOM" / "U/M" →
unit - "Qty" / "Quantity" / "Estimated Qty" →
quantity - "Rate" / "Unit Price" / "Unit Rate" →
unit_rate - "Amount" / "Total Price" / "Total" →
amount
Unit Standardization
Construction uses numerous unit abbreviations inconsistently:
- "m3" / "cu.m" / "CUM" / "cubic meter" →
m³ - "sqm" / "sq.m" / "SQM" / "m2" →
m² - "nr" / "no" / "nos" / "each" / "ea" →
nr - "rm" / "r.m" / "running meter" / "lm" →
rm
Hierarchy Reconstruction
BOQ items are hierarchical, but the hierarchy is often implicit:
- Section numbers (1.0, 1.1, 1.1.1) encode parent-child relationships
- Indentation levels indicate hierarchy in some formats
- Bold/font-size formatting distinguishes sections from items
- "Total" and "Sub-total" rows indicate hierarchy boundaries
Reconstructing this hierarchy is essential — it provides context for each item. "Concrete" under "Foundations" is different from "Concrete" under "Superstructure."
Numeric Handling
- Remove thousand separators (which vary by locale: commas, periods, spaces)
- Parse currency symbols and standardize
- Handle calculated fields (Amount = Qty × Rate) and flag inconsistencies
- Convert between measurement systems where needed
Pipeline Stage 3: Labeling
With normalized data, domain experts label the records:
Trade Classification
Each BOQ item maps to a construction trade:
- Civil/structural, mechanical, electrical, plumbing, HVAC, finishing, landscaping, etc.
- This classification enables trade-specific cost models
Material vs. Labor vs. Equipment
BOQ items often bundle these, but AI models benefit from the distinction:
- "Supply and fix structural steel" → material + labor
- "Crane hire for steel erection" → equipment
- "Reinforcement steel Grade 60" → material
Standardized Item Coding
Mapping to standard classification systems where applicable:
- UniFormat (for building elements)
- MasterFormat (for work results)
- Company-specific coding systems
Quality Flags
- Completeness (does the item have all required fields?)
- Consistency (does Amount = Qty × Rate?)
- Reasonableness (is the rate within expected ranges for this item type?)
Pipeline Stage 4: Export
The labeled, normalized BOQ data exports to different formats depending on the downstream AI use case:
For cost estimation models (JSONL):
{"description": "Supply and fix reinforcement steel...", "trade": "structural", "unit": "kg", "rate_usd_per_unit": 1.85, "context": "foundations/piling"}
For document classification (JSONL):
{"text": "1.1.3 Excavation in rock...", "label": "civil_earthworks"}
For RAG knowledge bases (chunked text): Structured chunks with trade/section metadata for retrieval-augmented generation.
The Domain Expert Requirement
This pipeline can't be run by ML engineers alone. The normalization rules, trade classifications, and quality judgments require construction domain knowledge:
- Is "concrete class C30" the same as "30 MPa concrete"? (Yes, but only a structural engineer would know.)
- Should "provisional sum for unforeseen ground conditions" be included in training data? (Depends on the model's purpose.)
- Is a rate of $500/m³ for concrete reasonable? (Depends on the region, project type, and year.)
This is why the data preparation tool needs to be accessible to quantity surveyors and project managers — not locked behind Python scripts and CLI interfaces. Platforms like Ertas Data Suite put domain experts directly in the labeling workflow, which is where their knowledge has the most impact on training data quality.
Getting Started
If you're sitting on a collection of BOQs and want to build AI training data:
- Start with digital-native files (Excel/CSV) — they're easier to process than scanned PDFs
- Define your target schema before you start processing
- Involve a quantity surveyor in the labeling schema design
- Begin with a single project type to establish the pipeline, then scale
- Expect iteration — the first pass will reveal format variations you didn't anticipate
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

Training AI on Financial Statements: Data Extraction and Labeling On-Premise
How to extract and label financial statement data for AI training — parsing XBRL, extracting tables from PDFs, handling format variation, and building classification models for financial analysis.

Bill of Quantities Data Extraction: A Guide for Construction AI Projects
Bill of quantities documents are dense, mixed-format files that hold critical domain knowledge for construction AI. Here's how to extract and structure BOQ data for model training — on-premise.

Claims Processing AI: Preparing Unstructured Documents for Model Training
A practical guide to preparing insurance claims data for AI model training — from extracting structured data from claim forms to building datasets for fraud detection and auto-adjudication.