Back to blog
    How to Convert Bill of Quantities into AI Training Data
    boqdata-extractionconstructiontraining-datadata-preparationsegment:enterprise

    How to Convert Bill of Quantities into AI Training Data

    A technical guide to converting Bills of Quantities (BOQs) from varied formats into structured AI training data — covering table extraction, normalization, labeling, and export.

    EErtas Team·

    Bills of Quantities (BOQs) are the backbone of construction cost estimation. They list every material, labor item, and activity in a project, with quantities, unit rates, and amounts. For AI applications in construction — cost forecasting, automated estimation, quantity takeoff verification — BOQs are the primary training data source.

    The problem: BOQs come in wildly inconsistent formats, and converting them into structured training data requires handling format variation, nested hierarchies, and domain-specific terminology. This guide covers the practical pipeline.

    The BOQ Format Problem

    A single construction company might have BOQs in all of these formats:

    • Excel spreadsheets with varying column layouts, merged cells, and color-coded sections
    • PDF exports from estimation software (CostX, Bluebeam, PlanSwift)
    • Scanned paper documents from older projects
    • CSV exports from ERP systems
    • Word documents with manually created tables

    Even within the same format, the structure varies:

    Contractor A's BOQ: | Item No | Description | Unit | Qty | Rate | Amount |

    Contractor B's BOQ: | Ref | Work Item | UOM | Quantity | Unit Price | Total Price | Remarks |

    Contractor C's BOQ: | S/N | Trade | Description of Works | Unit | Estimated Qty | Rate (USD) | Amount (USD) |

    Same information, different column names, different ordering, different granularity. Multiply this across hundreds of projects and the scale of the normalization challenge becomes clear.

    Pipeline Stage 1: Ingestion and Table Extraction

    For Excel/CSV files

    • Parse worksheets, identifying header rows (which aren't always row 1)
    • Handle merged cells (section headers often span multiple columns)
    • Detect and preserve hierarchy (sections → subsections → items → sub-items)
    • Handle multiple BOQ sheets in a single workbook

    For PDF files

    • Table detection using layout analysis (identifying grid structures, aligned columns)
    • Cell extraction with handling for multi-line cell content
    • Header identification (distinguishing column headers from data rows)
    • Page continuation detection (tables that span multiple pages)

    For scanned documents

    • OCR with table-aware processing
    • Line detection for table grid identification
    • Character confidence scoring (flagging low-confidence extractions for review)
    • Handling handwritten annotations alongside printed text

    Pipeline Stage 2: Normalization

    Once tables are extracted, the raw data needs normalization:

    Column Mapping

    Map varied column names to a standard schema:

    • "Description" / "Work Item" / "Description of Works" → description
    • "Unit" / "UOM" / "U/M" → unit
    • "Qty" / "Quantity" / "Estimated Qty" → quantity
    • "Rate" / "Unit Price" / "Unit Rate" → unit_rate
    • "Amount" / "Total Price" / "Total" → amount

    Unit Standardization

    Construction uses numerous unit abbreviations inconsistently:

    • "m3" / "cu.m" / "CUM" / "cubic meter" →
    • "sqm" / "sq.m" / "SQM" / "m2" →
    • "nr" / "no" / "nos" / "each" / "ea" → nr
    • "rm" / "r.m" / "running meter" / "lm" → rm

    Hierarchy Reconstruction

    BOQ items are hierarchical, but the hierarchy is often implicit:

    • Section numbers (1.0, 1.1, 1.1.1) encode parent-child relationships
    • Indentation levels indicate hierarchy in some formats
    • Bold/font-size formatting distinguishes sections from items
    • "Total" and "Sub-total" rows indicate hierarchy boundaries

    Reconstructing this hierarchy is essential — it provides context for each item. "Concrete" under "Foundations" is different from "Concrete" under "Superstructure."

    Numeric Handling

    • Remove thousand separators (which vary by locale: commas, periods, spaces)
    • Parse currency symbols and standardize
    • Handle calculated fields (Amount = Qty × Rate) and flag inconsistencies
    • Convert between measurement systems where needed

    Pipeline Stage 3: Labeling

    With normalized data, domain experts label the records:

    Trade Classification

    Each BOQ item maps to a construction trade:

    • Civil/structural, mechanical, electrical, plumbing, HVAC, finishing, landscaping, etc.
    • This classification enables trade-specific cost models

    Material vs. Labor vs. Equipment

    BOQ items often bundle these, but AI models benefit from the distinction:

    • "Supply and fix structural steel" → material + labor
    • "Crane hire for steel erection" → equipment
    • "Reinforcement steel Grade 60" → material

    Standardized Item Coding

    Mapping to standard classification systems where applicable:

    • UniFormat (for building elements)
    • MasterFormat (for work results)
    • Company-specific coding systems

    Quality Flags

    • Completeness (does the item have all required fields?)
    • Consistency (does Amount = Qty × Rate?)
    • Reasonableness (is the rate within expected ranges for this item type?)

    Pipeline Stage 4: Export

    The labeled, normalized BOQ data exports to different formats depending on the downstream AI use case:

    For cost estimation models (JSONL):

    {"description": "Supply and fix reinforcement steel...", "trade": "structural", "unit": "kg", "rate_usd_per_unit": 1.85, "context": "foundations/piling"}
    

    For document classification (JSONL):

    {"text": "1.1.3 Excavation in rock...", "label": "civil_earthworks"}
    

    For RAG knowledge bases (chunked text): Structured chunks with trade/section metadata for retrieval-augmented generation.

    The Domain Expert Requirement

    This pipeline can't be run by ML engineers alone. The normalization rules, trade classifications, and quality judgments require construction domain knowledge:

    • Is "concrete class C30" the same as "30 MPa concrete"? (Yes, but only a structural engineer would know.)
    • Should "provisional sum for unforeseen ground conditions" be included in training data? (Depends on the model's purpose.)
    • Is a rate of $500/m³ for concrete reasonable? (Depends on the region, project type, and year.)

    This is why the data preparation tool needs to be accessible to quantity surveyors and project managers — not locked behind Python scripts and CLI interfaces. Platforms like Ertas Data Suite put domain experts directly in the labeling workflow, which is where their knowledge has the most impact on training data quality.

    Getting Started

    If you're sitting on a collection of BOQs and want to build AI training data:

    1. Start with digital-native files (Excel/CSV) — they're easier to process than scanned PDFs
    2. Define your target schema before you start processing
    3. Involve a quantity surveyor in the labeling schema design
    4. Begin with a single project type to establish the pipeline, then scale
    5. Expect iteration — the first pass will reveal format variations you didn't anticipate

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading