Back to blog
    Bill of Quantities Data Extraction: A Guide for Construction AI Projects
    constructionboqdata-extractionenterprise-aisegment:enterprise

    Bill of Quantities Data Extraction: A Guide for Construction AI Projects

    Bill of quantities documents are dense, mixed-format files that hold critical domain knowledge for construction AI. Here's how to extract and structure BOQ data for model training — on-premise.

    EErtas Team·

    A bill of quantities is one of the most information-dense documents in construction. Every line item encodes a specification, a quantity, a unit of measure, and — in completed projects — a rate. Taken together, a firm's historical BOQ archive represents years of accumulated cost knowledge, calibrated to specific project types, locations, and market conditions.

    For construction AI, that archive is training data waiting to be unlocked. The main obstacle is that BOQ documents were designed for human readers and quantity surveying software, not for machine learning pipelines. Extracting them correctly requires understanding both the document format and the domain conventions.

    This guide covers the structure of BOQ documents, why extraction is harder than it appears, how to approach the extraction pipeline, and what the output should look like for AI training use cases.

    What a BOQ Contains and Why It Matters for AI

    A bill of quantities is a structured cost and quantity document produced during the pre-tender or post-tender phase of a construction project. It serves as the basis for pricing, contract administration, and final account settlement.

    The content is organized hierarchically. At the top level are divisions — typically corresponding to work categories like substructure, superstructure, finishes, MEP services. Within each division are sections, each covering a specific work type. Within each section are line items, each representing a discrete unit of work with a measured quantity.

    Each line item encodes:

    • Item code: A hierarchical reference number (e.g., 03.04.12)
    • Description: A technical specification of the work, often referencing materials, grades, standards, and methods
    • Quantity: A measured amount (e.g., 127.5)
    • Unit: The measurement unit (e.g., m3, m2, m, Nr, sum)
    • Rate: The unit price (present in priced BOQs from completed projects)
    • Amount: Quantity × Rate
    • Cross-references: Drawing numbers and specification clause references embedded in the description

    For AI training, the most valuable field is the description. It is where the domain knowledge lives. A description like "Reinforced concrete, grade C35/45, designed mix, in columns above ground floor slab, including formwork, vibration, and curing in accordance with Clause 5.4.2 of the Project Specification; ref dwgs S-201, S-202" contains: a concrete grade specification, a location specifier, a list of included activities, a specification cross-reference, and two drawing references. All in one line.

    A corpus of 100,000 such line items from completed projects is a dense, structured representation of construction knowledge — far more useful for training a construction estimating model than general web text.

    Why Extraction Is Harder Than It Looks

    BOQ documents are generated by quantity surveying software (CostX, CANDY, Buildsoft, CCS, WinQS) and exported to PDF for distribution. The problem is that PDF is a presentation format, not a data format. The software renders the table visually perfectly, but the underlying PDF may store each cell's text as a separate positioned text element with no structural relationship to its neighbors.

    The column alignment problem. In a natively digital PDF, text in a table column is aligned by X coordinate. But the X coordinates of text fragments from different software versions, different printers, and different export settings are not consistent. A table that looks clean on screen may have "Rate" at X=412 in one document and X=418 in another. Column detection by X coordinate requires tolerance handling and per-document calibration.

    Multi-line descriptions. Long descriptions wrap across multiple lines within the same cell. Each line is a separate text fragment. Reconstruction requires detecting that the lines belong to the same item — using indentation, the absence of a quantity in the adjacent column, and the non-occurrence of an item code at the start of the line.

    Continuation across pages. BOQ documents are often hundreds of pages long. A section may start on page 47 and continue through page 83. Page headers repeat at the top of each page, and section totals appear at the bottom. Naive page-by-page table extraction will include these repeated headers and totals as data rows, and will split items that straddle a page break.

    Rasterized PDFs. Some BOQs are scanned paper documents or are exported by software that rasterizes the output (producing an image-based PDF rather than a text-based one). These require OCR before any table extraction can happen. OCR on tabular content introduces alignment errors that compound the column detection problem.

    Embedded specification text. Some BOQ formats — particularly those following the Standard Method of Measurement — include preamble clauses above each section that define the specification applicable to all items in that section. These preambles are not line items but must be associated with the items below them to provide complete context for AI training.

    The Extraction Approach

    A BOQ extraction pipeline has four sub-stages: structure detection, line-item parsing, normalization, and cross-reference extraction.

    Structure detection. Before parsing individual items, the pipeline must identify the document's column layout and the locations of headers, preambles, section breaks, and continuation page headers. This is done by analyzing the distribution of text X-coordinates across a sample of pages to infer column boundaries, and scanning for patterns that indicate section structure (all-caps section titles, item code format changes, running totals).

    Line-item parsing. With the column structure established, each page is processed to extract item records. The parser reads text fragments in reading order, assigns each fragment to a column based on its X coordinate, detects multi-line descriptions by checking for item codes and quantities, and handles page-break continuation by carrying the current item state across pages.

    The output of this stage is a raw record for each line item with its code, description, quantity, unit, rate, and amount. At this point, the records are raw — descriptions contain the full text including spec references, quantities are raw strings, and units have not been normalized.

    Normalization. Quantity strings are converted to numeric values, handling thousands separators and locale-specific decimal marks. Unit strings are normalized to canonical forms: "m3", "CUM", "cum", "M3", "cubic metre", and "cu.m" all normalize to "m3". Item codes are parsed into their hierarchical components. Amounts are validated against Quantity × Rate where both are present.

    Cross-reference extraction. Drawing references and specification clause references are extracted from description text using pattern matching. Drawing references typically follow patterns like "dwg S-201", "ref. drawing A/301", or "S-201/Rev.A". Specification clause references typically follow patterns like "Clause 5.4.2", "BS EN 206", or "to Spec Section 4". These are extracted as structured fields rather than left embedded in the description text.

    Quality Checks for BOQ Data

    Raw extraction produces records that range from high-confidence (clean tabular PDF, clear column structure) to low-confidence (scanned raster PDF, irregular layout). Quality checks should run before any extracted data enters a training dataset.

    Item code consistency. Within a document, item codes should follow a consistent numbering format. Items that break the pattern — missing components, unexpected depth levels — are flagged for review.

    Unit normalization completeness. Any unit string that does not normalize to a known canonical form is flagged. Construction BOQs use a finite set of measurement units; an unrecognized unit string usually indicates a parsing error.

    Cross-page continuity. The sum of item amounts within a section should match the section total at the bottom of the section. Where they do not match, cross-page continuation errors are likely.

    Description completeness. Description fields that are very short (less than 10 characters) or that contain obvious OCR artifacts (strings of symbols, character sequences with no word breaks) are flagged.

    Duplicate detection. The same BOQ may exist in multiple revisions in the archive. Records from different revisions of the same document should be deduplicated using the latest revision.

    A practical quality threshold: records with item code, description of at least 20 characters, numeric quantity, and recognized unit pass automatically. Records failing any check are queued for human review. On a high-quality digital BOQ archive, expect 80–90% automatic pass rates. On a mixed archive including scanned documents, expect 50–70%.

    Output Formats for AI Training

    JSONL for fine-tuning estimating models. Each line item becomes one JSON record:

    {"item_code": "03.04.12", "description": "Reinforced concrete, grade C35/45, in columns above ground floor slab, including formwork and curing", "quantity": 127.5, "unit": "m3", "rate": 285.00, "project_type": "office", "region": "southeast", "date": "2024-Q2"}
    

    This format trains a model to predict rates from descriptions, or to suggest descriptions for a given specification.

    CSV for cost analytics. The same records in tabular form, with one column per field, enable statistical analysis: rate distributions by item type, cost trends over time, regional rate variations.

    Chunked text for RAG. BOQ line items can be embedded as text chunks for a retrieval system, allowing queries like "what was typically specified for RC columns in commercial office projects?" The description field, combined with project metadata, forms an effective retrieval unit.

    Expected Dataset Sizes for Construction AI

    For a meaningful estimating model, you need enough records to represent the range of items, project types, and time periods in your data. Rule-of-thumb thresholds:

    • Minimum viable dataset: 10,000 line items from at least 10 completed projects, covering the main work categories
    • Useful dataset: 50,000 line items from 30+ projects across multiple project types
    • Strong dataset: 150,000+ line items from 80+ projects with full rate data and project metadata

    Most firms with a decade of project history will find enough BOQ data to reach the "useful" threshold within their archive. The constraint is typically extraction quality, not data volume.


    Your data is the bottleneck — not your models.

    Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading