Back to blog
    Insurance Underwriting AI: From Policy PDFs to Structured Training Data
    insuranceunderwritingtraining-datapdf-extractiondata-preparationsegment:enterprise

    Insurance Underwriting AI: From Policy PDFs to Structured Training Data

    How to convert underwriting documents — risk assessments, policy applications, actuarial reports — into structured AI training data for risk scoring and automated underwriting.

    EErtas Team·

    Underwriting is where insurance companies make their most consequential decisions: what to insure, at what price, under what terms. AI is increasingly assisting these decisions — risk classification, pricing optimization, submission triage — but the training data required is buried in decades of underwriting documents that were never designed for machine consumption.

    Converting underwriting documents into structured AI training data requires understanding the unique document types, the domain-specific knowledge embedded in them, and the regulatory constraints around algorithmic underwriting.

    Underwriting Document Types

    Policy Applications

    The starting point for every underwriting decision. Applications contain:

    • Structured fields: Applicant demographics, coverage requested, limits, deductibles
    • Narrative sections: Business descriptions, loss history explanations, risk management practices
    • Supporting schedules: Vehicle lists, property schedules, employee counts, revenue breakdowns

    Applications vary significantly by line of business. A personal auto application looks nothing like a commercial property application, which looks nothing like a directors & officers liability application.

    Risk Assessment Reports

    Underwriters produce narrative evaluations that capture their analysis:

    • Risk factors identified (positive and negative)
    • Comparison to class averages
    • Pricing rationale and deviation justification
    • Terms and conditions modifications
    • Referral notes for risks that exceed authority

    These reports are the richest source of underwriting intelligence — they capture the reasoning, not just the decision.

    Loss Runs

    Historical claims data for a specific insured:

    • Claim dates, types, amounts paid and reserved
    • Open vs. closed status
    • Development patterns (how claims evolved over time)
    • Loss ratios by coverage line

    Loss runs come from multiple sources (current carrier, prior carriers) in inconsistent formats.

    Inspection Reports

    Third-party assessments of the risk being underwritten:

    • Property condition, construction type, protection class
    • Safety practices and hazard identification
    • Compliance with building codes and fire protection standards
    • Photos and diagrams

    Financial Statements

    For commercial lines, the insured's financial health informs underwriting:

    • Balance sheets, income statements, cash flow statements
    • Revenue trends, debt ratios, liquidity measures
    • Comparison to industry benchmarks

    Building the Training Pipeline

    Stage 1: Document Ingestion

    Applications: Parse PDF forms with field extraction. Handle the variation across application versions and lines of business. Multi-page applications with schedules require page-level classification.

    Risk assessments: Extract narrative text with section detection. Identify key sections (risk summary, pricing rationale, terms) even when formatting varies by underwriter.

    Loss runs: Table extraction with column mapping. Loss runs from different carriers use different column layouts, date formats, and status codes.

    Financial statements: Structured table extraction with line-item identification. Map varied presentations to a standard financial structure.

    Stage 2: Normalization and Enrichment

    • Map inconsistent field names to a standard schema across all document sources
    • Standardize codes (SIC → NAICS, state codes, coverage codes)
    • Calculate derived features (loss ratios, frequency/severity splits, growth rates)
    • Cross-reference data across documents (does the loss run match the application's loss history disclosure?)
    • Flag inconsistencies for review

    Stage 3: Labeling for AI Models

    Risk classification labels:

    • Preferred / standard / substandard / decline
    • Risk score (1-10 or similar scale)
    • Key risk factors that drove the classification

    Pricing labels:

    • Target premium, actual premium, deviation percentage
    • Rate adequacy assessment
    • Pricing components (base rate, experience modification, schedule credits/debits)

    Decision labels:

    • Quote / decline / refer
    • Terms offered vs. standard terms
    • Endorsements added and rationale

    Who labels: Senior underwriters and pricing actuaries. Risk classification is judgment-intensive — a junior analyst might miss the risk factors that an experienced underwriter catches instantly.

    Stage 4: Bias Testing

    Underwriting AI faces intense regulatory scrutiny for discrimination:

    • Protected characteristics: Models must not use race, ethnicity, gender, religion, or other protected classes as pricing or selection factors
    • Proxy variables: Geographic, credit, and occupational variables can serve as proxies for protected characteristics
    • Disparate impact analysis: Even facially neutral models must be tested for disproportionate impact on protected groups
    • State regulatory requirements: Many states require algorithmic underwriting models to be filed and approved

    Bias testing must be documented and the results included in the training data package.

    Stage 5: Export

    • JSONL for risk classification models: {"application_features": {...}, "loss_history": [...], "risk_class": "standard", "risk_score": 6}
    • Structured JSON for pricing models: Input features + target premium with component breakdown
    • Chunked text for RAG: Underwriting guidelines, risk appetite statements, and pricing manuals for retrieval-augmented underwriting assistants
    • CSV for traditional actuarial models: Feature matrices with outcome variables

    The On-Premise Imperative

    Underwriting data is among the most competitively sensitive information an insurance company possesses:

    • Pricing algorithms represent years of actuarial research and competitive positioning
    • Risk selection criteria define the company's risk appetite — core strategic IP
    • Loss experience reveals the company's book performance
    • Underwriter judgment encoded in risk assessments represents institutional knowledge

    Sending this data to cloud-based preparation tools exposes competitive intelligence. On-premise processing keeps everything within the company's infrastructure.

    Getting Started

    1. Pick one line of business: Commercial property or personal auto are common starting points — high volume, well-documented processes
    2. Start with structured data: Applications and loss runs before narrative risk assessments
    3. Engage senior underwriters: They define what "good underwriting" looks like — that's what the model needs to learn
    4. Build bias testing in from day one: Not as an afterthought — regulators will ask

    Platforms like Ertas Data Suite handle the complete pipeline on-premise: ingestion of varied document formats, PII redaction, domain expert labeling, bias documentation, and export to model-ready formats. For underwriting AI, where data sensitivity and regulatory scrutiny are at their highest, on-premise is the only approach that makes sense.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading