
Insurance Underwriting AI: From Policy PDFs to Structured Training Data
How to convert underwriting documents — risk assessments, policy applications, actuarial reports — into structured AI training data for risk scoring and automated underwriting.
Underwriting is where insurance companies make their most consequential decisions: what to insure, at what price, under what terms. AI is increasingly assisting these decisions — risk classification, pricing optimization, submission triage — but the training data required is buried in decades of underwriting documents that were never designed for machine consumption.
Converting underwriting documents into structured AI training data requires understanding the unique document types, the domain-specific knowledge embedded in them, and the regulatory constraints around algorithmic underwriting.
Underwriting Document Types
Policy Applications
The starting point for every underwriting decision. Applications contain:
- Structured fields: Applicant demographics, coverage requested, limits, deductibles
- Narrative sections: Business descriptions, loss history explanations, risk management practices
- Supporting schedules: Vehicle lists, property schedules, employee counts, revenue breakdowns
Applications vary significantly by line of business. A personal auto application looks nothing like a commercial property application, which looks nothing like a directors & officers liability application.
Risk Assessment Reports
Underwriters produce narrative evaluations that capture their analysis:
- Risk factors identified (positive and negative)
- Comparison to class averages
- Pricing rationale and deviation justification
- Terms and conditions modifications
- Referral notes for risks that exceed authority
These reports are the richest source of underwriting intelligence — they capture the reasoning, not just the decision.
Loss Runs
Historical claims data for a specific insured:
- Claim dates, types, amounts paid and reserved
- Open vs. closed status
- Development patterns (how claims evolved over time)
- Loss ratios by coverage line
Loss runs come from multiple sources (current carrier, prior carriers) in inconsistent formats.
Inspection Reports
Third-party assessments of the risk being underwritten:
- Property condition, construction type, protection class
- Safety practices and hazard identification
- Compliance with building codes and fire protection standards
- Photos and diagrams
Financial Statements
For commercial lines, the insured's financial health informs underwriting:
- Balance sheets, income statements, cash flow statements
- Revenue trends, debt ratios, liquidity measures
- Comparison to industry benchmarks
Building the Training Pipeline
Stage 1: Document Ingestion
Applications: Parse PDF forms with field extraction. Handle the variation across application versions and lines of business. Multi-page applications with schedules require page-level classification.
Risk assessments: Extract narrative text with section detection. Identify key sections (risk summary, pricing rationale, terms) even when formatting varies by underwriter.
Loss runs: Table extraction with column mapping. Loss runs from different carriers use different column layouts, date formats, and status codes.
Financial statements: Structured table extraction with line-item identification. Map varied presentations to a standard financial structure.
Stage 2: Normalization and Enrichment
- Map inconsistent field names to a standard schema across all document sources
- Standardize codes (SIC → NAICS, state codes, coverage codes)
- Calculate derived features (loss ratios, frequency/severity splits, growth rates)
- Cross-reference data across documents (does the loss run match the application's loss history disclosure?)
- Flag inconsistencies for review
Stage 3: Labeling for AI Models
Risk classification labels:
- Preferred / standard / substandard / decline
- Risk score (1-10 or similar scale)
- Key risk factors that drove the classification
Pricing labels:
- Target premium, actual premium, deviation percentage
- Rate adequacy assessment
- Pricing components (base rate, experience modification, schedule credits/debits)
Decision labels:
- Quote / decline / refer
- Terms offered vs. standard terms
- Endorsements added and rationale
Who labels: Senior underwriters and pricing actuaries. Risk classification is judgment-intensive — a junior analyst might miss the risk factors that an experienced underwriter catches instantly.
Stage 4: Bias Testing
Underwriting AI faces intense regulatory scrutiny for discrimination:
- Protected characteristics: Models must not use race, ethnicity, gender, religion, or other protected classes as pricing or selection factors
- Proxy variables: Geographic, credit, and occupational variables can serve as proxies for protected characteristics
- Disparate impact analysis: Even facially neutral models must be tested for disproportionate impact on protected groups
- State regulatory requirements: Many states require algorithmic underwriting models to be filed and approved
Bias testing must be documented and the results included in the training data package.
Stage 5: Export
- JSONL for risk classification models:
{"application_features": {...}, "loss_history": [...], "risk_class": "standard", "risk_score": 6} - Structured JSON for pricing models: Input features + target premium with component breakdown
- Chunked text for RAG: Underwriting guidelines, risk appetite statements, and pricing manuals for retrieval-augmented underwriting assistants
- CSV for traditional actuarial models: Feature matrices with outcome variables
The On-Premise Imperative
Underwriting data is among the most competitively sensitive information an insurance company possesses:
- Pricing algorithms represent years of actuarial research and competitive positioning
- Risk selection criteria define the company's risk appetite — core strategic IP
- Loss experience reveals the company's book performance
- Underwriter judgment encoded in risk assessments represents institutional knowledge
Sending this data to cloud-based preparation tools exposes competitive intelligence. On-premise processing keeps everything within the company's infrastructure.
Getting Started
- Pick one line of business: Commercial property or personal auto are common starting points — high volume, well-documented processes
- Start with structured data: Applications and loss runs before narrative risk assessments
- Engage senior underwriters: They define what "good underwriting" looks like — that's what the model needs to learn
- Build bias testing in from day one: Not as an afterthought — regulators will ask
Platforms like Ertas Data Suite handle the complete pipeline on-premise: ingestion of varied document formats, PII redaction, domain expert labeling, bias documentation, and export to model-ready formats. For underwriting AI, where data sensitivity and regulatory scrutiny are at their highest, on-premise is the only approach that makes sense.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

Claims Processing AI: Preparing Unstructured Documents for Model Training
A practical guide to preparing insurance claims data for AI model training — from extracting structured data from claim forms to building datasets for fraud detection and auto-adjudication.

How to Convert Bill of Quantities into AI Training Data
A technical guide to converting Bills of Quantities (BOQs) from varied formats into structured AI training data — covering table extraction, normalization, labeling, and export.

Training AI on Financial Statements: Data Extraction and Labeling On-Premise
How to extract and label financial statement data for AI training — parsing XBRL, extracting tables from PDFs, handling format variation, and building classification models for financial analysis.