
Training AI on Financial Statements: Data Extraction and Labeling On-Premise
How to extract and label financial statement data for AI training — parsing XBRL, extracting tables from PDFs, handling format variation, and building classification models for financial analysis.
Financial statements are among the most structured documents in business — yet converting them into AI training data is surprisingly difficult. Varied presentation formats, nested table structures, cross-references between statements and notes, and the domain-specific meaning of line items create extraction and labeling challenges that generic document AI tools don't handle well.
This guide covers the practical pipeline for turning financial statement PDFs and XBRL filings into labeled training datasets — on-premise, for use cases like automated financial analysis, anomaly detection, and report generation.
Financial Statement Data Sources
SEC Filings (XBRL/iXBRL)
Public company filings are available in structured XBRL (eXtensible Business Reporting Language):
- Advantage: Machine-readable with standardized taxonomy tags
- Challenge: XBRL extensions create custom tags that vary by filer, taxonomy versions change over time, and rendering differences between filing software
- What you get: Structured facts (Revenue = $X for period Y with unit Z) linked to US GAAP or IFRS taxonomy concepts
PDF Financial Statements
Private companies, international filings, and many reports exist only as PDFs:
- Advantage: Visual layout preserves human-readable formatting
- Challenge: Table extraction from PDFs is unreliable — merged cells, spanning headers, footnote references, and multi-page tables all cause problems
- What you get: Raw text and table structures that need significant processing
Audit/Compilation Software Exports
Many financial statements originate in accounting software (Caseware, Workiva, CCH):
- Advantage: Structured data at the source
- Challenge: Export formats are proprietary and vary between software versions
- What you get: Structured data that needs format normalization
The Extraction Pipeline
XBRL Processing
- Parse XBRL instance documents to extract facts (concept, value, period, unit, context)
- Resolve taxonomy references — map each fact to the US GAAP or IFRS taxonomy hierarchy
- Handle extensions — custom tags created by filers need to be mapped to standard concepts or flagged
- Build financial statement structure — reconstruct the balance sheet, income statement, and cash flow statement from individual facts
- Handle dimensional data — segment reporting, geographic breakdowns, and product line data use XBRL dimensions
PDF Table Extraction
- Layout detection — identify table regions on each page
- Column and row detection — find grid lines, aligned text, and cell boundaries
- Header identification — distinguish column headers from data rows (including multi-row headers)
- Cell extraction — extract text from each cell, handling:
- Parentheses for negative numbers: (1,234) → -1234
- Dash or em-dash for zero: — → 0
- Percentage signs: 12.5% → 0.125
- Currency symbols: $1,234 → 1234 (USD)
- Multi-page table continuation — detect when a table spans pages and merge correctly
- Footnote reference extraction — identify superscript markers and link to footnote text
Normalization
Financial statement line items vary in presentation:
| Company A | Company B | Normalized |
|---|---|---|
| Net revenues | Revenue | revenue |
| Cost of goods sold | Cost of revenue | cost_of_revenue |
| Selling, general and admin | SG&A expenses | sg_and_a |
| Net income (loss) | Net earnings | net_income |
Normalization maps these variations to a standard chart of accounts. This requires:
- A mapping dictionary (built from domain expertise)
- Fuzzy matching for novel presentations
- Context awareness (the same label can mean different things on different statements)
Labeling for AI Use Cases
Financial Analysis Automation
Label type: Line item classification
{"text": "Depreciation and amortization", "label": "depreciation_amortization", "statement": "income_statement", "subtotal_parent": "operating_expenses"}
Training data: thousands of examples mapping varied line item descriptions to standardized categories.
Anomaly Detection
Label type: Normal vs. anomalous patterns
{"company": "ANON_001", "metric": "gross_margin", "period": "2025-Q3", "value": 0.12, "historical_avg": 0.34, "label": "anomaly", "severity": "high"}
Training data: historical financial data with labeled anomalies (unusual fluctuations, errors, restatements).
Report Generation
Label type: Text-to-data and data-to-text pairs
{"financials": {"revenue": 45000000, "revenue_growth": 0.15, "gross_margin": 0.62}, "narrative": "Revenue increased 15% year-over-year to $45 million, driven by..."}
Training data: pairs of financial data and the human-written narratives that describe them.
Ratio Analysis
Label type: Calculated ratios with interpretive labels
{"current_ratio": 0.85, "industry_avg": 1.5, "interpretation": "below_industry_norm", "risk_flag": true}
Quality Challenges
Restatements and Corrections
Financial statements get restated. Original filings may contain errors corrected in subsequent filings. Training data should:
- Use the most recent version of each filing
- Flag restated periods (the original error and correction are both useful training signals for anomaly detection)
- Track which version of each statement was used
GAAP vs. Non-GAAP
Many companies report non-GAAP metrics alongside GAAP figures. Training data must distinguish between them — a model trained on a mix of GAAP and non-GAAP data without labels will produce unreliable outputs.
Consolidation Complexity
Consolidated financial statements combine multiple entities with elimination entries. Segment-level data may not reconcile to consolidated totals due to intersegment eliminations and corporate allocations.
Why On-Premise
Financial statement data for AI training involves:
- Client confidential information (for accounting firm data)
- Material non-public information (for pre-release financials)
- Competitive intelligence (financial performance data)
- Regulatory obligations (SOX, PCAOB, SEC)
Processing this data on cloud services creates unnecessary risk. On-premise platforms like Ertas Data Suite keep the entire pipeline local — extraction, normalization, labeling, and export all happen on your infrastructure. Financial professionals can label data directly through the desktop interface, and the complete audit trail satisfies regulatory documentation requirements.
Financial AI starts with financial data, prepared by financial professionals, on infrastructure you control.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

How to Convert Bill of Quantities into AI Training Data
A technical guide to converting Bills of Quantities (BOQs) from varied formats into structured AI training data — covering table extraction, normalization, labeling, and export.

Claims Processing AI: Preparing Unstructured Documents for Model Training
A practical guide to preparing insurance claims data for AI model training — from extracting structured data from claim forms to building datasets for fraud detection and auto-adjudication.

Insurance Underwriting AI: From Policy PDFs to Structured Training Data
How to convert underwriting documents — risk assessments, policy applications, actuarial reports — into structured AI training data for risk scoring and automated underwriting.