Training AI on Financial Statements: Data Extraction and Labeling On-Premise

Financial statements are among the most structured documents in business — yet converting them into AI training data is surprisingly difficult. Varied presentation formats, nested table structures, cross-references between statements and notes, and the domain-specific meaning of line items create extraction and labeling challenges that generic document AI tools don't handle well.

This guide covers the practical pipeline for turning financial statement PDFs and XBRL filings into labeled training datasets — on-premise, for use cases like automated financial analysis, anomaly detection, and report generation.

Financial Statement Data Sources

SEC Filings (XBRL/iXBRL)

Public company filings are available in structured XBRL (eXtensible Business Reporting Language):

Advantage: Machine-readable with standardized taxonomy tags
Challenge: XBRL extensions create custom tags that vary by filer, taxonomy versions change over time, and rendering differences between filing software
What you get: Structured facts (Revenue = $X for period Y with unit Z) linked to US GAAP or IFRS taxonomy concepts

PDF Financial Statements

Private companies, international filings, and many reports exist only as PDFs:

Advantage: Visual layout preserves human-readable formatting
Challenge: Table extraction from PDFs is unreliable — merged cells, spanning headers, footnote references, and multi-page tables all cause problems
What you get: Raw text and table structures that need significant processing

Audit/Compilation Software Exports

Many financial statements originate in accounting software (Caseware, Workiva, CCH):

Advantage: Structured data at the source
Challenge: Export formats are proprietary and vary between software versions
What you get: Structured data that needs format normalization

The Extraction Pipeline

XBRL Processing

Parse XBRL instance documents to extract facts (concept, value, period, unit, context)
Resolve taxonomy references — map each fact to the US GAAP or IFRS taxonomy hierarchy
Handle extensions — custom tags created by filers need to be mapped to standard concepts or flagged
Build financial statement structure — reconstruct the balance sheet, income statement, and cash flow statement from individual facts
Handle dimensional data — segment reporting, geographic breakdowns, and product line data use XBRL dimensions

PDF Table Extraction

Layout detection — identify table regions on each page
Column and row detection — find grid lines, aligned text, and cell boundaries
Header identification — distinguish column headers from data rows (including multi-row headers)
Cell extraction — extract text from each cell, handling:
- Parentheses for negative numbers: (1,234) → -1234
- Dash or em-dash for zero: — → 0
- Percentage signs: 12.5% → 0.125
- Currency symbols: $1,234 → 1234 (USD)
Multi-page table continuation — detect when a table spans pages and merge correctly
Footnote reference extraction — identify superscript markers and link to footnote text

Normalization

Financial statement line items vary in presentation:

Company A	Company B	Normalized
Net revenues	Revenue	revenue
Cost of goods sold	Cost of revenue	cost_of_revenue
Selling, general and admin	SG&A expenses	sg_and_a
Net income (loss)	Net earnings	net_income

Normalization maps these variations to a standard chart of accounts. This requires:

A mapping dictionary (built from domain expertise)
Fuzzy matching for novel presentations
Context awareness (the same label can mean different things on different statements)

Labeling for AI Use Cases

Financial Analysis Automation

Label type: Line item classification

{"text": "Depreciation and amortization", "label": "depreciation_amortization", "statement": "income_statement", "subtotal_parent": "operating_expenses"}

Training data: thousands of examples mapping varied line item descriptions to standardized categories.

Anomaly Detection

Label type: Normal vs. anomalous patterns

{"company": "ANON_001", "metric": "gross_margin", "period": "2025-Q3", "value": 0.12, "historical_avg": 0.34, "label": "anomaly", "severity": "high"}

Training data: historical financial data with labeled anomalies (unusual fluctuations, errors, restatements).

Report Generation

Label type: Text-to-data and data-to-text pairs

{"financials": {"revenue": 45000000, "revenue_growth": 0.15, "gross_margin": 0.62}, "narrative": "Revenue increased 15% year-over-year to $45 million, driven by..."}

Training data: pairs of financial data and the human-written narratives that describe them.

Ratio Analysis

Label type: Calculated ratios with interpretive labels

{"current_ratio": 0.85, "industry_avg": 1.5, "interpretation": "below_industry_norm", "risk_flag": true}

Quality Challenges

Restatements and Corrections

Financial statements get restated. Original filings may contain errors corrected in subsequent filings. Training data should:

Use the most recent version of each filing
Flag restated periods (the original error and correction are both useful training signals for anomaly detection)
Track which version of each statement was used

GAAP vs. Non-GAAP

Many companies report non-GAAP metrics alongside GAAP figures. Training data must distinguish between them — a model trained on a mix of GAAP and non-GAAP data without labels will produce unreliable outputs.

Consolidation Complexity

Consolidated financial statements combine multiple entities with elimination entries. Segment-level data may not reconcile to consolidated totals due to intersegment eliminations and corporate allocations.

Why On-Premise

Financial statement data for AI training involves:

Client confidential information (for accounting firm data)
Material non-public information (for pre-release financials)
Competitive intelligence (financial performance data)
Regulatory obligations (SOX, PCAOB, SEC)

Processing this data on cloud services creates unnecessary risk. On-premise platforms like Ertas Data Suite keep the entire pipeline local — extraction, normalization, labeling, and export all happen on your infrastructure. Financial professionals can label data directly through the desktop interface, and the complete audit trail satisfies regulatory documentation requirements.

Financial AI starts with financial data, prepared by financial professionals, on infrastructure you control.