Back to blog
    Training AI on Financial Statements: Data Extraction and Labeling On-Premise
    financial-statementsdata-extractionaccountingtraining-dataon-premisedata-preparationsegment:enterprise

    Training AI on Financial Statements: Data Extraction and Labeling On-Premise

    How to extract and label financial statement data for AI training — parsing XBRL, extracting tables from PDFs, handling format variation, and building classification models for financial analysis.

    EErtas Team·

    Financial statements are among the most structured documents in business — yet converting them into AI training data is surprisingly difficult. Varied presentation formats, nested table structures, cross-references between statements and notes, and the domain-specific meaning of line items create extraction and labeling challenges that generic document AI tools don't handle well.

    This guide covers the practical pipeline for turning financial statement PDFs and XBRL filings into labeled training datasets — on-premise, for use cases like automated financial analysis, anomaly detection, and report generation.

    Financial Statement Data Sources

    SEC Filings (XBRL/iXBRL)

    Public company filings are available in structured XBRL (eXtensible Business Reporting Language):

    • Advantage: Machine-readable with standardized taxonomy tags
    • Challenge: XBRL extensions create custom tags that vary by filer, taxonomy versions change over time, and rendering differences between filing software
    • What you get: Structured facts (Revenue = $X for period Y with unit Z) linked to US GAAP or IFRS taxonomy concepts

    PDF Financial Statements

    Private companies, international filings, and many reports exist only as PDFs:

    • Advantage: Visual layout preserves human-readable formatting
    • Challenge: Table extraction from PDFs is unreliable — merged cells, spanning headers, footnote references, and multi-page tables all cause problems
    • What you get: Raw text and table structures that need significant processing

    Audit/Compilation Software Exports

    Many financial statements originate in accounting software (Caseware, Workiva, CCH):

    • Advantage: Structured data at the source
    • Challenge: Export formats are proprietary and vary between software versions
    • What you get: Structured data that needs format normalization

    The Extraction Pipeline

    XBRL Processing

    1. Parse XBRL instance documents to extract facts (concept, value, period, unit, context)
    2. Resolve taxonomy references — map each fact to the US GAAP or IFRS taxonomy hierarchy
    3. Handle extensions — custom tags created by filers need to be mapped to standard concepts or flagged
    4. Build financial statement structure — reconstruct the balance sheet, income statement, and cash flow statement from individual facts
    5. Handle dimensional data — segment reporting, geographic breakdowns, and product line data use XBRL dimensions

    PDF Table Extraction

    1. Layout detection — identify table regions on each page
    2. Column and row detection — find grid lines, aligned text, and cell boundaries
    3. Header identification — distinguish column headers from data rows (including multi-row headers)
    4. Cell extraction — extract text from each cell, handling:
      • Parentheses for negative numbers: (1,234) → -1234
      • Dash or em-dash for zero: — → 0
      • Percentage signs: 12.5% → 0.125
      • Currency symbols: $1,234 → 1234 (USD)
    5. Multi-page table continuation — detect when a table spans pages and merge correctly
    6. Footnote reference extraction — identify superscript markers and link to footnote text

    Normalization

    Financial statement line items vary in presentation:

    Company ACompany BNormalized
    Net revenuesRevenuerevenue
    Cost of goods soldCost of revenuecost_of_revenue
    Selling, general and adminSG&A expensessg_and_a
    Net income (loss)Net earningsnet_income

    Normalization maps these variations to a standard chart of accounts. This requires:

    • A mapping dictionary (built from domain expertise)
    • Fuzzy matching for novel presentations
    • Context awareness (the same label can mean different things on different statements)

    Labeling for AI Use Cases

    Financial Analysis Automation

    Label type: Line item classification

    {"text": "Depreciation and amortization", "label": "depreciation_amortization", "statement": "income_statement", "subtotal_parent": "operating_expenses"}
    

    Training data: thousands of examples mapping varied line item descriptions to standardized categories.

    Anomaly Detection

    Label type: Normal vs. anomalous patterns

    {"company": "ANON_001", "metric": "gross_margin", "period": "2025-Q3", "value": 0.12, "historical_avg": 0.34, "label": "anomaly", "severity": "high"}
    

    Training data: historical financial data with labeled anomalies (unusual fluctuations, errors, restatements).

    Report Generation

    Label type: Text-to-data and data-to-text pairs

    {"financials": {"revenue": 45000000, "revenue_growth": 0.15, "gross_margin": 0.62}, "narrative": "Revenue increased 15% year-over-year to $45 million, driven by..."}
    

    Training data: pairs of financial data and the human-written narratives that describe them.

    Ratio Analysis

    Label type: Calculated ratios with interpretive labels

    {"current_ratio": 0.85, "industry_avg": 1.5, "interpretation": "below_industry_norm", "risk_flag": true}
    

    Quality Challenges

    Restatements and Corrections

    Financial statements get restated. Original filings may contain errors corrected in subsequent filings. Training data should:

    • Use the most recent version of each filing
    • Flag restated periods (the original error and correction are both useful training signals for anomaly detection)
    • Track which version of each statement was used

    GAAP vs. Non-GAAP

    Many companies report non-GAAP metrics alongside GAAP figures. Training data must distinguish between them — a model trained on a mix of GAAP and non-GAAP data without labels will produce unreliable outputs.

    Consolidation Complexity

    Consolidated financial statements combine multiple entities with elimination entries. Segment-level data may not reconcile to consolidated totals due to intersegment eliminations and corporate allocations.

    Why On-Premise

    Financial statement data for AI training involves:

    • Client confidential information (for accounting firm data)
    • Material non-public information (for pre-release financials)
    • Competitive intelligence (financial performance data)
    • Regulatory obligations (SOX, PCAOB, SEC)

    Processing this data on cloud services creates unnecessary risk. On-premise platforms like Ertas Data Suite keep the entire pipeline local — extraction, normalization, labeling, and export all happen on your infrastructure. Financial professionals can label data directly through the desktop interface, and the complete audit trail satisfies regulatory documentation requirements.

    Financial AI starts with financial data, prepared by financial professionals, on infrastructure you control.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading