Back to blog
    How to Audit Your Unstructured Data for AI Potential
    unstructured-datadata-auditenterprise-aidata-preparationassessmentsegment:enterprise

    How to Audit Your Unstructured Data for AI Potential

    A practical guide to assessing your enterprise's unstructured data for AI readiness — inventorying file types, estimating labeling effort, identifying PII, and evaluating document quality.

    EErtas Team·

    Before you select a model, hire an ML engineer, or buy a GPU, you need to answer one question: is your data usable for AI?

    An unstructured data audit is the process of systematically evaluating what you have, assessing its quality, estimating the effort to prepare it, and identifying blockers. This guide provides a practical framework for conducting the audit — something you can complete in 1-2 weeks with existing staff.

    Phase 1: Inventory (Days 1-3)

    Locate All Data Sources

    Enterprise data lives in more places than anyone expects:

    • Network file servers and NAS devices
    • SharePoint / OneDrive / Google Drive
    • Email archives (Exchange, Gmail)
    • Document management systems (SharePoint, M-Files, OpenText)
    • Line-of-business applications (ERP, CRM, HRIS)
    • Physical paper archives (yes, still)
    • Individual hard drives and local storage
    • Legacy systems scheduled for decommissioning

    Catalog by Type

    For each source, count documents by type:

    Document TypeCountFormatDigital/ScannedEstimated Size
    Contracts12,400PDF70% digital / 30% scanned45 GB
    Invoices89,000PDF, TIFF40% digital / 60% scanned120 GB
    Reports3,200Word, PDF95% digital8 GB
    Emails450,000MSG, EML100% digital65 GB
    Spreadsheets15,600Excel, CSV100% digital12 GB

    Assess Volume

    • Total documents and total size
    • Growth rate (how much new data accumulates per month/year?)
    • Historical depth (how far back does the archive go?)
    • Coverage (are there gaps in the archive — missing years, departments, or document types?)

    Phase 2: Quality Assessment (Days 4-7)

    Sample Selection

    Don't try to assess everything. Pull a representative sample:

    • 100-500 documents across document types and time periods
    • Include documents from different sources and departments
    • Include both digital-native and scanned documents
    • Weight the sample toward the document types most relevant to your AI use case

    Quality Dimensions

    Extraction Quality: Can content be reliably extracted?

    • Digital PDFs: text extraction confidence (usually high)
    • Scanned documents: OCR quality (depends on scan quality, resolution, document age)
    • Tables: Can table structures be preserved during extraction?
    • Images: Are embedded images relevant and extractable?

    Score each sample document: High / Medium / Low extraction quality.

    Completeness: Does each document contain the information needed?

    • Are required fields populated?
    • Are sections complete or truncated?
    • Are attachments and appendices included?

    Consistency: How much does format vary?

    • Same document type from different sources — how similar is the structure?
    • How many format variations exist for each document type?
    • Are naming conventions consistent enough for automated classification?

    Relevance: How much of the data actually relates to the target AI use case?

    • What percentage of documents are directly useful?
    • What percentage are tangentially useful (provide context but not training signal)?
    • What percentage are irrelevant (can be excluded)?

    Quality Summary

    Produce a quality scorecard:

    Document TypeExtractionCompletenessConsistencyRelevanceOverall
    ContractsHighHighMediumHighGood
    InvoicesMediumHighLowMediumFair
    Legacy reportsLowMediumLowHighNeeds work

    Phase 3: Compliance Assessment (Days 8-9)

    PII/PHI Identification

    Sample documents for sensitive data:

    • Personal names, addresses, phone numbers, email addresses
    • Social Security numbers, tax IDs, account numbers
    • Medical information (diagnoses, treatments, prescriptions)
    • Financial information (income, credit, account balances)
    • Biometric data (photos with identifiable faces)

    Estimate PII density: what percentage of documents contain PII, and how much per document?

    Regulatory Mapping

    Based on PII findings and industry, identify applicable regulations:

    • GDPR (EU data subjects)
    • HIPAA (health information)
    • EU AI Act (high-risk AI systems)
    • Industry-specific (SOX, PCAOB, ITAR, etc.)
    • State/regional privacy laws

    Processing Constraints

    • Can data leave the building? (Air-gapped requirements?)
    • Who can access the data? (Clearance, need-to-know, professional privilege?)
    • What audit trail is required?
    • What are the data retention and destruction obligations?

    Phase 4: Effort Estimation (Days 10-12)

    Ingestion Effort

    Based on quality assessment:

    • High-quality digital documents: Fast (batch processing)
    • Mixed quality: Moderate (some manual review of extraction results)
    • Low-quality scanned documents: Slow (OCR quality review, manual correction)

    Labeling Effort

    Estimate based on:

    • Number of records to label
    • Complexity of labeling schema (binary classification vs. multi-label vs. entity extraction)
    • Domain expertise required (generalist vs. specialist)
    • Estimated time per record (10 seconds for simple classification, 2-5 minutes for complex annotation)
    • Review cycles (typically 2-3 passes for quality)

    Example: 10,000 documents × 2 minutes per document × 2 review cycles = ~670 hours of labeling effort.

    Timeline

    Produce a realistic timeline:

    PhaseEffortDuration
    IngestionX documentsY weeks
    CleaningZ recordsW weeks
    LabelingN recordsM weeks
    Quality reviewN recordsP weeks
    Export-1 week

    Phase 5: Recommendations (Days 13-14)

    Go / No-Go Assessment

    Based on the audit, recommend one of:

    • Proceed: Data quality and volume support the AI use case. Define scope and timeline.
    • Proceed with caveats: Data is usable but requires significant preparation. Budget accordingly.
    • Defer: Data quality or volume is insufficient. Invest in data collection or improvement before starting an AI project.
    • Pivot: The intended use case doesn't match the available data. Consider alternative use cases that better fit what you have.

    Priority Ranking

    If multiple AI use cases are being considered, rank them by data readiness — the use case with the most ready data should go first, regardless of which use case seems most valuable on paper.

    The Audit Deliverable

    Produce a concise document (5-10 pages) covering:

    1. Data inventory summary
    2. Quality assessment by document type
    3. Compliance requirements and constraints
    4. Effort and timeline estimates
    5. Go/no-go recommendation with rationale

    This document becomes the foundation for your AI data preparation project plan. Without it, you're planning blind.

    When you're ready to move from audit to preparation, platforms like Ertas Data Suite handle the full pipeline — ingestion, cleaning, labeling, augmentation, and export — on-premise, with the audit trail and compliance documentation built in. But the audit comes first. Know your data before you try to prepare it.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading