How to Audit Your Unstructured Data for AI Potential

Before you select a model, hire an ML engineer, or buy a GPU, you need to answer one question: is your data usable for AI?

An unstructured data audit is the process of systematically evaluating what you have, assessing its quality, estimating the effort to prepare it, and identifying blockers. This guide provides a practical framework for conducting the audit — something you can complete in 1-2 weeks with existing staff.

Phase 1: Inventory (Days 1-3)

Locate All Data Sources

Enterprise data lives in more places than anyone expects:

Network file servers and NAS devices
SharePoint / OneDrive / Google Drive
Email archives (Exchange, Gmail)
Document management systems (SharePoint, M-Files, OpenText)
Line-of-business applications (ERP, CRM, HRIS)
Physical paper archives (yes, still)
Individual hard drives and local storage
Legacy systems scheduled for decommissioning

Catalog by Type

For each source, count documents by type:

Document Type	Count	Format	Digital/Scanned	Estimated Size
Contracts	12,400	PDF	70% digital / 30% scanned	45 GB
Invoices	89,000	PDF, TIFF	40% digital / 60% scanned	120 GB
Reports	3,200	Word, PDF	95% digital	8 GB
Emails	450,000	MSG, EML	100% digital	65 GB
Spreadsheets	15,600	Excel, CSV	100% digital	12 GB

Assess Volume

Total documents and total size
Growth rate (how much new data accumulates per month/year?)
Historical depth (how far back does the archive go?)
Coverage (are there gaps in the archive — missing years, departments, or document types?)

Phase 2: Quality Assessment (Days 4-7)

Sample Selection

Don't try to assess everything. Pull a representative sample:

100-500 documents across document types and time periods
Include documents from different sources and departments
Include both digital-native and scanned documents
Weight the sample toward the document types most relevant to your AI use case

Quality Dimensions

Extraction Quality: Can content be reliably extracted?

Digital PDFs: text extraction confidence (usually high)
Scanned documents: OCR quality (depends on scan quality, resolution, document age)
Tables: Can table structures be preserved during extraction?
Images: Are embedded images relevant and extractable?

Score each sample document: High / Medium / Low extraction quality.

Completeness: Does each document contain the information needed?

Are required fields populated?
Are sections complete or truncated?
Are attachments and appendices included?

Consistency: How much does format vary?

Same document type from different sources — how similar is the structure?
How many format variations exist for each document type?
Are naming conventions consistent enough for automated classification?

Relevance: How much of the data actually relates to the target AI use case?

What percentage of documents are directly useful?
What percentage are tangentially useful (provide context but not training signal)?
What percentage are irrelevant (can be excluded)?

Quality Summary

Produce a quality scorecard:

Document Type	Extraction	Completeness	Consistency	Relevance	Overall
Contracts	High	High	Medium	High	Good
Invoices	Medium	High	Low	Medium	Fair
Legacy reports	Low	Medium	Low	High	Needs work

Phase 3: Compliance Assessment (Days 8-9)

PII/PHI Identification

Sample documents for sensitive data:

Personal names, addresses, phone numbers, email addresses
Social Security numbers, tax IDs, account numbers
Medical information (diagnoses, treatments, prescriptions)
Financial information (income, credit, account balances)
Biometric data (photos with identifiable faces)

Estimate PII density: what percentage of documents contain PII, and how much per document?

Regulatory Mapping

Based on PII findings and industry, identify applicable regulations:

GDPR (EU data subjects)
HIPAA (health information)
EU AI Act (high-risk AI systems)
Industry-specific (SOX, PCAOB, ITAR, etc.)
State/regional privacy laws

Processing Constraints

Can data leave the building? (Air-gapped requirements?)
Who can access the data? (Clearance, need-to-know, professional privilege?)
What audit trail is required?
What are the data retention and destruction obligations?

Phase 4: Effort Estimation (Days 10-12)

Ingestion Effort

Based on quality assessment:

High-quality digital documents: Fast (batch processing)
Mixed quality: Moderate (some manual review of extraction results)
Low-quality scanned documents: Slow (OCR quality review, manual correction)

Labeling Effort

Estimate based on:

Number of records to label
Complexity of labeling schema (binary classification vs. multi-label vs. entity extraction)
Domain expertise required (generalist vs. specialist)
Estimated time per record (10 seconds for simple classification, 2-5 minutes for complex annotation)
Review cycles (typically 2-3 passes for quality)

Example: 10,000 documents × 2 minutes per document × 2 review cycles = ~670 hours of labeling effort.

Timeline

Produce a realistic timeline:

Phase	Effort	Duration
Ingestion	X documents	Y weeks
Cleaning	Z records	W weeks
Labeling	N records	M weeks
Quality review	N records	P weeks
Export	-	1 week

Phase 5: Recommendations (Days 13-14)

Go / No-Go Assessment

Based on the audit, recommend one of:

Proceed: Data quality and volume support the AI use case. Define scope and timeline.
Proceed with caveats: Data is usable but requires significant preparation. Budget accordingly.
Defer: Data quality or volume is insufficient. Invest in data collection or improvement before starting an AI project.
Pivot: The intended use case doesn't match the available data. Consider alternative use cases that better fit what you have.

Priority Ranking

If multiple AI use cases are being considered, rank them by data readiness — the use case with the most ready data should go first, regardless of which use case seems most valuable on paper.

The Audit Deliverable

Produce a concise document (5-10 pages) covering:

Data inventory summary
Quality assessment by document type
Compliance requirements and constraints
Effort and timeline estimates
Go/no-go recommendation with rationale

This document becomes the foundation for your AI data preparation project plan. Without it, you're planning blind.

When you're ready to move from audit to preparation, platforms like Ertas Data Suite handle the full pipeline — ingestion, cleaning, labeling, augmentation, and export — on-premise, with the audit trail and compliance documentation built in. But the audit comes first. Know your data before you try to prepare it.