How Long Does Enterprise AI Data Preparation Actually Take?

The honest answer is: longer than you've budgeted for. Almost universally.

The 60-80% statistic — the share of ML project time that goes to data preparation — is widely cited, occasionally doubted, and consistently confirmed by teams that have been through a real enterprise AI project. What the statistic doesn't capture is what that means for project planning.

If data preparation takes 60-80% of total project time, and your project has a 6-month deadline, then you have roughly 3.5 to 5 months for data preparation alone. Not for model training. Not for evaluation and iteration. Not for deployment. Just for getting the data into a shape that allows training to begin.

Most project plans don't reflect this. Most project plans allocate 3-4 weeks for "data preprocessing" and 4-5 months for everything else. The discovery that the timeline is inverted typically happens at week 6, when the first pass at data preparation has been completed and the output quality is not yet fit for training.

This article gives you concrete benchmarks so you can plan based on realistic numbers.

The Variables That Drive Timeline

Timeline varies enormously depending on four factors:

1. Source format quality. Native PDFs from modern document management systems parse cleanly and quickly. Scanned documents from 1990s archives require OCR, deskewing, and manual quality review. The same nominal "1,000 document" corpus can take 8 hours to process if native, or 40+ hours if scanned.

2. Data volume. Not just file count, but total text volume. 10,000 short forms is a different problem from 10,000 dense technical reports.

3. Label complexity. Classifying documents into 5 categories at the document level is fast. Annotating named entities at the token level across a specialized domain (clinical terminology, legal clauses, engineering components) is slow.

4. Team composition and tooling. Manual spreadsheet-based cleaning vs. automated deduplication pipelines. Domain experts who can access annotation tools independently vs. domain experts who require an ML engineer to assist them. These multipliers are large.

Stage-by-Stage Time Benchmarks

Ingestion

Ingestion time is primarily determined by format and OCR requirements.

Source Format	Pages per Hour (automated)	Error Rate	Manual Review Required
Native PDF (clean layout)	5,000–15,000	< 1%	Minimal
Native PDF (complex multi-column)	1,000–3,000	2–5%	Table validation
Scanned PDF (good quality, 300+ DPI)	500–1,500	2–8%	Spot check
Scanned PDF (poor quality, mixed)	100–400	10–25%	Significant
Word (.docx)	10,000–30,000	< 1%	Minimal
Excel (.xlsx, simple)	5,000–20,000 sheets	1–3%	Header validation
Audio transcripts	2–5x real-time + review	5–15%	Speaker/term corrections

These are automated processing rates. Add setup time — pipeline configuration, sample validation, parameter tuning — of 4-16 hours per corpus type before the main run begins.

Cleaning

Cleaning time is harder to estimate because it depends on the error rate from ingestion and the compliance requirements.

Task	Time Estimate
Automated deduplication (50K records)	1–4 hours compute + 2–4 hours validation
PII/PHI redaction (standard patterns)	2–8 hours compute + 4–8 hours audit sample review
Quality scoring and filtering	2–6 hours compute + 2–4 hours threshold calibration
Manual cleaning of OCR artifacts	1–3 minutes per page with significant errors

The manual cleaning component is the unpredictable one. If OCR quality is poor across a significant fraction of documents, manual correction becomes the timeline driver. A 10,000-page corpus at 5% page-level error rate has 500 pages requiring manual attention — at 2 minutes per page, that's over 16 hours of manual work per annotator.

Labeling

Labeling is almost always the longest stage, and almost always the most underestimated.

Task	Time per Record	10,000 Records
Document classification (5 classes)	15–30 seconds	40–80 hours
Document classification (20+ classes)	30–90 seconds	80–250 hours
NER tagging (3–5 entity types)	2–5 minutes	330–830 hours
NER tagging (10+ entity types, technical)	5–15 minutes	830–2,500 hours
Bounding box annotation (simple objects)	1–3 minutes	165–500 hours
Q&A pair generation per passage	10–20 minutes	1,650–3,300 hours
Instruction fine-tuning pair writing	15–45 minutes	2,500–7,500 hours

These times assume domain experts who are calibrated and working efficiently. For the first labeling sessions before calibration, add 30-50% for inconsistency and rework.

At these rates, labeling 10,000 records for a complex NER task requires 800-2,500 hours of expert annotation time. At 40 hours per week for a single annotator, that's 20-63 weeks. Most projects can't wait that long, which means either: hiring multiple annotators, reducing scope, or using augmentation to expand a smaller high-quality labeled set.

Augmentation

Automated augmentation using a local LLM runs at model inference speed — typically 50-500 synthetic records per hour depending on record length and hardware. Setup and quality review of synthetic examples adds 4-16 hours. This is usually the fastest stage.

Export

Export is typically fast — hours, not days — assuming the format is correctly specified and validation is automated. Format validation failures (schema errors, encoding issues) can add debugging time of 4-16 hours if they're discovered late.

The Compounding Cost of Skipping Cleaning

Teams that skip or rush cleaning — to hit a deadline or because cleaning "seems like overhead" — face a compounding problem.

A model trained on data with 10% near-duplicate records learns to reproduce common content with inflated confidence. A model trained on data with 2% PII contamination will output PII in production. A model trained on data with 5% OCR corruption will produce outputs that include corruption artifacts.

The cost is not just the cleaning time itself — it's the full train-evaluate-diagnose-fix-retrain cycle that follows. If the cleaning problem isn't identified until model evaluation (weeks after training begins), the total added time is: time to identify the data problem + cleaning time + re-training time + re-evaluation time. This is consistently 2-4x longer than addressing cleaning at the right stage.

Where Teams Consistently Underestimate

OCR quality on legacy scanned documents. Teams that haven't audited the actual scan quality of their archive before planning often assume OCR will be "close enough." OCR on documents scanned at 150 DPI with skew, ink fade, and mixed printing quality is not close enough for AI training data. The discovery of this happens after the ingestion phase, when the cleaning phase reveals the error rate.

Near-duplicate rates in accumulated archives. Enterprise document archives are not curated. Documents accumulate through email attachments, version saves, template instantiations, and copy-paste. Before deduplication, the effective training data volume is often 60-75% of the apparent volume.

Label consistency and calibration time. Teams assume that domain experts will naturally agree on labels. They rarely do on the first pass. Calibration — defining the label schema precisely, running trial annotation, measuring inter-annotator agreement, adjudicating disagreements, re-annotating with the refined schema — takes 2-6 weeks before the main annotation run begins.

Format requirements for the target framework. Discovering that the training framework requires a specific JSONL schema that doesn't match the export format, after labeling is complete, requires reformatting work and sometimes relabeling if the schema change affects how annotations map to the output.

A Rough Benchmark Table

Corpus Size	Format	Annotation Type	Estimated Total Prep Time
1,000 documents	Native PDF, simple	Document classification	2–4 weeks
1,000 documents	Scanned PDF	Document classification	4–8 weeks
10,000 documents	Native PDF, mixed	NER (5 entity types)	3–6 months
10,000 documents	Scanned PDF	NER (5 entity types)	5–10 months
50,000 documents	Mixed formats	Instruction fine-tuning pairs	6–18 months
100,000+ documents	Mixed formats	Multi-task labels	12+ months

These estimates assume a small team (2-4 people including at least 1 ML engineer and domain expert availability). Larger teams reduce calendar time proportionally, subject to annotation consistency overhead.

How Tooling Affects Timeline

Manual processes — Python scripts for cleaning, spreadsheet-based quality review, cobbled-together annotation tools — reliably produce 2-4x longer timelines than automated pipelines with built-in quality gates.

The compounding effects:

Manual deduplication takes days; automated deduplication takes hours
Manual PII review requires reading every document; automated detection with human audit sampling requires reading 5-10%
Annotation tools that require ML engineer setup for each annotator session double the effective annotation time
Format conversion scripts that need to be rewritten for each new export target add days to the export stage

Pipeline automation is not a luxury for large projects. For a team of 3 trying to prepare a 10,000-document NER corpus, the difference between a well-tooled pipeline and a manual process is the difference between a 3-month project and a 9-month project.

Ertas Data Suite automates the ingestion, cleaning, deduplication, and PII redaction stages, and provides a browser-based annotation interface that domain experts can access without installation. Based on teams using the pipeline, the automated stages alone reduce total preparation time by 40-60% compared to script-based pipelines.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →

The Five Stages of an Enterprise AI Data Pipeline — What actually happens at each stage and where teams get stuck.
The Enterprise Guide to AI Data Preparation — The full strategic picture, including how to scope a data preparation project before starting.
Cost of a Fragmented Data Prep Stack — The compounding cost of using 3–7 separate tools across the pipeline.

How Long Does Enterprise AI Data Preparation Actually Take?

The Variables That Drive Timeline