
How Long Does Enterprise AI Data Preparation Actually Take?
Honest benchmarks for AI data preparation timelines — by data type, volume, and pipeline complexity — and the biggest time sinks that slow enterprise AI projects down.
The honest answer is: longer than you've budgeted for. Almost universally.
The 60-80% statistic — the share of ML project time that goes to data preparation — is widely cited, occasionally doubted, and consistently confirmed by teams that have been through a real enterprise AI project. What the statistic doesn't capture is what that means for project planning.
If data preparation takes 60-80% of total project time, and your project has a 6-month deadline, then you have roughly 3.5 to 5 months for data preparation alone. Not for model training. Not for evaluation and iteration. Not for deployment. Just for getting the data into a shape that allows training to begin.
Most project plans don't reflect this. Most project plans allocate 3-4 weeks for "data preprocessing" and 4-5 months for everything else. The discovery that the timeline is inverted typically happens at week 6, when the first pass at data preparation has been completed and the output quality is not yet fit for training.
This article gives you concrete benchmarks so you can plan based on realistic numbers.
The Variables That Drive Timeline
Timeline varies enormously depending on four factors:
1. Source format quality. Native PDFs from modern document management systems parse cleanly and quickly. Scanned documents from 1990s archives require OCR, deskewing, and manual quality review. The same nominal "1,000 document" corpus can take 8 hours to process if native, or 40+ hours if scanned.
2. Data volume. Not just file count, but total text volume. 10,000 short forms is a different problem from 10,000 dense technical reports.
3. Label complexity. Classifying documents into 5 categories at the document level is fast. Annotating named entities at the token level across a specialized domain (clinical terminology, legal clauses, engineering components) is slow.
4. Team composition and tooling. Manual spreadsheet-based cleaning vs. automated deduplication pipelines. Domain experts who can access annotation tools independently vs. domain experts who require an ML engineer to assist them. These multipliers are large.
Stage-by-Stage Time Benchmarks
Ingestion
Ingestion time is primarily determined by format and OCR requirements.
| Source Format | Pages per Hour (automated) | Error Rate | Manual Review Required |
|---|---|---|---|
| Native PDF (clean layout) | 5,000–15,000 | < 1% | Minimal |
| Native PDF (complex multi-column) | 1,000–3,000 | 2–5% | Table validation |
| Scanned PDF (good quality, 300+ DPI) | 500–1,500 | 2–8% | Spot check |
| Scanned PDF (poor quality, mixed) | 100–400 | 10–25% | Significant |
| Word (.docx) | 10,000–30,000 | < 1% | Minimal |
| Excel (.xlsx, simple) | 5,000–20,000 sheets | 1–3% | Header validation |
| Audio transcripts | 2–5x real-time + review | 5–15% | Speaker/term corrections |
These are automated processing rates. Add setup time — pipeline configuration, sample validation, parameter tuning — of 4-16 hours per corpus type before the main run begins.
Cleaning
Cleaning time is harder to estimate because it depends on the error rate from ingestion and the compliance requirements.
| Task | Time Estimate |
|---|---|
| Automated deduplication (50K records) | 1–4 hours compute + 2–4 hours validation |
| PII/PHI redaction (standard patterns) | 2–8 hours compute + 4–8 hours audit sample review |
| Quality scoring and filtering | 2–6 hours compute + 2–4 hours threshold calibration |
| Manual cleaning of OCR artifacts | 1–3 minutes per page with significant errors |
The manual cleaning component is the unpredictable one. If OCR quality is poor across a significant fraction of documents, manual correction becomes the timeline driver. A 10,000-page corpus at 5% page-level error rate has 500 pages requiring manual attention — at 2 minutes per page, that's over 16 hours of manual work per annotator.
Labeling
Labeling is almost always the longest stage, and almost always the most underestimated.
| Task | Time per Record | 10,000 Records |
|---|---|---|
| Document classification (5 classes) | 15–30 seconds | 40–80 hours |
| Document classification (20+ classes) | 30–90 seconds | 80–250 hours |
| NER tagging (3–5 entity types) | 2–5 minutes | 330–830 hours |
| NER tagging (10+ entity types, technical) | 5–15 minutes | 830–2,500 hours |
| Bounding box annotation (simple objects) | 1–3 minutes | 165–500 hours |
| Q&A pair generation per passage | 10–20 minutes | 1,650–3,300 hours |
| Instruction fine-tuning pair writing | 15–45 minutes | 2,500–7,500 hours |
These times assume domain experts who are calibrated and working efficiently. For the first labeling sessions before calibration, add 30-50% for inconsistency and rework.
At these rates, labeling 10,000 records for a complex NER task requires 800-2,500 hours of expert annotation time. At 40 hours per week for a single annotator, that's 20-63 weeks. Most projects can't wait that long, which means either: hiring multiple annotators, reducing scope, or using augmentation to expand a smaller high-quality labeled set.
Augmentation
Automated augmentation using a local LLM runs at model inference speed — typically 50-500 synthetic records per hour depending on record length and hardware. Setup and quality review of synthetic examples adds 4-16 hours. This is usually the fastest stage.
Export
Export is typically fast — hours, not days — assuming the format is correctly specified and validation is automated. Format validation failures (schema errors, encoding issues) can add debugging time of 4-16 hours if they're discovered late.
The Compounding Cost of Skipping Cleaning
Teams that skip or rush cleaning — to hit a deadline or because cleaning "seems like overhead" — face a compounding problem.
A model trained on data with 10% near-duplicate records learns to reproduce common content with inflated confidence. A model trained on data with 2% PII contamination will output PII in production. A model trained on data with 5% OCR corruption will produce outputs that include corruption artifacts.
The cost is not just the cleaning time itself — it's the full train-evaluate-diagnose-fix-retrain cycle that follows. If the cleaning problem isn't identified until model evaluation (weeks after training begins), the total added time is: time to identify the data problem + cleaning time + re-training time + re-evaluation time. This is consistently 2-4x longer than addressing cleaning at the right stage.
Where Teams Consistently Underestimate
OCR quality on legacy scanned documents. Teams that haven't audited the actual scan quality of their archive before planning often assume OCR will be "close enough." OCR on documents scanned at 150 DPI with skew, ink fade, and mixed printing quality is not close enough for AI training data. The discovery of this happens after the ingestion phase, when the cleaning phase reveals the error rate.
Near-duplicate rates in accumulated archives. Enterprise document archives are not curated. Documents accumulate through email attachments, version saves, template instantiations, and copy-paste. Before deduplication, the effective training data volume is often 60-75% of the apparent volume.
Label consistency and calibration time. Teams assume that domain experts will naturally agree on labels. They rarely do on the first pass. Calibration — defining the label schema precisely, running trial annotation, measuring inter-annotator agreement, adjudicating disagreements, re-annotating with the refined schema — takes 2-6 weeks before the main annotation run begins.
Format requirements for the target framework. Discovering that the training framework requires a specific JSONL schema that doesn't match the export format, after labeling is complete, requires reformatting work and sometimes relabeling if the schema change affects how annotations map to the output.
A Rough Benchmark Table
| Corpus Size | Format | Annotation Type | Estimated Total Prep Time |
|---|---|---|---|
| 1,000 documents | Native PDF, simple | Document classification | 2–4 weeks |
| 1,000 documents | Scanned PDF | Document classification | 4–8 weeks |
| 10,000 documents | Native PDF, mixed | NER (5 entity types) | 3–6 months |
| 10,000 documents | Scanned PDF | NER (5 entity types) | 5–10 months |
| 50,000 documents | Mixed formats | Instruction fine-tuning pairs | 6–18 months |
| 100,000+ documents | Mixed formats | Multi-task labels | 12+ months |
These estimates assume a small team (2-4 people including at least 1 ML engineer and domain expert availability). Larger teams reduce calendar time proportionally, subject to annotation consistency overhead.
How Tooling Affects Timeline
Manual processes — Python scripts for cleaning, spreadsheet-based quality review, cobbled-together annotation tools — reliably produce 2-4x longer timelines than automated pipelines with built-in quality gates.
The compounding effects:
- Manual deduplication takes days; automated deduplication takes hours
- Manual PII review requires reading every document; automated detection with human audit sampling requires reading 5-10%
- Annotation tools that require ML engineer setup for each annotator session double the effective annotation time
- Format conversion scripts that need to be rewritten for each new export target add days to the export stage
Pipeline automation is not a luxury for large projects. For a team of 3 trying to prepare a 10,000-document NER corpus, the difference between a well-tooled pipeline and a manual process is the difference between a 3-month project and a 9-month project.
Ertas Data Suite automates the ingestion, cleaning, deduplication, and PII redaction stages, and provides a browser-based annotation interface that domain experts can access without installation. Based on teams using the pipeline, the automated stages alone reduce total preparation time by 40-60% compared to script-based pipelines.
Your data is the bottleneck — not your models.
Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.
Related Reading
- The Five Stages of an Enterprise AI Data Pipeline — What actually happens at each stage and where teams get stuck.
- The Enterprise Guide to AI Data Preparation — The full strategic picture, including how to scope a data preparation project before starting.
- Cost of a Fragmented Data Prep Stack — The compounding cost of using 3–7 separate tools across the pipeline.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

What Is AI Data Readiness? The Assessment Every Enterprise Skips
Most enterprises jump straight to model selection without assessing whether their data is actually usable for AI. Here's what AI data readiness means and how to assess it.

80% of Enterprise Data Is Unstructured — Here's What That Actually Means for AI
Unpacking the commonly cited statistic that 80-90% of enterprise data is unstructured — what types of data are trapped, what the opportunity cost is, and how it relates to AI adoption.

Build vs. Buy AI Data Preparation: The Real Cost Breakdown
The real math on building in-house AI data preparation pipelines vs. buying a platform — covering engineering costs, maintenance, tool licensing, and hidden integration expenses.