Back to blog
    How Long Does Enterprise AI Data Preparation Actually Take?
    data-preparationenterprise-aitimelineplanningsegment:enterprise

    How Long Does Enterprise AI Data Preparation Actually Take?

    Honest benchmarks for AI data preparation timelines — by data type, volume, and pipeline complexity — and the biggest time sinks that slow enterprise AI projects down.

    EErtas Team·

    The honest answer is: longer than you've budgeted for. Almost universally.

    The 60-80% statistic — the share of ML project time that goes to data preparation — is widely cited, occasionally doubted, and consistently confirmed by teams that have been through a real enterprise AI project. What the statistic doesn't capture is what that means for project planning.

    If data preparation takes 60-80% of total project time, and your project has a 6-month deadline, then you have roughly 3.5 to 5 months for data preparation alone. Not for model training. Not for evaluation and iteration. Not for deployment. Just for getting the data into a shape that allows training to begin.

    Most project plans don't reflect this. Most project plans allocate 3-4 weeks for "data preprocessing" and 4-5 months for everything else. The discovery that the timeline is inverted typically happens at week 6, when the first pass at data preparation has been completed and the output quality is not yet fit for training.

    This article gives you concrete benchmarks so you can plan based on realistic numbers.

    The Variables That Drive Timeline

    Timeline varies enormously depending on four factors:

    1. Source format quality. Native PDFs from modern document management systems parse cleanly and quickly. Scanned documents from 1990s archives require OCR, deskewing, and manual quality review. The same nominal "1,000 document" corpus can take 8 hours to process if native, or 40+ hours if scanned.

    2. Data volume. Not just file count, but total text volume. 10,000 short forms is a different problem from 10,000 dense technical reports.

    3. Label complexity. Classifying documents into 5 categories at the document level is fast. Annotating named entities at the token level across a specialized domain (clinical terminology, legal clauses, engineering components) is slow.

    4. Team composition and tooling. Manual spreadsheet-based cleaning vs. automated deduplication pipelines. Domain experts who can access annotation tools independently vs. domain experts who require an ML engineer to assist them. These multipliers are large.

    Stage-by-Stage Time Benchmarks

    Ingestion

    Ingestion time is primarily determined by format and OCR requirements.

    Source FormatPages per Hour (automated)Error RateManual Review Required
    Native PDF (clean layout)5,000–15,000< 1%Minimal
    Native PDF (complex multi-column)1,000–3,0002–5%Table validation
    Scanned PDF (good quality, 300+ DPI)500–1,5002–8%Spot check
    Scanned PDF (poor quality, mixed)100–40010–25%Significant
    Word (.docx)10,000–30,000< 1%Minimal
    Excel (.xlsx, simple)5,000–20,000 sheets1–3%Header validation
    Audio transcripts2–5x real-time + review5–15%Speaker/term corrections

    These are automated processing rates. Add setup time — pipeline configuration, sample validation, parameter tuning — of 4-16 hours per corpus type before the main run begins.

    Cleaning

    Cleaning time is harder to estimate because it depends on the error rate from ingestion and the compliance requirements.

    TaskTime Estimate
    Automated deduplication (50K records)1–4 hours compute + 2–4 hours validation
    PII/PHI redaction (standard patterns)2–8 hours compute + 4–8 hours audit sample review
    Quality scoring and filtering2–6 hours compute + 2–4 hours threshold calibration
    Manual cleaning of OCR artifacts1–3 minutes per page with significant errors

    The manual cleaning component is the unpredictable one. If OCR quality is poor across a significant fraction of documents, manual correction becomes the timeline driver. A 10,000-page corpus at 5% page-level error rate has 500 pages requiring manual attention — at 2 minutes per page, that's over 16 hours of manual work per annotator.

    Labeling

    Labeling is almost always the longest stage, and almost always the most underestimated.

    TaskTime per Record10,000 Records
    Document classification (5 classes)15–30 seconds40–80 hours
    Document classification (20+ classes)30–90 seconds80–250 hours
    NER tagging (3–5 entity types)2–5 minutes330–830 hours
    NER tagging (10+ entity types, technical)5–15 minutes830–2,500 hours
    Bounding box annotation (simple objects)1–3 minutes165–500 hours
    Q&A pair generation per passage10–20 minutes1,650–3,300 hours
    Instruction fine-tuning pair writing15–45 minutes2,500–7,500 hours

    These times assume domain experts who are calibrated and working efficiently. For the first labeling sessions before calibration, add 30-50% for inconsistency and rework.

    At these rates, labeling 10,000 records for a complex NER task requires 800-2,500 hours of expert annotation time. At 40 hours per week for a single annotator, that's 20-63 weeks. Most projects can't wait that long, which means either: hiring multiple annotators, reducing scope, or using augmentation to expand a smaller high-quality labeled set.

    Augmentation

    Automated augmentation using a local LLM runs at model inference speed — typically 50-500 synthetic records per hour depending on record length and hardware. Setup and quality review of synthetic examples adds 4-16 hours. This is usually the fastest stage.

    Export

    Export is typically fast — hours, not days — assuming the format is correctly specified and validation is automated. Format validation failures (schema errors, encoding issues) can add debugging time of 4-16 hours if they're discovered late.

    The Compounding Cost of Skipping Cleaning

    Teams that skip or rush cleaning — to hit a deadline or because cleaning "seems like overhead" — face a compounding problem.

    A model trained on data with 10% near-duplicate records learns to reproduce common content with inflated confidence. A model trained on data with 2% PII contamination will output PII in production. A model trained on data with 5% OCR corruption will produce outputs that include corruption artifacts.

    The cost is not just the cleaning time itself — it's the full train-evaluate-diagnose-fix-retrain cycle that follows. If the cleaning problem isn't identified until model evaluation (weeks after training begins), the total added time is: time to identify the data problem + cleaning time + re-training time + re-evaluation time. This is consistently 2-4x longer than addressing cleaning at the right stage.

    Where Teams Consistently Underestimate

    OCR quality on legacy scanned documents. Teams that haven't audited the actual scan quality of their archive before planning often assume OCR will be "close enough." OCR on documents scanned at 150 DPI with skew, ink fade, and mixed printing quality is not close enough for AI training data. The discovery of this happens after the ingestion phase, when the cleaning phase reveals the error rate.

    Near-duplicate rates in accumulated archives. Enterprise document archives are not curated. Documents accumulate through email attachments, version saves, template instantiations, and copy-paste. Before deduplication, the effective training data volume is often 60-75% of the apparent volume.

    Label consistency and calibration time. Teams assume that domain experts will naturally agree on labels. They rarely do on the first pass. Calibration — defining the label schema precisely, running trial annotation, measuring inter-annotator agreement, adjudicating disagreements, re-annotating with the refined schema — takes 2-6 weeks before the main annotation run begins.

    Format requirements for the target framework. Discovering that the training framework requires a specific JSONL schema that doesn't match the export format, after labeling is complete, requires reformatting work and sometimes relabeling if the schema change affects how annotations map to the output.

    A Rough Benchmark Table

    Corpus SizeFormatAnnotation TypeEstimated Total Prep Time
    1,000 documentsNative PDF, simpleDocument classification2–4 weeks
    1,000 documentsScanned PDFDocument classification4–8 weeks
    10,000 documentsNative PDF, mixedNER (5 entity types)3–6 months
    10,000 documentsScanned PDFNER (5 entity types)5–10 months
    50,000 documentsMixed formatsInstruction fine-tuning pairs6–18 months
    100,000+ documentsMixed formatsMulti-task labels12+ months

    These estimates assume a small team (2-4 people including at least 1 ML engineer and domain expert availability). Larger teams reduce calendar time proportionally, subject to annotation consistency overhead.

    How Tooling Affects Timeline

    Manual processes — Python scripts for cleaning, spreadsheet-based quality review, cobbled-together annotation tools — reliably produce 2-4x longer timelines than automated pipelines with built-in quality gates.

    The compounding effects:

    • Manual deduplication takes days; automated deduplication takes hours
    • Manual PII review requires reading every document; automated detection with human audit sampling requires reading 5-10%
    • Annotation tools that require ML engineer setup for each annotator session double the effective annotation time
    • Format conversion scripts that need to be rewritten for each new export target add days to the export stage

    Pipeline automation is not a luxury for large projects. For a team of 3 trying to prepare a 10,000-document NER corpus, the difference between a well-tooled pipeline and a manual process is the difference between a 3-month project and a 9-month project.

    Ertas Data Suite automates the ingestion, cleaning, deduplication, and PII redaction stages, and provides a browser-based annotation interface that domain experts can access without installation. Based on teams using the pipeline, the automated stages alone reduce total preparation time by 40-60% compared to script-based pipelines.


    Your data is the bottleneck — not your models.

    Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading