Back to blog
    Benchmark: On-Premise Data Prep Pipeline Throughput for 100GB+ Enterprise Datasets
    benchmarkthroughputon-premisedata-preparationperformanceocrlabelingenterprisesegment:service-provider

    Benchmark: On-Premise Data Prep Pipeline Throughput for 100GB+ Enterprise Datasets

    Realistic throughput benchmarks for on-premise data preparation — ingestion, OCR, cleaning, labeling, and export speeds by document type and hardware configuration.

    EErtas Team·

    Every service provider delivering data preparation for enterprise AI projects faces the same question during scoping: "How long will this take?"

    The answer depends on document types, dataset size, pipeline stages, and hardware. Vague estimates like "a few weeks" don't help when writing statements of work with fixed timelines. Concrete throughput numbers do.

    This guide provides realistic benchmark data for each pipeline stage across different document types and hardware configurations. These numbers come from common configurations, not idealized lab conditions. Use them as baselines for scoping engagements.


    Methodology Note

    All benchmarks assume:

    • Single machine processing (not distributed)
    • Documents processed sequentially through pipeline stages (ingest all → clean all → label all → export all)
    • Default configurations for OCR engines and inference backends (no exotic tuning)
    • Throughput measured as sustained rate after initial warmup, not peak burst

    Hardware configurations referenced:

    ConfigCPURAMGPUStorage
    EntryRyzen 7 7700 (8c/16t)32 GBRTX 4060 Ti 16GB2 TB NVMe
    Mid-RangeRyzen 9 7950X (16c/32t)64 GBRTX 4080 16GB4 TB NVMe
    ProductionThreadripper 7970X (32c/64t)128 GB2× RTX 4090 24GB8 TB NVMe

    Stage 1: Ingestion Throughput

    Ingestion covers reading source files, parsing their structure, and extracting raw content (text, images, metadata).

    By Document Type

    Document TypeAvg SizeEntry (docs/min)Mid-Range (docs/min)Production (docs/min)
    Native PDF (text-based)500 KB200–400400–800800–1,500
    Scanned PDF (image-based)5 MB60–120120–250250–500
    Word (.docx)200 KB300–600600–1,2001,200–2,000
    Excel (.xlsx)1 MB100–200200–400400–800
    Plain text / CSV50 KB1,000–3,0003,000–8,0008,000–15,000
    Images (JPEG/PNG)2 MB150–300300–600600–1,200
    HTML100 KB500–1,0001,000–2,0002,000–4,000
    Email (.eml/.msg)100 KB200–400400–800800–1,500

    Ingestion Bottleneck Analysis

    Native PDFs: CPU-bound. PDF parsing is single-threaded per file, so throughput scales with the number of parallel workers (limited by CPU cores and I/O).

    Scanned PDFs: I/O-bound. Each page is a large image that must be decompressed. Storage speed dominates.

    Excel files: Memory-bound for large spreadsheets. A 50 MB Excel file can decompress to 500 MB+ in memory. Parallel processing limited by RAM.

    What 100 GB Looks Like

    A 100 GB enterprise archive typically contains a mix of document types. A representative distribution:

    TypePercentage~File Count~Total Size
    Native PDF40%80,000 files40 GB
    Scanned PDF25%5,000 files25 GB
    Word/Excel20%40,000 files20 GB
    Images10%5,000 files10 GB
    Other (text, HTML, email)5%20,000 files5 GB
    Total~150,000 files100 GB

    Mid-range ingestion time for this mix: ~4–8 hours. The scanned PDFs dominate the timeline despite being only 25% of the volume.


    Stage 2: OCR Throughput

    OCR applies only to scanned documents and images. Text-based documents skip this stage.

    By Engine and Hardware

    EngineHardwarePages/SecondAccuracy (Clean Scans)Accuracy (Poor Quality)
    Tesseract 5CPU (8 cores)1–390–95%70–80%
    Tesseract 5CPU (16 cores)2–590–95%70–80%
    PaddleOCRCPU (16 cores)3–692–96%75–85%
    PaddleOCRGPU (RTX 4070)15–2592–96%75–85%
    PaddleOCRGPU (RTX 4090)25–4092–96%75–85%
    EasyOCRGPU (RTX 4070)10–1890–94%70–82%
    Surya OCRGPU (RTX 4070)20–3094–97%80–88%
    Surya OCRGPU (RTX 4090)30–5094–97%80–88%

    OCR Time Estimates

    Archive Size (Scanned Pages)CPU-Only (Tesseract)GPU Mid-RangeGPU Production
    10,000 pages1–3 hours7–12 minutes4–7 minutes
    50,000 pages5–14 hours35–55 minutes17–33 minutes
    100,000 pages10–28 hours1.1–1.8 hours0.6–1.1 hours
    500,000 pages2–6 days5.5–9.2 hours2.8–5.5 hours
    1,000,000 pages4–12 days11–18 hours5.5–11 hours

    Key insight: OCR is the single largest time sink in pipelines with scanned documents. If your client's archive is mostly scanned PDFs, OCR throughput determines your project timeline.


    Stage 3: Cleaning Throughput

    Cleaning includes deduplication, format normalization, PII detection/redaction, and quality filtering.

    By Operation

    OperationMethodThroughput (Mid-Range)RAM Usage
    Exact dedupSHA-256 hash50,000–100,000 docs/minLow (under 1 GB for 1M docs)
    Fuzzy dedup (MinHash)128 permutations5,000–15,000 docs/min2–4 GB per 1M docs
    PII detection (regex)Pattern matching10,000–30,000 docs/minLow
    PII detection (NER model)GLiNER / SpaCy NER500–2,000 docs/min2–4 GB VRAM
    PII redactionReplace detected PIISame as detectionSame
    Format normalizationUnicode, whitespace cleanup20,000–50,000 docs/minLow
    Quality filteringLength, language, coherence10,000–30,000 docs/minLow

    Cleaning Time Estimates

    For a 150,000-document archive (the 100 GB mix from above):

    OperationMid-Range Time
    Exact dedup2–3 minutes
    Fuzzy dedup10–30 minutes
    Regex PII detection5–15 minutes
    NER PII detection1.5–5 hours
    Format normalization3–8 minutes
    Quality filtering5–15 minutes
    Total (with NER PII)~2–6 hours
    Total (regex PII only)~25–70 minutes

    NER-based PII detection is the cleaning bottleneck. For projects where regex-based PII detection is sufficient (financial documents with structured PII like SSNs, account numbers), cleaning is fast. For unstructured PII in narrative text, NER adds significant time.


    Stage 4: Labeling Throughput

    Manual Labeling

    Human labeling speed varies enormously by task complexity and annotator experience:

    TaskSpeed (Experienced Annotator)Documents/Day (8 hrs)
    Binary classification5–10 seconds/doc2,800–5,700
    Multi-class (5–10 categories)10–30 seconds/doc960–2,800
    Named entity annotation1–5 minutes/doc96–480
    Span-level labeling2–10 minutes/doc48–240
    Complex multi-label30–120 seconds/doc240–960

    AI-Assisted Labeling (Pre-Annotation + Human Review)

    The pre-annotation phase uses local LLM inference. Human review time depends on pre-annotation accuracy.

    Pre-annotation throughput (LLM inference):

    TaskModelQuantHardwareDocs/Hour
    Binary classificationMistral 7BQ4_K_MRTX 40702,500–3,500
    Multi-class (5 cats)Mistral 7BQ4_K_MRTX 40702,000–3,000
    Multi-class (5 cats)Qwen 2.5 14BQ4_K_MRTX 40801,000–1,800
    Entity extractionQwen 2.5 14BQ5_K_MRTX 4080800–1,400
    Document summarizationQwen 2.5 14BQ4_K_MRTX 4080300–500

    Human review throughput (reviewing pre-annotations):

    Pre-Annotation AccuracyReview SpeedEffective Throughput vs. Manual
    >90% correct3–5 seconds/doc (confirm or fix)5–10x faster than manual
    80–90% correct5–15 seconds/doc3–5x faster than manual
    70–80% correct10–30 seconds/doc1.5–3x faster than manual
    Under 70% correct15–60 seconds/docMarginal improvement

    Break-even: Below ~70% pre-annotation accuracy, human reviewers spend more time understanding and correcting errors than they would labeling from scratch. The AI assistance becomes a distraction rather than an accelerator.

    Combined Labeling Timeline

    For 150,000 documents with binary classification:

    ApproachTime Estimate
    Manual (2 annotators)13–27 working days
    AI-assisted (90% accuracy, 2 reviewers)2–4 working days
    AI-assisted (80% accuracy, 2 reviewers)4–8 working days

    AI-assisted labeling with >80% pre-annotation accuracy reduces labeling time by 3–10x.


    Stage 5: Augmentation Throughput

    Synthetic data generation throughput depends on output length:

    TaskModelHardwareOutput LengthDocs/Hour
    Paraphrase generationMistral 7B Q4RTX 4070~100 tokens1,500–2,500
    Synthetic document generationQwen 2.5 14B Q4RTX 4080~500 tokens100–200
    Augmented examples (classification)Mistral 7B Q4RTX 4070~50 tokens3,000–5,000
    Question-answer pair generationQwen 2.5 14B Q4RTX 4080~200 tokens400–700

    Stage 6: Export Throughput

    Export is rarely the bottleneck:

    FormatSize (150K docs)NVMe Write TimeSATA SSD Write Time
    JSONL5–20 GB1–5 seconds10–40 seconds
    JSONL (gzip compressed)1–5 GB30–120 seconds60–240 seconds
    Parquet3–12 GB1–5 seconds10–40 seconds
    HuggingFace Dataset5–20 GB5–15 seconds30–120 seconds
    CSV5–20 GB1–5 seconds10–40 seconds

    End-to-End Pipeline Estimates

    Scenario A: 100 GB Mixed Enterprise Documents (150K Files)

    Mid-range hardware (Ryzen 9, 64 GB RAM, RTX 4080):

    StageTime Estimate
    Ingestion4–8 hours
    OCR (scanned subset: ~50K pages)35–55 minutes
    Cleaning (with regex PII)25–70 minutes
    AI-assisted labeling (binary classification)50–75 minutes (pre-annotation) + 2–4 days (human review)
    ExportUnder 5 minutes
    Total compute time~6–10 hours
    Total project time (incl. human review)3–5 working days

    Scenario B: 500 GB Scanned Document Archive (500K Pages)

    Mid-range hardware:

    StageTime Estimate
    Ingestion12–24 hours
    OCR (500K pages, GPU)5.5–9 hours
    Cleaning (with NER PII)4–12 hours
    AI-assisted labeling (multi-class)3–5 hours (pre-annotation) + 5–10 days (human review)
    ExportUnder 10 minutes
    Total compute time~24–50 hours
    Total project time1–2 weeks

    Scenario C: 1 TB Mixed Enterprise Archive (1M+ Files)

    Production hardware (Threadripper, 128 GB RAM, 2× RTX 4090):

    StageTime Estimate
    Ingestion24–48 hours
    OCR (scanned subset: ~200K pages)1–2 hours
    Cleaning (with NER PII)8–24 hours
    AI-assisted labeling (entity extraction)12–24 hours (pre-annotation) + 2–4 weeks (human review)
    ExportUnder 30 minutes
    Total compute time~2–4 days
    Total project time3–5 weeks

    How to Estimate Timeline from Data Volume

    A quick estimation framework for scoping proposals:

    1. Assess document types: What percentage is scanned vs. native text? Scanned documents take 5–10x longer per document.
    2. Estimate file count: Total volume ÷ average file size. A 100 GB archive might be 10,000 large files or 500,000 small files. File count affects ingestion time; total volume affects OCR time.
    3. Identify the labeling task: Binary classification? Multi-label? Entity extraction? Task complexity determines both LLM inference time and human review time.
    4. Calculate human review hours: Pre-annotation throughput × accuracy level → review hours. This is usually the longest phase.
    5. Add buffer: Real-world archives contain corrupt files, unexpected formats, and edge cases. Add 20–30% to compute time estimates.

    Improving Throughput Without Additional Hardware

    Before buying more hardware, optimize what you have:

    1. Fix the storage bottleneck: If source data is on HDD or network storage, copy it to local NVMe. This alone can cut ingestion time by 5–20x.
    2. Skip unnecessary OCR: Check if scanned PDFs already have text layers. Many enterprise scanners produce PDFs with embedded OCR. Extracting the existing text layer is 100x faster than re-running OCR.
    3. Use the right quantization: Q4_K_M instead of Q8_0 for classification tasks. 40–60% throughput improvement with minimal accuracy loss.
    4. Increase inference parallelism: If VRAM allows, run 2–4 concurrent LLM requests.
    5. Pre-filter aggressively: Remove duplicate and irrelevant files before processing. A 10% reduction in file count saves 10% of pipeline time.

    Ertas Data Suite Performance

    Ertas Data Suite's native desktop architecture avoids the overhead that containerized tools introduce — no Docker networking layer, no volume mount I/O penalties, no container runtime overhead. The application accesses the filesystem and GPU directly, which translates to throughput numbers at the upper end of the ranges listed in this guide.

    The built-in pipeline processes documents through Ingest → Clean → Label → Augment → Export stages with automatic batching and progress tracking. For service providers, this means the pipeline runs overnight with predictable throughput and detailed logging of what was processed, what failed, and what's ready for human review.


    Using These Numbers

    These benchmarks exist to answer one question: "How long will the data preparation phase take?" When scoping an engagement, estimate the compute time from these tables, add human review time based on your labeling task and team size, and apply a 20–30% buffer. The result is a defensible timeline for your statement of work.

    For more on the hardware and architecture decisions behind these numbers, see Hardware Sizing for On-Premise Data Preparation and On-Premise Runtime Architecture for Enterprise AI Data Preparation.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading