
Benchmark: On-Premise Data Prep Pipeline Throughput for 100GB+ Enterprise Datasets
Realistic throughput benchmarks for on-premise data preparation — ingestion, OCR, cleaning, labeling, and export speeds by document type and hardware configuration.
Every service provider delivering data preparation for enterprise AI projects faces the same question during scoping: "How long will this take?"
The answer depends on document types, dataset size, pipeline stages, and hardware. Vague estimates like "a few weeks" don't help when writing statements of work with fixed timelines. Concrete throughput numbers do.
This guide provides realistic benchmark data for each pipeline stage across different document types and hardware configurations. These numbers come from common configurations, not idealized lab conditions. Use them as baselines for scoping engagements.
Methodology Note
All benchmarks assume:
- Single machine processing (not distributed)
- Documents processed sequentially through pipeline stages (ingest all → clean all → label all → export all)
- Default configurations for OCR engines and inference backends (no exotic tuning)
- Throughput measured as sustained rate after initial warmup, not peak burst
Hardware configurations referenced:
| Config | CPU | RAM | GPU | Storage |
|---|---|---|---|---|
| Entry | Ryzen 7 7700 (8c/16t) | 32 GB | RTX 4060 Ti 16GB | 2 TB NVMe |
| Mid-Range | Ryzen 9 7950X (16c/32t) | 64 GB | RTX 4080 16GB | 4 TB NVMe |
| Production | Threadripper 7970X (32c/64t) | 128 GB | 2× RTX 4090 24GB | 8 TB NVMe |
Stage 1: Ingestion Throughput
Ingestion covers reading source files, parsing their structure, and extracting raw content (text, images, metadata).
By Document Type
| Document Type | Avg Size | Entry (docs/min) | Mid-Range (docs/min) | Production (docs/min) |
|---|---|---|---|---|
| Native PDF (text-based) | 500 KB | 200–400 | 400–800 | 800–1,500 |
| Scanned PDF (image-based) | 5 MB | 60–120 | 120–250 | 250–500 |
| Word (.docx) | 200 KB | 300–600 | 600–1,200 | 1,200–2,000 |
| Excel (.xlsx) | 1 MB | 100–200 | 200–400 | 400–800 |
| Plain text / CSV | 50 KB | 1,000–3,000 | 3,000–8,000 | 8,000–15,000 |
| Images (JPEG/PNG) | 2 MB | 150–300 | 300–600 | 600–1,200 |
| HTML | 100 KB | 500–1,000 | 1,000–2,000 | 2,000–4,000 |
| Email (.eml/.msg) | 100 KB | 200–400 | 400–800 | 800–1,500 |
Ingestion Bottleneck Analysis
Native PDFs: CPU-bound. PDF parsing is single-threaded per file, so throughput scales with the number of parallel workers (limited by CPU cores and I/O).
Scanned PDFs: I/O-bound. Each page is a large image that must be decompressed. Storage speed dominates.
Excel files: Memory-bound for large spreadsheets. A 50 MB Excel file can decompress to 500 MB+ in memory. Parallel processing limited by RAM.
What 100 GB Looks Like
A 100 GB enterprise archive typically contains a mix of document types. A representative distribution:
| Type | Percentage | ~File Count | ~Total Size |
|---|---|---|---|
| Native PDF | 40% | 80,000 files | 40 GB |
| Scanned PDF | 25% | 5,000 files | 25 GB |
| Word/Excel | 20% | 40,000 files | 20 GB |
| Images | 10% | 5,000 files | 10 GB |
| Other (text, HTML, email) | 5% | 20,000 files | 5 GB |
| Total | ~150,000 files | 100 GB |
Mid-range ingestion time for this mix: ~4–8 hours. The scanned PDFs dominate the timeline despite being only 25% of the volume.
Stage 2: OCR Throughput
OCR applies only to scanned documents and images. Text-based documents skip this stage.
By Engine and Hardware
| Engine | Hardware | Pages/Second | Accuracy (Clean Scans) | Accuracy (Poor Quality) |
|---|---|---|---|---|
| Tesseract 5 | CPU (8 cores) | 1–3 | 90–95% | 70–80% |
| Tesseract 5 | CPU (16 cores) | 2–5 | 90–95% | 70–80% |
| PaddleOCR | CPU (16 cores) | 3–6 | 92–96% | 75–85% |
| PaddleOCR | GPU (RTX 4070) | 15–25 | 92–96% | 75–85% |
| PaddleOCR | GPU (RTX 4090) | 25–40 | 92–96% | 75–85% |
| EasyOCR | GPU (RTX 4070) | 10–18 | 90–94% | 70–82% |
| Surya OCR | GPU (RTX 4070) | 20–30 | 94–97% | 80–88% |
| Surya OCR | GPU (RTX 4090) | 30–50 | 94–97% | 80–88% |
OCR Time Estimates
| Archive Size (Scanned Pages) | CPU-Only (Tesseract) | GPU Mid-Range | GPU Production |
|---|---|---|---|
| 10,000 pages | 1–3 hours | 7–12 minutes | 4–7 minutes |
| 50,000 pages | 5–14 hours | 35–55 minutes | 17–33 minutes |
| 100,000 pages | 10–28 hours | 1.1–1.8 hours | 0.6–1.1 hours |
| 500,000 pages | 2–6 days | 5.5–9.2 hours | 2.8–5.5 hours |
| 1,000,000 pages | 4–12 days | 11–18 hours | 5.5–11 hours |
Key insight: OCR is the single largest time sink in pipelines with scanned documents. If your client's archive is mostly scanned PDFs, OCR throughput determines your project timeline.
Stage 3: Cleaning Throughput
Cleaning includes deduplication, format normalization, PII detection/redaction, and quality filtering.
By Operation
| Operation | Method | Throughput (Mid-Range) | RAM Usage |
|---|---|---|---|
| Exact dedup | SHA-256 hash | 50,000–100,000 docs/min | Low (under 1 GB for 1M docs) |
| Fuzzy dedup (MinHash) | 128 permutations | 5,000–15,000 docs/min | 2–4 GB per 1M docs |
| PII detection (regex) | Pattern matching | 10,000–30,000 docs/min | Low |
| PII detection (NER model) | GLiNER / SpaCy NER | 500–2,000 docs/min | 2–4 GB VRAM |
| PII redaction | Replace detected PII | Same as detection | Same |
| Format normalization | Unicode, whitespace cleanup | 20,000–50,000 docs/min | Low |
| Quality filtering | Length, language, coherence | 10,000–30,000 docs/min | Low |
Cleaning Time Estimates
For a 150,000-document archive (the 100 GB mix from above):
| Operation | Mid-Range Time |
|---|---|
| Exact dedup | 2–3 minutes |
| Fuzzy dedup | 10–30 minutes |
| Regex PII detection | 5–15 minutes |
| NER PII detection | 1.5–5 hours |
| Format normalization | 3–8 minutes |
| Quality filtering | 5–15 minutes |
| Total (with NER PII) | ~2–6 hours |
| Total (regex PII only) | ~25–70 minutes |
NER-based PII detection is the cleaning bottleneck. For projects where regex-based PII detection is sufficient (financial documents with structured PII like SSNs, account numbers), cleaning is fast. For unstructured PII in narrative text, NER adds significant time.
Stage 4: Labeling Throughput
Manual Labeling
Human labeling speed varies enormously by task complexity and annotator experience:
| Task | Speed (Experienced Annotator) | Documents/Day (8 hrs) |
|---|---|---|
| Binary classification | 5–10 seconds/doc | 2,800–5,700 |
| Multi-class (5–10 categories) | 10–30 seconds/doc | 960–2,800 |
| Named entity annotation | 1–5 minutes/doc | 96–480 |
| Span-level labeling | 2–10 minutes/doc | 48–240 |
| Complex multi-label | 30–120 seconds/doc | 240–960 |
AI-Assisted Labeling (Pre-Annotation + Human Review)
The pre-annotation phase uses local LLM inference. Human review time depends on pre-annotation accuracy.
Pre-annotation throughput (LLM inference):
| Task | Model | Quant | Hardware | Docs/Hour |
|---|---|---|---|---|
| Binary classification | Mistral 7B | Q4_K_M | RTX 4070 | 2,500–3,500 |
| Multi-class (5 cats) | Mistral 7B | Q4_K_M | RTX 4070 | 2,000–3,000 |
| Multi-class (5 cats) | Qwen 2.5 14B | Q4_K_M | RTX 4080 | 1,000–1,800 |
| Entity extraction | Qwen 2.5 14B | Q5_K_M | RTX 4080 | 800–1,400 |
| Document summarization | Qwen 2.5 14B | Q4_K_M | RTX 4080 | 300–500 |
Human review throughput (reviewing pre-annotations):
| Pre-Annotation Accuracy | Review Speed | Effective Throughput vs. Manual |
|---|---|---|
| >90% correct | 3–5 seconds/doc (confirm or fix) | 5–10x faster than manual |
| 80–90% correct | 5–15 seconds/doc | 3–5x faster than manual |
| 70–80% correct | 10–30 seconds/doc | 1.5–3x faster than manual |
| Under 70% correct | 15–60 seconds/doc | Marginal improvement |
Break-even: Below ~70% pre-annotation accuracy, human reviewers spend more time understanding and correcting errors than they would labeling from scratch. The AI assistance becomes a distraction rather than an accelerator.
Combined Labeling Timeline
For 150,000 documents with binary classification:
| Approach | Time Estimate |
|---|---|
| Manual (2 annotators) | 13–27 working days |
| AI-assisted (90% accuracy, 2 reviewers) | 2–4 working days |
| AI-assisted (80% accuracy, 2 reviewers) | 4–8 working days |
AI-assisted labeling with >80% pre-annotation accuracy reduces labeling time by 3–10x.
Stage 5: Augmentation Throughput
Synthetic data generation throughput depends on output length:
| Task | Model | Hardware | Output Length | Docs/Hour |
|---|---|---|---|---|
| Paraphrase generation | Mistral 7B Q4 | RTX 4070 | ~100 tokens | 1,500–2,500 |
| Synthetic document generation | Qwen 2.5 14B Q4 | RTX 4080 | ~500 tokens | 100–200 |
| Augmented examples (classification) | Mistral 7B Q4 | RTX 4070 | ~50 tokens | 3,000–5,000 |
| Question-answer pair generation | Qwen 2.5 14B Q4 | RTX 4080 | ~200 tokens | 400–700 |
Stage 6: Export Throughput
Export is rarely the bottleneck:
| Format | Size (150K docs) | NVMe Write Time | SATA SSD Write Time |
|---|---|---|---|
| JSONL | 5–20 GB | 1–5 seconds | 10–40 seconds |
| JSONL (gzip compressed) | 1–5 GB | 30–120 seconds | 60–240 seconds |
| Parquet | 3–12 GB | 1–5 seconds | 10–40 seconds |
| HuggingFace Dataset | 5–20 GB | 5–15 seconds | 30–120 seconds |
| CSV | 5–20 GB | 1–5 seconds | 10–40 seconds |
End-to-End Pipeline Estimates
Scenario A: 100 GB Mixed Enterprise Documents (150K Files)
Mid-range hardware (Ryzen 9, 64 GB RAM, RTX 4080):
| Stage | Time Estimate |
|---|---|
| Ingestion | 4–8 hours |
| OCR (scanned subset: ~50K pages) | 35–55 minutes |
| Cleaning (with regex PII) | 25–70 minutes |
| AI-assisted labeling (binary classification) | 50–75 minutes (pre-annotation) + 2–4 days (human review) |
| Export | Under 5 minutes |
| Total compute time | ~6–10 hours |
| Total project time (incl. human review) | 3–5 working days |
Scenario B: 500 GB Scanned Document Archive (500K Pages)
Mid-range hardware:
| Stage | Time Estimate |
|---|---|
| Ingestion | 12–24 hours |
| OCR (500K pages, GPU) | 5.5–9 hours |
| Cleaning (with NER PII) | 4–12 hours |
| AI-assisted labeling (multi-class) | 3–5 hours (pre-annotation) + 5–10 days (human review) |
| Export | Under 10 minutes |
| Total compute time | ~24–50 hours |
| Total project time | 1–2 weeks |
Scenario C: 1 TB Mixed Enterprise Archive (1M+ Files)
Production hardware (Threadripper, 128 GB RAM, 2× RTX 4090):
| Stage | Time Estimate |
|---|---|
| Ingestion | 24–48 hours |
| OCR (scanned subset: ~200K pages) | 1–2 hours |
| Cleaning (with NER PII) | 8–24 hours |
| AI-assisted labeling (entity extraction) | 12–24 hours (pre-annotation) + 2–4 weeks (human review) |
| Export | Under 30 minutes |
| Total compute time | ~2–4 days |
| Total project time | 3–5 weeks |
How to Estimate Timeline from Data Volume
A quick estimation framework for scoping proposals:
- Assess document types: What percentage is scanned vs. native text? Scanned documents take 5–10x longer per document.
- Estimate file count: Total volume ÷ average file size. A 100 GB archive might be 10,000 large files or 500,000 small files. File count affects ingestion time; total volume affects OCR time.
- Identify the labeling task: Binary classification? Multi-label? Entity extraction? Task complexity determines both LLM inference time and human review time.
- Calculate human review hours: Pre-annotation throughput × accuracy level → review hours. This is usually the longest phase.
- Add buffer: Real-world archives contain corrupt files, unexpected formats, and edge cases. Add 20–30% to compute time estimates.
Improving Throughput Without Additional Hardware
Before buying more hardware, optimize what you have:
- Fix the storage bottleneck: If source data is on HDD or network storage, copy it to local NVMe. This alone can cut ingestion time by 5–20x.
- Skip unnecessary OCR: Check if scanned PDFs already have text layers. Many enterprise scanners produce PDFs with embedded OCR. Extracting the existing text layer is 100x faster than re-running OCR.
- Use the right quantization: Q4_K_M instead of Q8_0 for classification tasks. 40–60% throughput improvement with minimal accuracy loss.
- Increase inference parallelism: If VRAM allows, run 2–4 concurrent LLM requests.
- Pre-filter aggressively: Remove duplicate and irrelevant files before processing. A 10% reduction in file count saves 10% of pipeline time.
Ertas Data Suite Performance
Ertas Data Suite's native desktop architecture avoids the overhead that containerized tools introduce — no Docker networking layer, no volume mount I/O penalties, no container runtime overhead. The application accesses the filesystem and GPU directly, which translates to throughput numbers at the upper end of the ranges listed in this guide.
The built-in pipeline processes documents through Ingest → Clean → Label → Augment → Export stages with automatic batching and progress tracking. For service providers, this means the pipeline runs overnight with predictable throughput and detailed logging of what was processed, what failed, and what's ready for human review.
Using These Numbers
These benchmarks exist to answer one question: "How long will the data preparation phase take?" When scoping an engagement, estimate the compute time from these tables, add human review time based on your labeling task and team size, and apply a 20–30% buffer. The result is a defensible timeline for your statement of work.
For more on the hardware and architecture decisions behind these numbers, see Hardware Sizing for On-Premise Data Preparation and On-Premise Runtime Architecture for Enterprise AI Data Preparation.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Batch Processing Large Document Archives On-Premise: Performance Tuning Guide
Performance tuning guide for batch processing 100GB–1TB+ document archives on-premise — parallel ingestion, memory management, I/O optimization, and resumability strategies.

Running Ollama for AI-Assisted Data Prep in Air-Gapped Enterprise Environments
Step-by-step guide to deploying Ollama for AI-assisted data labeling in air-gapped environments — model transfer, offline setup, GPU configuration, and common failure modes.

How to Build an On-Premise Data Preparation Pipeline for LLM Fine-Tuning
A complete guide to building on-premise data preparation pipelines for LLM fine-tuning — covering the 5 stages from ingestion to export, tool comparisons, and architecture for regulated environments.