Benchmark: On-Premise Data Prep Pipeline Throughput for 100GB+ Enterprise Datasets

Every service provider delivering data preparation for enterprise AI projects faces the same question during scoping: "How long will this take?"

The answer depends on document types, dataset size, pipeline stages, and hardware. Vague estimates like "a few weeks" don't help when writing statements of work with fixed timelines. Concrete throughput numbers do.

This guide provides realistic benchmark data for each pipeline stage across different document types and hardware configurations. These numbers come from common configurations, not idealized lab conditions. Use them as baselines for scoping engagements.

Methodology Note

All benchmarks assume:

Single machine processing (not distributed)
Documents processed sequentially through pipeline stages (ingest all → clean all → label all → export all)
Default configurations for OCR engines and inference backends (no exotic tuning)
Throughput measured as sustained rate after initial warmup, not peak burst

Hardware configurations referenced:

Config	CPU	RAM	GPU	Storage
Entry	Ryzen 7 7700 (8c/16t)	32 GB	RTX 4060 Ti 16GB	2 TB NVMe
Mid-Range	Ryzen 9 7950X (16c/32t)	64 GB	RTX 4080 16GB	4 TB NVMe
Production	Threadripper 7970X (32c/64t)	128 GB	2× RTX 4090 24GB	8 TB NVMe

Stage 1: Ingestion Throughput

Ingestion covers reading source files, parsing their structure, and extracting raw content (text, images, metadata).

By Document Type

Document Type	Avg Size	Entry (docs/min)	Mid-Range (docs/min)	Production (docs/min)
Native PDF (text-based)	500 KB	200–400	400–800	800–1,500
Scanned PDF (image-based)	5 MB	60–120	120–250	250–500
Word (.docx)	200 KB	300–600	600–1,200	1,200–2,000
Excel (.xlsx)	1 MB	100–200	200–400	400–800
Plain text / CSV	50 KB	1,000–3,000	3,000–8,000	8,000–15,000
Images (JPEG/PNG)	2 MB	150–300	300–600	600–1,200
HTML	100 KB	500–1,000	1,000–2,000	2,000–4,000
Email (.eml/.msg)	100 KB	200–400	400–800	800–1,500

Ingestion Bottleneck Analysis

Native PDFs: CPU-bound. PDF parsing is single-threaded per file, so throughput scales with the number of parallel workers (limited by CPU cores and I/O).

Scanned PDFs: I/O-bound. Each page is a large image that must be decompressed. Storage speed dominates.

Excel files: Memory-bound for large spreadsheets. A 50 MB Excel file can decompress to 500 MB+ in memory. Parallel processing limited by RAM.

What 100 GB Looks Like

A 100 GB enterprise archive typically contains a mix of document types. A representative distribution:

Type	Percentage	~File Count	~Total Size
Native PDF	40%	80,000 files	40 GB
Scanned PDF	25%	5,000 files	25 GB
Word/Excel	20%	40,000 files	20 GB
Images	10%	5,000 files	10 GB
Other (text, HTML, email)	5%	20,000 files	5 GB
Total		~150,000 files	100 GB

Mid-range ingestion time for this mix: ~4–8 hours. The scanned PDFs dominate the timeline despite being only 25% of the volume.

Stage 2: OCR Throughput

OCR applies only to scanned documents and images. Text-based documents skip this stage.

By Engine and Hardware

Engine	Hardware	Pages/Second	Accuracy (Clean Scans)	Accuracy (Poor Quality)
Tesseract 5	CPU (8 cores)	1–3	90–95%	70–80%
Tesseract 5	CPU (16 cores)	2–5	90–95%	70–80%
PaddleOCR	CPU (16 cores)	3–6	92–96%	75–85%
PaddleOCR	GPU (RTX 4070)	15–25	92–96%	75–85%
PaddleOCR	GPU (RTX 4090)	25–40	92–96%	75–85%
EasyOCR	GPU (RTX 4070)	10–18	90–94%	70–82%
Surya OCR	GPU (RTX 4070)	20–30	94–97%	80–88%
Surya OCR	GPU (RTX 4090)	30–50	94–97%	80–88%

OCR Time Estimates

Archive Size (Scanned Pages)	CPU-Only (Tesseract)	GPU Mid-Range	GPU Production
10,000 pages	1–3 hours	7–12 minutes	4–7 minutes
50,000 pages	5–14 hours	35–55 minutes	17–33 minutes
100,000 pages	10–28 hours	1.1–1.8 hours	0.6–1.1 hours
500,000 pages	2–6 days	5.5–9.2 hours	2.8–5.5 hours
1,000,000 pages	4–12 days	11–18 hours	5.5–11 hours

Key insight: OCR is the single largest time sink in pipelines with scanned documents. If your client's archive is mostly scanned PDFs, OCR throughput determines your project timeline.

Stage 3: Cleaning Throughput

Cleaning includes deduplication, format normalization, PII detection/redaction, and quality filtering.

By Operation

Operation	Method	Throughput (Mid-Range)	RAM Usage
Exact dedup	SHA-256 hash	50,000–100,000 docs/min	Low (under 1 GB for 1M docs)
Fuzzy dedup (MinHash)	128 permutations	5,000–15,000 docs/min	2–4 GB per 1M docs
PII detection (regex)	Pattern matching	10,000–30,000 docs/min	Low
PII detection (NER model)	GLiNER / SpaCy NER	500–2,000 docs/min	2–4 GB VRAM
PII redaction	Replace detected PII	Same as detection	Same
Format normalization	Unicode, whitespace cleanup	20,000–50,000 docs/min	Low
Quality filtering	Length, language, coherence	10,000–30,000 docs/min	Low

Cleaning Time Estimates

For a 150,000-document archive (the 100 GB mix from above):

Operation	Mid-Range Time
Exact dedup	2–3 minutes
Fuzzy dedup	10–30 minutes
Regex PII detection	5–15 minutes
NER PII detection	1.5–5 hours
Format normalization	3–8 minutes
Quality filtering	5–15 minutes
Total (with NER PII)	~2–6 hours
Total (regex PII only)	~25–70 minutes

NER-based PII detection is the cleaning bottleneck. For projects where regex-based PII detection is sufficient (financial documents with structured PII like SSNs, account numbers), cleaning is fast. For unstructured PII in narrative text, NER adds significant time.

Stage 4: Labeling Throughput

Manual Labeling

Human labeling speed varies enormously by task complexity and annotator experience:

Task	Speed (Experienced Annotator)	Documents/Day (8 hrs)
Binary classification	5–10 seconds/doc	2,800–5,700
Multi-class (5–10 categories)	10–30 seconds/doc	960–2,800
Named entity annotation	1–5 minutes/doc	96–480
Span-level labeling	2–10 minutes/doc	48–240
Complex multi-label	30–120 seconds/doc	240–960

AI-Assisted Labeling (Pre-Annotation + Human Review)

The pre-annotation phase uses local LLM inference. Human review time depends on pre-annotation accuracy.

Pre-annotation throughput (LLM inference):

Task	Model	Quant	Hardware	Docs/Hour
Binary classification	Mistral 7B	Q4_K_M	RTX 4070	2,500–3,500
Multi-class (5 cats)	Mistral 7B	Q4_K_M	RTX 4070	2,000–3,000
Multi-class (5 cats)	Qwen 2.5 14B	Q4_K_M	RTX 4080	1,000–1,800
Entity extraction	Qwen 2.5 14B	Q5_K_M	RTX 4080	800–1,400
Document summarization	Qwen 2.5 14B	Q4_K_M	RTX 4080	300–500

Human review throughput (reviewing pre-annotations):

Pre-Annotation Accuracy	Review Speed	Effective Throughput vs. Manual
>90% correct	3–5 seconds/doc (confirm or fix)	5–10x faster than manual
80–90% correct	5–15 seconds/doc	3–5x faster than manual
70–80% correct	10–30 seconds/doc	1.5–3x faster than manual
Under 70% correct	15–60 seconds/doc	Marginal improvement

Break-even: Below ~70% pre-annotation accuracy, human reviewers spend more time understanding and correcting errors than they would labeling from scratch. The AI assistance becomes a distraction rather than an accelerator.

Combined Labeling Timeline

For 150,000 documents with binary classification:

Approach	Time Estimate
Manual (2 annotators)	13–27 working days
AI-assisted (90% accuracy, 2 reviewers)	2–4 working days
AI-assisted (80% accuracy, 2 reviewers)	4–8 working days

AI-assisted labeling with >80% pre-annotation accuracy reduces labeling time by 3–10x.

Stage 5: Augmentation Throughput

Synthetic data generation throughput depends on output length:

Task	Model	Hardware	Output Length	Docs/Hour
Paraphrase generation	Mistral 7B Q4	RTX 4070	~100 tokens	1,500–2,500
Synthetic document generation	Qwen 2.5 14B Q4	RTX 4080	~500 tokens	100–200
Augmented examples (classification)	Mistral 7B Q4	RTX 4070	~50 tokens	3,000–5,000
Question-answer pair generation	Qwen 2.5 14B Q4	RTX 4080	~200 tokens	400–700

Stage 6: Export Throughput

Export is rarely the bottleneck:

Format	Size (150K docs)	NVMe Write Time	SATA SSD Write Time
JSONL	5–20 GB	1–5 seconds	10–40 seconds
JSONL (gzip compressed)	1–5 GB	30–120 seconds	60–240 seconds
Parquet	3–12 GB	1–5 seconds	10–40 seconds
HuggingFace Dataset	5–20 GB	5–15 seconds	30–120 seconds
CSV	5–20 GB	1–5 seconds	10–40 seconds

End-to-End Pipeline Estimates

Scenario A: 100 GB Mixed Enterprise Documents (150K Files)

Mid-range hardware (Ryzen 9, 64 GB RAM, RTX 4080):

Stage	Time Estimate
Ingestion	4–8 hours
OCR (scanned subset: ~50K pages)	35–55 minutes
Cleaning (with regex PII)	25–70 minutes
AI-assisted labeling (binary classification)	50–75 minutes (pre-annotation) + 2–4 days (human review)
Export	Under 5 minutes
Total compute time	~6–10 hours
Total project time (incl. human review)	3–5 working days

Scenario B: 500 GB Scanned Document Archive (500K Pages)

Mid-range hardware:

Stage	Time Estimate
Ingestion	12–24 hours
OCR (500K pages, GPU)	5.5–9 hours
Cleaning (with NER PII)	4–12 hours
AI-assisted labeling (multi-class)	3–5 hours (pre-annotation) + 5–10 days (human review)
Export	Under 10 minutes
Total compute time	~24–50 hours
Total project time	1–2 weeks

Scenario C: 1 TB Mixed Enterprise Archive (1M+ Files)

Production hardware (Threadripper, 128 GB RAM, 2× RTX 4090):

Stage	Time Estimate
Ingestion	24–48 hours
OCR (scanned subset: ~200K pages)	1–2 hours
Cleaning (with NER PII)	8–24 hours
AI-assisted labeling (entity extraction)	12–24 hours (pre-annotation) + 2–4 weeks (human review)
Export	Under 30 minutes
Total compute time	~2–4 days
Total project time	3–5 weeks

How to Estimate Timeline from Data Volume

A quick estimation framework for scoping proposals:

Assess document types: What percentage is scanned vs. native text? Scanned documents take 5–10x longer per document.
Estimate file count: Total volume ÷ average file size. A 100 GB archive might be 10,000 large files or 500,000 small files. File count affects ingestion time; total volume affects OCR time.
Identify the labeling task: Binary classification? Multi-label? Entity extraction? Task complexity determines both LLM inference time and human review time.
Calculate human review hours: Pre-annotation throughput × accuracy level → review hours. This is usually the longest phase.
Add buffer: Real-world archives contain corrupt files, unexpected formats, and edge cases. Add 20–30% to compute time estimates.

Improving Throughput Without Additional Hardware

Before buying more hardware, optimize what you have:

Fix the storage bottleneck: If source data is on HDD or network storage, copy it to local NVMe. This alone can cut ingestion time by 5–20x.
Skip unnecessary OCR: Check if scanned PDFs already have text layers. Many enterprise scanners produce PDFs with embedded OCR. Extracting the existing text layer is 100x faster than re-running OCR.
Use the right quantization: Q4_K_M instead of Q8_0 for classification tasks. 40–60% throughput improvement with minimal accuracy loss.
Increase inference parallelism: If VRAM allows, run 2–4 concurrent LLM requests.
Pre-filter aggressively: Remove duplicate and irrelevant files before processing. A 10% reduction in file count saves 10% of pipeline time.

Ertas Data Suite Performance

Ertas Data Suite's native desktop architecture avoids the overhead that containerized tools introduce — no Docker networking layer, no volume mount I/O penalties, no container runtime overhead. The application accesses the filesystem and GPU directly, which translates to throughput numbers at the upper end of the ranges listed in this guide.

The built-in pipeline processes documents through Ingest → Clean → Label → Augment → Export stages with automatic batching and progress tracking. For service providers, this means the pipeline runs overnight with predictable throughput and detailed logging of what was processed, what failed, and what's ready for human review.

Using These Numbers

These benchmarks exist to answer one question: "How long will the data preparation phase take?" When scoping an engagement, estimate the compute time from these tables, add human review time based on your labeling task and team size, and apply a 20–30% buffer. The result is a defensible timeline for your statement of work.

For more on the hardware and architecture decisions behind these numbers, see Hardware Sizing for On-Premise Data Preparation and On-Premise Runtime Architecture for Enterprise AI Data Preparation.