Hardware Sizing for On-Premise Data Preparation: CPU, GPU, and Memory Requirements

"Do we need an A100?" is the most common hardware question from enterprise clients starting a data preparation project. The answer is almost always no.

Data preparation workloads — ingestion, OCR, cleaning, labeling, augmentation, export — have different compute profiles than model training. Training runs benefit from massive GPU parallelism and high memory bandwidth. Data preparation is sequential, I/O-heavy, and often bottlenecked by disk speed rather than compute. The hardware that's right for training is usually overkill and under-optimized for data prep.

This guide covers the specific hardware requirements for each pipeline stage and provides concrete recommendations at three budget tiers.

Requirements by Pipeline Stage

Ingestion: CPU + I/O

Ingestion reads source documents (PDFs, Word files, images, spreadsheets, HTML) and extracts their content into a normalized format. The work is parsing-heavy and I/O-heavy.

CPU: 4+ cores for parallel file processing. Most document parsers are single-threaded per file, so parallelism comes from processing multiple files concurrently. Clock speed matters more than core count for individual file throughput.

RAM: 16 GB minimum. Large PDFs (100+ pages with embedded images) can consume 500 MB–2 GB each during parsing. Processing multiple large files concurrently multiplies this.

Storage: This is the primary bottleneck. NVMe SSD delivers 3–7 GB/s sequential read. SATA SSD delivers 500–550 MB/s. HDD delivers 100–200 MB/s. For a 500 GB document archive, the difference between NVMe and HDD is 2 minutes vs. 40+ minutes for raw read throughput (actual parsing time is longer, but I/O dominates).

GPU: Not required for ingestion.

OCR: GPU Strongly Recommended

OCR converts scanned documents and images into machine-readable text. It's the most compute-intensive stage for document-heavy datasets.

Engine	Hardware	Speed (pages/sec)	Accuracy
Tesseract 5	CPU (8 cores)	1–3	Good for clean scans
PaddleOCR	CPU	3–5	Better for varied layouts
PaddleOCR	GPU (RTX 4070)	15–25	Better for varied layouts
EasyOCR	GPU (RTX 4070)	10–20	Good multilingual
Surya OCR	GPU (RTX 4070)	20–30	Strong on complex layouts

CPU-only OCR math: A 100,000-page archive at 2 pages/second = ~14 hours. At 20 pages/second with GPU = ~1.4 hours. For one-time ingestion, 14 hours overnight may be acceptable. For iterative workflows where you're re-processing after adjusting OCR settings, GPU acceleration matters.

GPU: 8 GB VRAM minimum for GPU-accelerated OCR. 12 GB preferred for batch processing with larger page buffers.

RAM: 32 GB recommended. OCR engines load model weights into memory alongside page buffers.

Cleaning: CPU + High RAM

Cleaning includes deduplication, format normalization, PII detection, and quality filtering.

Exact deduplication (hash-based): CPU-bound, low memory. Compute a hash per document, compare hashes. A million documents takes seconds.

Fuzzy deduplication (MinHash/SimHash): CPU and memory-intensive. MinHash with 128 permutations on 1 million documents requires ~2–4 GB of RAM for the signature matrix. At 10 million documents, this grows to 20–40 GB.

PII detection: Regex-based PII detection is fast and CPU-light. NER-based PII detection (using a small model like GLiNER or a fine-tuned NER model) adds GPU requirements: 2–4 GB VRAM for a typical NER model.

RAM: 32 GB baseline. 64 GB for datasets exceeding 1 million documents or when running NER-based PII detection alongside dedup.

Labeling with Local LLM: GPU Required

AI-assisted labeling — where a local LLM pre-annotates documents that humans then review — is the stage most people ask about when hardware planning.

Model Size	Quantization	VRAM Required	Speed (tokens/sec)	Notes
7B (Mistral, Llama 3.1)	Q4_K_M	4–5 GB	30–60	Good for classification, simple extraction
7B	Q8_0	7–8 GB	25–45	Better accuracy, still fast
14B (Qwen 2.5, Llama 3.1)	Q4_K_M	8–10 GB	20–35	Better for nuanced labeling
14B	Q8_0	14–16 GB	15–25	Best quality in mid-range
32B (Qwen 2.5)	Q4_K_M	18–20 GB	10–18	Diminishing returns for most labeling tasks

The practical ceiling: For data preparation labeling (classification, entity extraction, sentiment, topic assignment), 7B–14B models provide 90–95% of the accuracy of larger models at 2–4x the throughput. Moving to 30B+ models rarely improves labeling quality enough to justify the hardware cost and speed reduction.

GPU: 8 GB VRAM minimum (for 7B Q4). 16 GB VRAM recommended (for 14B Q4 or 7B Q8). RTX 4060 Ti 16GB, RTX 4070, or RTX 4080 are the sweet spots for price-to-VRAM ratio.

System RAM: 32 GB minimum. The model runs on GPU, but the application needs memory for document processing, context assembly, and batch management.

Augmentation: GPU for LLM-Based Generation

Synthetic data generation and augmentation use the same local LLM infrastructure as labeling but with longer outputs. Generating a synthetic 500-word document takes 5–10x longer than generating a classification label.

Hardware requirements mirror the labeling stage. If you sized for labeling, you're sized for augmentation. The difference is throughput: expect 5–15 synthetic documents per minute at 7B Q4, fewer at larger model sizes.

Export: I/O-Bound

Export converts processed data into training formats. The bottleneck is write speed.

Storage: NVMe SSD for output. Writing 100 GB of JSONL takes 15–30 seconds on NVMe, 3–5 minutes on SATA SSD.

CPU: Moderate. Compression (gzip, zstd) adds CPU load. 4+ cores handles parallel compression.

RAM: 16 GB sufficient for most export operations.

Three Hardware Tiers

Entry Tier (~$3,000)

Use case: Small datasets (under 100 GB source), text-heavy documents, manual or light AI-assisted labeling.

Component	Specification	Est. Cost
CPU	AMD Ryzen 7 7700 or Intel i7-13700 (8–16 cores)	$300–$350
RAM	32 GB DDR5-5600	$100–$130
GPU	NVIDIA RTX 4060 Ti 16GB	$400–$450
Storage	2 TB NVMe SSD (Gen4)	$120–$150
Motherboard + PSU + Case	Mid-tower build	$400–$500
Total		~$1,500–$1,700

Or a pre-built workstation from Dell/HP/Lenovo at ~$2,500–$3,500 for a comparable spec with warranty and support.

This tier handles proof-of-concept projects, small client engagements, and text-dominated datasets. CPU-only LLM inference is possible (via llama.cpp with CPU mode) but slow — plan for 2–5 tokens/second at 7B.

Mid-Range Tier (~$8,000)

Use case: Production data preparation, 100 GB–1 TB source data, GPU-accelerated OCR and labeling.

Component	Specification	Est. Cost
CPU	AMD Ryzen 9 7950X or Intel i9-13900K (16–24 cores)	$450–$550
RAM	64 GB DDR5-5600	$200–$260
GPU	NVIDIA RTX 4080 16GB or RTX 4090 24GB	$1,000–$1,800
Storage	4 TB NVMe SSD (Gen4)	$250–$300
Motherboard + PSU (850W+) + Case	Quality build	$600–$800
Total		~$2,500–$3,700

Pre-built workstation equivalent: $5,000–$8,000 from major OEMs.

This is the workhorse tier for service providers. It handles GPU-accelerated OCR at 15–25 pages/second, runs 14B models at Q4 comfortably, and processes 100 GB+ datasets without bottlenecking on RAM. Most enterprise data preparation engagements are fully served by this configuration.

Production Tier (~$20,000+)

Use case: Large-scale data preparation (1 TB+ source), concurrent pipeline stages, 14B+ model inference with high throughput.

Component	Specification	Est. Cost
CPU	AMD Threadripper 7970X (32 cores) or dual Xeon	$1,500–$3,000
RAM	128–256 GB DDR5 ECC	$500–$1,200
GPU	2× NVIDIA RTX 4090 24GB or 1× A6000 48GB	$3,600–$5,500
Storage	8 TB NVMe (RAID 0 for speed or RAID 1 for redundancy)	$600–$1,000
Motherboard + PSU (1200W+) + Case	Server/workstation chassis	$1,000–$1,500
Total		~$7,200–$12,200

Pre-built server/workstation equivalent: $15,000–$25,000+ from major OEMs.

Multi-GPU configurations enable parallel inference (different models on different GPUs) or larger model sizes (32B+ via tensor parallelism). Dual RTX 4090s provide 48 GB total VRAM — enough for 32B models at Q8 quantization.

"Do We Need an A100?"

The NVIDIA A100 (40 GB or 80 GB) costs $10,000–$15,000 per unit. It's designed for training workloads that benefit from high memory bandwidth (2 TB/s on the 80 GB variant) and large tensor cores.

For data preparation, the A100's strengths are largely irrelevant:

Memory bandwidth: Data prep inference uses small batch sizes (often 1), so memory bandwidth matters less than it does during training.
Tensor cores: Small-batch inference doesn't saturate tensor cores. The A100's FP16 throughput advantage over consumer GPUs is wasted at batch size 1.
VRAM: The 80 GB variant is useful for very large models (70B+), but these models are slower for labeling tasks and rarely more accurate than 14B models on classification and extraction.

An RTX 4090 (24 GB VRAM, $1,800) provides 80–90% of the A100's inference performance for data preparation tasks at 12–15% of the cost. Two RTX 4090s ($3,600) provide more total VRAM and comparable throughput.

Save the A100 budget for actual training runs.

NPU Support for Newer Hardware

Neural Processing Units (NPUs) are appearing in recent laptop and desktop CPUs — Intel Meteor Lake and Arrow Lake, AMD Ryzen AI, Qualcomm Snapdragon X Elite. These dedicated inference accelerators promise efficient local AI inference without a discrete GPU.

Current state for data preparation:

Throughput: NPUs in 2026 deliver 10–45 TOPS, compared to 100+ TOPS for a mid-range GPU. Suitable for lightweight models (1B–3B parameters) but too slow for 7B+ models that data prep labeling requires.
Software support: Ollama and llama.cpp have experimental NPU support. Stability varies by hardware vendor. ONNX Runtime provides the broadest NPU compatibility.
Use case: NPUs are useful for edge inference on deployed models. For data preparation — where you're processing documents in batch, not serving real-time requests — a discrete GPU is more practical.

NPUs will become more relevant as their TOPS rating increases and software support matures. For now, plan around GPU-based inference for data preparation workloads.

RAM Sizing for Large Document Processing

System RAM is the quiet bottleneck that catches teams off guard:

PDF processing: A 200-page PDF with embedded images can consume 1–2 GB during parsing. Processing 16 files concurrently requires 16–32 GB just for PDF buffers.
Deduplication: Fuzzy dedup on 5 million documents requires 10–20 GB for signature storage.
LLM context: Even though the model runs on GPU, the application assembles prompts in system RAM. Long documents with extensive context windows (8K–32K tokens) consume 100–500 MB per concurrent inference.
OS and application overhead: 4–8 GB for the OS, application runtime, and file system caches.

Sizing rule: Start at 32 GB. Move to 64 GB for production workloads. Move to 128 GB+ only for concurrent processing of very large document sets (10 million+ documents) or multi-GPU inference configurations.

Putting It Together

Ertas Data Suite's native desktop architecture accesses all of this hardware directly — CPU, GPU, NPU, and filesystem — without the overhead of container layers or virtualization. The application detects available hardware at startup and configures pipeline stages accordingly: GPU-accelerated OCR when a GPU is present, CPU fallback when it isn't.

For service providers scoping hardware for a client engagement, the mid-range tier ($5,000–$8,000 as a pre-built workstation) handles the vast majority of data preparation projects. Start there. If OCR throughput or labeling speed becomes a measured bottleneck on a specific engagement, upgrade the GPU. Don't pre-buy for hypothetical scale.

The hardware decision should follow the data assessment, not precede it. Know your document types, volumes, and labeling complexity before selecting components. A 500 GB archive of clean text PDFs has entirely different requirements than a 50 GB archive of scanned handwritten forms.