Batch Processing Large Document Archives On-Premise: Performance Tuning Guide

Enterprise document archives are measured in hundreds of gigabytes to terabytes. A mid-size law firm might have 500 GB of contract PDFs. A hospital system might have 2 TB of clinical records. A construction firm might have 300 GB of engineering drawings and specifications.

Processing these archives through a data preparation pipeline — ingestion, OCR, cleaning, labeling, export — takes hours to days, not minutes. The difference between a well-tuned pipeline and a naive one can be 3–5x in total processing time. This guide covers the performance tuning strategies that matter in practice.

The Overnight Processing Pattern

Before optimizing, it's worth acknowledging the dominant workflow pattern for large batch processing: start the job in the evening, review results in the morning.

This pattern works because:

Data preparation is a periodic task, not a real-time service
Nobody is waiting for interactive response during batch processing
Long-running jobs benefit from uninterrupted compute (no competing workloads)
Results need human review anyway — and humans review during business hours

The goal of performance tuning isn't to make a 12-hour job finish in 12 minutes. It's to make a 48-hour job finish in 12 hours so it completes overnight instead of spanning a weekend.

Parallel Ingestion Strategies

Document ingestion is the first pipeline stage and often the most I/O-bound. The key optimizations:

File-Level Parallelism

Process multiple files concurrently. Each file is independent — parsing a PDF doesn't depend on parsing the previous Word document.

Optimal parallelism: Match the number of concurrent file operations to your storage throughput, not your CPU cores. On NVMe SSD, 8–16 concurrent file reads typically saturate the drive. On HDD, even 2–4 concurrent reads cause seek contention that slows everything down.

Storage Type	Recommended Parallel Files	Notes
NVMe SSD	8–16	Limited by CPU parsing speed
SATA SSD	4–8	Limited by sequential bandwidth
HDD	1–2	Random I/O kills parallelism
Network (NFS/SMB)	4–8	Limited by network bandwidth and latency

Sorting Files by Size

Process large files first. This avoids the "long tail" problem where the pipeline is idle except for one thread processing a single massive file at the end.

A simple strategy: sort the file list by size descending, then process in parallel. The first batch picks up the largest files, and subsequent batches fill in with smaller files.

Pre-Filtering by File Type

Not every file in an archive needs processing. Common filters:

Skip files below a minimum size (e.g., under 1 KB — probably empty or stub files)
Skip non-document types (.exe, .dll, .tmp)
Skip duplicates by file hash before parsing

Pre-filtering a 500 GB archive often removes 5–15% of files, saving parsing time on content that won't produce useful data.

Memory Management for Large PDFs

PDFs are the most common document type in enterprise archives and the most problematic for memory.

Why PDFs Consume Memory

A 200-page PDF with embedded images can decompress to 500 MB–2 GB in memory during parsing. The PDF format uses internal compression (Flate, JPEG2000) and object references that require the entire file structure to be held in memory for random access.

Scanned PDFs are worse: each page is a full-resolution image (typically 300 DPI, 2550×3300 pixels, 24-bit color = ~24 MB uncompressed per page). A 200-page scanned PDF decompresses to ~4.8 GB.

Mitigation Strategies

Page-level processing: Instead of loading the entire PDF into memory, process one page (or a small batch of pages) at a time. Most PDF libraries support page-range extraction without loading the full document.

Memory limits per worker: Set a memory ceiling for each processing worker. If a single PDF exceeds the limit, process it sequentially (one page at a time) instead of in parallel. This prevents a single large document from consuming all available RAM and causing out-of-memory crashes.

Streaming extraction: For text-only PDFs (not scanned), use a streaming parser that extracts text without fully decompressing embedded objects. PyMuPDF (fitz) supports this approach and uses significantly less memory than pdfplumber or PyPDF2 for text extraction.

Practical Memory Budgeting

Concurrent Files	Avg File Size	Peak RAM per Worker	Total RAM Needed
8	10 MB (text PDF)	~200 MB	~1.6 GB
8	50 MB (mixed PDF)	~500 MB	~4 GB
4	200 MB (scanned PDF)	~2 GB	~8 GB
2	500 MB+ (large scan)	~4 GB	~8 GB

Add 8–16 GB for the OS, application, and LLM (if running concurrently). A system with 64 GB RAM comfortably handles most parallel ingestion scenarios.

I/O Optimization

SSD vs. HDD: The Numbers

This comparison bears repeating because the performance difference is stark:

Operation	NVMe SSD	SATA SSD	HDD
Sequential read	3,500–7,000 MB/s	500–550 MB/s	100–200 MB/s
Random read (4K)	500K–1M IOPS	50K–100K IOPS	100–200 IOPS
Latency	~10 μs	~50 μs	~5,000 μs

For data preparation, random read IOPS matters as much as sequential throughput. Parsing documents involves seeking to different positions within files, loading metadata, reading embedded objects — all random access patterns.

A concrete example: Ingesting 100,000 mixed documents (10 GB total) from different storage:

Storage	Estimated Ingestion Time
NVMe SSD	8–15 minutes
SATA SSD	25–45 minutes
HDD	3–6 hours
NFS over Gigabit Ethernet	1–3 hours

RAID Configuration

For the production tier (multi-TB archives):

RAID 0 (striping): Doubles read throughput by spreading data across two drives. No redundancy — a single drive failure loses everything. Acceptable for intermediate processing data that can be regenerated.

RAID 1 (mirroring): No throughput improvement but provides redundancy. Use for source data that cannot be easily replaced.

RAID 10 (stripe of mirrors): Both throughput and redundancy. Four drives minimum. Best option when both speed and data safety matter.

For NVMe, RAID is less necessary — a single Gen4 NVMe drive already provides more throughput than most data preparation pipelines can saturate. RAID becomes relevant at the HDD tier or when total capacity on a single drive is insufficient.

Network Storage Best Practice

If source data lives on network storage (NAS, SAN, NFS):

Copy to local SSD before processing. Network latency on every file read adds up across millions of operations.
If copying isn't practical, mount with appropriate options: noatime (skip access time updates), rsize=1048576,wsize=1048576 (large read/write buffers for NFS), and disable client-side caching locks.
Accept that network storage will be the bottleneck. Plan accordingly — double or triple your time estimates vs. local SSD.

Progress Tracking and Resumability

Long-running batch jobs fail. Drives fill up, power interruptions happen, software crashes on a malformed file. A pipeline that can't resume from where it left off wastes hours of completed work.

Checkpoint-Based Resumability

The minimum viable approach: maintain a log of completed files. When the pipeline restarts, it reads the log and skips already-processed files.

# Simple checkpoint log format
timestamp | filepath | status | duration_ms
2026-03-11T22:15:03 | /data/contracts/2024-Q1/contract_0001.pdf | completed | 1250
2026-03-11T22:15:04 | /data/contracts/2024-Q1/contract_0002.pdf | completed | 890
2026-03-11T22:15:05 | /data/contracts/2024-Q1/contract_0003.pdf | error:corrupt_pdf | 45

Key implementation details:

Write checkpoints synchronously (flush to disk) after each file. Async writes risk losing checkpoint data if the process crashes.
Record errors with enough context to investigate later — the error type, the file path, and the pipeline stage.
Store checkpoints separately from output data so they survive output directory cleanup.

Progress Reporting

For batch jobs running overnight, progress reporting matters more than real-time dashboards. The essentials:

Total files / files processed / files remaining: Know where you are.
Current throughput (files/minute): Know your speed.
Estimated time remaining: Know when it'll finish.
Error count: Know if something is going wrong at scale (a few errors in 100K files is normal; 10,000 errors means a systematic problem).

Write progress to a log file that can be checked without interrupting the process. A simple line like [2026-03-11 03:45:12] Progress: 45,230 / 100,000 files (45.2%) | 23.5 files/min | ETA: 38.8 hours | Errors: 12 is more useful than an elaborate dashboard that nobody watches at 3 AM.

Error Handling for Corrupt Files

Enterprise document archives contain corrupt files. Always. Common failure modes:

Truncated PDFs: File was interrupted during upload or copy. Parser reads the header, then encounters unexpected EOF.
Encrypted/password-protected files: Parser can detect the encryption flag but can't extract content.
Malformed XML (in DOCX/XLSX): Corrupted Office documents with invalid XML structures.
Zero-byte files: Present in the archive but contain no data.
Unsupported formats: Files with misleading extensions (a .pdf that's actually a TIFF).

Error Handling Strategy

Catch per-file: Never let a single corrupt file crash the entire pipeline. Wrap each file's processing in error handling.
Log and skip: Record the error and move to the next file. Accumulate errors for post-run review.
Quarantine: Move or link failed files to a separate directory for manual inspection.
Set thresholds: If the error rate exceeds a threshold (e.g., >5% of files), pause the pipeline and alert. A high error rate usually indicates a systematic issue — wrong parser, character encoding problem, or corrupted source.

Tuning Common Bottlenecks

Bottleneck: OCR Is Too Slow

OCR is typically the slowest stage. Tuning options:

Switch to GPU-accelerated OCR: If running CPU-only Tesseract, switching to PaddleOCR or Surya with GPU can improve throughput 5–10x.
Reduce OCR resolution: Processing at 200 DPI instead of 300 DPI roughly doubles throughput with modest accuracy loss for standard printed text.
Skip OCR where unnecessary: If a PDF has extractable text layers, use text extraction instead of OCR. Many "scanned" PDFs actually have an OCR text layer already embedded.
Batch page processing: Process multiple pages per GPU inference call instead of one-at-a-time.

Bottleneck: LLM Labeling Is Too Slow

Drop model size: Switch from 14B to 7B. Accept the accuracy trade-off if labeling quality remains above your threshold.
Increase quantization: Move from Q8 to Q4_K_M. Rough throughput improvement: 40–60%.
Reduce context window: If you're using 16K context but documents average 2K tokens, drop to 4K context.
Increase parallel requests: If VRAM allows, run 2–4 concurrent inference requests.

Bottleneck: Memory Exhaustion

Reduce parallelism: Process fewer files concurrently. Trading speed for stability.
Process by file type: Handle large scanned PDFs separately from small text documents, with different parallelism settings for each.
Increase swap space: A temporary measure, not a solution. Swapping to SSD is 100x slower than RAM but prevents crashes.

Bottleneck: Disk Space

Process in waves: Ingest a batch, process it through all stages, export, then clean up intermediate files before the next batch.
Compress intermediate data: gzip or zstd compression on intermediate outputs trades CPU for disk space.
Monitor disk usage proactively: A disk-full error at hour 10 of a 12-hour job is avoidable with monitoring.

Monitoring Long-Running Batch Jobs

For jobs running overnight or over a weekend:

Log to file: Write structured logs that include timestamps, throughput metrics, and error counts. Check the log file before bed and first thing in the morning.

Process monitoring: Use basic OS tools — htop for CPU and memory, nvidia-smi for GPU utilization, iostat for disk I/O. If GPU utilization drops to 0% while the job is running, something has stalled.

Alerting (optional): For jobs running on a dedicated server, a simple script that checks for process liveness and sends a notification (email, Slack) on failure is worth the 10 minutes to set up.

Practical Application

Ertas Data Suite handles batch processing with built-in progress tracking, per-file error handling, and automatic resumability. The application maintains a processing journal that records the state of every file through each pipeline stage. If the process is interrupted — power outage, application crash, or intentional pause — restarting picks up exactly where it left off.

For service providers processing client document archives, the overnight processing pattern combined with resumability means you can deliver results on a predictable schedule. Start the batch job when you leave the office, check the progress log remotely if needed, and review results the next morning. The performance tuning strategies in this guide help you get those results in one overnight run instead of three.

The Broader Picture

Batch processing performance directly affects engagement timelines and costs. A well-tuned pipeline that processes a 500 GB archive in 8 hours (one overnight run) delivers results in two days — one day to run, one day to review. A poorly tuned pipeline that takes 48 hours pushes the same deliverable to a week.

For more on the infrastructure decisions that affect batch processing performance, see On-Premise Runtime Architecture for Enterprise AI Data Preparation and Hardware Sizing for On-Premise Data Preparation.