Back to blog
    Batch Processing Large Document Archives On-Premise: Performance Tuning Guide
    batch-processingperformanceon-premisedocumentsocrdata-preparationthroughputsegment:service-provider

    Batch Processing Large Document Archives On-Premise: Performance Tuning Guide

    Performance tuning guide for batch processing 100GB–1TB+ document archives on-premise — parallel ingestion, memory management, I/O optimization, and resumability strategies.

    EErtas Team·

    Enterprise document archives are measured in hundreds of gigabytes to terabytes. A mid-size law firm might have 500 GB of contract PDFs. A hospital system might have 2 TB of clinical records. A construction firm might have 300 GB of engineering drawings and specifications.

    Processing these archives through a data preparation pipeline — ingestion, OCR, cleaning, labeling, export — takes hours to days, not minutes. The difference between a well-tuned pipeline and a naive one can be 3–5x in total processing time. This guide covers the performance tuning strategies that matter in practice.


    The Overnight Processing Pattern

    Before optimizing, it's worth acknowledging the dominant workflow pattern for large batch processing: start the job in the evening, review results in the morning.

    This pattern works because:

    • Data preparation is a periodic task, not a real-time service
    • Nobody is waiting for interactive response during batch processing
    • Long-running jobs benefit from uninterrupted compute (no competing workloads)
    • Results need human review anyway — and humans review during business hours

    The goal of performance tuning isn't to make a 12-hour job finish in 12 minutes. It's to make a 48-hour job finish in 12 hours so it completes overnight instead of spanning a weekend.


    Parallel Ingestion Strategies

    Document ingestion is the first pipeline stage and often the most I/O-bound. The key optimizations:

    File-Level Parallelism

    Process multiple files concurrently. Each file is independent — parsing a PDF doesn't depend on parsing the previous Word document.

    Optimal parallelism: Match the number of concurrent file operations to your storage throughput, not your CPU cores. On NVMe SSD, 8–16 concurrent file reads typically saturate the drive. On HDD, even 2–4 concurrent reads cause seek contention that slows everything down.

    Storage TypeRecommended Parallel FilesNotes
    NVMe SSD8–16Limited by CPU parsing speed
    SATA SSD4–8Limited by sequential bandwidth
    HDD1–2Random I/O kills parallelism
    Network (NFS/SMB)4–8Limited by network bandwidth and latency

    Sorting Files by Size

    Process large files first. This avoids the "long tail" problem where the pipeline is idle except for one thread processing a single massive file at the end.

    A simple strategy: sort the file list by size descending, then process in parallel. The first batch picks up the largest files, and subsequent batches fill in with smaller files.

    Pre-Filtering by File Type

    Not every file in an archive needs processing. Common filters:

    • Skip files below a minimum size (e.g., under 1 KB — probably empty or stub files)
    • Skip non-document types (.exe, .dll, .tmp)
    • Skip duplicates by file hash before parsing

    Pre-filtering a 500 GB archive often removes 5–15% of files, saving parsing time on content that won't produce useful data.


    Memory Management for Large PDFs

    PDFs are the most common document type in enterprise archives and the most problematic for memory.

    Why PDFs Consume Memory

    A 200-page PDF with embedded images can decompress to 500 MB–2 GB in memory during parsing. The PDF format uses internal compression (Flate, JPEG2000) and object references that require the entire file structure to be held in memory for random access.

    Scanned PDFs are worse: each page is a full-resolution image (typically 300 DPI, 2550×3300 pixels, 24-bit color = ~24 MB uncompressed per page). A 200-page scanned PDF decompresses to ~4.8 GB.

    Mitigation Strategies

    Page-level processing: Instead of loading the entire PDF into memory, process one page (or a small batch of pages) at a time. Most PDF libraries support page-range extraction without loading the full document.

    Memory limits per worker: Set a memory ceiling for each processing worker. If a single PDF exceeds the limit, process it sequentially (one page at a time) instead of in parallel. This prevents a single large document from consuming all available RAM and causing out-of-memory crashes.

    Streaming extraction: For text-only PDFs (not scanned), use a streaming parser that extracts text without fully decompressing embedded objects. PyMuPDF (fitz) supports this approach and uses significantly less memory than pdfplumber or PyPDF2 for text extraction.

    Practical Memory Budgeting

    Concurrent FilesAvg File SizePeak RAM per WorkerTotal RAM Needed
    810 MB (text PDF)~200 MB~1.6 GB
    850 MB (mixed PDF)~500 MB~4 GB
    4200 MB (scanned PDF)~2 GB~8 GB
    2500 MB+ (large scan)~4 GB~8 GB

    Add 8–16 GB for the OS, application, and LLM (if running concurrently). A system with 64 GB RAM comfortably handles most parallel ingestion scenarios.


    I/O Optimization

    SSD vs. HDD: The Numbers

    This comparison bears repeating because the performance difference is stark:

    OperationNVMe SSDSATA SSDHDD
    Sequential read3,500–7,000 MB/s500–550 MB/s100–200 MB/s
    Random read (4K)500K–1M IOPS50K–100K IOPS100–200 IOPS
    Latency~10 μs~50 μs~5,000 μs

    For data preparation, random read IOPS matters as much as sequential throughput. Parsing documents involves seeking to different positions within files, loading metadata, reading embedded objects — all random access patterns.

    A concrete example: Ingesting 100,000 mixed documents (10 GB total) from different storage:

    StorageEstimated Ingestion Time
    NVMe SSD8–15 minutes
    SATA SSD25–45 minutes
    HDD3–6 hours
    NFS over Gigabit Ethernet1–3 hours

    RAID Configuration

    For the production tier (multi-TB archives):

    RAID 0 (striping): Doubles read throughput by spreading data across two drives. No redundancy — a single drive failure loses everything. Acceptable for intermediate processing data that can be regenerated.

    RAID 1 (mirroring): No throughput improvement but provides redundancy. Use for source data that cannot be easily replaced.

    RAID 10 (stripe of mirrors): Both throughput and redundancy. Four drives minimum. Best option when both speed and data safety matter.

    For NVMe, RAID is less necessary — a single Gen4 NVMe drive already provides more throughput than most data preparation pipelines can saturate. RAID becomes relevant at the HDD tier or when total capacity on a single drive is insufficient.

    Network Storage Best Practice

    If source data lives on network storage (NAS, SAN, NFS):

    1. Copy to local SSD before processing. Network latency on every file read adds up across millions of operations.
    2. If copying isn't practical, mount with appropriate options: noatime (skip access time updates), rsize=1048576,wsize=1048576 (large read/write buffers for NFS), and disable client-side caching locks.
    3. Accept that network storage will be the bottleneck. Plan accordingly — double or triple your time estimates vs. local SSD.

    Progress Tracking and Resumability

    Long-running batch jobs fail. Drives fill up, power interruptions happen, software crashes on a malformed file. A pipeline that can't resume from where it left off wastes hours of completed work.

    Checkpoint-Based Resumability

    The minimum viable approach: maintain a log of completed files. When the pipeline restarts, it reads the log and skips already-processed files.

    # Simple checkpoint log format
    timestamp | filepath | status | duration_ms
    2026-03-11T22:15:03 | /data/contracts/2024-Q1/contract_0001.pdf | completed | 1250
    2026-03-11T22:15:04 | /data/contracts/2024-Q1/contract_0002.pdf | completed | 890
    2026-03-11T22:15:05 | /data/contracts/2024-Q1/contract_0003.pdf | error:corrupt_pdf | 45
    

    Key implementation details:

    • Write checkpoints synchronously (flush to disk) after each file. Async writes risk losing checkpoint data if the process crashes.
    • Record errors with enough context to investigate later — the error type, the file path, and the pipeline stage.
    • Store checkpoints separately from output data so they survive output directory cleanup.

    Progress Reporting

    For batch jobs running overnight, progress reporting matters more than real-time dashboards. The essentials:

    • Total files / files processed / files remaining: Know where you are.
    • Current throughput (files/minute): Know your speed.
    • Estimated time remaining: Know when it'll finish.
    • Error count: Know if something is going wrong at scale (a few errors in 100K files is normal; 10,000 errors means a systematic problem).

    Write progress to a log file that can be checked without interrupting the process. A simple line like [2026-03-11 03:45:12] Progress: 45,230 / 100,000 files (45.2%) | 23.5 files/min | ETA: 38.8 hours | Errors: 12 is more useful than an elaborate dashboard that nobody watches at 3 AM.


    Error Handling for Corrupt Files

    Enterprise document archives contain corrupt files. Always. Common failure modes:

    • Truncated PDFs: File was interrupted during upload or copy. Parser reads the header, then encounters unexpected EOF.
    • Encrypted/password-protected files: Parser can detect the encryption flag but can't extract content.
    • Malformed XML (in DOCX/XLSX): Corrupted Office documents with invalid XML structures.
    • Zero-byte files: Present in the archive but contain no data.
    • Unsupported formats: Files with misleading extensions (a .pdf that's actually a TIFF).

    Error Handling Strategy

    1. Catch per-file: Never let a single corrupt file crash the entire pipeline. Wrap each file's processing in error handling.
    2. Log and skip: Record the error and move to the next file. Accumulate errors for post-run review.
    3. Quarantine: Move or link failed files to a separate directory for manual inspection.
    4. Set thresholds: If the error rate exceeds a threshold (e.g., >5% of files), pause the pipeline and alert. A high error rate usually indicates a systematic issue — wrong parser, character encoding problem, or corrupted source.

    Tuning Common Bottlenecks

    Bottleneck: OCR Is Too Slow

    OCR is typically the slowest stage. Tuning options:

    • Switch to GPU-accelerated OCR: If running CPU-only Tesseract, switching to PaddleOCR or Surya with GPU can improve throughput 5–10x.
    • Reduce OCR resolution: Processing at 200 DPI instead of 300 DPI roughly doubles throughput with modest accuracy loss for standard printed text.
    • Skip OCR where unnecessary: If a PDF has extractable text layers, use text extraction instead of OCR. Many "scanned" PDFs actually have an OCR text layer already embedded.
    • Batch page processing: Process multiple pages per GPU inference call instead of one-at-a-time.

    Bottleneck: LLM Labeling Is Too Slow

    • Drop model size: Switch from 14B to 7B. Accept the accuracy trade-off if labeling quality remains above your threshold.
    • Increase quantization: Move from Q8 to Q4_K_M. Rough throughput improvement: 40–60%.
    • Reduce context window: If you're using 16K context but documents average 2K tokens, drop to 4K context.
    • Increase parallel requests: If VRAM allows, run 2–4 concurrent inference requests.

    Bottleneck: Memory Exhaustion

    • Reduce parallelism: Process fewer files concurrently. Trading speed for stability.
    • Process by file type: Handle large scanned PDFs separately from small text documents, with different parallelism settings for each.
    • Increase swap space: A temporary measure, not a solution. Swapping to SSD is 100x slower than RAM but prevents crashes.

    Bottleneck: Disk Space

    • Process in waves: Ingest a batch, process it through all stages, export, then clean up intermediate files before the next batch.
    • Compress intermediate data: gzip or zstd compression on intermediate outputs trades CPU for disk space.
    • Monitor disk usage proactively: A disk-full error at hour 10 of a 12-hour job is avoidable with monitoring.

    Monitoring Long-Running Batch Jobs

    For jobs running overnight or over a weekend:

    Log to file: Write structured logs that include timestamps, throughput metrics, and error counts. Check the log file before bed and first thing in the morning.

    Process monitoring: Use basic OS tools — htop for CPU and memory, nvidia-smi for GPU utilization, iostat for disk I/O. If GPU utilization drops to 0% while the job is running, something has stalled.

    Alerting (optional): For jobs running on a dedicated server, a simple script that checks for process liveness and sends a notification (email, Slack) on failure is worth the 10 minutes to set up.


    Practical Application

    Ertas Data Suite handles batch processing with built-in progress tracking, per-file error handling, and automatic resumability. The application maintains a processing journal that records the state of every file through each pipeline stage. If the process is interrupted — power outage, application crash, or intentional pause — restarting picks up exactly where it left off.

    For service providers processing client document archives, the overnight processing pattern combined with resumability means you can deliver results on a predictable schedule. Start the batch job when you leave the office, check the progress log remotely if needed, and review results the next morning. The performance tuning strategies in this guide help you get those results in one overnight run instead of three.


    The Broader Picture

    Batch processing performance directly affects engagement timelines and costs. A well-tuned pipeline that processes a 500 GB archive in 8 hours (one overnight run) delivers results in two days — one day to run, one day to review. A poorly tuned pipeline that takes 48 hours pushes the same deliverable to a week.

    For more on the infrastructure decisions that affect batch processing performance, see On-Premise Runtime Architecture for Enterprise AI Data Preparation and Hardware Sizing for On-Premise Data Preparation.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading