Setting Up Local Document Ingestion for Enterprise AI Projects

Document ingestion is where every enterprise AI data pipeline starts — and where most service providers first discover how messy real-world enterprise data actually is.

When a client hands you 200,000 PDFs, 50,000 Word documents, 10,000 Excel spreadsheets, and a box of scanned paper forms from 2003, your ingestion system needs to handle all of it. Locally. Without sending a single byte to a cloud API.

This guide covers the practical side of setting up local document ingestion: what enterprise document types you'll encounter, how to handle each one on-premise, and where the common failure modes are.

Enterprise Document Types You'll Actually Encounter

Enterprise document collections are not clean corpora. They're decades of accumulated files across formats, conventions, and quality levels. Here's what shows up in practice:

Native PDFs — Generated digitally, with extractable text layers. These are the easy case. Text extraction works reliably, and layout is generally recoverable. Still, complex layouts (multi-column, floating text boxes, nested tables) can defeat naive extraction.

Scanned PDFs and image-based documents — No text layer. Every character must be reconstructed via OCR. Quality varies enormously: a clean 300 DPI scan from 2020 is straightforward; a 150 DPI fax from 1997 with coffee stains is not.

Word documents (.docx, .doc) — Generally straightforward for .docx (XML-based). Legacy .doc files from pre-2007 Word versions require different parsing. Watch for tracked changes, embedded objects, and complex formatting that carries semantic meaning.

Excel spreadsheets (.xlsx, .xls, .csv) — Table structure is the key information. Merged cells, multi-level headers, empty rows used as separators, and formulas that generate display values all need handling.

PowerPoint presentations (.pptx) — Text embedded in shapes, text boxes, and SmartArt. Slide notes often contain additional context. Parsing must handle spatial layout, not just text extraction.

CAD drawings and engineering documents — Title blocks contain structured metadata. Drawing annotations carry critical information. These are domain-specific and require specialized extraction logic.

Scanned forms — Structured documents (insurance claims, patient intake forms, inspection checklists) where field positions encode meaning. OCR alone isn't enough — you need form field detection and key-value extraction.

Email archives (.eml, .msg, .mbox) — Header metadata, body text, and attachments all need separate handling. Thread reconstruction adds another layer of complexity.

On-Premise OCR Options

For scanned documents, OCR is the critical dependency. Here's how the on-premise options compare:

OCR Engine	Accuracy (clean scans)	Accuracy (degraded scans)	Language Support	Setup Complexity	License
Tesseract 5	Good (92-96%)	Fair (75-85%)	100+ languages	Low	Apache 2.0
EasyOCR	Good (90-95%)	Good (80-88%)	80+ languages	Low	Apache 2.0
PaddleOCR	Very Good (94-97%)	Good (82-90%)	80+ languages	Medium	Apache 2.0
Cloud APIs (Google, AWS, Azure)	Excellent (97-99%)	Excellent (90-95%)	100+ languages	Low	Not available on-premise

For on-premise deployments, PaddleOCR offers the best accuracy-to-complexity ratio. Tesseract is the simplest to deploy and works well for clean, modern scans. EasyOCR handles multilingual documents well.

The accuracy gap between cloud and on-premise OCR is real — roughly 2-5% on clean documents, wider on degraded scans. For most enterprise use cases, on-premise accuracy is sufficient. For edge cases (badly degraded documents, unusual fonts, handwritten text), expect to build a human review step into the pipeline.

Table Extraction: The Hard Problem

Tables in PDFs are where most ingestion pipelines break. A table that looks perfectly structured to a human viewer is, in the underlying PDF, just a collection of positioned text fragments with no explicit row/column structure.

Docling (IBM's document understanding library) reports 97.9% table structure recognition accuracy on standard benchmarks. That's impressive, and in practice it handles most enterprise tables well. Complex cases — tables with merged cells spanning multiple rows, nested sub-tables, tables that span page breaks — still require validation.

Camelot and Tabula are dedicated table extraction libraries for PDFs. They work well for simple, well-structured tables but struggle with complex layouts.

Layout-aware extraction is the current best approach: identify table regions in the document layout first, then extract cell contents using the detected structure. This requires a model (like the one Docling uses internally) rather than rule-based heuristics.

After extraction, validate table structure programmatically:

Row counts should be consistent across columns
Header rows should be identifiable
Numeric columns should contain parseable numbers
Cell content shouldn't be truncated

Handling Multi-Column Layouts

Multi-column PDFs (academic papers, newsletters, some legal documents) create a reading order problem. Text extraction that reads left-to-right across the full page width will interleave content from two columns, producing garbage.

Solving this requires layout detection: identify column boundaries, then extract text within each column in the correct reading order. The approaches:

Rule-based: Detect large horizontal gaps in text positioning. Works for simple two-column layouts, fails on complex ones.
ML-based layout detection: Models like LayoutLMv3 or Docling's layout model detect columns, headings, figures, and tables. More reliable, but requires a model deployment.
Hybrid: Use rule-based detection first, fall back to ML-based for documents that don't parse cleanly.

What Ingestion Output Should Look Like

The output of a well-designed ingestion stage is structured text with metadata — not raw text dumps. Here's what good ingestion output contains:

Document-level metadata:

Source filename, file type, page count
Ingestion timestamp
OCR confidence scores (for scanned documents)
Language detection results

Section-level structure:

Heading hierarchy (H1, H2, H3)
Paragraph boundaries
List items
Table structures (rows, columns, headers identified)

Inline metadata:

Bold/italic/underline markers where semantically meaningful
Footnote references
Cross-references and internal links

Quality indicators:

OCR confidence per page/region
Layout detection confidence
Parsing warnings (e.g., "table may be malformed," "possible multi-column layout detected")

This structured output feeds directly into the cleaning stage. If ingestion outputs flat text with no structure, every downstream stage works harder.

Common Failure Modes

Header/footer noise: Page numbers, running headers, document titles, and confidentiality notices repeat on every page. If not stripped, they appear hundreds of times in the extracted text and confuse deduplication and quality scoring.

Hyphenation artifacts: Words split across line breaks (e.g., "docu-\nment") need to be rejoined. Simple regex handles most cases, but edge cases (e.g., "re-\nevaluate" where the hyphen is real) require dictionary lookup.

Encoding issues: Legacy documents may use Windows-1252, ISO-8859-1, or other non-UTF-8 encodings. Mojibake (garbled characters from encoding mismatches) is common in mixed-encoding document collections.

Metadata loss: Some extractors discard document properties (author, creation date, modification history) that may be valuable for filtering or provenance tracking.

Empty or near-empty pages: Cover pages, separator pages, and blank pages waste processing time and can introduce noise if not filtered.

Practical Setup

For a service provider setting up local ingestion for a client project:

Inventory the document collection first. Before writing any code, count files by type, sample quality across the collection, and identify edge cases. A 30-minute inventory prevents days of debugging.
Start with the hardest format. If the collection includes scanned PDFs from the 1990s, test OCR on those first. If they're unsalvageable, you need to know before planning the rest of the pipeline.
Build a validation step. After ingestion, spot-check a random sample: Does the extracted text match the source document? Are tables intact? Is reading order correct?
Log everything. Every file processed, every OCR confidence score, every parsing warning. This log is part of the audit trail.

Ertas Data Suite's Ingest module handles 64+ file types natively — including PDFs (native and scanned), Word, Excel, PowerPoint, images, and engineering documents — with built-in OCR, table extraction, and structured output. It logs every parsing decision and confidence score as part of the project audit trail, and the entire process runs locally with no network calls.

Connecting to the Pipeline

Ingestion produces structured text and metadata. The next stage — cleaning — takes that structured output and removes duplicates, normalizes encoding, redacts PII/PHI, and scores quality. Each stage builds on the previous one, and the quality of ingestion output directly determines how much work cleaning requires.

For the complete pipeline overview, see How to Build an On-Premise Data Preparation Pipeline for LLM Fine-Tuning.