From Documents to Agent Knowledge Bases: The Complete Data Pipeline

Enterprise AI agents fail for a predictable reason: the knowledge base is bad. Not "slightly suboptimal" — bad. Documents ingested without cleaning. PDFs with OCR errors propagating through the entire pipeline. Chunks that split mid-sentence or separate a table from its caption. Metadata so sparse that the retrieval system cannot distinguish a 2024 policy update from a 2019 draft.

The result is an agent that hallucinates with confidence. It retrieves a corrupted chunk, treats it as ground truth, and generates an authoritative-sounding answer that is wrong. In regulated industries, this is not an embarrassment — it is a liability.

The fix is not a better retrieval algorithm or a larger embedding model. The fix is a better data pipeline. The five-stage pipeline described here converts raw enterprise documents into structured, retrieval-optimized, agent-ready knowledge. Skip any stage and agent accuracy degrades measurably.

The Knowledge Base Quality Problem

A study by Databricks in late 2025 found that 67% of RAG system failures traced back to data quality issues — not retrieval failures, not model limitations, but garbage input. The breakdown: 28% from OCR and parsing errors, 19% from poor chunking (relevant information split across chunks), 12% from missing metadata, and 8% from duplicate or contradictory content.

This matches what we see in enterprise deployments. Teams spend weeks tuning retrieval parameters and embedding models when the real problem is that the source data was never properly cleaned.

The pipeline investment pays for itself. Teams that implement all five stages typically see retrieval accuracy improve from 55-65% (raw ingestion) to 85-92% (full pipeline). Agent answer accuracy follows: from 40-50% to 75-85% on domain-specific questions.

Stage 1: Ingest

The first stage handles format diversity. Enterprise documents come in every format: PDF (scanned and native), Word (.docx, .doc), PowerPoint, Excel, email (.eml, .msg), HTML, Markdown, Slack exports, Confluence pages, SharePoint documents, and plain text.

Each format requires a specialized parser:

PDF (native): Extract text with layout preservation. Tools like Docling or PyMuPDF handle this well. Preserve table structure, headers, and page numbers.
PDF (scanned): OCR with Tesseract or a local vision model. Expect 95-98% character accuracy on clean scans, 85-90% on older or low-quality documents.
Word/PowerPoint: Parse the XML structure directly. python-docx and python-pptx handle most cases, but watch for embedded images with text, text boxes, and SmartArt — these are frequently missed.
Email: Extract body, subject, sender, recipients, timestamps, and attachments. Attachments re-enter the pipeline as separate documents with parent-child metadata links.
Slack/Teams exports: JSON format with threading structure. Preserve thread context — individual messages without their thread are often meaningless.

The output of Stage 1 is a standardized intermediate format: plain text with structural markers (headers, paragraphs, tables, lists) and source metadata (filename, format, page number, extraction date, extraction confidence score).

Volume benchmark: A typical enterprise knowledge base project starts with 10,000-50,000 documents. Ingestion throughput on a single 16-core server: approximately 500-1,000 documents per hour for native formats, 100-200 per hour for scanned PDFs requiring OCR.

Stage 2: Clean

Raw extracted text is noisy. Cleaning removes artifacts that would degrade retrieval quality.

OCR correction: Common OCR errors follow predictable patterns — "rn" misread as "m," "l" and "1" swapped, ligatures broken. Build a domain-specific correction dictionary. For a legal corpus, this means recognizing that "Artide" should be "Article" and "dause" should be "clause."

Deduplication: Enterprise document stores are full of duplicates — the same memo saved in three folders, the same policy document in a shared drive and a wiki, email attachments duplicated across recipients. Use content-based deduplication (hash the normalized text) rather than filename matching. Expect 15-30% of documents to be duplicates or near-duplicates.

Format normalization: Standardize Unicode encoding, line breaks, whitespace, and special characters. Convert smart quotes to straight quotes. Normalize dashes (em dash, en dash, hyphen-minus all become standard hyphens). This prevents retrieval misses caused by character encoding differences.

Boilerplate removal: Headers, footers, copyright notices, page numbers, "CONFIDENTIAL" watermarks, email signatures. These add noise to every chunk without adding information. Detect and strip them using pattern matching.

Language detection: In multinational enterprises, documents may be in multiple languages. Tag each document with its language for downstream processing (different embedding models for different languages, or translation as a preprocessing step).

The output of Stage 2 is clean, normalized text with structural markers preserved and artifacts removed. A spot check of 50 random documents should show zero OCR artifacts, zero duplicates, and zero boilerplate remnants.

Stage 3: Structure

Clean text needs structure before it can be chunked effectively. Structure detection identifies the semantic organization of each document.

Section detection: Identify headers, subheaders, and the hierarchical structure of the document. A policy document has chapters, sections, and subsections. A technical manual has numbered sections. An email thread has individual messages with timestamps.

Metadata extraction: Pull structured information from the content: dates, version numbers, author names, department references, product names, regulation citations. This metadata becomes filterable attributes in the retrieval system.

Entity recognition: Identify named entities relevant to the domain — product names, customer names, regulation identifiers (GDPR Article 6, ISO 27001 Section A.12), internal project codes. Entity tags enable precise retrieval: "Show me all documents mentioning Project Phoenix" returns results based on entity matching, not keyword search.

Table extraction: Tables in documents contain dense, structured information. Extract them as structured data (rows and columns) rather than flattening them to text. A financial table flattened to text becomes "Revenue Q1 2025 $4.2M Q2 2025 $4.8M" — useless for comparison queries. Preserved as structured data, the retrieval system can answer "What was Q2 2025 revenue?" precisely.

Cross-reference resolution: Documents reference other documents. "As described in Policy 4.2.1" should link to Policy 4.2.1 in the knowledge base. Resolve internal cross-references to create a document graph that the agent can traverse.

Stage 4: Chunk

Chunking is where most knowledge base pipelines succeed or fail. The goal is to split documents into pieces that are small enough for effective embedding and retrieval, but large enough to preserve semantic coherence.

Fixed-Size Chunking

Split text every N tokens (typically 256-512). Fast, simple, and dumb. Fixed-size chunks split mid-sentence, separate questions from answers, and break tables in half. Retrieval accuracy with fixed-size chunks: typically 60-70%.

Use case: quick prototypes, low-stakes applications, situations where speed matters more than quality.

Sentence-Level Chunking

Split on sentence boundaries, then group sentences until reaching the target chunk size. Better than fixed-size because chunks respect sentence structure. Still has problems with paragraphs that build an argument across 5-6 sentences — splitting after sentence 3 loses the conclusion.

Retrieval accuracy: typically 70-80%.

Semantic Chunking

Use an LLM or embedding model to identify semantic boundaries — points where the topic shifts. Group semantically related sentences into chunks regardless of their length (within limits). This preserves the coherence of explanations, arguments, and procedures.

Retrieval accuracy: typically 80-90%.

The cost of semantic chunking is compute time. A local LLM processing each document to identify semantic boundaries adds 5-10x to the chunking stage. For enterprise knowledge bases where accuracy matters, this trade-off is worth it.

Overlap Strategy

Regardless of chunking method, use overlap — include the last 1-2 sentences of each chunk as the first sentences of the next chunk. This prevents information loss at chunk boundaries. An overlap of 10-15% of chunk size is standard.

Preserving Context

Each chunk must carry its context: the document title, section header, page number, and preceding section summary. Without context, a chunk saying "The threshold is 5%" is meaningless. With context — "Document: Risk Policy 2026 > Section 3.2: Credit Risk Limits > The threshold is 5%" — the retrieval system can match it to the right query.

Stage 5: Export

The final stage produces the output format your agent system consumes. Most enterprise deployments need two outputs:

Vector-ready embeddings: Each chunk embedded using a model appropriate for your domain and language. Store embeddings with their metadata (source document, section, date, entities) in a vector database. This powers retrieval-augmented generation (RAG).

JSONL for fine-tuning: The same content formatted as instruction/response pairs for fine-tuning. This enables a complementary approach where the model learns domain knowledge directly, reducing retrieval dependency for common queries.

Producing both outputs from a single pipeline is more efficient than running two separate pipelines. The ingestion, cleaning, and structuring stages are identical — only the export format differs.

Quality Validation

Before deploying the knowledge base, validate it.

Retrieval accuracy testing: Prepare 100-200 test queries with known answers. Run each query against the knowledge base and check whether the correct chunk appears in the top-5 results. Target: 85%+ for production deployment.

Answer quality spot-checks: Have domain experts review 50 agent responses generated from the knowledge base. Score each response on accuracy, completeness, and source attribution. Any response that cites a nonexistent or incorrect source indicates a pipeline failure.

Coverage analysis: Does the knowledge base cover all the topics the agent is expected to handle? Map the topics to documents and identify gaps — these are topics where the agent will hallucinate because it has no source material.

Freshness audit: Check document dates. If the most recent version of a policy is from 2023 but a 2025 update exists in a different folder, the knowledge base is stale. Implement a freshness check that flags documents with newer versions available.

On-Premise Advantage

The entire pipeline runs on local infrastructure. Documents — which often contain proprietary business information, customer data, personal information, and trade secrets — never leave the organization's network.

This is not just a compliance checkbox. It is a practical requirement for the types of documents that form enterprise knowledge bases: HR policies with compensation details, legal memos with litigation strategy, financial reports with non-public data, engineering documents with trade secrets.

On-premise processing also eliminates vendor dependency. When your document pipeline runs on a third-party cloud service, that service controls your update schedule, your format support, and your pricing. When it runs locally, you control all three.

The infrastructure requirements are modest. A single server with 64GB RAM, 16 cores, and an NVIDIA A100 or equivalent GPU handles all five pipeline stages for knowledge bases up to 100,000 documents. Larger corpora benefit from parallelization across multiple nodes, but the pipeline itself is embarrassingly parallel — each document flows through the stages independently.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →