Building AI Agent Knowledge Bases from Enterprise Documents On-Premise

The AI agent is only as good as the knowledge base behind it. You can deploy the best model on the best hardware with the best agent framework, and the agent will still give wrong answers if the knowledge base it retrieves from contains poorly parsed documents, badly chunked text, duplicate content, or outdated information.

This is not theoretical. In enterprise agent deployments, retrieval quality — the accuracy and relevance of what the vector store returns — is the single strongest predictor of agent output quality. A well-built knowledge base with a 7B model consistently outperforms a messy knowledge base with a 70B model. The bottleneck is almost always the data, not the model.

This guide walks through the complete pipeline for building an agent knowledge base from enterprise documents, entirely on-premise. No data leaves your network. No external APIs are called. Every component runs locally.

The Pipeline Overview

Raw Enterprise Documents
    ↓
Step 1: Document Ingestion (Parse)
    ↓
Step 2: Text Cleaning
    ↓
Step 3: Chunking with Metadata
    ↓
Step 4: Embedding (Local Model)
    ↓
Step 5: Vector Store Indexing
    ↓
Step 6: Retrieval Testing and Validation
    ↓
Agent RAG Queries

Each step has specific requirements, common failure modes, and quality metrics. Skipping or shortcutting any step degrades the final knowledge base quality — and by extension, the agent's output quality.

Step 1: Document Ingestion

Enterprise documents come in dozens of formats: PDFs (text-based and scanned), Word documents (.docx, .doc), Excel spreadsheets, PowerPoint presentations, emails (.eml, .msg), HTML pages, plain text files, and sometimes proprietary formats from enterprise systems.

What Good Parsing Looks Like

Good document parsing preserves the structure and meaning of the original document:

Section headings are identified and labeled, not flattened into body text
Tables are extracted as structured data (rows and columns preserved), not converted to streams of text
Lists maintain their ordering and nesting
Headers and footers are identified and separated from body content
Page numbers are removed from the text but recorded as metadata
Images with text (charts, diagrams with labels) are OCR'd and the text is extracted
Formatting cues (bold, italic, underline) that convey meaning are preserved or annotated

Format-Specific Challenges

PDFs are the most common and most problematic format. A text-based PDF generated from Word is straightforward — the text layer is extractable. A scanned PDF is an image that requires OCR. A PDF generated from a web page may have unusual column layouts. A PDF with form fields has data in the form layer that simple text extraction misses.

Word documents are generally easier to parse, but complex formatting — nested tables, text boxes, SmartArt, embedded objects — can confuse extractors. Track changes and comments may or may not be relevant depending on the use case.

Excel spreadsheets present a structural challenge: the relationship between cells (which header goes with which value) is spatial, not textual. Flattening a spreadsheet to text loses these relationships. Merged cells, multiple sheets, and formulas add complexity.

Emails have metadata (from, to, date, subject) that is often more important for retrieval than the body text. Email chains have forwarding artifacts, reply markers, and signature blocks that should be cleaned. Attachments need separate handling.

Practical Approach

Use a document parser that handles multiple formats with format-specific logic. Run the parser on a sample of your document corpus and manually inspect the output. Flag documents that fail quality checks — OCR errors above a threshold, missing sections, corrupted tables — for manual review or reprocessing.

Quality metric: Parse accuracy rate — what percentage of documents are parsed with no significant structural errors? Target: 90%+ for text-based documents, 80%+ for scanned documents (with OCR).

Step 2: Text Cleaning

Parsed text is not clean text. Cleaning removes artifacts that degrade retrieval quality without adding information value.

What to Remove

Boilerplate: Headers, footers, copyright notices, confidentiality disclaimers that repeat on every page. These add noise to the vector store without adding retrieval value.
Page numbers and running heads: "Page 47 of 132" and "Acme Corp — Confidential" on every page pollute the embedding space.
OCR artifacts: Misrecognized characters, broken words, garbled text from poor scans. Light OCR errors (substituting "l" for "1") can be corrected with post-processing. Heavy errors may require re-scanning or manual transcription.
Encoding issues: Unicode normalization, smart quotes vs. straight quotes, em dashes vs. hyphens. These seem minor but cause duplicate detection failures (the same text with different encoding is treated as different).
Duplicate content: The same document exists in multiple locations (email attachment, shared drive, document management system). The same paragraph appears in multiple documents (standard clauses, boilerplate sections).

What to Preserve

Domain terminology: Do not "clean" technical terms, abbreviations, or jargon. "ICD-10-CM" should not be normalized to "ICD 10 CM."
Numerical data: Figures, measurements, dates, financial values. These are high-value for retrieval.
Document structure: Section breaks, heading hierarchy, list structure. These inform chunking.
Metadata: Author, date, document type, version, department. These enable filtered retrieval.

Deduplication Strategy

Deduplication operates at two levels:

Document-level: Exact and near-duplicate detection. Hash-based comparison catches exact duplicates. Similarity-based comparison (MinHash, SimHash) catches near-duplicates — documents that are 90%+ similar but have minor differences (version numbers, dates, formatting).

Paragraph-level: Standard clauses, boilerplate sections, and frequently copied text appear across many documents. These should be deduplicated to prevent the vector store from over-representing common text at the expense of unique, high-value content.

Quality metric: Deduplication rate — what percentage of duplicate or near-duplicate content was removed? Target: 95%+ of duplicates eliminated while retaining zero unique content.

Step 3: Chunking with Metadata

Chunking is where most knowledge base builds go wrong. The default approach — split text every 512 or 1,024 characters — is simple to implement and reliably produces bad results.

Why Character-Count Chunking Fails

Character-count chunking has no awareness of document structure. It splits:

A table between the header row and data rows, creating a chunk with column names but no data and another chunk with data but no column names
A conditional statement between the condition and the consequence ("If the patient has diabetes..." in one chunk, "...then prescribe metformin" in the next)
A numbered list between items, losing the context of what the list is about
A paragraph mid-sentence, creating chunks that start and end with sentence fragments

Each of these splits creates chunks that are individually meaningless or misleading. When the agent retrieves a chunk that says "then prescribe metformin" without the condition, it may recommend metformin unconditionally.

Semantic Chunking

Semantic chunking splits at natural topic boundaries:

Section headers: Split at each heading. This preserves the logical units of the document.
Paragraph boundaries: Within a section, split at paragraph breaks. Each paragraph typically addresses a single point.
Table boundaries: Keep tables intact as single chunks. Include the table header with every table chunk.
List boundaries: Keep lists intact. Include the introductory text with the list items.

Chunk Configuration

Parameter	Recommended Range	Rationale
Target chunk size	300–800 tokens	Large enough for context, small enough for precise retrieval
Maximum chunk size	1,200 tokens	Hard cap to prevent oversized chunks from dominating context windows
Overlap	50–100 tokens	Maintains context continuity between adjacent chunks
Minimum chunk size	50 tokens	Avoid micro-chunks that lack context

Metadata Tagging

Every chunk must carry metadata that enables filtered retrieval and traceability:

source_document: File name or document ID of the original document
source_path: Location of the original document in the file system or DMS
document_date: When the document was created or last updated
document_author: Who created the document
document_type: Policy, procedure, guideline, contract, email, report, etc.
section_title: The heading of the section this chunk comes from
chunk_index: Position of this chunk within the document (for ordering)
classification: Confidentiality level, department, business unit

This metadata serves three purposes:

Filtered retrieval: "Find information about travel policy" → filter by document_type=policy, search for "travel"
Recency weighting: Prefer chunks from more recent documents when multiple versions exist
Audit trail: Trace any agent response back to the specific chunk and source document

Quality metric: Chunk coherence — for a sample of chunks, does each chunk contain a complete, self-contained piece of information? Target: 80%+ of chunks are coherent standalone units.

Step 4: Embedding

Embedding converts text chunks into numerical vectors that capture semantic meaning. Similar concepts produce similar vectors, enabling semantic search.

Choosing an Embedding Model

For on-premise deployment, the embedding model must run locally. No external API calls. The current best options:

Model	Dimensions	Speed (CPU)	Quality (MTEB)	Size
all-MiniLM-L6-v2	384	Fast	Good	80MB
E5-large-v2	1,024	Medium	Very good	1.3GB
BGE-large-en-v1.5	1,024	Medium	Very good	1.3GB
E5-mistral-7b-instruct	4,096	Slow (GPU needed)	Excellent	14GB

For most enterprise knowledge bases, E5-large-v2 or BGE-large-en-v1.5 offer the best balance of quality and speed. They run well on CPU for batch embedding and produce high-quality vectors for semantic search.

For very large knowledge bases (500K+ chunks) where retrieval quality is paramount, E5-mistral-7b-instruct provides better semantic understanding but requires GPU for reasonable embedding speed.

Batch Processing

Enterprise document corpora are large. Embedding 100,000 chunks one at a time is slow. Batch processing — embedding 32–128 chunks at a time — is 10–50x faster.

For a corpus of 100,000 documents producing approximately 500,000 chunks:

Embedding Model	Hardware	Estimated Time
all-MiniLM-L6-v2	CPU (16 cores)	2–4 hours
E5-large-v2	CPU (16 cores)	8–16 hours
E5-large-v2	GPU (RTX 4090)	1–2 hours
E5-mistral-7b-instruct	GPU (RTX 4090)	6–12 hours

Plan for the initial embedding to take hours, not minutes. Subsequent updates (new or modified documents only) are incremental and much faster.

Consistency

Use the same embedding model for indexing and for querying. If you embed your documents with E5-large-v2, your agent's queries must also be embedded with E5-large-v2. Mixing embedding models produces vectors in different spaces, making similarity search meaningless.

Step 5: Vector Store Indexing

The embedded vectors need to be stored in a vector database that supports efficient similarity search. For on-premise deployment:

Vector Store	Strengths	Scale Limit	License
ChromaDB	Simple setup, good for prototyping	~1M vectors	Apache 2.0
Qdrant	Production-ready, filtering support, high performance	100M+ vectors	Apache 2.0
Milvus	Distributed, horizontal scaling	Billions of vectors	Apache 2.0
Weaviate	Hybrid search (vector + keyword), good filtering	100M+ vectors	BSD-3

For most enterprise deployments (10K–500K documents, 50K–2.5M chunks), Qdrant is the recommended choice. It handles the scale, supports metadata filtering (essential for enterprise use), and runs well on a single server.

For very large deployments (1M+ documents), consider Milvus for its distributed architecture.

Index Configuration

Distance metric: Cosine similarity is the default and works well for most text embedding models
Index type: HNSW (Hierarchical Navigable Small World) for approximate nearest neighbor search — fast and accurate
ef_construction: Higher values (128–256) build a better index at the cost of longer build time. Worth the investment for enterprise deployments.
m: Number of connections per node in the HNSW graph. 16–32 is typical. Higher values improve recall at the cost of memory.

Testing the Index

Before connecting the agent, test the vector store directly:

results = vector_store.query(
    query="What is the company travel policy?",
    top_k=10,
    filters={"document_type": "policy"}
)

for result in results:
    print(f"Score: {result.score:.3f}")
    print(f"Source: {result.metadata['source_document']}")
    print(f"Text: {result.text[:200]}...")
    print()

Run 50–100 representative queries that reflect what users will actually ask the agent. For each query, manually evaluate whether the retrieved chunks are relevant and sufficient to answer the question.

Quality metric: Hits@10 — for what percentage of test queries is the correct answer contained in the top 10 retrieved chunks? Target: 85%+. If this is below 70%, fix the chunking and cleaning before connecting the agent.

Step 6: Retrieval Testing and Validation

This is the step most teams skip, and it is the step that matters most. Testing the knowledge base before connecting the agent separates the retrieval quality problem from the model quality problem.

Test Protocol

Create a test set: 50–100 questions that represent the queries your agent will receive. Include easy questions (answer is in a single obvious document), medium questions (answer requires finding the right document among many), and hard questions (answer requires synthesizing information from multiple documents).
Label ground truth: For each question, identify the correct source document(s) and the correct answer. This requires domain expert input.
Run retrieval: Query the vector store with each test question. Record the top-10 retrieved chunks.
Evaluate:
- Retrieval precision: What percentage of retrieved chunks are actually relevant?
- Retrieval recall: What percentage of relevant chunks were retrieved?
- Answer coverage: Do the retrieved chunks contain enough information to answer the question?
Identify failures: For questions where retrieval failed, diagnose why:
- Chunk does not exist (document was not ingested or was filtered out)
- Chunk exists but was not retrieved (embedding mismatch, query phrasing does not match document language)
- Chunk was retrieved but is too fragmented (bad chunking split the relevant information)
- Wrong chunk was retrieved (duplicate or misleading content outranked the correct chunk)
Fix and retest: Address the root causes and re-run the test. Iterate until retrieval quality meets your threshold.

Common Mistakes and Their Fixes

Mistake	Symptom	Fix
Chunks too large (>1,500 tokens)	Retrieved chunks contain a mix of relevant and irrelevant information	Reduce target chunk size to 300–800 tokens with semantic boundaries
Chunks too small (<100 tokens)	Retrieved chunks lack context needed to answer the question	Increase minimum chunk size; add overlap; include section headers
No metadata filtering	Retrieval returns chunks from wrong document types or outdated versions	Add metadata to chunks; use filtered queries
No deduplication	Same information retrieved 3–5 times from different copies	Run deduplication before embedding
No PII/PHI handling	Sensitive data exposed through retrieval	Run PII/PHI detection and redaction before chunking
Embedding model mismatch	Retrieval returns semantically wrong results	Ensure same embedding model is used for indexing and querying
No overlap between chunks	Questions about information at chunk boundaries return irrelevant results	Add 50–100 token overlap between adjacent chunks

Scale Considerations

Small Corpus (1K–10K Documents)

Manageable with semi-manual processes. A single data engineer can parse, clean, chunk, and validate the knowledge base in 1–2 weeks. Quality issues can be identified and fixed by manual inspection of samples.

Infrastructure: Single server with CPU is sufficient. ChromaDB or Qdrant. Embedding with all-MiniLM or E5-large on CPU.

Medium Corpus (10K–100K Documents)

Requires automated pipelines with quality controls. Manual inspection of every document is not feasible. Automated quality scoring, deduplication, and chunking with spot-check validation.

Infrastructure: Server with GPU recommended for embedding speed. Qdrant for vector storage. Expect 1–2 days for initial embedding.

Large Corpus (100K+ Documents)

Requires a production data pipeline with monitoring, error handling, incremental updates, and quality metrics dashboards. New documents should be processable without re-embedding the entire corpus.

Infrastructure: Dedicated GPU server or small cluster. Qdrant or Milvus. Consider sharding the vector store by document type or department for better retrieval performance.

Pipeline characteristics at scale:

Automated document ingestion from source systems (DMS, SharePoint, email archives)
Continuous quality monitoring (embedding distribution drift, retrieval accuracy degradation)
Incremental updates (add new documents, update modified documents, remove deleted documents)
Version tracking (which version of each document is currently in the knowledge base)

The Audit Requirement

Every document in the knowledge base must be traceable to its source. This is not just good practice — it is a requirement for regulated industries and a practical necessity for debugging.

The audit chain:

Source document → identified by document ID, file path, and hash
Parsed text → stored with parse timestamp and parser version
Cleaned text → stored with cleaning rules applied
Chunks → stored with chunk ID, parent document ID, and chunk position
Embeddings → stored with embedding model version and timestamp
Retrieval events → logged with query, retrieved chunk IDs, relevance scores, and timestamp

When the agent gives a wrong answer, this chain lets you trace backwards: which chunks did the agent retrieve? Were those chunks relevant? Which source document did they come from? Was the source document correct and current? Was the chunk properly parsed and cleaned?

This traceability is what separates a production knowledge base from a prototype. It is also what makes continuous improvement possible — you can systematically identify and fix the data quality issues that cause agent errors, rather than guessing.

Where to Start

Inventory your documents — what do you have, where is it stored, how much is there, how current is it?
Pick a focused scope — start with one document type or one department, not the entire enterprise corpus
Build the pipeline — parse, clean, chunk, embed, index. Automate each step.
Test retrieval — 50+ queries, manual evaluation, identify failures
Fix data quality — address the root causes of retrieval failures
Connect the agent — only after retrieval quality meets your threshold
Monitor and iterate — track retrieval quality over time, update documents, fix newly discovered issues

The knowledge base is not a one-time build. Documents change, new documents are created, old documents become obsolete. A production knowledge base has a maintenance pipeline that keeps it current and accurate. Plan for ongoing maintenance from the start, not as an afterthought.