
Building AI Agent Knowledge Bases from Enterprise Documents On-Premise
A step-by-step guide to building RAG knowledge bases from enterprise documents — parsing, cleaning, chunking, embedding, and indexing — entirely on-premise. Covers common mistakes, scale considerations, and audit requirements.
The AI agent is only as good as the knowledge base behind it. You can deploy the best model on the best hardware with the best agent framework, and the agent will still give wrong answers if the knowledge base it retrieves from contains poorly parsed documents, badly chunked text, duplicate content, or outdated information.
This is not theoretical. In enterprise agent deployments, retrieval quality — the accuracy and relevance of what the vector store returns — is the single strongest predictor of agent output quality. A well-built knowledge base with a 7B model consistently outperforms a messy knowledge base with a 70B model. The bottleneck is almost always the data, not the model.
This guide walks through the complete pipeline for building an agent knowledge base from enterprise documents, entirely on-premise. No data leaves your network. No external APIs are called. Every component runs locally.
The Pipeline Overview
Raw Enterprise Documents
↓
Step 1: Document Ingestion (Parse)
↓
Step 2: Text Cleaning
↓
Step 3: Chunking with Metadata
↓
Step 4: Embedding (Local Model)
↓
Step 5: Vector Store Indexing
↓
Step 6: Retrieval Testing and Validation
↓
Agent RAG Queries
Each step has specific requirements, common failure modes, and quality metrics. Skipping or shortcutting any step degrades the final knowledge base quality — and by extension, the agent's output quality.
Step 1: Document Ingestion
Enterprise documents come in dozens of formats: PDFs (text-based and scanned), Word documents (.docx, .doc), Excel spreadsheets, PowerPoint presentations, emails (.eml, .msg), HTML pages, plain text files, and sometimes proprietary formats from enterprise systems.
What Good Parsing Looks Like
Good document parsing preserves the structure and meaning of the original document:
- Section headings are identified and labeled, not flattened into body text
- Tables are extracted as structured data (rows and columns preserved), not converted to streams of text
- Lists maintain their ordering and nesting
- Headers and footers are identified and separated from body content
- Page numbers are removed from the text but recorded as metadata
- Images with text (charts, diagrams with labels) are OCR'd and the text is extracted
- Formatting cues (bold, italic, underline) that convey meaning are preserved or annotated
Format-Specific Challenges
PDFs are the most common and most problematic format. A text-based PDF generated from Word is straightforward — the text layer is extractable. A scanned PDF is an image that requires OCR. A PDF generated from a web page may have unusual column layouts. A PDF with form fields has data in the form layer that simple text extraction misses.
Word documents are generally easier to parse, but complex formatting — nested tables, text boxes, SmartArt, embedded objects — can confuse extractors. Track changes and comments may or may not be relevant depending on the use case.
Excel spreadsheets present a structural challenge: the relationship between cells (which header goes with which value) is spatial, not textual. Flattening a spreadsheet to text loses these relationships. Merged cells, multiple sheets, and formulas add complexity.
Emails have metadata (from, to, date, subject) that is often more important for retrieval than the body text. Email chains have forwarding artifacts, reply markers, and signature blocks that should be cleaned. Attachments need separate handling.
Practical Approach
Use a document parser that handles multiple formats with format-specific logic. Run the parser on a sample of your document corpus and manually inspect the output. Flag documents that fail quality checks — OCR errors above a threshold, missing sections, corrupted tables — for manual review or reprocessing.
Quality metric: Parse accuracy rate — what percentage of documents are parsed with no significant structural errors? Target: 90%+ for text-based documents, 80%+ for scanned documents (with OCR).
Step 2: Text Cleaning
Parsed text is not clean text. Cleaning removes artifacts that degrade retrieval quality without adding information value.
What to Remove
- Boilerplate: Headers, footers, copyright notices, confidentiality disclaimers that repeat on every page. These add noise to the vector store without adding retrieval value.
- Page numbers and running heads: "Page 47 of 132" and "Acme Corp — Confidential" on every page pollute the embedding space.
- OCR artifacts: Misrecognized characters, broken words, garbled text from poor scans. Light OCR errors (substituting "l" for "1") can be corrected with post-processing. Heavy errors may require re-scanning or manual transcription.
- Encoding issues: Unicode normalization, smart quotes vs. straight quotes, em dashes vs. hyphens. These seem minor but cause duplicate detection failures (the same text with different encoding is treated as different).
- Duplicate content: The same document exists in multiple locations (email attachment, shared drive, document management system). The same paragraph appears in multiple documents (standard clauses, boilerplate sections).
What to Preserve
- Domain terminology: Do not "clean" technical terms, abbreviations, or jargon. "ICD-10-CM" should not be normalized to "ICD 10 CM."
- Numerical data: Figures, measurements, dates, financial values. These are high-value for retrieval.
- Document structure: Section breaks, heading hierarchy, list structure. These inform chunking.
- Metadata: Author, date, document type, version, department. These enable filtered retrieval.
Deduplication Strategy
Deduplication operates at two levels:
Document-level: Exact and near-duplicate detection. Hash-based comparison catches exact duplicates. Similarity-based comparison (MinHash, SimHash) catches near-duplicates — documents that are 90%+ similar but have minor differences (version numbers, dates, formatting).
Paragraph-level: Standard clauses, boilerplate sections, and frequently copied text appear across many documents. These should be deduplicated to prevent the vector store from over-representing common text at the expense of unique, high-value content.
Quality metric: Deduplication rate — what percentage of duplicate or near-duplicate content was removed? Target: 95%+ of duplicates eliminated while retaining zero unique content.
Step 3: Chunking with Metadata
Chunking is where most knowledge base builds go wrong. The default approach — split text every 512 or 1,024 characters — is simple to implement and reliably produces bad results.
Why Character-Count Chunking Fails
Character-count chunking has no awareness of document structure. It splits:
- A table between the header row and data rows, creating a chunk with column names but no data and another chunk with data but no column names
- A conditional statement between the condition and the consequence ("If the patient has diabetes..." in one chunk, "...then prescribe metformin" in the next)
- A numbered list between items, losing the context of what the list is about
- A paragraph mid-sentence, creating chunks that start and end with sentence fragments
Each of these splits creates chunks that are individually meaningless or misleading. When the agent retrieves a chunk that says "then prescribe metformin" without the condition, it may recommend metformin unconditionally.
Semantic Chunking
Semantic chunking splits at natural topic boundaries:
- Section headers: Split at each heading. This preserves the logical units of the document.
- Paragraph boundaries: Within a section, split at paragraph breaks. Each paragraph typically addresses a single point.
- Table boundaries: Keep tables intact as single chunks. Include the table header with every table chunk.
- List boundaries: Keep lists intact. Include the introductory text with the list items.
Chunk Configuration
| Parameter | Recommended Range | Rationale |
|---|---|---|
| Target chunk size | 300–800 tokens | Large enough for context, small enough for precise retrieval |
| Maximum chunk size | 1,200 tokens | Hard cap to prevent oversized chunks from dominating context windows |
| Overlap | 50–100 tokens | Maintains context continuity between adjacent chunks |
| Minimum chunk size | 50 tokens | Avoid micro-chunks that lack context |
Metadata Tagging
Every chunk must carry metadata that enables filtered retrieval and traceability:
- source_document: File name or document ID of the original document
- source_path: Location of the original document in the file system or DMS
- document_date: When the document was created or last updated
- document_author: Who created the document
- document_type: Policy, procedure, guideline, contract, email, report, etc.
- section_title: The heading of the section this chunk comes from
- chunk_index: Position of this chunk within the document (for ordering)
- classification: Confidentiality level, department, business unit
This metadata serves three purposes:
- Filtered retrieval: "Find information about travel policy" → filter by document_type=policy, search for "travel"
- Recency weighting: Prefer chunks from more recent documents when multiple versions exist
- Audit trail: Trace any agent response back to the specific chunk and source document
Quality metric: Chunk coherence — for a sample of chunks, does each chunk contain a complete, self-contained piece of information? Target: 80%+ of chunks are coherent standalone units.
Step 4: Embedding
Embedding converts text chunks into numerical vectors that capture semantic meaning. Similar concepts produce similar vectors, enabling semantic search.
Choosing an Embedding Model
For on-premise deployment, the embedding model must run locally. No external API calls. The current best options:
| Model | Dimensions | Speed (CPU) | Quality (MTEB) | Size |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Fast | Good | 80MB |
| E5-large-v2 | 1,024 | Medium | Very good | 1.3GB |
| BGE-large-en-v1.5 | 1,024 | Medium | Very good | 1.3GB |
| E5-mistral-7b-instruct | 4,096 | Slow (GPU needed) | Excellent | 14GB |
For most enterprise knowledge bases, E5-large-v2 or BGE-large-en-v1.5 offer the best balance of quality and speed. They run well on CPU for batch embedding and produce high-quality vectors for semantic search.
For very large knowledge bases (500K+ chunks) where retrieval quality is paramount, E5-mistral-7b-instruct provides better semantic understanding but requires GPU for reasonable embedding speed.
Batch Processing
Enterprise document corpora are large. Embedding 100,000 chunks one at a time is slow. Batch processing — embedding 32–128 chunks at a time — is 10–50x faster.
For a corpus of 100,000 documents producing approximately 500,000 chunks:
| Embedding Model | Hardware | Estimated Time |
|---|---|---|
| all-MiniLM-L6-v2 | CPU (16 cores) | 2–4 hours |
| E5-large-v2 | CPU (16 cores) | 8–16 hours |
| E5-large-v2 | GPU (RTX 4090) | 1–2 hours |
| E5-mistral-7b-instruct | GPU (RTX 4090) | 6–12 hours |
Plan for the initial embedding to take hours, not minutes. Subsequent updates (new or modified documents only) are incremental and much faster.
Consistency
Use the same embedding model for indexing and for querying. If you embed your documents with E5-large-v2, your agent's queries must also be embedded with E5-large-v2. Mixing embedding models produces vectors in different spaces, making similarity search meaningless.
Step 5: Vector Store Indexing
The embedded vectors need to be stored in a vector database that supports efficient similarity search. For on-premise deployment:
| Vector Store | Strengths | Scale Limit | License |
|---|---|---|---|
| ChromaDB | Simple setup, good for prototyping | ~1M vectors | Apache 2.0 |
| Qdrant | Production-ready, filtering support, high performance | 100M+ vectors | Apache 2.0 |
| Milvus | Distributed, horizontal scaling | Billions of vectors | Apache 2.0 |
| Weaviate | Hybrid search (vector + keyword), good filtering | 100M+ vectors | BSD-3 |
For most enterprise deployments (10K–500K documents, 50K–2.5M chunks), Qdrant is the recommended choice. It handles the scale, supports metadata filtering (essential for enterprise use), and runs well on a single server.
For very large deployments (1M+ documents), consider Milvus for its distributed architecture.
Index Configuration
- Distance metric: Cosine similarity is the default and works well for most text embedding models
- Index type: HNSW (Hierarchical Navigable Small World) for approximate nearest neighbor search — fast and accurate
- ef_construction: Higher values (128–256) build a better index at the cost of longer build time. Worth the investment for enterprise deployments.
- m: Number of connections per node in the HNSW graph. 16–32 is typical. Higher values improve recall at the cost of memory.
Testing the Index
Before connecting the agent, test the vector store directly:
results = vector_store.query(
query="What is the company travel policy?",
top_k=10,
filters={"document_type": "policy"}
)
for result in results:
print(f"Score: {result.score:.3f}")
print(f"Source: {result.metadata['source_document']}")
print(f"Text: {result.text[:200]}...")
print()
Run 50–100 representative queries that reflect what users will actually ask the agent. For each query, manually evaluate whether the retrieved chunks are relevant and sufficient to answer the question.
Quality metric: Hits@10 — for what percentage of test queries is the correct answer contained in the top 10 retrieved chunks? Target: 85%+. If this is below 70%, fix the chunking and cleaning before connecting the agent.
Step 6: Retrieval Testing and Validation
This is the step most teams skip, and it is the step that matters most. Testing the knowledge base before connecting the agent separates the retrieval quality problem from the model quality problem.
Test Protocol
-
Create a test set: 50–100 questions that represent the queries your agent will receive. Include easy questions (answer is in a single obvious document), medium questions (answer requires finding the right document among many), and hard questions (answer requires synthesizing information from multiple documents).
-
Label ground truth: For each question, identify the correct source document(s) and the correct answer. This requires domain expert input.
-
Run retrieval: Query the vector store with each test question. Record the top-10 retrieved chunks.
-
Evaluate:
- Retrieval precision: What percentage of retrieved chunks are actually relevant?
- Retrieval recall: What percentage of relevant chunks were retrieved?
- Answer coverage: Do the retrieved chunks contain enough information to answer the question?
-
Identify failures: For questions where retrieval failed, diagnose why:
- Chunk does not exist (document was not ingested or was filtered out)
- Chunk exists but was not retrieved (embedding mismatch, query phrasing does not match document language)
- Chunk was retrieved but is too fragmented (bad chunking split the relevant information)
- Wrong chunk was retrieved (duplicate or misleading content outranked the correct chunk)
-
Fix and retest: Address the root causes and re-run the test. Iterate until retrieval quality meets your threshold.
Common Mistakes and Their Fixes
| Mistake | Symptom | Fix |
|---|---|---|
| Chunks too large (>1,500 tokens) | Retrieved chunks contain a mix of relevant and irrelevant information | Reduce target chunk size to 300–800 tokens with semantic boundaries |
| Chunks too small (<100 tokens) | Retrieved chunks lack context needed to answer the question | Increase minimum chunk size; add overlap; include section headers |
| No metadata filtering | Retrieval returns chunks from wrong document types or outdated versions | Add metadata to chunks; use filtered queries |
| No deduplication | Same information retrieved 3–5 times from different copies | Run deduplication before embedding |
| No PII/PHI handling | Sensitive data exposed through retrieval | Run PII/PHI detection and redaction before chunking |
| Embedding model mismatch | Retrieval returns semantically wrong results | Ensure same embedding model is used for indexing and querying |
| No overlap between chunks | Questions about information at chunk boundaries return irrelevant results | Add 50–100 token overlap between adjacent chunks |
Scale Considerations
Small Corpus (1K–10K Documents)
Manageable with semi-manual processes. A single data engineer can parse, clean, chunk, and validate the knowledge base in 1–2 weeks. Quality issues can be identified and fixed by manual inspection of samples.
Infrastructure: Single server with CPU is sufficient. ChromaDB or Qdrant. Embedding with all-MiniLM or E5-large on CPU.
Medium Corpus (10K–100K Documents)
Requires automated pipelines with quality controls. Manual inspection of every document is not feasible. Automated quality scoring, deduplication, and chunking with spot-check validation.
Infrastructure: Server with GPU recommended for embedding speed. Qdrant for vector storage. Expect 1–2 days for initial embedding.
Large Corpus (100K+ Documents)
Requires a production data pipeline with monitoring, error handling, incremental updates, and quality metrics dashboards. New documents should be processable without re-embedding the entire corpus.
Infrastructure: Dedicated GPU server or small cluster. Qdrant or Milvus. Consider sharding the vector store by document type or department for better retrieval performance.
Pipeline characteristics at scale:
- Automated document ingestion from source systems (DMS, SharePoint, email archives)
- Continuous quality monitoring (embedding distribution drift, retrieval accuracy degradation)
- Incremental updates (add new documents, update modified documents, remove deleted documents)
- Version tracking (which version of each document is currently in the knowledge base)
The Audit Requirement
Every document in the knowledge base must be traceable to its source. This is not just good practice — it is a requirement for regulated industries and a practical necessity for debugging.
The audit chain:
- Source document → identified by document ID, file path, and hash
- Parsed text → stored with parse timestamp and parser version
- Cleaned text → stored with cleaning rules applied
- Chunks → stored with chunk ID, parent document ID, and chunk position
- Embeddings → stored with embedding model version and timestamp
- Retrieval events → logged with query, retrieved chunk IDs, relevance scores, and timestamp
When the agent gives a wrong answer, this chain lets you trace backwards: which chunks did the agent retrieve? Were those chunks relevant? Which source document did they come from? Was the source document correct and current? Was the chunk properly parsed and cleaned?
This traceability is what separates a production knowledge base from a prototype. It is also what makes continuous improvement possible — you can systematically identify and fix the data quality issues that cause agent errors, rather than guessing.
Where to Start
- Inventory your documents — what do you have, where is it stored, how much is there, how current is it?
- Pick a focused scope — start with one document type or one department, not the entire enterprise corpus
- Build the pipeline — parse, clean, chunk, embed, index. Automate each step.
- Test retrieval — 50+ queries, manual evaluation, identify failures
- Fix data quality — address the root causes of retrieval failures
- Connect the agent — only after retrieval quality meets your threshold
- Monitor and iterate — track retrieval quality over time, update documents, fix newly discovered issues
The knowledge base is not a one-time build. Documents change, new documents are created, old documents become obsolete. A production knowledge base has a maintenance pipeline that keeps it current and accurate. Plan for ongoing maintenance from the start, not as an afterthought.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

Data Preparation for Enterprise AI Agents: Why Your Agent Is Only as Good as Your Data
Everyone talks about agent frameworks — LangChain, CrewAI, AutoGen. Nobody talks about the data layer that feeds them. Data quality is the #1 predictor of agent success or failure. This guide covers the three data types agents need and how to prepare each one.

Fine-Tuned Models vs RAG for Enterprise AI Agents: When to Use Which
Should your enterprise AI agent use fine-tuning, RAG, or both? This guide compares both approaches across 10 decision criteria, explains when each wins, covers the hybrid pattern, and details the data preparation requirements for each path.

On-Device vs On-Premise AI: Different Privacy Problems, Different Data Prep
On-device AI and on-premise AI solve fundamentally different privacy problems — and require fundamentally different data preparation strategies. Here's how to tell which you need and what your data pipeline should look like for each.