
Agentic RAG: How to Build a Retrieval Tool Your AI Agent Discovers and Calls Automatically
AI agents need retrieval as a callable tool, not embedded code. Here is how to build a RAG pipeline that generates tool-calling specs so agents can discover and query your knowledge base without custom integration.
Most RAG implementations are hardwired into application code. A user asks a question, the application splits the query, searches a vector database, assembles context, and passes everything to a language model. The retrieval logic is embedded directly in the orchestration layer, tightly coupled to a single workflow.
This works until you introduce an AI agent.
Agents operate differently. They receive a goal, reason about which tools to use, and call those tools dynamically. An agent does not follow a fixed pipeline. It discovers available capabilities through tool-calling specifications, decides when retrieval is needed, and invokes it like any other function. If your RAG pipeline is buried inside application code, the agent cannot find it, cannot call it, and cannot reason about when retrieval would help.
The shift from embedded retrieval to agentic RAG is not a minor refactor. It is a fundamental change in how you architect knowledge access for AI systems.
What Makes RAG "Agentic"
Traditional RAG is a pipeline: query goes in, context comes out, the model generates a response. The application controls every step. The model has no say in whether retrieval happens, what gets retrieved, or how many times the pipeline runs.
An agentic RAG pipeline inverts this control. The retrieval pipeline becomes a tool that the agent can discover and call on its own terms. The agent decides:
- Whether to retrieve at all (some queries do not need external knowledge)
- What to search for (the agent formulates its own retrieval queries)
- When to retrieve again (if the first result is insufficient, the agent can call the tool a second time with a refined query)
- How to combine retrieval with other tools (search a knowledge base, then call a calculator, then search again)
This is the core idea behind an agentic RAG pipeline: retrieval is not a fixed step in a workflow. It is a capability the agent invokes through a standard tool-calling interface.
Why Embedded Retrieval Code Breaks When Agents Evolve
Consider a typical RAG integration. You have a Python function that takes a user query, generates an embedding, searches Pinecone or Weaviate, assembles the top-k results into a context string, and returns it. This function is called at a specific point in your application logic.
Now you want an AI agent to use this same retrieval capability. The problems start immediately:
Tight coupling to one workflow. The retrieval function assumes it will be called in a specific sequence. The agent does not follow sequences. It reasons about goals and picks tools dynamically. Your embedded function has no mechanism for the agent to discover it.
No schema for the agent to understand. Agents use tool-calling specifications — structured descriptions of what a tool does, what parameters it accepts, and what it returns. Your embedded retrieval function has none of this. The agent cannot reason about a tool it cannot see.
No independent deployment. The retrieval logic lives inside your application. If you want a different agent, a different framework, or a different orchestration layer to use the same knowledge base, you have to duplicate the code. Every copy drifts independently.
No versioning or swappability. When you update your embedding model, change your chunking strategy, or switch vector databases, every consumer of the retrieval logic must be updated. There is no abstraction boundary.
These problems compound as your AI system grows. One agent becomes three. One knowledge base becomes five. One orchestration framework gets replaced by another. Embedded retrieval code becomes a maintenance burden that scales linearly with every new consumer.
How Tool-Calling Specs Work
Modern AI agents discover tools through structured specifications. The two dominant formats are OpenAI function calling and Anthropic tool use, but the concept is the same across both: a JSON schema that describes the tool's name, purpose, parameters, and expected output.
A tool-calling spec for a RAG pipeline might look like this:
{
"name": "search_knowledge_base",
"description": "Search the internal knowledge base for information relevant to a query. Returns ranked passages with source citations.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Natural language search query"
},
"max_results": {
"type": "integer",
"description": "Maximum number of passages to return",
"default": 5
},
"filter_category": {
"type": "string",
"description": "Optional category filter to narrow search scope"
}
},
"required": ["query"]
}
}
When an agent receives this spec, it understands what the tool does, what inputs it needs, and when it might be useful. The agent does not need to know how the tool works internally — whether it uses dense embeddings, sparse retrieval, or a hybrid approach. The spec is the contract. The implementation is hidden.
This is what makes RAG tool calling for AI agents fundamentally different from embedded retrieval. The agent treats the knowledge base as a black-box service it can invoke, just like it would invoke a weather API or a database query tool.
The Architecture: RAG as a Swappable Tool
Building an agentic RAG pipeline means decomposing retrieval into a standalone service with a tool-calling interface. The architecture has five components that form a clean pipeline:
1. API Endpoint. The entry point that receives tool calls from any agent. This is a standard HTTP endpoint that accepts the parameters defined in the tool-calling spec and returns structured results. Critically, this endpoint also serves the tool-calling spec itself — agents can discover what the tool does by requesting its specification.
2. Query Embedder. Transforms the incoming natural language query into a vector representation. This component is internal to the pipeline. The agent never interacts with it directly. You can swap embedding models — from OpenAI embeddings to a locally hosted model — without changing the tool-calling spec.
3. Vector Search. Executes similarity search against your vector database. Again, internal to the pipeline. The agent does not know or care whether you use Pinecone, Weaviate, Qdrant, or a local FAISS index. The abstraction boundary at the API endpoint means you can migrate databases without breaking any agent integration.
4. Context Assembler. Takes the raw search results and assembles them into a structured response: ranked passages, relevance scores, source citations, metadata. This component controls the quality of what the agent receives. You can add re-ranking, deduplication, or citation formatting here without touching the external interface.
5. API Response. Returns the assembled context in the format the agent expects. The response schema is part of the tool-calling spec, so agents know exactly what structure to parse.
This five-node pipeline — API Endpoint, Query Embedder, Vector Search, Context Assembler, API Response — can be deployed as an independent service. Any agent that supports tool calling can discover it and start using it immediately. No custom integration code. No framework-specific adapters.
Auto-Generating Tool-Calling Specs
The most tedious part of making RAG callable by an AI agent is writing and maintaining the tool-calling specifications. Every time you add a parameter, change a filter option, or modify the response format, the spec must be updated. Manual spec maintenance is error-prone and falls out of sync quickly.
This is where auto-generation matters. In Ertas, the API Endpoint node automatically generates tool-calling specs in both OpenAI function calling format and Anthropic tool use format. When you define your pipeline's inputs and outputs through the visual builder, the corresponding tool-calling specification is produced as a build artifact. Update the pipeline, and the spec updates with it.
Auto-generated specs eliminate a category of bugs: the mismatch between what the tool actually accepts and what the spec tells the agent it accepts. They also make it practical to maintain multiple RAG pipelines — one per knowledge domain, one per access level, one per language — without manually writing specs for each.
What Changes When Retrieval Is a Tool
Treating RAG as a callable tool rather than embedded code changes how you think about knowledge infrastructure:
Agents become framework-agnostic. Your RAG pipeline works with any agent that supports tool calling — LangChain, CrewAI, AutoGen, custom orchestrators, or a simple loop calling the OpenAI API. The tool-calling spec is the universal interface.
Knowledge bases become composable. An agent can have access to multiple RAG tools, each connected to a different knowledge base. A legal research agent might call one tool for case law, another for regulatory filings, and a third for internal memos. Each is an independent pipeline with its own spec.
Upgrades become invisible. Swap your embedding model from text-embedding-3-small to a fine-tuned domain-specific model. Change your chunking strategy. Add a re-ranker. None of these changes are visible to the agent. The tool-calling spec stays the same. The API contract holds.
Testing becomes straightforward. A tool with a defined input schema and output schema is testable in isolation. You can evaluate retrieval quality, latency, and relevance without spinning up an entire agent framework. Integration tests verify the spec. Unit tests verify the pipeline.
Getting Started
If you have an existing RAG pipeline embedded in application code, the migration path is clear: extract the retrieval logic behind an API endpoint, define a tool-calling spec for the endpoint, and register that spec with your agent framework.
If you are building from scratch, start with the pipeline architecture described above. Ertas provides the visual pipeline builder where you connect the five nodes — API Endpoint, Query Embedder, Vector Search, Context Assembler, API Response — and deploy. The tool-calling specs are generated automatically, ready for any agent to discover and call.
The future of RAG is not smarter retrieval algorithms. It is better interfaces between retrieval systems and the agents that need them. Tool-calling specs are that interface. Build your RAG pipeline as a tool, and every agent you deploy — today and in the future — can use it without a single line of integration code.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

How to Deploy a RAG Pipeline as an API Endpoint Your AI Agent Can Call
Most RAG tutorials stop at the vector store. Production AI agents need a callable retrieval endpoint with tool-calling specs. Here is how to build and deploy RAG as modular infrastructure, not embedded code.

RAG as a Modular Service: Why Retrieval Should Be Infrastructure, Not Embedded Code
Most teams embed retrieval logic directly into their application code. When the RAG pipeline needs updating, it means redeploying the entire application. Treating RAG as modular infrastructure solves this.

Agentic AI On-Premise: Enterprise Deployment Without Cloud Dependency
Agentic AI systems take actions, not just generate text — and most assume cloud deployment. This guide covers why on-premise agents matter for data sovereignty, compliance, and latency, plus the architecture and tooling to deploy them locally.