LangChain + Fine-Tuned Local Model: Build Pipelines Without API Costs

LangChain is the most widely used framework for building AI pipelines: document processing, RAG, agents, chains. Most LangChain tutorials point at OpenAI's API. Your production bills reflect this.

LangChain supports local models via Ollama — and Ollama's OpenAI-compatible interface means you can swap the AI backend in a LangChain pipeline with minimal code changes. Combine this with a fine-tuned model, and you get pipelines that are faster for domain tasks, cheaper at scale, and private by default.

LangChain + Ollama Integration Options

LangChain has two integration paths for Ollama:

Option 1: ChatOllama (LangChain native)

from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="your-fine-tuned-model",
    base_url="http://localhost:11434",
    temperature=0.3
)

# Use exactly like ChatOpenAI
response = llm.invoke("Generate a listing for this property: ...")
print(response.content)

Option 2: ChatOpenAI with Ollama base URL (preferred for drop-in replacement)

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="your-fine-tuned-model",
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required field, not validated
    temperature=0.3
)

# Identical interface to cloud ChatOpenAI

Option 2 is the cleanest migration: if you already have LangChain code using ChatOpenAI, the only changes are base_url and model.

Common Pipeline Patterns

Pattern 1: Document Processing Chain

Before (GPT-4 API):

from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(model="gpt-4o")  # $0.005/1K input tokens

template = PromptTemplate.from_template(
    "Classify this support ticket:\n{ticket}\nOutput: category, priority, suggested_response"
)
chain = LLMChain(llm=llm, prompt=template)

result = chain.invoke({"ticket": ticket_text})

After (Fine-tuned local model):

from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

# One line change
llm = ChatOpenAI(
    model="support-classifier-v3",  # Your fine-tuned classifier
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

# Chain definition is UNCHANGED
template = PromptTemplate.from_template(
    "Classify this support ticket:\n{ticket}\nOutput: category, priority, suggested_response"
)
chain = LLMChain(llm=llm, prompt=template)

result = chain.invoke({"ticket": ticket_text})

Processing 10,000 support tickets:

Before: 10,000 × $0.005 = $50 in API costs
After: $50 in monthly VPS, amortized across all processing

Pattern 2: RAG Pipeline With Fine-Tuned Reader

For Retrieval-Augmented Generation (RAG), you often want the retrieval model (embeddings) and the reader model (answer generation) to both be calibrated to your domain.

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain_chroma import Chroma

# Local embeddings via Ollama (nomic-embed-text works well)
embeddings = OpenAIEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

# Fine-tuned reader model
reader_llm = ChatOpenAI(
    model="your-domain-reader-model",
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

# Build vector store from your documents
vectorstore = Chroma.from_documents(
    documents=your_docs,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# RAG chain — zero cloud API calls at inference time
qa_chain = RetrievalQA.from_chain_type(
    llm=reader_llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)

answer = qa_chain.invoke({"query": "What is our return policy for sale items?"})

Both embeddings and generation happen locally. The retrieval model understands your domain terminology. The reader model is fine-tuned on your domain Q&A. Zero API calls.

Pattern 3: LangGraph Agent With Local Tool Executor

LangGraph (LangChain's agent framework) works with any LangChain-compatible LLM:

from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

# Agent uses local model for orchestration
orchestrator = ChatOpenAI(
    model="your-orchestrator-model",  # Or use Claude/GPT-4 for orchestration
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

# Tool executor uses specialized fine-tuned model
domain_executor = ChatOpenAI(
    model="your-domain-model",
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

def run_domain_task(state):
    task = state["current_task"]
    result = domain_executor.invoke(task)
    return {"result": result.content}

# Build graph
graph = StateGraph(dict)
graph.add_node("domain_executor", run_domain_task)
# ... add orchestration logic

app = graph.compile()

Performance Tuning for LangChain + Ollama

Batch processing: For bulk processing (classify 5,000 tickets), use LangChain's batch method which parallelizes calls:

# Process 100 tickets concurrently
results = await chain.abatch(
    [{"ticket": t} for t in tickets],
    config={"max_concurrency": 10}  # 10 concurrent Ollama calls
)

Caching: Enable LangChain's semantic cache to avoid redundant model calls:

from langchain.globals import set_llm_cache
from langchain_community.cache import SQLiteCache

set_llm_cache(SQLiteCache(database_path=".langchain.db"))
# Identical prompts return cached results — no Ollama call needed

Context length: 7B models typically support 4K-8K context. For long document processing, use LangChain's text splitters to chunk before passing to the model.

When to Use Fine-Tuned Local vs Cloud in LangChain Pipelines

Task	Local Fine-Tuned	Cloud (GPT-4)
Domain classification	✓ Better + cheaper	Overkill
Domain generation	✓ Better + cheaper	Overkill
Complex reasoning chains	Consider	✓ Better
Current events / web	Not applicable	✓ Required
High-volume batch	✓ Much cheaper	Expensive
One-off / low volume	Either	Either