Back to blog
    LangChain + Fine-Tuned Local Model: Build Pipelines Without API Costs
    langchainlocal-modelollamapipelinefine-tuningsegment:vibecoder

    LangChain + Fine-Tuned Local Model: Build Pipelines Without API Costs

    LangChain works with any OpenAI-compatible API — including Ollama. Replace the API calls in your LangChain pipelines with a fine-tuned local model. Same chain structure, zero per-token costs.

    EErtas Team·

    LangChain is the most widely used framework for building AI pipelines: document processing, RAG, agents, chains. Most LangChain tutorials point at OpenAI's API. Your production bills reflect this.

    LangChain supports local models via Ollama — and Ollama's OpenAI-compatible interface means you can swap the AI backend in a LangChain pipeline with minimal code changes. Combine this with a fine-tuned model, and you get pipelines that are faster for domain tasks, cheaper at scale, and private by default.

    LangChain + Ollama Integration Options

    LangChain has two integration paths for Ollama:

    Option 1: ChatOllama (LangChain native)

    from langchain_ollama import ChatOllama
    
    llm = ChatOllama(
        model="your-fine-tuned-model",
        base_url="http://localhost:11434",
        temperature=0.3
    )
    
    # Use exactly like ChatOpenAI
    response = llm.invoke("Generate a listing for this property: ...")
    print(response.content)
    

    Option 2: ChatOpenAI with Ollama base URL (preferred for drop-in replacement)

    from langchain_openai import ChatOpenAI
    
    llm = ChatOpenAI(
        model="your-fine-tuned-model",
        base_url="http://localhost:11434/v1",
        api_key="ollama",  # Required field, not validated
        temperature=0.3
    )
    
    # Identical interface to cloud ChatOpenAI
    

    Option 2 is the cleanest migration: if you already have LangChain code using ChatOpenAI, the only changes are base_url and model.

    Common Pipeline Patterns

    Pattern 1: Document Processing Chain

    Before (GPT-4 API):

    from langchain_openai import ChatOpenAI
    from langchain.chains import LLMChain
    from langchain.prompts import PromptTemplate
    
    llm = ChatOpenAI(model="gpt-4o")  # $0.005/1K input tokens
    
    template = PromptTemplate.from_template(
        "Classify this support ticket:\n{ticket}\nOutput: category, priority, suggested_response"
    )
    chain = LLMChain(llm=llm, prompt=template)
    
    result = chain.invoke({"ticket": ticket_text})
    

    After (Fine-tuned local model):

    from langchain_openai import ChatOpenAI
    from langchain.chains import LLMChain
    from langchain.prompts import PromptTemplate
    
    # One line change
    llm = ChatOpenAI(
        model="support-classifier-v3",  # Your fine-tuned classifier
        base_url="http://localhost:11434/v1",
        api_key="ollama"
    )
    
    # Chain definition is UNCHANGED
    template = PromptTemplate.from_template(
        "Classify this support ticket:\n{ticket}\nOutput: category, priority, suggested_response"
    )
    chain = LLMChain(llm=llm, prompt=template)
    
    result = chain.invoke({"ticket": ticket_text})
    

    Processing 10,000 support tickets:

    • Before: 10,000 × $0.005 = $50 in API costs
    • After: $50 in monthly VPS, amortized across all processing

    Pattern 2: RAG Pipeline With Fine-Tuned Reader

    For Retrieval-Augmented Generation (RAG), you often want the retrieval model (embeddings) and the reader model (answer generation) to both be calibrated to your domain.

    from langchain_openai import ChatOpenAI, OpenAIEmbeddings
    from langchain.chains import RetrievalQA
    from langchain_chroma import Chroma
    
    # Local embeddings via Ollama (nomic-embed-text works well)
    embeddings = OpenAIEmbeddings(
        model="nomic-embed-text",
        base_url="http://localhost:11434/v1",
        api_key="ollama"
    )
    
    # Fine-tuned reader model
    reader_llm = ChatOpenAI(
        model="your-domain-reader-model",
        base_url="http://localhost:11434/v1",
        api_key="ollama"
    )
    
    # Build vector store from your documents
    vectorstore = Chroma.from_documents(
        documents=your_docs,
        embedding=embeddings,
        persist_directory="./chroma_db"
    )
    
    # RAG chain — zero cloud API calls at inference time
    qa_chain = RetrievalQA.from_chain_type(
        llm=reader_llm,
        retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
    )
    
    answer = qa_chain.invoke({"query": "What is our return policy for sale items?"})
    

    Both embeddings and generation happen locally. The retrieval model understands your domain terminology. The reader model is fine-tuned on your domain Q&A. Zero API calls.

    Pattern 3: LangGraph Agent With Local Tool Executor

    LangGraph (LangChain's agent framework) works with any LangChain-compatible LLM:

    from langgraph.graph import StateGraph, END
    from langchain_openai import ChatOpenAI
    from langchain_core.messages import HumanMessage
    
    # Agent uses local model for orchestration
    orchestrator = ChatOpenAI(
        model="your-orchestrator-model",  # Or use Claude/GPT-4 for orchestration
        base_url="http://localhost:11434/v1",
        api_key="ollama"
    )
    
    # Tool executor uses specialized fine-tuned model
    domain_executor = ChatOpenAI(
        model="your-domain-model",
        base_url="http://localhost:11434/v1",
        api_key="ollama"
    )
    
    def run_domain_task(state):
        task = state["current_task"]
        result = domain_executor.invoke(task)
        return {"result": result.content}
    
    # Build graph
    graph = StateGraph(dict)
    graph.add_node("domain_executor", run_domain_task)
    # ... add orchestration logic
    
    app = graph.compile()
    

    Performance Tuning for LangChain + Ollama

    Batch processing: For bulk processing (classify 5,000 tickets), use LangChain's batch method which parallelizes calls:

    # Process 100 tickets concurrently
    results = await chain.abatch(
        [{"ticket": t} for t in tickets],
        config={"max_concurrency": 10}  # 10 concurrent Ollama calls
    )
    

    Caching: Enable LangChain's semantic cache to avoid redundant model calls:

    from langchain.globals import set_llm_cache
    from langchain_community.cache import SQLiteCache
    
    set_llm_cache(SQLiteCache(database_path=".langchain.db"))
    # Identical prompts return cached results — no Ollama call needed
    

    Context length: 7B models typically support 4K-8K context. For long document processing, use LangChain's text splitters to chunk before passing to the model.

    When to Use Fine-Tuned Local vs Cloud in LangChain Pipelines

    TaskLocal Fine-TunedCloud (GPT-4)
    Domain classification✓ Better + cheaperOverkill
    Domain generation✓ Better + cheaperOverkill
    Complex reasoning chainsConsider✓ Better
    Current events / webNot applicable✓ Required
    High-volume batch✓ Much cheaperExpensive
    One-off / low volumeEitherEither

    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading