Back to blog
    Fine-Tuned Models for LangGraph Agents: Replace GPT-4 in Your Agent Stack
    langgraphagentsfine-tuninglocal-modelssegment:developer

    Fine-Tuned Models for LangGraph Agents: Replace GPT-4 in Your Agent Stack

    LangGraph agents default to GPT-4, but most agent tasks — routing, tool selection, response generation — work better with fine-tuned models trained on your specific workflows.

    EErtas Team·

    LangGraph is the dominant framework for building stateful AI agents in 2026. It gives you explicit control over agent state, supports cycles and conditional branching, and handles the complexity that simpler frameworks can't. If you're building production agents in Python, you're probably using LangGraph or something heavily inspired by it.

    The default pattern in every LangGraph tutorial: ChatOpenAI(model="gpt-4o") as the reasoning engine. Every node that needs to think, route, summarize, or generate calls GPT-4. And every call costs money.

    A typical LangGraph agent with 5 nodes — router, researcher, analyzer, responder, reviewer — makes 5-15 LLM calls per task execution. At GPT-4o pricing, that's $0.05-$0.30 per execution. Run 1,000 tasks per day and you're spending $50-$300/day, or $1,500-$9,000/month. Just on inference.

    Most of those LLM calls don't need GPT-4. They need a model that knows your specific tools, your specific routing logic, and your specific output format. That's what fine-tuning gives you.

    Breaking Down What Each Agent Node Actually Does

    LangGraph agents are graphs of nodes, where each node performs a specific function. Let's categorize what each node type actually requires from an LLM:

    Router Nodes (Classification)

    The router looks at the incoming request and decides which path the agent should take. "Is this a billing question, a technical support issue, or a sales inquiry?" This is a classification task with a fixed set of categories.

    GPT-4 classification accuracy on well-defined categories: ~95-98%. Fine-tuned 8B model accuracy on the same categories: ~96-99%. The fine-tuned model is often more accurate because it's trained exclusively on your categories and never hallucinates categories that don't exist.

    Tool Selection Nodes (Structured Output)

    The agent decides which tool to call and generates the parameters. "Call search_knowledge_base with {"query": "refund policy", "category": "billing"}." This is structured output generation — the model needs to output valid JSON matching a specific schema.

    Fine-tuned models excel here. They learn your exact tool names, your exact parameter schemas, and your exact calling conventions. No more hallucinated function names or wrong parameter types. See our detailed breakdown in Fine-Tuning for Tool Calling.

    Response Generation Nodes (Domain-Specific Text)

    The agent generates the final response to the user based on gathered context. This requires domain knowledge, appropriate tone, and accurate information synthesis.

    For domain-specific responses (legal, medical, financial, customer support), a fine-tuned model trained on your corpus of approved responses produces more consistent output than GPT-4 with a system prompt. It doesn't drift, doesn't hallucinate disclaimer language you didn't approve, and doesn't randomly change tone between responses.

    State Summarization Nodes (Compression)

    In long-running agents, you periodically summarize the conversation state to fit within context limits. "Here are the last 20 messages. Summarize the key facts." This is a compression task.

    A fine-tuned summarization model trained on your specific state format produces summaries that preserve the fields your downstream nodes actually need, rather than a generic summary that might drop critical details.

    Complex Reasoning Nodes (Multi-Step Logic)

    "Given these three data points, determine whether the customer qualifies for a premium plan, and if so, calculate the discount." Multi-step reasoning with novel combinations of facts.

    This is where frontier models still earn their keep. Complex, multi-step reasoning on novel problems is the one area where GPT-4 and Claude consistently outperform fine-tuned small models. You can't fine-tune your way to general intelligence — you can only fine-tune for patterns the model has seen.

    The Hybrid Architecture

    The practical architecture isn't "replace everything with fine-tuned models" — it's "replace everything that can be fine-tuned and keep the cloud API for what can't."

    User Input
        │
        ▼
    [Router Node] ← Fine-tuned 8B (classification)
        │
        ├── Path A: Simple Query
        │   └── [Response Node] ← Fine-tuned 8B (domain response)
        │
        ├── Path B: Tool-Required Query
        │   ├── [Tool Selection] ← Fine-tuned 8B (structured output)
        │   ├── [Tool Execution] ← No LLM needed
        │   └── [Response Node] ← Fine-tuned 8B (domain response)
        │
        └── Path C: Complex Reasoning
            └── [Analysis Node] ← GPT-4o (multi-step reasoning)
                └── [Response Node] ← Fine-tuned 8B (format output)
    

    In this architecture, 80-90% of requests follow Path A or B — handled entirely by the fine-tuned model. Only the genuinely complex 10-20% route to GPT-4. Your API bill drops by 80-90%.

    Training Data from Agent Traces

    LangGraph makes training data collection straightforward because every execution produces a complete trace — every node's input and output, every decision, every tool call. These traces are your training dataset.

    Collecting Traces

    Enable LangSmith tracing or build custom logging at each node:

    from langchain_core.tracers import LangChainTracer
    
    # Every execution automatically logged
    tracer = LangChainTracer(project_name="agent-training-data")
    agent.invoke(input, config={"callbacks": [tracer]})
    

    After running your agent on GPT-4 for 2-4 weeks, you'll have thousands of traced executions. Each trace contains the exact input-output pairs for every node — ready-made training data.

    Filtering and Formatting

    Not all traces are good training data. Filter for:

    • Successful executions — the agent achieved the goal without errors
    • Clean tool calls — no retries due to malformed parameters
    • Positive user feedback — if you collect feedback, weight those traces higher

    Format each node's input-output pair as a training example:

    {
      "messages": [
        {"role": "system", "content": "You are a routing agent. Classify the user's request into one of: billing, technical, sales, general."},
        {"role": "user", "content": "I was charged twice for my subscription last month"},
        {"role": "assistant", "content": "billing"}
      ]
    }
    

    For tool-calling nodes:

    {
      "messages": [
        {"role": "system", "content": "Select the appropriate tool and parameters for the user's request. Available tools: search_kb, create_ticket, get_account_info."},
        {"role": "user", "content": "Look up the refund policy for annual plans"},
        {"role": "assistant", "content": "{\"tool\": \"search_kb\", \"params\": {\"query\": \"refund policy annual plans\", \"category\": \"billing\"}}"}
      ]
    }
    

    Training Separate LoRA Adapters

    For maximum efficiency, train separate LoRA adapters for each node type:

    • Router adapter: Trained on routing decisions (small dataset, fast training)
    • Tool-calling adapter: Trained on tool selection and parameter generation
    • Response adapter: Trained on domain-specific response generation
    • Summarization adapter: Trained on state compression

    Each adapter is 50-200MB. You can hot-swap them per node, or merge the most commonly used one into the base model and load the others on demand.

    500-1,000 examples per node type is a reasonable starting point. Router nodes need fewer examples (200-300 is often sufficient for 5-10 categories). Response generation benefits from more (800-1,500 for diverse output quality).

    Drop-In Replacement with Ollama

    The reason this works with minimal code changes: Ollama exposes an OpenAI-compatible API. LangGraph uses LangChain's chat model interface. Swapping GPT-4 for a local model is a one-line change.

    Before:

    from langchain_openai import ChatOpenAI
    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    

    After:

    from langchain_ollama import ChatOllama
    llm = ChatOllama(model="fine-tuned-router", temperature=0)
    

    For the hybrid architecture, assign different models to different nodes:

    router_llm = ChatOllama(model="fine-tuned-router")
    tool_llm = ChatOllama(model="fine-tuned-tools")
    response_llm = ChatOllama(model="fine-tuned-responder")
    reasoning_llm = ChatOpenAI(model="gpt-4o")  # Cloud fallback
    

    Each node in your LangGraph graph gets the model that best fits its task. The graph orchestration is identical — only the LLM backend changes.

    Cost Impact

    Let's run the numbers on a customer support agent handling 1,000 tasks per day, averaging 8 LLM calls per task:

    All GPT-4o

    • 8,000 calls/day × ~2,000 tokens average × ($2.50 + $10)/2M tokens
    • Roughly $50-$100/day → $1,500-$3,000/month

    Hybrid (80% Local, 20% GPT-4o)

    • 6,400 calls/day local: $0
    • 1,600 calls/day GPT-4o: $10-$20/day
    • Cloud GPU for local model: $150-$300/month
    • $450-$900/month (70% reduction)

    All Local (with cloud fallback for truly novel cases)

    • 7,800 calls/day local: $0
    • 200 calls/day GPT-4o fallback: ~$2.50/day
    • Cloud GPU: $150-$300/month
    • $225-$375/month (85-90% reduction)

    The transition from 100% cloud to hybrid takes about 3 weeks: 1 week collecting traces, 1 week training adapters, 1 week testing and deploying. The transition from hybrid to mostly-local is incremental — you expand the fine-tuned model's coverage as you collect more training data from the remaining cloud calls.

    Evaluation: Making Sure the Fine-Tuned Agent Works

    Before deploying fine-tuned nodes in production, run a proper evaluation:

    1. Hold out 20% of your traces as test data. Don't train on them.
    2. Run the fine-tuned model on test inputs and compare outputs to GPT-4 outputs. For routing nodes, measure accuracy. For tool-calling nodes, measure schema compliance and parameter accuracy. For response nodes, measure semantic similarity and factual accuracy.
    3. Run the full agent graph with fine-tuned nodes on historical tasks. Compare end-to-end success rates.
    4. Shadow deploy. Run both the GPT-4 agent and the fine-tuned agent on live traffic, but only serve GPT-4 responses. Log fine-tuned responses for comparison. When the fine-tuned agent matches GPT-4 on 95%+ of tasks, switch over.

    Target metrics:

    • Router accuracy: 97%+ (most teams hit 98-99%)
    • Tool call schema compliance: 99%+
    • Response quality (human rating): within 0.2 points of GPT-4 on a 5-point scale
    • End-to-end task success: within 2-3% of the GPT-4 agent

    When to Stay on GPT-4

    Some LangGraph agents genuinely need frontier models throughout:

    • Agents that handle truly open-ended tasks with unpredictable input types
    • Agents operating across dozens of domains where you can't build domain-specific training data
    • Prototype agents in early development where you're still discovering the task structure
    • Agents where accuracy on the tail distribution is critical (medical diagnosis, legal analysis) and you need the strongest model for every case

    For the majority of production LangGraph agents — customer support, data processing, content generation, workflow automation — the hybrid approach with fine-tuned local models at the core delivers better consistency at a fraction of the cost.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading