Back to blog
    Pydantic AI vs LangGraph: Which Agent Framework for Fine-Tuned Models
    pydantic-ailanggraphai-agentsfine-tuningframework-comparisontool-calling

    Pydantic AI vs LangGraph: Which Agent Framework for Fine-Tuned Models

    Pydantic AI and LangGraph are the two production agent frameworks of 2026. Choose between them on type safety vs graph orchestration, then layer fine-tuning on top. Here's how to decide.

    EErtas Team·

    Updated 2026-05-10 — Reflects the early-May Pydantic AI v1.90.x / v1.93.x releases (explicit tool_choice, dedicated OutputToolCallEvent / OutputToolResultEvent streaming events, OpenAI Conversations API state). The decision matrix below is unchanged; the new primitives mostly tighten the type-safety and observability story Pydantic AI was already winning on.

    By 2026 the Python agent-framework landscape has consolidated. The hand-rolled "while loop calling OpenAI" approach is gone from any team that ships agents to production. The Frankenstein early-LangChain stack with its forty composable abstractions is gone too. Two frameworks remain at the center of serious work: LangGraph, the graph-based state-machine framework now running production agents at Uber, JPMorgan, BlackRock, and Replit; and Pydantic AI, the type-safe FastAPI-style framework whose 1.0 release in April 2026 made it the obvious default for new projects.

    Both are model-agnostic. Both work cleanly with fine-tuned open-weight models served via Ollama, vLLM, or Ertas Cloud. Both treat tool calling as a first-class primitive. Choosing between them is not a question of which is "better" — it's a design-philosophy choice about how your agents should be structured. This piece walks through the tradeoffs honestly, then shows how either framework benefits dramatically from layering a fine-tuned model underneath.

    Pydantic AI: types-first agents

    Pydantic AI is built by the team behind Pydantic and Logfire. The design ethos is borrowed directly from FastAPI: types are contracts, validation is non-negotiable, and the framework should disappear into the background once you've declared your shapes. An Agent is parameterized by a result type. Tools are decorated functions whose signatures Pydantic AI parses to build tool schemas. Outputs are validated automatically against the result type. If the model emits something that doesn't conform, you get a ValidationError, not a silent bug.

    The runtime is lightweight. There's no graph engine, no checkpointing layer, no execution scheduler. An agent runs as ordinary Python: call agent.run or agent.run_sync, and the framework handles the LLM loop, tool dispatch, and validation. The whole library is MIT-licensed and the dependency tree is small enough that you'll barely notice it in your pyproject.toml.

    This makes Pydantic AI a natural fit for the most common production case: an agent that takes input, calls a few tools, returns structured output. Extraction agents, classifiers, routers, lightweight assistants. If your workflow is mostly linear and you care about output schemas, Pydantic AI gets you to a tested production agent faster than anything else available. The 1.0 release in April 2026 stabilized the API and made it safe to build commercial products on top.

    LangGraph: stateful, durable, graph-orchestrated

    LangGraph took the opposite design path. An agent is a directed graph of nodes connected by edges. Each node is a function (an LLM call, a tool execution, a conditional check). Edges describe how state flows between nodes, including conditional edges that branch based on intermediate state. The graph engine runs the whole thing, persisting state at each step.

    Three things fall out of this design that Pydantic AI doesn't try to do.

    Durable checkpoints. Every state transition is persisted. If your agent crashes mid-execution — process killed, server rebooted, network partition — you can resume from the last checkpoint instead of starting over. For agents that run for hours or days, this is the difference between viable and not.

    Parallel branches. Because the graph engine schedules nodes, you can fan out to multiple parallel branches and join them later. A research agent that calls five different APIs in parallel and aggregates their results is one graph definition, not a hand-rolled async coordination layer.

    Human-in-the-loop interruption. A graph can pause at a designated node, surface state to a human reviewer, and resume once a decision comes back. For approval workflows, escalations, and any agent operating in regulated industries, this is essential. LangGraph's interrupt primitive turns "human approval" from an afterthought into a graph node like any other.

    The cost of these capabilities is complexity. LangGraph requires you to think about your agent as a state machine. The graph definitions are more verbose than Pydantic AI's decorator-based agents. The runtime is heavier — it carries the graph engine, the checkpointing layer, and (in production) typically a Postgres or Redis backend for persistence. For a linear extraction agent this is overkill. For a multi-stage approval workflow it's exactly what you need.

    LangGraph is what JPMorgan, BlackRock, and Uber chose for production agents that touch money, customer support, and compliance-relevant operations. The graph model gives them the audit trails their compliance teams require: every state transition is logged, every tool call is reproducible, every decision can be replayed. Pydantic AI's lightweight runtime can't easily provide that level of traceability.

    Where the two frameworks overlap

    Despite different philosophies, both frameworks land in the same place on several practical points.

    Both speak the OpenAI-compatible API as a first-class transport. Both work with any model server that exposes that interface — Ollama, vLLM, llama.cpp's llama-server, LM Studio, Ertas Cloud, the OpenAI API itself, Anthropic via OpenAI-compatible proxies, or any custom serving stack. This means a fine-tuned model you've shipped in Ertas Studio works identically against either framework without code changes.

    Both have first-class tool-calling support. You declare functions, the framework extracts their schemas, and the LLM is given a structured tool-use format. Both validate tool arguments before execution; both surface tool results back to the LLM in the next turn.

    Both have observability stories that matter in production. Pydantic AI integrates tightly with Logfire (same team) and emits OpenTelemetry traces. LangGraph integrates with LangSmith for graph-execution tracing and supports OpenTelemetry exporters. Either one will give you per-tool-call latency, token usage, and error traces in production.

    So the choice between them is not about basic capabilities. It's about workflow shape and operational requirements.

    Decision matrix

    ScenarioWinnerWhy
    Linear extraction agentPydantic AILightweight runtime, schema-validated outputs, no graph overhead
    Multi-step approval workflow with human reviewLangGraphInterrupt primitive turns approval into a graph node
    Type-safe tool inputs and outputs criticalPydantic AIValidation is the framework's whole reason for existing
    Long-running agents that pause and resume across hoursLangGraphDurable checkpoints survive process restarts
    Lightweight runtime for serverless or edgePydantic AIMinimal dependencies, no persistence layer required
    Regulated industry needing audit trailsLangGraphEvery state transition logged, replayable, compliance-ready
    Quick prototype to production for a startupPydantic AILower cognitive load, faster iteration
    Parallel multi-API research agentLangGraphGraph-native fan-out and join

    Notice the pattern. Pydantic AI wins when the workflow is mostly linear, output structure matters, and you want to move fast. LangGraph wins when the workflow is genuinely a state machine, when durability and audit trails are non-negotiable, and when the team has the engineering bandwidth to design graphs carefully.

    The fine-tuning angle: why both frameworks need it

    Here's the uncomfortable truth about agent frameworks in 2026: they assume the model can produce structured outputs reliably. Pydantic AI assumes it because the validator fires on every output. LangGraph assumes it because each node's output becomes the next node's input. When the model misbehaves, both frameworks fall back to retries — and retries cost latency, cost tokens, and erode user trust.

    Against a frontier API like Claude or GPT-5 Pro, that assumption holds well enough. Against a generic open-weight model — Qwen3-7B, Llama 3.3 8B, Mistral Small straight off the Hugging Face shelf — it doesn't. Schema violations are common. Wrong tool names appear. Parameter types drift. The framework's validation layer turns from "guard rail" into "recurring exception source," and your team starts wrapping every agent call in retry decorators that paper over the real problem.

    Fine-tuning fixes this at the source. Train the model on the exact tool schemas your agent uses, on the exact output formats your code expects, on a few hundred representative conversations from your domain. The model becomes a reliable producer of structured outputs. Pydantic AI's validator goes back to being a guard rail. LangGraph's nodes flow into each other without wrapping logic.

    The economics tilt the same way. A frontier API call for a multi-step agent costs $0.01 to $0.05 per run. At 10,000 daily runs that's $100 to $500 per day, $36,000 to $180,000 per year. A fine-tuned 7B or 8B model served on a single GPU instance costs orders of magnitude less, and a fine-tuned 4B model can run on the user's device for free. For mobile app builders feeling the agentic cost cliff bite between 500 and 5,000 users — the moment when API bills start consuming margin faster than revenue grows — fine-tuning isn't an optimization, it's the only path forward.

    The integration is identical for both frameworks: train your model in Ertas Studio, export to GGUF, serve via Ollama or vLLM, and point your agent's base_url at the OpenAI-compatible endpoint. From the framework's perspective, your fine-tuned model is just another OpenAI-compatible model. From the user's perspective, the agent is dramatically more reliable and the bill is dramatically smaller.

    Same agent, two frameworks

    To make the comparison concrete, here's the same triage agent — a claims-routing classifier that maps support tickets to the right team — implemented in both frameworks against an Ertas-fine-tuned Qwen3-4B served via Ollama.

    Pydantic AI version:

    from typing import Literal
    from pydantic import BaseModel
    from pydantic_ai import Agent
    from pydantic_ai.models.openai import OpenAIModel
    
    model = OpenAIModel(
        "ertas-claims-router-4b",
        base_url="http://localhost:11434/v1",
        api_key="not-needed",
    )
    
    class TriageDecision(BaseModel):
        team: Literal["billing", "technical", "fraud", "general"]
        priority: Literal["low", "normal", "high", "urgent"]
        reasoning: str
    
    agent = Agent(
        model,
        result_type=TriageDecision,
        system_prompt="Route incoming support tickets to the right team and priority.",
    )
    
    @agent.tool
    async def lookup_account(ctx, account_id: str) -> dict:
        """Look up account history to inform routing."""
        return await crm.get_account(account_id)
    
    result = agent.run_sync(
        "My card was charged twice for the same order this morning."
    )
    print(result.data)
    # TriageDecision(team='billing', priority='high', reasoning='Duplicate charge')
    

    LangGraph version:

    from typing import TypedDict, Literal
    from langgraph.graph import StateGraph, END
    from langchain_openai import ChatOpenAI
    from langchain_core.messages import HumanMessage
    
    llm = ChatOpenAI(
        model="ertas-claims-router-4b",
        base_url="http://localhost:11434/v1",
        api_key="not-needed",
    ).bind_tools([lookup_account_tool])
    
    class TriageState(TypedDict):
        ticket: str
        account_data: dict | None
        team: str | None
        priority: str | None
    
    def fetch_account(state):
        if account_id := extract_account_id(state["ticket"]):
            return {"account_data": crm.get_account(account_id)}
        return {"account_data": None}
    
    def classify(state):
        msg = llm.invoke([HumanMessage(content=build_prompt(state))])
        parsed = parse_triage_decision(msg.content)
        return {"team": parsed["team"], "priority": parsed["priority"]}
    
    graph = StateGraph(TriageState)
    graph.add_node("fetch", fetch_account)
    graph.add_node("classify", classify)
    graph.set_entry_point("fetch")
    graph.add_edge("fetch", "classify")
    graph.add_edge("classify", END)
    
    app = graph.compile()
    result = app.invoke({"ticket": "My card was charged twice this morning."})
    

    Both implementations work. Both use the same fine-tuned model. Both are roughly thirty lines of code. The Pydantic AI version is shorter and gives you typed outputs by construction; the LangGraph version is more verbose but every state transition is checkpointable, the graph can be paused for human review by adding an interrupt node, and the whole thing scales naturally to a workflow that — say — also notifies the team, opens a Jira ticket, and waits for a human to confirm priority before final routing.

    For a pure classifier, Pydantic AI is the obvious choice. For a classifier that's the first node in a longer workflow involving approvals and escalations, LangGraph starts to earn its complexity.

    Recommendation

    For most teams in 2026, the right answer is to start with Pydantic AI. The cognitive load is lower, the runtime is lighter, the API is more pleasant, and the type safety pays for itself the first time it catches a malformed model output before it hits production. Ninety percent of agent use cases are linear or close to linear, and Pydantic AI handles them cleanly.

    Graduate to LangGraph when the graph orchestration becomes the bottleneck — when you find yourself building checkpointing layers by hand, when you need genuine human-in-the-loop interrupts, when compliance demands audit trails of every state transition, when workflows naturally branch and rejoin in ways that hand-rolled control flow can't express cleanly. That's a real moment for some teams. It's not a moment most teams ever reach.

    Layer fine-tuning underneath either framework as soon as you see the second or third API bill. The combination of a fine-tuned 4B-to-8B model and a real agent framework is what makes production agents both reliable and economical in 2026. Without fine-tuning, your framework spends most of its energy paving over model unreliability. With fine-tuning, the framework gets to do the job it was designed for.

    Ertas Studio handles the fine-tuning side. Pick your base model — Qwen3-4B for mobile and edge, Qwen3-7B or Llama 3.3 8B for server-side agents — curate a few hundred examples in Data Craft, train, eval, export to GGUF. The Ertas Deployment CLI takes care of the mobile shipping path when that matters. The OpenAI-compatible serving means whichever framework you chose, the integration is one configuration line.

    Pick your framework for the workflow shape. Pick your model for the domain. Fine-tune so the model speaks your tools natively. That's the production recipe in 2026.

    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading