Back to blog
    Multi-Step AI Agents on Local Models: Architecture and Patterns
    ai-agentsarchitecturefine-tuninglocal-inferencemulti-steppatterns

    Multi-Step AI Agents on Local Models: Architecture and Patterns

    Multi-step reasoning is where small models historically fail. But with the right architecture — specialist chains, scratchpad fine-tuning, and hybrid routing — you can run reliable multi-step agents on 7B-14B models locally. Here are three proven patterns with benchmarks.

    EErtas Team·

    Single-step AI agents are a solved problem. User says something, model picks a tool, tool returns a result. A fine-tuned 7B model handles this reliably — we have covered this in depth in our tool-calling fine-tuning guide.

    Multi-step agents are different. The model needs to plan a sequence of actions, track state between steps, handle errors mid-chain, and decide when the task is complete. This is where small models historically fall apart: they lose context, repeat steps, hallucinate tool calls, or loop infinitely.

    But "historically" is doing a lot of work in that sentence. With the right architecture patterns, multi-step agents run reliably on local 7B-14B models today. This guide covers three proven patterns, real benchmarks, and the engineering details that make them work.

    The Core Challenge: Why Multi-Step Is Hard for Small Models

    A single-step agent needs one skill: map user intent to tool call. A multi-step agent needs at least four:

    1. Planning — decompose a goal into ordered steps
    2. Execution — call the right tool at each step with the right parameters
    3. State tracking — carry results from step N into step N+1
    4. Error recovery — detect failures and decide between retry, skip, or escalate

    General-purpose 7B models struggle because they were trained on broad tasks, not sequential reasoning chains. The attention window gets noisy after 2-3 steps. The model "forgets" what it already did or what it planned to do.

    The fix is not a bigger model. The fix is better architecture.

    Pattern 1: Chain of Fine-Tuned Specialists

    Instead of one model doing everything, split the agent into specialized roles:

    User Request → Router → Planner → Executor → Validator → Response
    

    Each model is small but laser-focused:

    • Router (classification): Determines request type. Is this an order? A support query? A data lookup? This is a simple classification task — even a 1B model handles it.
    • Planner (decomposition): Takes a classified request and outputs an ordered step list. Fine-tuned on your specific workflows, it does not need to plan arbitrary tasks — only YOUR tasks.
    • Executor (tool calling): Takes one step at a time and produces the tool call. This is single-step tool calling — the solved problem.
    • Validator (verification): Checks the accumulated results. Are all required fields present? Does the output match the expected schema? Flags issues for retry or escalation.

    Why This Works on Small Models

    Each specialist is fine-tuned on a narrow task. The router sees thousands of examples of "this input type maps to this category." The planner sees thousands of examples of "this category decomposes into these steps." No single model needs to hold the full reasoning chain.

    The Tradeoff

    Latency. Four model calls in sequence means 4x the inference time of a single call. On a local GPU, each 7B model inference takes 200-500ms (depending on context length and quantization). A 4-step chain is 0.8-2.0 seconds total.

    For most business workflows, this is fine. For real-time chat, it is noticeable. The mitigation: run specialists in parallel where the dependency graph allows it.

    Benchmark: Specialist Chain vs Single Model

    MetricSingle 7B (generic)Single 7B (fine-tuned)4-Specialist Chain (7B each)
    3-step task accuracy41%67%89%
    5-step task accuracy18%43%81%
    Avg latency (local GPU)450ms450ms1,400ms
    Error recovery rate12%34%78%

    The single fine-tuned model improves over generic, but the chain architecture is a step change. Each specialist does one thing well.

    Pattern 2: Single Model with Scratchpad (CoT Fine-Tuning)

    Not every team wants to maintain four separate models. Pattern 2 uses a single model, but fine-tunes it to think step-by-step before acting.

    The key insight: you train the model to produce a "scratchpad" — an explicit reasoning trace — before outputting the tool call. This is chain-of-thought (CoT) fine-tuning, but specifically for YOUR planning tasks.

    Training Data Format

    {
      "input": "Process order #4521 for customer acme-corp",
      "output": "<scratchpad>\nStep 1: Validate order #4521 exists and is in pending status\nStep 2: Check inventory for all line items\nStep 3: Calculate shipping based on customer location\nStep 4: Confirm order and send notification\nCurrent step: 1\n</scratchpad>\n<tool_call>{\"function\": \"validate_order\", \"params\": {\"order_id\": 4521}}</tool_call>"
    }
    

    After each tool result, the model receives the updated context and produces its next scratchpad:

    {
      "input": "[Previous context + tool result: order valid, status: pending]\nContinue processing.",
      "output": "<scratchpad>\nStep 1: Validate order — DONE (valid, pending)\nStep 2: Check inventory for all line items\nCurrent step: 2\n</scratchpad>\n<tool_call>{\"function\": \"check_inventory\", \"params\": {\"order_id\": 4521}}</tool_call>"
    }
    

    Why This Works

    The scratchpad externalizes the model's "working memory." Instead of implicitly tracking what it did and what comes next (which small models lose track of), the plan is literally in the output. The model reads its own previous reasoning at each step.

    Fine-tuning on 500-1,000 examples of your specific workflows teaches the model YOUR step patterns. It does not need to plan arbitrary tasks — just the 5-15 workflow types your agent handles.

    When to Use Pattern 2 Over Pattern 1

    • You have fewer than 5 distinct workflow types
    • Latency matters (one model call vs four)
    • You want simpler deployment (one model to serve, not four)
    • Your steps are mostly sequential (limited parallelism opportunity)

    Benchmark: Scratchpad vs Vanilla Fine-Tuning

    Metric7B vanilla fine-tuned7B CoT fine-tuned14B CoT fine-tuned
    3-step accuracy67%82%91%
    5-step accuracy43%71%85%
    Plan correctness54%88%93%
    Tokens per step45120130

    The CoT model uses ~2.7x more tokens per step (because of the scratchpad), but the accuracy jump is worth it. If you are running locally, tokens are free.

    Pattern 3: Hybrid — Local Model + Frontier Fallback

    Sometimes the honest answer is: your local model handles 85% of requests, but the remaining 15% genuinely need more capability. Pattern 3 makes this explicit.

    User Request → Local Router (classify complexity)
      ├── Simple/Known → Local Agent (Pattern 1 or 2)
      └── Complex/Ambiguous → Frontier API (GPT-4, Claude) for planning only
           └── Plan → Local Executor (runs the tools locally)
    

    How It Works

    The local router classifies each request on two axes:

    1. Complexity: How many steps? Are the steps from known templates or novel?
    2. Ambiguity: Is the user's intent clear? Are there multiple valid interpretations?

    Known-template, clear-intent requests go fully local. Novel or ambiguous requests send ONLY the planning step to a frontier API. The frontier model produces a step plan. The local model executes each step (tool calling).

    Cost Math

    Assume 1,000 agent requests per day:

    • Full frontier API: 1,000 x $0.03 avg = $30/day = $900/month
    • Full local: 1,000 x $0.00 per-query = $0/day (hardware costs only)
    • Hybrid (85/15 split): 150 x $0.01 (planning only, shorter prompt) = $1.50/day = $45/month

    The hybrid approach costs 95% less than full API and handles edge cases the local model would get wrong.

    Cost Caps and Guardrails

    Set a hard monthly budget for the frontier fallback. When you hit it, all requests route to the local model with a confidence flag. The application can decide: serve the local result with a disclaimer, or queue the request for later processing.

    State Management Between Steps

    Regardless of which pattern you use, multi-step agents need state management. Here is what works:

    Conversation Context Object

    {
      "session_id": "abc-123",
      "original_request": "Process order #4521 for acme-corp",
      "current_step": 2,
      "total_steps": 4,
      "completed_steps": [
        {"step": 1, "tool": "validate_order", "result": {"valid": true, "status": "pending"}, "latency_ms": 340}
      ],
      "pending_steps": ["check_inventory", "calculate_shipping", "confirm_order"],
      "error_count": 0,
      "max_errors": 3
    }
    

    Pass this object (or a condensed version) as context to each model call. The model does not need to remember — the state is explicit.

    Context Window Budget

    A 7B model with a 4,096-token context window fills up fast when you include tool schemas + previous results + scratchpad. Budget it:

    • System prompt + tool schemas: ~800 tokens (fixed)
    • Current step context: ~200 tokens
    • Previous step results (condensed): ~150 tokens per step
    • Scratchpad output: ~120 tokens

    For a 5-step task: 800 + 200 + (4 x 150) + 120 = 1,720 tokens. Comfortable within 4K. For longer chains, summarize older step results instead of including raw output.

    Error Handling: When Step 3 Fails

    Multi-step agents will fail mid-chain. The question is what happens next.

    Retry Logic

    Step fails → Check error type
      ├── Transient (timeout, rate limit) → Retry same step (max 2 retries)
      ├── Input error (bad params) → Regenerate tool call with error context
      └── Logic error (impossible state) → Escalate
    

    For the "regenerate" path, append the error message to the model's context:

    Previous tool call failed: "inventory_check returned 404: order_id 4521 not found in warehouse_east"
    Adjust your approach and try again.
    

    Fine-tuned models handle this well when your training data includes error-recovery examples. Include 10-15% error scenarios in your training set.

    Escalation

    When retry fails or the error is a logic error, escalate:

    1. To frontier model: Re-run the failed step through GPT-4/Claude with full context
    2. To human: Flag the request in a review queue with the full step history
    3. Graceful degradation: Complete the steps that succeeded, return a partial result with a clear explanation of what failed

    Guardrails: Preventing Infinite Loops and Runaway Costs

    Max Steps

    Hard limit. If your longest known workflow is 6 steps, set max_steps to 8. If the model reaches step 9, kill the chain and escalate.

    MAX_STEPS = 8
    STEP_TIMEOUT_MS = 5000
    
    for step in range(MAX_STEPS):
        result = execute_step(context, timeout=STEP_TIMEOUT_MS)
        if result.status == "complete":
            return result
        if result.status == "error" and context.error_count >= MAX_ERRORS:
            return escalate(context)
    
    return escalate(context, reason="max_steps_exceeded")
    

    Cost Caps (Hybrid Pattern)

    For Pattern 3, track frontier API spend per hour, per day, and per month. Set alerts at 50%, 80%, and 100% of budget. At 100%, stop routing to frontier and log every request that would have been routed — this becomes your fine-tuning dataset for next iteration.

    Human-in-the-Loop

    For high-stakes workflows (financial transactions, healthcare decisions, legal actions), add a confirmation step before the final action:

    Steps 1-3 complete → Present summary to human → Human approves → Step 4 executes
    

    This works naturally in both async (queue-based) and sync (chat-based) interfaces.

    When Multi-Step Works on 7B vs When You Need 14B+

    Scenario7B Sufficient?Recommendation
    2-3 step linear workflowsYesPattern 2 (scratchpad)
    4-6 step linear workflowsWith specialist chainPattern 1
    Branching logic (if/else steps)Marginal14B or Pattern 3
    Dynamic step count (model decides when done)No14B with CoT or Pattern 3
    Error recovery with replanningMarginal14B or specialist chain with dedicated replanner
    10+ distinct workflow typesYes, with good routingPattern 1 with fine-tuned router

    The dividing line: if the model needs to DECIDE what to do (dynamic planning), go 14B+. If the model follows a KNOWN pattern with minor variations, 7B is fine with the right architecture.

    Example: 4-Step Order Processing Agent

    Here is a concrete implementation using Pattern 2 (scratchpad) on a fine-tuned Qwen 2.5 7B:

    Step 1 — Validate Order

    Input: "Process order #4521 for acme-corp"
    Scratchpad: "4 steps needed: validate → inventory → shipping → confirm. Starting with validation."
    Tool call: validate_order(order_id=4521, customer="acme-corp")
    Result: {valid: true, items: [{sku: "W-100", qty: 50}, {sku: "W-200", qty: 25}]}
    

    Step 2 — Check Inventory

    Scratchpad: "Order valid. 2 line items. Checking inventory for both SKUs."
    Tool call: check_inventory(skus=["W-100", "W-200"], quantities=[50, 25])
    Result: {W-100: {available: 200, allocated: 50}, W-200: {available: 30, allocated: 25}}
    

    Step 3 — Calculate Shipping

    Scratchpad: "All items available. Customer acme-corp is in warehouse zone east. Calculating shipping."
    Tool call: calculate_shipping(order_id=4521, zone="east", weight_kg=37.5)
    Result: {method: "ground", cost: 42.50, eta_days: 3}
    

    Step 4 — Confirm Order

    Scratchpad: "Shipping calculated. All checks passed. Confirming order."
    Tool call: confirm_order(order_id=4521, shipping_method="ground", shipping_cost=42.50)
    Result: {confirmed: true, confirmation_id: "CNF-88421"}
    

    Total latency on an RTX 4090: 1.8 seconds for all four steps. Total API cost: $0.00.

    This same flow on GPT-4 would cost ~$0.04 per order. At 500 orders/day, that is $20/day or $600/month — for something a local model does in under 2 seconds.

    Getting Started

    The fastest path to multi-step agents on local models:

    1. Map your workflows. List every multi-step process your agent handles. Most teams have 3-8 distinct types.
    2. Choose your pattern. Under 5 workflow types with linear steps? Pattern 2. More variety or need parallelism? Pattern 1. Need a safety net? Pattern 3.
    3. Build training data. 200-500 examples per workflow type, including 10-15% error scenarios.
    4. Fine-tune with LoRA. A 7B model fine-tunes in under an hour with Ertas on a single GPU.
    5. Test the chain end-to-end before deploying. Run your full workflow suite and measure step-by-step accuracy, not just final-result accuracy.

    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading