Multi-Step AI Agents on Local Models: Architecture and Patterns

Single-step AI agents are a solved problem. User says something, model picks a tool, tool returns a result. A fine-tuned 7B model handles this reliably — we have covered this in depth in our tool-calling fine-tuning guide.

Multi-step agents are different. The model needs to plan a sequence of actions, track state between steps, handle errors mid-chain, and decide when the task is complete. This is where small models historically fall apart: they lose context, repeat steps, hallucinate tool calls, or loop infinitely.

But "historically" is doing a lot of work in that sentence. With the right architecture patterns, multi-step agents run reliably on local 7B-14B models today. This guide covers three proven patterns, real benchmarks, and the engineering details that make them work.

The Core Challenge: Why Multi-Step Is Hard for Small Models

A single-step agent needs one skill: map user intent to tool call. A multi-step agent needs at least four:

Planning — decompose a goal into ordered steps
Execution — call the right tool at each step with the right parameters
State tracking — carry results from step N into step N+1
Error recovery — detect failures and decide between retry, skip, or escalate

General-purpose 7B models struggle because they were trained on broad tasks, not sequential reasoning chains. The attention window gets noisy after 2-3 steps. The model "forgets" what it already did or what it planned to do.

The fix is not a bigger model. The fix is better architecture.

Pattern 1: Chain of Fine-Tuned Specialists

Instead of one model doing everything, split the agent into specialized roles:

User Request → Router → Planner → Executor → Validator → Response

Each model is small but laser-focused:

Router (classification): Determines request type. Is this an order? A support query? A data lookup? This is a simple classification task — even a 1B model handles it.
Planner (decomposition): Takes a classified request and outputs an ordered step list. Fine-tuned on your specific workflows, it does not need to plan arbitrary tasks — only YOUR tasks.
Executor (tool calling): Takes one step at a time and produces the tool call. This is single-step tool calling — the solved problem.
Validator (verification): Checks the accumulated results. Are all required fields present? Does the output match the expected schema? Flags issues for retry or escalation.

Why This Works on Small Models

Each specialist is fine-tuned on a narrow task. The router sees thousands of examples of "this input type maps to this category." The planner sees thousands of examples of "this category decomposes into these steps." No single model needs to hold the full reasoning chain.

The Tradeoff

Latency. Four model calls in sequence means 4x the inference time of a single call. On a local GPU, each 7B model inference takes 200-500ms (depending on context length and quantization). A 4-step chain is 0.8-2.0 seconds total.

For most business workflows, this is fine. For real-time chat, it is noticeable. The mitigation: run specialists in parallel where the dependency graph allows it.

Benchmark: Specialist Chain vs Single Model

Metric	Single 7B (generic)	Single 7B (fine-tuned)	4-Specialist Chain (7B each)
3-step task accuracy	41%	67%	89%
5-step task accuracy	18%	43%	81%
Avg latency (local GPU)	450ms	450ms	1,400ms
Error recovery rate	12%	34%	78%

The single fine-tuned model improves over generic, but the chain architecture is a step change. Each specialist does one thing well.

Pattern 2: Single Model with Scratchpad (CoT Fine-Tuning)

Not every team wants to maintain four separate models. Pattern 2 uses a single model, but fine-tunes it to think step-by-step before acting.

The key insight: you train the model to produce a "scratchpad" — an explicit reasoning trace — before outputting the tool call. This is chain-of-thought (CoT) fine-tuning, but specifically for YOUR planning tasks.

Training Data Format

{
  "input": "Process order #4521 for customer acme-corp",
  "output": "<scratchpad>\nStep 1: Validate order #4521 exists and is in pending status\nStep 2: Check inventory for all line items\nStep 3: Calculate shipping based on customer location\nStep 4: Confirm order and send notification\nCurrent step: 1\n</scratchpad>\n<tool_call>{\"function\": \"validate_order\", \"params\": {\"order_id\": 4521}}</tool_call>"
}

After each tool result, the model receives the updated context and produces its next scratchpad:

{
  "input": "[Previous context + tool result: order valid, status: pending]\nContinue processing.",
  "output": "<scratchpad>\nStep 1: Validate order — DONE (valid, pending)\nStep 2: Check inventory for all line items\nCurrent step: 2\n</scratchpad>\n<tool_call>{\"function\": \"check_inventory\", \"params\": {\"order_id\": 4521}}</tool_call>"
}

Why This Works

The scratchpad externalizes the model's "working memory." Instead of implicitly tracking what it did and what comes next (which small models lose track of), the plan is literally in the output. The model reads its own previous reasoning at each step.

Fine-tuning on 500-1,000 examples of your specific workflows teaches the model YOUR step patterns. It does not need to plan arbitrary tasks — just the 5-15 workflow types your agent handles.

When to Use Pattern 2 Over Pattern 1

You have fewer than 5 distinct workflow types
Latency matters (one model call vs four)
You want simpler deployment (one model to serve, not four)
Your steps are mostly sequential (limited parallelism opportunity)

Benchmark: Scratchpad vs Vanilla Fine-Tuning

Metric	7B vanilla fine-tuned	7B CoT fine-tuned	14B CoT fine-tuned
3-step accuracy	67%	82%	91%
5-step accuracy	43%	71%	85%
Plan correctness	54%	88%	93%
Tokens per step	45	120	130

The CoT model uses ~2.7x more tokens per step (because of the scratchpad), but the accuracy jump is worth it. If you are running locally, tokens are free.

Pattern 3: Hybrid — Local Model + Frontier Fallback

Sometimes the honest answer is: your local model handles 85% of requests, but the remaining 15% genuinely need more capability. Pattern 3 makes this explicit.

User Request → Local Router (classify complexity)
  ├── Simple/Known → Local Agent (Pattern 1 or 2)
  └── Complex/Ambiguous → Frontier API (GPT-4, Claude) for planning only
       └── Plan → Local Executor (runs the tools locally)

How It Works

The local router classifies each request on two axes:

Complexity: How many steps? Are the steps from known templates or novel?
Ambiguity: Is the user's intent clear? Are there multiple valid interpretations?

Known-template, clear-intent requests go fully local. Novel or ambiguous requests send ONLY the planning step to a frontier API. The frontier model produces a step plan. The local model executes each step (tool calling).

Cost Math

Assume 1,000 agent requests per day:

Full frontier API: 1,000 x $0.03 avg = $30/day = $900/month
Full local: 1,000 x $0.00 per-query = $0/day (hardware costs only)
Hybrid (85/15 split): 150 x $0.01 (planning only, shorter prompt) = $1.50/day = $45/month

The hybrid approach costs 95% less than full API and handles edge cases the local model would get wrong.

Cost Caps and Guardrails

Set a hard monthly budget for the frontier fallback. When you hit it, all requests route to the local model with a confidence flag. The application can decide: serve the local result with a disclaimer, or queue the request for later processing.

State Management Between Steps

Regardless of which pattern you use, multi-step agents need state management. Here is what works:

Conversation Context Object

{
  "session_id": "abc-123",
  "original_request": "Process order #4521 for acme-corp",
  "current_step": 2,
  "total_steps": 4,
  "completed_steps": [
    {"step": 1, "tool": "validate_order", "result": {"valid": true, "status": "pending"}, "latency_ms": 340}
  ],
  "pending_steps": ["check_inventory", "calculate_shipping", "confirm_order"],
  "error_count": 0,
  "max_errors": 3
}

Pass this object (or a condensed version) as context to each model call. The model does not need to remember — the state is explicit.

Context Window Budget

A 7B model with a 4,096-token context window fills up fast when you include tool schemas + previous results + scratchpad. Budget it:

System prompt + tool schemas: ~800 tokens (fixed)
Current step context: ~200 tokens
Previous step results (condensed): ~150 tokens per step
Scratchpad output: ~120 tokens

For a 5-step task: 800 + 200 + (4 x 150) + 120 = 1,720 tokens. Comfortable within 4K. For longer chains, summarize older step results instead of including raw output.

Error Handling: When Step 3 Fails

Multi-step agents will fail mid-chain. The question is what happens next.

Retry Logic

Step fails → Check error type
  ├── Transient (timeout, rate limit) → Retry same step (max 2 retries)
  ├── Input error (bad params) → Regenerate tool call with error context
  └── Logic error (impossible state) → Escalate

For the "regenerate" path, append the error message to the model's context:

Previous tool call failed: "inventory_check returned 404: order_id 4521 not found in warehouse_east"
Adjust your approach and try again.

Fine-tuned models handle this well when your training data includes error-recovery examples. Include 10-15% error scenarios in your training set.

Escalation

When retry fails or the error is a logic error, escalate:

To frontier model: Re-run the failed step through GPT-4/Claude with full context
To human: Flag the request in a review queue with the full step history
Graceful degradation: Complete the steps that succeeded, return a partial result with a clear explanation of what failed

Guardrails: Preventing Infinite Loops and Runaway Costs

Max Steps

Hard limit. If your longest known workflow is 6 steps, set max_steps to 8. If the model reaches step 9, kill the chain and escalate.

MAX_STEPS = 8
STEP_TIMEOUT_MS = 5000

for step in range(MAX_STEPS):
    result = execute_step(context, timeout=STEP_TIMEOUT_MS)
    if result.status == "complete":
        return result
    if result.status == "error" and context.error_count >= MAX_ERRORS:
        return escalate(context)

return escalate(context, reason="max_steps_exceeded")

Cost Caps (Hybrid Pattern)

For Pattern 3, track frontier API spend per hour, per day, and per month. Set alerts at 50%, 80%, and 100% of budget. At 100%, stop routing to frontier and log every request that would have been routed — this becomes your fine-tuning dataset for next iteration.

Human-in-the-Loop

For high-stakes workflows (financial transactions, healthcare decisions, legal actions), add a confirmation step before the final action:

Steps 1-3 complete → Present summary to human → Human approves → Step 4 executes

This works naturally in both async (queue-based) and sync (chat-based) interfaces.

When Multi-Step Works on 7B vs When You Need 14B+

Scenario	7B Sufficient?	Recommendation
2-3 step linear workflows	Yes	Pattern 2 (scratchpad)
4-6 step linear workflows	With specialist chain	Pattern 1
Branching logic (if/else steps)	Marginal	14B or Pattern 3
Dynamic step count (model decides when done)	No	14B with CoT or Pattern 3
Error recovery with replanning	Marginal	14B or specialist chain with dedicated replanner
10+ distinct workflow types	Yes, with good routing	Pattern 1 with fine-tuned router

The dividing line: if the model needs to DECIDE what to do (dynamic planning), go 14B+. If the model follows a KNOWN pattern with minor variations, 7B is fine with the right architecture.

Example: 4-Step Order Processing Agent

Here is a concrete implementation using Pattern 2 (scratchpad) on a fine-tuned Qwen 2.5 7B:

Step 1 — Validate Order

Input: "Process order #4521 for acme-corp"
Scratchpad: "4 steps needed: validate → inventory → shipping → confirm. Starting with validation."
Tool call: validate_order(order_id=4521, customer="acme-corp")
Result: {valid: true, items: [{sku: "W-100", qty: 50}, {sku: "W-200", qty: 25}]}

Step 2 — Check Inventory

Scratchpad: "Order valid. 2 line items. Checking inventory for both SKUs."
Tool call: check_inventory(skus=["W-100", "W-200"], quantities=[50, 25])
Result: {W-100: {available: 200, allocated: 50}, W-200: {available: 30, allocated: 25}}

Step 3 — Calculate Shipping

Scratchpad: "All items available. Customer acme-corp is in warehouse zone east. Calculating shipping."
Tool call: calculate_shipping(order_id=4521, zone="east", weight_kg=37.5)
Result: {method: "ground", cost: 42.50, eta_days: 3}

Step 4 — Confirm Order

Scratchpad: "Shipping calculated. All checks passed. Confirming order."
Tool call: confirm_order(order_id=4521, shipping_method="ground", shipping_cost=42.50)
Result: {confirmed: true, confirmation_id: "CNF-88421"}

Total latency on an RTX 4090: 1.8 seconds for all four steps. Total API cost: $0.00.

This same flow on GPT-4 would cost ~$0.04 per order. At 500 orders/day, that is $20/day or $600/month — for something a local model does in under 2 seconds.

Getting Started

The fastest path to multi-step agents on local models:

Map your workflows. List every multi-step process your agent handles. Most teams have 3-8 distinct types.
Choose your pattern. Under 5 workflow types with linear steps? Pattern 2. More variety or need parallelism? Pattern 1. Need a safety net? Pattern 3.
Build training data. 200-500 examples per workflow type, including 10-15% error scenarios.
Fine-tune with LoRA. A 7B model fine-tunes in under an hour with Ertas on a single GPU.
Test the chain end-to-end before deploying. Run your full workflow suite and measure step-by-step accuracy, not just final-result accuracy.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →