
Multi-Step AI Agents on Local Models: Architecture and Patterns
Multi-step reasoning is where small models historically fail. But with the right architecture — specialist chains, scratchpad fine-tuning, and hybrid routing — you can run reliable multi-step agents on 7B-14B models locally. Here are three proven patterns with benchmarks.
Single-step AI agents are a solved problem. User says something, model picks a tool, tool returns a result. A fine-tuned 7B model handles this reliably — we have covered this in depth in our tool-calling fine-tuning guide.
Multi-step agents are different. The model needs to plan a sequence of actions, track state between steps, handle errors mid-chain, and decide when the task is complete. This is where small models historically fall apart: they lose context, repeat steps, hallucinate tool calls, or loop infinitely.
But "historically" is doing a lot of work in that sentence. With the right architecture patterns, multi-step agents run reliably on local 7B-14B models today. This guide covers three proven patterns, real benchmarks, and the engineering details that make them work.
The Core Challenge: Why Multi-Step Is Hard for Small Models
A single-step agent needs one skill: map user intent to tool call. A multi-step agent needs at least four:
- Planning — decompose a goal into ordered steps
- Execution — call the right tool at each step with the right parameters
- State tracking — carry results from step N into step N+1
- Error recovery — detect failures and decide between retry, skip, or escalate
General-purpose 7B models struggle because they were trained on broad tasks, not sequential reasoning chains. The attention window gets noisy after 2-3 steps. The model "forgets" what it already did or what it planned to do.
The fix is not a bigger model. The fix is better architecture.
Pattern 1: Chain of Fine-Tuned Specialists
Instead of one model doing everything, split the agent into specialized roles:
User Request → Router → Planner → Executor → Validator → Response
Each model is small but laser-focused:
- Router (classification): Determines request type. Is this an order? A support query? A data lookup? This is a simple classification task — even a 1B model handles it.
- Planner (decomposition): Takes a classified request and outputs an ordered step list. Fine-tuned on your specific workflows, it does not need to plan arbitrary tasks — only YOUR tasks.
- Executor (tool calling): Takes one step at a time and produces the tool call. This is single-step tool calling — the solved problem.
- Validator (verification): Checks the accumulated results. Are all required fields present? Does the output match the expected schema? Flags issues for retry or escalation.
Why This Works on Small Models
Each specialist is fine-tuned on a narrow task. The router sees thousands of examples of "this input type maps to this category." The planner sees thousands of examples of "this category decomposes into these steps." No single model needs to hold the full reasoning chain.
The Tradeoff
Latency. Four model calls in sequence means 4x the inference time of a single call. On a local GPU, each 7B model inference takes 200-500ms (depending on context length and quantization). A 4-step chain is 0.8-2.0 seconds total.
For most business workflows, this is fine. For real-time chat, it is noticeable. The mitigation: run specialists in parallel where the dependency graph allows it.
Benchmark: Specialist Chain vs Single Model
| Metric | Single 7B (generic) | Single 7B (fine-tuned) | 4-Specialist Chain (7B each) |
|---|---|---|---|
| 3-step task accuracy | 41% | 67% | 89% |
| 5-step task accuracy | 18% | 43% | 81% |
| Avg latency (local GPU) | 450ms | 450ms | 1,400ms |
| Error recovery rate | 12% | 34% | 78% |
The single fine-tuned model improves over generic, but the chain architecture is a step change. Each specialist does one thing well.
Pattern 2: Single Model with Scratchpad (CoT Fine-Tuning)
Not every team wants to maintain four separate models. Pattern 2 uses a single model, but fine-tunes it to think step-by-step before acting.
The key insight: you train the model to produce a "scratchpad" — an explicit reasoning trace — before outputting the tool call. This is chain-of-thought (CoT) fine-tuning, but specifically for YOUR planning tasks.
Training Data Format
{
"input": "Process order #4521 for customer acme-corp",
"output": "<scratchpad>\nStep 1: Validate order #4521 exists and is in pending status\nStep 2: Check inventory for all line items\nStep 3: Calculate shipping based on customer location\nStep 4: Confirm order and send notification\nCurrent step: 1\n</scratchpad>\n<tool_call>{\"function\": \"validate_order\", \"params\": {\"order_id\": 4521}}</tool_call>"
}
After each tool result, the model receives the updated context and produces its next scratchpad:
{
"input": "[Previous context + tool result: order valid, status: pending]\nContinue processing.",
"output": "<scratchpad>\nStep 1: Validate order — DONE (valid, pending)\nStep 2: Check inventory for all line items\nCurrent step: 2\n</scratchpad>\n<tool_call>{\"function\": \"check_inventory\", \"params\": {\"order_id\": 4521}}</tool_call>"
}
Why This Works
The scratchpad externalizes the model's "working memory." Instead of implicitly tracking what it did and what comes next (which small models lose track of), the plan is literally in the output. The model reads its own previous reasoning at each step.
Fine-tuning on 500-1,000 examples of your specific workflows teaches the model YOUR step patterns. It does not need to plan arbitrary tasks — just the 5-15 workflow types your agent handles.
When to Use Pattern 2 Over Pattern 1
- You have fewer than 5 distinct workflow types
- Latency matters (one model call vs four)
- You want simpler deployment (one model to serve, not four)
- Your steps are mostly sequential (limited parallelism opportunity)
Benchmark: Scratchpad vs Vanilla Fine-Tuning
| Metric | 7B vanilla fine-tuned | 7B CoT fine-tuned | 14B CoT fine-tuned |
|---|---|---|---|
| 3-step accuracy | 67% | 82% | 91% |
| 5-step accuracy | 43% | 71% | 85% |
| Plan correctness | 54% | 88% | 93% |
| Tokens per step | 45 | 120 | 130 |
The CoT model uses ~2.7x more tokens per step (because of the scratchpad), but the accuracy jump is worth it. If you are running locally, tokens are free.
Pattern 3: Hybrid — Local Model + Frontier Fallback
Sometimes the honest answer is: your local model handles 85% of requests, but the remaining 15% genuinely need more capability. Pattern 3 makes this explicit.
User Request → Local Router (classify complexity)
├── Simple/Known → Local Agent (Pattern 1 or 2)
└── Complex/Ambiguous → Frontier API (GPT-4, Claude) for planning only
└── Plan → Local Executor (runs the tools locally)
How It Works
The local router classifies each request on two axes:
- Complexity: How many steps? Are the steps from known templates or novel?
- Ambiguity: Is the user's intent clear? Are there multiple valid interpretations?
Known-template, clear-intent requests go fully local. Novel or ambiguous requests send ONLY the planning step to a frontier API. The frontier model produces a step plan. The local model executes each step (tool calling).
Cost Math
Assume 1,000 agent requests per day:
- Full frontier API: 1,000 x $0.03 avg = $30/day = $900/month
- Full local: 1,000 x $0.00 per-query = $0/day (hardware costs only)
- Hybrid (85/15 split): 150 x $0.01 (planning only, shorter prompt) = $1.50/day = $45/month
The hybrid approach costs 95% less than full API and handles edge cases the local model would get wrong.
Cost Caps and Guardrails
Set a hard monthly budget for the frontier fallback. When you hit it, all requests route to the local model with a confidence flag. The application can decide: serve the local result with a disclaimer, or queue the request for later processing.
State Management Between Steps
Regardless of which pattern you use, multi-step agents need state management. Here is what works:
Conversation Context Object
{
"session_id": "abc-123",
"original_request": "Process order #4521 for acme-corp",
"current_step": 2,
"total_steps": 4,
"completed_steps": [
{"step": 1, "tool": "validate_order", "result": {"valid": true, "status": "pending"}, "latency_ms": 340}
],
"pending_steps": ["check_inventory", "calculate_shipping", "confirm_order"],
"error_count": 0,
"max_errors": 3
}
Pass this object (or a condensed version) as context to each model call. The model does not need to remember — the state is explicit.
Context Window Budget
A 7B model with a 4,096-token context window fills up fast when you include tool schemas + previous results + scratchpad. Budget it:
- System prompt + tool schemas: ~800 tokens (fixed)
- Current step context: ~200 tokens
- Previous step results (condensed): ~150 tokens per step
- Scratchpad output: ~120 tokens
For a 5-step task: 800 + 200 + (4 x 150) + 120 = 1,720 tokens. Comfortable within 4K. For longer chains, summarize older step results instead of including raw output.
Error Handling: When Step 3 Fails
Multi-step agents will fail mid-chain. The question is what happens next.
Retry Logic
Step fails → Check error type
├── Transient (timeout, rate limit) → Retry same step (max 2 retries)
├── Input error (bad params) → Regenerate tool call with error context
└── Logic error (impossible state) → Escalate
For the "regenerate" path, append the error message to the model's context:
Previous tool call failed: "inventory_check returned 404: order_id 4521 not found in warehouse_east"
Adjust your approach and try again.
Fine-tuned models handle this well when your training data includes error-recovery examples. Include 10-15% error scenarios in your training set.
Escalation
When retry fails or the error is a logic error, escalate:
- To frontier model: Re-run the failed step through GPT-4/Claude with full context
- To human: Flag the request in a review queue with the full step history
- Graceful degradation: Complete the steps that succeeded, return a partial result with a clear explanation of what failed
Guardrails: Preventing Infinite Loops and Runaway Costs
Max Steps
Hard limit. If your longest known workflow is 6 steps, set max_steps to 8. If the model reaches step 9, kill the chain and escalate.
MAX_STEPS = 8
STEP_TIMEOUT_MS = 5000
for step in range(MAX_STEPS):
result = execute_step(context, timeout=STEP_TIMEOUT_MS)
if result.status == "complete":
return result
if result.status == "error" and context.error_count >= MAX_ERRORS:
return escalate(context)
return escalate(context, reason="max_steps_exceeded")
Cost Caps (Hybrid Pattern)
For Pattern 3, track frontier API spend per hour, per day, and per month. Set alerts at 50%, 80%, and 100% of budget. At 100%, stop routing to frontier and log every request that would have been routed — this becomes your fine-tuning dataset for next iteration.
Human-in-the-Loop
For high-stakes workflows (financial transactions, healthcare decisions, legal actions), add a confirmation step before the final action:
Steps 1-3 complete → Present summary to human → Human approves → Step 4 executes
This works naturally in both async (queue-based) and sync (chat-based) interfaces.
When Multi-Step Works on 7B vs When You Need 14B+
| Scenario | 7B Sufficient? | Recommendation |
|---|---|---|
| 2-3 step linear workflows | Yes | Pattern 2 (scratchpad) |
| 4-6 step linear workflows | With specialist chain | Pattern 1 |
| Branching logic (if/else steps) | Marginal | 14B or Pattern 3 |
| Dynamic step count (model decides when done) | No | 14B with CoT or Pattern 3 |
| Error recovery with replanning | Marginal | 14B or specialist chain with dedicated replanner |
| 10+ distinct workflow types | Yes, with good routing | Pattern 1 with fine-tuned router |
The dividing line: if the model needs to DECIDE what to do (dynamic planning), go 14B+. If the model follows a KNOWN pattern with minor variations, 7B is fine with the right architecture.
Example: 4-Step Order Processing Agent
Here is a concrete implementation using Pattern 2 (scratchpad) on a fine-tuned Qwen 2.5 7B:
Step 1 — Validate Order
Input: "Process order #4521 for acme-corp"
Scratchpad: "4 steps needed: validate → inventory → shipping → confirm. Starting with validation."
Tool call: validate_order(order_id=4521, customer="acme-corp")
Result: {valid: true, items: [{sku: "W-100", qty: 50}, {sku: "W-200", qty: 25}]}
Step 2 — Check Inventory
Scratchpad: "Order valid. 2 line items. Checking inventory for both SKUs."
Tool call: check_inventory(skus=["W-100", "W-200"], quantities=[50, 25])
Result: {W-100: {available: 200, allocated: 50}, W-200: {available: 30, allocated: 25}}
Step 3 — Calculate Shipping
Scratchpad: "All items available. Customer acme-corp is in warehouse zone east. Calculating shipping."
Tool call: calculate_shipping(order_id=4521, zone="east", weight_kg=37.5)
Result: {method: "ground", cost: 42.50, eta_days: 3}
Step 4 — Confirm Order
Scratchpad: "Shipping calculated. All checks passed. Confirming order."
Tool call: confirm_order(order_id=4521, shipping_method="ground", shipping_cost=42.50)
Result: {confirmed: true, confirmation_id: "CNF-88421"}
Total latency on an RTX 4090: 1.8 seconds for all four steps. Total API cost: $0.00.
This same flow on GPT-4 would cost ~$0.04 per order. At 500 orders/day, that is $20/day or $600/month — for something a local model does in under 2 seconds.
Getting Started
The fastest path to multi-step agents on local models:
- Map your workflows. List every multi-step process your agent handles. Most teams have 3-8 distinct types.
- Choose your pattern. Under 5 workflow types with linear steps? Pattern 2. More variety or need parallelism? Pattern 1. Need a safety net? Pattern 3.
- Build training data. 200-500 examples per workflow type, including 10-15% error scenarios.
- Fine-tune with LoRA. A 7B model fine-tunes in under an hour with Ertas on a single GPU.
- Test the chain end-to-end before deploying. Run your full workflow suite and measure step-by-step accuracy, not just final-result accuracy.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Further Reading
- Fine-Tuning for Tool Calling: How to Build Reliable AI Agents — the foundation for single-step tool calling that Pattern 1's executor uses
- When a 7B Model Beats an API Call — benchmarks showing where small models match frontier performance
- Fine-Tuning Small Models vs GPT-4: When the Little Model Wins — the data behind task-specific fine-tuning outperforming general-purpose models
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

How to Power OpenClaw with Fine-Tuned Local Models (No API Costs)
OpenClaw defaults to cloud APIs that charge per token. Here's how to run it on fine-tuned local models via Ollama for better domain performance and zero marginal inference cost.

Fine-Tuning for Better JSON Output: Why Small Models Struggle and How to Fix It
How fine-tuning dramatically improves JSON output reliability in small models — from 60% valid JSON to 99%+ compliance, with practical techniques for structured output tasks.

On-Premise Healthcare AI: Architecture and Infrastructure Guide
A practical infrastructure guide for deploying AI on-premise in healthcare environments. Covers hardware requirements, network architecture, air-gapped deployment, HIPAA audit logging, model update strategies, and real cost comparisons against cloud APIs.