
Fine-Tuned Models for CrewAI: Multi-Agent Workflows Without API Costs
A CrewAI workflow with 4 agents making 20+ LLM calls per task can cost $2-5 per execution on GPT-4. Fine-tuned local models make multi-agent workflows economically viable.
CrewAI made multi-agent workflows accessible. Define agents with roles, give them tools, wire them into a crew, and let them collaborate on complex tasks. A researcher gathers information, an analyst evaluates it, a writer produces the output, a reviewer checks it. Four agents, one cohesive workflow.
The problem is that "four agents collaborating" means "four agents each making 5-10 LLM calls." A single crew execution on GPT-4o can easily hit 20-40 API calls. At $2.50/$10 per million input/output tokens, with each agent processing 2,000-5,000 tokens per step, a single task execution costs $2-5.
Run that crew 100 times per day — a modest production workload — and you're spending $200-$500/day. That's $6,000-$15,000/month for a workflow automation tool.
Multi-agent architectures have a cost multiplication problem. Every agent you add multiplies the API bill. Fine-tuned local models are the only way to make multi-agent workflows economically sustainable at scale.
The Cost Multiplication Problem
Single-agent architectures have a linear cost relationship: one request, one LLM call, one charge. Multi-agent architectures have a multiplicative relationship.
Consider a content creation crew:
- Research Agent: Searches the web, reads sources, produces a research brief (3-5 LLM calls)
- Writer Agent: Takes the brief and produces a draft (2-3 LLM calls)
- Editor Agent: Reviews the draft, suggests changes, rewrites sections (3-5 LLM calls)
- SEO Agent: Analyzes the content, adds keywords, optimizes structure (2-3 LLM calls)
That's 10-16 LLM calls per article. At GPT-4o pricing with average prompt sizes:
- Per article: $1.50-$4.00
- 10 articles/day: $15-$40
- 300 articles/month: $450-$1,200
Now add a fact-checking agent and a formatting agent. You're at 16-24 calls per article and $2,500-$5,000/month. The cost scales linearly with agents and linearly with task volume. Multiply those together and costs compound fast.
Which CrewAI Roles Work with Fine-Tuned Models
Not every agent role in a crew needs the same level of model capability. Here's a practical breakdown:
Roles That Work Well with Fine-Tuned 7-8B Models
Researcher/Gatherer Agents
These agents take a topic, formulate search queries, and synthesize information from tool results. The core task is query generation (structured output) and summarization (compression). Fine-tuned models handle both of these reliably.
A fine-tuned researcher agent learns your specific research patterns: what sources you prefer, how you want summaries formatted, what level of detail to include. It produces more consistent research briefs than GPT-4 with a system prompt, because the training data encodes your preferences directly.
Writer/Generator Agents
For domain-specific writing — product descriptions, support documentation, marketing copy with a specific brand voice — fine-tuned models are often better than prompted GPT-4. They don't drift from your style. They don't add qualifiers you didn't want. They produce output that sounds like your brand because they were trained on your brand's content.
Analyzer/Classifier Agents
Agents that evaluate, score, or categorize inputs. "Is this lead qualified or unqualified?" "What's the sentiment of this review?" "Which department should handle this ticket?" These are classification tasks — the sweet spot for fine-tuned small models.
Formatter/Post-Processor Agents
Agents that take raw output and format it for a specific target: convert to markdown, generate HTML, structure as JSON, format for a specific API. Pure structured output tasks where fine-tuned models achieve 99%+ compliance.
Roles That Still Benefit from Frontier Models
Strategic Planner Agents
Agents that need to create multi-step plans for novel problems. "Given these business constraints, design a marketing campaign targeting these demographics across these channels." This requires creative reasoning about novel combinations of factors.
Complex Reasoner Agents
Agents that need to evaluate multiple conflicting data points and make nuanced judgments. "Given these three conflicting reviews, this pricing data, and these market trends, recommend an investment strategy." Multi-factor reasoning with trade-offs is where frontier models maintain an edge.
Adversarial Reviewer Agents
Agents specifically designed to find flaws, challenge assumptions, and stress-test outputs. These need the broad knowledge and reasoning flexibility of larger models to catch subtle errors.
Assigning Different Models to Different Agents
CrewAI supports custom LLM configuration per agent. Here's how to implement a mixed-model crew:
from crewai import Agent, Crew, Task
from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI
# Fine-tuned local models for specialized roles
researcher_llm = ChatOllama(model="ft-researcher-8b")
writer_llm = ChatOllama(model="ft-writer-8b")
editor_llm = ChatOllama(model="ft-editor-8b")
# GPT-4o for strategic planning
strategist_llm = ChatOpenAI(model="gpt-4o")
researcher = Agent(
role="Research Analyst",
goal="Gather and synthesize information on the given topic",
llm=researcher_llm,
tools=[search_tool, web_scraper],
)
writer = Agent(
role="Content Writer",
goal="Produce clear, well-structured content",
llm=writer_llm,
)
editor = Agent(
role="Content Editor",
goal="Review and improve content quality",
llm=editor_llm,
)
strategist = Agent(
role="Content Strategist",
goal="Develop content strategy and evaluate alignment with business goals",
llm=strategist_llm, # Complex reasoning stays on GPT-4
)
This gives you the cost benefits of local inference for 3 out of 4 agents while maintaining frontier-model reasoning for the one agent that genuinely needs it.
Training Specialized LoRA Adapters per Agent Role
The most effective approach: one base model, multiple LoRA adapters — one per agent role. Each adapter specializes the base model for its specific task.
Base Model Selection
Llama 3.1 8B Instruct and Qwen 2.5 7B Instruct are the practical choices for agent adapters in 2026. Both support tool calling natively, handle structured output well, and fit comfortably on a 24GB GPU with room for multiple LoRA adapters in memory.
Adapter Training Data by Role
Researcher adapter training data:
- Input: research topic + available tools
- Output: structured research queries, tool calls, summarized findings
- Format: multi-turn conversations showing the full research process
- Dataset size: 300-600 examples
Writer adapter training data:
- Input: research brief or content outline + style guidelines
- Output: finished content in target format
- Format: input-output pairs with style-consistent examples
- Dataset size: 500-1,000 examples (more data = better style consistency)
Editor adapter training data:
- Input: draft content + editing guidelines
- Output: edited content with tracked changes or edit comments
- Format: before/after pairs showing specific edits and reasoning
- Dataset size: 400-800 examples
Analyzer adapter training data:
- Input: data or content to evaluate
- Output: structured analysis in a specific format (scores, categories, recommendations)
- Format: input → structured JSON output
- Dataset size: 300-500 examples
Training Process
Upload each dataset to Ertas as a separate fine-tuning job. Select the same base model for all adapters. Each training run produces a LoRA adapter file (50-200MB). Store all adapters on the same machine as the base model.
Total training time: 1-3 hours per adapter, so 4-12 hours for a full crew. Total adapter storage: 200-800MB for a four-agent crew. Compare that to four separate full model copies at 4GB each.
Cost Comparison: Three Configurations
Let's compare a four-agent content crew running 500 tasks per day.
Configuration 1: All GPT-4o
Every agent uses GPT-4o. 20 LLM calls per task, ~3,000 tokens per call.
- 10,000 calls/day × 3,000 tokens × $6.25/M tokens (blended rate)
- $187.50/day → $5,625/month
Configuration 2: Mixed Crew (3 Local, 1 GPT-4o)
Researcher, writer, editor on fine-tuned local models. Strategist on GPT-4o.
- 15,000 local calls/day: $0
- 5,000 GPT-4o calls/day × 3,000 tokens × $6.25/M tokens: $93.75/day
- Cloud GPU (A10G): $300/month
- $93.75/day + $300/month → $3,112/month (45% reduction)
Configuration 3: All Local (GPT-4o Fallback Only)
All agents on fine-tuned local models. GPT-4o called only when confidence is low (~5% of tasks).
- 19,500 local calls/day: $0
- 500 GPT-4o calls/day × 3,000 tokens × $6.25/M tokens: $9.38/day
- Cloud GPU (A10G): $300/month
- $9.38/day + $300/month → $581/month (90% reduction)
Going from $5,625/month to $581/month. That's a $60,528/year saving for a single crew workflow.
Handling Inter-Agent Communication
In CrewAI, agents communicate by passing outputs as context to the next agent. When all agents use GPT-4, this works naturally — each agent receives the full context window of the previous agent's output.
With fine-tuned local models, you need to be thoughtful about context management:
Keep inter-agent messages concise. Train your agents to produce structured, compact outputs rather than verbose narratives. A researcher that outputs a JSON-structured brief is better than one that writes a 2,000-word essay — less context for the next agent to process, faster inference, and more predictable parsing.
Standardize communication formats. Define a schema for each agent-to-agent handoff. The researcher outputs {"topic": "...", "key_facts": [...], "sources": [...]}. The writer expects exactly this format. Fine-tune both agents on this shared schema. This eliminates the format negotiation that wastes tokens when agents communicate in free-form text.
Use smaller context windows. Fine-tuned 8B models work well within 2,048-4,096 token contexts. Design your agent communication to fit within these limits. If the researcher's output exceeds 2K tokens, add a summarization step (also fine-tuned) before passing to the writer.
Monitoring and Iterating
After deploying a fine-tuned crew, monitor these metrics:
- Per-agent task success rate: Does each agent complete its role successfully?
- End-to-end crew success rate: Does the full workflow produce acceptable output?
- Fallback rate: How often does a local agent trigger a GPT-4 fallback?
- Latency per agent: Is any fine-tuned agent slower than expected?
Log every execution. When a fine-tuned agent fails or produces subpar output, add the correct input-output pair to its training dataset. Retrain the adapter periodically — monthly is typical — with the expanded dataset. Each iteration improves the adapter's coverage of edge cases.
After 2-3 retraining cycles, most teams see fallback rates drop from 10-15% to 2-5%. By the fourth or fifth cycle, the fine-tuned crew handles 98%+ of tasks without cloud fallback.
When Multi-Agent Is Overkill
Before investing in fine-tuning a full CrewAI crew, ask whether you actually need multiple agents. A single well-prompted or fine-tuned model can handle many "multi-agent" workflows:
- If your agents are just doing sequential processing (researcher → writer → editor), a single model with structured prompting can do the same thing in 3 calls instead of 15.
- If your "collaboration" is really just a review loop, a single model with self-reflection can achieve similar quality.
Multi-agent architectures add real value when agents operate in parallel, when they have genuinely different tool sets, or when the task structure is complex enough that specialization produces measurably better results. If your crew is four agents doing what one agent could do in four steps, simplify first — then fine-tune.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Further Reading
- Multi-Step AI Agents with Local Models — Architectural patterns for running complex agent workflows without cloud APIs.
- Reliable AI Agents with Fine-Tuned Local Models — Why fine-tuned models produce more consistent agent behavior at every step.
- Per-Client AI Agents for Agencies with LoRA — Using per-client LoRA adapters to run customized agent crews for each client.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Fine-Tuning for Voice AI Agents: Vapi, ElevenLabs, and Local Models
Voice AI agents running on GPT-4 cost $0.10-0.30 per minute of conversation. Fine-tuned local models cut that to near-zero. Here's how to build voice agents that don't bankrupt you per call.

Fine-Tuned Models for LangGraph Agents: Replace GPT-4 in Your Agent Stack
LangGraph agents default to GPT-4, but most agent tasks — routing, tool selection, response generation — work better with fine-tuned models trained on your specific workflows.

From Prompt Caching to Fine-Tuning: When to Make the Switch
Prompt caching cuts costs 60-90% for repetitive context. Fine-tuning eliminates per-token costs entirely. Here's how to know when you've outgrown caching and should fine-tune instead.