Fine-Tuned Models for CrewAI: Multi-Agent Workflows Without API Costs

CrewAI made multi-agent workflows accessible. Define agents with roles, give them tools, wire them into a crew, and let them collaborate on complex tasks. A researcher gathers information, an analyst evaluates it, a writer produces the output, a reviewer checks it. Four agents, one cohesive workflow.

The problem is that "four agents collaborating" means "four agents each making 5-10 LLM calls." A single crew execution on GPT-4o can easily hit 20-40 API calls. At $2.50/$10 per million input/output tokens, with each agent processing 2,000-5,000 tokens per step, a single task execution costs $2-5.

Run that crew 100 times per day — a modest production workload — and you're spending $200-$500/day. That's $6,000-$15,000/month for a workflow automation tool.

Multi-agent architectures have a cost multiplication problem. Every agent you add multiplies the API bill. Fine-tuned local models are the only way to make multi-agent workflows economically sustainable at scale.

The Cost Multiplication Problem

Single-agent architectures have a linear cost relationship: one request, one LLM call, one charge. Multi-agent architectures have a multiplicative relationship.

Consider a content creation crew:

Research Agent: Searches the web, reads sources, produces a research brief (3-5 LLM calls)
Writer Agent: Takes the brief and produces a draft (2-3 LLM calls)
Editor Agent: Reviews the draft, suggests changes, rewrites sections (3-5 LLM calls)
SEO Agent: Analyzes the content, adds keywords, optimizes structure (2-3 LLM calls)

That's 10-16 LLM calls per article. At GPT-4o pricing with average prompt sizes:

Per article: $1.50-$4.00
10 articles/day: $15-$40
300 articles/month: $450-$1,200

Now add a fact-checking agent and a formatting agent. You're at 16-24 calls per article and $2,500-$5,000/month. The cost scales linearly with agents and linearly with task volume. Multiply those together and costs compound fast.

Which CrewAI Roles Work with Fine-Tuned Models

Not every agent role in a crew needs the same level of model capability. Here's a practical breakdown:

Roles That Work Well with Fine-Tuned 7-8B Models

Researcher/Gatherer Agents

These agents take a topic, formulate search queries, and synthesize information from tool results. The core task is query generation (structured output) and summarization (compression). Fine-tuned models handle both of these reliably.

A fine-tuned researcher agent learns your specific research patterns: what sources you prefer, how you want summaries formatted, what level of detail to include. It produces more consistent research briefs than GPT-4 with a system prompt, because the training data encodes your preferences directly.

Writer/Generator Agents

For domain-specific writing — product descriptions, support documentation, marketing copy with a specific brand voice — fine-tuned models are often better than prompted GPT-4. They don't drift from your style. They don't add qualifiers you didn't want. They produce output that sounds like your brand because they were trained on your brand's content.

Analyzer/Classifier Agents

Agents that evaluate, score, or categorize inputs. "Is this lead qualified or unqualified?" "What's the sentiment of this review?" "Which department should handle this ticket?" These are classification tasks — the sweet spot for fine-tuned small models.

Formatter/Post-Processor Agents

Agents that take raw output and format it for a specific target: convert to markdown, generate HTML, structure as JSON, format for a specific API. Pure structured output tasks where fine-tuned models achieve 99%+ compliance.

Roles That Still Benefit from Frontier Models

Strategic Planner Agents

Agents that need to create multi-step plans for novel problems. "Given these business constraints, design a marketing campaign targeting these demographics across these channels." This requires creative reasoning about novel combinations of factors.

Complex Reasoner Agents

Agents that need to evaluate multiple conflicting data points and make nuanced judgments. "Given these three conflicting reviews, this pricing data, and these market trends, recommend an investment strategy." Multi-factor reasoning with trade-offs is where frontier models maintain an edge.

Adversarial Reviewer Agents

Agents specifically designed to find flaws, challenge assumptions, and stress-test outputs. These need the broad knowledge and reasoning flexibility of larger models to catch subtle errors.

Assigning Different Models to Different Agents

CrewAI supports custom LLM configuration per agent. Here's how to implement a mixed-model crew:

from crewai import Agent, Crew, Task
from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI

# Fine-tuned local models for specialized roles
researcher_llm = ChatOllama(model="ft-researcher-8b")
writer_llm = ChatOllama(model="ft-writer-8b")
editor_llm = ChatOllama(model="ft-editor-8b")

# GPT-4o for strategic planning
strategist_llm = ChatOpenAI(model="gpt-4o")

researcher = Agent(
    role="Research Analyst",
    goal="Gather and synthesize information on the given topic",
    llm=researcher_llm,
    tools=[search_tool, web_scraper],
)

writer = Agent(
    role="Content Writer",
    goal="Produce clear, well-structured content",
    llm=writer_llm,
)

editor = Agent(
    role="Content Editor",
    goal="Review and improve content quality",
    llm=editor_llm,
)

strategist = Agent(
    role="Content Strategist",
    goal="Develop content strategy and evaluate alignment with business goals",
    llm=strategist_llm,  # Complex reasoning stays on GPT-4
)

This gives you the cost benefits of local inference for 3 out of 4 agents while maintaining frontier-model reasoning for the one agent that genuinely needs it.

Training Specialized LoRA Adapters per Agent Role

The most effective approach: one base model, multiple LoRA adapters — one per agent role. Each adapter specializes the base model for its specific task.

Base Model Selection

Llama 3.1 8B Instruct and Qwen 2.5 7B Instruct are the practical choices for agent adapters in 2026. Both support tool calling natively, handle structured output well, and fit comfortably on a 24GB GPU with room for multiple LoRA adapters in memory.

Adapter Training Data by Role

Researcher adapter training data:

Input: research topic + available tools
Output: structured research queries, tool calls, summarized findings
Format: multi-turn conversations showing the full research process
Dataset size: 300-600 examples

Writer adapter training data:

Input: research brief or content outline + style guidelines
Output: finished content in target format
Format: input-output pairs with style-consistent examples
Dataset size: 500-1,000 examples (more data = better style consistency)

Editor adapter training data:

Input: draft content + editing guidelines
Output: edited content with tracked changes or edit comments
Format: before/after pairs showing specific edits and reasoning
Dataset size: 400-800 examples

Analyzer adapter training data:

Input: data or content to evaluate
Output: structured analysis in a specific format (scores, categories, recommendations)
Format: input → structured JSON output
Dataset size: 300-500 examples

Training Process

Upload each dataset to Ertas as a separate fine-tuning job. Select the same base model for all adapters. Each training run produces a LoRA adapter file (50-200MB). Store all adapters on the same machine as the base model.

Total training time: 1-3 hours per adapter, so 4-12 hours for a full crew. Total adapter storage: 200-800MB for a four-agent crew. Compare that to four separate full model copies at 4GB each.

Cost Comparison: Three Configurations

Let's compare a four-agent content crew running 500 tasks per day.

Configuration 1: All GPT-4o

Every agent uses GPT-4o. 20 LLM calls per task, ~3,000 tokens per call.

10,000 calls/day × 3,000 tokens × $6.25/M tokens (blended rate)
$187.50/day → $5,625/month

Configuration 2: Mixed Crew (3 Local, 1 GPT-4o)

Researcher, writer, editor on fine-tuned local models. Strategist on GPT-4o.

15,000 local calls/day: $0
5,000 GPT-4o calls/day × 3,000 tokens × $6.25/M tokens: $93.75/day
Cloud GPU (A10G): $300/month
$93.75/day + $300/month → $3,112/month (45% reduction)

Configuration 3: All Local (GPT-4o Fallback Only)

All agents on fine-tuned local models. GPT-4o called only when confidence is low (~5% of tasks).

19,500 local calls/day: $0
500 GPT-4o calls/day × 3,000 tokens × $6.25/M tokens: $9.38/day
Cloud GPU (A10G): $300/month
$9.38/day + $300/month → $581/month (90% reduction)

Going from $5,625/month to $581/month. That's a $60,528/year saving for a single crew workflow.

Handling Inter-Agent Communication

In CrewAI, agents communicate by passing outputs as context to the next agent. When all agents use GPT-4, this works naturally — each agent receives the full context window of the previous agent's output.

With fine-tuned local models, you need to be thoughtful about context management:

Keep inter-agent messages concise. Train your agents to produce structured, compact outputs rather than verbose narratives. A researcher that outputs a JSON-structured brief is better than one that writes a 2,000-word essay — less context for the next agent to process, faster inference, and more predictable parsing.

Standardize communication formats. Define a schema for each agent-to-agent handoff. The researcher outputs {"topic": "...", "key_facts": [...], "sources": [...]}. The writer expects exactly this format. Fine-tune both agents on this shared schema. This eliminates the format negotiation that wastes tokens when agents communicate in free-form text.

Use smaller context windows. Fine-tuned 8B models work well within 2,048-4,096 token contexts. Design your agent communication to fit within these limits. If the researcher's output exceeds 2K tokens, add a summarization step (also fine-tuned) before passing to the writer.

Monitoring and Iterating

After deploying a fine-tuned crew, monitor these metrics:

Per-agent task success rate: Does each agent complete its role successfully?
End-to-end crew success rate: Does the full workflow produce acceptable output?
Fallback rate: How often does a local agent trigger a GPT-4 fallback?
Latency per agent: Is any fine-tuned agent slower than expected?

Log every execution. When a fine-tuned agent fails or produces subpar output, add the correct input-output pair to its training dataset. Retrain the adapter periodically — monthly is typical — with the expanded dataset. Each iteration improves the adapter's coverage of edge cases.

After 2-3 retraining cycles, most teams see fallback rates drop from 10-15% to 2-5%. By the fourth or fifth cycle, the fine-tuned crew handles 98%+ of tasks without cloud fallback.

When Multi-Agent Is Overkill

Before investing in fine-tuning a full CrewAI crew, ask whether you actually need multiple agents. A single well-prompted or fine-tuned model can handle many "multi-agent" workflows:

If your agents are just doing sequential processing (researcher → writer → editor), a single model with structured prompting can do the same thing in 3 calls instead of 15.
If your "collaboration" is really just a review loop, a single model with self-reflection can achieve similar quality.

Multi-agent architectures add real value when agents operate in parallel, when they have genuinely different tool sets, or when the task structure is complex enough that specialization produces measurably better results. If your crew is four agents doing what one agent could do in four steps, simplify first — then fine-tune.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →