
Open-Source Models for OpenClaw: Llama 3, Qwen 2.5, and Which to Fine-Tune
Not all open-source models work equally well as OpenClaw backends. Here's a practical comparison of Llama 3.3, Qwen 2.5, Mistral, and Phi-3 for agent tasks, with fine-tuning recommendations.
OpenClaw supports any model served through an OpenAI-compatible API. That includes dozens of open-source models available through Ollama, vLLM, and LM Studio. But not all models perform equally well for agent work.
Agent tasks demand a specific mix of capabilities: reliable instruction following, accurate tool use, multi-step reasoning, and consistent output formatting. A model that excels at creative writing might fail at structured data extraction. A model with strong reasoning might be too slow for real-time conversational agents.
This guide compares the leading open-source models for OpenClaw, with specific focus on how each performs after fine-tuning.
What Makes a Good OpenClaw Model
Before comparing models, here is what OpenClaw specifically requires:
1. Instruction Following
OpenClaw gives models structured instructions with specific output format requirements. The model needs to follow these precisely — deviating from the expected format breaks downstream processing.
2. Tool Use
OpenClaw uses function calling to interact with tools (file system, browser, messaging APIs). Models need to generate syntactically correct tool calls with the right parameters.
3. Multi-Step Reasoning
Many OpenClaw tasks involve chains of actions: read an email → classify it → look up related context → draft a response → send it. The model needs to plan and execute multi-step sequences reliably.
4. Context Window
OpenClaw prompts can be long — they include conversation history, file contents, tool outputs, and system instructions. A minimum 8K context window is practical; 32K+ is preferred for document-heavy workflows.
5. Inference Speed
For conversational agent use cases (WhatsApp, Slack), response latency matters. Users expect sub-2-second responses for chat interactions. Batch processing tasks (report generation, email triage) are more tolerant of latency.
Model Comparison
Llama 3.3 8B
Strengths for OpenClaw:
- Strong instruction following out of the box
- Good tool-use support (Meta specifically trained for function calling in Llama 3)
- 128K context window
- Extensive community support and fine-tuning resources
- Wide compatibility across inference frameworks
Weaknesses:
- Slightly weaker on structured data extraction compared to Qwen
- Larger memory footprint than some alternatives at the same capability level
Best for: General-purpose OpenClaw agents, conversational tasks, multi-step workflows
Fine-tuning notes: Responds well to LoRA fine-tuning with rank 16-32. The large context window means it handles document-heavy fine-tuning datasets without truncation issues. Fine-tuned Llama 3.3 8B is the most broadly recommended starting point for OpenClaw deployments.
Hardware: Q5_K_M quantisation runs on 8GB+ RAM. Comfortable on M-series Macs, any GPU with 8GB+ VRAM.
Qwen 2.5 7B
Strengths for OpenClaw:
- Excellent structured output generation (JSON, tables, schemas)
- Strong multilingual support (particularly good for CJK languages)
- Good at data extraction and classification tasks
- Efficient inference speed
- 128K context window
Weaknesses:
- Slightly less natural in open-ended conversation compared to Llama 3.3
- Smaller community fine-tuning ecosystem (growing rapidly)
Best for: Data extraction, report generation, classification tasks, multilingual deployments
Fine-tuning notes: Particularly responsive to fine-tuning for structured output tasks. If your OpenClaw workflows are heavy on data extraction, invoice processing, or categorisation, Qwen 2.5 7B often outperforms Llama 3.3 8B after fine-tuning on the same dataset. Use rank 16, 3-4 epochs.
Hardware: Slightly smaller than Llama 3.3, runs well on 8GB+ RAM. Excellent performance on M-series Macs.
Mistral 7B / Mistral Nemo 12B
Strengths for OpenClaw:
- Fast inference speed (optimised architecture)
- Good reasoning capabilities relative to parameter count
- Nemo 12B offers strong middle ground between 7B and larger models
- Sliding window attention for efficient long-context handling
Weaknesses:
- Weaker tool-use support out of the box compared to Llama 3.3
- Smaller context window on base Mistral 7B (32K, though often sufficient)
- Less consistent structured output formatting
Best for: Speed-critical conversational agents, reasoning-heavy tasks where latency matters
Fine-tuning notes: Responds well to fine-tuning but requires more training examples for tool-use tasks compared to Llama 3.3. If your OpenClaw use case is primarily conversational (chat support, email drafting), Mistral's speed advantage makes it worth evaluating.
Hardware: Mistral 7B is highly efficient — runs on 6GB+ RAM. Nemo 12B needs 10GB+.
Phi-3 Mini (3.8B) / Phi-3 Medium (14B)
Strengths for OpenClaw:
- Phi-3 Mini is remarkably capable for its size — runs on very modest hardware
- Good instruction following despite small parameter count
- Phi-3 Medium offers near-frontier reasoning in a manageable package
- Excellent for edge deployment or resource-constrained environments
Weaknesses:
- Phi-3 Mini struggles with complex multi-step agent tasks
- Limited multilingual capability
- Smaller community and fewer fine-tuning examples available
Best for: Lightweight agents on constrained hardware, simple automation tasks, IoT/edge deployments
Fine-tuning notes: Phi-3 Mini benefits enormously from fine-tuning — the small base model has more room for domain-specific improvement. For simple, focused tasks (single-category classification, template-based responses), a fine-tuned Phi-3 Mini can match much larger models at a fraction of the compute cost.
Hardware: Phi-3 Mini runs on 4GB RAM. Phi-3 Medium needs 12GB+.
Recommendations by Use Case
| OpenClaw Use Case | Recommended Base Model | Why |
|---|---|---|
| General-purpose agent | Llama 3.3 8B | Best all-round instruction following and tool use |
| Email triage and response | Llama 3.3 8B or Qwen 2.5 7B | Both strong; Qwen edges ahead on classification |
| Document/data extraction | Qwen 2.5 7B | Best structured output generation |
| Customer support chat | Llama 3.3 8B | Natural conversational tone |
| Report generation | Qwen 2.5 7B | Consistent template adherence |
| Multilingual agent | Qwen 2.5 7B | Strongest multilingual support |
| Speed-critical chat | Mistral 7B | Fastest inference at this capability tier |
| Resource-constrained deployment | Phi-3 Mini 3.8B | Runs on minimal hardware |
| Complex reasoning tasks | Mistral Nemo 12B or Phi-3 Medium 14B | More parameters for harder problems |
| Agency (per-client adapters) | Llama 3.3 8B | Best LoRA adapter ecosystem, wide compatibility |
Quantisation Guide for OpenClaw
Quantisation level affects both quality and speed. Here is how each level performs for agent tasks:
| Quantisation | Quality Impact | Speed | RAM Needed (7B) | Recommended For |
|---|---|---|---|---|
| Q8_0 | Minimal loss | Baseline | ~8GB | Quality-critical tasks, evaluation |
| Q6_K | Near-lossless | +10% faster | ~7GB | Production agent work (recommended default) |
| Q5_K_M | Very slight loss | +20% faster | ~6GB | Good balance for most deployments |
| Q4_K_M | Noticeable on complex tasks | +30% faster | ~5GB | Simple tasks, speed-critical |
| Q4_K_S | Meaningful quality drop | +35% faster | ~4.5GB | Not recommended for agent work |
For OpenClaw, Q5_K_M or Q6_K is the sweet spot. Agent tasks involve chained reasoning where quality degradation compounds across steps. The small speed gain from Q4 quantisation is not worth the reliability loss in multi-step workflows.
Fine-Tuning Strategy
Regardless of which base model you choose, the fine-tuning approach is similar:
Data Preparation
- Export your OpenClaw interaction logs (the tasks it handles most often)
- Format as instruction/response pairs in JSONL
- Include examples of tool calls if your workflows use them
- Include examples of multi-step reasoning chains
- Aim for 500-2,000 examples
Training Configuration
- LoRA rank: 16 (start here; increase to 32 if accuracy plateaus)
- Epochs: 3-4 (monitor for overfitting on the validation set)
- Learning rate: 2e-4 (standard for LoRA fine-tuning)
Evaluation
- Test on a held-out set (20% of your data)
- Measure task-specific accuracy (classification F1, schema compliance, response quality)
- Compare against the base model on the same test set to quantify improvement
Iteration
- Collect misclassified examples from production use
- Add them to the training set
- Re-fine-tune (typically 1-2 iterations to reach production quality)
With Ertas Studio, the entire process — upload, configure, train, evaluate, export GGUF — takes 30-90 minutes per iteration with no code required.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Getting Started
- Pick a base model from the recommendations above based on your primary use case
- Pull it via Ollama:
ollama pull llama3.3:8borollama pull qwen2.5:7b - Test it with OpenClaw on your actual tasks to establish a baseline
- Collect training data from your workflows (500+ examples)
- Fine-tune on Ertas Studio — upload, train, export GGUF
- Deploy the fine-tuned model via Ollama and compare against the baseline
Most teams start with Llama 3.3 8B (safest all-round choice), fine-tune once, and then evaluate whether a different base model would serve their specific workload better. The fine-tuning investment (a few hundred training examples) is transferable — you can always re-fine-tune on a different base model using the same dataset.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

How to Power OpenClaw with Fine-Tuned Local Models (No API Costs)
OpenClaw defaults to cloud APIs that charge per token. Here's how to run it on fine-tuned local models via Ollama for better domain performance and zero marginal inference cost.

OpenClaw + Fine-Tuned Models vs. OpenClaw + GPT-4: A Practical Comparison
We compared OpenClaw running on fine-tuned local models against GPT-4o across five common agent tasks. Here's where fine-tuned models win, where they don't, and what the numbers say.

Extending OpenClaw with Custom Skills Powered by Fine-Tuned Models
The ClawHub supply chain attack compromised 800+ skills. Build your own instead — backed by fine-tuned models that are safer, more accurate, and tailored to your domain.