Open-Source Models for OpenClaw: Llama 3, Qwen 2.5, and Which to Fine-Tune

OpenClaw supports any model served through an OpenAI-compatible API. That includes dozens of open-source models available through Ollama, vLLM, and LM Studio. But not all models perform equally well for agent work.

Agent tasks demand a specific mix of capabilities: reliable instruction following, accurate tool use, multi-step reasoning, and consistent output formatting. A model that excels at creative writing might fail at structured data extraction. A model with strong reasoning might be too slow for real-time conversational agents.

This guide compares the leading open-source models for OpenClaw, with specific focus on how each performs after fine-tuning.

What Makes a Good OpenClaw Model

Before comparing models, here is what OpenClaw specifically requires:

1. Instruction Following

OpenClaw gives models structured instructions with specific output format requirements. The model needs to follow these precisely — deviating from the expected format breaks downstream processing.

2. Tool Use

OpenClaw uses function calling to interact with tools (file system, browser, messaging APIs). Models need to generate syntactically correct tool calls with the right parameters.

3. Multi-Step Reasoning

Many OpenClaw tasks involve chains of actions: read an email → classify it → look up related context → draft a response → send it. The model needs to plan and execute multi-step sequences reliably.

4. Context Window

OpenClaw prompts can be long — they include conversation history, file contents, tool outputs, and system instructions. A minimum 8K context window is practical; 32K+ is preferred for document-heavy workflows.

5. Inference Speed

For conversational agent use cases (WhatsApp, Slack), response latency matters. Users expect sub-2-second responses for chat interactions. Batch processing tasks (report generation, email triage) are more tolerant of latency.

Model Comparison

Llama 3.3 8B

Strengths for OpenClaw:

Strong instruction following out of the box
Good tool-use support (Meta specifically trained for function calling in Llama 3)
128K context window
Extensive community support and fine-tuning resources
Wide compatibility across inference frameworks

Weaknesses:

Slightly weaker on structured data extraction compared to Qwen
Larger memory footprint than some alternatives at the same capability level

Best for: General-purpose OpenClaw agents, conversational tasks, multi-step workflows

Fine-tuning notes: Responds well to LoRA fine-tuning with rank 16-32. The large context window means it handles document-heavy fine-tuning datasets without truncation issues. Fine-tuned Llama 3.3 8B is the most broadly recommended starting point for OpenClaw deployments.

Hardware: Q5_K_M quantisation runs on 8GB+ RAM. Comfortable on M-series Macs, any GPU with 8GB+ VRAM.

Qwen 2.5 7B

Strengths for OpenClaw:

Excellent structured output generation (JSON, tables, schemas)
Strong multilingual support (particularly good for CJK languages)
Good at data extraction and classification tasks
Efficient inference speed
128K context window

Weaknesses:

Slightly less natural in open-ended conversation compared to Llama 3.3
Smaller community fine-tuning ecosystem (growing rapidly)

Best for: Data extraction, report generation, classification tasks, multilingual deployments

Fine-tuning notes: Particularly responsive to fine-tuning for structured output tasks. If your OpenClaw workflows are heavy on data extraction, invoice processing, or categorisation, Qwen 2.5 7B often outperforms Llama 3.3 8B after fine-tuning on the same dataset. Use rank 16, 3-4 epochs.

Hardware: Slightly smaller than Llama 3.3, runs well on 8GB+ RAM. Excellent performance on M-series Macs.

Mistral 7B / Mistral Nemo 12B

Strengths for OpenClaw:

Fast inference speed (optimised architecture)
Good reasoning capabilities relative to parameter count
Nemo 12B offers strong middle ground between 7B and larger models
Sliding window attention for efficient long-context handling

Weaknesses:

Weaker tool-use support out of the box compared to Llama 3.3
Smaller context window on base Mistral 7B (32K, though often sufficient)
Less consistent structured output formatting

Best for: Speed-critical conversational agents, reasoning-heavy tasks where latency matters

Fine-tuning notes: Responds well to fine-tuning but requires more training examples for tool-use tasks compared to Llama 3.3. If your OpenClaw use case is primarily conversational (chat support, email drafting), Mistral's speed advantage makes it worth evaluating.

Hardware: Mistral 7B is highly efficient — runs on 6GB+ RAM. Nemo 12B needs 10GB+.

Phi-3 Mini (3.8B) / Phi-3 Medium (14B)

Strengths for OpenClaw:

Phi-3 Mini is remarkably capable for its size — runs on very modest hardware
Good instruction following despite small parameter count
Phi-3 Medium offers near-frontier reasoning in a manageable package
Excellent for edge deployment or resource-constrained environments

Weaknesses:

Phi-3 Mini struggles with complex multi-step agent tasks
Limited multilingual capability
Smaller community and fewer fine-tuning examples available

Best for: Lightweight agents on constrained hardware, simple automation tasks, IoT/edge deployments

Fine-tuning notes: Phi-3 Mini benefits enormously from fine-tuning — the small base model has more room for domain-specific improvement. For simple, focused tasks (single-category classification, template-based responses), a fine-tuned Phi-3 Mini can match much larger models at a fraction of the compute cost.

Hardware: Phi-3 Mini runs on 4GB RAM. Phi-3 Medium needs 12GB+.

Recommendations by Use Case

OpenClaw Use Case	Recommended Base Model	Why
General-purpose agent	Llama 3.3 8B	Best all-round instruction following and tool use
Email triage and response	Llama 3.3 8B or Qwen 2.5 7B	Both strong; Qwen edges ahead on classification
Document/data extraction	Qwen 2.5 7B	Best structured output generation
Customer support chat	Llama 3.3 8B	Natural conversational tone
Report generation	Qwen 2.5 7B	Consistent template adherence
Multilingual agent	Qwen 2.5 7B	Strongest multilingual support
Speed-critical chat	Mistral 7B	Fastest inference at this capability tier
Resource-constrained deployment	Phi-3 Mini 3.8B	Runs on minimal hardware
Complex reasoning tasks	Mistral Nemo 12B or Phi-3 Medium 14B	More parameters for harder problems
Agency (per-client adapters)	Llama 3.3 8B	Best LoRA adapter ecosystem, wide compatibility

Quantisation Guide for OpenClaw

Quantisation level affects both quality and speed. Here is how each level performs for agent tasks:

Quantisation	Quality Impact	Speed	RAM Needed (7B)	Recommended For
Q8_0	Minimal loss	Baseline	~8GB	Quality-critical tasks, evaluation
Q6_K	Near-lossless	+10% faster	~7GB	Production agent work (recommended default)
Q5_K_M	Very slight loss	+20% faster	~6GB	Good balance for most deployments
Q4_K_M	Noticeable on complex tasks	+30% faster	~5GB	Simple tasks, speed-critical
Q4_K_S	Meaningful quality drop	+35% faster	~4.5GB	Not recommended for agent work

For OpenClaw, Q5_K_M or Q6_K is the sweet spot. Agent tasks involve chained reasoning where quality degradation compounds across steps. The small speed gain from Q4 quantisation is not worth the reliability loss in multi-step workflows.

Fine-Tuning Strategy

Regardless of which base model you choose, the fine-tuning approach is similar:

Data Preparation

Export your OpenClaw interaction logs (the tasks it handles most often)
Format as instruction/response pairs in JSONL
Include examples of tool calls if your workflows use them
Include examples of multi-step reasoning chains
Aim for 500-2,000 examples

Training Configuration

LoRA rank: 16 (start here; increase to 32 if accuracy plateaus)
Epochs: 3-4 (monitor for overfitting on the validation set)
Learning rate: 2e-4 (standard for LoRA fine-tuning)

Evaluation

Test on a held-out set (20% of your data)
Measure task-specific accuracy (classification F1, schema compliance, response quality)
Compare against the base model on the same test set to quantify improvement

Iteration

Collect misclassified examples from production use
Add them to the training set
Re-fine-tune (typically 1-2 iterations to reach production quality)

With Ertas Studio, the entire process — upload, configure, train, evaluate, export GGUF — takes 30-90 minutes per iteration with no code required.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Getting Started

Pick a base model from the recommendations above based on your primary use case
Pull it via Ollama: ollama pull llama3.3:8b or ollama pull qwen2.5:7b
Test it with OpenClaw on your actual tasks to establish a baseline
Collect training data from your workflows (500+ examples)
Fine-tune on Ertas Studio — upload, train, export GGUF
Deploy the fine-tuned model via Ollama and compare against the baseline

Most teams start with Llama 3.3 8B (safest all-round choice), fine-tune once, and then evaluate whether a different base model would serve their specific workload better. The fine-tuning investment (a few hundred training examples) is transferable — you can always re-fine-tune on a different base model using the same dataset.

Open-Source Models for OpenClaw: Llama 3, Qwen 2.5, and Which to Fine-Tune

What Makes a Good OpenClaw Model

1. Instruction Following

2. Tool Use

3. Multi-Step Reasoning

4. Context Window

5. Inference Speed

Model Comparison

Llama 3.3 8B

Qwen 2.5 7B

Mistral 7B / Mistral Nemo 12B

Phi-3 Mini (3.8B) / Phi-3 Medium (14B)

Recommendations by Use Case

Quantisation Guide for OpenClaw

Fine-Tuning Strategy

Data Preparation

Training Configuration

Evaluation

Iteration

Getting Started

Ship AI that runs on your users' devices.

Keep reading

How to Power OpenClaw with Fine-Tuned Local Models (No API Costs)

OpenClaw + Fine-Tuned Models vs. OpenClaw + GPT-4: A Practical Comparison

Extending OpenClaw with Custom Skills Powered by Fine-Tuned Models