Extending OpenClaw with Custom Skills Powered by Fine-Tuned Models

OpenClaw's skill system is one of its most powerful features — and currently its most dangerous liability.

The ClawHub marketplace was designed to let the community share reusable capabilities: a skill for managing calendars, another for processing invoices, another for monitoring server health. In practice, the ClawHavoc campaign found 341 malicious skills delivering the Atomic macOS Stealer, and subsequent scans identified over 800 compromised entries — roughly 20% of the entire registry.

The supply chain is poisoned. But the concept of extending OpenClaw with domain-specific capabilities is sound. The fix is not to avoid skills — it is to build your own, backed by models fine-tuned on your domain data rather than generic API calls.

Why Custom Skills Beat Community Skills

1. No Supply Chain Risk

When you build a skill yourself, you control every line of code and every model call. There is no dependency on third-party authors, no risk of malicious updates, no need to audit someone else's code every time they push a new version.

2. Better Performance

Community skills are built to work generically — they use broad system prompts to handle any user's data. A custom skill backed by a fine-tuned model is specialised for your specific task, your specific data format, and your specific output requirements.

3. Data Stays Local

If you pair custom skills with a local fine-tuned model, the data processed by the skill never leaves your infrastructure. Community skills typically route through whatever cloud API OpenClaw is configured with — meaning your data flows through third-party servers even when the skill itself is harmless.

Anatomy of a Custom OpenClaw Skill

An OpenClaw skill is a self-contained capability with defined inputs, a processing function, and structured outputs. At its core, each skill is a prompt template that instructs the underlying model on how to handle a specific type of task.

The key components:

Trigger: How the skill is invoked (keyword, pattern match, or automatic detection)
Context gathering: What data the skill collects before calling the model
Model interaction: The prompt template and expected output format
Action: What the skill does with the model's response

When the underlying model is fine-tuned for the specific task, the prompt template can be simpler (less instruction needed), the output is more consistent (fewer format deviations), and accuracy improves (domain knowledge is in the weights, not crammed into system prompts).

Five Custom Skills Worth Building

1. Support Ticket Triage Skill

What it does: Monitors incoming support channels, classifies tickets by category and priority, routes to the appropriate team, and drafts initial responses.

Why fine-tuning matters: Your support taxonomy is unique. The difference between "billing issue" and "subscription management" depends on your product's specific structure. A fine-tuned model trained on 500+ categorised tickets learns these distinctions precisely — a generic model guesses from a description.

Training data: Export your last 6 months of support tickets with their categories, priorities, and initial responses. Format as instruction/response pairs.

Expected improvement: Categorisation accuracy typically jumps from 70-75% (generic model with system prompt) to 90-95% (fine-tuned model).

2. Contract Review Skill

What it does: Processes uploaded contracts, flags unfavourable clauses, extracts key terms (dates, amounts, obligations), and generates a summary with risk assessment.

Why fine-tuning matters: "Unfavourable" is subjective and domain-specific. What counts as a risk clause for a SaaS vendor agreement is different from a construction subcontract. Fine-tuning on your organisation's contract review history teaches the model your specific risk criteria.

Training data: 200-500 reviewed contracts with annotated clauses (flagged/not flagged) and summary outputs.

Expected improvement: Clause flagging accuracy reaches 90% with fine-tuning, compared to 65-75% with prompt-engineered generic models.

3. Daily Report Generator Skill

What it does: Pulls data from configured sources (dashboards, databases, APIs), generates a narrative report in your template format, and distributes to stakeholders via the appropriate channel.

Why fine-tuning matters: Report format consistency. A fine-tuned model has seen hundreds of your reports and replicates the exact structure, tone, and analytical style every time. Generic models vary their output format unpredictably.

Training data: 100-300 previous reports paired with the data inputs that generated them.

Expected improvement: Template adherence goes from 80-85% to 97%+.

4. Email Draft Skill

What it does: Analyses incoming emails, identifies the required response type, drafts a reply matching the appropriate tone and level of detail, and queues for human review before sending.

Why fine-tuning matters: Every person and organisation has a distinct email voice. Fine-tuning on your sent emails captures your communication style — formality level, greeting conventions, sign-off preferences, how you handle different relationship types (client vs. colleague vs. vendor).

Training data: 500-1,000 of your sent emails with their triggering incoming messages.

Expected improvement: Draft acceptance rate (sent without edits) typically doubles from 30-40% to 60-75%.

5. Data Extraction and Normalisation Skill

What it does: Processes incoming documents (invoices, purchase orders, intake forms) and extracts structured data into a consistent schema for downstream systems.

Why fine-tuning matters: Schema compliance. When OpenClaw feeds extracted data into databases, APIs, or spreadsheets, every deviation from the expected schema causes an error. Fine-tuned models achieve 99%+ schema compliance because they have seen the exact output format hundreds of times during training.

Training data: 200-500 documents with their corresponding structured data outputs.

Expected improvement: Schema compliance from 79% to 99%.

The Build Process

For each custom skill:

1. Collect Training Data

Export examples of the task from your existing workflows. The format should be:

{
  "instruction": "Classify this support ticket and draft a response",
  "input": "[ticket content]",
  "output": "Category: Billing\nPriority: Medium\nResponse: [draft response]"
}

Aim for 500+ examples. More data generally means better performance, but even 200 high-quality examples produce meaningful improvement over a generic model.

2. Fine-Tune the Model

Upload your dataset to Ertas Studio. Select a base model — Qwen 2.5 7B or Llama 3.3 8B work well for most skill tasks. Run a LoRA fine-tuning job (rank 16, 3 epochs is a reliable starting point).

Evaluate against a held-out test set. Iterate if accuracy is below your threshold.

Export as GGUF.

3. Deploy Locally

Deploy the GGUF model via Ollama. Configure OpenClaw to use your local model for this skill.

4. Build the Skill

Write the skill definition — trigger conditions, context gathering logic, prompt template, and output actions. Because the model is fine-tuned for the task, the prompt template can be minimal. You do not need to spend paragraphs of system prompt describing the output format or domain rules — the model already knows them.

5. Test and Iterate

Run the skill against real data. Collect cases where the model underperforms. Add these as training examples for the next fine-tuning iteration.

One Model or Many?

For most teams, a single fine-tuned model handling multiple skill types works well if the tasks share a domain. An agency managing client communications might train one model on email drafting, ticket triage, and report generation for a single client.

When tasks are significantly different — contract review vs. code generation vs. medical note summarisation — separate fine-tuned models per skill type perform better. Ollama supports loading multiple models, and OpenClaw can route different skills to different model endpoints.

The LoRA adapter approach is particularly efficient here: share a single base model, load task-specific adapters per skill. Storage overhead is minimal (50-200MB per adapter), and adapter switching is fast.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

The Security Advantage

By building custom skills backed by local fine-tuned models, you eliminate two attack vectors simultaneously:

Supply chain risk: No dependency on community-authored skills that may contain malicious code
Data exfiltration risk: No data transmitted to cloud APIs during skill execution

Your skills run entirely on your infrastructure, processing your data through your models. The only external dependency is the base model weights you downloaded once.

In a landscape where 20% of the OpenClaw skill registry was compromised, building your own is not just a performance optimisation — it is a security requirement.