
n8n Local AI: Replace OpenAI With Your Own Fine-Tuned Model
Step-by-step guide to replacing OpenAI API calls in your n8n workflows with a locally-running fine-tuned model. Cut costs to zero without sacrificing quality.
Your n8n workflows work great. You've built automations that classify emails, extract data from invoices, score leads, and draft follow-up messages — all powered by OpenAI nodes that reliably deliver quality results.
But every one of those AI nodes is a recurring cost. Each execution burns tokens. Each token appears on your monthly bill. And as your automations scale — more workflows, more clients, more volume — that bill grows in lockstep.
What if you could run the exact same quality AI, locally, for free per execution?
That's what this tutorial walks you through, step by step. By the end, you'll have a fine-tuned model running on Ollama, connected to your n8n workflows, producing the same quality outputs as OpenAI — at zero per-execution cost.
No ML background required. No Python scripting. Just the specific steps to go from "paying OpenAI per token" to "running your own model locally."
What We're Building
Here's the end state:
┌──────────────────────────────────────────┐
│ n8n Workflow │
│ │
│ Trigger → Process → [AI Node] → Action │
│ │ │
│ ▼ │
│ HTTP Request Node │
│ POST localhost:11434/api/chat │
└──────────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Ollama Server │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ Your Fine-Tuned Model (GGUF) │ │
│ │ Trained on YOUR workflow data │ │
│ │ Runs on CPU — no GPU needed │ │
│ └─────────────────────────────────────┘ │
└──────────────────────────────────────────┘
Your n8n workflow triggers exactly as before. Instead of the OpenAI node sending a request to api.openai.com (and costing tokens), an HTTP Request node sends the same prompt to Ollama running on your local machine or VPS. Ollama runs your fine-tuned model and returns the result. Same format. Same quality. Zero API cost.
The key insight: your fine-tuned model is trained on the exact input/output pairs from your existing OpenAI workflows. It doesn't need to know everything GPT-4 knows. It just needs to replicate the specific behavior your workflows rely on.
Prerequisites
Before we start, make sure you have:
- A running n8n instance — self-hosted or n8n cloud. This tutorial assumes you have existing workflows with OpenAI nodes.
- Ollama installed — on your n8n server, a nearby VPS, or your local machine. Installation:
curl -fsSL https://ollama.com/install.sh | sh - An Ertas account — for fine-tuning your model without code. Sign up at ertas.io ($14.50/month).
- Workflow execution history — at least 200 successful executions from the workflow you want to convert. More is better.
Hardware requirements for Ollama:
| Setup | Specs | Monthly Cost |
|---|---|---|
| Same server as n8n | 4+ vCPU, 16GB+ RAM | Already paying for it |
| Separate VPS | 4 vCPU, 16GB RAM | ~$30/month |
| Local machine | Any modern laptop with 16GB RAM | $0 |
A 7B parameter model with Q4 quantization uses about 4-5GB of RAM. If your n8n server has 16GB+ RAM, you can run Ollama alongside n8n on the same machine without issues.
Step 1: Export Your OpenAI Training Data
Your existing workflows have been generating training data every time they execute. Each execution contains the input that was sent to OpenAI and the output that came back. That's exactly what we need to fine-tune a model.
Identify the target workflow
Pick the workflow you want to convert first. The ideal candidate is:
- High volume — runs 50+ times per day (biggest cost savings)
- Narrow task — classification, extraction, or templated generation (easiest to fine-tune)
- Consistent quality — OpenAI outputs are reliably good (clean training data)
Extract input/output pairs
Method 1: Manual extraction from n8n UI
- Open the target workflow in n8n
- Click "Executions" in the sidebar
- Filter for "Success" status
- Click into each execution and find the OpenAI node
- Record the input (the messages array sent to OpenAI) and output (the response content)
- Format as JSONL:
{"input": "Classify this email: Hi, I'd like to cancel my subscription...", "output": "cancellation"}
{"input": "Classify this email: When will my order arrive?", "output": "shipping_inquiry"}
{"input": "Classify this email: The product broke after 2 days...", "output": "defect_report"}
This works for collecting 50-100 examples but becomes tedious for larger datasets.
Method 2: Automated extraction via n8n API
Create a separate n8n workflow that pulls execution data programmatically:
-
Add an HTTP Request node that calls the n8n API:
- URL:
http://localhost:5678/api/v1/executions?workflowId=YOUR_WORKFLOW_ID&status=success&limit=500 - Add your n8n API key in the headers
- URL:
-
Add a Code node to extract and format the AI node data:
const executions = $input.all();
const trainingData = [];
for (const exec of executions) {
const nodes = exec.json.data?.resultData?.runData;
if (nodes && nodes['OpenAI']) {
const openaiNode = nodes['OpenAI'][0];
const input = openaiNode.data?.main?.[0]?.[0]?.json?.messages;
const output = openaiNode.data?.main?.[0]?.[0]?.json?.choices?.[0]?.message?.content;
if (input && output) {
const userMessage = input.find(m => m.role === 'user')?.content || '';
trainingData.push({
input: userMessage,
output: output
});
}
}
}
return [{ json: { trainingData, count: trainingData.length } }];
- Add a Write File node to save the output as a JSONL file.
Method 3: Prospective data collection
If you don't have enough execution history yet, add a logging step to your current workflow. Insert a Code node after your OpenAI node that appends each input/output pair to a file or database. Run this for 1-2 weeks until you have 300+ examples.
How much data do you need?
| Task Type | Minimum | Sweet Spot | Diminishing Returns |
|---|---|---|---|
| Binary classification (yes/no) | 100 pairs | 250 pairs | 500+ pairs |
| Multi-class classification | 200 pairs | 500 pairs | 1,000+ pairs |
| Data extraction (structured) | 200 pairs | 500 pairs | 1,000+ pairs |
| Short text generation | 300 pairs | 800 pairs | 2,000+ pairs |
| Summarization | 300 pairs | 1,000 pairs | 3,000+ pairs |
Aim for the "sweet spot" column. You'll get good results at the minimum, but the sweet spot gives you a model that handles edge cases better.
Clean your data
Before uploading, do a quick quality pass:
- Remove failed outputs. If OpenAI returned an error or a nonsensical response, drop that pair.
- Remove duplicates. Exact duplicate input/output pairs don't help. Keep one copy.
- Check for consistency. If similar inputs produced wildly different outputs, investigate. Your model will learn the average — inconsistent training data produces inconsistent outputs.
- Standardize format. Make sure all your outputs follow the same format (e.g., all classification labels are lowercase, all JSON outputs use the same schema).
Step 2: Fine-Tune Your Model With Ertas
Now you've got a clean dataset. Time to turn it into a model.
Upload your data
- Log into Ertas Studio at app.ertas.io
- Create a new project (name it after your workflow, e.g., "Email Classifier" or "Invoice Extractor")
- Click "Upload Dataset" and drag in your JSONL file
- Ertas validates the file and shows you a preview — verify a few examples look correct
Select your base model
For n8n automation tasks, these are the recommended choices:
| Base Model | Best For | Inference Speed | Quality |
|---|---|---|---|
| Qwen 2.5 7B | Classification, extraction, structured output | Fast | Excellent |
| Llama 3.3 8B | Generation, summarization, longer outputs | Fast | Excellent |
| Mistral 7B | High-throughput automation, short outputs | Fastest | Very Good |
Our recommendation for most n8n workflows: Qwen 2.5 7B. It handles structured tasks exceptionally well and produces clean, consistent outputs — exactly what automation workflows need.
Configure training
Ertas auto-configures the training parameters based on your dataset:
- LoRA rank: Automatically selected based on task complexity (typically 16-32 for automation tasks)
- Learning rate: Optimized for your dataset size
- Epochs: Usually 3-5 for automation datasets
- Validation split: 10% of your data is held out for evaluation
You can adjust these if you want, but the defaults are tuned for automation use cases and work well out of the box.
Start training
Click "Start Training." Depending on your dataset size:
- 200-500 examples: ~15-20 minutes
- 500-1,000 examples: ~25-35 minutes
- 1,000+ examples: ~35-50 minutes
You can watch the training loss curve in real time. For automation tasks, you typically see the loss drop sharply in the first epoch and stabilize by epoch 3.
Evaluate results
When training completes, Ertas shows you:
- Accuracy metrics: For classification tasks, you get precision, recall, and F1 scores broken down by class.
- Side-by-side comparisons: Your fine-tuned model's outputs vs. the original GPT-4 outputs for test examples.
- Sample predictions: Run any input through the model and see the output instantly.
What to look for:
- Classification accuracy above 90% (for well-defined categories)
- Extraction completeness (all fields captured)
- Generation quality (reads naturally, matches expected format)
If quality isn't where you need it, the most common fixes are:
- Add more training examples (especially for underperforming categories)
- Clean up inconsistent training pairs
- Increase LoRA rank (from 16 to 32)
Step 3: Export to GGUF and Load in Ollama
Export from Ertas
- In your Ertas project, click "Export Model"
- Select GGUF format
- Choose Q4_K_M quantization — this is the optimal balance between quality and file size for 7B models:
| Quantization | File Size (7B) | Quality | Speed |
|---|---|---|---|
| Q8_0 | ~7.5GB | Highest | Slower |
| Q5_K_M | ~5.5GB | Very High | Medium |
| Q4_K_M | ~4.5GB | High | Fast |
| Q3_K_M | ~3.5GB | Good | Fastest |
Q4_K_M gives you about 99% of the quality at 60% of the file size compared to Q8. For automation tasks where outputs are short and structured, the quality difference is negligible.
- Download the GGUF file. It'll be named something like
email-classifier-q4km.gguf.
Load in Ollama
Transfer the GGUF file to your server (the machine running Ollama):
scp email-classifier-q4km.gguf user@your-server:/home/user/models/
Create a Modelfile that tells Ollama how to serve your model:
FROM /home/user/models/email-classifier-q4km.gguf
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
PARAMETER stop "<|im_end|>"
Key parameters:
- temperature 0.1 — Low temperature for consistent, deterministic outputs. Critical for automation tasks.
- num_ctx 4096 — Context window size. Increase if your inputs are longer.
- stop token — Depends on your base model. Qwen uses the im_end token, Llama uses the end-of-sequence token.
Create the model in Ollama:
ollama create email-classifier -f Modelfile
Test it:
ollama run email-classifier "Classify this email: Hi, I need to change my shipping address for order #4521."
You should get a response like shipping_inquiry — matching the format your workflow expects.
Verify the API endpoint
Ollama serves a REST API on port 11434 by default. Test it with curl:
curl http://localhost:11434/api/chat -d '{
"model": "email-classifier",
"messages": [
{
"role": "system",
"content": "Classify the following email into one of these categories: cancellation, shipping_inquiry, defect_report, billing_question, general_inquiry"
},
{
"role": "user",
"content": "Hi, I was charged twice for my last order."
}
],
"stream": false
}'
Expected response:
{
"model": "email-classifier",
"message": {
"role": "assistant",
"content": "billing_question"
}
}
If this works, you're ready to wire it into n8n.
Step 4: Create the Ollama Node in n8n
Now for the actual workflow change. Open the n8n workflow you want to convert.
Option A: Replace with HTTP Request Node (recommended)
This gives you full control over the request and works with any n8n version.
-
Disable the OpenAI node (don't delete it yet — you'll want it for comparison testing)
-
Add an HTTP Request node right where the OpenAI node was
-
Configure the HTTP Request node:
Method: POST
URL: http://localhost:11434/api/chat
(If Ollama is on a different server, use that server's IP instead of localhost)
Headers:
- Content-Type:
application/json
Body (JSON):
{
"model": "email-classifier",
"messages": [
{
"role": "system",
"content": "Classify the following email into one of these categories: cancellation, shipping_inquiry, defect_report, billing_question, general_inquiry"
},
{
"role": "user",
"content": "={{ $json.email_body }}"
}
],
"stream": false,
"options": {
"temperature": 0.1
}
}
Replace $json.email_body with whatever expression references the input data from the previous node in your workflow.
- Add a Code node after the HTTP Request to extract the response:
const response = $input.first().json;
const result = response.message?.content?.trim() || response.response?.trim() || '';
return [{
json: {
classification: result,
model: response.model,
raw_response: response
}
}];
- Connect the Code node's output to whatever comes next in your workflow (the same node the OpenAI node was connected to).
Option B: Use the OpenAI-Compatible Endpoint
Ollama also serves an OpenAI-compatible API at /v1/chat/completions. If your n8n version has an OpenAI node that lets you change the base URL, you can:
- Open the OpenAI node settings
- Change the base URL to
http://localhost:11434/v1 - Set the model to
email-classifier - Remove the API key (or set any dummy value — Ollama doesn't need one)
This approach requires fewer workflow changes but depends on your n8n version supporting custom base URLs in the OpenAI node.
Option C: Native Ollama Node
Recent n8n versions (1.20+) include native Ollama integration in the AI nodes section. If available:
- Add the Ollama Chat Model node
- Set Base URL to
http://localhost:11434 - Select your model name
- Wire it into your workflow
This is the simplest option but gives you the least control over request parameters.
Step 5: Test and Compare
Before you go live, run a proper comparison test.
A/B test your workflow
- Keep both the OpenAI node and the Ollama node in your workflow
- Add a Switch node before them that sends 50% of executions to each
- Add logging nodes after each to capture the outputs
- Run 100+ real executions through the split
Compare outputs
After collecting comparison data, evaluate:
For classification tasks:
| Metric | OpenAI (GPT-4) | Fine-Tuned Local |
|---|---|---|
| Accuracy | Baseline | Compare |
| Consistency (same input → same output) | ~95% | ~99% |
| Speed | 2-5 seconds | 0.3-1 second |
A fine-tuned model is typically more consistent than GPT-4 for classification because it's been trained specifically on your categories and doesn't exhibit the creative variation that general-purpose models do.
For extraction tasks:
| Metric | OpenAI (GPT-4) | Fine-Tuned Local |
|---|---|---|
| Field completeness | Baseline | Compare |
| Format adherence (valid JSON, etc.) | ~92% | ~97% |
| Speed | 3-8 seconds | 0.5-2 seconds |
Fine-tuned models tend to produce more consistent output formats because they've learned your specific schema through training, rather than following schema instructions through prompting.
For generation tasks:
| Metric | OpenAI (GPT-4) | Fine-Tuned Local |
|---|---|---|
| Quality (subjective) | Baseline | Compare |
| Tone consistency | Variable | Consistent |
| Speed | 3-10 seconds | 1-3 seconds |
Generation tasks are the most subjective. Run 20-30 outputs past a human reviewer and score them on a 1-5 scale for quality and appropriateness.
Performance Benchmarks
Here's real-world performance data for common n8n automation tasks running on a $30/month VPS (4 vCPU, 16GB RAM) with a fine-tuned Qwen 2.5 7B model:
| Metric | OpenAI API (GPT-4) | Local Fine-Tuned (7B) |
|---|---|---|
| Response time (classification) | 1.5-3.0 seconds | 0.2-0.5 seconds |
| Response time (extraction) | 2.0-5.0 seconds | 0.4-1.0 seconds |
| Response time (generation) | 3.0-8.0 seconds | 0.8-2.5 seconds |
| Throughput (requests/second) | Limited by rate tier | 10-20 req/sec |
| Cost per execution | $0.02-0.10 | $0.00 |
| Monthly cost (1K exec/day) | $600-3,000 | $44.50 flat |
| Monthly cost (10K exec/day) | $6,000-30,000 | $44.50 flat |
| Uptime dependency | OpenAI status page | Your server |
| Data leaves your infra | Yes | No |
The throughput advantage is significant for batch workflows. If you have a workflow that processes 500 emails every morning, the OpenAI version takes 12-25 minutes (rate-limited). The local version completes in 25-50 seconds.
Troubleshooting Common Issues
Model too slow
Symptom: Responses take 5+ seconds for simple tasks.
Fixes:
- Check VPS CPU usage — if it's maxed at 100%, you need more vCPUs or a beefier machine
- Use Q4_K_M quantization instead of Q8 — half the memory, 30% faster
- Reduce
num_ctxif your inputs are short — a 2048 context window is faster than 4096 - Make sure no other resource-heavy processes are running on the same server
Quality drop compared to OpenAI
Symptom: Outputs are noticeably worse than GPT-4 was producing.
Fixes:
- More training data. The most common fix. Go from 200 to 500+ examples.
- Cleaner training data. Remove any examples where OpenAI's output was wrong or inconsistent.
- More representative data. If certain categories or input types are underrepresented, add more examples of those specifically.
- Higher LoRA rank. If you used rank 8 or 16, try 32. This gives the model more capacity to learn your task.
- Try a different base model. If Mistral 7B isn't cutting it, try Qwen 2.5 7B or Llama 3.3 8B. Different base models have different strengths.
Context length errors
Symptom: Model returns garbage or errors on longer inputs.
Fixes:
- Increase
num_ctxin your Modelfile (e.g., from 4096 to 8192) - Note: larger context uses more RAM. A 7B model with 8K context needs ~6GB RAM.
- If your inputs are regularly over 4K tokens, consider truncating or summarizing the input before sending it to the model
- For very long inputs (8K+ tokens), consider a two-stage approach: summarize first, then classify/extract from the summary
Ollama not responding
Symptom: n8n gets connection refused or timeout errors.
Fixes:
- Verify Ollama is running:
systemctl status ollamaorollama list - Check the port:
curl http://localhost:11434/api/tagsshould return a JSON response - If n8n is on a different machine, make sure Ollama is bound to
0.0.0.0: setOLLAMA_HOST=0.0.0.0in the Ollama environment config - Check firewall rules: port 11434 must be accessible from the n8n machine
- Check RAM: if the server ran out of memory, Ollama may have crashed.
dmesg | grep -i oomwill show out-of-memory kills.
Inconsistent output format
Symptom: Model sometimes returns "billing_question" and sometimes "Billing Question" or "The category is billing_question."
Fixes:
- Add a post-processing step in n8n (Code node) that normalizes the output: lowercase, trim whitespace, strip prefixes
- Improve training data consistency — make sure all your training examples use the exact same format
- Lower temperature to 0.05 (almost deterministic)
- Add a system prompt that explicitly specifies the output format
Going Live
Once you've validated quality and resolved any issues:
- Remove the A/B split — route 100% of traffic to the local model
- Keep the OpenAI node disabled (not deleted) as a fallback for the first week
- Monitor for 7 days — check outputs daily, compare error rates
- After 7 days: If everything looks good, delete the OpenAI node and remove the API key from n8n credentials
- Set up retraining schedule — every 4-8 weeks, export new execution data and retrain the model on the expanded dataset
Your n8n workflows now run with zero API costs. Every execution is free. Scale to 10x the volume and your bill stays exactly the same: $14.50 for Ertas plus $30 for your VPS.
That's $44.50/month for unlimited AI automation. No tokens. No rate limits. No surprise invoices.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Further Reading
- n8n to Fine-Tuned Model: The Agency Playbook — How agencies are productizing local AI for their n8n automation clients.
- Fine-Tune a Model for Your App — The general guide to fine-tuning for any application, not just n8n.
- GGUF Format Explained — Everything you need to know about the GGUF format and why it matters for local deployment.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

I Replaced Every OpenAI Call in My n8n Workflows With a Fine-Tuned Model
A builder's firsthand account of migrating 12 n8n workflows from OpenAI to locally-running fine-tuned models. The costs, the gotchas, and the results after 60 days.

From $500/Month OpenAI Bills to $0: Migrating n8n Workflows to Local Models
A practical migration guide for n8n users spending hundreds on OpenAI API calls. Move your workflows to local fine-tuned models without breaking anything.

n8n + Ollama + Fine-Tuned Models: The Zero-API-Cost Automation Stack
Build powerful AI automations in n8n that cost nothing per execution. This guide shows you how to replace every OpenAI node with a locally-running fine-tuned model via Ollama.