Back to blog
    n8n Local AI: Replace OpenAI With Your Own Fine-Tuned Model
    n8nopenaifine-tuninglocal-aitutorialsegment:builder

    n8n Local AI: Replace OpenAI With Your Own Fine-Tuned Model

    Step-by-step guide to replacing OpenAI API calls in your n8n workflows with a locally-running fine-tuned model. Cut costs to zero without sacrificing quality.

    EErtas Team·

    Your n8n workflows work great. You've built automations that classify emails, extract data from invoices, score leads, and draft follow-up messages — all powered by OpenAI nodes that reliably deliver quality results.

    But every one of those AI nodes is a recurring cost. Each execution burns tokens. Each token appears on your monthly bill. And as your automations scale — more workflows, more clients, more volume — that bill grows in lockstep.

    What if you could run the exact same quality AI, locally, for free per execution?

    That's what this tutorial walks you through, step by step. By the end, you'll have a fine-tuned model running on Ollama, connected to your n8n workflows, producing the same quality outputs as OpenAI — at zero per-execution cost.

    No ML background required. No Python scripting. Just the specific steps to go from "paying OpenAI per token" to "running your own model locally."

    What We're Building

    Here's the end state:

    ┌──────────────────────────────────────────┐
    │              n8n Workflow                  │
    │                                           │
    │  Trigger → Process → [AI Node] → Action   │
    │                        │                  │
    │                        ▼                  │
    │              HTTP Request Node            │
    │         POST localhost:11434/api/chat     │
    └──────────────────┬───────────────────────┘
                       │
                       ▼
    ┌──────────────────────────────────────────┐
    │            Ollama Server                  │
    │                                           │
    │  ┌─────────────────────────────────────┐ │
    │  │  Your Fine-Tuned Model (GGUF)       │ │
    │  │  Trained on YOUR workflow data      │ │
    │  │  Runs on CPU — no GPU needed        │ │
    │  └─────────────────────────────────────┘ │
    └──────────────────────────────────────────┘
    

    Your n8n workflow triggers exactly as before. Instead of the OpenAI node sending a request to api.openai.com (and costing tokens), an HTTP Request node sends the same prompt to Ollama running on your local machine or VPS. Ollama runs your fine-tuned model and returns the result. Same format. Same quality. Zero API cost.

    The key insight: your fine-tuned model is trained on the exact input/output pairs from your existing OpenAI workflows. It doesn't need to know everything GPT-4 knows. It just needs to replicate the specific behavior your workflows rely on.

    Prerequisites

    Before we start, make sure you have:

    • A running n8n instance — self-hosted or n8n cloud. This tutorial assumes you have existing workflows with OpenAI nodes.
    • Ollama installed — on your n8n server, a nearby VPS, or your local machine. Installation: curl -fsSL https://ollama.com/install.sh | sh
    • An Ertas account — for fine-tuning your model without code. Sign up at ertas.io ($14.50/month).
    • Workflow execution history — at least 200 successful executions from the workflow you want to convert. More is better.

    Hardware requirements for Ollama:

    SetupSpecsMonthly Cost
    Same server as n8n4+ vCPU, 16GB+ RAMAlready paying for it
    Separate VPS4 vCPU, 16GB RAM~$30/month
    Local machineAny modern laptop with 16GB RAM$0

    A 7B parameter model with Q4 quantization uses about 4-5GB of RAM. If your n8n server has 16GB+ RAM, you can run Ollama alongside n8n on the same machine without issues.

    Step 1: Export Your OpenAI Training Data

    Your existing workflows have been generating training data every time they execute. Each execution contains the input that was sent to OpenAI and the output that came back. That's exactly what we need to fine-tune a model.

    Identify the target workflow

    Pick the workflow you want to convert first. The ideal candidate is:

    • High volume — runs 50+ times per day (biggest cost savings)
    • Narrow task — classification, extraction, or templated generation (easiest to fine-tune)
    • Consistent quality — OpenAI outputs are reliably good (clean training data)

    Extract input/output pairs

    Method 1: Manual extraction from n8n UI

    1. Open the target workflow in n8n
    2. Click "Executions" in the sidebar
    3. Filter for "Success" status
    4. Click into each execution and find the OpenAI node
    5. Record the input (the messages array sent to OpenAI) and output (the response content)
    6. Format as JSONL:
    {"input": "Classify this email: Hi, I'd like to cancel my subscription...", "output": "cancellation"}
    {"input": "Classify this email: When will my order arrive?", "output": "shipping_inquiry"}
    {"input": "Classify this email: The product broke after 2 days...", "output": "defect_report"}
    

    This works for collecting 50-100 examples but becomes tedious for larger datasets.

    Method 2: Automated extraction via n8n API

    Create a separate n8n workflow that pulls execution data programmatically:

    1. Add an HTTP Request node that calls the n8n API:

      • URL: http://localhost:5678/api/v1/executions?workflowId=YOUR_WORKFLOW_ID&status=success&limit=500
      • Add your n8n API key in the headers
    2. Add a Code node to extract and format the AI node data:

    const executions = $input.all();
    const trainingData = [];
    
    for (const exec of executions) {
      const nodes = exec.json.data?.resultData?.runData;
      if (nodes && nodes['OpenAI']) {
        const openaiNode = nodes['OpenAI'][0];
        const input = openaiNode.data?.main?.[0]?.[0]?.json?.messages;
        const output = openaiNode.data?.main?.[0]?.[0]?.json?.choices?.[0]?.message?.content;
    
        if (input && output) {
          const userMessage = input.find(m => m.role === 'user')?.content || '';
          trainingData.push({
            input: userMessage,
            output: output
          });
        }
      }
    }
    
    return [{ json: { trainingData, count: trainingData.length } }];
    
    1. Add a Write File node to save the output as a JSONL file.

    Method 3: Prospective data collection

    If you don't have enough execution history yet, add a logging step to your current workflow. Insert a Code node after your OpenAI node that appends each input/output pair to a file or database. Run this for 1-2 weeks until you have 300+ examples.

    How much data do you need?

    Task TypeMinimumSweet SpotDiminishing Returns
    Binary classification (yes/no)100 pairs250 pairs500+ pairs
    Multi-class classification200 pairs500 pairs1,000+ pairs
    Data extraction (structured)200 pairs500 pairs1,000+ pairs
    Short text generation300 pairs800 pairs2,000+ pairs
    Summarization300 pairs1,000 pairs3,000+ pairs

    Aim for the "sweet spot" column. You'll get good results at the minimum, but the sweet spot gives you a model that handles edge cases better.

    Clean your data

    Before uploading, do a quick quality pass:

    • Remove failed outputs. If OpenAI returned an error or a nonsensical response, drop that pair.
    • Remove duplicates. Exact duplicate input/output pairs don't help. Keep one copy.
    • Check for consistency. If similar inputs produced wildly different outputs, investigate. Your model will learn the average — inconsistent training data produces inconsistent outputs.
    • Standardize format. Make sure all your outputs follow the same format (e.g., all classification labels are lowercase, all JSON outputs use the same schema).

    Step 2: Fine-Tune Your Model With Ertas

    Now you've got a clean dataset. Time to turn it into a model.

    Upload your data

    1. Log into Ertas Studio at app.ertas.io
    2. Create a new project (name it after your workflow, e.g., "Email Classifier" or "Invoice Extractor")
    3. Click "Upload Dataset" and drag in your JSONL file
    4. Ertas validates the file and shows you a preview — verify a few examples look correct

    Select your base model

    For n8n automation tasks, these are the recommended choices:

    Base ModelBest ForInference SpeedQuality
    Qwen 2.5 7BClassification, extraction, structured outputFastExcellent
    Llama 3.3 8BGeneration, summarization, longer outputsFastExcellent
    Mistral 7BHigh-throughput automation, short outputsFastestVery Good

    Our recommendation for most n8n workflows: Qwen 2.5 7B. It handles structured tasks exceptionally well and produces clean, consistent outputs — exactly what automation workflows need.

    Configure training

    Ertas auto-configures the training parameters based on your dataset:

    • LoRA rank: Automatically selected based on task complexity (typically 16-32 for automation tasks)
    • Learning rate: Optimized for your dataset size
    • Epochs: Usually 3-5 for automation datasets
    • Validation split: 10% of your data is held out for evaluation

    You can adjust these if you want, but the defaults are tuned for automation use cases and work well out of the box.

    Start training

    Click "Start Training." Depending on your dataset size:

    • 200-500 examples: ~15-20 minutes
    • 500-1,000 examples: ~25-35 minutes
    • 1,000+ examples: ~35-50 minutes

    You can watch the training loss curve in real time. For automation tasks, you typically see the loss drop sharply in the first epoch and stabilize by epoch 3.

    Evaluate results

    When training completes, Ertas shows you:

    • Accuracy metrics: For classification tasks, you get precision, recall, and F1 scores broken down by class.
    • Side-by-side comparisons: Your fine-tuned model's outputs vs. the original GPT-4 outputs for test examples.
    • Sample predictions: Run any input through the model and see the output instantly.

    What to look for:

    • Classification accuracy above 90% (for well-defined categories)
    • Extraction completeness (all fields captured)
    • Generation quality (reads naturally, matches expected format)

    If quality isn't where you need it, the most common fixes are:

    1. Add more training examples (especially for underperforming categories)
    2. Clean up inconsistent training pairs
    3. Increase LoRA rank (from 16 to 32)

    Step 3: Export to GGUF and Load in Ollama

    Export from Ertas

    1. In your Ertas project, click "Export Model"
    2. Select GGUF format
    3. Choose Q4_K_M quantization — this is the optimal balance between quality and file size for 7B models:
    QuantizationFile Size (7B)QualitySpeed
    Q8_0~7.5GBHighestSlower
    Q5_K_M~5.5GBVery HighMedium
    Q4_K_M~4.5GBHighFast
    Q3_K_M~3.5GBGoodFastest

    Q4_K_M gives you about 99% of the quality at 60% of the file size compared to Q8. For automation tasks where outputs are short and structured, the quality difference is negligible.

    1. Download the GGUF file. It'll be named something like email-classifier-q4km.gguf.

    Load in Ollama

    Transfer the GGUF file to your server (the machine running Ollama):

    scp email-classifier-q4km.gguf user@your-server:/home/user/models/
    

    Create a Modelfile that tells Ollama how to serve your model:

    FROM /home/user/models/email-classifier-q4km.gguf
    
    PARAMETER temperature 0.1
    PARAMETER num_ctx 4096
    PARAMETER stop "<|im_end|>"
    

    Key parameters:

    • temperature 0.1 — Low temperature for consistent, deterministic outputs. Critical for automation tasks.
    • num_ctx 4096 — Context window size. Increase if your inputs are longer.
    • stop token — Depends on your base model. Qwen uses the im_end token, Llama uses the end-of-sequence token.

    Create the model in Ollama:

    ollama create email-classifier -f Modelfile
    

    Test it:

    ollama run email-classifier "Classify this email: Hi, I need to change my shipping address for order #4521."
    

    You should get a response like shipping_inquiry — matching the format your workflow expects.

    Verify the API endpoint

    Ollama serves a REST API on port 11434 by default. Test it with curl:

    curl http://localhost:11434/api/chat -d '{
      "model": "email-classifier",
      "messages": [
        {
          "role": "system",
          "content": "Classify the following email into one of these categories: cancellation, shipping_inquiry, defect_report, billing_question, general_inquiry"
        },
        {
          "role": "user",
          "content": "Hi, I was charged twice for my last order."
        }
      ],
      "stream": false
    }'
    

    Expected response:

    {
      "model": "email-classifier",
      "message": {
        "role": "assistant",
        "content": "billing_question"
      }
    }
    

    If this works, you're ready to wire it into n8n.

    Step 4: Create the Ollama Node in n8n

    Now for the actual workflow change. Open the n8n workflow you want to convert.

    This gives you full control over the request and works with any n8n version.

    1. Disable the OpenAI node (don't delete it yet — you'll want it for comparison testing)

    2. Add an HTTP Request node right where the OpenAI node was

    3. Configure the HTTP Request node:

    Method: POST

    URL: http://localhost:11434/api/chat

    (If Ollama is on a different server, use that server's IP instead of localhost)

    Headers:

    • Content-Type: application/json

    Body (JSON):

    {
      "model": "email-classifier",
      "messages": [
        {
          "role": "system",
          "content": "Classify the following email into one of these categories: cancellation, shipping_inquiry, defect_report, billing_question, general_inquiry"
        },
        {
          "role": "user",
          "content": "={{ $json.email_body }}"
        }
      ],
      "stream": false,
      "options": {
        "temperature": 0.1
      }
    }
    

    Replace $json.email_body with whatever expression references the input data from the previous node in your workflow.

    1. Add a Code node after the HTTP Request to extract the response:
    const response = $input.first().json;
    const result = response.message?.content?.trim() || response.response?.trim() || '';
    
    return [{
      json: {
        classification: result,
        model: response.model,
        raw_response: response
      }
    }];
    
    1. Connect the Code node's output to whatever comes next in your workflow (the same node the OpenAI node was connected to).

    Option B: Use the OpenAI-Compatible Endpoint

    Ollama also serves an OpenAI-compatible API at /v1/chat/completions. If your n8n version has an OpenAI node that lets you change the base URL, you can:

    1. Open the OpenAI node settings
    2. Change the base URL to http://localhost:11434/v1
    3. Set the model to email-classifier
    4. Remove the API key (or set any dummy value — Ollama doesn't need one)

    This approach requires fewer workflow changes but depends on your n8n version supporting custom base URLs in the OpenAI node.

    Option C: Native Ollama Node

    Recent n8n versions (1.20+) include native Ollama integration in the AI nodes section. If available:

    1. Add the Ollama Chat Model node
    2. Set Base URL to http://localhost:11434
    3. Select your model name
    4. Wire it into your workflow

    This is the simplest option but gives you the least control over request parameters.

    Step 5: Test and Compare

    Before you go live, run a proper comparison test.

    A/B test your workflow

    1. Keep both the OpenAI node and the Ollama node in your workflow
    2. Add a Switch node before them that sends 50% of executions to each
    3. Add logging nodes after each to capture the outputs
    4. Run 100+ real executions through the split

    Compare outputs

    After collecting comparison data, evaluate:

    For classification tasks:

    MetricOpenAI (GPT-4)Fine-Tuned Local
    AccuracyBaselineCompare
    Consistency (same input → same output)~95%~99%
    Speed2-5 seconds0.3-1 second

    A fine-tuned model is typically more consistent than GPT-4 for classification because it's been trained specifically on your categories and doesn't exhibit the creative variation that general-purpose models do.

    For extraction tasks:

    MetricOpenAI (GPT-4)Fine-Tuned Local
    Field completenessBaselineCompare
    Format adherence (valid JSON, etc.)~92%~97%
    Speed3-8 seconds0.5-2 seconds

    Fine-tuned models tend to produce more consistent output formats because they've learned your specific schema through training, rather than following schema instructions through prompting.

    For generation tasks:

    MetricOpenAI (GPT-4)Fine-Tuned Local
    Quality (subjective)BaselineCompare
    Tone consistencyVariableConsistent
    Speed3-10 seconds1-3 seconds

    Generation tasks are the most subjective. Run 20-30 outputs past a human reviewer and score them on a 1-5 scale for quality and appropriateness.

    Performance Benchmarks

    Here's real-world performance data for common n8n automation tasks running on a $30/month VPS (4 vCPU, 16GB RAM) with a fine-tuned Qwen 2.5 7B model:

    MetricOpenAI API (GPT-4)Local Fine-Tuned (7B)
    Response time (classification)1.5-3.0 seconds0.2-0.5 seconds
    Response time (extraction)2.0-5.0 seconds0.4-1.0 seconds
    Response time (generation)3.0-8.0 seconds0.8-2.5 seconds
    Throughput (requests/second)Limited by rate tier10-20 req/sec
    Cost per execution$0.02-0.10$0.00
    Monthly cost (1K exec/day)$600-3,000$44.50 flat
    Monthly cost (10K exec/day)$6,000-30,000$44.50 flat
    Uptime dependencyOpenAI status pageYour server
    Data leaves your infraYesNo

    The throughput advantage is significant for batch workflows. If you have a workflow that processes 500 emails every morning, the OpenAI version takes 12-25 minutes (rate-limited). The local version completes in 25-50 seconds.

    Troubleshooting Common Issues

    Model too slow

    Symptom: Responses take 5+ seconds for simple tasks.

    Fixes:

    • Check VPS CPU usage — if it's maxed at 100%, you need more vCPUs or a beefier machine
    • Use Q4_K_M quantization instead of Q8 — half the memory, 30% faster
    • Reduce num_ctx if your inputs are short — a 2048 context window is faster than 4096
    • Make sure no other resource-heavy processes are running on the same server

    Quality drop compared to OpenAI

    Symptom: Outputs are noticeably worse than GPT-4 was producing.

    Fixes:

    • More training data. The most common fix. Go from 200 to 500+ examples.
    • Cleaner training data. Remove any examples where OpenAI's output was wrong or inconsistent.
    • More representative data. If certain categories or input types are underrepresented, add more examples of those specifically.
    • Higher LoRA rank. If you used rank 8 or 16, try 32. This gives the model more capacity to learn your task.
    • Try a different base model. If Mistral 7B isn't cutting it, try Qwen 2.5 7B or Llama 3.3 8B. Different base models have different strengths.

    Context length errors

    Symptom: Model returns garbage or errors on longer inputs.

    Fixes:

    • Increase num_ctx in your Modelfile (e.g., from 4096 to 8192)
    • Note: larger context uses more RAM. A 7B model with 8K context needs ~6GB RAM.
    • If your inputs are regularly over 4K tokens, consider truncating or summarizing the input before sending it to the model
    • For very long inputs (8K+ tokens), consider a two-stage approach: summarize first, then classify/extract from the summary

    Ollama not responding

    Symptom: n8n gets connection refused or timeout errors.

    Fixes:

    • Verify Ollama is running: systemctl status ollama or ollama list
    • Check the port: curl http://localhost:11434/api/tags should return a JSON response
    • If n8n is on a different machine, make sure Ollama is bound to 0.0.0.0: set OLLAMA_HOST=0.0.0.0 in the Ollama environment config
    • Check firewall rules: port 11434 must be accessible from the n8n machine
    • Check RAM: if the server ran out of memory, Ollama may have crashed. dmesg | grep -i oom will show out-of-memory kills.

    Inconsistent output format

    Symptom: Model sometimes returns "billing_question" and sometimes "Billing Question" or "The category is billing_question."

    Fixes:

    • Add a post-processing step in n8n (Code node) that normalizes the output: lowercase, trim whitespace, strip prefixes
    • Improve training data consistency — make sure all your training examples use the exact same format
    • Lower temperature to 0.05 (almost deterministic)
    • Add a system prompt that explicitly specifies the output format

    Going Live

    Once you've validated quality and resolved any issues:

    1. Remove the A/B split — route 100% of traffic to the local model
    2. Keep the OpenAI node disabled (not deleted) as a fallback for the first week
    3. Monitor for 7 days — check outputs daily, compare error rates
    4. After 7 days: If everything looks good, delete the OpenAI node and remove the API key from n8n credentials
    5. Set up retraining schedule — every 4-8 weeks, export new execution data and retrain the model on the expanded dataset

    Your n8n workflows now run with zero API costs. Every execution is free. Scale to 10x the volume and your bill stays exactly the same: $14.50 for Ertas plus $30 for your VPS.

    That's $44.50/month for unlimited AI automation. No tokens. No rate limits. No surprise invoices.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading