Back to blog
    I Replaced Every OpenAI Call in My n8n Workflows With a Fine-Tuned Model
    n8nopenaifine-tuningcase-studysegment:builder

    I Replaced Every OpenAI Call in My n8n Workflows With a Fine-Tuned Model

    A builder's firsthand account of migrating 12 n8n workflows from OpenAI to locally-running fine-tuned models. The costs, the gotchas, and the results after 60 days.

    EErtas Team·

    Three months ago, my n8n instance was running 12 workflows that all hit the OpenAI API. My monthly bill was $487. Today, that number is $0 in API costs. My total AI spend is $44.50/month — a $30 VPS running Ollama plus $14.50 for Ertas. Here's exactly what I did, what went wrong, and what surprised me after 60 days of running fully local.

    I'm not an ML engineer. I'm a builder who uses n8n to automate everything from client onboarding to content pipelines. I started using OpenAI's API in mid-2025 because it was the easiest way to add intelligence to my workflows. And it was — until the bill started growing, a rate limit incident took down my production workflows for three hours, and I realized I was one API policy change away from my entire business breaking.

    This is the full story of the migration: the planning, the training, the testing, the cutover, and the honest results after two months of production use.

    My Setup Before

    I run a small automation agency. My n8n instance handles workflows for both internal operations and client projects. Here are the 12 workflows that were hitting the OpenAI API:

    #WorkflowModel UsedDaily VolumeMonthly Cost
    1Email triage (classify incoming emails into 6 categories)GPT-4o-mini~180 emails$18
    2Lead scoring (analyze form submissions, assign 1-10 score)GPT-4o~45 leads$32
    3Content summarization (summarize articles for newsletter)GPT-4o~30 articles$28
    4Support ticket classification (route to correct team)GPT-4o-mini~120 tickets$14
    5Invoice data extraction (pull line items from PDF text)GPT-4o~200 invoices$67
    6Meeting notes summarization (transcripts to action items)GPT-4o~40 meetings$45
    7Social media content generation (draft posts from briefs)GPT-4o~25 briefs$38
    8Customer feedback analysis (sentiment + theme extraction)GPT-4o-mini~90 responses$12
    9Contract clause extraction (identify key terms)GPT-4o~15 contracts$52
    10Product description normalization (standardize listings)GPT-4o-mini~300 listings$35
    11Chatbot response generation (client-facing support bot)GPT-4o~250 conversations$98
    12Data enrichment (research companies from name + URL)GPT-4o~60 companies$48
    Total~1,355/day$487/mo

    The $487 was growing about 12% month-over-month as my clients added more volume to the workflows. At that trajectory, I was looking at $650/month by summer and $800+ by end of year.

    Why I Decided to Migrate

    Three things pushed me over the edge.

    The bill was growing. $487 might not sound catastrophic, but I'm a small operation. That $487 was my second-largest business expense after my own salary. And unlike rent or software subscriptions, I couldn't predict it — it fluctuated 15-20% month to month based on volume.

    The rate limit incident. In December 2025, OpenAI experienced capacity issues that caused elevated latency and rate limit errors for about three hours during a Tuesday afternoon. Three of my client-facing workflows — email triage, support ticket classification, and the chatbot — all started failing. Tickets piled up. Emails went unclassified. The chatbot returned errors. I spent the afternoon apologizing to clients and manually processing queues. I had no fallback. My entire AI layer was a single point of failure.

    I wanted to own the models. I was building my business on someone else's infrastructure. The prompts I'd spent months refining were worthless without the model behind them. If OpenAI changed their pricing, deprecated a model, or decided my use case violated their terms, I had no Plan B. That felt unacceptable for something this central to my business.

    The Migration Plan

    I didn't try to migrate everything at once. I grouped the 12 workflows by task type and complexity:

    Group 1: Classification (4 workflows) — Email triage, support ticket classification, customer feedback analysis, lead scoring. These take a text input and output a category or score. Simplest to fine-tune.

    Group 2: Extraction (3 workflows) — Invoice data extraction, contract clause extraction, product description normalization. These take unstructured text and output structured data. Medium complexity.

    Group 3: Generation (5 workflows) — Content summarization, meeting notes, social media generation, chatbot responses, data enrichment. These produce free-form text. Highest complexity.

    My plan was to tackle them in this order, spending about a week on each group, with one final week for parallel testing and cutover.

    Week 1: Data Collection

    This was the most tedious part. I needed training data — examples of inputs and correct outputs for each workflow.

    Fortunately, n8n stores execution history. I had months of successful executions with both the inputs sent to OpenAI and the outputs received. I wrote a simple script to export these as JSONL files.

    For each workflow, I exported the last 3 months of successful executions and cleaned them:

    Workflow GroupRaw ExecutionsAfter CleaningFinal Training Set
    Classification (4 workflows)12,4008,200800 (200 per workflow)
    Extraction (3 workflows)6,1504,800600 (200 per workflow)
    Generation (5 workflows)5,6003,900750 (150 per workflow)

    "Cleaning" meant removing duplicates, discarding executions where the output was clearly wrong (I'd manually flagged some over the months), fixing formatting inconsistencies, and ensuring the input-output pairs were self-contained (no references to context that wouldn't be in the prompt).

    I ended up with three consolidated datasets:

    1. Classification dataset (800 examples): Input text + category/score label. Covered all 4 classification workflows.
    2. Extraction dataset (600 examples): Input text + structured JSON output. Covered invoice, contract, and product normalization workflows.
    3. Generation dataset (750 examples): Input brief/text + generated output. Covered summarization, social media, chatbot, and enrichment workflows.

    The cleaning took about 6 hours spread over 3 days. Not glamorous work, but this is the step that determines whether your fine-tuned model actually works. Garbage in, garbage out.

    Week 2: Fine-Tuning

    I used Ertas to fine-tune three models:

    Model 1: Qwen 2.5 7B for classification. I chose Qwen for classification because it's fast at inference and handles structured output well. Uploaded the 800-example classification dataset. Training took 35 minutes. I ran the built-in evaluation and got 94% accuracy on the held-out test set right away. Honestly, I was shocked — I expected to need multiple training runs.

    Model 2: Llama 3.3 8B for extraction. Extraction requires understanding document structure and producing valid JSON, so I went with the slightly larger Llama base. Uploaded the 600-example extraction dataset. Training took 42 minutes. Evaluation showed 91% accuracy on extraction tasks. The main errors were on edge cases — multi-currency invoices and contracts with unusual formatting.

    Model 3: Llama 3.3 8B for generation. Generation is the hardest task type because the "correct" output is subjective. I used Llama 3.3 8B again. Training took 48 minutes on the 750-example dataset. Evaluation was trickier here — I couldn't just compare outputs for exact match. I manually reviewed 50 generated outputs and found that about 82% were immediately usable, 14% needed minor edits, and 4% missed the mark.

    Total fine-tuning cost: $14.50 for the Ertas Builder plan (which covered all three training runs and more). Total time: about 2 hours of active work plus the training time.

    I exported all three models as GGUF files. Each was about 4.5-5GB quantized to Q4_K_M.

    Week 3: Testing

    This was the most important week. I set up Ollama on my development machine, loaded all three models, and ran parallel testing.

    For each workflow, I took the last 100 production executions and ran them through both the OpenAI model and my fine-tuned model. Then I compared outputs.

    Classification Results

    WorkflowGPT-4o/mini AccuracyFine-tuned Qwen 7B AccuracyWinner
    Email triage88%96%Fine-tuned
    Support ticket classification85%95%Fine-tuned
    Customer feedback analysis90%93%Fine-tuned
    Lead scoring (within 1 point)72%84%Fine-tuned

    The fine-tuned model beat GPT-4o-mini on every single classification task. This makes sense — the fine-tuned model was trained on exactly these categories, with examples from my specific domain. GPT-4o-mini was guessing from a system prompt.

    The lead scoring improvement was the biggest surprise. GPT-4o (not mini — I was paying for the full model on this one) scored leads inconsistently. The same lead description might get a 6 one day and an 8 the next. The fine-tuned model was dramatically more consistent because it had learned my specific scoring criteria from 200 examples.

    Extraction Results

    WorkflowGPT-4o AccuracyFine-tuned Llama 8B AccuracyWinner
    Invoice extraction (all fields correct)76%92%Fine-tuned
    Contract clause extraction71%88%Fine-tuned
    Product description normalization83%95%Fine-tuned

    Again, fine-tuned won across the board. The biggest improvement was on invoices — GPT-4o struggled with consistent field naming and occasionally hallucinated values. The fine-tuned model learned my exact JSON schema and almost never deviated from it.

    The contract clause extraction was the most challenging. Some contracts had unusual formatting that the fine-tuned model hadn't seen. I noted these for additional training data later.

    Generation Results

    WorkflowGPT-4o Quality (manual rating 1-5)Fine-tuned Llama 8B QualityWinner
    Content summarization4.23.9GPT-4o (slight edge)
    Meeting notes to action items3.84.1Fine-tuned
    Social media post drafting4.03.5GPT-4o
    Chatbot responses3.64.3Fine-tuned
    Data enrichment3.42.8GPT-4o

    Generation was mixed. The fine-tuned model excelled at tasks with predictable structure — meeting notes and chatbot responses, where the format is consistent and the content is domain-specific. GPT-4o had the edge on creative tasks (social media posts) and tasks requiring external knowledge (data enrichment).

    For social media post drafting, the fine-tuned model produced functional but formulaic posts. They matched our brand voice perfectly (because that's what it was trained on) but lacked the variety that GPT-4o brought. I decided this was acceptable for a first draft that gets human review anyway.

    For data enrichment, the fine-tuned model couldn't match GPT-4o because the task inherently requires broad knowledge about companies. This was the one workflow I considered keeping on the API. More on that later.

    Week 4: Cutover

    I provisioned a Hetzner CPX41 VPS — 8 vCPU (AMD EPYC), 16GB RAM, 240GB NVMe SSD. Cost: $30.49/month. Installed Ubuntu 24.04, set up Ollama, and loaded all three GGUF models.

    I configured n8n to use the Ollama endpoint instead of the OpenAI API. Since Ollama exposes an OpenAI-compatible API, this was literally changing the base URL and API key in the n8n HTTP Request nodes. For most workflows, the switch was a one-line change.

    I migrated workflows in order of confidence:

    Day 1-2: Classification workflows (highest confidence). Switched all 4 classification workflows to the fine-tuned Qwen model. Kept OpenAI running in parallel as a shadow pipeline — both models processed every input, but only the fine-tuned model's output was used.

    Day 3-4: Extraction workflows. Switched all 3 extraction workflows. Same shadow pipeline setup. Monitored for JSON parsing errors since downstream workflows depend on clean structured output.

    Day 5: Generation workflows (except data enrichment). Switched 4 of the 5 generation workflows. For the chatbot workflow, I was nervous — it's client-facing. I kept the shadow pipeline running for an extra 3 days.

    Day 6-7: Data enrichment. This was the tricky one. My fine-tuned model couldn't match GPT-4o's broad knowledge for company research. I made a pragmatic decision: I switched to a hybrid approach. The fine-tuned model handles the formatting and structured output, while I use the Serper API ($5/month) for the actual web research data that feeds into the model. Total cost for this workflow went from $48/month to about $7/month.

    After the full cutover, I kept the OpenAI API key active for one week as a manual fallback. I never needed it. On Day 14, I deleted the API key.

    The Results After 60 Days

    Here's the honest comparison after two full months of production use.

    Cost

    MetricBefore (OpenAI)After (Self-hosted)Change
    Monthly AI API cost$487$0-100%
    Monthly VPS cost$0$30.49+$30.49
    Monthly Ertas cost$0$14.50+$14.50
    Monthly Serper API (enrichment)$0$5+$5
    Total monthly AI spend$487$49.99-$437.01
    Annual AI spend$5,844$599.88-$5,244.12

    I'm saving $437 per month. Over a year, that's $5,244. For a small automation agency, that's significant — it's roughly the cost of a part-time contractor for a month, or new equipment, or just straight profit.

    Performance

    MetricBefore (OpenAI)After (Self-hosted)Change
    Average classification accuracy84%95%+11 points
    Average extraction accuracy77%92%+15 points
    Average generation quality (1-5)3.83.9+0.1
    Average latency per request1,100ms185ms-83%
    Timeout/error rate1.8%0.1%-94%
    Uptime (AI layer)99.2%99.95%+0.75 points

    The accuracy improvements on classification and extraction are real and sustained. After 60 days, the fine-tuned models are consistently outperforming what GPT-4o was doing on my specific tasks.

    The latency improvement is dramatic. My workflows are noticeably faster. The email triage workflow that used to take 3-4 seconds per email now completes in under 500ms. The support ticket classifier responds in 120ms. Users have noticed — my client whose support bot runs on my n8n instance commented that "the bot feels snappier."

    The uptime improvement matters more than anything. Zero API outages in 60 days. Zero rate limit errors. Zero "model overloaded" responses. My workflows just run. The only downtime was 12 minutes when I rebooted the VPS for a kernel update — and I scheduled that for 3 AM.

    Volume

    MetricBeforeAfterChange
    Daily AI requests processed~1,355~1,620+20%
    Monthly AI requests~40,650~48,600+20%
    Cost per request$0.012$0.001-92%

    Volume grew 20% over the 60 days because my clients added more data to their pipelines. Under the old system, that growth would have pushed my API bill to $585+. Under the new system, I didn't even notice — the VPS handles the extra load without breaking a sweat.

    What Surprised Me

    Local is Actually Faster

    I expected self-hosted to be slower. It's not even close. Without the network round-trip to OpenAI's servers, the API queue time, and the cold start latency, every single request is faster. Classification requests that took 800ms via API now complete in 120ms locally. That's a 6.7x improvement.

    For my meeting notes workflow, this matters a lot. Processing a 45-minute meeting transcript through GPT-4o took 8-12 seconds. Through the local model, it takes 1.5 seconds. Multiply that by 40 meetings/month and I'm saving 7-8 minutes of wall clock time per month on just that one workflow.

    The Fine-Tuned Model is More Consistent

    This was the biggest practical improvement. GPT-4o's outputs would vary in format, structure, and even correctness between identical inputs. My downstream n8n nodes had extensive error handling — regex patterns to catch format variations, retry logic for malformed JSON, fallback parsers.

    With the fine-tuned model, I deleted most of that error handling code. The model produces the same output format 98%+ of the time. My n8n workflows got simpler and more reliable simultaneously.

    Some Workflows Actually Improved

    I expected "good enough" from the fine-tuned models. For classification and extraction, I got "significantly better." The lead scoring workflow went from 72% agreement with manual scores to 84%. The support ticket classifier went from 85% to 95%. These aren't marginal improvements — they materially reduced the manual review my team does.

    The reason is simple: GPT-4o is trying to be good at everything. My fine-tuned models are trying to be good at one thing. On that one thing, specialization wins.

    Resource Usage is Minimal

    I was worried about the VPS running out of memory or CPU with three models loaded. In practice, Ollama is efficient about model loading — it keeps the active model in memory and swaps on demand. With 16GB RAM, I can keep two models loaded simultaneously, and the third loads in about 2 seconds when needed.

    CPU usage averages 15-20% during normal operation, spiking to 60-70% during batch processing. I have significant headroom to add more workflows before needing to upgrade.

    What I'd Do Differently

    Start With the Highest-Volume Workflow

    I started with classification because it was the simplest to fine-tune. In hindsight, I should have started with product description normalization (300/day) or the chatbot (250 conversations/day). These had the highest API costs and would have shown the fastest ROI.

    Collect More Training Data Upfront

    I used 200 examples per classification workflow, which worked well. But for extraction and generation, I wish I'd used 300-400 examples. The contract clause extraction model still struggles with unusual contract formats, and I suspect more training data would help. I've been collecting additional examples and plan to retrain next month.

    Test Edge Cases Earlier

    My first batch of training data was biased toward common cases. The invoice extraction model was great at standard invoices but initially fumbled multi-page invoices and those with foreign currency formatting. I should have intentionally included more edge cases in the training set from the start. I fixed this by adding 50 edge-case examples and retraining — accuracy on edge cases jumped from 61% to 87%.

    Set Up Monitoring From Day One

    For the first two weeks, I was manually checking outputs. I should have built a simple monitoring dashboard from the start — tracking accuracy, latency, and output format compliance automatically. I eventually built this with a separate n8n workflow that samples 5% of outputs and logs them for review. Wish I'd done it on day one.

    The Final Numbers

    Let me lay it out one more time, because these numbers still surprise me when I look at them.

    BeforeAfter
    Monthly AI spend$487$49.99
    Annual AI spend$5,844$599.88
    Annual savings$5,244.12
    Accuracy (classification avg)84%95%
    Accuracy (extraction avg)77%92%
    Average latency1,100ms185ms
    API outages experienced (60 days)30
    Vendor dependenciesOpenAI (critical)None (critical)

    The $487/month is now $49.99/month. That's $30 for the VPS, $14.50 for Ertas, and $5 for the Serper API I use in the data enrichment workflow. Annual savings of $5,244.

    But the savings are almost secondary at this point. What I value most is the reliability and the independence. My AI layer doesn't go down when OpenAI has issues. My costs don't change when someone in San Francisco adjusts a pricing slider. My clients' data stays on my server.

    I own my models. I own my data. I own my infrastructure. For the first time since I started using AI in my workflows, I actually own the most important part of my business.

    If you're running n8n workflows with OpenAI calls and spending more than $50/month, the migration is worth it. It took me 4 weeks of part-time work. The models are better on my tasks, faster on every task, and cost 90% less. I wish I'd done it six months sooner.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading