I Replaced Every OpenAI Call in My n8n Workflows With a Fine-Tuned Model

Three months ago, my n8n instance was running 12 workflows that all hit the OpenAI API. My monthly bill was $487. Today, that number is $0 in API costs. My total AI spend is $44.50/month — a $30 VPS running Ollama plus $14.50 for Ertas. Here's exactly what I did, what went wrong, and what surprised me after 60 days of running fully local.

I'm not an ML engineer. I'm a builder who uses n8n to automate everything from client onboarding to content pipelines. I started using OpenAI's API in mid-2025 because it was the easiest way to add intelligence to my workflows. And it was — until the bill started growing, a rate limit incident took down my production workflows for three hours, and I realized I was one API policy change away from my entire business breaking.

This is the full story of the migration: the planning, the training, the testing, the cutover, and the honest results after two months of production use.

My Setup Before

I run a small automation agency. My n8n instance handles workflows for both internal operations and client projects. Here are the 12 workflows that were hitting the OpenAI API:

#	Workflow	Model Used	Daily Volume	Monthly Cost
1	Email triage (classify incoming emails into 6 categories)	GPT-4o-mini	~180 emails	$18
2	Lead scoring (analyze form submissions, assign 1-10 score)	GPT-4o	~45 leads	$32
3	Content summarization (summarize articles for newsletter)	GPT-4o	~30 articles	$28
4	Support ticket classification (route to correct team)	GPT-4o-mini	~120 tickets	$14
5	Invoice data extraction (pull line items from PDF text)	GPT-4o	~200 invoices	$67
6	Meeting notes summarization (transcripts to action items)	GPT-4o	~40 meetings	$45
7	Social media content generation (draft posts from briefs)	GPT-4o	~25 briefs	$38
8	Customer feedback analysis (sentiment + theme extraction)	GPT-4o-mini	~90 responses	$12
9	Contract clause extraction (identify key terms)	GPT-4o	~15 contracts	$52
10	Product description normalization (standardize listings)	GPT-4o-mini	~300 listings	$35
11	Chatbot response generation (client-facing support bot)	GPT-4o	~250 conversations	$98
12	Data enrichment (research companies from name + URL)	GPT-4o	~60 companies	$48
	Total		~1,355/day	$487/mo

The $487 was growing about 12% month-over-month as my clients added more volume to the workflows. At that trajectory, I was looking at $650/month by summer and $800+ by end of year.

Why I Decided to Migrate

Three things pushed me over the edge.

The bill was growing. $487 might not sound catastrophic, but I'm a small operation. That $487 was my second-largest business expense after my own salary. And unlike rent or software subscriptions, I couldn't predict it — it fluctuated 15-20% month to month based on volume.

The rate limit incident. In December 2025, OpenAI experienced capacity issues that caused elevated latency and rate limit errors for about three hours during a Tuesday afternoon. Three of my client-facing workflows — email triage, support ticket classification, and the chatbot — all started failing. Tickets piled up. Emails went unclassified. The chatbot returned errors. I spent the afternoon apologizing to clients and manually processing queues. I had no fallback. My entire AI layer was a single point of failure.

I wanted to own the models. I was building my business on someone else's infrastructure. The prompts I'd spent months refining were worthless without the model behind them. If OpenAI changed their pricing, deprecated a model, or decided my use case violated their terms, I had no Plan B. That felt unacceptable for something this central to my business.

The Migration Plan

I didn't try to migrate everything at once. I grouped the 12 workflows by task type and complexity:

Group 1: Classification (4 workflows) — Email triage, support ticket classification, customer feedback analysis, lead scoring. These take a text input and output a category or score. Simplest to fine-tune.

Group 2: Extraction (3 workflows) — Invoice data extraction, contract clause extraction, product description normalization. These take unstructured text and output structured data. Medium complexity.

Group 3: Generation (5 workflows) — Content summarization, meeting notes, social media generation, chatbot responses, data enrichment. These produce free-form text. Highest complexity.

My plan was to tackle them in this order, spending about a week on each group, with one final week for parallel testing and cutover.

Week 1: Data Collection

This was the most tedious part. I needed training data — examples of inputs and correct outputs for each workflow.

Fortunately, n8n stores execution history. I had months of successful executions with both the inputs sent to OpenAI and the outputs received. I wrote a simple script to export these as JSONL files.

For each workflow, I exported the last 3 months of successful executions and cleaned them:

Workflow Group	Raw Executions	After Cleaning	Final Training Set
Classification (4 workflows)	12,400	8,200	800 (200 per workflow)
Extraction (3 workflows)	6,150	4,800	600 (200 per workflow)
Generation (5 workflows)	5,600	3,900	750 (150 per workflow)

"Cleaning" meant removing duplicates, discarding executions where the output was clearly wrong (I'd manually flagged some over the months), fixing formatting inconsistencies, and ensuring the input-output pairs were self-contained (no references to context that wouldn't be in the prompt).

I ended up with three consolidated datasets:

Classification dataset (800 examples): Input text + category/score label. Covered all 4 classification workflows.
Extraction dataset (600 examples): Input text + structured JSON output. Covered invoice, contract, and product normalization workflows.
Generation dataset (750 examples): Input brief/text + generated output. Covered summarization, social media, chatbot, and enrichment workflows.

The cleaning took about 6 hours spread over 3 days. Not glamorous work, but this is the step that determines whether your fine-tuned model actually works. Garbage in, garbage out.

Week 2: Fine-Tuning

I used Ertas to fine-tune three models:

Model 1: Qwen 2.5 7B for classification. I chose Qwen for classification because it's fast at inference and handles structured output well. Uploaded the 800-example classification dataset. Training took 35 minutes. I ran the built-in evaluation and got 94% accuracy on the held-out test set right away. Honestly, I was shocked — I expected to need multiple training runs.

Model 2: Llama 3.3 8B for extraction. Extraction requires understanding document structure and producing valid JSON, so I went with the slightly larger Llama base. Uploaded the 600-example extraction dataset. Training took 42 minutes. Evaluation showed 91% accuracy on extraction tasks. The main errors were on edge cases — multi-currency invoices and contracts with unusual formatting.

Model 3: Llama 3.3 8B for generation. Generation is the hardest task type because the "correct" output is subjective. I used Llama 3.3 8B again. Training took 48 minutes on the 750-example dataset. Evaluation was trickier here — I couldn't just compare outputs for exact match. I manually reviewed 50 generated outputs and found that about 82% were immediately usable, 14% needed minor edits, and 4% missed the mark.

Total fine-tuning cost: $14.50 for the Ertas Builder plan (which covered all three training runs and more). Total time: about 2 hours of active work plus the training time.

I exported all three models as GGUF files. Each was about 4.5-5GB quantized to Q4_K_M.

Week 3: Testing

This was the most important week. I set up Ollama on my development machine, loaded all three models, and ran parallel testing.

For each workflow, I took the last 100 production executions and ran them through both the OpenAI model and my fine-tuned model. Then I compared outputs.

Classification Results

Workflow	GPT-4o/mini Accuracy	Fine-tuned Qwen 7B Accuracy	Winner
Email triage	88%	96%	Fine-tuned
Support ticket classification	85%	95%	Fine-tuned
Customer feedback analysis	90%	93%	Fine-tuned
Lead scoring (within 1 point)	72%	84%	Fine-tuned

The fine-tuned model beat GPT-4o-mini on every single classification task. This makes sense — the fine-tuned model was trained on exactly these categories, with examples from my specific domain. GPT-4o-mini was guessing from a system prompt.

The lead scoring improvement was the biggest surprise. GPT-4o (not mini — I was paying for the full model on this one) scored leads inconsistently. The same lead description might get a 6 one day and an 8 the next. The fine-tuned model was dramatically more consistent because it had learned my specific scoring criteria from 200 examples.

Extraction Results

Workflow	GPT-4o Accuracy	Fine-tuned Llama 8B Accuracy	Winner
Invoice extraction (all fields correct)	76%	92%	Fine-tuned
Contract clause extraction	71%	88%	Fine-tuned
Product description normalization	83%	95%	Fine-tuned

Again, fine-tuned won across the board. The biggest improvement was on invoices — GPT-4o struggled with consistent field naming and occasionally hallucinated values. The fine-tuned model learned my exact JSON schema and almost never deviated from it.

The contract clause extraction was the most challenging. Some contracts had unusual formatting that the fine-tuned model hadn't seen. I noted these for additional training data later.

Generation Results

Workflow	GPT-4o Quality (manual rating 1-5)	Fine-tuned Llama 8B Quality	Winner
Content summarization	4.2	3.9	GPT-4o (slight edge)
Meeting notes to action items	3.8	4.1	Fine-tuned
Social media post drafting	4.0	3.5	GPT-4o
Chatbot responses	3.6	4.3	Fine-tuned
Data enrichment	3.4	2.8	GPT-4o

Generation was mixed. The fine-tuned model excelled at tasks with predictable structure — meeting notes and chatbot responses, where the format is consistent and the content is domain-specific. GPT-4o had the edge on creative tasks (social media posts) and tasks requiring external knowledge (data enrichment).

For social media post drafting, the fine-tuned model produced functional but formulaic posts. They matched our brand voice perfectly (because that's what it was trained on) but lacked the variety that GPT-4o brought. I decided this was acceptable for a first draft that gets human review anyway.

For data enrichment, the fine-tuned model couldn't match GPT-4o because the task inherently requires broad knowledge about companies. This was the one workflow I considered keeping on the API. More on that later.

Week 4: Cutover

I provisioned a Hetzner CPX41 VPS — 8 vCPU (AMD EPYC), 16GB RAM, 240GB NVMe SSD. Cost: $30.49/month. Installed Ubuntu 24.04, set up Ollama, and loaded all three GGUF models.

I configured n8n to use the Ollama endpoint instead of the OpenAI API. Since Ollama exposes an OpenAI-compatible API, this was literally changing the base URL and API key in the n8n HTTP Request nodes. For most workflows, the switch was a one-line change.

I migrated workflows in order of confidence:

Day 1-2: Classification workflows (highest confidence). Switched all 4 classification workflows to the fine-tuned Qwen model. Kept OpenAI running in parallel as a shadow pipeline — both models processed every input, but only the fine-tuned model's output was used.

Day 3-4: Extraction workflows. Switched all 3 extraction workflows. Same shadow pipeline setup. Monitored for JSON parsing errors since downstream workflows depend on clean structured output.

Day 5: Generation workflows (except data enrichment). Switched 4 of the 5 generation workflows. For the chatbot workflow, I was nervous — it's client-facing. I kept the shadow pipeline running for an extra 3 days.

Day 6-7: Data enrichment. This was the tricky one. My fine-tuned model couldn't match GPT-4o's broad knowledge for company research. I made a pragmatic decision: I switched to a hybrid approach. The fine-tuned model handles the formatting and structured output, while I use the Serper API ($5/month) for the actual web research data that feeds into the model. Total cost for this workflow went from $48/month to about $7/month.

After the full cutover, I kept the OpenAI API key active for one week as a manual fallback. I never needed it. On Day 14, I deleted the API key.

The Results After 60 Days

Here's the honest comparison after two full months of production use.

Cost

Metric	Before (OpenAI)	After (Self-hosted)	Change
Monthly AI API cost	$487	$0	-100%
Monthly VPS cost	$0	$30.49	+$30.49
Monthly Ertas cost	$0	$14.50	+$14.50
Monthly Serper API (enrichment)	$0	$5	+$5
Total monthly AI spend	$487	$49.99	-$437.01
Annual AI spend	$5,844	$599.88	-$5,244.12

I'm saving $437 per month. Over a year, that's $5,244. For a small automation agency, that's significant — it's roughly the cost of a part-time contractor for a month, or new equipment, or just straight profit.

Performance

Metric	Before (OpenAI)	After (Self-hosted)	Change
Average classification accuracy	84%	95%	+11 points
Average extraction accuracy	77%	92%	+15 points
Average generation quality (1-5)	3.8	3.9	+0.1
Average latency per request	1,100ms	185ms	-83%
Timeout/error rate	1.8%	0.1%	-94%
Uptime (AI layer)	99.2%	99.95%	+0.75 points

The accuracy improvements on classification and extraction are real and sustained. After 60 days, the fine-tuned models are consistently outperforming what GPT-4o was doing on my specific tasks.

The latency improvement is dramatic. My workflows are noticeably faster. The email triage workflow that used to take 3-4 seconds per email now completes in under 500ms. The support ticket classifier responds in 120ms. Users have noticed — my client whose support bot runs on my n8n instance commented that "the bot feels snappier."

The uptime improvement matters more than anything. Zero API outages in 60 days. Zero rate limit errors. Zero "model overloaded" responses. My workflows just run. The only downtime was 12 minutes when I rebooted the VPS for a kernel update — and I scheduled that for 3 AM.

Volume

Metric	Before	After	Change
Daily AI requests processed	~1,355	~1,620	+20%
Monthly AI requests	~40,650	~48,600	+20%
Cost per request	$0.012	$0.001	-92%

Volume grew 20% over the 60 days because my clients added more data to their pipelines. Under the old system, that growth would have pushed my API bill to $585+. Under the new system, I didn't even notice — the VPS handles the extra load without breaking a sweat.

What Surprised Me

Local is Actually Faster

I expected self-hosted to be slower. It's not even close. Without the network round-trip to OpenAI's servers, the API queue time, and the cold start latency, every single request is faster. Classification requests that took 800ms via API now complete in 120ms locally. That's a 6.7x improvement.

For my meeting notes workflow, this matters a lot. Processing a 45-minute meeting transcript through GPT-4o took 8-12 seconds. Through the local model, it takes 1.5 seconds. Multiply that by 40 meetings/month and I'm saving 7-8 minutes of wall clock time per month on just that one workflow.

The Fine-Tuned Model is More Consistent

This was the biggest practical improvement. GPT-4o's outputs would vary in format, structure, and even correctness between identical inputs. My downstream n8n nodes had extensive error handling — regex patterns to catch format variations, retry logic for malformed JSON, fallback parsers.

With the fine-tuned model, I deleted most of that error handling code. The model produces the same output format 98%+ of the time. My n8n workflows got simpler and more reliable simultaneously.

Some Workflows Actually Improved

I expected "good enough" from the fine-tuned models. For classification and extraction, I got "significantly better." The lead scoring workflow went from 72% agreement with manual scores to 84%. The support ticket classifier went from 85% to 95%. These aren't marginal improvements — they materially reduced the manual review my team does.

The reason is simple: GPT-4o is trying to be good at everything. My fine-tuned models are trying to be good at one thing. On that one thing, specialization wins.

Resource Usage is Minimal

I was worried about the VPS running out of memory or CPU with three models loaded. In practice, Ollama is efficient about model loading — it keeps the active model in memory and swaps on demand. With 16GB RAM, I can keep two models loaded simultaneously, and the third loads in about 2 seconds when needed.

CPU usage averages 15-20% during normal operation, spiking to 60-70% during batch processing. I have significant headroom to add more workflows before needing to upgrade.

What I'd Do Differently

Start With the Highest-Volume Workflow

I started with classification because it was the simplest to fine-tune. In hindsight, I should have started with product description normalization (300/day) or the chatbot (250 conversations/day). These had the highest API costs and would have shown the fastest ROI.

Collect More Training Data Upfront

I used 200 examples per classification workflow, which worked well. But for extraction and generation, I wish I'd used 300-400 examples. The contract clause extraction model still struggles with unusual contract formats, and I suspect more training data would help. I've been collecting additional examples and plan to retrain next month.

Test Edge Cases Earlier

My first batch of training data was biased toward common cases. The invoice extraction model was great at standard invoices but initially fumbled multi-page invoices and those with foreign currency formatting. I should have intentionally included more edge cases in the training set from the start. I fixed this by adding 50 edge-case examples and retraining — accuracy on edge cases jumped from 61% to 87%.

Set Up Monitoring From Day One

For the first two weeks, I was manually checking outputs. I should have built a simple monitoring dashboard from the start — tracking accuracy, latency, and output format compliance automatically. I eventually built this with a separate n8n workflow that samples 5% of outputs and logs them for review. Wish I'd done it on day one.

The Final Numbers

Let me lay it out one more time, because these numbers still surprise me when I look at them.

	Before	After
Monthly AI spend	$487	$49.99
Annual AI spend	$5,844	$599.88
Annual savings		$5,244.12
Accuracy (classification avg)	84%	95%
Accuracy (extraction avg)	77%	92%
Average latency	1,100ms	185ms
API outages experienced (60 days)	3	0
Vendor dependencies	OpenAI (critical)	None (critical)

The $487/month is now $49.99/month. That's $30 for the VPS, $14.50 for Ertas, and $5 for the Serper API I use in the data enrichment workflow. Annual savings of $5,244.

But the savings are almost secondary at this point. What I value most is the reliability and the independence. My AI layer doesn't go down when OpenAI has issues. My costs don't change when someone in San Francisco adjusts a pricing slider. My clients' data stays on my server.

I own my models. I own my data. I own my infrastructure. For the first time since I started using AI in my workflows, I actually own the most important part of my business.

If you're running n8n workflows with OpenAI calls and spending more than $50/month, the migration is worth it. It took me 4 weeks of part-time work. The models are better on my tasks, faster on every task, and cost 90% less. I wish I'd done it six months sooner.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →