Fine-Tuning for Apple Silicon: Running Custom Models on M-Series Macs
A practical guide to deploying fine-tuned AI models on Apple Silicon Macs. Covers M4 hardware capabilities, unified memory advantages, Ollama and MLX setup, quantization choices, and Core ML LoRA adapter support.
Apple Silicon has a quiet advantage for local AI inference that most people underestimate: unified memory. The CPU, GPU, and Neural Engine all share the same memory pool — no data copying between separate VRAM and system RAM. For large language model inference, where memory bandwidth is the primary bottleneck, this architecture is a genuine competitive advantage.
If you own an M-series Mac, you already own capable AI inference hardware. This guide covers how to take a fine-tuned model from Ertas and deploy it locally on your Mac — no cloud APIs, no GPU rentals, no per-token billing.
What Your Mac Can Run
The limiting factor for local LLM inference is memory. Here's what each M-series tier supports:
| Mac | Unified Memory | Recommended Models | Expected Speed |
|---|---|---|---|
| M1/M2/M3/M4 (base) | 8-16 GB | 1-3B quantized, 7B at Q4 (tight) | ~15-25 tok/s |
| M1/M2/M3/M4 Pro | 18-24 GB | 7-8B at Q5/Q8, 13B at Q4 | ~25-35 tok/s |
| M1/M2/M3/M4 Max | 32-128 GB | 13B at Q8, 70B at Q4 | ~15-30 tok/s |
| M2/M4 Ultra | 64-192 GB | 70B at Q8, multiple models simultaneously | ~20-35 tok/s |
The sweet spot for most developers: An M4 Pro (24 GB) running a fine-tuned 8B model at Q5_K_M or Q8_0. This delivers ~30+ tokens per second — fast enough for interactive use — with room for generous context windows.
For guidance on choosing between quantization levels, see our quantization guide.
Why Unified Memory Matters
On a traditional PC with a discrete GPU, LLM inference works like this:
- Model weights live in GPU VRAM (limited to 8-24 GB on consumer cards)
- If the model doesn't fit in VRAM, parts spill to system RAM
- Accessing system RAM from the GPU is 10-20x slower than VRAM
- This "offloading" kills performance
On Apple Silicon:
- Everything — CPU, GPU, Neural Engine — accesses the same memory pool
- There's no VRAM/RAM distinction
- A Mac with 64 GB unified memory gives the GPU access to all 64 GB at full speed
- No offloading penalty
This means a Mac Studio M4 Ultra with 192 GB unified memory can run models that would require multiple enterprise GPUs on a traditional setup. For inference (not training), Apple Silicon is surprisingly competitive.
The Deployment Stack
Option 1: Ollama (Easiest)
Ollama is the simplest path from fine-tuned model to running inference on your Mac.
Setup:
- Install Ollama:
brew install ollama - Fine-tune your model on Ertas and export as GGUF
- Create a Modelfile pointing to your GGUF:
FROM ./your-fine-tuned-model.Q5_K_M.gguf - Import:
ollama create my-model -f Modelfile - Run:
ollama run my-model
Ollama handles all the Apple Silicon optimization automatically — it uses Metal for GPU acceleration on M-series chips. No configuration needed.
When to use Ollama: When you want the fastest path to running a fine-tuned model locally. Excellent for development, testing, and production inference behind an API endpoint.
Option 2: MLX (Apple-Native Performance)
MLX is Apple's own machine learning framework, designed specifically for Apple Silicon. It offers lower-level control and often better performance than Ollama on M-series hardware.
Advantages over Ollama:
- Built by Apple, optimized for the specific memory hierarchy of M-series chips
- Supports LoRA adapter loading natively (swap adapters without reloading the base model)
- Fine-tuning directly on Mac is possible for small models (though Ertas cloud GPUs are faster)
When to use MLX: When you need maximum performance on Apple Silicon, when you want to hot-swap LoRA adapters, or when you're building a native macOS application with AI features.
Option 3: llama.cpp (Maximum Control)
The underlying engine that powers Ollama. Use it directly when you need custom batch sizes, specific threading configurations, or when integrating with a custom application via the C/C++ API.
llama.cpp includes Metal support for Apple Silicon GPU acceleration out of the box.
When to use llama.cpp: When you need fine-grained control over inference parameters or are embedding inference into a compiled application.
Core ML and LoRA Adapters
Apple's Core ML framework now supports LoRA adapter inference on the Neural Engine — the dedicated AI accelerator built into every M-series chip.
This matters for two reasons:
-
Adapter swapping is fast. Load a base model once, swap LoRA adapters for different tasks without reloading the full model. This is the same pattern that hardware vendors are building into their chips.
-
Neural Engine efficiency. The ANE (Apple Neural Engine) is optimized for specific quantization levels and model architectures. Running inference on the ANE can be more power-efficient than GPU inference, extending battery life on MacBooks.
Apple published a research paper demonstrating Llama 3.1 8B running locally via Core ML at ~33 tokens/sec on M1 Max. M4 series chips are faster.
The End-to-End Workflow
Here's the complete workflow from domain data to running inference on your Mac:
1. Fine-Tune on Cloud GPUs (via Ertas)
Fine-tuning requires GPU compute that's impractical on consumer hardware — even powerful Macs. The M4 Max can fine-tune 7B models via MLX, but it's slow (hours vs. minutes on cloud GPUs) and ties up your machine.
Use Ertas to fine-tune on cloud GPUs: upload your dataset, configure training visually, monitor results. Training happens in minutes, not hours.
2. Export as GGUF
Export your fine-tuned model from Ertas as GGUF at your target quantization level:
- Q4_K_M for memory-constrained Macs (8-16 GB)
- Q5_K_M for production quality on Macs with 24 GB+
- Q8_0 for maximum quality on Macs with 32 GB+
You can also export as a separate LoRA adapter if you want to use adapter-swapping via MLX or Core ML.
3. Load into Ollama
Import your GGUF into Ollama and start serving inference. Ollama exposes an OpenAI-compatible API by default, so any application that talks to the OpenAI API can point to your local model with a one-line configuration change.
4. Integrate with Your Stack
Your fine-tuned model running on Ollama on your Mac can serve:
- n8n workflows via the Ollama node (replace OpenAI API calls)
- Web applications via the REST API (localhost:11434)
- CLI tools via Ollama's command-line interface
- Custom applications via the Python or JavaScript client libraries
5. Run at Zero Marginal Cost
Once the model is loaded, every query costs only electricity. No per-token billing. No API rate limits. No data leaving your machine.
For an indie developer processing 50,000 queries per month, the difference between cloud API costs ($500-2,000/month) and local inference on a Mac you already own ($10-15/month electricity) is the difference between a viable business and burning cash.
Performance Optimization Tips
Match Context Length to Your Needs
Longer context windows consume more memory (KV cache grows linearly with context length). If your use case only needs 2K context (many classification and extraction tasks), set the context window accordingly. This frees memory for the model weights and improves speed.
Use the Right Quantization for Your Memory
Don't just use the highest quantization your Mac can technically load. Leave headroom for KV cache, the operating system, and your other applications. A model that barely fits will be slower due to memory pressure.
Safe rule: Model file size should be no more than 60-70% of your total unified memory for comfortable operation.
Batch Related Queries
If you're processing many similar inputs (document classification, data extraction), batch them through a script rather than interactive chat. This keeps the model loaded and avoids cold-start overhead.
Consider a Dedicated Mac for Inference
For agencies or teams running AI inference as a service, a Mac Mini M4 Pro ($1,600-2,000) or Mac Studio M4 Max ($3,000-5,000) makes an excellent dedicated inference server. Low power consumption, silent operation, and enough memory for production workloads.
Compare that to a cloud GPU at $800-1,500/month. The Mac pays for itself in 2-4 months.
When Not to Use Apple Silicon for Inference
Apple Silicon is excellent for inference but not always the right choice:
- Throughput-critical workloads: If you need to serve hundreds of concurrent users, dedicated GPU servers (or dedicated silicon like Taalas HC1) will outperform a Mac
- Models larger than your memory: If your model requires more memory than your Mac has, you need bigger hardware
- Fine-tuning itself: Training on cloud GPUs via Ertas is faster and more cost-effective than training on-device (except for small experiments)
For everything else — development, testing, single-user or small-team production inference, privacy-sensitive deployments, and cost-conscious indie apps — Apple Silicon is a strong choice.
Getting Started
- Check your Mac's unified memory: Apple menu → About This Mac → Memory
- Refer to the table above to see what models you can run
- Fine-tune on Ertas — upload your domain data, train visually, export as GGUF
- Install Ollama:
brew install ollama - Import your model and start querying
Your fine-tuned AI model, running on hardware you already own, at zero per-query cost. That's the local AI promise — and on Apple Silicon, it works well today.
References: Apple Core ML — On-Device Llama, SitePoint — Guide to Local LLMs in 2026, XDA — Apple's Sleeper Advantage for Local LLMs, Best Local LLMs for Apple Silicon 2026.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Running AI Models Locally: The Complete Guide to Local LLM Inference
Everything you need to know about running large language models on your own hardware — from hardware requirements and model formats to tools like Ollama, LM Studio, and llama.cpp.

Building Reliable AI Agents with Fine-Tuned Local Models: Complete Guide
Most AI agents are just GPT-4 wrappers — expensive, unreliable at scale, and dependent on cloud APIs. Fine-tuned local models hit 98%+ accuracy on your specific tools at zero per-query cost. Here's the complete architecture.

Quantization Levels Explained: Q4 vs Q5 vs Q8 and When Each Matters
A practical guide to choosing GGUF quantization levels for local AI deployment. Covers Q4_K_M, Q5_K_M, Q8_0, and how hardware constraints, fine-tuning, and use case requirements determine the right quantization for your model.