Back to blog
    Fine-Tuning for Apple Silicon: Running Custom Models on M-Series Macs
    apple-siliconm-serieslocal-inferenceollamamlxfine-tuningdeploymentgguf

    Fine-Tuning for Apple Silicon: Running Custom Models on M-Series Macs

    A practical guide to deploying fine-tuned AI models on Apple Silicon Macs. Covers M4 hardware capabilities, unified memory advantages, Ollama and MLX setup, quantization choices, and Core ML LoRA adapter support.

    EErtas Team·

    Apple Silicon has a quiet advantage for local AI inference that most people underestimate: unified memory. The CPU, GPU, and Neural Engine all share the same memory pool — no data copying between separate VRAM and system RAM. For large language model inference, where memory bandwidth is the primary bottleneck, this architecture is a genuine competitive advantage.

    If you own an M-series Mac, you already own capable AI inference hardware. This guide covers how to take a fine-tuned model from Ertas and deploy it locally on your Mac — no cloud APIs, no GPU rentals, no per-token billing.

    What Your Mac Can Run

    The limiting factor for local LLM inference is memory. Here's what each M-series tier supports:

    MacUnified MemoryRecommended ModelsExpected Speed
    M1/M2/M3/M4 (base)8-16 GB1-3B quantized, 7B at Q4 (tight)~15-25 tok/s
    M1/M2/M3/M4 Pro18-24 GB7-8B at Q5/Q8, 13B at Q4~25-35 tok/s
    M1/M2/M3/M4 Max32-128 GB13B at Q8, 70B at Q4~15-30 tok/s
    M2/M4 Ultra64-192 GB70B at Q8, multiple models simultaneously~20-35 tok/s

    The sweet spot for most developers: An M4 Pro (24 GB) running a fine-tuned 8B model at Q5_K_M or Q8_0. This delivers ~30+ tokens per second — fast enough for interactive use — with room for generous context windows.

    For guidance on choosing between quantization levels, see our quantization guide.

    Why Unified Memory Matters

    On a traditional PC with a discrete GPU, LLM inference works like this:

    1. Model weights live in GPU VRAM (limited to 8-24 GB on consumer cards)
    2. If the model doesn't fit in VRAM, parts spill to system RAM
    3. Accessing system RAM from the GPU is 10-20x slower than VRAM
    4. This "offloading" kills performance

    On Apple Silicon:

    1. Everything — CPU, GPU, Neural Engine — accesses the same memory pool
    2. There's no VRAM/RAM distinction
    3. A Mac with 64 GB unified memory gives the GPU access to all 64 GB at full speed
    4. No offloading penalty

    This means a Mac Studio M4 Ultra with 192 GB unified memory can run models that would require multiple enterprise GPUs on a traditional setup. For inference (not training), Apple Silicon is surprisingly competitive.

    The Deployment Stack

    Option 1: Ollama (Easiest)

    Ollama is the simplest path from fine-tuned model to running inference on your Mac.

    Setup:

    1. Install Ollama: brew install ollama
    2. Fine-tune your model on Ertas and export as GGUF
    3. Create a Modelfile pointing to your GGUF:
      FROM ./your-fine-tuned-model.Q5_K_M.gguf
      
    4. Import: ollama create my-model -f Modelfile
    5. Run: ollama run my-model

    Ollama handles all the Apple Silicon optimization automatically — it uses Metal for GPU acceleration on M-series chips. No configuration needed.

    When to use Ollama: When you want the fastest path to running a fine-tuned model locally. Excellent for development, testing, and production inference behind an API endpoint.

    Option 2: MLX (Apple-Native Performance)

    MLX is Apple's own machine learning framework, designed specifically for Apple Silicon. It offers lower-level control and often better performance than Ollama on M-series hardware.

    Advantages over Ollama:

    • Built by Apple, optimized for the specific memory hierarchy of M-series chips
    • Supports LoRA adapter loading natively (swap adapters without reloading the base model)
    • Fine-tuning directly on Mac is possible for small models (though Ertas cloud GPUs are faster)

    When to use MLX: When you need maximum performance on Apple Silicon, when you want to hot-swap LoRA adapters, or when you're building a native macOS application with AI features.

    Option 3: llama.cpp (Maximum Control)

    The underlying engine that powers Ollama. Use it directly when you need custom batch sizes, specific threading configurations, or when integrating with a custom application via the C/C++ API.

    llama.cpp includes Metal support for Apple Silicon GPU acceleration out of the box.

    When to use llama.cpp: When you need fine-grained control over inference parameters or are embedding inference into a compiled application.

    Core ML and LoRA Adapters

    Apple's Core ML framework now supports LoRA adapter inference on the Neural Engine — the dedicated AI accelerator built into every M-series chip.

    This matters for two reasons:

    1. Adapter swapping is fast. Load a base model once, swap LoRA adapters for different tasks without reloading the full model. This is the same pattern that hardware vendors are building into their chips.

    2. Neural Engine efficiency. The ANE (Apple Neural Engine) is optimized for specific quantization levels and model architectures. Running inference on the ANE can be more power-efficient than GPU inference, extending battery life on MacBooks.

    Apple published a research paper demonstrating Llama 3.1 8B running locally via Core ML at ~33 tokens/sec on M1 Max. M4 series chips are faster.

    The End-to-End Workflow

    Here's the complete workflow from domain data to running inference on your Mac:

    1. Fine-Tune on Cloud GPUs (via Ertas)

    Fine-tuning requires GPU compute that's impractical on consumer hardware — even powerful Macs. The M4 Max can fine-tune 7B models via MLX, but it's slow (hours vs. minutes on cloud GPUs) and ties up your machine.

    Use Ertas to fine-tune on cloud GPUs: upload your dataset, configure training visually, monitor results. Training happens in minutes, not hours.

    2. Export as GGUF

    Export your fine-tuned model from Ertas as GGUF at your target quantization level:

    • Q4_K_M for memory-constrained Macs (8-16 GB)
    • Q5_K_M for production quality on Macs with 24 GB+
    • Q8_0 for maximum quality on Macs with 32 GB+

    You can also export as a separate LoRA adapter if you want to use adapter-swapping via MLX or Core ML.

    3. Load into Ollama

    Import your GGUF into Ollama and start serving inference. Ollama exposes an OpenAI-compatible API by default, so any application that talks to the OpenAI API can point to your local model with a one-line configuration change.

    4. Integrate with Your Stack

    Your fine-tuned model running on Ollama on your Mac can serve:

    • n8n workflows via the Ollama node (replace OpenAI API calls)
    • Web applications via the REST API (localhost:11434)
    • CLI tools via Ollama's command-line interface
    • Custom applications via the Python or JavaScript client libraries

    5. Run at Zero Marginal Cost

    Once the model is loaded, every query costs only electricity. No per-token billing. No API rate limits. No data leaving your machine.

    For an indie developer processing 50,000 queries per month, the difference between cloud API costs ($500-2,000/month) and local inference on a Mac you already own ($10-15/month electricity) is the difference between a viable business and burning cash.

    Performance Optimization Tips

    Match Context Length to Your Needs

    Longer context windows consume more memory (KV cache grows linearly with context length). If your use case only needs 2K context (many classification and extraction tasks), set the context window accordingly. This frees memory for the model weights and improves speed.

    Use the Right Quantization for Your Memory

    Don't just use the highest quantization your Mac can technically load. Leave headroom for KV cache, the operating system, and your other applications. A model that barely fits will be slower due to memory pressure.

    Safe rule: Model file size should be no more than 60-70% of your total unified memory for comfortable operation.

    If you're processing many similar inputs (document classification, data extraction), batch them through a script rather than interactive chat. This keeps the model loaded and avoids cold-start overhead.

    Consider a Dedicated Mac for Inference

    For agencies or teams running AI inference as a service, a Mac Mini M4 Pro ($1,600-2,000) or Mac Studio M4 Max ($3,000-5,000) makes an excellent dedicated inference server. Low power consumption, silent operation, and enough memory for production workloads.

    Compare that to a cloud GPU at $800-1,500/month. The Mac pays for itself in 2-4 months.

    When Not to Use Apple Silicon for Inference

    Apple Silicon is excellent for inference but not always the right choice:

    • Throughput-critical workloads: If you need to serve hundreds of concurrent users, dedicated GPU servers (or dedicated silicon like Taalas HC1) will outperform a Mac
    • Models larger than your memory: If your model requires more memory than your Mac has, you need bigger hardware
    • Fine-tuning itself: Training on cloud GPUs via Ertas is faster and more cost-effective than training on-device (except for small experiments)

    For everything else — development, testing, single-user or small-team production inference, privacy-sensitive deployments, and cost-conscious indie apps — Apple Silicon is a strong choice.

    Getting Started

    1. Check your Mac's unified memory: Apple menu → About This Mac → Memory
    2. Refer to the table above to see what models you can run
    3. Fine-tune on Ertas — upload your domain data, train visually, export as GGUF
    4. Install Ollama: brew install ollama
    5. Import your model and start querying

    Your fine-tuned AI model, running on hardware you already own, at zero per-query cost. That's the local AI promise — and on Apple Silicon, it works well today.


    References: Apple Core ML — On-Device Llama, SitePoint — Guide to Local LLMs in 2026, XDA — Apple's Sleeper Advantage for Local LLMs, Best Local LLMs for Apple Silicon 2026.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading