
Running AI Models Locally: The Complete Guide to Local LLM Inference
Everything you need to know about running large language models on your own hardware — from hardware requirements and model formats to tools like Ollama, LM Studio, and llama.cpp.
You can run AI models locally by downloading a GGUF-quantized model and serving it with tools like Ollama, LM Studio, or llama.cpp — a 7B parameter model runs comfortably on any machine with 16 GB of RAM, no GPU required. According to benchmarks from the llama.cpp project, Q4_K_M quantization reduces model size by approximately 70% while maintaining quality that is nearly indistinguishable from full precision on most tasks. The Stanford HAI AI Index Report notes that the cost of training and inference has dropped over 90% since 2020, making local deployment practical for individuals and small teams.
This guide covers everything you need to get started: why local inference matters, what hardware you need, which model format to use, and which tools make it easy.
Why Run Models Locally?
Privacy and Data Control
When you send a prompt to a cloud API, your data travels to someone else's server. For many use cases — medical records, legal documents, financial data, proprietary code — that's a non-starter.
Local inference means your data never leaves your network. There's no third-party processing agreement to negotiate, no data residency questions to answer, and no risk of your prompts being used to train someone else's model.
Predictable Costs
Cloud LLM APIs charge per token. At low volume, this is affordable. At scale, it becomes a significant line item. A team processing 100,000 queries per month can easily spend $1,000–3,000 on API calls alone.
Local inference has a fixed cost: your hardware. Whether you run 10 queries or 10 million, the cost doesn't change. For high-volume applications, the break-even point comes surprisingly fast — often within 2–3 months.
No Vendor Lock-In
If your application depends on a cloud API, you're at the mercy of that provider's pricing changes, rate limits, model deprecations, and terms of service updates. Running locally means you own the model file and can switch inference tools at any time.
Latency
Local inference eliminates network round-trips. For applications that need sub-100ms response times or operate in environments with unreliable connectivity, local deployment is the only viable option.
Hardware Requirements
The good news: you don't need a data center. Modern quantized models run on consumer hardware.
RAM Is the Bottleneck
For CPU inference (which is what most people use for local deployment), the key constraint is system RAM — not GPU VRAM. A quantized model needs to fit entirely in memory.
| Model Size | Quantization | RAM Required | Example Hardware |
|---|---|---|---|
| 1–3B | Q4_K_M | 2–4 GB | Any modern laptop |
| 7–8B | Q4_K_M | 6–8 GB | Mid-range laptop, desktop |
| 13B | Q4_K_M | 10–12 GB | 16 GB laptop or desktop |
| 34B | Q4_K_M | 24–28 GB | 32 GB workstation |
| 70B | Q4_K_M | 40–48 GB | 64 GB workstation or server |
GPU Acceleration (Optional but Nice)
If you have a discrete GPU, inference speeds up dramatically. Apple Silicon Macs are particularly good at this — the unified memory architecture means the GPU can access the full system RAM.
| GPU | VRAM | Comfortable Model Size |
|---|---|---|
| Apple M2/M3 (16 GB unified) | Shared | Up to 13B |
| Apple M2/M3 Pro (36 GB unified) | Shared | Up to 34B |
| NVIDIA RTX 3060 (12 GB) | 12 GB | Up to 7B |
| NVIDIA RTX 4090 (24 GB) | 24 GB | Up to 13B |
| NVIDIA A100 (80 GB) | 80 GB | Up to 70B |
For most use cases, a 7B–8B quantized model on a machine with 16 GB RAM hits the sweet spot of capability and performance.
Model Formats: Why GGUF Matters
GGUF (GPT-Generated Unified Format) is the standard format for local LLM inference. It was designed by the llama.cpp project and is now supported by virtually every local inference tool.
What Makes GGUF Special
- Quantization baked in — GGUF files contain quantized weights, so a 7B model that would normally be 14 GB in full precision can be 4–5 GB in Q4 quantization with minimal quality loss.
- Single file — everything the model needs (weights, tokenizer config, metadata) is in one file. No dependency management.
- CPU-optimized — designed for efficient CPU inference using SIMD instructions, with optional GPU offloading.
- Universal compatibility — works with llama.cpp, Ollama, LM Studio, GPT4All, Jan, KoboldCpp, and many more tools.
Quantization Levels
| Quantization | Size (7B model) | Quality | Speed |
|---|---|---|---|
| F16 | ~14 GB | Best | Slowest |
| Q8_0 | ~7.5 GB | Near-lossless | Fast |
| Q6_K | ~5.5 GB | Excellent | Faster |
| Q5_K_M | ~5 GB | Very good | Fast |
| Q4_K_M | ~4.3 GB | Good (recommended) | Fast |
| Q3_K_M | ~3.3 GB | Acceptable | Fastest |
| Q2_K | ~2.7 GB | Noticeable degradation | Fastest |
Q4_K_M is the sweet spot for most use cases — it reduces model size by ~70% with quality that's nearly indistinguishable from full precision on most tasks.
Tools for Local Inference
Ollama
The easiest way to get started. Ollama packages models and inference into a single CLI tool with an API server built in.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Run a model
ollama run llama3
# Serve as an API
ollama serve
curl http://localhost:11434/api/generate -d '{"model": "llama3", "prompt": "Hello"}'
Best for: developers who want a quick API endpoint, teams that need OpenAI-compatible API format, Docker-based deployments.
LM Studio
A desktop application with a visual interface for downloading, managing, and chatting with local models.
Best for: non-technical users, teams that want a ChatGPT-like experience with local models, quick testing and evaluation.
llama.cpp
The foundational inference engine that powers most other tools. Maximum control and performance tuning options.
# Run inference directly
./llama-cli -m model.gguf -p "Translate to French: Hello, how are you?"
# Start an API server
./llama-server -m model.gguf --port 8080
Best for: production deployments where you need full control over inference parameters, custom applications, embedded systems.
Open WebUI
A self-hosted web interface that connects to Ollama or other backends. Gives your team a ChatGPT-style experience backed by your local models.
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
ghcr.io/open-webui/open-webui:main
Best for: teams that want a shared, web-based chat interface for local models.
vLLM
A high-throughput serving engine designed for production workloads. Implements continuous batching and PagedAttention for maximum GPU utilization.
Best for: production APIs serving many concurrent users, applications that need high throughput.
The Complete Workflow: Fine-Tune to Local Deployment
The most powerful local inference setup starts with a model fine-tuned on your data. Here's the end-to-end workflow:
- Prepare training data in JSONL format
- Fine-tune a base model on your data (using LoRA for efficiency)
- Export the fine-tuned model as a GGUF file
- Deploy using Ollama, LM Studio, or any GGUF-compatible tool
- Integrate into your application via the local API
The result: a model that understands your domain, runs on your hardware, and costs nothing per query.
With Ertas Studio
Ertas Studio handles steps 1–3 through a visual interface. Upload your dataset, select a base model, fine-tune on managed cloud GPUs, and download the GGUF file. From there, deploy with any of the tools above.
This gives you the best of both worlds: cloud-powered training (fast, no GPU to manage) with fully local inference (private, no ongoing costs).
Lock in early bird pricing at $14.50/mo — guaranteed for life. Increases to $34.50/mo at launch. Join the waitlist →
Frequently Asked Questions
What hardware do I need to run AI locally?
For a 7B parameter model (the most common size for local deployment), you need a machine with at least 8 GB of RAM — though 16 GB is recommended for comfortable performance. No GPU is required; modern quantized models run on CPU using tools like llama.cpp and Ollama. Apple Silicon Macs are particularly well-suited due to their unified memory architecture. For larger models (13B-70B), you need proportionally more RAM: 16 GB for 13B, 32 GB for 34B, and 64 GB for 70B models.
Is local AI as good as cloud APIs?
For general-purpose, open-ended tasks, large cloud models like GPT-4 still have an edge. But for narrow, well-defined tasks — which represent the majority of production AI applications — a fine-tuned 7B local model can match or exceed cloud API quality. According to research from Hugging Face, fine-tuned small models routinely achieve 90-95% accuracy on domain-specific classification tasks, matching GPT-4 class models. The key is that fine-tuning creates a specialist, not a generalist.
What's the fastest way to run LLMs locally?
The fastest path from zero to running a local LLM is Ollama. Install it with a single command (curl -fsSL https://ollama.com/install.sh | sh), then run ollama run llama3 to download and start chatting with a model. The entire process takes under 5 minutes. For a GUI experience, LM Studio provides a desktop application where you can browse, download, and run models without touching the terminal. For production use cases with higher throughput needs, vLLM or llama.cpp's server mode offer more control.
Can I run AI models on a Mac?
Yes — Apple Silicon Macs are actually among the best hardware for local AI inference. The unified memory architecture allows the GPU to access all system RAM, which means a Mac with 16 GB unified memory can run models that would require a dedicated GPU with 16 GB VRAM on a PC. An M2/M3 Mac with 16 GB handles 7B-13B models comfortably, while the M2/M3 Pro or Max with 36-96 GB can run models up to 70B parameters. Ollama, LM Studio, and llama.cpp all have native Apple Silicon support with Metal GPU acceleration.
Further Reading
- How to Fine-Tune an LLM: Complete Guide — prepare data and train your own model
- Fine-Tuning vs RAG: When to Use Each — decide the right approach for your use case
- Privacy-Conscious AI Development — the case for keeping AI data under your control
- The Hidden Cost of Per-Token AI Pricing — why local inference saves money at scale
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading
Fine-Tuning for Apple Silicon: Running Custom Models on M-Series Macs
A practical guide to deploying fine-tuned AI models on Apple Silicon Macs. Covers M4 hardware capabilities, unified memory advantages, Ollama and MLX setup, quantization choices, and Core ML LoRA adapter support.

Quantization Levels Explained: Q4 vs Q5 vs Q8 and When Each Matters
A practical guide to choosing GGUF quantization levels for local AI deployment. Covers Q4_K_M, Q5_K_M, Q8_0, and how hardware constraints, fine-tuning, and use case requirements determine the right quantization for your model.

LM Studio vs Ollama for Client Deployments: Which to Use
Both LM Studio and Ollama run local AI models — but they're designed for different use cases. Here's a direct comparison for AI solutions architects deploying for clients.