llama.cpp + Ertas
Export GGUF models from Ertas and run high-performance inference with llama.cpp on CPUs, GPUs, or Apple Silicon without heavy framework dependencies.
Overview
llama.cpp is the reference implementation for efficient LLM inference in pure C/C++, supporting a wide range of hardware from consumer laptops to multi-GPU servers. By eliminating the need for Python runtimes and heavy ML frameworks, llama.cpp delivers some of the fastest token-per-second rates available for local inference. It supports advanced quantization schemes (from 2-bit to 8-bit), KV cache optimization, speculative decoding, and batched inference, making it the backbone of many production-grade local AI deployments.
For teams using Ertas to fine-tune domain-specific models, llama.cpp provides the performance layer that turns trained weights into production-ready inference endpoints. Whether you are embedding a model into a desktop application, running inference on edge devices, or building a high-throughput API server, llama.cpp gives you fine-grained control over memory usage, threading, and GPU offloading that managed runtimes cannot match.
How Ertas Integrates
After fine-tuning in Ertas Studio, you can download your model directly in the GGUF format that llama.cpp consumes. During download, you choose from over a dozen quantization options, and Ertas displays perplexity benchmarks against your validation set to help you pick the right trade-off between model size and output quality. The downloaded GGUF file includes embedded chat templates, tokenizer configuration, and metadata so llama.cpp can load and serve the model without additional configuration files.
Ertas Studio also provides recommended llama-server launch parameters alongside your download, based on the model size and quantization level you selected. These suggestions cover context sizes, batch sizes, and layer offloading strategies, removing the guesswork from performance tuning and helping your fine-tuned model run at peak efficiency on your specific hardware.
Getting Started
- 1
Complete fine-tuning in Ertas Studio
Train your model using LoRA or full-parameter methods in Ertas Studio. Validate against your test set to confirm quality before export.
- 2
Select quantization strategy
Choose a GGUF quantization level based on your deployment constraints. Ertas shows estimated file sizes and perplexity impact for each option.
- 3
Download GGUF model
Download the fine-tuned model in GGUF format from Ertas Studio with embedded tokenizer, chat template, and metadata. The file is self-contained and ready for llama.cpp.
- 4
Review recommended server settings
Ertas Studio displays recommended llama-server launch parameters alongside your download, including context size, GPU layer offloading, and thread count.
- 5
Launch llama-server
Start the llama.cpp HTTP server with your exported model. The server provides an OpenAI-compatible API endpoint for chat completions and embeddings.
- 6
Benchmark and iterate
Run the built-in benchmarking suite to measure tokens per second, time to first token, and memory usage. Feed results back into Ertas for the next training iteration.
# After downloading the Q4_K_M GGUF file from Ertas Studio,
# launch llama-server with the recommended settings
llama-server \
--model ./models/my-model.gguf \
--ctx-size 4096 \
--n-gpu-layers 35 \
--threads 8 \
--port 8080
# Test the endpoint
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello"}]}'Benefits
- No Python runtime or ML framework dependencies required for inference
- Industry-leading inference speed on CPUs, GPUs, and Apple Silicon
- Over a dozen quantization options with perplexity impact previews
- Self-contained GGUF files with embedded tokenizer and chat templates
- Recommended server settings provided alongside your GGUF download
- Suitable for edge deployment, desktop applications, and high-throughput servers
Related Resources
Fine-Tuning
GGUF
Inference
LoRA
Getting Started with Ertas: Fine-Tune and Deploy Custom AI Models
Privacy-Conscious AI Development: Fine-Tune in the Cloud, Run on Your Terms
Running AI Models Locally: The Complete Guide to Local LLM Inference
Fine-Tuning Llama 3: A Practical Guide for Your Use Case
Self-Hosted AI for Indie Apps: Replace GPT-4 with Your Own Model
The Indie Dev's Guide to AI Model Costs in 2026
Hugging Face
KoboldCpp
LM Studio
Ollama
vLLM
Ertas for Healthcare
Ertas for SaaS Product Teams
Ertas for Customer Support
Ertas for E-Commerce
Ertas for Indie Developers & Vibe-Coded Apps
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.