llama.cpp + Ertas

Export GGUF models from Ertas and run high-performance inference with llama.cpp on CPUs, GPUs, or Apple Silicon without heavy framework dependencies.

Overview

llama.cpp is the reference implementation for efficient LLM inference in pure C/C++, supporting a wide range of hardware from consumer laptops to multi-GPU servers. By eliminating the need for Python runtimes and heavy ML frameworks, llama.cpp delivers some of the fastest token-per-second rates available for local inference. It supports advanced quantization schemes (from 2-bit to 8-bit), KV cache optimization, speculative decoding, and batched inference, making it the backbone of many production-grade local AI deployments.

For teams using Ertas to fine-tune domain-specific models, llama.cpp provides the performance layer that turns trained weights into production-ready inference endpoints. Whether you are embedding a model into a desktop application, running inference on edge devices, or building a high-throughput API server, llama.cpp gives you fine-grained control over memory usage, threading, and GPU offloading that managed runtimes cannot match.

How Ertas Integrates

After fine-tuning in Ertas Studio, you can download your model directly in the GGUF format that llama.cpp consumes. During download, you choose from over a dozen quantization options, and Ertas displays perplexity benchmarks against your validation set to help you pick the right trade-off between model size and output quality. The downloaded GGUF file includes embedded chat templates, tokenizer configuration, and metadata so llama.cpp can load and serve the model without additional configuration files.

Ertas Studio also provides recommended llama-server launch parameters alongside your download, based on the model size and quantization level you selected. These suggestions cover context sizes, batch sizes, and layer offloading strategies, removing the guesswork from performance tuning and helping your fine-tuned model run at peak efficiency on your specific hardware.

Getting Started

1
Complete fine-tuning in Ertas Studio
Train your model using LoRA or full-parameter methods in Ertas Studio. Validate against your test set to confirm quality before export.
2
Select quantization strategy
Choose a GGUF quantization level based on your deployment constraints. Ertas shows estimated file sizes and perplexity impact for each option.
3
Download GGUF model
Download the fine-tuned model in GGUF format from Ertas Studio with embedded tokenizer, chat template, and metadata. The file is self-contained and ready for llama.cpp.
4
Review recommended server settings
Ertas Studio displays recommended llama-server launch parameters alongside your download, including context size, GPU layer offloading, and thread count.
5
Launch llama-server
Start the llama.cpp HTTP server with your exported model. The server provides an OpenAI-compatible API endpoint for chat completions and embeddings.
6
Benchmark and iterate
Run the built-in benchmarking suite to measure tokens per second, time to first token, and memory usage. Feed results back into Ertas for the next training iteration.

bash

# After downloading the Q4_K_M GGUF file from Ertas Studio,
# launch llama-server with the recommended settings
llama-server \
  --model ./models/my-model.gguf \
  --ctx-size 4096 \
  --n-gpu-layers 35 \
  --threads 8 \
  --port 8080

# Test the endpoint
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello"}]}'

After downloading your GGUF model from Ertas Studio, serve it with llama.cpp for high-performance local inference.

Benefits

No Python runtime or ML framework dependencies required for inference
Industry-leading inference speed on CPUs, GPUs, and Apple Silicon
Over a dozen quantization options with perplexity impact previews
Self-contained GGUF files with embedded tokenizer and chat templates
Recommended server settings provided alongside your GGUF download
Suitable for edge deployment, desktop applications, and high-throughput servers

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

llama.cpp + Ertas

Overview

How Ertas Integrates

Getting Started

Complete fine-tuning in Ertas Studio

Select quantization strategy

Download GGUF model

Review recommended server settings

Launch llama-server

Benchmark and iterate

Benefits

Related Resources

Ship AI that runs on your users' devices.