llama.cpp + Ertas

    Export GGUF models from Ertas and run high-performance inference with llama.cpp on CPUs, GPUs, or Apple Silicon without heavy framework dependencies.

    Overview

    llama.cpp is the reference implementation for efficient LLM inference in pure C/C++, supporting a wide range of hardware from consumer laptops to multi-GPU servers. By eliminating the need for Python runtimes and heavy ML frameworks, llama.cpp delivers some of the fastest token-per-second rates available for local inference. It supports advanced quantization schemes (from 2-bit to 8-bit), KV cache optimization, speculative decoding, and batched inference, making it the backbone of many production-grade local AI deployments.

    For teams using Ertas to fine-tune domain-specific models, llama.cpp provides the performance layer that turns trained weights into production-ready inference endpoints. Whether you are embedding a model into a desktop application, running inference on edge devices, or building a high-throughput API server, llama.cpp gives you fine-grained control over memory usage, threading, and GPU offloading that managed runtimes cannot match.

    How Ertas Integrates

    After fine-tuning in Ertas Studio, you can download your model directly in the GGUF format that llama.cpp consumes. During download, you choose from over a dozen quantization options, and Ertas displays perplexity benchmarks against your validation set to help you pick the right trade-off between model size and output quality. The downloaded GGUF file includes embedded chat templates, tokenizer configuration, and metadata so llama.cpp can load and serve the model without additional configuration files.

    Ertas Studio also provides recommended llama-server launch parameters alongside your download, based on the model size and quantization level you selected. These suggestions cover context sizes, batch sizes, and layer offloading strategies, removing the guesswork from performance tuning and helping your fine-tuned model run at peak efficiency on your specific hardware.

    Getting Started

    1. 1

      Complete fine-tuning in Ertas Studio

      Train your model using LoRA or full-parameter methods in Ertas Studio. Validate against your test set to confirm quality before export.

    2. 2

      Select quantization strategy

      Choose a GGUF quantization level based on your deployment constraints. Ertas shows estimated file sizes and perplexity impact for each option.

    3. 3

      Download GGUF model

      Download the fine-tuned model in GGUF format from Ertas Studio with embedded tokenizer, chat template, and metadata. The file is self-contained and ready for llama.cpp.

    4. 4

      Review recommended server settings

      Ertas Studio displays recommended llama-server launch parameters alongside your download, including context size, GPU layer offloading, and thread count.

    5. 5

      Launch llama-server

      Start the llama.cpp HTTP server with your exported model. The server provides an OpenAI-compatible API endpoint for chat completions and embeddings.

    6. 6

      Benchmark and iterate

      Run the built-in benchmarking suite to measure tokens per second, time to first token, and memory usage. Feed results back into Ertas for the next training iteration.

    bash
    # After downloading the Q4_K_M GGUF file from Ertas Studio,
    # launch llama-server with the recommended settings
    llama-server \
      --model ./models/my-model.gguf \
      --ctx-size 4096 \
      --n-gpu-layers 35 \
      --threads 8 \
      --port 8080
    
    # Test the endpoint
    curl http://localhost:8080/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"messages": [{"role": "user", "content": "Hello"}]}'
    After downloading your GGUF model from Ertas Studio, serve it with llama.cpp for high-performance local inference.

    Benefits

    • No Python runtime or ML framework dependencies required for inference
    • Industry-leading inference speed on CPUs, GPUs, and Apple Silicon
    • Over a dozen quantization options with perplexity impact previews
    • Self-contained GGUF files with embedded tokenizer and chat templates
    • Recommended server settings provided alongside your GGUF download
    • Suitable for edge deployment, desktop applications, and high-throughput servers

    Related Resources

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.