KoboldCpp + Ertas

    Export fine-tuned GGUF models from Ertas Studio and run them with KoboldCpp for fast local inference optimized for creative writing, roleplay, and long-context generation.

    Overview

    KoboldCpp is a lightweight, self-contained inference engine built on llama.cpp that specializes in long-form text generation and creative AI workflows. Distributed as a single executable with no dependencies, KoboldCpp provides a browser-based UI, a KoboldAI-compatible API, and an OpenAI-compatible API — all from a single binary that runs on Windows, macOS, and Linux. It supports GGUF models natively with full GPU acceleration on NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (Metal), along with a Vulkan backend for broad GPU compatibility.

    What distinguishes KoboldCpp from general-purpose inference tools is its focus on generation quality and creative control. Features like SmartContext for intelligent context window management, story mode with world info and memory systems, and fine-grained sampler controls (including Mirostat, tail-free sampling, and typical sampling) make it the preferred tool for creative writing, interactive fiction, and roleplay applications. For teams that fine-tune models with Ertas for content generation or narrative AI, KoboldCpp provides the generation controls needed to get the best output from their trained models.

    How Ertas Integrates

    After fine-tuning a creative writing, content generation, or domain-specific model in Ertas Studio, you can download the GGUF file and launch it with KoboldCpp in a single command. KoboldCpp reads all necessary configuration from the GGUF metadata — chat templates, tokenizer settings, and context length — so the model is ready to use immediately. The built-in launcher GUI also provides a point-and-click interface for selecting your model file and configuring GPU layers, context size, and other runtime parameters before starting the server.

    The integration is particularly valuable for teams building AI-powered content tools. Fine-tune a model in Ertas on your specific writing style, brand voice, or narrative structure, then deploy it locally with KoboldCpp's advanced generation controls. The SmartContext feature intelligently manages the context window for long documents, and the story mode with memory and world info systems enables persistent narrative context that goes beyond the model's raw context length. All of this runs locally, ensuring that proprietary creative content and writing samples never leave your infrastructure.

    Getting Started

    1. 1

      Fine-tune your model in Ertas Studio

      Upload your creative writing dataset in JSONL format to Ertas Studio. Configure training parameters optimized for text generation quality, such as longer sequence lengths and appropriate learning rates.

    2. 2

      Export as GGUF

      Download the fine-tuned model in GGUF format. For creative writing workloads, Q5_K_M or Q6_K quantization preserves more generation quality than aggressive quantization levels.

    3. 3

      Download KoboldCpp

      Download the single-file KoboldCpp executable for your platform. No installation or dependency management is required — it is fully self-contained.

    4. 4

      Launch with your model

      Run KoboldCpp with your GGUF file path. Use the launcher GUI for point-and-click configuration, or pass command-line flags for GPU layers, context size, and port.

    5. 5

      Configure generation settings

      Adjust sampler settings in the web UI including temperature, repetition penalty, Mirostat, and top-k/top-p. Enable SmartContext for intelligent context window management on long documents.

    bash
    # After downloading the GGUF model from Ertas Studio,
    # launch KoboldCpp with GPU acceleration
    ./koboldcpp \
      --model ./my-model-Q5_K_M.gguf \
      --contextsize 8192 \
      --gpulayers 35 \
      --port 5001 \
      --smartcontext
    
    # The web UI is available at http://localhost:5001
    # The API is OpenAI-compatible at http://localhost:5001/v1/
    curl http://localhost:5001/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "koboldcpp",
        "messages": [{"role": "user", "content": "Continue the story..."}]
      }'
    Launch KoboldCpp with your Ertas-exported GGUF model for local inference with advanced generation controls and SmartContext.

    Benefits

    • Single executable with zero dependencies for instant deployment
    • SmartContext for intelligent context window management on long documents
    • Advanced sampler controls (Mirostat, tail-free, typical) for generation quality
    • Vulkan GPU backend for broad hardware compatibility beyond CUDA and Metal
    • Both KoboldAI and OpenAI-compatible API endpoints from a single server
    • Story mode with memory and world info for persistent narrative context

    Related Resources

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.