Ollama + Ertas

Deploy Ertas-trained models through Ollama for fast, private local inference with a simple CLI and OpenAI-compatible API.

Overview

Ollama simplifies local model deployment by packaging model weights, configuration, and runtime into a single streamlined tool. With a familiar CLI inspired by container workflows, Ollama lets developers pull and run large language models on their own hardware without configuring complex inference servers or managing GPU drivers manually. Its built-in OpenAI-compatible REST API means existing application code can switch to local inference with a single endpoint change.

For teams that have invested in fine-tuning custom models with Ertas, Ollama provides the fastest path from trained weights to a running inference endpoint. The combination of Ertas for training and Ollama for serving creates a fully local AI pipeline where sensitive data never leaves your infrastructure, making it ideal for regulated industries and privacy-conscious organizations.

How Ertas Integrates

After a training job completes in Ertas Studio, you can download your fine-tuned model in GGUF format directly from the platform — which Ollama natively supports. Ertas also provides a downloadable Modelfile with the correct template, system prompt, and quantization settings baked in, so you can register the model with Ollama in a single step. The download preserves chat templates, stop tokens, and any custom parameters you configured during training.

Once deployed, Ertas Cloud can monitor your Ollama instances for health, throughput, and latency metrics. You can manage multiple Ollama endpoints from the Ertas dashboard, route traffic between model versions for A/B testing, and roll back to previous checkpoints without restarting the server. This tight feedback loop between training and serving lets teams iterate on model quality with minimal operational overhead.

Getting Started

1
Download model in GGUF format
After fine-tuning in Ertas Studio, download the model in GGUF format with your preferred quantization level (Q4_K_M, Q5_K_M, Q8_0, or full precision) from the platform.
2
Download the Ollama Modelfile
Ertas provides a ready-made Modelfile alongside your GGUF download that includes the correct chat template, system prompt, and runtime parameters.
3
Register the model with Ollama
Run a single CLI command to create the Ollama model from the generated Modelfile and GGUF weights.
4
Start the inference server
Launch Ollama to serve your model locally. The OpenAI-compatible API is available immediately at localhost:11434.
5
Connect your application
Point your application to the local Ollama endpoint. Any OpenAI SDK or HTTP client works out of the box with no code changes beyond the base URL.

bash

# After downloading the GGUF model and Modelfile from Ertas Studio,
# create an Ollama model from the downloaded files
ollama create my-model -f ./models/Modelfile

# Run the model locally
ollama run my-model "Summarize this patient report"

# Or use the OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

After downloading your GGUF model from Ertas Studio, deploy it locally through Ollama with full API compatibility.