vLLM + Ertas

Fine-tune models in Ertas Studio and deploy them with vLLM for production-grade serving with continuous batching, PagedAttention, and OpenAI-compatible API endpoints.

Overview

vLLM is a high-throughput, memory-efficient inference engine designed for production LLM serving. Its core innovation, PagedAttention, manages the KV cache like virtual memory pages, dramatically reducing memory waste and enabling significantly higher concurrent request throughput compared to traditional inference frameworks. vLLM supports continuous batching, tensor parallelism across multiple GPUs, speculative decoding, and quantization formats including AWQ and GPTQ, making it the go-to choice for teams that need to serve models at scale with predictable latency.

Unlike desktop-oriented tools, vLLM is built for server environments where throughput, latency percentiles, and resource utilization matter. It provides an OpenAI-compatible API server out of the box, supports streaming responses, and integrates with observability tools for monitoring request queues, token generation rates, and GPU utilization. For organizations that fine-tune models with Ertas for customer-facing applications, vLLM bridges the gap between a trained model and a production-ready inference service.

How Ertas Integrates

After fine-tuning in Ertas Studio, you can export your model in formats compatible with vLLM's supported architectures. For Hugging Face-format models, Ertas pushes the fine-tuned weights directly to a Hub repository (public or private) that vLLM can pull from at startup. For LoRA-based fine-tunes, Ertas exports the adapter weights separately, allowing vLLM to load them dynamically on top of a base model using its built-in LoRA serving support — enabling multiple fine-tuned variants to share the same base model memory.

This integration is designed for production workloads where the model needs to handle hundreds or thousands of concurrent requests. A typical workflow involves fine-tuning a customer support model in Ertas Studio, exporting the adapter, and deploying it on vLLM behind a load balancer. vLLM's continuous batching ensures efficient GPU utilization even under variable traffic patterns, while its OpenAI-compatible API means your application code requires no changes when migrating from a cloud AI provider to self-hosted inference.

Getting Started

1
Fine-tune in Ertas Studio
Train your model using LoRA or full-parameter fine-tuning in Ertas Studio. LoRA is recommended for vLLM deployments as it enables efficient multi-adapter serving.
2
Export model weights
Push your fine-tuned model or LoRA adapter to Hugging Face Hub from Ertas Studio. For full fine-tunes, export the complete model; for LoRA, export the adapter weights separately.
3
Install and configure vLLM
Install vLLM on your GPU server. Configure the serving parameters including tensor parallelism, max model length, and GPU memory utilization based on your hardware and traffic requirements.
4
Launch the vLLM server
Start vLLM with your model path or Hugging Face repo ID. For LoRA adapters, specify the base model and adapter path. vLLM exposes an OpenAI-compatible API immediately.
5
Load-test and tune
Run load tests against the vLLM endpoint to validate throughput and latency under expected traffic patterns. Adjust batch sizes, max concurrent requests, and GPU memory allocation as needed.
6
Deploy behind a load balancer
Place the vLLM server behind a reverse proxy or load balancer for production traffic. Scale horizontally by adding more vLLM instances with tensor parallelism across GPUs.

bash

# After fine-tuning in Ertas Studio and pushing to Hugging Face,
# serve the model with vLLM
vllm serve my-org/my-fine-tuned-model \
  --tensor-parallel-size 2 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9 \
  --port 8000

# For LoRA adapter serving on a shared base model
vllm serve meta-llama/Llama-3-8B \
  --enable-lora \
  --lora-modules my-adapter=my-org/my-lora-adapter \
  --port 8000

# Query the endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-adapter",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Serve your Ertas fine-tuned model or LoRA adapter with vLLM for high-throughput production inference with an OpenAI-compatible API.

Benefits

PagedAttention delivers up to 24x higher throughput than naive inference implementations
Continuous batching maximizes GPU utilization under variable traffic loads
Built-in LoRA serving enables multiple fine-tuned variants on a single base model
OpenAI-compatible API for zero-change migration from cloud providers
Tensor parallelism for serving large models across multiple GPUs
Production-ready with streaming, metrics, and health check endpoints