ExLlamaV2 + Ertas

Deploy fine-tuned models from Ertas using ExLlamaV2's highly optimized quantized inference engine, achieving exceptional generation speed and memory efficiency on consumer and data center NVIDIA GPUs alike.

Overview

ExLlamaV2 is a high-performance inference library focused on extracting maximum speed from quantized language models on NVIDIA GPUs. It implements custom CUDA kernels specifically optimized for quantized matrix multiplication, achieving generation speeds that consistently rank among the fastest in independent benchmarks. ExLlamaV2 supports GPTQ, EXL2, and other quantization formats, with EXL2 offering particularly fine-grained control over per-layer quantization levels to balance quality against memory usage.

What sets ExLlamaV2 apart is its focus on practical efficiency for single-GPU and dual-GPU setups. While other inference engines target large-scale multi-GPU clusters, ExLlamaV2 excels at making large models run fast on the hardware most developers actually have — a single RTX 4090, a used 3090, or a pair of consumer GPUs. Its paged attention implementation, speculative decoding support, and cache quantization allow it to serve models that would otherwise require more expensive hardware. For developers and small teams deploying fine-tuned models locally, ExLlamaV2 delivers production-quality speed without production-scale infrastructure.

How Ertas Integrates

Ertas Studio produces fine-tuned models that can be quantized into ExLlamaV2's EXL2 format for optimized deployment. After completing a fine-tuning job in Ertas — training on your domain-specific data with LoRA and merging the adapters — you export the full model and run it through ExLlamaV2's quantization pipeline. The EXL2 format allows you to target a specific bits-per-weight ratio (typically 3.0 to 6.0 bpp), giving precise control over the tradeoff between model quality and GPU memory requirements.

Once quantized, the model runs through ExLlamaV2's inference server, which exposes an OpenAI-compatible API endpoint for integration with any client application. The combination is particularly effective for deploying domain-specific models on consumer hardware: Ertas handles the knowledge injection through fine-tuning, and ExLlamaV2 handles the performance optimization through quantization and custom kernels. A 13B parameter model fine-tuned for your use case can serve requests at 80+ tokens per second on a single RTX 4090 — fast enough for interactive applications and concurrent users.

Getting Started

1
Fine-tune and export from Ertas Studio
Train your domain-specific model in Ertas Studio using LoRA fine-tuning. Once satisfied with quality, merge the LoRA adapters into the base model and export the full merged model in safetensors format.
2
Quantize to EXL2 format
Use ExLlamaV2's conversion tool to quantize the merged model into EXL2 format. Choose a bits-per-weight target that balances quality and memory — 4.0 bpp is a common sweet spot for consumer GPUs, while 5.0-6.0 bpp preserves more quality for larger VRAM budgets.
3
Benchmark inference performance
Run ExLlamaV2's built-in benchmark to measure generation speed, prompt processing throughput, and memory usage on your target GPU. Verify that performance meets your latency requirements for interactive use.
4
Launch the inference server
Start ExLlamaV2's TabbyAPI or compatible server to expose your quantized model as an OpenAI-compatible endpoint. Configure context length, concurrent request handling, and speculative decoding if using a draft model.
5
Connect client applications
Point your coding assistant, chat interface, or custom application to the ExLlamaV2 endpoint. Monitor generation quality in real use and re-fine-tune in Ertas if the model needs improvement on specific tasks.

Benefits

Industry-leading generation speed on single-GPU setups through optimized CUDA kernels
Fine-grained EXL2 quantization for precise quality-versus-memory tradeoff control
Efficient enough to serve fine-tuned 13B+ models interactively on consumer RTX cards
Speculative decoding support for even faster generation with compatible draft models
OpenAI-compatible API for seamless integration with coding tools and custom applications
Cache quantization and paged attention to maximize concurrent users on limited VRAM