Fine-Tune Vicuna with Ertas

LMSYS's instruction-tuned model family in 7B, 13B, and 33B sizes, fine-tuned from Llama on ShareGPT conversations and widely recognized for pioneering open-source chatbot evaluation methodology.

7B13B33BLMSYS

Overview

Vicuna is a family of open-source chatbot models developed by LMSYS (Large Model Systems Organization), a research group from UC Berkeley, CMU, Stanford, and UCSD. Released in March 2023, Vicuna was created by fine-tuning Llama models on approximately 125,000 user-shared conversations collected from ShareGPT.com. The resulting models demonstrated conversational quality that early evaluations estimated at approximately 90% of ChatGPT's capability.

Vicuna played a pivotal role in the open-source LLM ecosystem by demonstrating that relatively simple fine-tuning on high-quality conversational data could dramatically improve a base model's chat ability. The project also introduced innovations in evaluation methodology — LMSYS developed the Chatbot Arena, a crowdsourced platform for comparing LLM responses head-to-head, which has since become the most widely cited independent benchmark for conversational AI quality.

The Vicuna family includes 7B, 13B, and 33B parameter variants, all derived from Llama base models. Vicuna v1.5, the most widely used version, is built on Llama 2 and supports a 16K token context window. The models use the standard Llama architecture with grouped-query attention and RoPE positional embeddings.

Vicuna models are released under the Llama 2 Community License (for v1.5). While newer models have surpassed Vicuna on benchmarks, the project's contributions to evaluation methodology and its demonstration of the power of fine-tuning on conversational data remain influential.

Key Features

Vicuna's training on ShareGPT conversations gives it a distinctive conversational style. The training data consists of real multi-turn conversations between users and ChatGPT, capturing the natural flow of human-AI dialogue including follow-up questions, clarifications, topic switches, and nuanced instructions. This produces a model that feels more naturally conversational than models fine-tuned on synthetic instruction-following datasets.

The Chatbot Arena evaluation platform, developed alongside Vicuna, introduced pairwise comparison evaluation to the LLM community. Users submit prompts and rate two anonymous model responses side-by-side, generating Elo ratings that reflect real-world user preferences. This methodology has become the gold standard for evaluating conversational AI and is now used to benchmark virtually every major language model release.

Vicuna v1.5 includes training with 16K context support, enabling longer conversations and document processing than the original 2K context version. The model handles multi-turn conversations well, maintaining context and coherence across extended dialogue sessions — a direct benefit of training on real conversational data rather than single-turn instruction pairs.

Fine-Tuning with Ertas

Vicuna models are straightforward to fine-tune in Ertas Studio, following the same workflow as other Llama-based models. The 7B variant requires 8-12GB VRAM with QLoRA, the 13B needs 10-14GB, and the 33B needs 20-24GB. Since Vicuna is already instruction-tuned, further fine-tuning adapts its conversational style and knowledge to your specific domain.

Vicuna's conversational training makes it a strong starting point for chatbot and customer-facing applications. Fine-tune on your organization's conversation logs, FAQ databases, or support ticket records to create a domain-specific conversational assistant. The model's natural dialogue style means less fine-tuning data is needed to achieve a conversational tone compared to base models.

After fine-tuning in Ertas Studio, export to GGUF for deployment. Vicuna models are compatible with all standard inference backends. A Q4_K_M quantized Vicuna 13B at approximately 7.8GB provides a good balance of conversational quality and resource efficiency for production chatbot deployments. Ollama and LM Studio both support the Vicuna chat template natively.

Use Cases

Vicuna's primary strength is conversational AI. Its training on real human-AI conversations makes it natural and engaging in multi-turn dialogue, suitable for customer support chatbots, internal knowledge assistants, and interactive help systems. The model handles conversation flow, context tracking, and topic management well.

The model is also valuable for organizations evaluating and comparing language models. The Chatbot Arena methodology pioneered with Vicuna provides a practical framework for assessment, and running Vicuna alongside newer models provides a useful quality baseline. Many organizations include Vicuna in their evaluation suites as a reference point.

Fine-tuned Vicuna models serve well as conversational interfaces for domain-specific knowledge bases. The model's natural dialogue capability, combined with domain-specific fine-tuning, creates assistants that can discuss technical topics in an accessible, conversational manner — useful for educational platforms, technical documentation navigation, and expert consultation systems.

Hardware Requirements

Vicuna 7B at Q4_K_M requires approximately 4.4GB of RAM, the 13B needs about 7.8GB, and the 33B needs about 19GB. These requirements mirror the underlying Llama architecture. The 7B and 13B models run comfortably on consumer hardware with 8-16GB RAM or GPUs with 8-12GB VRAM.

At Q8_0, the requirements are approximately 7.7GB (7B), 13.8GB (13B), and 35GB (33B). Full FP16 inference requires approximately 14.5GB (7B), 26GB (13B), and 66GB (33B). The 13B model on an RTX 4090 at Q4_K_M typically achieves 35-50 tokens per second, providing a responsive conversational experience.

For fine-tuning in Ertas Studio, the 7B needs 8-12GB VRAM, the 13B needs 10-14GB, and the 33B needs 20-24GB with QLoRA. The 13B variant offers the best quality-to-resource ratio for most conversational fine-tuning tasks, providing noticeably better multi-turn coherence than the 7B at manageable training costs.

Supported Quantizations

Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

Related Resources

Integration

llama.cpp

Integration

LM Studio

Integration

Ollama

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →