Fine-Tuning for Voice AI Agents: Vapi, ElevenLabs, and Local Models

The voice AI agent market has exploded. Vapi, ElevenLabs, Retell, Bland.ai — there are now dozens of platforms promising conversational voice agents for customer support, appointment booking, lead qualification, and outbound sales. Agencies are spinning up voice bots for clients in hours. Demo calls sound impressive.

Then the invoices arrive.

A single voice AI agent handling 1,000 calls per month at an average of 4 minutes per call racks up $400-$1,200/month in LLM backbone costs alone. That's before STT, TTS, telephony, and platform fees. At 10,000 calls/month, you're looking at $4,000-$12,000 — just for the language model deciding what to say next.

The LLM backbone is the expensive part. And for the vast majority of voice agent use cases, GPT-4 is massive overkill.

The Voice Agent Architecture

Every voice AI agent follows the same pipeline:

Speech-to-Text (STT): Convert the caller's audio to text (Whisper, Deepgram, AssemblyAI)
LLM Processing: Generate the agent's response based on the transcript, context, and instructions
Text-to-Speech (TTS): Convert the response text back to audio (ElevenLabs, PlayHT, XTTS)

The STT and TTS steps are relatively cheap — $0.006/minute for Deepgram, $0.01-0.03/minute for ElevenLabs depending on your plan. The LLM step is where the money goes.

A typical voice conversation generates 500-2,000 tokens per turn, with 10-30 turns per call. That's 5,000-60,000 tokens per conversation. At GPT-4o pricing ($2.50/$10 per million input/output tokens), a 4-minute call with 20 turns costs roughly $0.15-0.40 in LLM inference alone.

Fine-tuned local models bring that LLM cost to effectively zero.

Why Voice Agents Are Perfect for Fine-Tuning

Voice agent conversations are surprisingly narrow in scope. An appointment-booking agent for a dental practice handles maybe 15-20 distinct conversation patterns:

Scheduling a new appointment
Rescheduling an existing one
Asking about insurance acceptance
Asking about office hours
Requesting directions
Handling emergencies
Polite small talk and greetings

That's a classification and response-generation task — exactly the type of work where a fine-tuned 7B or 8B model matches GPT-4 quality. The agent doesn't need to reason about novel problems or synthesize information from disparate domains. It needs to recognize the caller's intent, pull the right information from context, and generate a natural-sounding response.

The Latency Advantage of Small Models

Voice agents have strict latency requirements. Callers expect responses within 300-800ms. Anything over 1 second feels unnatural. Over 2 seconds and people start saying "Hello? Are you there?"

This is where small models actually have an advantage over cloud APIs:

Setup	Time to First Token	Full Response (avg)
GPT-4o via API	200-600ms	800-2,000ms
GPT-3.5 via API	150-400ms	500-1,200ms
Fine-tuned 8B (local, RTX 4090)	30-80ms	150-400ms
Fine-tuned 3B (local, RTX 3090)	15-40ms	80-250ms

Local inference eliminates network round-trip time entirely. For voice agents where every 100ms matters, running a fine-tuned model on local hardware produces noticeably more natural conversations. The agent responds faster than a human would — which, counterintuitively, makes it sound more human because there's less awkward silence.

Building Training Data from Conversation Transcripts

The best training data for a voice agent model is real conversation transcripts. If you're already running a voice agent on GPT-4, you have a gold mine of training data accumulating every day.

Here's the process:

Step 1: Collect Transcripts

Export conversation logs from your voice platform. Vapi provides full conversation transcripts via their API. Retell and Bland.ai have similar export capabilities. You need the full turn-by-turn transcript including system prompts and tool calls.

Step 2: Filter for Quality

Not every conversation is good training data. Filter for:

Successful outcomes: Calls where the agent accomplished the goal (appointment booked, question answered, lead qualified)
Natural flow: Conversations without excessive retries or confusion
Representative coverage: Ensure all major conversation types are included

Typically, 60-70% of conversations are good training candidates. From 1,000 total calls, expect 600-700 usable transcripts.

Step 3: Format as Training Pairs

Convert each conversation into the chat format your target model expects:

{
  "messages": [
    {"role": "system", "content": "You are a scheduling assistant for Downtown Dental..."},
    {"role": "user", "content": "Hi, I'd like to make an appointment"},
    {"role": "assistant", "content": "I'd be happy to help you schedule an appointment..."},
    {"role": "user", "content": "Do you take Delta Dental?"},
    {"role": "assistant", "content": "Yes, we accept Delta Dental PPO and Premier plans..."}
  ]
}

Step 4: Add Edge Cases

Supplement your real transcripts with synthetic examples for edge cases:

Callers who are upset or confused
Calls where the agent needs to say "I don't know" or transfer to a human
Simultaneous requests ("I need to reschedule my appointment and also ask about my bill")
Off-topic requests the agent should gracefully deflect

A dataset of 500-1,000 conversations is typically sufficient for a voice agent fine-tune. More data helps, but the returns diminish after 1,000 examples for a single-purpose agent.

Cost Comparison at Scale

Here's what the numbers look like across different scale levels. Assumes 4-minute average call duration, 20 turns per call, and ~30,000 tokens per conversation.

1,000 Calls/Month

Component	GPT-4o Agent	Fine-Tuned 8B Agent
LLM inference	$400-$1,200	$0 (local)
STT (Deepgram)	$24	$24
TTS (ElevenLabs)	$99-$330	$99-$330
Hardware/hosting	$0	$50-$100 (cloud GPU)
Monthly total	$523-$1,554	$173-$454

10,000 Calls/Month

Component	GPT-4o Agent	Fine-Tuned 8B Agent
LLM inference	$4,000-$12,000	$0 (local)
STT	$240	$240
TTS	$330-$990	$330-$990
Hardware/hosting	$0	$150-$300
Monthly total	$4,570-$13,230	$720-$1,530

100,000 Calls/Month

Component	GPT-4o Agent	Fine-Tuned 8B Agent
LLM inference	$40,000-$120,000	$0 (local)
STT	$2,400	$2,400
TTS	$3,300-$9,900	$3,300-$9,900
Hardware/hosting	$0	$500-$1,500
Monthly total	$45,700-$132,300	$6,200-$13,800

At 100K calls/month, the fine-tuned model saves $39,500-$118,500/month. The economics are not subtle.

Architecture: Replacing the LLM Backbone

The swap is straightforward. Your voice agent architecture stays the same — STT, LLM, TTS — you just replace the LLM layer.

Option A: Ollama + OpenAI-Compatible API

Most voice platforms let you specify a custom LLM endpoint. Run your fine-tuned model via Ollama, which exposes an OpenAI-compatible API:

ollama serve
# Model accessible at http://localhost:11434/v1/chat/completions

Point Vapi or your custom voice pipeline at http://your-server:11434/v1/chat/completions instead of https://api.openai.com/v1/chat/completions. The request format is identical.

Option B: vLLM for High Throughput

If you're handling more than 50 concurrent calls, Ollama's single-request processing becomes a bottleneck. vLLM provides batched inference with continuous batching — it can handle hundreds of concurrent requests on a single GPU:

python -m vllm.entrypoints.openai.api_server \
  --model ./fine-tuned-voice-agent \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9

Option C: Hybrid with Cloud Fallback

Run local inference for the 90% of calls that fit your fine-tuned model's training distribution. Route unusual or complex calls to GPT-4 as a fallback. You capture most of the cost savings while maintaining quality on edge cases.

Detection is simple: if the model's confidence score drops below a threshold, or if the conversation exceeds a certain turn count, escalate to the cloud model.

Hardware Requirements

For voice agent inference, latency matters more than throughput. Here's what works:

Hardware	Model Size	Concurrent Calls	Cost
RTX 4090 (24GB)	8B Q4	5-15	$1,600 one-time
A6000 (48GB)	8B Q8 or 14B Q4	10-30	$4,500 one-time
L4 (cloud)	8B Q4	5-15	$0.50/hr
A10G (cloud)	8B Q8	10-25	$1.00/hr

A single RTX 4090 handling 10 concurrent calls at an average of 4 minutes per call can process 150 calls per hour — 3,600 calls per day — 108,000 calls per month. One consumer GPU. $1,600.

Self-Hosting TTS to Cut Costs Further

ElevenLabs and PlayHT produce excellent voice quality, but their costs add up at scale. Open-source TTS models have closed the gap significantly:

XTTS v2 (Coqui): Near-human quality, supports voice cloning, runs locally
Piper TTS: Lower quality but extremely fast, good for simple use cases
StyleTTS2: High-quality neural TTS with style control

Running XTTS v2 locally adds zero marginal cost per call. Quality is roughly 85-90% of ElevenLabs for most voices. For cost-sensitive deployments, self-hosted TTS can eliminate another $99-$9,900/month from the bill.

Training the Fine-Tuned Voice Agent Model

The fine-tuning process itself is the same as any chat model fine-tune, with a few voice-specific considerations:

Keep responses short. Voice responses should be 1-3 sentences. Long paragraphs sound unnatural when spoken. Train the model to be concise.
Include filler and conversational markers. Real speech includes "Sure," "Let me check that," "One moment." These make the TTS output sound natural.
Train on multi-turn conversations. The model needs to handle back-and-forth, not just single question-answer pairs.
Include interruption handling. Callers interrupt. Train the model to handle partial inputs gracefully.

Upload your formatted dataset to Ertas and select your base model — Llama 3.1 8B Instruct or Qwen 2.5 7B Instruct work well for voice agents. The fine-tuning takes 1-3 hours depending on dataset size. Deploy the resulting model via Ollama and point your voice pipeline at it.

When to Keep GPT-4 in the Loop

Fine-tuned local models work for the majority of voice agent use cases, but some scenarios still benefit from a frontier model:

Open-domain conversations where the caller can ask about anything (general customer service hotlines)
Complex troubleshooting that requires multi-step reasoning about technical problems
Multilingual agents handling 10+ languages (though fine-tuned multilingual models are improving fast)
Initial prototyping before you have conversation data to fine-tune on

For most production voice agents — appointment booking, lead qualification, order status, FAQ handling — a fine-tuned 7-8B model handles 95%+ of calls at equal or better quality than GPT-4, with lower latency and dramatically lower cost.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →