Back to blog
    Fine-Tuning for Voice AI Agents: Vapi, ElevenLabs, and Local Models
    voice-aiagentsfine-tuningvapicost-reductionsegment:agency

    Fine-Tuning for Voice AI Agents: Vapi, ElevenLabs, and Local Models

    Voice AI agents running on GPT-4 cost $0.10-0.30 per minute of conversation. Fine-tuned local models cut that to near-zero. Here's how to build voice agents that don't bankrupt you per call.

    EErtas Team·

    The voice AI agent market has exploded. Vapi, ElevenLabs, Retell, Bland.ai — there are now dozens of platforms promising conversational voice agents for customer support, appointment booking, lead qualification, and outbound sales. Agencies are spinning up voice bots for clients in hours. Demo calls sound impressive.

    Then the invoices arrive.

    A single voice AI agent handling 1,000 calls per month at an average of 4 minutes per call racks up $400-$1,200/month in LLM backbone costs alone. That's before STT, TTS, telephony, and platform fees. At 10,000 calls/month, you're looking at $4,000-$12,000 — just for the language model deciding what to say next.

    The LLM backbone is the expensive part. And for the vast majority of voice agent use cases, GPT-4 is massive overkill.

    The Voice Agent Architecture

    Every voice AI agent follows the same pipeline:

    1. Speech-to-Text (STT): Convert the caller's audio to text (Whisper, Deepgram, AssemblyAI)
    2. LLM Processing: Generate the agent's response based on the transcript, context, and instructions
    3. Text-to-Speech (TTS): Convert the response text back to audio (ElevenLabs, PlayHT, XTTS)

    The STT and TTS steps are relatively cheap — $0.006/minute for Deepgram, $0.01-0.03/minute for ElevenLabs depending on your plan. The LLM step is where the money goes.

    A typical voice conversation generates 500-2,000 tokens per turn, with 10-30 turns per call. That's 5,000-60,000 tokens per conversation. At GPT-4o pricing ($2.50/$10 per million input/output tokens), a 4-minute call with 20 turns costs roughly $0.15-0.40 in LLM inference alone.

    Fine-tuned local models bring that LLM cost to effectively zero.

    Why Voice Agents Are Perfect for Fine-Tuning

    Voice agent conversations are surprisingly narrow in scope. An appointment-booking agent for a dental practice handles maybe 15-20 distinct conversation patterns:

    • Scheduling a new appointment
    • Rescheduling an existing one
    • Asking about insurance acceptance
    • Asking about office hours
    • Requesting directions
    • Handling emergencies
    • Polite small talk and greetings

    That's a classification and response-generation task — exactly the type of work where a fine-tuned 7B or 8B model matches GPT-4 quality. The agent doesn't need to reason about novel problems or synthesize information from disparate domains. It needs to recognize the caller's intent, pull the right information from context, and generate a natural-sounding response.

    The Latency Advantage of Small Models

    Voice agents have strict latency requirements. Callers expect responses within 300-800ms. Anything over 1 second feels unnatural. Over 2 seconds and people start saying "Hello? Are you there?"

    This is where small models actually have an advantage over cloud APIs:

    SetupTime to First TokenFull Response (avg)
    GPT-4o via API200-600ms800-2,000ms
    GPT-3.5 via API150-400ms500-1,200ms
    Fine-tuned 8B (local, RTX 4090)30-80ms150-400ms
    Fine-tuned 3B (local, RTX 3090)15-40ms80-250ms

    Local inference eliminates network round-trip time entirely. For voice agents where every 100ms matters, running a fine-tuned model on local hardware produces noticeably more natural conversations. The agent responds faster than a human would — which, counterintuitively, makes it sound more human because there's less awkward silence.

    Building Training Data from Conversation Transcripts

    The best training data for a voice agent model is real conversation transcripts. If you're already running a voice agent on GPT-4, you have a gold mine of training data accumulating every day.

    Here's the process:

    Step 1: Collect Transcripts

    Export conversation logs from your voice platform. Vapi provides full conversation transcripts via their API. Retell and Bland.ai have similar export capabilities. You need the full turn-by-turn transcript including system prompts and tool calls.

    Step 2: Filter for Quality

    Not every conversation is good training data. Filter for:

    • Successful outcomes: Calls where the agent accomplished the goal (appointment booked, question answered, lead qualified)
    • Natural flow: Conversations without excessive retries or confusion
    • Representative coverage: Ensure all major conversation types are included

    Typically, 60-70% of conversations are good training candidates. From 1,000 total calls, expect 600-700 usable transcripts.

    Step 3: Format as Training Pairs

    Convert each conversation into the chat format your target model expects:

    {
      "messages": [
        {"role": "system", "content": "You are a scheduling assistant for Downtown Dental..."},
        {"role": "user", "content": "Hi, I'd like to make an appointment"},
        {"role": "assistant", "content": "I'd be happy to help you schedule an appointment..."},
        {"role": "user", "content": "Do you take Delta Dental?"},
        {"role": "assistant", "content": "Yes, we accept Delta Dental PPO and Premier plans..."}
      ]
    }
    

    Step 4: Add Edge Cases

    Supplement your real transcripts with synthetic examples for edge cases:

    • Callers who are upset or confused
    • Calls where the agent needs to say "I don't know" or transfer to a human
    • Simultaneous requests ("I need to reschedule my appointment and also ask about my bill")
    • Off-topic requests the agent should gracefully deflect

    A dataset of 500-1,000 conversations is typically sufficient for a voice agent fine-tune. More data helps, but the returns diminish after 1,000 examples for a single-purpose agent.

    Cost Comparison at Scale

    Here's what the numbers look like across different scale levels. Assumes 4-minute average call duration, 20 turns per call, and ~30,000 tokens per conversation.

    1,000 Calls/Month

    ComponentGPT-4o AgentFine-Tuned 8B Agent
    LLM inference$400-$1,200$0 (local)
    STT (Deepgram)$24$24
    TTS (ElevenLabs)$99-$330$99-$330
    Hardware/hosting$0$50-$100 (cloud GPU)
    Monthly total$523-$1,554$173-$454

    10,000 Calls/Month

    ComponentGPT-4o AgentFine-Tuned 8B Agent
    LLM inference$4,000-$12,000$0 (local)
    STT$240$240
    TTS$330-$990$330-$990
    Hardware/hosting$0$150-$300
    Monthly total$4,570-$13,230$720-$1,530

    100,000 Calls/Month

    ComponentGPT-4o AgentFine-Tuned 8B Agent
    LLM inference$40,000-$120,000$0 (local)
    STT$2,400$2,400
    TTS$3,300-$9,900$3,300-$9,900
    Hardware/hosting$0$500-$1,500
    Monthly total$45,700-$132,300$6,200-$13,800

    At 100K calls/month, the fine-tuned model saves $39,500-$118,500/month. The economics are not subtle.

    Architecture: Replacing the LLM Backbone

    The swap is straightforward. Your voice agent architecture stays the same — STT, LLM, TTS — you just replace the LLM layer.

    Option A: Ollama + OpenAI-Compatible API

    Most voice platforms let you specify a custom LLM endpoint. Run your fine-tuned model via Ollama, which exposes an OpenAI-compatible API:

    ollama serve
    # Model accessible at http://localhost:11434/v1/chat/completions
    

    Point Vapi or your custom voice pipeline at http://your-server:11434/v1/chat/completions instead of https://api.openai.com/v1/chat/completions. The request format is identical.

    Option B: vLLM for High Throughput

    If you're handling more than 50 concurrent calls, Ollama's single-request processing becomes a bottleneck. vLLM provides batched inference with continuous batching — it can handle hundreds of concurrent requests on a single GPU:

    python -m vllm.entrypoints.openai.api_server \
      --model ./fine-tuned-voice-agent \
      --max-model-len 4096 \
      --gpu-memory-utilization 0.9
    

    Option C: Hybrid with Cloud Fallback

    Run local inference for the 90% of calls that fit your fine-tuned model's training distribution. Route unusual or complex calls to GPT-4 as a fallback. You capture most of the cost savings while maintaining quality on edge cases.

    Detection is simple: if the model's confidence score drops below a threshold, or if the conversation exceeds a certain turn count, escalate to the cloud model.

    Hardware Requirements

    For voice agent inference, latency matters more than throughput. Here's what works:

    HardwareModel SizeConcurrent CallsCost
    RTX 4090 (24GB)8B Q45-15$1,600 one-time
    A6000 (48GB)8B Q8 or 14B Q410-30$4,500 one-time
    L4 (cloud)8B Q45-15$0.50/hr
    A10G (cloud)8B Q810-25$1.00/hr

    A single RTX 4090 handling 10 concurrent calls at an average of 4 minutes per call can process 150 calls per hour — 3,600 calls per day — 108,000 calls per month. One consumer GPU. $1,600.

    Self-Hosting TTS to Cut Costs Further

    ElevenLabs and PlayHT produce excellent voice quality, but their costs add up at scale. Open-source TTS models have closed the gap significantly:

    • XTTS v2 (Coqui): Near-human quality, supports voice cloning, runs locally
    • Piper TTS: Lower quality but extremely fast, good for simple use cases
    • StyleTTS2: High-quality neural TTS with style control

    Running XTTS v2 locally adds zero marginal cost per call. Quality is roughly 85-90% of ElevenLabs for most voices. For cost-sensitive deployments, self-hosted TTS can eliminate another $99-$9,900/month from the bill.

    Training the Fine-Tuned Voice Agent Model

    The fine-tuning process itself is the same as any chat model fine-tune, with a few voice-specific considerations:

    1. Keep responses short. Voice responses should be 1-3 sentences. Long paragraphs sound unnatural when spoken. Train the model to be concise.
    2. Include filler and conversational markers. Real speech includes "Sure," "Let me check that," "One moment." These make the TTS output sound natural.
    3. Train on multi-turn conversations. The model needs to handle back-and-forth, not just single question-answer pairs.
    4. Include interruption handling. Callers interrupt. Train the model to handle partial inputs gracefully.

    Upload your formatted dataset to Ertas and select your base model — Llama 3.1 8B Instruct or Qwen 2.5 7B Instruct work well for voice agents. The fine-tuning takes 1-3 hours depending on dataset size. Deploy the resulting model via Ollama and point your voice pipeline at it.

    When to Keep GPT-4 in the Loop

    Fine-tuned local models work for the majority of voice agent use cases, but some scenarios still benefit from a frontier model:

    • Open-domain conversations where the caller can ask about anything (general customer service hotlines)
    • Complex troubleshooting that requires multi-step reasoning about technical problems
    • Multilingual agents handling 10+ languages (though fine-tuned multilingual models are improving fast)
    • Initial prototyping before you have conversation data to fine-tune on

    For most production voice agents — appointment booking, lead qualification, order status, FAQ handling — a fine-tuned 7-8B model handles 95%+ of calls at equal or better quality than GPT-4, with lower latency and dramatically lower cost.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading