Back to blog
    Replit App AI Costs Exploding? Replace OpenAI with a Fine-Tuned Local Model
    replitai-costslocal-modelfine-tuningopenaisegment:vibecoder

    Replit App AI Costs Exploding? Replace OpenAI with a Fine-Tuned Local Model

    Replit's always-on deployment and easy AI integration create a specific API cost problem. Here's how to replace OpenAI with a fine-tuned local model and cut costs to flat rate.

    EErtas Team·

    Replit's AI agents make it dangerously easy to add OpenAI-powered features. You describe what you want, the agent writes the code, and your app has AI in it. The problem is that the cost of that AI does not show up in your Replit bill — it shows up in your OpenAI dashboard, quietly climbing every week as your app gets more users.

    Replit has a specific AI cost problem that other platforms do not: always-on deployments.

    The Replit AI Stack

    Most Replit apps with AI features integrate OpenAI through one of two patterns:

    The direct API call pattern (most common):

    import openai
    
    client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    
    def get_ai_response(user_input):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": user_input}]
        )
        return response.choices[0].message.content
    

    The Replit AI template pattern: Some Replit templates include pre-configured OpenAI integrations. If you used one of these, your app is making API calls without you explicitly seeing the implementation.

    Both patterns have the same scaling problem: every user request that touches an AI feature costs money.

    Real Cost Numbers at Different Scales

    For a typical Replit app with a chat or AI generation feature:

    UsersAI Requests/DayDaily TokensMonthly OpenAI Cost
    50150105,000~$1.50
    200600420,000~$6
    5001,5001,050,000~$15
    1,0003,0002,100,000~$30
    3,0009,0006,300,000~$90
    10,00030,00021,000,000~$300

    These numbers assume gpt-4o-mini at 700 tokens per request. Switch to gpt-4o and multiply by 15-20x.

    The Specific Replit Problem: Always-On Deployments

    Here is what makes Replit different from other platforms: Replit Deployments are always-on. Your app runs 24/7, even when no users are active.

    This creates AI cost exposure that other platforms do not have:

    Scheduled tasks making API calls: If your Replit app has any schedule or cron-style tasks that call OpenAI (daily summaries, periodic data enrichment, background processing), those run regardless of user activity.

    Webhook handlers: If your app receives webhooks (Stripe events, GitHub hooks, third-party service callbacks), and those trigger AI processing, each webhook is an API call you pay for.

    Database watchers / polling loops: Some Replit apps poll external APIs or watch databases in the background. If this polling triggers AI processing on new data, costs accumulate without user interaction.

    Session initialization: Some AI features initialize on app load or session start, making API calls before any user interaction.

    Before fixing the scale problem, audit your Replit app for background AI calls. Use the OpenAI usage dashboard to see if your costs correlate with user activity (linear = user-driven) or have a base cost even without users (non-zero = background calls).

    The Local Model Alternative

    The fix is the same as any other platform: fine-tune a small model on your domain, run it locally, route requests to your own VPS instead of OpenAI.

    For Replit apps, the architecture looks like this:

    Replit App (frontend + logic)
             ↓
        HTTP request
             ↓
    External VPS (Hetzner $14-26/mo)
      └── Ollama serving fine-tuned GGUF
             ↓
        Response back to Replit app
    

    Your Replit app makes HTTP requests to an external URL (your VPS). The VPS runs Ollama, which serves your fine-tuned model. This works because:

    1. Replit apps can make outbound HTTP requests to any URL
    2. Ollama serves an OpenAI-compatible API
    3. Your existing OpenAI SDK code works unchanged by updating the base_url

    Architecture: Replit App + External Ollama VPS

    Setting up the VPS (Hetzner CX32, ~$14/month):

    # Install Ollama
    curl -fsSL https://ollama.com/install.sh | sh
    
    # Pull base model or create from fine-tuned GGUF
    cat > Modelfile << 'EOF'
    FROM /path/to/your-fine-tuned-model.gguf
    SYSTEM "You are a helpful assistant specialized in [your domain]."
    EOF
    
    ollama create my-app-model -f Modelfile
    
    # Start Ollama (it listens on port 11434 by default)
    # For external access, set OLLAMA_HOST=0.0.0.0
    OLLAMA_HOST=0.0.0.0 ollama serve
    

    Updating your Replit app code:

    # Before:
    client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    
    # After:
    client = openai.OpenAI(
        api_key="not-required",
        base_url=f"http://{os.environ['OLLAMA_VPS_IP']}:11434/v1"
    )
    
    # Everything else in your code stays the same
    response = client.chat.completions.create(
        model="my-app-model",  # your Ollama model name
        messages=[{"role": "user", "content": user_input}]
    )
    

    Store your VPS IP as a Replit Secret (OLLAMA_VPS_IP). Never hardcode IPs.

    Security note: Add a simple API key check with nginx if your VPS is public. Otherwise anyone with the IP can use your model.

    Fine-Tuning for Your Replit Use Case

    To get the fine-tuned model you'll run on the VPS:

    1. Export 400-800 input/output pairs from your existing OpenAI API logs (Replit logs all environment output; your app may also be logging responses to a database)
    2. Format as JSONL
    3. Upload to Ertas, select Qwen 2.5 7B, train
    4. Download GGUF, upload to your VPS, load into Ollama

    For Replit apps, common fine-tuning tasks:

    • Chat/Q&A on domain content: Train on (question, answer) pairs from your logs
    • Content generation: Train on (prompt, output) pairs where outputs were accepted/used
    • Classification/routing: Train on (input, category) pairs with verified correct categories

    Cost After Migration

    Users (MAU)Monthly OpenAI (gpt-4o-mini)Monthly (Ertas + VPS)
    500~$15$40.50
    1,000~$30$40.50
    5,000~$150$40.50
    20,000~$600$40.50-66.50

    Break-even against gpt-4o-mini is around 1,500-2,000 MAU for typical usage. Against gpt-4o, break-even is under 200 MAU.

    The flat cost structure also eliminates the background call problem: your always-on Replit app can call your always-on Ollama VPS at zero additional cost per call.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading