Back to blog
    7B vs GPT-4: Which Model Size Actually Fits Your Client's Task
    model-selection7b-modelsgpt-4solutions-architectfine-tuningsegment:agency

    7B vs GPT-4: Which Model Size Actually Fits Your Client's Task

    Bigger isn't always better. A guide for AI solutions architects on matching model size to client task requirements — including when a fine-tuned 7B model will outperform GPT-4.

    EErtas Team·

    One of the most expensive mistakes in AI agency work is defaulting to the most capable model available. GPT-4o is impressive — but it is often significantly over-engineered for the tasks clients actually need. And the cost difference between GPT-4o and a well-deployed 7B model is not 20% — it is often 95%.

    This guide gives you a practical decision framework for model selection that you can use with clients.

    Why Bigger Models Are Not Always Better

    GPT-4 and similar frontier models are trained to be generalists. They have broad knowledge, strong reasoning capabilities, and can handle a wide variety of tasks. They are also:

    • Expensive per token
    • Slow (higher latency)
    • Controlled by a third party (data goes to OpenAI/Anthropic)
    • Non-customizable without expensive fine-tuning APIs

    A 7B model — Llama 3.2, Mistral 7B, Phi-4, Qwen 2.5 — is much smaller. It is faster, cheaper, runs locally, and can be fine-tuned with consumer hardware in a few hours.

    The critical insight is: task complexity and model size are not the same thing. A 7B model fine-tuned on a narrow domain can significantly outperform GPT-4 on that specific task. The general-purpose intelligence of GPT-4 is irrelevant — and sometimes counterproductive — for specialized use cases.

    The Task Taxonomy

    When evaluating a client's AI use case, classify the task against this taxonomy:

    Tier 1: Narrow and Repetitive Tasks

    Examples: Email routing classification, intent detection, entity extraction from structured text, yes/no filtering, template-based content generation with fixed format.

    Characteristics: The task has a small, well-defined output space. "Correct" answers can be enumerated or validated automatically. The same type of request appears thousands of times in slightly different forms.

    Best model choice: Fine-tuned 7B model. These tasks are exactly what LoRA fine-tuning excels at. A model trained on 500-2,000 examples of your client's specific task will match or beat GPT-4 accuracy at 1/50th the inference cost.

    Example results: A fine-tuned Llama 3.2 7B model for legal document classification (trained on 1,200 examples) achieves 93% accuracy on held-out test cases. GPT-4o with an optimized prompt achieves 87%. The fine-tuned 7B model wins on both accuracy and cost.

    Tier 2: Domain-Specific Generation

    Examples: Customer support responses in a specific brand voice, product descriptions following a template, meeting summaries in a prescribed format, code review comments following team conventions.

    Characteristics: The output is longer and more variable than Tier 1, but the domain and style are well-defined. "Good" answers follow patterns that can be learned from examples.

    Best model choice: Fine-tuned 7B or 13B model. The base intelligence requirement is modest — what matters is domain knowledge and style consistency. Fine-tuning provides both. A 7B model trained on 2,000 of the client's existing support responses will sound exactly like the client, which GPT-4 with a prompt cannot replicate as consistently.

    Edge case to monitor: If the domain requires dense factual recall (medical, legal, financial) and the training data does not cover all required knowledge, supplement with RAG. Fine-tuning handles style and behavior; RAG handles facts.

    Tier 3: Complex Reasoning and Multi-Step Tasks

    Examples: Legal contract analysis, complex code generation, multi-document synthesis, strategic recommendations, nuanced creative writing.

    Characteristics: The task requires genuine reasoning, synthesizing information from multiple sources, or generating novel solutions to novel problems. The output space is large and cannot be easily learned from examples.

    Best model choice: Larger models (GPT-4o, Claude 3.5 Sonnet, Llama 3.3 70B, Qwen 2.5 72B) — or smaller models with strong chain-of-thought prompting and decomposition. These tasks genuinely benefit from larger parameter counts and more extensive pretraining.

    Cost mitigation strategy: Even for Tier 3 tasks, you can use small models for preprocessing (extraction, classification, routing) and reserve the large model call for the final step. Mixing tiers in a pipeline is often the most cost-effective production architecture.

    Tier 4: General-Purpose Assistance

    Examples: Open-ended Q&A, research, general chat, tasks that vary wildly day to day.

    Characteristics: No fixed domain, highly variable input, no ability to define "correct" output.

    Best model choice: GPT-4o or Claude 3.5. These tasks genuinely need the breadth and reasoning of a frontier model. There is no fine-tuning shortcut because the task is intentionally general.

    The Cost Matrix

    Approximate costs per 1,000 completions, assuming 500 input tokens + 300 output tokens per request:

    ModelCost per 1K requestsNotes
    GPT-4oAU$6-12Variable, depends on context length
    Claude 3.5 SonnetAU$5-10Similar to GPT-4o
    GPT-4o-miniAU$0.60-1.20Good for Tier 3 at lower cost
    Self-hosted 7B (Ollama)AU$0 variableHardware fixed cost, ~AU$0.001/request amortized
    Self-hosted 13B (Ollama)AU$0 variableSlightly slower, same economics
    Fine-tuned 7B (Ollama)AU$0 variableBest quality/cost for Tier 1-2 tasks

    The hardware cost for a local 7B model inference server (Mac Mini M4 or RTX 4070 workstation) is AU$800-1,200 amortized over 12 months. At moderate client volumes, the break-even against GPT-4o-mini is often under three months.

    The "Fine-Tuned 7B Outperforms GPT-4" Claim

    This claim is commonly made but frequently misunderstood. Let me be precise:

    A fine-tuned 7B model outperforms GPT-4 on narrow, domain-specific tasks when:

    1. The task is well-defined (Tier 1 or Tier 2)
    2. The training data is high quality and representative
    3. The evaluation metric aligns with the training objective
    4. The volume of examples is adequate (200+ for simple tasks, 1,000+ for complex ones)

    A fine-tuned 7B model does NOT outperform GPT-4 on:

    • Reasoning-heavy tasks that require broad world knowledge
    • Tasks with genuinely novel inputs not represented in training data
    • Tier 3-4 tasks generally

    The error agencies make is applying the fine-tuning claim too broadly. Fine-tuning is not magic — it is efficient when the task boundaries are clear.

    A Practical Client Assessment Process

    When scoping a new client engagement, ask these questions:

    1. What is the task? Be specific. "AI for marketing" is not a task. "Classify incoming support emails into 8 categories and extract the order number" is a task.

    2. How repetitive is it? What percentage of requests follow the same pattern with different specifics? 80%+ repetition = strong fine-tuning candidate.

    3. Is there existing examples data? Do they have 500+ examples of the input-output behavior they want? If yes, fine-tuning is viable. If no, you are starting from scratch.

    4. What does "correct" look like? Can you define success metrics? If yes, you can evaluate fine-tuning rigorously. If no, you are in Tier 4 territory where general models are needed.

    5. What are the data sensitivity requirements? If the client cannot send data to OpenAI, local models are required regardless of task type.

    The answers determine whether you are looking at a local fine-tuned 7B, a local base 7B with prompting, or a cloud frontier model — and what the cost structure of the engagement looks like.

    Summary

    Client TaskRecommended ModelRationale
    Support ticket classificationFine-tuned 7BRepetitive, well-defined, high volume
    Brand-voice content generationFine-tuned 7B/13BStyle learnable from examples
    Complex legal analysis70B or GPT-4oRequires broad reasoning
    Open-ended assistantGPT-4oGeneral intelligence needed
    Code generation (specific stack)Fine-tuned 7B coderDomain consistent
    Data extraction from documentsFine-tuned 7B + RAGStructured output + factual retrieval

    Defaulting to the biggest model available is not good architecture — it is a failure to understand what the task actually requires. Solutions architects who do this analysis rigorously build better products and deliver better margins.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading