7B vs GPT-4: Which Model Size Actually Fits Your Client's Task

One of the most expensive mistakes in AI agency work is defaulting to the most capable model available. GPT-4o is impressive — but it is often significantly over-engineered for the tasks clients actually need. And the cost difference between GPT-4o and a well-deployed 7B model is not 20% — it is often 95%.

This guide gives you a practical decision framework for model selection that you can use with clients.

Why Bigger Models Are Not Always Better

GPT-4 and similar frontier models are trained to be generalists. They have broad knowledge, strong reasoning capabilities, and can handle a wide variety of tasks. They are also:

Expensive per token
Slow (higher latency)
Controlled by a third party (data goes to OpenAI/Anthropic)
Non-customizable without expensive fine-tuning APIs

A 7B model — Llama 3.2, Mistral 7B, Phi-4, Qwen 2.5 — is much smaller. It is faster, cheaper, runs locally, and can be fine-tuned with consumer hardware in a few hours.

The critical insight is: task complexity and model size are not the same thing. A 7B model fine-tuned on a narrow domain can significantly outperform GPT-4 on that specific task. The general-purpose intelligence of GPT-4 is irrelevant — and sometimes counterproductive — for specialized use cases.

The Task Taxonomy

When evaluating a client's AI use case, classify the task against this taxonomy:

Tier 1: Narrow and Repetitive Tasks

Examples: Email routing classification, intent detection, entity extraction from structured text, yes/no filtering, template-based content generation with fixed format.

Characteristics: The task has a small, well-defined output space. "Correct" answers can be enumerated or validated automatically. The same type of request appears thousands of times in slightly different forms.

Best model choice: Fine-tuned 7B model. These tasks are exactly what LoRA fine-tuning excels at. A model trained on 500-2,000 examples of your client's specific task will match or beat GPT-4 accuracy at 1/50th the inference cost.

Example results: A fine-tuned Llama 3.2 7B model for legal document classification (trained on 1,200 examples) achieves 93% accuracy on held-out test cases. GPT-4o with an optimized prompt achieves 87%. The fine-tuned 7B model wins on both accuracy and cost.

Tier 2: Domain-Specific Generation

Examples: Customer support responses in a specific brand voice, product descriptions following a template, meeting summaries in a prescribed format, code review comments following team conventions.

Characteristics: The output is longer and more variable than Tier 1, but the domain and style are well-defined. "Good" answers follow patterns that can be learned from examples.

Best model choice: Fine-tuned 7B or 13B model. The base intelligence requirement is modest — what matters is domain knowledge and style consistency. Fine-tuning provides both. A 7B model trained on 2,000 of the client's existing support responses will sound exactly like the client, which GPT-4 with a prompt cannot replicate as consistently.

Edge case to monitor: If the domain requires dense factual recall (medical, legal, financial) and the training data does not cover all required knowledge, supplement with RAG. Fine-tuning handles style and behavior; RAG handles facts.

Tier 3: Complex Reasoning and Multi-Step Tasks

Examples: Legal contract analysis, complex code generation, multi-document synthesis, strategic recommendations, nuanced creative writing.

Characteristics: The task requires genuine reasoning, synthesizing information from multiple sources, or generating novel solutions to novel problems. The output space is large and cannot be easily learned from examples.

Best model choice: Larger models (GPT-4o, Claude 3.5 Sonnet, Llama 3.3 70B, Qwen 2.5 72B) — or smaller models with strong chain-of-thought prompting and decomposition. These tasks genuinely benefit from larger parameter counts and more extensive pretraining.

Cost mitigation strategy: Even for Tier 3 tasks, you can use small models for preprocessing (extraction, classification, routing) and reserve the large model call for the final step. Mixing tiers in a pipeline is often the most cost-effective production architecture.

Tier 4: General-Purpose Assistance

Examples: Open-ended Q&A, research, general chat, tasks that vary wildly day to day.

Characteristics: No fixed domain, highly variable input, no ability to define "correct" output.

Best model choice: GPT-4o or Claude 3.5. These tasks genuinely need the breadth and reasoning of a frontier model. There is no fine-tuning shortcut because the task is intentionally general.

The Cost Matrix

Approximate costs per 1,000 completions, assuming 500 input tokens + 300 output tokens per request:

Model	Cost per 1K requests	Notes
GPT-4o	AU$6-12	Variable, depends on context length
Claude 3.5 Sonnet	AU$5-10	Similar to GPT-4o
GPT-4o-mini	AU$0.60-1.20	Good for Tier 3 at lower cost
Self-hosted 7B (Ollama)	AU$0 variable	Hardware fixed cost, ~AU$0.001/request amortized
Self-hosted 13B (Ollama)	AU$0 variable	Slightly slower, same economics
Fine-tuned 7B (Ollama)	AU$0 variable	Best quality/cost for Tier 1-2 tasks

The hardware cost for a local 7B model inference server (Mac Mini M4 or RTX 4070 workstation) is AU$800-1,200 amortized over 12 months. At moderate client volumes, the break-even against GPT-4o-mini is often under three months.

The "Fine-Tuned 7B Outperforms GPT-4" Claim

This claim is commonly made but frequently misunderstood. Let me be precise:

A fine-tuned 7B model outperforms GPT-4 on narrow, domain-specific tasks when:

The task is well-defined (Tier 1 or Tier 2)
The training data is high quality and representative
The evaluation metric aligns with the training objective
The volume of examples is adequate (200+ for simple tasks, 1,000+ for complex ones)

A fine-tuned 7B model does NOT outperform GPT-4 on:

Reasoning-heavy tasks that require broad world knowledge
Tasks with genuinely novel inputs not represented in training data
Tier 3-4 tasks generally

The error agencies make is applying the fine-tuning claim too broadly. Fine-tuning is not magic — it is efficient when the task boundaries are clear.

A Practical Client Assessment Process

When scoping a new client engagement, ask these questions:

What is the task? Be specific. "AI for marketing" is not a task. "Classify incoming support emails into 8 categories and extract the order number" is a task.
How repetitive is it? What percentage of requests follow the same pattern with different specifics? 80%+ repetition = strong fine-tuning candidate.
Is there existing examples data? Do they have 500+ examples of the input-output behavior they want? If yes, fine-tuning is viable. If no, you are starting from scratch.
What does "correct" look like? Can you define success metrics? If yes, you can evaluate fine-tuning rigorously. If no, you are in Tier 4 territory where general models are needed.
What are the data sensitivity requirements? If the client cannot send data to OpenAI, local models are required regardless of task type.

The answers determine whether you are looking at a local fine-tuned 7B, a local base 7B with prompting, or a cloud frontier model — and what the cost structure of the engagement looks like.

Summary

Client Task	Recommended Model	Rationale
Support ticket classification	Fine-tuned 7B	Repetitive, well-defined, high volume
Brand-voice content generation	Fine-tuned 7B/13B	Style learnable from examples
Complex legal analysis	70B or GPT-4o	Requires broad reasoning
Open-ended assistant	GPT-4o	General intelligence needed
Code generation (specific stack)	Fine-tuned 7B coder	Domain consistent
Data extraction from documents	Fine-tuned 7B + RAG	Structured output + factual retrieval

Defaulting to the biggest model available is not good architecture — it is a failure to understand what the task actually requires. Solutions architects who do this analysis rigorously build better products and deliver better margins.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →