
Running Fine-Tuned Models on Enterprise Hardware: CPU vs GPU vs NPU Guide
A technical guide comparing CPUs, GPUs, and NPUs for running fine-tuned small language models in enterprise environments. Includes performance benchmarks, cost analysis, and a decision framework for infrastructure teams.
You've fine-tuned your small language model. It performs well on your benchmarks. Now comes the infrastructure question: what hardware should you run it on?
This isn't as straightforward as "just buy GPUs." The right answer depends on your deployment scale, model size, latency requirements, and existing infrastructure. A 3B-parameter model serving a single team has very different hardware needs than a 14B model serving an entire organization.
This guide compares three accelerator types — CPUs, GPUs, and NPUs — with real performance numbers, cost analysis, and a decision framework for infrastructure teams.
The Three Accelerator Types
CPU: The Universal Baseline
Every server in your data center has CPUs. Every workstation, every laptop, every VM. CPUs are the most available compute resource in any enterprise, and modern CPUs with AVX-512 or AMX (Advanced Matrix Extensions) instructions can run quantized SLMs at usable speeds.
Strengths:
- Zero additional hardware procurement — you already own them
- No driver issues, no CUDA compatibility problems
- Scales horizontally across existing server fleet
- Well-understood by every operations team
Limitations:
- Significantly slower than GPUs for matrix operations
- Practically limited to models under 3B parameters for interactive use
- Higher power-per-token than purpose-built accelerators
Best for: Small models (sub-3B), low-volume deployments, prototyping, and situations where you want to avoid GPU procurement entirely.
Shunya Labs and similar vendors have demonstrated CPU-first architectures claiming 20x cost reduction compared to GPU-based deployment for appropriate workloads. The key qualifier is "appropriate" — this works for small models at moderate volume, not for running a 14B model at high throughput.
GPU: The Performance Standard
NVIDIA GPUs remain the default choice for AI inference, and for good reason. The combination of high memory bandwidth, massive parallelism, and mature software ecosystem (CUDA, cuDNN, TensorRT) means GPUs deliver the best raw performance for language model inference.
The relevant GPU tiers for enterprise SLM deployment:
| GPU | VRAM | FP16 TFLOPS | Price (approx.) | Target Use |
|---|---|---|---|---|
| RTX 4060 Ti | 16GB | 22 | $400–$500 | Single-user, small models |
| RTX 4090 | 24GB | 83 | $1,600–$2,000 | Small team, up to 14B models |
| L40S | 48GB | 91 | $7,000–$9,000 | Department, multi-model serving |
| A100 | 40/80GB | 78/78 | $8,000–$15,000 | High-throughput production |
| H100 | 80GB | 267 | $25,000–$35,000 | Organization-wide, maximum throughput |
AMD's MI300X (192GB HBM3) is emerging as a cost-effective alternative to NVIDIA's H100, particularly for inference workloads where AMD's ROCm software stack has matured enough to be production-viable. Pricing sits between the A100 and H100 tiers with competitive throughput.
Strengths:
- Highest absolute throughput for models of any size
- Mature software ecosystem with extensive optimization tools
- Scales from single-user (RTX 4060) to enterprise (H100 cluster)
- Supports both inference and fine-tuning on the same hardware
Limitations:
- Procurement cost, especially for datacenter GPUs
- Power consumption (300–700W per card for datacenter GPUs)
- GPU driver and CUDA version management across a fleet
- Supply constraints for high-end cards (though improving in 2026)
Best for: Any deployment where throughput or model size exceeds what CPUs or NPUs can handle. This is the default choice for 7B+ models at any meaningful volume.
NPU: The Efficiency Play
Neural Processing Units are purpose-built inference accelerators integrated into modern processors. Unlike GPUs (which are general-purpose parallel processors adapted for AI), NPUs are designed specifically for the matrix operations and memory access patterns of neural network inference.
Current NPU implementations:
| NPU | Found In | TOPS (INT8) | Power | Status |
|---|---|---|---|---|
| Intel NPU (Meteor Lake) | Intel Core Ultra laptops/workstations | 10–11 | 5–15W | Available |
| Intel NPU (Arrow Lake) | Intel Core Ultra 200 series | 13 | 5–15W | Available |
| Qualcomm Hexagon (Snapdragon X) | Snapdragon X Elite/Plus laptops | 45 | 15–25W | Available |
| Apple Neural Engine (M4) | M4/M4 Pro/M4 Max MacBooks | 38 | 10–20W | Available |
| AMD XDNA 2 (Ryzen AI) | AMD Ryzen AI 300 series | 50 | 15–25W | Available |
Strengths:
- Dramatically lower power consumption than GPUs
- Built into hardware your enterprise may already be purchasing (new laptops and workstations)
- No separate procurement — it's on the chip
- Silent operation (no GPU fan noise in office environments)
- Good enough for single-user interactive inference with quantized SLMs
Limitations:
- Lower absolute throughput than discrete GPUs
- Software ecosystem is still maturing (framework support varies)
- Limited to smaller models (practical ceiling around 7B quantized)
- Performance varies significantly between vendors
- Multi-user serving isn't practical — NPUs are designed for single-user workloads
Best for: Individual workstation deployment, edge inference, scenarios where models run on employee laptops/desktops without requiring server infrastructure.
Microsoft's Foundry Local initiative provides useful signal here: it's designed to run models locally on Windows PCs, targeting exactly the NPU and integrated GPU hardware in modern devices. When a major platform vendor optimizes for specific hardware, that's a reliable indicator of where the ecosystem is heading.
Performance Benchmarks
Here's where the abstract comparison turns concrete. The following benchmarks show tokens per second for a quantized 7B model (Q4_K_M quantization, a good balance of quality and speed) across different hardware.
Tokens Per Second — Quantized 7B Model (Q4_K_M)
| Hardware | Tokens/Second | Notes |
|---|---|---|
| CPU: 32-core Xeon W (server) | 8–15 tok/s | Using llama.cpp with AVX-512 |
| CPU: Intel Core Ultra 7 (laptop) | 5–10 tok/s | Using llama.cpp |
| CPU: AMD Ryzen 9 7950X (desktop) | 10–18 tok/s | 16 cores, fast memory helps |
| GPU: RTX 4060 Ti (16GB) | 60–80 tok/s | Entry-level discrete GPU |
| GPU: RTX 4090 (24GB) | 80–120 tok/s | Best consumer GPU |
| GPU: A100 (40GB) | 100–150 tok/s | Datacenter standard |
| GPU: H100 (80GB) | 150–200 tok/s | Peak single-GPU performance |
| NPU: Qualcomm Snapdragon X Elite | 20–40 tok/s | Hexagon NPU, framework-dependent |
| NPU: Apple M4 Max (Neural Engine) | 40–60 tok/s | Unified memory architecture helps |
| NPU: Intel Core Ultra (Meteor Lake NPU) | 8–15 tok/s | Early NPU generation, improving |
What These Numbers Mean in Practice
For interactive use (chatbot, document analysis where a human is waiting):
- Comfortable: 30+ tokens/second. The user sees a fast, fluid response.
- Acceptable: 15–30 tokens/second. Noticeable generation speed but still usable.
- Frustrating: Under 15 tokens/second. The user is watching text appear word by word.
For batch processing (document classification, nightly extraction jobs):
- Throughput matters more than per-query speed
- A CPU doing 10 tok/s can still process thousands of documents overnight
- Parallelism across multiple CPU cores or multiple GPU instances scales linearly
Smaller Models Change the Equation
The benchmarks above are for a 7B model. Smaller models run proportionally faster:
| Hardware | 7B (Q4) tok/s | 3.8B (Q4) tok/s | 1.5B (Q4) tok/s |
|---|---|---|---|
| CPU: 32-core Xeon | 8–15 | 15–30 | 30–60 |
| GPU: RTX 4090 | 80–120 | 140–200 | 250–400 |
| NPU: Snapdragon X Elite | 20–40 | 40–70 | 60–100 |
| Apple M4 Max | 40–60 | 70–100 | 100–160 |
A 3.8B model (like Phi-3 mini) on a modern laptop CPU delivers 15–30 tokens/second — comfortable for interactive use. On an NPU or Apple Silicon, it's 40–100 tokens/second, which is fast enough that the user barely notices generation latency.
Cost Per Token
Raw speed doesn't tell the full story. What matters for budget planning is cost efficiency: how much does each token cost when you amortize hardware over its useful life?
Cost Per Million Tokens (Amortized Over 3 Years)
Assumptions: hardware runs at 70% utilization for 12 hours/day, power cost $0.12/kWh.
| Hardware | Hardware Cost | Monthly Amortized | Power/Month | Tokens/Month (est.) | Cost per 1M Tokens |
|---|---|---|---|---|---|
| CPU: 32-core Xeon server | $5,000 | $139 | $40 | 130M | $1.38 |
| GPU: RTX 4090 + server | $6,000 | $167 | $55 | 1.3B | $0.17 |
| GPU: L40S + server | $13,000 | $361 | $70 | 1.9B | $0.23 |
| GPU: A100 + server | $18,000 | $500 | $80 | 2.4B | $0.24 |
| GPU: H100 + server | $38,000 | $1,056 | $120 | 3.2B | $0.37 |
| NPU: Laptop (Snapdragon X) | $1,500 | $42 | $8 | 52M | $0.96 |
| NPU: MacBook Pro M4 Max | $3,500 | $97 | $10 | 96M | $1.11 |
Some patterns emerge:
The RTX 4090 is the cost-efficiency champion. At $0.17 per million tokens, it delivers the lowest cost per token of any option. This is a $1,600 consumer GPU in a $4,400 server — total system cost around $6,000. For small-to-medium deployments, this is hard to beat.
Datacenter GPUs (A100, H100) trade cost efficiency for throughput and reliability. The H100 costs 2x per token compared to the RTX 4090, but it delivers higher absolute throughput, supports larger batch sizes, has ECC memory, and is designed for 24/7 datacenter operation. For mission-critical production workloads, the premium is justified.
CPUs are the most expensive per token but have zero incremental hardware cost if you're using existing servers. If your servers have idle CPU capacity during off-hours, the marginal cost of running inference is essentially just power — $40/month.
NPUs are mid-range on cost but their real value is deployment simplicity. No server infrastructure, no GPU procurement, no dedicated cooling. The model runs on the same laptop the employee already uses.
The Quantization Factor
Quantization is the technique of reducing model weights from their original precision (usually FP16 or BF16, 16 bits per weight) to lower precision (8, 5, or 4 bits). This directly affects model size, inference speed, and output quality.
Quantization Levels Compared (7B Model)
| Quantization | Bits/Weight | Model Size | Speed Impact | Quality Impact |
|---|---|---|---|---|
| FP16 (no quant) | 16 | ~14GB | Baseline | Baseline (best) |
| Q8_0 | 8 | ~7.5GB | ~1.5x faster | Negligible quality loss |
| Q5_K_M | 5 | ~5.3GB | ~2x faster | Very minor quality loss |
| Q4_K_M | 4 | ~4.4GB | ~2.5x faster | Minor quality loss, acceptable for most tasks |
| Q4_0 | 4 | ~4.0GB | ~2.8x faster | Noticeable quality loss on nuanced tasks |
| Q3_K_M | 3 | ~3.3GB | ~3x faster | Significant quality loss |
| Q2_K | 2 | ~2.7GB | ~3.5x faster | Substantial quality loss, not recommended |
The Enterprise Sweet Spot: Q4_K_M
For most enterprise workloads, Q4_K_M provides the optimal trade-off:
- Size reduction: 3.2x smaller than FP16, fitting in 4–5GB VRAM for a 7B model
- Speed improvement: 2–2.5x faster inference than FP16
- Quality retention: Minimal degradation on structured tasks (classification, extraction). Accuracy drops typically less than 1% compared to FP16 on narrow enterprise tasks.
When should you use higher precision?
- Q5_K_M: If your task involves nuanced text generation or your fine-tuning showed sensitivity to quantization. Costs ~20% more VRAM for a marginal quality improvement.
- Q8_0: For evaluation and benchmarking to establish a quality ceiling, or for tasks where every fraction of a percent of accuracy matters (medical, legal critical decisions).
- FP16: Almost never for production inference. The performance penalty doesn't justify the marginal quality gain in production workloads.
When can you go lower?
- Q3_K_M or Q2_K: Only when hardware constraints absolutely require it (e.g., running on a device with 2GB available memory). The quality trade-off is real and measurable. Test thoroughly before deploying.
Decision Framework
Here's how to match your deployment scenario to the right hardware.
Single-User Workstation
Scenario: One employee using a fine-tuned model for their daily work — document analysis, email classification, code review.
Recommendation:
- If they have a modern laptop (2024+): Use the NPU or integrated GPU. Deploy a Q4-quantized 3.8B model (Phi-3 mini) via Ollama. No additional hardware needed.
- If they have a desktop with a GPU: Any discrete GPU with 8GB+ VRAM runs a Q4 7B model comfortably. Even an RTX 3060 (12GB) works fine.
- If no GPU and older CPU: Stick with a 1.5B or 3B model at Q4 quantization, or consider a Snapdragon X or M4 Mac refresh.
Expected performance: 15–60 tokens/second depending on model size and hardware. Sufficient for interactive use.
Small Team (5–20 Users)
Scenario: A team sharing a fine-tuned model for a common workload — legal contract review, customer support triage, compliance checking.
Recommendation:
- Budget option: Single RTX 4090 in a team server. $6,000 total. Handles 5–15 concurrent users on a Q4 7B model with acceptable latency.
- Production option: Single L40S in a rackmount server. $13,000 total. Handles 15–30 concurrent users with headroom for burst traffic.
Expected performance: 30–80 tokens/second per user (depending on concurrency), with sub-100ms latency for short queries.
Department (50–200 Users)
Scenario: A department-wide deployment — all customer support agents, all analysts, all legal staff.
Recommendation:
- 2–4 RTX 4090s in a multi-GPU server, or 1–2 L40S cards. Run vLLM for efficient batch scheduling and continuous batching.
- Total cost: $15,000–$30,000 for the server.
- At 200 concurrent users, expect 15–30 tokens/second per user with proper batching.
Expected performance: Comparable to cloud API latency (100–300ms per short query) with the cost advantage of local hardware.
Organization-Wide (500+ Users)
Scenario: Company-wide deployment of one or more fine-tuned models, possibly serving multiple applications.
Recommendation:
- GPU cluster: 4–8 datacenter GPUs (A100 or H100) in a dedicated server or small rack.
- Use vLLM or TGI with load balancing across GPU instances.
- Consider redundancy: N+1 GPU configuration for failover.
- Total cost: $80,000–$200,000 for infrastructure, which pays for itself within 3–6 months against equivalent cloud API costs at this volume.
Expected performance: Cloud-competitive latency and throughput, with full data sovereignty and no per-token marginal cost.
Power and Cooling Considerations
Infrastructure teams often overlook power and cooling when planning GPU deployments. Here's what to budget:
| Hardware | Power Draw | Annual Power Cost (@$0.12/kWh) | Cooling Overhead |
|---|---|---|---|
| RTX 4090 | 450W TDP | ~$473 | Standard office HVAC |
| L40S | 350W TDP | ~$368 | Rackmount cooling |
| A100 | 300W TDP | ~$315 | Datacenter cooling |
| H100 | 700W TDP | ~$735 | Datacenter cooling required |
| NPU (laptop) | 15–25W | ~$26 | None (passive) |
For 1–4 GPUs, existing office infrastructure usually handles the power and cooling load. Beyond that, you're looking at dedicated rack space with appropriate power distribution and cooling capacity.
The Bottom Line
There's no single "best" hardware for running fine-tuned models. The right choice maps directly to your deployment scale:
- Individual use: NPU or CPU on the device they already have. Cost: $0 incremental.
- Team use: Single RTX 4090 in a shared server. Cost: ~$6,000.
- Department use: Multi-GPU server with 2–4 GPUs. Cost: $15,000–$30,000.
- Organization-wide: Datacenter GPU cluster. Cost: $80,000–$200,000.
In every case, the total cost of ownership is a fraction of equivalent cloud API spend at the same query volume. The hardware decision isn't about whether to deploy on-premise — the economics already favor it for high-volume workloads. It's about right-sizing the hardware to your actual scale and growth trajectory.
Start with the smallest configuration that meets your current needs. A single RTX 4090 server is a $6,000 experiment that can serve a team of 15 people. If the results justify scaling, add capacity incrementally. GPU servers don't require long-term commitments or multi-year contracts — they're capital equipment that you own and can repurpose.
The silicon is ready. The models are ready. The decision is a straightforward infrastructure planning exercise, not a technology bet.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

GPU Selection Guide for On-Premise AI: H100 vs A100 vs L40S vs Consumer GPUs
A detailed comparison of NVIDIA H100, A100, L40S, RTX 4090, and RTX 5090 GPUs for enterprise AI workloads. Includes performance benchmarks, cost analysis, power requirements, and use case recommendations for on-premise deployments.

Enterprise AI Capacity Planning: How to Size Your On-Premise Infrastructure
A step-by-step technical guide for sizing on-premise AI infrastructure. Covers compute, storage, network, and power requirements with a sizing worksheet and common planning mistakes to avoid.

How to Deploy a RAG Pipeline as an API Endpoint Your AI Agent Can Call
Most RAG tutorials stop at the vector store. Production AI agents need a callable retrieval endpoint with tool-calling specs. Here is how to build and deploy RAG as modular infrastructure, not embedded code.