
Enterprise AI Capacity Planning: How to Size Your On-Premise Infrastructure
A step-by-step technical guide for sizing on-premise AI infrastructure. Covers compute, storage, network, and power requirements with a sizing worksheet and common planning mistakes to avoid.
The most expensive mistake in on-premise AI is buying the wrong hardware. Oversizing means hundreds of thousands of dollars of idle compute. Undersizing means performance bottlenecks that undermine the business case for on-premise in the first place. And unlike cloud, you can't resize a GPU cluster with a configuration change — you're ordering hardware with 8-16 week lead times.
This guide walks through a structured capacity planning process: inventory your workloads, calculate compute requirements, factor storage and networking, and plan for growth. The goal is a specific, defensible hardware recommendation — not a vague "buy some GPUs."
Step 1: Inventory Your AI Workloads
Before selecting any hardware, build a complete inventory of every AI workload that will run on your on-premise infrastructure. This includes workloads running today (even if they're in the cloud) and workloads planned within the next 18 months.
For each workload, document:
| Field | Example Value | Why It Matters |
|---|---|---|
| Workload Name | Customer Support Chatbot | Identification |
| Type | Inference | Determines GPU utilization pattern |
| Model | Llama 3.1 14B (Q4 quantized) | Determines VRAM and compute needs |
| Requests/Day | 50,000 queries | Determines throughput requirements |
| Peak QPS | 15 queries/second | Determines concurrent GPU instances |
| Avg. Input Tokens | 800 tokens | Affects latency and throughput |
| Avg. Output Tokens | 400 tokens | Affects latency and throughput |
| Latency Requirement | <3 seconds to first token | Determines GPU class needed |
| Data Sensitivity | High (contains customer PII) | Confirms on-prem requirement |
| Availability Requirement | 99.9% (8.7 hours downtime/year) | Determines redundancy needs |
| Growth Projection | 2x in 12 months | Determines headroom |
Build this inventory as a spreadsheet. It becomes the foundation for every sizing decision that follows.
Common gap: Organizations inventory their primary workload but forget supporting workloads. A RAG-based chatbot doesn't just need inference compute — it also needs:
- Embedding generation for document ingestion (runs on GPU)
- Reranking model for retrieval (runs on GPU)
- Vector database (runs on CPU, needs fast storage)
- Document processing pipeline (mixed CPU/GPU)
Each of these consumes resources that must be planned for.
Step 2: Calculate Compute Requirements
VRAM Sizing
VRAM (GPU memory) is usually the binding constraint. A model must fit in VRAM to run — there's no graceful degradation, just failure to load.
Model VRAM requirements by size and quantization:
| Model Size | FP16 (no quantization) | INT8 | INT4 (GPTQ/AWQ) |
|---|---|---|---|
| 7B parameters | ~14 GB | ~7 GB | ~4 GB |
| 14B parameters | ~28 GB | ~14 GB | ~8 GB |
| 32B parameters | ~64 GB | ~32 GB | ~18 GB |
| 70B parameters | ~140 GB | ~70 GB | ~35 GB |
These numbers represent model weights only. At inference time, you also need VRAM for:
- KV cache: Scales with context length and batch size. For a 14B model with 8K context serving 8 concurrent requests, add ~4-8GB.
- Activation memory: Typically 1-3GB depending on batch size.
- Framework overhead: PyTorch, vLLM, or TensorRT-LLM each add 1-2GB of baseline memory.
Rule of thumb: Reserve 30-40% VRAM headroom beyond the model weight size. A 14B INT4 model that needs 8GB of weight storage should be planned for 11-12GB total VRAM usage.
Throughput Sizing
Calculate how many GPU instances you need to serve your target queries per second (QPS):
-
Measure single-instance throughput. For a 14B INT4 model on an L40S, expect approximately 70-110 tokens/second per GPU. With an average output of 400 tokens, that's roughly 0.17-0.28 requests/second per GPU.
-
Calculate instances needed. If your peak QPS is 15, and each GPU handles 0.2 requests/second: 15 / 0.2 = 75 GPUs? No — that math is for sequential generation. With batched inference (vLLM, TensorRT-LLM), a single GPU can serve 4-8 concurrent requests with minimal per-request throughput degradation. Realistic capacity: 1-2 requests/second per GPU for a 14B model with batching.
-
Add headroom. Target 60-80% GPU utilization at peak, not 100%. At 100% utilization, any traffic spike causes latency degradation. For the example above: 15 QPS / 1.5 QPS per GPU / 0.7 utilization target = ~14 GPUs.
GPU Utilization Targets
Do not plan for 100% GPU utilization. Here's why:
| Target Utilization | Implication |
|---|---|
| 90-100% | No headroom. Any spike = latency degradation or dropped requests. |
| 70-80% | Healthy production target. Handles normal variance in traffic. |
| 50-60% | Conservative. Appropriate for critical workloads with strict SLAs. |
| Below 50% | Likely over-provisioned. Consider smaller hardware or consolidating workloads. |
Utilization also depends on workload patterns. A customer-facing chatbot with peak hours (9am-5pm) will average 30-40% utilization even if peak hits 70-80%. An internal document processing pipeline running 24/7 can sustain 70-80% consistently.
Step 3: Factor Storage Requirements
GPU compute gets all the attention, but storage planning is where capacity planning most often goes wrong.
Storage Categories
Model Weights
Each model version needs storage. A 14B FP16 model is ~28GB. If you keep 5 versions (current + 4 rollback versions), that's 140GB per model. Multiply by the number of models you serve.
Training Datasets
If you're fine-tuning on-premise, your training data needs fast storage. Sizes vary wildly:
- Text fine-tuning datasets: 1GB–50GB typical
- Document corpora for RAG: 10GB–1TB+
- Image/multimodal datasets: 100GB–10TB+
Model Checkpoints
During fine-tuning, checkpoints are saved at regular intervals. A full checkpoint for a 14B model is ~28GB. If you save checkpoints every 500 steps for a 5,000-step training run, that's 10 checkpoints × 28GB = 280GB per training run. Checkpoints accumulate quickly if not cleaned up.
Vector Database
RAG workloads need vector storage. A rough estimate: 1 million document chunks with 1,536-dimension embeddings requires approximately 6GB of vector storage, plus metadata and indexes that can double or triple the raw size.
Audit Logs and Telemetry
Every inference request and response should be logged for compliance and monitoring. A single request/response pair averages 2-5KB. At 50,000 requests/day, that's 100-250MB/day, or 36-91GB/year. Not huge, but it adds up and must be on fast, reliable storage if you need real-time audit capability.
Storage Sizing Worksheet
| Storage Category | Calculation | Example |
|---|---|---|
| Model weights | Models × Versions × Size | 3 models × 5 versions × 28GB = 420GB |
| Training datasets | Sum of all datasets | 50GB + 200GB = 250GB |
| Checkpoints | Runs/month × Checkpoints × Size | 4 runs × 10 × 28GB = 1,120GB |
| Vector database | Chunks × Embedding size × 3 (overhead) | 2M × 6KB × 3 = 36GB |
| Audit logs | Requests/day × Size × Retention | 50K × 3KB × 365 days = 55GB |
| Total | ~1.9TB | |
| With 50% headroom | ~2.8TB |
Use NVMe SSDs for model weights and active training data — spinning disks cannot keep up with GPU data loading. A typical configuration pairs 2-4TB of NVMe storage per GPU server with a larger NAS or SAN for archival storage (old checkpoints, historical audit logs).
Step 4: Network Requirements
Network planning depends on whether you're running multi-node training or inference-only workloads.
Multi-Node Training
If you're training or fine-tuning models across multiple servers (distributed training), you need high-speed interconnect between nodes. The GPU communication during training is continuous and latency-sensitive.
- InfiniBand HDR (200 Gb/s) or NDR (400 Gb/s): The standard for multi-node GPU training. Each server needs an InfiniBand HCA, and you need an InfiniBand switch. Cost: $5,000-$15,000 per server + $10,000-$30,000 for the switch.
- RoCE (RDMA over Converged Ethernet): A cheaper alternative using standard Ethernet NICs with RDMA capabilities. Performance is 80-90% of InfiniBand for most workloads. Cost: $2,000-$5,000 per server with existing network switches.
Inference-Only
If you're only running inference (no distributed training), standard networking is sufficient:
- 25 GbE: Adequate for most inference workloads. Handles model loading and client request/response traffic without bottlenecks.
- 100 GbE: Useful if you're transferring large datasets frequently or serving very high QPS with large context windows.
Standard 1 GbE is too slow for model loading (loading a 28GB model over 1 GbE takes ~4 minutes — unacceptable for failover scenarios).
Bandwidth to Clients
Calculate the bandwidth your inference service needs to serve clients:
- Average response: 400 tokens × ~4 bytes/token = 1.6KB per response
- At 15 QPS: 24KB/second — negligible
- But streaming responses token-by-token add connection overhead: plan for 100-500 concurrent WebSocket connections if serving real-time chat interfaces
Client bandwidth is rarely a bottleneck, but connection count can be. Ensure your inference server (or load balancer in front of it) is configured for sufficient concurrent connections.
Step 5: Power and Cooling
This is the step that kills on-premise projects that looked great on paper.
Power Requirements
| Configuration | GPU Power | System Total | Circuit Required |
|---|---|---|---|
| 4x RTX 4090 workstation | 1,800W | ~2,500W | 1x 20A 208V |
| 8x L40S server | 2,800W | ~4,000W | 1x 30A 208V |
| 8x A100 server | 3,200W | ~4,500W | 1x 30A 208V |
| 8x H100 server | 5,600W | ~8,000W | 2x 30A 208V or 1x 60A |
Before purchasing hardware, verify with your facilities team:
- Available power capacity in your server room/data center. Many enterprise server rooms were sized for CPU-based servers at 2-5kW per rack, not GPU servers at 8-15kW per rack.
- Circuit availability. A single 8xH100 server may need its own dedicated circuit.
- UPS capacity. Your uninterruptible power supply must handle the GPU load plus runtime for safe shutdown.
Cooling Requirements
GPUs generate heat proportional to their power draw. Every watt of GPU power requires roughly 0.3-0.5 watts of cooling energy (depends on PUE — power usage effectiveness).
| Configuration | Heat Output | Cooling Method |
|---|---|---|
| 4x RTX 4090 | ~2.5kW | Standard room AC sufficient |
| 8x L40S | ~4kW | In-row cooling recommended |
| 8x H100 | ~8kW | In-row cooling or rear-door heat exchangers required |
| 16x H100 (2 servers) | ~16kW | Likely needs liquid cooling or dedicated cooling infrastructure |
If your server room's cooling capacity is maxed out, adding GPU servers may require HVAC upgrades costing $20,000-$100,000+. Check cooling capacity before committing to hardware purchases.
The Sizing Worksheet
Pull it all together into a single sizing table:
| Workload | Model | Quantization | QPS | VRAM/Instance | GPUs Needed | GPU Type | Storage |
|---|---|---|---|---|---|---|---|
| Customer chatbot | 14B | INT4 | 15 | 12 GB | 14 | L40S | 50GB models |
| Document processing | 7B | INT4 | 5 | 6 GB | 4 | L40S | 200GB corpus |
| Embedding generation | 0.3B | FP16 | 50 | 2 GB | 2 | L40S | Shared |
| Reranking | 0.4B | FP16 | 50 | 2 GB | 2 | L40S | Shared |
| Monthly fine-tuning | 14B | FP16 | N/A | 80 GB (train) | 4 | A100 or L40S | 1.5TB checkpoints |
| Total | 26 GPUs | ~2TB NVMe |
In this example, 26 L40S GPUs across 3-4 servers would handle the inference workloads, with the fine-tuning workload either sharing the same hardware during off-peak hours or running on a dedicated 4-GPU server.
Total estimated cost: 4x 8-GPU L40S servers × $79,000 = $316,000, plus storage, networking, and supporting infrastructure: approximately $380,000-$420,000 total.
Common Capacity Planning Mistakes
Mistake 1: Buying H100s When L40S Would Suffice
H100s are the best GPU available, but they're 4x the price of L40S hardware. If your workloads are inference-heavy with models under 30B parameters, the L40S provides 80-90% of the practical performance at 25% of the cost. The H100's advantages — HBM3 bandwidth, NVLink, MIG — matter most for large-model training and multi-tenant inference. If you're not doing those, you're paying for capabilities you won't use.
Mistake 2: Undersizing Storage
Model checkpoints are the most common storage surprise. A single fine-tuning run can generate 200-500GB of checkpoints. Organizations that budget 2TB of NVMe for "plenty of storage" find themselves full within weeks of starting fine-tuning experiments. Budget 2-3x more storage than your initial calculation suggests.
Mistake 3: Ignoring Power and Cooling Constraints
Hardware arrives, gets racked, and then trips the circuit breaker. Or the server room temperature climbs to 95°F within hours. Always verify power and cooling capacity with your facilities team before purchasing hardware, not after.
Mistake 4: Not Planning for Multi-Model Serving
Most organizations start with one model but quickly expand to 3-5 models serving different use cases. If you size your infrastructure for a single model, you'll be out of capacity within 6 months. Plan for at least 2-3x your initial model count.
Mistake 5: Sizing for Average Instead of Peak
A workload that averages 5 QPS but peaks at 20 QPS during business hours needs to be sized for 20 QPS (with headroom). Sizing for the average means degraded performance during the hours when usage matters most.
Mistake 6: Forgetting Redundancy
If you have exactly enough GPUs to serve your workload, losing a single GPU means degraded service. For workloads with 99.9%+ availability requirements, plan for N+1 redundancy — at minimum, one spare GPU per server, or a standby server that can absorb load during maintenance or failures.
Planning Horizon: 18-24 Months
GPU clusters are not trivially expandable. Adding GPUs to an existing server may not be possible (depends on chassis), and adding a new server requires procurement, racking, cabling, and configuration that takes 2-4 months from decision to production.
Size your initial deployment for 18-24 months of projected growth. It's better to have 20% excess capacity in year one than to face a capacity crunch in month eight while waiting for hardware procurement.
However, don't try to predict needs beyond 24 months. The AI hardware landscape changes rapidly — the GPU you'd buy in two years may not exist yet, and workload patterns will shift as your organization's AI usage matures.
Plan for what you can see. Build in headroom for what you can't.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

GPU Selection Guide for On-Premise AI: H100 vs A100 vs L40S vs Consumer GPUs
A detailed comparison of NVIDIA H100, A100, L40S, RTX 4090, and RTX 5090 GPUs for enterprise AI workloads. Includes performance benchmarks, cost analysis, power requirements, and use case recommendations for on-premise deployments.

Why 93% of Enterprises Are Moving AI Off the Cloud
Enterprise AI is moving back on-premise. Three forces are driving it: data sovereignty mandates, unpredictable cloud costs, and latency requirements that cloud architectures can't meet. Here's what the data says and what it means for your AI infrastructure.

How to Migrate AI Workloads from Cloud to On-Premise: The Enterprise Playbook
A phased, step-by-step guide for migrating AI workloads from cloud to on-premise infrastructure. Covers workload classification, infrastructure planning, data pipeline migration, and the common pitfalls that derail enterprise migrations.