Enterprise AI Capacity Planning: How to Size Your On-Premise Infrastructure

The most expensive mistake in on-premise AI is buying the wrong hardware. Oversizing means hundreds of thousands of dollars of idle compute. Undersizing means performance bottlenecks that undermine the business case for on-premise in the first place. And unlike cloud, you can't resize a GPU cluster with a configuration change — you're ordering hardware with 8-16 week lead times.

This guide walks through a structured capacity planning process: inventory your workloads, calculate compute requirements, factor storage and networking, and plan for growth. The goal is a specific, defensible hardware recommendation — not a vague "buy some GPUs."

Step 1: Inventory Your AI Workloads

Before selecting any hardware, build a complete inventory of every AI workload that will run on your on-premise infrastructure. This includes workloads running today (even if they're in the cloud) and workloads planned within the next 18 months.

For each workload, document:

Field	Example Value	Why It Matters
Workload Name	Customer Support Chatbot	Identification
Type	Inference	Determines GPU utilization pattern
Model	Llama 3.1 14B (Q4 quantized)	Determines VRAM and compute needs
Requests/Day	50,000 queries	Determines throughput requirements
Peak QPS	15 queries/second	Determines concurrent GPU instances
Avg. Input Tokens	800 tokens	Affects latency and throughput
Avg. Output Tokens	400 tokens	Affects latency and throughput
Latency Requirement	<3 seconds to first token	Determines GPU class needed
Data Sensitivity	High (contains customer PII)	Confirms on-prem requirement
Availability Requirement	99.9% (8.7 hours downtime/year)	Determines redundancy needs
Growth Projection	2x in 12 months	Determines headroom

Build this inventory as a spreadsheet. It becomes the foundation for every sizing decision that follows.

Common gap: Organizations inventory their primary workload but forget supporting workloads. A RAG-based chatbot doesn't just need inference compute — it also needs:

Embedding generation for document ingestion (runs on GPU)
Reranking model for retrieval (runs on GPU)
Vector database (runs on CPU, needs fast storage)
Document processing pipeline (mixed CPU/GPU)

Each of these consumes resources that must be planned for.

Step 2: Calculate Compute Requirements

VRAM Sizing

VRAM (GPU memory) is usually the binding constraint. A model must fit in VRAM to run — there's no graceful degradation, just failure to load.

Model VRAM requirements by size and quantization:

Model Size	FP16 (no quantization)	INT8	INT4 (GPTQ/AWQ)
7B parameters	~14 GB	~7 GB	~4 GB
14B parameters	~28 GB	~14 GB	~8 GB
32B parameters	~64 GB	~32 GB	~18 GB
70B parameters	~140 GB	~70 GB	~35 GB

These numbers represent model weights only. At inference time, you also need VRAM for:

KV cache: Scales with context length and batch size. For a 14B model with 8K context serving 8 concurrent requests, add ~4-8GB.
Activation memory: Typically 1-3GB depending on batch size.
Framework overhead: PyTorch, vLLM, or TensorRT-LLM each add 1-2GB of baseline memory.

Rule of thumb: Reserve 30-40% VRAM headroom beyond the model weight size. A 14B INT4 model that needs 8GB of weight storage should be planned for 11-12GB total VRAM usage.

Throughput Sizing

Calculate how many GPU instances you need to serve your target queries per second (QPS):

Measure single-instance throughput. For a 14B INT4 model on an L40S, expect approximately 70-110 tokens/second per GPU. With an average output of 400 tokens, that's roughly 0.17-0.28 requests/second per GPU.
Calculate instances needed. If your peak QPS is 15, and each GPU handles 0.2 requests/second: 15 / 0.2 = 75 GPUs? No — that math is for sequential generation. With batched inference (vLLM, TensorRT-LLM), a single GPU can serve 4-8 concurrent requests with minimal per-request throughput degradation. Realistic capacity: 1-2 requests/second per GPU for a 14B model with batching.
Add headroom. Target 60-80% GPU utilization at peak, not 100%. At 100% utilization, any traffic spike causes latency degradation. For the example above: 15 QPS / 1.5 QPS per GPU / 0.7 utilization target = ~14 GPUs.

GPU Utilization Targets

Do not plan for 100% GPU utilization. Here's why:

Target Utilization	Implication
90-100%	No headroom. Any spike = latency degradation or dropped requests.
70-80%	Healthy production target. Handles normal variance in traffic.
50-60%	Conservative. Appropriate for critical workloads with strict SLAs.
Below 50%	Likely over-provisioned. Consider smaller hardware or consolidating workloads.

Utilization also depends on workload patterns. A customer-facing chatbot with peak hours (9am-5pm) will average 30-40% utilization even if peak hits 70-80%. An internal document processing pipeline running 24/7 can sustain 70-80% consistently.

Step 3: Factor Storage Requirements

GPU compute gets all the attention, but storage planning is where capacity planning most often goes wrong.

Storage Categories

Model Weights

Each model version needs storage. A 14B FP16 model is ~28GB. If you keep 5 versions (current + 4 rollback versions), that's 140GB per model. Multiply by the number of models you serve.

Training Datasets

If you're fine-tuning on-premise, your training data needs fast storage. Sizes vary wildly:

Text fine-tuning datasets: 1GB–50GB typical
Document corpora for RAG: 10GB–1TB+
Image/multimodal datasets: 100GB–10TB+

Model Checkpoints

During fine-tuning, checkpoints are saved at regular intervals. A full checkpoint for a 14B model is ~28GB. If you save checkpoints every 500 steps for a 5,000-step training run, that's 10 checkpoints × 28GB = 280GB per training run. Checkpoints accumulate quickly if not cleaned up.

Vector Database

RAG workloads need vector storage. A rough estimate: 1 million document chunks with 1,536-dimension embeddings requires approximately 6GB of vector storage, plus metadata and indexes that can double or triple the raw size.

Audit Logs and Telemetry

Every inference request and response should be logged for compliance and monitoring. A single request/response pair averages 2-5KB. At 50,000 requests/day, that's 100-250MB/day, or 36-91GB/year. Not huge, but it adds up and must be on fast, reliable storage if you need real-time audit capability.

Storage Sizing Worksheet

Storage Category	Calculation	Example
Model weights	Models × Versions × Size	3 models × 5 versions × 28GB = 420GB
Training datasets	Sum of all datasets	50GB + 200GB = 250GB
Checkpoints	Runs/month × Checkpoints × Size	4 runs × 10 × 28GB = 1,120GB
Vector database	Chunks × Embedding size × 3 (overhead)	2M × 6KB × 3 = 36GB
Audit logs	Requests/day × Size × Retention	50K × 3KB × 365 days = 55GB
Total		~1.9TB
With 50% headroom		~2.8TB

Use NVMe SSDs for model weights and active training data — spinning disks cannot keep up with GPU data loading. A typical configuration pairs 2-4TB of NVMe storage per GPU server with a larger NAS or SAN for archival storage (old checkpoints, historical audit logs).

Step 4: Network Requirements

Network planning depends on whether you're running multi-node training or inference-only workloads.

Multi-Node Training

If you're training or fine-tuning models across multiple servers (distributed training), you need high-speed interconnect between nodes. The GPU communication during training is continuous and latency-sensitive.

InfiniBand HDR (200 Gb/s) or NDR (400 Gb/s): The standard for multi-node GPU training. Each server needs an InfiniBand HCA, and you need an InfiniBand switch. Cost: $5,000-$15,000 per server + $10,000-$30,000 for the switch.
RoCE (RDMA over Converged Ethernet): A cheaper alternative using standard Ethernet NICs with RDMA capabilities. Performance is 80-90% of InfiniBand for most workloads. Cost: $2,000-$5,000 per server with existing network switches.

Inference-Only

If you're only running inference (no distributed training), standard networking is sufficient:

25 GbE: Adequate for most inference workloads. Handles model loading and client request/response traffic without bottlenecks.
100 GbE: Useful if you're transferring large datasets frequently or serving very high QPS with large context windows.

Standard 1 GbE is too slow for model loading (loading a 28GB model over 1 GbE takes ~4 minutes — unacceptable for failover scenarios).

Bandwidth to Clients

Calculate the bandwidth your inference service needs to serve clients:

Average response: 400 tokens × ~4 bytes/token = 1.6KB per response
At 15 QPS: 24KB/second — negligible
But streaming responses token-by-token add connection overhead: plan for 100-500 concurrent WebSocket connections if serving real-time chat interfaces

Client bandwidth is rarely a bottleneck, but connection count can be. Ensure your inference server (or load balancer in front of it) is configured for sufficient concurrent connections.

Step 5: Power and Cooling

This is the step that kills on-premise projects that looked great on paper.

Power Requirements

Configuration	GPU Power	System Total	Circuit Required
4x RTX 4090 workstation	1,800W	~2,500W	1x 20A 208V
8x L40S server	2,800W	~4,000W	1x 30A 208V
8x A100 server	3,200W	~4,500W	1x 30A 208V
8x H100 server	5,600W	~8,000W	2x 30A 208V or 1x 60A

Before purchasing hardware, verify with your facilities team:

Available power capacity in your server room/data center. Many enterprise server rooms were sized for CPU-based servers at 2-5kW per rack, not GPU servers at 8-15kW per rack.
Circuit availability. A single 8xH100 server may need its own dedicated circuit.
UPS capacity. Your uninterruptible power supply must handle the GPU load plus runtime for safe shutdown.

Cooling Requirements

GPUs generate heat proportional to their power draw. Every watt of GPU power requires roughly 0.3-0.5 watts of cooling energy (depends on PUE — power usage effectiveness).

Configuration	Heat Output	Cooling Method
4x RTX 4090	~2.5kW	Standard room AC sufficient
8x L40S	~4kW	In-row cooling recommended
8x H100	~8kW	In-row cooling or rear-door heat exchangers required
16x H100 (2 servers)	~16kW	Likely needs liquid cooling or dedicated cooling infrastructure

If your server room's cooling capacity is maxed out, adding GPU servers may require HVAC upgrades costing $20,000-$100,000+. Check cooling capacity before committing to hardware purchases.

The Sizing Worksheet

Pull it all together into a single sizing table:

Workload	Model	Quantization	QPS	VRAM/Instance	GPUs Needed	GPU Type	Storage
Customer chatbot	14B	INT4	15	12 GB	14	L40S	50GB models
Document processing	7B	INT4	5	6 GB	4	L40S	200GB corpus
Embedding generation	0.3B	FP16	50	2 GB	2	L40S	Shared
Reranking	0.4B	FP16	50	2 GB	2	L40S	Shared
Monthly fine-tuning	14B	FP16	N/A	80 GB (train)	4	A100 or L40S	1.5TB checkpoints
Total					26 GPUs		~2TB NVMe

In this example, 26 L40S GPUs across 3-4 servers would handle the inference workloads, with the fine-tuning workload either sharing the same hardware during off-peak hours or running on a dedicated 4-GPU server.

Total estimated cost: 4x 8-GPU L40S servers × $79,000 = $316,000, plus storage, networking, and supporting infrastructure: approximately $380,000-$420,000 total.

Common Capacity Planning Mistakes

Mistake 1: Buying H100s When L40S Would Suffice

H100s are the best GPU available, but they're 4x the price of L40S hardware. If your workloads are inference-heavy with models under 30B parameters, the L40S provides 80-90% of the practical performance at 25% of the cost. The H100's advantages — HBM3 bandwidth, NVLink, MIG — matter most for large-model training and multi-tenant inference. If you're not doing those, you're paying for capabilities you won't use.

Mistake 2: Undersizing Storage

Model checkpoints are the most common storage surprise. A single fine-tuning run can generate 200-500GB of checkpoints. Organizations that budget 2TB of NVMe for "plenty of storage" find themselves full within weeks of starting fine-tuning experiments. Budget 2-3x more storage than your initial calculation suggests.

Mistake 3: Ignoring Power and Cooling Constraints

Hardware arrives, gets racked, and then trips the circuit breaker. Or the server room temperature climbs to 95°F within hours. Always verify power and cooling capacity with your facilities team before purchasing hardware, not after.

Mistake 4: Not Planning for Multi-Model Serving

Most organizations start with one model but quickly expand to 3-5 models serving different use cases. If you size your infrastructure for a single model, you'll be out of capacity within 6 months. Plan for at least 2-3x your initial model count.

Mistake 5: Sizing for Average Instead of Peak

A workload that averages 5 QPS but peaks at 20 QPS during business hours needs to be sized for 20 QPS (with headroom). Sizing for the average means degraded performance during the hours when usage matters most.

Mistake 6: Forgetting Redundancy

If you have exactly enough GPUs to serve your workload, losing a single GPU means degraded service. For workloads with 99.9%+ availability requirements, plan for N+1 redundancy — at minimum, one spare GPU per server, or a standby server that can absorb load during maintenance or failures.

Planning Horizon: 18-24 Months

GPU clusters are not trivially expandable. Adding GPUs to an existing server may not be possible (depends on chassis), and adding a new server requires procurement, racking, cabling, and configuration that takes 2-4 months from decision to production.

Size your initial deployment for 18-24 months of projected growth. It's better to have 20% excess capacity in year one than to face a capacity crunch in month eight while waiting for hardware procurement.

However, don't try to predict needs beyond 24 months. The AI hardware landscape changes rapidly — the GPU you'd buy in two years may not exist yet, and workload patterns will shift as your organization's AI usage matures.

Plan for what you can see. Build in headroom for what you can't.