Back to blog
    Enterprise AI Capacity Planning: How to Size Your On-Premise Infrastructure
    capacity-planningon-premiseenterprise-aiai-infrastructuregpusegment:enterprise

    Enterprise AI Capacity Planning: How to Size Your On-Premise Infrastructure

    A step-by-step technical guide for sizing on-premise AI infrastructure. Covers compute, storage, network, and power requirements with a sizing worksheet and common planning mistakes to avoid.

    EErtas Team·

    The most expensive mistake in on-premise AI is buying the wrong hardware. Oversizing means hundreds of thousands of dollars of idle compute. Undersizing means performance bottlenecks that undermine the business case for on-premise in the first place. And unlike cloud, you can't resize a GPU cluster with a configuration change — you're ordering hardware with 8-16 week lead times.

    This guide walks through a structured capacity planning process: inventory your workloads, calculate compute requirements, factor storage and networking, and plan for growth. The goal is a specific, defensible hardware recommendation — not a vague "buy some GPUs."

    Step 1: Inventory Your AI Workloads

    Before selecting any hardware, build a complete inventory of every AI workload that will run on your on-premise infrastructure. This includes workloads running today (even if they're in the cloud) and workloads planned within the next 18 months.

    For each workload, document:

    FieldExample ValueWhy It Matters
    Workload NameCustomer Support ChatbotIdentification
    TypeInferenceDetermines GPU utilization pattern
    ModelLlama 3.1 14B (Q4 quantized)Determines VRAM and compute needs
    Requests/Day50,000 queriesDetermines throughput requirements
    Peak QPS15 queries/secondDetermines concurrent GPU instances
    Avg. Input Tokens800 tokensAffects latency and throughput
    Avg. Output Tokens400 tokensAffects latency and throughput
    Latency Requirement<3 seconds to first tokenDetermines GPU class needed
    Data SensitivityHigh (contains customer PII)Confirms on-prem requirement
    Availability Requirement99.9% (8.7 hours downtime/year)Determines redundancy needs
    Growth Projection2x in 12 monthsDetermines headroom

    Build this inventory as a spreadsheet. It becomes the foundation for every sizing decision that follows.

    Common gap: Organizations inventory their primary workload but forget supporting workloads. A RAG-based chatbot doesn't just need inference compute — it also needs:

    • Embedding generation for document ingestion (runs on GPU)
    • Reranking model for retrieval (runs on GPU)
    • Vector database (runs on CPU, needs fast storage)
    • Document processing pipeline (mixed CPU/GPU)

    Each of these consumes resources that must be planned for.

    Step 2: Calculate Compute Requirements

    VRAM Sizing

    VRAM (GPU memory) is usually the binding constraint. A model must fit in VRAM to run — there's no graceful degradation, just failure to load.

    Model VRAM requirements by size and quantization:

    Model SizeFP16 (no quantization)INT8INT4 (GPTQ/AWQ)
    7B parameters~14 GB~7 GB~4 GB
    14B parameters~28 GB~14 GB~8 GB
    32B parameters~64 GB~32 GB~18 GB
    70B parameters~140 GB~70 GB~35 GB

    These numbers represent model weights only. At inference time, you also need VRAM for:

    • KV cache: Scales with context length and batch size. For a 14B model with 8K context serving 8 concurrent requests, add ~4-8GB.
    • Activation memory: Typically 1-3GB depending on batch size.
    • Framework overhead: PyTorch, vLLM, or TensorRT-LLM each add 1-2GB of baseline memory.

    Rule of thumb: Reserve 30-40% VRAM headroom beyond the model weight size. A 14B INT4 model that needs 8GB of weight storage should be planned for 11-12GB total VRAM usage.

    Throughput Sizing

    Calculate how many GPU instances you need to serve your target queries per second (QPS):

    1. Measure single-instance throughput. For a 14B INT4 model on an L40S, expect approximately 70-110 tokens/second per GPU. With an average output of 400 tokens, that's roughly 0.17-0.28 requests/second per GPU.

    2. Calculate instances needed. If your peak QPS is 15, and each GPU handles 0.2 requests/second: 15 / 0.2 = 75 GPUs? No — that math is for sequential generation. With batched inference (vLLM, TensorRT-LLM), a single GPU can serve 4-8 concurrent requests with minimal per-request throughput degradation. Realistic capacity: 1-2 requests/second per GPU for a 14B model with batching.

    3. Add headroom. Target 60-80% GPU utilization at peak, not 100%. At 100% utilization, any traffic spike causes latency degradation. For the example above: 15 QPS / 1.5 QPS per GPU / 0.7 utilization target = ~14 GPUs.

    GPU Utilization Targets

    Do not plan for 100% GPU utilization. Here's why:

    Target UtilizationImplication
    90-100%No headroom. Any spike = latency degradation or dropped requests.
    70-80%Healthy production target. Handles normal variance in traffic.
    50-60%Conservative. Appropriate for critical workloads with strict SLAs.
    Below 50%Likely over-provisioned. Consider smaller hardware or consolidating workloads.

    Utilization also depends on workload patterns. A customer-facing chatbot with peak hours (9am-5pm) will average 30-40% utilization even if peak hits 70-80%. An internal document processing pipeline running 24/7 can sustain 70-80% consistently.

    Step 3: Factor Storage Requirements

    GPU compute gets all the attention, but storage planning is where capacity planning most often goes wrong.

    Storage Categories

    Model Weights

    Each model version needs storage. A 14B FP16 model is ~28GB. If you keep 5 versions (current + 4 rollback versions), that's 140GB per model. Multiply by the number of models you serve.

    Training Datasets

    If you're fine-tuning on-premise, your training data needs fast storage. Sizes vary wildly:

    • Text fine-tuning datasets: 1GB–50GB typical
    • Document corpora for RAG: 10GB–1TB+
    • Image/multimodal datasets: 100GB–10TB+

    Model Checkpoints

    During fine-tuning, checkpoints are saved at regular intervals. A full checkpoint for a 14B model is ~28GB. If you save checkpoints every 500 steps for a 5,000-step training run, that's 10 checkpoints × 28GB = 280GB per training run. Checkpoints accumulate quickly if not cleaned up.

    Vector Database

    RAG workloads need vector storage. A rough estimate: 1 million document chunks with 1,536-dimension embeddings requires approximately 6GB of vector storage, plus metadata and indexes that can double or triple the raw size.

    Audit Logs and Telemetry

    Every inference request and response should be logged for compliance and monitoring. A single request/response pair averages 2-5KB. At 50,000 requests/day, that's 100-250MB/day, or 36-91GB/year. Not huge, but it adds up and must be on fast, reliable storage if you need real-time audit capability.

    Storage Sizing Worksheet

    Storage CategoryCalculationExample
    Model weightsModels × Versions × Size3 models × 5 versions × 28GB = 420GB
    Training datasetsSum of all datasets50GB + 200GB = 250GB
    CheckpointsRuns/month × Checkpoints × Size4 runs × 10 × 28GB = 1,120GB
    Vector databaseChunks × Embedding size × 3 (overhead)2M × 6KB × 3 = 36GB
    Audit logsRequests/day × Size × Retention50K × 3KB × 365 days = 55GB
    Total~1.9TB
    With 50% headroom~2.8TB

    Use NVMe SSDs for model weights and active training data — spinning disks cannot keep up with GPU data loading. A typical configuration pairs 2-4TB of NVMe storage per GPU server with a larger NAS or SAN for archival storage (old checkpoints, historical audit logs).

    Step 4: Network Requirements

    Network planning depends on whether you're running multi-node training or inference-only workloads.

    Multi-Node Training

    If you're training or fine-tuning models across multiple servers (distributed training), you need high-speed interconnect between nodes. The GPU communication during training is continuous and latency-sensitive.

    • InfiniBand HDR (200 Gb/s) or NDR (400 Gb/s): The standard for multi-node GPU training. Each server needs an InfiniBand HCA, and you need an InfiniBand switch. Cost: $5,000-$15,000 per server + $10,000-$30,000 for the switch.
    • RoCE (RDMA over Converged Ethernet): A cheaper alternative using standard Ethernet NICs with RDMA capabilities. Performance is 80-90% of InfiniBand for most workloads. Cost: $2,000-$5,000 per server with existing network switches.

    Inference-Only

    If you're only running inference (no distributed training), standard networking is sufficient:

    • 25 GbE: Adequate for most inference workloads. Handles model loading and client request/response traffic without bottlenecks.
    • 100 GbE: Useful if you're transferring large datasets frequently or serving very high QPS with large context windows.

    Standard 1 GbE is too slow for model loading (loading a 28GB model over 1 GbE takes ~4 minutes — unacceptable for failover scenarios).

    Bandwidth to Clients

    Calculate the bandwidth your inference service needs to serve clients:

    • Average response: 400 tokens × ~4 bytes/token = 1.6KB per response
    • At 15 QPS: 24KB/second — negligible
    • But streaming responses token-by-token add connection overhead: plan for 100-500 concurrent WebSocket connections if serving real-time chat interfaces

    Client bandwidth is rarely a bottleneck, but connection count can be. Ensure your inference server (or load balancer in front of it) is configured for sufficient concurrent connections.

    Step 5: Power and Cooling

    This is the step that kills on-premise projects that looked great on paper.

    Power Requirements

    ConfigurationGPU PowerSystem TotalCircuit Required
    4x RTX 4090 workstation1,800W~2,500W1x 20A 208V
    8x L40S server2,800W~4,000W1x 30A 208V
    8x A100 server3,200W~4,500W1x 30A 208V
    8x H100 server5,600W~8,000W2x 30A 208V or 1x 60A

    Before purchasing hardware, verify with your facilities team:

    1. Available power capacity in your server room/data center. Many enterprise server rooms were sized for CPU-based servers at 2-5kW per rack, not GPU servers at 8-15kW per rack.
    2. Circuit availability. A single 8xH100 server may need its own dedicated circuit.
    3. UPS capacity. Your uninterruptible power supply must handle the GPU load plus runtime for safe shutdown.

    Cooling Requirements

    GPUs generate heat proportional to their power draw. Every watt of GPU power requires roughly 0.3-0.5 watts of cooling energy (depends on PUE — power usage effectiveness).

    ConfigurationHeat OutputCooling Method
    4x RTX 4090~2.5kWStandard room AC sufficient
    8x L40S~4kWIn-row cooling recommended
    8x H100~8kWIn-row cooling or rear-door heat exchangers required
    16x H100 (2 servers)~16kWLikely needs liquid cooling or dedicated cooling infrastructure

    If your server room's cooling capacity is maxed out, adding GPU servers may require HVAC upgrades costing $20,000-$100,000+. Check cooling capacity before committing to hardware purchases.

    The Sizing Worksheet

    Pull it all together into a single sizing table:

    WorkloadModelQuantizationQPSVRAM/InstanceGPUs NeededGPU TypeStorage
    Customer chatbot14BINT41512 GB14L40S50GB models
    Document processing7BINT456 GB4L40S200GB corpus
    Embedding generation0.3BFP16502 GB2L40SShared
    Reranking0.4BFP16502 GB2L40SShared
    Monthly fine-tuning14BFP16N/A80 GB (train)4A100 or L40S1.5TB checkpoints
    Total26 GPUs~2TB NVMe

    In this example, 26 L40S GPUs across 3-4 servers would handle the inference workloads, with the fine-tuning workload either sharing the same hardware during off-peak hours or running on a dedicated 4-GPU server.

    Total estimated cost: 4x 8-GPU L40S servers × $79,000 = $316,000, plus storage, networking, and supporting infrastructure: approximately $380,000-$420,000 total.

    Common Capacity Planning Mistakes

    Mistake 1: Buying H100s When L40S Would Suffice

    H100s are the best GPU available, but they're 4x the price of L40S hardware. If your workloads are inference-heavy with models under 30B parameters, the L40S provides 80-90% of the practical performance at 25% of the cost. The H100's advantages — HBM3 bandwidth, NVLink, MIG — matter most for large-model training and multi-tenant inference. If you're not doing those, you're paying for capabilities you won't use.

    Mistake 2: Undersizing Storage

    Model checkpoints are the most common storage surprise. A single fine-tuning run can generate 200-500GB of checkpoints. Organizations that budget 2TB of NVMe for "plenty of storage" find themselves full within weeks of starting fine-tuning experiments. Budget 2-3x more storage than your initial calculation suggests.

    Mistake 3: Ignoring Power and Cooling Constraints

    Hardware arrives, gets racked, and then trips the circuit breaker. Or the server room temperature climbs to 95°F within hours. Always verify power and cooling capacity with your facilities team before purchasing hardware, not after.

    Mistake 4: Not Planning for Multi-Model Serving

    Most organizations start with one model but quickly expand to 3-5 models serving different use cases. If you size your infrastructure for a single model, you'll be out of capacity within 6 months. Plan for at least 2-3x your initial model count.

    Mistake 5: Sizing for Average Instead of Peak

    A workload that averages 5 QPS but peaks at 20 QPS during business hours needs to be sized for 20 QPS (with headroom). Sizing for the average means degraded performance during the hours when usage matters most.

    Mistake 6: Forgetting Redundancy

    If you have exactly enough GPUs to serve your workload, losing a single GPU means degraded service. For workloads with 99.9%+ availability requirements, plan for N+1 redundancy — at minimum, one spare GPU per server, or a standby server that can absorb load during maintenance or failures.

    Planning Horizon: 18-24 Months

    GPU clusters are not trivially expandable. Adding GPUs to an existing server may not be possible (depends on chassis), and adding a new server requires procurement, racking, cabling, and configuration that takes 2-4 months from decision to production.

    Size your initial deployment for 18-24 months of projected growth. It's better to have 20% excess capacity in year one than to face a capacity crunch in month eight while waiting for hardware procurement.

    However, don't try to predict needs beyond 24 months. The AI hardware landscape changes rapidly — the GPU you'd buy in two years may not exist yet, and workload patterns will shift as your organization's AI usage matures.

    Plan for what you can see. Build in headroom for what you can't.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading