What is GPU Memory (VRAM)?

    The dedicated high-bandwidth memory on a graphics processing unit that stores model weights, activations, and gradients during training and inference.

    Definition

    GPU memory, also known as VRAM (Video Random Access Memory), is the high-bandwidth memory physically located on a graphics processing unit. In machine learning, GPU memory is the primary bottleneck that determines which models can be trained or served on a given piece of hardware. During training, GPU memory must hold the model weights, optimizer states, gradients, and intermediate activations simultaneously. During inference, it holds the model weights and the key-value cache that grows with context length.

    The memory requirements of LLMs have grown dramatically. A 7B parameter model in FP16 precision requires approximately 14 GB of VRAM just for weights. During training with the Adam optimizer, the memory requirement balloons to approximately 84 GB (weights + gradients + two optimizer states + activations), far exceeding the 24 GB available on a consumer RTX 4090 or even the 48 GB on an A6000. This memory wall is the primary reason techniques like quantization, gradient checkpointing, and parameter-efficient fine-tuning exist.

    Modern GPU architectures relevant to LLM work include NVIDIA's A100 (40 or 80 GB HBM2e), H100 (80 GB HBM3), and consumer options like the RTX 4090 (24 GB GDDR6X). The memory bandwidth — how fast data can be read from and written to VRAM — is equally important, as LLM inference is typically memory-bandwidth-bound rather than compute-bound. The H100's 3.35 TB/s bandwidth is a major reason for its dominance in LLM serving.

    Why It Matters

    GPU memory is the single most important hardware constraint in LLM work. It determines whether a model can be loaded at all, how large a batch size can be used during training, how long a context window can be served during inference, and how many concurrent requests a serving system can handle. Every other optimization technique — quantization, LoRA, gradient accumulation, KV cache compression — exists fundamentally to work around GPU memory limitations.

    For teams on a budget, understanding GPU memory requirements is essential for hardware planning. Choosing between a 7B and a 13B model, between LoRA and full fine-tuning, or between FP16 and 4-bit quantization all come down to how much VRAM is available and how efficiently it can be utilized. Making the wrong choice means either failing to load the model entirely or wasting expensive GPU capacity.

    How It Works

    During training, GPU memory is allocated across several categories. Model parameters consume memory proportional to the parameter count times the precision (e.g., 7B parameters times 2 bytes for FP16 equals 14 GB). Gradients require the same memory as parameters. The Adam optimizer maintains two additional copies of the parameters (first and second moment estimates), roughly tripling the parameter memory. Intermediate activations — the outputs of each layer saved for the backward pass — consume variable memory depending on batch size and sequence length.

    During inference, memory usage is dominated by model weights and the KV cache. The KV cache stores the key and value tensors computed for each token in the context, and it grows linearly with both context length and batch size. For long-context applications (32K+ tokens), the KV cache can exceed the model weight memory. Techniques like quantization reduce weight memory, while KV cache compression and paged attention (used by vLLM) optimize the cache memory footprint.

    Example Use Case

    A team wants to fine-tune Llama 3 8B on a server with 2x RTX 4090 GPUs (48 GB total VRAM). Full fine-tuning would require approximately 100 GB of VRAM — impossible on their hardware. By using QLoRA (4-bit quantized weights + LoRA adapters) and gradient checkpointing, they reduce the memory footprint to 18 GB, fitting comfortably on a single GPU and leaving room for a reasonable batch size of 4 with 2048-token sequences.

    Key Takeaways

    • GPU memory (VRAM) is the primary hardware bottleneck for LLM training and inference.
    • A 7B model requires approximately 14 GB for inference and 84 GB for full fine-tuning in FP16.
    • Memory-efficient techniques like quantization, LoRA, and gradient checkpointing exist to overcome VRAM limitations.
    • Memory bandwidth is as important as capacity — LLM inference is typically bandwidth-bound.
    • Hardware planning around VRAM constraints is essential for cost-effective LLM deployment.

    How Ertas Helps

    Ertas Studio automatically estimates GPU memory requirements for each training configuration and recommends optimization settings like QLoRA and gradient accumulation to fit within available VRAM, making fine-tuning accessible on consumer and mid-range hardware.

    Related Resources

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.