What is Batch Size?
The number of training examples processed simultaneously in one forward-backward pass during model training, affecting memory usage, training speed, and convergence behavior.
Definition
Batch size is a fundamental training hyperparameter that determines how many examples from the training dataset are processed together in a single iteration of the training loop. After the model processes a batch, it computes the average loss across all examples in the batch and uses this to calculate gradients for updating the model's weights. Larger batch sizes provide more stable gradient estimates (because the average is computed over more examples) but require proportionally more GPU memory.
In LLM fine-tuning, batch size is particularly constrained by GPU memory. A single training example for a 7B model might consume several gigabytes of VRAM for activations, gradients, and optimizer states. As a result, practical batch sizes for fine-tuning are often small — typically 1 to 8 examples per GPU. To achieve the stability benefits of larger effective batch sizes without exceeding memory limits, practitioners use gradient accumulation: processing several small batches sequentially and accumulating their gradients before performing a single weight update. An effective batch size of 32 can be achieved by processing 4 batches of 8 with gradient accumulation.
The relationship between batch size and learning rate is well-established: larger effective batch sizes generally require higher learning rates to maintain training dynamics, following the linear scaling rule. This means that when adjusting batch size, the learning rate often needs to be adjusted proportionally to maintain stable convergence. The interplay between these two hyperparameters is one of the key considerations in fine-tuning configuration.
Why It Matters
Batch size directly affects three critical aspects of training: memory consumption (larger batches need more VRAM), training speed (larger batches leverage GPU parallelism more efficiently), and convergence behavior (larger batches give smoother gradient estimates but may converge to sharper minima that generalize worse). For teams working with limited GPU resources — which includes most fine-tuning practitioners — understanding batch size trade-offs is essential for maximizing the quality of results within hardware constraints. Getting the batch size wrong can lead to out-of-memory errors, wasted training time, or suboptimal model quality.
How It Works
In each training step, the data loader selects batch_size examples from the training dataset. These examples are tokenized, padded to the same length, and stacked into a tensor. The model performs a forward pass on the entire batch simultaneously (leveraging GPU parallelism), computing predictions for all examples at once. The loss function computes the error for each example and averages them. The backward pass computes gradients of this averaged loss with respect to all trainable parameters. When gradient accumulation is used, these gradients are added to a running buffer for N steps before the optimizer applies the accumulated gradients to update the weights. The effective batch size is then batch_size × gradient_accumulation_steps.
Example Use Case
A team fine-tuning a 13B model on an NVIDIA A100 40GB GPU finds that a batch size of 4 causes out-of-memory errors. They reduce the per-device batch size to 1 and set gradient accumulation steps to 8, achieving an effective batch size of 8 while fitting within VRAM limits. They adjust the learning rate from 2e-5 to 1e-5 to account for the smaller effective batch size compared to their original plan. Training completes successfully, and the model achieves their target validation accuracy after 3 epochs.
Key Takeaways
- Batch size determines how many examples are processed per training step.
- Larger batches give more stable gradients but require more GPU memory.
- Gradient accumulation simulates larger batches within memory constraints.
- Batch size and learning rate should be adjusted together (linear scaling rule).
- Practical fine-tuning batch sizes are typically 1–8 per GPU, with gradient accumulation for larger effective sizes.
How Ertas Helps
Ertas Studio automatically configures batch size and gradient accumulation based on the selected model size and available GPU resources in Ertas Cloud. Users who want manual control can adjust both parameters in the advanced hyperparameter panel. The platform prevents out-of-memory errors by estimating VRAM requirements before training starts and suggesting appropriate batch size configurations, making it safe for users without deep hardware knowledge to train models reliably.
Related Resources
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.