What is Checkpoint?

    A saved snapshot of a model's weights and training state at a specific point during training, enabling recovery, evaluation, and selection of the best-performing version.

    Definition

    A checkpoint is a serialized snapshot of a model's complete state at a particular point during training. This includes the model weights, optimizer states (momentum and variance buffers for Adam), the learning rate scheduler state, the current step and epoch number, and the random number generator state. Saving checkpoints at regular intervals serves multiple purposes: crash recovery (resume training after hardware failure), model selection (choose the best-performing version based on validation metrics), and experiment management (compare models from different stages of training).

    In LLM fine-tuning, checkpoints are typically saved at the end of each epoch and optionally at fixed step intervals. Each checkpoint represents a complete model that can be loaded for inference or further training. Because checkpoints capture the full optimizer state, they enable exact resumption of training — the continued training produces identical results to uninterrupted training.

    Checkpoint management becomes a significant operational concern at scale. A single checkpoint for a 7B model can be 14-28 GB depending on precision, and with optimizer states included, the total rises to 56-112 GB. Training runs that save checkpoints every 500 steps can accumulate terabytes of checkpoint data. Teams must implement retention policies — for example, keeping only the best 3 checkpoints by validation loss and the most recent checkpoint for crash recovery.

    Why It Matters

    Without checkpoints, any interruption in training — hardware failure, preemption on shared compute, accidental process termination — means restarting from scratch. For LLM fine-tuning jobs that run for hours or days, this represents a substantial waste of compute and time. Checkpoints convert training from an all-or-nothing operation into a recoverable, incremental process.

    Beyond recovery, checkpoint-based model selection is a critical quality technique. Models often achieve their best validation performance partway through training before overfitting to the training data. By saving checkpoints at regular intervals and evaluating each on a validation set, teams can select the checkpoint that generalizes best rather than defaulting to the final training state.

    How It Works

    Checkpoint saving is integrated into the training loop. At configured intervals — every N steps, every epoch, or triggered by validation metric improvements — the trainer serializes the model state to disk. Modern training frameworks like Hugging Face Transformers, PyTorch Lightning, and Axolotl all support automatic checkpoint management with configurable strategies.

    Checkpoint loading reverses the process: the serialized state is deserialized and loaded into the model and optimizer objects. For inference-only use, only the model weights need to be loaded — the optimizer states can be discarded, reducing the memory footprint. Some frameworks support checkpoint sharding, where large checkpoints are split across multiple files for parallel I/O, reducing save and load times for very large models.

    Example Use Case

    A team fine-tuning a 13B model runs training for 5 epochs with checkpoint saving at each epoch. Validation loss improves through epoch 3 but degrades in epochs 4 and 5 due to overfitting. They select the epoch 3 checkpoint as their production model, achieving 8% better performance than the epoch 5 model. Without checkpointing, they would have deployed the overfitted epoch 5 model or been forced to re-run training with different settings.

    Key Takeaways

    • Checkpoints are serialized snapshots of model weights and training state.
    • They enable crash recovery, model selection, and experiment tracking.
    • Checkpoint-based model selection often yields better results than using the final training state.
    • Storage management is important — a single checkpoint can be tens of gigabytes.
    • Modern training frameworks automate checkpoint saving, loading, and retention policies.

    How Ertas Helps

    Ertas Studio automatically saves checkpoints during fine-tuning and lets users compare validation metrics across checkpoints to select the best-performing model version for GGUF export and deployment.

    Related Resources

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.