What is Weight?

A numerical parameter in a neural network that is learned during training and determines how the model transforms input data into output predictions.

Definition

In neural networks, a weight is a learnable numerical value that controls the strength of the connection between neurons (or more precisely, between input features and output activations in a layer). The collection of all weights in a model constitutes the model's learned knowledge — when we say a model 'knows' something, we mean that information is encoded in the specific values of its millions or billions of weights.

In transformer-based language models, weights are organized into matrices that perform specific functions. Attention weights (query, key, value, and output projection matrices) determine how tokens attend to each other. Feed-forward weights (two large matrices per transformer layer) apply non-linear transformations that encode factual and linguistic knowledge. Embedding weights map token IDs to dense vectors, and the language model head weights map hidden states back to vocabulary probabilities for next-token prediction.

The number of weights in a model — its parameter count — is the primary measure of model scale. A 7B model has 7 billion weights, a 70B model has 70 billion. Larger parameter counts generally enable greater knowledge capacity and reasoning ability, but also require proportionally more memory and compute for training and inference. Each weight is stored as a floating-point number (typically 16-bit during training, often quantized to 4-8 bits for inference), so a 7B model requires approximately 14 GB in FP16 precision.

Why It Matters

Weights are quite literally what you are paying for when you train or fine-tune a model. Pre-training initializes weights randomly and then adjusts them across trillions of training tokens until they encode useful representations of language and knowledge. Fine-tuning further adjusts these weights (or a subset, in the case of LoRA) to encode domain-specific knowledge and behaviors. The entire ML pipeline exists to produce a set of weights that transform inputs into useful outputs.

Understanding weights is essential for practical decisions. Weight precision (FP16 vs FP32 vs INT4) determines memory requirements and inference speed. Weight formats (SafeTensors, GGUF, PyTorch checkpoints) determine compatibility with different inference engines. Weight licensing determines how a model can be used commercially. When practitioners discuss quantization, pruning, or model merging, they are all operations on weights.

How It Works

Weights are initialized before training using strategies designed to prevent vanishing or exploding gradients — commonly Xavier or Kaiming initialization, which scale initial values based on layer dimensions. During training, the forward pass uses current weight values to compute predictions, the loss function measures prediction error, and backpropagation computes the gradient of the loss with respect to each weight. The optimizer then updates each weight by a small amount in the direction that reduces the loss.

After training, weights are serialized to disk in a model file. Different serialization formats store weights differently: PyTorch's .bin format uses Python's pickle serialization, SafeTensors uses a memory-mapped format with integrity checks, and GGUF stores quantized weights in a format optimized for CPU inference. When a model is loaded for inference, the weight tensors are deserialized and placed in GPU (or CPU) memory, where they remain fixed as the model processes inputs.

Example Use Case

A team trains a LoRA adapter on a 7B base model, producing 20 million new adapter weights (compared to the base model's 7 billion). During inference, the adapter weights are merged with the base model weights by adding the low-rank weight updates to the original matrices. The merged model behaves identically to having the adapter loaded separately but eliminates the overhead of managing two sets of weights at runtime, simplifying deployment.

Key Takeaways

Weights are learnable numerical parameters that encode a model's knowledge and capabilities.
A model's parameter count (number of weights) determines its scale, capacity, and resource requirements.
Weight precision (FP16, INT4, etc.) controls the trade-off between memory usage and model quality.
Fine-tuning adjusts weights to encode domain-specific knowledge on top of pre-trained representations.
Weight serialization formats (SafeTensors, GGUF, PyTorch) determine compatibility with inference engines.

How Ertas Helps

Ertas Studio manages model weights throughout the fine-tuning lifecycle — loading base model weights, training adapter weights via LoRA, optionally merging adapters, and exporting final weights in GGUF format for local deployment.

Related Resources

GGUF

LoRA

Parameter

Quantization

SafeTensors

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →