What is Parameter?

    A learnable value in a neural network — including weights and biases — that the model adjusts during training to minimize prediction error.

    Definition

    In machine learning, a parameter is any value within a model that is learned from data during training. Parameters include weights (which scale input features) and biases (which shift activation values), and together they define the model's behavior. When practitioners refer to a '7B parameter model,' they mean the model contains approximately 7 billion individually learnable values that were adjusted through training on a large corpus.

    The parameter count of a language model is its most commonly cited specification because it correlates strongly with model capability. Research has consistently shown that, given sufficient training data, larger models (more parameters) learn more nuanced representations, demonstrate better reasoning, and perform better on downstream tasks. This relationship, known as neural scaling laws, predicts that model performance improves as a power law with parameter count.

    However, parameter count alone does not determine model quality. Training data quality, training duration (measured in tokens seen), architecture choices, and post-training alignment all significantly impact the final model. A well-trained 7B parameter model can outperform a poorly trained 13B model. Additionally, not all parameters contribute equally — Mixture of Experts architectures have large total parameter counts but only activate a fraction per input, and LoRA fine-tuning adds a small number of high-impact parameters rather than modifying all existing ones.

    Why It Matters

    Parameter count is the primary factor determining a model's hardware requirements. Each parameter must be stored in memory during inference (at the chosen precision), and during training, additional memory is needed for gradients and optimizer states — typically 4-8x the weight memory. A 7B parameter model requires approximately 14 GB for inference in FP16 and 56-112 GB for training. These requirements directly dictate hardware costs and deployment feasibility.

    For practitioners, understanding the relationship between parameters, quality, and cost enables informed model selection. A 3B parameter model fine-tuned on domain data may outperform a general-purpose 13B model for specific tasks, while being 4x cheaper to deploy. This trade-off between parameter count and specialization is at the heart of the fine-tuning value proposition.

    How It Works

    Parameters are organized into tensors (multi-dimensional arrays) that correspond to specific model components. In a transformer, key parameter groups include embedding matrices (vocabulary size times hidden dimension), attention projection matrices (4 matrices per layer of hidden dimension squared), feed-forward network matrices (2 per layer, typically hidden dimension times 4x hidden dimension), and layer normalization parameters (2 small vectors per layer).

    During training, each parameter is updated through gradient descent. The gradient — computed via backpropagation — indicates the direction and magnitude of change that would reduce the loss. The optimizer applies the gradient (potentially with momentum and adaptive learning rates) to produce the new parameter value. This process repeats for billions of optimization steps across the training data, gradually shaping the parameters into a configuration that produces useful outputs.

    Example Use Case

    A startup evaluates three model sizes for their customer support chatbot: 3B, 7B, and 13B parameters. The 3B model runs on a single consumer GPU but produces mediocre responses. The 13B model is excellent but requires an expensive A100 GPU. The 7B model, fine-tuned on 5,000 domain-specific examples, matches the 13B model's quality on support tasks while running on an affordable RTX 4090. They choose the fine-tuned 7B, trading parameter count for task-specific specialization.

    Key Takeaways

    • Parameters are all learnable values in a model — weights and biases — adjusted during training.
    • Parameter count is the primary specification for LLM scale, following neural scaling laws.
    • More parameters generally mean better capability but also higher memory and compute costs.
    • Fine-tuning can make smaller-parameter models competitive with larger general-purpose models on specific tasks.
    • Training requires 4-8x more memory per parameter than inference due to gradients and optimizer states.

    How Ertas Helps

    Ertas Studio displays parameter counts and memory requirements for each base model, helping users choose the right model size for their hardware and use case. LoRA fine-tuning in Studio adds only a small fraction of new parameters, making large model customization accessible.

    Related Resources

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.