What is Learning Rate?

A hyperparameter that controls how much the model's weights are adjusted in response to each batch of training data, directly influencing training speed and stability.

Definition

The learning rate is a scalar value — typically a small number like 1e-4 or 2e-5 — that determines the magnitude of weight updates during gradient descent. After each batch of training data, the model computes gradients that indicate the direction each weight should move to reduce the loss. The learning rate scales these gradients before they are applied: a larger learning rate means bigger steps (faster but riskier), while a smaller learning rate means smaller steps (slower but more stable).

In the context of LLM fine-tuning, the learning rate is arguably the most sensitive hyperparameter. Fine-tuning uses a much lower learning rate than pre-training — typically 10x to 100x smaller — because the goal is to gently adapt the model's existing knowledge rather than overwrite it. A learning rate that is too high can cause catastrophic forgetting, where the model loses its pre-trained capabilities. A learning rate that is too low wastes compute by making negligible progress per epoch.

Modern fine-tuning pipelines typically use learning rate schedules that vary the rate over the course of training. Common schedules include cosine annealing (which gradually reduces the learning rate following a cosine curve), linear warmup followed by decay, and constant-with-warmup. These schedules help the model make large initial progress and then fine-tune its weights more carefully as training approaches convergence.

Why It Matters

The learning rate is often the first hyperparameter practitioners tune because it has the most dramatic effect on training outcomes. An order-of-magnitude mistake in either direction can be the difference between a high-performing model and a completely broken one. For teams without deep ML expertise, understanding learning rate basics — and having sensible defaults — is critical to avoiding wasted compute and frustrating debugging sessions.

How It Works

During each training step, the optimizer multiplies the computed gradient by the learning rate to produce the actual weight update: new_weight = old_weight - learning_rate × gradient. For parameter-efficient methods like LoRA, the learning rate is applied only to the adapter weights (since the base model is frozen). Advanced optimizers like AdamW maintain per-parameter adaptive learning rates based on historical gradient statistics, but the base learning rate still acts as a global scaling factor. Learning rate schedulers then modify this base rate over time — for example, linearly warming up from zero over the first 10% of training steps to prevent early instability.

Example Use Case

A data science team fine-tunes a Llama 3 8B model and initially sets the learning rate to 1e-3 (too high). After one epoch, the model produces incoherent outputs — catastrophic forgetting has destroyed the pre-trained knowledge. They restart with 2e-5, and after 3 epochs the model produces fluent, accurate domain-specific responses. They then experiment with 1e-4 using cosine annealing and find it converges in 2 epochs with marginally better validation scores, saving 33% of training time.

Key Takeaways

The learning rate controls how aggressively model weights are updated during training.
Fine-tuning learning rates are typically 10–100x lower than pre-training rates (commonly 1e-5 to 1e-4).
Too high a learning rate causes catastrophic forgetting; too low wastes compute.
Learning rate schedules (cosine, linear warmup) help optimize the training trajectory.
The learning rate is usually the first and most impactful hyperparameter to tune.

How Ertas Helps

Ertas Studio provides sensible default learning rates tailored to each base model and training method (LoRA vs. QLoRA), so users can start training without needing to research optimal values. For advanced users, the visual configuration panel exposes learning rate, scheduler type, and warmup steps as adjustable parameters. Real-time training loss charts in Studio make it easy to diagnose learning rate issues — a spiking loss curve signals the rate is too high, while a flat curve suggests it is too low.