What is Inference?
The process of running a trained AI model to generate predictions or outputs from new input data, as opposed to the training phase where the model learns from data.
Definition
Inference is the production phase of a machine learning model's lifecycle — the point at which the model applies what it learned during training to generate useful outputs from new, unseen inputs. For large language models, inference means processing a user's prompt through the model's transformer layers to produce a sequence of tokens (words or sub-words) that form a coherent response. While training happens once (or periodically), inference happens continuously for every user request, making it the primary driver of ongoing operational cost and the main determinant of end-user experience.
Inference performance is measured along several axes: latency (time to first token and total generation time), throughput (requests per second or tokens per second), and cost per token. These metrics are influenced by model size, quantization level, hardware (GPU vs. CPU, memory bandwidth), batching strategy, and the serving runtime. A 70B-parameter model in FP16 might deliver exceptional quality but require multiple A100 GPUs, while the same model quantized to 4-bit GGUF format might run on a single RTX 4090 with acceptable quality and dramatically lower cost.
Modern inference optimization is a rich field encompassing techniques like KV-cache management, continuous batching, speculative decoding, tensor parallelism, and PagedAttention (used by vLLM). The choice of inference stack — whether llama.cpp for local CPU/GPU inference, vLLM for high-throughput GPU serving, or Ollama for developer-friendly local deployment — can make a 5-10x difference in performance for the same model.
Why It Matters
Inference is where AI models deliver value to end users, and its cost often exceeds training cost over a model's lifetime by an order of magnitude. A model that is brilliant but takes 30 seconds to respond will be abandoned; one that is fast but inaccurate will erode trust. Getting inference right means balancing quality, speed, and cost — a triad that depends on smart choices about model size, quantization, hardware, and serving infrastructure. For organizations deploying AI at scale, inference optimization directly impacts margins, user satisfaction, and competitive positioning.
How It Works
When a user submits a prompt, the inference pipeline first tokenizes the input text into a sequence of integer token IDs using the model's vocabulary. These tokens pass through the model's embedding layer to become dense vectors, then flow through dozens of transformer layers — each applying self-attention and feed-forward operations. For autoregressive generation, the model produces one token at a time: after generating each token, it appends that token to the input sequence and runs another forward pass (using a KV-cache to avoid recomputing attention for previous tokens). This loop continues until the model emits a stop token or reaches the maximum output length. The resulting token IDs are decoded back into text and returned to the user. Optimizations like continuous batching allow the server to interleave multiple requests in a single batch, maximizing GPU utilization.
import requests
# Query an Ertas Cloud inference endpoint
response = requests.post(
"https://api.ertas.ai/v1/completions",
headers={"Authorization": "Bearer ert_sk_..."},
json={
"model": "my-org/support-assistant-v2",
"prompt": "How do I upgrade my subscription plan?",
"max_tokens": 256,
"temperature": 0.3,
"stream": True,
},
stream=True,
)
# Stream tokens as they are generated
for chunk in response.iter_lines():
if chunk:
print(chunk.decode(), end="", flush=True)Example Use Case
An e-commerce company deploys a fine-tuned 7B model to power its product recommendation chatbot. During peak hours, the system handles 200 concurrent users. By serving the model in Q4_K_M GGUF format via a llama.cpp-based backend with continuous batching, they achieve a median time-to-first-token of 180ms and a generation speed of 45 tokens per second on a single A10G GPU — meeting their latency SLA of under 2 seconds for typical responses while keeping infrastructure costs below $0.001 per interaction.
Key Takeaways
- Inference is the production phase where a trained model generates outputs from new inputs — it is where AI delivers user value.
- Latency, throughput, and cost-per-token are the three key metrics for evaluating inference performance.
- Quantization (e.g., GGUF Q4_K_M) can reduce inference costs by 4-8x with modest quality trade-offs.
- The choice of serving runtime (llama.cpp, vLLM, Ollama) has a major impact on performance characteristics.
- Inference cost typically exceeds training cost over a model's lifetime, making optimization critical for production deployments.
How Ertas Helps
Ertas Cloud provides managed inference endpoints for models fine-tuned in Ertas Studio. Users deploy a model with a single click and receive an API endpoint compatible with the OpenAI API format, making integration straightforward. Under the hood, Ertas Cloud automatically selects the optimal serving runtime, quantization level, and hardware tier based on the model's size and the user's latency and throughput requirements. Auto-scaling ensures that endpoints handle traffic spikes without manual intervention, while Ertas Vault guarantees that inference data is processed in compliance with the organization's privacy policies — no prompts or completions are logged unless explicitly opted in.
Related Resources
Batch Size
Context Window
Fine-Tuning
GGUF
JSONL
LoRA
Model Routing
Multi-Tenant Inference
Quantization
Temperature
Tokenizer
Top-p (Nucleus Sampling)
Getting Started with Ertas: Fine-Tune and Deploy Custom AI Models
Privacy-Conscious AI Development: Fine-Tune in the Cloud, Run on Your Terms
Running AI Models Locally: The Complete Guide to Local LLM Inference
The Hidden Cost of Per-Token AI Pricing
Multi-Tenant AI Deployment: One Base Model, Dozens of Client Adapters
GPT4All
Hugging Face
Jan
KoboldCpp
llama.cpp
LM Studio
Ollama
vLLM
Ertas for Healthcare
Ertas for SaaS Product Teams
Ertas for Customer Support
Ertas for E-Commerce
Ertas for Content Creation
Ertas for AI Automation Agencies
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.