What is Inference?

    The process of running a trained AI model to generate predictions or outputs from new input data, as opposed to the training phase where the model learns from data.

    Definition

    Inference is the production phase of a machine learning model's lifecycle — the point at which the model applies what it learned during training to generate useful outputs from new, unseen inputs. For large language models, inference means processing a user's prompt through the model's transformer layers to produce a sequence of tokens (words or sub-words) that form a coherent response. While training happens once (or periodically), inference happens continuously for every user request, making it the primary driver of ongoing operational cost and the main determinant of end-user experience.

    Inference performance is measured along several axes: latency (time to first token and total generation time), throughput (requests per second or tokens per second), and cost per token. These metrics are influenced by model size, quantization level, hardware (GPU vs. CPU, memory bandwidth), batching strategy, and the serving runtime. A 70B-parameter model in FP16 might deliver exceptional quality but require multiple A100 GPUs, while the same model quantized to 4-bit GGUF format might run on a single RTX 4090 with acceptable quality and dramatically lower cost.

    Modern inference optimization is a rich field encompassing techniques like KV-cache management, continuous batching, speculative decoding, tensor parallelism, and PagedAttention (used by vLLM). The choice of inference stack — whether llama.cpp for local CPU/GPU inference, vLLM for high-throughput GPU serving, or Ollama for developer-friendly local deployment — can make a 5-10x difference in performance for the same model.

    Why It Matters

    Inference is where AI models deliver value to end users, and its cost often exceeds training cost over a model's lifetime by an order of magnitude. A model that is brilliant but takes 30 seconds to respond will be abandoned; one that is fast but inaccurate will erode trust. Getting inference right means balancing quality, speed, and cost — a triad that depends on smart choices about model size, quantization, hardware, and serving infrastructure. For organizations deploying AI at scale, inference optimization directly impacts margins, user satisfaction, and competitive positioning.

    How It Works

    When a user submits a prompt, the inference pipeline first tokenizes the input text into a sequence of integer token IDs using the model's vocabulary. These tokens pass through the model's embedding layer to become dense vectors, then flow through dozens of transformer layers — each applying self-attention and feed-forward operations. For autoregressive generation, the model produces one token at a time: after generating each token, it appends that token to the input sequence and runs another forward pass (using a KV-cache to avoid recomputing attention for previous tokens). This loop continues until the model emits a stop token or reaches the maximum output length. The resulting token IDs are decoded back into text and returned to the user. Optimizations like continuous batching allow the server to interleave multiple requests in a single batch, maximizing GPU utilization.

    python
    import requests
    
    # Query an Ertas Cloud inference endpoint
    response = requests.post(
        "https://api.ertas.ai/v1/completions",
        headers={"Authorization": "Bearer ert_sk_..."},
        json={
            "model": "my-org/support-assistant-v2",
            "prompt": "How do I upgrade my subscription plan?",
            "max_tokens": 256,
            "temperature": 0.3,
            "stream": True,
        },
        stream=True,
    )
    
    # Stream tokens as they are generated
    for chunk in response.iter_lines():
        if chunk:
            print(chunk.decode(), end="", flush=True)
    Calling a fine-tuned model's inference endpoint on Ertas Cloud with streaming enabled for low perceived latency.

    Example Use Case

    An e-commerce company deploys a fine-tuned 7B model to power its product recommendation chatbot. During peak hours, the system handles 200 concurrent users. By serving the model in Q4_K_M GGUF format via a llama.cpp-based backend with continuous batching, they achieve a median time-to-first-token of 180ms and a generation speed of 45 tokens per second on a single A10G GPU — meeting their latency SLA of under 2 seconds for typical responses while keeping infrastructure costs below $0.001 per interaction.

    Key Takeaways

    • Inference is the production phase where a trained model generates outputs from new inputs — it is where AI delivers user value.
    • Latency, throughput, and cost-per-token are the three key metrics for evaluating inference performance.
    • Quantization (e.g., GGUF Q4_K_M) can reduce inference costs by 4-8x with modest quality trade-offs.
    • The choice of serving runtime (llama.cpp, vLLM, Ollama) has a major impact on performance characteristics.
    • Inference cost typically exceeds training cost over a model's lifetime, making optimization critical for production deployments.

    How Ertas Helps

    Ertas Cloud provides managed inference endpoints for models fine-tuned in Ertas Studio. Users deploy a model with a single click and receive an API endpoint compatible with the OpenAI API format, making integration straightforward. Under the hood, Ertas Cloud automatically selects the optimal serving runtime, quantization level, and hardware tier based on the model's size and the user's latency and throughput requirements. Auto-scaling ensures that endpoints handle traffic spikes without manual intervention, while Ertas Vault guarantees that inference data is processed in compliance with the organization's privacy policies — no prompts or completions are logged unless explicitly opted in.

    Related Resources

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.