AI Model Benchmarks

    What each benchmark measures, how to interpret scores, and which models lead.

    SWE-Bench Verified

    Coding

    A benchmark for evaluating language models on real-world software engineering tasks drawn from open-source GitHub repositories — measuring whether the model can autonomously close issues by making the correct multi-file code changes.

    5 ranked leaders·Updated 2026-04-30

    SWE-Bench Pro

    Coding

    A harder successor to SWE-Bench Verified, designed to be contamination-resistant and to evaluate models on more complex multi-file changes drawn from recent GitHub issues — the current frontier benchmark for agentic coding capability.

    5 ranked leaders·Updated 2026-04-30

    LiveBench

    General Knowledge

    A monthly-refreshed benchmark designed to be contamination-resistant — questions are sourced from current events and recent academic content, reducing the risk that frontier models have seen the test data during training.

    5 ranked leaders·Updated 2026-04-30

    MMLU-Pro

    General Knowledge

    A harder version of the original MMLU benchmark — multi-subject knowledge and reasoning questions with 10 answer choices instead of 4, designed to address the saturation and contamination of vanilla MMLU at the frontier.

    5 ranked leaders·Updated 2026-04-30

    GPQA Diamond

    Reasoning

    A graduate-level science Q&A benchmark designed to be unsolvable by web search — questions written by domain PhD students that test deep, specialized knowledge in physics, chemistry, and biology.

    5 ranked leaders·Updated 2026-04-30

    HumanEval

    Coding

    An older Python coding benchmark — 164 hand-written programming problems with hidden test suites. Once the standard coding benchmark; now widely considered saturated and contamination-prone, with frontier models scoring 95%+.

    5 ranked leaders·Updated 2026-04-30

    AIME 2025

    Math

    The American Invitational Mathematics Examination — the qualifying exam for the US Math Olympiad. The 2025 problems test advanced high-school math reasoning at a level where most adults without specialized training would struggle.

    5 ranked leaders·Updated 2026-04-30

    ARC-AGI

    Reasoning

    François Chollet's Abstraction and Reasoning Corpus — a benchmark of visual pattern-recognition puzzles designed to test fluid intelligence. ARC-AGI-2 and ARC-AGI-3 succeed the original ARC, with frontier models still scoring well below human baselines.

    5 ranked leaders·Updated 2026-04-30

    TauBench

    Tool Use

    A benchmark for evaluating tool-using language models in realistic multi-turn customer service interactions — measuring whether the model can correctly use APIs to complete user requests across a variety of domains.

    5 ranked leaders·Updated 2026-04-30

    Humanity's Last Exam (HLE)

    General Knowledge

    A benchmark of expert-level questions across academic disciplines — designed to be the final challenge that frontier models cannot solve, with current leaders scoring well below 50% across thousands of expert-validated questions.

    5 ranked leaders·Updated 2026-04-30