Humanity's Last Exam (HLE)

    A benchmark of expert-level questions across academic disciplines — designed to be the final challenge that frontier models cannot solve, with current leaders scoring well below 50% across thousands of expert-validated questions.

    General KnowledgeUpdated 2026-04-30

    What It Measures

    Humanity's Last Exam (HLE) is a benchmark of expert-level questions sourced from across academic disciplines — mathematics, physics, biology, chemistry, computer science, history, philosophy, classics, and more. Questions are written by domain experts and validated by other experts, with the explicit design goal of being the 'last exam' that frontier AI models will struggle with. By design, even the best current models score well below 50% on HLE — providing substantial differentiation room as frontier capability advances.

    The benchmark's positioning as 'humanity's last exam' reflects a research community concern: as benchmarks have saturated (MMLU, HumanEval), they've stopped providing meaningful frontier evaluation. HLE is intentionally constructed to remain hard for years to come, drawing on the deepest expert knowledge available across disciplines. The benchmark is published with thousands of questions to ensure that even substantial training-data inclusion would only marginally affect scores.

    How It Works

    Each question is a free-form expert-level academic problem, typically requiring deep specialized knowledge plus reasoning to solve. Some questions are multiple-choice; others are short-answer requiring exact correctness. Scoring is the percentage of questions answered correctly across the full benchmark.

    Unlike multiple-choice benchmarks where guessing provides a baseline, HLE's mix of formats (with many short-answer questions) makes the random-guess baseline effectively zero. This means every percentage point of score reflects genuine model capability rather than guessing artifacts.

    Current Leaders

    How to Interpret Scores

    HLE scores below 25% indicate substantial gaps in expert-level reasoning across disciplines. Scores of 25-40% indicate frontier-level capability — the current top models. Scores above 50% would indicate substantially superhuman capability across academic disciplines, and no model has reached that level as of April 2026. For evaluating frontier model capability specifically, HLE is one of the most useful single benchmarks to monitor, since it has substantial headroom and resists contamination by virtue of question difficulty and breadth. Strong HLE performance combined with strong GPQA Diamond performance is a credible signal of broad expert-level reasoning capability across both depth (GPQA) and breadth (HLE).

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.