AI Model Benchmarks

What each benchmark measures, how to interpret scores, and which models lead.

SWE-Bench Verified

A benchmark for evaluating language models on real-world software engineering tasks drawn from open-source GitHub repositories — measuring whether the model can autonomously close issues by making the correct multi-file code changes.

5 ranked leaders·Updated 2026-04-30

SWE-Bench Pro

Coding

A harder successor to SWE-Bench Verified, designed to be contamination-resistant and to evaluate models on more complex multi-file changes drawn from recent GitHub issues — the current frontier benchmark for agentic coding capability.

5 ranked leaders·Updated 2026-04-30

LiveBench

General Knowledge

A monthly-refreshed benchmark designed to be contamination-resistant — questions are sourced from current events and recent academic content, reducing the risk that frontier models have seen the test data during training.

5 ranked leaders·Updated 2026-04-30

MMLU-Pro

General Knowledge

A harder version of the original MMLU benchmark — multi-subject knowledge and reasoning questions with 10 answer choices instead of 4, designed to address the saturation and contamination of vanilla MMLU at the frontier.

5 ranked leaders·Updated 2026-04-30

GPQA Diamond

Reasoning

A graduate-level science Q&A benchmark designed to be unsolvable by web search — questions written by domain PhD students that test deep, specialized knowledge in physics, chemistry, and biology.

5 ranked leaders·Updated 2026-04-30

HumanEval

Coding

An older Python coding benchmark — 164 hand-written programming problems with hidden test suites. Once the standard coding benchmark; now widely considered saturated and contamination-prone, with frontier models scoring 95%+.

5 ranked leaders·Updated 2026-04-30

AIME 2025

Math

The American Invitational Mathematics Examination — the qualifying exam for the US Math Olympiad. The 2025 problems test advanced high-school math reasoning at a level where most adults without specialized training would struggle.

5 ranked leaders·Updated 2026-04-30

ARC-AGI

Reasoning

François Chollet's Abstraction and Reasoning Corpus — a benchmark of visual pattern-recognition puzzles designed to test fluid intelligence. ARC-AGI-2 and ARC-AGI-3 succeed the original ARC, with frontier models still scoring well below human baselines.

5 ranked leaders·Updated 2026-04-30

TauBench

Tool Use

A benchmark for evaluating tool-using language models in realistic multi-turn customer service interactions — measuring whether the model can correctly use APIs to complete user requests across a variety of domains.

5 ranked leaders·Updated 2026-04-30

Humanity's Last Exam (HLE)

General Knowledge

A benchmark of expert-level questions across academic disciplines — designed to be the final challenge that frontier models cannot solve, with current leaders scoring well below 50% across thousands of expert-validated questions.

5 ranked leaders·Updated 2026-04-30