GPQA Diamond

A graduate-level science Q&A benchmark designed to be unsolvable by web search — questions written by domain PhD students that test deep, specialized knowledge in physics, chemistry, and biology.

ReasoningUpdated 2026-04-30

What It Measures

GPQA (Graduate-level PhD Question-and-Answering) is a benchmark of multi-choice questions in physics, chemistry, and biology written by domain PhD students. The 'Diamond' subset is the highest-difficulty tier — questions where the writers consider non-experts would have difficulty even with internet access. The benchmark is designed to test deep specialized knowledge and reasoning rather than recall of widely-available information.

GPQA Diamond has emerged as one of the most credible benchmarks for evaluating frontier model reasoning capability. The questions are hard enough that even strong models score well below 100%, providing room for differentiation; the questions are written by PhD students rather than scraped from public sources, reducing contamination risk; and the subject matter spans enough disciplines that high scores genuinely reflect broad scientific reasoning rather than narrow specialization.

How It Works

Each question is a multi-choice problem with 4 answer choices, written by a domain PhD student in physics, chemistry, or biology. Questions are designed to require deep specialized knowledge — typically the kind of reasoning a graduate student in the relevant subfield would handle but a non-specialist would struggle with even given research time.

Scoring is the percentage of questions answered correctly. The 'Diamond' subset is approximately 200 questions selected as the hardest from the full GPQA pool — these are questions where the question writer estimates non-experts would have difficulty even with internet access during evaluation. Diamond is the standard subset cited in leaderboards and research reports.

Current Leaders

Qwen 3.5

Open-weight leader

88.4

DeepSeek V4

Strong

Kimi K2.6

Strong

Hermes 4

Strong (vs Llama 3 base)

Qwen 3.6

Competitive

How to Interpret Scores

GPQA Diamond scores in the 80%+ range indicate strong scientific reasoning capability across multiple disciplines. The current open-weight leader is Qwen 3.5 at 88.4 — competitive with the best frontier proprietary models. A model scoring 60%+ on GPQA Diamond demonstrates meaningful capability for tasks requiring graduate-level scientific knowledge; below 50% suggests the model would struggle with research-assistance use cases involving scientific reasoning. GPQA Diamond is particularly useful as a complement to MMLU-Pro: MMLU-Pro covers breadth (57 subject areas), GPQA Diamond covers depth (graduate-level questions in 3 sciences). Strong scores on both suggest both broad and deep reasoning capability.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →