MMLU-Pro

    A harder version of the original MMLU benchmark — multi-subject knowledge and reasoning questions with 10 answer choices instead of 4, designed to address the saturation and contamination of vanilla MMLU at the frontier.

    General KnowledgeUpdated 2026-04-30

    What It Measures

    MMLU-Pro is a harder, more rigorous successor to the original Massive Multitask Language Understanding (MMLU) benchmark. It covers the same 57 subject areas — from elementary mathematics to professional law to clinical medicine — but with several modifications designed to make the benchmark more discriminating at the frontier: questions have 10 answer choices instead of 4 (reducing the random-guess baseline), more reasoning-intensive questions are included, and the question pool has been carefully curated to reduce the contamination problems that affected vanilla MMLU.

    MMLU-Pro is now the standard substitute for vanilla MMLU in research and leaderboard contexts. Vanilla MMLU is considered saturated — frontier models score 90%+ and the differences between top models are within noise. MMLU-Pro spreads the leaderboard out enough to provide meaningful differentiation, and the harder format makes it more reflective of real reasoning capability.

    How It Works

    Each question is a multiple-choice problem from one of 57 subject areas, with 10 answer choices (A through J). Models are scored on the percentage of questions they answer correctly. Scoring is typically reported as a single composite score across all subject areas, though per-subject scores can reveal interesting model-specific strengths.

    The 10-choice format has two effects: it reduces the random-guess baseline from 25% to 10%, and it makes the questions harder by requiring the model to discriminate among more plausible distractors. Both effects make MMLU-Pro a more discriminating evaluation at the high end of the capability spectrum.

    Current Leaders

    How to Interpret Scores

    MMLU-Pro scores in the 80%+ range indicate strong general knowledge and reasoning capability. The current open-weight leader is Qwen 3.5 at 84.9%, with frontier proprietary models scoring slightly higher. Compared to vanilla MMLU (where most flagship models score 90%+), MMLU-Pro provides better differentiation at the top end. For evaluating a model's general capability across diverse subject areas, MMLU-Pro is now the standard benchmark to consult. Like all multiple-choice benchmarks, MMLU-Pro doesn't capture certain capability dimensions — instruction following, structured output, tool use, long-context retrieval — that matter for production deployment but require different evaluations.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.