AIME 2025

    The American Invitational Mathematics Examination — the qualifying exam for the US Math Olympiad. The 2025 problems test advanced high-school math reasoning at a level where most adults without specialized training would struggle.

    MathUpdated 2026-04-30

    What It Measures

    AIME (American Invitational Mathematics Examination) is the qualifying exam for the United States of America Mathematical Olympiad. Problems are challenging high-school-level math: number theory, combinatorics, geometry, algebra, and probability — substantially harder than typical school math but well-defined and solvable through reasoning rather than specialized advanced knowledge. AIME 2025 refers specifically to the 2025 edition's problems, used as a benchmark in early 2026 model evaluations.

    For language models, AIME is a useful benchmark because the problems require multi-step reasoning, careful arithmetic, and structured problem-solving. Unlike simpler math benchmarks (basic arithmetic, GSM8K) where frontier models score near-100%, AIME spreads the leaderboard out enough to provide meaningful differentiation. The benchmark has emerged as a standard component of evaluating reasoning capability in 2026-generation models.

    How It Works

    Each AIME exam consists of 15 problems, each with an integer answer between 0 and 999. The model is given the problem statement and must produce the correct integer answer. Scoring is the percentage of correct answers across the 15 problems (or across the combined 30 problems if both AIME I and AIME II from a given year are used).

    For language models, results are typically reported as the percentage of problems answered correctly with the model running in extended-reasoning mode. Models with hybrid thinking modes (Qwen 3+, DeepSeek V3.2/V4, Hermes 4) generally need to be evaluated in their reasoning configuration to compete at the high end of the AIME leaderboard.

    Current Leaders

    How to Interpret Scores

    AIME 2025 scores correlate well with general math reasoning capability and are a meaningful signal of how a model handles multi-step problem-solving. Falcon H1R-7B scoring 83.1% on AIME 2025 — at only 7B parameters — is particularly notable, demonstrating that targeted training and architectural innovation can produce substantial reasoning capability at small scale. QwQ-32B scoring 79% and reaching 95% on related MATH benchmarks shows the same pattern at the 32B scale. Strong AIME 2025 performance is one of the more credible indicators of genuine reasoning capability in 2026-era models.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.