HumanEval

An older Python coding benchmark — 164 hand-written programming problems with hidden test suites. Once the standard coding benchmark; now widely considered saturated and contamination-prone, with frontier models scoring 95%+.

CodingUpdated 2026-04-30

What It Measures

HumanEval is a benchmark of 164 hand-written Python programming problems published by OpenAI in 2021. Each problem provides a function signature, a docstring describing what the function should do, and a hidden test suite. The model must generate function bodies that pass the test suite. The benchmark was the standard coding evaluation for several years and remains widely cited for backwards comparison.

By 2026, HumanEval is considered saturated and contamination-prone. Frontier models routinely score 95%+ — Kimi K2.5 set the open-weight record at 99.0 — and the gap between top models is dominated by noise rather than genuine capability differences. The original problems have also been included in many training corpora directly or in reformulated forms, making contamination a substantial concern at the high end of the leaderboard.

How It Works

Each problem provides a function signature and docstring. The model is asked to generate the function body. The function is then run against a hidden test suite, and the model's score on the problem is binary — either all tests pass or they don't. The aggregate HumanEval score is the percentage of the 164 problems for which the model's solution passes all tests.

Most current evaluations use 'pass@1' scoring — the model generates one attempt per problem, and that attempt is evaluated against the test suite. Earlier evaluations sometimes used 'pass@k' (multiple attempts, with success counted if any attempt passes), but pass@1 is now the standard for cross-report comparison.

Current Leaders

Kimi K2.6

K2.5 set the open-weight record

~99.0%

DeepSeek V4

97-99%

MiMo V2.5 Pro

97-99%

Qwen3-Coder

95-98%

MiniMax M2.5

95-98%

How to Interpret Scores

HumanEval scores in 2026 should be interpreted carefully. Frontier models scoring 95-99% are likely benefiting from training-data contamination as much as genuine capability — the original problems are sufficiently old and widely-published that no current model is encountering them for the first time. For practical model evaluation, SWE-Bench Verified or SWE-Bench Pro are substantially more meaningful signals of real-world coding capability. HumanEval remains useful as a sanity check (a model scoring below 80% on HumanEval is unlikely to be useful for production coding) and for backwards comparison to older models, but it should not be the primary signal for frontier model selection.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →