LiveBench

A monthly-refreshed benchmark designed to be contamination-resistant — questions are sourced from current events and recent academic content, reducing the risk that frontier models have seen the test data during training.

General KnowledgeUpdated 2026-04-30

What It Measures

LiveBench is a contamination-resistant benchmark covering a broad range of capability areas: math, reasoning, coding, instruction following, language, and data analysis. The defining feature is its monthly refresh cadence — questions are added monthly from current sources (recent news, recent academic papers, recently-modified open-source codebases), and older questions are rotated out. This makes LiveBench substantially harder to game through training-data inclusion than fixed benchmarks, since the test set is constantly moving forward in time.

As traditional benchmarks like MMLU and HumanEval became saturated and contamination-prone in 2024-2025, LiveBench emerged as one of the more credible alternatives for evaluating frontier model capability. The benchmark covers enough capability areas that a strong LiveBench score is a meaningful general-purpose intelligence signal, while the monthly refresh makes the leaderboard particularly informative as models are released.

How It Works

Each month, new questions are sourced from the previous month's events and content. The test set rotates: questions older than a fixed window are removed, and the new questions are added. This means a 'LiveBench score' has an implicit timestamp — the score reported is on the test set that was current at that month's evaluation. Scores from different time periods are not directly comparable, though the benchmark publishes month-over-month trend data.

The scoring methodology aggregates results across the capability areas (math, reasoning, coding, etc.) into both a composite score and per-category scores. For comparing models, the composite score is most commonly cited, though per-category scores can reveal model-specific strengths and weaknesses.

Current Leaders

DeepSeek V4

Top open-weight

Kimi K2.6

Top open-weight

MiMo V2.5 Pro

Top open-weight

Qwen 3.6

Strong

GLM-5

Strong

How to Interpret Scores

LiveBench scores are generally a better signal of frontier model capability than older benchmarks because the contamination resistance keeps the comparison meaningful. A model that improves on MMLU might be improving genuinely or might be benefiting from training-data contamination; a model that improves on LiveBench is more likely to be improving genuinely. As of April 2026, the leader on LiveBench overall is OpenAI's o3-mini at 0.846, with frontier closed models leading the leaderboard. Among open-weight models, the top tier (DeepSeek V4, Kimi K2.6, MiMo V2.5 Pro) score competitively but typically below the closed-model leaders. For tracking the frontier, LiveBench is one of the most useful single benchmarks to monitor monthly.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →