DeepSeek V4
Open-weight leader; well below GPT-5.5 (85% on ARC-AGI-2)
François Chollet's Abstraction and Reasoning Corpus — a benchmark of visual pattern-recognition puzzles designed to test fluid intelligence. ARC-AGI-2 and ARC-AGI-3 succeed the original ARC, with frontier models still scoring well below human baselines.
ARC-AGI (Abstraction and Reasoning Corpus, Artificial General Intelligence variant) is a benchmark of visual pattern-recognition puzzles designed by François Chollet to test fluid intelligence — the ability to solve novel problems without relying on memorized patterns. Each problem provides a few input-output examples of a visual transformation; the model must infer the transformation rule and apply it to a new input. The transformations are designed to be solvable by humans (typical human baseline is ~80%) but resistant to memorization or pattern-matching from training data.
The benchmark family has evolved through multiple versions: original ARC (now considered solved by frontier models with extensive scaffolding), ARC-AGI-2 (current standard, designed to be harder), and ARC-AGI-3 (latest version with continued difficulty). The benchmark is positioned as a measure of capability genuinely transferring to novel problems rather than capability gained through training data exposure — making it a particularly useful complement to benchmarks that may suffer from contamination.
Each problem provides 2-5 input-output example pairs demonstrating a visual transformation, then a 'test' input. The model must produce the correct test output. The transformations involve operations like reflection, rotation, color substitution, shape recognition, counting, and various combinations of these. Scoring is the percentage of problems for which the model produces the exactly-correct output.
Most current ARC-AGI evaluations involve substantial scaffolding around the model — code-generation agents that produce Python programs implementing the transformation, multiple sampling strategies, verifier agents, and similar approaches. Reported scores typically include the scaffolding contribution; pure model scores without scaffolding are substantially lower. The ARC Prize organizers maintain a public leaderboard of human-vs-model performance with various rule sets.
Open-weight leader; well below GPT-5.5 (85% on ARC-AGI-2)
ARC-AGI scores remain notable because frontier models, despite massive parameter counts and training compute, still score well below human baselines on the harder versions. As of April 2026, GPT-5.5 leads ARC-AGI-2 at 85% (released April 24); other frontier models score lower. Human baseline on ARC-AGI-2 is typically reported as 90%+. For open-weight models, ARC-AGI scores are generally lower than for proprietary models — the benchmark is one where heavy scaffolding and extensive sampling matter substantially. ARC-AGI is a useful complement to other benchmarks because it specifically tests capability that shouldn't be improvable through training-data inclusion. Strong ARC-AGI performance is a credible signal of generalization, though absolute scores should be interpreted in context of the scaffolding used.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.