Google DeepMind just dropped an upgraded Gemini 3 Deep Think that scores 84.6% on ARC-AGI-2 -- a benchmark that was explicitly designed to resist exactly this kind of result.
When ARC-AGI-2 launched in early 2025, pure LLMs scored 0%. AI reasoning systems managed single-digit percentages. The benchmark was supposed to expose "what is fundamentally missing in our current AI architectures." Instead, less than a year later, a single model is within striking distance of the 85% threshold that the ARC Prize Foundation set as its grand prize target.
This is not just a benchmark story. It is a story about the accelerating pace of AI capability and what it means when the tests we build to measure intelligence keep getting solved faster than anyone expected.
What ARC-AGI Was Built to Measure
In 2019, Francois Chollet -- the Google AI researcher who created Keras -- published "On the Measure of Intelligence," a paper that argued the AI field was measuring the wrong things. Most benchmarks tested narrow skills: can a model translate text, recognize objects, or answer trivia questions? But skill on a specific task is not intelligence. A calculator can multiply better than any human, but nobody calls it smart.
Chollet proposed a different definition: intelligence is the efficiency of skill acquisition on unknown tasks. In other words, how quickly can a system learn to do something it has never seen before, with minimal examples?
The Abstraction and Reasoning Corpus (ARC-AGI) was his practical test of this idea. It consisted of 800 grid-based visual puzzles, each presenting a few input-output examples. The test-taker had to figure out the underlying rule and apply it to a new input. The puzzles relied on what developmental psychologists call core knowledge priors -- basic concepts like object permanence, spatial reasoning, and counting that humans develop in early childhood.
The design was deliberate: easy for humans, hard for AI. No language, no cultural knowledge, no domain expertise. Just raw reasoning from a handful of examples.
Five Years of Resistance
ARC-AGI-1 held up remarkably well. From its debut in 2019 through late 2024, it resisted every approach the AI research community threw at it.
The first Kaggle competition in 2020 produced a winning score of just 21%. Subsequent competitions in 2022 and 2023 saw marginal gains but nothing approaching human performance (which tested at 97-98% accuracy). Deep learning, reinforcement learning, program synthesis, domain-specific languages -- none of them cracked it.
Then in December 2024, OpenAI's o3-preview model scored 75% on ARC-AGI-1 at low compute and 87% at high compute. Five years of resistance collapsed in a single demo. The breakthrough was not brute-force scaling but a new paradigm: test-time reasoning systems that could think step-by-step, explore hypotheses, and refine their answers during inference.
ARC-AGI-1 was effectively solved. Chollet and his team had already been preparing the next iteration.
ARC-AGI-2: Designed to Be Harder
ARC-AGI-2 launched in early 2025 with a clear mandate: stress-test the reasoning systems that had cracked the first version. The team designed it to close the specific loopholes those systems exploited.
Key changes included:
- Harder tasks targeting known AI weaknesses: symbolic interpretation (symbols that carry meaning beyond their visual patterns), compositional reasoning (multiple rules interacting simultaneously), and contextual rule application (rules that change based on context).
- Brute-force resistance: Tasks from the original 2020 Kaggle competition that were susceptible to exhaustive search were removed from evaluation sets.
- Human calibration: Over 400 members of the general public were tested in a controlled study. Every task in the evaluation set was solved by at least two humans in two or fewer attempts. Average human test-taker score: 60%.
- Efficiency measurement: For the first time, ARC-AGI reporting included cost-per-task, acknowledging that intelligence is not just about getting the answer but getting it efficiently.
The initial results looked promising for the benchmark's longevity. Pure LLMs scored 0%. Early reasoning systems hit single digits. The ARC Prize 2025 Kaggle competition saw 1,455 teams submit over 15,000 entries, with the top score reaching just 24% on the private evaluation set.
Even commercial frontier models struggled. As of the ARC Prize 2025 results, the top verified commercial model -- Anthropic's Opus 4.5 with thinking mode -- scored 37.6% at $2.20 per task. The best refinement solution, built on Gemini 3 Pro, reached 54% but cost $30 per task.
Then Gemini 3 Deep Think Happened
Today's announcement changes the picture dramatically. Gemini 3 Deep Think's 84.6% on ARC-AGI-2, verified by the ARC Prize Foundation, represents a massive leap:
| Model | ARC-AGI-2 Score | |-------|----------------| | Gemini 3 Deep Think | 84.6% | | Claude Opus 4.6 (Thinking Max) | 68.8% | | GPT-5.2 (Thinking xhigh) | 52.9% | | Gemini 3 Pro Preview | 31.1% | | ARC Prize 2025 winner (NVARC) | 24.0% |
For context, the ARC Prize Foundation set 85% as the threshold for its grand prize -- a level they expected to take years to reach. Gemini 3 Deep Think is within 0.4 percentage points of that target.
The model's approach relies on what Google DeepMind describes as "enhanced reasoning chains, parallel hypothesis exploration, and inference-time optimizations." It excels at exactly the kind of deep, iterative thinking that ARC-AGI-2 was designed to test.
And Gemini 3 Deep Think is not just dominating this one benchmark. It simultaneously posted a 48.4% on Humanity's Last Exam (without tools), a 3455 Elo on Codeforces, and gold-medal results in physics and chemistry olympiads.
The Benchmark Treadmill
The ARC Prize team is already developing ARC-AGI-3. In their technical report, they acknowledge the pattern: "We are actively developing ARC-AGI-3 for release in early 2026 and are optimistic about the new format."
They have also raised a critical concern: even well-designed benchmarks can be "overfit" if the public training and private test sets are too similar. The team asserts this may already be happening with both ARC-AGI-1 and ARC-AGI-2 -- "accidentally or intentionally, we cannot determine which."
This is the fundamental challenge. We are caught in a benchmark treadmill:
- Researchers design a test meant to capture something essential about intelligence.
- AI systems improve until they saturate it.
- The community asks whether the systems actually got smarter or just got better at the specific test.
- New benchmarks are created. Repeat.
The cycle used to take years. ARC-AGI-1 lasted five years. ARC-AGI-2 lasted less than one. The AI field is running out of yardsticks faster than it can build new ones.
What This Means
Three things stand out from the ARC-AGI trajectory:
Reasoning is the real breakthrough. The jump from ARC-AGI-1 being unsolvable to being solved, and from ARC-AGI-2 being hard to being nearly saturated, tracks directly with the development of test-time reasoning systems. This is not just models getting bigger -- it is a fundamental architectural shift that all four major AI labs now report on their model cards.
Efficiency gaps remain massive. The ARC Prize 2025 Kaggle winner scored 24% at $0.20 per task. The top refinement solution hit 54% at $30 per task. Gemini 3 Deep Think hits 84.6% but the per-task cost has not been disclosed. Intelligence without efficiency is just brute-force search wearing a lab coat.
Measurement itself is in crisis. When benchmarks get saturated this quickly, the field loses its ability to track meaningful progress. As Chollet himself framed it: AGI is about efficiently acquiring new skills on unknown tasks. But if the "unknown" tasks become predictable, we are measuring something else entirely.
What Would ARC-AGI-3 Need to Look Like?
The ARC Prize team has confirmed that ARC-AGI-3 is in active development for early 2026 and that they are "optimistic about the new format." Given how quickly ARC-AGI-2 fell, the next version faces an existential design question: how do you build a benchmark that stays ahead of systems improving at this pace?
Grid puzzles with static input-output pairs -- no matter how cleverly designed -- may have hit a ceiling as a format. Reasoning models have gotten too good at the specific loop of "observe examples, hypothesize rule, apply rule, verify." The next frontier likely needs to break that loop entirely.
A few directions seem plausible:
Embodied or interactive reasoning. Instead of static puzzles, tasks where the system must interact with an environment, observe consequences, and adjust its strategy in real time. Think less "solve this grid" and more "navigate this unfamiliar system." This would test whether AI can reason about causality and feedback -- not just pattern transformation.
Open-ended generation with human judgment. Tasks where there is no single correct answer but a space of good answers that require genuine creativity and contextual judgment. Current benchmarks reward convergent thinking. Real intelligence often requires divergent thinking -- generating novel solutions that satisfy loosely defined constraints.
Multi-domain transfer. Tasks that require pulling together reasoning across fundamentally different domains within a single problem. Not "solve a physics puzzle" or "solve a logic puzzle" but "use spatial reasoning to inform a causal inference that unlocks a symbolic pattern." The kind of cross-domain synthesis that humans do effortlessly but that current systems compartmentalize.
Adversarial co-evolution. Perhaps the most radical idea: benchmarks that evolve alongside the systems being tested. If a model demonstrates mastery of a task type, the benchmark automatically generates harder variants. This would turn the benchmark treadmill into a feature rather than a bug -- a continuously adaptive measure rather than a static target.
The honest answer is that nobody knows yet whether any benchmark can stay ahead indefinitely. The gap between "trivial for humans, impossible for AI" is shrinking in every direction. ARC-AGI-1 held for five years. ARC-AGI-2 held for less than one. If ARC-AGI-3 holds for six months, the field may need to accept that static benchmarks are the wrong tool for measuring a moving target.
But here is the optimistic read: the fact that we keep outrunning our own measurements is itself the signal. We are not stuck. We are not plateauing. The benchmarks are falling because the underlying capabilities -- abstract reasoning, few-shot generalization, compositional thinking -- are genuinely improving at an accelerating rate. The measurement crisis is a good problem to have. It means the thing we are trying to measure is real and it is moving fast.
The Small Business Angle
If you run a small business and read all this as "academic AI navel-gazing," here is the practical takeaway: AI capabilities are advancing faster than even the researchers designing the hardest tests can keep up with.
The tools available to your business today -- from Claude's agentic coding to Gemini's deep reasoning to open-weight models like GLM-5 -- represent a genuine step-change in what machines can do. They are not just autocompleting text. They are reasoning through novel problems, adapting to context, and applying multiple rules simultaneously.
For SMBs, this translates to concrete capabilities:
- Complex analysis that used to require consultants -- financial modeling, competitive research, regulatory compliance analysis -- is increasingly within reach of reasoning-capable AI systems.
- Custom automation that requires understanding your specific business context, not just following rigid scripts.
- Technical problem-solving at a level that rivals junior domain experts.
The benchmarks are breaking because AI is genuinely getting better at the kind of flexible, adaptive reasoning that businesses need. The companies that recognize this shift and start building AI into their workflows now will have a significant advantage over those that wait for the dust to settle.
The dust is not going to settle. But that is the point. In a field where the hardest tests humans can design get solved within months, the only losing strategy is standing still.
Want to understand how advancing AI capabilities can benefit your specific business? Contact Barista Labs -- we help small businesses navigate the AI landscape without the enterprise consulting price tag.
