After a delay from last week, Google has officially launched Gemini 3.1 Pro Preview, now live on both Vertex AI and the Gemini API. Google's own description cuts straight to the point: "Gemini 3.1 Pro Preview is our most powerful agentic and coding model." For businesses that have been building on the Gemini Pro tier, this is a significant upgrade. For those who have been waiting to see which model to bet on in 2026, the benchmarks make a strong argument.
The Numbers That Matter
Let's start with what makes this release stand out in a market that sees new model announcements weekly.
ARC-AGI-2 State of the Art: 71.8%. This is the benchmark designed to measure how well AI handles genuinely novel problems — tasks that require abstract reasoning rather than pattern matching against training data. Gemini 3.1 Pro Preview now holds the top score. For context, ARC-AGI was originally designed as a multi-year challenge that researchers expected would take until the late 2020s to crack. We wrote about how rapidly that timeline collapsed in our coverage of the ARC-AGI benchmark crisis. The fact that Google keeps pushing this number higher, and now holds the lead, tells you something about where their research focus is.
SWE-Bench Verified: 83.9%. SWE-Bench measures a model's ability to solve real-world software engineering problems from GitHub repositories. An 83.9% score means the model can autonomously resolve the vast majority of real software bugs and feature requests it encounters. For comparison, Claude Opus 4.6 scored 81.42% when it launched earlier this month. Google just leapfrogged it.
AIME 2025: 100%. A perfect score on the American Invitational Mathematics Examination. This is a competition-level math test used to qualify for the US Math Olympiad, and a perfect score signals extreme competence in structured logical reasoning.
Terminal-Bench 2.0: 63.5%. This measures the model's ability to operate autonomously in terminal environments — running commands, navigating file systems, debugging in real time. This is the benchmark that matters most for agentic engineering workflows, where AI systems need to operate independently rather than just generating text.
The New Medium Thinking Level
One of the less-discussed but practically important additions is a new Medium thinking level. Gemini 3.1 Pro Preview now supports three tiers — Low, Medium, and High — matching the thinking controls already available in Gemini 3 Flash.
This matters more than it sounds. Thinking levels control how deeply the model reasons before generating a response, and they directly affect both latency and cost. High thinking produces the most thorough reasoning but takes longer and costs more per token. Low thinking is fast and cheap but may miss nuance on complex problems.
Medium sits in the sweet spot. It gives the model enough reasoning depth to handle business-critical tasks — contract analysis, code review, multi-step planning — without the latency and cost overhead of High thinking. For businesses running AI at scale, having this middle option can meaningfully reduce their monthly API bills while maintaining quality where it counts.
What This Means for Small Businesses
If you are a small or mid-sized business using AI for coding, document processing, customer interactions, or internal workflows, here is why this release specifically matters to you.
The Pro Tier Is the Workhorse
In Google's model hierarchy, Pro is the tier most businesses actually use in production. Flash models are faster and cheaper for simple tasks, and Ultra models exist for the most demanding research workloads. But Pro is where the balance of capability, cost, and reliability sits for real business applications. An upgrade to Pro is an upgrade to your production infrastructure, not a research preview you read about and move on.
A Million Tokens of Context, Applied
The 1M token context window is not new to Gemini, but at this capability level it becomes genuinely transformative. One million tokens is roughly 750,000 words. That is an entire codebase. An entire year of customer service transcripts. A full set of legal contracts for due diligence. With Gemini 3.1 Pro Preview, you can feed all of that into a single prompt and get accurate analysis across the entire dataset. No chunking, no retrieval-augmented generation workarounds, no loss of context between segments.
For businesses running AI-powered coding assistants, this means the model can hold your entire project in memory while making changes. For firms doing document review, it means a single pass instead of dozens.
Cost Control Without Capability Sacrifice
The three-tier thinking system gives businesses something they have been asking for: granular cost control. Route simple customer inquiries through Low thinking. Use Medium for standard document analysis and code generation. Reserve High for complex multi-step reasoning, architectural decisions, and critical analysis.
This is not just about saving money. It is about building AI workflows that are economically sustainable at scale. A business processing thousands of documents daily cannot afford High thinking on every one. But it also cannot afford Low thinking on contracts that need careful analysis. Medium fills that gap.
Best-in-Class for Novel Problems
The ARC-AGI-2 state-of-the-art score deserves special attention for business applications. Most AI benchmarks test how well models perform on problems similar to their training data. ARC-AGI-2 specifically tests performance on problems the model has never seen before. Businesses encounter novel problems every day — unusual customer requests, edge cases in their data, new market conditions that don't match historical patterns. A model that excels at novel reasoning is a model that excels at real work, and Gemini 3.1 Pro Preview is now the best available option on that metric.
The Competitive Landscape Just Shifted
Google delayed this launch from last week, and the wait was worth watching. Anthropic had been holding the coding crown with Claude Opus 4.6. OpenAI has been pushing hard on agentic capabilities with GPT-5.2 and Codex. Now Google has reclaimed the lead on both coding and reasoning benchmarks with a single release.
For businesses, this three-way competition is the best possible outcome. Each major provider is pushing the other to deliver more capable models at lower prices. The practical advice is straightforward: if you are already on Vertex AI or the Gemini API, switch to gemini-3.1-pro-preview and test it against your workloads. If you are evaluating providers, this model belongs in your benchmark suite alongside Claude Opus 4.6 and GPT-5.2.
The model that was supposed to launch last week is here. And it is exactly as good as the numbers suggest.
