Tuesday brought a stealth company claiming a near-perfect score on one of AI's hardest benchmarks, a new way to measure coding ability across nine programming languages, and a $650 billion spending forecast that makes last year's budgets look quaint. Here is everything worth knowing.
Models and Benchmarks
Confluence Labs Emerges from Stealth with 97.9% on ARC-AGI-2. YC-backed Confluence Labs came out of stealth today with what may be the most impressive benchmark result of the year: 97.9 percent on ARC-AGI-2, the abstract reasoning benchmark designed to resist brute-force scaling. The company says its approach centers on learning efficiency — making AI useful in domains where data is sparse and experiments are expensive. ARC-AGI-3 is expected to launch next month, which will immediately test whether this result holds up against an even harder evaluation. For now, 97.9 percent on a benchmark built specifically to stump frontier models is worth paying attention to.
Reve Drops Its First Image Model — Lands Top Three on Arena. Reve launched v1.5, its debut text-to-image model, and it immediately placed in the top three on the Image Arena leaderboard with a score of 1177 — behind only GPT-Image-1.5 and Google's Nano Banana Pro variants. The model outputs at 4K resolution with strong text rendering and performs particularly well on product and commercial design prompts. A first release landing this high is unusual and suggests Reve may become a serious contender in the image generation space alongside Midjourney and DALL-E.
SWE-Bench Goes Multilingual — And Rankings Shuffle. The team behind SWE-bench launched a multilingual variant today: 300 tasks across nine programming languages, none overlapping with SWE-bench Verified. The current state-of-the-art sits at 72 percent, leaving meaningful headroom. More interesting than the top score is that model rankings shift significantly between languages — a model that dominates Python tasks may struggle with Rust or TypeScript. This matters for teams evaluating coding agents, because your language stack now determines which model is actually best for your work.
Infrastructure and Spending
Bridgewater: Big Tech Will Spend $650 Billion on AI in 2026. Bridgewater Associates, the world's largest hedge fund, published a forecast projecting that major technology companies will collectively invest approximately $650 billion in AI infrastructure this year. That figure dwarfs 2025 spending and underscores how aggressively the hyperscalers are building out compute capacity. NVIDIA's upcoming earnings report is being framed as the single biggest test of whether AI demand justifies this capital expenditure binge.
NVIDIA and Red Hat Launch AI Factory Partnership. NVIDIA and Red Hat announced a joint initiative called the Red Hat AI Factory, designed to accelerate enterprise AI deployment on standardized infrastructure. The partnership combines Red Hat's enterprise Linux and OpenShift platform with NVIDIA's AI stack, giving enterprises a turnkey path from model training to production. The timing aligns with the broader push from both companies to make AI infrastructure as routine as cloud computing was a decade ago.
Enterprise and Products
Anthropic Cowork Gets Enterprise Plugins — We Covered It. Anthropic rolled out department-specific plugins for Claude Cowork, including connectors for Gmail, DocuSign, FactSet, and native Excel-to-PowerPoint orchestration. This is the most concrete move yet to make AI agents standard-issue for knowledge workers. We covered the full story earlier today — read the deep dive here.
DeepSeek V4 Shows Pre-Release Activity — 39 PRs Merged. DeepSeek developers merged 39 pull requests in a concentrated batch, a pattern consistent with pre-release polish. The V4 model, which a senior administration official claims was trained on NVIDIA Blackwell hardware, could arrive as soon as next week. We broke down the four architectural bets behind V4 in this morning's analysis. The export control question surrounding Blackwell access adds a geopolitical layer to what is already one of the most anticipated model launches of the quarter.
OpenAI Hires Meta FAIR Researcher for Robotics Push. OpenAI recruited PengchuanZ, a key contributor to Meta's SAM and Llama projects during a 3.75-year stint at FAIR. The hire signals acceleration of OpenAI's relaunched humanoid robotics initiative, with work focused on the convergence of visual perception, world models, and physical manipulation. OpenAI quietly shelved its robotics division years ago; bringing it back with talent from Meta's best research group suggests the company sees embodied AI as commercially viable sooner than previously expected.
Research and Community
Google DeepMind Explores LLMs That Discover Multi-Agent Algorithms. New research from DeepMind investigates whether large language models can discover entirely novel multi-agent learning algorithms rather than just executing existing ones. The work uses LLM-powered optimization to design algorithms from scratch. This is still firmly in the research phase, but the implication — AI systems that can invent their own coordination strategies — would be a meaningful step toward more autonomous agent swarms.
Tencent's CogRouter: Teaching AI When to Think Hard and When to Think Fast. Tencent published "Think Fast and Slow," a training framework called CogRouter that teaches models to adjust cognitive depth on a step-by-step basis across four levels, from fast reflexive responses to deep multi-step planning. A 7B parameter model trained with this framework hit 82.3 percent on their benchmark suite, outperforming larger models while using 62 percent fewer tokens. If this technique generalizes, it could make small models dramatically more efficient by spending compute only where problems actually require it.
The Thread Connecting Today's News
The recurring theme across today's stories is efficiency over brute force. Confluence Labs cracked ARC-AGI-2 with learning efficiency, not data volume. Tencent's CogRouter beats bigger models with fewer tokens. SWE-bench Multilingual revealed that raw coding ability matters less than language-specific competence. Even Bridgewater's $650 billion forecast carries an implicit question: how much of that spending is justified, and how much is just scaling for scaling's sake? The next wave of AI progress may belong to the teams that figure out how to do more with less.
