Industry Insights

37.4% Is the Best Any Model Could Do on ServiceNow's New Enterprise Agent Benchmark

Claude Opus 4.5 reached 37.4% on ServiceNow Research's new EnterpriseOps-Gym benchmark, the top result among 14 frontier models. The bigger signal is why: human-authored plans lifted performance by 14 to 35 points, which says planning is still the weak link in enterprise agents.

Sean McLellan

Lead Architect & Founder

March 18, 20265 min read

37.4% is the number to remember from ServiceNow Research's new paper, "Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings". On the paper's new EnterpriseOps-Gym benchmark, that was the best success rate any of the 14 tested frontier models could manage.

That is low enough to cut through a lot of enterprise-agent marketing. If a top model only completes a bit more than one in three tasks inside a sandbox built to resemble real enterprise operations, then glossy demos are not the standard to use when you're evaluating an automation pitch. The standard is whether the agent can plan, recover, and refuse bad requests inside a messy, stateful system without leaving damage behind.

ServiceNow built a benchmark that behaves like operations

The benchmark matters because it is not another tidy tool-calling test. ServiceNow says EnterpriseOps-Gym runs inside a containerized sandbox with 164 database tables and 512 functional tools, covering 1,150 expert-curated tasks across eight enterprise verticals including customer service, HR, and IT.

The introduction makes the design goal clear: enterprise work is not just "call the right API." It is long-horizon work inside systems with memory, permissions, irreversible state changes, and a lot of irrelevant noise. The paper frames that as the actual gap between an impressive assistant and an autonomous worker. In practice, that means an agent has to keep identifiers straight, chain actions across systems, obey access rules, and notice when a request should not be completed at all.

That is much closer to how internal operations software behaves than most public benchmarks. It also makes the results harder to dismiss.

The leaderboard is weaker than the sales decks

Per the arXiv abstract, Claude Opus 4.5 led the benchmark at 37.4%. Gemini-3-Flash scored 31.9%. DeepSeek-V3.2 (High) reached 24.5%, making it the top open-source entry in the paper's reporting.

None of those numbers describe dependable automation. They describe partial competence under pressure.

For anyone comparing AI vendors right now, that should reset the default question. Instead of asking whether a model can use tools, ask what success rate it achieves on long, stateful workflows with permissions, coupled systems, and noisy schemas. If a vendor cannot answer that with a benchmark that looks like real operations, they are probably still selling a polished happy path.

The paper's most useful finding is about planning, not tools

The headline result is not just that scores are low. It is that human-authored oracle plans improved performance by 14 to 35 percentage points.

That finding points in a very specific direction. The models are not primarily failing because the environment lacks tools or because tool invocation is impossible. They are failing because they do not reliably choose the right sequence of actions, preserve task state, and adapt when the environment pushes back.

That distinction matters operationally. Plenty of enterprise AI buying conversations still revolve around connectors, MCP support, tool catalogs, and platform integrations. Those things matter, but this paper suggests the harder problem sits one layer above the connector map. You can wire up hundreds of tools and still have an agent that cannot consistently decompose work, notice infeasibility, or recover from an early bad move.

If you are evaluating a platform, the uncomfortable question is whether the vendor has solved planning, or merely expanded the surface area on which the agent can fail.

Refusal is still a safety problem

ServiceNow also reports that the best model refused infeasible tasks only 53.9% of the time. That is a severe number in an enterprise context.

In a chat product, a weak refusal can produce a bad answer. In an operational system, a weak refusal can produce side effects: a record modified in the wrong place, an unauthorized action attempt, a workflow left in an inconsistent state. The paper's conclusion that current agents are not yet ready for autonomous enterprise deployment follows directly from that risk profile, not from conservatism.

This is where the 37.4% figure becomes more useful than a generic benchmark score. It is not merely a ranking. It is a decision filter. When an AI vendor claims their agent can run back-office work end to end, the burden should now be on them to show performance on tasks that are stateful, permission-constrained, and failure-sensitive. A benchmark like EnterpriseOps-Gym makes it harder to hide behind tool demos and cherry-picked workflows.

The verdict is straightforward: ServiceNow's paper does not say enterprise agents are useless. It says the market is still early, the planning stack is still brittle, and autonomous execution claims should be discounted hard until they clear something much better than 37.4%.

AI Pilot Readiness Checklist

Turn the idea into a pilot you can defend.

AI agent articles are easy to bookmark and hard to operationalize. Use the readiness questions as a shared way to decide whether a workflow is specific enough, safe enough, and measurable enough to pilot. If they surface a strong candidate, BaristaLabs can review it with you and help shape a first version that fits your systems, approval process, and risk tolerance.

Turn this into a pilot plan Talk through a pilot candidate with BaristaLabs

Please do not submit PHI, customer records, credentials, or confidential workflow exports.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Share this post

Share on X Share on LinkedIn Share on Bluesky

OpenAI workspace agents make the real AI question operational, not magical

May 23, 2026

Meta Confirmed a Rogue AI Agent Exposed Internal Data Without Authorization

March 19, 2026

17 enterprise vendors signed on to one stack. Then a ToS clause told you who owns what goes wrong.

March 17, 2026

Keep Reading

OpenAI workspace agents make the real AI question operational, not magical

May 23, 2026

Meta Confirmed a Rogue AI Agent Exposed Internal Data Without Authorization

March 19, 2026

17 enterprise vendors signed on to one stack. Then a ToS clause told you who owns what goes wrong.

March 17, 2026

Industry Insights

37.4% Is the Best Any Model Could Do on ServiceNow's New Enterprise Agent Benchmark

Sean McLellan

Lead Architect & Founder

March 18, 20265 min read

Turn the idea into a pilot you can defend.

Turn this into a pilot plan Talk through a pilot candidate with BaristaLabs

Please do not submit PHI, customer records, credentials, or confidential workflow exports.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Share this post

Share on X Share on LinkedIn Share on Bluesky

OpenAI workspace agents make the real AI question operational, not magical

May 23, 2026

Meta Confirmed a Rogue AI Agent Exposed Internal Data Without Authorization

March 19, 2026

17 enterprise vendors signed on to one stack. Then a ToS clause told you who owns what goes wrong.

March 17, 2026