37.4% is the number to remember from ServiceNow Research's new paper, "Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings". On the paper's new EnterpriseOps-Gym benchmark, that was the best success rate any of the 14 tested frontier models could manage.
That is low enough to cut through a lot of enterprise-agent marketing. If a top model only completes a bit more than one in three tasks inside a sandbox built to resemble real enterprise operations, then glossy demos are not the standard to use when you're evaluating an automation pitch. The standard is whether the agent can plan, recover, and refuse bad requests inside a messy, stateful system without leaving damage behind.
ServiceNow built a benchmark that behaves like operations
The benchmark matters because it is not another tidy tool-calling test. ServiceNow says EnterpriseOps-Gym runs inside a containerized sandbox with 164 database tables and 512 functional tools, covering 1,150 expert-curated tasks across eight enterprise verticals including customer service, HR, and IT.
The introduction makes the design goal clear: enterprise work is not just "call the right API." It is long-horizon work inside systems with memory, permissions, irreversible state changes, and a lot of irrelevant noise. The paper frames that as the actual gap between an impressive assistant and an autonomous worker. In practice, that means an agent has to keep identifiers straight, chain actions across systems, obey access rules, and notice when a request should not be completed at all.
That is much closer to how internal operations software behaves than most public benchmarks. It also makes the results harder to dismiss.
The leaderboard is weaker than the sales decks
Per the arXiv abstract, Claude Opus 4.5 led the benchmark at 37.4%. Gemini-3-Flash scored 31.9%. DeepSeek-V3.2 (High) reached 24.5%, making it the top open-source entry in the paper's reporting.
None of those numbers describe dependable automation. They describe partial competence under pressure.
For anyone comparing AI vendors right now, that should reset the default question. Instead of asking whether a model can use tools, ask what success rate it achieves on long, stateful workflows with permissions, coupled systems, and noisy schemas. If a vendor cannot answer that with a benchmark that looks like real operations, they are probably still selling a polished happy path.
The paper's most useful finding is about planning, not tools
The headline result is not just that scores are low. It is that human-authored oracle plans improved performance by 14 to 35 percentage points.
That finding points in a very specific direction. The models are not primarily failing because the environment lacks tools or because tool invocation is impossible. They are failing because they do not reliably choose the right sequence of actions, preserve task state, and adapt when the environment pushes back.
That distinction matters operationally. Plenty of enterprise AI buying conversations still revolve around connectors, MCP support, tool catalogs, and platform integrations. Those things matter, but this paper suggests the harder problem sits one layer above the connector map. You can wire up hundreds of tools and still have an agent that cannot consistently decompose work, notice infeasibility, or recover from an early bad move.
If you are evaluating a platform, the uncomfortable question is whether the vendor has solved planning, or merely expanded the surface area on which the agent can fail.
Refusal is still a safety problem
ServiceNow also reports that the best model refused infeasible tasks only 53.9% of the time. That is a severe number in an enterprise context.
In a chat product, a weak refusal can produce a bad answer. In an operational system, a weak refusal can produce side effects: a record modified in the wrong place, an unauthorized action attempt, a workflow left in an inconsistent state. The paper's conclusion that current agents are not yet ready for autonomous enterprise deployment follows directly from that risk profile, not from conservatism.
This is where the 37.4% figure becomes more useful than a generic benchmark score. It is not merely a ranking. It is a decision filter. When an AI vendor claims their agent can run back-office work end to end, the burden should now be on them to show performance on tasks that are stateful, permission-constrained, and failure-sensitive. A benchmark like EnterpriseOps-Gym makes it harder to hide behind tool demos and cherry-picked workflows.
The verdict is straightforward: ServiceNow's paper does not say enterprise agents are useless. It says the market is still early, the planning stack is still brittle, and autonomous execution claims should be discounted hard until they clear something much better than 37.4%.
