AI Development

Enterprise IT agents just got a harder benchmark. The best models still missed half the incidents.

ITBench-AA shows why enterprise IT agents need scoped pilots, workflow receipts, eval datasets, approval gates, and human escalation before they touch production systems.

Sean McLellan

Lead Architect & Founder

May 29, 20268 min read

The most useful result from ITBench-AA is not that AI agents are bad at IT work.

It is that the work is finally being tested in a way that looks closer to the mess businesses actually care about.

Artificial Analysis and IBM launched ITBench-AA, a benchmark for agentic enterprise IT tasks based on IBM's ITBench work. The first track focuses on Site Reliability Engineering. Agents are asked to diagnose Kubernetes incidents using evidence an operations team would recognize: logs, traces, metrics, events, alerts, and topology snapshots.

The headline number is sobering. Claude Opus 4.7 Adaptive Reasoning Max Effort led the initial SRE leaderboard at 47%. GPT-5.5 xhigh scored 46%. Qwen3.7 Max scored 42%. According to the announcement, all frontier models scored below 50%.

That should not send teams back to spreadsheets and ticket queues forever. It should change what "ready for production" means.

For support, infrastructure, internal ops, workflow automation, and customer-impacting systems, the question is no longer, "Can an agent produce a plausible answer?" The better question is whether it can find the right cause, avoid adding false causes, show its work, stay inside its authority, and escalate when the risk is too high.

That is a different bar. It is also the bar that matters.

What ITBench-AA is testing

ITBench-AA starts with 59 SRE tasks: 40 public tasks and 19 held-out tasks. Each task gives the agent a Kubernetes incident snapshot. The agent has shell access to a sandboxed file system with the relevant logs and snapshots, a 100-turn cap, and three repeats per task.

The job is to identify the minimal set of independent root-cause Kubernetes entities responsible for the incident. That could include deployments, services, pods, or other entities, depending on the scenario.

The scoring method is stricter than "did the answer sound reasonable?" It uses average precision at full recall. If the agent misses any ground-truth root cause, that repeat scores 0.0. If it finds all of them, the score equals precision, so extra wrong root-cause entities lower the score.

That detail matters. In a real incident, naming the right service but also blaming unrelated upstream systems can waste time. A noisy diagnosis is not just imperfect. It can send responders down the wrong path while customers are still waiting.

The benchmark also shows that more agent activity does not automatically mean better work. The announcement notes that GPT-5.5 xhigh averaged 31 turns per task and scored 46%, while Gemini 3.1 Pro Preview averaged 83 turns and scored 30%. Longer trails can mean deeper investigation. They can also mean the agent is collecting symptoms, upstream mechanisms, and false positives.

That is one reason benchmarks like this are useful. They push evaluation beyond "the model answered" and toward "the model completed the operational job cleanly."

What the sub-50% score should mean

A sub-50% score does not mean teams should ignore enterprise IT agents. It means teams should stop treating demos as deployment evidence.

There is still useful work for agents in IT and business operations. They can summarize noisy tickets, draft incident timelines, search logs, correlate runbook steps, prepare change requests, gather context for a human reviewer, and recommend next actions. In many organizations, that alone would save real time.

But ITBench-AA is a reminder that there is a gap between "helpful assistant" and "autonomous operator."

The first category can improve human work. The second can change systems, affect customers, and create audit or compliance problems if it is wrong.

That gap is where deployment design matters.

If an agent diagnoses a Kubernetes outage with 46% benchmark accuracy, you probably do not want it restarting services, rolling back deployments, changing routing, or paging downstream teams without a second check. You might want it to collect evidence, propose likely root causes, cite the files or metrics it used, and ask for approval before taking action.

That is not a timid use of AI. It is a more honest one.

The benchmark supports a better evaluation habit

At BaristaLabs, we have been arguing that agent evals should inspect the work, not only the answer. For workflow agents, that means evaluating tool calls, approvals, state changes, recovery behavior, and other evidence left behind by the run.

ITBench-AA points in the same direction.

The score depends on whether the agent found the root causes and avoided extra wrong entities. The harness constrains the environment. The task includes logs and traces. The run produces a trajectory. That gives evaluators more to inspect than a final paragraph.

This is the right direction for enterprise AI agent evaluation. A business workflow is not just a text answer. It is a chain of actions.

For an IT agent, those actions might include:

Which logs it opened.
Which commands it ran.
Which alerts it used or ignored.
Which entities it named as root causes.
Whether it confused symptoms with causes.
Whether it proposed a reversible next step.
Whether it asked for approval before a risky action.
Whether it escalated when evidence was thin.

If your eval does not look at those receipts, it can pass agents that sound confident and fail in practice.

What a safe pilot should look like

The practical lesson is not "wait until agents are perfect." They will not be. The lesson is to pilot agents in a way that makes failure visible, bounded, and useful.

Start narrow. Do not begin with "run IT operations." Pick one workflow where the inputs, success criteria, and risk boundaries are clear. Good early candidates might be support ticket triage, alert enrichment, incident timeline generation, runbook lookup, log summarization, or read-only root-cause suggestion.

Keep the agent away from irreversible production actions at first. Read-only access is underrated. An agent that can gather evidence, classify an issue, and draft a recommended response may still save your team time. Let humans approve restarts, refunds, permission changes, customer messages, billing adjustments, data edits, and infrastructure changes until the eval record justifies more autonomy.

Collect workflow receipts. Store transcripts, tool calls, command history, retrieved documents, state changes, approvals, and final outcomes. This is where teams often underinvest. Without receipts, every failure becomes a vague anecdote: "the agent got confused." With receipts, you can see whether the problem was missing context, a bad tool call, weak instructions, poor retrieval, a model limitation, or an unsafe permission boundary.

Turn real failures into eval cases. AWS made a related point in its May 28 post on dataset management in Amazon Bedrock AgentCore: agent evaluation benefits from stable offline baselines plus fast-moving online signals. Versioned datasets keep inputs fixed so scores stay comparable across runs. When something breaks in production, that failure can become a permanent test case.

That is how business teams should treat early agent pilots. Every missed escalation, wrong recommendation, bad tool call, or confusing handoff should become part of the next eval suite.

Use deterministic checks where you can. Not every eval needs an LLM judge. If the agent must call a certain tool, check whether it did. If it must avoid a destructive action, check the action log. If it must name a root-cause entity, compare against known ground truth. If it must create an approval request before execution, verify the request exists.

An AWS and LangSmith post on evaluating deep agents using LangSmith on AWS recommends deterministic graders where possible, LLM graders where judgment is needed, and human graders for calibration. That is a sensible split. Use code for the facts. Use people for the parts that require operational judgment.

Separate capability evals from regression evals. Capability evals answer: can this agent do the task at all? Regression evals answer: did the latest change make the agent worse at something it used to handle?

You need both. Early on, you are learning whether the workflow is a fit for an agent. Once the workflow starts working, the eval suite becomes a gate. A new model, prompt, tool, retrieval source, or permission change should not ship just because it looked good in one demo. It should pass the cases you already care about.

Add approval gates before autonomy expands. Before an agent can take action, decide which actions need human approval, which can run automatically, and which are prohibited. We have written about this in our guide to building an AI approval queue before giving agents real authority.

For production IT, customer data, payments, legal workflows, and security-sensitive systems, approval queues are not bureaucracy. They are how you let agents do useful work without pretending they have the judgment, context, or accountability of the people who own the system.

The open-weight cost result is interesting, but keep it in scope

One detail from ITBench-AA is worth watching: some open-weight models were cost competitive inside this benchmark. Gemma 4 31B Reasoning scored 37% at $0.14 per task. GLM-5.1 Reasoning scored 40% at $1.23 per task.

That does not prove those models are better for your environment. It does suggest that teams should evaluate cost, accuracy, latency, hosting requirements, privacy needs, and workflow risk together.

For some narrow tasks, a cheaper model with good guardrails may beat a frontier model used carelessly. For other tasks, the best model still may not be safe enough without human review. The only way to know is to test the actual workflow with your own data, tools, and failure modes.

The business takeaway

ITBench-AA is useful because it makes agentic enterprise IT work look less like a chat demo and more like operations.

That is good for buyers.

If you are considering enterprise IT agents, support agents, infrastructure agents, or broader process automation, do not ask vendors only for answer quality. Ask for eval design. Ask what receipts they store. Ask how they score tool use. Ask what actions are gated. Ask how failures become regression tests. Ask who reviews high-risk decisions. Ask what happens when the agent is uncertain.

The teams that get value from agents will not be the ones that trust them the fastest. They will be the ones that narrow the job, measure the work, and expand autonomy only when the evidence supports it.

That fits our view of responsible AI: clear boundaries, human approval for high-risk work, audit logs, and workflow review. It is also how we approach AI consulting when a team wants to move from "we should try agents" to a pilot that can survive contact with real operations.

If you are trying to decide where agents fit in your business, start with the workflow, not the model. Pick the task. Define the receipts. Build the eval. Then decide how much autonomy the agent has earned.

If you want help choosing a safe first workflow, BaristaLabs can help you map the pilot and the evaluation plan before anyone gives an agent production keys.

Implementation help

Turn production misses into reviewable gates

BaristaLabs helps teams turn one candidate AI workflow into scoped data boundaries, reviewer evidence, receipts, and rollback paths before production use.

Define the IT-agent boundary

Best fit when the team can name one workflow, one owner, and the evidence a reviewer needs before the agent acts.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Share this post

Share on X Share on LinkedIn Share on Bluesky

ITBench-AA shows why enterprise IT agents need receipts before root access

May 31, 2026

Agent evals should test workflow receipts, not just model answers

May 25, 2026

OpenAI's eval playbook makes the harness part of the result

May 30, 2026