Editorial · agents tools

Frontier models fail most enterprise IT agent tasks — here's where your budget gets burned

IBM and Artificial Analysis published the first credible third-party benchmark for agentic IT work, and the pass rates are a cold shower for 2026 agent budgets.

May 30, 2026· 5 min read· Domani AI

A credible third-party benchmark just put a number on the gap between agent demos and production IT operations: frontier models score below 50% on real enterprise IT tasks. For CTOs finalising agent platform contracts this quarter, that number is the most important data point you don't have in your vendor deck. Our read is that the benchmark's task categories map directly to a go/no-go decision tree for which IT workflows are safe to hand to an agent today.

What changed — IBM and Artificial Analysis published the first agentic IT benchmark

ITBench-AA, released jointly by IBM Research and Artificial Analysis, is the first publicly documented benchmark designed specifically for agentic enterprise IT tasks. Unlike prior coding or reasoning benchmarks, ITBench-AA evaluates models on multi-step IT operations work: tasks that require an agent to observe system state, plan a sequence of actions, execute them against real or simulated environments, and verify the outcome — the same loop your IT automation vendor promises in every demo.

The headline result is that frontier models — including the leading models from the major labs — score below 50% on the benchmark's task suite. The benchmark spans categories that map to real IT workflows: incident detection and triage, configuration management, compliance checking, and multi-system coordination. Performance varies meaningfully across these categories, which is the part most coverage will miss. A model that passes 60% of compliance-checking tasks might pass fewer than 30% of multi-system coordination tasks. The aggregate sub-50% figure flattens a distribution that should be driving your prioritisation decisions.

The benchmark is reproducible and third-party — neither IBM nor Artificial Analysis is selling you the model being evaluated. That independence matters when you're deciding whether to trust a pass rate. Primary methodology and results are documented at the ITBench-AA Hugging Face post.

Why the task-category breakdown changes the math on agent ownership

Most enterprise IT agent pilots fail quietly. A team ships an agent for Tier-1 ticket triage, it handles 70% of cases adequately in staging, and then in production it starts closing tickets it shouldn't and escalating ones it should resolve — but nobody built an audit trail that surfaces the pattern until three months in. The ITBench-AA results explain the mechanism: agentic failure is not uniformly distributed. It concentrates in tasks that require coordinating state across more than one system, interpreting ambiguous environmental signals, or executing irreversible actions in the correct sequence.

That distribution should reshape how you categorise your IT automation backlog. Tasks that are stateless, reversible, and scoped to a single system — password resets, log pulls, scheduled report generation — sit in a different risk tier than tasks that touch multiple systems, mutate state, or have downstream dependencies. The benchmark's category structure gives you the empirical basis to draw that line. If your vendor's agent demo lives in the stateless/reversible bucket, the sub-50% headline does not apply directly. If the demo involves multi-system incident remediation, the headline is conservative.

The second implication is for contract terms. If you're signing a platform contract that prices on agent task volume or on automation rate, and the vendor's internal benchmarks are self-reported, you now have an external reference point to request independent evals as a contract condition. That is a negotiating lever that did not exist six months ago.

Talk to Domani AI about building this →

The Monday-morning move — run your IT automation backlog through a two-axis triage

Before your next vendor call or internal roadmap review, sort your planned IT agent use cases on two axes: state complexity (single-system vs. multi-system) and reversibility (easily undone vs. hard or impossible to undo). That gives you four quadrants. The only quadrant where current frontier models are likely to deliver production-grade reliability without a human-in-the-loop scaffold is single-system, reversible tasks.

For everything outside that quadrant, the ITBench-AA results suggest you should be budgeting for at minimum a confirmation step before execution — and for multi-system, irreversible tasks, a human sign-off gate. That is not a reason to stop the programme; it's a reason to wire the scaffold before you sign the volume contract.

This week: Pull your IT automation use-case list and tag each item: single-system or multi-system, reversible or irreversible.
Before your next vendor call: Ask for task-category-level pass rates, not aggregate accuracy. If they can't provide them, treat the aggregate as the worst-case distribution.
Before signing: Add an independent eval clause or a 60-day production accuracy review to any agent platform contract over a material threshold.
In parallel: Identify one single-system, reversible workflow — password resets, scheduled diagnostics, log aggregation — and run a contained pilot with full audit logging. Use it as your internal benchmark baseline before expanding scope.

What this costs — and what staying uninformed costs more

Adding human-in-the-loop gates to an agent workflow increases operational overhead. If you planned to automate 200 monthly IT tasks and 40% of them fall into the multi-system or irreversible quadrant, you're not automating 80 of those tasks — you're creating a lighter-weight approval workflow for them. That is a real cost: engineer time to build the gates, process time for reviewers, and a lower automation rate than your business case projected. Plan for it now rather than discover it in Q4 when the contract is already running.

The cost of not doing this triage is worse and less visible. Agent errors in IT operations compound: a misconfigured firewall rule, a wrongly closed incident ticket, a compliance record that reflects what the agent reported rather than what actually happened. The ITBench-AA benchmark documents that these failure modes are not edge cases — they are the median outcome on multi-system tasks for today's frontier models. Vendors will improve, and we expect pass rates to rise through 2026 as both models and scaffolding mature. But the contracts you sign this quarter will govern production deployments that run for 12 to 18 months. Build the scaffold for today's capability, not next year's roadmap.

Talk to Domani AI about building this →

Source: https://huggingface.co/blog/ibm-research/itbench-aa

Have a similar build in mind? → Start the conversation

Start the conversation →

← Back to Insights