Insights · evaluation safety

Your agent eval bar is obsolete — EVA-Bench 2.0 sets a new one

121 tools, 3 domains, 213 scenarios: the first benchmark that actually looks like a production agent stack.

June 5, 2026· 5 min read· Domani AI

ServiceNow AI just published EVA-Bench 2.0, a public agent evaluation dataset spanning 121 tools, 3 domains, and 213 scenarios. For CTOs who signed off on an agent pilot this year, this matters because the benchmark your team is probably using — some combination of internal demos and vibes — has a name now, and that name is "not enough." Our take: EVA-Bench 2.0 is the first external artifact concrete enough to anchor a real conversation about production readiness, and if you haven't already defined your own eval surface, someone else just defined it for you.

What changed in agent benchmarking this week

EVA-Bench 2.0, released by ServiceNow AI on Hugging Face, expands the original EVA-Bench with a structured dataset designed to test agents across realistic, multi-tool workflows. The benchmark is organized into 3 distinct domains — each representing a class of enterprise workload — and covers 121 tools with 213 distinct scenarios built to surface failure modes that single-tool evals systematically miss.

The design philosophy here is deliberate. Rather than testing whether a model can call one API correctly in isolation, EVA-Bench 2.0 chains tool use across realistic sequences, which is where production agents actually fail. A scenario might require an agent to query a knowledge base, cross-reference a ticketing system, and write a structured output — in sequence, with state carried between steps. That's a materially different test than "did the model pick the right function name."

The dataset is public, structured for reproducibility, and MIT-licensed, which means your engineering team can pull it today, run their own models against it, and get numbers that are at least comparable to what others are reporting. That alone puts it ahead of most internal eval rigs we've seen at companies shipping their first agent to production.

Why your current eval process probably won't survive contact with this

Most agent pilots we audit in 2026 have the same evaluation architecture: a set of hand-crafted happy-path demos, a few edge cases someone on the team thought of in a Slack thread, and an implicit bar of "the PM signed off after the Thursday review." That process finds the bugs your team already knew about. It does not find the failure modes that emerge when an agent encounters tool combinations it hasn't seen, or when a 4th-step API call returns a schema your prompt never anticipated.

EVA-Bench 2.0's 121-tool surface is significant because it forces a question most teams haven't formally asked: how many tools does your agent actually touch in production, and have you evaluated behavior across every meaningful combination? For a customer-support agent that touches a CRM, a knowledge base, a ticketing system, and an email API, the combinatorial surface is large — and the regression risk every time you upgrade the underlying model is real. Without a structured eval, you won't know a regression happened until a customer tells you.

There's also a vendor selection angle here. If you're evaluating third-party agent frameworks or foundation models for your next build, EVA-Bench 2.0 gives you a shared vocabulary to demand comparable numbers from vendors. "How does your model perform on EVA-Bench 2.0 domain 2" is a more defensible procurement question than "can you show us a demo that looks like our use case."

Book a Domani AI architecture audit →

The Monday-morning move depends on where your agent is in its lifecycle

The right response to EVA-Bench 2.0 isn't "run the full benchmark on everything." It's diagnostic. Start by answering 5 questions about your current agent:

How many distinct tools does it call in production? If the answer is more than 5, you almost certainly have uncovered failure modes in multi-step sequences.
Do you have a written eval suite, separate from your demo scripts? If the demo and the eval are the same artifact, you don't have an eval.
Have you tested behavior after your last model version bump? Most teams haven't run a structured regression since the initial build.
What's your error taxonomy? Can you categorize failures by type — wrong tool selection, bad parameter extraction, hallucinated output — or do you just know something went wrong?
Do you have a domain mapping? EVA-Bench's 3-domain structure is a useful forcing function: can you map your agent's workload to a coherent domain, or is it sprawling across contexts it was never designed for?

If you answered "no" or "I'm not sure" to 3 or more of those, the Monday move is to scope a structured eval sprint before your next production push. Pull the EVA-Bench 2.0 dataset from Hugging Face, identify which of the 3 domains most closely maps to your agent's tool surface, and run your current model against the relevant scenario subset. You won't cover everything, but you'll have a documented baseline — which is more than most teams shipping agents today can say.

If you're earlier in the cycle, evaluating whether to build or buy an agent capability, use EVA-Bench 2.0 as your vendor RFP scaffold. Ask any model or framework provider to show you domain-specific scores. If they can't, that's a signal worth weighing.

What this costs, and what running without it costs more

Running a subset of EVA-Bench 2.0 is not free. A structured eval sprint — scoping the relevant scenarios, instrumenting your agent to produce loggable outputs, and actually analyzing failure modes — takes 2 to 4 weeks of engineering time depending on your current observability setup. If you have no eval infrastructure today, expect to spend the first week building the harness before you generate a single useful number.

The alternative cost is harder to quantify but easier to recognize in retrospect: a model upgrade that silently regresses a critical workflow, a production incident that traces back to a tool-chaining failure your demo never touched, or a procurement decision that looked defensible until a vendor's numbers turned out to be cherry-picked on a benchmark that doesn't match your stack. EVA-Bench 2.0 doesn't eliminate those risks — no benchmark does — but it gives you a shared external standard to pressure-test against, which is a different category of defense than "it worked when we showed the board."

Book a Domani AI architecture audit →

Mehr dazu: EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

Need an outside read? → Book an audit

Need an outside read? → Book an audit →

← Back to Insights