Skip to content
Editorial · agents tools

Agent framework selection just got a benchmark you can cite

The Open Agent Leaderboard compares frameworks on task success and cost-per-task — which changes how you shortlist for a 2026 build.

May 22, 2026· 5 min read· Domani AI

IBM Research and Hugging Face published the Open Agent Leaderboard this month, and it is the first public benchmark that scores agent frameworks on task success rate and cost alongside each other. For any CTO still choosing between LangGraph, AutoGen, smolagents, or a custom harness, the selection process just changed: the answer no longer has to be "we liked the docs" or "our last engineer knew it." The uncomfortable implication is that frameworks your team assumed were equivalent are measurably not — and the gaps matter at production scale.

What changed — and what the leaderboard actually measures

The Open Agent Leaderboard, published by IBM Research in collaboration with Hugging Face, evaluates open-source agent frameworks across a standardized set of agentic tasks. The evaluation methodology scores each framework on task success rate (did the agent complete the goal?), tool-calling accuracy (did it invoke the right tools in the right order?), and cost-per-task in USD, using a fixed model backend so the framework — not the underlying model — is the variable under test.

The leaderboard is designed to be model-agnostic within the evaluation harness. That means two frameworks running on the same base model can produce materially different success rates depending on how they structure prompts, manage state, and handle tool-call retries. The cost-per-task metric captures token consumption across the full agent loop, not just a single inference call — which is the number that actually hits your API bill in production.

The methodology distinguishes between single-agent and multi-agent task configurations, and it separates tool-heavy tasks (file operations, API calls, code execution) from reasoning-heavy tasks (multi-step planning, conditional branching). Those two axes map almost directly to the split in real enterprise workloads: orchestration-heavy pipelines versus research-style reasoning chains.

Why this changes the math on agent framework ownership

Most teams pick an agent framework the way they pick a JavaScript bundler: based on community momentum, a strong advocate internally, or a tutorial that happened to be good. That was defensible in 2024, when the frameworks were all early and the benchmarks didn't exist. In 2026, committing to a framework is a multi-year architectural decision. Agent logic accretes. Prompt structures get embedded in tooling. Migrating from one orchestration layer to another after 12 months of production is not a weekend project.

The cost-per-task number is the one most decision-makers underweight. A framework that costs 40% more per task at 10,000 tasks per month is a rounding error. At 500,000 tasks per month — which is where a successful internal deployment lands inside 18 months — it is a line item that surfaces in your infrastructure review. The leaderboard gives you a way to model that number before you build, not after you're already committed.

The second underweighted dimension is tool-calling accuracy under failure conditions. Most framework demos show the happy path. The leaderboard's evaluation includes tasks where tools return errors, partial results, or require retry logic. Success rate on those tasks predicts production reliability far better than performance on clean demos. If your workload is heavily tool-dependent — and most enterprise agent workloads are — that sub-score matters more than the headline accuracy number.

Talk to Domani AI about building this →

The Monday-morning move: run your requirements against the leaderboard axes

Don't start by reading the full leaderboard. Start by writing down three constraints from your actual build before you look at the rankings:

  • Latency budget: What is the maximum acceptable end-to-end task completion time? Frameworks with more retry and reflection loops score better on accuracy but add wall-clock time.
  • Tool surface: How many distinct tools does your agent need to call, and how often do those tools return non-200 responses? Weight the tool-calling accuracy sub-score accordingly.
  • Monthly task volume at 18-month scale: Project out, not from where you are today. Use that number to multiply the cost-per-task delta between your top two framework candidates.

Once you have those three numbers, look at the leaderboard's sub-scores for your task configuration (single-agent vs. multi-agent, tool-heavy vs. reasoning-heavy). Shortlist the two frameworks that score highest on your weighted criteria. Then run a 48-hour spike: take one real task from your backlog, implement it in both frameworks, and record actual token consumption and success rate on your tooling. The leaderboard gives you the prior; the spike gives you the likelihood update for your specific stack.

If your team doesn't have 48 hours to run that spike without pulling someone off a delivery commitment, that is itself a signal: you are not staffed to make this architectural decision safely without external input.

What it costs to act now versus waiting for the leaderboard to mature

Acting now means committing to a framework before the leaderboard has full coverage of every framework your team might consider. The Open Agent Leaderboard is live but not exhaustive — if your preferred framework isn't yet on it, you're back to partial information. The honest trade-off: you can wait 60 to 90 days for broader coverage, but if you have a build starting in Q3, that delay compresses your architecture window.

The bigger risk is the opposite: waiting indefinitely for a perfect benchmark while continuing to let each engineer pick the framework they know. That produces a heterogeneous agent stack that is expensive to operate and nearly impossible to audit. One standardized framework selection — even an imperfect one, made with the best available data — costs less over 24 months than three frameworks running in parallel because no one made a call.

The leaderboard doesn't make the decision for you. It removes the excuse for not making it.

Have a similar build in mind? → Start the conversation

Start the conversation →
Agent framework selection just got a benchmark you can cite · Domani AI