Third-party AI evals are now a procurement gate — read them like a buyer
OpenAI's new eval playbook signals that vendor-supplied safety claims won't survive security reviews much longer.
OpenAI just published a shared playbook for trustworthy third-party evaluations, a structured framework for how frontier model evaluations should be designed, conducted, and reported. For researchers, it's methodology. For CTOs, it's something different: a preview of the standard your procurement and infosec teams will be asked to enforce within the next quarter or two. The part most coverage misses is that this document hands buyers a checklist — and the real question is whether you know how to use it.
What changed in how evaluations are framed
OpenAI's guidance organizes third-party evals into three distinct tracks: capability evaluations (what can the model actually do?), safeguard evaluations (what behaviors has it been trained or instructed to suppress?), and validity assessments (do the benchmarks measure what they claim to measure?). Each track, the document argues, requires different evaluator expertise, different access to model internals, and different reporting standards.
The document also makes a pointed distinction between evaluations run with a model provider's cooperation and those run adversarially or independently. Cooperative evals get fuller access but carry the risk of shaping results toward the provider's preferred narrative. Independent evals preserve objectivity but often work only on model outputs, missing instruction-tuning artifacts or system-prompt defaults that matter enormously in production deployment.
This is the framing your vendors are about to adopt in their security documentation. The timing is not accidental: the EU AI Act's conformity assessment obligations for high-risk systems are coming into full enforcement scope this year, and several large enterprise procurement frameworks — including US federal contractor AI guidance — are converging on third-party eval requirements as the acceptable proof of due diligence.
Why this changes the math on trusting vendor safety claims
Until recently, a frontier model vendor could hand you a model card, cite a few academic benchmarks, and call it a day. That era is ending. What OpenAI's playbook signals — even if unintentionally — is that the industry is coalescing around a shared vocabulary for what a credible eval looks like. That vocabulary will be used against you in a vendor review if you can't speak it.
The harder problem is that most eval reports in the wild fail at least one of the three tracks. Capability claims are frequently benchmark-optimized, trained on data that overlaps with the test set in ways the report doesn't disclose. Safeguard claims are almost always cooperative evals — meaning the model provider selected the prompts, defined the harm categories, and in some cases ran the evaluation themselves with a nominal third-party signature on the cover page. Validity is the track that nearly every vendor report skips entirely, because it requires the evaluator to defend why their benchmark predicts real-world behavior — a defensible but difficult argument.
For your stack, this matters most when you are deploying a model in any context where a failure has regulatory, reputational, or contractual consequence: customer-facing automation, legal document processing, HR screening tools, medical information retrieval. In those contexts, a capability claim without a validity section is not evidence — it is marketing.
→ Book a Domani AI architecture audit if you're mid-procurement on a frontier model and need an outside read on whether the eval documentation holds up.
The Monday-morning move: a decision tree for reading an eval report
When a vendor hands you an eval report — or when you're reviewing a model for internal deployment — run it through these gates before it influences a build or buy decision.
Gate 1: Who conducted the eval, and what did they have access to?
- If the evaluator was paid directly by the model provider and had no pre-registered protocol: treat capability claims as directional, not definitive.
- If the evaluator had weights or fine-tuning access: safeguard claims are more trustworthy than output-only audits, but check for conflict-of-interest disclosure.
- If the evaluator was fully independent with no provider access: validity of the benchmark choice matters more than the score itself.
Gate 2: Are capability, safeguard, and validity claims separated or bundled?
- Bundled reports that use a single score to cover all three tracks are almost always marketing documents. Separate them manually if you have to: find where the capability claims end and the safety claims begin.
- Absence of a validity section is a yellow flag. Ask the vendor: why does this benchmark predict your production use case?
Gate 3: What is the scope of the safeguard evaluation?
- Ask explicitly: were the harm categories defined by the evaluator or the provider?
- Ask: what adversarial prompting methodology was used, and was it consistent with your deployment context (API access, system prompt, tool use)?
- If the safeguard eval was run on the base model but you're deploying a fine-tuned or system-prompted variant: the eval is not applicable to your deployment. Request a new one or scope your liability accordingly.
Gate 4: Can the report be reproduced?
- Is the benchmark public? Are the prompts disclosed? If neither: the report cannot be independently verified, and you should price that uncertainty into your risk assessment.
This week: pull the eval documentation for every frontier model currently in procurement or live in your stack. Run gates 1 and 3 first — those catch the majority of validity failures with the least time investment. Flag anything that fails gate 3 for immediate legal and compliance review before the next deployment increment.
What this costs, and what it saves you
Reading an eval report critically adds 2 to 4 hours per vendor to your procurement cycle. Commissioning an independent architecture review against a specific deployment context — the next step when a report fails gate 2 or 3 — runs 3 to 6 weeks and costs real money. That is not a small ask on a quarter with shipping pressure.
The counterfactual cost is harder to see until it arrives. A safeguard claim that doesn't hold under your actual system prompt is a liability exposure, not a vendor problem. Regulatory enforcement under the EU AI Act for high-risk system failures can reach 3% of global annual turnover — and "the vendor's eval said it was safe" has not, in any published guidance, been accepted as a compliance defense. The Monday-morning move is not about perfection; it's about knowing which claims in your current stack are load-bearing and which are theater, before someone else finds out first.
Need an outside read? → Book an audit
Start the conversation →