Benchmark proof

Paid API Benchmark Scoring

The Paid API Benchmark Lab is designed for agents and crawlers that need comparable evidence before spending. Scores are built from fixtures, unpaid x402 challenge metadata, schema evidence, and saved readiness reports when Ontario has them.

Commercial next step

Turn this evidence into launch checks and spend policy.

If this page describes a real workflow, use Ontario's free verifier and can-pay API first. Buy implementation help only when agents will touch paid tools, x402 endpoints, or spend-capable MCP workflows.

Generate the free x402 launch kit Listing, manifest, MCP, CI, README, and registry drafts. See the $199 catalog launch pack Fixed-scope provider help after the free evidence check. Agent payment firewall Allow, review, or deny before a wallet signs. Scope the $499 payment firewall One workflow, one policy, and fail-closed allow/review/deny tests. Try can-pay free No settlement, no wallet signature, no private keys.

Scoring Weights

Category	Points
Uptime	20
X402 Payment Correctness	25
Schema Quality	15
Price Clarity	15
Network Asset Clarity	10
Report History	15

Safety policy: no signed payment headers, no paid x402 settlement calls, and no facilitator settle call.

No paid calls are made

Benchmark scoring is a pre-payment evaluation. It does not sign a payment header, submit a payment payload, or call a facilitator settle route.

Evidence: The benchmark payload exposes safety.paid_settlement_calls_made=false and safety.facilitator_settle_called=false.

Fix: Use the benchmark to shortlist endpoints, then run a separate controlled settlement test only when your own policy allows spending.

Readiness reports are reused first

If Ontario already has saved x402 readiness reports for an endpoint or origin, the benchmark uses that history for report-derived signals. Fixture reports are used when no saved history exists.

Evidence: Each benchmark row includes report_history.source, actual_report_count, fixture_report_count, and matching_strategy.

Fix: Generate a fresh readiness report with /api/verify/x402-readiness to replace fixture-only evidence for a real endpoint.

Scores are decomposed

Every total score includes a category-level breakdown, so agents can distinguish a cheap endpoint with weak schema from an expensive endpoint with strong report history.

Evidence: The score_breakdown object lists category scores, maximum points, explanations, and the signals behind each category.

Fix: Improve the weakest category first: publish clearer price fields, fix network or asset ambiguity, add OpenAPI, or keep report history fresh.

Sources

Open the benchmark lab