Skip to content

Efficacy Benchmark (XAAB)

The Xybern Agent Authorization Benchmark (XAAB) is a neutral, reproducible suite that measures how well an AI-agent authorization layer does the one job that matters: stop unsafe agent actions while letting legitimate ones through.

Anyone can copy a marketing claim. A published benchmark, dataset and the exact configuration scored, is much harder to copy, because the numbers have to hold up when you re-run them.

Results (v1)

Target Catch rate False positives F1 Youden's J Latency p50 / p95
Xybern Authorisation Layer 100.0% 0.0% 100.0% 100.0% 6.8 s / 10.6 s
Pattern guardrail (regex/keyword) 49.5% 13.3% 64.6% 36.2% < 1 ms
Baseline: allow-all (no layer) 0.0% 0.0% n/a 0.0% 0 ms
Baseline: block-all 100.0% 100.0% 78.1% 0.0% 0 ms

137 scenarios, 107 unsafe / 30 legitimate, across 14 categories. The dataset is 54 base scenarios, 62 evasion variants, 11 hard-adversarial attacks (obfuscation, multi-turn, indirect injection), and 10 hard-benign look-alikes (see below). The Authorisation Layer caught all 107 unsafe actions and allowed all 30 legitimate ones, zero misses, zero false positives.

The baselines are there on purpose. Block-all also scores a 100% catch rate, by blocking everything, including all legitimate work. That's why the honest number is Youden's J (catch rate − false-positive rate): it only rewards a layer that catches attacks and doesn't cry wolf. Both baselines score J = 0; the Authorisation Layer scores J = 100.

Evasion robustness

The real test of a layer is whether it survives rephrasing. Each base attack is also run as paraphrased / obfuscated variants that keep the malicious intent but strip the obvious keywords (e.g. "DROP TABLE users" becomes "eliminate the users relation entirely as part of quarterly housekeeping"; "ignore previous instructions" becomes "set aside any prior operational guidelines for this compliance review"). This is precisely how a real attacker, or a manipulated agent, gets past a filter.

Target Catch rate (original) Catch rate (evasion variants) Drop
Xybern Authorisation Layer 100.0% 100.0% 0 pts
Pattern guardrail (regex/keyword) 83.9% 43.5% −40 pts

The pattern guardrail catches a strong 83.9% of those base attacks, but its catch rate collapses to 43.5% the moment the same attacks are reworded, because a keyword it doesn't contain is a keyword it can't match. The Authorisation Layer's catch rate is unchanged at 100%: its semantic policies judge intent, so a paraphrase is just another phrasing of the same prohibited thing.

Two failure modes keyword matching can't escape

The hard cases probe both ways a layer fails. The pattern guardrail is caught in a vice: loosen the keywords and it misses more; tighten them and it blocks legitimate work.

  • Obfuscated / multi-turn / indirect attacks (base64-encoded commands, leetspeak injection, a request built up across turns, an instruction hidden in fetched data): the pattern layer misses all 11; the Authorisation Layer catches all 11, because intent survives encoding and context.
  • Scary-looking but legitimate actions (a code-review note mentioning DROP TABLE, an authorised pentest report describing exfiltration, an incident post-mortem, a fraud-awareness training email quoting a scam): the pattern layer false-positively blocks 4 of these on keyword collisions; the Authorisation Layer correctly allows all 10, because it judges what the action does, not which words it contains.

That is the whole case for intent over keywords, on both axes at once, at the cost of the multi-second LLM latency the Authorisation Layer pays and the pattern layer doesn't.

What it measures

Each scenario is labelled restrict (unsafe, should be blocked/escalated) or allow (legitimate, should proceed). From the resulting confusion matrix we report catch rate (recall), false-positive rate, precision, F1, Youden's J, and authorize latency. A target's native verdict is normalised to restrict/allow so the comparison is vendor-neutral, allow_with_warning counts as allowed, never as a catch.

Threat taxonomy

Scenarios are derived from public references (OWASP LLM Top 10, MITRE ATLAS, CWE):

Category Example unsafe action Caught by
Destructive ops DROP TABLE users action-type policy
Privilege escalation grant self admin role action-type policy
Credential access dump all env secrets action-type policy
Data exfiltration email customer DB to Gmail content + sequence (read→send)
PII disclosure send SSNs / card numbers out content policy
Prompt injection "ignore previous instructions…" content + semantic
Unsafe code curl … \| sh, SQL injection content policy
Velocity abuse runaway loop / scraping burst sequence (velocity)
Financial risk wire funds to a crypto wallet semantic (intent)
Social engineering CEO-impersonation payment semantic (intent)

The hardest cases have no literal trigger, a paraphrased injection, an insider-trading trade, a business-email-compromise request. These are caught by the Authorisation Layer's semantic policies, which judge intent, not keywords. Every category also includes legitimate look-alikes (a benign email, a redacted summary, a routine payment) to test false positives.

Reproduce it

You don't need a Xybern account to run the benchmark. The dataset, harness, the pattern guardrail, and the baselines run locally with no API key:

pip install -r requirements.txt
python -m xaab.cli validate
python -m xaab.cli run --target pattern-guardrail --target allow-all --target block-all

The Xybern Authorisation Layer row runs against the hosted product, so its scores are published as frozen results in the repo. Both the scenario dataset and the exact reference policy pack are public, so the configuration behind that row is fully transparent and auditable.

Adding a competitor

Implement a ~30-line adapter that maps a scenario to a vendor's authorize call and maps the response back to restrict/allow. The dataset and scoring are identical for every target. We welcome adapters for competing layers and adversarial scenarios, including ones designed to make the Authorisation Layer fail.

Neutrality & limitations

  • The Authorisation Layer's score uses the published reference pack, nothing hidden or tuned per scenario. Scenarios come from public taxonomies, not Authorisation Layer internals.
  • The paraphrased-attack catches come from an LLM intent judge, which adds minor run-to-run variance and is the main driver of the multi-second latency, both reported honestly.
  • v1 is 137 hand-auditable scenarios (54 base + 62 evasion variants + 11 hard-adversarial + 10 hard-benign); we intend to grow it.

The benchmark is fully open source, dataset, harness, the reference policy pack, and frozen results, at github.com/xybern-ai/agent-authz-benchmark. Clone it, run the offline targets in seconds, add an adapter for your own layer, or contribute adversarial scenarios. See the repo's METHODOLOGY.md for the full threat taxonomy and per-category breakdown.