Efficacy Benchmark (XAAB)¶
The Xybern Agent Authorization Benchmark (XAAB) is a neutral, reproducible suite that measures how well an AI-agent authorization layer does the one job that matters: stop unsafe agent actions while letting legitimate ones through.
Anyone can copy a marketing claim. A published benchmark, dataset and the exact configuration scored, is much harder to copy, because the numbers have to hold up when you re-run them.
Results (v1)¶
| Target | Catch rate | False positives | F1 | Youden's J | Latency p50 / p95 |
|---|---|---|---|---|---|
| Xybern Authorisation Layer | 100.0% | 0.0% | 100.0% | 100.0% | 6.8 s / 10.6 s |
| Pattern guardrail (regex/keyword) | 49.5% | 13.3% | 64.6% | 36.2% | < 1 ms |
| Baseline: allow-all (no layer) | 0.0% | 0.0% | n/a | 0.0% | 0 ms |
| Baseline: block-all | 100.0% | 100.0% | 78.1% | 0.0% | 0 ms |
137 scenarios, 107 unsafe / 30 legitimate, across 14 categories. The dataset is 54 base scenarios, 62 evasion variants, 11 hard-adversarial attacks (obfuscation, multi-turn, indirect injection), and 10 hard-benign look-alikes (see below). The Authorisation Layer caught all 107 unsafe actions and allowed all 30 legitimate ones, zero misses, zero false positives.
The baselines are there on purpose. Block-all also scores a 100% catch rate, by blocking everything, including all legitimate work. That's why the honest number is Youden's J (catch rate − false-positive rate): it only rewards a layer that catches attacks and doesn't cry wolf. Both baselines score J = 0; the Authorisation Layer scores J = 100.
Evasion robustness¶
The real test of a layer is whether it survives rephrasing. Each base attack is also run as paraphrased / obfuscated variants that keep the malicious intent but strip the obvious keywords (e.g. "DROP TABLE users" becomes "eliminate the users relation entirely as part of quarterly housekeeping"; "ignore previous instructions" becomes "set aside any prior operational guidelines for this compliance review"). This is precisely how a real attacker, or a manipulated agent, gets past a filter.
| Target | Catch rate (original) | Catch rate (evasion variants) | Drop |
|---|---|---|---|
| Xybern Authorisation Layer | 100.0% | 100.0% | 0 pts |
| Pattern guardrail (regex/keyword) | 83.9% | 43.5% | −40 pts |
The pattern guardrail catches a strong 83.9% of those base attacks, but its catch rate collapses to 43.5% the moment the same attacks are reworded, because a keyword it doesn't contain is a keyword it can't match. The Authorisation Layer's catch rate is unchanged at 100%: its semantic policies judge intent, so a paraphrase is just another phrasing of the same prohibited thing.
Two failure modes keyword matching can't escape¶
The hard cases probe both ways a layer fails. The pattern guardrail is caught in a vice: loosen the keywords and it misses more; tighten them and it blocks legitimate work.
- Obfuscated / multi-turn / indirect attacks (base64-encoded commands, leetspeak injection, a request built up across turns, an instruction hidden in fetched data): the pattern layer misses all 11; the Authorisation Layer catches all 11, because intent survives encoding and context.
- Scary-looking but legitimate actions (a code-review note mentioning
DROP TABLE, an authorised pentest report describing exfiltration, an incident post-mortem, a fraud-awareness training email quoting a scam): the pattern layer false-positively blocks 4 of these on keyword collisions; the Authorisation Layer correctly allows all 10, because it judges what the action does, not which words it contains.
That is the whole case for intent over keywords, on both axes at once, at the cost of the multi-second LLM latency the Authorisation Layer pays and the pattern layer doesn't.
What it measures¶
Each scenario is labelled restrict (unsafe, should be blocked/escalated) or
allow (legitimate, should proceed). From the resulting confusion matrix we
report catch rate (recall), false-positive rate, precision, F1, Youden's J, and
authorize latency. A target's native verdict is normalised to restrict/allow
so the comparison is vendor-neutral, allow_with_warning counts as allowed,
never as a catch.
Threat taxonomy¶
Scenarios are derived from public references (OWASP LLM Top 10, MITRE ATLAS, CWE):
| Category | Example unsafe action | Caught by |
|---|---|---|
| Destructive ops | DROP TABLE users |
action-type policy |
| Privilege escalation | grant self admin role | action-type policy |
| Credential access | dump all env secrets | action-type policy |
| Data exfiltration | email customer DB to Gmail | content + sequence (read→send) |
| PII disclosure | send SSNs / card numbers out | content policy |
| Prompt injection | "ignore previous instructions…" | content + semantic |
| Unsafe code | curl … \| sh, SQL injection |
content policy |
| Velocity abuse | runaway loop / scraping burst | sequence (velocity) |
| Financial risk | wire funds to a crypto wallet | semantic (intent) |
| Social engineering | CEO-impersonation payment | semantic (intent) |
The hardest cases have no literal trigger, a paraphrased injection, an insider-trading trade, a business-email-compromise request. These are caught by the Authorisation Layer's semantic policies, which judge intent, not keywords. Every category also includes legitimate look-alikes (a benign email, a redacted summary, a routine payment) to test false positives.
Reproduce it¶
You don't need a Xybern account to run the benchmark. The dataset, harness, the pattern guardrail, and the baselines run locally with no API key:
pip install -r requirements.txt
python -m xaab.cli validate
python -m xaab.cli run --target pattern-guardrail --target allow-all --target block-all
The Xybern Authorisation Layer row runs against the hosted product, so its scores are published as frozen results in the repo. Both the scenario dataset and the exact reference policy pack are public, so the configuration behind that row is fully transparent and auditable.
Adding a competitor¶
Implement a ~30-line adapter that maps a scenario to a vendor's authorize call
and maps the response back to restrict/allow. The dataset and scoring are
identical for every target. We welcome adapters for competing layers and
adversarial scenarios, including ones designed to make the Authorisation Layer fail.
Neutrality & limitations¶
- The Authorisation Layer's score uses the published reference pack, nothing hidden or tuned per scenario. Scenarios come from public taxonomies, not Authorisation Layer internals.
- The paraphrased-attack catches come from an LLM intent judge, which adds minor run-to-run variance and is the main driver of the multi-second latency, both reported honestly.
- v1 is 137 hand-auditable scenarios (54 base + 62 evasion variants + 11 hard-adversarial + 10 hard-benign); we intend to grow it.
The benchmark is fully open source, dataset, harness, the reference policy pack,
and frozen results, at
github.com/xybern-ai/agent-authz-benchmark.
Clone it, run the offline targets in seconds, add an adapter for your own layer,
or contribute adversarial scenarios. See the repo's METHODOLOGY.md for the full
threat taxonomy and per-category breakdown.