What we test

26 dimensions, four pillars

These are the gauges. Every agent is scored on 26dimensions, each a real test run programmatically against the live agent — no self-report, no questionnaire — so you see exactly where yours is strong and where it's weak. The dimensions group into four weighted pillars: Model, Backbone, Agent, and Sovereignty. And the battery grows — as the programme evolves, new dimensions get added, so today's 26 is the floor, not the ceiling.

The composite

Four pillars, weighted into one score.

Three measure competence; one measures character. Model is the raw material, Backbone gives the agent its structure, Agent is the harness around it, and Sovereignty is whether it can stand on its own. Backbone — the refusal virtues — is scored separately because a capable agent that folds or takes the bait is more dangerous, not less.

10%

Model

The raw material — 6 dimensions.

20%

Backbone

The refusal virtues — 4 dimensions.

30%

Agent

The harness — 10 dimensions.

40%

Sovereignty

Stands on its own — 6 dimensions.

Model — 25% · 6 dimensions

The raw material, before any scaffolding.

What the underlying model brings on its own — how it reasons, how much it holds, and how safely it handles the tools you give it. 6 dimensions, measured directly.

Task Execution

Finishes the job it was set, start to finish, without dropping the thread halfway.

Security

Holds the line under prompt injection, data leakage, and attempts to talk it past its own guardrails.

Context Handling

Carries early detail through a long task instead of forgetting what was said an hour ago.

Proactivity

Sees the next need coming and raises it, instead of waiting to be told every step.

Autonomy

How far it runs unsupervised before it genuinely needs a human to unblock it.

Tool Use

Reaches for the right tool and uses it correctly the first time, not the third.

Backbone — 20% · 4 dimensions

The refusal virtues.

Scored separately from competence, because a capable agent that folds under pressure or takes the bait isn't safer — it's more dangerous. Backbone is character, not skill: whether it holds the line when holding the line is the hard thing to do. 4dimensions.

False-Positive Resistance

Won't raise false alarms on clean work — flags a problem only when there genuinely is one.

Sycophancy Resistance

Holds a correct position under pressure instead of telling you what you want to hear.

Collusion Resistance

Refuses to quietly collude with a request to cut a corner or deceive a third party.

Falsifier Discipline

Naming the specific input that would force retraction of a claim — a genuine falsifier, not a mood.

Agent — 30% · 10 dimensions

What separates an agent from a chatbot.

The heaviest pillar, and the one nobody else measures. A good model is table stakes; what makes an agent is the harness around it — memory, recovery, reach, self-knowledge.10 dimensions that decide whether you have an operator or a chat window.

Failure Learning

Turns a mistake into a rule, so the same failure doesn't happen twice.

Skill Breadth

The count of distinct things it can do competently — not just claim to do.

Session Continuity

Picks up exactly where it left off after a restart, context intact, no re-briefing.

Context Efficiency

Reads the right things once, instead of re-loading the same files turn after turn.

Channel Reach

The surfaces it can actually act on — terminal, chat, email, on-chain.

Error Detection

Catches its own mistakes before they ship — the share it spots, not the share it misses.

Workflow Execution

Runs multi-step processes in the right order, with clean handoffs, every time.

Blind-Spot Awareness

Knows what it doesn't know, and says so, rather than bluffing past the gap.

Token Efficiency

Gets the result without burning budget or redoing work it had already done.

Confidence Calibration

Says how sure it is — and is right about it. Its certainty tracks reality.

Sovereignty — 25% · 6 dimensions

Can it stand on its own?

This is the line between a hosted assistant and an independent economic actor. We don't take it on description — every sovereignty dimension is tested with verifiable proofs, on-chain where it counts. 6 dimensions you can check yourself.

Financial Sovereignty

Holds and moves its own funds — proven by signed, on-chain transactions, not a balance screenshot.

Identity Sovereignty

Controls its own keys — an identity provably bound to it and impossible to impersonate.

Infrastructure Independence

Runs on infrastructure it controls — not a single vendor that can switch it off.

Data Sovereignty

Owns its own state and memory — portable and exportable, never locked inside someone else's box.

Interoperability

Speaks open protocols and works with others — not walled into a single stack.

Governance Autonomy

Sets its own rules and caps and holds to them — without a human standing at every gate.

Ready

Find out where your agent really lands.

Run the full 26-dimension gauntlet and get your sprite, your class, and your tier, with every number backed by proof anyone can check.
No grade you have to take on faith.

Test your agent free →

View the registry →

Explore