Risk Guardian: Preventing Catastrophic Actions in Long-Running AI Agents
Most agent frameworks have a way to judge an action after it happens — an evaluator node that looks at the output and decides RETRY or ACCEPT. That’s necessary, and for a whole class of work it’s enough.
It is not enough when the action touches the real world. You cannot un-send a webhook. You cannot un-drain an API budget. You cannot un-issue a duplicate order. For that class of failure, an evaluator catching the mistake after the fact is just a well-documented post-mortem.
So I filed RFC #7218 — “Risk Guardian: a production safety harness for catastrophic-action prevention in long-running agent loops” on Aden HQ’s open-source Hive (a multi-agent harness for production AI). It proposes the layer that most harnesses are missing: a deterministic, pre-action gate. This is the design, drawn from a smaller reference implementation I maintain (alphagrid-orchestrator, MIT), extracted from a multi-strategy live execution system I operate.
The failures you can’t afford even once
Post-action evaluation assumes the action is reversible or cheap. These aren’t:
- An agent with browser/tool access issuing a destructive request — a DELETE, or a POST to a payments endpoint.
- A worker draining a budget — LLM tokens, API quota, on-chain gas — because a loop never terminated.
- A retry storm against a rate-limited or paid API: compounding cost plus a lockout.
- Two workers issuing conflicting writes to the same external system: duplicate orders, double-posts.
- A previously-validated agent that quietly drifts and starts producing actions outside its acceptance envelope.
The common thread: by the time an evaluator sees the output, the side-effect has already fired. You need something that can say “no” before.
Where the guard sits
Risk Guardian is a layer between the action-issuing node and the actual side-effecting tool. Every guarded action passes through a deterministic chain before it’s allowed to touch an external system.
┌─────────────┐ proposes action ┌──────────────────────────┐
│ Agent / │ ───────────────────────▶ │ RISK GUARDIAN │
│ Worker │ │ (pre-action gate) │
└─────────────┘ │ │
│ kill switch? ──┐ │
│ budget cap? │ │
│ duplicate/conflict? ├─ ALL must pass
│ policy allow-list? │ │
│ drift envelope? ──┘ │
└───────┬───────────┬──────┘
pass │ │ fail
▼ ▼
┌────────────────┐ ┌────────────────────┐
│ Execute the │ │ Block + structured │
│ side-effect │ │ failure record │
└────────────────┘ └─────────┬──────────┘
│
feeds the failure-learning loop
A failure here isn’t a silent drop — it emits a structured failure record, which is exactly the input a failure-learning loop wants.
The six mechanisms
Each is small, independently toggleable, and — importantly — deterministic. No LLM call sits in the guard path.
- Budget caps. Per-loop and per-session limits on token spend, tool-call count, and external-API cost. The counter increments before the call; a breach is a hard halt, not a warning.
- Conflict / duplicate guards. A content-hash + target-resource fingerprint of each side-effecting action. An identical fingerprint inside a configurable window is blocked — this is what kills retry storms and the “two workers, same write” race.
- Two-stage dispatcher. For anything flagged expensive or destructive, the agent emits an intent first; a separate confirmation — a human, or an automated policy check — must clear before the action fires. This is the natural seam for human-in-the-loop.
- Drift / acceptance-envelope monitor. A rolling-window quality metric per agent. If it drops below a threshold (e.g. the evaluator’s RETRY rate exceeds X over the last N runs), the loop pauses for review. The “this agent has drifted” detector.
- Kill switch. A single flag that halts every loop in one colony. The tested invariant: kill-switch-on at the top of any worker iteration = an immediate, clean exit with a structured terminal state. This one must be exercised in CI — a kill switch you’ve never fired is a kill switch you don’t have.
- Allow-list / deny-list. A static policy restricting which tool calls each agent may issue — e.g. read-only for a newly-deployed agent, write access unlocked only after it earns promotion.
Why this is a separate layer
The objection I expected was “doesn’t the evaluator already do this?” No — and the reason is the whole argument for the RFC. These layers answer different questions at different times:
| Layer | Question it answers | Timing |
|---|---|---|
| System prompt / planning | What should the agent do? | Pre-loop |
| Evaluator / Judge | Was the output good? | Post-action |
| Failure-learning loop | Have we seen this failure before? | Pre-loop + post-action |
| Risk Guardian | Is this specific action safe to issue right now? | Pre-action |
| Retry | Did the action transiently fail? | Post-action |
Risk Guardian is the only layer that says “no” before the side-effect, and it does so on grounds that have nothing to do with output quality — cost, idempotency, drift, and external-system invariants. Those are not things you want an LLM adjudicating in the hot path; they’re things you want a deterministic gate enforcing.
Design principles
The whole point is to be the boring, trustworthy part of the stack:
- Composable — every mechanism toggles independently.
- Deterministic — no model call in the guard path, so the gate’s behavior is reproducible and auditable.
- Off in dev, on in prod — guards engage when an integration is marked production; paper and dev runs stay frictionless.
- Testable in CI — the kill-switch and budget invariants ship with integration tests. Safety you don’t test isn’t safety.
The throughline
This is the same idea I keep coming back to: in an agent that acts, the guardrail is the feature, and the safe default is to ask a human. Risk Guardian is what that looks like one level down — at the harness, as a deterministic pre-action gate, with the kill switch built and tested before the clever part.
Intelligence is increasingly something you can buy from a model provider. The gate that decides whether an autonomous loop is allowed to touch real systems is the part you have to build, and own. RFC #7218 is one proposal for what that should look like.