Building a Guardrailed AI Agent with Human-in-the-Loop
A copilot suggests. An agent acts. That one-word difference is where most of the hard product work lives.
The moment an agent can take a real action — send the email, issue the refund, update the record — a single question becomes the entire product: is this confident and safe enough to do without a human? Get that gate right and the agent earns trust. Get it wrong in either direction and you either flood a human review queue with trivia, or you quietly let a wrong action ship.
I built a small, working version of this gate to make the pattern concrete. Here it is end to end.
The pattern
The agent never acts directly. It does three things in order: draft a proposed action, self-evaluate that draft against guardrails, then route based on the result — auto-execute only when it clears every check, otherwise hand it to a human.
┌──────────────┐
input / event →│ AI Agent │ drafts a proposed action
│ (LLM) │
└──────┬───────┘
│ { action, confidence, sensitive, reason }
▼
┌──────────────┐
│ Guardrail │ confidence ≥ 0.8 AND not sensitive ?
│ Gate │
└───┬──────┬───┘
yes │ │ no
▼ ▼
┌───────────┐ ┌──────────────────────┐
│ Auto- │ │ Escalate to a human │
│ execute │ │ (review queue) │
└───────────┘ └──────────┬───────────┘
│ approve / edit / reject
▼
decision logged → feeds back
The interesting part isn’t the LLM. It’s the self-evaluation step — forcing the agent to emit, alongside its draft, a confidence weight and a safety flag — and the gate that reads them.
The self-evaluation
Before anything can happen, the agent scores its own draft on two independent axes:
// The agent drafts an action, then self-evaluates it before anything happens.
const draft = await proposeAction(input);
// Guardrail 1 — is the agent confident enough in its own answer?
const confidence = draft.confidence; // 0–1, model-reported
// Guardrail 2 — does the action touch a sensitive topic?
const sensitiveTopics = ['refund', 'chargeback', 'legal', 'medical', 'password'];
const sensitive = sensitiveTopics.some(t => input.toLowerCase().includes(t));
return {
action: draft.text,
confidence,
sensitive,
reason: sensitive
? 'Sensitive topic — needs human sign-off'
: 'No sensitive topics detected',
};
Then the gate is almost boring — which is the point:
if (confidence >= 0.8 && !sensitive) {
autoExecute(draft); // safe to act
} else {
escalateToHuman(draft); // route to review, with the reason attached
}
I built this as an n8n workflow — Manual Trigger → Code (draft + self-evaluate) → IF (the gate) → Auto-send or Escalate to Human Review — so it’s a real, runnable artifact, not a diagram of one. The Code node returns the object above; the IF node is the gate; the two branches are the two outcomes.
Why gate on two axes, not one
The non-obvious design choice is the AND. A lot of teams gate on confidence alone. That’s not enough — a model can be very confident about an action you still never want it to take unsupervised.
sensitive = false sensitive = true
confidence ≥ 0.8 AUTO-EXECUTE ESCALATE
confidence < 0.8 ESCALATE ESCALATE
Only the top-left cell auto-executes. A confident refund still goes to a human, because the category is high-stakes regardless of how sure the model is. Confidence handles “might be wrong”; the sensitivity flag handles “don’t let it decide this alone even when it’s right.” Those are different failure modes and they need different guards.
The other deliberate choice: uncertainty defaults to a human. Three of the four cells escalate. In an agent that acts in the world, the safe default isn’t “do nothing” and it isn’t “do it anyway” — it’s “ask.” The cost of an unnecessary escalation is a few seconds of a person’s attention. The cost of an unnecessary auto-execution can be a refund, a leaked record, or a customer you can’t get back.
The human step isn’t a dead end — it’s the training signal
The easy way to read the escalation branch is as a fallback: the agent couldn’t handle it, so a person cleans up. That undersells it.
When a human approves, edits, or rejects an escalated action, that decision is the single most valuable piece of data the system produces. Logged against the original draft and its reasoning, it tells you why the gate fired and whether the agent was actually wrong or just appropriately cautious. Over time those decisions are exactly what you’d use to tune thresholds, expand or contract the sensitive-topic list, and teach the agent which exceptions a human always wants to see. The human-in-the-loop is how the system learns what “safe” means in your specific domain — not a patch over the agent’s gaps.
This is also the trust surface. As I wrote in a year of running production AI agents, users don’t trust agents that are right — they trust agents that can show their work, especially on the calls they got wrong. A gate that escalates with its reasoning attached, and a human decision logged next to it, is that “showing your work,” made operational.
The takeaway
The clever part of an agent — the model, the prompt, the planning — is increasingly something you can buy. The part you have to build is the boring infrastructure around it: the self-evaluation, the gate, the escalation path, the log. In an agent that acts, the guardrail is the feature. This pattern — draft, self-evaluate on confidence and sensitivity, auto-execute only on the top-left cell, route everything else to a human whose decision feeds back — is about the smallest version of that idea that still does real work.
Start there. The intelligence can come later; the trust has to come first.