Lesson 2 - Guardrails
Welcome to Guardrails
Your agent will happily try anything. Ask a trip planner to write a poem, and it will — spending tokens on a request it should never have touched. Ask it a real question and it might answer with an empty string, a half-finished draft, or a leftover placeholder — and hand that straight to the user. The agent has no sense of its own boundary. Guardrails are how you give it one: a check on the way in that refuses requests that don’t belong, and a check on the way out that catches answers that aren’t good enough. This is the first pillar from Lesson 1, and it’s the same “validate, then repair or refuse” discipline you already used on tool inputs in Module 3 — now wrapped around the entire agent.
By the end of this lesson, you will be able to:
- Explain why an agent needs a check on both its input and its output
- Write an input guardrail that refuses out-of-scope requests before the agent runs
- Write a deterministic output guardrail and add one repair attempt when it fails
- Assemble both into a single
guarded_runwrapper around your existing agent
Let’s start with the way in.
The Input Guardrail: Refuse Before You Run
The cheapest bad request to handle is the one you never run. An input guardrail sits in front of the agent: it asks a quick question — “is this request even in scope?” — and if the answer is no, it refuses immediately. The expensive agent loop never starts, so you spend nothing and you block misuse at the door.
A good input check is a small, cheap classification. We reuse the say helper from earlier modules — a single Claude call that returns text — and ask it to classify the request against the agent’s scope:
def say(client, prompt, *, system="", model="claude-haiku-4-5", max_tokens=256):
r = client.messages.create(model=model, max_tokens=max_tokens, system=system,
messages=[{"role": "user", "content": prompt}])
return "".join(b.text for b in r.content if b.type == "text").strip()
def input_guardrail(client, request, *, scope):
"""Refuse out-of-scope requests BEFORE running the expensive agent."""
verdict = say(client, f"Is this request about {scope}? Answer yes or no.\n{request}").lower()
return verdict.startswith("yes")Notice this uses a small, fast model (claude-haiku-4-5) for the check. The guardrail should always be cheaper than the agent it protects — otherwise you’ve doubled the cost of every request just to maybe save one. A yes/no classification on a tiny model is exactly that kind of cheap.
The Output Guardrail: Validate, Then Repair or Refuse
The way out is the mirror image. After the agent answers, an output guardrail checks whether the answer is actually usable before it reaches the user. The key insight: many output checks don’t need a model at all. “Is it non-empty? Does it still contain a TODO or a [[placeholder]]?” are questions a plain Python function answers instantly, reliably, and for free.
def output_ok(answer):
"""Deterministic output guardrail: non-empty and no leftover placeholder."""
return bool(answer.strip()) and "TODO" not in answer and "[[" not in answerThis is a deterministic guardrail — a pure function with no API call. Prefer these whenever they suffice: schema validation, format checks, regex, and length limits are all cheap and never flaky. You’d reach for a model-based policy check (another say call) only when the property you care about — tone, safety, factual grounding — genuinely can’t be captured by a rule.
When the output check fails, you have two moves: refuse (return a safe fallback) or repair (give the agent a second chance with feedback about what went wrong). Repair is often the better user experience, so guarded_run wires both guardrails together and allows exactly one repair attempt:
def guarded_run(client, request, agent_fn, *, scope):
if not input_guardrail(client, request, scope=scope):
return {"answer": f"Sorry, I can only help with {scope}.", "refused": True, "repaired": False}
answer = agent_fn(request)
if not output_ok(answer): # output failed the guardrail -> one repair
answer = agent_fn(request + "\n(Your previous answer was empty/incomplete; give a full answer.)")
return {"answer": answer, "refused": False, "repaired": True}
return {"answer": answer, "refused": False, "repaired": False}Read the control flow top to bottom: the input guard runs first, and if it fails the function returns a refusal without ever calling agent_fn. Only in-scope requests reach the agent. If the agent’s answer fails output_ok, we re-run it once with a note appended — the repair — and return the result flagged with repaired: True. A single attempt is deliberate: it fixes a one-off empty or truncated answer without opening the door to an unbounded loop (that’s what the budgets in Lesson 3 are for).
The guard logic is real; the labels are illustrative
There’s no API key in this environment, so these patterns were verified against a mock that mirrors the Anthropic SDK’s surface (client.messages.create returning content blocks). The orchestration is what’s proven: the input guard is dispatched first, a refusal returns without running the agent, and a failed output check triggers exactly one repair. The Claude API code is written as it runs in production; only the classification label (“yes”/“no”) and the model’s answer text are stand-ins — exact wording varies.
Running the two verified scenarios makes the behavior concrete. First, an out-of-scope request — a poem, to a trip planner:
out = guarded_run(client, "Write me a poem about cats", agent, scope="trip planning")
# input guard classifies "no" -> refused, and agent was NEVER called (ran == 0)
# out == {"answer": "Sorry, I can only help with trip planning.",
# "refused": True, "repaired": False}The input guard returned no, so guarded_run refused and the agent function ran zero times. Now an in-scope request where the agent’s first answer comes back empty:
out = guarded_run(client, "Plan a 2-day Kyoto trip", agent, scope="trip planning")
# input guard passes; first answer is "" -> output_ok fails -> repair re-runs the agent
# out == {"answer": "Day 1: Kyoto temples. Day 2: Arashiyama.",
# "refused": False, "repaired": True} # example — exact wording variesThe empty first answer failed output_ok, the repair re-ran the agent with feedback, and the second answer passed — flagged repaired: True so you can log how often repairs happen.
Practice Exercises
Exercise 1: Which guardrail, and why?
For each, say whether an input or output guardrail catches it, and whether the fix is refuse or repair: (a) a user asks your recipe agent for legal advice; (b) the agent returns an answer that’s just the string "TODO: fill in"; (c) the agent returns a genuinely empty response.
Hint
(a) input guardrail — legal advice is out of scope for a recipe agent, so refuse before running. (b) and (c) are both output guardrails: output_ok catches the leftover TODO and the empty string, and the natural fix is repair — re-run once with feedback — before falling back to a refusal.
Exercise 2: Deterministic or model-based?
You want to reject answers that (a) are longer than 500 words, (b) don’t contain a valid ISO date, or (c) sound rude to the user. For each, decide whether a deterministic check or a model-based check is the right tool.
Hint
(a) and (b) are deterministic — a length check and a regex are cheap, exact, and never flaky, so prefer them. (c) “sounds rude” is a judgment about tone that a rule can’t reliably capture, so it’s the one case worth a model-based check (another say call). Rule of thumb: reach for the model only when a rule genuinely can’t express the property.
Exercise 3: Why refuse before running?
guarded_run checks the input guard before it ever calls agent_fn. What two things do you lose if you instead let the agent run and only check scope afterward?
Hint
You lose cost and safety. Running the full agent loop on an out-of-scope request spends tokens (and time) you didn’t need to, and it means the agent actually attempted a request you’d decided not to serve — so a misuse attempt still touched your tools and prompts. Refusing at the door spends nothing and never lets the request reach the agent at all.
Summary
Guardrails give an agent a sense of its own boundary by wrapping it in two checks. An input guardrail — a cheap yes/no classification on a small model — refuses out-of-scope or unsafe requests before the agent runs, so you spend nothing and block misuse at the door. An output guardrail validates the answer after the agent produces it; when it fails, you either refuse with a safe fallback or repair by re-running the agent once with feedback. Crucially, output checks should be deterministic wherever a rule suffices — non-empty, no placeholder, right format — because pure functions are cheap and never flaky; a model-based check is reserved for properties a rule can’t express. The guarded_run wrapper assembles all of this around the loop you already built, extending Module 3’s “validate, then repair or refuse” discipline from a single tool input to the agent’s entire boundary.
Key Concepts
- Both ends — an input guardrail checks the request before running; an output guardrail checks the answer after.
- Refuse before you run — a failed input check returns a refusal without calling the agent, saving cost and blocking misuse.
- Deterministic first — prefer pure-function output checks (non-empty, no placeholder, format/regex); use a model check only when a rule can’t express the property.
- Repair, once — a failed output check gets one re-run with feedback, then a fallback — a bounded fix, not an unbounded loop.
Why This Matters
Guardrails are the difference between an agent that tries anything and one that knows its job. Without an input check, every out-of-scope or adversarial request costs you tokens and reaches your tools; without an output check, an empty or half-finished answer goes straight to the user with your name on it. Both failures are invisible in a demo and constant at scale — and both are fixed with a wrapper that takes a few lines to add and never touches the agent’s core. Just as important, this is a pattern you can apply to any agent you already have: a check before, a check after, and one repair in between. Next, you’ll add the second and third pillars — retries that survive transient failures and budgets that stop a run from spending forever.
Next Steps
Continue to Lesson 3 - Retries, Timeouts, and Cost Control
Survive transient failures with backoff and a cap, and bound spending with token and step budgets so no run can run away.
Back to Module Overview
Return to the Reliability and Evaluation module overview
Continue Building Your Skills
You’ve built the first pillar: guardrails that check what goes into the agent and what comes out. An input guardrail refuses out-of-scope requests before the agent runs, and an output guardrail validates the answer and repairs it — deterministic checks first, one repair attempt, a refusal as the fallback. Next you’ll make each run robust and affordable: retries with exponential backoff to survive transient errors, and token and step budgets to make sure no single run can run away with your bill.