Lesson 1 - Why Reliability and Evaluation
Welcome to Why Reliability and Evaluation
Your agent works. You typed a request, it planned, retrieved, coordinated specialists, and gave a great answer. So it’s done, right? Not quite. A demo proves the agent can work; a product needs it to work for every user, on every request, every time — without crashing on a hiccup, without quietly burning your budget, and without silently getting worse the next time you change a prompt. That last mile is what this final module is about. It’s less glamorous than building the agent, and it’s exactly what separates something you show off from something you ship.
By the end of this lesson, you will be able to:
- Explain why “works in a demo” is a weak guarantee for a real agent
- Name the four failure modes production agents must handle
- Map each to a pillar: guardrails, retries, cost control, and evaluation
- See that all four wrap around the agent loop you already built
Let’s look at what actually goes wrong once real users show up.
What Breaks Between Demo and Production
A demo is one person, one friendly request, one run, and you watching. Production is thousands of requests you’ll never see, some malformed, some adversarial, over a flaky network, on a budget. Four things break:
- Bad inputs and outputs. Users ask for things out of scope (“write my essay” to a trip planner), or the agent produces an answer that’s empty, malformed, or off-policy. Nothing validates either end.
- Transient failures. An API times out, a rate limit hits, a tool errors once. In a demo you retry by hand; in production an unhandled blip becomes a failed request — or a crash.
- Runaway cost. A confused agent loops ten extra times, or picks an expensive path, and every one of those steps is tokens you pay for. Without caps, one bad run can cost a hundred times a normal one.
- Silent quality regressions. You tweak a prompt to fix one case and unknowingly break five others. With no way to measure quality, you find out from users, not from yourself.
None of these show up when you’re demoing happily to yourself. All of them show up at scale. The good news: each has a well-understood fix, and this module builds all four.
Four Pillars of a Production Agent
The fixes map to four pillars — and, true to the whole course, none of them is a new engine. They wrap around the run_agent loop you already have:
- Guardrails (Lesson 2) — check what goes in (refuse out-of-scope or unsafe requests before spending anything) and what comes out (validate the answer; repair it or refuse if it’s bad). This extends the input validation you built for tools in Module 3 to the agent’s whole boundary.
- Retries and cost control (Lesson 3) — recover from transient failures with exponential backoff and a hard attempt cap, and bound spending with token and step budgets so no single run can run away. Robust and affordable.
- Evaluation (Lesson 4) — build a test set and score the agent, often using an LLM as judge, so quality is a number you track instead of a feeling. This is what turns “I think that helped” into “pass rate went from 72% to 88%.”
The module ends with a capstone (Lesson 5): a production-ready Atlas that assembles everything — memory, retrieval, multi-agent coordination — behind guardrails, retries, budgets, and an eval harness.
Reliability is mostly wrapping, not rewriting
Notice the shape of this module: you won’t rebuild the agent. You’ll wrap it — a check before it runs, a check after it answers, a retry around a flaky call, a budget around the loop, a scorer around the whole thing. That’s deliberate. Production-hardening is layers of protection around working code, added where the risk is, not a redesign. It also means you can adopt these one at a time on an agent you already have.
Practice Exercises
Exercise 1: Name the failure
For each, name which pillar addresses it: (a) a user asks your trip planner to debug their code; (b) the model API returns a 529 “overloaded” once; (c) a confused agent loops 20 times on one request.
Hint
(a) guardrails — an input guardrail refuses the out-of-scope request before the agent runs; (b) retries — a transient error should be retried with backoff, not surfaced as a failure; (c) cost control — a step budget caps the loop so it can’t run away. Evaluation is the fourth pillar, for measuring quality over time rather than handling a single run.
Exercise 2: Why is “it worked when I tried it” weak?
Your agent gave a great answer in your demo. Why is that a weak guarantee that it’s production-ready?
Hint
A demo is one friendly request, one run, with you watching. It says nothing about malformed or adversarial inputs, transient network failures, cost under load, or whether your next change will break cases you didn’t test. Production quality is about the requests you don’t see — which is exactly what guardrails, retries, budgets, and evaluation address.
Exercise 3: Which pillar is different?
Three pillars protect a single run; one protects quality over time. Which is which, and why does that one need a test set?
Hint
Guardrails, retries, and cost control each make one run correct, robust, and affordable. Evaluation is different: it measures whether the agent, as a whole, is getting better or worse as you change it — and you can’t measure that from one run, so it needs a repeatable test set scored the same way each time.
Summary
A working demo proves an agent can work; production requires it to work for every user, every time, affordably — and four things break in between. Bad inputs/outputs, transient failures, runaway cost, and silent quality regressions are all invisible in a demo and unavoidable at scale. The fixes are four pillars that wrap around the loop you already built: guardrails (validate the way in and the way out — refuse or repair), retries (recover from transient errors with backoff and a cap), cost control (token and step budgets so nothing runs away), and evaluation (score against a test set so quality is a number you track). The first three protect each run; evaluation protects quality over time. The module builds all four and assembles them on a production-ready Atlas.
Key Concepts
- Demo ≠ production — one friendly run says nothing about scale, failures, cost, or regressions.
- Four failure modes — bad I/O, transient failures, runaway cost, silent quality loss.
- Four pillars — guardrails, retries, cost control, evaluation; they wrap the agent.
- Run vs. over-time — three pillars protect a single run; evaluation protects quality as you change things.
Why This Matters
This is the difference between an impressive prototype and an agent you’d put in front of real users or customers. The failures here are the ones that generate the 3 a.m. page, the surprise bill, and the “it used to work” bug report — and every one of them is preventable with a pattern that takes a few lines to add. Just as important, evaluation changes how you build: once quality is a number, you can improve the agent deliberately instead of guessing, and ship changes with confidence instead of hope. Next, you’ll build the first pillar — guardrails on the agent’s inputs and outputs.
Next Steps
Continue to Lesson 2 - Guardrails
Validate what goes into the agent and what comes out — refuse out-of-scope requests and repair bad answers.
Back to Module Overview
Return to the Reliability and Evaluation module overview
Continue Building Your Skills
You now know the gap between a demo and a product, and the four pillars that close it — guardrails, retries, cost control, and evaluation, each wrapping the agent loop rather than replacing it. Next you’ll build the first pillar: guardrails that check what goes into the agent and what comes out, refusing out-of-scope requests and repairing bad answers before they reach a user.