Lesson 5 - Guided Project: Production-Ready Atlas
On this page
- Welcome to the Guided Project
- Stage 1: The Full Atlas So Far
- Stage 2: Guardrails — Refuse Out-of-Scope, Repair Bad Output
- Stage 3: Retries and a Budget — Survive Blips, Cap Spend
- Stage 4: Evaluation — Score Atlas, Catch Regressions
- Stage 5: Putting It Together — One Production-Ready Turn
- Practice Exercises
- Summary
- Next Steps
- Continue Building Your Skills
Welcome to the Guided Project
This is the last lesson of the last module — the capstone of the whole course. Across seven modules you built Atlas from nothing into a capable travel-planning agent: the agent loop (Module 2), robust tools (Module 3), memory it manages and recalls (Module 4), planning and reflection for hard requests (Module 5), retrieval over real documents (Module 6), and multi-agent coordination where specialists collaborate (Module 7). Atlas can already do impressive things. What it can’t do yet is survive contact with thousands of real users on a flaky network with a finite budget — and know, every time you change it, that it still works. This module gave you the four pillars that close that gap, and now you’ll wrap all four around Atlas at once. By the end, Atlas is production-ready: it refuses what it shouldn’t do, repairs a bad answer, shrugs off a transient failure, stops before it overspends, and reports a pass rate you can track.
By the end of this project, you will be able to:
- Wrap Atlas in an input/output guardrail that refuses out-of-scope requests and repairs bad answers
- Make Atlas’s calls resilient with
with_retries(backoff and a cap) and bounded with aTokenBudget - Score Atlas against a small test set with
evaluateandjudge, reporting a pass rate - Assemble all four pillars into one production-ready turn around the exact agent you already built
We’ll build it in stages, wrapping the agent you already have — not rewriting it. Let’s harden Atlas.
Stage 1: The Full Atlas So Far
Start by taking stock of what you’ve built. Atlas is not a toy at this point — it is the accumulation of seven modules of work, and every piece is still in play:
- The loop (M2) — Atlas reads a request, calls the model, runs tools, feeds results back, and repeats until it has an answer.
- Robust tools (M3) — validated inputs, typed outputs, errors returned as data rather than crashes.
- Memory (M4) — a bounded transcript plus a searchable store of traveler facts recalled before planning.
- Planning and reflection (M5) — decompose hard trips into steps, reason through them, and critique the draft.
- Retrieval (M6) — pull grounding facts from real documents before answering.
- Multi-agent coordination (M7) — an orchestrator delegates to specialists (flights, lodging, food) and merges their work.
All of that runs inside one function we’ll call run_agent(request). In this project we treat that whole assembly as a black box: something that takes a request and returns an answer, using memory, retrieval, and specialists along the way. Module 8’s job is to put four protective layers around it without touching the engine.
That framing is the whole point of this module: production-hardening is layers around working code, added where the risk is. Let’s add them one at a time.
Stage 2: Guardrails — Refuse Out-of-Scope, Repair Bad Output
The first layer sits at Atlas’s boundary. Before the expensive run starts, an input guardrail checks that the request is even something Atlas should handle. After the run, an output guardrail checks the answer is usable, and gets one repair attempt if it isn’t. This is the exact code from Lesson 2:
def input_guardrail(client, request, *, scope):
"""Refuse out-of-scope requests BEFORE running the expensive agent."""
verdict = say(client, f"Is this request about {scope}? Answer yes or no.\n{request}").lower()
return verdict.startswith("yes")
def output_ok(answer):
"""Deterministic output guardrail: non-empty and no leftover placeholder."""
return bool(answer.strip()) and "TODO" not in answer and "[[" not in answer
def guarded_run(client, request, agent_fn, *, scope):
if not input_guardrail(client, request, scope=scope):
return {"answer": f"Sorry, I can only help with {scope}.", "refused": True, "repaired": False}
answer = agent_fn(request)
if not output_ok(answer): # output failed the guardrail -> one repair
answer = agent_fn(request + "\n(Your previous answer was empty/incomplete; give a full answer.)")
return {"answer": answer, "refused": False, "repaired": True}
return {"answer": answer, "refused": False, "repaired": False}Here agent_fn is the full Atlas from Stage 1. Two things can now go right that would have gone wrong. First, an out-of-scope request is refused before Atlas ever runs:
out = guarded_run(client, "Write me a poem about cats", run_agent, scope="trip planning")
# {'answer': 'Sorry, I can only help with trip planning.', 'refused': True, 'repaired': False}The input guardrail said “no,” so run_agent was never called — no tokens spent, no specialists spun up. Second, when Atlas does run but returns something empty or half-finished, the output guardrail catches it and asks once more:
out2 = guarded_run(client, "Plan a 2-day Kyoto trip", run_agent, scope="trip planning")
# repaired -> "Day 1: Kyoto temples. Day 2: Arashiyama." (example — exact wording varies)The first answer came back empty, output_ok returned False, and guarded_run re-asked with a nudge — yielding a real itinerary. Atlas now has a validated way in and a validated way out.
Stage 3: Retries and a Budget — Survive Blips, Cap Spend
The second and third layers wrap the calls inside Atlas. A model call or a tool call can fail transiently — a timeout, a rate limit, a one-off 529. In a demo you retry by hand; in production you wrap the call in with_retries, which backs off exponentially and gives up after a cap so it never loops forever. This is the verified code from Lesson 3:
def with_retries(fn, *, max_attempts=3, base_delay=0.5, sleep, transient=(TransientError,)):
"""Call fn(); on a transient error, back off (base*2**n) and retry, up to a cap."""
for attempt in range(1, max_attempts + 1):
try:
return fn()
except transient:
if attempt == max_attempts:
raise
sleep(base_delay * (2 ** (attempt - 1)))Wrap a flaky Atlas call — say the model API that fails twice, then succeeds — and the backoff schedule is deterministic:
out = with_retries(flaky_model_call, max_attempts=5, base_delay=0.5, sleep=delays.append)
# result: "ok" | attempts: 3 | backoff delays: [0.5, 1.0]Two failures, two waits of 0.5s then 1.0s, then success — the caller never sees the blips. And if a call is genuinely down, with_retries raises after the cap instead of hanging forever.
The other half is cost. A confused agent can loop, pick an expensive path, or fan out to too many specialists — every step is tokens you pay for. A TokenBudget meters spend and stops the run the moment the cap is crossed:
class TokenBudget:
def __init__(self, limit):
self.limit = limit; self.spent = 0
def charge(self, usage_tokens):
self.spent += usage_tokens
if self.spent > self.limit:
raise BudgetExceeded(f"spent {self.spent} > limit {self.limit}")
return self.spentCharge it after each model call with the tokens that call used, and it enforces the cap:
budget = TokenBudget(limit=1000)
budget.charge(400); budget.charge(400) # spent 800, fine
budget.charge(300) # 1100 > 1000 -> raises BudgetExceededAtlas now recovers from transient failures and can’t run away with your money. Robust and affordable, both in a few lines around the existing calls.
Stage 4: Evaluation — Score Atlas, Catch Regressions
The first three layers protect a single run. The fourth protects quality over time. You build a small test set of representative requests, each with a rubric describing a passing answer, run Atlas over all of them, and let an LLM judge score each answer PASS or FAIL. The pass rate becomes a number you track. This is the verified harness from Lesson 4:
def judge(client, question, answer, rubric):
verdict = say(client, f"Question: {question}\nAnswer: {answer}\nRubric: {rubric}\n"
f"Reply with exactly PASS or FAIL.")
return verdict.strip().upper().startswith("PASS")
def evaluate(client, agent_fn, cases):
results = []
for c in cases:
answer = agent_fn(c["input"])
passed = judge(client, c["input"], answer, c["rubric"])
results.append({"input": c["input"], "answer": answer, "passed": passed})
n_pass = sum(r["passed"] for r in results)
return {"pass_rate": round(n_pass / len(results), 3), "results": results}A three-case test set for Atlas, with the judge verdicts PASS, PASS, FAIL, scores exactly as you’d expect:
cases = [
{"input": "Plan a day in Kyoto", "rubric": "mentions a real Kyoto sight"},
{"input": "Budget for 3 days", "rubric": "gives a number"},
{"input": "Best time for Arashiyama", "rubric": "says morning"},
]
report = evaluate(client, run_agent, cases)
# PASS Plan a day in Kyoto
# PASS Budget for 3 days
# FAIL Best time for Arashiyama (example — exact wording varies)
# pass_rate: 0.667Two of three passed — pass_rate of 0.667. That number is the point. Run this harness on every change: tweak a prompt, swap a tool, add a specialist, and re-score. If the pass rate drops, you’ve caught a regression before your users did. Quality stops being a feeling and becomes something you can defend with a number.
Stage 5: Putting It Together — One Production-Ready Turn
Now watch a single request flow through all four layers around the same Atlas you built in Module 2:
- Input guard.
guarded_runasks the scope question. Out-of-scope (“write my essay”)? Refuse now — Atlas never runs, nothing is spent. - The run — retried and budgeted. In scope, so Atlas runs: it recalls memory (M4), retrieves grounding facts (M6), plans and reflects (M5), and delegates to specialists (M7) — all through the loop (M2) with robust tools (M3). Each model/tool call is wrapped in
with_retries, and every callcharges theTokenBudget, so a blip is absorbed and a runaway is stopped. - Output guard.
output_okchecks the answer. Empty or half-finished? One repair attempt. Otherwise it’s delivered. - Offline evaluation. Separately, whenever you change anything,
evaluatescores Atlas over the test set and reports a pass rate so you know the change helped rather than hurt.
That’s it — that’s a production agent. Notice what did not happen: you did not rebuild the loop, the tools, the memory, the retrieval, or the specialists. Every one of those is the same code from earlier modules. All you added was four layers of protection around it: a check before, a retry and budget within, a check after, and a scorer around the whole thing.
Ship the smallest safe version
You don’t have to add all four layers on day one. Add protection where the risk is. Getting adversarial inputs? Start with the input guardrail. Seeing flaky API errors? Add retries. Watching the bill climb? Add the budget. Shipping changes and hoping? Build the eval harness first. Each pillar is independent and takes a few lines — you can adopt them one at a time on an agent you already have. And congratulations: this is the last stage of the last lesson. Atlas is production-ready, and so are you.
Practice Exercises
Exercise 1: Add a rate-limit-specific retry
Right now with_retries treats every TransientError the same. A 429 “rate limited” response usually tells you how long to wait (a Retry-After header). Make the retry honor that hint instead of the fixed backoff schedule when it’s present.
Hint
Define a RateLimited(TransientError) exception carrying a retry_after value, and catch it specifically before the general transient case. When you catch a RateLimited, call sleep(err.retry_after) instead of sleep(base_delay * 2 ** n). Everything else — the attempt loop, the cap, the re-raise — stays identical, because rate limiting is just a transient failure with a known wait time.
Exercise 2: Grow the eval set from real failures
Three test cases barely scratch Atlas’s behavior. Turn every production failure into a permanent test: when a user reports a bad answer, add that request and a rubric to the eval set so it can never silently regress again.
Hint
Keep cases in a JSON file, one object per case with input and rubric. When you find a failure, append {"input": "<the request>", "rubric": "<what a good answer must do>"} and re-run evaluate. Over time this “regression suite” grows into the real definition of what Atlas must do — and each fix comes with a test that guards it forever. This is exactly how test suites grow in ordinary software.
Exercise 3: Add a cost dashboard
You have a TokenBudget per run, but no view of spending across many runs. Log budget.spent after each request and summarize it, so you can see cost per turn, average, and outliers.
Hint
After each guarded_run, append budget.spent (and the request) to a log file or a small SQLite table. Then a tiny script can report the mean, the max, and the requests in the top 5% of spend — those outliers are exactly the confused runs worth investigating. Pair it with the pass rate from Exercise 2 and you have the two numbers that matter most in production: is it good, and what does it cost?
Summary
You took the full Atlas — the loop, tools, memory, planning, retrieval, and multi-agent coordination you built across seven modules — and wrapped it in Module 8’s four production layers without rewriting the engine. Guardrails (guarded_run) refuse out-of-scope requests before Atlas runs and repair an empty or half-finished answer after; the verified run refused a poem request and repaired a blank itinerary. Retries (with_retries) absorbed a flaky call with a deterministic [0.5, 1.0] backoff and capped a dead one, while a TokenBudget raised BudgetExceeded the moment spend crossed the cap. Evaluation (evaluate/judge) scored a three-case test set at a pass_rate of 0.667, the number you re-check on every change to catch regressions. Put together, one production-ready turn flows request → input guard → retried, budgeted agent run → output guard → delivered, with the eval harness scoring Atlas offline. It’s the same agent from Module 2, now behind four layers of protection.
The retries/backoff and TokenBudget here are pure Python, verified deterministically; the guardrails (refuse and repair) and the LLM-as-judge evaluation were verified against an SDK-shaped mock with no ANTHROPIC_API_KEY. Illustrative model text is labeled “example — exact wording varies.”
Key Concepts
- Wrap, don’t rewrite — the four pillars surround the existing
run_agent; the engine is untouched. - Guardrails at the boundary — an input check refuses out-of-scope before spending; an output check repairs a bad answer after.
- Resilient and bounded —
with_retriessurvives transient blips with backoff and a cap;TokenBudgetstops runaway spend. - Quality as a number —
evaluate/judgeproduce a pass rate you re-run on every change to catch regressions.
Why This Matters
This is the line between an impressive demo and something you’d put in front of real users. Every failure the pillars prevent — the out-of-scope abuse, the 3 a.m. blip, the surprise bill, the silent regression — is the kind that turns a great prototype into a support ticket. And because each pillar is a thin layer added where the risk is, you can harden an agent you already have, incrementally, without a redesign. That’s the shape of shipping AI: build the capability, then wrap it in protection you can measure. Atlas now has both, and so do you.
Next Steps
Back to the Course Home
Revisit any module in Building AI Agents in Python.
Explore More Courses
Continue your data and AI journey on DataTweets.
Continue Building Your Skills
You did it — you finished Building AI Agents in Python. Look back at the whole journey. You started with a bare agent loop that could call the model and run a tool. You made those tools robust, so bad inputs became data instead of crashes. You gave Atlas memory — a bounded transcript and a store it recalls facts from across sessions. You taught it to plan and reflect on hard, multi-step requests. You grounded it with retrieval over real documents. You had it coordinate specialists as a team of agents. And in this final module you wrapped the whole thing in production reliability — guardrails, retries, a budget, and an evaluation harness. That arc, loop → tools → memory → planning → retrieval → multi-agent → production-ready, is the full anatomy of a real AI agent, and you built every layer yourself.
Atlas began as an empty loop and ends as an agent you could genuinely ship. The patterns you practiced here are not specific to travel planning — they are how serious agents are built, whatever the domain. Take them and build your own. Congratulations, and happy building.