Lesson 5 - Guided Project: Production-Ready Atlas

Welcome to the Guided Project

This is the last lesson of the last module — the capstone of the whole course. Across seven modules you built Atlas from nothing into a capable travel-planning agent: the agent loop (Module 2), robust tools (Module 3), memory it manages and recalls (Module 4), planning and reflection for hard requests (Module 5), retrieval over real documents (Module 6), and multi-agent coordination where specialists collaborate (Module 7). Atlas can already do impressive things. What it can’t do yet is survive contact with thousands of real users on a flaky network with a finite budget — and know, every time you change it, that it still works. This module gave you the four pillars that close that gap, and now you’ll wrap all four around Atlas at once. By the end, Atlas is production-ready: it refuses what it shouldn’t do, repairs a bad answer, shrugs off a transient failure, stops before it overspends, and reports a pass rate you can track.

By the end of this project, you will be able to:

  • Wrap Atlas in an input/output guardrail that refuses out-of-scope requests and repairs bad answers
  • Make Atlas’s calls resilient with with_retries (backoff and a cap) and bounded with a TokenBudget
  • Score Atlas against a small test set with evaluate and judge, reporting a pass rate
  • Assemble all four pillars into one production-ready turn around the exact agent you already built

We’ll build it in stages, wrapping the agent you already have — not rewriting it. Let’s harden Atlas.


Stage 1: The Full Atlas So Far

Start by taking stock of what you’ve built. Atlas is not a toy at this point — it is the accumulation of seven modules of work, and every piece is still in play:

  • The loop (M2) — Atlas reads a request, calls the model, runs tools, feeds results back, and repeats until it has an answer.
  • Robust tools (M3) — validated inputs, typed outputs, errors returned as data rather than crashes.
  • Memory (M4) — a bounded transcript plus a searchable store of traveler facts recalled before planning.
  • Planning and reflection (M5) — decompose hard trips into steps, reason through them, and critique the draft.
  • Retrieval (M6) — pull grounding facts from real documents before answering.
  • Multi-agent coordination (M7) — an orchestrator delegates to specialists (flights, lodging, food) and merges their work.

All of that runs inside one function we’ll call run_agent(request). In this project we treat that whole assembly as a black box: something that takes a request and returns an answer, using memory, retrieval, and specialists along the way. Module 8’s job is to put four protective layers around it without touching the engine.

Four pillars sitting on top of the agent loop from Modules 2-7. Guardrails check inputs and outputs, refusing or repairing. Retries recover from transient errors with backoff and a cap. Cost control uses token and step budgets so there are no runaway loops. Evaluation scores against a test set to catch regressions. The caption notes the first three keep each run correct, robust, and affordable, while evaluation tells you the whole thing still works as it changes — and none replace the agent, they wrap around it.
The four production pillars wrapping the full Atlas from Modules 2-7. Guardrails, retries, and cost control protect each run; evaluation protects quality over time. None of them replace the agent — they surround it.

That framing is the whole point of this module: production-hardening is layers around working code, added where the risk is. Let’s add them one at a time.


Stage 2: Guardrails — Refuse Out-of-Scope, Repair Bad Output

The first layer sits at Atlas’s boundary. Before the expensive run starts, an input guardrail checks that the request is even something Atlas should handle. After the run, an output guardrail checks the answer is usable, and gets one repair attempt if it isn’t. This is the exact code from Lesson 2:

def input_guardrail(client, request, *, scope):
    """Refuse out-of-scope requests BEFORE running the expensive agent."""
    verdict = say(client, f"Is this request about {scope}? Answer yes or no.\n{request}").lower()
    return verdict.startswith("yes")

def output_ok(answer):
    """Deterministic output guardrail: non-empty and no leftover placeholder."""
    return bool(answer.strip()) and "TODO" not in answer and "[[" not in answer

def guarded_run(client, request, agent_fn, *, scope):
    if not input_guardrail(client, request, scope=scope):
        return {"answer": f"Sorry, I can only help with {scope}.", "refused": True, "repaired": False}
    answer = agent_fn(request)
    if not output_ok(answer):                       # output failed the guardrail -> one repair
        answer = agent_fn(request + "\n(Your previous answer was empty/incomplete; give a full answer.)")
        return {"answer": answer, "refused": False, "repaired": True}
    return {"answer": answer, "refused": False, "repaired": False}

Here agent_fn is the full Atlas from Stage 1. Two things can now go right that would have gone wrong. First, an out-of-scope request is refused before Atlas ever runs:

out = guarded_run(client, "Write me a poem about cats", run_agent, scope="trip planning")
# {'answer': 'Sorry, I can only help with trip planning.', 'refused': True, 'repaired': False}

The input guardrail said “no,” so run_agent was never called — no tokens spent, no specialists spun up. Second, when Atlas does run but returns something empty or half-finished, the output guardrail catches it and asks once more:

out2 = guarded_run(client, "Plan a 2-day Kyoto trip", run_agent, scope="trip planning")
# repaired -> "Day 1: Kyoto temples. Day 2: Arashiyama."   (example — exact wording varies)

The first answer came back empty, output_ok returned False, and guarded_run re-asked with a nudge — yielding a real itinerary. Atlas now has a validated way in and a validated way out.


Stage 3: Retries and a Budget — Survive Blips, Cap Spend

The second and third layers wrap the calls inside Atlas. A model call or a tool call can fail transiently — a timeout, a rate limit, a one-off 529. In a demo you retry by hand; in production you wrap the call in with_retries, which backs off exponentially and gives up after a cap so it never loops forever. This is the verified code from Lesson 3:

def with_retries(fn, *, max_attempts=3, base_delay=0.5, sleep, transient=(TransientError,)):
    """Call fn(); on a transient error, back off (base*2**n) and retry, up to a cap."""
    for attempt in range(1, max_attempts + 1):
        try:
            return fn()
        except transient:
            if attempt == max_attempts:
                raise
            sleep(base_delay * (2 ** (attempt - 1)))

Wrap a flaky Atlas call — say the model API that fails twice, then succeeds — and the backoff schedule is deterministic:

out = with_retries(flaky_model_call, max_attempts=5, base_delay=0.5, sleep=delays.append)
# result: "ok" | attempts: 3 | backoff delays: [0.5, 1.0]

Two failures, two waits of 0.5s then 1.0s, then success — the caller never sees the blips. And if a call is genuinely down, with_retries raises after the cap instead of hanging forever.

The other half is cost. A confused agent can loop, pick an expensive path, or fan out to too many specialists — every step is tokens you pay for. A TokenBudget meters spend and stops the run the moment the cap is crossed:

class TokenBudget:
    def __init__(self, limit):
        self.limit = limit; self.spent = 0
    def charge(self, usage_tokens):
        self.spent += usage_tokens
        if self.spent > self.limit:
            raise BudgetExceeded(f"spent {self.spent} > limit {self.limit}")
        return self.spent

Charge it after each model call with the tokens that call used, and it enforces the cap:

budget = TokenBudget(limit=1000)
budget.charge(400); budget.charge(400)   # spent 800, fine
budget.charge(300)                        # 1100 > 1000 -> raises BudgetExceeded

Atlas now recovers from transient failures and can’t run away with your money. Robust and affordable, both in a few lines around the existing calls.


Stage 4: Evaluation — Score Atlas, Catch Regressions

The first three layers protect a single run. The fourth protects quality over time. You build a small test set of representative requests, each with a rubric describing a passing answer, run Atlas over all of them, and let an LLM judge score each answer PASS or FAIL. The pass rate becomes a number you track. This is the verified harness from Lesson 4:

def judge(client, question, answer, rubric):
    verdict = say(client, f"Question: {question}\nAnswer: {answer}\nRubric: {rubric}\n"
                          f"Reply with exactly PASS or FAIL.")
    return verdict.strip().upper().startswith("PASS")

def evaluate(client, agent_fn, cases):
    results = []
    for c in cases:
        answer = agent_fn(c["input"])
        passed = judge(client, c["input"], answer, c["rubric"])
        results.append({"input": c["input"], "answer": answer, "passed": passed})
    n_pass = sum(r["passed"] for r in results)
    return {"pass_rate": round(n_pass / len(results), 3), "results": results}

A three-case test set for Atlas, with the judge verdicts PASS, PASS, FAIL, scores exactly as you’d expect:

cases = [
    {"input": "Plan a day in Kyoto",      "rubric": "mentions a real Kyoto sight"},
    {"input": "Budget for 3 days",        "rubric": "gives a number"},
    {"input": "Best time for Arashiyama", "rubric": "says morning"},
]
report = evaluate(client, run_agent, cases)
#   PASS  Plan a day in Kyoto
#   PASS  Budget for 3 days
#   FAIL  Best time for Arashiyama      (example — exact wording varies)
# pass_rate: 0.667

Two of three passed — pass_rate of 0.667. That number is the point. Run this harness on every change: tweak a prompt, swap a tool, add a specialist, and re-score. If the pass rate drops, you’ve caught a regression before your users did. Quality stops being a feeling and becomes something you can defend with a number.


Stage 5: Putting It Together — One Production-Ready Turn

Now watch a single request flow through all four layers around the same Atlas you built in Module 2:

  1. Input guard. guarded_run asks the scope question. Out-of-scope (“write my essay”)? Refuse now — Atlas never runs, nothing is spent.
  2. The run — retried and budgeted. In scope, so Atlas runs: it recalls memory (M4), retrieves grounding facts (M6), plans and reflects (M5), and delegates to specialists (M7) — all through the loop (M2) with robust tools (M3). Each model/tool call is wrapped in with_retries, and every call charges the TokenBudget, so a blip is absorbed and a runaway is stopped.
  3. Output guard. output_ok checks the answer. Empty or half-finished? One repair attempt. Otherwise it’s delivered.
  4. Offline evaluation. Separately, whenever you change anything, evaluate scores Atlas over the test set and reports a pass rate so you know the change helped rather than hurt.

That’s it — that’s a production agent. Notice what did not happen: you did not rebuild the loop, the tools, the memory, the retrieval, or the specialists. Every one of those is the same code from earlier modules. All you added was four layers of protection around it: a check before, a retry and budget within, a check after, and a scorer around the whole thing.

Ship the smallest safe version

You don’t have to add all four layers on day one. Add protection where the risk is. Getting adversarial inputs? Start with the input guardrail. Seeing flaky API errors? Add retries. Watching the bill climb? Add the budget. Shipping changes and hoping? Build the eval harness first. Each pillar is independent and takes a few lines — you can adopt them one at a time on an agent you already have. And congratulations: this is the last stage of the last lesson. Atlas is production-ready, and so are you.


Practice Exercises

Exercise 1: Add a rate-limit-specific retry

Right now with_retries treats every TransientError the same. A 429 “rate limited” response usually tells you how long to wait (a Retry-After header). Make the retry honor that hint instead of the fixed backoff schedule when it’s present.

Hint

Define a RateLimited(TransientError) exception carrying a retry_after value, and catch it specifically before the general transient case. When you catch a RateLimited, call sleep(err.retry_after) instead of sleep(base_delay * 2 ** n). Everything else — the attempt loop, the cap, the re-raise — stays identical, because rate limiting is just a transient failure with a known wait time.

Exercise 2: Grow the eval set from real failures

Three test cases barely scratch Atlas’s behavior. Turn every production failure into a permanent test: when a user reports a bad answer, add that request and a rubric to the eval set so it can never silently regress again.

Hint

Keep cases in a JSON file, one object per case with input and rubric. When you find a failure, append {"input": "<the request>", "rubric": "<what a good answer must do>"} and re-run evaluate. Over time this “regression suite” grows into the real definition of what Atlas must do — and each fix comes with a test that guards it forever. This is exactly how test suites grow in ordinary software.

Exercise 3: Add a cost dashboard

You have a TokenBudget per run, but no view of spending across many runs. Log budget.spent after each request and summarize it, so you can see cost per turn, average, and outliers.

Hint

After each guarded_run, append budget.spent (and the request) to a log file or a small SQLite table. Then a tiny script can report the mean, the max, and the requests in the top 5% of spend — those outliers are exactly the confused runs worth investigating. Pair it with the pass rate from Exercise 2 and you have the two numbers that matter most in production: is it good, and what does it cost?


Summary

You took the full Atlas — the loop, tools, memory, planning, retrieval, and multi-agent coordination you built across seven modules — and wrapped it in Module 8’s four production layers without rewriting the engine. Guardrails (guarded_run) refuse out-of-scope requests before Atlas runs and repair an empty or half-finished answer after; the verified run refused a poem request and repaired a blank itinerary. Retries (with_retries) absorbed a flaky call with a deterministic [0.5, 1.0] backoff and capped a dead one, while a TokenBudget raised BudgetExceeded the moment spend crossed the cap. Evaluation (evaluate/judge) scored a three-case test set at a pass_rate of 0.667, the number you re-check on every change to catch regressions. Put together, one production-ready turn flows request → input guard → retried, budgeted agent run → output guard → delivered, with the eval harness scoring Atlas offline. It’s the same agent from Module 2, now behind four layers of protection.

The retries/backoff and TokenBudget here are pure Python, verified deterministically; the guardrails (refuse and repair) and the LLM-as-judge evaluation were verified against an SDK-shaped mock with no ANTHROPIC_API_KEY. Illustrative model text is labeled “example — exact wording varies.”

Key Concepts

  • Wrap, don’t rewrite — the four pillars surround the existing run_agent; the engine is untouched.
  • Guardrails at the boundary — an input check refuses out-of-scope before spending; an output check repairs a bad answer after.
  • Resilient and boundedwith_retries survives transient blips with backoff and a cap; TokenBudget stops runaway spend.
  • Quality as a numberevaluate/judge produce a pass rate you re-run on every change to catch regressions.

Why This Matters

This is the line between an impressive demo and something you’d put in front of real users. Every failure the pillars prevent — the out-of-scope abuse, the 3 a.m. blip, the surprise bill, the silent regression — is the kind that turns a great prototype into a support ticket. And because each pillar is a thin layer added where the risk is, you can harden an agent you already have, incrementally, without a redesign. That’s the shape of shipping AI: build the capability, then wrap it in protection you can measure. Atlas now has both, and so do you.


Next Steps

Back to the Course Home

Revisit any module in Building AI Agents in Python.

Explore More Courses

Continue your data and AI journey on DataTweets.


Continue Building Your Skills

You did it — you finished Building AI Agents in Python. Look back at the whole journey. You started with a bare agent loop that could call the model and run a tool. You made those tools robust, so bad inputs became data instead of crashes. You gave Atlas memory — a bounded transcript and a store it recalls facts from across sessions. You taught it to plan and reflect on hard, multi-step requests. You grounded it with retrieval over real documents. You had it coordinate specialists as a team of agents. And in this final module you wrapped the whole thing in production reliability — guardrails, retries, a budget, and an evaluation harness. That arc, loop → tools → memory → planning → retrieval → multi-agent → production-ready, is the full anatomy of a real AI agent, and you built every layer yourself.

Atlas began as an empty loop and ends as an agent you could genuinely ship. The patterns you practiced here are not specific to travel planning — they are how serious agents are built, whatever the domain. Take them and build your own. Congratulations, and happy building.