Lesson 4 - Evaluating Agents
Welcome to Evaluating Agents
You’ve built guardrails, retries, and budgets — three pillars that protect a single run. This lesson builds the fourth, and it’s different in kind: it protects quality over time. Right now, “is the agent good?” is a feeling. You try a few requests, the answers look right, and you ship. Then you tweak a prompt to fix one case and silently break five others — and you find out from users, not from yourself. An evaluation harness closes that gap. It turns quality into a number: a fixed set of test cases, run through the agent, scored the same way every time, reported as a pass rate you watch change as you edit. You can’t improve what you don’t measure — so let’s start measuring.
By the end of this lesson, you will be able to:
- Explain why an evaluation harness turns agent quality into a trackable number
- Build an
evaluatefunction that runs a test set and reports a pass rate - Use an LLM-as-judge to score open-ended answers against a rubric
- Recognize when a deterministic check beats a model, and what a regression looks like
Let’s build the harness that makes “did that change help?” answerable.
Why a Pass Rate Beats a Vibe
A demo tells you the agent can produce a good answer. It says nothing about whether it does so reliably, and nothing at all about whether your next change made things better or worse. The fix is an evaluation harness: a repeatable process you run on every change.
The harness has four parts, and none of them is exotic:
- A test set — a fixed list of cases, each an input plus a rubric describing what a good answer needs. Build it from real failure cases you’ve actually seen, so it tests the things that break.
- A run — feed each input to the agent and collect its answer.
- A score — judge each answer against its rubric: PASS or FAIL.
- A pass rate —
passed / total. One number. Track it over time.
That single number is what changes how you work. Before, you guessed: “I think that prompt tweak helped.” After, you know: “pass rate went from 0.667 to 0.833.” A drop after a change has a name — a regression — and the harness catches it before a user does.
The whole point is repeatability. Because the test set is fixed and the scoring is mechanical, two runs are comparable — so the difference between them means something.
An LLM as the Judge
Scoring is the hard part. For a lookup with one right answer, you could compare strings — but agents produce open-ended prose, and there are a hundred ways to correctly plan a day in Kyoto. Exact-match scoring is useless here. So we use a model to judge: give it the question, the answer, and the rubric, and ask for a single verdict.
The judge is one focused model call that returns exactly PASS or FAIL:
def say(client, prompt, *, system="", model="claude-haiku-4-5", max_tokens=256):
r = client.messages.create(model=model, max_tokens=max_tokens, system=system,
messages=[{"role": "user", "content": prompt}])
return "".join(b.text for b in r.content if b.type == "text").strip()
def judge(client, question, answer, rubric):
verdict = say(client, f"Question: {question}\nAnswer: {answer}\nRubric: {rubric}\n"
f"Reply with exactly PASS or FAIL.")
return verdict.strip().upper().startswith("PASS")say is the same thin wrapper around a Claude call you’ve used all module — one prompt in, the text out. judge wraps it: it hands the model the three things it needs and normalizes the reply to a boolean. Asking for exactly PASS or FAIL keeps the output parseable; .upper().startswith("PASS") tolerates stray whitespace or casing.
The judge is powerful because it reads meaning, but that’s also its risk: the judge can be wrong. A vague rubric like “gives a good answer” invites the model to guess. A tight rubric — one specific, checkable thing — gives it a real target:
- ✅ “mentions a real Kyoto sight”
- ✅ “gives a number”
- ❌ “is helpful and accurate” (what does the judge even check?)
Keep rubrics tight, and spot-check the judge’s verdicts against your own reading now and then. A judge you never audit is a number you can’t trust.
Running the Whole Test Set
judge scores one answer. evaluate runs the loop: for every case, call the agent, judge the result, tally the passes, and report the rate.
def evaluate(client, agent_fn, cases):
results = []
for c in cases:
answer = agent_fn(c["input"])
passed = judge(client, c["input"], answer, c["rubric"])
results.append({"input": c["input"], "answer": answer, "passed": passed})
n_pass = sum(r["passed"] for r in results)
return {"pass_rate": round(n_pass / len(results), 3), "results": results}agent_fn is any callable that takes an input and returns an answer — your real agent, or a fixed stand-in while you develop the harness. evaluate returns both the headline pass_rate and the per-case results, so you get the number and the receipts: exactly which cases failed and what they answered.
Here’s a run over three Kyoto cases, verified against an SDK-shaped mock (no live API — the harness orchestration is real; the judge’s verdicts are illustrative model output):
cases = [
{"input": "Plan a day in Kyoto", "rubric": "mentions a real Kyoto sight"},
{"input": "Budget for 3 days", "rubric": "gives a number"},
{"input": "Best time for Arashiyama", "rubric": "says morning"},
]
def agent(q): # a fixed stand-in agent under test
return {"Plan a day in Kyoto": "Visit Fushimi Inari and Arashiyama.",
"Budget for 3 days": "About $270 total.",
"Best time for Arashiyama": "Anytime is fine."}[q]
report = evaluate(client, agent, cases)
for r in report["results"]:
print(f" {'PASS' if r['passed'] else 'FAIL'} {r['input']}")
print("pass_rate:", report["pass_rate"]) PASS Plan a day in Kyoto
PASS Budget for 3 days
FAIL Best time for Arashiyama
pass_rate: 0.667Two of three cases pass, so the pass rate is 2 / 3 rounded to 0.667, and the harness made exactly three judge calls — one per case. The third fails on purpose: the rubric wants “says morning,” the answer says “Anytime is fine,” and a tight rubric catches the gap. That failing row is the whole value of the harness — it points straight at the case to fix.
Not every check needs a model
The LLM-as-judge shines on open-ended answers, but a model call costs tokens and can be wrong. Some checks don’t need one at all. “Is the answer under 200 words?”, “does it cite a source?”, “is it non-empty?” are deterministic assertions — plain Python if statements that are cheaper, faster, and perfectly reliable when they apply. Reach for the judge for the fuzzy, meaning-based criteria; use a deterministic check whenever the rubric is something you can test with code. A good harness mixes both.
Practice Exercises
Exercise 1: Write a checkable rubric
You want to evaluate a trip planner on the case "Suggest a 3-day Kyoto itinerary." Write a rubric the judge can actually check, then explain why "gives a good itinerary" is a poor one.
Hint
A good rubric names one specific, checkable thing — e.g. "mentions at least three distinct Kyoto sights" or "covers all three days". "gives a good itinerary" is unusable because “good” isn’t defined: the judge has nothing concrete to test, so its verdict is a guess and two runs may disagree. Tight rubrics make the judge — and the pass rate — trustworthy.
Exercise 2: Compute the pass rate
You run evaluate over 5 cases. The judge returns PASS, FAIL, PASS, PASS, PASS. What is pass_rate, how many judge calls were made, and if a change drops it to 0.6 next run, what do you call that?
Hint
Four of five pass, so pass_rate = round(4 / 5, 3) = 0.8. evaluate makes exactly one judge call per case — five cases, five calls. A drop from 0.8 to 0.6 after a change is a regression: the change broke a case that used to pass, and the harness caught it before a user did. That’s exactly why you run the harness on every change.
Exercise 3: Judge or deterministic check?
For each rubric, decide whether it needs the LLM-as-judge or a deterministic assertion: (a) “the answer is under 100 words”; (b) “the answer correctly summarizes the trip’s theme”; (c) “the answer includes a dollar amount.”
Hint
(a) deterministic — len(answer.split()) < 100, no model needed; (b) judge — “correctly summarizes the theme” is a meaning-based judgment only a model can make; (c) deterministic — a regex or "$" in answer check is cheaper and more reliable than asking a model. Rule of thumb: if you can test it with plain Python, do — save the judge for the fuzzy criteria where code can’t reach.
Summary
Quality you can only feel is quality you can’t improve — and it silently regresses the moment you change a prompt. An evaluation harness fixes that by turning quality into a number. You assemble a test set of cases (each an input plus a checkable rubric), run the agent on each, score every answer PASS or FAIL, and report a pass rate you track over time. The scorer here is an LLM-as-judge: judge hands a model the question, answer, and rubric and gets back exactly PASS or FAIL — powerful for open-ended answers where exact-match won’t work, but only as trustworthy as the rubric is tight, so keep rubrics specific and spot-check the verdicts. evaluate wraps judge in a loop over the whole test set and returns {pass_rate, results} — the headline number and the per-case receipts. On three Kyoto cases the judge returned PASS, PASS, FAIL, giving a pass rate of 2/3 = 0.667 from exactly three judge calls. And where a rubric is something code can check — a word count, a citation, a dollar amount — a deterministic assertion is cheaper and more reliable than a model. Build the test set from real failures, run it on every change, and a regression shows up as a falling pass rate before it shows up in a bug report.
Key Concepts
- Evaluation harness — a fixed test set + scoring + a pass rate, run on every change to turn quality into a number.
- LLM-as-judge — a model call returning PASS/FAIL against a rubric; great for open-ended answers, only as good as the rubric.
- Pass rate —
passed / total; the one number you track to see whether a change helped or hurt. - Deterministic check vs. regression — code-checkable rubrics skip the model; a falling pass rate after a change is a regression.
Why This Matters
Evaluation is what turns building an agent from guesswork into engineering. Without it, every change is a gamble and every regression is discovered by a user. With it, you ship changes with a number in hand: pass rate went up, or it went down, and you know before anyone else does. It also compounds — every real failure you add to the test set is a bug that can never silently come back. This is the discipline behind every agent you’d trust in production: not “it looked right when I tried it,” but “it passes the suite, and the suite grows every time something breaks.” Next, you’ll assemble everything from this module — guardrails, retries, budgets, and this eval harness — onto a production-ready Atlas.
Next Steps
Continue to Lesson 5 - Guided Project: Production-Ready Atlas
Assemble memory, retrieval, and multi-agent coordination behind guardrails, retries, budgets, and an eval harness.
Back to Module Overview
Return to the Reliability and Evaluation module overview
Continue Building Your Skills
You can now measure agent quality instead of guessing at it: a fixed test set, an LLM-as-judge scoring each answer against a tight rubric, and a pass rate you track to catch regressions — with deterministic checks where a model isn’t needed. That’s the fourth and final pillar. Next, in the module capstone, you’ll bring all four together on a production-ready Atlas that assembles memory, retrieval, and multi-agent coordination behind guardrails, retries, budgets, and the eval harness you just built.