Lesson 6 - Evaluating and Improving Prompts
On this page
Welcome to Evaluating and Improving Prompts
Up to now you’ve judged prompts by reading the output and deciding whether it “looks good.” That works for one example and fails the moment you have ten. A prompt that nails a single message can quietly mislabel a third of the rest, and you’d never know unless you counted. Eyeballing doesn’t scale, and it doesn’t tell you whether your latest tweak actually helped or just felt better.
In this lesson you’ll switch from opinion to measurement. You’ll build a tiny test set with known answers, write a check that scores output automatically, and compare two prompt versions on identical inputs to see — in real numbers — which one wins.
By the end of this lesson, you will be able to:
- Build a small labelled test set for a checkable task
- Write an automated check (exact match, contains, valid-JSON, label-in-set) and compute an accuracy score
- Compare two prompt versions on the same test set and report which scores higher
- Use an LLM as a judge to rate open-ended output where exact match doesn’t apply
You’ll reuse the ask() helper from earlier lessons. Let’s begin.
Why You Must Measure, Not Eyeball
When you read one model output and think “good enough,” you’re testing a sample of size one. The next input might phrase things differently, sit on the boundary between two categories, or trip a wording you never considered. Without a way to count how often the prompt is right, every change you make is a guess dressed up as an improvement.
Measurement replaces that guesswork with a number. Once you can say “this prompt is right 7 times out of 8,” you can change one thing, re-run, and see the number move. That single discipline — measure, change one thing, re-measure — is what separates prompt engineering from prompt fiddling.
The catch is that you can only measure what you can check automatically. So the first move is to pick a task whose answers are verifiable.
A Checkable Task: Classification
The easiest output to grade is one with a small set of correct answers. Classification fits perfectly: every input has exactly one right label, drawn from a list you control, so a string comparison can tell right from wrong without a human in the loop.
Building a small test set
A test set is a list of inputs paired with the answer you already know is correct. Keep it tiny so it runs fast and cheap — eight examples is plenty to expose a weak prompt. Here we label short customer-support messages as complaint, question, or praise:
test_set = [
("My order still hasn't arrived and it's been three weeks.", "complaint"),
("Do you ship to Canada?", "question"),
("Absolutely love the new packaging, great job!", "praise"),
("The app crashes every time I open the settings page.", "complaint"),
("What time does your support line open?", "question"),
("Thanks so much, the refund came through quickly.", "praise"),
("Can I change the email address on my account?", "question"),
("I was charged twice for the same item and no one replied.", "complaint"),
]
LABELS = {"complaint", "question", "praise"}The (message, expected_label) pairs are your ground truth, and LABELS is the set of answers you’ll accept. Notice the set is closed: anything outside complaint, question, or praise is automatically wrong, no matter how reasonable it sounds.
Writing an Automated Check
An automated check is a function that turns a model’s raw output into a pass or fail. The right check depends on the task:
- Exact match — output must equal the expected string exactly. Strict, good for single labels.
- Contains — the expected answer must appear somewhere in the output. Forgiving of extra words.
- Valid JSON —
json.loads()succeeds and the required keys exist. For structured output (Lesson 4). - Label-in-set — output, normalized, must be one of an allowed set. Ideal for classification.
For our task we combine label-in-set with exact match: normalize the output, confirm it’s a real label, then check it equals the expected one.
Scoring the test set
The accuracy of a prompt is just the fraction it gets right:
Here’s a score() function that runs a prompt over the whole test set and returns that fraction, along with a per-example breakdown so you can see where it failed:
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from your environment
def ask(prompt, max_tokens=50):
return client.messages.create(
model="claude-haiku-4-5", max_tokens=max_tokens,
messages=[{"role": "user", "content": prompt}],
).content[0].text
def score(prompt_template):
correct = 0
rows = []
for message, expected in test_set:
raw = ask(prompt_template.format(message=message))
pred = raw.strip().lower().strip(".") # normalize
in_set = pred in LABELS # label-in-set check
hit = in_set and pred == expected # exact match against truth
correct += hit
rows.append((expected, pred, in_set, hit))
return correct / len(test_set), rowsEvery example is graded the same way, so the score is reproducible and objective. Now we have a yardstick — let’s measure two prompts with it.
Comparing Two Prompt Versions
The point of a yardstick is comparison. We’ll write two prompts for the same task and run both through score() on the same test set. Version A is a vague zero-shot ask. Version B pins down the label set, demands a single word, and gives one example (few-shot, from Lesson 3):
prompt_a = """Classify this customer message in one word.
Message: {message}"""
prompt_b = """You are a support-desk classifier.
Classify the customer message into exactly one label: complaint, question, or praise.
Reply with only the single lowercase label and nothing else.
Example:
Message: "This is the best service I've ever used."
Label: praise
Message: {message}
Label:"""
acc_a, rows_a = score(prompt_a)
acc_b, rows_b = score(prompt_b)
for label, (exp, pred, in_set, hit), in [("A", rows_a[0]), ("B", rows_b[0])]:
pass # (full per-row output shown below)
print(f"Accuracy A (vague): {acc_a:.3f}")
print(f"Accuracy B (sharpened + example): {acc_b:.3f}")Running both versions against the eight messages gives real, measured results:
=== Version A (vague zero-shot) ===
OK expected=complaint predicted='complaint' in_set=True
XX expected=question predicted='shipping' in_set=False
OK expected=praise predicted='praise' in_set=True
XX expected=complaint predicted='bug' in_set=False
XX expected=question predicted='inquiry' in_set=False
XX expected=praise predicted='satisfied' in_set=False
XX expected=question predicted='account' in_set=False
XX expected=complaint predicted='billing' in_set=False
Accuracy A: 0.250 (2/8)
=== Version B (sharpened + 1 example) ===
OK expected=complaint predicted='complaint' in_set=True
OK expected=question predicted='question' in_set=True
OK expected=praise predicted='praise' in_set=True
OK expected=complaint predicted='complaint' in_set=True
OK expected=question predicted='question' in_set=True
OK expected=praise predicted='praise' in_set=True
OK expected=question predicted='question' in_set=True
OK expected=complaint predicted='complaint' in_set=True
Accuracy B: 1.000 (8/8)Version A scored 0.250 (2 out of 8); Version B scored 1.000 (8 out of 8). The breakdown shows why: asked vaguely to classify “in one word,” the model answered with sensible-but-unusable words like shipping, bug, inquiry, and account. Those aren’t wrong descriptions — they’re just not labels in your set, so the automated check rejects them. Version B fixed the label set and gave one example, and the drift disappeared.
The same test set, every time
The comparison is only fair because both prompts saw the identical eight messages and were graded by the identical check. If you change the test set or the check between runs, the numbers aren’t comparable. Lock those down, then vary only the prompt.
LLM-as-Judge for Open-Ended Output
Classification is easy to grade because the answer is a fixed label. But most real tasks — summaries, explanations, rewrites — have no single correct string. Exact match is useless there: two excellent summaries can share almost no words.
For those, you use an LLM-as-judge: a second model call whose only job is to rate an answer against written criteria and return a score. You’re not asking the judge to do the task — you’re asking it to grade it, which is a narrower, more reliable job.
A small real example
Suppose the task is “explain a database index to a non-technical manager in two sentences.” We write a judge prompt that scores any answer from 1 to 5 against a rubric, then run it on a deliberately weak answer and a good one:
def judge(answer, question, criteria):
rubric = f"""You are a strict grader. Score the ANSWER from 1 to 5 on these criteria:
{criteria}
Question: {question}
Answer: {answer}
Reply with only a single integer 1-5 and nothing else."""
return ask(rubric, max_tokens=5).strip()
q = "Explain what a database index is to a non-technical manager in two sentences."
crit = "1=off-topic or wrong, 5=correct, jargon-free, exactly two sentences."
weak = "An index is a B-tree data structure that reduces I/O by avoiding full table scans via logarithmic seeks."
good = ("A database index is like the index at the back of a book: it lets the system jump "
"straight to the rows it needs instead of reading every page. The trade-off is that "
"it takes extra storage and slightly slows down writes.")
print("weak answer score:", judge(weak, q, crit))
print("good answer score:", judge(good, q, crit))weak answer score: 1
good answer score: 5The judge gave the jargon-stuffed answer a 1 and the clear analogy a 5 — exactly the ranking a human reviewer would. Now you can swap in a new prompt, regenerate the answer, and watch the judge’s score move, all without reading every output yourself.
The judge is a model too
An LLM judge can be wrong or inconsistent, especially on close calls. Keep its rubric concrete and its output tightly constrained (here, a single integer), and spot-check its scores against your own judgment before trusting it at scale. Treat the judge as a fast first pass, not an infallible referee.
Iterate: Change One Thing, Re-Measure
With a test set and a check in place, improving a prompt becomes a tight loop:
- Measure the current prompt to get a baseline number.
- Change exactly one thing — add an example, fix the label set, tighten the format line.
- Re-measure on the same test set.
- Keep the change if the number went up; revert it if it didn’t.
Changing one thing at a time is what makes the loop informative. If you rewrite three lines at once and the score moves, you don’t know which line did it. Our jump from 0.250 to 1.000 is interpretable precisely because Version B’s gains came from two clear additions — a fixed label set and a single example — that you could even introduce one at a time to see each one’s contribution.
Practice Exercises
Exercise 1: Add a hard case and re-measure
Add two tricky messages to the test set — for example a question phrased as a complaint (“Why is your checkout so confusing?”) — with the label you think is correct. Re-run both Version A and Version B and report the new accuracies.
Hint
Append the new (message, label) pairs to test_set and call score() again — no other code changes. Boundary cases are where prompts break, so expect at least one version’s number to drop.
Exercise 2: Swap the check
Rewrite the automated check to use contains instead of label-in-set: count an answer correct if the expected label appears anywhere in the lowercased output. Re-score Version A and see how much its accuracy rises.
Hint
Replace the pred in LABELS and pred == expected logic with expected in raw.lower(). A looser check is more forgiving but can mask real problems — note which failures it now lets through.
Exercise 3: Build a judge for summaries
Pick a short paragraph and write a prompt that summarizes it in one sentence. Then write a judge() rubric scoring the summary 1–5 on “accurate and one sentence only,” and grade two different summary prompts with it.
Hint
Reuse the judge() function as-is; only the criteria string changes. Constrain the judge to output a single integer so you can compare the two prompts’ scores directly.
Summary
A prompt you only eyeball is a prompt you don’t actually understand. By building a tiny labelled test set, writing an automated check, and computing an accuracy score, you turn vague impressions into numbers you can act on. We measured two prompts on the same eight messages and saw a vague zero-shot score 0.250 while a sharpened, few-shot version scored 1.000 — a real, reproducible gap that told us exactly which prompt to ship. For open-ended output where exact match fails, an LLM-as-judge rates answers against written criteria, and the whole workflow runs on the disciplined loop of change one thing, re-measure.
Key Concepts
- Test set — inputs paired with known-correct answers; keep it tiny so it runs cheaply.
- Automated check — exact match, contains, valid-JSON, or label-in-set logic that grades output without a human.
- Accuracy — the fraction of the test set a prompt gets right; the score you optimize.
- A/B comparison — running two prompt versions on the identical test set and check to see which wins.
- LLM-as-judge — a second model call that scores open-ended output against a rubric.
- Change one thing, re-measure — the iteration loop that makes improvement interpretable.
Why This Matters
Every serious LLM application ships with an evaluation suite, because “it worked when I tried it” is not a guarantee anyone can build on. The moment you can put a number on a prompt, you can improve it deliberately, catch regressions before users do, and justify your choices with evidence instead of vibes — the difference between a demo and a product.
Next Steps
Continue to Lesson 7 - Reducing Hallucinations and Unsafe Output
Make the model say 'I don't know,' ground answers in provided context, and refuse unsafe requests.
Back to Module Overview
Return to the Prompt Engineering module overview
Continue Building Your Skills
You can now measure a prompt instead of guessing about it — the skill that turns prompting into engineering. Next you’ll tackle the failure mode that measurement most often exposes: confident, wrong answers. You’ll learn to reduce hallucinations and keep the model from producing unsafe output, so the prompts you’ve learned to score also stay trustworthy.