Lesson 8 - Guided Project: A Reusable Prompt Toolkit

Welcome to the Guided Project

You’ll build a small, reusable prompt-toolkit module that bakes the anatomy and techniques from this module into clean, parameterized functions — so you never write a one-off prompt again. By the end you’ll have a single file, prompt_toolkit.py, with a PromptTemplate and three ready-made tools — summarize, classify, and extract — plus a tiny harness that measures how well classify actually works.

Nothing here is new theory. Each step snaps one idea from Lessons 1 through 7 into place: the six-element anatomy from Lesson 1, the format and constraint control from Lesson 2, the structured-output / tool approach from Lesson 4, the deterministic settings you want for data tasks from Lesson 5, and the objective evaluation from Lesson 6. The project is where those separate exercises become one module you import.

By the end of this lesson, you will be able to:

  • Turn the prompt anatomy into a single parameterized PromptTemplate function
  • Build summarize, classify, and extract tools on top of it, with extract returning a validated dict
  • Pin a fixed model and low temperature so the deterministic tools are repeatable
  • Measure a prompt’s quality with a small eval harness that prints an accuracy score

You only need the Anthropic SDK and your API key in an environment variable. Let’s build it.


Step 1: A PromptTemplate From the Anatomy

In Lesson 1 you learned the elements of a strong prompt — role, task, context, format, constraints. Writing them out by hand every time is exactly the one-off habit we want to kill. So the first piece of the toolkit is a function that fills that anatomy from parameters.

As in every lesson, the client reads your key from the ANTHROPIC_API_KEY environment variable — you never put the key in the code. If you haven’t set it for this terminal session yet:

export ANTHROPIC_API_KEY="your-key-here"
pip install anthropic

Now the top of prompt_toolkit.py. We create the client and write build_prompt, which assembles a structured prompt string from the pieces you supply:

import json
import anthropic

# The client reads your key from the ANTHROPIC_API_KEY environment variable.
client = anthropic.Anthropic()


def build_prompt(role, task, context=None, output_format=None, constraints=None):
    """Fill the anatomy from Lesson 1 into one structured prompt string."""
    lines = [f"You are {role}.", task]
    if context:
        lines.append(f"\nContext:\n{context}")
    if output_format:
        lines.append(f"\nFormat: {output_format}")
    if constraints:
        lines.append(f"Constraints: {constraints}")
    return "\n".join(lines)

That small function is doing real work. Every tool we build next will call it instead of hand-writing a prompt, so the shape of every prompt in the toolkit is consistent: role first, then the task, then optional context, format, and constraints. Optional parameters mean a tool supplies only the elements it needs — classify cares about format and constraints, summarize adds an audience — but they all speak the same structured language.

Why a template, not a giant string

A PromptTemplate is just the bad → better → best habit from Lesson 1 turned into code. Each parameter is one element the model no longer has to guess. Putting them behind a function means you sharpen the anatomy once, here, and every tool inherits it — fix a weakness in build_prompt and all three tools improve at the same time.


Step 2: Three Ready-Made Tools

Now we build the tools learners will actually call: summarize, classify, and extract. Each one is a thin wrapper that fills build_prompt with the right elements for its job and sends the result.

We need one shared helper to make the call. It takes a prompt and returns the response, and it’s where the model and temperature get pinned (Step 3 explains why those values):

# A fixed model and low temperature make the deterministic tools repeatable.
MODEL = "claude-haiku-4-5"
TEMPERATURE = 0.0


def run(prompt, max_tokens=400, tools=None, tool_choice=None):
    """One low-temperature call on the fixed model. Returns the response object."""
    kwargs = dict(
        model=MODEL,
        max_tokens=max_tokens,
        temperature=TEMPERATURE,
        messages=[{"role": "user", "content": prompt}],
    )
    if tools:
        kwargs["tools"] = tools
    if tool_choice:
        kwargs["tool_choice"] = tool_choice
    return client.messages.create(**kwargs)

summarize asks for a fixed number of sentences and lets you name the audience — the Format and Audience levers from Lesson 1, parameterized:

def summarize(text, sentences=2, audience="a busy reader"):
    """Tool 1: summarize text to a fixed number of sentences."""
    prompt = build_prompt(
        role="a precise summarizer",
        task=f"Summarize the text below in exactly {sentences} sentences.",
        context=text,
        output_format="plain text, no preamble, no bullet points",
        constraints=f"write for {audience}; state only what the text says",
    )
    return run(prompt).content[0].text.strip()

classify assigns exactly one label from a set you pass in. The prompt asks for a single label word, and then — because a model can still drift — we snap the answer back onto the allowed set, so the function is guaranteed to return a valid label:

def classify(text, labels):
    """Tool 2: assign exactly one label from a fixed set."""
    label_list = ", ".join(labels)
    prompt = build_prompt(
        role="a strict text classifier",
        task=f"Classify the message below into exactly one of these labels: {label_list}.",
        context=text,
        output_format="respond with the single label word only, nothing else",
        constraints="choose the closest label; never invent a new one",
    )
    answer = run(prompt, max_tokens=20).content[0].text.strip().lower()
    # Snap the answer back onto the allowed set so the result is always valid.
    for label in labels:
        if label.lower() in answer:
            return label
    return labels[0]

extract is the one that returns structured data. Rather than parsing free-form text, it uses the tool / structured-output approach from Lesson 4: you pass a JSON schema describing the fields you want, force the model to call a record tool, and read the validated dict straight off the tool_use block:

def extract(text, schema):
    """Tool 3: return a validated dict using the structured/tool approach from Lesson 4."""
    tool = {
        "name": "record",
        "description": "Record the fields extracted from the text.",
        "input_schema": schema,
    }
    resp = run(
        build_prompt(
            role="a careful information extractor",
            task="Extract the requested fields from the text and call the record tool.",
            context=text,
            constraints="if a field is not present, use null",
        ),
        tools=[tool],
        tool_choice={"type": "tool", "name": "record"},
    )
    for block in resp.content:
        if block.type == "tool_use":
            return block.input
    return {}

Three different tasks, one shared anatomy underneath. Notice how little code each tool is — the work went into build_prompt and run, and the tools just describe what they need.

Why extract uses a tool, not text parsing

You could ask the model to “reply with JSON” and run json.loads on the text — but then you’re back to brittle string parsing, the very thing Lesson 4 warned against. Forcing a tool call with tool_choice makes the model fill your schema, and the SDK hands you a real Python dict. The schema is the contract; the tool call enforces it.


Step 3: Low Temperature and a Fixed Model

Look back at run — it pins MODEL = "claude-haiku-4-5" and TEMPERATURE = 0.0, and every tool goes through it. That’s deliberate. These are deterministic tools: classify should give the same label for the same ticket, and extract should pull the same fields every run. You learned in Lesson 5 that data tasks want repeatability, not creativity.

Temperature controls randomness. A high temperature is what you want for brainstorming or varied copy; for classification and extraction it’s a liability — the same input could land on a different label tomorrow. Setting temperature=0.0 tells the model to take its most-likely path every time, which is exactly what a classifier should do.

Pinning the model matters for the same reason. If the model silently changed under you, your classify accuracy could move without a single line of your code changing. Naming claude-haiku-4-5 in one constant means the whole toolkit is reproducible and cheap — Haiku is the inexpensive model, and these are short, high-volume calls where cost adds up.

One place to change, everything updated

Because MODEL and TEMPERATURE live in run, every tool inherits them. Want to test the toolkit on a more capable model, or loosen summarize to a higher temperature for a livelier voice? You change it in one spot, not in three separate functions. That’s the payoff of routing every call through a single helper.


Step 4: A Tiny Eval Harness

A prompt that looks right isn’t the same as a prompt that is right. Lesson 6 made the case for measuring quality objectively instead of eyeballing a few outputs. So the last piece of the toolkit is a small harness that runs classify over a handful of labeled examples and reports its accuracy.

The idea is simple: you supply (text, expected_label) pairs, the harness classifies each one, counts how many match, and returns the fraction:

def evaluate_classifier(examples, labels):
    """Tool 4: tiny eval harness — accuracy of classify() over labeled examples (Lesson 6)."""
    correct = 0
    for text, expected in examples:
        predicted = classify(text, labels)
        correct += predicted == expected
    return correct / len(examples)

That’s the whole harness. It turns a vague feeling — “the classifier seems okay” — into a number you can track. Change the prompt, rerun the harness, and you’ll see whether accuracy went up or down. This is the difference between tweaking prompts by vibes and engineering them: a labeled set and a score.

A few examples is enough to start

You don’t need thousands of labeled examples to benefit. Even five well-chosen ones catch the obvious failures — a label the model keeps confusing, a phrasing it mishandles. Start small, watch the score, and add examples for the cases that actually break. The harness is a habit, not a research project.


Step 5: Run the Toolkit Over a Batch

The final step is to use the toolkit the way you actually would: import the functions, point them at some real text, and collect structured results over a small batch. The if __name__ == "__main__": block at the bottom of the file does exactly this — it runs each tool once and prints what comes back.

For extract, we loop over a small inline batch of messages and build a list of validated dicts — the shape you’d hand to a database or a DataFrame. With that, the toolkit is complete; here is the whole program.


The Complete prompt_toolkit.py

import json
import anthropic

# The client reads your key from the ANTHROPIC_API_KEY environment variable.
client = anthropic.Anthropic()

# A fixed model and low temperature make the deterministic tools repeatable.
MODEL = "claude-haiku-4-5"
TEMPERATURE = 0.0


def build_prompt(role, task, context=None, output_format=None, constraints=None):
    """Fill the anatomy from Lesson 1 into one structured prompt string."""
    lines = [f"You are {role}.", task]
    if context:
        lines.append(f"\nContext:\n{context}")
    if output_format:
        lines.append(f"\nFormat: {output_format}")
    if constraints:
        lines.append(f"Constraints: {constraints}")
    return "\n".join(lines)


def run(prompt, max_tokens=400, tools=None, tool_choice=None):
    """One low-temperature call on the fixed model. Returns the response object."""
    kwargs = dict(
        model=MODEL,
        max_tokens=max_tokens,
        temperature=TEMPERATURE,
        messages=[{"role": "user", "content": prompt}],
    )
    if tools:
        kwargs["tools"] = tools
    if tool_choice:
        kwargs["tool_choice"] = tool_choice
    return client.messages.create(**kwargs)


def summarize(text, sentences=2, audience="a busy reader"):
    """Tool 1: summarize text to a fixed number of sentences."""
    prompt = build_prompt(
        role="a precise summarizer",
        task=f"Summarize the text below in exactly {sentences} sentences.",
        context=text,
        output_format="plain text, no preamble, no bullet points",
        constraints=f"write for {audience}; state only what the text says",
    )
    return run(prompt).content[0].text.strip()


def classify(text, labels):
    """Tool 2: assign exactly one label from a fixed set."""
    label_list = ", ".join(labels)
    prompt = build_prompt(
        role="a strict text classifier",
        task=f"Classify the message below into exactly one of these labels: {label_list}.",
        context=text,
        output_format="respond with the single label word only, nothing else",
        constraints="choose the closest label; never invent a new one",
    )
    answer = run(prompt, max_tokens=20).content[0].text.strip().lower()
    # Snap the answer back onto the allowed set so the result is always valid.
    for label in labels:
        if label.lower() in answer:
            return label
    return labels[0]


def extract(text, schema):
    """Tool 3: return a validated dict using the structured/tool approach from Lesson 4."""
    tool = {
        "name": "record",
        "description": "Record the fields extracted from the text.",
        "input_schema": schema,
    }
    resp = run(
        build_prompt(
            role="a careful information extractor",
            task="Extract the requested fields from the text and call the record tool.",
            context=text,
            constraints="if a field is not present, use null",
        ),
        tools=[tool],
        tool_choice={"type": "tool", "name": "record"},
    )
    for block in resp.content:
        if block.type == "tool_use":
            return block.input
    return {}


def evaluate_classifier(examples, labels):
    """Tool 4: tiny eval harness — accuracy of classify() over labeled examples (Lesson 6)."""
    correct = 0
    for text, expected in examples:
        predicted = classify(text, labels)
        correct += predicted == expected
    return correct / len(examples)


if __name__ == "__main__":
    # --- A summary -------------------------------------------------------
    article = (
        "The city council approved a plan to add 40 kilometers of protected "
        "bike lanes over the next three years. Supporters say the network will "
        "cut commute times and reduce traffic injuries, while some business "
        "owners worry about the temporary loss of street parking during "
        "construction. The first segment breaks ground in the spring."
    )
    print("SUMMARY")
    print(summarize(article, sentences=2, audience="a city newsletter"))
    print()

    # --- Classification + eval accuracy ----------------------------------
    LABELS = ["billing", "bug", "feature_request", "praise"]
    eval_set = [
        ("I was charged twice for last month's subscription.", "billing"),
        ("The export button does nothing when I click it.", "bug"),
        ("It would be great if you added a dark mode.", "feature_request"),
        ("Honestly the new dashboard is fantastic, thank you!", "praise"),
        ("Why is my invoice higher than the plan price?", "billing"),
    ]
    print("CLASSIFY")
    sample = "The app crashes every time I open the settings page."
    print(f"{sample!r} -> {classify(sample, LABELS)}")
    accuracy = evaluate_classifier(eval_set, LABELS)
    print(f"Eval accuracy on {len(eval_set)} examples: {accuracy:.0%}")
    print()

    # --- Extraction over an inline batch ---------------------------------
    schema = {
        "type": "object",
        "properties": {
            "company": {"type": ["string", "null"]},
            "role": {"type": ["string", "null"]},
            "location": {"type": ["string", "null"]},
            "remote": {"type": ["boolean", "null"]},
        },
        "required": ["company", "role", "location", "remote"],
    }
    messages = [
        "Acme Corp is hiring a Data Engineer in Berlin; the role is fully remote.",
        "Join Nimbus Labs as a Frontend Developer based in our Toronto office.",
    ]
    print("EXTRACT (batch)")
    results = [extract(m, schema) for m in messages]
    print(json.dumps(results, indent=2))

Run it with python prompt_toolkit.py. Here is the real output — a summary, a classification with the accuracy score, and the extracted dicts:

SUMMARY
The city council has approved a plan to build 40 kilometers of protected bike lanes over three years, with construction beginning this spring. While supporters believe the bike lanes will reduce commute times and traffic injuries, some business owners have expressed concerns about temporary parking loss during construction.

CLASSIFY
'The app crashes every time I open the settings page.' -> bug
Eval accuracy on 5 examples: 100%

EXTRACT (batch)
[
  {
    "company": "Acme Corp",
    "role": "Data Engineer",
    "location": "Berlin",
    "remote": true
  },
  {
    "company": "Nimbus Labs",
    "role": "Frontend Developer",
    "location": "Toronto",
    "remote": false
  }
]

Read down the page and the whole module is visible at once. summarize obeyed the two-sentence, plain-text format and wrote for the audience we named. classify correctly tagged the crash report as a bug, and the eval harness scored a clean 100% on the five labeled examples — a real number, not a hope. And extract returned two validated dicts, each pulling company, role, location, and the boolean remote straight from the sentence — note it correctly read “fully remote” as true and the “Toronto office” message as false. You just turned a module’s worth of techniques into a toolkit you can import.


Take It Further

The toolkit works; now make it yours. Each of these is a small change to the file you just wrote:

  • Add a rewrite tool. Build a fourth tool, rewrite(text, tone), on top of build_prompt — same pattern as summarize, but with a tone parameter ("formal", "friendly", "concise"). You’ll see how little code a new tool takes once the template exists.
  • Grow the eval set. Add more (text, label) pairs to eval_set, including a few deliberately tricky ones, and watch the accuracy score move. Then try weakening the classify prompt and rerun — the harness will show you the damage.
  • Return the raw answer too. Have classify return both the snapped label and the model’s raw reply, so you can spot when the snapping logic is doing real work versus when the model already answered cleanly.
  • Batch from a file. Replace the inline messages list with rows read from a CSV, run extract over each, and write the list of dicts back out — the first step toward a real extraction pipeline.

Summary

You assembled a reusable prompt toolkit from the parts you built across this module. A PromptTemplate (build_prompt) turned the Lesson 1 anatomy into a parameterized function; three tools — summarize, classify, and extract — were thin wrappers on top of it, with extract returning a validated dict via the Lesson 4 tool approach. A fixed model and low temperature made the deterministic tools repeatable, a small eval harness scored classify objectively, and a final batch run produced structured results you could feed straight into a pipeline.

Key Concepts

  • PromptTemplate — one function that fills the prompt anatomy (role, task, context, format, constraints) from parameters, so every tool shares the same shape.
  • Parameterized toolssummarize, classify, and extract each describe only the elements they need and reuse the template underneath.
  • Validated extraction — forcing a tool call with a JSON schema returns a real dict, not brittle free-form text.
  • Deterministic settings — a fixed model plus temperature=0.0 make classification and extraction reproducible.
  • Eval harness — a labeled set and an accuracy score turn prompt-tweaking into measurable engineering.

Why This Matters

Almost every production LLM feature you’ll build is a small set of parameterized tools like these, wired to a fixed model and watched by an eval. Summarizers, classifiers, and extractors are the workhorses of real data products — and now you have a clean, testable module that produces them, instead of scattering one-off prompts across your codebase. You finished the module by turning everything you learned into a tool you can actually reuse.


Next Steps

Continue to Module 3 - Tool Use & Function Calling (next in the course)

Move from shaping prompts to giving the model real tools — function calling, schemas, and the agent loop.

Back to Module Overview

Return to the Prompt Engineering module overview


Continue Building Your Skills

You’ve built a real, reusable toolkit — a PromptTemplate and three tools you can import into your next project, scored by an eval you can trust. That’s a genuine milestone: every technique in this module now lives in one file you understand top to bottom. Next you’ll go a step further and hand the model actual tools to call, turning prompts into programs. Onward to tool use and function calling.