Lesson 6 - Controlling the Output
Welcome to Controlling the Output
So far you have been sending prompts and taking whatever comes back. That works, but it leaves real money and reliability on the table. A response that runs three paragraphs long when you needed one word, or that varies wildly between runs when you needed a stable answer, is a response you did not control. This lesson hands you the knobs.
Every control here comes back to one idea from Lesson 1: the model writes one token at a time by sampling from a probability distribution. You cannot change what the model knows, but you can change how it samples and how long it is allowed to keep going. That is the whole game — you steer the sampling and the budget.
By the end of this lesson, you will be able to:
- Cap a response’s length with
max_tokensand detect when the cap was hit - End generation early at a marker you choose with
stop_sequences - Use
temperatureto trade consistency for variety, and know when to use each - Choose between Haiku, Sonnet, and Opus based on cost and difficulty
You’ll need the working client from Lesson 2 (anthropic.Anthropic() reading your ANTHROPIC_API_KEY). Let’s begin.
Capping Length with max_tokens
max_tokens is the single required parameter you have set in every call so far, and it does exactly one thing: it puts a hard ceiling on how many tokens the model is allowed to generate. It is a budget, not a target. The model writes until it naturally finishes or until it hits the ceiling — whichever comes first.
This matters for two reasons. First, output tokens cost money, so an unbounded response is an unbounded bill. Second, you often know roughly how long a good answer should be: a classification label is a few tokens, a tweet is a few dozen, a summary is a few hundred. Setting max_tokens close to that keeps responses tight and predictable.
Detecting when the cap was hit
The catch is that a ceiling can cut the model off mid-sentence. You need to know when that happened, and the response tells you through stop_reason. Set the cap deliberately low and watch what comes back:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=16,
messages=[{"role": "user", "content": "List ten facts about the ocean, one per line."}],
)
print(response.content[0].text)
print("stop_reason:", response.stop_reason)
print("output_tokens:", response.usage.output_tokens)# Ten Ocean Facts
1. The ocean covers approximately 71% of Earth
stop_reason: max_tokens
output_tokens: 16The model wanted to write ten facts but only got 16 tokens, so it stopped in the middle of the first one. The stop_reason is "max_tokens" — that is your signal that the answer is truncated, not complete. The output_tokens count confirms it spent exactly the budget you gave it.
Compare that to a response that finishes on its own:
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=100,
messages=[{"role": "user", "content": "Reply with exactly: Hello."}],
)
print(repr(response.content[0].text))
print("stop_reason:", response.stop_reason)'Hello.'
stop_reason: end_turnHere stop_reason is "end_turn": the model decided it was done well before the 100-token ceiling. The rule of thumb: give yourself enough headroom that a complete answer finishes naturally with end_turn, and treat max_tokens as a safety cap. If you keep seeing "max_tokens" in production, your ceiling is too low and answers are being cut off.
Always read stop_reason
Whenever output length matters, check response.stop_reason before trusting the text. A "max_tokens" stop means you are looking at a fragment. Either raise the ceiling or, if you genuinely want short output, shorten the prompt’s request so the model finishes inside the budget.
Stopping Early with stop_sequences
max_tokens stops the model after a count of tokens. Sometimes you want it to stop at a marker instead — a specific string that means “you’ve written enough.” That is what stop_sequences is for. You pass a list of strings; the moment the model is about to emit any of them, generation halts.
This is the workhorse behind structured output. If you tell the model to write one item and then a separator, you can use that separator as a stop sequence and guarantee you only get the first item back. Here it is cutting a numbered list off cleanly at the fourth item:
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=200,
temperature=0.0,
stop_sequences=["4."],
messages=[{"role": "user", "content": "List the planets nearest the Sun as a numbered list: 1. ... 2. ... and so on."}],
)
print(repr(response.content[0].text))
print("stop_reason:", response.stop_reason)
print("stop_sequence:", response.stop_sequence)'# Planets Nearest the Sun\n\n1. Mercury\n2. Venus\n3. Earth\n'
stop_reason: stop_sequence
stop_sequence: 4.Two things to notice. First, stop_reason is now "stop_sequence", and a new field, stop_sequence, tells you which of your strings triggered the stop — handy when you pass several. Second, the stop string itself is not included in the text. The model wrote up to the point where it would have typed "4.", then stopped, so you get items 1 through 3 and nothing more. You did not have to trim the output yourself.
One constraint to know
A stop sequence must contain at least one non-whitespace character. A pure-whitespace marker like a blank line ("\n\n") is rejected by the API:
try:
client.messages.create(
model="claude-haiku-4-5",
max_tokens=50,
stop_sequences=["\n\n"],
messages=[{"role": "user", "content": "Hi"}],
)
except anthropic.BadRequestError as error:
print(error.body["error"]["message"])stop_sequences: each stop sequence must contain non-whitespaceSo pick a marker that has real characters in it — "END", "---", "</answer>", or a "4." like above. If you want to stop at a blank line, anchor it to something visible, such as "\n\nNEXT", rather than whitespace alone.
max_tokens still applies
stop_sequences and max_tokens work together. Generation ends at whichever comes first: the marker, or the token ceiling. Always set a sane max_tokens even when you use a stop sequence, so a model that never produces the marker can’t run on forever.
Tuning Randomness with temperature
Recall the picture from Lesson 1: at each step the model has a probability distribution over the next token, and it samples one. Temperature is the knob that reshapes that distribution before the sample is drawn. On claude-haiku-4-5 it ranges from 0.0 to 1.0.
At low temperature (near 0.0), the distribution is sharpened so the single most-likely token almost always wins. Output becomes focused, stable, and close to repeatable. At higher temperature (toward 1.0), the distribution is flattened so lower-probability tokens get a real chance, and the output gets more varied and creative. You are choosing, directly, how much run-to-run variation you are willing to accept.
Low temperature for consistency
When you need the same answer every time — extraction, classification, parsing, anything a downstream system depends on — pull the temperature to 0.0. Here the same extraction prompt runs twice and returns an identical, clean result:
prompt = (
"Extract the city from this sentence and reply with only the city name: "
"'We landed in Lisbon late on Tuesday night.'"
)
for _ in range(2):
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=20,
temperature=0.0,
messages=[{"role": "user", "content": prompt}],
)
print(repr(response.content[0].text))'Lisbon'
'Lisbon'Both runs return 'Lisbon'. Low temperature does not guarantee byte-for-byte identical output (sampling can still wobble at the margins), but it makes the model strongly prefer its top choice, which is exactly what structured tasks want.
Higher temperature for variety
Now flip it. For a creative task — a tagline, a brainstorm, alternative phrasings — you want the model to explore. The same prompt at temperature=0.0 collapses to one answer; at temperature=1.0 it spreads out:
prompt = "Write a single short tagline for a coffee shop on Mars."
print("temperature 0.0:")
for _ in range(2):
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=40,
temperature=0.0,
messages=[{"role": "user", "content": prompt}],
)
print(" ", repr(response.content[0].text))
print("temperature 1.0:")
for _ in range(2):
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=40,
temperature=1.0,
messages=[{"role": "user", "content": prompt}],
)
print(" ", repr(response.content[0].text))temperature 0.0:
'"Brew where no one has brewed before."'
'"Brew where no one has brewed before."'
temperature 1.0:
'"Wake up on Mars—we\'re out of this world."'
'"Out of This World Caffeine"'At 0.0 the two runs are identical — the model keeps picking its single favorite tagline. At 1.0 it gives two genuinely different lines. That contrast is temperature: same model, same prompt, different amount of sampling randomness. Use low values when you need one dependable answer, higher values when you want a spread of options to choose from.
Start low, raise only as needed
A good default for most engineering tasks is a low temperature — you usually want predictable behavior, and you can always rerun a creative task at a higher setting. Raising temperature is a deliberate choice to trade reliability for variety, not a free upgrade.
Choosing the Right Model
The last control is the biggest one: which model you call. Every parameter above tunes a single response, but picking the model sets the ceiling on capability and the floor on cost for every response. Claude comes in a small family, and they trade capability against price.
| Model | Strength | Input / Output per 1M tokens | Reach for it when |
|---|---|---|---|
claude-haiku-4-5 | Fastest, cheapest | $1 / $5 | High volume, simple-to-moderate tasks: extraction, classification, routing, short replies |
claude-sonnet-4-6 | Balanced | $3 / $15 | Harder reasoning, longer writing, multi-step work where Haiku starts to slip |
claude-opus-4-8 | Most capable | $5 / $25 | The hardest problems: deep reasoning, tricky code, work where a wrong answer is expensive |
Read the pricing carefully: it is per million tokens, split into input and output, and output is the pricier side. Moving from Haiku to Sonnet roughly triples your cost; Sonnet to Opus adds more on top. This course uses claude-haiku-4-5 throughout precisely because it is cheap and fast enough to experiment freely while you learn — and it handles the great majority of everyday tasks well.
When to upgrade
The right move is almost always start on Haiku and upgrade only when you have evidence you need to. Switching models is a one-line change — just the model string — so it costs you nothing to test:
for model in ["claude-haiku-4-5", "claude-sonnet-4-6"]:
response = client.messages.create(
model=model,
max_tokens=200,
messages=[{"role": "user", "content": "Explain why the sky is blue in two sentences."}],
)
print(model, "->", response.content[0].text)Run the same prompt across two models, compare the answers, and let the results decide. Upgrade when Haiku consistently gets a task wrong, drops important detail, or fumbles multi-step reasoning — and stay on Haiku for everything it already handles, because you are paying for capability you do not use otherwise.
Sonnet and Opus have extra reasoning controls
The larger models add capabilities Haiku does not expose — an extended (or “adaptive”) thinking mode where the model reasons at length before answering, and an “effort” control that trades more compute for better answers on hard problems. These belong to the Sonnet and Opus tiers; do not assume they apply to claude-haiku-4-5. They are worth exploring later when a task genuinely needs that extra horsepower.
Putting It Together
These four controls are not separate tricks — they are the dials on one console. A real call often sets several at once: pick a model for the difficulty, cap max_tokens to your budget, set temperature for the consistency you need, and add a stop_sequence to keep structured output clean.
response = client.messages.create(
model="claude-haiku-4-5", # cheap model for a simple task
max_tokens=30, # short answer, hard ceiling
temperature=0.0, # deterministic — this is extraction
stop_sequences=["END"], # stop at a visible, non-whitespace marker
messages=[{
"role": "user",
"content": "Reply with only the country, then the word END: 'I flew home to Canada.'",
}],
)
print(repr(response.content[0].text))
print("stop_reason:", response.stop_reason)'Canada\n'
stop_reason: stop_sequenceEvery dial here answers a question you actually had: how capable, how long, how varied, where to stop. That is the difference between taking whatever the model gives you and getting the output you asked for.
Keep stop markers visible
Notice the stop sequence is "END", not a newline. A whitespace-only marker like "\n" or "\n\n" is rejected by the API, so anchor your stop on a real, non-whitespace string. For a plain “one line only” cut where you don’t want any marker, lean on a low max_tokens instead.
Practice Exercises
Exercise 1: Catch a truncated answer
Ask the model for a five-paragraph essay with max_tokens=20. Print response.stop_reason and response.usage.output_tokens. Then raise max_tokens until the same prompt finishes on its own. What does stop_reason change to?
Hint
A capped response reports stop_reason == "max_tokens" and spends exactly your budget. A complete one reports "end_turn" and usually uses fewer tokens than the ceiling. Raise the cap in steps until the reason flips.
Exercise 2: Cut a list with a stop sequence
Prompt the model for a numbered list of ten programming languages, but pass stop_sequences=["6."] so you only get the first five. Confirm the text stops before item six and that stop_reason is "stop_sequence". Then try stop_sequences=[" "] and read the error.
Hint
The marker string is not included in the returned text, so you get items 1–5 cleanly. A whitespace-only stop sequence such as " " raises a BadRequestError — every stop sequence must contain a non-whitespace character.
Exercise 3: See temperature in action
Run a creative prompt — “Name a fictional planet” — three times at temperature=0.0, then three times at temperature=1.0. Count how many unique answers you get at each setting.
Hint
At 0.0 you should see the same name repeat (the model keeps choosing its top token); at 1.0 you should see more variety. This is the Lesson 1 sampling picture made concrete: low temperature sharpens the distribution, high temperature flattens it.
Summary
You now control what comes back from the API. max_tokens caps the length and signals truncation through stop_reason == "max_tokens". stop_sequences ends generation at a marker you choose (the marker is excluded from the output, and it must contain non-whitespace). temperature trades consistency for variety — low for extraction and classification, higher for creative work — by reshaping the sampling distribution from Lesson 1. And choosing the model sets your capability-versus-cost trade-off, with Haiku as the cheap, fast default you upgrade only when the evidence demands it. Across all four: you steer the sampling and the budget.
Key Concepts
- max_tokens — a hard ceiling on generated tokens; a
"max_tokens"stop reason means the answer was truncated. - stop_reason — why generation ended:
"end_turn","max_tokens", or"stop_sequence". - stop_sequences — strings that halt generation when the model would emit them; excluded from output and must be non-whitespace.
- temperature —
0.0–1.0on Haiku; low sharpens the distribution for consistent output, high flattens it for variety. - Model choice — Haiku (cheap/fast), Sonnet (balanced), Opus (most capable); start cheap and upgrade on evidence.
Why This Matters
Production LLM systems live or die on control. The same call that costs a fraction of a cent and returns a clean label can, mishandled, return a truncated ramble at ten times the price. Knowing which dial to turn — and reading stop_reason to confirm what actually happened — is what separates a demo from something you can ship and trust.
Next Steps
Continue to Lesson 7 - Guided Project: Command-Line Assistant
Put every control to work building a small, real command-line assistant on top of Claude.
Back to Module Overview
Return to the Working with LLMs in Python module overview
Continue Building Your Skills
You have gone from sending prompts to shaping responses — capping length, stopping early, dialing randomness, and choosing the right model for the cost. Next you’ll combine all of it into a working command-line assistant, the first thing in this course that feels like a real tool rather than a single call.