Lesson 5 - Tokens, Cost, and Streaming
Welcome to Tokens, Cost, and Streaming
You have made a call to Claude, picked your model, and read the response. Now comes the practical question every engineer eventually asks: how much did that cost, and how do I keep an answer from sitting in silence while it generates? Both answers live in the same place — the token.
This lesson is about money and feel. You will read the exact token counts off a real response, turn them into a precise dollar figure, learn to estimate a request before you send it, and then make long answers print word-by-word instead of all at once. None of it is hard, and the headline is reassuring: at these prices, learning by experimenting is almost free.
By the end of this lesson, you will be able to:
- Read
input_tokensandoutput_tokensfrom a real response - Compute the cost of any call from the per-million-token pricing
- Use
count_tokensto budget a request before you send it - Stream a response so its output appears progressively
You only need the anthropic client from earlier lessons. Let’s begin.
Reading Token Usage
Every response from the Messages API carries a usage object that reports exactly how many tokens the call consumed. There are two numbers that matter: the input tokens (your prompt — system text, messages, everything you sent) and the output tokens (what the model wrote back). Here is a real call that reads both:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=200,
messages=[
{"role": "user", "content": "In two sentences, explain what an API is to a new programmer."}
],
)
print(response.content[0].text)
print("---")
print("input_tokens:", response.usage.input_tokens)
print("output_tokens:", response.usage.output_tokens)# What is an API?
An API (Application Programming Interface) is a set of rules and tools that allows different software programs to communicate with each other and share data or functionality. Think of it like a restaurant menu—you don't need to know how the kitchen works; you just order what you want, and the kitchen delivers the result through a standard process.
---
input_tokens: 21
output_tokens: 77That short prompt cost 21 input tokens, and the answer came to 77 output tokens. These are not estimates — they are the exact counts the model billed, returned on every single response. You never have to guess; you read them.
Why input and output are billed differently
Input and output tokens have separate prices, and output is almost always the more expensive of the two. The reason is mechanical: the model processes your whole prompt in one pass, but it has to generate each output token one at a time (that next-token prediction from Lesson 1). Generation is the costly part, so a token the model writes costs several times more than a token it merely reads.
Computing the Cost of a Call
Now turn those two numbers into money. The model you are using, claude-haiku-4-5, is priced per million tokens: $1.00 per 1M input tokens and $5.00 per 1M output tokens. To find the cost of a call, scale each token count down from a million and multiply by its price:
Plug in the real numbers from the call above — 21 input tokens and 77 output tokens:
Here is the same calculation in code, which is how you’d track spend in a real program:
INPUT_PRICE = 1.00 # dollars per 1M input tokens
OUTPUT_PRICE = 5.00 # dollars per 1M output tokens
in_tokens = 21
out_tokens = 77
cost = (in_tokens / 1_000_000) * INPUT_PRICE + (out_tokens / 1_000_000) * OUTPUT_PRICE
print(f"Cost of this call: ${cost:.6f}")
print(f"That is {cost * 100:.4f} cents.")Cost of this call: $0.000406
That is 0.0406 cents.Read that result again: the entire round trip cost four hundredths of a cent. You could run that exact call more than two thousand times for a single dollar. This is the most important practical fact in the whole course — experimenting with Claude is essentially free at this scale. When you are learning, the cost of running one more test is rounding error. Don’t ration your curiosity; run the code, change the prompt, run it again.
The numbers only start to matter when you scale up: thousands of calls, long documents stuffed into the context window, or a bigger model. For comparison, Sonnet 4.6 costs $3 / $15 per 1M tokens and Opus 4.8 costs $5 / $25 — three to five times Haiku’s price. That gap is exactly why Lesson 3 had you reach for Haiku on simple, high-volume work and save the expensive models for the genuinely hard problems.
Budgeting Before You Send
The usage object tells you what a call cost after it happened. But sometimes you need to know the size before you send — to check a long document fits the context window, or to estimate the bill for a batch job you haven’t run yet. The SDK has a dedicated endpoint for exactly this: count_tokens. It tokenizes your messages and returns the input count without ever generating a response, so it is free and instant.
count = client.messages.count_tokens(
model="claude-haiku-4-5",
messages=[
{"role": "user", "content": "Summarize the plot of Romeo and Juliet in one paragraph."}
],
)
print("count_tokens input_tokens:", count.input_tokens)count_tokens input_tokens: 22That prompt is 22 tokens before anything is sent. Knowing this up front lets you do real budgeting: multiply the input count by the input price, add a reasonable estimate for the output you expect, and you have a cost forecast for a job before you commit to running it. It also catches the failure case early — if count_tokens reports a number close to the context window limit, you know to trim the input before the call fails, not after.
Count tokens, never guess from word count
It is tempting to approximate token counts from word or character counts, but that approximation drifts badly on code, numbers, JSON, and non-English text — the exact inputs where budgeting matters most. count_tokens uses the model’s real tokenizer, so it is always right. When the size actually matters, measure it.
Streaming: Output as It Is Generated
There is one more thing the cost numbers hint at. A long answer is a lot of output tokens, and the model generates them one at a time. With a normal create call, your program waits in silence until every token is finished, then receives the whole response at once. For a short reply that’s fine. For a long one — a multi-paragraph explanation, a generated report — the user stares at a blank screen for several seconds, which feels broken even when it isn’t.
Streaming fixes the feel. Instead of waiting for the complete message, you receive each chunk of text the instant the model produces it and print it immediately. The answer appears progressively, the same way it does in a chat interface. Use client.messages.stream(...) as a context manager and loop over stream.text_stream:
import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-haiku-4-5",
max_tokens=300,
messages=[
{"role": "user", "content": "Write a short, encouraging 4-line poem about learning to code."}
],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
print()Two details make the printing work. end="" stops print from adding a newline after every chunk, so the words flow together instead of stacking one per line. flush=True forces Python to display each chunk the moment it arrives rather than holding it in a buffer — without it, the progressive effect is lost. Here is a captured run:
# Learning to Code
Each bug you fix builds strength within your mind,
The logic flows like light through clouded haze,
With every line of code, new doors you'll find,
You're crafting worlds—embrace the learning phase.On screen, that text did not appear all at once — it streamed in, a few words at a time, exactly as the model wrote it.
Getting the Final Message and Usage
Streaming the text doesn’t mean giving up the structured response. After the loop finishes, call stream.get_final_message() to get the same complete message object a normal create call would have returned — including the usage numbers, so you can still compute cost on a streamed call:
with client.messages.stream(
model="claude-haiku-4-5",
max_tokens=300,
messages=[
{"role": "user", "content": "Write a short, encouraging 4-line poem about learning to code."}
],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
print()
final = stream.get_final_message()
print("---")
print("stop_reason:", final.stop_reason)
print("input_tokens:", final.usage.input_tokens)
print("output_tokens:", final.usage.output_tokens)# Learning to Code
Each bug you fix builds strength within your mind,
The logic flows like light through clouded haze,
With every line of code, new doors you'll find,
You're crafting worlds—embrace the learning phase.
---
stop_reason: end_turn
input_tokens: 22
output_tokens: 55The poem streamed in live, and afterward you still recovered everything: a clean end_turn stop reason and the full usage — 22 input tokens, 55 output tokens. You lose nothing by streaming.
When to Stream
Streaming is a user-experience choice, not a correctness one — the final text is identical either way. Reach for it when the output is long enough that the wait would feel like a hang, and when a human is watching the result appear:
- Stream for chat interfaces, long generated text, and anything a person reads as it arrives.
- Don’t bother for short answers, or for back-end calls where you only use the final value (extraction, classification) and no one is watching the tokens land.
You’ll feel the difference the first time you stream a paragraph: the same response that felt frozen now feels alive.
Practice Exercises
Exercise 1: Cost of your own call
Make any create call you like, read response.usage.input_tokens and response.usage.output_tokens, and compute the cost with the formula from this lesson. Then ask yourself: how many times could you run that call for one dollar?
Hint
Reuse the cost code: cost = (in_tokens / 1_000_000) * 1.00 + (out_tokens / 1_000_000) * 5.00. To get the runs-per-dollar, compute 1 / cost. Most short calls land in the thousands.
Exercise 2: Budget before sending
Pick a long string — a few paragraphs of text, or a chunk of code — and pass it to count_tokens before making any real call. Compare the count for prose against the count for the same number of characters of code. Which is denser in tokens?
Hint
Call client.messages.count_tokens(model="claude-haiku-4-5", messages=[{"role": "user", "content": your_string}]) and read .input_tokens. Code and symbol-heavy text usually tokenize into more tokens per character than plain English prose.
Exercise 3: Stream a long answer
Ask the model for something genuinely long — “Explain how the internet works in five paragraphs” — first with a normal create call, then with stream. Notice how different the two feel even though the text is the same. Then recover the usage from the streamed call.
Hint
For the streamed version, wrap the call in with client.messages.stream(...) as stream: and loop over stream.text_stream with print(text, end="", flush=True). After the loop, stream.get_final_message().usage gives you the token counts.
Summary
Every Claude response reports its token usage in a usage object: input_tokens for your prompt and output_tokens for the reply. Multiply each by its per-million price — $1.00 input and $5.00 output for claude-haiku-4-5 — and you have the exact cost of the call, which at this scale is a tiny fraction of a cent. count_tokens lets you measure a request’s size before you send it, so you can budget and check context limits up front. And client.messages.stream(...) delivers the output progressively as it is generated, with get_final_message() still handing back the full response and usage afterward.
Key Concepts
usage.input_tokens/usage.output_tokens— the exact token counts a call consumed, returned on every response.- Cost formula —
in/10^6 × input_price + out/10^6 × output_price; output tokens cost more because the model generates them one at a time. count_tokens— counts a request’s input tokens before sending, with no generation and no cost.- Streaming — receiving and printing each chunk of output as it is produced, via
stream.text_stream, withget_final_message()for the complete result and usage.
Why This Matters
Cost and latency are the two constraints every production LLM application lives inside. Reading usage turns “this feels expensive” into a measured dollar figure you can optimize against; count_tokens lets you forecast a bill before you pay it; and streaming is the difference between an app that feels responsive and one that feels stuck. These are the habits that separate a demo from something people actually use.
Next Steps
Continue to Lesson 6 - Controlling the Output
Use temperature, system prompts, and stop sequences to steer how Claude responds.
Back to Module Overview
Return to the Working with LLMs in Python module overview
Continue Building Your Skills
You can now answer the two questions that follow every real LLM project: what does this cost, and why does it feel slow? You read usage off the response, turn it into dollars, budget before you send, and stream long answers so they land progressively. Next you’ll take control of what the model says — steering its tone, format, and consistency with the model’s output controls.