Lesson 3 - Managing Cost and Tokens
Welcome to Managing Cost and Tokens
In a prototype, cost is invisible. You make a handful of calls, the bill is a fraction of a cent, and the model can be the biggest one available because it never matters. In production it matters constantly: the same call running thousands of times a day, against longer prompts and bigger models, turns that fraction of a cent into a budget line someone has to defend. The good news is that LLM cost is not mysterious — it is arithmetic on two numbers you already have. This lesson shows you how to read those numbers, turn them into dollars, predict cost before you spend it, and pull the levers that actually move the bill.
You already saw the usage object in Lesson 1. Here you put it to work.
By the end of this lesson, you will be able to:
- Read the
usageobject on every response and convert it to a real dollar figure - Use
count_tokensto estimate cost and enforce length limits before sending a request - Pick the biggest cost levers: model choice, capping
max_tokens, trimming context, and caching - Accumulate and log spend across calls so you can see it in production
Let’s start with the number that turns tokens into money.
From Usage to Dollars
Every response carries a usage object with two counts: input_tokens (everything you sent — prompt, system message, context, history) and output_tokens (everything the model generated). Pricing is quoted per million tokens, and input and output are priced separately. For claude-haiku-4-5 it’s $1 per million input tokens and $5 per million output tokens. The cost of any single call is just those counts multiplied by those rates:
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from your environment
PRICES = {
"claude-haiku-4-5": (1.0, 5.0), # (input $/M, output $/M)
"claude-sonnet-4-6": (3.0, 15.0),
"claude-opus-4-8": (5.0, 25.0),
}
def cost_from_usage(model, usage):
price_in, price_out = PRICES[model]
return usage.input_tokens / 1_000_000 * price_in + usage.output_tokens / 1_000_000 * price_out
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=60,
messages=[{"role": "user", "content": "Summarize what an API is in one sentence."}],
)
print("reply:", response.content[0].text)
print("input tokens:", response.usage.input_tokens)
print("output tokens:", response.usage.output_tokens)
print("cost: $%.8f" % cost_from_usage("claude-haiku-4-5", response.usage))reply: An API (Application Programming Interface) is a set of rules and tools that allows different software applications to communicate with and request data or services from each other.
input tokens: 18
output tokens: 34
cost: $0.00018800Eighteen tokens in, thirty-four out, for about two hundredths of a cent. Notice that the 34 output tokens contributed $0.00017000 of that total while the 18 input tokens contributed only $0.00001800 — the output is roughly ten times the input cost here, even though it’s only twice as many tokens. That is the whole game in miniature: output is the expensive half, and the dictionary holding per-model prices is the only thing you need to extend this to Sonnet or Opus.
Output tokens cost about 5x input — so cap them
For every Claude model, output tokens are priced at roughly 5x the input rate (Haiku is $1 vs $5, Sonnet $3 vs $15, Opus $5 vs $25). A long, rambling answer can cost more than a long prompt. That makes max_tokens a genuine cost control, not just a formatting nicety — pick the cheapest model that passes your evaluation, and cap its output to the length you actually need.
Estimating Cost Before You Send
Reading usage tells you what a call did cost — after you’ve already paid for it. Often you want to know before: to reject input that’s too long, to show the user an estimate, or to keep a runaway prompt from blowing your budget. The count_tokens endpoint counts the input tokens for a request without running the model, so it’s effectively free and fast.
prompt = "Write a haiku about the ocean."
ct = client.messages.count_tokens(
model="claude-haiku-4-5",
messages=[{"role": "user", "content": prompt}],
)
MAX_OUT = 50
price_in, price_out = PRICES["claude-haiku-4-5"]
worst_case = ct.input_tokens / 1_000_000 * price_in + MAX_OUT / 1_000_000 * price_out
print("input tokens (estimate):", ct.input_tokens)
print("worst-case cost with max_tokens=%d: $%.8f" % (MAX_OUT, worst_case))input tokens (estimate): 15
worst-case cost with max_tokens=50: $0.00026500You can’t know the exact output length in advance, but you do control its ceiling: max_tokens. Combining the counted input with your max_tokens cap gives you a true upper bound on the call’s cost — a number you can budget against with confidence.
The same count is also a length gate. If a user pastes a wall of text, you can measure it and reject it before it ever reaches the model:
LIMIT = 100
long_prompt = "word " * 200
ct = client.messages.count_tokens(
model="claude-haiku-4-5",
messages=[{"role": "user", "content": long_prompt}],
)
if ct.input_tokens > LIMIT:
print("rejected:", ct.input_tokens, "tokens exceeds limit of", LIMIT)rejected: 208 tokens exceeds limit of 100Count the input with the same model you’ll send to — token counts are model-specific. And resist the temptation to estimate tokens with a character or word heuristic; count_tokens is the only number that matches what you’ll actually be billed for.
The Big Cost Levers
Once you can measure cost, four levers move it, roughly in order of impact.
1. Choose the cheapest model that does the job. This is by far the largest lever. The exact same request costs five times more on Opus than on Haiku. Here is one request’s worth of tokens (1,000 in, 300 out) priced across all three models:
IN, OUT = 1000, 300
for model, (price_in, price_out) in PRICES.items():
c = IN / 1_000_000 * price_in + OUT / 1_000_000 * price_out
print(f"{model:20} ${c:.6f}")claude-haiku-4-5 $0.002500
claude-sonnet-4-6 $0.007500
claude-opus-4-8 $0.012500The discipline is to start with the smallest model, measure quality against your own evaluation set, and only step up when the cheaper model genuinely fails the task. Many production workloads — classification, extraction, routing, short summaries — run perfectly well on Haiku, and reserving Opus for the requests that truly need it is the single biggest thing you can do for your bill.
2. Cap max_tokens. Because output is the expensive half, putting a tight ceiling on it directly caps the most expensive part of every call. Set max_tokens to the longest answer you actually need, not the largest the model allows.
3. Trim what you send. Input tokens are cheaper per token, but they add up across long system prompts, large retrieved context, and growing conversation history. Send only the context the task needs: prune old turns from a chat, retrieve five relevant snippets instead of fifty, and keep the system prompt lean.
4. Cache large reused prefixes. When many requests share a big, identical chunk at the front — a long system prompt, a fixed knowledge base, a set of examples — you can mark that prefix with cache_control so the model processes it once and reuses it. Subsequent calls read the cached prefix at a fraction of the normal price:
big_context = ("The following is reference material for a support assistant. "
"Each product line has its own return policy and warranty terms. ") * 600
system = [{
"type": "text",
"text": big_context,
"cache_control": {"type": "ephemeral"},
}]
r1 = client.messages.create(
model="claude-haiku-4-5", max_tokens=20, system=system,
messages=[{"role": "user", "content": "Reply with the single word OK."}],
)
print("call 1:",
"input", r1.usage.input_tokens,
"cache_creation", r1.usage.cache_creation_input_tokens,
"cache_read", r1.usage.cache_read_input_tokens)
r2 = client.messages.create(
model="claude-haiku-4-5", max_tokens=20, system=system,
messages=[{"role": "user", "content": "Reply with the single word YES."}],
)
print("call 2:",
"input", r2.usage.input_tokens,
"cache_creation", r2.usage.cache_creation_input_tokens,
"cache_read", r2.usage.cache_read_input_tokens)call 1: input 13 cache_creation 13202 cache_read 0
call 2: input 13 cache_creation 0 cache_read 13202The first call writes the 13,202-token prefix into the cache (reported as cache_creation_input_tokens); the second call reads it back (cache_read_input_tokens) and charges almost nothing for it. At full input price those 13,202 tokens cost about $0.01320200 each time; the cache write costs about $0.01650250 once, and every cache read after that costs about $0.00132020 — roughly a tenth of full price. Over many requests sharing the same prefix, that difference dominates. The catch is that caching is a prefix match: any byte change in the cached portion invalidates it, so keep the stable content at the front and the per-request part at the end. If cache_read_input_tokens stays at zero across repeated calls, something in the prefix is changing between requests.
Tracking Spend in Production
A single call’s cost is trivia. The number that matters is the total, accumulated across every call your app makes — and the only way to have it is to record usage on every response as you go. The pattern is simple: cost each call from its usage, and add it to a running total.
def call_cost(usage, model):
price_in, price_out = PRICES[model]
return usage.input_tokens / 1_000_000 * price_in + usage.output_tokens / 1_000_000 * price_out
MODEL = "claude-haiku-4-5"
total_cost = 0.0
total_in = 0
total_out = 0
questions = [
"What is 12 times 8?",
"Name the capital of Japan.",
"Give one synonym for 'fast'.",
]
for q in questions:
r = client.messages.create(
model=MODEL, max_tokens=30,
messages=[{"role": "user", "content": q}],
)
c = call_cost(r.usage, MODEL)
total_cost += c
total_in += r.usage.input_tokens
total_out += r.usage.output_tokens
print(f"in={r.usage.input_tokens:3} out={r.usage.output_tokens:3} cost=${c:.8f} {q}")
print(f"TOTAL: in={total_in} out={total_out} cost=${total_cost:.8f}")in= 16 out= 14 cost=$0.00008600 What is 12 times 8?
in= 13 out= 10 cost=$0.00006300 Name the capital of Japan.
in= 15 out= 6 cost=$0.00004500 Give one synonym for 'fast'.
TOTAL: in=44 out=30 cost=$0.00019400In a real service you would log each call’s model, input tokens, output tokens, and cost to your observability system — tagged with the feature or user that made it — instead of summing in a loop. That log is what lets you answer the questions production actually asks: which feature is expensive, did cost spike overnight, is one user driving the bill, would caching or a smaller model help here? You can’t manage a cost you don’t measure, and usage is the measurement — recorded on every call, it turns spend from a surprise at the end of the month into a number you watch in real time.
Practice Exercises
Exercise 1: Cost a single call
A call to claude-sonnet-4-6 ($3 / $15 per million) returns usage with 1,200 input tokens and 400 output tokens. Compute the cost by hand, then with the cost_from_usage helper. Which half — input or output — contributed more?
Hint
Input: 1200 / 1e6 x $3 = $0.0036. Output: 400 / 1e6 x $15 = $0.0060. Total $0.0096. The output costs more ($0.0060 vs $0.0036) despite being a third as many tokens, because output is priced at 5x the input rate. This is exactly why capping max_tokens is a real lever.
Exercise 2: Estimate before sending
You want to allow at most $0.001 per request on Haiku, with max_tokens capped at 100. Use count_tokens to find the input count for a prompt, then compute the worst-case cost (counted input + 100 output) and reject the request if it would exceed the budget.
Hint
Worst-case cost = input_tokens / 1e6 * 1.0 + 100 / 1e6 * 5.0. The 100 output tokens alone cost $0.0005, so your input budget is the remaining $0.0005 — about 500 input tokens. Call count_tokens, compute the figure, and gate on it before messages.create, so you never pay for an over-budget call.
Exercise 3: Pick the lever
For each situation, name the single most effective cost lever: (a) every request prepends the same 8,000-token policy document; (b) you’re running Opus for a simple “is this email spam?” classification; (c) answers come back as long essays when a sentence would do.
Hint
(a) Prompt caching — the 8,000-token prefix is identical every time, so cache it and pay full price once. (b) Model choice — drop to Haiku; spam classification almost certainly doesn’t need Opus, and that’s a 5x saving. (c) Cap max_tokens — output is the expensive half, so a tight ceiling cuts both cost and verbosity.
Summary
LLM cost is arithmetic on two numbers that arrive free with every response: input_tokens and output_tokens. Multiply them by per-million pricing — $1/$5 for Haiku, $3/$15 for Sonnet, $5/$25 for Opus — and you have the exact cost of any call, with output running about 5x the input rate. Before sending, count_tokens gives you the input count for an estimate or a length gate; combined with max_tokens, it bounds a call’s cost in advance. The big levers, in order of impact, are choosing the cheapest model that passes your evaluation, capping max_tokens, trimming the context and history you send, and caching large reused prefixes (verified live: a 13,202-token prefix that costs full price each time reads back at roughly a tenth of that once cached). And because the only spend that matters is the cumulative total, you accumulate cost from usage on every call and log it — so cost is something you watch, not something you discover.
Key Concepts
usageto dollars —input_tokensandoutput_tokenstimes per-million pricing gives exact per-call cost.count_tokens— counts input before sending, for cost estimates and length limits, at no cost.- The four levers — model choice (biggest),
max_tokenscap, context trimming, and prompt caching. - Spend tracking — accumulate and log cost from
usageon every call to see and manage it in production.
Why This Matters
Cost is the concern that quietly kills LLM features after launch. A demo that felt free becomes a real bill once it runs at scale, and teams that never instrumented it have no idea where the money is going or how to bring it down. The engineers who ship sustainable AI treat cost the way they treat latency or errors: measured on every call, attributed to features, and controlled with deliberate levers. The token counts are right there in every response — turning them into a number you watch is the difference between a feature you can afford to keep and one you have to switch off.
Next Steps
Continue to Lesson 4 - Securing and Serving
Keep secrets out of your code, validate untrusted input, and serve your model behind a real API with proper error codes.
Back to Module Overview
Return to the Shipping AI Applications module overview
Continue Building Your Skills
You can now read the cost of any call, predict it before you spend, pull the levers that move the bill, and track spend as it accumulates. Next you’ll handle the other two production concerns that turn a script into a service: keeping your credentials and your users safe, and serving your model behind a proper API.