Lesson 1 - From Prototype to Production
Welcome to From Prototype to Production
Every system you’ve built in this course works — on your machine, on the happy path, with you watching. That’s a prototype. A product is different: it has to keep working when the API returns an error, when a hundred requests arrive at once, when the bill comes due, and when someone sends input you didn’t expect. This final module is about closing that gap. None of it requires new AI concepts — it’s the engineering that turns the systems you already built into services other people can rely on.
This lesson maps the territory and grounds it in real numbers. The rest of the module builds each piece.
By the end of this lesson, you will be able to:
- Explain how a production app differs from a prototype
- Name the five production concerns: reliability, cost, security, serving, observability
- Read the token
usagereturned with every response - Estimate the cost of a request from its token counts
You’ll build on everything from the course so far. Let’s begin.
What Changes When You Ship
A prototype answers one question: can this work at all? A product answers a harder one: will this keep working for people who aren’t me? The difference is everything the prototype quietly ignored:
- Reliability — networks fail and APIs rate-limit. Production code retries, sets timeouts, and streams long responses instead of blocking. (Lesson 2)
- Cost — every token costs money. You measure usage, cache what you can, and choose the cheapest model that does the job. (Lesson 3)
- Security — your API key is a credential; user input is untrusted. You keep secrets out of code and validate every request. (Lesson 4)
- Serving — a product answers over a network, with a defined request/response shape and proper error codes. (Lesson 4)
- Observability — you can’t fix what you can’t see, so you log requests, track token spend, and monitor errors. (woven throughout)
These five are the agenda for the module. The good news: each is ordinary software engineering applied to the LLM calls you already know how to make.
The Data Behind Production: usage
Production decisions are made on data, and the most important data point comes free with every response. The Anthropic SDK returns a usage object telling you exactly how many tokens the request consumed:
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from your environment
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=60,
messages=[{"role": "user", "content": "Say hello in one word."}],
)
print("reply:", response.content[0].text)
print("input tokens:", response.usage.input_tokens)
print("output tokens:", response.usage.output_tokens)reply: Hello.
input tokens: 13
output tokens: 5That tiny request cost 13 input tokens and produced 5 output tokens. Those two numbers are the raw material of cost control: input tokens are everything you send (your prompt, retrieved context, conversation history), and output tokens are everything the model generates. In production you log these on every call — they tell you what’s expensive and where to optimize.
Input and output tokens are priced differently
Output tokens cost several times more than input tokens. For claude-haiku-4-5 it’s $1 per million input tokens and $5 per million output tokens — output is 5× the price. That’s why a long, rambling answer can cost more than a long prompt, and why capping max_tokens is a real cost lever, not just a formatting one.
Turning Tokens into Dollars
Once you have token counts and the per-token price, cost is just arithmetic. With Haiku at $1 per million input tokens and $5 per million output tokens, the hello request above costs:
input_tokens, output_tokens = 13, 5
cost = input_tokens / 1_000_000 * 1.0 + output_tokens / 1_000_000 * 5.0
print(f"estimated cost: ${cost:.8f}")estimated cost: $0.00003800A fraction of a thousandth of a cent — which is exactly why prototypes feel free. But multiply by millions of requests, longer prompts, and bigger models, and it becomes a real budget line. The discipline of production is to know this number, watch it, and keep it where it belongs. You’ll build proper token counting and cost tracking in Lesson 3; for now the point is that the data you need is right there in every response.
Practice Exercises
Exercise 1: Name the concern
For each problem, name which of the five production concerns it falls under: (a) the API returns a 429 “too many requests” error; (b) your monthly bill is triple what you expected; (c) a user pastes 50,000 words into your chat box.
Hint
(a) is reliability — you need retry/backoff handling. (b) is cost — you need usage tracking and limits. (c) is security/serving — you need input validation to reject oversized or malicious input before it reaches the model.
Exercise 2: Predict the cost driver
A request sends a 2,000-token prompt and gets back a 500-token answer, using Haiku ($1/$5 per million). Compute each part. Which costs more — the longer input or the shorter output?
Hint
Input: 2000 / 1e6 × $1 = $0.002. Output: 500 / 1e6 × $5 = $0.0025. The shorter output costs more, because output tokens are 5× the price. This is why output length is often the bigger cost lever.
Exercise 3: Why log usage?
Your prototype works, so why bother recording input_tokens and output_tokens on every call once it’s in production?
Hint
Logged usage is how you spot which requests are expensive, catch a sudden cost spike, attribute spend to features or users, and decide where caching or a smaller model would help. You can’t manage cost you don’t measure — that’s the observability concern applied to money.
Summary
A prototype proves something can work; a product keeps working for other people, which means handling everything the happy path ignored. Five concerns define that gap: reliability (retries, timeouts, streaming), cost (tokens, caching, model choice), security (secrets, input validation), serving (an API with schemas and error codes), and observability (logging and monitoring). You saw the foundation of cost control — the usage object returned with every response (13 input, 5 output tokens for a tiny call) — and turned it into a real dollar figure. None of this is new AI; it’s the engineering that makes your AI shippable.
Key Concepts
- Prototype vs. product — happy-path proof versus dependable service.
- The five concerns — reliability, cost, security, serving, observability.
usage— the input/output token counts returned with every response.- Token-based cost — input and output tokens priced separately (output costs more).
Why This Matters
Most LLM projects die in the gap between a working demo and a deployable service. The teams that ship are the ones who treat reliability, cost, security, serving, and observability as first-class — not afterthoughts. This module gives you those skills, turning every system you built in this course into something you could actually put in front of users. It’s the final step from knowing how LLMs work to engineering with them.
Next Steps
Continue to Lesson 2 - Streaming and Error Handling
Make your calls reliable: stream long responses and handle rate limits, timeouts, and transient errors with retries.
Back to Module Overview
Return to the Shipping AI Applications module overview
Continue Building Your Skills
You now know the five concerns that separate a prototype from a product, and you’ve seen the usage data that production runs on. Next you’ll tackle the first concern head-on — making your calls reliable with streaming and robust error handling.