Lesson 2 - Streaming and Error Handling
Welcome to Streaming and Error Handling
Reliability is the first thing a production app has to earn. In Lesson 1 you saw the five concerns that separate a script from a service; this lesson tackles the first of them head-on. When you call a model in a notebook, you wait for the whole answer and you watch it happen. In production, nobody is watching — a request might block for thirty seconds while a long answer generates, the network might drop mid-call, or the API might rate-limit you under load. None of that should take your app down. The good news is that the SDK already gives you most of the tools; your job is to use them deliberately.
This lesson covers three reliability levers — streaming, error handling, and retries — and shows the real, verified API patterns for each.
By the end of this lesson, you will be able to:
- Stream a response token by token to cut perceived latency and avoid long-request timeouts
- Catch and respond to the
anthropicexception hierarchy withtry/except - Use the SDK’s built-in automatic retries with exponential backoff, and set timeouts
You’ll build directly on the calls you already know how to make. Let’s begin.
Streaming: Don’t Make Users Wait for the Whole Answer
When you call messages.create(), the SDK sends your request and blocks until the entire response has been generated, then hands it back all at once. For a one-word reply that’s fine. For a 500-token answer, the user stares at nothing for several seconds — and worse, a long generation can run up against the SDK’s HTTP timeout and fail outright before the answer ever arrives.
Streaming fixes both problems. Instead of waiting for the full message, you receive the text in small pieces as the model produces them. The user sees output appear immediately (lower perceived latency, even though total time is the same), and because data keeps flowing, you don’t hit the request-timeout wall on long outputs.
The SDK exposes this through client.messages.stream(), used as a context manager. Iterating stream.text_stream gives you the text deltas as they arrive:
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from your environment
with client.messages.stream(
model="claude-haiku-4-5",
max_tokens=80,
messages=[{"role": "user", "content": "Count from 1 to 5, comma separated. Just the numbers."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
final = stream.get_final_message() # the complete Message, after streaming finishes
print()
print("stop_reason:", final.stop_reason)
print("output_tokens:", final.usage.output_tokens)1, 2, 3, 4, 5
stop_reason: end_turn
output_tokens: 17Two things to notice. First, the text printed as it streamed — each piece appeared the moment the model generated it, which is exactly what gives a chat UI its live, responsive feel. Second, after the loop you still get the whole picture: get_final_message() returns the same complete Message object a non-streaming call would have, so you keep access to stop_reason, the full usage counts, and everything else. You stream for the user and get the structured result for your code — no tradeoff.
The same pattern scales to real answers. Here it is on a slightly longer output:
with client.messages.stream(
model="claude-haiku-4-5",
max_tokens=120,
messages=[{"role": "user", "content": "List three primary colors, one per line."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
final = stream.get_final_message()
print()
print("stop_reason:", final.stop_reason)
print("tokens:", final.usage.input_tokens, "in /", final.usage.output_tokens, "out")Red
Yellow
Blue
stop_reason: end_turn
tokens: 16 in / 8 outThe rule of thumb: stream anything that might produce a long answer. It costs you nothing extra and removes a whole class of timeout failures.
Streaming is how you avoid long-request timeouts
A non-streaming request has to come back before the SDK’s HTTP timeout (10 minutes by default) elapses, and large max_tokens outputs can push right up against it. Streaming keeps data flowing the whole time, so the connection never sits idle waiting for one giant response. That’s why production code streams long or open-ended generations by default — it’s a reliability measure, not just a UI nicety.
Error Handling: Catch the Right Exception
Streaming handles the happy path gracefully. But networks fail, keys get revoked, and APIs rate-limit. When something goes wrong, the SDK raises a typed exception, and the type tells you exactly what happened — so you can respond appropriately instead of guessing from an error string.
The anthropic package defines an exception for each failure category. These all exist in the installed SDK (version 0.112.0):
| Exception | Raised when |
|---|---|
anthropic.RateLimitError | You hit a rate limit (HTTP 429) — slow down and retry |
anthropic.APITimeoutError | The request took longer than your timeout |
anthropic.APIConnectionError | The request never reached the server (network failure) |
anthropic.APIStatusError | Any other non-success HTTP status (e.g. 400, 404, 500) |
anthropic.InternalServerError | A server-side error (HTTP 5xx) — transient, retry |
anthropic.APIError | The base class all of the above inherit from |
The practical pattern is to wrap the call in try/except and order your handlers from most specific to least specific. Each category gets a response that fits it — back off on a rate limit, surface a clear message on a bad request, log and alert on an unexpected status:
import anthropic
client = anthropic.Anthropic()
try:
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=60,
messages=[{"role": "user", "content": "Reply with exactly: ok"}],
)
print("success:", response.content[0].text)
except anthropic.RateLimitError:
print("rate limited - backing off")
except anthropic.APIStatusError as e:
print("API error:", e.status_code)
except anthropic.APIConnectionError:
print("connection failed")success: okThe call succeeded, so we landed in the try body and printed the reply. But the structure is what matters: had the API returned a 429, the RateLimitError branch would have caught it; a malformed request (400) or bad model ID (404) would fall to the APIStatusError branch, where e.status_code tells you which; a dropped connection would hit APIConnectionError. Because every one of these inherits from anthropic.APIError, you can add a final except anthropic.APIError as a catch-all for anything you didn’t name specifically.
Order matters: list the specific exceptions first. If you put a broad except anthropic.APIError at the top, it would swallow rate-limit and connection errors before your specific handlers ever got a chance to run, and you’d lose the ability to treat them differently.
Retries and Timeouts: Built-In Resilience
Here’s the part that saves you the most work: the SDK already retries transient failures for you. Rate limits (429), server errors (5xx), and connection problems are retried automatically with exponential backoff — each retry waits a little longer than the last, which is exactly what you want so you don’t hammer an overloaded API. The default is two retries, and you can raise it:
import anthropic
# Default client retries twice; this one retries up to five times
robust = anthropic.Anthropic(max_retries=5)
print("max_retries set:", robust.max_retries)max_retries set: 5What gets retried is deliberately limited to the transient failures — the ones where trying again actually helps: HTTP 429 (rate limit), 5xx (server errors), and connection errors. A 400 (bad request) or 404 (bad model ID) is not retried, because retrying a malformed request just fails again more slowly. The SDK draws that line for you.
The other lever is the timeout. By default a request waits up to 10 minutes, which is rarely what you want in a service that needs to respond promptly. You can tighten it per request — either with with_options(), which returns a configured client, or by passing timeout directly to the call:
# Configure a timeout via with_options()
response = client.with_options(timeout=20.0).messages.create(
model="claude-haiku-4-5",
max_tokens=20,
messages=[{"role": "user", "content": "Say hi"}],
)
print(response.content[0].text)Hi! 👋 How can I help you today?Both forms work, and timeout is in seconds. If a request exceeds the limit, the SDK raises anthropic.APITimeoutError — which, being transient, is itself subject to the automatic retry logic. So a timeout doesn’t immediately bubble up as a failure; the SDK gives it another attempt (or several, depending on max_retries) before giving up. The total wall-clock budget for a call is therefore roughly your timeout multiplied by the number of attempts, which is worth keeping in mind when you set both.
Put together, the production recipe is short: set a sensible max_retries and timeout on the client, stream long responses, and wrap the call in try/except for the failures that survive the retries. The SDK handles the rest.
The SDK already retries — don’t double-retry
Because the SDK automatically retries 429s, 5xx errors, and connection failures with exponential backoff, you usually do not need to wrap your call in your own retry loop. If you add a for loop that re-calls on every exception and leave max_retries at its default, a single rate limit can trigger your loop’s retries on top of the SDK’s — multiplying the requests and making the rate limit worse. Configure max_retries and let the SDK do the backoff; reserve your own logic for application-level decisions (like falling back to a smaller model), not raw retrying.
Practice Exercises
Exercise 1: Stream and inspect
You want to stream a response to the user but still log how many output tokens it cost. Using client.messages.stream(), how do you print the text live and get the final token count?
Hint
Iterate stream.text_stream inside the with block to print deltas as they arrive (print(text, end="", flush=True)), then call stream.get_final_message() after the loop. The returned Message has .usage.output_tokens — you get the live experience for the user and the full structured result for your logs, from one call.
Exercise 2: Match the error to the response
For each failure, name the anthropic exception you’d catch and a sensible response: (a) the API returns HTTP 429; (b) you mistype the model ID and get a 404; (c) the request never reaches the server because the network is down.
Hint
(a) anthropic.RateLimitError — back off and retry (the SDK already does this). (b) anthropic.APIStatusError with e.status_code == 404 — this is a bug in your code, so fix the model ID; don’t retry. (c) anthropic.APIConnectionError — a transient network failure the SDK will retry; surface it only if retries are exhausted.
Exercise 3: Why not write your own retry loop?
Your teammate wraps every API call in a while loop that retries on any exception. Why is that usually unnecessary, and when could it actually make reliability worse?
Hint
The SDK already retries transient failures (429, 5xx, connection errors) with exponential backoff, controlled by max_retries. A hand-rolled loop on top of that multiplies the request count on a rate limit — making the 429 worse — and may also retry non-transient errors like a 400, which can never succeed. Let the SDK retry; use your own logic only for application-level fallbacks.
Summary
Reliability is the first production concern, and the SDK gives you three levers to address it. Streaming (client.messages.stream() with text_stream and get_final_message()) cuts perceived latency and keeps long requests from hitting timeouts, while still returning the complete Message for your code. Error handling means catching the typed anthropic exceptions — RateLimitError, APITimeoutError, APIConnectionError, APIStatusError, and the base APIError — most-specific-first, so each failure gets the response it deserves. Retries are built in: the SDK automatically retries 429s, 5xx errors, and connection failures with exponential backoff (default max_retries=2, configurable), and you can set a per-request timeout to bound how long any single attempt waits. Together, that’s a call that streams for the user, survives transient failures on its own, and fails cleanly when something is genuinely wrong.
Key Concepts
- Streaming —
client.messages.stream()yields text deltas viatext_stream;get_final_message()returns the fullMessage. - Exception hierarchy — typed errors (
RateLimitError,APITimeoutError,APIConnectionError,APIStatusError) all inherit fromanthropic.APIError. - Automatic retries — 429s, 5xx, and connection errors are retried with exponential backoff;
max_retriesdefaults to 2. - Timeouts — set per request with
with_options(timeout=...)or thetimeoutargument; exceeding it raisesAPITimeoutError.
Why This Matters
The gap between a working demo and a deployable service is mostly made of failures the demo never saw. A prototype that calls the API once, on a fast connection, with you watching, looks identical to a production service — right up until a hundred users hit it at once, the API rate-limits, and the whole thing falls over. Streaming, typed error handling, and configured retries are how you close that gap with a few lines of code instead of a custom resilience framework. Get these right and your LLM calls stop being a liability and start being something other systems can depend on.
Next Steps
Continue to Lesson 3 - Managing Cost and Tokens
Measure and control what every call costs: count tokens, track usage, and choose the cheapest model that does the job.
Back to Module Overview
Return to the Shipping AI Applications module overview
Continue Building Your Skills
You can now make a call that streams its output, handles the failures it might hit, and leans on the SDK’s built-in retries and timeouts to ride out transient trouble. That’s reliability — the first production concern — handled. Next you’ll tackle the second: cost. You’ll measure exactly what each call spends and learn the levers that keep the bill where it belongs.