Lesson 2 - Streaming and Error Handling

Welcome to Streaming and Error Handling

Reliability is the first thing a production app has to earn. In Lesson 1 you saw the five concerns that separate a script from a service; this lesson tackles the first of them head-on. When you call a model in a notebook, you wait for the whole answer and you watch it happen. In production, nobody is watching — a request might block for thirty seconds while a long answer generates, the network might drop mid-call, or the API might rate-limit you under load. None of that should take your app down. The good news is that the SDK already gives you most of the tools; your job is to use them deliberately.

This lesson covers three reliability levers — streaming, error handling, and retries — and shows the real, verified API patterns for each.

By the end of this lesson, you will be able to:

  • Stream a response token by token to cut perceived latency and avoid long-request timeouts
  • Catch and respond to the anthropic exception hierarchy with try/except
  • Use the SDK’s built-in automatic retries with exponential backoff, and set timeouts

You’ll build directly on the calls you already know how to make. Let’s begin.


Streaming: Don’t Make Users Wait for the Whole Answer

When you call messages.create(), the SDK sends your request and blocks until the entire response has been generated, then hands it back all at once. For a one-word reply that’s fine. For a 500-token answer, the user stares at nothing for several seconds — and worse, a long generation can run up against the SDK’s HTTP timeout and fail outright before the answer ever arrives.

Streaming fixes both problems. Instead of waiting for the full message, you receive the text in small pieces as the model produces them. The user sees output appear immediately (lower perceived latency, even though total time is the same), and because data keeps flowing, you don’t hit the request-timeout wall on long outputs.

The SDK exposes this through client.messages.stream(), used as a context manager. Iterating stream.text_stream gives you the text deltas as they arrive:

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from your environment

with client.messages.stream(
    model="claude-haiku-4-5",
    max_tokens=80,
    messages=[{"role": "user", "content": "Count from 1 to 5, comma separated. Just the numbers."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
    final = stream.get_final_message()  # the complete Message, after streaming finishes

print()
print("stop_reason:", final.stop_reason)
print("output_tokens:", final.usage.output_tokens)
1, 2, 3, 4, 5
stop_reason: end_turn
output_tokens: 17

Two things to notice. First, the text printed as it streamed — each piece appeared the moment the model generated it, which is exactly what gives a chat UI its live, responsive feel. Second, after the loop you still get the whole picture: get_final_message() returns the same complete Message object a non-streaming call would have, so you keep access to stop_reason, the full usage counts, and everything else. You stream for the user and get the structured result for your code — no tradeoff.

The same pattern scales to real answers. Here it is on a slightly longer output:

with client.messages.stream(
    model="claude-haiku-4-5",
    max_tokens=120,
    messages=[{"role": "user", "content": "List three primary colors, one per line."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
    final = stream.get_final_message()

print()
print("stop_reason:", final.stop_reason)
print("tokens:", final.usage.input_tokens, "in /", final.usage.output_tokens, "out")
Red
Yellow
Blue
stop_reason: end_turn
tokens: 16 in / 8 out

The rule of thumb: stream anything that might produce a long answer. It costs you nothing extra and removes a whole class of timeout failures.

Streaming is how you avoid long-request timeouts

A non-streaming request has to come back before the SDK’s HTTP timeout (10 minutes by default) elapses, and large max_tokens outputs can push right up against it. Streaming keeps data flowing the whole time, so the connection never sits idle waiting for one giant response. That’s why production code streams long or open-ended generations by default — it’s a reliability measure, not just a UI nicety.


Error Handling: Catch the Right Exception

Streaming handles the happy path gracefully. But networks fail, keys get revoked, and APIs rate-limit. When something goes wrong, the SDK raises a typed exception, and the type tells you exactly what happened — so you can respond appropriately instead of guessing from an error string.

The anthropic package defines an exception for each failure category. These all exist in the installed SDK (version 0.112.0):

ExceptionRaised when
anthropic.RateLimitErrorYou hit a rate limit (HTTP 429) — slow down and retry
anthropic.APITimeoutErrorThe request took longer than your timeout
anthropic.APIConnectionErrorThe request never reached the server (network failure)
anthropic.APIStatusErrorAny other non-success HTTP status (e.g. 400, 404, 500)
anthropic.InternalServerErrorA server-side error (HTTP 5xx) — transient, retry
anthropic.APIErrorThe base class all of the above inherit from

The practical pattern is to wrap the call in try/except and order your handlers from most specific to least specific. Each category gets a response that fits it — back off on a rate limit, surface a clear message on a bad request, log and alert on an unexpected status:

import anthropic

client = anthropic.Anthropic()

try:
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=60,
        messages=[{"role": "user", "content": "Reply with exactly: ok"}],
    )
    print("success:", response.content[0].text)
except anthropic.RateLimitError:
    print("rate limited - backing off")
except anthropic.APIStatusError as e:
    print("API error:", e.status_code)
except anthropic.APIConnectionError:
    print("connection failed")
success: ok

The call succeeded, so we landed in the try body and printed the reply. But the structure is what matters: had the API returned a 429, the RateLimitError branch would have caught it; a malformed request (400) or bad model ID (404) would fall to the APIStatusError branch, where e.status_code tells you which; a dropped connection would hit APIConnectionError. Because every one of these inherits from anthropic.APIError, you can add a final except anthropic.APIError as a catch-all for anything you didn’t name specifically.

Order matters: list the specific exceptions first. If you put a broad except anthropic.APIError at the top, it would swallow rate-limit and connection errors before your specific handlers ever got a chance to run, and you’d lose the ability to treat them differently.


Retries and Timeouts: Built-In Resilience

Here’s the part that saves you the most work: the SDK already retries transient failures for you. Rate limits (429), server errors (5xx), and connection problems are retried automatically with exponential backoff — each retry waits a little longer than the last, which is exactly what you want so you don’t hammer an overloaded API. The default is two retries, and you can raise it:

import anthropic

# Default client retries twice; this one retries up to five times
robust = anthropic.Anthropic(max_retries=5)
print("max_retries set:", robust.max_retries)
max_retries set: 5

What gets retried is deliberately limited to the transient failures — the ones where trying again actually helps: HTTP 429 (rate limit), 5xx (server errors), and connection errors. A 400 (bad request) or 404 (bad model ID) is not retried, because retrying a malformed request just fails again more slowly. The SDK draws that line for you.

The other lever is the timeout. By default a request waits up to 10 minutes, which is rarely what you want in a service that needs to respond promptly. You can tighten it per request — either with with_options(), which returns a configured client, or by passing timeout directly to the call:

# Configure a timeout via with_options()
response = client.with_options(timeout=20.0).messages.create(
    model="claude-haiku-4-5",
    max_tokens=20,
    messages=[{"role": "user", "content": "Say hi"}],
)
print(response.content[0].text)
Hi! 👋 How can I help you today?

Both forms work, and timeout is in seconds. If a request exceeds the limit, the SDK raises anthropic.APITimeoutError — which, being transient, is itself subject to the automatic retry logic. So a timeout doesn’t immediately bubble up as a failure; the SDK gives it another attempt (or several, depending on max_retries) before giving up. The total wall-clock budget for a call is therefore roughly your timeout multiplied by the number of attempts, which is worth keeping in mind when you set both.

Put together, the production recipe is short: set a sensible max_retries and timeout on the client, stream long responses, and wrap the call in try/except for the failures that survive the retries. The SDK handles the rest.

The SDK already retries — don’t double-retry

Because the SDK automatically retries 429s, 5xx errors, and connection failures with exponential backoff, you usually do not need to wrap your call in your own retry loop. If you add a for loop that re-calls on every exception and leave max_retries at its default, a single rate limit can trigger your loop’s retries on top of the SDK’s — multiplying the requests and making the rate limit worse. Configure max_retries and let the SDK do the backoff; reserve your own logic for application-level decisions (like falling back to a smaller model), not raw retrying.


Practice Exercises

Exercise 1: Stream and inspect

You want to stream a response to the user but still log how many output tokens it cost. Using client.messages.stream(), how do you print the text live and get the final token count?

Hint

Iterate stream.text_stream inside the with block to print deltas as they arrive (print(text, end="", flush=True)), then call stream.get_final_message() after the loop. The returned Message has .usage.output_tokens — you get the live experience for the user and the full structured result for your logs, from one call.

Exercise 2: Match the error to the response

For each failure, name the anthropic exception you’d catch and a sensible response: (a) the API returns HTTP 429; (b) you mistype the model ID and get a 404; (c) the request never reaches the server because the network is down.

Hint

(a) anthropic.RateLimitError — back off and retry (the SDK already does this). (b) anthropic.APIStatusError with e.status_code == 404 — this is a bug in your code, so fix the model ID; don’t retry. (c) anthropic.APIConnectionError — a transient network failure the SDK will retry; surface it only if retries are exhausted.

Exercise 3: Why not write your own retry loop?

Your teammate wraps every API call in a while loop that retries on any exception. Why is that usually unnecessary, and when could it actually make reliability worse?

Hint

The SDK already retries transient failures (429, 5xx, connection errors) with exponential backoff, controlled by max_retries. A hand-rolled loop on top of that multiplies the request count on a rate limit — making the 429 worse — and may also retry non-transient errors like a 400, which can never succeed. Let the SDK retry; use your own logic only for application-level fallbacks.


Summary

Reliability is the first production concern, and the SDK gives you three levers to address it. Streaming (client.messages.stream() with text_stream and get_final_message()) cuts perceived latency and keeps long requests from hitting timeouts, while still returning the complete Message for your code. Error handling means catching the typed anthropic exceptions — RateLimitError, APITimeoutError, APIConnectionError, APIStatusError, and the base APIError — most-specific-first, so each failure gets the response it deserves. Retries are built in: the SDK automatically retries 429s, 5xx errors, and connection failures with exponential backoff (default max_retries=2, configurable), and you can set a per-request timeout to bound how long any single attempt waits. Together, that’s a call that streams for the user, survives transient failures on its own, and fails cleanly when something is genuinely wrong.

Key Concepts

  • Streamingclient.messages.stream() yields text deltas via text_stream; get_final_message() returns the full Message.
  • Exception hierarchy — typed errors (RateLimitError, APITimeoutError, APIConnectionError, APIStatusError) all inherit from anthropic.APIError.
  • Automatic retries — 429s, 5xx, and connection errors are retried with exponential backoff; max_retries defaults to 2.
  • Timeouts — set per request with with_options(timeout=...) or the timeout argument; exceeding it raises APITimeoutError.

Why This Matters

The gap between a working demo and a deployable service is mostly made of failures the demo never saw. A prototype that calls the API once, on a fast connection, with you watching, looks identical to a production service — right up until a hundred users hit it at once, the API rate-limits, and the whole thing falls over. Streaming, typed error handling, and configured retries are how you close that gap with a few lines of code instead of a custom resilience framework. Get these right and your LLM calls stop being a liability and start being something other systems can depend on.


Next Steps

Continue to Lesson 3 - Managing Cost and Tokens

Measure and control what every call costs: count tokens, track usage, and choose the cheapest model that does the job.

Back to Module Overview

Return to the Shipping AI Applications module overview


Continue Building Your Skills

You can now make a call that streams its output, handles the failures it might hit, and leans on the SDK’s built-in retries and timeouts to ride out transient trouble. That’s reliability — the first production concern — handled. Next you’ll tackle the second: cost. You’ll measure exactly what each call spends and learn the levers that keep the bill where it belongs.