Lesson 3 - Retries, Timeouts, and Cost Control
Welcome to Retries, Timeouts, and Cost Control
Guardrails protected the edges of your agent — what goes in, what comes out. This lesson protects the middle: the calls it makes while it runs. Two things go wrong there, and they pull in opposite directions. A call can fail for no fault of yours — an API times out, a rate limit hits — and you want to try again. Or a run can go long — a confused agent loops and loops — and you want to make it stop. Robustness says “try harder”; affordability says “know when to quit.” Both are about bounding a run, and both are a few lines of pure Python wrapped around the loop you already have.
By the end of this lesson, you will be able to:
- Retry transient failures with exponential backoff and a hard attempt cap
- Tell transient errors (retry) from permanent ones (fail fast)
- Bound spending with a token budget that raises when the cap is crossed
- See how a step cap and a token budget complement each other around the loop
Let’s start with robustness — recovering from failures that aren’t your fault.
Retries with Exponential Backoff
Some failures are transient: they’ll succeed if you just try again in a moment. A model API returns a 529 “overloaded,” a request times out, a rate limit (429) kicks in. Surfacing those to the user as a hard failure is a waste — the fix is to wait a beat and retry. But two rules matter. First, back off: wait longer each time, so a struggling service gets room to recover instead of being hammered. Second, cap the attempts: never retry forever, or one bad call becomes a hung request.
Here’s the pattern. It takes the function to call, a max_attempts cap, a base_delay, and — so it’s testable — an injected sleep:
class TransientError(Exception):
pass
def with_retries(fn, *, max_attempts=3, base_delay=0.5, sleep, transient=(TransientError,)):
"""Call fn(); on a transient error, back off (base*2**n) and retry, up to a cap."""
for attempt in range(1, max_attempts + 1):
try:
return fn()
except transient:
if attempt == max_attempts:
raise
sleep(base_delay * (2 ** (attempt - 1)))Read the delay formula: attempt 1 waits base_delay * 2**0, attempt 2 waits base_delay * 2**1, and so on — each wait doubles. On the last attempt it doesn’t sleep; it raises, so the caller learns the call truly failed. Crucially, only errors in transient are caught. A permanent error — bad input, a bad API key — is not a TransientError, so it propagates immediately and fails fast, exactly as it should. Retrying an auth failure five times just wastes five times the money before failing anyway.
This is verified deterministically — no API key, no waiting. A function that fails twice then succeeds returns "ok" on attempt 3, and because sleep is injected as a list’s append, the test captures the exact delays without ever pausing:
delays = []
calls = {"n": 0}
def flaky():
calls["n"] += 1
if calls["n"] < 3: # fail twice, then succeed
raise TransientError("temporary")
return "ok"
out = with_retries(flaky, max_attempts=5, base_delay=0.5, sleep=delays.append)
# out == "ok"; calls["n"] == 3; delays == [0.5, 1.0] (exponential backoff)And a function that always fails proves the cap holds: it raises after exactly 3 attempts, with delays [1.0, 2.0] — it does not retry forever.
d2, c2 = [], {"n": 0}
def dead():
c2["n"] += 1
raise TransientError("down")
# with_retries(dead, max_attempts=3, base_delay=1.0, sleep=d2.append) raises TransientError
# after c2["n"] == 3 attempts, having slept d2 == [1.0, 2.0]Injecting sleep is why this is instantly testable — swap the list’s append for real time.sleep in production, and the same code waits for real.
Only retry what’s worth retrying
Backoff and a cap only help if you retry the right errors. Transient: timeouts, rate limits, 429, 529 “overloaded” — a second try genuinely might work. Permanent: bad input, invalid API key, a malformed request — a second try will fail identically, so retrying just burns time and money. The transient tuple is the switch: everything in it retries with backoff; everything else fails fast. Get that classification right and retries are pure upside.
Cost Control with a Token Budget
Robustness makes a run survive; affordability makes sure it can’t ruin you. Every model call spends tokens, and tokens are money. A confused agent that loops ten extra times, or takes an expensive path, quietly runs up the bill — and without a cap, one bad run can cost a hundred times a normal one. The fix is a budget: track what you’ve spent and stop the moment you cross the cap.
class BudgetExceeded(Exception):
pass
class TokenBudget:
def __init__(self, limit):
self.limit = limit; self.spent = 0
def charge(self, usage_tokens):
self.spent += usage_tokens
if self.spent > self.limit:
raise BudgetExceeded(f"spent {self.spent} > limit {self.limit}")
return self.spentIt’s tiny on purpose. After each model call you charge the tokens that call used, spent accumulates, and the instant it crosses limit you get a BudgetExceeded — a clean, catchable signal that the run has hit its ceiling. Verified deterministically: charging 400 then 400 leaves spent at 800 of 1000; the next 300 pushes it to 1100, crosses the cap, and raises.
budget = TokenBudget(limit=1000)
budget.charge(400); budget.charge(400) # spent == 800, under the cap
# budget.charge(300) -> spent becomes 1100 > 1000 -> raises BudgetExceededA token budget pairs with the max_steps cap from the agent loop (Modules 2 and 4). They bound different things: max_steps limits how many iterations the loop runs, while the token budget limits total spend. A run could stay well under its step cap and still blow the budget on a few enormous calls — or take many cheap steps and stay affordable. Use both. And remember the biggest cost lever of all is model choice: a cheap, fast claude-haiku-4-5 for routine steps versus a larger model only where you truly need the extra capability. Budgets cap the damage; model choice sets the baseline.
Wrapping the Loop for Real
Put the two together and the shape is the same as every other pillar: you don’t rebuild the agent, you wrap it. The model call — the flaky, expensive part — goes inside with_retries, and each response’s usage is charged to the TokenBudget. In a real, API-backed run it looks like this:
import time
from anthropic import Anthropic
client = Anthropic()
budget = TokenBudget(limit=20_000)
def call_model():
return client.messages.create(
model="claude-haiku-4-5",
max_tokens=512,
messages=[{"role": "user", "content": "Plan a day in Kyoto."}],
)
# Retry the flaky call with backoff, then charge what it cost.
resp = with_retries(call_model, max_attempts=3, base_delay=1.0, sleep=time.sleep)
budget.charge(resp.usage.input_tokens + resp.usage.output_tokens)The retry wraps around the flaky model or tool call; the budget is checked as the loop spends. Do this at each step of your run_agent loop, and the whole run is both robust (transient blips recover automatically) and affordable (it stops the moment it hits the cap). The retry and budget logic above are pure Python and verified for real — no API key needed. The messages.create call is shown as correct production code; there’s no ANTHROPIC_API_KEY in this environment, so the live call doesn’t actually run here.
Practice Exercises
Exercise 1: Compute the backoff delays
With base_delay=0.5 and a call that fails three times before succeeding on the fourth attempt, what delays does with_retries sleep, and how many total attempts run?
Hint
The delay before attempt n is base_delay * 2**(n-1), and there’s no sleep before attempt 1 or after the final success. Failing on attempts 1, 2, and 3 means sleeping [0.5, 1.0, 2.0], then attempt 4 succeeds — four attempts total. Each wait doubles the last.
Exercise 2: Retry or fail fast?
For each, decide whether with_retries should retry: (a) a 529 “overloaded” from the model API; (b) an invalid API key; (c) a request timeout; (d) a malformed-input 400.
Hint
Retry the transient ones — (a) overloaded and (c) timeout — because a second try genuinely might succeed. Fail fast on the permanent ones — (b) bad key and (d) bad input — because retrying will fail identically and just waste time and money. Only the transient errors belong in the transient tuple.
Exercise 3: Steps vs. tokens
A run finishes in 4 of its allowed 20 steps but raises BudgetExceeded. What happened, and why does a step cap alone not catch it?
Hint
A few very large calls spent more tokens than the budget allows, even though the loop barely iterated. max_steps bounds the number of iterations, not the cost per iteration — so a short run with huge calls slips past it. The token budget catches exactly that, which is why the two caps are complementary rather than redundant.
Summary
Two forces bound a run, and they pull opposite ways. Robustness says retry: with_retries calls a function, and on a transient error (timeout, rate limit, 429/529) it waits base_delay * 2**(attempt-1) — exponential backoff — and tries again, up to a hard max_attempts cap, then raises. Only transient errors retry; permanent ones (bad input, auth) fail fast. It’s verified deterministically: fail-twice-then-succeed returns "ok" on attempt 3 with delays [0.5, 1.0], and always-fails raises after 3 attempts with [1.0, 2.0] — instantly testable because sleep is injected. Affordability says stop: TokenBudget.charge accumulates spent and raises BudgetExceeded once the cap is crossed (400 + 400 leaves 800/1000; +300 raises). A step cap and a token budget are complementary — steps bound loop length, tokens bound total spend — and model choice (cheap claude-haiku-4-5 vs. a bigger model) is the other big cost lever. Both wrap the agent loop: retry around a flaky model or tool call, budget checked as it spends.
Key Concepts
- Exponential backoff + cap — wait
base*2**(n-1)and retry, but never pastmax_attempts, then raise. - Transient vs. permanent — retry timeouts and rate limits; fail fast on bad input and auth.
- Token budget — accumulate
spent; raiseBudgetExceededthe moment the cap is crossed. - Steps and tokens complement — a step cap bounds iterations; a token budget bounds spend.
Why This Matters
This is the difference between an agent that survives a real network and a real bill, and one that doesn’t. Retries turn a random 529 into a non-event instead of a failed request or a 3 a.m. page; backoff keeps you from making an overloaded service worse; the attempt cap keeps a retry from becoming a hang. On the other side, a token budget and a step cap are what stand between a confused loop and a surprise invoice — and picking the right model for each step keeps the baseline cheap before any cap is hit. Best of all, every one of these is a few lines wrapped around code you already wrote, added exactly where the risk is. Next, you’ll build the pillar that measures whether the whole thing still works as you change it.
Next Steps
Continue to Lesson 4 - Evaluating Agents
Turn quality into a number: build a test set and score the agent with an LLM as judge to catch regressions.
Back to Module Overview
Return to the Reliability and Evaluation module overview
Continue Building Your Skills
You can now make an agent both robust and affordable — retries with exponential backoff and an attempt cap so transient failures recover instead of crashing, and a token budget alongside the step cap so no run can run away with your bill. Both wrap the loop rather than replacing it. Next you’ll build the final protective pillar: evaluation, where quality becomes a number you track against a test set, so you can change the agent with confidence instead of hope.