Lesson 5 - Guided Project: Deploying an AI App
Welcome to Deploying an AI App
This is it — the final project of the course. Over the last four lessons you took apart the five concerns that separate a script from a shipped product: reliability, cost, security, serving, and observability. Now you’ll put them back together into a single thing you can actually run: a small FastAPI service that wraps a Claude call and applies all five concerns at once. Nothing here is a new AI idea. It’s the production shell — the boring, dependable layer — that lets the AI work you’ve already done face real users. Build it once, understand it deeply, and you can wrap anything in it.
We’ll grow the service one stage at a time, verifying each stage with a test client before moving on, and finish with the exact files you need to deploy it.
By the end of this project, you will be able to:
- Build a FastAPI service with validated request/response schemas and a health check
- Map upstream API failures to correct HTTP status codes and compute per-request cost from
usage - Log tokens, cost, and latency for every request, with built-in retries for reliability
- Run the service with
uvicornand pin its dependencies for deployment
Let’s assemble it.
New to FastAPI? Learn it first
This project uses FastAPI but doesn’t teach it from the ground up — we focus on the production concerns, not the framework basics. If terms like path operations, Pydantic request/response models, dependency injection, or uvicorn are new to you, work through our dedicated FastAPI course first. It takes you from your very first endpoint to a complete, tested, deployable API — validation, a database, auth, and uvicorn/Docker — exactly the foundation this project assumes. Come back here and this lesson will click.
Stage 1: Schemas, App, and a Health Check
Start with the skeleton. Before there’s any AI in the picture, a production service needs three things: a way to describe what it accepts, a way to describe what it returns, and a way for infrastructure to ask “are you alive?”
Pydantic models give us the first two. AskRequest declares that a request must carry a non-empty question no longer than 2,000 characters — that single Field constraint is your input validation, and it runs before any of your code does. AskResponse declares exactly what callers get back, so the shape of your service is a contract, not a surprise. The /health endpoint is the part orchestration systems (load balancers, Kubernetes, uptime monitors) hit constantly to decide whether to send you traffic.
import os, time, logging, anthropic
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from fastapi.testclient import TestClient
client = anthropic.Anthropic(max_retries=3) # key from env; built-in retry/backoff
app = FastAPI(title="Ask Claude")
class AskRequest(BaseModel):
question: str = Field(min_length=1, max_length=2000)
class AskResponse(BaseModel):
answer: str
input_tokens: int
output_tokens: int
cost_usd: float
@app.get("/health")
def health():
return {"status": "ok"}Notice the very first line of real work: anthropic.Anthropic(max_retries=3). We never pass a key — the SDK reads ANTHROPIC_API_KEY from the environment. That’s the security concern handled at the source: the credential lives in the environment, never in this file, never in your repository, never in a log. The max_retries=3 is the reliability concern, baked in before we write a single endpoint; we’ll come back to what it does.
A health check sounds trivial, but it earns its keep. Here’s why infrastructure cares about it:
tc = TestClient(app)
print("HEALTH:", tc.get("/health").json())HEALTH: {'status': 'ok'}That {'status': 'ok'} is the heartbeat. A load balancer that gets it keeps routing requests to this instance; one that doesn’t (timeout, a 500, a crashed process) pulls the instance out of rotation automatically. You get self-healing deployments for the price of three lines.
Stage 2: The /ask Endpoint — Errors and Cost
Now the core. The /ask endpoint takes a validated AskRequest, calls Claude, and returns an AskResponse. Two production concerns live here that a prototype skips entirely: error mapping and cost.
A prototype lets an exception bubble up and crash with a 500 for everything. A service translates each failure into the right signal. A rate limit is the caller’s problem to retry, so it becomes a 429. An upstream API failure is a bad gateway, so it becomes a 502. The caller can now react sensibly instead of guessing.
Cost is the other addition. Every response carries a usage object with input_tokens and output_tokens. With Haiku priced at $1 per million input tokens and $5 per million output tokens, turning that into dollars is arithmetic — and returning it on every response means cost is never a mystery you discover at the end of the month.
INPUT_PRICE_PER_MTOK = 1.0
OUTPUT_PRICE_PER_MTOK = 5.0
def estimate_cost(input_tokens: int, output_tokens: int) -> float:
cost = (input_tokens / 1e6) * INPUT_PRICE_PER_MTOK \
+ (output_tokens / 1e6) * OUTPUT_PRICE_PER_MTOK
return round(cost, 8)
@app.post("/ask", response_model=AskResponse)
def ask(req: AskRequest):
try:
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=200,
system="Answer concisely and accurately.",
messages=[{"role": "user", "content": req.question}],
)
except anthropic.RateLimitError:
raise HTTPException(429, "rate limited, please retry shortly")
except anthropic.APIError as e:
raise HTTPException(502, f"upstream error: {e}")
u = resp.usage
cost = estimate_cost(u.input_tokens, u.output_tokens)
return AskResponse(
answer=resp.content[0].text,
input_tokens=u.input_tokens,
output_tokens=u.output_tokens,
cost_usd=cost,
)The model call itself is deliberately simple — a concise, grounded Q&A. That’s the point: the production shell doesn’t care what’s inside the try block. We’ll return to swapping in something bigger shortly.
Stage 3: Observability and Reliability
The service works, but right now it’s a black box — you’d have no idea what it’s doing in production. The observability concern fixes that: log the numbers that matter on every request. We add a logger, time the call, and record tokens, cost, and latency.
These four numbers answer the questions you’ll actually ask in production. Tokens tell you what is expensive. Cost tells you how much. Latency tells you whether the service feels fast or slow to users. Logged together on every request, they’re the difference between “the bill went up and I don’t know why” and “requests carrying long context spiked yesterday at 3 p.m.”
Reliability is already in place from Stage 1 — max_retries=3 means the SDK automatically retries transient failures (network blips, 429s, 5xx errors) with exponential backoff before giving up. Your endpoint only sees the error classes if every retry fails. That’s why the try/except and the retry policy work together rather than overlap: retries handle the recoverable cases silently, and your error mapping handles what’s left.
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
logger = logging.getLogger("ask-claude")
@app.post("/ask", response_model=AskResponse)
def ask(req: AskRequest):
start = time.perf_counter()
try:
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=200,
system="Answer concisely and accurately.",
messages=[{"role": "user", "content": req.question}],
)
except anthropic.RateLimitError:
logger.warning("rate limited by upstream")
raise HTTPException(429, "rate limited, please retry shortly")
except anthropic.APIError as e:
logger.error("upstream error: %s", e)
raise HTTPException(502, f"upstream error: {e}")
latency_ms = (time.perf_counter() - start) * 1000
u = resp.usage
cost = estimate_cost(u.input_tokens, u.output_tokens)
logger.info(
"ask in=%d out=%d cost=$%.8f latency=%.0fms",
u.input_tokens, u.output_tokens, cost, latency_ms,
)
return AskResponse(
answer=resp.content[0].text,
input_tokens=u.input_tokens,
output_tokens=u.output_tokens,
cost_usd=cost,
)One more reliability lever worth knowing: for long answers, you can stream instead of blocking. The SDK’s client.messages.stream(...) context manager yields text as it’s generated through text_stream, and get_final_message() gives you the complete response (with usage) at the end — so you can start showing output immediately while still logging cost. We keep the blocking call here for clarity, but streaming is a one-method swap when responses get long.
Stage 4: Verify End to End
Now run the whole thing. We use FastAPI’s TestClient, which exercises the real app in-process — no server to launch, but every route, schema, and exception handler runs exactly as it would in production. We hit the health check, send a real question, and send an empty one to confirm validation rejects it.
tc = TestClient(app)
print("HEALTH:", tc.get("/health").json())
r = tc.post("/ask", json={"question": "Name one benefit of using a health-check endpoint."})
print("ASK:", r.json())
print("EMPTY STATUS:", tc.post("/ask", json={"question": ""}).status_code)Here is the real output, including the log lines the service emits as it handles each request:
INFO HTTP Request: GET http://testserver/health "HTTP/1.1 200 OK"
INFO HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO ask in=25 out=65 cost=$0.00035000 latency=2325ms
INFO HTTP Request: POST http://testserver/ask "HTTP/1.1 200 OK"
INFO HTTP Request: POST http://testserver/ask "HTTP/1.1 422 Unprocessable Entity"
HEALTH: {'status': 'ok'}
ASK: {'answer': '# Health-Check Endpoint Benefit\n\n**Load balancing and service reliability**: A health-check endpoint allows load balancers and orchestration systems (like Kubernetes) to automatically detect when a service instance is unhealthy and route traffic away from it, ensuring requests only go to functioning instances.', 'input_tokens': 25, 'output_tokens': 65, 'cost_usd': 0.00035}
EMPTY STATUS: 422Read that output as a checklist of all five concerns working together:
- Serving —
/healthreturns{'status': 'ok'}, and/askreturns a structuredAskResponsematching its schema. - Observability — the
ask in=25 out=65 cost=$0.00035000 latency=2325msline is the per-request record you’ll grep, graph, and alert on. - Cost — that request cost $0.00035, computed and returned, not guessed.
- Security — the call to
api.anthropic.comsucceeded using a key the code never saw, never printed, and never stored. - Reliability — the request that did succeed was protected by the SDK’s automatic retries; the validation layer returned a clean
422for the empty question instead of wasting a paid API call on garbage input.
That last point is worth pausing on: the empty question never reached Claude. Pydantic’s Field(min_length=1) rejected it with a 422 before your endpoint body ran. Bad input gets cheap, fast, correct failures.
Running it for real
TestClient is for verification. To serve real traffic, put the code in a file named app.py and run a real server:
export ANTHROPIC_API_KEY="your-key-here" # set in the environment, never in code
uvicorn app:app --host 0.0.0.0 --port 8000The app:app means “in the module app, serve the object named app.” Your service is now listening; POST to http://localhost:8000/ask with a JSON body, and GET /health for the heartbeat.
Pinning dependencies
A deployment needs to know exactly what to install. Create a requirements.txt:
anthropic==0.112.0
fastapi==0.138.1
uvicorn
httpx==0.28.1Pinning versions means the build that works on your machine works on the server. Anyone (or any CI pipeline) can now run pip install -r requirements.txt, set the environment variable, and start the exact service you built.
This shell wraps any LLM system you’ve built
The model call inside /ask is intentionally the simplest thing — a single concise Q&A. But the production shell around it doesn’t care what’s inside the try block. Drop in your RAG pipeline from Module 7 (retrieve, then answer), your tool-using agent from Module 8, or your LangGraph workflow from Module 9, and everything else stays exactly the same: the same schemas, the same error mapping, the same cost-from-usage logging, the same health check. That’s the real lesson of this capstone — you don’t rebuild the production layer per project. You build it once and wrap whatever AI system you ship behind the same dependable endpoint.
Extend the Project
Exercise 1: Add a streaming endpoint
Add a second route, POST /ask-stream, that streams the answer back token by token instead of waiting for the whole response. Use the SDK’s streaming API and FastAPI’s StreamingResponse. Still log the final token counts and cost when the stream finishes.
Hint
Use with client.messages.stream(...) as stream: and iterate over stream.text_stream, yielding each chunk. After the loop, call stream.get_final_message() to get the complete message — its .usage gives you the input/output tokens to log and cost the request, exactly as the blocking endpoint does.
Exercise 2: Protect your own endpoint with an API key
The Anthropic key authenticates you to Claude. But who’s allowed to call your service? Add a simple shared-secret check: require an X-API-Key header on /ask and reject requests that don’t match a value you read from the environment (e.g. SERVICE_API_KEY).
Hint
Read the expected secret once with os.environ["SERVICE_API_KEY"]. In the endpoint, accept the header via FastAPI’s Header dependency, compare it to the expected value, and raise HTTPException(401, "unauthorized") if it’s missing or wrong. Keep both keys in the environment — never in code.
Exercise 3: Enforce a daily cost cap
Cost tracking tells you what you spent; a cost cap stops you from spending too much. Keep a running total of cost_usd across requests, and once it crosses a daily limit, reject new /ask requests with a 429 until the budget resets.
Hint
Maintain a module-level accumulator (in real deployments this lives in a shared store like Redis, but a variable shows the idea). Before each call, check whether the total has crossed your limit; after each call, add the new cost. When over budget, raise HTTPException(429, "daily cost limit reached") — the same status as a rate limit, because that’s effectively what it is.
Summary
You built a complete, deployable FastAPI service for Claude and, in doing so, applied every production concern from this module in one place. Pydantic schemas define and validate the request and response. The /health endpoint gives infrastructure a heartbeat. The /ask endpoint maps RateLimitError to 429 and APIError to 502, and computes cost from the response’s usage. A logger records tokens, cost, and latency on every request, while max_retries=3 handles transient failures automatically. The key stays in the environment, never in code. You verified the whole thing end to end with TestClient — a real answer, real token counts, a real $0.00035 cost, and a clean 422 rejection — then saw how to run it with uvicorn app:app and pin it with requirements.txt.
Key Concepts
- Validated schemas —
AskRequest/AskResponsemake your service a typed contract and reject bad input before it costs anything. - Health check — a
/healthendpoint lets load balancers and orchestrators route around unhealthy instances. - Error mapping — translate upstream exceptions into correct HTTP status codes (
429,502) so callers can react. - Cost from
usage— turninput_tokensandoutput_tokensinto a dollar figure returned on every response. - Per-request logging — tokens, cost, and latency on every call are the raw data of production debugging.
- Built-in reliability —
max_retriesgives you exponential backoff for free; streaming is a one-method swap for long answers. - Reusable production shell — the same wrapper fits any LLM system you build behind the endpoint.
Why This Matters
The hardest part of LLM engineering isn’t getting a clever answer in a notebook — it’s making that answer dependable enough to put in front of real people. This capstone is the bridge. Everything inside the try block is the AI you spent the course learning to build; everything around it is the engineering that makes it shippable. Master this shell and you’re no longer someone who can prototype with LLMs — you’re someone who can deploy them, which is the skill teams actually hire for.
Next Steps
Back to Course Home
You did it — you've completed the full Generative AI & LLM Engineering course. Revisit the course home to review modules, explore what to build next, or pick up a related track.
Back to Module Overview
Return to the Shipping AI Applications module overview
Continue Building Your Skills
Congratulations — you’ve reached the end of the Generative AI & LLM Engineering course. You started by making a single call to a model and you’re finishing by deploying a production-shaped service that handles reliability, cost, security, serving, and observability all at once. The best way to make these skills stick is to use them: take a system you’re genuinely curious about — a RAG assistant over your own documents, an agent that automates a real task — wrap it in the shell you just built, and ship it. The gap between knowing and doing closes the moment you deploy something real. Go build it.