Lesson 5 - Guided Project: Documentation Q&A Bot
Welcome to the Documentation Q&A Bot
This is the capstone for the module, and it pulls together everything you’ve built so far: you’ll load a real document, chunk it with overlap, embed the chunks into a persistent vector database so you only pay for embedding once, and then write a single ask() function that retrieves the best chunks, builds a grounded, citation-aware prompt, calls Claude, and returns an answer plus the source chunks it came from. By the end you’ll have something that feels like a real “chat with your docs” product — it answers real questions with citations, and it declines questions the document can’t support. The document is a product handbook for a fictional notes app called Acme Cloud Notes; download it from datatweets.com/datasets/product-handbook.md and save it next to your script as product-handbook.md.
By the end of this project, you will be able to:
- Load a real document and split it into overlapping chunks
- Build a persistent Chroma collection and embed the chunks exactly once
- Write an
ask(question)function that retrieves, grounds, cites, and generates - Demonstrate grounded answers with citations — and honest refusals when the doc can’t answer
Each stage below is a small, runnable piece. By the last one they assemble into a complete bot. Let’s build it.
Stage 1: Load the Document and Chunk It
A whole document is too big to drop into a single prompt, and embedding it as one giant blob would make retrieval useless — every query would match the whole thing. So the first step is the same one from Lesson 3: split the text into smaller chunks, with a little overlap between neighbors so a sentence that straddles a boundary isn’t lost.
We’ll reuse the word-based chunker with overlap. Load the handbook with open(...).read(), then chunk it:
def chunk_text(text, chunk_size=120, overlap=30):
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunks.append(" ".join(words[start:end]))
if end >= len(words):
break
start = end - overlap # step back by the overlap so chunks share words
return chunks
document = open("product-handbook.md").read()
chunks = chunk_text(document)
print("chunk count:", len(chunks))
print(chunks[1][:300])chunk count: 8
code from your phone each time you sign in on a new device. We never store your password in plain text, and support staff can never see it. ## Plans and Billing Acme Cloud Notes offers three plans. The Starter plan is free forever and includes up to 100 notes and 1 GB of storage. The Pro plan costsThe handbook split into 8 chunks. Notice the sample chunk starts mid-sentence (“code from your phone…”) — that’s the overlap doing its job, carrying the tail of the previous chunk forward so context isn’t cut cleanly at a word boundary. Each chunk is now a bite-sized passage we can embed and retrieve independently.
Stage 2: Store the Chunks in a Persistent Collection
In earlier lessons we used an in-memory Chroma client, which forgets everything when the script ends — so every run re-embeds the whole document. A real product embeds once and reuses the stored vectors forever. The fix is PersistentClient, which writes the collection to a folder on disk.
The important trick is the embed-once guard: only add the chunks if the collection is empty (count() == 0). On the first run it embeds and stores; on every run after that it finds the chunks already there and skips straight to querying.
import chromadb
client = chromadb.PersistentClient(path="chroma_db")
collection = client.get_or_create_collection(name="product_handbook")
if collection.count() == 0:
collection.add(
documents=chunks,
ids=[f"chunk-{i}" for i in range(len(chunks))],
metadatas=[{"chunk_index": i} for i in range(len(chunks))],
)
print("stored chunk count:", collection.count())stored chunk count: 8All 8 chunks are stored on disk. Each chunk gets a string id (Chroma requires string ids) and a small metadata dict carrying its chunk_index — we’ll use that index later as the citation label. Run this script a second time and you’ll see stored chunk count: 8 again without re-embedding: the guard saw a non-empty collection and skipped the add(). That persisted folder is your knowledge base.
This is the architecture behind real chat-with-your-docs products
Embed-once-then-reuse is exactly how production documentation bots work: a build step ingests the docs into a vector store, and the live app only ever queries it. The same shape powers support assistants, internal wikis, and research tools. In Module 8 you’ll take the next step — instead of always retrieving, you’ll give an agent tools and let it decide when to search the docs and when to answer directly.
Stage 3: Write the ask() Function
Now the heart of the bot. The ask(question) function ties the whole pipeline into one call: it retrieves the top chunks, builds a grounded + citation prompt, generates with Claude, and returns both the answer and the source chunks it drew from.
Two design choices make this feel like a product. First, we label each retrieved chunk with its chunk_index ([Source chunk 3]) inside the context, and we instruct the model to cite that label — so every answer points back to its evidence. Second, we keep the grounding rule from Lesson 4: answer only from the context, and say “I don’t know” otherwise.
import anthropic
llm = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from your environment
def ask(question, n_results=3):
results = collection.query(query_texts=[question], n_results=n_results)
docs = results["documents"][0]
metas = results["metadatas"][0]
sources = [meta["chunk_index"] for meta in metas]
context = "\n\n".join(
f"[Source chunk {meta['chunk_index']}]\n{doc}"
for doc, meta in zip(docs, metas)
)
prompt = (
"You are a documentation assistant for Acme Cloud Notes. "
"Answer the question using ONLY the context below. "
"Cite the source chunk you used in square brackets, like [Source chunk 3]. "
"If the answer is not in the context, say you don't know.\n\n"
f"Context:\n{context}\n\n"
f"Question: {question}"
)
response = llm.messages.create(
model="claude-haiku-4-5",
max_tokens=200,
messages=[{"role": "user", "content": prompt}],
)
return response.content[0].text, sourcesLet’s call it once and look at both return values — the cited answer and the list of source chunks:
answer, sources = ask("How much does the Pro plan cost?")
print("Retrieved source chunks:", sources)
print(answer)Retrieved source chunks: [1, 0, 7]
The Pro plan costs $8 per month. [Source chunk 1]That’s the complete loop in one function. Retrieval surfaced chunks 1, 0, and 7; the model read them, answered from chunk 1, and cited it. You get the grounded answer and the receipts.
Stage 4: Run It on Real Questions
A bot is only convincing when you push real questions through it. Let’s ask several — three the handbook can answer, and one it deliberately can’t — and print the cited answer and sources for each.
questions = [
"How much does the Pro plan cost?",
"How long are deleted notes kept before they're removed?",
"Can I use my notes offline?",
"Does Acme Cloud Notes have a calendar feature?",
]
for q in questions:
answer, sources = ask(q)
print("Q:", q)
print("Sources:", sources)
print("A:", answer)
print("-" * 50)Q: How much does the Pro plan cost?
Sources: [1, 0, 7]
A: The Pro plan costs $8 per month. [Source chunk 1]
--------------------------------------------------
Q: How long are deleted notes kept before they're removed?
Sources: [5, 2, 3]
A: Deleted notes are kept in Trash for 30 days before they are permanently removed. [Source chunk 5]
--------------------------------------------------
Q: Can I use my notes offline?
Sources: [3, 2, 4]
A: Yes, you can use your notes offline. The desktop and mobile apps keep a full
local copy of your notes, so you can read and edit your notes offline. Any edits
you make offline are automatically merged the next time your device reconnects to
the internet. [Source chunk 2]
--------------------------------------------------
Q: Does Acme Cloud Notes have a calendar feature?
Sources: [0, 1, 3]
A: I don't know. The documentation provided doesn't mention a calendar feature
for Acme Cloud Notes.
--------------------------------------------------Look at what happened. The three answerable questions each got a correct, natural-language answer with a citation pointing at the chunk that supported it — Pro pricing from chunk 1, the 30-day Trash window from chunk 5, offline access from chunk 2. The handbook never appeared in Claude’s training; the bot retrieved the right slice each time.
The fourth question is the real test. The handbook says nothing about a calendar, so the retriever still returned its three closest chunks (0, 1, 3) — retrieval always returns something — but those chunks didn’t contain the answer. Because the prompt is grounded, the model declined instead of inventing a feature. That honest “I don’t know” is what makes the bot safe to put in front of real users. You’ve built a complete documentation Q&A bot.
Extend the Project
You have a working bot. These extensions push it closer to production.
Exercise 1: Add a “no relevant docs found” threshold
Right now retrieval always returns three chunks, even when none are relevant — the bot leans entirely on the prompt to decline. Add a check that inspects the retrieval distance and short-circuits to “I couldn’t find anything about that in the docs” before calling Claude when the closest chunk is too far away.
Hint
Pass include=["distances"] to collection.query(...) and read results["distances"][0]. Chroma’s default distance is squared L2, so lower means closer. Inspect the distances on a few real questions to pick a cutoff, then return an early “not found” message when results["distances"][0][0] exceeds it — saving an API call.
Exercise 2: Show the source text in the output
Returning the chunk indices is useful for debugging, but a user-facing bot should show the actual passages it cited. Change ask() to return the source chunk text alongside the answer, and print a short “Sources:” section under each answer.
Hint
You already have docs (the retrieved chunk texts) inside ask(). Return docs (or a list of (chunk_index, doc) pairs) as a third value, then in your print loop show the first sentence or first ~120 characters of each cited chunk so the user can verify the answer.
Exercise 3: Support multiple documents
A real bot covers more than one document. Add a second handbook (or any Markdown file), chunk it too, and store every chunk with a source field in its metadata (e.g. {"chunk_index": i, "source": "product-handbook"}). Then include the source name in the citation.
Hint
Metadata values must be str, int, float, or bool, so a filename string is fine. When you build the context in ask(), read meta["source"] and label each block like [product-handbook, chunk 3]. Because everything lives in one collection, retrieval automatically ranks chunks across all documents by relevance.
Summary
You built a complete documentation Q&A bot — the capstone that combines every piece of this module. You loaded a real product handbook, chunked it with overlap, embedded the chunks into a persistent Chroma collection guarded so it only embeds once, and wrote a single ask() function that retrieves the top chunks, builds a grounded, citation-aware prompt, generates with Claude, and returns the cited answer plus its source chunks. You watched it answer real questions about pricing, data retention, and offline access — each with a citation — and decline a question (a calendar feature) the document couldn’t support. That’s a real chat-with-your-docs product in well under a hundred lines.
Key Concepts
- Chunking with overlap — split a document into bite-sized passages that share a few words at the seams, so retrieval works and boundary sentences aren’t lost.
- Persistent collection —
PersistentClientwrites embeddings to disk; acount() == 0guard embeds once and reuses forever. ask()pipeline — one function that retrieves, grounds, cites, and generates.- Citations — labeling chunks with their index and asking the model to cite gives every answer traceable evidence.
- Grounded refusal — answering only from context means the bot declines instead of hallucinating when the docs can’t help.
Why This Matters
This is the architecture behind nearly every “chat with your documents” product in the wild — support bots, internal knowledge assistants, research tools. The shape never changes: ingest and embed your docs once, then retrieve-ground-generate on each question, with citations so users can trust and verify the answer. You can now point this same pattern at any document set and ship a working knowledge bot. From here, the next step is letting a model decide when to retrieve — which is exactly what agents do.
Next Steps
Continue to Module 8 - Building AI Agents
Go beyond always-retrieving: give the model tools and let it decide when to search your docs and which tool to use for each step.
Back to Module Overview
Return to the Retrieval-Augmented Generation module overview
Continue Building Your Skills
You’ve shipped a real RAG application — a documentation bot that retrieves, grounds, cites, and knows when to say “I don’t know.” Everything until now has retrieved on every question. Next you’ll hand the model its own tools and let it choose when to reach for them, turning a fixed pipeline into a reasoning agent.