Lesson 3 - Chunking Documents
Welcome to Chunking Documents
In the last two lessons you retrieved from a knowledge base where each document was already a short, single-fact sentence. Real documents are not like that. A product handbook, a policy PDF, or a wiki page covers a dozen topics across hundreds of words. If you embed the whole thing as one vector, you destroy the very precision that makes retrieval useful — the vector becomes a blurry average of every topic on the page, and a search for any one of them returns the entire document. Chunking is the fix: before embedding, you split a long document into smaller, focused pieces, embed each piece on its own, and store them separately. This lesson teaches you why chunking matters and how to do it well, working on a real multi-section document.
You’ll use the Acme Cloud Notes product handbook — a real Markdown document with seven sections (accounts, billing, syncing, sharing, exporting, data and deletion, support). Download it from https://datatweets.com/datasets/product-handbook.md and save it next to your script as product-handbook.md. Everything here runs locally and free — Chroma’s built-in embedding model does all the work, so no API key is needed.
By the end of this lesson, you will be able to:
- Explain why a whole long document makes a poor single embedding
- Write a fixed-size chunking function with overlap between consecutive chunks
- Store each chunk in Chroma along with metadata such as its source section
- Retrieve the single most relevant chunk for a question to see precision improve
Let’s start with the problem chunking solves.
The Problem: One Big Embedding Blurs Everything
An embedding turns a piece of text into a single point in space, positioned by overall meaning. That works beautifully for a short, focused sentence. It works badly for a long document, because the model has to compress accounts, billing, syncing, sharing, exporting, deletion, and support into one point — the result sits in the vague middle of all those topics and is a strong match for none of them.
Load the handbook and see what happens when you embed it whole:
text = open("product-handbook.md").read()
print("total characters:", len(text))
print("total words:", len(text.split()))total characters: 3831
total words: 674That is far too much meaning for one vector. Store the entire document as a single entry and ask a sharp, specific question:
import chromadb
client = chromadb.Client()
whole = client.get_or_create_collection(name="whole_doc")
whole.add(documents=[text], ids=["handbook"])
question = "How long are deleted notes kept?"
result = whole.query(query_texts=[question], n_results=1)
print("returned characters:", len(result["documents"][0][0]))
print("distance:", round(result["distances"][0][0], 4))returned characters: 3831
distance: 1.3366Two problems, both visible in the output. First, the only thing retrieval can hand back is the whole 3,831-character document — every topic in it, when you asked about one. Second, the distance is 1.3366; Chroma’s default metric is squared L2, where lower means more similar, and that is a poor score because the deletion fact is diluted by six unrelated sections. To answer the question, you (or the model) would have to read the entire handbook and find the relevant sentence yourself. That is exactly the work retrieval is supposed to do for you.
The cure is to split the document into smaller pieces before embedding, so each piece carries one focused idea.
A Fixed-Size Chunker With Overlap
The simplest chunking strategy is fixed-size: walk through the text and cut it into pieces of a fixed length. The one subtlety is overlap — each chunk repeats the last bit of the previous one. Without overlap, a fact that straddles a boundary ("…recoverable for 30 | days before…") gets sliced in half and may be lost from both chunks. A small overlap means any sentence near a boundary appears whole in at least one chunk.
Here is a chunker that takes a chunk size and an overlap, both measured in characters:
def chunk_text(text, chunk_size=500, overlap=100):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start += chunk_size - overlap # step forward, leaving an overlap
return chunks
chunks = chunk_text(text, chunk_size=500, overlap=100)
print("number of chunks:", len(chunks))
print("length of first chunk:", len(chunks[0]))
print("length of last chunk:", len(chunks[-1]))number of chunks: 10
length of first chunk: 500
length of last chunk: 231The 3,831-character handbook becomes ten chunks. Each step advances by chunk_size - overlap (here 400 characters), so consecutive chunks share their final and opening 100 characters. The last chunk is shorter simply because the text ran out. Look at one chunk to confirm the overlap and the focus:
print(repr(chunks[3]))'the Billing page; if you cancel, your plan stays active until the end of the
current billing period and is not auto-renewed.\n\n## Syncing and Offline
Access\n\nNotes sync automatically across all your devices within a few seconds
of any change, as long as you are online. The desktop and mobile apps also keep
a full local copy, so you can read and edit your notes offline; any edits you
make offline are merged the next time the device reconnects. If the same note
was changed on two devices while one 'This chunk is about syncing — one topic, not seven. Notice it starts mid-sentence (the tail of the billing section) because a fixed-character cut does not respect sentence or section boundaries. That tail is the overlap doing its job: the billing fact still lives intact in chunk 2. The trade-off is that fixed-size chunks can begin and end awkwardly. We’ll refine that next, but even this crude version already gives retrieval something focused to match against.
Storing Chunks in Chroma With Metadata
Fixed-character cutting is fine, but this document hands us a natural, cleaner boundary for free: its Markdown ## headers. Splitting on sections gives chunks that each cover one topic and a label we can keep — the section name — as metadata. Metadata travels with each chunk in Chroma, so a retrieved chunk can tell you exactly where it came from.
Split the handbook into sections, chunk each section (long ones still get cut with overlap), and store every chunk with its section as metadata:
import re
sections = re.split(r"\n## ", text)
parsed = []
for sec in sections:
sec = sec.strip()
if "\n" not in sec:
continue
title = sec.split("\n", 1)[0].lstrip("# ").strip()
body = sec.split("\n", 1)[1].strip()
if body:
parsed.append((title, body))
all_chunks, metadatas, ids = [], [], []
i = 0
for title, body in parsed:
for piece in chunk_text(body, chunk_size=500, overlap=100):
all_chunks.append(piece)
metadatas.append({"section": title})
ids.append(f"chunk-{i}")
i += 1
print("total chunks:", len(all_chunks))
print("sample:", ids[0], metadatas[0])total chunks: 14
sample: chunk-0 {'section': 'Accounts and Sign-In'}Fourteen chunks: most sections fit in a single chunk, and the two longest (accounts and billing) split into two each. Now store them in a collection, attaching the metadata as you add:
collection = client.get_or_create_collection(name="handbook_chunks")
collection.add(documents=all_chunks, metadatas=metadatas, ids=ids)
print("stored:", collection.count(), "chunks")stored: 14 chunksChroma embeds each chunk on its own with its built-in model, so you now have fourteen focused vectors instead of one blurry one — each tagged with the section it belongs to.
Chunk size is a trade-off, and overlap protects boundaries
There is no universally correct chunk size. Chunks that are too small lose context — a single line can be retrieved without the sentence that explains it. Chunks that are too big drift back toward the one-blurry-vector problem, mixing topics and returning noise. A few hundred characters (a paragraph or two) is a sensible starting point; tune it for your documents and questions. Overlap is the seatbelt: by repeating a little text across the boundary, you ensure a fact split between two chunks survives whole in at least one of them.
Retrieving the Most Relevant Chunk
Now the payoff. Ask the same specific question you asked the whole-document collection, but this time request the single best chunk:
question = "How long are deleted notes kept?"
result = collection.query(query_texts=[question], n_results=1)
print("section:", result["metadatas"][0][0]["section"])
print("distance:", round(result["distances"][0][0], 4))
print(result["documents"][0][0])section: Data, Privacy, and Deletion
distance: 0.6309
Your notes belong to you. Acme Cloud Notes does not read your notes or sell your
data, and we do not use the contents of your notes to train any models. When you
delete a note it goes to Trash, where it remains recoverable for 30 days before
it is permanently removed. If you delete your entire account, all of your data is
permanently erased within 30 days, and any active subscription is cancelled. You
can request a copy of all data we hold about you from the Privacy section of
Account Settings.Compare this to the whole-document result. Retrieval returned just the Data, Privacy, and Deletion chunk — and the metadata tells you so — at a distance of 0.6309, far better than the 1.3366 the blurry single vector scored. Instead of 3,831 characters spanning every topic, you get one focused paragraph containing the exact fact (“recoverable for 30 days”). That is the precision chunking buys you.
It is not a one-off. Ask about pricing and retrieval jumps to the right section just as cleanly:
question = "How much does the Pro plan cost?"
result = collection.query(query_texts=[question], n_results=1)
print("section:", result["metadatas"][0][0]["section"])
print("distance:", round(result["distances"][0][0], 4))
print(result["documents"][0][0][:140], "...")section: Plans and Billing
distance: 0.8462
Acme Cloud Notes offers three plans. The Starter plan is free forever and
includes up to 100 notes and 1 GB of storage. The Pro plan costs 8 dollars ...A pricing question lands in Plans and Billing; a deletion question lands in Data, Privacy, and Deletion. Each returns one tight, on-topic chunk. This is the chunk you would hand to the model as grounding context in a full RAG pipeline — small enough to fit comfortably in the prompt, focused enough that the model isn’t wading through six unrelated sections to find the answer.
Splitting on ## headers worked here because the document is well structured. A common refinement is sentence- or paragraph-aware chunking: instead of cutting at a fixed character count, you split on sentence or paragraph boundaries and then group sentences up to your target size. That keeps every chunk readable and avoids the mid-sentence starts you saw earlier, while still bounding the size. The principle is identical — small, focused, slightly overlapping pieces — just with smarter cut points.
Practice Exercises
Exercise 1: Predict the chunk count
The handbook is 3,831 characters. Using chunk_text(text, chunk_size=500, overlap=100), how far does the loop advance on each step, and roughly how many chunks should you expect? Then run it and check against the real number.
Hint
Each step advances by chunk_size - overlap, which is 500 − 100 = 400 characters. About 3831 / 400 ≈ 9.6, which rounds up to 10 chunks — exactly what the flat chunker produces. (The section-aware version yields 14 because it also breaks at every header.)
Exercise 2: Shrink the chunk size
Re-chunk the handbook with chunk_size=120, overlap=20 and store those chunks in a fresh collection. Ask “How much does the Pro plan cost?” and read the top chunk. Does it still contain the 8-dollar figure, or did the small size cut the answer off from its context?
Hint
With very small chunks the pricing sentence may be split so the retrieved piece holds only part of the plan description, or even matches the wrong plan. This is the “too small loses context” failure: the chunk no longer carries enough surrounding text to be a complete, useful answer.
Exercise 3: Filter by metadata
You stored a section on every chunk. Use Chroma’s where={"section": "Support"} argument in collection.query(...) to retrieve only from the Support section, then ask “What is the support email address?” Why is filtering on metadata a useful complement to semantic search?
Hint
Passing where={"section": "Support"} restricts the search to chunks tagged with that section before scoring by similarity. Metadata filtering lets you narrow retrieval to a known source, date, or category — combining structured filters with semantic ranking so you get the right chunk and from the right place.
Summary
A long document makes a poor single embedding: compressing many topics into one vector produces a blurry point that matches no specific question well, and forces retrieval to return the whole document. Chunking splits a document into smaller, focused pieces that are embedded and stored separately. You wrote a fixed-size chunker with overlap — overlap so a fact split across a boundary survives in at least one chunk — and saw a 3,831-character handbook become focused chunks. You stored each chunk in Chroma with metadata (its source section), then retrieved the single best chunk for a question: the deletion question returned the Data, Privacy, and Deletion chunk at distance 0.6309, versus 1.3366 for the whole document. Chunking is the step that turns vague, document-level retrieval into precise, paragraph-level retrieval.
Key Concepts
- Chunking — splitting a long document into smaller pieces before embedding, so each piece carries one focused idea.
- Fixed-size chunking — cutting text into pieces of a set length (in characters or words).
- Overlap — repeating a little text across chunk boundaries so a straddling fact isn’t lost.
- Chunk metadata — labels (such as the source section) stored with each chunk and returned with retrieval.
- Chunk-size trade-off — too small loses context; too big adds noise. Tune it to your documents.
Why This Matters
Every real RAG system runs on chunks, not whole documents. The quality of your chunking sets a ceiling on the quality of your retrieval — and therefore on every answer your application produces. Get chunk size and boundaries right and the model receives clean, relevant context it can answer from confidently; get them wrong and you feed it either fragments or noise no prompt can rescue. This is the unglamorous step that quietly decides whether a RAG product feels precise or vague.
Next Steps
Continue to Lesson 4 - Grounding and Citations
Make the model answer strictly from retrieved chunks and cite which chunk each fact came from.
Back to Module Overview
Return to the Retrieval-Augmented Generation module overview
Continue Building Your Skills
You can now take any long document and turn it into focused, retrievable chunks — the raw material of every RAG pipeline. Next you’ll make the model answer strictly from those chunks and cite exactly where each fact came from, so your answers are not just relevant but verifiable.