Lesson 2 - Getting Started with Chroma
Welcome to Getting Started with Chroma
In Lesson 1 you saw why a vector database exists: it stores embeddings, documents, and metadata together, indexes them for fast search, and lets you embed once and query many times. Now you’ll use one. Chroma is an open-source vector database that runs entirely on your own machine — no server to host, no account to create, no API key to manage. Even better, Chroma can do the embedding for you. In Module 5 you called an embedding model by hand and stored the vectors yourself; here you simply hand Chroma your text, and it embeds and stores it in a single call.
This lesson gets you from a blank file to a working semantic search in a handful of lines.
By the end of this lesson, you will be able to:
- Install Chroma and create an in-memory client and a collection
- Add documents and let Chroma embed them automatically with its built-in model
- Count the items in a collection and query it by meaning
- Read the result dictionary and interpret Chroma’s squared-L2 distance scores
Let’s install it and create our first collection.
Installing Chroma and Creating a Collection
Chroma ships as a single Python package. Install it with pip:
pip install chromadbThat one package brings everything you need, including a built-in embedding model, so you don’t need an API key or any external service. The work in this lesson runs against chromadb 1.5.9.
Chroma stores data inside collections — think of a collection as a table dedicated to one set of related items. To create one, you first make a client. The simplest client is in-memory: it lives only for the duration of your program, which is perfect for learning and quick experiments (later lessons cover a persistent client that saves to disk).
import chromadb
print("chromadb version:", chromadb.__version__)
# An in-memory client — data lives only while this program runs
client = chromadb.Client()
# A collection is where your documents and their embeddings live
collection = client.create_collection(name="faqs")chromadb version: 1.5.9One thing to know early: create_collection fails if a collection with that name already exists. While you’re iterating in a notebook or re-running a script, that’s a common stumbling block. The fix is get_or_create_collection, which returns the existing collection if it’s there and creates it otherwise:
collection = client.get_or_create_collection(name="faqs")Use create_collection when you want to be sure you’re starting fresh, and get_or_create_collection when you just want a handle to the collection regardless of whether it already exists.
Adding Documents — Chroma Embeds Them for You
Here is the big win over Module 5. To store text, you call collection.add with your documents and a list of unique ids. You do not generate any embeddings yourself — Chroma runs each document through its built-in default embedding model (the all-MiniLM-L6-v2 model, executed locally via onnxruntime) and stores the resulting vectors alongside your text. No API key, no separate embedding step.
collection.add(
documents=[
"You can return any unused item within 30 days of delivery for a full refund.",
"Standard shipping takes 3-5 business days; express arrives in 1-2 days.",
"Go to the sign-in page and click Forgot password to reset it.",
"We ship to over 60 countries worldwide.",
"Email [email protected] or use live chat for support.",
],
ids=["doc0", "doc1", "doc2", "doc3", "doc4"],
)
print("count:", collection.count())count: 5collection.count() confirms how many items the collection holds — here, five. Each document is now stored together with its embedding and is ready to be searched. The ids are yours to choose; they must be unique within the collection and are how you’d later update or delete a specific item.
The first time you add documents, Chroma downloads the small embedding model. After that it’s cached locally, so subsequent runs are fast and fully offline.
No API key, no embedding code
In Module 5 you wrote the embedding step yourself and managed the vectors by hand. Chroma’s add collapses that into one call: it embeds your text with its built-in all-MiniLM-L6-v2 model running locally, then stores the vectors for you. There’s no account, no API key, and nothing leaves your machine. When you’re ready, you can swap in a different embedding function — but the default is enough to get real semantic search working immediately.
Querying and Reading the Results
To search, call collection.query with one or more query_texts and the number of matches you want. Chroma embeds the query the same way it embedded your documents, finds the nearest stored vectors, and returns them.
res = collection.query(
query_texts=["How long do I have to send something back?"],
n_results=3,
)
for doc, dist in zip(res["documents"][0], res["distances"][0]):
print(f"{dist:.4f} {doc}")0.7514 You can return any unused item within 30 days of delivery for a full refund.
1.1615 Standard shipping takes 3-5 business days; express arrives in 1-2 days.
1.7333 We ship to over 60 countries worldwide.The query never mentioned “return,” “refund,” or “30 days” — it asked “How long do I have to send something back?” Chroma matched it to the refund policy anyway, because the meaning is close. That’s semantic search, with the embedding handled for you.
Now look carefully at the shape of what query returns. It’s a dictionary with the keys ids, documents, distances, and metadatas (among a few others). The important detail: each value is a list of lists — one inner list per query text you passed. Because you can query with several texts at once, Chroma keeps each query’s results in its own inner list:
print("documents:", res["documents"])
print()
print("distances:", res["distances"])documents: [['You can return any unused item within 30 days of delivery for a full refund.', 'Standard shipping takes 3-5 business days; express arrives in 1-2 days.', 'We ship to over 60 countries worldwide.']]
distances: [[0.7513809204101562, 1.16151762008667, 1.7332555055618286]]You passed a single query, so there’s a single inner list. To get the matches for your first (and only) query, you read res["documents"][0] and res["distances"][0]. The [0] selects the results for the first query text — forgetting it is the most common early mistake, because you’d be working with the outer list of all queries instead of one query’s hits.
Understanding Chroma’s Distance Scores
Each match comes with a number in distances. By default, Chroma measures similarity with squared L2 distance (squared Euclidean distance) between the query’s embedding and each stored embedding. The rule to remember is:
With squared L2 distance, lower means more similar. A distance of 0 would be an identical vector; larger numbers are further apart in meaning.
You can see this in the results above. The refund answer scored — the smallest distance and the best match. The shipping-time answer was further at , and the worldwide-shipping line further still at . The list comes back sorted from closest to farthest, so res["documents"][0][0] is always the top match.
This is the opposite direction from the cosine similarity you used in Module 5. There, scores ran toward 1 for a close match, so higher was better. Here, with distance, lower is better. Same goal — find the most relevant document — but read the number the other way.
Distance vs. similarity: which way is closer?
In Module 5 you ranked by cosine similarity, where a bigger score (toward 1) meant a closer match. Chroma’s default is squared L2 distance, where a smaller score means a closer match. The intuition flips: similarity goes up as things get closer, distance goes down. When you sort Chroma results, the smallest distance is the best hit — which is why Chroma returns them in ascending order.
That’s a complete cycle: install, create a collection, add documents Chroma embeds for itself, count them, and query by meaning — all locally and for free.
Practice Exercises
Exercise 1: Create and add
Write the code to create an in-memory Chroma client, create a collection named notes, and add two documents of your own with ids n0 and n1. Then print collection.count(). What value do you expect?
Hint
Start with client = chromadb.Client(), then collection = client.create_collection(name="notes"). Call collection.add(documents=[...], ids=["n0", "n1"]) — no embedding step needed. With two documents added, collection.count() returns 2.
Exercise 2: Read the top match
After running the query in this lesson, you have the result dictionary res. Write a single line that prints just the closest document’s text. Which keys and indices do you need?
Hint
Results are a list of lists. The first query’s documents are res["documents"][0], and the closest match is first in that list, so res["documents"][0][0] is the top document. The closest distance is res["distances"][0][0].
Exercise 3: Lower or higher?
A teammate looks at two Chroma matches with distances 0.75 and 1.73 and says the 1.73 one is the better match because the number is bigger. Are they right? Explain in one sentence using the correct rule.
Hint
They’re wrong. Chroma’s default squared L2 distance means lower is more similar, so 0.75 is the closer, better match — unlike cosine similarity, where higher would win.
Summary
Chroma is an open-source vector database that runs locally and for free. You install it with pip install chromadb, create an in-memory client with chromadb.Client(), and make a collection with create_collection (or get_or_create_collection to avoid the “already exists” error). You add text with collection.add(documents=[...], ids=[...]), and Chroma embeds the documents for you using its built-in all-MiniLM-L6-v2 model — no API key, no separate embedding step. collection.count() tells you how many items you have, and collection.query(query_texts=[...], n_results=3) searches by meaning. The result is a dictionary whose values are lists of lists, one inner list per query, so you read res["documents"][0] and res["distances"][0]. Chroma’s default distance is squared L2, where lower means more similar.
Key Concepts
- Client —
chromadb.Client()creates an in-memory store; it lives only while your program runs. - Collection — a named container for documents and their embeddings; create with
create_collectionorget_or_create_collection. - Automatic embedding —
collection.addembeds your text with Chroma’s built-in model and stores the vectors, no API key required. - Result shape —
queryreturns a dict of lists of lists; index with[0]to get the first query’s matches. - Squared L2 distance — Chroma’s default score; lower is more similar, the opposite direction from cosine similarity.
Why This Matters
Module 5 had you embed, store, and compare vectors by hand. Chroma folds all of that into two calls — add and query — and persists the work so you don’t repeat it. That’s the foundation every retrieval system is built on: store your data once, embedded and indexed, then answer questions against it cheaply. With the basics working, you’re ready to attach metadata and filter your searches.
Next Steps
Continue to Lesson 3 - Metadata and Filtering
Attach structured metadata to your documents and combine semantic search with filters like topic and category.
Back to Module Overview
Return to the Vector Databases module overview
Continue Building Your Skills
You’ve run a real vector database end to end: installed Chroma, created a collection, added documents it embedded for you, and queried them by meaning. Next you’ll make those searches sharper by attaching metadata to each document and filtering on it — so you can find the most relevant answer that also matches the category, user, or date you care about.