Lesson 3 - Metadata and Filtering

Welcome to Metadata and Filtering

In the previous lesson you stored a handful of documents in Chroma and searched them by meaning alone. That works well when every document is fair game, but real systems are rarely that simple. You usually want to search within a category: only shipping articles, only documents written this year, only items tagged for a particular product line. Semantic similarity finds the closest text; metadata filtering decides which text is even allowed to compete. This lesson shows you how to attach that structured information and use it, and then how to save the whole database to disk so you never have to rebuild it from scratch.

By the end of this lesson, you will be able to:

  • Attach metadata to documents when you add them to a collection
  • Restrict a semantic query to a subset of documents using a where clause
  • Use operators like $eq and $in, and filter on document text with where_document
  • Persist a collection to disk with PersistentClient so it survives restarts without re-embedding

Let’s begin by giving each document some structured context.


Attaching Metadata to Documents

Metadata is a small dictionary of structured facts that travels alongside each document. You supply it through the metadatas argument of collection.add, passing one dictionary per document in the same order as your documents and ids. The keys are yours to choose, and the values are simple scalars: strings, numbers, or booleans.

Here we tag five short support articles with a topic so we can later search inside a single category.

import chromadb

client = chromadb.Client()
collection = client.get_or_create_collection(name="support_docs")

documents = [
    "You can return any unused item within 30 days of delivery for a full refund.",
    "Standard shipping takes 3-5 business days; express arrives in 1-2 days.",
    "Go to the sign-in page and click Forgot password to reset it.",
    "We ship to over 60 countries worldwide.",
    "Email [email protected] or use live chat for support.",
]
metadatas = [
    {"topic": "returns"},
    {"topic": "shipping"},
    {"topic": "account"},
    {"topic": "shipping"},
    {"topic": "support"},
]
ids = ["doc1", "doc2", "doc3", "doc4", "doc5"]

collection.add(documents=documents, ids=ids, metadatas=metadatas)
print("count:", collection.count())
count: 5

Nothing about the embeddings changed: each document is still turned into a vector exactly as before. The metadata rides along untouched, stored next to the vector so it can be used as a filter at query time. Notice that two documents share topic: "shipping" while the others are unique. That overlap is the point: metadata describes groups, and groups are what we filter on.


Filtering a Query with a where Clause

To search inside a category, add a where argument to collection.query. Chroma first narrows the collection to only the documents whose metadata matches your filter, then ranks those documents by semantic similarity. The result is the closest match among the allowed set, not the closest match overall.

The query below asks about “delivery time” but restricts the search to shipping articles. Only two documents carry topic: "shipping", so those are the only two that can be returned.

results = collection.query(
    query_texts=["delivery time"],
    n_results=2,
    where={"topic": "shipping"},
)

for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"({dist:.4f}) {doc}")
(0.9359) Standard shipping takes 3-5 business days; express arrives in 1-2 days.
(1.3711) We ship to over 60 countries worldwide.

Remember that Chroma’s default distance is squared L2, where a lower number means a closer match. The first result is the obvious answer about delivery times. The second is the other shipping document, returned only because we asked for two results and it was the next-best within the filter. The returns, account, and support documents never appeared because the filter excluded them before similarity was even considered.

Operators for richer filters

The shorthand {"topic": "shipping"} is equivalent to writing the explicit equality operator:

results = collection.query(
    query_texts=["delivery time"],
    n_results=2,
    where={"topic": {"$eq": "shipping"}},
)

for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"({dist:.4f}) {doc}")
(0.9359) Standard shipping takes 3-5 business days; express arrives in 1-2 days.
(1.3711) We ship to over 60 countries worldwide.

The result is identical, but the operator form opens the door to more expressive filters. Use $in to allow several values at once. Here we look for help-related content across two topics:

results = collection.query(
    query_texts=["how do I get help"],
    n_results=3,
    where={"topic": {"$in": ["support", "account"]}},
)

for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"({dist:.4f}) {doc}")
(0.9065) Email [email protected] or use live chat for support.
(1.6590) Go to the sign-in page and click Forgot password to reset it.

Both eligible documents come back, ranked by similarity to the question.

Filtering on the document text itself

Metadata filters operate on the tags you attached, but sometimes you want to filter on the content of the document. The where_document argument does exactly that. The $contains operator keeps only documents whose text includes a given substring, and then similarity ranks what remains.

results = collection.query(
    query_texts=["how do I send something back"],
    n_results=2,
    where_document={"$contains": "refund"},
)

for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"({dist:.4f}) {doc}")
(1.0899) You can return any unused item within 30 days of delivery for a full refund.

Only one document contains the word “refund”, so it is the only candidate, and it is returned. You can combine where and where_document in the same query to filter on both metadata and text at once.

Filtering narrows the field before ranking

A filtered query does not rank everything and then throw away the rows that fail the filter. Chroma applies the where and where_document conditions first to build the set of eligible documents, then computes similarity within that set. You always get the closest match among the allowed documents, which is why a filter that leaves only two candidates can never return a third, no matter how relevant it might have been.


Persisting the Database to Disk

Every example so far has used chromadb.Client(), which keeps the entire collection in memory. That is convenient for experiments, but the moment your program exits, the vectors, metadata, and documents are gone. The next run has to embed everything again. For anything beyond a quick test, you want the database written to disk.

Swap chromadb.Client() for chromadb.PersistentClient(path="./chroma_db"). Everything else stays the same, but now Chroma stores the collection in the folder you named. The pattern is to call get_or_create_collection, check whether it already has data, and embed only if it is empty.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="support_docs")

if collection.count() == 0:
    print("Collection is empty, adding documents...")
    documents = [
        "You can return any unused item within 30 days of delivery for a full refund.",
        "Standard shipping takes 3-5 business days; express arrives in 1-2 days.",
        "Go to the sign-in page and click Forgot password to reset it.",
        "We ship to over 60 countries worldwide.",
        "Email [email protected] or use live chat for support.",
    ]
    metadatas = [
        {"topic": "returns"},
        {"topic": "shipping"},
        {"topic": "account"},
        {"topic": "shipping"},
        {"topic": "support"},
    ]
    ids = ["doc1", "doc2", "doc3", "doc4", "doc5"]
    collection.add(documents=documents, ids=ids, metadatas=metadatas)
else:
    print("Collection already has data, skipping embedding.")

print("documents in collection:", collection.count())

Run this script once and it embeds the documents and saves them:

Collection is empty, adding documents...
documents in collection: 5

Now run the exact same script a second time. A brand-new PersistentClient opens the same folder, finds the collection already populated, and takes the fast path:

Collection already has data, skipping embedding.
documents in collection: 5

The data survived because it was on disk the whole time. If you list the chroma_db folder you will see what Chroma wrote: a chroma.sqlite3 file holding the metadata and documents, alongside a folder containing the vector index.

11c252ec-2582-453a-8c33-01d0a76a2704
chroma.sqlite3

The count() guard is what makes this safe to run repeatedly. Embedding text is the slow, expensive part of building a vector database, so checking the count before adding means you pay that cost exactly once. Every later run loads the saved vectors instantly and is ready to query.


Practice Exercises

Exercise 1

Add a second metadata field to each of the five documents called audience, set to either "customer" or "agent". Then run a query that returns only documents intended for customers.

Hint

Metadata dictionaries can hold more than one key, for example {"topic": "shipping", "audience": "customer"}. Filter on the new field with where={"audience": "customer"}.

Exercise 2

Write a single query that searches for “international orders” but only among documents whose topic is "shipping" and whose text contains the word “countries”.

Hint

Pass both where={"topic": "shipping"} and where_document={"$contains": "countries"} to the same collection.query call. The two filters apply together before similarity ranking.

Exercise 3

Convert your in-memory experiment into a persistent one. Point a PersistentClient at a folder named support_db, embed the documents only when the collection is empty, then run your script twice and confirm the second run skips embedding.

Hint

Use chromadb.PersistentClient(path="./support_db") and guard the add call with if collection.count() == 0:. The second run should print your “already has data” message and still report a count of 5.


Summary

You learned how to turn a flat collection into a structured, searchable store. Metadata attaches facts to each document, where and where_document restrict a query to a relevant subset before similarity ranks the survivors, and PersistentClient writes everything to disk so the work is done once and reused forever.

Key Concepts

  • Metadata is a per-document dictionary of scalar values passed through the metadatas argument of add, stored alongside each vector.
  • The where clause filters a query by metadata, so similarity is computed only among the documents that match.
  • Operators such as $eq and $in express exact matches and multi-value matches inside a where filter.
  • The where_document clause filters on the document text itself, with $contains keeping only documents that include a substring.
  • PersistentClient saves the collection to a folder on disk, replacing the in-memory Client so data survives restarts and embedding happens only once.

Why This Matters

A search engine that can only rank everything against everything is rarely what production needs. Filtering lets you scope results to a tenant, a category, a date range, or a permission level, which is the difference between a toy and a tool people can trust with real data. Persistence turns that tool into something you can deploy: you build the index once, ship the folder, and every restart is instant. Together, metadata filtering and persistence are what move a vector database from a notebook demo to a service you can run.


Next Steps


Continue Building Your Skills

You now have every piece a small retrieval system needs: documents, embeddings, metadata, filtered queries, and a database that lives on disk. In the next lesson you will put them together into a single guided project, building a searchable knowledge base from the ground up and querying it the way a real application would.