Lesson 4 - Guided Project: Searchable Knowledge Base

Welcome to the Searchable Knowledge Base Project

So far in this module you have learned what embeddings are, how a vector database stores them, and how similarity search finds the closest matches by meaning. In this guided project you will put all of it together into something you could actually ship: a support knowledge base that takes a customer’s question in plain language and returns the most relevant FAQ answer. You will work with a real dataset of fifteen support FAQs, store them in a Chroma collection that lives on disk, and write a clean search function that returns the matched question, its answer, and how close the match was. Download the dataset from https://datatweets.com/datasets/support-faqs.csv and save it next to your script as support-faqs.csv.

By the end of this project, you will be able to:

  • Load and inspect a real FAQ dataset with pandas and attach a simple topic label to each row.
  • Create a persistent Chroma collection that embeds your documents once and survives a restart.
  • Write a reusable search(query, k=3) function that returns the matched question, its stored answer, and the distance.
  • Filter results by metadata so a search can be scoped to a single topic.

We will build the project in small, verifiable stages. Each stage adds one capability, runs real code, and shows you the real output so you always know where you stand.


Stage 1: Load and Inspect the FAQ Data

Start by reading the CSV with pandas. The file has three columns — id, question, and answer — and fifteen rows. Some answers contain commas, but the file is properly quoted, so a plain pd.read_csv parses it correctly without any extra arguments.

import pandas as pd

faqs = pd.read_csv("support-faqs.csv")

print("Shape:", faqs.shape)
print(faqs.head())
Shape: (15, 3)
   id                      question                                                                                                  answer
0   1   How do I reset my password?  Go to the sign-in page and click "Forgot password." We'll email you a secure link to choose a new one.
1   2  How long does shipping take?               Standard shipping takes 3-5 business days. Express shipping arrives in 1-2 business days.
2   3   What is your return policy?                            You can return any unused item within 30 days of delivery for a full refund.
3   4  Do you ship internationally?     Yes. We ship to over 60 countries. International orders typically arrive within 7-14 business days.
4   5     How can I track my order?               Sign in to your account and open the Orders page to see live tracking for every shipment.

You have fifteen question-and-answer pairs covering accounts, shipping, returns, billing, and a couple of general topics. The question text is what you will embed and search against, and the answer is what you ultimately want to return to the user.

It is useful to be able to filter searches by topic later, so let us derive a simple topic label from each question using a few keyword rules. This is deliberately plain — no machine learning, just string matching — because the goal here is to have a clean categorical value we can store in metadata.

def topic_for(question):
    q = question.lower()
    if any(w in q for w in ["return", "damaged", "refund"]):
        return "returns"
    if any(w in q for w in ["password", "subscription", "email address"]):
        return "account"
    if any(w in q for w in ["payment", "secure", "gift card", "discount"]):
        return "billing"
    if any(w in q for w in ["ship", "track", "order", "address"]):
        return "shipping"
    return "general"

faqs["topic"] = faqs["question"].apply(topic_for)
print(faqs[["id", "question", "topic"]].to_string(index=False))
 id                                         question    topic
  1                      How do I reset my password?  account
  2                     How long does shipping take? shipping
  3                      What is your return policy?  returns
  4                     Do you ship internationally? shipping
  5                        How can I track my order? shipping
  6              What payment methods do you accept?  billing
  7                 How do I cancel my subscription?  account
  8                Is my payment information secure?  billing
  9 Can I change my shipping address after ordering? shipping
 10               How do I contact customer support?  general
 11                  Do you offer student discounts?  billing
 12     What should I do if my item arrives damaged?  returns
 13                           Can I buy a gift card?  billing
 14                How do I update my email address?  account
 15                        Do you have a mobile app?  general

Every row now has a topic. Notice that the topic is just a label we will travel alongside each document; it has nothing to do with how the embedding is computed. The embedding still comes from the meaning of the question text.


Stage 2: Create a Persistent Collection and Embed Once

This is the part that makes the project feel like a real product rather than a throwaway script. Instead of an in-memory client that forgets everything when the program ends, you will use chromadb.PersistentClient. It writes the collection — including the computed embeddings — to a folder on disk. The next time your script runs, the data is already there.

To make that pay off, you need to guard the work that embeds the documents. Embedding costs time (and, with a paid model, money), so you only want to do it once. The pattern is: get or create the collection, then add the documents only if the collection is empty, which you check with collection.count().

Chroma has a few rules worth keeping in mind as you write this:

  • Document ids must be strings, so convert the CSV’s integer id with str(...).
  • Metadata values must be a string, int, float, or bool — never a list. We store the answer and topic as plain strings.
  • The default distance is squared L2, where lower means more similar.
import chromadb

client = chromadb.PersistentClient(path="chroma_db")
collection = client.get_or_create_collection(name="support_faqs")

if collection.count() == 0:
    collection.add(
        ids=[str(i) for i in faqs["id"]],
        documents=faqs["question"].tolist(),
        metadatas=[
            {"answer": a, "topic": t}
            for a, t in zip(faqs["answer"], faqs["topic"])
        ],
    )
    print("Embedded and added", collection.count(), "FAQs.")
else:
    print("Collection already has", collection.count(), "FAQs - skipping embedding.")

print("Count:", collection.count())

On the first run, the collection is empty, so Chroma embeds all fifteen questions and saves them:

Embedded and added 15 FAQs.
Count: 15

We embedded the question text and stored the answer in metadata. That is a deliberate choice: customers ask in the shape of questions, so matching their query against stored questions tends to retrieve cleaner results than matching against long answer paragraphs. The answer is what we hand back once we have found the right entry.


Stage 3: Write and Run the Search Function

Now wrap the query in a small function that does the unpacking for you. collection.query returns a dictionary where ids, documents, distances, and metadatas are each a list-of-lists — one inner list per query. Since we send a single query, we read index [0] of each. The function zips those together and returns a tidy list of results.

def search(query, k=3):
    res = collection.query(query_texts=[query], n_results=k)
    results = []
    for question, meta, dist in zip(
        res["documents"][0], res["metadatas"][0], res["distances"][0]
    ):
        results.append(
            {
                "question": question,
                "answer": meta["answer"],
                "topic": meta["topic"],
                "distance": dist,
            }
        )
    return results

for hit in search("I can't sign into my account"):
    print(f"distance={hit['distance']:.4f}  [{hit['topic']}]  {hit['question']}")
    print(f"    -> {hit['answer']}")
distance=0.9020  [account]  How do I reset my password?
    -> Go to the sign-in page and click "Forgot password." We'll email you a secure link to choose a new one.
distance=1.3205  [account]  How do I update my email address?
    -> Open Account Settings, edit the email field, and confirm the change from the verification email we send.
distance=1.3436  [general]  How do I contact customer support?
    -> Email us at [email protected] or use the live chat button at the bottom-right of any page.

This is the moment the project earns its name. The query “I can’t sign into my account” shares no keywords with “How do I reset my password?” — there is no overlap between sign in and reset password as raw text — yet it comes back as the top match with the lowest distance, 0.9020. That is semantic search doing its job: it matched on meaning, not on words. The runner-up entries about email and contacting support are clearly further away.

Let us try a second query in a completely different domain, phrased nothing like the stored FAQ:

for hit in search("What is the deadline for returning a purchase?"):
    print(f"distance={hit['distance']:.4f}  [{hit['topic']}]  {hit['question']}")
    print(f"    -> {hit['answer']}")
distance=0.8079  [returns]  What is your return policy?
    -> You can return any unused item within 30 days of delivery for a full refund.
distance=1.0958  [shipping]  How long does shipping take?
    -> Standard shipping takes 3-5 business days. Express shipping arrives in 1-2 business days.
distance=1.2379  [general]  How do I contact customer support?
    -> Email us at [email protected] or use the live chat button at the bottom-right of any page.

“Deadline for returning a purchase” lands on “What is your return policy?” with a distance of 0.8079 — an even tighter match. The model understood that deadline for returning and return policy are about the same thing. The shipping FAQ is a distant second, which makes sense because the answer mentions a thirty-day window and shipping also deals in time-and-delivery language, but the gap in distance tells you the first result is the real answer.


Stage 4: Filter by Metadata, and Prove Persistence

The topic we stored is not just decoration. You can pass a where filter to query so the search only considers documents whose metadata matches. This is how a real system narrows a search — for example, when a user is already in the “Shipping & Delivery” section of a help page and you only want shipping answers.

res = collection.query(
    query_texts=["how do I find my package"],
    n_results=3,
    where={"topic": "shipping"},
)
for question, meta, dist in zip(
    res["documents"][0], res["metadatas"][0], res["distances"][0]
):
    print(f"distance={dist:.4f}  [{meta['topic']}]  {question}")
distance=1.2138  [shipping]  How can I track my order?
distance=1.4408  [shipping]  How long does shipping take?
distance=1.5086  [shipping]  Do you ship internationally?

Every result is from the shipping topic, and the order tracking FAQ rises to the top — exactly what you would want for “how do I find my package.” The filter runs before similarity is scored, so it acts like a WHERE clause in SQL combined with a sort by relevance.

Finally, the payoff for using a persistent client. Run the entire script a second time, without deleting the chroma_db folder. The data is already on disk, so the count() guard from Stage 2 sees fifteen documents and skips the embedding step entirely:

Collection already has 15 FAQs - skipping embedding.
Count: 15

No re-embedding, no wasted work — the collection simply loaded what it saved before and was ready to search instantly. That is the difference between a demo and a service.

Embed once, search forever

The count() == 0 guard plus PersistentClient is the pattern you will reuse in almost every real project. Embedding is the expensive step; once it is done and saved, every restart is free and fast. You only re-embed when your source data actually changes. Keep this idea close, because in Module 7 (Retrieval-Augmented Generation) the documents you retrieve here become the context you feed to Claude. The search you just wrote is the retrieval half of RAG — next you will hand the matched answer to a model so it can write a grounded, conversational reply.


Practice Exercises

Try these on your own to deepen what you have built. Each one is a small, realistic extension.

Exercise 1: Add a “no good match” threshold

Right now search always returns its top k results, even when nothing is genuinely relevant. Add a distance threshold so that if the best match is too far away, you return a friendly fallback instead of a misleading answer. Try a query that has no real FAQ, such as "what is the weather like today", and observe how high the best distance gets.

Hint

Query for n_results=1, read res["distances"][0][0], and compare it to a cutoff. A query like "what is the weather like today" returns a best distance near 1.65 against this dataset — well above your strongest matches around 0.8. Pick a cutoff between those (for example, 1.5) and return a message like “Sorry, I couldn’t find an answer to that” when the best distance exceeds it.

Exercise 2: Search within a chosen topic

Add a topic parameter to your search function so the caller can optionally restrict the search to one category. When the parameter is provided, pass where={"topic": topic} to the query; when it is None, search across everything.

Hint

Build the query keyword arguments conditionally: start with kwargs = {"query_texts": [query], "n_results": k}, then add kwargs["where"] = {"topic": topic} only if topic is not None. Call collection.query(**kwargs). Compare searching "how do I find my package" with and without topic="shipping".

Exercise 3: Add new FAQs without re-embedding the rest

Imagine support publishes a new FAQ. Append one new row to the collection using collection.add with a fresh string id (such as "16"), then re-query to confirm it can be found — all without touching the existing fifteen documents.

Hint

Call collection.add(ids=["16"], documents=["Can I pause my subscription instead of cancelling?"], metadatas=[{"answer": "Yes — open Billing and choose Pause to hold your plan for up to three months.", "topic": "account"}]). Then check collection.count() is now 16 and run a search like "can I temporarily stop my plan" to see the new entry surface. Because each document is added individually, only the new one gets embedded.


Summary

You built a complete, persistent semantic search system over real data. You loaded a FAQ dataset with pandas, derived a simple topic label for each entry, embedded every question once into a Chroma collection saved on disk, and wrote a search function that returns the matched question, its answer, and the distance. You also scoped a search by metadata and proved that re-running the script reuses the saved embeddings instead of recomputing them.

Key Concepts

  • PersistentClient writes a collection (and its embeddings) to disk, so your data survives restarts.
  • Embed once with a count() guard — add documents only when the collection is empty, so you never pay to re-embed unchanged data.
  • String ids and scalar metadata — Chroma requires string ids, and metadata values must be str, int, float, or bool, never lists.
  • Documents vs. metadata — you embed and search the document text (the questions), while answers and topics ride along as retrievable metadata.
  • Distance as confidence — with the default squared L2 metric, a lower distance means a closer match, which you can threshold to detect “no good answer.”
  • Metadata filtering — a where clause narrows the candidate set before similarity scoring, like a WHERE plus a relevance sort.

Why This Matters

Almost every practical use of large language models that needs current or private knowledge starts exactly like this: take your real documents, embed them once, store them in a vector database, and search by meaning. The knowledge base you built is not a toy — swap fifteen FAQs for fifteen thousand pages of product docs and the code barely changes. More importantly, this retrieval layer is the foundation of retrieval-augmented generation, where the documents you find here are handed to a model as grounding context. Get retrieval right, and everything you build on top of it gets more accurate and more trustworthy.

Next Steps

Continue to Module 7 - Retrieval-Augmented Generation

Take the documents your search function retrieves and feed them to Claude as context, so it can answer questions grounded in your own knowledge base instead of guessing.

Back to Module Overview

Revisit embeddings, vector databases, and similarity search to reinforce the foundations before moving on.


Continue Building Your Skills

You now have a working pattern you can reach for again and again: load, embed once, persist, and search by meaning. Carry the count() guard and the tidy search function forward, because in the next module you will plug this exact retrieval step into a full RAG pipeline and watch Claude answer from documents you control. Keep the project handy — you are about to build on top of it.