Lesson 2 - Building a RAG Pipeline

Welcome to Building a RAG Pipeline

In Lesson 1 you watched the three RAG steps run end-to-end as one straight-line script. That’s perfect for seeing the idea, but it’s not something you’d ship. The retrieval logic, the prompt text, and the API call were all tangled together in one block — so the moment you wanted to change the model, tune how many documents to pull, or reword the grounding instruction, you’d be editing a wall of code and hoping you didn’t break the rest. This lesson fixes that. You’ll turn retrieve → augment → generate into three small, named functions, then compose them into a single answer() function you can call on any question.

The payoff is separation of concerns: each step lives in its own function, so you can change one without touching the others.

By the end of this lesson, you will be able to:

  • Write a retrieve(question, k) function that queries a Chroma collection
  • Write a build_prompt(question, docs) function that assembles a grounded prompt
  • Write a generate(prompt) function that calls Claude
  • Compose all three into a single reusable answer(question, k=3) function

You’ll reuse the same four-document knowledge base from Lesson 1, but this time the code will be clean. Let’s build it.


Step 1: A Retrieval Function

First, set up the collection (same knowledge base as Lesson 1) and create the Claude client once, so every function can share them:

import chromadb
import anthropic

client = chromadb.Client()
collection = client.get_or_create_collection(name="knowledge_base")
collection.add(
    documents=[
        "You can return any unused item within 30 days of delivery for a full refund.",
        "Standard shipping takes 3-5 business days; express arrives in 1-2 days.",
        "We ship to over 60 countries worldwide.",
        "To reset your password, go to the sign-in page and click Forgot password.",
    ],
    ids=["d0", "d1", "d2", "d3"],
)

llm = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from your environment

Now the first function. retrieve takes a question and a number k, queries the collection, and returns just the list of matching documents — nothing else. It hides the Chroma-specific detail (the nested results["documents"][0] indexing) so the rest of your pipeline never has to know how the vector store returns data.

def retrieve(question, k=3):
    results = collection.query(query_texts=[question], n_results=k)
    return results["documents"][0]

Try it on its own. A function you can call in isolation is a function you can trust:

docs = retrieve("How many days do I have to return something?", k=3)
for doc in docs:
    print("-", doc)
- You can return any unused item within 30 days of delivery for a full refund.
- Standard shipping takes 3-5 business days; express arrives in 1-2 days.
- To reset your password, go to the sign-in page and click Forgot password.

The return-policy sentence comes back first — it’s the closest in meaning to the question. The other two are less relevant but still pulled in because we asked for k=3. Notice that k is a knob you control: ask for more documents and you give the model more context (and more chances to be distracted); ask for fewer and you keep the prompt tight. Because retrieval is now its own function, tuning k is a one-line change.


Step 2: A Grounded Prompt Builder

Retrieval gives you a list of documents. The model needs a prompt. build_prompt is the bridge: it takes the question and the retrieved documents and assembles the grounded prompt — the same shape you saw in Lesson 1, now in a reusable function.

def build_prompt(question, docs):
    context = "\n".join(f"- {doc}" for doc in docs)
    return (
        "Answer the question using ONLY the context below. "
        "If the answer is not in the context, say you don't know.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}"
    )

This function does no retrieval and makes no API calls — it’s pure string assembly, which makes it trivial to inspect. Print what it produces and you can read exactly what the model will see:

docs = retrieve("How many days do I have to return something?", k=3)
print(build_prompt("How many days do I have to return something?", docs))
Answer the question using ONLY the context below. If the answer is not in the context, say you don't know.

Context:
- You can return any unused item within 30 days of delivery for a full refund.
- Standard shipping takes 3-5 business days; express arrives in 1-2 days.
- To reset your password, go to the sign-in page and click Forgot password.

Question: How many days do I have to return something?

The grounding instruction comes first, then the retrieved documents as a bulleted context block, then the question. Keeping this in its own function means the prompt is the only place prompt wording lives — when you want to tighten the grounding in Lesson 4, you edit one function and the whole pipeline updates.


Step 3: A Generation Call

The last step sends the prompt to Claude and returns the text of the answer. generate knows nothing about Chroma or about how the prompt was built — it just takes a finished prompt string and calls the model.

def generate(prompt):
    response = llm.messages.create(
        model="claude-haiku-4-5",
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.content[0].text

Because the model choice lives entirely inside this one function, swapping models later (for a faster or smarter one) is a single-line edit — and nothing else in your pipeline has to change. That’s the whole point of pulling generation out on its own.

Why three functions instead of one block

Splitting the pipeline into retrieve, build_prompt, and generate makes each part independently tunable. Want different retrieval? Edit retrievebuild_prompt and generate don’t notice. Want to reword the grounding instruction? Edit build_prompt. Want a different model or token limit? Edit generate. Each concern lives in exactly one place, so a change in one step can’t accidentally break another. This is the same separation-of-concerns discipline that keeps any real codebase maintainable — RAG is no exception.


Step 4: Compose Them Into answer()

Now the satisfying part. Each step is a clean function, so the full pipeline is just three lines: retrieve, build, generate. Wrap them in answer() and you have a single function that takes a question and returns a grounded answer.

def answer(question, k=3):
    docs = retrieve(question, k)
    prompt = build_prompt(question, docs)
    return generate(prompt)

That’s the entire RAG pipeline. The k=3 default flows straight through to retrieve, so callers can tune retrieval breadth per question without touching the internals. Run it on a couple of real questions:

print(answer("How many days do I have to return something?"))
You have 30 days to return any unused item for a full refund.
print(answer("How long does express shipping take?"))
Express shipping takes 1-2 days.

Two different questions, one function, both answered correctly and entirely from the retrieved documents. The first pulled the return policy to the top and Claude read “30 days” out of it; the second retrieved the shipping sentence and Claude extracted just the express figure. And because the grounding instruction is baked into build_prompt, the pipeline still declines gracefully when the knowledge base can’t help:

print(answer("Do you offer gift wrapping?", k=2))
I don't know. The context provided doesn't include information about gift wrapping services.

Nothing in the four documents mentions gift wrapping, so answer() says so instead of inventing a policy. You now have a real, reusable RAG pipeline — point it at any Chroma collection and it works the same way.


Practice Exercises

Exercise 1: Tune k

Call answer("How many days do I have to return something?", k=1). Why might passing k=1 instead of the default k=3 still produce a correct answer here — and when could a small k hurt you?

Hint

With k=1 only the single closest document is retrieved — and for the return question, that’s the return-policy sentence, so the answer is still correct. A small k hurts when the answer needs information spread across several documents, or when the single top match isn’t quite the right one; then too-narrow retrieval starves the model of the context it needs.

Exercise 2: Swap the model in one place

Suppose you want to try a different Claude model for generation. Which of the three functions do you edit, and which functions stay completely untouched?

Hint

You edit only generate — the model="claude-haiku-4-5" argument lives there and nowhere else. retrieve, build_prompt, and answer stay exactly as they are. That isolation is the direct benefit of separating concerns into one function per step.

Exercise 3: Inspect before you generate

Before calling answer() on a new question, you want to check what context the model will receive. Using only retrieve and build_prompt, how would you print the exact prompt without making any API call?

Hint

Call docs = retrieve(question) then print(build_prompt(question, docs)). Neither function touches the API, so you see the full grounded prompt — instruction, context bullets, and question — without spending a single token on generation. This is why pulling prompt assembly into its own pure function is so useful for debugging.


Summary

A RAG pipeline is cleanest when each of its three steps is its own function. retrieve(question, k) queries the Chroma collection and returns the matching documents, hiding the vector-store details. build_prompt(question, docs) assembles the grounded prompt — instruction, context, question — as pure string work with no side effects. generate(prompt) sends that prompt to Claude and returns the answer text. Composing them gives you answer(question, k=3), a single reusable entry point that you ran on real questions: it answered “30 days” and “1-2 days” from the retrieved documents, and said “I don’t know” when the knowledge base couldn’t help. Because each concern lives in exactly one function, you can retune retrieval, reword the prompt, or swap the model independently.

Key Concepts

  • retrieve(question, k) — queries the vector database and returns the top-k documents.
  • build_prompt(question, docs) — pure-function prompt assembly with the grounding instruction baked in.
  • generate(prompt) — the single place the model call (and model choice) lives.
  • answer(question, k=3) — composes the three steps into one reusable RAG entry point.
  • Separation of concerns — one function per step, so each piece can be swapped or tuned without breaking the others.

Why This Matters

Every production RAG system is built this way: small, composable functions you can test, tune, and replace independently. When you later add chunking, better retrieval, or a different prompt strategy, you’ll change one function at a time instead of rewriting a monolith. The answer() function you just built is the skeleton of a real “chat with your documents” application — point it at a bigger collection and the structure doesn’t change. Master this composition pattern and the rest of the module is about improving each piece in place.


Next Steps

Continue to Lesson 3 - Chunking Documents

Real documents are too long to embed whole. Learn to split them into chunks so retrieval pulls the right passage, not the whole file.

Back to Module Overview

Return to the Retrieval-Augmented Generation module overview


Continue Building Your Skills

You’ve gone from a tangled one-shot script to a clean, composable pipeline — three functions and one answer() that ties them together. Next you’ll feed it realistic data: documents far too long to embed whole, split into chunks so retrieval surfaces the right passage instead of an entire file.