Lesson 3 - RAG with LangChain
Welcome to RAG with LangChain
In Module 7 you built retrieval-augmented generation from the ground up: you chunked a document yourself, computed embeddings, stored vectors, ran a similarity search, stuffed the top hits into a prompt, and called the model. It worked, and now you understand every moving part. In this lesson you’ll rebuild that same pipeline with LangChain — and you’ll watch it collapse into a handful of composable lines. Nothing here is new capability; it’s the same retrieve-and-generate flow, expressed with the framework’s building blocks. The point is recognition: you’ll see a splitter, embeddings, a retriever, and a prompt, and you’ll know exactly what each one does because you wrote them by hand first.
We’ll use the same “Acme Cloud Notes” handbook as our knowledge source. Download it from https://datatweets.com/datasets/product-handbook.md and save it next to your script as product-handbook.md.
By the end of this lesson, you will be able to:
- Load a document and split it into chunks with
RecursiveCharacterTextSplitter - Embed and store chunks in a Chroma vector store using local HuggingFace embeddings
- Turn a vector store into a
retrieverand run a similarity search - Compose a grounded RAG chain (
{context, question} | prompt | model | parser) that answers from the document — and admits when it can’t
We’re on LangChain 1.x. Let’s rebuild RAG.
Load and Split the Document
RAG starts with the same first step you did by hand: take a long document and cut it into chunks small enough to retrieve precisely and stuff into a prompt. LangChain’s RecursiveCharacterTextSplitter does the cutting. It tries to split on natural boundaries first (paragraphs, then lines, then sentences) and only falls back to a hard character cut when it must — which keeps related text together. You give it a chunk_size (the target length in characters) and a chunk_overlap (how much each chunk repeats from the previous one, so a fact straddling a boundary isn’t lost).
from langchain_text_splitters import RecursiveCharacterTextSplitter
text = open("product-handbook.md").read()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = splitter.create_documents([text])
print("Number of chunks:", len(chunks))
print("First chunk:", repr(chunks[0].page_content[:120]))Number of chunks: 14
First chunk: '# Acme Cloud Notes — Product Handbook\n\n## Accounts and Sign-In'The handbook became 14 chunks. We used create_documents, which returns a list of Document objects — the same Document type you’ll embed and retrieve, each carrying a .page_content string (and a .metadata dict). If you only want raw strings, splitter.split_text(text) returns a list[str] instead. Either way, this one call replaces the manual loop you wrote in Module 7 to slice text and track overlap by hand.
Embed and Store in a Chroma Vector Store
Next you turn each chunk into a vector and store it so you can search by meaning. In Module 7 you called an embedding model and managed the vectors yourself; here, HuggingFaceEmbeddings runs a small embedding model locally — no API key, no cost, no network call for embeddings — and Chroma.from_documents embeds every chunk and stores it in one step. Then .as_retriever() hands you a retriever, the component whose entire job is “give me the k chunks most relevant to this query.”
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
emb = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vs = Chroma.from_documents(chunks, embedding=emb, collection_name="handbook")
retriever = vs.as_retriever(search_kwargs={"k": 2})
hits = retriever.invoke("How much does the Pro plan cost?")
print("Number of hits:", len(hits))
print("Top hit:", hits[0].page_content[:220])Number of hits: 2
Top hit: Acme Cloud Notes offers three plans. The Starter plan is free forever and includes up to 100 notes and 1 GB of storage. The Pro plan costs 8 dollars per month and raises those limits to unlimited notes and 50 GB of storaNotice the retriever speaks the same invoke() interface as every other LangChain component, and it returns a list[Document] — exactly the chunks you split earlier. We asked for k=2, so it returned the two most semantically similar chunks, and the top one is precisely the “Plans and Billing” section that answers a pricing question. The model never sees the whole handbook; it will see only these focused chunks. That selective retrieval is the heart of RAG, and you just rebuilt it with three lines.
Compose the RAG Chain
Now you wire retrieval and generation together. The pattern is the LCEL pipe from Lesson 1, with one new idea: the first stage is a dictionary of parallel work. The question flows straight through with RunnablePassthrough(), while the same question is sent to the retriever, whose chunks are joined into a single context string. Both land in the prompt, which goes to the model, which goes to the parser. Read it left to right and it’s the exact flow you hand-coded — retrieve, build context, prompt, generate — just composed.
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
model = ChatAnthropic(model="claude-haiku-4-5", max_tokens=120)
prompt = ChatPromptTemplate.from_template(
"Answer using ONLY this context. If the answer is not in the context, "
"say you don't know.\n\nContext:\n{context}\n\nQuestion: {question}"
)
def format_docs(ds):
return "\n".join(f"- {d.page_content}" for d in ds)
rag = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)
print(rag.invoke("How much does the Pro plan cost, and what does it include?"))
print("---")
print(rag.invoke("How long do deleted notes stay in the Trash?"))The Pro plan costs $8 per month and includes:
- Unlimited notes
- 50 GB of storage
- Version history
(Note: Annual billing is available at a 20 percent discount if you choose that option instead of monthly billing.)
---
According to the context, deleted notes remain in Trash for 30 days before they are permanently removed.Both answers are correct and grounded in the handbook — the price, the storage, the version history, the 30-day window all come straight from the retrieved chunks, not from the model’s general knowledge. The whole pipeline, from raw document to grounded answer, is about a dozen lines. In Module 7 the same thing took a page of hand-written code; here the framework supplies the splitter, the vector store, the retriever, and the prompt plumbing, and you compose them with the pipe.
Every piece is something you built by hand
Look at what each component replaced from Module 7: the splitter is your manual chunking loop, HuggingFaceEmbeddings + Chroma is your embed-and-store code, the retriever is your similarity search, and the prompt is your “answer using only this context” template. The framework didn’t invent RAG — it packaged the exact pipeline you wrote from scratch into composable parts. Because you built it the hard way, you can read this chain and know precisely what every line is doing under the hood.
Grounding Means Knowing What It Doesn’t Know
A RAG system is only trustworthy if it refuses to answer when the document doesn’t contain the answer. Our prompt explicitly tells the model to say it doesn’t know when the context is silent, and that instruction is what keeps it from hallucinating. The handbook lists an email address for support but no phone hotline — so let’s ask for the thing that isn’t there.
print(rag.invoke("Does Acme Cloud Notes have a phone support hotline number?"))I don't know. The context provided does not mention a phone support hotline number for Acme Cloud Notes.This is the behavior you want. The retriever still pulled the support-related chunks, but those chunks only mention email and in-app chat — no phone number — so the model followed the instruction and declined rather than inventing a number. That honest “I don’t know” is the difference between a grounded assistant and a confident liar, and it’s why the prompt’s grounding instruction is not optional.
Practice Exercises
Exercise 1: Tune the chunking
Change chunk_size to 1000 and re-run the split. How does the chunk count change, and what trade-off are you making between chunk size and retrieval precision?
Hint
Larger chunks mean fewer of them, so the count drops. Bigger chunks carry more surrounding context but retrieve less precisely (each hit pulls in more unrelated text); smaller chunks are sharper but risk splitting a fact across a boundary — which is exactly what chunk_overlap guards against.
Exercise 2: Retrieve more context
Set search_kwargs={"k": 4} instead of 2 and re-run a grounded question. What changes in what the model sees, and when might more chunks help or hurt?
Hint
With k=4 the format_docs context string holds four chunks instead of two. More chunks raise the chance the answer is present, but they also add noise and tokens — too many can bury the relevant passage and cost more. Tune k to the smallest value that reliably includes the answer.
Exercise 3: Read the chain
In {"context": retriever | format_docs, "question": RunnablePassthrough()}, explain what each key produces and why both are needed before the prompt.
Hint
context runs the question through the retriever, then format_docs joins the returned Document objects into one string. question passes the original question straight through unchanged. The prompt template has both {context} and {question} placeholders, so the dictionary fills both in parallel before the filled prompt goes to the model.
Summary
You rebuilt the entire Module 7 RAG pipeline with LangChain in a fraction of the code. You loaded the handbook and split it into 14 chunks with RecursiveCharacterTextSplitter, embedded and stored them in a Chroma vector store using local HuggingFace embeddings (no API key, no cost), turned the store into a retriever, and composed a grounded RAG chain with the LCEL pipe: {context, question} | prompt | model | parser. It answered pricing and policy questions straight from the document, and — because the prompt instructs it to — it honestly said “I don’t know” when asked for a phone number the handbook never lists. Every component mapped cleanly onto something you wrote by hand, which is exactly why you can trust what the framework is doing.
Key Concepts
RecursiveCharacterTextSplitter— splits text on natural boundaries withchunk_sizeandchunk_overlap.HuggingFaceEmbeddings— a local, free embedding model; no API key required.Chroma.from_documents— embeds and stores chunks in a vector store in one call.- retriever —
vs.as_retriever(); returns the top-kmost relevantDocumentchunks for a query. - RAG chain —
{context, question} | prompt | model | parser, where the question both flows through and feeds the retriever. - grounding — instructing the model to answer only from context, so it declines instead of hallucinating.
Why This Matters
RAG is the most common way to put your own documents in front of an LLM, and it’s a staple of production work — support assistants, internal search, document Q&A all run on this exact pattern. Building it by hand taught you how it works; building it with LangChain teaches you to ship it quickly without losing that understanding. When a retrieval result looks wrong, you’ll know whether to blame the chunking, the embeddings, the k, or the prompt — because you’ve built and now composed every one of those pieces yourself.
Next Steps
Continue to Lesson 4 - Agents with LangGraph
Move from fixed chains to stateful agents — rebuild the think-act-observe loop with LangGraph.
Back to Module Overview
Return to the LangChain & LangGraph module overview
Continue Building Your Skills
You’ve now rebuilt RAG as a clean, composable LangChain chain and seen how much plumbing the framework removes while leaving you in full command of every part. Next you’ll leave fixed chains behind and step into LangGraph, where you’ll rebuild the agent loop — the think, act, observe cycle — as a stateful graph that decides its own steps.