Every NLP task starts with the same unglamorous work: cutting text into tokens, cleaning them up, and converting them into numbers. This guide walks through tokenization, normalization, stemming vs lemmatization, and bag-of-words/TF-IDF vectorization with NLTK and scikit-learn.
“Turn this text into something a model can learn from” sounds like one step, but it’s really four or five quiet ones that happen before any model gets involved. A sentence has to be broken into pieces, those pieces have to be cleaned up and made consistent, and only then can they become the kind of numbers a machine learning algorithm actually operates on. Skip or fumble any of those steps and everything downstream — search, classification, clustering, an LLM’s tokenizer — inherits the mess.
This is the layer most NLP tutorials rush past on the way to a model, which is exactly where people quietly lose the thread: they can copy-paste a training loop, but they can’t explain why their vocabulary has 4,000 words in it or why their stemmer turned “happiness” into “happi”. This guide slows down and builds that foundation first — tokenization, normalization, stemming vs lemmatization, and bag-of-words/TF-IDF vectorization — with real text and real output at every step. If you already have clean, numeric text and want to see it become a trained model, our post on text classification with spaCy picks up right where this one ends.
Text is unstructured until you tokenize it, and it’s not usable by a model until you vectorize it. Every NLP pipeline, no matter how sophisticated, is some version of the same four-stage path:
Notice what each stage removes or adds, not just what it looks like: tokenization adds structure (a list instead of a blob), normalization removes noise (case, punctuation, filler words), and vectorization adds shape (every document becomes the same fixed-length row of numbers, however long the original sentence was). That last property is the whole point — a model can’t accept “sentences of varying length,” it can only accept “vectors of a fixed size.”
Say you’re building a tiny personal reading log — one short note per book you finish, meant to be searchable later (“show me books that mention islands”). Before you can search notes like that, you need to turn each one into a comparable numeric form. To keep this reproducible without hunting down a book-review dataset of uncertain licensing, we’ll stand in eight real notes with the opening lines of eight public-domain novels and story collections, pulled straight from NLTK’s own bundled gutenberg corpus — no network fetch beyond the one-time nltk.download.
Data: eight public-domain texts (Austen, Carroll, Melville, Whitman, Blake, Edgeworth, Chesterton, and a children’s story collection by Bryant), distributed as NLTK’s built-in gutenberg corpus, itself sourced from Project Gutenberg.
import re
import nltk
from nltk.corpus import gutenberg
fileids = [
"austen-emma.txt", "carroll-alice.txt", "melville-moby_dick.txt",
"whitman-leaves.txt", "blake-poems.txt", "edgeworth-parents.txt",
"chesterton-thursday.txt", "bryant-stories.txt",
]
SKIP_WORDS = ("CHAPTER", "VOLUME", "BOOK ", "PART ", "ACT ", "SCENE", "PREFACE")
def first_content_sentence(fileid, min_len=25, search_chars=4000):
raw = gutenberg.raw(fileid)
raw = re.sub(r"^\[[^\]]*\]", "", raw, count=1).strip() # drop "[Title by Author]"
for sent in nltk.sent_tokenize(raw[:search_chars]):
cleaned = " ".join(sent.split())
if len(cleaned) < min_len or cleaned.isupper():
continue
if any(word in cleaned.upper() for word in SKIP_WORDS):
continue
return cleaned
documents = [first_content_sentence(fid) for fid in fileids]
for fid, doc in zip(fileids, documents):
print(f"{fid:>24}: {doc}") austen-emma.txt: She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period.
carroll-alice.txt: So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
melville-moby_dick.txt: (Supplied by a Late Consumptive Usher to a Grammar School) The pale Usher--threadbare in coat, heart, body, and brain; I see him now.
whitman-leaves.txt: Of physiology from top to toe I sing, Not physiognomy alone nor brain alone is worthy for the Muse, I say the Form complete is worthier far, The Female equally with the Male I sing.
blake-poems.txt: So I piped with merry cheer.
edgeworth-parents.txt: Near the ruins of the castle of Rossmore, in Ireland, is a small cabin, in which there once lived a widow and her four children.
chesterton-thursday.txt: To Edmund Clerihew Bentley A cloud was on the mind of men, and wailing went the weather, Yea, a sick cloud upon the soul when we were boys together.
bryant-stories.txt: TWO LITTLE RIDDLES IN RHYME There's a garden that I ken, Full of little gentlemen; Little caps of blue they wear, And green ribbons, very fair.Eight documents, real 19th-century prose and poetry with real punctuation quirks — exactly the kind of text that makes the rest of this post honest instead of a toy example. (The outputs in this post come from NLTK 3.9 and scikit-learn 1.9; nothing here relies on version-specific behavior.)
.split() Isn’t EnoughA tokenizer’s job is deciding where one unit of meaning ends and the next begins, and plain whitespace is a bad rule for that. Take one sentence built to show the cracks:
from nltk.tokenize import word_tokenize
example = "I can't believe how quickly the runners were running yesterday."
print("naive split():", example.split())
print("word_tokenize():", word_tokenize(example))naive split(): ['I', "can't", 'believe', 'how', 'quickly', 'the', 'runners', 'were', 'running', 'yesterday.']
word_tokenize(): ['I', 'ca', "n't", 'believe', 'how', 'quickly', 'the', 'runners', 'were', 'running', 'yesterday', '.'].split() leaves "yesterday." glued to its period and treats "can't" as one indivisible token. NLTK’s word_tokenize — trained on real English punctuation and contraction patterns — makes two different, more useful calls: it splits the trailing period off as its own token, and it splits "can't" into "ca" and "n't", because grammatically that contraction really is two words (“can” + “not”) mashed together. Whether that specific split helps or hurts depends on what you’re building next, but the point stands either way: a tokenizer that understands language beats one that just looks for spaces.
Once you have tokens, the next job is shrinking the amount of meaningless variation between them — “The” and “the” are the same word to a human, but two different strings to a naive vectorizer unless you normalize first.
doc0 = documents[0]
print("original:", doc0)
lowered = doc0.lower()
no_punct = re.sub(r"[^\w\s]", "", lowered)
tokens0 = word_tokenize(no_punct)
print("no punctuation:", no_punct)
print("tokens:", tokens0)original: She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period.
no punctuation: she was the youngest of the two daughters of a most affectionate indulgent father and had in consequence of her sisters marriage been mistress of his house from a very early period
tokens: ['she', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate', 'indulgent', 'father', 'and', 'had', 'in', 'consequence', 'of', 'her', 'sisters', 'marriage', 'been', 'mistress', 'of', 'his', 'house', 'from', 'a', 'very', 'early', 'period']Look closely and "sister's" became "sisters" — a blunt regex has no idea an apostrophe inside a word can mean something different from a comma between words. Hold that thought; it’s worth a full gotcha below.
The next normalization move is removing stopwords — extremely common words (“the”, “of”, “and”, “a”) that carry almost no distinguishing meaning for most tasks. NLTK ships a ready-made English list:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
print("count:", len(stop_words))
print("sample:", sorted(stop_words)[:15])
filtered0 = [t for t in tokens0 if t not in stop_words]
print("after stopword removal:", filtered0)count: 198
sample: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't"]after stopword removal: ['youngest', 'two', 'daughters', 'affectionate', 'indulgent', 'father', 'consequence', 'sisters', 'marriage', 'mistress', 'house', 'early', 'period']Thirty-two tokens shrank to thirteen. What’s left is almost entirely content words — the ones that would actually help distinguish this sentence from a different one, which is exactly what you want going into a vectorizer.
Even after stopword removal, "running", "ran", and "runs" are three separate strings to a computer, even though a person reading them knows they’re the same underlying idea. Stemming and lemmatization both try to collapse word variants down to one shared form — they just disagree about how careful to be.
from nltk.stem import PorterStemmer, WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["running", "ran", "runs", "better", "geese", "studies", "happiness"]
pos_map = {"running": "v", "ran": "v", "runs": "v", "better": "a",
"geese": "n", "studies": "v", "happiness": "n"}
for w in words:
stem = stemmer.stem(w)
lemma_default = lemmatizer.lemmatize(w)
lemma_pos = lemmatizer.lemmatize(w, pos=pos_map[w])
print(f"{w:<12}{stem:<12}{lemma_default:<18}{lemma_pos}")word stem lemma (default) lemma (correct pos)
running run running run
ran ran ran run
runs run run run
better better better good
geese gees goose goose
studies studi study study
happiness happi happiness happinessA stemmer (PorterStemmer here) works by chopping suffixes according to a fixed set of rules, with no dictionary and no idea what the word actually means — "running" and "runs" both correctly collapse to "run", but "geese" doesn’t, because a suffix rule has nothing to chop off an irregular plural. A lemmatizer looks the word up against a real dictionary (WordNetLemmatizer uses WordNet) and returns an actual word — "geese" correctly becomes "goose" — but it needs to be told the word’s part of speech, or it silently assumes “noun” and gives up early:
print("lemmatize('better') ->", lemmatizer.lemmatize("better"))
print("lemmatize('better', pos='a') ->", lemmatizer.lemmatize("better", pos="a"))lemmatize('better') -> better
lemmatize('better', pos='a') -> goodWithout a part-of-speech hint, "better" is assumed to be a noun and left untouched. Tell the lemmatizer it’s an adjective (pos="a") and it correctly resolves to "good". This is the practical tradeoff: stemming is fast and dictionary-free but produces rough, sometimes non-word output; lemmatization produces real words but needs more setup and, ideally, part-of-speech information to do its best work.
A cleaned list of tokens is still not something a model can consume — you need a fixed-size numeric vector per document. The simplest approach is bag-of-words: count how many times each vocabulary word appears in each document, ignoring word order entirely.
def clean(doc):
lowered = doc.lower()
no_punct = re.sub(r"[^\w\s]", "", lowered)
tokens = word_tokenize(no_punct)
return [t for t in tokens if t not in stop_words]
cleaned_docs = [clean(d) for d in documents]
cleaned_joined = [" ".join(toks) for toks in cleaned_docs]
from sklearn.feature_extraction.text import CountVectorizer
count_vec = CountVectorizer()
count_matrix = count_vec.fit_transform(cleaned_joined)
vocab = count_vec.get_feature_names_out()
print("vocabulary size:", len(vocab))
row = count_matrix.toarray()[3] # whitman-leaves.txt
for term, count in zip(vocab, row):
if count:
print(f" {term}: {count}")vocabulary size: 111
alone: 2
brain: 1
complete: 1
equally: 1
far: 1
female: 1
form: 1
male: 1
muse: 1
physiognomy: 1
physiology: 1
say: 1
sing: 2
toe: 1
top: 1
worthier: 1
worthy: 1The eight documents together produce a 111-word vocabulary, and Whitman’s line is the one document where a word repeats: "alone" and "sing" each appear twice, so they get a count of 2 while everything else gets 1. That’s the entire idea of bag-of-words — order is gone, but frequency survives.
Raw counts have a blind spot, though: a word that’s common everywhere (say, across all eight documents) tells you nothing distinctive about any one of them, but a plain count treats it the same as a rare, on-topic word. TF-IDF (term frequency–inverse document frequency) fixes that by weighting a word up when it’s frequent within a document and weighting it down when it’s common across documents — see the scikit-learn TfidfVectorizer documentation for the exact formula it uses:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(cleaned_joined)
tfidf_vocab = tfidf_vec.get_feature_names_out()
row_tfidf = tfidf_matrix.toarray()[3] # whitman-leaves.txt
pairs = sorted(
((t, w) for t, w in zip(tfidf_vocab, row_tfidf) if w > 0),
key=lambda p: p[1], reverse=True,
)
for term, weight in pairs[:8]:
print(f" {term}: {weight:.3f}") alone: 0.420
sing: 0.420
complete: 0.210
equally: 0.210
far: 0.210
female: 0.210
form: 0.210
male: 0.210"alone" and "sing" — the two repeated words — come out on top with double the weight of every other word in the document, exactly because they’re both frequent here and don’t show up (diluting their value) anywhere else in the eight-document set. That ranked list is what “turning text into numbers” is actually for: it’s now something you could sort, compare, feed to a classifier, or use to search for the “most Whitman-like” note in the reading log.
A blunt regex can quietly merge or change words, not just strip punctuation. Look back at two real artifacts from this post’s own dataset: "sister's" became "sisters" (an apostrophe means something different from a comma, but [^\w\s] treats them identically), and "Usher--threadbare" became one token, "usherthreadbare" (an em-dash with no surrounding spaces glued two separate words together). Neither is a bug exactly — it’s what a simple regex does — but both are worth eyeballing before you trust a “cleaned” dataset.
A stemmer producing a non-word is normal, not a mistake. "happiness" stemming to "happi" and "studies" stemming to "studi" look broken to a human reader, but a model doesn’t read English — it just needs "happiness" and "happier" to map to the same token consistently, and PorterStemmer does that reliably even when the result isn’t a real word. Don’t “fix” stemmer output back into English; that defeats the point.
Stopword lists are English-, domain-, and sometimes meaning-blind defaults. Removing "not" from "This movie is not good" leaves ["movie", "good", "."] — the negation, which completely flips the sentence’s meaning, is gone:
neg_sentence = "This movie is not good."
neg_tokens = [t for t in word_tokenize(neg_sentence.lower()) if t not in stop_words]
print(neg_tokens)['movie', 'good', '.']For sentiment-sensitive tasks, blindly applying a generic stopword list can erase the exact signal you’re trying to detect. And the opposite failure mode is just as real: skipping stopword removal, or skipping max_features/min_df limits, lets a vectorizer’s vocabulary balloon on real-sized text. Pulling every sentence from the first ~20,000 characters of each of our eight books — 1,423 sentences — produces a 4,213-word vocabulary with no limits, versus 200 words once you cap it with TfidfVectorizer(max_features=200, min_df=2). On a corpus of thousands of documents instead of eight, that gap gets far more dramatic, and an unbounded vocabulary quietly turns into a memory and performance problem.
Every NLP pipeline, underneath whatever model sits on top of it, runs through the same handful of moves:
word_tokenize), not .split(), so contractions and punctuation are handled on purposeCountVectorizer for raw frequency or TfidfVectorizer when you need words that are common everywhere to matter lessOnce your text is clean and numeric, the natural next step is asking a model to learn from it. That’s exactly where our text classification with spaCy post picks up — same idea of turning text into something learnable, applied to training a real spam/ham classifier. If you want to go further into embeddings, sequence models, and transformer-based text understanding, the NLP with Deep Learning module in our free Machine Learning course builds directly on the preprocessing steps covered here.