← All articles
PythonMachine Learning

NLP Preprocessing in Python: From Raw Text to a TF-IDF Vector

Every NLP task starts with the same unglamorous work: cutting text into tokens, cleaning them up, and converting them into numbers. This guide walks through tokenization, normalization, stemming vs lemmatization, and bag-of-words/TF-IDF vectorization with NLTK and scikit-learn.

“Turn this text into something a model can learn from” sounds like one step, but it’s really four or five quiet ones that happen before any model gets involved. A sentence has to be broken into pieces, those pieces have to be cleaned up and made consistent, and only then can they become the kind of numbers a machine learning algorithm actually operates on. Skip or fumble any of those steps and everything downstream — search, classification, clustering, an LLM’s tokenizer — inherits the mess.

This is the layer most NLP tutorials rush past on the way to a model, which is exactly where people quietly lose the thread: they can copy-paste a training loop, but they can’t explain why their vocabulary has 4,000 words in it or why their stemmer turned “happiness” into “happi”. This guide slows down and builds that foundation first — tokenization, normalization, stemming vs lemmatization, and bag-of-words/TF-IDF vectorization — with real text and real output at every step. If you already have clean, numeric text and want to see it become a trained model, our post on text classification with spaCy picks up right where this one ends.

The Mental Model: Raw Text, Tokens, Clean Tokens, Numbers

Text is unstructured until you tokenize it, and it’s not usable by a model until you vectorize it. Every NLP pipeline, no matter how sophisticated, is some version of the same four-stage path:

  1. Raw text — a string, exactly as a human wrote it: contractions, punctuation, capitalization, typos and all.
  2. Tokens — the string split into discrete units (usually words) by a tokenizer that actually understands language, not just whitespace.
  3. Cleaned tokens — tokens with the noise squeezed out: lowercased, stripped of punctuation, common words removed, variants collapsed to a shared root.
  4. Numbers — cleaned tokens mapped into a fixed-size numeric vector, because that’s the only input format a model understands.
Diagram tracing the word alone through a text preprocessing pipeline: a raw sentence from Walt Whitman's Leaves of Grass is tokenized into a list of tokens, cleaned by removing stopwords, and finally vectorized, where alone gets a bag-of-words count of 2 and a TF-IDF score of 0.420.

Notice what each stage removes or adds, not just what it looks like: tokenization adds structure (a list instead of a blob), normalization removes noise (case, punctuation, filler words), and vectorization adds shape (every document becomes the same fixed-length row of numbers, however long the original sentence was). That last property is the whole point — a model can’t accept “sentences of varying length,” it can only accept “vectors of a fixed size.”

A Dataset You Can Reproduce

Say you’re building a tiny personal reading log — one short note per book you finish, meant to be searchable later (“show me books that mention islands”). Before you can search notes like that, you need to turn each one into a comparable numeric form. To keep this reproducible without hunting down a book-review dataset of uncertain licensing, we’ll stand in eight real notes with the opening lines of eight public-domain novels and story collections, pulled straight from NLTK’s own bundled gutenberg corpus — no network fetch beyond the one-time nltk.download.

Data: eight public-domain texts (Austen, Carroll, Melville, Whitman, Blake, Edgeworth, Chesterton, and a children’s story collection by Bryant), distributed as NLTK’s built-in gutenberg corpus, itself sourced from Project Gutenberg.

import re
import nltk
from nltk.corpus import gutenberg

fileids = [
    "austen-emma.txt", "carroll-alice.txt", "melville-moby_dick.txt",
    "whitman-leaves.txt", "blake-poems.txt", "edgeworth-parents.txt",
    "chesterton-thursday.txt", "bryant-stories.txt",
]

SKIP_WORDS = ("CHAPTER", "VOLUME", "BOOK ", "PART ", "ACT ", "SCENE", "PREFACE")

def first_content_sentence(fileid, min_len=25, search_chars=4000):
    raw = gutenberg.raw(fileid)
    raw = re.sub(r"^\[[^\]]*\]", "", raw, count=1).strip()  # drop "[Title by Author]"
    for sent in nltk.sent_tokenize(raw[:search_chars]):
        cleaned = " ".join(sent.split())
        if len(cleaned) < min_len or cleaned.isupper():
            continue
        if any(word in cleaned.upper() for word in SKIP_WORDS):
            continue
        return cleaned

documents = [first_content_sentence(fid) for fid in fileids]
for fid, doc in zip(fileids, documents):
    print(f"{fid:>24}: {doc}")
            austen-emma.txt: She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period.
          carroll-alice.txt: So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
     melville-moby_dick.txt: (Supplied by a Late Consumptive Usher to a Grammar School) The pale Usher--threadbare in coat, heart, body, and brain; I see him now.
         whitman-leaves.txt: Of physiology from top to toe I sing, Not physiognomy alone nor brain alone is worthy for the Muse, I say the Form complete is worthier far, The Female equally with the Male I sing.
            blake-poems.txt: So I piped with merry cheer.
      edgeworth-parents.txt: Near the ruins of the castle of Rossmore, in Ireland, is a small cabin, in which there once lived a widow and her four children.
    chesterton-thursday.txt: To Edmund Clerihew Bentley A cloud was on the mind of men, and wailing went the weather, Yea, a sick cloud upon the soul when we were boys together.
        bryant-stories.txt: TWO LITTLE RIDDLES IN RHYME There's a garden that I ken, Full of little gentlemen; Little caps of blue they wear, And green ribbons, very fair.

Eight documents, real 19th-century prose and poetry with real punctuation quirks — exactly the kind of text that makes the rest of this post honest instead of a toy example. (The outputs in this post come from NLTK 3.9 and scikit-learn 1.9; nothing here relies on version-specific behavior.)

Tokenization: Why .split() Isn’t Enough

A tokenizer’s job is deciding where one unit of meaning ends and the next begins, and plain whitespace is a bad rule for that. Take one sentence built to show the cracks:

from nltk.tokenize import word_tokenize

example = "I can't believe how quickly the runners were running yesterday."
print("naive split():", example.split())
print("word_tokenize():", word_tokenize(example))
naive split(): ['I', "can't", 'believe', 'how', 'quickly', 'the', 'runners', 'were', 'running', 'yesterday.']
word_tokenize(): ['I', 'ca', "n't", 'believe', 'how', 'quickly', 'the', 'runners', 'were', 'running', 'yesterday', '.']

.split() leaves "yesterday." glued to its period and treats "can't" as one indivisible token. NLTK’s word_tokenize — trained on real English punctuation and contraction patterns — makes two different, more useful calls: it splits the trailing period off as its own token, and it splits "can't" into "ca" and "n't", because grammatically that contraction really is two words (“can” + “not”) mashed together. Whether that specific split helps or hurts depends on what you’re building next, but the point stands either way: a tokenizer that understands language beats one that just looks for spaces.

Normalization: Lowercasing, Punctuation, and Stopwords

Once you have tokens, the next job is shrinking the amount of meaningless variation between them — “The” and “the” are the same word to a human, but two different strings to a naive vectorizer unless you normalize first.

doc0 = documents[0]
print("original:", doc0)

lowered = doc0.lower()
no_punct = re.sub(r"[^\w\s]", "", lowered)
tokens0 = word_tokenize(no_punct)
print("no punctuation:", no_punct)
print("tokens:", tokens0)
original: She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period.
no punctuation: she was the youngest of the two daughters of a most affectionate indulgent father and had in consequence of her sisters marriage been mistress of his house from a very early period
tokens: ['she', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate', 'indulgent', 'father', 'and', 'had', 'in', 'consequence', 'of', 'her', 'sisters', 'marriage', 'been', 'mistress', 'of', 'his', 'house', 'from', 'a', 'very', 'early', 'period']

Look closely and "sister's" became "sisters" — a blunt regex has no idea an apostrophe inside a word can mean something different from a comma between words. Hold that thought; it’s worth a full gotcha below.

The next normalization move is removing stopwords — extremely common words (“the”, “of”, “and”, “a”) that carry almost no distinguishing meaning for most tasks. NLTK ships a ready-made English list:

from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
print("count:", len(stop_words))
print("sample:", sorted(stop_words)[:15])

filtered0 = [t for t in tokens0 if t not in stop_words]
print("after stopword removal:", filtered0)
count: 198
sample: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't"]
after stopword removal: ['youngest', 'two', 'daughters', 'affectionate', 'indulgent', 'father', 'consequence', 'sisters', 'marriage', 'mistress', 'house', 'early', 'period']

Thirty-two tokens shrank to thirteen. What’s left is almost entirely content words — the ones that would actually help distinguish this sentence from a different one, which is exactly what you want going into a vectorizer.

Stemming vs Lemmatization: Two Ways to Collapse a Word Family

Even after stopword removal, "running", "ran", and "runs" are three separate strings to a computer, even though a person reading them knows they’re the same underlying idea. Stemming and lemmatization both try to collapse word variants down to one shared form — they just disagree about how careful to be.

from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["running", "ran", "runs", "better", "geese", "studies", "happiness"]
pos_map = {"running": "v", "ran": "v", "runs": "v", "better": "a",
           "geese": "n", "studies": "v", "happiness": "n"}

for w in words:
    stem = stemmer.stem(w)
    lemma_default = lemmatizer.lemmatize(w)
    lemma_pos = lemmatizer.lemmatize(w, pos=pos_map[w])
    print(f"{w:<12}{stem:<12}{lemma_default:<18}{lemma_pos}")
word        stem        lemma (default)   lemma (correct pos)
running     run         running           run
ran         ran         ran               run
runs        run         run               run
better      better      better            good
geese       gees        goose             goose
studies     studi       study             study
happiness   happi       happiness         happiness

A stemmer (PorterStemmer here) works by chopping suffixes according to a fixed set of rules, with no dictionary and no idea what the word actually means — "running" and "runs" both correctly collapse to "run", but "geese" doesn’t, because a suffix rule has nothing to chop off an irregular plural. A lemmatizer looks the word up against a real dictionary (WordNetLemmatizer uses WordNet) and returns an actual word — "geese" correctly becomes "goose" — but it needs to be told the word’s part of speech, or it silently assumes “noun” and gives up early:

print("lemmatize('better')          ->", lemmatizer.lemmatize("better"))
print("lemmatize('better', pos='a') ->", lemmatizer.lemmatize("better", pos="a"))
lemmatize('better')          -> better
lemmatize('better', pos='a') -> good

Without a part-of-speech hint, "better" is assumed to be a noun and left untouched. Tell the lemmatizer it’s an adjective (pos="a") and it correctly resolves to "good". This is the practical tradeoff: stemming is fast and dictionary-free but produces rough, sometimes non-word output; lemmatization produces real words but needs more setup and, ideally, part-of-speech information to do its best work.

Turning Cleaned Tokens Into Numbers

A cleaned list of tokens is still not something a model can consume — you need a fixed-size numeric vector per document. The simplest approach is bag-of-words: count how many times each vocabulary word appears in each document, ignoring word order entirely.

def clean(doc):
    lowered = doc.lower()
    no_punct = re.sub(r"[^\w\s]", "", lowered)
    tokens = word_tokenize(no_punct)
    return [t for t in tokens if t not in stop_words]

cleaned_docs = [clean(d) for d in documents]
cleaned_joined = [" ".join(toks) for toks in cleaned_docs]

from sklearn.feature_extraction.text import CountVectorizer

count_vec = CountVectorizer()
count_matrix = count_vec.fit_transform(cleaned_joined)
vocab = count_vec.get_feature_names_out()
print("vocabulary size:", len(vocab))

row = count_matrix.toarray()[3]  # whitman-leaves.txt
for term, count in zip(vocab, row):
    if count:
        print(f"  {term}: {count}")
vocabulary size: 111
  alone: 2
  brain: 1
  complete: 1
  equally: 1
  far: 1
  female: 1
  form: 1
  male: 1
  muse: 1
  physiognomy: 1
  physiology: 1
  say: 1
  sing: 2
  toe: 1
  top: 1
  worthier: 1
  worthy: 1

The eight documents together produce a 111-word vocabulary, and Whitman’s line is the one document where a word repeats: "alone" and "sing" each appear twice, so they get a count of 2 while everything else gets 1. That’s the entire idea of bag-of-words — order is gone, but frequency survives.

Raw counts have a blind spot, though: a word that’s common everywhere (say, across all eight documents) tells you nothing distinctive about any one of them, but a plain count treats it the same as a rare, on-topic word. TF-IDF (term frequency–inverse document frequency) fixes that by weighting a word up when it’s frequent within a document and weighting it down when it’s common across documents — see the scikit-learn TfidfVectorizer documentation for the exact formula it uses:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(cleaned_joined)
tfidf_vocab = tfidf_vec.get_feature_names_out()

row_tfidf = tfidf_matrix.toarray()[3]  # whitman-leaves.txt
pairs = sorted(
    ((t, w) for t, w in zip(tfidf_vocab, row_tfidf) if w > 0),
    key=lambda p: p[1], reverse=True,
)
for term, weight in pairs[:8]:
    print(f"  {term}: {weight:.3f}")
  alone: 0.420
  sing: 0.420
  complete: 0.210
  equally: 0.210
  far: 0.210
  female: 0.210
  form: 0.210
  male: 0.210

"alone" and "sing" — the two repeated words — come out on top with double the weight of every other word in the document, exactly because they’re both frequent here and don’t show up (diluting their value) anywhere else in the eight-document set. That ranked list is what “turning text into numbers” is actually for: it’s now something you could sort, compare, feed to a classifier, or use to search for the “most Whitman-like” note in the reading log.

Three Gotchas Worth Knowing

A blunt regex can quietly merge or change words, not just strip punctuation. Look back at two real artifacts from this post’s own dataset: "sister's" became "sisters" (an apostrophe means something different from a comma, but [^\w\s] treats them identically), and "Usher--threadbare" became one token, "usherthreadbare" (an em-dash with no surrounding spaces glued two separate words together). Neither is a bug exactly — it’s what a simple regex does — but both are worth eyeballing before you trust a “cleaned” dataset.

A stemmer producing a non-word is normal, not a mistake. "happiness" stemming to "happi" and "studies" stemming to "studi" look broken to a human reader, but a model doesn’t read English — it just needs "happiness" and "happier" to map to the same token consistently, and PorterStemmer does that reliably even when the result isn’t a real word. Don’t “fix” stemmer output back into English; that defeats the point.

Stopword lists are English-, domain-, and sometimes meaning-blind defaults. Removing "not" from "This movie is not good" leaves ["movie", "good", "."] — the negation, which completely flips the sentence’s meaning, is gone:

neg_sentence = "This movie is not good."
neg_tokens = [t for t in word_tokenize(neg_sentence.lower()) if t not in stop_words]
print(neg_tokens)
['movie', 'good', '.']

For sentiment-sensitive tasks, blindly applying a generic stopword list can erase the exact signal you’re trying to detect. And the opposite failure mode is just as real: skipping stopword removal, or skipping max_features/min_df limits, lets a vectorizer’s vocabulary balloon on real-sized text. Pulling every sentence from the first ~20,000 characters of each of our eight books — 1,423 sentences — produces a 4,213-word vocabulary with no limits, versus 200 words once you cap it with TfidfVectorizer(max_features=200, min_df=2). On a corpus of thousands of documents instead of eight, that gap gets far more dramatic, and an unbounded vocabulary quietly turns into a memory and performance problem.

Wrapping Up

Every NLP pipeline, underneath whatever model sits on top of it, runs through the same handful of moves:

  • Tokenize with a real tokenizer (word_tokenize), not .split(), so contractions and punctuation are handled on purpose
  • Normalize: lowercase, strip punctuation carefully, and remove stopwords to cut down meaningless variation
  • Stem or lemmatize to collapse word variants — stemming for speed, lemmatization for correctness, with a part-of-speech hint if you need the accurate word back
  • Vectorize with CountVectorizer for raw frequency or TfidfVectorizer when you need words that are common everywhere to matter less

Once your text is clean and numeric, the natural next step is asking a model to learn from it. That’s exactly where our text classification with spaCy post picks up — same idea of turning text into something learnable, applied to training a real spam/ham classifier. If you want to go further into embeddings, sequence models, and transformer-based text understanding, the NLP with Deep Learning module in our free Machine Learning course builds directly on the preprocessing steps covered here.

More from the blog