Lesson 4 - The Naive Bayes Algorithm
Welcome to the Naive Bayes Algorithm
In the last lessons you learned Bayes’ theorem as a rule for updating a single belief when one piece of evidence arrives. Now you will turn that rule into something that does real work: a classifier — a program that reads a piece of text and decides which category it belongs to.
The category here is a familiar annoyance. Some news headlines are honest summaries; others are clickbait, engineered to make you click without telling you anything (“You Won’t Believe What Happened Next”). You will build the algorithm that separates the two, using nothing but counts of words and the theorem you already know. It is called Naive Bayes, and despite the dismissive name it is one of the most effective, most widely used text classifiers ever written.
By the end of this lesson, you will be able to:
- Frame classification as comparing against
- Explain the “naive” conditional-independence assumption and why it is wrong but useful
- Estimate class priors and per-word likelihoods from training data
- Fix the zero-frequency problem with Laplace (add-1) smoothing and work in log-probabilities
You need only pandas, a little numpy, and the standard re and math modules. Let’s begin.
From Bayes’ Theorem to a Classifier
A headline is a bag of words. Given those words, we want the probability that the headline is clickbait. Bayes’ theorem gives it to us directly. Writing for a class (either clickbait or genuine) and for the words in the headline:
We have to compute this for both classes and pick the larger. But notice the denominator — the overall probability of seeing this exact set of words — is the same for both classes. It only rescales the two numbers by an identical factor, so it cannot change which one is bigger. We can drop it and compare the numerators:
The symbol means “proportional to.” To classify a headline, compute for each class and choose whichever is larger. Two pieces remain: the prior , and the likelihood . The prior is easy. The likelihood is the hard part — and it is where the “naive” idea rescues us.
The Naive Assumption
Estimating directly is hopeless. It asks: among clickbait headlines, how often do you see this exact sequence of words together? Almost any real headline is unique, so the honest answer for nearly every headline is zero. We would have no data.
Naive Bayes makes a bold simplification: it assumes the words are conditionally independent given the class. That is, once you know a headline is clickbait, the chance it contains “you” tells you nothing about whether it also contains “won’t.” Under that assumption the joint probability factors into a simple product:
Now we only need for individual words — the probability of seeing a single word given the class — which we can estimate by counting.
Putting it together, the rule we will actually compute is:
Wrong but useful
The independence assumption is plainly false. In real headlines “click” and “here” travel together; “you” and “won’t” cluster in clickbait. Words are correlated, not independent. So why does the algorithm work so well?
Because we are not trying to model language faithfully — we are trying to rank two classes. Even when the individual probabilities are distorted by pretending words are independent, the distortion usually pushes both classes in the same direction, and the comparison between them survives. Naive Bayes is often a poor estimator of the actual probability yet an excellent decider of the class. That gap between “wrong about the number” and “right about the answer” is exactly why this naive model has stayed useful for decades.
Why “naive”?
The word naive refers specifically to the conditional-independence assumption — treating every word as if it appeared on its own, blind to the other words around it. It is the model being naive about language, not about you.
Counting Priors and Likelihoods
Let’s get our hands on real data. The dataset has 6,000 headlines, evenly split between clickbait and genuine, each with its label.
import pandas as pd
headlines = pd.read_csv("https://datatweets.com/datasets/headlines.csv")
print(headlines.shape)
print(headlines["label"].value_counts())
print(headlines.head(3))(6000, 2)
label
genuine 3000
clickbait 3000
Name: count, dtype: int64
headline label
0 Why do cities close refuges unless they want w... genuine
1 From Helen Mirrens Instagram to Kanyes new hai... genuine
2 15 Romantic Movies That Will Make You Cry clickbaitTo learn from text we first tokenize it: split each headline into lowercase words. A small regular expression keeps runs of letters and apostrophes and throws away digits and punctuation.
import re
def tokenize(text):
return re.findall(r"[a-z']+", text.lower())
print(tokenize("These 17 Things Will Blow Your Mind!"))['these', 'things', 'will', 'blow', 'your', 'mind']Train and test
We never judge a classifier on the data it learned from — it could simply memorize. We split off 20% of the headlines as a test set the model never sees during training, and learn from the remaining 80%.
shuffled = headlines.sample(frac=1, random_state=1).reset_index(drop=True)
n_test = int(len(shuffled) * 0.2)
test = shuffled.iloc[:n_test].reset_index(drop=True)
train = shuffled.iloc[n_test:].reset_index(drop=True)
print(len(train), "train,", len(test), "test")4800 train, 1200 testThe priors
The prior is the probability a headline is clickbait (or genuine) before we read a single word — just the fraction of each class in the training data.
classes = ["clickbait", "genuine"]
priors = {c: (train["label"] == c).mean() for c in classes}
print({c: round(p, 3) for c, p in priors.items()}){'clickbait': 0.498, 'genuine': 0.502}Because the dataset is balanced 3,000/3,000, the priors come out almost exactly 0.5 and 0.5. The split nudged them to 0.498 and 0.502, but the message is the same: before reading the words, a headline is equally likely to be either kind. With balanced priors the decision will rest entirely on the words.
The per-word likelihoods
Now the likelihoods . For a class , the natural estimate of is: how many times does word appear across all headlines of class , divided by the total number of words in class . We tally every word in the training set, per class.
word_counts = {c: {} for c in classes}
total_words = {c: 0 for c in classes}
vocab = set()
for _, row in train.iterrows():
c = row["label"]
for w in tokenize(row["headline"]):
word_counts[c][w] = word_counts[c].get(w, 0) + 1
total_words[c] += 1
vocab.add(w)
print("vocabulary size:", len(vocab))
print("words in clickbait:", total_words["clickbait"])
print("words in genuine: ", total_words["genuine"])vocabulary size: 10172
words in clickbait: 27461
words in genuine: 25552Our vocabulary — the set of distinct words seen in training — has about 10,000 words. We now have everything Bayes’ theorem needs: priors, per-class word counts, totals, and a vocabulary. One problem stands between us and a working classifier.
The Zero-Frequency Problem and Laplace Smoothing
Suppose a test headline contains the word “blockchain,” but no clickbait headline in training ever used it. Then its count is zero, so . Because the likelihoods are multiplied together, a single zero wipes out the entire product — the whole headline gets probability zero for clickbait, no matter how many other words screamed clickbait. One unseen word vetoes everything. This is the zero-frequency problem.
The fix is Laplace smoothing, also called add-1 smoothing: pretend every word in the vocabulary was seen one extra time in each class. Add 1 to every count in the numerator, and to keep it a valid probability, add the vocabulary size to the denominator:
Here is how often word appeared in class , is the total word count for class , and is the vocabulary size. Now no probability is ever exactly zero — an unseen word gets a tiny but nonzero value instead of a fatal veto.
V = len(vocab)
def p_word(w, c):
return (word_counts[c].get(w, 0) + 1) / (total_words[c] + V)
print(round(p_word("you", "clickbait"), 5), round(p_word("you", "genuine"), 5))
print(round(p_word("xyzzy", "clickbait"), 7)) # a word never seen0.01584 0.0016
1e-05The word “you” is about ten times more likely in clickbait than in genuine headlines, and a word the model has never seen still gets a small positive probability instead of zero. Smoothing turned a brittle classifier into a robust one.
Working in Log-Probabilities
There is one more practical hazard. A headline might have a dozen words, each with a probability around 0.001. Multiply twelve such numbers and you get something like — so close to zero that floating-point arithmetic loses precision and may underflow to a flat zero. Both classes collapse to zero and the comparison becomes meaningless.
The cure is to work with logarithms. Logs turn products into sums, and adding moderate negative numbers is perfectly stable where multiplying tiny ones is not. Because is monotonic — bigger inputs give bigger outputs — taking the log never changes which class wins. Our decision rule becomes:
We compute a log-score for each class and pick the larger.
import math
def classify(headline):
scores = {}
for c in classes:
score = math.log(priors[c])
for w in tokenize(headline):
if w in vocab: # ignore words never seen in training
score += math.log(p_word(w, c))
scores[c] = score
return max(scores, key=scores.get)
demo = "These 23 Photos Will Change The Way You See The World"
print(classify(demo))clickbaitLook at the two log-scores behind that decision — the clickbait score is much higher (less negative), so clickbait wins:
for c in classes:
score = math.log(priors[c])
for w in tokenize(demo):
if w in vocab:
score += math.log(p_word(w, c))
print(c, round(score, 2))clickbait -53.73
genuine -68.71Words like “these,” “you,” and “this” pulled the clickbait score far ahead. That is the whole algorithm: start each class at its log-prior, add the log-likelihood of every word, and choose the winner.
A First End-to-End Pass
Let’s see which words do the heaviest lifting. For each word we compute the likelihood ratio — its probability under clickbait divided by its probability under genuine. A ratio far above 1 means the word is a strong clickbait tell.
for w in ["these", "this", "you", "things", "reasons"]:
ratio = p_word(w, "clickbait") / p_word(w, "genuine")
print(f"{w:>8} {ratio:5.1f}x") these 36.7x
this 10.0x
you 9.9x
things 9.4x
reasons 8.0xThese are exactly the words you would expect from listicle clickbait: “These 19 Photos…”, “This Is Why…”, “21 Reasons You Need…”. The word “these” alone is about 37 times more likely in a clickbait headline than a genuine one. Genuine headlines, by contrast, lean on proper nouns and event words that carry no such lopsided ratio.
Now run the finished classifier across all 1,200 held-out test headlines and measure its accuracy — the fraction it labels correctly.
correct = sum(classify(h) == label
for h, label in zip(test["headline"], test["label"]))
accuracy = correct / len(test)
print(f"test accuracy: {accuracy:.1%}")test accuracy: 91.8%A classifier built from nothing but word counts and Bayes’ theorem labels real headlines correctly about 92% of the time — roughly nine out of ten — on data it never saw during training. No neural network, no deep learning library, just priors, smoothed likelihoods, and a log-sum. That is the power of Naive Bayes, and you now understand every step of it.
Practice Exercises
Exercise 1: Read the likelihood ratio for new words
Compute the clickbait-to-genuine likelihood ratio for the words "will", "trump", and "the". Which one is the strongest clickbait signal, and which is essentially neutral? Explain what a ratio near 1 means.
Hint
Reuse p_word(w, "clickbait") / p_word(w, "genuine") for each word. A ratio near 1 means the word appears at about the same rate in both classes, so it barely moves either log-score — it carries almost no evidence.
Exercise 2: Watch smoothing prevent a zero
Pick a word that appears in genuine headlines but never in clickbait ones (for example, search word_counts["clickbait"] for a missing key). Print its unsmoothed probability count / total and its smoothed probability with p_word. Why would the unsmoothed value break a multiplied likelihood?
Hint
The unsmoothed probability is 0 / total_words["clickbait"], which is exactly 0; multiplying it into the product zeroes out the whole headline. The smoothed p_word adds 1 on top, so it returns a tiny positive number like 1 / (total_words["clickbait"] + V) instead.
Exercise 3: Classify your own headlines
Write three headlines of your own — one obviously clickbait, one plainly genuine, and one in between — and run classify on each. For the borderline case, print both class log-scores and see how close the decision was.
Hint
Reuse the loop from the demo that prints math.log(priors[c]) plus the summed math.log(p_word(w, c)) for each class. The smaller the gap between the two log-scores, the less confident the classifier is.
Summary
You turned Bayes’ theorem into a working text classifier. The key move was to compare across classes, drop the shared denominator, and use the naive assumption that words are conditionally independent given the class so that the likelihood factors into a product of per-word probabilities. You estimated priors as class frequencies and likelihoods by counting words, fixed the zero-frequency problem with Laplace (add-1) smoothing, and switched to log-probabilities to avoid underflow. On 6,000 real headlines, that recipe reached about 92% accuracy.
Key Concepts
- Naive Bayes — a classifier that picks the class maximizing .
- Naive (conditional independence) assumption — treating words as independent given the class; wrong about language but good enough to rank classes.
- Prior — the probability of a class before seeing the words; here estimated as each class’s fraction of the training data.
- Likelihood — the probability of a word given a class, estimated by counting.
- Laplace (add-1) smoothing — add 1 to every count and to the denominator so no probability is ever zero.
- Log-probabilities — replace a product of tiny probabilities with a sum of logs to prevent numerical underflow.
Why This Matters
Naive Bayes was the original engine behind spam filters and still anchors sentiment analysis, document tagging, and medical triage tools. It is fast, needs little data, and is easy to inspect — you can read off exactly which words swayed a decision. More than that, building one teaches the habit at the heart of applied probability: take a clean theorem, make a defensible simplifying assumption, patch the places where raw data breaks, and ship something that works.
Next Steps
Continue to Lesson 5 - Guided Project: Clickbait Classifier
Build the full Naive Bayes clickbait classifier end to end, the complete guided project.
Back to Module Overview
Return to the Conditional Probability & Bayes module overview
Continue Building Your Skills
You have seen every moving part of Naive Bayes and watched it sort real headlines with about 92% accuracy. In the next lesson you will build the whole thing yourself, start to finish, as a guided project — wiring training, smoothing, and evaluation into one clean classifier you can point at any text you like.