Lesson 4 - The Naive Bayes Algorithm

Welcome to the Naive Bayes Algorithm

In the last lessons you learned Bayes’ theorem as a rule for updating a single belief when one piece of evidence arrives. Now you will turn that rule into something that does real work: a classifier — a program that reads a piece of text and decides which category it belongs to.

The category here is a familiar annoyance. Some news headlines are honest summaries; others are clickbait, engineered to make you click without telling you anything (“You Won’t Believe What Happened Next”). You will build the algorithm that separates the two, using nothing but counts of words and the theorem you already know. It is called Naive Bayes, and despite the dismissive name it is one of the most effective, most widely used text classifiers ever written.

By the end of this lesson, you will be able to:

  • Frame classification as comparing P(clickbaitwords) P(\text{clickbait} \mid \text{words}) against P(genuinewords) P(\text{genuine} \mid \text{words})
  • Explain the “naive” conditional-independence assumption and why it is wrong but useful
  • Estimate class priors and per-word likelihoods from training data
  • Fix the zero-frequency problem with Laplace (add-1) smoothing and work in log-probabilities

You need only pandas, a little numpy, and the standard re and math modules. Let’s begin.


From Bayes’ Theorem to a Classifier

A headline is a bag of words. Given those words, we want the probability that the headline is clickbait. Bayes’ theorem gives it to us directly. Writing c c for a class (either clickbait or genuine) and w1,w2,,wn w_1, w_2, \dots, w_n for the words in the headline:

P(cw1,,wn)=P(c)P(w1,,wnc)P(w1,,wn) P(c \mid w_1, \dots, w_n) = \frac{P(c)\, P(w_1, \dots, w_n \mid c)}{P(w_1, \dots, w_n)}

We have to compute this for both classes and pick the larger. But notice the denominator P(w1,,wn) P(w_1, \dots, w_n) — the overall probability of seeing this exact set of words — is the same for both classes. It only rescales the two numbers by an identical factor, so it cannot change which one is bigger. We can drop it and compare the numerators:

P(cw1,,wn)P(c)P(w1,,wnc) P(c \mid w_1, \dots, w_n) \propto P(c)\, P(w_1, \dots, w_n \mid c)

The \propto symbol means “proportional to.” To classify a headline, compute P(c)P(wordsc) P(c)\,P(\text{words} \mid c) for each class and choose whichever is larger. Two pieces remain: the prior P(c) P(c) , and the likelihood P(wordsc) P(\text{words} \mid c) . The prior is easy. The likelihood is the hard part — and it is where the “naive” idea rescues us.


The Naive Assumption

Estimating P(w1,,wnc) P(w_1, \dots, w_n \mid c) directly is hopeless. It asks: among clickbait headlines, how often do you see this exact sequence of words together? Almost any real headline is unique, so the honest answer for nearly every headline is zero. We would have no data.

Naive Bayes makes a bold simplification: it assumes the words are conditionally independent given the class. That is, once you know a headline is clickbait, the chance it contains “you” tells you nothing about whether it also contains “won’t.” Under that assumption the joint probability factors into a simple product:

P(w1,,wnc)=i=1nP(wic) P(w_1, \dots, w_n \mid c) = \prod_{i=1}^{n} P(w_i \mid c)

Now we only need P(wc) P(w \mid c) for individual words — the probability of seeing a single word given the class — which we can estimate by counting.

A headline split into separate word tokens, each token feeding its own independent factor P(word given class), and all factors multiplied together with the prior to give score of a class equals P(class) times P(w1 given class) times P(w2 given class) and so on.
The naive assumption treats each word as independent given the class, so the score is just the prior times one P(word | class) factor per word.

Putting it together, the rule we will actually compute is:

P(cwords)P(c)iP(wic) P(c \mid \text{words}) \propto P(c)\prod_{i} P(w_i \mid c)

Wrong but useful

The independence assumption is plainly false. In real headlines “click” and “here” travel together; “you” and “won’t” cluster in clickbait. Words are correlated, not independent. So why does the algorithm work so well?

Because we are not trying to model language faithfully — we are trying to rank two classes. Even when the individual probabilities are distorted by pretending words are independent, the distortion usually pushes both classes in the same direction, and the comparison between them survives. Naive Bayes is often a poor estimator of the actual probability yet an excellent decider of the class. That gap between “wrong about the number” and “right about the answer” is exactly why this naive model has stayed useful for decades.

Why “naive”?

The word naive refers specifically to the conditional-independence assumption — treating every word as if it appeared on its own, blind to the other words around it. It is the model being naive about language, not about you.


Counting Priors and Likelihoods

Let’s get our hands on real data. The dataset has 6,000 headlines, evenly split between clickbait and genuine, each with its label.

import pandas as pd

headlines = pd.read_csv("https://datatweets.com/datasets/headlines.csv")
print(headlines.shape)
print(headlines["label"].value_counts())
print(headlines.head(3))
(6000, 2)
label
genuine      3000
clickbait    3000
Name: count, dtype: int64
                                            headline      label
0  Why do cities close refuges unless they want w...    genuine
1  From Helen Mirrens Instagram to Kanyes new hai...    genuine
2          15 Romantic Movies That Will Make You Cry  clickbait

To learn from text we first tokenize it: split each headline into lowercase words. A small regular expression keeps runs of letters and apostrophes and throws away digits and punctuation.

import re

def tokenize(text):
    return re.findall(r"[a-z']+", text.lower())

print(tokenize("These 17 Things Will Blow Your Mind!"))
['these', 'things', 'will', 'blow', 'your', 'mind']

Train and test

We never judge a classifier on the data it learned from — it could simply memorize. We split off 20% of the headlines as a test set the model never sees during training, and learn from the remaining 80%.

shuffled = headlines.sample(frac=1, random_state=1).reset_index(drop=True)
n_test = int(len(shuffled) * 0.2)
test = shuffled.iloc[:n_test].reset_index(drop=True)
train = shuffled.iloc[n_test:].reset_index(drop=True)
print(len(train), "train,", len(test), "test")
4800 train, 1200 test

The priors

The prior P(c) P(c) is the probability a headline is clickbait (or genuine) before we read a single word — just the fraction of each class in the training data.

classes = ["clickbait", "genuine"]
priors = {c: (train["label"] == c).mean() for c in classes}
print({c: round(p, 3) for c, p in priors.items()})
{'clickbait': 0.498, 'genuine': 0.502}

Because the dataset is balanced 3,000/3,000, the priors come out almost exactly 0.5 and 0.5. The split nudged them to 0.498 and 0.502, but the message is the same: before reading the words, a headline is equally likely to be either kind. With balanced priors the decision will rest entirely on the words.

The per-word likelihoods

Now the likelihoods P(wc) P(w \mid c) . For a class c c , the natural estimate of P(wc) P(w \mid c) is: how many times does word w w appear across all headlines of class c c , divided by the total number of words in class c c . We tally every word in the training set, per class.

word_counts = {c: {} for c in classes}
total_words = {c: 0 for c in classes}
vocab = set()

for _, row in train.iterrows():
    c = row["label"]
    for w in tokenize(row["headline"]):
        word_counts[c][w] = word_counts[c].get(w, 0) + 1
        total_words[c] += 1
        vocab.add(w)

print("vocabulary size:", len(vocab))
print("words in clickbait:", total_words["clickbait"])
print("words in genuine: ", total_words["genuine"])
vocabulary size: 10172
words in clickbait: 27461
words in genuine:  25552

Our vocabulary — the set of distinct words seen in training — has about 10,000 words. We now have everything Bayes’ theorem needs: priors, per-class word counts, totals, and a vocabulary. One problem stands between us and a working classifier.


The Zero-Frequency Problem and Laplace Smoothing

Suppose a test headline contains the word “blockchain,” but no clickbait headline in training ever used it. Then its count is zero, so P("blockchain"clickbait)=0 P(\text{"blockchain"} \mid \text{clickbait}) = 0 . Because the likelihoods are multiplied together, a single zero wipes out the entire product — the whole headline gets probability zero for clickbait, no matter how many other words screamed clickbait. One unseen word vetoes everything. This is the zero-frequency problem.

The fix is Laplace smoothing, also called add-1 smoothing: pretend every word in the vocabulary was seen one extra time in each class. Add 1 to every count in the numerator, and to keep it a valid probability, add the vocabulary size V |V| to the denominator:

P(wc)=count(w,c)+1Nc+V P(w \mid c) = \frac{\text{count}(w,c) + 1}{N_c + |V|}

Here count(w,c) \text{count}(w,c) is how often word w w appeared in class c c , Nc N_c is the total word count for class c c , and V |V| is the vocabulary size. Now no probability is ever exactly zero — an unseen word gets a tiny but nonzero value instead of a fatal veto.

V = len(vocab)

def p_word(w, c):
    return (word_counts[c].get(w, 0) + 1) / (total_words[c] + V)

print(round(p_word("you", "clickbait"), 5), round(p_word("you", "genuine"), 5))
print(round(p_word("xyzzy", "clickbait"), 7))   # a word never seen
0.01584 0.0016
1e-05

The word “you” is about ten times more likely in clickbait than in genuine headlines, and a word the model has never seen still gets a small positive probability instead of zero. Smoothing turned a brittle classifier into a robust one.


Working in Log-Probabilities

There is one more practical hazard. A headline might have a dozen words, each with a probability around 0.001. Multiply twelve such numbers and you get something like 1036 10^{-36} — so close to zero that floating-point arithmetic loses precision and may underflow to a flat zero. Both classes collapse to zero and the comparison becomes meaningless.

The cure is to work with logarithms. Logs turn products into sums, and adding moderate negative numbers is perfectly stable where multiplying tiny ones is not. Because log \log is monotonic — bigger inputs give bigger outputs — taking the log never changes which class wins. Our decision rule becomes:

logP(cwords)logP(c)+ilogP(wic) \log P(c \mid \text{words}) \propto \log P(c) + \sum_{i} \log P(w_i \mid c)

We compute a log-score for each class and pick the larger.

import math

def classify(headline):
    scores = {}
    for c in classes:
        score = math.log(priors[c])
        for w in tokenize(headline):
            if w in vocab:            # ignore words never seen in training
                score += math.log(p_word(w, c))
        scores[c] = score
    return max(scores, key=scores.get)

demo = "These 23 Photos Will Change The Way You See The World"
print(classify(demo))
clickbait

Look at the two log-scores behind that decision — the clickbait score is much higher (less negative), so clickbait wins:

for c in classes:
    score = math.log(priors[c])
    for w in tokenize(demo):
        if w in vocab:
            score += math.log(p_word(w, c))
    print(c, round(score, 2))
clickbait -53.73
genuine -68.71

Words like “these,” “you,” and “this” pulled the clickbait score far ahead. That is the whole algorithm: start each class at its log-prior, add the log-likelihood of every word, and choose the winner.


A First End-to-End Pass

Let’s see which words do the heaviest lifting. For each word we compute the likelihood ratio — its probability under clickbait divided by its probability under genuine. A ratio far above 1 means the word is a strong clickbait tell.

for w in ["these", "this", "you", "things", "reasons"]:
    ratio = p_word(w, "clickbait") / p_word(w, "genuine")
    print(f"{w:>8}  {ratio:5.1f}x")
   these   36.7x
    this   10.0x
     you    9.9x
  things    9.4x
 reasons    8.0x

These are exactly the words you would expect from listicle clickbait: “These 19 Photos…”, “This Is Why…”, “21 Reasons You Need…”. The word “these” alone is about 37 times more likely in a clickbait headline than a genuine one. Genuine headlines, by contrast, lean on proper nouns and event words that carry no such lopsided ratio.

A horizontal bar chart of five words — these, this, you, things, reasons — with bars showing each word's likelihood ratio of clickbait to genuine. 'these' has by far the longest bar at 36.7 times, while the others cluster between 8 and 10 times.
The likelihood ratio P(word | clickbait) / P(word | genuine) for five listicle words, computed from the training set. "These" is about 37 times more likely in clickbait headlines; "this," "you," "things," and "reasons" each run roughly 8–10 times more likely. These lopsided ratios are the evidence Naive Bayes adds up.

Now run the finished classifier across all 1,200 held-out test headlines and measure its accuracy — the fraction it labels correctly.

correct = sum(classify(h) == label
              for h, label in zip(test["headline"], test["label"]))
accuracy = correct / len(test)
print(f"test accuracy: {accuracy:.1%}")
test accuracy: 91.8%

A classifier built from nothing but word counts and Bayes’ theorem labels real headlines correctly about 92% of the time — roughly nine out of ten — on data it never saw during training. No neural network, no deep learning library, just priors, smoothed likelihoods, and a log-sum. That is the power of Naive Bayes, and you now understand every step of it.


Practice Exercises

Exercise 1: Read the likelihood ratio for new words

Compute the clickbait-to-genuine likelihood ratio for the words "will", "trump", and "the". Which one is the strongest clickbait signal, and which is essentially neutral? Explain what a ratio near 1 means.

Hint

Reuse p_word(w, "clickbait") / p_word(w, "genuine") for each word. A ratio near 1 means the word appears at about the same rate in both classes, so it barely moves either log-score — it carries almost no evidence.

Exercise 2: Watch smoothing prevent a zero

Pick a word that appears in genuine headlines but never in clickbait ones (for example, search word_counts["clickbait"] for a missing key). Print its unsmoothed probability count / total and its smoothed probability with p_word. Why would the unsmoothed value break a multiplied likelihood?

Hint

The unsmoothed probability is 0 / total_words["clickbait"], which is exactly 0; multiplying it into the product zeroes out the whole headline. The smoothed p_word adds 1 on top, so it returns a tiny positive number like 1 / (total_words["clickbait"] + V) instead.

Exercise 3: Classify your own headlines

Write three headlines of your own — one obviously clickbait, one plainly genuine, and one in between — and run classify on each. For the borderline case, print both class log-scores and see how close the decision was.

Hint

Reuse the loop from the demo that prints math.log(priors[c]) plus the summed math.log(p_word(w, c)) for each class. The smaller the gap between the two log-scores, the less confident the classifier is.


Summary

You turned Bayes’ theorem into a working text classifier. The key move was to compare P(cwords) P(c \mid \text{words}) across classes, drop the shared denominator, and use the naive assumption that words are conditionally independent given the class so that the likelihood factors into a product of per-word probabilities. You estimated priors as class frequencies and likelihoods by counting words, fixed the zero-frequency problem with Laplace (add-1) smoothing, and switched to log-probabilities to avoid underflow. On 6,000 real headlines, that recipe reached about 92% accuracy.

Key Concepts

  • Naive Bayes — a classifier that picks the class maximizing P(c)iP(wic) P(c)\prod_i P(w_i \mid c) .
  • Naive (conditional independence) assumption — treating words as independent given the class; wrong about language but good enough to rank classes.
  • Prior P(c) P(c) — the probability of a class before seeing the words; here estimated as each class’s fraction of the training data.
  • Likelihood P(wc) P(w \mid c) — the probability of a word given a class, estimated by counting.
  • Laplace (add-1) smoothing — add 1 to every count and V |V| to the denominator so no probability is ever zero.
  • Log-probabilities — replace a product of tiny probabilities with a sum of logs to prevent numerical underflow.

Why This Matters

Naive Bayes was the original engine behind spam filters and still anchors sentiment analysis, document tagging, and medical triage tools. It is fast, needs little data, and is easy to inspect — you can read off exactly which words swayed a decision. More than that, building one teaches the habit at the heart of applied probability: take a clean theorem, make a defensible simplifying assumption, patch the places where raw data breaks, and ship something that works.


Next Steps

Continue to Lesson 5 - Guided Project: Clickbait Classifier

Build the full Naive Bayes clickbait classifier end to end, the complete guided project.

Back to Module Overview

Return to the Conditional Probability & Bayes module overview


Continue Building Your Skills

You have seen every moving part of Naive Bayes and watched it sort real headlines with about 92% accuracy. In the next lesson you will build the whole thing yourself, start to finish, as a guided project — wiring training, smoothing, and evaluation into one clean classifier you can point at any text you like.