Lesson 5 - Guided Project: Building a Clickbait Detector

Welcome to the Guided Project

You’re building a tool that flags clickbait headlines. Using only Bayes’ theorem and the Naive Bayes idea from this module — no machine-learning library — you will train a classifier on 6,000 real headlines and see how well it separates clickbait from genuine news.

That is the whole project. Every piece comes from this module: a prior for each class, a likelihood for each word, and Bayes’ theorem to combine them into a posterior. The word naive is the one assumption we make — that the words in a headline are independent given its class — and at the end we will be honest about what that shortcut costs. In between you will tokenize text, count words, turn those counts into probabilities, and classify headlines you have never seen.

By the end of this lesson, you will be able to:

  • Turn raw text into word counts you can compute probabilities from
  • Estimate class priors and smoothed per-word likelihoods from a training set
  • Combine them with Bayes’ theorem in log-space to classify new headlines
  • Measure real accuracy, read a confusion matrix, and inspect what the model learned

You only need pandas, numpy, and the standard-library re, math, and collections. Let’s begin.


Step 1: Load and Explore the Data

Start by loading the headlines and getting the measure of what you have.

import pandas as pd

headlines = pd.read_csv("https://datatweets.com/datasets/headlines.csv")
print(headlines.shape)
print(headlines["label"].value_counts())
(6000, 2)
label
genuine      3000
clickbait    3000
Name: count, dtype: int64

There are 6,000 headlines with two columns — the headline text and its label, either "clickbait" or "genuine". The classes are perfectly balanced: 3,000 of each. That balance is convenient, because it means our priors will be roughly even and any accuracy above 50% is real signal, not an artifact of guessing the majority class.

Look at a few examples to get a feel for the two styles:

for label in ["clickbait", "genuine"]:
    print(f"--- {label} ---")
    for text in headlines[headlines["label"] == label]["headline"].head(3):
        print(" ", text)
--- clickbait ---
  15 Romantic Movies That Will Make You Cry
  17 Things That Inevitably Happen When You Move Back Home
  This Is What Happens When You Try To Take A Group Photo
--- genuine ---
  Why do cities close refuges unless they want women to die?
  From Helen Mirrens Instagram to Kanyes new hair-do: this weeks fashion trends
  Football transfer rumours: Antoine Griezmann to Manchester United?

You can already feel the difference. Clickbait leans on lists and second-person teasers (“Things”, “You”, “What Happens”); genuine headlines name people, places, and events. Our classifier’s whole job is to make that intuition quantitative.

Before any modeling, check one simple numeric signal — does headline length differ between the classes? Tokenize each headline into lowercase word-like pieces and count them:

import re

def tokenize(text):
    return re.findall(r"[a-z']+", text.lower())

headlines["n_words"] = headlines["headline"].apply(lambda s: len(tokenize(s)))
print(headlines.groupby("label")["n_words"].mean().round(2))
label
clickbait    11.48
genuine      10.61
Name: n_words, dtype: float64

Clickbait headlines are a touch longer on average — 11.48 words versus 10.61 — but the gap is under one word, far too small to classify on its own. Length is a weak hint. The real signal is in which words appear, and that is exactly what Naive Bayes is built to exploit.

Why tokenize this way

The pattern [a-z']+ lowercases everything first, then keeps runs of letters and apostrophes. So “You’ll” becomes the single token you'll and “15” disappears entirely. That keeps contractions intact (they turn out to be strong clickbait signals) while dropping numbers and punctuation that would only clutter the vocabulary.


Step 2: Prepare — Tokenize and Split

A classifier you test on the same headlines it trained on will look better than it is — it has effectively seen the answers. To get an honest estimate, hold out part of the data. Split 80% for training and 20% for a test set the model never sees while learning:

train = headlines.sample(frac=0.8, random_state=1)
test = headlines.drop(train.index)

print("train:", train.shape[0], "  test:", test.shape[0])
print(train["label"].value_counts().to_dict())
print(test["label"].value_counts().to_dict())
train: 4800   test: 1200
{'genuine': 2400, 'clickbait': 2400}
{'clickbait': 600, 'genuine': 600}

The random_state=1 makes the shuffle reproducible, so you get the same split every run. The training set has 4,800 headlines (2,400 per class) and the test set 1,200 (600 per class). Both stay balanced, which keeps the evaluation clean.

We already have the tokenize function from Step 1. It is the only preprocessing the model needs: every headline becomes a list of lowercase tokens, and the classifier works entirely from how often each token shows up in each class.


Step 3: Train — Priors and Smoothed Word Likelihoods

Recall Bayes’ theorem from this module. For a class c c and a headline made of tokens w1,w2,,wn w_1, w_2, \dots, w_n :

P(cheadline)P(c)i=1nP(wic) P(c \mid \text{headline}) \propto P(c) \cdot \prod_{i=1}^{n} P(w_i \mid c)

Two ingredients to estimate from the training set. The prior P(c) P(c) is just how common each class is. The likelihood P(wic) P(w_i \mid c) is how often word wi w_i appears in headlines of class c c . The naive assumption is the product itself: we multiply the word likelihoods as if each word were independent of the others given the class. They aren’t, really — but pretending so makes the math tractable and, as you’ll see, still works remarkably well.

First, count every word in each class and build the vocabulary:

import math
from collections import Counter

classes = ["clickbait", "genuine"]
word_counts = {c: Counter() for c in classes}   # word -> count, per class
class_tokens = {c: 0 for c in classes}           # total tokens, per class
class_docs = {c: 0 for c in classes}             # headline count, per class
vocab = set()

for _, row in train.iterrows():
    c = row["label"]
    class_docs[c] += 1
    tokens = tokenize(row["headline"])
    word_counts[c].update(tokens)
    class_tokens[c] += len(tokens)
    vocab.update(tokens)

V = len(vocab)
N = len(train)
log_prior = {c: math.log(class_docs[c] / N) for c in classes}

print("vocabulary size:", V)
print("log priors:", {c: round(v, 3) for c, v in log_prior.items()})
vocabulary size: 10224
log priors: {'clickbait': -0.693, 'genuine': -0.693}

The training headlines use 10,224 distinct words. Both log priors are ln(0.5)=0.693 \ln(0.5) = -0.693 , confirming the even split. We store the priors in log-space for a reason that becomes critical in the next step.

Now the likelihoods. The naive estimate of P(wc) P(w \mid c) is “how many times did w w appear in class c c , out of all tokens in class c c .” But there’s a trap: any word that never appeared in a class gets probability zero, and a single zero would wipe out the entire product. The fix from this module is add-1 (Laplace) smoothing — pretend every word in the vocabulary was seen one extra time:

P(wc)=count(w,c)+1(total tokens in c)+V P(w \mid c) = \frac{\text{count}(w, c) + 1}{\text{(total tokens in } c) + V}

Adding V V to the denominator (one for each vocabulary word’s extra count) keeps the probabilities summing to one. No word is ever impossible; an unseen word just gets a small, non-zero likelihood. We don’t precompute all of these — there are over twenty thousand — we compute each one on demand in the prediction step.

Why log-probabilities

A 12-word headline multiplies a dozen probabilities, each well below one. Their product can underflow to zero in floating point. Working in logs turns the product into a sum — ln(ab)=lna+lnb \ln(a \cdot b) = \ln a + \ln b — which is numerically stable and, conveniently, monotonic: the class with the highest log-score is the same one with the highest probability.


Step 4: Predict and Evaluate

With priors and a smoothing rule in hand, classifying a headline is mechanical: start from each class’s log prior, add the log likelihood of every token, and pick the class with the higher total.

def predict(text):
    tokens = tokenize(text)
    scores = {}
    for c in classes:
        score = log_prior[c]
        denom = class_tokens[c] + V
        for w in tokens:
            score += math.log((word_counts[c][w] + 1) / denom)
        scores[c] = score
    return max(scores, key=scores.get)

That is the entire model — Bayes’ theorem, in log-space, with add-1 smoothing. Run it across all 1,200 held-out headlines and compare to the truth:

test_pred = test["headline"].apply(predict)
accuracy = (test_pred == test["label"]).mean()
correct = (test_pred == test["label"]).sum()
print(f"accuracy: {accuracy * 100:.2f}%  ({correct} of {len(test)})")
accuracy: 91.42%  (1097 of 1200)

The classifier is right 91.42% of the time — 1,097 of 1,200 headlines it had never seen — using nothing but word counts and Bayes’ theorem. Against the 50% you’d get by flipping a coin on a balanced set, that is a strong result for a model you can read top to bottom.

But an overall number hides which way the mistakes go. A confusion matrix lays out actual labels against predicted ones:

confusion = pd.crosstab(test["label"], test_pred,
                        rownames=["actual"], colnames=["predicted"])
print(confusion)
predicted  clickbait  genuine
actual
clickbait        580       20
genuine           83      517
A two-by-two confusion matrix heatmap. Top-left clickbait-classified-as-clickbait shows 580, top-right clickbait-as-genuine shows 20, bottom-left genuine-as-clickbait shows 83, bottom-right genuine-as-genuine shows 517.
The diagonal (580 and 517) holds the correct calls. The model misses far more often by flagging genuine news as clickbait (83) than by letting clickbait slip through as genuine (20).

Read the off-diagonal cells — they are the errors, and they are lopsided:

  • Clickbait caught: 580 of 600 (97%). The model rarely lets clickbait slip past. Only 20 clickbait headlines were mislabeled genuine.
  • Genuine misflagged: 83 of 600. The bulk of the errors are genuine headlines wrongly called clickbait. The model is trigger-happy — when in doubt, it cries clickbait.

That asymmetry is worth naming. For a tool that flags clickbait, the dangerous error depends on your goal. If flagging means hiding a headline, those 83 false alarms mean real news gets buried — you’d want to make the model more cautious. The confusion matrix, not the headline accuracy figure, is what tells you that.


Step 5: Look Inside — Signal Words and Fresh Headlines

A 91% number is nice, but the satisfying part of a from-scratch model is that you can ask it why. For each word, compare its likelihood under each class. The log of that ratio says which way the word pushes a headline:

signal(w)=lnP(wclickbait)P(wgenuine) \text{signal}(w) = \ln \frac{P(w \mid \text{clickbait})}{P(w \mid \text{genuine})}

A large positive value means the word is a clickbait flag; a large negative value means it points to genuine news. Restrict to words that appear at least 20 times so the ratios are stable, then sort:

signal = []
for w in vocab:
    cb = word_counts["clickbait"][w]
    gen = word_counts["genuine"][w]
    if cb + gen < 20:
        continue
    p_cb = (cb + 1) / (class_tokens["clickbait"] + V)
    p_gen = (gen + 1) / (class_tokens["genuine"] + V)
    signal.append((w, math.log(p_cb / p_gen)))

signal.sort(key=lambda x: x[1])
print("Strongest GENUINE words:", [w for w, _ in signal[:8]])
print("Strongest CLICKBAIT words:", [w for w, _ in signal[-8:][::-1]])
Strongest GENUINE words: ['review', 'uk', 'brexit', 'australian', 'theresa', 'nhs', 'speech', 'eu']
Strongest CLICKBAIT words: ["you'll", 'pics', 'amazing', 'photos', 'them', 'these', "you've", 'absolutely']
A horizontal bar chart of the strongest signal words. Genuine words like review, uk, brexit, australian extend left in blue; clickbait words like you'll, pics, amazing, photos, these extend right in orange.
The model's strongest cues. Genuine headlines lean on news vocabulary — "brexit", "nhs", "eu", "review"; clickbait leans on teasers and plurals — "you'll", "pics", "amazing", "these".

This is the model showing its work. It learned, from counts alone, that proper-noun news vocabulary (“brexit”, “theresa”, “nhs”, “eu”) signals genuine reporting, while second-person contractions and breathless plurals (“you’ll”, “pics”, “amazing”, “photos”, “these”) signal clickbait. Nobody coded those rules — they fell out of Bayes’ theorem applied to 4,800 examples.

The real test is headlines the model has never seen, in any form. Write a few of your own and run them through:

fresh = [
    "23 Photos That Prove You Will Never Be This Cool",
    "Senate approves budget bill after late-night vote",
    "You Won't Believe What This Dog Did Next",
    "Central bank holds interest rates steady amid inflation concerns",
    "15 Things Only 90s Kids Will Understand",
]
for text in fresh:
    print(f"[{predict(text)}]  {text}")
[clickbait]  23 Photos That Prove You Will Never Be This Cool
[genuine]  Senate approves budget bill after late-night vote
[clickbait]  You Won't Believe What This Dog Did Next
[genuine]  Central bank holds interest rates steady amid inflation concerns
[clickbait]  15 Things Only 90s Kids Will Understand

Five for five. The listicle and teaser headlines land as clickbait; the sober policy and finance headlines land as genuine — and the model decided each one by adding up the log-likelihoods of words it had counted during training. That is the payoff: a tool you built from a single theorem, classifying text it has never encountered.

Take It Further

The model is good, and the same toolkit can push it further:

  • Use bigrams. Add two-word tokens like “you’ll never” or “interest rates” alongside single words. Phrases carry signal that individual words lose, and they soften the naive independence assumption a little.
  • Strip stopwords — or don’t. Common words like “the” and “of” add noise. Try dropping a stopword list and see whether accuracy moves. It may not: notice that “these” and “them” were strong clickbait signals, so blunt stopword removal can throw away real information.
  • Tune the decision threshold. Right now a headline is clickbait whenever its clickbait log-score edges out genuine. Require a margin before flagging — only call clickbait when it wins by some gap — and watch the 83 false alarms fall while a few more clickbait slip through. The confusion matrix tells you whether that trade is worth it.
  • Inspect the misses. Pull the 20 clickbait headlines the model called genuine. What do they have in common? Often they’re short, or use unusual words the model barely saw.

Summary

You built a working clickbait detector from a single theorem. You loaded 6,000 real headlines, tokenized them into word counts, and split off a held-out test set. You estimated class priors and add-1 smoothed word likelihoods from the training data, then combined them with Bayes’ theorem in log-space to classify headlines the model had never seen — reaching a real 91.42% accuracy. You read the confusion matrix to find the errors run lopsided toward false clickbait alarms, and you opened the model up to see the exact words driving its decisions. No machine-learning library touched any of it.

Key Concepts

  • Naive Bayes — classify by combining a class prior with per-word likelihoods, assuming words are independent given the class.
  • Prior and likelihoodP(c) P(c) is how common a class is; P(wc) P(w \mid c) is how often a word appears in that class.
  • Add-1 smoothing — pad every count by one so no unseen word forces a probability of zero.
  • Log-probabilities — sum logs instead of multiplying probabilities to stay numerically stable.
  • Confusion matrix — actual versus predicted counts that reveal which way the errors go, not just how many.

Why This Matters

This is Bayes’ theorem doing real work on messy, real-world text. The same machinery — priors, likelihoods, smoothing, log-space scoring — is the backbone of spam filters, language identification, and countless production classifiers. And the honest close is the naive part: words in a headline are obviously not independent (“you’ll” and “never” travel together), yet pretending they are still bought us 91% accuracy. That gap between a convenient assumption and the messy truth is the cost you accept for a model this simple and transparent — and knowing when that trade is worth making is exactly the judgment this module was building toward.


Next Steps

Continue to Module 5 - Statistical Inference (next in the course)

Move from probability to inference: estimate population quantities, build confidence intervals, and test hypotheses.

Back to Module Overview

Return to the Conditional Probability & Bayes module overview


Continue Building Your Skills

You just turned Bayes’ theorem into a tool — counting words, smoothing the gaps, summing logs, and measuring your work against headlines the model had never seen. That whole loop, from a probability rule to a thing that makes decisions you can audit, is what data work feels like at its best. Carry forward the two habits that made it honest: hold out a test set, and always read the confusion matrix, not just the accuracy. Onward to statistical inference.