Lesson 5 - Guided Project: Building a Clickbait Detector
Welcome to the Guided Project
You’re building a tool that flags clickbait headlines. Using only Bayes’ theorem and the Naive Bayes idea from this module — no machine-learning library — you will train a classifier on 6,000 real headlines and see how well it separates clickbait from genuine news.
That is the whole project. Every piece comes from this module: a prior for each class, a likelihood for each word, and Bayes’ theorem to combine them into a posterior. The word naive is the one assumption we make — that the words in a headline are independent given its class — and at the end we will be honest about what that shortcut costs. In between you will tokenize text, count words, turn those counts into probabilities, and classify headlines you have never seen.
By the end of this lesson, you will be able to:
- Turn raw text into word counts you can compute probabilities from
- Estimate class priors and smoothed per-word likelihoods from a training set
- Combine them with Bayes’ theorem in log-space to classify new headlines
- Measure real accuracy, read a confusion matrix, and inspect what the model learned
You only need pandas, numpy, and the standard-library re, math, and collections. Let’s begin.
Step 1: Load and Explore the Data
Start by loading the headlines and getting the measure of what you have.
import pandas as pd
headlines = pd.read_csv("https://datatweets.com/datasets/headlines.csv")
print(headlines.shape)
print(headlines["label"].value_counts())(6000, 2)
label
genuine 3000
clickbait 3000
Name: count, dtype: int64There are 6,000 headlines with two columns — the headline text and its label, either "clickbait" or "genuine". The classes are perfectly balanced: 3,000 of each. That balance is convenient, because it means our priors will be roughly even and any accuracy above 50% is real signal, not an artifact of guessing the majority class.
Look at a few examples to get a feel for the two styles:
for label in ["clickbait", "genuine"]:
print(f"--- {label} ---")
for text in headlines[headlines["label"] == label]["headline"].head(3):
print(" ", text)--- clickbait ---
15 Romantic Movies That Will Make You Cry
17 Things That Inevitably Happen When You Move Back Home
This Is What Happens When You Try To Take A Group Photo
--- genuine ---
Why do cities close refuges unless they want women to die?
From Helen Mirrens Instagram to Kanyes new hair-do: this weeks fashion trends
Football transfer rumours: Antoine Griezmann to Manchester United?You can already feel the difference. Clickbait leans on lists and second-person teasers (“Things”, “You”, “What Happens”); genuine headlines name people, places, and events. Our classifier’s whole job is to make that intuition quantitative.
Before any modeling, check one simple numeric signal — does headline length differ between the classes? Tokenize each headline into lowercase word-like pieces and count them:
import re
def tokenize(text):
return re.findall(r"[a-z']+", text.lower())
headlines["n_words"] = headlines["headline"].apply(lambda s: len(tokenize(s)))
print(headlines.groupby("label")["n_words"].mean().round(2))label
clickbait 11.48
genuine 10.61
Name: n_words, dtype: float64Clickbait headlines are a touch longer on average — 11.48 words versus 10.61 — but the gap is under one word, far too small to classify on its own. Length is a weak hint. The real signal is in which words appear, and that is exactly what Naive Bayes is built to exploit.
Why tokenize this way
The pattern [a-z']+ lowercases everything first, then keeps runs of letters and apostrophes. So “You’ll” becomes the single token you'll and “15” disappears entirely. That keeps contractions intact (they turn out to be strong clickbait signals) while dropping numbers and punctuation that would only clutter the vocabulary.
Step 2: Prepare — Tokenize and Split
A classifier you test on the same headlines it trained on will look better than it is — it has effectively seen the answers. To get an honest estimate, hold out part of the data. Split 80% for training and 20% for a test set the model never sees while learning:
train = headlines.sample(frac=0.8, random_state=1)
test = headlines.drop(train.index)
print("train:", train.shape[0], " test:", test.shape[0])
print(train["label"].value_counts().to_dict())
print(test["label"].value_counts().to_dict())train: 4800 test: 1200
{'genuine': 2400, 'clickbait': 2400}
{'clickbait': 600, 'genuine': 600}The random_state=1 makes the shuffle reproducible, so you get the same split every run. The training set has 4,800 headlines (2,400 per class) and the test set 1,200 (600 per class). Both stay balanced, which keeps the evaluation clean.
We already have the tokenize function from Step 1. It is the only preprocessing the model needs: every headline becomes a list of lowercase tokens, and the classifier works entirely from how often each token shows up in each class.
Step 3: Train — Priors and Smoothed Word Likelihoods
Recall Bayes’ theorem from this module. For a class and a headline made of tokens :
Two ingredients to estimate from the training set. The prior is just how common each class is. The likelihood is how often word appears in headlines of class . The naive assumption is the product itself: we multiply the word likelihoods as if each word were independent of the others given the class. They aren’t, really — but pretending so makes the math tractable and, as you’ll see, still works remarkably well.
First, count every word in each class and build the vocabulary:
import math
from collections import Counter
classes = ["clickbait", "genuine"]
word_counts = {c: Counter() for c in classes} # word -> count, per class
class_tokens = {c: 0 for c in classes} # total tokens, per class
class_docs = {c: 0 for c in classes} # headline count, per class
vocab = set()
for _, row in train.iterrows():
c = row["label"]
class_docs[c] += 1
tokens = tokenize(row["headline"])
word_counts[c].update(tokens)
class_tokens[c] += len(tokens)
vocab.update(tokens)
V = len(vocab)
N = len(train)
log_prior = {c: math.log(class_docs[c] / N) for c in classes}
print("vocabulary size:", V)
print("log priors:", {c: round(v, 3) for c, v in log_prior.items()})vocabulary size: 10224
log priors: {'clickbait': -0.693, 'genuine': -0.693}The training headlines use 10,224 distinct words. Both log priors are , confirming the even split. We store the priors in log-space for a reason that becomes critical in the next step.
Now the likelihoods. The naive estimate of is “how many times did appear in class , out of all tokens in class .” But there’s a trap: any word that never appeared in a class gets probability zero, and a single zero would wipe out the entire product. The fix from this module is add-1 (Laplace) smoothing — pretend every word in the vocabulary was seen one extra time:
Adding to the denominator (one for each vocabulary word’s extra count) keeps the probabilities summing to one. No word is ever impossible; an unseen word just gets a small, non-zero likelihood. We don’t precompute all of these — there are over twenty thousand — we compute each one on demand in the prediction step.
Why log-probabilities
A 12-word headline multiplies a dozen probabilities, each well below one. Their product can underflow to zero in floating point. Working in logs turns the product into a sum — — which is numerically stable and, conveniently, monotonic: the class with the highest log-score is the same one with the highest probability.
Step 4: Predict and Evaluate
With priors and a smoothing rule in hand, classifying a headline is mechanical: start from each class’s log prior, add the log likelihood of every token, and pick the class with the higher total.
def predict(text):
tokens = tokenize(text)
scores = {}
for c in classes:
score = log_prior[c]
denom = class_tokens[c] + V
for w in tokens:
score += math.log((word_counts[c][w] + 1) / denom)
scores[c] = score
return max(scores, key=scores.get)That is the entire model — Bayes’ theorem, in log-space, with add-1 smoothing. Run it across all 1,200 held-out headlines and compare to the truth:
test_pred = test["headline"].apply(predict)
accuracy = (test_pred == test["label"]).mean()
correct = (test_pred == test["label"]).sum()
print(f"accuracy: {accuracy * 100:.2f}% ({correct} of {len(test)})")accuracy: 91.42% (1097 of 1200)The classifier is right 91.42% of the time — 1,097 of 1,200 headlines it had never seen — using nothing but word counts and Bayes’ theorem. Against the 50% you’d get by flipping a coin on a balanced set, that is a strong result for a model you can read top to bottom.
But an overall number hides which way the mistakes go. A confusion matrix lays out actual labels against predicted ones:
confusion = pd.crosstab(test["label"], test_pred,
rownames=["actual"], colnames=["predicted"])
print(confusion)predicted clickbait genuine
actual
clickbait 580 20
genuine 83 517Read the off-diagonal cells — they are the errors, and they are lopsided:
- Clickbait caught: 580 of 600 (97%). The model rarely lets clickbait slip past. Only 20 clickbait headlines were mislabeled genuine.
- Genuine misflagged: 83 of 600. The bulk of the errors are genuine headlines wrongly called clickbait. The model is trigger-happy — when in doubt, it cries clickbait.
That asymmetry is worth naming. For a tool that flags clickbait, the dangerous error depends on your goal. If flagging means hiding a headline, those 83 false alarms mean real news gets buried — you’d want to make the model more cautious. The confusion matrix, not the headline accuracy figure, is what tells you that.
Step 5: Look Inside — Signal Words and Fresh Headlines
A 91% number is nice, but the satisfying part of a from-scratch model is that you can ask it why. For each word, compare its likelihood under each class. The log of that ratio says which way the word pushes a headline:
A large positive value means the word is a clickbait flag; a large negative value means it points to genuine news. Restrict to words that appear at least 20 times so the ratios are stable, then sort:
signal = []
for w in vocab:
cb = word_counts["clickbait"][w]
gen = word_counts["genuine"][w]
if cb + gen < 20:
continue
p_cb = (cb + 1) / (class_tokens["clickbait"] + V)
p_gen = (gen + 1) / (class_tokens["genuine"] + V)
signal.append((w, math.log(p_cb / p_gen)))
signal.sort(key=lambda x: x[1])
print("Strongest GENUINE words:", [w for w, _ in signal[:8]])
print("Strongest CLICKBAIT words:", [w for w, _ in signal[-8:][::-1]])Strongest GENUINE words: ['review', 'uk', 'brexit', 'australian', 'theresa', 'nhs', 'speech', 'eu']
Strongest CLICKBAIT words: ["you'll", 'pics', 'amazing', 'photos', 'them', 'these', "you've", 'absolutely']This is the model showing its work. It learned, from counts alone, that proper-noun news vocabulary (“brexit”, “theresa”, “nhs”, “eu”) signals genuine reporting, while second-person contractions and breathless plurals (“you’ll”, “pics”, “amazing”, “photos”, “these”) signal clickbait. Nobody coded those rules — they fell out of Bayes’ theorem applied to 4,800 examples.
The real test is headlines the model has never seen, in any form. Write a few of your own and run them through:
fresh = [
"23 Photos That Prove You Will Never Be This Cool",
"Senate approves budget bill after late-night vote",
"You Won't Believe What This Dog Did Next",
"Central bank holds interest rates steady amid inflation concerns",
"15 Things Only 90s Kids Will Understand",
]
for text in fresh:
print(f"[{predict(text)}] {text}")[clickbait] 23 Photos That Prove You Will Never Be This Cool
[genuine] Senate approves budget bill after late-night vote
[clickbait] You Won't Believe What This Dog Did Next
[genuine] Central bank holds interest rates steady amid inflation concerns
[clickbait] 15 Things Only 90s Kids Will UnderstandFive for five. The listicle and teaser headlines land as clickbait; the sober policy and finance headlines land as genuine — and the model decided each one by adding up the log-likelihoods of words it had counted during training. That is the payoff: a tool you built from a single theorem, classifying text it has never encountered.
Take It Further
The model is good, and the same toolkit can push it further:
- Use bigrams. Add two-word tokens like “you’ll never” or “interest rates” alongside single words. Phrases carry signal that individual words lose, and they soften the naive independence assumption a little.
- Strip stopwords — or don’t. Common words like “the” and “of” add noise. Try dropping a stopword list and see whether accuracy moves. It may not: notice that “these” and “them” were strong clickbait signals, so blunt stopword removal can throw away real information.
- Tune the decision threshold. Right now a headline is clickbait whenever its clickbait log-score edges out genuine. Require a margin before flagging — only call clickbait when it wins by some gap — and watch the 83 false alarms fall while a few more clickbait slip through. The confusion matrix tells you whether that trade is worth it.
- Inspect the misses. Pull the 20 clickbait headlines the model called genuine. What do they have in common? Often they’re short, or use unusual words the model barely saw.
Summary
You built a working clickbait detector from a single theorem. You loaded 6,000 real headlines, tokenized them into word counts, and split off a held-out test set. You estimated class priors and add-1 smoothed word likelihoods from the training data, then combined them with Bayes’ theorem in log-space to classify headlines the model had never seen — reaching a real 91.42% accuracy. You read the confusion matrix to find the errors run lopsided toward false clickbait alarms, and you opened the model up to see the exact words driving its decisions. No machine-learning library touched any of it.
Key Concepts
- Naive Bayes — classify by combining a class prior with per-word likelihoods, assuming words are independent given the class.
- Prior and likelihood — is how common a class is; is how often a word appears in that class.
- Add-1 smoothing — pad every count by one so no unseen word forces a probability of zero.
- Log-probabilities — sum logs instead of multiplying probabilities to stay numerically stable.
- Confusion matrix — actual versus predicted counts that reveal which way the errors go, not just how many.
Why This Matters
This is Bayes’ theorem doing real work on messy, real-world text. The same machinery — priors, likelihoods, smoothing, log-space scoring — is the backbone of spam filters, language identification, and countless production classifiers. And the honest close is the naive part: words in a headline are obviously not independent (“you’ll” and “never” travel together), yet pretending they are still bought us 91% accuracy. That gap between a convenient assumption and the messy truth is the cost you accept for a model this simple and transparent — and knowing when that trade is worth making is exactly the judgment this module was building toward.
Next Steps
Continue to Module 5 - Statistical Inference (next in the course)
Move from probability to inference: estimate population quantities, build confidence intervals, and test hypotheses.
Back to Module Overview
Return to the Conditional Probability & Bayes module overview
Continue Building Your Skills
You just turned Bayes’ theorem into a tool — counting words, smoothing the gaps, summing logs, and measuring your work against headlines the model had never seen. That whole loop, from a probability rule to a thing that makes decisions you can audit, is what data work feels like at its best. Carry forward the two habits that made it honest: hold out a test set, and always read the confusion matrix, not just the accuracy. Onward to statistical inference.