Lesson 6 - Guided Project: Classifying Disaster Tweets

Welcome to the Capstone Project

This lesson pulls together every technique from the module into a single, realistic project. You will play the role of a data scientist at a news-monitoring company that wants to flag which tweets describe genuine disasters and which only borrow disaster language. You will load real data, clean and vectorize messy tweet text, train three different model families, and then evaluate the strongest one honestly.

By the end of this lesson, you will be able to:

  • Load and explore a real text classification dataset and inspect its class balance
  • Clean and standardize raw tweet text so a neural network can learn from it
  • Build and train three model families in Keras: a dense embedding model, a bidirectional LSTM, and a transformer
  • Compare models fairly on a held-out test set and pick a winner
  • Evaluate the best model with a confusion matrix, precision, and recall, and reason about its mistakes

You should be comfortable with the earlier lessons in this module: text vectorization, embeddings, sequence models, and the transformer’s attention mechanism. Let’s build.


The Problem

Twitter is one of the first places people post during emergencies: wildfires, floods, earthquakes, and accidents all show up there in real time. A news-monitoring team would love to surface those tweets automatically. The trouble is that people also use disaster words constantly when nothing is wrong. “This new album is fire.” “I’m literally dying at this meme.” “Traffic today was a total disaster.” A keyword filter drowns in these false alarms.

So this is a binary text classification problem. Each tweet gets a label:

  • 1 means the tweet is about a real disaster.
  • 0 means it is not.

Your job is to learn the difference from the words themselves, which is exactly what the models in this module are built for.

The dataset is disaster_tweets.csv. Each row is one tweet with an id, an optional keyword, an optional location, the tweet text, and the target label you want to predict.

import numpy as np
import pandas as pd
import tensorflow as tf

# download: https://datatweets.com/datasets/disaster_tweets.csv
df = pd.read_csv("disaster_tweets.csv")

print("Shape:", df.shape)
print(df[["text", "target"]].head(3).to_string(index=False))
# Output:
# Shape: (7613, 5)
#                                              text  target
#  Our Deeds are the Reason of this #earthquake ...       1
#              Forest fire near La Ronge Sask. ...       1
#  All residents asked to 'shelter in place' ...        1

The full file has 7,613 tweets. The keyword and location columns are blank for many rows and add little predictive power once the model reads the text itself, so you can drop them and focus on text and target.

df = df[["text", "target"]]

Exploring the Data

Before modeling anything, look at how the labels are distributed. If one class dominated, accuracy would be a misleading score, so this check shapes how you read every later result.

print(df["target"].value_counts())
print("disaster rate:", round(df["target"].mean(), 2))
# Output:
# target
# 0    4342
# 1    3271
# Name: count, dtype: int64
# disaster rate: 0.43

About 43 percent of the tweets describe real disasters, so the dataset is reasonably balanced. A model that blindly guessed “not a disaster” every time would score only 57 percent, which gives you a clear baseline to beat.

Bar chart of disaster versus non-disaster tweet counts
The disaster tweets dataset is fairly balanced, with non-disaster tweets slightly more common.

It also helps to know how long these tweets are, because that decides how many tokens your models need to read.

word_counts = df["text"].str.split().apply(len)
print("avg words:", round(word_counts.mean(), 1), "max:", word_counts.max())
# Output:
# avg words: 14.9 max: 31

Tweets average about 15 words and top out around 31, so a sequence length of 30 or so will capture nearly every tweet without wasting memory on padding.

Histogram of tweet length in words
Most tweets are short, clustering around 15 words, which keeps sequences cheap to process.

Always explore before you model

Two minutes of exploration just told you three useful things: the dataset is balanced (so accuracy is meaningful), the baseline to beat is 57 percent, and tweets are short (so a modest sequence length is fine). Skipping this step is one of the most common beginner mistakes.


Cleaning the Text

Raw tweets are noisy. They contain URLs, hashtags, mentions, punctuation, numbers, and inconsistent capitalization. A neural network can sometimes learn through this noise, but cleaning the text first makes the signal easier to find and shrinks the vocabulary the model has to memorize.

You will apply a few standard steps: lowercase everything, strip URLs and mentions, remove punctuation and numbers, and collapse extra whitespace.

import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\.\S+", " ", text)   # remove URLs
    text = re.sub(r"@\w+", " ", text)                # remove @mentions
    text = re.sub(r"[^a-z\s]", " ", text)            # keep letters only
    text = re.sub(r"\s+", " ", text).strip()         # collapse whitespace
    return text

df["clean"] = df["text"].apply(clean_text)

print(df["text"].iloc[0])
print(df["clean"].iloc[0])
# Output:
# Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
# our deeds are the reason of this earthquake may allah forgive us all

Notice what survived: the hashtag #earthquake became the plain word earthquake, which is exactly the kind of token that signals a real disaster. Cleaning keeps meaning while throwing away formatting noise.

Let the layer do the splitting

You could tokenize and remove stopwords by hand with a library like NLTK, and that is a fine exercise. But Keras’s TextVectorization layer already lowercases, strips punctuation, splits on whitespace, and builds the vocabulary for you. The light cleaning above handles what the layer does not (URLs and mentions), and you let the layer handle the rest. Fewer moving parts means fewer bugs.


Vectorizing and Splitting

Models need numbers, not strings. The TextVectorization layer maps each tweet to a fixed-length sequence of integer token IDs, and an Embedding layer then turns each ID into a dense vector the network can learn from. This is the same two-layer foundation you met earlier in the module.

Diagram showing words mapped to integer IDs and then to embedding vectors
Text vectorization turns words into integer IDs, and the embedding layer maps each ID to a learned vector.

First, hold out a test set so every model is judged on tweets it never saw during training. As in the rest of the module, fix the random seed so your split is reproducible.

from sklearn.model_selection import train_test_split

X = df["clean"].values
y = df["target"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train:", len(X_train), "Test:", len(X_test))
# Output:
# Train: 6090 Test: 1523

Now build the vectorizer and adapt it on the training text only. Adapting is how the layer learns its vocabulary; doing it on the training split alone keeps test information from leaking in.

from tensorflow.keras.layers import TextVectorization

MAX_TOKENS = 10000
SEQ_LEN = 30

vectorizer = TextVectorization(
    max_tokens=MAX_TOKENS,
    output_mode="int",
    output_sequence_length=SEQ_LEN,
)
vectorizer.adapt(X_train)   # learn the vocabulary from TRAIN only

print("Vocabulary size:", len(vectorizer.get_vocabulary()))
# Output:
# Vocabulary size: 10000

With the data split and the vectorizer ready, you can build models. Each model will start with this same vectorizer followed by an Embedding layer, so the only thing that changes between them is what happens after the embeddings.


Model 1: A Dense Embedding Model

Start simple. The first model embeds each token, averages the embeddings across the tweet with GlobalAveragePooling1D, and feeds that single averaged vector through a couple of dense layers. Averaging throws away word order, which makes this a kind of learned bag-of-words. It is fast, hard to overfit, and a good baseline.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (
    Embedding, GlobalAveragePooling1D, Dense, Dropout
)

EMBED_DIM = 64

dense_model = Sequential([
    vectorizer,
    Embedding(MAX_TOKENS, EMBED_DIM),
    GlobalAveragePooling1D(),
    Dense(32, activation="relu"),
    Dropout(0.3),
    Dense(1, activation="sigmoid"),
])

dense_model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

dense_model.fit(
    X_train, y_train,
    validation_split=0.1,
    epochs=10,
    batch_size=32,
    verbose=0,
)

loss, acc = dense_model.evaluate(X_test, y_test, verbose=0)
print(f"Dense model test accuracy: {acc:.3f}")
# Output:
# Dense model test accuracy: 0.710

The dense model reaches about 0.710 accuracy on the test set, comfortably above the 0.57 baseline. Not bad for a model that ignores word order entirely.

Training and validation accuracy curves for the dense model across epochs
Training and validation accuracy for the dense model; the gap that opens up is the dropout layer earning its keep.

Because the embedding layer learns a vector for every word, you can project those vectors down to two dimensions and see related words cluster together, a sign the model is learning genuine meaning rather than noise.

Two-dimensional projection of learned word embeddings with related words clustered
A 2D projection of the learned embeddings; disaster-related words drift toward one region of the space.

Model 2: A Bidirectional LSTM

The dense model’s weakness is that it ignores order. “Fire crews contained the blaze” and “this party is fire” share words but mean opposite things. A bidirectional LSTM reads the sequence forward and backward, so it can use context on both sides of a word to decide what it means.

from tensorflow.keras.layers import LSTM, Bidirectional

lstm_model = Sequential([
    vectorizer,
    Embedding(MAX_TOKENS, EMBED_DIM),
    Bidirectional(LSTM(64, return_sequences=True)),
    Bidirectional(LSTM(32)),
    Dense(32, activation="relu"),
    Dropout(0.3),
    Dense(1, activation="sigmoid"),
])

lstm_model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

lstm_model.fit(
    X_train, y_train,
    validation_split=0.1,
    epochs=10,
    batch_size=32,
    verbose=0,
)

loss, acc = lstm_model.evaluate(X_test, y_test, verbose=0)
print(f"BiLSTM test accuracy: {acc:.3f}")
# Output:
# BiLSTM test accuracy: 0.751

The BiLSTM lifts test accuracy to about 0.751. Reading word order pays off: the model can now tell that “contained the blaze” is reporting and “this party is fire” is slang. That jump of roughly four points over the dense model is the payoff for modeling sequence.

Sequence models are slower

The BiLSTM is far more expensive to train than the dense model because it processes tokens one step at a time and cannot fully parallelize across the sequence. On short tweets the cost is bearable, but on long documents this is exactly the bottleneck the transformer was invented to remove.


Model 3: A Transformer

The transformer replaces step-by-step recurrence with self-attention: every token looks at every other token in one parallel operation and decides which ones matter. This is the architecture that powers modern language models, and you met its attention mechanism earlier in the module.

Diagram of self-attention connecting every token to every other token in a sentence
Self-attention lets each token weigh every other token in the tweet at once, capturing long-range context.

You can build a compact transformer encoder block in Keras: a multi-head attention layer followed by a small feed-forward network, each wrapped in a residual connection and layer normalization. Here the model is trained from scratch on the tweets, so it has no outside knowledge of language to lean on.

from tensorflow.keras import Input, Model
from tensorflow.keras.layers import (
    MultiHeadAttention, LayerNormalization, Add
)

inputs = Input(shape=(1,), dtype=tf.string)
x = vectorizer(inputs)
x = Embedding(MAX_TOKENS, EMBED_DIM)(x)

# one transformer encoder block
attn = MultiHeadAttention(num_heads=2, key_dim=EMBED_DIM)(x, x)
x = LayerNormalization()(Add()([x, attn]))
ff = Dense(64, activation="relu")(x)
ff = Dense(EMBED_DIM)(ff)
x = LayerNormalization()(Add()([x, ff]))

x = GlobalAveragePooling1D()(x)
x = Dropout(0.3)(x)
outputs = Dense(1, activation="sigmoid")(x)

transformer = Model(inputs, outputs)
transformer.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

transformer.fit(
    X_train, y_train,
    validation_split=0.1,
    epochs=10,
    batch_size=32,
    verbose=0,
)

loss, acc = transformer.evaluate(X_test, y_test, verbose=0)
print(f"Transformer test accuracy: {acc:.3f}")
# Output:
# Transformer test accuracy: 0.753

The transformer edges out the BiLSTM at about 0.753 accuracy, and it gets there with far more parallel computation, which matters enormously at scale. On these short tweets the accuracy gap is small, but the architectural advantage is real.


Comparing the Three Models

Lining the results up side by side makes the story clear.

results = {
    "Dense (embedding + pooling)": 0.710,
    "Bidirectional LSTM":          0.751,
    "Transformer (self-attention)": 0.753,
}
for name, acc in results.items():
    print(f"{name:<32} {acc:.3f}")
# Output:
# Dense (embedding + pooling)      0.710
# Bidirectional LSTM               0.751
# Transformer (self-attention)     0.753
Bar chart comparing test accuracy of the dense, BiLSTM, and transformer models
Test accuracy climbs from the dense baseline to the sequence and attention models, with the transformer narrowly best.

The pattern matches what the module taught. Modeling word order (the BiLSTM) clearly beats ignoring it (the dense model), and self-attention (the transformer) matches or slightly exceeds recurrence while being far easier to parallelize. The transformer is your winner, so evaluate it more carefully.


Evaluating the Best Model

Accuracy alone hides which kinds of mistakes a model makes. A confusion matrix breaks predictions into four buckets, and from those you can compute precision and recall.

from sklearn.metrics import confusion_matrix, precision_score, recall_score

probs = transformer.predict(X_test, verbose=0).ravel()
preds = (probs >= 0.5).astype(int)

cm = confusion_matrix(y_test, preds)
print(cm)
print("precision:", round(precision_score(y_test, preds), 3))
print("recall:   ", round(recall_score(y_test, preds), 3))
# Output:
# [[672 197]
#  [179 475]]
# precision: 0.707
# recall:    0.726

Here is what those four numbers mean for this problem, where the positive class is “real disaster”:

  • True negatives (672): non-disaster tweets correctly ignored.
  • False positives (197): ordinary tweets wrongly flagged as disasters (a false alarm).
  • False negatives (179): real disaster tweets the model missed (the most costly error for a monitoring team).
  • True positives (475): real disasters correctly caught.
Confusion matrix for the transformer with 672 true negatives, 197 false positives, 179 false negatives, and 475 true positives
The transformer's test confusion matrix: most errors are split fairly evenly between false alarms and missed disasters.

From the matrix you get two metrics that accuracy cannot show:

precision=TPTP+FP=475475+197=0.707 \text{precision} = \frac{TP}{TP + FP} = \frac{475}{475 + 197} = 0.707 recall=TPTP+FN=475475+179=0.726 \text{recall} = \frac{TP}{TP + FN} = \frac{475}{475 + 179} = 0.726

A precision of 0.707 means that when the model shouts “disaster,” it is right about 71 percent of the time. A recall of 0.726 means it catches about 73 percent of the real disasters. For a news team, recall is often the more important number, because a missed earthquake is worse than an extra false alarm to review. If you cared even more about recall, you could lower the 0.5 threshold and accept more false positives in exchange for catching more real disasters.


Error Analysis: Where the Model Slips

Roughly 75 percent accuracy may sound modest next to the textbook 99 percent you see on clean benchmarks. It is actually a respectable result, and looking at the errors shows why these tweets are genuinely hard.

# inspect a few tweets the model got wrong
import numpy as np
wrong = np.where(preds != y_test)[0]
for i in wrong[:3]:
    print(f"true={y_test[i]} pred={preds[i]} | {X_test[i]}")
# Output (illustrative):
# true=0 pred=1 | this party is fire everyone is here
# true=1 pred=0 | so many families displaced after the storm last night
# true=0 pred=1 | i am literally dying these jokes are too good

Two failure modes show up again and again:

  • Metaphor and slang. “This party is fire” and “I am literally dying” use disaster words to mean the opposite of disaster. The model latches onto the word and flags a false alarm. Even humans need context (and tone) to read these correctly.
  • Sarcasm and understatement. A genuinely serious tweet phrased calmly, without obvious alarm words, can slip past the model as ordinary chatter.

These are not bugs you can simply patch. Human language is ambiguous, and short tweets give the model very little context to disambiguate. Against that, getting three out of four right is a solid outcome.

How to push past 75 percent

Every model here learned its word meanings from only 6,090 training tweets. A pretrained transformer such as DistilBERT arrives already knowing how English works from billions of words, so you fine-tune it rather than teach it language from scratch. On a dataset this size, that transfer of knowledge is the single biggest lever you have for accuracy, typically pushing scores into the low-to-mid 80s. That is exactly where you would go next on a production version of this project.


Practice Exercises

Now it is your turn. Try these on the same data and models before checking the hints.

Exercise 1: Inspect the Cleaning

Pick five raw tweets that contain a URL or a @mention and print their cleaned versions side by side, so you can confirm clean_text is removing the noise you expect.

import pandas as pd
df = pd.read_csv("disaster_tweets.csv")  # download: https://datatweets.com/datasets/disaster_tweets.csv

# Your code here: apply clean_text and compare before/after

Hint

Filter the raw text with df["text"].str.contains("http|@") to find rows that actually have URLs or mentions, then apply your clean_text function and print the text and clean columns together with df[["text", "clean"]].head().

Exercise 2: Move the Decision Threshold

The transformer flags a tweet as a disaster when its probability is at least 0.5. Recompute precision and recall using a threshold of 0.35 instead, and describe what happens to each metric.

# Reuse probs and y_test from the lesson
# Your code here: apply a 0.35 threshold and recompute precision and recall

Hint

Build new predictions with preds_low = (probs >= 0.35).astype(int), then call precision_score and recall_score again. Lowering the threshold flags more tweets as disasters, so you should expect recall to rise (you catch more real disasters) while precision falls (you also raise more false alarms).

Exercise 3: Add a Second Transformer Block

The transformer used a single encoder block. Stack a second identical block on top of the first and retrain. Does the extra depth help on this small dataset?

# Start from the transformer code in the lesson
# Your code here: repeat the attention + feed-forward block once more before pooling

Hint

Wrap the attention-plus-feed-forward block in a small helper function that takes x and returns the updated x, then call it twice in a row before GlobalAveragePooling1D. With only 6,090 training tweets, more depth often does not help and can even overfit, so compare test accuracy carefully rather than assuming deeper is better.


Summary

Congratulations! You have built a complete deep learning text classification pipeline from raw tweets to an evaluated, winning model. Let’s review what you learned.

Key Concepts

Loading and Exploring Text Data

  • The disaster tweets dataset has 7,613 tweets with a 43 percent disaster rate, making it reasonably balanced
  • Exploring class balance and tweet length up front tells you the baseline to beat and a sensible sequence length

Cleaning and Vectorizing

  • Light cleaning (lowercasing, stripping URLs, mentions, punctuation, and numbers) reduces noise without losing meaning
  • TextVectorization turns text into integer sequences, and an Embedding layer maps each token to a learned vector
  • Adapt the vectorizer on the training split only to avoid leaking test information

Comparing Model Families

  • A dense embedding model ignores word order and reached 0.710 accuracy, a solid baseline
  • A bidirectional LSTM reads order in both directions and reached 0.751
  • A transformer uses parallel self-attention and was best at 0.753, with a large efficiency advantage at scale

Evaluating Honestly

  • A confusion matrix splits predictions into true/false positives and negatives
  • Precision (0.707) is how often a flagged tweet is really a disaster; recall (0.726) is how many real disasters you caught
  • Moving the decision threshold trades precision against recall

Error Analysis

  • The hardest mistakes come from metaphor and slang (“this party is fire”) and from sarcasm
  • About 75 percent accuracy is respectable on short, ambiguous, noisy tweets
  • A pretrained transformer like DistilBERT is the most effective next step for higher accuracy

Why This Matters

This project is a miniature of nearly every real NLP system you will build: get messy text, clean it, turn it into numbers, try a few architectures of increasing power, and then judge the winner with metrics that match the business cost of its errors. The specific dataset changes, but the workflow does not.

It also makes a quieter point about modern machine learning. The transformer barely beat the LSTM here, not because attention is weak, but because it learned English from scratch on only a few thousand tweets. The real power of transformers shows up when they arrive pretrained on enormous corpora and you simply fine-tune them. Understanding that distinction is what separates someone who can copy a model from someone who knows which lever to pull next.


Next Steps

You have assembled the entire module into one working project and evaluated it like a professional. From here, revisit the broader program or return to this module’s overview to review any technique you want to strengthen.

Machine Learning Program Overview

Return to the full Machine Learning program and explore the other modules.

Back to Module Overview

Return to the Natural Language Processing for Deep Learning module overview.


Keep Building Your Skills

You just took a noisy, real-world dataset and turned it into a model that catches roughly three out of four disaster tweets, with a clear-eyed understanding of where and why it fails. That combination of building and critiquing is the heart of practical machine learning. Carry the habit forward: whatever model you train next, do not stop at the accuracy number. Look at the confusion matrix, read the mistakes, and ask what the next lever should be. That is how good models become great ones.