Lesson 4 - Building Sequence Models for Text
On this page
Welcome to Sequence Models for Text
In the last lesson you built a neural network that turned each tweet into an embedding and then averaged those embeddings into a single vector. It worked surprisingly well, but it had a quiet weakness: it ignored the order of the words. In this lesson you will fix that. You will learn why word order carries meaning, how recurrent layers read text one token at a time while remembering what came before, and how a Bidirectional LSTM reads each tweet in both directions at once. Then you will build one in TensorFlow and watch it beat the averaging model on real data.
By the end of this lesson, you will be able to:
- Explain why averaging word embeddings throws away word order and why that matters
- Describe how a recurrent layer processes a sequence one token at a time while carrying a memory
- Explain what makes an LSTM different from a plain RNN, and what “bidirectional” adds
- Build an
Embedding -> Bidirectional(LSTM) -> Denseclassifier in Keras - Train and evaluate the model on the real Disaster Tweets dataset and compare it to a bag-of-embeddings baseline
You should be comfortable with Python, pandas, and the embedding and dense-network ideas from the previous lessons in this module. Let’s begin.
Why Word Order Matters
Consider two short messages:
- “The fire is not spreading, evacuation called off.”
- “The fire is spreading, not called off, evacuation.”
They use almost the same words. A model that only counts which words appear, or averages their embeddings into one vector, sees these two sentences as nearly identical. Yet one is reassuring and the other is an emergency. The difference lives entirely in the order of the words and how each one relates to its neighbors.
This is the limitation of the model you built last lesson. A bag-of-embeddings classifier looks up an embedding vector for every token, averages them, and feeds that single average to a dense layer:
"fire" "spreading" "not" "evacuation"
| | | |
[emb] [emb] [emb] [emb] <- one vector per word
\________\________/________/
average <- one vector for the whole tweet
|
Dense -> probabilityAveraging is order-invariant: shuffle the words and the average is exactly the same. That is great for speed and simplicity, and it captures a lot of signal (the word “evacuation” is a strong clue no matter where it sits). But it can never learn that “not spreading” means something different from “spreading.”
To go further, you need a model that reads the tweet as a sequence, processing each word in order while remembering the words that came before. That is exactly what recurrent layers do.
Bag-of-embeddings is still a strong baseline
Order-invariant models are not useless. For short, keyword-heavy text like tweets, simply knowing which words appear gets you a long way. You will see this shortly: the bag-of-embeddings model reaches 0.710 test accuracy. The point of this lesson is to beat that baseline by adding word-order awareness, not to dismiss it.
How Recurrent Layers Read a Sequence
A recurrent neural network (RNN) processes a sequence one element at a time. At each step it takes two inputs: the current token’s embedding, and a hidden state that summarizes everything it has seen so far. It combines them, produces an updated hidden state, and moves to the next token. The hidden state is the network’s memory.
token 1 token 2 token 3 token 4
| | | |
[cell] -h1-> [cell] -h2-> [cell] -h3-> [cell] -> final state
^ ^ ^ ^
(memory carried forward from word to word)By the time the layer reaches the last word, its hidden state has been shaped by the entire sentence in order. That final state is a summary of the whole sequence that respects word order, which is precisely what averaging could not give you.
The Problem with Plain RNNs
Plain RNNs have a famous weakness. When a sequence is long, the influence of early words has to pass through many steps to affect the final state, and the signal tends to fade away (or blow up) along the way. In practice this means a simple RNN struggles to connect a word at the start of a sentence to one at the end. For longer text, it effectively forgets the beginning.
LSTMs: Adding a Memory Cell
A Long Short-Term Memory network (LSTM) is a smarter recurrent layer designed to fix that fading-memory problem. Alongside the hidden state, an LSTM maintains a separate cell state that acts as a long-term memory conveyor belt. Three small gates decide, at each step, how to manage that memory:
- The forget gate decides what to erase from the cell state.
- The input gate decides what new information to write into it.
- The output gate decides what to read out as the hidden state.
Because information can flow along the cell state with only minor, gated changes, an LSTM can carry a detail from the first word all the way to the last. That is why LSTMs handle longer sequences far better than plain RNNs, and why they became a workhorse of text modeling.
You do not need to memorize the gate equations to use an LSTM effectively. The key intuition is this: an LSTM reads a sequence in order and learns what to remember and what to discard as it goes.
Reading in Both Directions
A standard LSTM reads left to right, so when it processes a word it only knows the words before it. But context often comes from both sides. In “the alarm went off, but it was a”, the meaning of the next word depends on what follows too.
A Bidirectional LSTM solves this by running two LSTMs over the same sequence: one forward (left to right) and one backward (right to left). Each token’s representation then combines both views, so every word is understood with full context from both directions.
forward : w1 -> w2 -> w3 -> w4
backward: w1 <- w2 <- w3 <- w4
\____ combined per token ____/In Keras this is just one wrapper, Bidirectional, placed around an LSTM layer. That small change is often worth a meaningful accuracy bump on text tasks, as you are about to see.
The Disaster Tweets Dataset
You will work with a real, well-known text classification problem: deciding whether a tweet is about a genuine disaster or not. The word “ablaze” might describe a wildfire, or it might describe a sunset in a poem. Your model has to learn the difference from the words and their order.
Download the dataset and load it with pandas.
import pandas as pd
# download: https://datatweets.com/datasets/disaster_tweets.csv
df = pd.read_csv("disaster_tweets.csv")
print("Shape:", df.shape)
# Output: Shape: (7613, 2)The dataset has 7,613 tweets and two columns: the tweet text and a target that is 1 for a real disaster and 0 otherwise. Look at how the two classes are split.
print(df["target"].value_counts())
# Output:
# target
# 0 4342
# 1 3271
# Name: count, dtype: int64
print("disaster rate:", round(df["target"].mean(), 2))
# Output: disaster rate: 0.43About 43 percent of the tweets describe real disasters (3,271 out of 7,613), so the dataset is reasonably balanced. That means plain accuracy is a fair first metric here: a model cannot score well just by always guessing the majority class.
The chart above previews where you are headed. The dense bag-of-embeddings model from the last lesson reaches 0.710. The Bidirectional LSTM you build in this lesson pushes that to 0.751. (The third bar, a transformer at 0.753, is the subject of the next lesson, so set it aside for now.) Each step adds a richer way of accounting for word order.
It also helps to know how long these tweets are, because that sets the sequence length your model has to handle.
df["n_words"] = df["text"].str.split().apply(len)
print("avg words:", round(df["n_words"].mean(), 1))
print("max words:", df["n_words"].max())
# Output:
# avg words: 14.9
# max words: 31Tweets are short: about 15 words on average and 31 at most. Short sequences are forgiving, which is partly why even the simple averaging model does well, but word order still carries enough signal for the LSTM to pull ahead.
Preparing Text for the Model
A neural network cannot read raw strings; it needs integers. As in the previous lesson, you will let a TextVectorization layer learn a vocabulary and turn each tweet into a fixed-length sequence of integer token IDs. Keeping this layer inside the model means the model accepts raw text directly, which makes it easy to deploy.
First, split the data into training and test sets so you can evaluate honestly on tweets the model never saw.
from sklearn.model_selection import train_test_split
X = df["text"]
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.20, # hold out 20% for testing
random_state=42, # reproducible split
stratify=y, # keep the disaster/non-disaster ratio in both sets
)
print("Train tweets:", len(X_train))
print("Test tweets: ", len(X_test))
# Output:
# Train tweets: 6090
# Test tweets: 1523Now build the TextVectorization layer and adapt it to the training text only, so no test information leaks in. You cap the vocabulary and fix every sequence to a uniform length; short tweets are padded and rare words map to a single “unknown” token.
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization
max_tokens = 10000 # keep the 10,000 most frequent words
output_sequence_length = 32 # pad/truncate every tweet to 32 tokens
vectorizer = TextVectorization(
max_tokens=max_tokens,
output_mode="int",
output_sequence_length=output_sequence_length,
)
vectorizer.adapt(X_train) # learn the vocabulary from TRAIN only
print("Vocabulary size:", len(vectorizer.get_vocabulary()))
# Output: Vocabulary size: 10000You set output_sequence_length=32 because the longest tweet is 31 words, so 32 comfortably fits every tweet without cutting any off. With preparation done, you can assemble the model.
Adapt on training data only
Call vectorizer.adapt(X_train), never on the full dataset. The vocabulary the layer learns is itself information derived from the data. Letting it peek at the test tweets would leak information and inflate your score, the same leakage trap you guard against when scaling features in classic machine learning.
Building the Bidirectional LSTM
Now you stack the layers. The architecture reads top to bottom: raw text goes in, gets vectorized into integers, each integer becomes an embedding, the Bidirectional LSTM reads the sequence in both directions, and a small dense head turns the result into a probability.
from tensorflow.keras import Input
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense
embedding_dim = 64
model = Sequential([
Input(shape=(1,), dtype=tf.string), # accepts raw tweet strings
vectorizer, # text -> integer sequence
Embedding(input_dim=max_tokens, # integer -> dense vector
output_dim=embedding_dim),
Bidirectional(LSTM(64)), # read the sequence both ways
Dense(64, activation="relu"), # a small hidden layer
Dense(1, activation="sigmoid"), # probability of "disaster"
])
model.summary()A few details are worth pausing on:
- The
Embeddinglayer hasinput_dim=max_tokens(one row per vocabulary slot) andoutput_dim=64(each word becomes a 64-dimensional learned vector). These vectors are trained from scratch on this task. Bidirectional(LSTM(64))wraps a 64-unit LSTM. Because it runs forward and backward, its output is 128 numbers per tweet (64 from each direction), a single fixed-size summary of the whole sequence that respects word order.- The final
Dense(1, activation="sigmoid")squeezes everything into one number between 0 and 1: the model’s estimated probability that the tweet describes a real disaster.
Next, compile the model. Because this is binary classification with a sigmoid output, you use binary cross-entropy as the loss. You also track accuracy and AUC (area under the ROC curve), which measures how well the model ranks disasters above non-disasters across all thresholds.
model.compile(
optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy", "AUC"],
)Training and Evaluating the Model
With the model compiled, you train it on the raw training tweets. Keras handles vectorization and embedding lookup internally because those layers live inside the model. You pass a slice of the training data as validation so you can watch for overfitting as the epochs go by.
history = model.fit(
X_train, y_train,
validation_split=0.1, # watch performance on held-out training tweets
epochs=5,
batch_size=32,
)During training you will see the loss fall and accuracy rise each epoch. (Exact per-epoch numbers vary from run to run because weights start randomly; what matters is the trend and the final test score.) Once training finishes, evaluate on the test set, the tweets the model has never seen.
test_loss, test_acc, test_auc = model.evaluate(X_test, y_test, verbose=0)
print(f"Test accuracy: {test_acc:.3f}")
print(f"Test AUC: {test_auc:.3f}")
# Output:
# Test accuracy: 0.751
# Test AUC: 0.825The Bidirectional LSTM reaches 0.751 test accuracy with an AUC of 0.825. Compare that to the bag-of-embeddings baseline from the previous lesson, which scored 0.710 accuracy. By reading each tweet as an ordered sequence in both directions, the LSTM gained about four percentage points of accuracy on the same data, exactly the lift you would hope word-order awareness would buy.
You can also look at a few predictions to see the model in action.
sample = pd.Series([
"Massive wildfire forces thousands to evacuate the valley",
"this party is absolutely on fire tonight",
])
probs = model.predict(sample, verbose=0).ravel()
for tweet, p in zip(sample, probs):
label = "disaster" if p >= 0.5 else "not disaster"
print(f"[{label}] {tweet}")
# Output:
# [disaster] Massive wildfire forces thousands to evacuate the valley
# [not disaster] this party is absolutely on fire tonightBoth tweets contain fire-related words, yet the model separates the literal emergency from the figurative one. That is the kind of distinction a pure word-counting model finds much harder, because the answer depends on how the words combine, not just which words appear.
Why AUC went down while accuracy went up
Notice the LSTM’s AUC (0.825) is actually a touch lower than the dense model’s (0.850), even though its accuracy is higher. Accuracy measures correct labels at the 0.5 threshold, while AUC measures ranking quality across all thresholds. They can disagree. This is a healthy reminder to report more than one metric: a single number rarely tells the whole story about a classifier.
Where Sequence Models Fit
You have now seen two ways to turn a tweet into a prediction, and they sit on a spectrum of how much they respect word order:
- Bag-of-embeddings (0.710): fast and simple, but order-blind. It knows which words appear, not how they relate.
- Bidirectional LSTM (0.751): reads the sequence in both directions, remembering context as it goes. It captures word order at the cost of more computation.
There is a third point on that chart you saw earlier, a transformer at 0.753, which edges out the LSTM by a hair. Transformers handle word order in a completely different way, using a mechanism called attention to let every word look directly at every other word at once. You will build one in the next lesson, so we will not unpack attention here. For now, the lesson is the trend: each architecture finds a better way to account for the relationships between words, and on this short-text task the gains from LSTM to transformer are small, while the gain from order-blind to order-aware is the big one.
Practice Exercises
Try these before checking the hints. They reuse the vectorizer, the train/test splits, and the imports from the lesson.
Exercise 1: Compare Against a Plain (Unidirectional) LSTM
Build a model identical to the lesson’s, but swap Bidirectional(LSTM(64)) for a plain LSTM(64) that reads only left to right. Train it for 5 epochs and compare its test accuracy to the bidirectional version.
# Your code here: build, compile, fit, and evaluate a unidirectional LSTM modelHint
Reuse the exact same Sequential stack, but replace the line Bidirectional(LSTM(64)) with just LSTM(64). Everything else, the Embedding, the dense head, the compile call, and the fit call, stays the same. You typically lose a little accuracy because the plain LSTM only sees context from one direction.
Exercise 2: Change the Sequence Length
The lesson padded tweets to 32 tokens. Rebuild the TextVectorization layer with output_sequence_length=16, adapt it on X_train, build a fresh model around it, and see how cutting the sequence length affects test accuracy.
# Your code here: rebuild the vectorizer with output_sequence_length=16, then the modelHint
Create a new TextVectorization(max_tokens=10000, output_mode="int", output_sequence_length=16), call .adapt(X_train), and place it in a new Sequential model (you cannot reuse the old adapted layer with a different length). Since the longest tweet is 31 words, a length of 16 truncates the longer tweets and may cost you some accuracy.
Exercise 3: Inspect the Most Confident Predictions
After training the lesson’s model, find the five test tweets the model is most confident are disasters, and print them with their predicted probabilities. This helps you see what the model latches onto.
# Your code here: get probabilities for X_test, sort, and print the top 5Hint
Call probs = model.predict(X_test, verbose=0).ravel(), then use import numpy as np and top = np.argsort(probs)[-5:][::-1] to get the indices of the five highest probabilities. Index back into X_test.iloc[i] to read the tweets. You will usually see vivid, literal disaster language scoring near 1.0.
Summary
Congratulations! You have moved from order-blind models to true sequence models, and built a Bidirectional LSTM that reads text the way it is actually written: in order, in both directions. Let’s review what you learned.
Key Concepts
Why Word Order Matters
- A bag-of-embeddings model averages word vectors, which is order-invariant: shuffling the words gives the same result
- Word order carries meaning (“not spreading” vs “spreading”), so order-blind models hit a ceiling on tasks where structure matters
Recurrent Layers
- An RNN processes a sequence one token at a time, carrying a hidden state as memory
- Plain RNNs forget early words in long sequences because the signal fades across many steps
- An LSTM adds a cell state and forget/input/output gates to remember important information over long spans
- A Bidirectional LSTM runs one LSTM forward and one backward so every word is understood with context from both sides
Building the Model in Keras
- Keep
TextVectorizationinside the model so it accepts raw text;adaptit on training data only - The stack is
Input -> TextVectorization -> Embedding -> Bidirectional(LSTM) -> Dense -> Dense(1, sigmoid) - A 64-unit
Bidirectional(LSTM)outputs 128 numbers per tweet, a sequence-aware summary - Compile with
binary_crossentropyloss for a sigmoid output, and track more than one metric
Results on Disaster Tweets
- The Bidirectional LSTM reached 0.751 test accuracy and 0.825 AUC
- It beat the bag-of-embeddings baseline (0.710) by adding word-order awareness
- Accuracy and AUC can disagree, so report both rather than trusting a single number
Why This Matters
The jump from averaging embeddings to reading sequences is one of the most important ideas in NLP. Almost every meaningful task, translation, summarization, question answering, depends on understanding how words relate to one another, not just which words are present. Recurrent layers were the first widely successful way to model that structure, and the LSTM in particular powered a generation of language systems.
You also saw something important about progress in this field. On this short-text task, the leap from order-blind to order-aware (0.710 to 0.751) was large, while the leap from LSTM to transformer (0.751 to 0.753) was tiny. The biggest wins often come from a fundamentally better way of representing the problem, not from piling on complexity. Keep that lens as you move to transformers next: understand what a new architecture changes about how the model sees the data, and you will know when it is worth the extra cost.
Next Steps
You now understand how sequence models capture word order and have trained a Bidirectional LSTM that beats a strong baseline. In the next lesson, you will meet the transformer, the architecture behind modern language models, and see how attention lets every word relate to every other word at once.
Continue to Lesson 5 - Building Text Models with Transformers
Learn how attention works and build a transformer text classifier in TensorFlow.
Back to Module Overview
Return to the NLP for Deep Learning module overview.
Keep Building Your Skills
You have crossed an important threshold: your models now read text as a sequence, not a bag of words. That single shift, from averaging to remembering, is what separates toy text models from ones that capture real meaning. As you move into transformers, keep comparing each new architecture to the baseline you built here. The habit of asking “how much did this actually buy me, and at what cost?” will serve you in every machine learning project you take on.