Lesson 5 - Building Text Models with Transformers
Welcome to Transformers
Over the last few lessons you moved from a simple embedding-and-pooling model to a recurrent BiLSTM that reads a tweet word by word. In this lesson you meet the architecture behind modern language models, from BERT to GPT: the transformer. You will learn what self-attention is and why it lets a model see long-range context, then build a small transformer block in Keras and evaluate it on the same disaster tweets you have used throughout this module.
By the end of this lesson, you will be able to:
- Explain the core transformer idea and why it removes the need for recurrence
- Describe self-attention in terms of queries, keys, and values
- Explain what multi-head attention adds and why positional information is needed
- Build a transformer block in Keras using
MultiHeadAttention,LayerNormalization, and pooling - Train and evaluate a transformer text classifier and compare it to earlier models
You should be comfortable with Keras, embeddings, and the train/test workflow from the earlier lessons in this module. Let’s begin.
Why Move Beyond Recurrence?
A recurrent model like the BiLSTM from the previous lesson reads a sentence one token at a time, carrying a hidden state forward. That design has two costs. First, it is sequential: token five cannot be processed until tokens one through four are done, so the work cannot be spread across many cores at once. Second, information from early words has to survive a long chain of updates to influence a decision made at the end of the sentence. Over long sequences that signal weakens.
Transformers, introduced by Google researchers in 2017, take a completely different approach. There is no recurrence at all. Instead, every token looks at every other token directly, in a single step, through a mechanism called self-attention. This buys two things at once:
- Parallelism. Because tokens are not processed in order, the whole sequence can be handled simultaneously, which makes transformers far faster to train on modern hardware.
- Long-range context. Any word can attend to any other word in one hop, no matter how far apart they are, so distant relationships are just as easy to capture as nearby ones.
Where the name shows up
The “T” in GPT stands for Transformer (Generative Pretrained Transformer), and BERT is built from transformer encoder blocks. The same idea you will build at small scale in this lesson is the foundation of the largest language models in use today.
Self-Attention: The Core Idea
The heart of a transformer is self-attention. The intuition is simple: when the model builds a representation for one word, it should be allowed to mix in information from the other words that matter most for understanding it.
Consider the tweet fragment “the fire spread through the building”. To understand the word fire in this context, it helps a great deal to look at building and spread. Self-attention lets the representation of fire pull in information from those words while largely ignoring the filler words like the. Every word does this for every other word, all at once.
Queries, Keys, and Values
How does the model decide which words to attend to? It gives each token three learned vectors, all derived from the token’s embedding:
- A query (): what this token is looking for.
- A key (): what this token offers to others.
- A value (): the actual information this token will pass along.
To compute attention for one token, you compare its query against every token’s key with a dot product. A large dot product means “these two are relevant to each other,” so that token’s value should contribute more. The scores are scaled and turned into weights that sum to one with a softmax, then used to take a weighted average of the value vectors.
Written compactly for the whole sequence at once, with query, key, and value matrices , , and :
Here is the dimension of the key vectors, and dividing by keeps the dot products from growing too large and pushing the softmax into a region where gradients vanish. The output is a new vector for each token that is a context-aware blend of the whole sequence.
A familiar analogy
Think of a library lookup. Your query is the topic you want. Each book’s key is its label on the spine, and its value is the actual content inside. You compare your query to every key, then read most heavily from the books whose labels match best. Self-attention does exactly this, but with soft, weighted matches instead of a single hit.
Multi-Head Attention
A single attention computation captures one kind of relationship. But words relate in many ways at once: one pattern might track which noun a verb acts on, another might track sentiment, another might track which words negate each other.
Multi-head attention runs several attention computations in parallel, each with its own learned query, key, and value projections. Each “head” is free to focus on a different kind of relationship. Their outputs are concatenated and projected back down to the original size. The result is a richer representation than any single head could produce, and it costs little extra because the heads run side by side.
Positional Information
Self-attention has one surprising blind spot: by itself it treats the input as an unordered set. The attention formula above gives the same answer no matter how you shuffle the tokens, because it only compares pairs. But “disaster averted” and “averted disaster” mean different things.
To fix this, transformers add positional information to each token’s embedding before attention runs. A common approach uses fixed sinusoidal patterns; another, which you will use here, simply learns a position embedding the same way you learn word embeddings. Either way, the model now knows not just which words appear but where they sit in the sequence.
Loading the Disaster Tweets
You will train your transformer on the same dataset you used for the earlier models in this module: real tweets labeled as describing a genuine disaster (1) or not (0). Reusing the dataset means the transformer’s score is directly comparable to the dense and BiLSTM models you already built.
import pandas as pd
# download: https://datatweets.com/datasets/disaster_tweets.csv
df = pd.read_csv("disaster_tweets.csv")
print("Shape:", df.shape)
print(df["target"].value_counts().to_dict())
print("disaster rate:", round(df["target"].mean(), 2))
# Output:
# Shape: (7613, 2)
# {0: 4342, 1: 3271}
# disaster rate: 0.43The dataset has 7,613 tweets across two columns, a text column and a target label. About 43 percent describe a real disaster, so the classes are reasonably balanced. The tweets are short, averaging about 15 words, which matters for the result you will see later.
word_counts = df["text"].str.split().str.len()
print("avg words:", round(word_counts.mean(), 1), "max:", word_counts.max())
# Output:
# avg words: 14.9 max: 31Preparing Text for the Model
Like the earlier models, the transformer needs integer sequences, not raw strings. You use a Keras TextVectorization layer to map each tweet to a fixed-length sequence of token IDs, then split into training and test sets.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
# Split first so the vectorizer only ever learns from training text
X_train_text, X_test_text, y_train, y_test = train_test_split(
df["text"].values, df["target"].values,
test_size=0.2, random_state=42, stratify=df["target"].values,
)
max_tokens = 10000 # vocabulary size
seq_len = 32 # tweets are short; 32 tokens covers the longest
vectorizer = layers.TextVectorization(
max_tokens=max_tokens,
output_sequence_length=seq_len,
)
vectorizer.adapt(X_train_text) # learn the vocabulary from TRAIN only
X_train = vectorizer(X_train_text)
X_test = vectorizer(X_test_text)
print("X_train shape:", X_train.shape)
# Output:
# X_train shape: (6090, 32)Every tweet is now a length-32 row of integers, padded or truncated as needed. The vectorizer is adapted on the training text only, so no information about the test set leaks in.
Adapt on training data only
Call vectorizer.adapt() on the training text before you touch the test set. Building the vocabulary from the full dataset would let test-set words influence training and make your final score look better than it really is. The same discipline applied to scaling and splitting in earlier lessons applies here.
Building a Transformer Block in Keras
You now have everything you need to assemble a small transformer. Keras ships the key pieces as layers, so you can build a working block in a few lines:
- An embedding layer for the tokens, plus a second embedding for positions.
- A
MultiHeadAttentionlayer that performs self-attention. LayerNormalizationand a residual (skip) connection to stabilize training.- A small feed-forward network applied to each token.
- A pooling layer to collapse the sequence into one vector for classification.
Adding Positional Information
First, a layer that embeds both the tokens and their positions and adds them together. This is how the model learns word order.
class TokenAndPositionEmbedding(layers.Layer):
def __init__(self, seq_len, vocab_size, embed_dim):
super().__init__()
self.token_emb = layers.Embedding(vocab_size, embed_dim)
self.pos_emb = layers.Embedding(seq_len, embed_dim)
self.seq_len = seq_len
def call(self, x):
positions = tf.range(start=0, limit=self.seq_len, delta=1)
return self.token_emb(x) + self.pos_emb(positions)The token embedding answers “what is this word?” and the position embedding answers “where does it sit?”. Adding them gives each input a representation that carries both pieces of information into the attention layer.
The Transformer Block
Next, the block itself: self-attention followed by a feed-forward network, each wrapped in a residual connection and layer normalization. The residual connections let gradients flow cleanly, and normalization keeps activations well-scaled.
class TransformerBlock(layers.Layer):
def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
super().__init__()
self.att = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
)
self.ffn = keras.Sequential([
layers.Dense(ff_dim, activation="relu"),
layers.Dense(embed_dim),
])
self.norm1 = layers.LayerNormalization(epsilon=1e-6)
self.norm2 = layers.LayerNormalization(epsilon=1e-6)
self.drop1 = layers.Dropout(dropout)
self.drop2 = layers.Dropout(dropout)
def call(self, x, training=False):
# Self-attention sublayer with a residual connection
attn = self.att(x, x) # query=key=value=x
attn = self.drop1(attn, training=training)
x = self.norm1(x + attn)
# Feed-forward sublayer with a residual connection
ffn = self.ffn(x)
ffn = self.drop2(ffn, training=training)
return self.norm2(x + ffn)Notice the call self.att(x, x). Passing the same tensor as both query and value is exactly what makes this self-attention: every token attends to every other token in the same sequence.
Assembling the Full Model
Finally, wire the pieces together. The transformer block produces one context-aware vector per token; you pool those into a single vector with global average pooling, then send it through a small classification head.
embed_dim = 32 # size of each token vector
num_heads = 2 # number of attention heads
ff_dim = 32 # hidden size of the feed-forward network
inputs = keras.Input(shape=(seq_len,))
x = TokenAndPositionEmbedding(seq_len, max_tokens, embed_dim)(inputs)
x = TransformerBlock(embed_dim, num_heads, ff_dim)(x)
x = layers.GlobalAveragePooling1D()(x) # collapse the sequence
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(
optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy", keras.metrics.AUC(name="auc")],
)
model.summary()The output layer is a single sigmoid unit because this is binary classification: the probability that a tweet describes a real disaster.
Training and Evaluating
With the model compiled, training looks exactly like the earlier Keras models in this module. You fit on the training sequences and hold out the test set for a final, honest evaluation.
history = model.fit(
X_train, y_train,
validation_split=0.1,
epochs=10,
batch_size=32,
verbose=2,
)
test_loss, test_acc, test_auc = model.evaluate(X_test, y_test, verbose=0)
print(f"transformer (self-attention) test acc={test_acc:.3f} AUC={test_auc:.3f}")
# Output:
# transformer (self-attention) test acc=0.753 AUC=0.821The transformer reaches a test accuracy of 0.753 and an AUC of 0.821. That is a solid result, and it is right in line with what you have seen so far in this module.
How It Compares
Putting all three models from this module side by side tells an instructive story:
| Model | Test accuracy | AUC |
|---|---|---|
| Dense (embedding + pooling) | 0.710 | 0.850 |
| BiLSTM | 0.751 | 0.825 |
| Transformer (self-attention) | 0.753 | 0.821 |
The transformer edges out the BiLSTM on accuracy and lands very close on AUC. On this particular task the gain is modest, and that is worth understanding rather than glossing over.
Why the transformer does not run away with it here
The big advantages of transformers, parallelism and long-range context, matter most when sequences are long and training data is plentiful. These tweets average about 15 words, so there is little long-range context for attention to exploit, and the dataset is small. On short text with limited data, a transformer performs comparably to a BiLSTM. Scale the inputs up to paragraphs and documents, and the transformer pulls decisively ahead.
Pretrained Transformers and BERT
The transformer you built learns everything from scratch on a few thousand tweets. The real power of the architecture shows up when you start from a model that has already been trained on enormous amounts of text.
BERT (Bidirectional Encoder Representations from Transformers) is a stack of transformer encoder blocks pretrained on billions of words. Because it has already learned rich, general-purpose representations of language, you can take that pretrained model and fine-tune it on a small task-specific dataset, often reaching far higher accuracy than a from-scratch model could. Lighter variants like DistilBERT keep most of that capability in a smaller, faster package.
The pattern is the same one you have used all along: build a representation, attach a small classification head, and train. The difference is that BERT brings a head start of general language understanding that no model trained on 6,000 tweets could match. You will not fine-tune BERT here, but it is the natural next rung on the ladder once you understand the transformer block underneath it.
Practice Exercises
Try these before checking the hints.
Exercise 1: Change the Number of Heads
The model used num_heads=2. Rebuild and retrain the model with num_heads=4, keeping everything else the same, and compare the test accuracy. Does giving the model more attention heads help on this short-text task?
# Your code here: change num_heads, rebuild the Model, refit, then evaluateHint
Set num_heads = 4 before constructing the TransformerBlock, then rebuild the full keras.Model, recompile, and call .fit() and .evaluate() exactly as in the lesson. Expect a result close to the lesson’s 0.753, because more heads add capacity but cannot manufacture long-range context that short tweets do not contain.
Exercise 2: Swap the Pooling Layer
The lesson used GlobalAveragePooling1D to collapse the token sequence. Replace it with GlobalMaxPooling1D, which keeps the strongest signal per dimension instead of the average, and compare the test accuracy.
# Your code here: replace the pooling layer, rebuild, refit, and evaluateHint
Change the pooling line to x = layers.GlobalMaxPooling1D()(x) and keep the rest of the architecture identical. Average pooling smooths over all tokens while max pooling latches onto the single most disaster-like token; comparing them shows how the readout choice affects performance.
Exercise 3: Remove the Position Embedding
Self-attention alone ignores word order. Build a version of the model that uses only the token embedding (drop the pos_emb addition) and see whether removing positional information hurts the score.
# Your code here: use only token_emb in TokenAndPositionEmbedding, then retrainHint
In TokenAndPositionEmbedding.call, return just self.token_emb(x) instead of adding the position embedding. Because these tweets are short and bag-of-words-like, the drop may be small, but the experiment makes concrete why positional information is part of every real transformer.
Summary
You built and trained your first transformer, going from the self-attention idea all the way to a working text classifier. Let’s review what you learned.
Key Concepts
The Transformer Idea
- Transformers replace recurrence with self-attention, so every token attends to every other token directly
- This gives parallelism (the whole sequence is processed at once) and easy long-range context
Self-Attention
- Each token produces a query, a key, and a value
- Attention weights come from comparing queries to keys with a scaled dot product and a softmax
- The output is a context-aware blend of value vectors:
Multi-Head Attention and Position
- Multi-head attention runs several attention computations in parallel to capture different relationships
- Self-attention ignores order, so a positional embedding is added to give the model word position
Building in Keras
- A transformer block combines
MultiHeadAttention, residual connections,LayerNormalization, and a feed-forward network self.att(x, x)makes it self-attention; pooling collapses the token sequence for classification
Results
- The transformer reached test accuracy 0.753 and AUC 0.821, comparable to the BiLSTM here
- On short tweets with little data the gain is modest; transformers dominate at scale
- Pretrained transformers like BERT bring general language understanding you can fine-tune
Why This Matters
The transformer block you assembled is the same building unit that powers BERT, GPT, and nearly every state-of-the-art language model. Understanding self-attention, queries and keys and values, multi-head attention, and positional information is what separates someone who can only call an API from someone who understands what is happening inside. The modest gain on these short tweets is also a useful lesson in judgment: the best architecture depends on the data, and a transformer is not automatically the right tool for every problem. As your sequences grow longer and your datasets larger, the advantages you read about here turn into decisive wins.
Next Steps
You now understand the transformer architecture and have trained one end to end. In the next lesson you will pull everything from this module together in a guided project, building and comparing models on the disaster tweets dataset from start to finish.
Continue to Lesson 6 - Guided Project: Classifying Disaster Tweets
Apply everything you have learned in an end-to-end project on the disaster tweets dataset.
Back to Module Overview
Return to the NLP for Deep Learning module overview.
Keep Building Your Skills
You have just built the architecture that defines modern NLP. Self-attention may feel abstract the first time through, so revisit the query, key, and value picture until it clicks; that single idea unlocks the entire transformer family. Carry this forward into the guided project, where you will put your dense, recurrent, and transformer models head to head and decide, with real numbers, which one earns a place in production.