Lesson 5 - Building Text Models with Transformers

Welcome to Transformers

Over the last few lessons you moved from a simple embedding-and-pooling model to a recurrent BiLSTM that reads a tweet word by word. In this lesson you meet the architecture behind modern language models, from BERT to GPT: the transformer. You will learn what self-attention is and why it lets a model see long-range context, then build a small transformer block in Keras and evaluate it on the same disaster tweets you have used throughout this module.

By the end of this lesson, you will be able to:

Explain the core transformer idea and why it removes the need for recurrence
Describe self-attention in terms of queries, keys, and values
Explain what multi-head attention adds and why positional information is needed
Build a transformer block in Keras using MultiHeadAttention, LayerNormalization, and pooling
Train and evaluate a transformer text classifier and compare it to earlier models

You should be comfortable with Keras, embeddings, and the train/test workflow from the earlier lessons in this module. Let’s begin.

Why Move Beyond Recurrence?

A recurrent model like the BiLSTM from the previous lesson reads a sentence one token at a time, carrying a hidden state forward. That design has two costs. First, it is sequential: token five cannot be processed until tokens one through four are done, so the work cannot be spread across many cores at once. Second, information from early words has to survive a long chain of updates to influence a decision made at the end of the sentence. Over long sequences that signal weakens.

Transformers, introduced by Google researchers in 2017, take a completely different approach. There is no recurrence at all. Instead, every token looks at every other token directly, in a single step, through a mechanism called self-attention. This buys two things at once:

Parallelism. Because tokens are not processed in order, the whole sequence can be handled simultaneously, which makes transformers far faster to train on modern hardware.
Long-range context. Any word can attend to any other word in one hop, no matter how far apart they are, so distant relationships are just as easy to capture as nearby ones.

Where the name shows up

The “T” in GPT stands for Transformer (Generative Pretrained Transformer), and BERT is built from transformer encoder blocks. The same idea you will build at small scale in this lesson is the foundation of the largest language models in use today.

Self-Attention: The Core Idea

The heart of a transformer is self-attention. The intuition is simple: when the model builds a representation for one word, it should be allowed to mix in information from the other words that matter most for understanding it.

Consider the tweet fragment “the fire spread through the building”. To understand the word fire in this context, it helps a great deal to look at building and spread. Self-attention lets the representation of fire pull in information from those words while largely ignoring the filler words like the. Every word does this for every other word, all at once.

Self-attention diagram where each word weighs every other word, with the word fire attending strongly to building — In self-attention, each word weighs every other word; here "fire" attends strongly to "building" to capture the disaster context.

Queries, Keys, and Values

How does the model decide which words to attend to? It gives each token three learned vectors, all derived from the token’s embedding:

A query ( $q$ ): what this token is looking for.
A key ( $k$ ): what this token offers to others.
A value ( $v$ ): the actual information this token will pass along.

To compute attention for one token, you compare its query against every token’s key with a dot product. A large dot product means “these two are relevant to each other,” so that token’s value should contribute more. The scores are scaled and turned into weights that sum to one with a softmax, then used to take a weighted average of the value vectors.

Written compactly for the whole sequence at once, with query, key, and value matrices $Q$ , $K$ , and $V$ :

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right) V

Here $d_k$ is the dimension of the key vectors, and dividing by $\sqrt{d_k}$ keeps the dot products from growing too large and pushing the softmax into a region where gradients vanish. The output is a new vector for each token that is a context-aware blend of the whole sequence.

A familiar analogy

Think of a library lookup. Your query is the topic you want. Each book’s key is its label on the spine, and its value is the actual content inside. You compare your query to every key, then read most heavily from the books whose labels match best. Self-attention does exactly this, but with soft, weighted matches instead of a single hit.

Multi-Head Attention

A single attention computation captures one kind of relationship. But words relate in many ways at once: one pattern might track which noun a verb acts on, another might track sentiment, another might track which words negate each other.

Multi-head attention runs several attention computations in parallel, each with its own learned query, key, and value projections. Each “head” is free to focus on a different kind of relationship. Their outputs are concatenated and projected back down to the original size. The result is a richer representation than any single head could produce, and it costs little extra because the heads run side by side.

Positional Information

Self-attention has one surprising blind spot: by itself it treats the input as an unordered set. The attention formula above gives the same answer no matter how you shuffle the tokens, because it only compares pairs. But “disaster averted” and “averted disaster” mean different things.

To fix this, transformers add positional information to each token’s embedding before attention runs. A common approach uses fixed sinusoidal patterns; another, which you will use here, simply learns a position embedding the same way you learn word embeddings. Either way, the model now knows not just which words appear but where they sit in the sequence.

Loading the Disaster Tweets

You will train your transformer on the same dataset you used for the earlier models in this module: real tweets labeled as describing a genuine disaster (1) or not (0). Reusing the dataset means the transformer’s score is directly comparable to the dense and BiLSTM models you already built.

import pandas as pd

# download: https://datatweets.com/datasets/disaster_tweets.csv
df = pd.read_csv("disaster_tweets.csv")

print("Shape:", df.shape)
print(df["target"].value_counts().to_dict())
print("disaster rate:", round(df["target"].mean(), 2))
# Output:
# Shape: (7613, 2)
# {0: 4342, 1: 3271}
# disaster rate: 0.43

The dataset has 7,613 tweets across two columns, a text column and a target label. About 43 percent describe a real disaster, so the classes are reasonably balanced. The tweets are short, averaging about 15 words, which matters for the result you will see later.

word_counts = df["text"].str.split().str.len()
print("avg words:", round(word_counts.mean(), 1), "max:", word_counts.max())
# Output:
# avg words: 14.9 max: 31

Preparing Text for the Model

Like the earlier models, the transformer needs integer sequences, not raw strings. You use a Keras TextVectorization layer to map each tweet to a fixed-length sequence of token IDs, then split into training and test sets.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

# Split first so the vectorizer only ever learns from training text
X_train_text, X_test_text, y_train, y_test = train_test_split(
    df["text"].values, df["target"].values,
    test_size=0.2, random_state=42, stratify=df["target"].values,
)

max_tokens = 10000   # vocabulary size
seq_len = 32         # tweets are short; 32 tokens covers the longest

vectorizer = layers.TextVectorization(
    max_tokens=max_tokens,
    output_sequence_length=seq_len,
)
vectorizer.adapt(X_train_text)   # learn the vocabulary from TRAIN only

X_train = vectorizer(X_train_text)
X_test = vectorizer(X_test_text)

print("X_train shape:", X_train.shape)
# Output:
# X_train shape: (6090, 32)

Every tweet is now a length-32 row of integers, padded or truncated as needed. The vectorizer is adapted on the training text only, so no information about the test set leaks in.

Adapt on training data only

Call vectorizer.adapt() on the training text before you touch the test set. Building the vocabulary from the full dataset would let test-set words influence training and make your final score look better than it really is. The same discipline applied to scaling and splitting in earlier lessons applies here.

Building a Transformer Block in Keras

You now have everything you need to assemble a small transformer. Keras ships the key pieces as layers, so you can build a working block in a few lines:

An embedding layer for the tokens, plus a second embedding for positions.
A MultiHeadAttention layer that performs self-attention.
LayerNormalization and a residual (skip) connection to stabilize training.
A small feed-forward network applied to each token.
A pooling layer to collapse the sequence into one vector for classification.

Adding Positional Information

First, a layer that embeds both the tokens and their positions and adds them together. This is how the model learns word order.

class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, seq_len, vocab_size, embed_dim):
        super().__init__()
        self.token_emb = layers.Embedding(vocab_size, embed_dim)
        self.pos_emb = layers.Embedding(seq_len, embed_dim)
        self.seq_len = seq_len

    def call(self, x):
        positions = tf.range(start=0, limit=self.seq_len, delta=1)
        return self.token_emb(x) + self.pos_emb(positions)

The token embedding answers “what is this word?” and the position embedding answers “where does it sit?”. Adding them gives each input a representation that carries both pieces of information into the attention layer.

The Transformer Block

Next, the block itself: self-attention followed by a feed-forward network, each wrapped in a residual connection and layer normalization. The residual connections let gradients flow cleanly, and normalization keeps activations well-scaled.

class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.ffn = keras.Sequential([
            layers.Dense(ff_dim, activation="relu"),
            layers.Dense(embed_dim),
        ])
        self.norm1 = layers.LayerNormalization(epsilon=1e-6)
        self.norm2 = layers.LayerNormalization(epsilon=1e-6)
        self.drop1 = layers.Dropout(dropout)
        self.drop2 = layers.Dropout(dropout)

    def call(self, x, training=False):
        # Self-attention sublayer with a residual connection
        attn = self.att(x, x)                       # query=key=value=x
        attn = self.drop1(attn, training=training)
        x = self.norm1(x + attn)
        # Feed-forward sublayer with a residual connection
        ffn = self.ffn(x)
        ffn = self.drop2(ffn, training=training)
        return self.norm2(x + ffn)

Notice the call self.att(x, x). Passing the same tensor as both query and value is exactly what makes this self-attention: every token attends to every other token in the same sequence.

Assembling the Full Model

Finally, wire the pieces together. The transformer block produces one context-aware vector per token; you pool those into a single vector with global average pooling, then send it through a small classification head.

embed_dim = 32    # size of each token vector
num_heads = 2     # number of attention heads
ff_dim = 32       # hidden size of the feed-forward network

inputs = keras.Input(shape=(seq_len,))
x = TokenAndPositionEmbedding(seq_len, max_tokens, embed_dim)(inputs)
x = TransformerBlock(embed_dim, num_heads, ff_dim)(x)
x = layers.GlobalAveragePooling1D()(x)   # collapse the sequence
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model = keras.Model(inputs, outputs)
model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy", keras.metrics.AUC(name="auc")],
)
model.summary()

The output layer is a single sigmoid unit because this is binary classification: the probability that a tweet describes a real disaster.

Training and Evaluating

With the model compiled, training looks exactly like the earlier Keras models in this module. You fit on the training sequences and hold out the test set for a final, honest evaluation.

history = model.fit(
    X_train, y_train,
    validation_split=0.1,
    epochs=10,
    batch_size=32,
    verbose=2,
)

test_loss, test_acc, test_auc = model.evaluate(X_test, y_test, verbose=0)
print(f"transformer (self-attention) test acc={test_acc:.3f} AUC={test_auc:.3f}")
# Output:
# transformer (self-attention) test acc=0.753 AUC=0.821

The transformer reaches a test accuracy of 0.753 and an AUC of 0.821. That is a solid result, and it is right in line with what you have seen so far in this module.

How It Compares

Putting all three models from this module side by side tells an instructive story:

Model	Test accuracy	AUC
Dense (embedding + pooling)	0.710	0.850
BiLSTM	0.751	0.825
Transformer (self-attention)	0.753	0.821

The transformer edges out the BiLSTM on accuracy and lands very close on AUC. On this particular task the gain is modest, and that is worth understanding rather than glossing over.

Why the transformer does not run away with it here

The big advantages of transformers, parallelism and long-range context, matter most when sequences are long and training data is plentiful. These tweets average about 15 words, so there is little long-range context for attention to exploit, and the dataset is small. On short text with limited data, a transformer performs comparably to a BiLSTM. Scale the inputs up to paragraphs and documents, and the transformer pulls decisively ahead.

Pretrained Transformers and BERT

The transformer you built learns everything from scratch on a few thousand tweets. The real power of the architecture shows up when you start from a model that has already been trained on enormous amounts of text.

BERT (Bidirectional Encoder Representations from Transformers) is a stack of transformer encoder blocks pretrained on billions of words. Because it has already learned rich, general-purpose representations of language, you can take that pretrained model and fine-tune it on a small task-specific dataset, often reaching far higher accuracy than a from-scratch model could. Lighter variants like DistilBERT keep most of that capability in a smaller, faster package.

The pattern is the same one you have used all along: build a representation, attach a small classification head, and train. The difference is that BERT brings a head start of general language understanding that no model trained on 6,000 tweets could match. You will not fine-tune BERT here, but it is the natural next rung on the ladder once you understand the transformer block underneath it.

Practice Exercises

Try these before checking the hints.

Exercise 1: Change the Number of Heads

The model used num_heads=2. Rebuild and retrain the model with num_heads=4, keeping everything else the same, and compare the test accuracy. Does giving the model more attention heads help on this short-text task?

# Your code here: change num_heads, rebuild the Model, refit, then evaluate

Hint

Set num_heads = 4 before constructing the TransformerBlock, then rebuild the full keras.Model, recompile, and call .fit() and .evaluate() exactly as in the lesson. Expect a result close to the lesson’s 0.753, because more heads add capacity but cannot manufacture long-range context that short tweets do not contain.

Exercise 2: Swap the Pooling Layer

The lesson used GlobalAveragePooling1D to collapse the token sequence. Replace it with GlobalMaxPooling1D, which keeps the strongest signal per dimension instead of the average, and compare the test accuracy.

# Your code here: replace the pooling layer, rebuild, refit, and evaluate

Hint

Change the pooling line to x = layers.GlobalMaxPooling1D()(x) and keep the rest of the architecture identical. Average pooling smooths over all tokens while max pooling latches onto the single most disaster-like token; comparing them shows how the readout choice affects performance.

Exercise 3: Remove the Position Embedding

Self-attention alone ignores word order. Build a version of the model that uses only the token embedding (drop the pos_emb addition) and see whether removing positional information hurts the score.

# Your code here: use only token_emb in TokenAndPositionEmbedding, then retrain

Hint

In TokenAndPositionEmbedding.call, return just self.token_emb(x) instead of adding the position embedding. Because these tweets are short and bag-of-words-like, the drop may be small, but the experiment makes concrete why positional information is part of every real transformer.

Summary

You built and trained your first transformer, going from the self-attention idea all the way to a working text classifier. Let’s review what you learned.

Key Concepts

The Transformer Idea

Transformers replace recurrence with self-attention, so every token attends to every other token directly
This gives parallelism (the whole sequence is processed at once) and easy long-range context

Self-Attention

Each token produces a query, a key, and a value
Attention weights come from comparing queries to keys with a scaled dot product and a softmax
The output is a context-aware blend of value vectors: $\text{softmax}(QK^{\top}/\sqrt{d_k})\,V$

Multi-Head Attention and Position

Multi-head attention runs several attention computations in parallel to capture different relationships
Self-attention ignores order, so a positional embedding is added to give the model word position

Building in Keras

A transformer block combines MultiHeadAttention, residual connections, LayerNormalization, and a feed-forward network
self.att(x, x) makes it self-attention; pooling collapses the token sequence for classification

Results

The transformer reached test accuracy 0.753 and AUC 0.821, comparable to the BiLSTM here
On short tweets with little data the gain is modest; transformers dominate at scale
Pretrained transformers like BERT bring general language understanding you can fine-tune

Why This Matters

The transformer block you assembled is the same building unit that powers BERT, GPT, and nearly every state-of-the-art language model. Understanding self-attention, queries and keys and values, multi-head attention, and positional information is what separates someone who can only call an API from someone who understands what is happening inside. The modest gain on these short tweets is also a useful lesson in judgment: the best architecture depends on the data, and a transformer is not automatically the right tool for every problem. As your sequences grow longer and your datasets larger, the advantages you read about here turn into decisive wins.

Next Steps

You now understand the transformer architecture and have trained one end to end. In the next lesson you will pull everything from this module together in a guided project, building and comparing models on the disaster tweets dataset from start to finish.

Continue to Lesson 6 - Guided Project: Classifying Disaster Tweets

Apply everything you have learned in an end-to-end project on the disaster tweets dataset.

Back to Module Overview

Return to the NLP for Deep Learning module overview.

Keep Building Your Skills

You have just built the architecture that defines modern NLP. Self-attention may feel abstract the first time through, so revisit the query, key, and value picture until it clicks; that single idea unlocks the entire transformer family. Carry this forward into the guided project, where you will put your dense, recurrent, and transformer models head to head and decide, with real numbers, which one earns a place in production.

Lesson 4 - Building Sequence Models for Text

Lesson 6 - Guided Project: Classifying Disaster Tweets

Courses

DATATWEETS

Title here

Lesson 5 - Building Text Models with Transformers

Welcome to Transformers

Why Move Beyond Recurrence?

Self-Attention: The Core Idea

Queries, Keys, and Values

Multi-Head Attention

Positional Information

Loading the Disaster Tweets

Preparing Text for the Model

Building a Transformer Block in Keras

Adding Positional Information

The Transformer Block

Assembling the Full Model

Training and Evaluating

How It Compares

Pretrained Transformers and BERT

Practice Exercises

Exercise 1: Change the Number of Heads

Exercise 2: Swap the Pooling Layer

Exercise 3: Remove the Position Embedding

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 6 - Guided Project: Classifying Disaster Tweets

Back to Module Overview

Keep Building Your Skills

Lesson 5 - Building Text Models with Transformers

Welcome to Transformers#

Why Move Beyond Recurrence?#

Self-Attention: The Core Idea#

Queries, Keys, and Values#

Multi-Head Attention#

Positional Information#

Loading the Disaster Tweets#

Preparing Text for the Model#

Building a Transformer Block in Keras#

Adding Positional Information#

The Transformer Block#

Assembling the Full Model#

Training and Evaluating#

How It Compares#

Pretrained Transformers and BERT#

Practice Exercises#

Exercise 1: Change the Number of Heads#

Exercise 2: Swap the Pooling Layer#

Exercise 3: Remove the Position Embedding#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 6 - Guided Project: Classifying Disaster Tweets

Back to Module Overview

Keep Building Your Skills#

Welcome to Transformers

Why Move Beyond Recurrence?

Self-Attention: The Core Idea

Queries, Keys, and Values

Multi-Head Attention

Positional Information

Loading the Disaster Tweets

Preparing Text for the Model

Building a Transformer Block in Keras

Adding Positional Information

The Transformer Block

Assembling the Full Model

Training and Evaluating

How It Compares

Pretrained Transformers and BERT

Practice Exercises

Exercise 1: Change the Number of Heads

Exercise 2: Swap the Pooling Layer

Exercise 3: Remove the Position Embedding

Summary

Key Concepts

Why This Matters

Next Steps

Keep Building Your Skills