Lesson 1 - Introduction to Unsupervised Learning

Welcome to Unsupervised Learning

This lesson introduces you to unsupervised machine learning, a family of techniques that finds hidden structure in data that has no answers attached. You will learn what unsupervised learning is, how it differs from the supervised learning you may have already seen, why clustering is its most important task, and where it shows up in the real world. Then you will meet the dataset you will use throughout this module and learn to spot natural groups with your own eyes, before any algorithm gets involved.

By the end of this lesson, you will be able to:

  • Explain what unsupervised machine learning is and how it differs from supervised learning
  • Describe the main families of unsupervised tasks: clustering, association, and anomaly detection
  • Explain why clustering is the headline task and where it is used in practice
  • Load and explore the real Mall Customers dataset with pandas
  • Read an unlabeled scatter plot and identify candidate groups by eye

No prior unsupervised learning experience is needed. You should be comfortable with basic Python, pandas, and reading a simple scatter plot. Let’s begin.


What Is Unsupervised Learning?

Most introductions to machine learning start with supervised learning, where every example in your data comes with a correct answer attached. You show a model thousands of past phone calls, each labeled “subscribed” or “did not subscribe,” and the model learns to predict that label for new customers. The word supervised is the key: because every training example has a known answer, you can objectively check whether the model got it right.

But what happens when there are no answers?

Imagine you work for a shopping mall. You have a spreadsheet of customers with their age, income, and a spending score, but nobody has labeled them. There is no column that says “this is a high-value customer” or “this is a bargain hunter.” No one told you how many types of customer exist, or which customer belongs to which type. You simply have raw measurements and a hunch that some natural groups are hiding in there.

This is the world of unsupervised learning. The data is unlabeled: there is no target variable to predict. In fact, the goal is not to predict anything at all. The goal is to find structure that is already present in the data but not written down anywhere.

The Core Difference

The single question that separates the two families is whether your data comes with answers.

Supervised Learning              Unsupervised Learning
-----------------------          -----------------------
Data HAS labels (answers)        Data has NO labels
Goal: predict the label          Goal: find hidden structure
You can score correctness        No "correct" answer to check
Examples:                        Examples:
  - Spam vs. not spam             - Customer segmentation
  - Price of a house              - Grouping similar documents
  - Disease vs. healthy           - Anomaly / fraud detection

In supervised learning, a human supplied the answers ahead of time, so you can measure accuracy by comparing predictions to those answers. In unsupervised learning there is no answer key. The algorithm proposes a structure, and it is up to you, the data scientist, to look at that structure and judge whether it is useful and what it means.

Unsupervised does not mean unattended

A common misconception is that “unsupervised” means the algorithm runs with no human involvement. It does not. It means the data has no labels. A person still has to interpret the output, decide whether the groups make sense, and translate them into action. Unsupervised learning shifts the human effort from labeling data to interpreting results.

Why Bother Without Labels?

Labeling data is expensive and slow. Someone has to read every email, inspect every transaction, or interview every customer to attach an answer. In practice, the vast majority of data in the world arrives without labels. Unsupervised learning lets you extract value from that data anyway, by revealing patterns you did not know to look for. It is often the very first thing a data scientist reaches for when handed an unfamiliar dataset, because it answers the question “what is actually in here?”


The Main Types of Unsupervised Learning

Unsupervised learning is not a single technique but a family of related tasks. Three of them show up again and again.

Clustering

Clustering is the process of dividing a dataset into groups so that items in the same group are similar to each other and different from items in other groups. Each group is called a cluster. The algorithm is never told what the groups are; it discovers them from the data.

The classic business application is customer segmentation: splitting your customers into groups so you can market to each group differently. A “young, high spender” cluster might get one campaign while a “older, cautious saver” cluster gets another. Clustering is also used to group similar products, organize documents by topic, and compress images.

Association

Association looks for relationships between variables rather than between entries. The textbook example is market basket analysis: discovering that customers who buy bread and butter also tend to buy jam, so you can place those items together or recommend one when another is added to the cart. Association rules power many of the “frequently bought together” suggestions you see online.

Anomaly Detection

Anomaly detection flags data points that do not fit the overall pattern of the dataset. If almost every credit card transaction for a customer is under fifty dollars and made in their home city, a sudden thousand-dollar purchase from another country stands out as an anomaly. This is widely used for fraud detection, network security, and spotting faulty equipment before it fails.

This module focuses on clustering, the most widely used and most intuitive of the three. It is the perfect entry point into unsupervised thinking, and the customer segmentation problem we are about to set up is one of its purest applications.

The same data, two questions

The line between supervised and unsupervised is about the question, not the data. With customer records you could ask “will this customer churn next month?” (supervised, because churn is a known label) or “what natural groups of customers exist?” (unsupervised, because no grouping is given). The same table can feed either approach depending on what you are trying to learn.


Meet the Mall Customers Dataset

Throughout this module you will work with the Mall Customers dataset, a small, clean dataset of shoppers at a retail mall. It is a favorite for learning clustering because it is just big enough to be interesting and small enough to plot every point and reason about the results by eye.

The business question is simple and real: the mall wants to divide its customers into segments so it can tailor marketing to each group. No one has told us how many segments there are or who belongs where. That is exactly the kind of “find the structure” problem unsupervised learning was built for.

You can download the dataset and load it with pandas.

import pandas as pd

# download: https://datatweets.com/datasets/mall_customers.csv
customers = pd.read_csv("mall_customers.csv")

print("Shape:", customers.shape)
# Output: Shape: (200, 4)

The dataset has 200 rows and 4 columns. Each row is one customer, and there is no target column, which is the hallmark of an unsupervised problem.

A Data Dictionary

Here are the columns you will work with:

ColumnTypeMeaning
GendercategoryThe customer’s gender
ageintThe customer’s age in years
annual_incomeintAnnual income in thousands of dollars
spending_scoreintA score from 1 to 100 reflecting shopping behavior (higher means more spending)

The spending_score is the mall’s own internal measure of how much and how often a customer shops; you can treat it as a ready-made signal of engagement. Notice that none of these columns is an “answer.” There is no label telling us which segment a customer belongs to. Our job is to discover those segments.

Take a moment to look at a few rows and the basic statistics.

print(customers.head())
# Output:
#    Gender  age  annual_income  spending_score
# 0    Male   19             15              39
# 1    Male   21             15              81
# 2  Female   20             16               6
# 3  Female   23             16              77
# 4  Female   31             17              40

print(customers.describe())
# Output:
#               age  annual_income  spending_score
# count  200.000000     200.000000      200.000000
# mean    38.850000      60.560000       50.200000
# std     13.969007      26.264721       25.823522
# min     18.000000      15.000000        1.000000
# 25%     28.750000      41.500000       34.750000
# 50%     36.000000      61.500000       50.000000
# 75%     49.000000      78.000000       73.000000
# max     70.000000     137.000000       99.000000

The customers range from 18 to 70 years old, earn between 15 and 137 thousand dollars a year, and have spending scores spread across the full 1-to-100 range. Crucially, there are no missing values to clean up, so you can focus entirely on the ideas rather than on data wrangling.

Why such a small dataset?

With only 200 rows and a handful of columns, you can plot every single customer and check whether an algorithm’s groups match your intuition. Real customer datasets have millions of rows and hundreds of columns where eyeballing is impossible, but the concepts you build here scale directly to those larger problems. Start small, see clearly, then scale up.


The Segmentation Question

The two columns that tell the clearest marketing story are annual_income and spending_score. Income tells you how much a customer could spend; the spending score tells you how much they actually engage with the mall. Plotting one against the other is a natural way to look for customer types.

So let’s ask the central question of this module directly: if you look at income versus spending, do customers fall into natural groups?

Before reaching for any algorithm, plot the raw data and look. This is one of the most valuable habits in unsupervised learning, because your eyes are an excellent clustering tool in two dimensions, and they give you a sanity check for whatever the algorithm produces later.

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(customers["annual_income"], customers["spending_score"])
plt.xlabel("Annual income (k$)")
plt.ylabel("Spending score (1-100)")
plt.title("Mall customers: income vs spending score")
plt.show()

This produces an unlabeled scatter plot: every point is one customer, and there are no colors, because we have nothing to color by. That is the whole point of unsupervised learning. We start with undifferentiated dots and try to discover the groups.

Unlabeled scatter plot of mall customers showing annual income on the x-axis and spending score on the y-axis, with points spreading into several visible clumps
The raw Mall Customers data: annual income versus spending score, with no labels and no colors.

Eyeball the Groups

Now do the work an unsupervised algorithm is meant to automate. Stare at that scatter plot and ask yourself: how many natural groups can you pick out?

Most people, looking at this plot, spot something like five blobs:

  • A cluster in the center: middle income, middle spending. The “average” customer.
  • A cluster at the top right: high income and high spending. These are valuable, engaged customers.
  • A cluster at the bottom right: high income but low spending. They can afford to shop but choose not to, an obvious target for a campaign.
  • A cluster at the top left: low income but high spending. Enthusiastic shoppers on a budget.
  • A cluster at the bottom left: low income and low spending.

Notice how much business meaning falls out of this simple picture. Each corner of the plot suggests a different marketing strategy. That is the promise of clustering: turning a cloud of points into actionable segments.

Your eyes only work in two dimensions

We could pick out groups here only because we are looking at two columns at once. The Mall Customers data also has age and Gender, and real datasets have dozens or hundreds of features. You cannot eyeball groups in fifty dimensions, and even in two dimensions, different people will count the blobs differently. That is precisely why we need an algorithm to find clusters objectively and consistently, which is where the next lessons go.

From Eyeballing to an Algorithm

You just performed clustering by hand. You looked at the data and proposed groups based on how close points were to one another. An algorithm does the same thing, only it makes the notion of “close” precise, it scales to any number of dimensions, and it produces the same answer every time instead of depending on who is squinting at the screen.

But this also exposes the hard questions that the rest of the module answers. How does a computer decide which points belong together? How many groups should there be, when five was just a guess? And once the algorithm hands you groups, how do you describe and use them? You will not solve those here. For now, the important takeaways are conceptual: unsupervised learning finds structure in unlabeled data, clustering is how we find groups, and even a plain scatter plot already hints that real structure is there to be found.


Practice Exercises

Now it is your turn. Try these before checking the hints. They use only the ideas from this lesson: exploring the data and looking for structure, not running a clustering algorithm.

Exercise 1: Explore Gender Balance

The data dictionary lists Gender as a column. Find out how many customers of each gender are in the dataset, and what fraction of the total each represents.

import pandas as pd

# download: https://datatweets.com/datasets/mall_customers.csv
customers = pd.read_csv("mall_customers.csv")

# Your code here

Hint

Use customers["Gender"].value_counts() to count each category, and customers["Gender"].value_counts(normalize=True) to get the proportions instead of raw counts. You should find the dataset is reasonably balanced between the two groups.

Exercise 2: Plot a Different Pair of Features

The lesson plotted annual_income against spending_score. Make a scatter plot of age against spending_score instead, and look at it. Do you still see clear, separated groups, or does the structure look weaker?

import matplotlib.pyplot as plt

# Your code here (reuse the customers DataFrame)

Hint

Call plt.scatter(customers["age"], customers["spending_score"]), then add plt.xlabel(...) and plt.ylabel(...). The groups are less obvious in this view, which is a useful reminder that which features you choose strongly affects what structure you can see.

Exercise 3: Count the Clusters by Eye

Look again at the income-versus-spending scatter plot from the lesson. Write down, in a comment, how many clusters you see and where each one sits (for example, “top right: high income, high spending”). There is no single correct answer.

# Your answer here, as comments. For example:
# Cluster 1: center, average income and spending
# Cluster 2: ...

Hint

Focus on the four corners and the middle of the plot. Many people settle on about five groups, but if you see four or six, that is fine. The fact that reasonable people disagree on the count is exactly why later lessons introduce a systematic way to choose the number of clusters.


Summary

Congratulations! You now understand what unsupervised learning is, how it differs from supervised learning, and why clustering is its headline task. You also met the Mall Customers dataset and learned to spot structure with your own eyes. Let’s review what you learned.

Key Concepts

Unsupervised vs. Supervised Learning

  • Supervised learning uses labeled data and predicts a known target; you can objectively score its correctness
  • Unsupervised learning uses unlabeled data; the goal is to find structure, not to predict
  • “Unsupervised” refers to the data having no labels, not to the absence of human involvement; a person still interprets the results

Types of Unsupervised Learning

  • Clustering groups similar entries together (for example, customer segmentation)
  • Association finds relationships between variables (for example, market basket analysis)
  • Anomaly detection flags points that break the overall pattern (for example, fraud detection)

The Mall Customers Problem

  • The dataset has 200 customers and 4 columns: Gender, age, annual_income, and spending_score, with no target column
  • The business goal is segmentation: dividing customers into groups to market to each differently
  • Plotting annual_income against spending_score reveals candidate groups by eye

The Habit of Looking First

  • Always plot raw, unlabeled data before running an algorithm; your eyes are a strong clustering tool in two dimensions
  • Eyeballing only works in two or three dimensions and depends on the observer, which is why an algorithm is needed

Why This Matters

Most data in the world arrives without labels. Customer records, sensor readings, transaction logs, and documents pile up far faster than anyone could ever annotate them. Unsupervised learning is how you extract value from that data anyway, by surfacing patterns you did not know to look for. Clustering in particular turns a featureless cloud of points into named segments that a business can actually act on.

You also saw both the promise and the limitation of doing this by eye. A simple scatter plot already hinted at real structure in the mall’s customers, which is genuinely useful. But you had to guess the number of groups, you could only use two features at a time, and another person might count the blobs differently. Removing that guesswork, scaling beyond two dimensions, and getting a consistent answer is exactly what a clustering algorithm provides, and it is what the rest of this module builds toward.


Next Steps

You now understand what unsupervised learning is and have seen the structure hiding in the Mall Customers data. In the next lesson, you will learn the algorithm that finds these clusters automatically, k-means, and watch it iterate step by step toward a solution.

Continue to Lesson 2 - The Iterative K-Means Algorithm

Learn how k-means finds clusters automatically, one iteration at a time.

Back to Module Overview

Return to the Unsupervised Learning module overview.


Keep Building Your Skills

You have taken your first step into unsupervised learning, the part of machine learning that works without an answer key. The instinct you practiced here, looking at raw data and asking “what natural groups are in here?”, is the same instinct that drives every clustering project, from segmenting customers to detecting fraud. Hold onto that intuition. As you learn the algorithms in the coming lessons, keep checking their output against what your own eyes would say. When the math and the picture agree, you can trust the result.