Lesson 3 - Choosing the Number of Clusters with the Elbow Method

Welcome to Choosing the Number of Clusters

This lesson tackles the question that sits at the very center of k-means clustering: how many clusters should you ask for? K-means cannot answer this for you, because you set the number before the algorithm even starts. You will learn to measure how tight your clusters are with inertia, read the elbow curve to spot a sensible value, and then confirm that choice with a complementary metric called the silhouette score.

By the end of this lesson, you will be able to:

  • Explain why the number of clusters, k k , is a decision you must make yourself
  • Define inertia and describe how it changes as k k grows
  • Build and read an elbow curve to find a reasonable number of clusters
  • Compute and interpret the silhouette score as a second opinion
  • Combine both metrics to choose a defensible value of k k on a real dataset

You should be comfortable with basic Python, pandas, and the idea of k-means clustering from the previous lessons in this module. Let’s begin.


The Central Question of K-Means

In the last lesson you watched k-means run: it placed centroids, assigned each point to its nearest centroid, recomputed the centroids, and repeated until nothing moved. But there was a quiet assumption baked into all of it. Before the algorithm could take a single step, you had to tell it how many clusters to look for. That number is the hyperparameter k k .

This is genuinely different from supervised learning. When you train a classifier, the data carries the right answers and you can measure how often the model is correct. Clustering has no answer key. There is no column that tells you “this customer truly belongs to group 3.” So when you ask for three clusters versus five clusters, nothing in the data shouts that one is right and the other is wrong. You have to bring a method.

Sometimes the choice is made for you by the situation. If a marketing team has the budget for exactly three campaigns, then k=3 k = 3 whether you like it or not. But most of the time nobody hands you the number, and you have to find a value of k k that splits the data in a way that is both tight and meaningful.

This lesson gives you two tools for that job. The first is the elbow method, built on a quantity called inertia. The second is the silhouette score, which looks at cluster quality from a different angle. When the two agree, you can be confident in your choice.

Clustering has no ground truth

Because unsupervised learning has no labels, you cannot compute an accuracy score the way you would for a classifier. Choosing k k is therefore part measurement and part judgment. The metrics in this lesson narrow the options down to a small, sensible range; you and your domain knowledge make the final call.


Inertia: Measuring How Tight Your Clusters Are

To compare different values of k k , you first need a number that captures how good a clustering is. The standard choice for k-means is inertia.

Inertia measures how far each data point sits from the centroid of the cluster it was assigned to. Concretely, it is the sum of the squared distances from every point to its own centroid. If you have n n points, and cxi c_{x_i} is the centroid assigned to point xi x_i , then:

Inertia=i=1nxicxi2 \text{Inertia} = \sum_{i=1}^{n} \lVert x_i - c_{x_i} \rVert^2

A small inertia means points are packed tightly around their centroids, which is what you want: compact, well-defined groups. A large inertia means points are scattered far from their centroids, which suggests a loose, poorly fitting clustering.

Why the Distances Are Squared

Notice the square in the formula. It does two jobs. First, it keeps every term positive, so distances never cancel each other out. Second, and more importantly, it penalizes outliers. A point twice as far from its centroid contributes four times as much to the total. This means a cluster with a few points stranded far from the center looks much worse than a cluster where everyone sits at a moderate distance, even if the average distance is similar. Squaring pushes k-means toward clusters that are genuinely compact rather than ones with a tight core and a few stragglers.

Inertia Always Drops as K Grows

Here is the catch that makes inertia tricky to use on its own: inertia always decreases as you add more clusters. It can never increase. With more centroids to go around, every point ends up closer to one of them.

Take the extreme case. Imagine a dataset of 200 customers and you ask for 200 clusters. Each customer becomes its own cluster, sitting exactly on its own centroid. Every distance is zero, so the inertia is zero, perfect by the numbers. But you have not actually segmented anyone; you just relabeled each row. That clustering is useless.

So you cannot simply chase the lowest inertia, or you will always be pushed toward more and more clusters until each point is alone. The real goal is a trade-off: the smallest number of clusters that still drives inertia low enough to be useful. That trade-off is exactly what the elbow method visualizes.


Computing Inertia in scikit-learn

You do not have to write the inertia formula by hand. When you fit a KMeans model in scikit-learn, it stores the inertia of the result in an attribute called inertia_. You will use the real Mall Customers dataset, which records 200 shoppers and includes their annual income and a spending score assigned by the store.

Load the data and keep the two columns you will cluster on.

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# download: https://datatweets.com/datasets/mall_customers.csv
customers = pd.read_csv("mall_customers.csv")

# Cluster on the two behavioral columns
features = ["annual_income", "spending_score"]
X = customers[features]

print("Shape:", X.shape)
# Output: Shape: (200, 2)

The two features live on different scales, so you standardize them first, exactly as you would before any distance-based algorithm. (You met StandardScaler earlier; it rescales each column to mean 0 and standard deviation 1.)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Now fit a single k-means model and read its inertia. Start with k=2 k = 2 .

model = KMeans(n_clusters=2, random_state=42, n_init=10)
model.fit(X_scaled)

print(f"Inertia (k=2): {model.inertia_:.2f}")
# Output: Inertia (k=2): 269.69

You get an inertia of about 269.69. On its own, that number means almost nothing. Is 269.69 good? Bad? You have nothing to compare it against. A single inertia value is like a single data point with no axis: meaningless until you place it next to others.

What n_init does

K-means starts from random centroid positions, and a bad start can lead to a poor result. Setting n_init=10 tells scikit-learn to run the algorithm ten times from different starting points and keep the best one (the run with the lowest inertia). This makes your results stable and reproducible, which matters a great deal when you are comparing inertia across many values of k k .


Calculating Inertia Across Many Values of K

To make sense of inertia, you compute it for a whole range of k k values and compare them. The pattern is a simple loop: for each candidate k k , fit a model and record its inertia.

inertias = []
k_values = range(1, 11)

for k in k_values:
    model = KMeans(n_clusters=k, random_state=42, n_init=10)
    model.fit(X_scaled)
    inertias.append(model.inertia_)

for k, inertia in zip(k_values, inertias):
    print(f"k={k:>2}  inertia={inertia:7.2f}")
# Output:
# k= 1  inertia= 400.00
# k= 2  inertia= 269.69
# k= 3  inertia= 157.70
# k= 4  inertia= 108.92
# k= 5  inertia=  65.57
# k= 6  inertia=  55.06
# k= 7  inertia=  44.86
# k= 8  inertia=  37.23
# k= 9  inertia=  32.39
# k=10  inertia=  29.98

Read the numbers from the top down and the pattern jumps out. The inertia falls steeply at first: from 400.00 at k=1 k = 1 down to 65.57 at k=5 k = 5 . After that, the drops get much smaller, from 65.57 to 55.06 to 44.86, shrinking each step. This is exactly the trade-off we predicted. Early clusters buy you a lot of tightness; later clusters buy you very little.

The size of each drop is the part worth watching:

StepInertia falls fromtoDrop
k=12 k=1 \to 2 400.00269.69130.31
k=23 k=2 \to 3 269.69157.70111.99
k=34 k=3 \to 4 157.70108.9248.78
k=45 k=4 \to 5 108.9265.5743.35
k=56 k=5 \to 6 65.5755.0610.51

The drops stay large through k=5 k = 5 , then collapse: the move from 5 to 6 saves only about 10 units, a fraction of what earlier steps saved. That sudden flattening is the signal you are looking for, and it is far easier to spot in a picture than in a table.


The Elbow Method

The elbow method turns the table above into a single, readable chart. You plot inertia on the y-axis against the number of clusters on the x-axis and connect the points with a line.

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
plt.plot(k_values, inertias, marker="o")
plt.xlabel("Number of clusters (k)")
plt.ylabel("Inertia")
plt.title("Elbow method for the Mall Customers dataset")
plt.show()

Because inertia only ever falls, the line always slopes downward. But it does not fall evenly. It plunges steeply for the first few clusters, then bends and flattens into a near-horizontal tail. That bend is the elbow, and it gives the method its name. The idea is wonderfully simple: the elbow marks the point where adding another cluster stops paying off. Before the elbow, each new cluster sharply reduces inertia. After it, you are spending complexity for almost no gain, the textbook case of diminishing returns.

Line chart of inertia against the number of clusters for the Mall Customers dataset, with a clear bend at k equals five
The elbow curve drops steeply through k=5 and then flattens, marking k=5 as the natural choice.

Look at the curve for this dataset. It descends sharply from k=1 k = 1 through k=5 k = 5 , then visibly levels off. The bend sits right at k=5 k = 5 . Past that point the line is nearly flat, the signature of diminishing returns. The elbow method points clearly at five clusters.

When the Elbow Is Not Obvious

Be honest about the elbow method’s limits: the bend is not always this clean. Real datasets often produce a gentle, rounded curve where two or three values of k k could each plausibly be “the elbow.” Different people will eyeball the same chart and pick different numbers. That ambiguity is not a flaw you can code away; it is the nature of unlabeled data. It is also the reason you do not want to lean on the elbow method alone.

The elbow is a guide, not a rule

The elbow method relies on visual judgment, and visual judgment can be fuzzy. A curve that bends sharply gives you confidence; a curve that rounds off gently leaves room for doubt. Treat the elbow as one piece of evidence rather than a final verdict, and always cross-check it with a second metric and with what makes sense for your problem.


The Silhouette Score: A Second Opinion

The elbow method asks a single question: how tight are the clusters? The silhouette score asks a richer one: are the points not only close to their own cluster, but also clearly separated from the others? Because it brings new information, it makes an excellent independent check on the elbow.

For each individual point, the silhouette compares two distances:

  • a a : the average distance from the point to the other points in its own cluster (how cozy it is at home).
  • b b : the average distance from the point to the points in the nearest neighboring cluster (how far the next-best group is).

The silhouette value for that point is:

s=bamax(a,b) s = \frac{b - a}{\max(a, b)}

The result always lands between 1 -1 and 1 1 , and it has an intuitive reading:

  • Near +1 +1 : the point is much closer to its own cluster than to any other. It is confidently placed.
  • Near 0 0 : the point sits right on the border between two clusters. It could have gone either way.
  • Below 0 0 : the point is actually closer to a neighboring cluster than to its own, a sign it may have been assigned to the wrong group.

The silhouette score for a whole clustering is just the average of s s over every point. Higher is better. Unlike inertia, it does not automatically improve as k k grows, which is precisely what makes it useful: it can actually peak at the right number of clusters and then decline, giving you a genuine maximum to aim for.

Why two metrics beat one

Inertia only measures compactness, and it always favors more clusters. The silhouette score measures compactness and separation, and it does not blindly reward adding clusters. When a metric that loves big k k (the elbow) and a metric that does not (the silhouette) land on the same value, that agreement is strong evidence you have found a real structure in the data.


Computing the Silhouette Score

scikit-learn provides silhouette_score, which takes your data and the cluster labels and returns the average silhouette. One detail to remember: the silhouette is undefined for a single cluster (there is no “neighboring cluster” to compare against), so you start the loop at k=2 k = 2 .

from sklearn.metrics import silhouette_score

sil_scores = []
k_values = range(2, 11)

for k in k_values:
    model = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = model.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    sil_scores.append(score)

for k, score in zip(k_values, sil_scores):
    print(f"k={k:>2}  silhouette={score:.3f}")
# Output:
# k= 2  silhouette=0.321
# k= 3  silhouette=0.467
# k= 4  silhouette=0.494
# k= 5  silhouette=0.555
# k= 6  silhouette=0.540
# k= 7  silhouette=0.528
# k= 8  silhouette=0.455
# k= 9  silhouette=0.457
# k=10  silhouette=0.443

Now watch the shape of these numbers, because it is different from inertia. The silhouette rises from 0.321 at k=2 k = 2 up to a peak of 0.555 at k=5 k = 5 , and then it starts falling: 0.540 at k=6 k = 6 , 0.528 at k=7 k = 7 , and lower from there. This is a true maximum, not a curve that flattens. The best-separated clustering is the one with five clusters.

Plot it to make the peak unmistakable.

plt.figure(figsize=(8, 5))
plt.plot(k_values, sil_scores, marker="o")
plt.xlabel("Number of clusters (k)")
plt.ylabel("Average silhouette score")
plt.title("Silhouette score for the Mall Customers dataset")
plt.show()
Line chart of average silhouette score against the number of clusters, peaking at k equals five
The average silhouette score climbs to a clear peak at k=5 before declining.

Where the elbow chart asks you to judge a bend, the silhouette chart hands you a single highest point. For this dataset that point is k=5 k = 5 , and the rule is simply to pick the k k with the highest average silhouette.


Putting Both Metrics Together

You now have two independent verdicts on the Mall Customers data:

  • The elbow method shows inertia dropping steeply through k=5 k = 5 and then flattening. The bend sits at five.
  • The silhouette score climbs to its maximum of 0.555 at k=5 k = 5 and falls afterward. The peak is at five.

Two metrics that measure different things, built on different intuitions, arrive at the same answer. That kind of agreement is the strongest signal you can hope for in unsupervised learning. The five clusters are not an artifact of one particular metric; they reflect genuine structure in how these shoppers spread across income and spending.

This is also the right moment to bring in business judgment. Suppose the marketing team can only run four campaigns this quarter. Even though the metrics favor five clusters, you might choose four, accepting a slightly looser fit because it matches what the team can actually act on. Or suppose operations has five regional budgets to allocate; then five is doubly justified. The metrics narrow the field to a small, defensible range, and your knowledge of the problem makes the final pick.

Agreement is the goal, not certainty

The metrics will not always agree as cleanly as they do here. When the elbow is fuzzy and the silhouette has a soft peak, you may be left with two or three reasonable candidates. That is normal. Report the range, explain the trade-off, and let the downstream use of the clusters guide the decision.


Practice Exercises

Now it is your turn. Try these before checking the hints. Each assumes you have loaded and scaled the data as shown earlier.

Exercise 1: Measure a Single Inertia

Fit a KMeans model with n_clusters=3 on the scaled data and print its inertia rounded to two decimals. Confirm that it matches the value from the lesson’s table.

# Reuse X_scaled from the lesson
# Your code here

Hint

Create the model with KMeans(n_clusters=3, random_state=42, n_init=10), call .fit(X_scaled), then read model.inertia_. You should get about 157.70, exactly the k=3 k = 3 entry from the inertia table.

Exercise 2: Find the Biggest Drop in Inertia

Using the inertias list computed across k=1 k = 1 to 10 10 , print the drop in inertia at each step (the difference between consecutive values). Which step has the largest drop, and which is the first step where the drop becomes small?

# Reuse the inertias list from the lesson
# Your code here

Hint

Loop over the indices and compute inertias[i] - inertias[i + 1] for each consecutive pair. The largest single drop is from k=1 k=1 to k=2 k=2 (about 130), and the drop shrinks dramatically after k=5 k=5 , where it falls to roughly 10.5, confirming the elbow.

Exercise 3: Confirm the Silhouette Peak

Write a loop that computes the silhouette score for k=2 k = 2 through k=6 k = 6 , then use Python to print the value of k k with the highest score. Does it match the elbow?

from sklearn.metrics import silhouette_score
# Reuse X_scaled from the lesson
# Your code here

Hint

Collect the scores in a list, then pair each score with its k k using zip(range(2, 7), scores) and pick the maximum with max(..., key=lambda pair: pair[1]). The winner is k=5 k = 5 with a score of about 0.555, the same answer the elbow gives.


Summary

Congratulations! You have learned how to answer the hardest question in k-means: how many clusters to use. Let’s review what you covered.

Key Concepts

Why K Is Your Decision

  • K-means requires the number of clusters, k k , as input before it runs
  • Unsupervised learning has no ground truth, so you cannot score k k with accuracy
  • Choosing k k combines quantitative metrics with business judgment

Inertia

  • Inertia is the sum of squared distances from each point to its assigned centroid
  • Squaring keeps values positive and penalizes far-flung outliers
  • Inertia always decreases as k k grows, so the lowest inertia is never the goal
  • scikit-learn exposes it through the inertia_ attribute after fitting

The Elbow Method

  • Plot inertia against k k and look for the bend where the curve flattens
  • The elbow marks the point of diminishing returns, where extra clusters stop paying off
  • For the Mall Customers data, inertia falls steeply through k=5 k=5 and then flattens
  • The elbow relies on visual judgment and can be ambiguous on real data

The Silhouette Score

  • It combines cohesion (closeness within a cluster) and separation (distance from other clusters)
  • Values range from 1 -1 to 1 1 ; higher means better-defined clusters
  • Unlike inertia, it does not always improve with more clusters, so it can peak at the right k k
  • For this dataset it peaks at 0.555 when k=5 k = 5

Combining the Two

  • The elbow favors larger k k ; the silhouette does not, so agreement between them is meaningful
  • On the Mall Customers data both point to k=5 k = 5
  • Always weigh the metrics against operational and business constraints

Why This Matters

Every clustering project eventually faces the same question, and getting it wrong quietly undermines everything downstream. Choose too few clusters and you blur distinct groups together, hiding the very segments you set out to find. Choose too many and you fracture real groups into meaningless slivers that no team can act on. The elbow method and the silhouette score give you a disciplined, repeatable way to land in the right range instead of guessing.

Just as importantly, you saw how to use two metrics as checks on each other. Inertia loves more clusters; the silhouette does not. When a metric biased toward big k k and one that is not both settle on five, that convergence is far more trustworthy than either number alone. This habit, looking for agreement between independent measures rather than trusting a single score, is one of the most valuable instincts you can carry into any machine learning work.


Next Steps

You can now choose a defensible number of clusters and back it up with evidence. In the next lesson, you will commit to k=5 k = 5 , fit k-means with scikit-learn properly, and turn the resulting clusters into a profile of each customer segment that a business could actually use.

Continue to Lesson 4 - K-Means with Scikit-Learn and Interpreting Results

Fit k-means with scikit-learn and turn clusters into actionable customer segments.

Back to Module Overview

Return to the Unsupervised Learning module overview.


Keep Building Your Skills

Choosing k k is where clustering shifts from mechanical to thoughtful. You learned to measure inertia, read the elbow, confirm it with the silhouette score, and weigh both against the realities of a business problem. Hold on to the bigger lesson too: in unsupervised learning there is no answer key, so you build confidence by triangulating from several independent signals. Carry that mindset forward, and the clusters you create will be ones people can actually trust and use.