Lesson 6 - Guided Project: Telling the Species Apart

Welcome to the Guided Project

You’re handed penguin body measurements with the species labels hidden. Using only the descriptive-statistics tools from this module — distributions, summaries, and comparisons — how well can you tell the three species apart?

That is the whole project. No machine learning, no fancy models — just the habits you have been building: look at the data, summarize each group, compare distributions, and turn what you see into a simple rule. Then you will do the honest thing most tutorials skip: measure exactly how often your rule is right, and look hard at where it fails.

By the end of this lesson, you will be able to:

  • Profile an unfamiliar dataset and summarize each measurement by group
  • Read distributions to find which variables separate groups and which do not
  • Translate a visual pattern into a written rule-of-thumb classifier
  • Measure a rule’s real accuracy and diagnose its errors honestly

You only need pandas, a little numpy, and the descriptive statistics from the earlier lessons. Let’s begin.


Step 1: Load and Inspect the Data

Start by loading the penguins and taking the measure of what you have.

import pandas as pd

penguins = pd.read_csv("https://datatweets.com/datasets/penguins.csv")
print(penguins.shape)
print(penguins["species"].value_counts())
(344, 7)
species
Adelie       152
Gentoo       124
Chinstrap     68
Name: count, dtype: int64

There are 344 penguins across three species — Adelie, Gentoo, and Chinstrap — and the classes are uneven, with more than twice as many Adelie as Chinstrap. That imbalance matters later: a lazy rule that simply guesses “Adelie” every time would already be right 44% of the time, so any rule we build has to beat that bar by a wide margin.

Four numeric measurements describe each bird: bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. A few are missing:

print(penguins[["bill_length_mm", "flipper_length_mm", "body_mass_g"]].isna().sum())
bill_length_mm       2
flipper_length_mm    2
body_mass_g          2
dtype: int64

Only two rows per measurement are missing — a couple of birds that were never fully measured. We will drop those rows before classifying so every penguin we judge actually has the numbers our rule needs.

The labels aren’t really hidden

We keep the species column the whole time — we just promise not to use it while building the rule. Treating it as the hidden answer key lets us do something you rarely can in the real world: check our work against the truth and compute an exact accuracy.


Step 2: Summarize Each Measurement by Species

The single most useful table in this project is the mean and standard deviation of every measurement, split by species. groupby plus agg builds it in one line:

summary = (penguins.groupby("species")[
              ["bill_length_mm", "flipper_length_mm", "body_mass_g"]]
           .agg(["mean", "std"]).round(1))
print(summary)
          bill_length_mm      flipper_length_mm      body_mass_g
                    mean  std              mean  std        mean    std
species
Adelie              38.8  2.7             190.0  6.5      3700.7  458.6
Chinstrap           48.8  3.3             195.8  7.1      3733.1  384.3
Gentoo              47.5  3.1             217.2  6.5      5076.0  504.1

Read this table the way the rest of the project depends on. Three patterns jump out:

  • Gentoo is the giant. Its flippers average 217.2 mm versus ~190–196 mm for the others, and it is over 1,300 g heavier. On size alone, Gentoo stands apart.
  • Adelie has a short bill. At 38.8 mm, its bill is a full 10 mm shorter than the other two. That is the cleanest single gap in the table.
  • Chinstrap and Gentoo have nearly identical bill lengths (48.8 vs 47.5 mm) but completely different bodies — so bill length tells Adelie from Chinstrap, while size tells Gentoo from Adelie.

The standard deviations are the second half of the story. A 6.5 mm spread on flipper length is small next to the 27 mm gap between Adelie and Gentoo means — those groups barely overlap. But Adelie and Chinstrap bill lengths sit only 10 mm apart with spreads of ~3 mm each, so their tails will mingle. Already we can predict where a simple rule will struggle.


Step 3: Visualize Which Measurements Separate the Species

A summary table hints at separation; a distribution shows it. Plot flipper length as one histogram per species and the picture becomes obvious.

Overlapping histograms of flipper length for the three penguin species, with Gentoo sitting far to the right of Adelie and Chinstrap and a dashed line at 206 mm.
Gentoo flippers (green) form a near-separate hump above ~206 mm, while Adelie and Chinstrap overlap heavily on the left.

Gentoo lives almost entirely to the right; Adelie and Chinstrap pile up together on the left. A single vertical cut around 206 mm would peel off nearly every Gentoo while catching almost no one else. That is our first rule.

But flipper length cannot tell Adelie from Chinstrap — they share the same hump. For that we need bill length:

Histograms of bill length by species, with Adelie clustered around 39 mm on the left and Chinstrap and Gentoo overlapping on the right, with a dashed line at 43 mm.
Adelie (blue) clusters at short bills around 39 mm; Chinstrap and Gentoo both sit higher, so a cut near 43 mm separates Adelie from the longer-billed species.

Now Adelie sits clearly on the left with a short bill, and a cut near 43 mm divides it from the longer-billed Chinstrap and Gentoo. Put both measurements on one scatter plot and you can see all three groups pulled into their own corners:

Scatter plot of bill length against flipper length colored by species, with dashed lines at flipper 206 mm and bill 43 mm carving the plane into regions.
Bill length (x) against flipper length (y). The two dashed cuts carve the plane into a Gentoo region (top), a Chinstrap region (bottom right), and an Adelie region (bottom left).

The scatter plot is the blueprint for our classifier. The horizontal cut at 206 mm separates the high-flippered Gentoo on top. Below that line, the vertical cut at 43 mm separates short-billed Adelie on the left from long-billed Chinstrap on the right. Two cuts, three regions.


Step 4: Build a Rule-of-Thumb Classifier

We can now write the rule the plots described, in plain order:

  1. If flipper_length_mm is greater than 206, call it Gentoo (it is large).
  2. Otherwise, if bill_length_mm is greater than 43, call it Chinstrap (long bill).
  3. Otherwise, call it Adelie (short bill, smaller body).

The thresholds are not guesses — they come straight from the distributions in Step 3, sitting in the gaps between the species. Here it is in code. First drop the rows with missing measurements, then apply the rule to every penguin:

clean = penguins.dropna(subset=["bill_length_mm", "flipper_length_mm"]).copy()

def classify(row):
    if row["flipper_length_mm"] > 206:
        return "Gentoo"
    elif row["bill_length_mm"] > 43:
        return "Chinstrap"
    else:
        return "Adelie"

clean["predicted"] = clean.apply(classify, axis=1)
print(clean["predicted"].value_counts())
predicted
Adelie       146
Gentoo       129
Chinstrap     67

The rule predicts 146 Adelie, 129 Gentoo, and 67 Chinstrap. Compare that to the true counts (152, 124, 68) and the totals are already close — a good sign. But matching the totals is not the same as getting each individual bird right, so we have to check properly.

Order matters in a rule chain

Because the checks run top to bottom, the flipper test gets first say. A long-billed Gentoo is caught by rule 1 before the bill rule can mislabel it Chinstrap. When you write a rule chain, put the cleanest, most decisive split first.


Step 5: Check How Often the Rule Is Right

The moment of truth. Accuracy is the share of penguins whose predicted species matches their real one:

accuracy=number predicted correctlytotal number of penguins \text{accuracy} = \frac{\text{number predicted correctly}}{\text{total number of penguins}}

Because we kept the species column as our hidden answer key, we can compute it directly:

accuracy = (clean["predicted"] == clean["species"]).mean()
print(round(accuracy * 100, 1))
94.4

The rule is right 94.4% of the time — 323 of 342 penguins — using nothing but two thresholds read off a couple of histograms. Against the 44% you would get by always guessing Adelie, that is a genuine result. But an overall number can hide trouble in one group, so break it down by species:

clean["correct"] = clean["predicted"] == clean["species"]
per_species = (clean.groupby("species")["correct"].mean() * 100).round(1)
print(per_species)
species
Adelie       94.0
Chinstrap    86.8
Gentoo       99.2
Name: correct, dtype: float64

The accuracy is not spread evenly. Gentoo is nearly perfect (99.2%) — exactly what the well-separated flipper histogram promised. Adelie is strong (94.0%), but Chinstrap lags at 86.8%. To see which mistakes the rule makes, build a confusion table — actual species down the rows, predicted across the columns:

print(pd.crosstab(clean["species"], clean["predicted"]))
predicted  Adelie  Chinstrap  Gentoo
species
Adelie        142          7       2
Chinstrap       4         59       5
Gentoo          0          1     122

Every off-diagonal cell is an error. Reading them tells the whole story of where the rule breaks:

  • Adelie ↔ Chinstrap confusion (7 + 4 = 11 errors). This is exactly the overlap we predicted in Step 2. A handful of Adelie have unusually long bills (over 43 mm) and get called Chinstrap; a few Chinstrap have unusually short bills and get called Adelie. Their bill-length distributions touch, so no single cut can separate them cleanly.
  • Chinstrap → Gentoo (5 errors). Five Chinstrap have long flippers (over 206 mm) and trip the Gentoo rule. The Chinstrap flipper distribution has a long upper tail that reaches into Gentoo territory.
  • Gentoo almost never misses (1 error). Its size makes it the easiest species to call.

The lesson here is honest and important: the rule is only as good as the gaps between the groups. Where distributions are well separated (Gentoo’s size), the rule is nearly flawless. Where they overlap (Adelie vs Chinstrap bills), no threshold can be perfect, and the errors cluster exactly where the histograms told us they would.

Take It Further

The rule is good, but you can push it further with the same descriptive tools:

  • Tune the thresholds. Try shifting the bill cut between 42 and 45 mm and recompute the accuracy. Does nudging it help Chinstrap at Adelie’s expense, or the reverse? Use the confusion table to decide what “better” even means here.
  • Add a third measurement. body_mass_g separates Gentoo even more sharply than flipper length. Add a mass check to the Gentoo rule and see whether it rescues any of the misread Chinstrap.
  • Quantify the overlap. For Adelie and Chinstrap bill length, compute how many Adelie fall above 43 mm and how many Chinstrap fall below it. That count is the floor on this rule’s error — the mistakes no single cut can avoid.
  • Stratify your check. Compute accuracy separately by sex or island. Does the rule work as well for females as males? Uneven accuracy across subgroups is something every honest analysis should report.

Summary

You told three penguin species apart using only descriptive statistics. You profiled the data, summarized every measurement by species, and read the distributions to find which variables separate the groups — flipper length isolates the large Gentoo, bill length splits short-billed Adelie from long-billed Chinstrap. You turned those gaps into a two-threshold rule, applied it to all 342 measured birds, and measured an honest 94.4% accuracy — then used a confusion table to pin the errors on exactly the overlap your summaries had predicted.

Key Concepts

  • Group summariesgroupby with mean and standard deviation reveal which measurements separate groups and which overlap.
  • Separation vs overlap — a variable is useful for classification only where its group distributions don’t blend together.
  • Rule-of-thumb classifier — an ordered chain of threshold checks read directly off the distributions.
  • Accuracy — the share of cases predicted correctly; compare it against a naive baseline.
  • Confusion table — a cross-tabulation of actual versus predicted that shows where a rule fails, not just how often.

Why This Matters

This is descriptive statistics doing real work. Long before anyone reaches for a model, the questions that decide a project are the ones you just answered: which variables actually carry signal, where do groups overlap, and how good is “good enough”? A rule you can read in one sentence, whose every error you can explain, is often more trustworthy than a black box — and the habit of measuring accuracy honestly, then staring at the mistakes, is what separates careful analysis from wishful thinking.


Next Steps

Continue to Module 2 - Measures of Center & Variability (next in the course)

Go deeper on the mean, median, standard deviation, and the summaries you leaned on throughout this project.

Back to Module Overview

Return to the Statistics Fundamentals module overview


Continue Building Your Skills

You just turned a pile of measurements into a working classifier without a single model — only the descriptive tools from this module and the discipline to check your own work. That instinct, to summarize first, look for the gaps, and always measure how often you’re right, will serve you in every analysis you ever do. Onward to measures of center and variability, where these summaries get a proper foundation.