Lesson 1 - Populations and Samples

Welcome to Populations and Samples

Almost every statistic you have ever read — an average salary, an approval rating, a defect rate — was calculated from a small slice of a much larger group. Pollsters do not phone every voter; factories do not test every part. They measure a sample and use it to reason about a population. Getting that relationship right is the foundation everything else in statistics is built on.

In this lesson you will make that idea concrete using a real dataset of penguin measurements, sample from it in Python, and watch first-hand how a sample can mislead you when it is not chosen carefully.

By the end of this lesson, you will be able to:

  • Tell the difference between a population and a sample, and a parameter and a statistic
  • Explain why we sample instead of measuring everything
  • Apply simple random, stratified, cluster, and systematic sampling
  • Describe sampling error and what makes a sample representative

You only need a little Python and pandas. Let’s begin.


Populations and Samples

A population is the entire group you want to draw a conclusion about — every member of it. A sample is the subset you actually collect data from. The whole craft of statistics is using the sample in your hands to say something reliable about the population you cannot fully see.

A large box labelled Population N=344 filled with gray dots; a dashed circle inside encloses blue dots labelled Sample n=30, whose statistic x-bar estimates the population parameter mu.
We measure a small sample (statistic x̄) to estimate a parameter (μ) of the whole population we cannot fully observe.

The dataset for this module contains body measurements of penguins from three species near Palmer Station, Antarctica. Load it and look at its size:

import pandas as pd

penguins = pd.read_csv("https://datatweets.com/datasets/penguins.csv")
print(penguins.shape)
print(penguins.head())
(344, 7)
  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
0  Adelie  Torgersen            39.1           18.7              181.0       3750.0    MALE
1  Adelie  Torgersen            39.5           17.4              186.0       3800.0  FEMALE
2  Adelie  Torgersen            40.3           18.0              195.0       3250.0  FEMALE
3  Adelie  Torgersen             NaN            NaN                NaN          NaN     NaN
4  Adelie  Torgersen            36.7           19.3              193.0       3450.0  FEMALE

For this lesson, treat these 344 penguins as our entire population. That is a convenient fiction — in reality these birds are themselves a sample of all penguins — but it lets us do something you can almost never do with real populations: calculate the true answer, then see how close a sample gets to it.

Parameters and statistics

When you compute a number from the whole population, it is called a parameter. When you compute the same number from a sample, it is called a statistic. A statistic is your best estimate of the parameter you usually cannot observe directly.

population_mean = penguins["body_mass_g"].mean()
print(round(population_mean, 1))
4201.8

So the population mean body mass is μ=4201.8 \mu = 4201.8 grams. By convention we write a population mean as μ \mu (the Greek letter “mu”) and a sample mean as xˉ \bar{x} (“x-bar”). Keep that distinction in mind: μ \mu is a fixed fact about the population; xˉ \bar{x} changes every time you draw a new sample.


Why We Sample

If we already have all 344 penguins, why not always measure the whole population? Because in the real world you almost never can. Measuring an entire population is usually:

  • Too expensive or slow — surveying every customer or citizen costs more than any project can afford.
  • Impossible — the population may be infinite, or include future members (every part a machine will produce).
  • Destructive — to test whether a match lights or a fuse blows, you destroy it. Test them all and you have nothing left to sell.

A well-chosen sample gives you an answer that is close enough, far faster and cheaper. The catch is in the words well-chosen. A sample is only useful if it is representative — if its makeup mirrors the population. Let’s see what happens when we sample carelessly.

sample = penguins.sample(30, random_state=1)
print(round(sample["body_mass_g"].mean(), 1))
4415.0

This sample of 30 penguins gives a mean of 4415.0 g — about 213 grams heavier than the true 4201.8 g. The sample overshot. To see why, look at which species landed in it:

print(sample["species"].value_counts())
species
Gentoo       13
Adelie       12
Chinstrap     5

The Gentoo is by far the heaviest species, and this sample happened to scoop up a lot of them. That mismatch between sample and population is the heart of the next idea.


Sampling Error

Sampling error is the difference between a sample statistic and the true population parameter, caused simply by the luck of which members you happened to draw. It is not a mistake in your arithmetic — even a perfectly fair sample will rarely match the population exactly.

Two dartboards: on the left, dots scatter randomly around the bullseye (low bias); on the right, dots cluster tightly off to one side (biased).
Sampling error is random scatter around the truth; bias is a systematic offset that no amount of averaging removes.

Watch the sample mean bounce around the true value as we change which 30 penguins we draw:

for seed in [1, 2, 7, 42]:
    m = penguins.sample(30, random_state=seed)["body_mass_g"].mean()
    print(seed, round(m, 1))
1 4415.0
2 4085.0
7 4097.4
42 4200.0

Four samples, four different answers — 4415, 4085, 4097, 4200 — all orbiting the true 4201.8 g. Two lessons fall out of this:

  1. A single sample is one draw from a lottery. Any one estimate could be high or low.
  2. The errors are not biased in one direction. They scatter on both sides of the truth, because sample() gives every penguin an equal chance. A fair method produces error that is random, not systematic.

The danger is bias — error that pushes consistently one way. If we had only ever sampled Gentoo-heavy groups, every estimate would be too high, and no amount of averaging would save us. Good sampling methods exist to prevent exactly that.

Sampling error vs. bias

Sampling error is the random, unavoidable gap between a sample and the population — it shrinks as samples get larger. Bias is a systematic gap baked in by a flawed method — it does not shrink with size. A bigger biased sample is just a more confident wrong answer.


Sampling Methods

How you choose your sample determines whether it is representative. Here are the four methods you will meet most often.

Four mini-schematics: simple random highlights a scattered handful; stratified highlights a few dots from each of three colored layers; cluster highlights two whole groups; systematic highlights every fourth dot in a row.
The four sampling methods differ in how they pick members: at random, within proportional strata, by whole clusters, or every k-th in order.

Simple random sampling

In simple random sampling (SRS), every member of the population has an equal chance of being selected, and every possible sample of a given size is equally likely. It is the gold standard for fairness and what sample() does by default.

srs = penguins.sample(30, random_state=42)
print(round(srs["body_mass_g"].mean(), 1))
4200.0

SRS is unbiased, but with small samples it can still miss — by chance it might draw few Chinstraps, as we saw. When a population has distinct subgroups, we can do better.

Stratified sampling

In stratified sampling, you split the population into non-overlapping groups called strata, then sample from each one — usually in proportion to its size. This guarantees every subgroup is represented. Species is a natural stratum here.

First, the population proportions:

print((penguins["species"].value_counts(normalize=True) * 100).round(1))
species
Adelie       44.2
Gentoo       36.0
Chinstrap    19.8

A proportional sample of 30 should therefore contain about 13 Adelie, 11 Gentoo, and 6 Chinstrap. Draw it by sampling within each species:

stratified = penguins.groupby("species", group_keys=False).sample(
    frac=30/len(penguins), random_state=1)
print(stratified["species"].value_counts())
species
Adelie       13
Gentoo       11
Chinstrap     6

Now the sample mirrors the population’s species mix by design, so a heavy-Gentoo accident cannot happen. Stratifying removes a known source of error instead of leaving it to luck.

Cluster sampling

In cluster sampling, you divide the population into many groups called clusters, randomly pick a few entire clusters, and measure everyone inside them. It trades some accuracy for huge convenience: instead of chasing individuals scattered everywhere, you study a handful of whole groups. Here the three islands make natural clusters.

import numpy as np
rng = np.random.default_rng(0)
chosen = rng.choice(penguins["island"].unique(), size=2, replace=False)
cluster_sample = penguins[penguins["island"].isin(chosen)]
print(chosen, "->", len(cluster_sample), "penguins")
['Biscoe' 'Dream'] -> 292 penguins

A pollster who samples two whole neighborhoods instead of scattered households is cluster sampling. It is cheap, but if clusters differ from one another, picking only a couple can skew the result.

Systematic sampling

In systematic sampling, you order the population and take every k k -th member after a random start. To draw 30 from 344 you would use a step of k=344/3011 k = 344 / 30 \approx 11 :

k = len(penguins) // 30
systematic = penguins.iloc[::k]
print(len(systematic), "penguins, every", k, "th row")
32 penguins, every 11 th row

Systematic sampling is simple and spreads the sample evenly across the ordered list. Its one trap: if the ordering has a hidden repeating pattern that lines up with your step, the sample can become badly biased.


Practice Exercises

Exercise 1: Estimate a different parameter

The population mean flipper length is a parameter. Estimate it from a simple random sample of 50 penguins (use random_state=10), then compare it to the true value computed from the whole dataset. How large is the sampling error?

Hint

Compute the parameter with penguins["flipper_length_mm"].mean() and the statistic with penguins.sample(50, random_state=10)["flipper_length_mm"].mean(). The sampling error is the difference between them.

Exercise 2: Stratify by sex

Draw a sample of 40 penguins stratified by the sex column instead of species. Check that the proportion of males and females in your sample matches the population. Why might stratifying by sex matter when estimating average body mass?

Hint

Drop missing values first with clean = penguins.dropna(subset=["sex"]), since some rows have no recorded sex, then use clean.groupby("sex", group_keys=False).sample(frac=40/len(clean), random_state=1).

Exercise 3: See the law of large numbers

Draw simple random samples of size 10, 50, and 150 and compare each sample mean to the true population mean of 4201.8 g. What happens to the sampling error as the sample grows?

Hint

Loop over [10, 50, 150], draw penguins.sample(n, random_state=1), and print the absolute difference from 4201.8. Larger samples should sit closer to the true value, on average.


Summary

You met the central relationship in statistics: we measure a sample to learn about a population, estimating an unknown parameter with a sample statistic. Because a sample is just one draw, it carries sampling error — random, two-sided, and shrinking with size — which is very different from bias, the one-sided error a flawed method bakes in. Good sampling methods (simple random, stratified, cluster, systematic) exist to keep samples representative so your estimates stay honest.

Key Concepts

  • Population — the entire group you want to study.
  • Sample — the subset you actually measure.
  • Parameter — a number describing a population (e.g. μ \mu , the population mean).
  • Statistic — a number describing a sample (e.g. xˉ \bar{x} , the sample mean), used to estimate a parameter.
  • Sampling error — the random gap between a statistic and its parameter.
  • Bias — a systematic gap caused by a flawed sampling method.
  • Stratified sampling — sampling within proportional subgroups to guarantee representation.

Why This Matters

Every dataset you will ever analyze is a sample of something larger, and every model you train assumes that sample resembles the world you will deploy it in. Knowing when a sample is representative — and when sampling error or bias is quietly steering your numbers — is what separates a conclusion you can trust from one that just sounds confident.


Next Steps

Continue to Lesson 2 - Variables and Measurement Scales

Learn how to classify variables and why their measurement scale decides what math you are allowed to do.

Back to Module Overview

Return to the Statistics Fundamentals module overview


Continue Building Your Skills

You now know what a sample really is and how to draw one fairly — the quiet decision that every statistic downstream depends on. Next you will learn to look at a single variable and ask the question that shapes every analysis: what kind of data is this, and what is it allowed to tell me?