Lesson 1 - Populations and Samples
Welcome to Populations and Samples
Almost every statistic you have ever read — an average salary, an approval rating, a defect rate — was calculated from a small slice of a much larger group. Pollsters do not phone every voter; factories do not test every part. They measure a sample and use it to reason about a population. Getting that relationship right is the foundation everything else in statistics is built on.
In this lesson you will make that idea concrete using a real dataset of penguin measurements, sample from it in Python, and watch first-hand how a sample can mislead you when it is not chosen carefully.
By the end of this lesson, you will be able to:
- Tell the difference between a population and a sample, and a parameter and a statistic
- Explain why we sample instead of measuring everything
- Apply simple random, stratified, cluster, and systematic sampling
- Describe sampling error and what makes a sample representative
You only need a little Python and pandas. Let’s begin.
Populations and Samples
A population is the entire group you want to draw a conclusion about — every member of it. A sample is the subset you actually collect data from. The whole craft of statistics is using the sample in your hands to say something reliable about the population you cannot fully see.
The dataset for this module contains body measurements of penguins from three species near Palmer Station, Antarctica. Load it and look at its size:
import pandas as pd
penguins = pd.read_csv("https://datatweets.com/datasets/penguins.csv")
print(penguins.shape)
print(penguins.head())(344, 7)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 MALE
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALE
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 FEMALEFor this lesson, treat these 344 penguins as our entire population. That is a convenient fiction — in reality these birds are themselves a sample of all penguins — but it lets us do something you can almost never do with real populations: calculate the true answer, then see how close a sample gets to it.
Parameters and statistics
When you compute a number from the whole population, it is called a parameter. When you compute the same number from a sample, it is called a statistic. A statistic is your best estimate of the parameter you usually cannot observe directly.
population_mean = penguins["body_mass_g"].mean()
print(round(population_mean, 1))4201.8So the population mean body mass is grams. By convention we write a population mean as (the Greek letter “mu”) and a sample mean as (“x-bar”). Keep that distinction in mind: is a fixed fact about the population; changes every time you draw a new sample.
Why We Sample
If we already have all 344 penguins, why not always measure the whole population? Because in the real world you almost never can. Measuring an entire population is usually:
- Too expensive or slow — surveying every customer or citizen costs more than any project can afford.
- Impossible — the population may be infinite, or include future members (every part a machine will produce).
- Destructive — to test whether a match lights or a fuse blows, you destroy it. Test them all and you have nothing left to sell.
A well-chosen sample gives you an answer that is close enough, far faster and cheaper. The catch is in the words well-chosen. A sample is only useful if it is representative — if its makeup mirrors the population. Let’s see what happens when we sample carelessly.
sample = penguins.sample(30, random_state=1)
print(round(sample["body_mass_g"].mean(), 1))4415.0This sample of 30 penguins gives a mean of 4415.0 g — about 213 grams heavier than the true 4201.8 g. The sample overshot. To see why, look at which species landed in it:
print(sample["species"].value_counts())species
Gentoo 13
Adelie 12
Chinstrap 5The Gentoo is by far the heaviest species, and this sample happened to scoop up a lot of them. That mismatch between sample and population is the heart of the next idea.
Sampling Error
Sampling error is the difference between a sample statistic and the true population parameter, caused simply by the luck of which members you happened to draw. It is not a mistake in your arithmetic — even a perfectly fair sample will rarely match the population exactly.
Watch the sample mean bounce around the true value as we change which 30 penguins we draw:
for seed in [1, 2, 7, 42]:
m = penguins.sample(30, random_state=seed)["body_mass_g"].mean()
print(seed, round(m, 1))1 4415.0
2 4085.0
7 4097.4
42 4200.0Four samples, four different answers — 4415, 4085, 4097, 4200 — all orbiting the true 4201.8 g. Two lessons fall out of this:
- A single sample is one draw from a lottery. Any one estimate could be high or low.
- The errors are not biased in one direction. They scatter on both sides of the truth, because
sample()gives every penguin an equal chance. A fair method produces error that is random, not systematic.
The danger is bias — error that pushes consistently one way. If we had only ever sampled Gentoo-heavy groups, every estimate would be too high, and no amount of averaging would save us. Good sampling methods exist to prevent exactly that.
Sampling error vs. bias
Sampling error is the random, unavoidable gap between a sample and the population — it shrinks as samples get larger. Bias is a systematic gap baked in by a flawed method — it does not shrink with size. A bigger biased sample is just a more confident wrong answer.
Sampling Methods
How you choose your sample determines whether it is representative. Here are the four methods you will meet most often.
Simple random sampling
In simple random sampling (SRS), every member of the population has an equal chance of being selected, and every possible sample of a given size is equally likely. It is the gold standard for fairness and what sample() does by default.
srs = penguins.sample(30, random_state=42)
print(round(srs["body_mass_g"].mean(), 1))4200.0SRS is unbiased, but with small samples it can still miss — by chance it might draw few Chinstraps, as we saw. When a population has distinct subgroups, we can do better.
Stratified sampling
In stratified sampling, you split the population into non-overlapping groups called strata, then sample from each one — usually in proportion to its size. This guarantees every subgroup is represented. Species is a natural stratum here.
First, the population proportions:
print((penguins["species"].value_counts(normalize=True) * 100).round(1))species
Adelie 44.2
Gentoo 36.0
Chinstrap 19.8A proportional sample of 30 should therefore contain about 13 Adelie, 11 Gentoo, and 6 Chinstrap. Draw it by sampling within each species:
stratified = penguins.groupby("species", group_keys=False).sample(
frac=30/len(penguins), random_state=1)
print(stratified["species"].value_counts())species
Adelie 13
Gentoo 11
Chinstrap 6Now the sample mirrors the population’s species mix by design, so a heavy-Gentoo accident cannot happen. Stratifying removes a known source of error instead of leaving it to luck.
Cluster sampling
In cluster sampling, you divide the population into many groups called clusters, randomly pick a few entire clusters, and measure everyone inside them. It trades some accuracy for huge convenience: instead of chasing individuals scattered everywhere, you study a handful of whole groups. Here the three islands make natural clusters.
import numpy as np
rng = np.random.default_rng(0)
chosen = rng.choice(penguins["island"].unique(), size=2, replace=False)
cluster_sample = penguins[penguins["island"].isin(chosen)]
print(chosen, "->", len(cluster_sample), "penguins")['Biscoe' 'Dream'] -> 292 penguinsA pollster who samples two whole neighborhoods instead of scattered households is cluster sampling. It is cheap, but if clusters differ from one another, picking only a couple can skew the result.
Systematic sampling
In systematic sampling, you order the population and take every -th member after a random start. To draw 30 from 344 you would use a step of :
k = len(penguins) // 30
systematic = penguins.iloc[::k]
print(len(systematic), "penguins, every", k, "th row")32 penguins, every 11 th rowSystematic sampling is simple and spreads the sample evenly across the ordered list. Its one trap: if the ordering has a hidden repeating pattern that lines up with your step, the sample can become badly biased.
Practice Exercises
Exercise 1: Estimate a different parameter
The population mean flipper length is a parameter. Estimate it from a simple random sample of 50 penguins (use random_state=10), then compare it to the true value computed from the whole dataset. How large is the sampling error?
Hint
Compute the parameter with penguins["flipper_length_mm"].mean() and the statistic with penguins.sample(50, random_state=10)["flipper_length_mm"].mean(). The sampling error is the difference between them.
Exercise 2: Stratify by sex
Draw a sample of 40 penguins stratified by the sex column instead of species. Check that the proportion of males and females in your sample matches the population. Why might stratifying by sex matter when estimating average body mass?
Hint
Drop missing values first with clean = penguins.dropna(subset=["sex"]), since some rows have no recorded sex, then use clean.groupby("sex", group_keys=False).sample(frac=40/len(clean), random_state=1).
Exercise 3: See the law of large numbers
Draw simple random samples of size 10, 50, and 150 and compare each sample mean to the true population mean of 4201.8 g. What happens to the sampling error as the sample grows?
Hint
Loop over [10, 50, 150], draw penguins.sample(n, random_state=1), and print the absolute difference from 4201.8. Larger samples should sit closer to the true value, on average.
Summary
You met the central relationship in statistics: we measure a sample to learn about a population, estimating an unknown parameter with a sample statistic. Because a sample is just one draw, it carries sampling error — random, two-sided, and shrinking with size — which is very different from bias, the one-sided error a flawed method bakes in. Good sampling methods (simple random, stratified, cluster, systematic) exist to keep samples representative so your estimates stay honest.
Key Concepts
- Population — the entire group you want to study.
- Sample — the subset you actually measure.
- Parameter — a number describing a population (e.g. , the population mean).
- Statistic — a number describing a sample (e.g. , the sample mean), used to estimate a parameter.
- Sampling error — the random gap between a statistic and its parameter.
- Bias — a systematic gap caused by a flawed sampling method.
- Stratified sampling — sampling within proportional subgroups to guarantee representation.
Why This Matters
Every dataset you will ever analyze is a sample of something larger, and every model you train assumes that sample resembles the world you will deploy it in. Knowing when a sample is representative — and when sampling error or bias is quietly steering your numbers — is what separates a conclusion you can trust from one that just sounds confident.
Next Steps
Continue to Lesson 2 - Variables and Measurement Scales
Learn how to classify variables and why their measurement scale decides what math you are allowed to do.
Back to Module Overview
Return to the Statistics Fundamentals module overview
Continue Building Your Skills
You now know what a sample really is and how to draw one fairly — the quiet decision that every statistic downstream depends on. Next you will learn to look at a single variable and ask the question that shapes every analysis: what kind of data is this, and what is it allowed to tell me?