← All articles
PythonMachine Learning

Your First Machine Learning Model: A Hands-On scikit-learn Tutorial

A complete first pass at supervised learning: explore a real health dataset with pandas, chart a relationship with matplotlib, split it honestly, and train and grade a scikit-learn regression model.

“Given some numbers I already have about something, can I predict a number I don’t have yet?” That question — not “what is machine learning” — is the one that actually gets answered the first time you train a model. A patient’s baseline health measurements predicting how their condition progresses. A house’s size and location predicting its price. The mechanics are the same no matter what the numbers describe.

The confusing part isn’t the idea, it’s the vocabulary that shows up all at once: fit, predict, train_test_split, features, targets. It’s easy to copy-paste your way through a tutorial without ever building a mental model of what those pieces are actually doing to your data. (If you’ve already trained a text classifier with spaCy, the shape of this will feel familiar — our post on text classification with spaCy walks through the same fit-then-evaluate rhythm for labeled text instead of numbers.) This guide builds that mental model first, then walks through a real, reproducible example end to end: load and explore data with pandas, visualize it with matplotlib, and train and grade a model with scikit-learn.

The Mental Model: Study, Then Quiz

Every supervised learning problem, regardless of the algorithm, follows the same four-step discipline:

  1. Split your labeled examples into a study set and a quiz set, and don’t let the model see the quiz set’s answers.
  2. Fit the model on the study set — show it examples paired with their correct answers, over and over, until it works out the general relationship between them.
  3. Predict on the quiz set — ask the model to guess the answers for examples it has never seen.
  4. Evaluate the guesses against the real answers, which you kept hidden the whole time.

The entire discipline of machine learning is making sure step 4 is an honest quiz. A model that got to peek at the answers during step 2 will look brilliant and then fail the moment it meets real, new data. Keep this model in mind — features go in, a fitted relationship comes out, and every claim about how “good” a model is only means something if it was graded on examples it never studied.

A Dataset You Can Reproduce

Rather than downloading anything, we’ll use a dataset that ships with scikit-learn itself, so your numbers will match mine exactly with nothing more than pip install scikit-learn. Imagine you work with a small diabetes research clinic that wants a quick, defensible way to flag which patients are likely to see their condition progress the most over the next year, using only the measurements already in each patient’s chart.

Data: the diabetes dataset bundled with scikit-learn (originally from Efron, Hastie, Johnstone & Tibshirani’s 2004 “Least Angle Regression” paper; public research data, redistributed under scikit-learn’s BSD license), loaded via load_diabetes().

import pandas as pd
from sklearn.datasets import load_diabetes

diabetes = load_diabetes(as_frame=True)
df = diabetes.frame
df[["age", "sex", "bmi", "bp", "s1", "target"]].head()
        age       sex       bmi        bp        s1  target
0  0.038076  0.050680  0.061696  0.021872 -0.044223   151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449    75.0
2  0.085299  0.050680  0.044451 -0.005670 -0.045599   141.0
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191   206.0
4  0.005383 -0.044642 -0.036385  0.021872  0.003935   135.0

Notice the feature values don’t look like real ages or blood pressures — they’re small floats centered on zero. That’s not missing or corrupted data: scikit-learn ships this dataset with every one of its ten baseline features (age, sex, bmi, average blood pressure bp, and six blood serum measurements s1 through s6) already mean-centered and scaled by their standard deviation, computed across all 442 patients before the train/test split even exists. Keep that detail in mind — it becomes relevant again in the gotchas section. target is untouched: a quantitative measure of disease progression one year after the baseline measurements were taken, on a continuous scale that runs from 25 to 346 in this data.

df.shape
(442, 11)

442 patients, 10 features, 1 target column. Small enough to explore by hand, real enough that the relationships are worth predicting. (The outputs in this post come from pandas 3.0, scikit-learn 1.9, and matplotlib 3.11.)

df[["age", "bmi", "bp", "target"]].describe().round(3)
           age      bmi       bp   target
count  442.000  442.000  442.000  442.000
mean    -0.000   -0.000   -0.000  152.133
std      0.048    0.048    0.048   77.093
min     -0.107   -0.090   -0.112   25.000
25%     -0.037   -0.034   -0.037   87.000
50%      0.005   -0.007   -0.006  140.500
75%      0.038    0.031    0.036  211.500
max      0.111    0.171    0.132  346.000

The feature columns all share the same near-zero mean and tiny standard deviation — that’s the scaling described above. target, on the other hand, has a real mean of about 152 and ranges from 25 to 346. Predicting that number is the job.

Find a Feature Worth Plotting

Before training anything, it’s worth asking which features actually move together with the target. corr() gives every feature’s linear correlation with target in one line:

df.corr(numeric_only=True)["target"].sort_values(ascending=False)
target    1.000000
bmi       0.586450
s5        0.565883
bp        0.441482
s4        0.430453
s6        0.382483
s1        0.212022
age       0.187889
s2        0.174054
sex       0.043062
s3       -0.394789
Name: target, dtype: float64

bmi (body mass index) has the strongest relationship with disease progression, narrowly ahead of s5. That’s a reasonable place to start exploring visually before handing all ten features to a model.

Visualize the Relationship with matplotlib

Grouping patients into BMI bands makes the relationship easy to see. pd.qcut splits bmi into three equal-sized groups by rank:

df["bmi_group"] = pd.qcut(df["bmi"], q=3, labels=["Low BMI", "Medium BMI", "High BMI"])
bmi_summary = df.groupby("bmi_group", observed=True)["target"].mean().round(1)
bmi_summary
bmi_group
Low BMI       105.8
Medium BMI    145.0
High BMI      207.7
Name: target, dtype: float64
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 4))
bmi_summary.plot(kind="bar", ax=ax, color="#0067c0")
ax.set_ylabel("Average disease progression score")
ax.set_title("Disease progression by baseline BMI group")
fig.tight_layout()
Bar chart showing average disease progression score by baseline BMI group: 105.8 for Low BMI, 145.0 for Medium BMI, and 207.7 for High BMI, based on 442 patients split into equal-sized tercile groups.

Average disease progression roughly doubles between the lowest and highest BMI groups — 105.8 up to 207.7. That’s a real, sizeable relationship, and exactly the kind of pattern a model like linear regression is built to quantify precisely instead of eyeballing in three buckets.

Split the Data Before You Do Anything Else

Before fitting anything, set aside a slice of patients the model will never see during training. scikit-learn’s train_test_split does the splitting and the shuffling in one call:

from sklearn.model_selection import train_test_split

X = df.drop(columns=["target", "bmi_group"])
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape
((353, 10), (89, 10))

X holds the ten feature columns (the “study material”), y holds target (the “answer key”). test_size=0.2 reserves 20% — 89 of the 442 patients — as the quiz set, and random_state=42 makes the split reproducible: rerun this and you’ll get the same 353/89 split every time.

Train a scikit-learn Model with fit

With bmi showing the strongest single relationship but several other features also correlated with target, a plain linear regression across all ten features is a sensible first model — it learns one weight per feature and adds them together.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
LinearRegression()

That’s the entire study step. fit looked at all 353 training patients’ features and their real disease-progression scores, and worked out the linear combination of features that best predicts the score. Nothing has been asked about the 89 held-out patients yet.

Ask for Predictions with predict

Now hand the model the test set’s features — and only the features, never y_test — and ask it to guess:

y_pred = model.predict(X_test)

preview = pd.DataFrame({
    "actual": y_test.values[:5],
    "predicted": y_pred[:5].round(1),
})
preview
   actual  predicted
0   219.0      139.5
1    70.0      179.5
2   202.0      134.0
3   230.0      291.4
4   111.0      123.8

Some guesses land close (row 4: 111 vs. 123.8), some are far off (row 3: 230 vs. 291.4). That’s expected — the interesting question isn’t whether any single guess is right, it’s how good the guesses are on average across all 89 patients.

Grade the Quiz: R² and MAE

scikit-learn ships metrics for exactly this. R² (the coefficient of determination) and mean absolute error answer different questions:

from sklearn.metrics import r2_score, mean_absolute_error

r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f"R2: {r2:.3f}")
print(f"MAE: {mae:.2f}")
R2: 0.453
MAE: 42.79

R² of 0.453 means the model explains about 45% of the variation in disease progression across the test patients — the remaining 55% comes from things the model doesn’t see (other health factors, measurement noise, or genuine unpredictability). MAE of 42.79 is in the target’s own units: on average, a prediction is off by about 43 points on a scale where the test set’s patients average 145.8. Read together, they say the model has found a real but partial signal — useful for flagging patients worth a closer look, not precise enough to replace a clinician’s judgment.

Three Gotchas Worth Knowing

Grading yourself on the practice test flatters you. It’s tempting to check model.score() against the data you just trained on. Compare that to the honest test score and the gap is immediate:

y_train_pred = model.predict(X_train)
r2_train = r2_score(y_train, y_train_pred)
print(f"R2 on training data: {r2_train:.3f}")
print(f"R2 on test data:     {r2:.3f}")
R2 on training data: 0.528
R2 on test data:     0.453

0.528 versus 0.453 — the model looks noticeably better on the data it already memorized patterns from. Always report the test-set number; the training-set number tells you almost nothing about how the model will do on a new patient.

Fitting a transformer on the whole dataset before splitting leaks the test set into training. Any preprocessing step that learns statistics from your data — a scaler, an imputer — must be fit on the training set only, never on the full dataset before the split:

from sklearn.preprocessing import StandardScaler

leaky_scaler = StandardScaler().fit(X)          # sees all 442 rows, test included
honest_scaler = StandardScaler().fit(X_train)   # sees only the 353 training rows

print("bmi mean seen by leaky scaler (all 442 rows):            ", round(leaky_scaler.mean_[2], 6))
print("bmi mean seen by honest scaler (only 353 training rows): ", round(honest_scaler.mean_[2], 6))
print("bmi std seen by leaky scaler (all 442 rows):              ", round(leaky_scaler.scale_[2], 6))
print("bmi std seen by honest scaler (only 353 training rows):   ", round(honest_scaler.scale_[2], 6))
bmi mean seen by leaky scaler (all 442 rows):             -0.0
bmi mean seen by honest scaler (only 353 training rows):  0.001736
bmi std seen by leaky scaler (all 442 rows):               0.047565
bmi std seen by honest scaler (only 353 training rows):    0.047208

The gap looks tiny here only because this particular dataset’s features were already globally centered and scaled before it was published, as described earlier — with already-near-zero values, there isn’t much room left for the leak to move. Swap in a real, unscaled dataset (prices in euros, ages in years, page views in the thousands) and the same mistake shifts what “zero” and “one standard deviation” mean between train and test by a lot, quietly inflating your test score. The fix is always the same: split first, fit any transformer on the training portion only, then apply it to both sets.

R² is not a percentage of correct guesses. It’s easy to read “R² = 0.453” as “45% of predictions were right.” It means the model accounts for 45% of the variance in the target — a statement about how much of the spread in disease-progression scores is explained by the model, not about how many individual guesses landed on the exact right number. For a sense of typical error in real units, look at MAE instead: 42.79 points, on a target that averages 145.8 in the test set.

Wrapping Up

The mental model holds regardless of the algorithm: split your labeled data, fit on the training half, predict on the half the model never saw, and evaluate honestly. Mapped onto the tools:

  • pandas (corr, groupby) → explore which features actually relate to your target, before you model anything
  • matplotlib → visualize a relationship you’re about to ask a model to quantify
  • train_test_split → hold out an honest quiz set the model never studies from
  • model.fit() → learn the relationship from the training set
  • model.predict() → generate guesses on data the model has never seen
  • r2_score / mean_absolute_error → grade those guesses without flattering yourself

If you want to go deeper than one linear regression — the full supervised learning workflow, k-nearest neighbors, hyperparameter tuning, and a guided project with a real dataset — the Machine Learning Workflow lesson in our free Machine Learning course picks up exactly where this post leaves off.

More from the blog