Lesson 5 - Guided Project: Predicting Breast Cancer Diagnosis

Your First End-to-End Project

Over the last four lessons you learned the full machine learning workflow, how k-nearest neighbors makes predictions, how to evaluate a classifier honestly, and how to tune it. Each idea arrived on its own. This lesson is where they all come together.

You are going to build a complete diagnostic model from scratch on a real medical dataset, the kind of work a data scientist does on the job. There is no new algorithm to learn here. Instead, you will practice the discipline of moving from a raw file to a trustworthy, well-measured model, making sensible decisions at every step and reading what the results actually tell you.

By the end of this lesson, you will be able to:

  • Frame a real classification problem and load a genuine dataset from a CSV file
  • Explore class balance and feature structure before writing any modeling code
  • Prepare data correctly: select numeric features, split, and scale without leaking information
  • Train a k-nearest neighbors classifier and tune the number of neighbors
  • Evaluate the model with accuracy, a confusion matrix, and a ROC curve, and interpret each one in a medical context

You should already be comfortable with train_test_split, StandardScaler, KNeighborsClassifier, and the confusion-matrix and ROC ideas from Lesson 3. Let’s get to work.


Project Brief

Imagine you have joined the data team at a hospital research lab. Pathologists examine fine-needle aspirate (FNA) samples of breast tissue under a microscope and measure properties of the cell nuclei: how large they are, how irregular their borders look, how much they vary in size. These measurements are recorded automatically from a digitized image.

Your task is to build a model that takes those measurements and predicts whether a tumor is malignant (cancerous) or benign (not cancerous). A good model can act as a second opinion that helps flag cases for closer review. It does not replace a doctor, but it can make the screening process faster and more consistent.

This is a binary classification problem. The two outcomes are malignant and benign, and the cost of the two kinds of mistake is very different, a point we will return to when we evaluate the model.

The data comes from the well-known Wisconsin Diagnostic Breast Cancer study. It contains 569 real patient samples, each with 30 numeric measurements and a confirmed diagnosis.


Loading the Real Dataset

You can download the dataset and follow along:

import pandas as pd

df = pd.read_csv("breast_cancer.csv")  # download: https://datatweets.com/datasets/breast_cancer.csv

# How big is the dataset, and is anything missing?
print("Shape:", df.shape)
print("Missing values:", df.isna().sum().sum())
# Output:
# Shape: (569, 31)
# Missing values: 0

You have 569 rows and 31 columns: 30 numeric feature columns plus a target column. There are no missing values, which spares you a cleaning step and lets you focus on modeling.

The Data Dictionary

Every row describes one tumor sample. The 30 feature columns are not 30 unrelated measurements. They are ten base properties of the cell nuclei, each reported three ways: the mean across nuclei in the image, the error (standard error) of that measurement, and the worst (largest) value observed. That gives you mean radius, radius error, and worst radius, and the same pattern for the other nine properties.

The ten base properties are:

PropertyWhat it captures
radiusAverage distance from the center of a nucleus to its edge
textureVariation in gray-scale intensity (how mottled the nucleus looks)
perimeterLength of the nucleus boundary
areaSize of the nucleus
smoothnessLocal variation in the radius (how even the border is)
compactnessPerimeter squared over area, minus one
concavitySeverity of concave (inward-curving) portions of the border
concave pointsNumber of concave portions of the border
symmetryHow symmetric the nucleus shape is
fractal dimension“Coastline” roughness of the border

And the target:

ColumnMeaning
target1 = benign (357 cases), 0 = malignant (212 cases)

Larger, more irregular, less symmetric nuclei tend to indicate malignancy, so you can expect features like worst radius, worst perimeter, and worst concave points to carry a lot of signal. You do not need to memorize the biology; the model will learn which measurements separate the two classes.


Exploring the Data

Before training anything, you should understand the shape of the problem. The single most important question for a classifier is: how balanced are the classes? If 95% of samples were benign, a lazy model could score 95% accuracy by always guessing “benign” while catching zero cancers. Accuracy would lie to you.

Let’s count the two outcomes. The dataset stores the target as 0 and 1; mapping those to readable labels makes the output easier to interpret.

# Map the numeric target to readable labels for exploration
labels = df["target"].map({0: "malignant", 1: "benign"})
print(labels.value_counts())
# Output:
# target
# benign       357
# malignant    212
# Name: count, dtype: int64

So you have 357 benign and 212 malignant cases. That is roughly a 63 / 37 split, mildly imbalanced but not severe. A model needs to do real work to beat the “always benign” baseline of about 63% accuracy.

The bar chart below makes the balance concrete.

Bar chart of benign vs malignant diagnoses
The breast cancer dataset has 357 benign and 212 malignant cases.

A quick look at the feature scales also matters for k-NN. Because k-NN measures distance between points, features on large numeric scales dominate features on small ones.

# Compare the typical magnitude of a few features
print(df[["mean area", "mean radius", "mean smoothness"]].describe().loc[["mean", "max"]])
# Output:
#         mean area  mean radius  mean smoothness
# mean   654.889104    14.127292         0.096360
# max   2501.000000    28.110000         0.163400

Look at the ranges. mean area runs into the thousands, mean radius sits in the tens, and mean smoothness is a fraction below one. Without intervention, mean area alone would swamp the distance calculation and mean smoothness would barely register. That is exactly why you will scale the features before training. This is the WHY behind a step that is easy to apply mechanically and forget the reason for.


Preparing the Data

Preparation for this project has three parts: choose the feature matrix and target, split into train and test sets, and scale. The order matters.

Selecting Features and Target

All 30 columns are numeric and informative, so you can use them all. The target is already numeric (0/1), which is exactly what scikit-learn wants.

# Features: every column except the target
X = df.drop(columns=["target"])

# Target: 0 = malignant, 1 = benign (already numeric)
y = df["target"]

print("Feature matrix shape:", X.shape)
print("Target shape:", y.shape)
# Output:
# Feature matrix shape: (569, 30)
# Target shape: (569,)

Splitting Before Scaling

You split into a training set and a test set before you scale. The test set must stand in for data the model has never seen, so it cannot influence any decision made during training, including the mean and standard deviation used for scaling. Letting test data leak into the scaler is a subtle but real form of cheating that inflates your scores.

Use stratify=y so the train and test sets keep the same 63 / 37 class balance. With a small, imbalanced dataset, an unlucky random split could otherwise pile most malignant cases into one side.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y,
)

print("Training samples:", X_train.shape[0])
print("Test samples:", X_test.shape[0])
# Output:
# Training samples: 426
# Test samples: 143

Scaling Without Leakage

Now scale. You fit the StandardScaler on the training data only, then transform both sets with that same fitted scaler. The training set teaches the scaler each feature’s mean and spread; the test set is merely transformed using what was learned. This is the correct, leak-free pattern.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Learn the scaling from training data, then apply it everywhere
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)   # transform only, never fit

After this step every feature has roughly mean 0 and standard deviation 1 on the training set, so mean area and mean smoothness finally contribute to the distance calculation on equal footing.


Training the Model

With clean, scaled data, training k-nearest neighbors is short. The one decision you control is k k , the number of neighbors that vote on each prediction. A small k k follows the training data closely and can overfit; a large k k smooths the decision boundary and can underfit.

A practical rule of thumb is to start near k=n k = \sqrt{n} . With 426 training samples that is about 20, so a value in the high single digits to low twenties is reasonable. Let’s settle on k=7 k = 7 , a common, robust starting point, and train.

from sklearn.neighbors import KNeighborsClassifier

# Seven neighbors vote on each prediction
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train_scaled, y_train)

# Accuracy on the held-out test set
test_accuracy = knn.score(X_test_scaled, y_test)
print(f"test accuracy: {test_accuracy:.4f}")
# Output:
# test accuracy: 0.9790

A test accuracy of 0.9790 means the model correctly classifies about 98% of tumors it has never seen. That is a strong result and a large jump over the 63% “always benign” baseline. But accuracy alone never tells the whole story in a medical setting, so let’s dig into where the model is right and where it is wrong.


Evaluating the Model

A single accuracy number hides the most important question: what kind of mistakes does the model make? For a cancer screen, missing a malignant tumor (a false negative) is far more dangerous than flagging a benign one for extra review (a false positive). The confusion matrix splits the errors apart so you can see both.

from sklearn.metrics import confusion_matrix, classification_report

y_pred = knn.predict(X_test_scaled)

print(confusion_matrix(y_test, y_pred))
print(classification_report(
    y_test, y_pred,
    target_names=["malignant", "benign"],
))
# Output:
#               precision    recall  f1-score   support
#
#    malignant       1.00      0.94      0.97        53
#       benign       0.97      1.00      0.98        90
#
#     accuracy                           0.98       143
#    macro avg       0.98      0.97      0.98       143
# weighted avg       0.98      0.98      0.98       143

Read the report row by row. For the malignant class, precision is 1.00: every tumor the model called malignant truly was. Recall is 0.94: it caught 94% of the actual malignant cases, meaning a small number of malignant tumors were missed. For the benign class, recall is 1.00: no benign tumor was wrongly labeled malignant.

In a real screening tool, that missed-malignant rate is the number you would scrutinize most, because those are the dangerous false negatives. We will see how to trade precision for recall in a moment, but first let’s visualize the errors.

The confusion matrix below shows the four outcome counts: correctly identified malignant cases, correctly identified benign cases, and the few off-diagonal mistakes.

Confusion matrix for the diagnosis model
The confusion matrix for the final k-NN diagnosis model on the test set.

Reading the ROC Curve

Accuracy and the confusion matrix judge the model at one fixed decision threshold (predict “benign” when the predicted probability passes 0.5). But you can move that threshold. Lower it and the model becomes more cautious, catching more malignant cases at the cost of more false alarms. The ROC curve traces the model’s performance across every possible threshold, plotting the true positive rate against the false positive rate.

The single number that summarizes the whole curve is the area under the curve (AUC). An AUC of 0.5 means the model is no better than a coin flip; an AUC of 1.0 means perfect separation of the two classes.

from sklearn.metrics import roc_auc_score

# predict_proba returns probabilities; column 1 is the "benign" class
y_proba = knn.predict_proba(X_test_scaled)[:, 1]

auc = roc_auc_score(y_test, y_proba)
print(f"BC ROC AUC = {auc:.3f}")
# Output:
# BC ROC AUC = 0.992

An AUC of 0.992 is excellent. It tells you that if you picked one random malignant tumor and one random benign tumor, the model would rank the malignant one as more likely malignant about 99% of the time. The two classes are very well separated, which the curve below shows visually as it hugs the top-left corner.

ROC curve for the diagnosis model, AUC 0.99
The model separates the two classes very well (AUC = 0.99).

A high AUC is reassuring because it means the model’s ranking is sound regardless of threshold. If the lab decided that missing a malignant tumor is unacceptable, they could lower the threshold to push malignant recall toward 1.00, accepting a few more benign cases flagged for review. The ROC curve is the map for making that trade-off deliberately.


What You Built

Step back and look at the full pipeline you assembled:

  1. Framed the problem as binary classification with asymmetric error costs.
  2. Loaded 569 real patient samples with no missing values.
  3. Explored the class balance (357 benign, 212 malignant) and discovered wildly different feature scales.
  4. Prepared the data: selected 30 numeric features, split with stratification, and scaled without leakage.
  5. Trained a k-NN classifier with k=7 k = 7 .
  6. Evaluated it: 0.9790 test accuracy, a confusion matrix that exposed the rare false negatives, and a ROC AUC of 0.992.

That is the entire supervised learning workflow, applied end to end to a dataset that matters. The same sequence, swap the data and the algorithm, is what professional data scientists run every day.


Practice Exercises

Try these before moving on. Each one deepens a decision you made above.

Exercise 1: Sweep the Number of Neighbors

You chose k=7 k = 7 by rule of thumb. Loop over the values [1, 3, 5, 7, 9, 15, 21], train a fresh KNeighborsClassifier for each, and print the test accuracy. Which value wins, and is 7 close to the best?

# Your code here

Hint

Iterate with for k in [1, 3, 5, 7, 9, 15, 21]:, create KNeighborsClassifier(n_neighbors=k), fit it on X_train_scaled, and print knn.score(X_test_scaled, y_test) each time. Reuse the scaled arrays you already built.

Exercise 2: Prove That Scaling Matters

Train one k-NN model on the unscaled X_train / X_test and another on the scaled versions, both with k=7 k = 7 . Compare their test accuracies. How much does skipping the scaler cost you?

# Your code here

Hint

You already have X_train, X_test, X_train_scaled, and X_test_scaled. Fit one classifier on each pair and print both scores side by side. The unscaled model will be noticeably worse because mean area dominates the distance calculation.

Exercise 3: Shift the Threshold to Catch More Cancers

Using predict_proba, classify a tumor as malignant whenever its malignant probability exceeds 0.3 instead of the default 0.5. Build the new confusion matrix. Does malignant recall improve? What does it cost in false positives?

# Your code here

Hint

Malignant is class 0, so its probability is knn.predict_proba(X_test_scaled)[:, 0]. Create predictions with (proba_malignant > 0.3) and convert the boolean array into the right labels before calling confusion_matrix.


Summary

You completed a full machine learning project on a real medical dataset, from a raw CSV to an evaluated, trustworthy classifier.

Key Concepts

Project Workflow

  • Frame the problem and understand the cost of each error type before modeling
  • Always check class balance; accuracy is misleading on imbalanced data
  • Split before scaling, and fit the scaler on training data only to avoid leakage
  • Use stratify=y to preserve class proportions in small datasets

Model and Evaluation

  • k-nearest neighbors with k=7 k = 7 reached 0.9790 test accuracy here
  • The confusion matrix and classification_report reveal which errors occur, not just how many
  • ROC AUC (0.992 in this project) measures class separation across all thresholds
  • Lowering the decision threshold trades precision for recall, which matters when false negatives are dangerous

Why This Matters

A model is only as good as the process that produced it. The same accuracy number can come from a sound, leak-free pipeline or a flawed one that quietly cheated. By splitting before scaling, checking class balance, and reading the confusion matrix and ROC curve instead of a lone accuracy figure, you produced a result you can actually defend. In a domain like medicine, where a false negative carries real consequences, that discipline is the difference between a demo and a tool a professional would trust.


Next Steps

You have finished the Machine Learning Foundations module. You can now load data, prepare it correctly, train a classifier, and evaluate it like a professional. Next, you will move beyond predicting categories to predicting continuous numbers with regression.

Continue to the Regression Module

Predict continuous values with linear regression and gradient descent, on real data.

Back to Module Overview

Return to the Machine Learning Foundations module overview.