Lesson 3 - Handling Imbalanced Data

Welcome to Handling Imbalanced Data

Back in the Boosting Foundations module, the Northwind Analytics team trained a classifier to flag high earners on the real Adult Income dataset, and it reached a satisfying accuracy in the mid-80s. That number felt like success. But there is a catch hiding inside it, and it is the kind of catch that quietly sinks real projects: only about 24 percent of people in the data actually earn more than 50K. The other 76 percent do not. When one class is that much larger than the other, a single headline accuracy figure can look great while the model is failing badly at the exact job you built it for.

This lesson is about seeing through that illusion. Northwind’s marketing team does not care how many low earners the model correctly labels; there is an endless supply of those and labeling them right earns nothing. What they care about is catching the high earners, the minority class, the people worth targeting. So we will train a default XGBClassifier, watch its accuracy look reassuring while its recall on high earners lags, learn the metrics that expose the problem, and then fix it with one hyperparameter, scale_pos_weight, measuring the honest trade-off that fix brings. Every number in this lesson comes from a model we actually fit.

By the end of this lesson, you will be able to:

  • Explain why accuracy is a misleading score on imbalanced data and what it hides
  • Read a confusion matrix and per-class precision, recall, and F1 to see how a model treats the minority class
  • Use threshold-free metrics like ROC AUC and average precision (PR AUC) to judge probability quality
  • Set XGBoost’s scale_pos_weight to the negative-to-positive ratio to reweight the minority class
  • Describe the recall-versus-precision trade-off that reweighting produces and when it is worth making

This lesson assumes you can train an XGBClassifier and run a stratified train/test split, as covered earlier in the course. Let’s begin by making the problem visible.


Why Accuracy Misleads on Imbalanced Data

Start with the imbalance itself. We load the Adult Income data, keep the numeric columns (handling the categorical ones is Lesson 4’s job), and turn the target into a clean 0/1 label where 1 means a person earns more than 50K.

import warnings
warnings.filterwarnings("ignore")
import pandas as pd
from sklearn.datasets import fetch_openml

adult = fetch_openml("adult", version=2, as_frame=True)

features = ["age", "education-num", "hours-per-week",
            "capital-gain", "capital-loss"]
X = adult.data[features].copy()
y = (adult.target == ">50K").astype(int)

print("Rows, columns:", X.shape)
print("Positive rate (>50K):", round(float(y.mean()), 4))
print(y.value_counts().to_string())
# Output:
# Rows, columns: (48842, 5)
# Positive rate (>50K): 0.2393
# class
# 0    37155
# 1    11687

There it is: 37,155 people in the majority class and only 11,687 in the minority, a positive rate of 0.2393. Keep that ratio in mind, because it sets a trap. A do-nothing model that ignored every feature and always predicted “not a high earner” would be correct on all 37,155 low earners and wrong only on the 11,687 high earners, scoring about 76 percent accuracy while being completely useless. Any real model has to clear that 76 percent bar just to justify its existence, and clearing it by a few points is not as impressive as it sounds.

Now train an honest default XGBClassifier and look past the accuracy to the confusion matrix and the per-class report. The confusion matrix lays out actual labels against predicted ones, so we can see exactly who the model catches and who it misses.

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import xgboost as xgb

adult = fetch_openml("adult", version=2, as_frame=True)
features = ["age", "education-num", "hours-per-week",
            "capital-gain", "capital-loss"]
X = adult.data[features].copy()
y = (adult.target == ">50K").astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = xgb.XGBClassifier(random_state=42, eval_metric="logloss")
model.fit(X_train, y_train)
pred = model.predict(X_test)

print("Test accuracy:", round(accuracy_score(y_test, pred), 4))
print("Confusion matrix [rows = actual, cols = predicted]:")
print(confusion_matrix(y_test, pred))
print()
print(classification_report(y_test, pred, digits=4))
# Output:
# Test accuracy: 0.8495
# Confusion matrix [rows = actual, cols = predicted]:
# [[7129  302]
#  [1168 1170]]
#
#               precision    recall  f1-score   support
#
#            0     0.8592    0.9594    0.9065      7431
#            1     0.7948    0.5004    0.6142      2338
#
#     accuracy                         0.8495      9769
#    macro avg     0.8270    0.7299    0.7604      9769
# weighted avg     0.8438    0.8495    0.8366      9769

The accuracy is a healthy-looking 0.8495, a few points above the 76 percent do-nothing baseline. But read the second row of the report, the one for class 1, the high earners Northwind actually cares about. Its recall is only 0.5004. Recall answers “of all the real high earners, what fraction did the model find?” and the answer is barely half. The confusion matrix says the same thing in raw counts: of the 1168 + 1170 = 2338 genuine high earners in the test set, the model caught 1170 and missed 1168. It is essentially a coin flip on the group that matters, and the strong accuracy sailed right over that fact because the 7,129 correctly-labeled low earners in the top-left cell dominate the arithmetic. Accuracy rewards the model for being good at the easy, plentiful class and barely penalizes it for fumbling the rare, valuable one.

Recall, precision, and which one costs you

Recall for the positive class is TP/(TP+FN) \text{TP} / (\text{TP} + \text{FN}) : of all actual high earners, how many did we catch? A missed high earner (a false negative) is a lost opportunity for Northwind’s campaign. Precision is TP/(TP+FP) \text{TP} / (\text{TP} + \text{FP}) : of everyone we flagged, how many really were high earners? A false alarm (a false positive) wastes outreach budget. Imbalance almost always crushes recall first, because the model learns that guessing “majority” is a safe bet. Which error hurts more is a business question, not a statistical one, and it decides how hard you push the fix in the rest of this lesson.


Better Metrics for Imbalance

If accuracy hides the problem, what should you watch instead? Two families of metric are far more honest on imbalanced data, and together they give a complete picture.

The first family is per-class precision and recall, which you already saw in the report. Never trust a single aggregate on imbalanced data; always split it out by class and look hardest at the minority. The second family is threshold-free ranking metrics, chiefly ROC AUC and average precision (the area under the precision-recall curve, often called PR AUC). These do not depend on the default 0.5 cutoff at all. Instead they ask how well the model’s probabilities order the positive cases above the negative ones. That matters because a model can rank people beautifully and still make poor 0.5-threshold decisions, and these metrics let you see the ranking quality separately from the threshold choice.

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import (roc_auc_score, average_precision_score,
                             precision_score, recall_score, f1_score)
import xgboost as xgb

adult = fetch_openml("adult", version=2, as_frame=True)
features = ["age", "education-num", "hours-per-week",
            "capital-gain", "capital-loss"]
X = adult.data[features].copy()
y = (adult.target == ">50K").astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = xgb.XGBClassifier(random_state=42, eval_metric="logloss")
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)[:, 1]
pred = model.predict(X_test)

majority_rate = float((y_test == 0).mean())

print("Always-guess-<=50K accuracy:", round(majority_rate, 4))
print("Positive-class precision:  ", round(precision_score(y_test, pred), 4))
print("Positive-class recall:     ", round(recall_score(y_test, pred), 4))
print("Positive-class F1:         ", round(f1_score(y_test, pred), 4))
print("ROC AUC:                   ", round(roc_auc_score(y_test, proba), 4))
print("Average precision (PR AUC):", round(average_precision_score(y_test, proba), 4))
# Output:
# Always-guess-<=50K accuracy: 0.7607
# Positive-class precision:   0.7948
# Positive-class recall:      0.5004
# Positive-class F1:          0.6142
# ROC AUC:                    0.8783
# Average precision (PR AUC): 0.7551

Now the model tells a fuller story. The always-guess-majority accuracy is 0.7607, confirming the trap: our real model’s 0.8495 is a genuine but modest lift over doing nothing. The F1 of 0.6142 is the harmonic mean of precision and recall, and it drops well below the accuracy precisely because recall is weak; F1 is a much better single number to optimize on imbalanced problems than accuracy is. Meanwhile the ROC AUC of 0.8783 and average precision of 0.7551 are both strong. That combination, good ranking metrics but poor recall at the default threshold, is the classic signature of imbalance: the model has learned who the high earners are and orders them well, but its decision threshold is tuned to please the majority. The ranking is fine; the cutoff is the problem. That is exactly what scale_pos_weight addresses.

You will also notice we passed eval_metric="logloss" when constructing the classifier. For imbalanced problems you can instead ask XGBoost to track a more relevant score during training, for example eval_metric="auc" for ROC AUC or eval_metric="aucpr" for the area under the precision-recall curve, which is often the single most informative metric when positives are rare.


The scale_pos_weight Hyperparameter

XGBoost gives you a direct lever for imbalance: scale_pos_weight. It multiplies the gradient contribution of every positive example during training, so a value above 1 makes each high earner “count for more” and pushes the model to stop ignoring them. The recommended starting point is simply the ratio of negatives to positives in the training set:

scale_pos_weight=number of negativesnumber of positives \texttt{scale\_pos\_weight} = \frac{\text{number of negatives}}{\text{number of positives}}

For our training split that ratio works out to about 3.18, meaning each high earner will be weighted a little more than three times as heavily as each low earner. Let’s fit two models, one with the default scale_pos_weight=1 and one with the reweighted value, and compare them on the same test set.

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, confusion_matrix, precision_score,
                             recall_score, f1_score, roc_auc_score)
import xgboost as xgb

adult = fetch_openml("adult", version=2, as_frame=True)
features = ["age", "education-num", "hours-per-week",
            "capital-gain", "capital-loss"]
X = adult.data[features].copy()
y = (adult.target == ">50K").astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

neg = int((y_train == 0).sum())
pos = int((y_train == 1).sum())
spw = neg / pos
print("Train negatives:", neg, " positives:", pos)
print("scale_pos_weight = neg / pos =", round(spw, 4))
print()

def evaluate(weight, label):
    m = xgb.XGBClassifier(random_state=42, eval_metric="logloss",
                          scale_pos_weight=weight)
    m.fit(X_train, y_train)
    pred = m.predict(X_test)
    proba = m.predict_proba(X_test)[:, 1]
    print(f"--- {label} (scale_pos_weight={round(weight, 4)}) ---")
    print("Accuracy: ", round(accuracy_score(y_test, pred), 4))
    print("Confusion matrix [rows = actual, cols = predicted]:")
    print(confusion_matrix(y_test, pred))
    print("Positive precision:", round(precision_score(y_test, pred), 4))
    print("Positive recall:   ", round(recall_score(y_test, pred), 4))
    print("Positive F1:       ", round(f1_score(y_test, pred), 4))
    print("ROC AUC:           ", round(roc_auc_score(y_test, proba), 4))
    print()

evaluate(1.0, "Before: default weighting")
evaluate(spw, "After: reweighted minority class")
# Output:
# Train negatives: 29724  positives: 9349
# scale_pos_weight = neg / pos = 3.1794
#
# --- Before: default weighting (scale_pos_weight=1.0) ---
# Accuracy:  0.8495
# Confusion matrix [rows = actual, cols = predicted]:
# [[7129  302]
#  [1168 1170]]
# Positive precision: 0.7948
# Positive recall:    0.5004
# Positive F1:        0.6142
# ROC AUC:            0.8783
#
# --- After: reweighted minority class (scale_pos_weight=3.1794) ---
# Accuracy:  0.779
# Confusion matrix [rows = actual, cols = predicted]:
# [[5793 1638]
#  [ 521 1817]]
# Positive precision: 0.5259
# Positive recall:    0.7772
# Positive F1:        0.6273
# ROC AUC:            0.8786

Look at what moved and what did not. Positive-class recall leapt from 0.5004 to 0.7772: the model went from catching 1,170 of the 2,338 high earners to catching 1,817 of them, missing only 521 instead of 1,168. For Northwind, that is more than 600 additional high earners the campaign will now reach. But nothing is free. Precision fell from 0.7948 to 0.5259, because in its eagerness to catch high earners the model now also flags many more low earners as high (false positives jumped from 302 to 1,638). Overall accuracy dropped from 0.8495 to 0.779, which is exactly why accuracy is the wrong thing to optimize here: the reweighted model is worse on accuracy yet far better at the actual task.

Two details are worth noticing. First, the F1 barely changed (0.6142 to 0.6273), because it balances the recall gain against the precision loss; scale_pos_weight mostly redistributed errors from false negatives to false positives rather than eliminating them. Second, the ROC AUC stayed essentially flat (0.8783 to 0.8786). That confirms what we said earlier: scale_pos_weight did not teach the model anything new about who the high earners are, the ranking was already good. What it changed was the effective decision threshold, sliding it so the model commits to “high earner” more readily. The figure below makes that redistribution concrete using these exact counts.

Two confusion matrices side by side for the same 2,338 real high earners in the test set. Left, the default model with scale_pos_weight equal to 1: the bottom actual-greater-than-50K row shows 1168 missed in red and 1170 caught in green, and the top actual-less-than-or-equal-50K row shows 7129 correct and 302 false alarms, giving recall on high earners of 50.0 percent. Right, the reweighted model with scale_pos_weight equal to 3.18: the bottom row shows only 521 missed in red and 1817 caught in green, and the top row shows 5793 correct and 1638 false alarms, giving recall of 77.7 percent. An orange trade-off banner notes that recall rises from 50.0 to 77.7 percent while positive-class precision falls from 79.5 to 52.6 percent due to the extra false alarms.
Reweighting the minority class slides the decision threshold: the model now catches far more true high earners (green, bottom-right) but at the cost of many more false alarms (top-right), trading precision for recall.

scale_pos_weight moves the threshold, not the ranking

Because the ROC AUC hardly budged while recall and precision swung sharply, you can think of scale_pos_weight as an indirect way of shifting the decision boundary toward the minority class. If your only goal were to trade precision for recall, you could achieve much the same effect at prediction time by lowering the probability threshold below 0.5. The advantage of scale_pos_weight is that it also influences training, encouraging the trees to spend more of their capacity carving out the minority-class regions, which can help when the minority is not just rare but genuinely hard to separate.

Other Tools for Imbalance

scale_pos_weight is the most direct XGBoost-native fix, but it is not the only option, and the right choice depends entirely on the business cost of the two error types. Threshold tuning takes a trained model’s probabilities and picks a cutoff other than 0.5 to hit a target recall or precision, cheap, transparent, and easy to adjust after deployment. Resampling changes the training data instead of the loss: oversampling the minority (including synthetic methods like SMOTE) or undersampling the majority to even out the classes. And you can combine approaches. The key discipline is to decide first which error is more expensive for your problem, a missed high earner or a wasted outreach, and only then choose the technique and its strength. If false negatives are catastrophic (say, missing a fraudulent transaction or a sick patient), you push recall hard and accept the precision hit; if false positives are expensive, you hold back. There is no universally correct setting, only the one that matches your costs.


Practice Exercises

Try these before checking the hints. They reinforce diagnosing imbalance, reading per-class metrics, and steering the recall-precision trade-off.

Exercise 1: Compute scale_pos_weight for a Rarer Positive

Imagine a fraud dataset where only 5 percent of transactions are fraudulent (the positive class), out of 10,000 training rows. Compute the recommended scale_pos_weight from the negative-to-positive ratio, then explain in one sentence why it is so much larger than the 3.18 we used for Adult Income.

# Your code here

Hint

With 5 percent positives there are 500 positives and 9,500 negatives, so scale_pos_weight = 9500 / 500 = 19.0. It is far larger than 3.18 because the imbalance is far more severe: the rarer the positive class, the more heavily each positive example must be weighted to keep the model from ignoring it entirely.

Exercise 2: Try an Intermediate Weight

scale_pos_weight is a dial, not an on/off switch. Retrain the Adult classifier with scale_pos_weight=2.0, halfway between the default 1 and the full 3.18, and print the positive-class precision, recall, and F1. Confirm the recall lands between the two values you saw in the lesson.

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score
import xgboost as xgb

# Your code here

Hint

Build the model with xgb.XGBClassifier(random_state=42, eval_metric="logloss", scale_pos_weight=2.0) and reuse the same features and stratified split. You should get a recall of about 0.6484 and precision of about 0.6416 (F1 about 0.645), both sitting neatly between the scale_pos_weight=1 and scale_pos_weight=3.18 results, confirming that the weight smoothly trades precision for recall.

Exercise 3: Reach the Same Recall by Threshold Tuning

Instead of reweighting, take the default model’s predicted probabilities and lower the decision threshold from 0.5 to 0.3 (label as positive whenever proba >= 0.3). Print the positive-class recall and precision at both thresholds and compare the direction of the change to what scale_pos_weight did.

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score
import xgboost as xgb

# Your code here

Hint

Get probabilities with proba = model.predict_proba(X_test)[:, 1], then build predictions as (proba >= t).astype(int) for t in [0.5, 0.3]. At 0.5 you recover recall 0.5004 and precision 0.7948; at 0.3 recall rises to about 0.6865 while precision falls to about 0.6114. The direction is the same as scale_pos_weight, more recall for less precision, showing that both tools ultimately move the same threshold, one at training time and one afterward.


Summary

You took a classifier that looked good on paper and learned to see the failure its accuracy was hiding, then fixed it deliberately while measuring exactly what the fix cost.

Key Concepts

Why Accuracy Misleads

  • When one class dominates, a model that always predicts the majority already scores high; for Adult Income that do-nothing baseline is 0.7607
  • A default XGBClassifier reached 0.8495 accuracy yet only 0.5004 recall on the minority high-earner class, catching 1,170 of 2,338 and missing 1,168
  • Always split precision and recall out by class and scrutinize the minority; a single aggregate number cannot reveal an imbalance failure

Better Metrics

  • F1 (0.6142 here) balances precision and recall and is a far safer single target than accuracy on imbalanced data
  • ROC AUC (0.8783) and average precision / PR AUC (0.7551) are threshold-free and judge how well probabilities rank positives above negatives
  • Strong ranking metrics alongside weak recall are the signature of imbalance: the model knows the answer but its 0.5 threshold favors the majority

scale_pos_weight

  • Set it to negatives/positives \text{negatives} / \text{positives} in the training set (about 3.18 for Adult Income) to weight each minority example more heavily
  • Doing so raised positive recall from 0.5004 to 0.7772 (catching 1,817 high earners instead of 1,170), while precision fell from 0.7948 to 0.5259 and accuracy dropped to 0.779
  • ROC AUC barely moved (0.8783 to 0.8786), showing the tool shifts the decision threshold rather than improving the underlying ranking
  • Threshold tuning and resampling are alternatives; the right choice depends on whether a false negative or a false positive costs your business more

Why This Matters

Almost every high-stakes classification problem is imbalanced: fraud, disease, churn, default, and yes, high-value prospects. In every one of them, the rare class is the one you actually care about, and it is exactly the class a naive accuracy-driven workflow will neglect. The skill you built here, refusing to trust a headline accuracy, reading per-class recall, and reaching for a lever like scale_pos_weight with a clear-eyed view of the trade-off, is what separates a model that demos well from one that works. Just as important, you saw that there is no free lunch: buying recall costs precision, and deciding how much to buy is a business judgment about the cost of each mistake, not something a metric can settle for you. Carry that honesty into every imbalanced problem you meet.


Next Steps

You can now diagnose and correct class imbalance on a real dataset. The Adult Income data still has a loose end, though: we have been quietly ignoring its rich categorical columns and any missing values. In the next lesson you will bring those back in and let XGBoost handle them natively.

Lesson 4: Missing Values and Categorical Features

Let XGBoost handle missing values natively and encode the Adult Income categorical columns to push the model further.

Back to Module Overview

Return to the Training Robust Models module overview


Continue Building Your Skills

You started this lesson with a model that looked like a success and ended it with a model that actually is one, and the difference was never a fancier algorithm; it was refusing to be fooled by a single number. Trace the arc one more time: accuracy hid a coin-flip recall on the class that mattered, per-class metrics and PR AUC exposed it, and scale_pos_weight fixed it at a precision cost you measured rather than ignored. That habit, look past the aggregate, weigh the trade-off, decide by the cost of the error, will serve you on every imbalanced problem long after the specific hyperparameter fades from memory. Next, you will make the model even stronger by feeding it the features we set aside.

Sponsor

Keep DATATWEETS free. Help fund practical data, AI, and engineering lessons for learners worldwide.

Buy Me a Coffee at ko-fi.com