Lesson 1 - Feature Importance and Its Pitfalls

Welcome to Feature Importance and Its Pitfalls

Welcome to Module 4. Over the last three modules you turned XGBoost from a name into a tool you can drive: you built the boosting core by hand, opened up the regularized objective, and learned to tune the knobs that control accuracy. Now the questions change. A tuned model that predicts well is only half the job; the other half is being able to look someone in the eye and say why it predicts what it does. This module is about interpretation, tuning discipline, and getting the model into production, and it starts here, with the single most requested and most misused number in applied machine learning: feature importance.

Our running team, Northwind Analytics, has a strong California Housing model in hand (test RMSE 0.4696 from Module 2). The moment they show it to stakeholders, the question is always the same: which features actually drive the prediction? XGBoost is happy to answer. In fact it will answer three different ways, and, as you are about to see with real numbers, those three answers do not agree with each other. One of them, the default, will confidently tell Northwind that median income is the single most important feature. A more careful method, measured the right way, will disagree and put geography on top. Getting this right is the difference between an honest insight and a confident mistake. Every number below was produced by running the code for real.

By the end of this lesson, you will be able to:

  • Compute all three importance types XGBoost reports (weight, gain, and cover) with booster.get_score(importance_type=...) and explain what each one measures
  • Show that the three types rank the same features differently on a real model
  • Explain why the default (gain) importance is biased, computed on training data, and says nothing about causation or generalization
  • Measure permutation importance on a held-out set with sklearn.inspection.permutation_importance and compare its ranking to gain
  • State clear guidance on when to trust each method and never read causation into any of them

You should be comfortable fitting an XGBRegressor and with a train/test split. No new theory is required, just a healthy skepticism. Let’s begin.


The Three Importances XGBoost Reports

When a fitted booster is asked “how important is each feature?”, it does not have one answer. It keeps a few different bookkeeping tallies as it grows trees, and each tally is a legitimate but different notion of importance. XGBoost exposes them through booster.get_score(importance_type=...). Three matter most:

  • weight counts how many times a feature is used to make a split across all trees. A feature used in many splits gets a high weight. This is purely a frequency count; it says nothing about how useful each split was.
  • gain measures the average improvement in the training objective (the drop in loss) that splits on this feature produced. A feature that, when split on, sharply reduces the error gets a high gain. This is the default behind scikit-learn’s feature_importances_ attribute, so it is the one most people see without realizing it.
  • cover measures the average number of training samples affected by splits on this feature (technically the sum of the second-order statistics passing through those splits). A feature whose splits sit near the top of trees, touching many rows, gets high cover.

Three different questions: How often is it used? How much does each use help? How many rows does it touch? There is no reason these should agree, and on real data they do not. Let’s fit Northwind’s model and read all three off the same booster.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data = fetch_california_housing()
features = list(data.feature_names)
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

model = xgb.XGBRegressor(
    n_estimators=300, learning_rate=0.1, max_depth=4, random_state=42
)
model.fit(X_train, y_train)

# Attach readable names so the booster reports them instead of f0, f1, ...
booster = model.get_booster()
booster.feature_names = features

for imp_type in ["weight", "gain", "cover"]:
    scores = booster.get_score(importance_type=imp_type)
    ranked = sorted(scores.items(), key=lambda kv: kv[1], reverse=True)
    print(f"--- {imp_type} ---")
    for name, value in ranked:
        print(f"  {name:11s} {float(round(value, 2))}")
--- weight ---
  Latitude    867.0
  Longitude   831.0
  MedInc      627.0
  AveOccup    550.0
  HouseAge    474.0
  AveRooms    421.0
  Population  359.0
  AveBedrms   301.0
--- gain ---
  MedInc      92.48
  AveOccup    23.72
  Longitude   14.47
  Latitude    12.46
  HouseAge    9.64
  AveRooms    5.89
  AveBedrms   3.1
  Population  2.51
--- cover ---
  AveRooms    5753.22
  MedInc      5747.47
  Longitude   5284.72
  Latitude    5014.86
  AveBedrms   3983.29
  AveOccup    3160.23
  HouseAge    2662.33
  Population  2367.47

Look carefully at the top of each list, because they tell three different stories about the same model:

  • weight says Latitude is the most important feature (used in 867 splits), with Longitude close behind. The geographic coordinates are used constantly.
  • gain says MedInc is the most important feature by a landslide: its average split improves the objective by 92.48, nearly four times the next feature (AveOccup at 23.72). By gain, Latitude drops all the way to 4th.
  • cover says AveRooms edges out MedInc, and reshuffles the middle of the pack again.

The units are not comparable across types, so do not compare a weight of 867 to a gain of 92; compare rankings. And the rankings genuinely conflict. Latitude is 1st by weight and 4th by gain. AveOccup is 4th by weight but 2nd by gain. If Northwind reported “the model’s top feature is X,” the honest answer would be “which definition of top did you use?”

Why does Latitude rack up so many splits (high weight) but only middling gain? Because geography is useful in many small ways: the model keeps carving the map into finer regions, each split helping a little. MedInc is the opposite: it is split on fewer times, but each of those splits is enormously informative, so it dominates gain. Neither view is wrong. They are answers to different questions.

feature_importances_ is gain, and it is normalized

When you read model.feature_importances_ from the scikit-learn wrapper, you are getting the gain importance, rescaled to sum to 1. So the raw gain of 92.48 for MedInc becomes a normalized 0.563: the model attributes about 56 percent of its total gain to MedInc alone. That is the number most tutorials plot without ever mentioning that a different importance_type would draw a different bar chart. Whenever you see a feature-importance plot, your first question should be “is this weight, gain, or cover?”


The Pitfalls: Why the Default Answer Can Mislead

Gain is the default, it is easy to read, and it is genuinely useful for a rough sense of what the model leans on. But it carries three pitfalls that get people into trouble when they treat it as ground truth.

1. Gain is biased toward some kinds of features. Split-based importances systematically favor continuous, high-cardinality features and features that happen to be used near the top of trees. A feature with many distinct values gives the greedy split-finder more thresholds to choose from, so it wins more splits and accumulates more gain, partly on merit and partly just because it had more chances. MedInc (a fine-grained continuous income measure) is exactly the kind of feature this bias flatters. That does not mean MedInc is unimportant; it means gain overstates its lead relative to features with fewer split points.

2. Gain is computed on training data. Every number in the block above was measured while the trees were being fit, on the training set. It reflects how the model used each feature to reduce training loss, not whether that feature helps the model generalize to new districts. A feature the model leaned on heavily during training could still be one it overfit to. Importance measured on the training set cannot tell you the difference.

3. Importance is not causation, and the types disagree. A high importance means “the model used this feature,” not “this feature causes the target.” If two features are correlated, the model may split on one and ignore the other, and the ignored one will look unimportant even if it is just as predictive. And, as you just saw, weight, gain, and cover rank the same features differently, so “the important features” is not even a well-defined list until you pick a definition. Any conclusion that would flip if you switched importance_type was never solid to begin with.

The figure below makes the disagreement concrete using the exact rankings you just computed, plus the permutation ranking you are about to measure.

Three side-by-side ranked lists of the eight California Housing features, titled by weight (number of splits), by gain (default importance), and by permutation (held-out R-squared drop). By weight the order is Latitude, Longitude, MedInc, AveOccup, HouseAge, AveRooms, Population, AveBedrms. By gain the order is MedInc, AveOccup, Longitude, Latitude, HouseAge, AveRooms, AveBedrms, Population. By permutation the order is Latitude, Longitude, MedInc, AveOccup, AveRooms, HouseAge, AveBedrms, Population. Colored lines track three features across the columns: MedInc is first by gain but third by weight and permutation, while Latitude and Longitude are first and second by both weight and permutation but fall to fourth and third by gain.
The same eight features ranked three ways on one California Housing model. Gain crowns MedInc; weight and permutation both put Latitude and Longitude on top. Whether a feature is "most important" depends entirely on which definition you use.

A More Reliable Answer: Permutation Importance

If gain is biased and training-based, what should Northwind trust instead? The most dependable model-agnostic tool is permutation importance, and its logic is refreshingly direct. Take your held-out test set, where the model has never seen the labels. Measure the model’s score. Now take one feature’s column and randomly shuffle it, breaking any relationship between that feature and the target while leaving every other feature intact. Score the model again. If that feature mattered for generalization, performance drops; the size of the drop is the feature’s importance. Shuffle a useless feature and the score barely moves.

This fixes the pitfalls head-on. It is measured on held-out data, so it reports generalization, not training-set usage. It works for any model, so it does not inherit the split-finder’s bias toward high-cardinality features. And it directly answers the question people actually mean: how much does the model’s real-world accuracy depend on this feature? sklearn.inspection.permutation_importance does the shuffling for us; we run several repeats and average, and we score with R2 R^2 so a bigger number always means “more important.”

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance

data = fetch_california_housing()
features = list(data.feature_names)
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

model = xgb.XGBRegressor(
    n_estimators=300, learning_rate=0.1, max_depth=4, random_state=42
)
model.fit(X_train, y_train)

# Gain ranking straight from feature_importances_ (normalized gain)
gain = {f: round(float(v), 4) for f, v in zip(features, model.feature_importances_)}
gain_rank = [f for f, _ in sorted(gain.items(), key=lambda kv: kv[1], reverse=True)]

# Permutation importance on the HELD-OUT test set, scored by R2
result = permutation_importance(
    model, X_test, y_test, n_repeats=10, random_state=42, scoring="r2"
)
perm = {f: float(round(m, 4)) for f, m in zip(features, result.importances_mean)}
perm_rank = [f for f, _ in sorted(perm.items(), key=lambda kv: kv[1], reverse=True)]

print("rank | by gain      | by permutation")
for i in range(len(features)):
    g = gain_rank[i]
    p = perm_rank[i]
    print(f"  {i+1}  | {g:11s}  | {p:11s}")

print("\nTop feature by gain       :", gain_rank[0], gain[gain_rank[0]])
print("Top feature by permutation:", perm_rank[0], perm[perm_rank[0]])
rank | by gain      | by permutation
  1  | MedInc       | Latitude   
  2  | AveOccup     | Longitude  
  3  | Longitude    | MedInc     
  4  | Latitude     | AveOccup   
  5  | HouseAge     | AveRooms   
  6  | AveRooms     | HouseAge   
  7  | AveBedrms    | AveBedrms  
  8  | Population   | Population 

Top feature by gain       : MedInc 0.563
Top feature by permutation: Latitude 1.5858

This is the punchline of the lesson. Gain says the single most important feature is MedInc. Permutation importance, measured honestly on held-out data, says it is Latitude, with Longitude second, and MedInc only third. Shuffling Latitude costs the model an R2 R^2 of 1.5858 (an enormous drop, since perfect is 1.0, the shuffled model becomes far worse than useless), while shuffling MedInc costs 0.4301. Geography, not income, is what the model’s out-of-sample accuracy leans on hardest.

Where do the two methods agree? At the bottom, completely: both rank AveBedrms 7th and Population 8th, so both agree these two features barely matter. And both keep the same four features (MedInc, Latitude, Longitude, AveOccup) in the top four, just in a different order. Where they disagree is exactly where it counts, the headline: gain’s top-of-tree, high-cardinality bias inflated MedInc to first place, while permutation, immune to that bias and measured on unseen districts, reveals that the geographic pair carries more of the real predictive weight. Both facts are true about the same model. Only one of them answers “what does generalization depend on.”

Why the geography story makes sense

California house prices are famously location-driven: coastal and urban districts command a premium that income alone cannot explain. It is reassuring that the honest importance method surfaces Latitude and Longitude on top, matching what we know about the housing market, while the default method was distracted by income’s abundance of split points. When a more careful measurement agrees with domain knowledge and the quick default does not, trust the careful one, and be glad you checked.

Permutation importance is not free. It costs extra passes over the data (one refit-free re-scoring per feature per repeat), and it has a blind spot of its own: when two features are strongly correlated, shuffling just one lets the model lean on its partner, so both can look less important than they really are. The next lesson introduces SHAP, which handles interactions and gives per-prediction explanations, going beyond the single global ranking permutation importance provides. For a trustworthy global ranking today, permutation importance on held-out data is the right default.


Practice Exercises

Try each one before opening its hint. They reinforce reading the three importance types and measuring permutation importance the right way.

Exercise 1: Find Where Weight and Gain Disagree Most

Fit the XGBRegressor(n_estimators=300, learning_rate=0.1, max_depth=4, random_state=42) model, attach the real feature names to its booster, and print the weight and gain rankings side by side. Identify the feature whose rank changes the most between the two.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data = fetch_california_housing()
features = list(data.feature_names)
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Your code here

Hint

Call booster.get_score(importance_type="weight") and ...="gain"), sort each into a ranked list of feature names, then compare positions. Latitude is the standout: it is 1st by weight but 4th by gain, a three-place drop, because it wins many low-payoff splits. AveOccup moves the other way (4th by weight, 2nd by gain). Any feature that swings this much between definitions is one you should never describe as simply “the Nth most important feature.”

Exercise 2: Measure Permutation Importance and Compare to Gain

Fit the same model, compute permutation importance on the test set with n_repeats=10, random_state=42, scoring="r2", and print each feature’s mean importance sorted high to low. Confirm which feature tops the list and how it compares to gain’s top feature.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance

data = fetch_california_housing()
features = list(data.feature_names)
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Your code here

Hint

Fit the model, then call permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42, scoring="r2") and read result.importances_mean. Cast values with round(float(m), 4) before printing so numpy does not wrap them as np.float64(...). You should find Latitude on top at 1.5858, then Longitude at 1.3851, then MedInc at 0.4301, exactly reversing gain, which put MedInc first. Note that permutation importance was measured on held-out data, so it reflects generalization, while gain reflected training-set usage.

Exercise 3: Watch Gain Credit a Pure-Noise Feature

Add a column of pure random noise (draw from np.random.default_rng(42).normal(...)) as a 9th feature, refit the model, and compare the noise column’s normalized gain to its permutation importance on the test set. Which method is fooled?

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance

data = fetch_california_housing()
features = list(data.feature_names)
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Your code here

Hint

Use rng = np.random.default_rng(42) and np.hstack to append rng.normal(size=(X_train.shape[0], 1)) to X_train (and the same for X_test). After refitting, the noise column earns a nonzero normalized gain of about 0.0093, because the greedy split-finder will always find some split on it that reduces training loss by chance. Its permutation importance on the test set is about 0.0 (it rounds to -0.0), correctly reporting that the feature is worthless out of sample. Gain is fooled by noise; held-out permutation importance is not. That gap is the whole reason to prefer permutation importance for real conclusions.


Summary

You learned that “feature importance” is not one number but several, that the default one can mislead, and how to get a trustworthy answer. Let’s review.

Key Concepts

The three importances XGBoost reports

  • weight: how many times a feature is used to split (a frequency count)
  • gain: the average improvement in the training objective from splits on the feature, and the default behind feature_importances_ (normalized to sum to 1)
  • cover: the average number of samples affected by the feature’s splits
  • On Northwind’s California Housing model, these ranked the top feature as Latitude (weight), MedInc (gain), and AveRooms (cover), all from the same booster

Why the default (gain) can mislead

  • It is biased toward high-cardinality/continuous features and features used near the top of trees
  • It is computed on training data, so it reflects usage, not generalization
  • It is not causation, and because the three types disagree, “the important features” is undefined until you name a definition

Permutation importance, the reliable alternative

  • Shuffle one feature’s column on a held-out set and measure how much the score drops; a bigger drop means more importance
  • It is measured out-of-sample and is model-agnostic, avoiding gain’s bias
  • On the test set, shuffling Latitude cost the most (R2 R^2 drop of 1.5858) and MedInc landed only third (0.4301), the reverse of gain’s top pick
  • Its weakness is correlated features (it can split their credit); SHAP, next lesson, addresses that

Why This Matters

The default importance plot is the most common way a machine learning model gets explained to a stakeholder, and it is one of the easiest ways to mislead one. You have now seen, on real data, a case where the default confidently names the wrong top feature: gain says income, held-out permutation importance says location, and location is the answer that matches both how California housing actually works and how the model behaves on unseen districts. Knowing to ask “is this weight, gain, or cover, and was it measured on training or held-out data?” is what separates an analyst who reports a chart from one who reports the truth. Use the built-in importances for a fast, rough sense of what the model leans on; reach for permutation importance (or SHAP) before you put a conclusion in front of anyone who will act on it; and never, on any of these numbers, claim that a feature causes the outcome. The model tells you what it used, not why the world works.


Next Steps

You can now measure importance three ways and know which to trust. But even permutation importance gives you a single global ranking; it cannot tell you why the model priced this particular district the way it did, and it stumbles when features are correlated. Next you will meet SHAP, which explains individual predictions and handles feature interactions, the gold standard for XGBoost interpretability.

Lesson 2: Explaining Predictions with SHAP

Go beyond global rankings to per-prediction explanations that handle feature interactions, using SHAP values on the same California Housing model.

Back to Module Overview

Return to the Interpretation, Tuning & Deployment module overview


Continue Building Your Skills

Before moving on, rerun the two rankings yourself and sit with the disagreement. Try changing max_depth and watch how the gap between weight and gain shifts, since deeper trees give features more chances to split. Add a second noise column and confirm again that gain credits it while permutation importance does not. The habit worth building here is skepticism: whenever you are handed a feature-importance chart, ask which definition produced it and whether it was measured on data the model had already seen. That one question, asked reflexively, will keep you from confidently reporting the wrong most-important feature, exactly the mistake this lesson caught the default method making.

Sponsor

Keep DATATWEETS free. Help fund practical data, AI, and engineering lessons for learners worldwide.

Buy Me a Coffee at ko-fi.com