Lesson 1 - Feature Importance and Its Pitfalls
Welcome to Feature Importance and Its Pitfalls
Welcome to Module 4. Over the last three modules you turned XGBoost from a name into a tool you can drive: you built the boosting core by hand, opened up the regularized objective, and learned to tune the knobs that control accuracy. Now the questions change. A tuned model that predicts well is only half the job; the other half is being able to look someone in the eye and say why it predicts what it does. This module is about interpretation, tuning discipline, and getting the model into production, and it starts here, with the single most requested and most misused number in applied machine learning: feature importance.
Our running team, Northwind Analytics, has a strong California Housing model in hand (test RMSE 0.4696 from Module 2). The moment they show it to stakeholders, the question is always the same: which features actually drive the prediction? XGBoost is happy to answer. In fact it will answer three different ways, and, as you are about to see with real numbers, those three answers do not agree with each other. One of them, the default, will confidently tell Northwind that median income is the single most important feature. A more careful method, measured the right way, will disagree and put geography on top. Getting this right is the difference between an honest insight and a confident mistake. Every number below was produced by running the code for real.
By the end of this lesson, you will be able to:
- Compute all three importance types XGBoost reports (
weight,gain, andcover) withbooster.get_score(importance_type=...)and explain what each one measures - Show that the three types rank the same features differently on a real model
- Explain why the default (gain) importance is biased, computed on training data, and says nothing about causation or generalization
- Measure permutation importance on a held-out set with
sklearn.inspection.permutation_importanceand compare its ranking to gain - State clear guidance on when to trust each method and never read causation into any of them
You should be comfortable fitting an XGBRegressor and with a train/test split. No new theory is required, just a healthy skepticism. Let’s begin.
The Three Importances XGBoost Reports
When a fitted booster is asked “how important is each feature?”, it does not have one answer. It keeps a few different bookkeeping tallies as it grows trees, and each tally is a legitimate but different notion of importance. XGBoost exposes them through booster.get_score(importance_type=...). Three matter most:
weightcounts how many times a feature is used to make a split across all trees. A feature used in many splits gets a high weight. This is purely a frequency count; it says nothing about how useful each split was.gainmeasures the average improvement in the training objective (the drop in loss) that splits on this feature produced. A feature that, when split on, sharply reduces the error gets a high gain. This is the default behind scikit-learn’sfeature_importances_attribute, so it is the one most people see without realizing it.covermeasures the average number of training samples affected by splits on this feature (technically the sum of the second-order statistics passing through those splits). A feature whose splits sit near the top of trees, touching many rows, gets high cover.
Three different questions: How often is it used? How much does each use help? How many rows does it touch? There is no reason these should agree, and on real data they do not. Let’s fit Northwind’s model and read all three off the same booster.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
data = fetch_california_housing()
features = list(data.feature_names)
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
model = xgb.XGBRegressor(
n_estimators=300, learning_rate=0.1, max_depth=4, random_state=42
)
model.fit(X_train, y_train)
# Attach readable names so the booster reports them instead of f0, f1, ...
booster = model.get_booster()
booster.feature_names = features
for imp_type in ["weight", "gain", "cover"]:
scores = booster.get_score(importance_type=imp_type)
ranked = sorted(scores.items(), key=lambda kv: kv[1], reverse=True)
print(f"--- {imp_type} ---")
for name, value in ranked:
print(f" {name:11s} {float(round(value, 2))}")--- weight ---
Latitude 867.0
Longitude 831.0
MedInc 627.0
AveOccup 550.0
HouseAge 474.0
AveRooms 421.0
Population 359.0
AveBedrms 301.0
--- gain ---
MedInc 92.48
AveOccup 23.72
Longitude 14.47
Latitude 12.46
HouseAge 9.64
AveRooms 5.89
AveBedrms 3.1
Population 2.51
--- cover ---
AveRooms 5753.22
MedInc 5747.47
Longitude 5284.72
Latitude 5014.86
AveBedrms 3983.29
AveOccup 3160.23
HouseAge 2662.33
Population 2367.47Look carefully at the top of each list, because they tell three different stories about the same model:
weightsays Latitude is the most important feature (used in 867 splits), with Longitude close behind. The geographic coordinates are used constantly.gainsays MedInc is the most important feature by a landslide: its average split improves the objective by 92.48, nearly four times the next feature (AveOccup at 23.72). By gain, Latitude drops all the way to 4th.coversays AveRooms edges out MedInc, and reshuffles the middle of the pack again.
The units are not comparable across types, so do not compare a weight of 867 to a gain of 92; compare rankings. And the rankings genuinely conflict. Latitude is 1st by weight and 4th by gain. AveOccup is 4th by weight but 2nd by gain. If Northwind reported “the model’s top feature is X,” the honest answer would be “which definition of top did you use?”
Why does Latitude rack up so many splits (high weight) but only middling gain? Because geography is useful in many small ways: the model keeps carving the map into finer regions, each split helping a little. MedInc is the opposite: it is split on fewer times, but each of those splits is enormously informative, so it dominates gain. Neither view is wrong. They are answers to different questions.
feature_importances_ is gain, and it is normalized
When you read model.feature_importances_ from the scikit-learn wrapper, you are getting the gain importance, rescaled to sum to 1. So the raw gain of 92.48 for MedInc becomes a normalized 0.563: the model attributes about 56 percent of its total gain to MedInc alone. That is the number most tutorials plot without ever mentioning that a different importance_type would draw a different bar chart. Whenever you see a feature-importance plot, your first question should be “is this weight, gain, or cover?”
The Pitfalls: Why the Default Answer Can Mislead
Gain is the default, it is easy to read, and it is genuinely useful for a rough sense of what the model leans on. But it carries three pitfalls that get people into trouble when they treat it as ground truth.
1. Gain is biased toward some kinds of features. Split-based importances systematically favor continuous, high-cardinality features and features that happen to be used near the top of trees. A feature with many distinct values gives the greedy split-finder more thresholds to choose from, so it wins more splits and accumulates more gain, partly on merit and partly just because it had more chances. MedInc (a fine-grained continuous income measure) is exactly the kind of feature this bias flatters. That does not mean MedInc is unimportant; it means gain overstates its lead relative to features with fewer split points.
2. Gain is computed on training data. Every number in the block above was measured while the trees were being fit, on the training set. It reflects how the model used each feature to reduce training loss, not whether that feature helps the model generalize to new districts. A feature the model leaned on heavily during training could still be one it overfit to. Importance measured on the training set cannot tell you the difference.
3. Importance is not causation, and the types disagree. A high importance means “the model used this feature,” not “this feature causes the target.” If two features are correlated, the model may split on one and ignore the other, and the ignored one will look unimportant even if it is just as predictive. And, as you just saw, weight, gain, and cover rank the same features differently, so “the important features” is not even a well-defined list until you pick a definition. Any conclusion that would flip if you switched importance_type was never solid to begin with.
The figure below makes the disagreement concrete using the exact rankings you just computed, plus the permutation ranking you are about to measure.
A More Reliable Answer: Permutation Importance
If gain is biased and training-based, what should Northwind trust instead? The most dependable model-agnostic tool is permutation importance, and its logic is refreshingly direct. Take your held-out test set, where the model has never seen the labels. Measure the model’s score. Now take one feature’s column and randomly shuffle it, breaking any relationship between that feature and the target while leaving every other feature intact. Score the model again. If that feature mattered for generalization, performance drops; the size of the drop is the feature’s importance. Shuffle a useless feature and the score barely moves.
This fixes the pitfalls head-on. It is measured on held-out data, so it reports generalization, not training-set usage. It works for any model, so it does not inherit the split-finder’s bias toward high-cardinality features. And it directly answers the question people actually mean: how much does the model’s real-world accuracy depend on this feature? sklearn.inspection.permutation_importance does the shuffling for us; we run several repeats and average, and we score with so a bigger number always means “more important.”
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
data = fetch_california_housing()
features = list(data.feature_names)
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
model = xgb.XGBRegressor(
n_estimators=300, learning_rate=0.1, max_depth=4, random_state=42
)
model.fit(X_train, y_train)
# Gain ranking straight from feature_importances_ (normalized gain)
gain = {f: round(float(v), 4) for f, v in zip(features, model.feature_importances_)}
gain_rank = [f for f, _ in sorted(gain.items(), key=lambda kv: kv[1], reverse=True)]
# Permutation importance on the HELD-OUT test set, scored by R2
result = permutation_importance(
model, X_test, y_test, n_repeats=10, random_state=42, scoring="r2"
)
perm = {f: float(round(m, 4)) for f, m in zip(features, result.importances_mean)}
perm_rank = [f for f, _ in sorted(perm.items(), key=lambda kv: kv[1], reverse=True)]
print("rank | by gain | by permutation")
for i in range(len(features)):
g = gain_rank[i]
p = perm_rank[i]
print(f" {i+1} | {g:11s} | {p:11s}")
print("\nTop feature by gain :", gain_rank[0], gain[gain_rank[0]])
print("Top feature by permutation:", perm_rank[0], perm[perm_rank[0]])rank | by gain | by permutation
1 | MedInc | Latitude
2 | AveOccup | Longitude
3 | Longitude | MedInc
4 | Latitude | AveOccup
5 | HouseAge | AveRooms
6 | AveRooms | HouseAge
7 | AveBedrms | AveBedrms
8 | Population | Population
Top feature by gain : MedInc 0.563
Top feature by permutation: Latitude 1.5858This is the punchline of the lesson. Gain says the single most important feature is MedInc. Permutation importance, measured honestly on held-out data, says it is Latitude, with Longitude second, and MedInc only third. Shuffling Latitude costs the model an of 1.5858 (an enormous drop, since perfect is 1.0, the shuffled model becomes far worse than useless), while shuffling MedInc costs 0.4301. Geography, not income, is what the model’s out-of-sample accuracy leans on hardest.
Where do the two methods agree? At the bottom, completely: both rank AveBedrms 7th and Population 8th, so both agree these two features barely matter. And both keep the same four features (MedInc, Latitude, Longitude, AveOccup) in the top four, just in a different order. Where they disagree is exactly where it counts, the headline: gain’s top-of-tree, high-cardinality bias inflated MedInc to first place, while permutation, immune to that bias and measured on unseen districts, reveals that the geographic pair carries more of the real predictive weight. Both facts are true about the same model. Only one of them answers “what does generalization depend on.”
Why the geography story makes sense
California house prices are famously location-driven: coastal and urban districts command a premium that income alone cannot explain. It is reassuring that the honest importance method surfaces Latitude and Longitude on top, matching what we know about the housing market, while the default method was distracted by income’s abundance of split points. When a more careful measurement agrees with domain knowledge and the quick default does not, trust the careful one, and be glad you checked.
Permutation importance is not free. It costs extra passes over the data (one refit-free re-scoring per feature per repeat), and it has a blind spot of its own: when two features are strongly correlated, shuffling just one lets the model lean on its partner, so both can look less important than they really are. The next lesson introduces SHAP, which handles interactions and gives per-prediction explanations, going beyond the single global ranking permutation importance provides. For a trustworthy global ranking today, permutation importance on held-out data is the right default.
Practice Exercises
Try each one before opening its hint. They reinforce reading the three importance types and measuring permutation importance the right way.
Exercise 1: Find Where Weight and Gain Disagree Most
Fit the XGBRegressor(n_estimators=300, learning_rate=0.1, max_depth=4, random_state=42) model, attach the real feature names to its booster, and print the weight and gain rankings side by side. Identify the feature whose rank changes the most between the two.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
data = fetch_california_housing()
features = list(data.feature_names)
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Your code hereHint
Call booster.get_score(importance_type="weight") and ...="gain"), sort each into a ranked list of feature names, then compare positions. Latitude is the standout: it is 1st by weight but 4th by gain, a three-place drop, because it wins many low-payoff splits. AveOccup moves the other way (4th by weight, 2nd by gain). Any feature that swings this much between definitions is one you should never describe as simply “the Nth most important feature.”
Exercise 2: Measure Permutation Importance and Compare to Gain
Fit the same model, compute permutation importance on the test set with n_repeats=10, random_state=42, scoring="r2", and print each feature’s mean importance sorted high to low. Confirm which feature tops the list and how it compares to gain’s top feature.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
data = fetch_california_housing()
features = list(data.feature_names)
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Your code hereHint
Fit the model, then call permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42, scoring="r2") and read result.importances_mean. Cast values with round(float(m), 4) before printing so numpy does not wrap them as np.float64(...). You should find Latitude on top at 1.5858, then Longitude at 1.3851, then MedInc at 0.4301, exactly reversing gain, which put MedInc first. Note that permutation importance was measured on held-out data, so it reflects generalization, while gain reflected training-set usage.
Exercise 3: Watch Gain Credit a Pure-Noise Feature
Add a column of pure random noise (draw from np.random.default_rng(42).normal(...)) as a 9th feature, refit the model, and compare the noise column’s normalized gain to its permutation importance on the test set. Which method is fooled?
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
data = fetch_california_housing()
features = list(data.feature_names)
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Your code hereHint
Use rng = np.random.default_rng(42) and np.hstack to append rng.normal(size=(X_train.shape[0], 1)) to X_train (and the same for X_test). After refitting, the noise column earns a nonzero normalized gain of about 0.0093, because the greedy split-finder will always find some split on it that reduces training loss by chance. Its permutation importance on the test set is about 0.0 (it rounds to -0.0), correctly reporting that the feature is worthless out of sample. Gain is fooled by noise; held-out permutation importance is not. That gap is the whole reason to prefer permutation importance for real conclusions.
Summary
You learned that “feature importance” is not one number but several, that the default one can mislead, and how to get a trustworthy answer. Let’s review.
Key Concepts
The three importances XGBoost reports
weight: how many times a feature is used to split (a frequency count)gain: the average improvement in the training objective from splits on the feature, and the default behindfeature_importances_(normalized to sum to 1)cover: the average number of samples affected by the feature’s splits- On Northwind’s California Housing model, these ranked the top feature as Latitude (weight), MedInc (gain), and AveRooms (cover), all from the same booster
Why the default (gain) can mislead
- It is biased toward high-cardinality/continuous features and features used near the top of trees
- It is computed on training data, so it reflects usage, not generalization
- It is not causation, and because the three types disagree, “the important features” is undefined until you name a definition
Permutation importance, the reliable alternative
- Shuffle one feature’s column on a held-out set and measure how much the score drops; a bigger drop means more importance
- It is measured out-of-sample and is model-agnostic, avoiding gain’s bias
- On the test set, shuffling Latitude cost the most ( drop of 1.5858) and MedInc landed only third (0.4301), the reverse of gain’s top pick
- Its weakness is correlated features (it can split their credit); SHAP, next lesson, addresses that
Why This Matters
The default importance plot is the most common way a machine learning model gets explained to a stakeholder, and it is one of the easiest ways to mislead one. You have now seen, on real data, a case where the default confidently names the wrong top feature: gain says income, held-out permutation importance says location, and location is the answer that matches both how California housing actually works and how the model behaves on unseen districts. Knowing to ask “is this weight, gain, or cover, and was it measured on training or held-out data?” is what separates an analyst who reports a chart from one who reports the truth. Use the built-in importances for a fast, rough sense of what the model leans on; reach for permutation importance (or SHAP) before you put a conclusion in front of anyone who will act on it; and never, on any of these numbers, claim that a feature causes the outcome. The model tells you what it used, not why the world works.
Next Steps
You can now measure importance three ways and know which to trust. But even permutation importance gives you a single global ranking; it cannot tell you why the model priced this particular district the way it did, and it stumbles when features are correlated. Next you will meet SHAP, which explains individual predictions and handles feature interactions, the gold standard for XGBoost interpretability.
Lesson 2: Explaining Predictions with SHAP
Go beyond global rankings to per-prediction explanations that handle feature interactions, using SHAP values on the same California Housing model.
Back to Module Overview
Return to the Interpretation, Tuning & Deployment module overview
Continue Building Your Skills
Before moving on, rerun the two rankings yourself and sit with the disagreement. Try changing max_depth and watch how the gap between weight and gain shifts, since deeper trees give features more chances to split. Add a second noise column and confirm again that gain credits it while permutation importance does not. The habit worth building here is skepticism: whenever you are handed a feature-importance chart, ask which definition produced it and whether it was measured on data the model had already seen. That one question, asked reflexively, will keep you from confidently reporting the wrong most-important feature, exactly the mistake this lesson caught the default method making.