Lesson 4 - Missing Values and Categorical Features
Welcome to Missing Values and Categorical Features
Real data is messy in two specific, predictable ways: it has holes and it has words. Some rows are missing a value; some columns hold categories like workclass or occupation instead of numbers. For most machine learning models these two facts mean a whole preprocessing stage before you can even call fit: imputers to fill the gaps, and one-hot encoders to turn categories into columns of zeros and ones.
Our running team, Northwind Analytics, is about to build a model on the real Adult Income dataset, predicting whether a person earns more than 50,000 dollars a year. That dataset has both problems at once: three of its columns (workclass, occupation, native-country) contain genuine missing values, and eight of its columns are categorical. A colleague coming from linear models is already reaching for SimpleImputer and OneHotEncoder. In this lesson you will show Northwind that with XGBoost, most of that machinery is unnecessary. XGBoost handles missing values through the sparsity-aware split finding you first heard named in Module 2, and it handles categories through native categorical splits. You will prove both work on real data, with real numbers produced by running the code.
By the end of this lesson, you will be able to:
- Explain why most models need imputation and how XGBoost instead learns a default branch direction for missing values at each split
- Train an
XGBClassifieron data containingnp.nanwith no imputer and confirm it predicts without error - Enable native categorical splits with
enable_categorical=Trueandtree_method="hist", passing a pandas DataFrame withcategorydtype straight tofit - Compare native categorical against traditional one-hot encoding on the same split and report both the accuracy and the feature-count difference
- Decide when native categorical support helps and what its caveats are
You should be comfortable with the scikit-learn fit/predict workflow, a stratified train/test split, and the split-finding idea from Module 2. Let’s begin.
Missing Values: XGBoost Learns Where the Gaps Should Go
Start with the problem. A decision tree split asks a question like “is hours-per-week less than 40?” and sends each row left or right based on the answer. But if a row’s hours-per-week is missing, there is no answer, and the split does not know which way to send it. This is why most models refuse to train on data containing np.nan at all: a linear model cannot multiply a coefficient by a missing number, and scikit-learn’s estimators will raise an error. The standard fix is imputation: fill every gap with the column mean, median, or some learned value, so there are no holes left. Imputation works, but it is a guess, and a mean-filled value can be misleading when the fact that a value is missing is itself informative.
XGBoost takes a different route, the one Module 2 called sparsity-aware split finding. When XGBoost evaluates a split, it does not just decide the threshold. It also learns a default direction: for the rows whose value is missing, it tries sending them all left and all right, measures which choice improves the objective more, and bakes that winning direction into the split. Every split gets its own learned default. Missing values are not filled in; they are routed, following whichever branch the training data showed was best.
The practical payoff is that you hand XGBoost data with np.nan in it and it just trains. To prove it, we take Adult’s numeric columns, inject missing values into hours-per-week for 20 percent of the rows, and fit with no imputer anywhere in sight.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
adult = fetch_openml("adult", version=2, as_frame=True)
X = adult.data
y = (adult.target == ">50K").astype(int)
# Use only the numeric columns for a clean missing-values demo
num = X.select_dtypes("number").copy()
Xtr, Xte, ytr, yte = train_test_split(
num, y, test_size=0.2, random_state=42, stratify=y
)
# Baseline: no missing values at all
base = xgb.XGBClassifier(
n_estimators=200, learning_rate=0.1, max_depth=4, random_state=42
)
base.fit(Xtr, ytr)
auc_base = roc_auc_score(yte, base.predict_proba(Xte)[:, 1])
# Inject np.nan into hours-per-week for 20% of rows, train and test
rng = np.random.RandomState(42)
Xtr_m, Xte_m = Xtr.copy(), Xte.copy()
mask_tr = rng.rand(len(Xtr_m)) < 0.20
mask_te = rng.rand(len(Xte_m)) < 0.20
Xtr_m.loc[Xtr_m.index[mask_tr], "hours-per-week"] = np.nan
Xte_m.loc[Xte_m.index[mask_te], "hours-per-week"] = np.nan
print("NaN cells in train:", int(Xtr_m["hours-per-week"].isna().sum()))
print("NaN cells in test :", int(Xte_m["hours-per-week"].isna().sum()))
# No imputation - fit directly on data that contains np.nan
miss = xgb.XGBClassifier(
n_estimators=200, learning_rate=0.1, max_depth=4, random_state=42
)
miss.fit(Xtr_m, ytr)
auc_miss = roc_auc_score(yte, miss.predict_proba(Xte_m)[:, 1])
print("baseline (no NaN) test AUC:", float(round(auc_base, 4)))
print("20% NaN, no imputation test AUC:", float(round(auc_miss, 4)))NaN cells in train: 7777
NaN cells in test : 2000
baseline (no NaN) test AUC: 0.8794
20% NaN, no imputation test AUC: 0.8781Read what happened. We punched 7,777 holes into the training data and 2,000 into the test data, then called fit and predict_proba with those holes still there. No error, no imputer, no warning we cared about. And the cost was tiny: test AUC slipped from 0.8794 to 0.8781, about one thousandth. XGBoost learned, split by split, which way to send the rows whose hours-per-week it could not see, and those learned default directions carried almost all of the signal the intact column had. That is the sparsity-aware split finding from Module 2, working on real data.
Missing on purpose, not by accident
The learned default direction does more than tolerate gaps: it lets XGBoost use the absence of a value as a signal. If people who leave occupation blank tend to fall in one income bracket, XGBoost can route all missing-occupation rows the informative way, something mean-imputation destroys by pretending the value was simply average. This is why you should resist the reflex to impute before handing data to XGBoost; you may be erasing information the model would otherwise have used.
Categorical Features: Skip the One-Hot Explosion
Now the second kind of mess: columns that hold categories. Adult’s workclass, education, marital-status, occupation, relationship, race, sex, and native-country are all categorical. A tree wants to ask numeric questions, so the classic fix is one-hot encoding: replace each categorical column with a block of 0/1 indicator columns, one per category. sex becomes 2 columns, education becomes 16, native-country becomes 41, and so on. The trouble is dimensionality. Across Adult’s eight categorical columns there are 99 distinct categories, so one-hot encoding turns 14 tidy columns into a wide, sparse matrix, and high-cardinality columns like native-country dominate the width while carrying little signal per column.
XGBoost offers a direct alternative: native categorical splits. Instead of exploding a column into indicators, XGBoost keeps it as one column and, at a split, partitions the set of categories into two groups, sending some categories left and the rest right. You opt in with two settings: enable_categorical=True and tree_method="hist". The one requirement on your side is that the categorical columns carry the pandas category dtype, which Adult’s already do when loaded with as_frame=True. Then you pass the DataFrame straight to fit, no encoder in the pipeline.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
adult = fetch_openml("adult", version=2, as_frame=True)
X = adult.data
y = (adult.target == ">50K").astype(int)
cat_cols = X.select_dtypes("category").columns.tolist()
print("categorical columns:", cat_cols)
print("distinct categories across them:",
sum(int(X[c].nunique()) for c in cat_cols))
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Native categorical: enable_categorical + hist, pass the DataFrame directly
native = xgb.XGBClassifier(
enable_categorical=True,
tree_method="hist",
n_estimators=300, learning_rate=0.1, max_depth=4, random_state=42,
)
native.fit(X_train, y_train)
p_native = native.predict_proba(X_test)[:, 1]
acc = accuracy_score(y_test, native.predict(X_test))
auc = roc_auc_score(y_test, p_native)
print("native categorical columns:", X_train.shape[1])
print("native categorical test accuracy:", float(round(acc, 4)))
print("native categorical test AUC:", float(round(auc, 4)))categorical columns: ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
distinct categories across them: 99
native categorical columns: 14
native categorical test accuracy: 0.8755
native categorical test AUC: 0.93XGBoost trained on all 14 original columns, categorical ones included, and reached test AUC 0.9300 and accuracy 87.55 percent. No OneHotEncoder, no ColumnTransformer, no expanded matrix. The categorical columns stayed as themselves, and XGBoost found the category groupings that best separated the two income classes.
In modern XGBoost, enable_categorical defaults to True
In XGBoost 3.x, enable_categorical is already True by default, so passing a DataFrame with category columns often works even if you forget the flag. We set it explicitly anyway. Being explicit documents your intent, keeps the code correct on older versions where the default was False, and makes it obvious to a reader that native categorical handling is switched on. tree_method="hist" is the histogram-based algorithm that native categorical splits are built on; it is also XGBoost’s default in this version.
Head-to-Head: Native Categorical vs. One-Hot Encoding
The natural question is whether skipping one-hot encoding costs you anything. Let’s settle it: build the one-hot version of exactly the same data, on the same split and seed, and compare.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
adult = fetch_openml("adult", version=2, as_frame=True)
X = adult.data
y = (adult.target == ">50K").astype(int)
cat_cols = X.select_dtypes("category").columns.tolist()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Native categorical
native = xgb.XGBClassifier(
enable_categorical=True, tree_method="hist",
n_estimators=300, learning_rate=0.1, max_depth=4, random_state=42,
)
native.fit(X_train, y_train)
auc_native = roc_auc_score(y_test, native.predict_proba(X_test)[:, 1])
# One-hot encoding: expand every categorical column into indicators
X_ohe = pd.get_dummies(X, columns=cat_cols, dummy_na=False)
Xo_train, Xo_test = train_test_split(
X_ohe, test_size=0.2, random_state=42, stratify=y
)
ohe = xgb.XGBClassifier(
tree_method="hist",
n_estimators=300, learning_rate=0.1, max_depth=4, random_state=42,
)
ohe.fit(Xo_train, y_train)
auc_ohe = roc_auc_score(y_test, ohe.predict_proba(Xo_test)[:, 1])
print(f"{'approach':22s}{'columns':>9}{'test AUC':>11}")
print(f"{'native categorical':22s}{X_train.shape[1]:>9}{float(round(auc_native, 4)):>11}")
print(f"{'one-hot encoding':22s}{Xo_train.shape[1]:>9}{float(round(auc_ohe, 4)):>11}")approach columns test AUC
native categorical 14 0.93
one-hot encoding 105 0.9308There is the comparison Northwind wanted. One-hot encoding scores test AUC 0.9308; native categorical scores 0.9300. The gap is eight ten-thousandths, statistically a tie on this dataset. But look at the columns column: one-hot encoding fed the model 105 columns to earn that 0.9308, while native categorical earned essentially the same score from the original 14. The eight categorical columns, expanded into 99 indicators, grew the matrix more than sevenfold for no meaningful accuracy. Native categorical support buys you that same predictive power on a model that is smaller to store, faster to sweep, and far easier to read.
Native does not always win, and that is fine
Notice we are not claiming native categorical is more accurate, it landed a hair below one-hot here. The claim is that it reaches comparable accuracy while avoiding the dimensionality blowup, and it saves you the encoding step entirely. On some datasets one-hot edges ahead, on others native does; the reliable, dataset-independent win is the far smaller feature count and the simpler pipeline. When the two tie on accuracy, prefer the one with 14 columns over the one with 105.
Practical Guidance: When Native Categorical Helps
Native categorical support is a strong default for XGBoost on tabular data, but it comes with conditions worth knowing before you rely on it.
- The dtype must be
category. XGBoost decides a column is categorical from its pandas dtype, not its contents. A column of strings left asobject, or a category encoded as plain integers, will not be treated categorically. Convert withdf[col] = df[col].astype("category")before fitting, exactly whatfetch_openml(..., as_frame=True)already did for Adult. - It shines when one-hot would explode. The more categorical columns you have and the higher their cardinality, the bigger the feature-count saving. Adult’s jump from 14 to 105 columns is modest; a dataset with several high-cardinality columns (product IDs, ZIP codes) can balloon into thousands of one-hot columns, and native categorical avoids all of them.
- Very high cardinality is still hard. Native splits partition a set of categories, and searching partitions of a column with thousands of distinct values is genuinely difficult for any tree method. For columns like a free-text ID with tens of thousands of levels, neither native splitting nor one-hot is ideal; consider grouping rare categories, target-style encodings, or dropping the column. Native categorical helps most in the common middle ground of a handful to a few dozen categories per column.
- It pairs naturally with native missing handling. A categorical column can also have missing values, and XGBoost applies the same learned-default-direction logic to them. Together, the two features let you take a raw DataFrame of numbers, categories, and gaps and hand it to
fitwith almost no preprocessing, which is exactly the robust pipeline this module is building toward.
Used within these bounds, enable_categorical=True removes an entire preprocessing stage from your workflow while keeping accuracy intact, and it keeps your model honest about how many features it truly has.
Practice Exercises
Try each one before opening its hint. They reinforce the native handling of gaps and categories you just saw.
Exercise 1: Prove XGBoost Trains Through Missing Values
Load Adult’s numeric columns, split with random_state=42 stratified on y, and inject np.nan into the age column for 25 percent of both train and test rows using np.random.RandomState(42). Fit an XGBClassifier(n_estimators=200, learning_rate=0.1, max_depth=4, random_state=42) with no imputer, and print the test AUC alongside a baseline fit on the intact columns.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
adult = fetch_openml("adult", version=2, as_frame=True)
X = adult.data.select_dtypes("number").copy()
y = (adult.target == ">50K").astype(int)
# Your code hereHint
Split first, then build a boolean mask with rng.rand(len(Xtr)) < 0.25 and assign np.nan via Xtr.loc[Xtr.index[mask], "age"] = np.nan. Fit the classifier directly on the holed data, no SimpleImputer needed, and call roc_auc_score(yte, model.predict_proba(Xte_m)[:, 1]). The baseline (no NaN) lands around AUC 0.8794, and the 25-percent-missing model stays close behind it. The point is not the exact number but that fit and predict_proba run without error on data containing np.nan.
Exercise 2: Count the One-Hot Blowup
Using the full Adult DataFrame, list the categorical columns with X.select_dtypes("category").columns. Print how many columns the data has natively, then apply pd.get_dummies(X, columns=cat_cols) and print the new column count. Report the multiplier.
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
from sklearn.datasets import fetch_openml
adult = fetch_openml("adult", version=2, as_frame=True)
X = adult.data
# Your code hereHint
cat_cols = X.select_dtypes("category").columns.tolist() gives you the eight categorical columns. X.shape[1] is 14; pd.get_dummies(X, columns=cat_cols).shape[1] is 105. That is a 7.5x increase in width, driven mostly by high-cardinality columns like native-country (41 categories) and occupation and education. Native categorical support gets comparable accuracy from the original 14.
Exercise 3: Native vs. One-Hot on Your Own Split
Fit two XGBClassifier models with the same hyperparameters (n_estimators=300, learning_rate=0.1, max_depth=4, random_state=42, tree_method="hist"): one with enable_categorical=True on the raw DataFrame, and one on a pd.get_dummies version. Use the same random_state=42 stratified split for both, and print each model’s test AUC and column count side by side.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
adult = fetch_openml("adult", version=2, as_frame=True)
X = adult.data
y = (adult.target == ">50K").astype(int)
# Your code hereHint
Keep y fixed and split both X (native) and pd.get_dummies(X, columns=cat_cols) (one-hot) with the identical train_test_split(..., test_size=0.2, random_state=42, stratify=y) so the row assignment matches. You should see native categorical at AUC 0.93 from 14 columns and one-hot at 0.9308 from 105, a tie on accuracy with a sevenfold difference in width. Conclude, as the lesson did, that native categorical delivers the same predictive power on a far leaner model.
Summary
You gave XGBoost the two things that usually demand a preprocessing stage, missing values and categorical columns, and it took them in stride. Let’s review.
Key Concepts
Missing values via learned default directions
- Most models require imputation because a split cannot answer its question for a missing value
- XGBoost’s sparsity-aware split finding instead learns a default direction per split, routing missing rows the way training data showed was best
- You fit directly on data containing
np.nan, no imputer: injecting 20 percent missing values intohours-per-weekmoved test AUC only from 0.8794 to 0.8781 - Because missingness is routed rather than filled, XGBoost can even use the fact that a value is missing as signal
Categorical features via native splits
- One-hot encoding turns categories into 0/1 indicator columns and can explode dimensionality: Adult’s 14 columns became 105
- Native categorical splits (
enable_categorical=True,tree_method="hist") partition the set of categories directly, keeping the column count at 14 - The columns must carry the pandas
categorydtype, whichfetch_openml(..., as_frame=True)provides - On Adult, native categorical reached test AUC 0.9300 versus one-hot’s 0.9308, a tie on accuracy from a model with 91 fewer columns
When native categorical helps
- Best when it spares you a large one-hot blowup, and it pairs naturally with native missing-value handling
- Its caveats: the
categorydtype is required, and very high-cardinality columns remain hard for any method
Why This Matters
Every preprocessing step you remove is a step that can no longer leak information, break in production, or silently distort your data. Imputation and one-hot encoding are two of the most common such steps, and XGBoost lets you drop both without giving up accuracy. That is not just convenience; it is robustness. A pipeline that hands the raw DataFrame, gaps and categories and all, straight to fit has fewer moving parts to get wrong when tomorrow’s data arrives with a new category or a new missing column.
Just as important, you saw why it works rather than taking it on faith. The missing-value handling is the sparsity-aware split finding named back in Module 2, now demonstrated on real holes you punched yourself. The categorical handling is a tree doing what trees do, splitting on a question, just phrased over a set of categories instead of a numeric threshold. With both in hand, you are ready to assemble the robust end-to-end training pipeline the next lesson builds.
Next Steps
You can now feed XGBoost messy, real-world tabular data with missing values and categorical columns and let it do the cleanup that other models push onto you. Next you will put every piece of this module together into one robust training pipeline from raw data to an evaluated model.
Guided Project: A Robust Training Pipeline
Assemble missing-value handling, native categorical features, and validation into one end-to-end XGBoost pipeline on real data.
Back to Module Overview
Return to the Training Robust Models module overview
Continue Building Your Skills
Before moving on, run the two demos yourself and change one thing at a time. Push the missing-value rate from 20 percent up toward 50 and watch how gently, or not, the AUC gives way, so you build intuition for how much information the learned default directions can carry. Then take one high-cardinality column like native-country, one-hot just that column, and compare the width and AUC against leaving it native. Feeling these trade-offs on real data now, while the pipeline is still simple, is exactly what will make the guided project’s full robust pipeline feel like assembly rather than guesswork.