Lesson 1 - Introducing XGBoost

Welcome to Introducing XGBoost

Welcome to Module 2. In Module 1 you did something most people who use boosting never do: you built a gradient booster from scratch, fitting shallow trees to residuals one after another and watching the test error fall as the corrections accumulated. That work was not busywork. It means that when you meet XGBoost in this lesson, it will not be a magic box. You already know the engine inside it: sequential trees, each fixing the mistakes of the ones before.

Our running team, Northwind Analytics, has been predicting median house values across California districts with scikit-learn’s GradientBoostingRegressor. It works, but they keep hearing the same name from every competition writeup and every colleague on tabular data: XGBoost. In this lesson you will find out what XGBoost actually is, why it is built differently from the plain gradient boosting you already know, and how to fit your very first XGBoost model. You will fit it two different ways, using both of XGBoost’s APIs, and put its accuracy head-to-head with Module 1’s model on the exact same California Housing data. Every number below was produced by running the code for real.

By the end of this lesson, you will be able to:

Explain the four design choices that separate XGBoost from plain gradient boosting: a regularized objective, second-order optimization, sparsity-aware split finding, and speed engineering
Fit an XGBoost model with the scikit-learn API (xgb.XGBRegressor) using the familiar fit/predict pattern
Fit the same model with the native API (xgb.DMatrix plus xgb.train) and explain what a DMatrix is
Report and compare the test RMSE of both APIs and confirm they produce the identical model
Show that XGBoost beats Module 1’s GradientBoostingRegressor on the same train/test split

You should be comfortable with the scikit-learn fit/predict workflow, a train/test split, and the boosting intuition from Module 1. No prior XGBoost experience is assumed. Let’s begin.

What Is XGBoost?

XGBoost (short for eXtreme Gradient Boosting) is an open-source library that implements gradient-boosted decision trees. At its core it does exactly what you built by hand in Module 1: it trains an ensemble of trees sequentially, each new tree correcting the residual errors left by the ensemble so far. If that were the whole story, XGBoost would just be a faster version of GradientBoostingRegressor. It is much more than that.

XGBoost takes the plain boosting loop and hardens it with four deliberate design choices. You do not need to master the math today (Lesson 2 opens up the objective in full), but you should know the names and what each one buys you:

A regularized objective. Plain gradient boosting minimizes only the training loss. XGBoost adds a penalty on tree complexity directly into the objective it optimizes, so the model is discouraged from growing needlessly complicated trees. This built-in regularization is a big part of why XGBoost resists overfitting so well. You will see the exact penalty term in Lesson 2.
Second-order optimization. Module 1’s booster fit each tree to the gradient of the loss (the first derivative, essentially the residuals). XGBoost uses the gradient and the hessian (the second derivative), giving it a more informed, Newton-style step toward the minimum. The full derivation is Lesson 2’s job; for now, just note that XGBoost looks at more of the loss’s shape before it commits to each tree.
Sparsity-aware split finding. Real data has holes. XGBoost’s split-finding algorithm has a built-in default direction for missing values, so it handles gaps natively instead of forcing you to fill them in first. We cover this properly in Module 3; the headline is that missing values are a first-class citizen.
Engineered for speed. XGBoost is written for performance: cache-aware access patterns, parallelized split finding, and an optimized internal data structure. On the same problem it typically trains faster than scikit-learn’s implementation while matching or beating its accuracy.

Keep these four in mind as labels for now. The rest of this module fills each one in. What you will do today is get a model running and prove to yourself that the accuracy claim is real.

You already understand the hard part

Everything genuinely new in XGBoost is a refinement of the boosting loop you built in Module 1. The sequential, residual-correcting core is unchanged. The four upgrades above make that core more accurate (regularized objective, second-order steps), more robust (sparsity-aware splits), and much faster (speed engineering). Because you built the core yourself, you can treat this lesson as learning an interface, not a new algorithm.

XGBoost keeps Module 1's sequential boosting core and layers four upgrades on top; on the same California Housing split it lands at a lower test RMSE (0.4696) than plain gradient boosting (0.5422).

Two APIs, One Library

One thing that confuses newcomers is that XGBoost exposes two different interfaces, and tutorials mix them freely. It is worth being clear about both from the start:

The scikit-learn API wraps XGBoost in the familiar estimator objects (xgb.XGBRegressor, xgb.XGBClassifier) with the exact fit/predict methods you already use. It plugs straight into scikit-learn tools like Pipeline, GridSearchCV, and cross_val_score.
The native API is XGBoost’s own lower-level interface. You pack your data into a DMatrix, hand XGBoost a dictionary of params, and call xgb.train. It exposes a few advanced features first and is what you will see in a lot of competition code.

They are two doors into the same house. Given the same settings, they train the identical model. You will confirm that below by fitting both and getting the same test RMSE to four decimals. We will start with the scikit-learn API because it will feel like home.

Fitting Your First XGBoost Model (scikit-learn API)

Northwind loads the real California Housing data, splits it with the same random_state=42 used throughout the course, and fits an XGBRegressor. Note the hyperparameters: n_estimators (how many trees), learning_rate (how big a step each tree takes), and max_depth (how deep each tree grows). These are the same three knobs from Module 1, and we will study them closely in Lessons 3 and 4. For now we use sensible values.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

model = xgb.XGBRegressor(
    n_estimators=300,
    learning_rate=0.1,
    max_depth=4,
    random_state=42,
)
model.fit(X_train, y_train)

pred = model.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, pred))
test_r2 = r2_score(y_test, pred)

print("XGBRegressor test RMSE:", float(round(test_rmse, 4)))
print("XGBRegressor test R2  :", float(round(test_r2, 4)))

XGBRegressor test RMSE: 0.4696
XGBRegressor test R2  : 0.8317

That is it. The interface is byte-for-byte the scikit-learn workflow you already know: construct, fit, predict, score. Your first XGBoost model reaches a test RMSE of 0.4696 and explains about 83 percent of the variance in unseen districts ( $R^2 = 0.8317$ ). Hold on to that 0.4696; we will compare it in two places.

The Same Model, the Native Way (native API)

Now the second door. The native API asks you to wrap your arrays in a DMatrix first. A DMatrix is XGBoost’s own optimized internal data structure: it stores your features and labels together in a compressed, cache-friendly layout that the training algorithm can sweep through quickly. Building it once up front is part of how XGBoost stays fast. Instead of passing hyperparameters to a constructor, you collect them in a params dictionary, where the key "objective" names the loss to minimize ("reg:squarederror" is plain squared-error regression), and you pass the tree count to xgb.train as num_boost_round.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Pack the arrays into XGBoost's optimized DMatrix structure
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

params = {
    "objective": "reg:squarederror",
    "learning_rate": 0.1,
    "max_depth": 4,
    "seed": 42,
}

bst = xgb.train(params, dtrain, num_boost_round=300)

pred = bst.predict(dtest)
test_rmse = np.sqrt(mean_squared_error(y_test, pred))

print("native API test RMSE:", float(round(test_rmse, 4)))

native API test RMSE: 0.4696

Look at the number: 0.4696, identical to the scikit-learn API to four decimals. This is the point. n_estimators in the wrapper is the same as num_boost_round in the native call; learning_rate and max_depth mean the same thing in both. With matching settings, both APIs run the same training algorithm and produce the same model. The predictions agree not just on aggregate RMSE but row by row, which the exercises let you verify.

Which API should you use?

For most work, reach for the scikit-learn API (XGBRegressor / XGBClassifier). It gives you fit/predict and drops straight into pipelines, grid search, and cross-validation, so you keep all the tooling you already know. Use the native API (DMatrix + xgb.train) when you want a feature that surfaces there first or when you are reading competition code written in that style. They are interchangeable in accuracy; the choice is about ergonomics, not results.

Head-to-Head: XGBoost vs. Plain Gradient Boosting

You have seen XGBoost score 0.4696 twice. The question Northwind actually cares about is: does switching from scikit-learn’s GradientBoostingRegressor (Module 1’s tool) to XGBoost actually buy anything? Let’s settle it directly, both models on the same split, same random_state=42.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

xgb_model = xgb.XGBRegressor(
    n_estimators=300, learning_rate=0.1, max_depth=4, random_state=42
)
gbr_model = GradientBoostingRegressor(random_state=42)

for name, model in [("XGBoost", xgb_model), ("GradientBoosting", gbr_model)]:
    model.fit(X_train, y_train)
    p = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, p))
    r2 = r2_score(y_test, p)
    print(f"{name:16s} RMSE={float(round(rmse, 4)):<8} R2={float(round(r2, 4))}")

XGBoost          RMSE=0.4696   R2=0.8317
GradientBoosting RMSE=0.5422   R2=0.7756

The verdict is clear. On identical data, XGBoost’s test RMSE of 0.4696 beats plain gradient boosting’s 0.5422, and its $R^2$ climbs from 0.7756 to 0.8317, explaining roughly five more percentage points of the variance. In MedHouseVal units (100,000 dollars each), that RMSE gap of about 0.073 is roughly 7,300 dollars less error on a typical district, with no tuning effort beyond picking a reasonable tree count and depth. And this is XGBoost cold, straight out of the box; the regularization, second-order steps, and tuning knobs you will study next only widen the gap.

This is not a fluke of these particular numbers. On structured, tabular data like Northwind’s, XGBoost is routinely competitive-with or better-than scikit-learn’s gradient boosting, and it typically gets there faster. That combination, more accuracy and more speed, is exactly why it became the default tool for this kind of problem.

A fair comparison, honestly reported

Both models here use library defaults for most settings and share the same split and seed, so the comparison is apples to apples. We are not claiming XGBoost is magic; we are showing that its four design upgrades translate into a real, measurable accuracy gain on real data. Later lessons will tune both far past these starting points, but the starting points already tell the story.

Practice Exercises

Try each one before opening its hint. They reinforce the two APIs and the comparison you just saw.

Exercise 1: Fit XGBoost Two Ways and Confirm They Match

Fit an xgb.XGBRegressor(n_estimators=300, learning_rate=0.1, max_depth=4, random_state=42) and, separately, an xgb.train model with the matching params and num_boost_round=300 on a DMatrix. Print both test RMSEs and confirm they agree to four decimals.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Your code here

Hint

Build the scikit-learn model with the constructor arguments above and the native model with params = {"objective": "reg:squarederror", "learning_rate": 0.1, "max_depth": 4, "seed": 42} plus num_boost_round=300. Remember to wrap the test features in xgb.DMatrix(X_test) before calling bst.predict. Both should print 0.4696. If you want to be strict, compare the two prediction arrays with np.allclose(pred_sklearn, pred_native, atol=1e-4), which returns True: the two APIs produce the same model row by row, not just on average.

Exercise 2: How Much Does XGBoost Beat Plain Gradient Boosting?

Fit an XGBRegressor (same settings as above) and a default GradientBoostingRegressor(random_state=42) on the same split. Print each model’s test RMSE and $R^2$ , then state in one sentence how much accuracy the switch to XGBoost buys.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Your code here

Hint

Fit both models, predict on X_test, and compute RMSE with np.sqrt(mean_squared_error(...)) and $R^2$ with r2_score(...). You should see XGBoost at RMSE 0.4696 / $R^2$ 0.8317 versus gradient boosting at 0.5422 / 0.7756. The switch cuts test RMSE by about 0.073 (roughly 7,300 dollars per district) and lifts explained variance by about five percentage points, all before any tuning.

Exercise 3: Feel the Learning Rate Knob

Keeping n_estimators=300, max_depth=4, and random_state=42 fixed, fit three XGBRegressor models with learning_rate set to 0.05, 0.1, and 0.3. Print each one’s test RMSE and note which learning rate did best here.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Your code here

Hint

Loop over [0.05, 0.1, 0.3], building a fresh XGBRegressor(n_estimators=300, learning_rate=lr, max_depth=4, random_state=42) each time. You should see roughly 0.4923, 0.4696, and 0.4573. On this dataset the larger learning rate happens to score best at 300 trees, but do not over-read one split: learning rate and tree count trade off against each other, and Lessons 3 and 4 will show how to tune them together rather than in isolation.

Summary

You fit your first XGBoost models and proved, with real numbers, that the library earns its reputation. Let’s review.

Key Concepts

What XGBoost adds to plain boosting

Keeps Module 1’s sequential, residual-correcting core, then layers on four upgrades
A regularized objective (penalizes complex trees), second-order optimization (uses gradient and hessian), sparsity-aware split finding (handles missing values natively), and speed engineering
The details of each are the subject of the coming lessons; today you learned the names and the payoff

Two APIs, one model

The scikit-learn API (xgb.XGBRegressor) gives you fit/predict and plugs into pipelines and grid search
The native API (xgb.DMatrix + xgb.train with a params dict and num_boost_round) is XGBoost’s own lower-level interface
A DMatrix is XGBoost’s optimized internal data structure for fast training
With matching settings both APIs train the identical model: each scored test RMSE 0.4696 on California Housing, agreeing row by row

XGBoost vs. Module 1’s gradient boosting

On the same split and seed, XGBoost reached RMSE 0.4696 / $R^2$ 0.8317, beating GradientBoostingRegressor at 0.5422 / 0.7756
That is about a 0.073 RMSE improvement (roughly 7,300 dollars per district) with no tuning, and XGBoost trains faster too

Why This Matters

XGBoost is the workhorse of applied machine learning on tabular data, and you have now driven it yourself instead of reading about it. Just as important, you saw that it is not a mysterious black box: it is the boosting loop you already built by hand, upgraded in four specific, nameable ways. That framing is what lets you tune it deliberately later, because when a knob like learning_rate or max_depth changes the result, you will know which part of the machine you are adjusting.

Knowing both APIs matters more than it looks. The scikit-learn API keeps you inside the ecosystem of pipelines and cross-validation you already trust, while the native API unlocks XGBoost’s own features and lets you read the competition code where they first appear. From here on you can pick whichever door fits the task, confident they lead to the same model.

Next Steps

You have a working XGBoost model and proof it beats plain gradient boosting. Next you will open up the objective function you have only named so far, seeing exactly how the regularization penalty and the second-order (gradient-and-hessian) step change what each tree is trained to do.

Lesson 2: Inside the XGBoost Objective

See the regularized objective and the second-order optimization that make XGBoost more than a faster gradient booster.

Back to Module Overview

Return to the XGBoost in Depth module overview

Continue Building Your Skills

You crossed the line from understanding boosting to using the tool the whole industry reaches for. Before moving on, rerun the head-to-head yourself and try one small change at a time: swap max_depth up or down, raise or lower n_estimators, and watch how XGBoost responds versus how plain gradient boosting did in Module 1. Getting a feel for these knobs now, while the model is still simple and the numbers are honest, will make Lesson 2’s tour of the objective, and the serious tuning in Lessons 3 and 4, land as a natural next step rather than a leap.

Next lesson

Lesson 2 - Inside the XGBoost Objective

Courses

DATATWEETS

Title here

Lesson 1 - Introducing XGBoost

Welcome to Introducing XGBoost

What Is XGBoost?

Two APIs, One Library

Fitting Your First XGBoost Model (scikit-learn API)

The Same Model, the Native Way (native API)

Head-to-Head: XGBoost vs. Plain Gradient Boosting

Practice Exercises

Exercise 1: Fit XGBoost Two Ways and Confirm They Match

Exercise 2: How Much Does XGBoost Beat Plain Gradient Boosting?

Exercise 3: Feel the Learning Rate Knob

Summary

Key Concepts

Why This Matters

Next Steps

Lesson 2: Inside the XGBoost Objective

Back to Module Overview

Continue Building Your Skills

Lesson 1 - Introducing XGBoost

Welcome to Introducing XGBoost#

What Is XGBoost?#

Two APIs, One Library#

Fitting Your First XGBoost Model (scikit-learn API)#

The Same Model, the Native Way (native API)#

Head-to-Head: XGBoost vs. Plain Gradient Boosting#

Practice Exercises#

Exercise 1: Fit XGBoost Two Ways and Confirm They Match#

Exercise 2: How Much Does XGBoost Beat Plain Gradient Boosting?#

Exercise 3: Feel the Learning Rate Knob#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Lesson 2: Inside the XGBoost Objective

Back to Module Overview

Continue Building Your Skills#

Welcome to Introducing XGBoost

What Is XGBoost?

Two APIs, One Library

Fitting Your First XGBoost Model (scikit-learn API)

The Same Model, the Native Way (native API)

Head-to-Head: XGBoost vs. Plain Gradient Boosting

Practice Exercises

Exercise 1: Fit XGBoost Two Ways and Confirm They Match

Exercise 2: How Much Does XGBoost Beat Plain Gradient Boosting?

Exercise 3: Feel the Learning Rate Knob

Summary

Key Concepts

Why This Matters

Next Steps

Continue Building Your Skills