Lesson 8 - Guided Project: Predicting Insurance Costs

A Project to Tie It All Together

Over the last seven lessons you learned how a line is fit to data, how to read its coefficients, how to check whether a linear model actually fits, how to scale features, how gradient descent finds the best parameters, and how scikit-learn wraps all of it into a few lines. This lesson is where you use those skills end to end on a real, messy, business-relevant problem.

You will work as if you were on a team at a health insurer. The company wants to estimate the annual medical cost of a customer before issuing a policy, using a handful of facts it already collects on an application form. A good estimate helps the company set fair premiums, plan its budget, and flag unusually expensive cases for review.

By the end of this lesson, you will be able to:

  • Frame a regression project from a plain-language business brief
  • Explore a real dataset to decide which features are worth modeling
  • One-hot encode categorical columns so a linear model can use them
  • Train and evaluate a multiple linear regression model on a held-out test set
  • Interpret standardized coefficients to explain what drives the prediction

You should already be comfortable with the workflow from the earlier lessons: train_test_split, StandardScaler, LinearRegression, and the R-squared and RMSE metrics. We will move quickly through the mechanics and spend more time on the judgment calls.


The Project Brief

Here is the brief, exactly as a stakeholder might hand it to you.

We collect six pieces of information on every application: the applicant’s age, sex, body mass index, number of dependent children, whether they smoke, and which region of the country they live in. For past customers we also know the total medical cost they billed in a year. Build us a model that predicts that annual cost from the six application fields, and tell us which fields matter most.

That brief gives you everything you need to define the problem:

  • The target is charges, the annual medical cost in US dollars. It is a continuous, positive number, which makes this a regression problem, exactly the kind of task this module has prepared you for.
  • The features are the six application fields. Three are numeric (age, bmi, children) and three are categorical (sex, smoker, region).
  • Success means the model predicts well on customers it has never seen, and that you can explain in plain English which features drive the cost.

The plan follows the same shape as every project in this module: explore, prepare, train, evaluate, interpret. Let’s go.


Loading the Data

You can download the dataset and load it with pandas. It is the classic medical-cost dataset: 1,338 anonymized billing records, one row per customer, with no missing values to clean up.

import pandas as pd

df = pd.read_csv("insurance.csv")  # download: https://datatweets.com/datasets/insurance.csv

print("Shape:", df.shape)
# Output: Shape: (1338, 7)

Seven columns: the six features plus the target. Here is the data dictionary.

ColumnTypeMeaning
ageintAge of the primary policyholder in years
sexcategoryfemale or male
bmifloatBody mass index (weight in kg / height in m squared)
childrenintNumber of dependents covered by the policy
smokercategoryyes or no
regioncategoryResidential area: northeast, northwest, southeast, southwest
chargesfloatTarget: annual medical cost billed, in US dollars

A quick look at the first rows and at the target’s range tells you what you are working with.

print(df.head(3))
# Output:
#    age     sex     bmi  children smoker     region      charges
# 0   19  female  27.900         0    yes  southwest  16884.92400
# 1   18    male  33.770         1     no  southeast   1725.55230
# 2   28    male  33.000         3     no  southeast   4449.46200

print(df["charges"].describe()[["mean", "min", "max"]].round())
# Output:
# mean    13270.0
# min      1122.0
# max     63770.0
# Name: charges, dtype: float64

The average annual cost is about $13,270, but it ranges from roughly $1,122 to $63,770. That is a wide spread, more than fifty to one between the cheapest and most expensive customers. A big part of your job is figuring out what separates the two ends.


Exploring What Drives Cost

Before modeling, it pays to look. The earlier lessons showed that features which correlate strongly with the target make good predictors, so start with the three numeric features.

print(df[["age", "bmi", "children", "charges"]].corr()["charges"].round(3))
# Output:
# age         0.299
# bmi         0.198
# children    0.068
# charges     1.000
# Name: charges, dtype: float64

None of the numeric features is overwhelmingly correlated with cost. age has the strongest linear relationship at 0.299, bmi is moderate at 0.198, and children is nearly flat at 0.068. If those were your only features, the model would be weak. So where does the rest of the signal hide? In the categoricals, and one in particular.

The Smoker Effect

Split the average cost by whether the customer smokes.

print(df["smoker"].value_counts())
# Output:
# smoker
# no     1064
# yes     274
# Name: count, dtype: int64

print(df.groupby("smoker")["charges"].mean().round())
# Output:
# smoker
# no      8434.0
# yes    32050.0
# Name: charges, dtype: float64

Only about one customer in five smokes, but the average smoker is billed roughly $32,050 a year against $8,434 for a non-smoker, almost four times as much. That single binary field carries more predictive power than all three numeric features combined. A histogram of charges, colored by smoking status, makes the separation impossible to miss.

Histogram of insurance charges split by smoker
Smokers face dramatically higher charges than non-smokers.

Notice the shape. Non-smokers pile up at the low end, while smokers form a separate cluster shifted far to the right. This is exactly the kind of relationship a linear model can capture well, as long as you give it the smoker column in a form it can read. That is the next problem.

Look before you model

The numeric correlations alone would have told a misleading story, suggesting a weak model is the best you can do. Splitting by a categorical feature revealed the strongest signal in the whole dataset. Always explore categorical features too, not just the numbers a correlation matrix can reach.


Preparing the Data

A linear model multiplies each feature by a coefficient and adds the results, which means every feature has to be a number. Right now three of your columns are text. You cannot multiply a coefficient by "yes" or "northeast", so you need to convert them.

One-Hot Encoding

The standard fix is one-hot encoding. For a categorical column, you create one new 0/1 column per category. A customer who smokes gets a 1 in the smoker_yes column and a 0 otherwise.

There is a subtlety. If smoker has two values, you only need one new column to capture the information: smoker_yes = 1 already implies the customer is a smoker, and smoker_yes = 0 implies they are not. Keeping both a smoker_yes and a smoker_no column would be redundant, and that redundancy can make a linear model’s coefficients unstable. The convention is to drop one category per feature, called the reference category. pandas does this for you with drop_first=True.

X = pd.get_dummies(df.drop(columns=["charges"]), drop_first=True)
y = df["charges"]

print(list(X.columns))
# Output:
# ['age', 'bmi', 'children', 'sex_male', 'smoker_yes',
#  'region_northwest', 'region_southeast', 'region_southwest']

The three numeric columns pass through untouched. sex becomes a single sex_male column (female is the reference). smoker becomes smoker_yes (non-smoker is the reference). region, which has four values, becomes three columns (northeast is the reference). You started with six features and ended with eight numeric columns, all ready for the model.

Splitting and Scaling

From here the recipe is the familiar one. Hold out 25 percent of the data as a test set so you can judge the model on customers it never trained on, then standardize the features. Scaling does not change which model fits best, but it puts every feature on the same scale, which makes the coefficients directly comparable, exactly what you need to answer the stakeholder’s “which fields matter most?” question.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # learn mean/std on TRAIN only
X_test_scaled = scaler.transform(X_test)         # apply the same transform to TEST

print("Training rows:", X_train.shape[0])
print("Test rows:    ", X_test.shape[0])
# Output:
# Training rows: 1003
# Test rows:     335

You train on 1,003 customers and keep 335 in reserve.

Fit the scaler on training data only

Call fit_transform on the training set and plain transform on the test set. Fitting the scaler on the full dataset would leak information about the test customers into training, and your test scores would look better than they really are. This is the same golden rule you followed when scaling features in the earlier lessons.


Training the Model

With the data prepared, training is two lines. You instantiate a LinearRegression and call .fit() on the scaled training data. Under the hood scikit-learn solves for the coefficients that minimize the mean squared error, the same objective you implemented by hand with gradient descent two lessons ago.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train_scaled, y_train)

print("Model trained!")
# Output: Model trained!

The model you just fit has the form you have seen throughout this module:

y^=β0+β1x1+β2x2++β8x8 \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_8 x_8

where y^ \hat{y} is the predicted annual cost, β0 \beta_0 is the intercept, and each βj \beta_j is the coefficient on one of the eight standardized features. Because the features are standardized, each βj \beta_j tells you how many dollars the prediction moves when feature xj x_j increases by one standard deviation, holding the others fixed. That makes the coefficients directly comparable in size.

import pandas as pd

coefs = pd.Series(model.coef_, index=X.columns).sort_values(key=abs, ascending=False)
print(coefs.round(1))
# Output:
# smoker_yes          9546.3
# age                 3643.1
# bmi                 2042.3
# children             513.5
# region_southwest    -370.6
# region_southeast    -342.0
# region_northwest    -152.8
# sex_male              22.8
# dtype: float64

print("Intercept:", round(model.intercept_, 1))
# Output: Intercept: 13267.9

The intercept of about $13,268 is the predicted cost for an “average” customer, because after standardizing, the average customer sits at zero on every feature. From there the coefficients tell the story.

Bar chart of insurance regression coefficients
Smoking, age, and BMI are the strongest drivers of cost.

Reading the bars from largest to smallest:

  • smoker_yes (+9,546) dwarfs everything else. Moving from non-smoker to smoker adds about $9,500 to the standardized prediction, by far the biggest single lever. This confirms exactly what the histogram hinted at.
  • age (+3,643) is next: older customers cost more, as you would expect.
  • bmi (+2,042) also pushes cost up, consistent with the health risks of a higher body mass index.
  • children (+513) has a small positive effect.
  • region and sex barely move the needle. The regional coefficients are all small and negative relative to the northeast reference, and sex_male is essentially zero, meaning sex tells the model almost nothing once the other features are known.

If the stakeholder only remembers one thing from your analysis, it should be this: smoking is the dominant driver of medical cost in this data.


Evaluating on the Test Set

Coefficients explain the model, but they do not tell you whether it predicts well. For that you turn to the held-out test set, the 335 customers the model never saw. Two metrics from the earlier lessons do the job.

The coefficient of determination, R-squared, measures the fraction of the variation in charges that the model explains:

R2=1i(yiy^i)2i(yiyˉ)2 R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}

An R2 R^2 of 1.0 is a perfect fit; 0.0 means the model does no better than always guessing the mean. The root mean squared error, RMSE, reports the typical prediction error in the original units, dollars:

RMSE=1ni=1n(yiy^i)2 \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

predictions = model.predict(X_test_scaled)

r2 = r2_score(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))

print(f"Test R-squared: {r2:.3f}")
print(f"Test RMSE:      ${rmse:,.0f}")
# Output:
# Test R-squared: 0.767
# Test RMSE:      $5,926

The model explains about 77 percent of the variation in medical costs on customers it has never seen, with a typical error of about $5,926. For a model built from six application fields and a straight-line assumption, that is a genuinely useful result. A scatter of predicted versus actual cost shows the fit visually: if predictions were perfect, every point would sit on the diagonal line.

Predicted vs actual insurance charges on the test set
The model explains about 77% of the variation in charges.

Most points track the diagonal closely, which is why R2 R^2 is high. You can also look at individual predictions to get a feel for the errors.

for pred, actual in zip(predictions[:5], y_test.values[:5]):
    print(f"predicted ${pred:>8,.0f}   actual ${actual:>8,.0f}")
# Output:
# predicted $   8,952   actual $   9,095
# predicted $   7,054   actual $   5,272
# predicted $  36,888   actual $  29,331
# predicted $   9,522   actual $   9,302
# predicted $  26,962   actual $  33,750

Some predictions are remarkably close (the first and fourth are within a few hundred dollars), while the high-cost cases are noticeably off by several thousand. That pattern is worth a closer look.


Reading the Residuals

A residual is the gap between what actually happened and what the model predicted:

ei=yiy^i e_i = y_i - \hat{y}_i

For lesson 3 you learned that a well-behaved linear model should produce residuals that scatter randomly around zero with no visible pattern. Plotting them against the predicted value is the quickest way to spot trouble.

Residual plot for the insurance model
The residuals reveal where a linear model still struggles.

The residuals are not a featureless cloud. You can see structure: distinct bands of points, and the spread of errors grows as the predicted cost rises. That tells you two honest things about the model.

First, the bands come from the smoker split. Smokers and non-smokers follow such different cost curves that a single straight line cannot serve both groups perfectly, so each group leaves its own trail of residuals. Second, the widening spread (called heteroscedasticity) means the model is least reliable exactly where the stakes are highest, the expensive customers. A $5,926 typical error is small next to a $32,000 smoker bill but large next to an $8,000 non-smoker bill.

None of this makes the model useless. It makes it honest: you now know where it is strong (typical, low-to-mid-cost customers) and where it should be trusted less (the expensive tail). Being able to say that is the difference between shipping a model and shipping a model you understand.

A good model knows its limits

The residual plot did not invalidate the project; it sharpened it. The structure it revealed points straight at the next improvements: model smokers and non-smokers separately, or add interaction terms so age and BMI are allowed to behave differently for smokers. That is how diagnostics drive the next iteration.


What You Built

Step back and look at what you have. Starting from a one-paragraph business brief, you:

  • Explored the data and discovered that smoking, not any single numeric feature, is the dominant cost driver.
  • Prepared it by one-hot encoding three categorical columns into eight numeric features, then splitting and scaling without leaking the test set.
  • Trained a multiple linear regression model and read its standardized coefficients to rank what matters: smoking, then age, then BMI.
  • Evaluated it honestly on held-out data, reporting a test R2 R^2 of 0.767 and a typical error of about $5,926.
  • Diagnosed its residuals to identify exactly where it succeeds and where it strains.

Here is the entire project condensed into one runnable script. It is a template you can adapt for almost any regression problem with a mix of numeric and categorical features.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# 1. Load
df = pd.read_csv("insurance.csv")  # download: https://datatweets.com/datasets/insurance.csv

# 2. Prepare: one-hot encode categoricals, split off the target
X = pd.get_dummies(df.drop(columns=["charges"]), drop_first=True)
y = df["charges"]

# 3. Split, then scale (fit on train only)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 4. Train
model = LinearRegression()
model.fit(X_train, y_train)

# 5. Evaluate on the held-out test set
predictions = model.predict(X_test)
r2 = r2_score(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print(f"Test R-squared: {r2:.3f}   RMSE: ${rmse:,.0f}")
# Output: Test R-squared: 0.767   RMSE: $5,926

That is a complete, defensible regression project in about 20 lines.


Practice Exercises

Try these before peeking at the hints. They push the project a little further in the directions the residual plot suggested.

Exercise 1: Quantify the Smoker Gap

You saw that smokers cost far more on average. Compute the exact difference: the mean charges for smokers minus the mean for non-smokers, rounded to the nearest dollar. Does the gap line up with the size of the smoker_yes coefficient?

import pandas as pd
df = pd.read_csv("insurance.csv")

# Your code here

Hint

Use df.groupby("smoker")["charges"].mean() to get both averages, then subtract. You should find smokers average about $32,050 and non-smokers about $8,434, a gap of roughly $23,600. The coefficient is smaller than the raw gap because it is measured per standard deviation of a standardized 0/1 column, not per category, but both point to smoking as the dominant driver.

Exercise 2: Model Without the Smoker Feature

How much of the model’s power comes from smoker? Rebuild the model after dropping every smoker-related column, and compare the test R2 R^2 to the full model’s 0.767.

# Reuse df from Exercise 1
# Your code here

Hint

After one-hot encoding, drop the smoker_yes column with X = X.drop(columns=["smoker_yes"]) before the split. Re-run the same train_test_split, StandardScaler, and LinearRegression steps. The test R2 R^2 collapses dramatically, which is direct evidence that smoking carries most of the predictive signal.

Exercise 3: Separate Models for Smokers and Non-Smokers

The residual plot suggested smokers and non-smokers follow different cost curves. Split the data into two groups by smoker, fit a separate LinearRegression to each, and compare the test R2 R^2 of each group’s model. Do the focused models fit their own group better than the single combined model?

# Reuse df from Exercise 1
# Your code here

Hint

Filter with df[df["smoker"] == "yes"] and df[df["smoker"] == "no"], then run the full encode/split/scale/fit/evaluate pipeline on each subset (drop smoker itself, since it is now constant within each group). Fitting groups separately is one of the cleanest ways to handle the structure the residual plot revealed.


Summary

You completed an end-to-end regression project on a real dataset, applying every skill from this module. Let’s review what you learned.

Key Concepts

Framing a Project

  • A continuous, positive target like annual cost is a regression problem
  • A clear brief defines the target, the features, and what success means before any code is written

Preparing Mixed Data

  • Linear models need numbers, so categorical columns must be one-hot encoded
  • pd.get_dummies(..., drop_first=True) creates 0/1 columns and drops one reference category per feature to avoid redundancy
  • Split before scaling, and fit the scaler on the training set only

Training and Interpreting

  • LinearRegression().fit() solves for the coefficients that minimize mean squared error
  • Standardized coefficients are directly comparable: here smoker_yes (+9,546) dwarfs age (+3,643) and bmi (+2,042)
  • The intercept is the prediction for an average customer

Evaluating Honestly

  • Judge the model on a held-out test set, never on training data
  • The model reached a test R2 R^2 of 0.767 and an RMSE of about $5,926
  • Residual plots reveal structure the metrics hide, pointing the way to the next iteration

Why This Matters

This project is a microcosm of real applied machine learning. The hard parts were not the algorithm calls, which were the same handful of scikit-learn lines you have used all module. The hard parts were the judgment: noticing that a categorical feature held the strongest signal, encoding it correctly, choosing an honest evaluation, and reading the residuals well enough to know where the model can and cannot be trusted.

A model that explains 77 percent of medical-cost variation from six form fields is genuinely useful to an insurer, and you can defend every number in it. That combination, a working model plus a clear account of its limits, is exactly what good data work delivers.


Next Steps

You have finished the Regression module: from fitting a single line to building, tuning, and critiquing a full multiple-regression model on real data. The same workflow, explore, prepare, train, evaluate, iterate, carries forward to every model you will meet next.

Continue to the Next Module - Classification

Predict categories with logistic regression and learn to evaluate classifiers with the right metrics.

Back to Module Overview

Return to the Regression module overview.


Keep Building Your Skills

The best way to lock in what you learned is to run this project yourself and then push past it. Try the practice exercises, experiment with interaction terms, or bring your own dataset with a numeric target and a mix of feature types. Every project you complete makes the next one faster, and the workflow you practiced here is the one professionals reach for every day.