Lesson 8 - Guided Project: Predicting Insurance Costs
A Project to Tie It All Together
Over the last seven lessons you learned how a line is fit to data, how to read its coefficients, how to check whether a linear model actually fits, how to scale features, how gradient descent finds the best parameters, and how scikit-learn wraps all of it into a few lines. This lesson is where you use those skills end to end on a real, messy, business-relevant problem.
You will work as if you were on a team at a health insurer. The company wants to estimate the annual medical cost of a customer before issuing a policy, using a handful of facts it already collects on an application form. A good estimate helps the company set fair premiums, plan its budget, and flag unusually expensive cases for review.
By the end of this lesson, you will be able to:
- Frame a regression project from a plain-language business brief
- Explore a real dataset to decide which features are worth modeling
- One-hot encode categorical columns so a linear model can use them
- Train and evaluate a multiple linear regression model on a held-out test set
- Interpret standardized coefficients to explain what drives the prediction
You should already be comfortable with the workflow from the earlier lessons: train_test_split, StandardScaler, LinearRegression, and the R-squared and RMSE metrics. We will move quickly through the mechanics and spend more time on the judgment calls.
The Project Brief
Here is the brief, exactly as a stakeholder might hand it to you.
We collect six pieces of information on every application: the applicant’s age, sex, body mass index, number of dependent children, whether they smoke, and which region of the country they live in. For past customers we also know the total medical cost they billed in a year. Build us a model that predicts that annual cost from the six application fields, and tell us which fields matter most.
That brief gives you everything you need to define the problem:
- The target is
charges, the annual medical cost in US dollars. It is a continuous, positive number, which makes this a regression problem, exactly the kind of task this module has prepared you for. - The features are the six application fields. Three are numeric (
age,bmi,children) and three are categorical (sex,smoker,region). - Success means the model predicts well on customers it has never seen, and that you can explain in plain English which features drive the cost.
The plan follows the same shape as every project in this module: explore, prepare, train, evaluate, interpret. Let’s go.
Loading the Data
You can download the dataset and load it with pandas. It is the classic medical-cost dataset: 1,338 anonymized billing records, one row per customer, with no missing values to clean up.
import pandas as pd
df = pd.read_csv("insurance.csv") # download: https://datatweets.com/datasets/insurance.csv
print("Shape:", df.shape)
# Output: Shape: (1338, 7)Seven columns: the six features plus the target. Here is the data dictionary.
| Column | Type | Meaning |
|---|---|---|
age | int | Age of the primary policyholder in years |
sex | category | female or male |
bmi | float | Body mass index (weight in kg / height in m squared) |
children | int | Number of dependents covered by the policy |
smoker | category | yes or no |
region | category | Residential area: northeast, northwest, southeast, southwest |
charges | float | Target: annual medical cost billed, in US dollars |
A quick look at the first rows and at the target’s range tells you what you are working with.
print(df.head(3))
# Output:
# age sex bmi children smoker region charges
# 0 19 female 27.900 0 yes southwest 16884.92400
# 1 18 male 33.770 1 no southeast 1725.55230
# 2 28 male 33.000 3 no southeast 4449.46200
print(df["charges"].describe()[["mean", "min", "max"]].round())
# Output:
# mean 13270.0
# min 1122.0
# max 63770.0
# Name: charges, dtype: float64The average annual cost is about $13,270, but it ranges from roughly $1,122 to $63,770. That is a wide spread, more than fifty to one between the cheapest and most expensive customers. A big part of your job is figuring out what separates the two ends.
Exploring What Drives Cost
Before modeling, it pays to look. The earlier lessons showed that features which correlate strongly with the target make good predictors, so start with the three numeric features.
print(df[["age", "bmi", "children", "charges"]].corr()["charges"].round(3))
# Output:
# age 0.299
# bmi 0.198
# children 0.068
# charges 1.000
# Name: charges, dtype: float64None of the numeric features is overwhelmingly correlated with cost. age has the strongest linear relationship at 0.299, bmi is moderate at 0.198, and children is nearly flat at 0.068. If those were your only features, the model would be weak. So where does the rest of the signal hide? In the categoricals, and one in particular.
The Smoker Effect
Split the average cost by whether the customer smokes.
print(df["smoker"].value_counts())
# Output:
# smoker
# no 1064
# yes 274
# Name: count, dtype: int64
print(df.groupby("smoker")["charges"].mean().round())
# Output:
# smoker
# no 8434.0
# yes 32050.0
# Name: charges, dtype: float64Only about one customer in five smokes, but the average smoker is billed roughly $32,050 a year against $8,434 for a non-smoker, almost four times as much. That single binary field carries more predictive power than all three numeric features combined. A histogram of charges, colored by smoking status, makes the separation impossible to miss.
Notice the shape. Non-smokers pile up at the low end, while smokers form a separate cluster shifted far to the right. This is exactly the kind of relationship a linear model can capture well, as long as you give it the smoker column in a form it can read. That is the next problem.
Look before you model
The numeric correlations alone would have told a misleading story, suggesting a weak model is the best you can do. Splitting by a categorical feature revealed the strongest signal in the whole dataset. Always explore categorical features too, not just the numbers a correlation matrix can reach.
Preparing the Data
A linear model multiplies each feature by a coefficient and adds the results, which means every feature has to be a number. Right now three of your columns are text. You cannot multiply a coefficient by "yes" or "northeast", so you need to convert them.
One-Hot Encoding
The standard fix is one-hot encoding. For a categorical column, you create one new 0/1 column per category. A customer who smokes gets a 1 in the smoker_yes column and a 0 otherwise.
There is a subtlety. If smoker has two values, you only need one new column to capture the information: smoker_yes = 1 already implies the customer is a smoker, and smoker_yes = 0 implies they are not. Keeping both a smoker_yes and a smoker_no column would be redundant, and that redundancy can make a linear model’s coefficients unstable. The convention is to drop one category per feature, called the reference category. pandas does this for you with drop_first=True.
X = pd.get_dummies(df.drop(columns=["charges"]), drop_first=True)
y = df["charges"]
print(list(X.columns))
# Output:
# ['age', 'bmi', 'children', 'sex_male', 'smoker_yes',
# 'region_northwest', 'region_southeast', 'region_southwest']The three numeric columns pass through untouched. sex becomes a single sex_male column (female is the reference). smoker becomes smoker_yes (non-smoker is the reference). region, which has four values, becomes three columns (northeast is the reference). You started with six features and ended with eight numeric columns, all ready for the model.
Splitting and Scaling
From here the recipe is the familiar one. Hold out 25 percent of the data as a test set so you can judge the model on customers it never trained on, then standardize the features. Scaling does not change which model fits best, but it puts every feature on the same scale, which makes the coefficients directly comparable, exactly what you need to answer the stakeholder’s “which fields matter most?” question.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # learn mean/std on TRAIN only
X_test_scaled = scaler.transform(X_test) # apply the same transform to TEST
print("Training rows:", X_train.shape[0])
print("Test rows: ", X_test.shape[0])
# Output:
# Training rows: 1003
# Test rows: 335You train on 1,003 customers and keep 335 in reserve.
Fit the scaler on training data only
Call fit_transform on the training set and plain transform on the test set. Fitting the scaler on the full dataset would leak information about the test customers into training, and your test scores would look better than they really are. This is the same golden rule you followed when scaling features in the earlier lessons.
Training the Model
With the data prepared, training is two lines. You instantiate a LinearRegression and call .fit() on the scaled training data. Under the hood scikit-learn solves for the coefficients that minimize the mean squared error, the same objective you implemented by hand with gradient descent two lessons ago.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_scaled, y_train)
print("Model trained!")
# Output: Model trained!The model you just fit has the form you have seen throughout this module:
where is the predicted annual cost, is the intercept, and each is the coefficient on one of the eight standardized features. Because the features are standardized, each tells you how many dollars the prediction moves when feature increases by one standard deviation, holding the others fixed. That makes the coefficients directly comparable in size.
import pandas as pd
coefs = pd.Series(model.coef_, index=X.columns).sort_values(key=abs, ascending=False)
print(coefs.round(1))
# Output:
# smoker_yes 9546.3
# age 3643.1
# bmi 2042.3
# children 513.5
# region_southwest -370.6
# region_southeast -342.0
# region_northwest -152.8
# sex_male 22.8
# dtype: float64
print("Intercept:", round(model.intercept_, 1))
# Output: Intercept: 13267.9The intercept of about $13,268 is the predicted cost for an “average” customer, because after standardizing, the average customer sits at zero on every feature. From there the coefficients tell the story.
Reading the bars from largest to smallest:
smoker_yes(+9,546) dwarfs everything else. Moving from non-smoker to smoker adds about $9,500 to the standardized prediction, by far the biggest single lever. This confirms exactly what the histogram hinted at.age(+3,643) is next: older customers cost more, as you would expect.bmi(+2,042) also pushes cost up, consistent with the health risks of a higher body mass index.children(+513) has a small positive effect.regionandsexbarely move the needle. The regional coefficients are all small and negative relative to the northeast reference, andsex_maleis essentially zero, meaning sex tells the model almost nothing once the other features are known.
If the stakeholder only remembers one thing from your analysis, it should be this: smoking is the dominant driver of medical cost in this data.
Evaluating on the Test Set
Coefficients explain the model, but they do not tell you whether it predicts well. For that you turn to the held-out test set, the 335 customers the model never saw. Two metrics from the earlier lessons do the job.
The coefficient of determination, R-squared, measures the fraction of the variation in charges that the model explains:
An of 1.0 is a perfect fit; 0.0 means the model does no better than always guessing the mean. The root mean squared error, RMSE, reports the typical prediction error in the original units, dollars:
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
predictions = model.predict(X_test_scaled)
r2 = r2_score(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print(f"Test R-squared: {r2:.3f}")
print(f"Test RMSE: ${rmse:,.0f}")
# Output:
# Test R-squared: 0.767
# Test RMSE: $5,926The model explains about 77 percent of the variation in medical costs on customers it has never seen, with a typical error of about $5,926. For a model built from six application fields and a straight-line assumption, that is a genuinely useful result. A scatter of predicted versus actual cost shows the fit visually: if predictions were perfect, every point would sit on the diagonal line.
Most points track the diagonal closely, which is why is high. You can also look at individual predictions to get a feel for the errors.
for pred, actual in zip(predictions[:5], y_test.values[:5]):
print(f"predicted ${pred:>8,.0f} actual ${actual:>8,.0f}")
# Output:
# predicted $ 8,952 actual $ 9,095
# predicted $ 7,054 actual $ 5,272
# predicted $ 36,888 actual $ 29,331
# predicted $ 9,522 actual $ 9,302
# predicted $ 26,962 actual $ 33,750Some predictions are remarkably close (the first and fourth are within a few hundred dollars), while the high-cost cases are noticeably off by several thousand. That pattern is worth a closer look.
Reading the Residuals
A residual is the gap between what actually happened and what the model predicted:
For lesson 3 you learned that a well-behaved linear model should produce residuals that scatter randomly around zero with no visible pattern. Plotting them against the predicted value is the quickest way to spot trouble.
The residuals are not a featureless cloud. You can see structure: distinct bands of points, and the spread of errors grows as the predicted cost rises. That tells you two honest things about the model.
First, the bands come from the smoker split. Smokers and non-smokers follow such different cost curves that a single straight line cannot serve both groups perfectly, so each group leaves its own trail of residuals. Second, the widening spread (called heteroscedasticity) means the model is least reliable exactly where the stakes are highest, the expensive customers. A $5,926 typical error is small next to a $32,000 smoker bill but large next to an $8,000 non-smoker bill.
None of this makes the model useless. It makes it honest: you now know where it is strong (typical, low-to-mid-cost customers) and where it should be trusted less (the expensive tail). Being able to say that is the difference between shipping a model and shipping a model you understand.
A good model knows its limits
The residual plot did not invalidate the project; it sharpened it. The structure it revealed points straight at the next improvements: model smokers and non-smokers separately, or add interaction terms so age and BMI are allowed to behave differently for smokers. That is how diagnostics drive the next iteration.
What You Built
Step back and look at what you have. Starting from a one-paragraph business brief, you:
- Explored the data and discovered that smoking, not any single numeric feature, is the dominant cost driver.
- Prepared it by one-hot encoding three categorical columns into eight numeric features, then splitting and scaling without leaking the test set.
- Trained a multiple linear regression model and read its standardized coefficients to rank what matters: smoking, then age, then BMI.
- Evaluated it honestly on held-out data, reporting a test of 0.767 and a typical error of about $5,926.
- Diagnosed its residuals to identify exactly where it succeeds and where it strains.
Here is the entire project condensed into one runnable script. It is a template you can adapt for almost any regression problem with a mix of numeric and categorical features.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
# 1. Load
df = pd.read_csv("insurance.csv") # download: https://datatweets.com/datasets/insurance.csv
# 2. Prepare: one-hot encode categoricals, split off the target
X = pd.get_dummies(df.drop(columns=["charges"]), drop_first=True)
y = df["charges"]
# 3. Split, then scale (fit on train only)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# 4. Train
model = LinearRegression()
model.fit(X_train, y_train)
# 5. Evaluate on the held-out test set
predictions = model.predict(X_test)
r2 = r2_score(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print(f"Test R-squared: {r2:.3f} RMSE: ${rmse:,.0f}")
# Output: Test R-squared: 0.767 RMSE: $5,926That is a complete, defensible regression project in about 20 lines.
Practice Exercises
Try these before peeking at the hints. They push the project a little further in the directions the residual plot suggested.
Exercise 1: Quantify the Smoker Gap
You saw that smokers cost far more on average. Compute the exact difference: the mean charges for smokers minus the mean for non-smokers, rounded to the nearest dollar. Does the gap line up with the size of the smoker_yes coefficient?
import pandas as pd
df = pd.read_csv("insurance.csv")
# Your code hereHint
Use df.groupby("smoker")["charges"].mean() to get both averages, then subtract. You should find smokers average about $32,050 and non-smokers about $8,434, a gap of roughly $23,600. The coefficient is smaller than the raw gap because it is measured per standard deviation of a standardized 0/1 column, not per category, but both point to smoking as the dominant driver.
Exercise 2: Model Without the Smoker Feature
How much of the model’s power comes from smoker? Rebuild the model after dropping every smoker-related column, and compare the test to the full model’s 0.767.
# Reuse df from Exercise 1
# Your code hereHint
After one-hot encoding, drop the smoker_yes column with X = X.drop(columns=["smoker_yes"]) before the split. Re-run the same train_test_split, StandardScaler, and LinearRegression steps. The test collapses dramatically, which is direct evidence that smoking carries most of the predictive signal.
Exercise 3: Separate Models for Smokers and Non-Smokers
The residual plot suggested smokers and non-smokers follow different cost curves. Split the data into two groups by smoker, fit a separate LinearRegression to each, and compare the test of each group’s model. Do the focused models fit their own group better than the single combined model?
# Reuse df from Exercise 1
# Your code hereHint
Filter with df[df["smoker"] == "yes"] and df[df["smoker"] == "no"], then run the full encode/split/scale/fit/evaluate pipeline on each subset (drop smoker itself, since it is now constant within each group). Fitting groups separately is one of the cleanest ways to handle the structure the residual plot revealed.
Summary
You completed an end-to-end regression project on a real dataset, applying every skill from this module. Let’s review what you learned.
Key Concepts
Framing a Project
- A continuous, positive target like annual cost is a regression problem
- A clear brief defines the target, the features, and what success means before any code is written
Preparing Mixed Data
- Linear models need numbers, so categorical columns must be one-hot encoded
pd.get_dummies(..., drop_first=True)creates 0/1 columns and drops one reference category per feature to avoid redundancy- Split before scaling, and fit the scaler on the training set only
Training and Interpreting
LinearRegression().fit()solves for the coefficients that minimize mean squared error- Standardized coefficients are directly comparable: here
smoker_yes(+9,546) dwarfsage(+3,643) andbmi(+2,042) - The intercept is the prediction for an average customer
Evaluating Honestly
- Judge the model on a held-out test set, never on training data
- The model reached a test of 0.767 and an RMSE of about $5,926
- Residual plots reveal structure the metrics hide, pointing the way to the next iteration
Why This Matters
This project is a microcosm of real applied machine learning. The hard parts were not the algorithm calls, which were the same handful of scikit-learn lines you have used all module. The hard parts were the judgment: noticing that a categorical feature held the strongest signal, encoding it correctly, choosing an honest evaluation, and reading the residuals well enough to know where the model can and cannot be trusted.
A model that explains 77 percent of medical-cost variation from six form fields is genuinely useful to an insurer, and you can defend every number in it. That combination, a working model plus a clear account of its limits, is exactly what good data work delivers.
Next Steps
You have finished the Regression module: from fitting a single line to building, tuning, and critiquing a full multiple-regression model on real data. The same workflow, explore, prepare, train, evaluate, iterate, carries forward to every model you will meet next.
Continue to the Next Module - Classification
Predict categories with logistic regression and learn to evaluate classifiers with the right metrics.
Back to Module Overview
Return to the Regression module overview.
Keep Building Your Skills
The best way to lock in what you learned is to run this project yourself and then push past it. Try the practice exercises, experiment with interaction terms, or bring your own dataset with a numeric target and a mix of feature types. Every project you complete makes the next one faster, and the workflow you practiced here is the one professionals reach for every day.