Lesson 1 - Introduction to Logistic Regression

Welcome to Classification

This lesson introduces you to classification, the branch of supervised learning that predicts categories instead of numbers, and to the workhorse model that makes it possible: logistic regression. You will see why the linear regression you already know breaks down on a yes/no target, learn how the sigmoid function fixes it, and build a real churn classifier with scikit-learn.

By the end of this lesson, you will be able to:

Tell the difference between classification and regression problems
Explain why linear regression is the wrong tool for a 0/1 target
Describe the sigmoid function and how it maps any number to a probability between 0 and 1
Write down the logistic regression model in its standard form
Build, train, and evaluate a LogisticRegression model on a real dataset

You should be comfortable with basic Python, pandas, and the machine learning workflow (features, targets, train/test split). If those terms are new, work through the Foundations module first. Let’s begin.

Classification vs. Regression

In the Foundations module you saw that the type of target decides the type of task. That single distinction is worth revisiting carefully, because everything in this module flows from it.

When the target is a number that can vary smoothly, like a house price, a temperature, or next month’s revenue, you are doing regression. The model outputs a continuous value, and you measure error by how far off that value is.

When the target is a category, like spam or not-spam, fraud or legitimate, churn or stay, you are doing classification. The model’s job is to assign each example to one of a fixed set of labels.

Regression                          Classification
-----------------------             -----------------------
Target is a number                  Target is a category
Examples:                           Examples:
  - House price ($)                   - Spam vs. not spam
  - Tomorrow's temperature            - Fraud vs. legitimate
  - Customer lifetime value           - Will the customer churn?

When a classifier chooses between exactly two categories, it is a binary classifier. Binary classification is the most common case in industry, and it is where we will spend this entire module. The car-loan default, the medical diagnosis, the customer who cancels their subscription: all of these are yes/no questions, and all of them are binary classification.

The name logistic regression is a small but famous source of confusion. Despite the word “regression,” logistic regression is a classification model. The “regression” part refers to the fact that, under the hood, it still builds a linear combination of the predictors, exactly like linear regression. It then wraps that linear combination in one extra function to turn it into a probability. Understanding that one extra function is the heart of this lesson.

Why start with logistic regression?

Logistic regression is the first classifier most practitioners learn, and many never stop using it. It trains in milliseconds, its predictions are easy to explain to non-technical stakeholders, and it produces well-calibrated probabilities rather than bare labels. Even when a fancier model wins on accuracy, logistic regression is the baseline everyone measures against.

The Problem: Predicting Customer Churn

Imagine you work at a telecom company. Every month, some fraction of your customers cancel their service. That is called churn, and it is expensive: winning a brand-new customer costs far more than keeping an existing one. If you could flag the customers most likely to leave before they leave, your retention team could step in with a discount or a better plan.

That is a classic binary classification problem. For each customer, the answer is one of two categories: churn (they leave) or stay. You have records of past customers, including who churned and who did not, so this is a supervised learning task with a labeled target.

You will work with the real Customer Churn dataset, which records account information for telecom customers along with whether each one churned.

import pandas as pd

# download: https://datatweets.com/datasets/customer_churn.csv
df = pd.read_csv("customer_churn.csv")

print("Shape:", df.shape)
# Output: Shape: (7032, 12)

The dataset has 7,032 rows and 12 columns. Each row is one customer. Most columns describe the account, such as how long the customer has been with the company and how much they pay, and one column, Churn, records the outcome you want to predict.

A Look at the Columns

You do not need to memorize every field, but here are the ones you will use as features, plus the target.

Column	Type	Meaning
`tenure`	int	Months the customer has stayed with the company
`MonthlyCharges`	float	The amount billed each month
`TotalCharges`	float	The total amount billed over the lifetime of the account
`SeniorCitizen`	binary	Whether the customer is a senior citizen (1) or not (0)
`Contract_One year`, `Contract_Two year`	binary	Contract type (month-to-month is the omitted baseline)
`InternetService_Fiber optic`, `InternetService_No`	binary	Internet service type (DSL is the omitted baseline)
`Partner_Yes`, `Dependents_Yes`	binary	Whether the customer has a partner or dependents
`PaperlessBilling_Yes`	binary	Whether the customer uses paperless billing
`Churn`	category	Target: `"Yes"` if the customer churned, `"No"` otherwise

The categorical fields like contract type and internet service have already been turned into 0/1 columns for you, a process called one-hot encoding that you will study later. For now, treat them as ready-to-use numeric features.

How Balanced Is the Target?

Before modeling anything, always check how the target is distributed. If one class vastly outnumbers the other, a lazy model can score high accuracy while being useless.

print(df["Churn"].value_counts().to_dict())
# Output: {'No': 5163, 'Yes': 1869}

churn_rate = (df["Churn"] == "Yes").mean()
print("churn rate:", round(churn_rate, 3))
# Output: churn rate: 0.266

About 27 percent of customers churned (1,869 out of 7,032). That is imbalanced but not severely so: the positive class is well represented. A picture makes the split clear.

Bar chart showing about 27 percent of customers churned and 73 percent stayed — Roughly 27 percent of customers churned, so the dataset leans toward the "stay" class but keeps a healthy minority of churners.

Keep the base rate in mind

Because only 27 percent of customers churn, a model that blindly predicts “no churn” for everyone would already be right 73 percent of the time. Remember that number: any useful model must beat it. This is exactly why you should never judge a classifier on accuracy alone, a point we return to in a later lesson.

Why Linear Regression Fails Here

Here is a natural question: you already know linear regression, so why not just use it? Encode the target as 1 for churn and 0 for stay, fit a line, and call any prediction above 0.5 a churn. It sounds reasonable, and it falls apart fast.

Linear regression fits the equation

\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p

The right-hand side is a straight line (a plane, in higher dimensions). It is free to take any value from $-\infty$ to $+\infty$ . But your target is only ever 0 or 1. There is nothing stopping the fitted line from predicting 1.8 for a customer with very high charges, or -0.4 for a customer with very low ones. What is a churn probability of 180 percent supposed to mean? Or negative 40 percent? Probabilities must live between 0 and 1, and a straight line refuses to stay in that range.

The picture below shows the problem directly. On the left, a straight line is fit to a 0/1 target: it shoots above 1 and dips below 0, and its slope forces a single rigid threshold that misclassifies points at both ends. On the right, the logistic curve hugs the data, flattening out near 0 and 1 exactly where it should.

Side by side comparison of a straight regression line and an S-shaped logistic curve fit to a 0/1 target — A straight line escapes the 0-to-1 range, while the S-shaped logistic curve stays inside it and bends smoothly between the two classes.

There are deeper problems too. The errors in a 0/1 target are not the smooth, evenly-spread errors that linear regression assumes, and a single outlier far out on one axis can drag the whole line and shift the threshold. The fix is not to patch linear regression. The fix is to wrap its linear score in a function that guarantees an output between 0 and 1. That function is the sigmoid.

The Sigmoid Function

The sigmoid function, also called the logistic function, is the piece that turns logistic regression into a probability machine. It takes any real number, no matter how large or small, and squashes it into the open interval between 0 and 1. Its formula is:

h(z) = \frac{1}{1 + e^{-z}}

Here $z$ is any real number, and $h(z)$ is always strictly between 0 and 1. Let’s read the formula by checking its behavior at the extremes:

When $z$ is a large positive number, $e^{-z}$ shrinks toward 0, so $h(z)$ approaches $\frac{1}{1+0} = 1$ .
When $z$ is a large negative number, $e^{-z}$ blows up toward $+\infty$ , so $h(z)$ approaches $\frac{1}{\infty} = 0$ .
When $z = 0$ , we get $h(0) = \frac{1}{1+1} = 0.5$ , the exact midpoint.

That is the whole trick. No matter what you feed it, the output is a valid probability. The shape this produces is the famous S-curve.

The S-shaped sigmoid curve rising smoothly from 0 to 1 and passing through 0.5 at z equals zero — The sigmoid function maps any input z to a probability between 0 and 1, crossing 0.5 at z = 0.

You can compute it yourself in a couple of lines to confirm the behavior.

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

for z in [-6, -2, 0, 2, 6]:
    print(f"sigmoid({z:>2}) = {sigmoid(z):.4f}")
# Output:
# sigmoid(-6) = 0.0025
# sigmoid(-2) = 0.1192
# sigmoid( 0) = 0.5000
# sigmoid( 2) = 0.8808
# sigmoid( 6) = 0.9975

Notice how quickly the curve transitions. Around $z = 0$ it is steep and decisive, and far out on either side it flattens out, becoming nearly certain but never quite reaching 0 or 1. That gentle saturation is exactly what we want: the model can be very confident without ever claiming an impossible probability.

The midpoint is the decision boundary

Because $h(z) = 0.5$ exactly when $z = 0$ , the line $z = 0$ is the natural dividing line between the two classes. Inputs that push $z$ above 0 lean toward the positive class; inputs that push it below 0 lean toward the negative class. You will revisit and move this 0.5 cutoff deliberately in a later lesson on thresholds.

The Logistic Regression Model

Now you can assemble the full model. Logistic regression is built in two stages.

First, exactly like linear regression, it forms a linear combination of the features. Call this intermediate score $z$ :

z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p

Here $\beta_0$ is the intercept and each $\beta_j$ is the weight, or coefficient, attached to feature $x_j$ . This score $z$ can be any real number.

Second, it feeds $z$ through the sigmoid to get a probability:

P(y = 1 \mid x) = h(z) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)}}

Read that left side carefully: it is the model’s estimated probability that the example belongs to the positive class (here, that the customer churns), given the feature values. To turn that probability into an actual label, you apply a threshold, usually 0.5: predict churn if the probability is at least 0.5, otherwise predict stay.

The values the model has to learn are the coefficients $\beta_0, \beta_1, \ldots, \beta_p$ . Training means searching for the coefficients that make the predicted probabilities match the observed outcomes as closely as possible. Linear regression has a tidy formula for its best coefficients; logistic regression does not, so it finds them through iterative optimization instead. The good news is that scikit-learn handles all of that for you behind a single .fit() call. What the coefficients mean once you have them, and how to read them as odds, is the subject of the next lesson.

Building Your First Logistic Regression

Time to put it together. The workflow mirrors what you already know: separate features from target, split into train and test sets, scale, fit, and evaluate.

Step 1: Features and Target

Separate the predictor columns (X) from the outcome column (y), converting the text target to 0/1.

feature_cols = [
    "tenure", "MonthlyCharges", "TotalCharges", "SeniorCitizen",
    "Contract_One year", "Contract_Two year",
    "InternetService_Fiber optic", "InternetService_No",
    "Partner_Yes", "Dependents_Yes", "PaperlessBilling_Yes",
]

X = df[feature_cols]
y = (df["Churn"] == "Yes").astype(int)   # 1 = churned, 0 = stayed

print("X shape:", X.shape)
print("Churners:", int(y.sum()))
# Output:
# X shape: (7032, 11)
# Churners: 1869

Step 2: Split and Scale

Hold out 25 percent of the data for testing, and use stratify=y so the churn rate is identical in both sets. Then standardize the features so the model treats them on a level playing field.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y,
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)   # fit on TRAIN only
X_test = scaler.transform(X_test)         # apply same transform to TEST

print("Train rows:", X_train.shape[0])
print("Test rows: ", X_test.shape[0])
# Output:
# Train rows: 5274
# Test rows:  1758

Fit the scaler on training data only

Always call fit_transform on the training set and only transform on the test set. If you fit the scaler on the full dataset, information about the test rows leaks into training and your score becomes dishonestly optimistic. The same rule applies to the model itself: it must never see the test set until the final evaluation.

Step 3: Fit the Model

Building and training a logistic regression takes two lines: instantiate, then fit. The max_iter argument simply gives the iterative optimizer enough steps to settle on its coefficients.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

print("Model trained!")
# Output: Model trained!

That single .fit() call ran the optimization that searches for the best coefficients, the ones that plug into the sigmoid to produce probabilities closest to the actual churn outcomes.

Step 4: Predict Probabilities and Labels

Logistic regression gives you more than a bare label. The .predict_proba() method returns the actual probability for each class, which is one of the model’s biggest selling points.

# Probability of churn for the first five test customers
proba = model.predict_proba(X_test)[:, 1]   # column 1 = P(churn)
print("Churn probabilities:", proba[:5].round(3))
# Output: Churn probabilities: [0.03  0.625 0.118 0.444 0.706]

# Apply the default 0.5 threshold to get labels
preds = model.predict(X_test)
print("Predicted labels:   ", preds[:5])
print("Actual labels:      ", y_test.values[:5])
# Output:
# Predicted labels:    [0 1 0 0 1]
# Actual labels:       [0 1 0 0 1]

Each probability is the sigmoid output for that customer. The default .predict() simply rounds at 0.5: probabilities at or above 0.5 become 1, the rest become 0.

Evaluating the Model

Now measure how well the model does on the held-out test set, the data it never saw during training.

The simplest metric is accuracy: the fraction of test customers classified correctly. scikit-learn computes it with .score().

accuracy = model.score(X_test, y_test)
print(f"Test accuracy: {accuracy:.3f}")
# Output: Test accuracy: 0.796

An accuracy of 0.796 means the model classified about 80 percent of the test customers correctly. Recall the base rate from earlier: always guessing “no churn” would score 0.73, so the model is genuinely learning something, beating the naive baseline by a meaningful margin.

Accuracy is only part of the story, though, especially with an imbalanced target. A more informative single number is the AUC (area under the ROC curve), which measures how well the model ranks customers by churn risk across every possible threshold, not just at 0.5. AUC ranges from 0.5 (no better than a coin flip) to 1.0 (perfect ranking).

from sklearn.metrics import roc_auc_score

auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"Test AUC: {auc:.3f}")
# Output: Test AUC: 0.836

An AUC of 0.836 is solid. It means that if you pick a random churner and a random non-churner, the model assigns the churner a higher churn probability about 84 percent of the time. That ranking ability is exactly what a retention team needs: it can sort customers by risk and focus on the most likely to leave.

Two numbers, two questions

Accuracy answers “how often is the model’s label correct?” while AUC answers “how well does the model rank customers by risk?” They can disagree, and on imbalanced problems AUC is often the more trustworthy summary. You will unpack accuracy, precision, recall, ROC curves, and thresholds in depth in the lessons that follow.

Putting It All Together

Here is the entire pipeline you just built, condensed into one runnable script you can adapt for any binary classification problem.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# 1. Load
df = pd.read_csv("customer_churn.csv")  # download: https://datatweets.com/datasets/customer_churn.csv

# 2. Features and target
feature_cols = [
    "tenure", "MonthlyCharges", "TotalCharges", "SeniorCitizen",
    "Contract_One year", "Contract_Two year",
    "InternetService_Fiber optic", "InternetService_No",
    "Partner_Yes", "Dependents_Yes", "PaperlessBilling_Yes",
]
X = df[feature_cols]
y = (df["Churn"] == "Yes").astype(int)

# 3. Split and scale (fit scaler on train only)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 4. Train
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# 5. Evaluate
print(f"Accuracy: {model.score(X_test, y_test):.3f}")
print(f"AUC:      {roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]):.3f}")
# Output:
# Accuracy: 0.796
# AUC:      0.836

In about 25 lines you loaded real customer data, built a logistic regression classifier, and evaluated it honestly on unseen customers. That is a working churn model.

Practice Exercises

Try these before checking the hints.

Exercise 1: Plot the Sigmoid Yourself

Write a sigmoid(z) function, then plot it over the range $z = -8$ to $z = 8$ . Confirm visually that the curve passes through 0.5 at $z = 0$ and flattens toward 0 and 1 at the ends.

import numpy as np
import matplotlib.pyplot as plt

# Your code here

Hint

Define sigmoid(z) as 1 / (1 + np.exp(-z)). Create your inputs with z = np.linspace(-8, 8, 200), then call plt.plot(z, sigmoid(z)). Add plt.axhline(0.5) and plt.axvline(0) to mark the midpoint, and plt.show() to display it.

Exercise 2: Check the Naive Baseline

Before trusting any model, compute the accuracy of always predicting the majority class (“no churn”) on the test set. This is the number your real model must beat.

# Reuse y_test from the lesson
# Your code here

Hint

The majority class is 0 (stay). The baseline accuracy is just the fraction of the test set that is 0, which you can compute with (y_test == 0).mean(). You should get about 0.734, comfortably below the model’s 0.796.

Exercise 3: Inspect the Predicted Probabilities

Use model.predict_proba(X_test)[:, 1] to get each test customer’s churn probability, then report the minimum, maximum, and mean. Are all the values inside the 0-to-1 range the sigmoid promises?

# Reuse the trained model and X_test from the lesson
# Your code here

Hint

Store the probabilities in proba = model.predict_proba(X_test)[:, 1], then print proba.min(), proba.max(), and proba.mean(). Every value will fall strictly between 0 and 1, and the mean will sit near the overall churn rate of about 0.27.

Summary

Congratulations! You have built and evaluated your first classification model and you understand the machinery that makes it work. Let’s review.

Key Concepts

Classification vs. Regression

Regression predicts a continuous number; classification predicts a category
A binary classifier chooses between exactly two classes, the focus of this module
Despite its name, logistic regression is a classification model; “regression” refers only to the linear combination inside it

Why Linear Regression Fails

A straight line is free to predict values below 0 and above 1, which cannot be probabilities
Binary targets also violate linear regression’s assumptions about errors and are sensitive to outliers
The fix is to wrap the linear score in a function that always outputs a value between 0 and 1

The Sigmoid Function

The sigmoid $h(z) = \frac{1}{1 + e^{-z}}$ squashes any real number into the interval (0, 1)
It outputs 0.5 at $z = 0$ , approaches 1 for large positive $z$ , and 0 for large negative $z$
The resulting S-curve is the natural shape for a probability

The Logistic Regression Model

First form a linear score $z = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p$
Then pass it through the sigmoid to get $P(y=1 \mid x) = h(z)$
Apply a threshold (default 0.5) to turn that probability into a 0/1 label
Training searches for the coefficients that best match the observed outcomes

scikit-learn Pattern

model = LogisticRegression(max_iter=1000) instantiates the model
model.fit(X_train, y_train) trains it
model.predict_proba(X)[:, 1] returns churn probabilities; model.predict(X) returns labels
model.score(X_test, y_test) gives accuracy; roc_auc_score gives AUC
On the churn data: test accuracy 0.796, AUC 0.836

Why This Matters

Classification is everywhere: deciding which email is spam, which transaction is fraud, which patient needs follow-up, which customer is about to walk away. Logistic regression is the model that anchors all of it. It is fast, it is interpretable, and it produces honest probabilities rather than bare guesses, which is why it remains the default first model and the baseline every other classifier is judged against.

The one idea to carry forward is the sigmoid. Linear regression failed not because the linear combination was wrong, but because nothing kept its output in bounds. Wrapping that same linear score in the sigmoid is the single change that converts a regression model into a probability-producing classifier. Once you see that, the rest of the module, interpreting coefficients, measuring performance, and tuning thresholds, becomes far easier to follow.

Next Steps

You now know what logistic regression is, why it works, and how to train one. In the next lesson, you will open up the trained model and learn to read its coefficients as odds ratios, turning the math into plain-language statements about what drives churn.

Continue to Lesson 2 - Interpreting the Regression Parameters

Learn to read logistic regression coefficients as odds ratios and explain what drives the prediction.

Back to Module Overview

Return to the Classification module overview.

Keep Building Your Skills

You have made the leap from predicting numbers to predicting categories, and you did it on a real business problem. The sigmoid function you learned today is not just a logistic regression detail; it reappears throughout machine learning, from neural networks to calibration. Keep the big picture in mind: a linear score, squashed into a probability, thresholded into a decision. Master that pattern here, and every classifier you meet later will feel familiar.

Lesson 8 - Guided Project: Predicting Insurance Costs

Lesson 2 - Interpreting the Regression Parameters

Courses

DATATWEETS

Title here

Lesson 1 - Introduction to Logistic Regression

Welcome to Classification

Classification vs. Regression

The Problem: Predicting Customer Churn

A Look at the Columns

How Balanced Is the Target?

Why Linear Regression Fails Here

The Sigmoid Function

The Logistic Regression Model

Building Your First Logistic Regression

Step 1: Features and Target

Step 2: Split and Scale

Step 3: Fit the Model

Step 4: Predict Probabilities and Labels

Evaluating the Model

Putting It All Together

Practice Exercises

Exercise 1: Plot the Sigmoid Yourself

Exercise 2: Check the Naive Baseline

Exercise 3: Inspect the Predicted Probabilities

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 2 - Interpreting the Regression Parameters

Back to Module Overview

Keep Building Your Skills

Lesson 1 - Introduction to Logistic Regression

Welcome to Classification#

Classification vs. Regression#

The Problem: Predicting Customer Churn#

A Look at the Columns#

How Balanced Is the Target?#

Why Linear Regression Fails Here#

The Sigmoid Function#

The Logistic Regression Model#

Building Your First Logistic Regression#

Step 1: Features and Target#

Step 2: Split and Scale#

Step 3: Fit the Model#

Step 4: Predict Probabilities and Labels#

Evaluating the Model#

Putting It All Together#

Practice Exercises#

Exercise 1: Plot the Sigmoid Yourself#

Exercise 2: Check the Naive Baseline#

Exercise 3: Inspect the Predicted Probabilities#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 2 - Interpreting the Regression Parameters

Back to Module Overview

Keep Building Your Skills#

Welcome to Classification

Classification vs. Regression

The Problem: Predicting Customer Churn

A Look at the Columns

How Balanced Is the Target?

Why Linear Regression Fails Here

The Sigmoid Function

The Logistic Regression Model

Building Your First Logistic Regression

Step 1: Features and Target

Step 2: Split and Scale

Step 3: Fit the Model

Step 4: Predict Probabilities and Labels

Evaluating the Model

Putting It All Together

Practice Exercises

Exercise 1: Plot the Sigmoid Yourself

Exercise 2: Check the Naive Baseline

Exercise 3: Inspect the Predicted Probabilities

Summary

Key Concepts

Why This Matters

Next Steps

Keep Building Your Skills