Lesson 1 - Introduction to Logistic Regression
On this page
- Welcome to Classification
- Classification vs. Regression
- The Problem: Predicting Customer Churn
- Why Linear Regression Fails Here
- The Sigmoid Function
- The Logistic Regression Model
- Building Your First Logistic Regression
- Evaluating the Model
- Putting It All Together
- Practice Exercises
- Summary
- Next Steps
- Keep Building Your Skills
Welcome to Classification
This lesson introduces you to classification, the branch of supervised learning that predicts categories instead of numbers, and to the workhorse model that makes it possible: logistic regression. You will see why the linear regression you already know breaks down on a yes/no target, learn how the sigmoid function fixes it, and build a real churn classifier with scikit-learn.
By the end of this lesson, you will be able to:
- Tell the difference between classification and regression problems
- Explain why linear regression is the wrong tool for a 0/1 target
- Describe the sigmoid function and how it maps any number to a probability between 0 and 1
- Write down the logistic regression model in its standard form
- Build, train, and evaluate a
LogisticRegressionmodel on a real dataset
You should be comfortable with basic Python, pandas, and the machine learning workflow (features, targets, train/test split). If those terms are new, work through the Foundations module first. Let’s begin.
Classification vs. Regression
In the Foundations module you saw that the type of target decides the type of task. That single distinction is worth revisiting carefully, because everything in this module flows from it.
When the target is a number that can vary smoothly, like a house price, a temperature, or next month’s revenue, you are doing regression. The model outputs a continuous value, and you measure error by how far off that value is.
When the target is a category, like spam or not-spam, fraud or legitimate, churn or stay, you are doing classification. The model’s job is to assign each example to one of a fixed set of labels.
Regression Classification
----------------------- -----------------------
Target is a number Target is a category
Examples: Examples:
- House price ($) - Spam vs. not spam
- Tomorrow's temperature - Fraud vs. legitimate
- Customer lifetime value - Will the customer churn?When a classifier chooses between exactly two categories, it is a binary classifier. Binary classification is the most common case in industry, and it is where we will spend this entire module. The car-loan default, the medical diagnosis, the customer who cancels their subscription: all of these are yes/no questions, and all of them are binary classification.
The name logistic regression is a small but famous source of confusion. Despite the word “regression,” logistic regression is a classification model. The “regression” part refers to the fact that, under the hood, it still builds a linear combination of the predictors, exactly like linear regression. It then wraps that linear combination in one extra function to turn it into a probability. Understanding that one extra function is the heart of this lesson.
Why start with logistic regression?
Logistic regression is the first classifier most practitioners learn, and many never stop using it. It trains in milliseconds, its predictions are easy to explain to non-technical stakeholders, and it produces well-calibrated probabilities rather than bare labels. Even when a fancier model wins on accuracy, logistic regression is the baseline everyone measures against.
The Problem: Predicting Customer Churn
Imagine you work at a telecom company. Every month, some fraction of your customers cancel their service. That is called churn, and it is expensive: winning a brand-new customer costs far more than keeping an existing one. If you could flag the customers most likely to leave before they leave, your retention team could step in with a discount or a better plan.
That is a classic binary classification problem. For each customer, the answer is one of two categories: churn (they leave) or stay. You have records of past customers, including who churned and who did not, so this is a supervised learning task with a labeled target.
You will work with the real Customer Churn dataset, which records account information for telecom customers along with whether each one churned.
import pandas as pd
# download: https://datatweets.com/datasets/customer_churn.csv
df = pd.read_csv("customer_churn.csv")
print("Shape:", df.shape)
# Output: Shape: (7032, 12)The dataset has 7,032 rows and 12 columns. Each row is one customer. Most columns describe the account, such as how long the customer has been with the company and how much they pay, and one column, Churn, records the outcome you want to predict.
A Look at the Columns
You do not need to memorize every field, but here are the ones you will use as features, plus the target.
| Column | Type | Meaning |
|---|---|---|
tenure | int | Months the customer has stayed with the company |
MonthlyCharges | float | The amount billed each month |
TotalCharges | float | The total amount billed over the lifetime of the account |
SeniorCitizen | binary | Whether the customer is a senior citizen (1) or not (0) |
Contract_One year, Contract_Two year | binary | Contract type (month-to-month is the omitted baseline) |
InternetService_Fiber optic, InternetService_No | binary | Internet service type (DSL is the omitted baseline) |
Partner_Yes, Dependents_Yes | binary | Whether the customer has a partner or dependents |
PaperlessBilling_Yes | binary | Whether the customer uses paperless billing |
Churn | category | Target: "Yes" if the customer churned, "No" otherwise |
The categorical fields like contract type and internet service have already been turned into 0/1 columns for you, a process called one-hot encoding that you will study later. For now, treat them as ready-to-use numeric features.
How Balanced Is the Target?
Before modeling anything, always check how the target is distributed. If one class vastly outnumbers the other, a lazy model can score high accuracy while being useless.
print(df["Churn"].value_counts().to_dict())
# Output: {'No': 5163, 'Yes': 1869}
churn_rate = (df["Churn"] == "Yes").mean()
print("churn rate:", round(churn_rate, 3))
# Output: churn rate: 0.266About 27 percent of customers churned (1,869 out of 7,032). That is imbalanced but not severely so: the positive class is well represented. A picture makes the split clear.
Keep the base rate in mind
Because only 27 percent of customers churn, a model that blindly predicts “no churn” for everyone would already be right 73 percent of the time. Remember that number: any useful model must beat it. This is exactly why you should never judge a classifier on accuracy alone, a point we return to in a later lesson.
Why Linear Regression Fails Here
Here is a natural question: you already know linear regression, so why not just use it? Encode the target as 1 for churn and 0 for stay, fit a line, and call any prediction above 0.5 a churn. It sounds reasonable, and it falls apart fast.
Linear regression fits the equation
The right-hand side is a straight line (a plane, in higher dimensions). It is free to take any value from to . But your target is only ever 0 or 1. There is nothing stopping the fitted line from predicting 1.8 for a customer with very high charges, or -0.4 for a customer with very low ones. What is a churn probability of 180 percent supposed to mean? Or negative 40 percent? Probabilities must live between 0 and 1, and a straight line refuses to stay in that range.
The picture below shows the problem directly. On the left, a straight line is fit to a 0/1 target: it shoots above 1 and dips below 0, and its slope forces a single rigid threshold that misclassifies points at both ends. On the right, the logistic curve hugs the data, flattening out near 0 and 1 exactly where it should.
There are deeper problems too. The errors in a 0/1 target are not the smooth, evenly-spread errors that linear regression assumes, and a single outlier far out on one axis can drag the whole line and shift the threshold. The fix is not to patch linear regression. The fix is to wrap its linear score in a function that guarantees an output between 0 and 1. That function is the sigmoid.
The Sigmoid Function
The sigmoid function, also called the logistic function, is the piece that turns logistic regression into a probability machine. It takes any real number, no matter how large or small, and squashes it into the open interval between 0 and 1. Its formula is:
Here is any real number, and is always strictly between 0 and 1. Let’s read the formula by checking its behavior at the extremes:
- When is a large positive number, shrinks toward 0, so approaches .
- When is a large negative number, blows up toward , so approaches .
- When , we get , the exact midpoint.
That is the whole trick. No matter what you feed it, the output is a valid probability. The shape this produces is the famous S-curve.
You can compute it yourself in a couple of lines to confirm the behavior.
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
for z in [-6, -2, 0, 2, 6]:
print(f"sigmoid({z:>2}) = {sigmoid(z):.4f}")
# Output:
# sigmoid(-6) = 0.0025
# sigmoid(-2) = 0.1192
# sigmoid( 0) = 0.5000
# sigmoid( 2) = 0.8808
# sigmoid( 6) = 0.9975Notice how quickly the curve transitions. Around it is steep and decisive, and far out on either side it flattens out, becoming nearly certain but never quite reaching 0 or 1. That gentle saturation is exactly what we want: the model can be very confident without ever claiming an impossible probability.
The midpoint is the decision boundary
Because exactly when , the line is the natural dividing line between the two classes. Inputs that push above 0 lean toward the positive class; inputs that push it below 0 lean toward the negative class. You will revisit and move this 0.5 cutoff deliberately in a later lesson on thresholds.
The Logistic Regression Model
Now you can assemble the full model. Logistic regression is built in two stages.
First, exactly like linear regression, it forms a linear combination of the features. Call this intermediate score :
Here is the intercept and each is the weight, or coefficient, attached to feature . This score can be any real number.
Second, it feeds through the sigmoid to get a probability:
Read that left side carefully: it is the model’s estimated probability that the example belongs to the positive class (here, that the customer churns), given the feature values. To turn that probability into an actual label, you apply a threshold, usually 0.5: predict churn if the probability is at least 0.5, otherwise predict stay.
The values the model has to learn are the coefficients . Training means searching for the coefficients that make the predicted probabilities match the observed outcomes as closely as possible. Linear regression has a tidy formula for its best coefficients; logistic regression does not, so it finds them through iterative optimization instead. The good news is that scikit-learn handles all of that for you behind a single .fit() call. What the coefficients mean once you have them, and how to read them as odds, is the subject of the next lesson.
Building Your First Logistic Regression
Time to put it together. The workflow mirrors what you already know: separate features from target, split into train and test sets, scale, fit, and evaluate.
Step 1: Features and Target
Separate the predictor columns (X) from the outcome column (y), converting the text target to 0/1.
feature_cols = [
"tenure", "MonthlyCharges", "TotalCharges", "SeniorCitizen",
"Contract_One year", "Contract_Two year",
"InternetService_Fiber optic", "InternetService_No",
"Partner_Yes", "Dependents_Yes", "PaperlessBilling_Yes",
]
X = df[feature_cols]
y = (df["Churn"] == "Yes").astype(int) # 1 = churned, 0 = stayed
print("X shape:", X.shape)
print("Churners:", int(y.sum()))
# Output:
# X shape: (7032, 11)
# Churners: 1869Step 2: Split and Scale
Hold out 25 percent of the data for testing, and use stratify=y so the churn rate is identical in both sets. Then standardize the features so the model treats them on a level playing field.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.25,
random_state=42,
stratify=y,
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # fit on TRAIN only
X_test = scaler.transform(X_test) # apply same transform to TEST
print("Train rows:", X_train.shape[0])
print("Test rows: ", X_test.shape[0])
# Output:
# Train rows: 5274
# Test rows: 1758Fit the scaler on training data only
Always call fit_transform on the training set and only transform on the test set. If you fit the scaler on the full dataset, information about the test rows leaks into training and your score becomes dishonestly optimistic. The same rule applies to the model itself: it must never see the test set until the final evaluation.
Step 3: Fit the Model
Building and training a logistic regression takes two lines: instantiate, then fit. The max_iter argument simply gives the iterative optimizer enough steps to settle on its coefficients.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)
print("Model trained!")
# Output: Model trained!That single .fit() call ran the optimization that searches for the best coefficients, the ones that plug into the sigmoid to produce probabilities closest to the actual churn outcomes.
Step 4: Predict Probabilities and Labels
Logistic regression gives you more than a bare label. The .predict_proba() method returns the actual probability for each class, which is one of the model’s biggest selling points.
# Probability of churn for the first five test customers
proba = model.predict_proba(X_test)[:, 1] # column 1 = P(churn)
print("Churn probabilities:", proba[:5].round(3))
# Output: Churn probabilities: [0.03 0.625 0.118 0.444 0.706]
# Apply the default 0.5 threshold to get labels
preds = model.predict(X_test)
print("Predicted labels: ", preds[:5])
print("Actual labels: ", y_test.values[:5])
# Output:
# Predicted labels: [0 1 0 0 1]
# Actual labels: [0 1 0 0 1]Each probability is the sigmoid output for that customer. The default .predict() simply rounds at 0.5: probabilities at or above 0.5 become 1, the rest become 0.
Evaluating the Model
Now measure how well the model does on the held-out test set, the data it never saw during training.
The simplest metric is accuracy: the fraction of test customers classified correctly. scikit-learn computes it with .score().
accuracy = model.score(X_test, y_test)
print(f"Test accuracy: {accuracy:.3f}")
# Output: Test accuracy: 0.796An accuracy of 0.796 means the model classified about 80 percent of the test customers correctly. Recall the base rate from earlier: always guessing “no churn” would score 0.73, so the model is genuinely learning something, beating the naive baseline by a meaningful margin.
Accuracy is only part of the story, though, especially with an imbalanced target. A more informative single number is the AUC (area under the ROC curve), which measures how well the model ranks customers by churn risk across every possible threshold, not just at 0.5. AUC ranges from 0.5 (no better than a coin flip) to 1.0 (perfect ranking).
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"Test AUC: {auc:.3f}")
# Output: Test AUC: 0.836An AUC of 0.836 is solid. It means that if you pick a random churner and a random non-churner, the model assigns the churner a higher churn probability about 84 percent of the time. That ranking ability is exactly what a retention team needs: it can sort customers by risk and focus on the most likely to leave.
Two numbers, two questions
Accuracy answers “how often is the model’s label correct?” while AUC answers “how well does the model rank customers by risk?” They can disagree, and on imbalanced problems AUC is often the more trustworthy summary. You will unpack accuracy, precision, recall, ROC curves, and thresholds in depth in the lessons that follow.
Putting It All Together
Here is the entire pipeline you just built, condensed into one runnable script you can adapt for any binary classification problem.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
# 1. Load
df = pd.read_csv("customer_churn.csv") # download: https://datatweets.com/datasets/customer_churn.csv
# 2. Features and target
feature_cols = [
"tenure", "MonthlyCharges", "TotalCharges", "SeniorCitizen",
"Contract_One year", "Contract_Two year",
"InternetService_Fiber optic", "InternetService_No",
"Partner_Yes", "Dependents_Yes", "PaperlessBilling_Yes",
]
X = df[feature_cols]
y = (df["Churn"] == "Yes").astype(int)
# 3. Split and scale (fit scaler on train only)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# 4. Train
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)
# 5. Evaluate
print(f"Accuracy: {model.score(X_test, y_test):.3f}")
print(f"AUC: {roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]):.3f}")
# Output:
# Accuracy: 0.796
# AUC: 0.836In about 25 lines you loaded real customer data, built a logistic regression classifier, and evaluated it honestly on unseen customers. That is a working churn model.
Practice Exercises
Try these before checking the hints.
Exercise 1: Plot the Sigmoid Yourself
Write a sigmoid(z) function, then plot it over the range to . Confirm visually that the curve passes through 0.5 at and flattens toward 0 and 1 at the ends.
import numpy as np
import matplotlib.pyplot as plt
# Your code hereHint
Define sigmoid(z) as 1 / (1 + np.exp(-z)). Create your inputs with z = np.linspace(-8, 8, 200), then call plt.plot(z, sigmoid(z)). Add plt.axhline(0.5) and plt.axvline(0) to mark the midpoint, and plt.show() to display it.
Exercise 2: Check the Naive Baseline
Before trusting any model, compute the accuracy of always predicting the majority class (“no churn”) on the test set. This is the number your real model must beat.
# Reuse y_test from the lesson
# Your code hereHint
The majority class is 0 (stay). The baseline accuracy is just the fraction of the test set that is 0, which you can compute with (y_test == 0).mean(). You should get about 0.734, comfortably below the model’s 0.796.
Exercise 3: Inspect the Predicted Probabilities
Use model.predict_proba(X_test)[:, 1] to get each test customer’s churn probability, then report the minimum, maximum, and mean. Are all the values inside the 0-to-1 range the sigmoid promises?
# Reuse the trained model and X_test from the lesson
# Your code hereHint
Store the probabilities in proba = model.predict_proba(X_test)[:, 1], then print proba.min(), proba.max(), and proba.mean(). Every value will fall strictly between 0 and 1, and the mean will sit near the overall churn rate of about 0.27.
Summary
Congratulations! You have built and evaluated your first classification model and you understand the machinery that makes it work. Let’s review.
Key Concepts
Classification vs. Regression
- Regression predicts a continuous number; classification predicts a category
- A binary classifier chooses between exactly two classes, the focus of this module
- Despite its name, logistic regression is a classification model; “regression” refers only to the linear combination inside it
Why Linear Regression Fails
- A straight line is free to predict values below 0 and above 1, which cannot be probabilities
- Binary targets also violate linear regression’s assumptions about errors and are sensitive to outliers
- The fix is to wrap the linear score in a function that always outputs a value between 0 and 1
The Sigmoid Function
- The sigmoid squashes any real number into the interval (0, 1)
- It outputs 0.5 at , approaches 1 for large positive , and 0 for large negative
- The resulting S-curve is the natural shape for a probability
The Logistic Regression Model
- First form a linear score
- Then pass it through the sigmoid to get
- Apply a threshold (default 0.5) to turn that probability into a 0/1 label
- Training searches for the coefficients that best match the observed outcomes
scikit-learn Pattern
model = LogisticRegression(max_iter=1000)instantiates the modelmodel.fit(X_train, y_train)trains itmodel.predict_proba(X)[:, 1]returns churn probabilities;model.predict(X)returns labelsmodel.score(X_test, y_test)gives accuracy;roc_auc_scoregives AUC- On the churn data: test accuracy 0.796, AUC 0.836
Why This Matters
Classification is everywhere: deciding which email is spam, which transaction is fraud, which patient needs follow-up, which customer is about to walk away. Logistic regression is the model that anchors all of it. It is fast, it is interpretable, and it produces honest probabilities rather than bare guesses, which is why it remains the default first model and the baseline every other classifier is judged against.
The one idea to carry forward is the sigmoid. Linear regression failed not because the linear combination was wrong, but because nothing kept its output in bounds. Wrapping that same linear score in the sigmoid is the single change that converts a regression model into a probability-producing classifier. Once you see that, the rest of the module, interpreting coefficients, measuring performance, and tuning thresholds, becomes far easier to follow.
Next Steps
You now know what logistic regression is, why it works, and how to train one. In the next lesson, you will open up the trained model and learn to read its coefficients as odds ratios, turning the math into plain-language statements about what drives churn.
Continue to Lesson 2 - Interpreting the Regression Parameters
Learn to read logistic regression coefficients as odds ratios and explain what drives the prediction.
Back to Module Overview
Return to the Classification module overview.
Keep Building Your Skills
You have made the leap from predicting numbers to predicting categories, and you did it on a real business problem. The sigmoid function you learned today is not just a logistic regression detail; it reappears throughout machine learning, from neural networks to calibration. Keep the big picture in mind: a linear score, squashed into a probability, thresholded into a decision. Master that pattern here, and every classifier you meet later will feel familiar.