Lesson 1 - Introduction to Neural Networks
Welcome to Deep Learning
This lesson introduces you to neural networks, the engine behind modern deep learning. You will learn where the idea came from, what an artificial neuron actually computes, how neurons stack into layers to form a network, and why a small piece called an activation function is what makes the whole thing powerful. By the end, you will have a clear mental model of how a network turns inputs into a prediction.
By the end of this lesson, you will be able to:
- Explain what a neural network is and the biological intuition behind it
- Describe what a single artificial neuron computes from its inputs, weights, and bias
- Identify the input, hidden, and output layers of a feedforward network
- Explain why activation functions add the nonlinearity that makes networks powerful
- Describe the forward pass and the high-level idea of learning by adjusting weights
No deep learning experience is needed. You should be comfortable with basic Python and pandas, and you should have seen the general machine learning workflow before. Let’s begin.
What Is a Neural Network?
Suppose you work at a clinic and you want to predict, from a few routine health measurements, whether a patient is likely to have diabetes. You have measurements like glucose level, blood pressure, body mass index, and age. The relationship between these numbers and the diagnosis is real, but it is tangled: glucose matters more when BMI is also high, age interacts with blood pressure, and no single measurement decides the outcome on its own.
You could try to write the rules by hand, but the interactions quickly outgrow what any person can untangle. A neural network takes a different approach. Instead of you specifying the rules, the network discovers them by adjusting a large set of internal numbers until its predictions match the known answers in your data.
A neural network is, at heart, a function. It takes some numbers in (your measurements), passes them through a series of simple mathematical steps, and produces a number out (a prediction). What makes it special is that the steps contain thousands or millions of adjustable values, and the network learns good values for them automatically.
Deep Learning in the Bigger Picture
You will hear three terms used together, and it helps to see how they nest:
+---------------------------------------------------+
| Artificial Intelligence |
| +---------------------------------------------+ |
| | Machine Learning | |
| | +---------------------------------------+ | |
| | | Deep Learning | | |
| | | (neural networks with many layers) | | |
| | +---------------------------------------+ | |
| +---------------------------------------------+ |
+---------------------------------------------------+Artificial intelligence is the broad goal of getting machines to do things that seem intelligent. Machine learning is the subset where machines learn patterns from data rather than following hand-written rules. Deep learning is the subset of machine learning that uses neural networks with multiple layers. The word “deep” simply refers to having many layers stacked one after another.
The same systems you have heard about, from image recognition to the language models that generate text, are all built from the neural network ideas you will learn in this module.
The Biological Inspiration
The name “neural network” comes from the brain. A biological brain is made of cells called neurons. Each neuron receives signals from other neurons through connections, combines those signals, and if the combined signal is strong enough, it “fires” and passes a signal along to the next neurons. Learning, in a brain, happens partly by strengthening or weakening these connections over time.
Artificial neural networks borrow this picture in a loose, simplified way. An artificial neuron receives several numbers, combines them, and produces an output that feeds into the next neurons. The strength of each connection is a number the network can adjust. That is the whole analogy, and you should hold it loosely. Artificial neurons are far simpler than real ones, and the goal was never to simulate a brain, just to borrow a useful idea: many simple units, richly connected, can learn complex behavior together.
The analogy is a starting point, not a blueprint
Real neurons are vastly more complex than artificial ones, and the brain does not learn the way our networks do. The biological story is helpful for building intuition, but once you understand the math, you will think of a neural network as what it really is: a flexible function with many adjustable numbers. Do not lean on the brain metaphor too hard.
The Artificial Neuron
Everything in a neural network is built from one simple unit, so it is worth understanding it on its own before stacking many together.
A single artificial neuron does three things:
- It receives several input numbers.
- It multiplies each input by a weight, adds the results together, and adds one more number called a bias.
- It passes that sum through an activation function to produce its output.
Steps one and two are just a weighted sum. If a neuron receives inputs with weights and a bias , the weighted sum is:
Then the neuron applies an activation function to that sum to get its final output:
That is the entire computation of one neuron. Let’s make each piece concrete.
Weights: How Much Each Input Matters
A weight is a number attached to each input that controls how strongly that input influences the neuron. A large positive weight means “this input pushes the output up.” A large negative weight means “this input pushes the output down.” A weight near zero means “this input barely matters.”
In the diabetes example, a neuron might learn a large positive weight on glucose (high glucose pushes toward a positive diagnosis) and a smaller weight on, say, blood pressure. Crucially, you do not set these weights yourself. The network discovers them during training.
Bias: Shifting the Threshold
The bias is a single extra number added to the weighted sum, independent of any input. It lets the neuron shift its output up or down regardless of the inputs. You can think of it as setting the neuron’s baseline, or how easy it is for the neuron to “activate.” Without a bias, every neuron would be forced to produce zero when all its inputs are zero, which is an unnecessary restriction. The bias removes it.
x1 ---- w1 ----\
\
x2 ---- w2 ------> z = (w1*x1 + w2*x2 + w3*x3) + b ---> f(z) ---> output
/
x3 ---- w3 ----/
b (bias)Together, the weights and the bias are the neuron’s parameters, the numbers that get adjusted during learning. A network’s knowledge lives entirely in its parameters.
Parameters versus inputs
Keep two ideas separate. The inputs change with every example you feed in (each patient has different measurements). The parameters, meaning the weights and biases, stay fixed while making a prediction and only change during training. When people say a model has “100 million parameters,” they are counting all its weights and biases.
Activation Functions: Adding Nonlinearity
So far a neuron computes a weighted sum and then applies an activation function. Why is that last step there? Could we not just stop at the weighted sum? This question is more important than it looks, and the answer is the reason neural networks work at all.
A weighted sum is a linear operation. If you stack many neurons that only ever compute weighted sums, the whole network, no matter how many layers it has, still computes one big weighted sum. In other words, stacking linear steps gives you another linear step, and a single linear function can only draw straight lines (or flat planes) through your data. It could never capture the curved, interacting relationship between glucose, BMI, and diabetes.
The activation function breaks this limitation. It is a small nonlinear function applied to each neuron’s weighted sum. By inserting a nonlinearity between the layers, the network gains the ability to bend and fold its decision boundaries, which is what lets it represent genuinely complex patterns. Without activation functions, deep networks would be pointless.
Here are the activation functions you will meet most often.
Sigmoid
The sigmoid function squashes any input into the range between 0 and 1:
Large positive inputs come out near 1, large negative inputs come out near 0, and an input of 0 comes out at exactly 0.5. Because its output looks like a probability, sigmoid is a natural choice for the final neuron of a binary classifier, like predicting diabetes versus no diabetes. It was also the standard choice inside hidden layers for many years, though it has fallen out of favor there for reasons later lessons explore.
Tanh
The tanh (hyperbolic tangent) function is closely related to sigmoid but squashes inputs into the range between -1 and 1, centered on 0:
Being centered on zero often makes tanh behave a little better than sigmoid inside hidden layers, since its outputs are balanced around zero rather than always positive.
ReLU
The rectified linear unit, or ReLU, is the most widely used activation in modern networks, and it is refreshingly simple:
It returns the input unchanged when the input is positive, and returns 0 when the input is negative. Despite being almost trivial, ReLU trains quickly and works extremely well in practice, which is why it became the default choice for hidden layers.
Leaky ReLU
ReLU has one quirk: any neuron whose input is negative outputs exactly 0, and such a neuron can sometimes get “stuck” and stop contributing. Leaky ReLU fixes this by letting a small slope through for negative inputs instead of flattening them to zero:
The negative side has a gentle slope rather than being completely flat, which keeps those neurons alive.
Which one should you use?
A good default is ReLU (or leaky ReLU) for the hidden layers, and sigmoid for the output layer when you are doing binary classification. You do not need to memorize the formulas right now. What matters is the idea: an activation function is a nonlinear step that gives the network its expressive power.
Stacking Neurons into Layers
A single neuron can only do so much. The power of neural networks comes from organizing many neurons into layers and connecting the layers in sequence. A network arranged this way, where information flows in one direction from inputs to outputs, is called a feedforward network.
There are three kinds of layers.
The Input Layer
The input layer is not made of computing neurons at all. It simply holds the raw feature values for one example. If each patient is described by eight measurements, the input layer has eight slots, one per feature. Its only job is to pass those numbers into the first hidden layer.
Hidden Layers
The hidden layers sit between input and output and do the real work. Each neuron in a hidden layer takes every value from the previous layer, computes its own weighted sum plus bias, applies an activation function, and passes its output forward. Stacking several hidden layers lets the network build up increasingly abstract combinations of the inputs. They are called “hidden” because you never observe their values directly; they are internal to the network.
The number of hidden layers and the number of neurons in each are design choices you make. More layers and more neurons give the network more capacity to fit complex patterns, but also more parameters to train and more risk of overfitting, a tension later lessons address.
The Output Layer
The output layer produces the final prediction. For binary classification like the diabetes problem, it is a single neuron with a sigmoid activation, so its output is a number between 0 and 1 that you read as the probability of the positive class. For predicting a continuous number, the output layer might be a single neuron with no activation at all.
Fully Connected Layers
In the most basic feedforward network, every neuron in one layer connects to every neuron in the next layer. Such a layer is called fully connected (or dense). Each of those connections has its own weight, so even a modest network can have a surprising number of parameters. This dense, fully connected design is exactly the kind of network you will build by hand in the lessons ahead.
The Forward Pass
Now you can describe how a network turns inputs into a prediction. The process of pushing data through the network from the input layer to the output layer is called the forward pass (or forward propagation).
It works layer by layer:
- The input layer holds the feature values for one example.
- Each neuron in the first hidden layer computes its weighted sum plus bias, applies its activation, and produces an output.
- Those outputs become the inputs to the next layer, which repeats the same computation.
- This continues layer by layer until the output layer produces the final prediction.
inputs hidden 1 hidden 2 output
------- --------- --------- --------
glucose --> [neuron] --> [neuron] -->
bmi --> [neuron] --> [neuron] --> [neuron] --> P(diabetes)
age --> [neuron] --> [neuron] -->
... --> [neuron] --> [neuron] -->
weighted sum + bias + activation, repeated at every layerThe forward pass is just the neuron computation you already know, applied at every neuron, one layer after another. Given a fixed set of weights and biases, the forward pass is completely deterministic: the same input always produces the same output.
A fresh network predicts nonsense
When a network is first created, its weights and biases are set to small random numbers. Run a forward pass at that point and the prediction is essentially meaningless. The network only becomes useful after learning, which adjusts those parameters so the forward pass produces good predictions. The forward pass is the machinery; learning is what tunes the machinery.
How a Network Learns
You now know that a network’s behavior is determined entirely by its weights and biases, and that a fresh network has random, useless ones. So how do they become good?
The high-level loop is the same idea behind most machine learning, applied to a network’s parameters:
- Forward pass. Feed a training example through the network and get its prediction.
- Measure the error. Compare the prediction to the known correct answer using a loss function, which produces a single number measuring how wrong the prediction was. A large loss means a bad prediction; a small loss means a good one.
- Adjust the parameters. Nudge every weight and bias a little in the direction that would have made the loss smaller.
- Repeat. Do this across many examples, many times, and the loss gradually drops as the network’s predictions improve.
That third step, figuring out exactly how to nudge each parameter, is the heart of training, and it is what the next several lessons build up carefully: the math behind it (Lesson 2), the optimization method called gradient descent (Lesson 3), and the algorithm called backpropagation that makes it efficient (Lesson 4). For now, hold on to the intuition: learning means repeatedly adjusting the weights and biases to reduce the error.
This is the same generalization goal you have seen before. A network that merely memorizes its training data is useless. A good network learns parameters that produce accurate predictions on new patients it has never seen.
The Running Example: The Pima Diabetes Dataset
Throughout this module you will build neural networks on one real dataset, so it is worth meeting it now. The Pima Indians Diabetes dataset records health measurements for adult female patients, along with whether each was later diagnosed with diabetes. It is a classic binary classification problem and a perfect fit for the network you will build.
You can download the dataset and load it with pandas.
import pandas as pd
# download: https://datatweets.com/datasets/diabetes.csv
df = pd.read_csv("diabetes.csv")
print("Shape:", df.shape)
# Output: Shape: (768, 9)The dataset has 768 rows and 9 columns. Each row is one patient. Eight of the columns are health measurements (the features), and the final column, Outcome, records the diagnosis (the target).
A Data Dictionary
Here are the columns you will work with:
| Column | Meaning |
|---|---|
Pregnancies | Number of times pregnant |
Glucose | Plasma glucose concentration |
BloodPressure | Diastolic blood pressure (mm Hg) |
SkinThickness | Triceps skinfold thickness (mm) |
Insulin | 2-hour serum insulin |
BMI | Body mass index |
DiabetesPedigreeFunction | A score summarizing family history of diabetes |
Age | Age in years |
Outcome | Target: 1 if diagnosed with diabetes, 0 otherwise |
The eight feature columns become the inputs to your network’s input layer, and the single Outcome column is what the output neuron will learn to predict.
Exploring the Target
Before building anything, it is good practice to see how the two outcomes are distributed.
print(df["Outcome"].value_counts())
# Output:
# Outcome
# 0 500
# 1 268
# Name: count, dtype: int64
print("positive rate:", round(df["Outcome"].mean(), 3))
# Output: positive rate: 0.349Out of 768 patients, 500 do not have diabetes (outcome 0) and 268 do (outcome 1). That is about 35 percent positive cases. The classes are somewhat imbalanced but not extremely so, which is useful to keep in mind when you evaluate a model later: a lazy predictor that always guesses “no diabetes” would already be right about 65 percent of the time, so your network needs to do meaningfully better than that.
# A quick look at the first few feature values
print(df.head(3))
# Output:
# Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
# 0 6 148 72 35 0 33.6
# 1 1 85 66 29 0 26.6
# 2 8 183 64 0 0 23.3
#
# DiabetesPedigreeFunction Age Outcome
# 0 0.627 50 1
# 1 0.351 31 0
# 2 0.288 32 1Why this dataset works well here
The diabetes dataset is small enough to load instantly and reason about by hand, yet its features interact in nonlinear ways that a simple linear rule cannot capture. That makes it an ideal sandbox for watching a neural network learn. You will return to this exact dataset in every lesson of this module, so the only thing that changes from lesson to lesson is the network and the technique, not the problem.
You will not train a network in this lesson. The goal here is the mental model. With the architecture and the dataset both clear in your mind, you are ready for the math and code that bring a network to life in the lessons ahead.
Practice Exercises
These exercises are about checking your understanding and exploring the dataset. Try them before reading the hints.
Exercise 1: Trace a Single Neuron
A neuron receives two inputs, and , with weights and and a bias . It uses a ReLU activation. Compute the neuron’s output by hand, then check it with code.
# Your code hereHint
First compute the weighted sum: . Then apply ReLU, which is max(0, z). Since is negative, the output is 0.0. In code: z = 0.5*2 + (-1.0)*3 + 1.0 then output = max(0, z).
Exercise 2: Count the Parameters
Imagine a fully connected network for the diabetes problem: an input layer of 8 features, one hidden layer of 4 neurons, and an output layer of 1 neuron. How many weights and biases does it have in total? Compute it, then print the answer.
# Your code hereHint
A fully connected layer has (inputs x neurons) weights plus one bias per neuron. The hidden layer has 8*4 = 32 weights and 4 biases. The output layer has 4*1 = 4 weights and 1 bias. Total parameters = 32 + 4 + 4 + 1 = 41.
Exercise 3: Explore the Features
Load the diabetes dataset and compare the average Glucose level for patients who have diabetes versus those who do not. Does the difference match your intuition about which features should matter?
import pandas as pd
df = pd.read_csv("diabetes.csv") # download: https://datatweets.com/datasets/diabetes.csv
# Your code hereHint
Use df.groupby("Outcome")["Glucose"].mean(). You will see that patients with Outcome 1 have a clearly higher average glucose level than those with Outcome 0, which is exactly the kind of signal a neuron would learn to weight heavily.
Summary
You now have a working mental model of what a neural network is and how it produces a prediction. Let’s review what you learned.
Key Concepts
The Big Picture
- A neural network is a flexible function with many adjustable numbers that learns patterns from data
- Deep learning is the part of machine learning that uses neural networks with multiple layers
- The biological brain inspired the name, but artificial neurons are far simpler and the metaphor is only a starting point
The Artificial Neuron
- A neuron computes a weighted sum of its inputs plus a bias, then applies an activation function
- Weights control how much each input matters; a bias shifts the neuron’s baseline
- Weights and biases together are the parameters, the numbers learning adjusts
Activation Functions
- A weighted sum alone is linear, and stacking linear steps stays linear
- Activation functions add the nonlinearity that lets networks represent complex patterns
- Sigmoid (0 to 1) and tanh (-1 to 1) squash inputs; ReLU and leaky ReLU are the modern defaults for hidden layers
Layers and Architecture
- The input layer holds the feature values; hidden layers do the computation; the output layer produces the prediction
- A feedforward network passes information in one direction; a fully connected (dense) layer connects every neuron to every neuron in the next layer
The Forward Pass and Learning
- The forward pass pushes inputs through the network layer by layer to a prediction
- A fresh network has random parameters and predicts nonsense
- Learning repeatedly adjusts weights and biases to reduce a loss that measures prediction error
Why This Matters
Every modern deep learning system, from image classifiers to large language models, is built from the pieces you just learned: neurons, weights, biases, activations, and layers, trained by repeatedly reducing a loss. The architectures get more elaborate, but the foundation does not change. If you understand what a single neuron computes and why an activation function matters, you understand the core of deep learning.
You also met the Pima diabetes dataset, the problem you will solve again and again as the module progresses. Keeping one problem fixed means every new technique you learn, from gradient descent to regularization, shows up as a visible change in how well your network predicts diabetes. The concepts here are the scaffolding; the next lessons turn them into running code.
Next Steps
You now understand the anatomy of a neural network and how it makes a prediction. In the next lesson, you will build the mathematical and NumPy foundations you need to actually implement these computations efficiently, turning the weighted sums and activations into vector and matrix operations.
Continue to Lesson 2 - Math and NumPy Foundations for Neural Networks
Learn the vector and matrix math, and the NumPy tools, that make neural network computation efficient.
Back to Module Overview
Return to the Deep Learning Foundations module overview.
Keep Building Your Skills
You have taken your first step into deep learning, and it is a conceptual one on purpose. The single most valuable thing you can carry forward is the picture of a neuron computing a weighted sum, adding a bias, and applying an activation, repeated across layers to form the forward pass. Every lesson from here builds on that picture. Keep it in mind, and the math and code that follow will feel like filling in details on a structure you already understand.