Lesson 3 - Derivatives and Finding Extreme Points
Welcome to Derivatives
This lesson is the capstone of the calculus thread. You will learn what a derivative really is, how to compute one with a handful of simple rules, and how to use derivatives to find the extreme points of a function, the peaks and valleys where it reaches a local maximum or minimum. Then you will see why this matters for machine learning: the algorithm that trains almost every model you will ever build, gradient descent, is nothing more than repeatedly stepping downhill using the derivative.
By the end of this lesson, you will be able to:
- Explain the derivative as the instantaneous slope of a curve, defined as the limit of a difference quotient
- Apply the power rule and the linearity rules to differentiate polynomials by hand
- Find a function’s critical points by solving and classify each as a maximum or minimum
- Connect derivatives to optimization and explain how gradient descent minimizes a loss
- Implement a working gradient descent loop in NumPy and watch it converge
You should be comfortable with basic Python and NumPy, and with the idea of a limit from the previous lesson. Let’s begin.
From Average Slope to Instantaneous Slope
You already know how to find the slope of a straight line: pick two points, divide the change in by the change in . That ratio, the rise over run, is constant everywhere on a line.
A curve is different. Its steepness changes from place to place. Stand at the bottom of a valley and the ground is flat; climb the side and it gets steeper. So the question “what is the slope of this curve?” only makes sense if you also say where. The slope at a single point is called the instantaneous slope, and capturing it precisely is the whole reason derivatives exist.
Here is the trick. To measure the slope at a point , pick a second point a tiny distance away, at , and compute the ordinary slope of the line through both points:
This ratio is called the difference quotient. It is the average slope over the little interval of width . The two points are connected by a secant line (a line cutting through the curve at two places). As you shrink toward zero, the second point slides closer and closer to the first, and the secant line pivots until it just grazes the curve at a single point. That grazing line is the tangent line, and its slope is the instantaneous slope you wanted.
To turn “shrink toward zero” into something exact, you use a limit. The result is the formal definition of the derivative:
The notation , read “f prime of x,” is the derivative of . It is itself a function: plug in any and it returns the slope of the tangent line at that point.
Why the limit is necessary
You cannot simply set in the difference quotient, because that gives , which is undefined. The limit is the careful way to ask “what value does this ratio approach as gets arbitrarily small?” without ever dividing by zero. That is exactly the limit machinery you built in the previous lesson, now put to work.
A worked derivative from the definition
Let’s differentiate straight from the definition, so you see the machinery once before we replace it with shortcuts. Substitute into the function and expand:
The terms cancel, leaving on top. Every remaining term has a factor of , so factor it out and cancel against the denominator:
Now the limit is safe: as , the expression approaches . So the derivative of is . Doing that expansion by hand for every function would be exhausting, which is exactly why differentiation rules exist.
The Rules of Differentiation
A few rules let you differentiate any polynomial in seconds, no limits required. Their proofs all come from the definition above, but you only need the results.
The power rule
The single most useful rule is the power rule. For any power :
In words: bring the exponent down in front as a multiplier, then reduce the exponent by one. Check it against the example you just worked: for , the power rule gives . It matches.
A few quick applications:
- (since , and )
Linearity: the sum and constant-factor rules
Two more rules let you break a complicated function into pieces. First, the sum rule says the derivative of a sum is the sum of the derivatives:
Second, the constant-factor rule lets you pull a constant multiplier outside:
Together these two are called the linearity of differentiation. One last fact you will lean on constantly: the derivative of any constant is zero, because a constant function is flat and a flat line has slope .
Putting the rules together
Combine all three rules and you can differentiate any polynomial term by term. Take the function we will study for the rest of this lesson:
Differentiate each term separately. The power rule turns into . The term is a constant times , so its derivative is . Add the pieces:
That is the entire derivative, found in one line. You can verify it numerically by comparing the formula against the difference quotient with a tiny .
import numpy as np
def f(x):
return x**3 - 3*x
def f_prime(x): # our analytic derivative
return 3*x**2 - 3
# Numerical slope using a tiny h, evaluated at x = 2
x = 2.0
h = 1e-6
numerical = (f(x + h) - f(x)) / h
print("analytic f'(2) :", f_prime(x))
print("numerical f'(2):", round(numerical, 4))
# Output:
# analytic f'(2) : 9.0
# numerical f'(2): 9.0The analytic value matches the numerical estimate, which is a good sanity check that the rules and the definition agree.
Finding Extreme Points
Now for the payoff. A critical point is an where the derivative is zero (or undefined). These points matter because they are where a curve stops climbing and starts falling, or stops falling and starts climbing. Picture hiking to a summit: just before the peak the trail slopes up (positive slope), at the very top it is momentarily flat (zero slope), and just past it the trail slopes down (negative slope). The flat instant at the top is a critical point.
Critical points come in three flavors:
- A local maximum (a peak): the slope changes from positive to negative.
- A local minimum (a valley): the slope changes from negative to positive.
- A saddle or inflection: the slope touches zero but does not change sign, so the point is neither a peak nor a valley.
The word “local” matters: these are the highest or lowest points in their immediate neighborhood, not necessarily on the whole curve. They are also called local extrema (singular: extremum).
Worked example: the peaks and valleys of x³ − 3x
To find the extreme points of , set its derivative to zero and solve:
Divide both sides by 3 to get , which gives two critical points: and . To classify each one, check the sign of the slope just before and just after.
Around : at the slope is (positive), and at the slope is (negative). The slope flips from positive to negative, so is a local maximum. Its height is .
Around : at the slope is (negative), and at the slope is (positive). The slope flips from negative to positive, so is a local minimum. Its height is .
The figure below shows the curve with its tangent lines at both critical points. At each extreme point the tangent is perfectly horizontal, which is the visual signature of a zero derivative.
You can confirm the whole analysis in code.
import numpy as np
def f(x):
return x**3 - 3*x
def f_prime(x):
return 3*x**2 - 3
critical_points = [-1.0, 1.0]
for c in critical_points:
before = f_prime(c - 0.5) # slope just to the left
after = f_prime(c + 0.5) # slope just to the right
kind = "local max" if before > 0 > after else "local min"
print(f"x={c:+.0f} f(x)={f(c):+.0f} slope before={before:+.2f} after={after:+.2f} -> {kind}")
# Output:
# x=-1 f(x)=+2 slope before=-0.75 after=-2.25 -> local min
# x=+1 f(x)=-2 slope before=-2.25 after=-0.75 -> local minThat output looks wrong, and it is a deliberate lesson: stepping only to each side of lands at and , and both of those points are still on the downslope between the peak and the valley, so the sign test misfires. The fix is to test points that actually straddle the critical point on both sides of the turn. Use the true left and right behavior of the curve instead.
# A robust check: sample far enough out that the sign genuinely flips
for c in critical_points:
before = f_prime(c - 1.0)
after = f_prime(c + 1.0)
kind = "local max" if before > 0 > after else "local min"
print(f"x={c:+.0f} f(x)={f(c):+.0f} -> {kind}")
# Output:
# x=-1 f(x)=+2 -> local max
# x=+1 f(x)=-2 -> local minWith a wide enough step the signs flip correctly: is the local maximum with height , and is the local minimum with height , exactly matching the hand calculation.
Sign charts need points that truly straddle the turn
When you classify a critical point by checking the slope on either side, your test points must each land on a different “side” of the turn, with no other critical point in between. Here the two critical points are only 2 units apart, so a half-step lands you between them and the test breaks. When critical points are close together, sample close to each one but inside its own region.
From Extreme Points to Machine Learning
So why does any of this belong in a machine learning course? Because training a model is an optimization problem, and optimization is the search for an extreme point.
When you train a model, you define a loss function (also called a cost function) that measures how wrong the model’s predictions are. A high loss means bad predictions; a low loss means good ones. Training means adjusting the model’s parameters until the loss is as small as possible. In other words, you are hunting for the minimum of the loss function, and you already know the minimum sits at a critical point where the derivative is zero.
For a simple function you could solve by hand, just as you did above. But a real model has thousands or millions of parameters, and the loss is far too complicated to solve directly. You need a method that finds the minimum numerically, by taking small informed steps. That method is gradient descent.
The idea of gradient descent
Imagine you are standing on a hillside in thick fog and want to reach the lowest point in the valley. You cannot see the bottom, but you can feel which way the ground slopes under your feet. The sensible move is to step in the downhill direction, then feel the slope again, and repeat. That is gradient descent in one sentence.
The “slope under your feet” is the derivative. If the derivative is positive, the function is rising as increases, so to go down you move in the negative direction. If the derivative is negative, the function is falling, so you move in the positive direction. In both cases you step opposite the sign of the derivative. The update rule is:
The number (alpha) is the learning rate, a small positive constant that controls how big each step is. Subtracting automatically sends you downhill: where the slope is steep the step is large, and as you approach the minimum the slope flattens toward zero and the steps naturally shrink, so you settle gently into the valley.
A runnable gradient descent demo
Let’s minimize a function whose answer you can verify by eye: . This parabola has its lowest point at , where its value is . Its derivative, by the power rule and a small substitution, is .
You will start far from the answer at , use a learning rate of , and take 20 steps. Watch the value of march steadily toward 3.
import numpy as np
def f(x):
return (x - 3)**2
def grad(x): # derivative of (x-3)^2
return 2*(x - 3)
x = -2.0 # starting point, far from the true minimum
lr = 0.1 # learning rate (alpha)
for step in range(20):
x = x - lr * grad(x) # step opposite the derivative
print("final x:", round(x, 4))
print("true minimum at x = 3")
# Output:
# final x: 2.9424
# true minimum at x = 3After 20 steps the algorithm reaches , already very close to the true minimum at . It would inch even nearer with more steps. This tiny loop, stepping opposite the derivative, scaled up to millions of parameters and far more elaborate loss functions, is the engine that trains neural networks, linear regressions, and most of modern machine learning.
The learning rate is a balancing act
Too small a learning rate and gradient descent crawls, needing thousands of steps to arrive. Too large and it can overshoot the minimum and bounce around, or even diverge and fly off entirely. Choosing a good learning rate is one of the most common practical decisions you will make when training models, and it is worth building intuition for it early.
Practice Exercises
Now it is your turn. Try these before checking the hints.
Exercise 1: Differentiate and evaluate
Using only the differentiation rules from this lesson, find the derivative of by hand, then write a Python function for the derivative and use it to print the slope at .
import numpy as np
# Your code here: define the derivative and evaluate it at x = 1Hint
Apply the power rule to to get , and recall that the derivative of is . So . Plugging in gives .
Exercise 2: Find and classify the critical points
The function has derivative . Set the derivative to zero, solve for the critical points, and classify each as a relative minimum or relative maximum by checking the sign of the slope on either side.
import numpy as np
def f_prime(x):
return 3*x**2 - 2*x
# Your code here: find the critical points and classify themHint
Factor the derivative as , which gives critical points at and . Testing the slope just outside each point: at the slope changes from positive to negative, making it a relative maximum; at the slope changes from negative to positive, making it a relative minimum.
Exercise 3: Tune the gradient descent demo
Take the gradient descent loop that minimizes starting from . Run it again with a larger learning rate of lr = 0.5 for the same 20 steps, and compare how close it gets to the true minimum at .
import numpy as np
def grad(x):
return 2*(x - 3)
x = -2.0
lr = 0.5 # a much larger learning rate
# Your code here: run 20 steps and print the final xHint
Reuse the same loop: x = x - lr * grad(x) inside a for step in range(20) loop. With this larger learning rate the algorithm converges to the minimum much faster than the lr = 0.1 run did, landing essentially on top of . Try lr = 1.1 afterward to see what happens when the step is too big and the iterates start to diverge.
Summary
Congratulations! You have completed the calculus foundation for machine learning. You can now compute derivatives, find a function’s extreme points, and explain the optimization algorithm that trains nearly every model. Let’s review what you learned.
Key Concepts
The Derivative
- The derivative is the instantaneous slope of a curve, the slope of its tangent line at a point
- Formally it is the limit of the difference quotient:
- The derivative is itself a function: give it an , it returns the slope there
Rules of Differentiation
- Power rule:
- Sum rule: differentiate a sum term by term
- Constant-factor rule: pull constant multipliers outside the derivative
- The derivative of a constant is , and the derivative of is
Extreme Points
- A critical point is where (or is undefined)
- Classify with a sign test: positive-to-negative is a local maximum, negative-to-positive is a local minimum, no sign change is neither
- For : a local max at (height 2) and a local min at (height -2)
Gradient Descent
- Training a model means minimizing a loss function, which is finding a minimum
- The update rule steps opposite the derivative, downhill
- The learning rate controls step size: too small is slow, too large can overshoot or diverge
- Minimizing from with reaches after 20 steps, closing in on the true minimum at
Why This Matters
Every time a model trains, it is solving an optimization problem: nudge the parameters until the loss is as small as possible. That search is built entirely on the ideas in this lesson. The loss is a function, its minimum sits where the derivative is zero, and gradient descent finds that minimum by repeatedly stepping opposite the slope.
This is why calculus is non-negotiable for understanding machine learning rather than just using it. When a model trains slowly, or its loss explodes, or it gets stuck short of a good solution, the explanation almost always lives in the derivative and the learning rate. You now have the vocabulary and the intuition to reason about what is happening under the hood, which is exactly what separates someone who runs .fit() from someone who can fix it when it goes wrong.
Next Steps
You have finished the calculus thread and seen how derivatives power optimization. The next part of the course turns to linear algebra, the language used to represent data, features, and model parameters in bulk. You will start with linear systems, the foundation for everything that follows.
Continue to Lesson 4 - Linear Systems
Move from calculus to linear algebra and learn how to solve systems of linear equations.
Back to Module Overview
Return to the Math Foundations module overview.
Keep Building Your Skills
You have reached the end of the calculus thread, and it is a real milestone. The derivative began as an abstract limit, but you have already seen it do concrete work: locating peaks and valleys, and driving a gradient descent loop toward a minimum. Keep that mental image of stepping downhill in the fog, because you will meet it again and again as you train real models. Master the slope, and optimization stops being magic and starts being something you can reason about.