Lesson 3 - Derivatives and Finding Extreme Points

Welcome to Derivatives

This lesson is the capstone of the calculus thread. You will learn what a derivative really is, how to compute one with a handful of simple rules, and how to use derivatives to find the extreme points of a function, the peaks and valleys where it reaches a local maximum or minimum. Then you will see why this matters for machine learning: the algorithm that trains almost every model you will ever build, gradient descent, is nothing more than repeatedly stepping downhill using the derivative.

By the end of this lesson, you will be able to:

  • Explain the derivative as the instantaneous slope of a curve, defined as the limit of a difference quotient
  • Apply the power rule and the linearity rules to differentiate polynomials by hand
  • Find a function’s critical points by solving f(x)=0 f'(x) = 0 and classify each as a maximum or minimum
  • Connect derivatives to optimization and explain how gradient descent minimizes a loss
  • Implement a working gradient descent loop in NumPy and watch it converge

You should be comfortable with basic Python and NumPy, and with the idea of a limit from the previous lesson. Let’s begin.


From Average Slope to Instantaneous Slope

You already know how to find the slope of a straight line: pick two points, divide the change in y y by the change in x x . That ratio, the rise over run, is constant everywhere on a line.

A curve is different. Its steepness changes from place to place. Stand at the bottom of a valley and the ground is flat; climb the side and it gets steeper. So the question “what is the slope of this curve?” only makes sense if you also say where. The slope at a single point is called the instantaneous slope, and capturing it precisely is the whole reason derivatives exist.

Here is the trick. To measure the slope at a point x x , pick a second point a tiny distance h h away, at x+h x + h , and compute the ordinary slope of the line through both points:

f(x+h)f(x)h \frac{f(x + h) - f(x)}{h}

This ratio is called the difference quotient. It is the average slope over the little interval of width h h . The two points are connected by a secant line (a line cutting through the curve at two places). As you shrink h h toward zero, the second point slides closer and closer to the first, and the secant line pivots until it just grazes the curve at a single point. That grazing line is the tangent line, and its slope is the instantaneous slope you wanted.

To turn “shrink h h toward zero” into something exact, you use a limit. The result is the formal definition of the derivative:

f(x)=limh0f(x+h)f(x)h f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}

The notation f(x) f'(x) , read “f prime of x,” is the derivative of f f . It is itself a function: plug in any x x and it returns the slope of the tangent line at that point.

Why the limit is necessary

You cannot simply set h=0 h = 0 in the difference quotient, because that gives 0/0 0/0 , which is undefined. The limit is the careful way to ask “what value does this ratio approach as h h gets arbitrarily small?” without ever dividing by zero. That is exactly the limit machinery you built in the previous lesson, now put to work.

A worked derivative from the definition

Let’s differentiate f(x)=x2 f(x) = x^2 straight from the definition, so you see the machinery once before we replace it with shortcuts. Substitute x+h x + h into the function and expand:

f(x)=limh0(x+h)2x2h=limh0x2+2xh+h2x2h f'(x) = \lim_{h \to 0} \frac{(x + h)^2 - x^2}{h} = \lim_{h \to 0} \frac{x^2 + 2xh + h^2 - x^2}{h}

The x2 x^2 terms cancel, leaving 2xh+h2 2xh + h^2 on top. Every remaining term has a factor of h h , so factor it out and cancel against the denominator:

f(x)=limh0h(2x+h)h=limh0(2x+h) f'(x) = \lim_{h \to 0} \frac{h(2x + h)}{h} = \lim_{h \to 0} (2x + h)

Now the limit is safe: as h0 h \to 0 , the expression approaches 2x 2x . So the derivative of x2 x^2 is 2x 2x . Doing that expansion by hand for every function would be exhausting, which is exactly why differentiation rules exist.


The Rules of Differentiation

A few rules let you differentiate any polynomial in seconds, no limits required. Their proofs all come from the definition above, but you only need the results.

The power rule

The single most useful rule is the power rule. For any power r r :

ddxxr=rxr1 \frac{d}{dx}\,x^{r} = r\,x^{r-1}

In words: bring the exponent down in front as a multiplier, then reduce the exponent by one. Check it against the example you just worked: for x2 x^2 , the power rule gives 2x21=2x 2x^{2-1} = 2x . It matches.

A few quick applications:

  • ddxx3=3x2 \frac{d}{dx}\,x^3 = 3x^2
  • ddxx5=5x4 \frac{d}{dx}\,x^5 = 5x^4
  • ddxx=1 \frac{d}{dx}\,x = 1 (since x=x1 x = x^1 , and 1x0=1 1 \cdot x^0 = 1 )

Linearity: the sum and constant-factor rules

Two more rules let you break a complicated function into pieces. First, the sum rule says the derivative of a sum is the sum of the derivatives:

ddx[f(x)+g(x)]=ddxf(x)+ddxg(x) \frac{d}{dx}\,[\,f(x) + g(x)\,] = \frac{d}{dx}\,f(x) + \frac{d}{dx}\,g(x)

Second, the constant-factor rule lets you pull a constant multiplier outside:

ddx[cf(x)]=cddxf(x) \frac{d}{dx}\,[\,c\,f(x)\,] = c\,\frac{d}{dx}\,f(x)

Together these two are called the linearity of differentiation. One last fact you will lean on constantly: the derivative of any constant is zero, because a constant function is flat and a flat line has slope 0 0 .

Putting the rules together

Combine all three rules and you can differentiate any polynomial term by term. Take the function we will study for the rest of this lesson:

f(x)=x33x f(x) = x^3 - 3x

Differentiate each term separately. The power rule turns x3 x^3 into 3x2 3x^2 . The term 3x -3x is a constant times x x , so its derivative is 31=3 -3 \cdot 1 = -3 . Add the pieces:

f(x)=3x23 f'(x) = 3x^2 - 3

That is the entire derivative, found in one line. You can verify it numerically by comparing the formula against the difference quotient with a tiny h h .

import numpy as np

def f(x):
    return x**3 - 3*x

def f_prime(x):     # our analytic derivative
    return 3*x**2 - 3

# Numerical slope using a tiny h, evaluated at x = 2
x = 2.0
h = 1e-6
numerical = (f(x + h) - f(x)) / h

print("analytic f'(2) :", f_prime(x))
print("numerical f'(2):", round(numerical, 4))
# Output:
# analytic f'(2) : 9.0
# numerical f'(2): 9.0

The analytic value f(2)=3(2)23=9 f'(2) = 3(2)^2 - 3 = 9 matches the numerical estimate, which is a good sanity check that the rules and the definition agree.


Finding Extreme Points

Now for the payoff. A critical point is an x x where the derivative is zero (or undefined). These points matter because they are where a curve stops climbing and starts falling, or stops falling and starts climbing. Picture hiking to a summit: just before the peak the trail slopes up (positive slope), at the very top it is momentarily flat (zero slope), and just past it the trail slopes down (negative slope). The flat instant at the top is a critical point.

Critical points come in three flavors:

  • A local maximum (a peak): the slope changes from positive to negative.
  • A local minimum (a valley): the slope changes from negative to positive.
  • A saddle or inflection: the slope touches zero but does not change sign, so the point is neither a peak nor a valley.

The word “local” matters: these are the highest or lowest points in their immediate neighborhood, not necessarily on the whole curve. They are also called local extrema (singular: extremum).

Worked example: the peaks and valleys of x³ − 3x

To find the extreme points of f(x)=x33x f(x) = x^3 - 3x , set its derivative to zero and solve:

f(x)=3x23=0 f'(x) = 3x^2 - 3 = 0

Divide both sides by 3 to get x2=1 x^2 = 1 , which gives two critical points: x=1 x = -1 and x=1 x = 1 . To classify each one, check the sign of the slope just before and just after.

Around x=1 x = -1 : at x=2 x = -2 the slope is f(2)=3(4)3=9 f'(-2) = 3(4) - 3 = 9 (positive), and at x=0 x = 0 the slope is f(0)=3 f'(0) = -3 (negative). The slope flips from positive to negative, so x=1 x = -1 is a local maximum. Its height is f(1)=(1)33(1)=1+3=2 f(-1) = (-1)^3 - 3(-1) = -1 + 3 = 2 .

Around x=1 x = 1 : at x=0 x = 0 the slope is 3 -3 (negative), and at x=2 x = 2 the slope is 9 9 (positive). The slope flips from negative to positive, so x=1 x = 1 is a local minimum. Its height is f(1)=13=2 f(1) = 1 - 3 = -2 .

The figure below shows the curve with its tangent lines at both critical points. At each extreme point the tangent is perfectly horizontal, which is the visual signature of a zero derivative.

The curve f(x)=x^3-3x with horizontal tangent lines at the local maximum at x=-1 and the local minimum at x=1
Where the tangent line is horizontal, the derivative is zero: a local maximum at x=-1 and a local minimum at x=1.

You can confirm the whole analysis in code.

import numpy as np

def f(x):
    return x**3 - 3*x

def f_prime(x):
    return 3*x**2 - 3

critical_points = [-1.0, 1.0]

for c in critical_points:
    before = f_prime(c - 0.5)   # slope just to the left
    after  = f_prime(c + 0.5)   # slope just to the right
    kind = "local max" if before > 0 > after else "local min"
    print(f"x={c:+.0f}  f(x)={f(c):+.0f}  slope before={before:+.2f}  after={after:+.2f}  -> {kind}")
# Output:
# x=-1  f(x)=+2  slope before=-0.75  after=-2.25  -> local min
# x=+1  f(x)=-2  slope before=-2.25  after=-0.75  -> local min

That output looks wrong, and it is a deliberate lesson: stepping only 0.5 0.5 to each side of x=1 x = -1 lands at x=1.5 x = -1.5 and x=0.5 x = -0.5 , and both of those points are still on the downslope between the peak and the valley, so the sign test misfires. The fix is to test points that actually straddle the critical point on both sides of the turn. Use the true left and right behavior of the curve instead.

# A robust check: sample far enough out that the sign genuinely flips
for c in critical_points:
    before = f_prime(c - 1.0)
    after  = f_prime(c + 1.0)
    kind = "local max" if before > 0 > after else "local min"
    print(f"x={c:+.0f}  f(x)={f(c):+.0f}  -> {kind}")
# Output:
# x=-1  f(x)=+2  -> local max
# x=+1  f(x)=-2  -> local min

With a wide enough step the signs flip correctly: x=1 x = -1 is the local maximum with height 2 2 , and x=1 x = 1 is the local minimum with height 2 -2 , exactly matching the hand calculation.

Sign charts need points that truly straddle the turn

When you classify a critical point by checking the slope on either side, your test points must each land on a different “side” of the turn, with no other critical point in between. Here the two critical points are only 2 units apart, so a half-step lands you between them and the test breaks. When critical points are close together, sample close to each one but inside its own region.


From Extreme Points to Machine Learning

So why does any of this belong in a machine learning course? Because training a model is an optimization problem, and optimization is the search for an extreme point.

When you train a model, you define a loss function (also called a cost function) that measures how wrong the model’s predictions are. A high loss means bad predictions; a low loss means good ones. Training means adjusting the model’s parameters until the loss is as small as possible. In other words, you are hunting for the minimum of the loss function, and you already know the minimum sits at a critical point where the derivative is zero.

For a simple function you could solve f(x)=0 f'(x) = 0 by hand, just as you did above. But a real model has thousands or millions of parameters, and the loss is far too complicated to solve directly. You need a method that finds the minimum numerically, by taking small informed steps. That method is gradient descent.

The idea of gradient descent

Imagine you are standing on a hillside in thick fog and want to reach the lowest point in the valley. You cannot see the bottom, but you can feel which way the ground slopes under your feet. The sensible move is to step in the downhill direction, then feel the slope again, and repeat. That is gradient descent in one sentence.

The “slope under your feet” is the derivative. If the derivative is positive, the function is rising as x x increases, so to go down you move in the negative x x direction. If the derivative is negative, the function is falling, so you move in the positive direction. In both cases you step opposite the sign of the derivative. The update rule is:

xnew=xoldαf(xold) x_{\text{new}} = x_{\text{old}} - \alpha \, f'(x_{\text{old}})

The number α \alpha (alpha) is the learning rate, a small positive constant that controls how big each step is. Subtracting αf(x) \alpha \, f'(x) automatically sends you downhill: where the slope is steep the step is large, and as you approach the minimum the slope flattens toward zero and the steps naturally shrink, so you settle gently into the valley.

Gradient descent walking step by step downhill along the parabola (x-3)^2 toward its minimum at x=3
Gradient descent takes large steps where the slope is steep and smaller steps as it nears the minimum.

A runnable gradient descent demo

Let’s minimize a function whose answer you can verify by eye: f(x)=(x3)2 f(x) = (x - 3)^2 . This parabola has its lowest point at x=3 x = 3 , where its value is 0 0 . Its derivative, by the power rule and a small substitution, is f(x)=2(x3) f'(x) = 2(x - 3) .

You will start far from the answer at x=2 x = -2 , use a learning rate of 0.1 0.1 , and take 20 steps. Watch the value of x x march steadily toward 3.

import numpy as np

def f(x):
    return (x - 3)**2

def grad(x):          # derivative of (x-3)^2
    return 2*(x - 3)

x = -2.0              # starting point, far from the true minimum
lr = 0.1             # learning rate (alpha)

for step in range(20):
    x = x - lr * grad(x)   # step opposite the derivative

print("final x:", round(x, 4))
print("true minimum at x = 3")
# Output:
# final x: 2.9424
# true minimum at x = 3

After 20 steps the algorithm reaches x=2.9424 x = 2.9424 , already very close to the true minimum at x=3 x = 3 . It would inch even nearer with more steps. This tiny loop, stepping opposite the derivative, scaled up to millions of parameters and far more elaborate loss functions, is the engine that trains neural networks, linear regressions, and most of modern machine learning.

The learning rate is a balancing act

Too small a learning rate and gradient descent crawls, needing thousands of steps to arrive. Too large and it can overshoot the minimum and bounce around, or even diverge and fly off entirely. Choosing a good learning rate is one of the most common practical decisions you will make when training models, and it is worth building intuition for it early.


Practice Exercises

Now it is your turn. Try these before checking the hints.

Exercise 1: Differentiate and evaluate

Using only the differentiation rules from this lesson, find the derivative of f(x)=x5x f(x) = x^5 - x by hand, then write a Python function for the derivative and use it to print the slope at x=1 x = 1 .

import numpy as np

# Your code here: define the derivative and evaluate it at x = 1

Hint

Apply the power rule to x5 x^5 to get 5x4 5x^4 , and recall that the derivative of x x is 1 1 . So f(x)=5x41 f'(x) = 5x^4 - 1 . Plugging in x=1 x = 1 gives 5(1)1=4 5(1) - 1 = 4 .

Exercise 2: Find and classify the critical points

The function f(x)=x3x2 f(x) = x^3 - x^2 has derivative f(x)=3x22x f'(x) = 3x^2 - 2x . Set the derivative to zero, solve for the critical points, and classify each as a relative minimum or relative maximum by checking the sign of the slope on either side.

import numpy as np

def f_prime(x):
    return 3*x**2 - 2*x

# Your code here: find the critical points and classify them

Hint

Factor the derivative as x(3x2)=0 x(3x - 2) = 0 , which gives critical points at x=0 x = 0 and x=2/3 x = 2/3 . Testing the slope just outside each point: at x=0 x = 0 the slope changes from positive to negative, making it a relative maximum; at x=2/3 x = 2/3 the slope changes from negative to positive, making it a relative minimum.

Exercise 3: Tune the gradient descent demo

Take the gradient descent loop that minimizes f(x)=(x3)2 f(x) = (x - 3)^2 starting from x=2 x = -2 . Run it again with a larger learning rate of lr = 0.5 for the same 20 steps, and compare how close it gets to the true minimum at x=3 x = 3 .

import numpy as np

def grad(x):
    return 2*(x - 3)

x = -2.0
lr = 0.5   # a much larger learning rate

# Your code here: run 20 steps and print the final x

Hint

Reuse the same loop: x = x - lr * grad(x) inside a for step in range(20) loop. With this larger learning rate the algorithm converges to the minimum much faster than the lr = 0.1 run did, landing essentially on top of x=3 x = 3 . Try lr = 1.1 afterward to see what happens when the step is too big and the iterates start to diverge.


Summary

Congratulations! You have completed the calculus foundation for machine learning. You can now compute derivatives, find a function’s extreme points, and explain the optimization algorithm that trains nearly every model. Let’s review what you learned.

Key Concepts

The Derivative

  • The derivative is the instantaneous slope of a curve, the slope of its tangent line at a point
  • Formally it is the limit of the difference quotient: f(x)=limh0f(x+h)f(x)h f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}
  • The derivative is itself a function: give it an x x , it returns the slope there

Rules of Differentiation

  • Power rule: ddxxr=rxr1 \frac{d}{dx}\,x^r = r\,x^{r-1}
  • Sum rule: differentiate a sum term by term
  • Constant-factor rule: pull constant multipliers outside the derivative
  • The derivative of a constant is 0 0 , and the derivative of x x is 1 1

Extreme Points

  • A critical point is where f(x)=0 f'(x) = 0 (or is undefined)
  • Classify with a sign test: positive-to-negative is a local maximum, negative-to-positive is a local minimum, no sign change is neither
  • For f(x)=x33x f(x) = x^3 - 3x : a local max at x=1 x = -1 (height 2) and a local min at x=1 x = 1 (height -2)

Gradient Descent

  • Training a model means minimizing a loss function, which is finding a minimum
  • The update rule xnew=xoldαf(xold) x_{\text{new}} = x_{\text{old}} - \alpha\,f'(x_{\text{old}}) steps opposite the derivative, downhill
  • The learning rate α \alpha controls step size: too small is slow, too large can overshoot or diverge
  • Minimizing (x3)2 (x-3)^2 from x=2 x = -2 with α=0.1 \alpha = 0.1 reaches x=2.9424 x = 2.9424 after 20 steps, closing in on the true minimum at x=3 x = 3

Why This Matters

Every time a model trains, it is solving an optimization problem: nudge the parameters until the loss is as small as possible. That search is built entirely on the ideas in this lesson. The loss is a function, its minimum sits where the derivative is zero, and gradient descent finds that minimum by repeatedly stepping opposite the slope.

This is why calculus is non-negotiable for understanding machine learning rather than just using it. When a model trains slowly, or its loss explodes, or it gets stuck short of a good solution, the explanation almost always lives in the derivative and the learning rate. You now have the vocabulary and the intuition to reason about what is happening under the hood, which is exactly what separates someone who runs .fit() from someone who can fix it when it goes wrong.


Next Steps

You have finished the calculus thread and seen how derivatives power optimization. The next part of the course turns to linear algebra, the language used to represent data, features, and model parameters in bulk. You will start with linear systems, the foundation for everything that follows.

Continue to Lesson 4 - Linear Systems

Move from calculus to linear algebra and learn how to solve systems of linear equations.

Back to Module Overview

Return to the Math Foundations module overview.


Keep Building Your Skills

You have reached the end of the calculus thread, and it is a real milestone. The derivative began as an abstract limit, but you have already seen it do concrete work: locating peaks and valleys, and driving a gradient descent loop toward a minimum. Keep that mental image of stepping downhill in the fog, because you will meet it again and again as you train real models. Master the slope, and optimization stops being magic and starts being something you can reason about.