You are on page 1of 51

Distilled AI

Back to aman.ai

search...

Primers • Math
Linear Algebra
Vectors
Definition
Purpose
Vectors in Python
Plotting Vectors
2D Vectors
3D Vectors
Norm
Addition
Differential Calculus
Slope of a Straight Line
Defining the Slope of a Curve
Differentiability
Differentiating a Function
Example: Finding the Derivative of x2
Notations
Differentiation Rules
The Chain Rule
Derivatives and Optimization
Higher Order Derivatives
Partial Derivatives
Gradients
Gradient Descent, Revisited
Jacobians
Hessians
Examples
Derivative 1: Composed Exponential Function
Derivative 2. Function with Variable Base and Variable Exponent
Derivative 3: Gradient of a Multi­Dimensional Input Function
Derivative 4. Jacobian of a Multi­Dimensional Input and Output Function
Derivative 5. Hessian of a Multi­Dimensional Input Function
Probability Theory
Concepts
Chance Events
Expectation
Variance
Set Theory
Counting
Conditional Probability
Probability Distributions
Random Variables
Central Limit Theorem
Types of Probability Distributions
Discrete
Continuous
Bernoulli Distribution
PMF
CDF
Iid Bernoulli Trials
Binomial Trials
The Selection Problem
Example Justification of the Binomial Likelihood
Example
Gaussian/normal Distribution
PDF
CDF
Standard Normal Distribution
Example 1
Example 2
Facts about the Normal Density
Other Properties
Poisson Distribution
PMF
CDF
Use­cases for the Poisson Distribution
Poisson Derivation
Rates and Poisson Random Variables
Poisson Approximation to the Binomial
Example
Example: Poisson Approximation to the Binomial
Uniform Distribution
PDF
CDF
Geometric Distribution
Student’s T­distribution
Chi­squared Distribution
Exponential Distribution
F Distribution
Gamma Distribution
Beta Distribution
Frequentist Inference
Point Estimation
Confidence Intervals
The Bootstrap
Bayesian Inference
Bayes’ Theorem
Likelihood Function
Prior to Posterior
Regression Analysis
Ordinary Least Squares
Correlation
Analysis of Variance
Trigonometry
Ratios
Graphical View of Sin and Cos
References and Credits
Citation

Open in Colab
Linear Algebra
Linear Algebra is the branch of mathematics that studies vector spaces and linear transformations between
vector spaces, such as rotating a shape, scaling it up or down, translating it (i.e., moving it), etc.
Machine Learning relies heavily on Linear Algebra, so it is essential to understand what vectors and
matrices are, what operations you can perform with them, and how they can be useful.

Vectors

Definition
A vector is a quantity defined by a magnitude and a direction. For example, a rocket’s velocity is a 3­
dimensional vector: its magnitude is the speed of the rocket, and its direction is (hopefully) up. A vector can
be represented by an array of numbers called scalars. Each scalar corresponds to the magnitude of the
vector with regards to each dimension.
For example, say the rocket is going up at a slight angle: it has a vertical speed of 5,000 m/s, and also a
slight speed towards the East at 10 m/s, and a slight speed towards the North at 50 m/s. The rocket’s
velocity may be represented by the following vector:

velocity

10
⎡ ⎤

⎢ 50 ⎥
⎣ ⎦
5000

Note: by convention vectors are generally presented in the form of columns. Also, vector names are
generally lowercase to distinguish them from matrices (which we will discuss below) and in bold (when
possible) to distinguish them from simple scalar values such as meters_per_second = 5026 .
A list of N numbers may also represent the coordinates of a point in an N ­dimensional space, so it is
quite frequent to represent vectors as simple points instead of arrows. A vector with 1 element may be
represented as an arrow or a point on an axis, a vector with 2 elements is an arrow or a point on a plane, a
vector with 3 elements is an arrow or point in space, and a vector with N elements is an arrow or a point in
an N ­dimensional space… which most people find hard to imagine.

Purpose
Vectors have many purposes in Machine Learning, most notably to represent observations and predictions.
For example, say we built a Machine Learning system to classify videos into 3 categories (good, spam,
clickbait) based on what we know about them. For each video, we would have a vector representing what
we know about it, such as:

video

10.5

⎢ 5.2
= ⎢
⎢ 3.25


7.0



This vector could represent a video that lasts 10.5 minutes, but only 5.2% viewers watch for more than a
minute, it gets 3.25 views per day on average, and it was flagged 7 times as spam. As you can see, each
axis may have a different meaning.
Based on this vector our Machine Learning system may predict that there is an 80% probability that it is a
spam video, 18% that it is click­bait, and 2% that it is a good video. This could be represented as the
following vector:

0.80
⎡ ⎤
class_probabilities = ⎢ 0.18 ⎥
⎣ ⎦
0.02

Vectors in Python
In python, a vector can be represented in many ways, the simplest being a regular python list of numbers:

[10.5, 5.2, 3.25, 7.0]

Since we plan to do quite a lot of scientific calculations, it is much better to use NumPy’s ndarray , which
provides a lot of convenient and optimized implementations of essential mathematical operations on
vectors (for more details about NumPy, check out the NumPy tutorial). For example:

import numpy as np
video = np.array([10.5, 5.2, 3.25, 7.0])
video

The size of a vector can be obtained using the size attribute:

video.size

The ith element (also called entry or item) of a vector v is noted vi .


Note that indices in mathematics generally start at 1, but in programming they usually start at 0. So to
access video3 programmatically, we would write:

video[2] ## 3rd element

Plotting Vectors
To plot vectors, we will use matplotlib, so let’s start by importing it (for details about Matplotlib, check out
our Matplotlib tutorial):

import matplotlib.pyplot as plt

In a Jupyter/Colab notebook, we can simply output the graphs within the notebook itself by running the
%matplotlib inline magic command. Run this cell if you’re viewing this in Colab:

%matplotlib inline

2D Vectors
Now, import NumPy for array containers to handle the input data that we’re looking to plot:

import numpy as np
Let’s create a couple very simple 2D vectors to plot:

u = np.array([2, 5])
v = np.array([3, 1])

These vectors each have 2 elements, so they can easily be represented graphically on a 2D graph, for
example as points:

x_coords, y_coords = zip(u, v)


plt.scatter(x_coords, y_coords, color=["r","b"])
plt.axis([0, 9, 0, 6])
plt.grid()
plt.show()

3D Vectors
Plotting 3D vectors is also relatively straightforward. First let’s create two 3D vectors:

import numpy as np

a = np.array([1, 2, 8])
b = np.array([5, 6, 3])

Now let’s plot them using matplotlib’s Axes3D . Note that we’ll be using mpl_toolkits to carry out this
step (and matplotlib does not load mpl_toolkits as a dependency during installation) so let’s load it
up first using pip install ‐‐upgrade matplotlib (if you’re running this in a Jupyter notebook, use
!pip install ‐‐upgrade matplotlib ).

import matplotlib.pyplot as plt


from mpl_toolkits.mplot3d import Axes3D

subplot3d = plt.subplot(111, projection='3d')


x_coords, y_coords, z_coords = zip(a,b)
subplot3d.scatter(x_coords, y_coords, z_coords)
subplot3d.set_zlim3d([0, 9])
subplot3d.set_zlim3d([0, 9])
plt.show()

It is a bit hard to visualize exactly where in space these two points are, so let’s add vertical lines. We’ll
create a small convenience function to plot a list of 3D vectors with vertical lines attached:

def plot_vectors3d(ax, vectors3d, z0, **options):


for v in vectors3d:
x, y, z = v
ax.plot([x,x], [y,y], [z0, z], color="gray", linestyle='dotted', marker=".")
x_coords, y_coords, z_coords = zip(*vectors3d)
ax.scatter(x_coords, y_coords, z_coords, **options)

subplot3d = plt.subplot(111, projection='3d')


subplot3d.set_zlim([0, 9])
plot_vectors3d(subplot3d, [a,b], 0, color=("r","b"))
plt.show()
Norm
The norm of a vector u , noted ∣∣ u , is a measure of the length (a.k.a. the magnitude) of u . There are
∣∣

multiple possible norms, but the most common one (and the only one we will discuss here) is the Euclidian
norm, which is defined as:

∥u∥
−−−−−−
2
= ∑ ui

i



We could implement this easily in pure python, recalling that √x .
1

= x 2

def vector_norm(vector):
squares = [element**2 for element in vector]
return sum(squares)**0.5

u = np.array([2, 5])
print("||", u, "|| =")
vector_norm(u) # Prints 5.385164807134504

However, it is much more efficient to use NumPy’s norm function, available in the linalg (Linear
Algebra) module:

import numpy.linalg as LA
LA.norm(u) # Prints 5.385164807134504

Let’s plot a little diagram to confirm that the length of vector v is indeed ≈ 5.4 :

radius = LA.norm(u)
plt.gca().add_artist(plt.Circle((0,0), radius, color="#DDDDDD"))

def plot_vector2d(vector2d, origin=[0, 0], **options):


return plt.arrow(origin[0], origin[1], vector2d[0], vector2d[1], head_width=0.2,
head_length=0.3, length_includes_head=True, **options)

plot_vector2d(u, color="red")
plt.axis([0, 8.7, 0, 6])
plt.grid()
plt.show()
Looks about right!

Addition
Vectors of same size can be added together. Addition is performed elementwise:

import numpy as np

u = np.array([2, 5])
v = np.array([3, 1])

print(" ", u)
print("+", v)
print("‐"*10)
u + v

which outputs:

[2 5]
+ [3 1]
‐‐‐‐‐‐‐‐‐‐
array([5, 6])

Let’s look at what vector addition looks like graphically:

plot_vector2d(u, color="r")
plot_vector2d(v, color="b")
plot_vector2d(v, origin=u, color="b", linestyle="dotted")
plot_vector2d(u, origin=v, color="r", linestyle="dotted")
plot_vector2d(u+v, color="g")
plt.axis([0, 9, 0, 7])
plt.text(0.7, 3, "u", color="r", fontsize=18)
plt.text(4, 3, "u", color="r", fontsize=18)
plt.text(1.8, 0.2, "v", color="b", fontsize=18)
plt.text(3.1, 5.6, "v", color="b", fontsize=18)
plt.text(2.4, 2.5, "u+v", color="g", fontsize=18)
plt.grid()
plt.show()
Vector addition is commutative, meaning that u . You can see it on the previous image: following u

+ v

= v

+ u

then v leads to the same point as following v then u .


Vector addition is also associative, meaning that u .
+ (v

+ w

(u

+ v)

+ w
If you have a shape defined by a number of points (vectors), and you add a vector v to all of these points,
then the whole shape gets shifted by v . This is called a geometric translation:

t1 = np.array([2, 0.25])
t2 = np.array([2.5, 3.5])
t3 = np.array([1, 2])

x_coords, y_coords = zip(t1, t2, t3, t1)


plt.plot(x_coords, y_coords, "c‐‐", x_coords, y_coords, "co")

plot_vector2d(v, t1, color="r", linestyle=":")


plot_vector2d(v, t2, color="r", linestyle=":")
plot_vector2d(v, t3, color="r", linestyle=":")

t1b = t1 + v
t2b = t2 + v
t3b = t3 + v

x_coords_b, y_coords_b = zip(t1b, t2b, t3b, t1b)


plt.plot(x_coords_b, y_coords_b, "b‐", x_coords_b, y_coords_b, "bo")

plt.text(4, 4.2, "v", color="r", fontsize=18)


plt.text(3, 2.3, "v", color="r", fontsize=18)
plt.text(3.5, 0.4, "v", color="r", fontsize=18)

plt.axis([0, 6, 0, 5])
plt.grid()
plt.show()
Finally, subtracting a vector is like adding the opposite vector.

Differential Calculus
Calculus is the study of continuous change. It has two major subfields: differential calculus, which studies
the rate of change of functions, and integral calculus, which studies the area under the curve. In this
notebook, we will discuss the former.
Differential calculus is at the core of deep learning, so it is important to understand what derivatives and
gradients are, how they are used in deep learning, and understand what their limitations are.

Slope of a Straight Line


The slope of a (non­vertical) straight line can be calculated by taking any two points A and B on the line,
and computing the “rise over run”:

slope

Δy
=
Δx

height
=
width

rise
=
run

yB − yA
=
xB − xA

In this example, the height (rise) is 3, and the width (run) is 6, so the slope is .
3

= 0.5
Defining the Slope of a Curve
Now, let’s try to figure out how we can compute the slope of something else than a straight line. For
example, let’s consider the curve defined by y :
= f

(x)

2
= x

Obviously, the slope varies: on the left (i.e., when x ), the slope is negative (i.e., when we move from left
< 0
to right, the curve goes down), while on the right (i.e., when x ) the slope is positive (i.e., when we move
> 0
from left to right, the curve goes up). At the point x , the slope is equal to 0 (i.e., the curve is locally flat).
= 0
The fact that the slope is 0 when we reach a minimum (or indeed a maximum) is crucially important, and
we will come back to it later.
How can we put numbers on these intuitions? Well, say we want to estimate the slope of the curve at a
point A , we can do this by taking another point B on the curve, not too far away, and then computing the
slope between these two points.
As you can see, when point B is very close to point A , the (AB line becomes almost indistinguishable
)

from the curve itself (at least locally around point A ). The (AB line gets closer and closer to the tangent
)

line to the curve at point A : this is the best linear approximation of the curve at point A .
So it makes sense to define the slope of the curve at point A as the slope that the (AB line approaches
)

when B gets infinitely close to A . This slope is called the derivative of the function f at x . For
= xA

example, the derivative of the function f (x) at x is equal to 2xA (we will see how to get this result
2 = xA
= x
shortly), so on the graph above, since the point A is located at xA , the tangent line to the curve at that
=

−1

point has a slope of −2 .


Differentiability
Note that some functions are not quite as well­behaved as x2 : for example, consider the function f (x) , the
= |x

absolute value of x :

No matter how much you zoom in on the origin (the point at x ), the curve will always look like a V. The
= 0,

= 0
slope is ­1 for any x , and it is +1 for any x , but at x , the slope is undefined, since it is not
< 0 > 0 = 0
possible to approximate the curve y locally around the origin using a straight line, no matter how much
= |x

you zoom in on that point.


The function f (x) is said to be non­differentiable at x : its derivative is undefined at x . This means
= 0 = 0
= |x

that the curve y has an undefined slope at that point. However, the function f (x) is differentiable at
= |x = |x

| |

all other points.


In order for a function f (x) to be differentiable at some point xA , the slope of the (AB line must approach
)

a single finite value as B gets infinitely close to A .


This implies several constraints:
First, the function must of course be defined at xA . As a counterexample, the function f is
1
(x) =
x
undefined at xA = 0 , so it is not differentiable at that point.
The function must also be continuous at xA , meaning that as xB gets infinitely close to xA , f

(xB )

must also get infinitely close to f .


(xA )

As a counterexample,

{
−1 if x < 0
f (x) = {
+1 if x ≥ 0

is not continuous at xA = 0 , even though it is defined at that point: indeed, when you
approach it from the negative side, it does not approach infinitely close to f (0) = +1 .
Therefore, it is not continuous at that point, and thus not differentiable either.
The function must not have a breaking point at xA , meaning that the slope that the (AB) line
approaches as B approaches A must be the same whether B approaches from the left side or from
the right side. We already saw a counterexample with f , which is both defined and
(x) = |x|

continuous at xA = 0 , but which has a breaking point at xA = 0 : the slope of the curve y is ­1
= |x|

on the left, and +1 on the right.


The curve y must not be vertical at point A . One counterexample is f , the cubic

− 3
= f (x) (x) = √x

root of x : the curve is vertical at the origin, so the function is not differentiable at xA = 0 .

Differentiating a Function
Now let’s see how to actually differentiate a function (i.e., find its derivative).
The derivative of a function f (x) at x is noted f ′ , and it is defined as:
= xA (xA )


f (xA )

f (xB )

− f (xA )
lim
xB →xA xB − xA

yB

− yA
Don’t be scared, this is simpler than it looks! You may recognize the rise over run equation that we
xB

− xA

discussed earlier. That’s just the slope of the (AB line. And the notation lim means that we are making
xB

) →xA

xB approach infinitely close to xA . So in plain English, f



is the value that the slope of the (AB line
(xA ) )

approaches when B gets infinitely close to A . This is just a formal way of saying exactly the same thing as
earlier.

Example: Finding the Derivative of x2

Let’s look at a concrete example. Let’s see if we can determine what the slope of the y curve is, at any
2
= x
point A :


f (xA ) =

f (xB )

− f (xA )
lim
xB →xA xB − xA

= since f (x)
2 2
x − x 2
B A = x
lim
xB →xA xB − xA
2
= since xA

(xB 2
− xB

− xA ) = (xA

(xB − xB )

+ xA ) (xA
lim
xB →xA xB − xA + xB )

= since the two (xB − xA ) cancel out

lim
xB →xA

(xB + xA )

= since the limit of a sum is the sum of the limits

lim xB
xB →xA

lim xA
xB →xA

= xA + since xB approaches xA

lim xA
xB →xA

= xA since xA remains constant when xB approaches xA

+ xA

= 2xA

That’s it! We just proved that the slope of y at any point A is f



. What we have done is called
2
= x (xA )

= 2xA

differentiation: finding the derivative of a function.


Note that we used a couple of important properties of limits. Here are the main properties you need to
know to work with derivatives:
lim c = c if c is some constant value that does not depend on x , then the limit is just c .
x

→k

lim x = k if x approaches some value k, then the limit is k.


x

→k

lim [f (x) + g(x)] = lim f (x) + lim g(x) the limit of a sum is the sum of the limits
x x x

→k →k →k

lim [f (x) × g(x)] = lim f (x) × lim g(x) the limit of a product is the product of the limits
x x x

→k →k →k

Important note: in Deep Learning, differentiation is almost always performed automatically by the
framework you are using (such as TensorFlow or PyTorch). This is called auto­differentiation. However, you
should still make sure you have a good understanding of derivatives, or else they will come and bite you
one day, for example when you use a square root in your cost function without realizing that its derivative
−−

approaches infinity when x approaches 0 (tip: you should use x instead, where ϵ is some small

+ ϵ

constant, such as 10−4 ).


You will often find a slightly different (but equivalent) definition of the derivative. Let’s derive it from the
previous definition. First, let’s define ϵ . Next, note that ϵ will approach 0 as xB approaches xA . Lastly,
= xB

− xA

note that xB . With that, we can reformulate the definition above like so:
= xA

+ ϵ


f (xA )

f (xA
+ ϵ)

− f (xA )
lim
ϵ→0 ϵ

While we’re at it, let’s just rename xA to x , to get rid of the annoying subscript A and make the equation
simpler to read:


f (x) =

f (x + ϵ)

− f (x)
lim
ϵ→0 ϵ

Okay! Now let’s use this new definition to find the derivative of f (x) at any point x , and (hopefully) we
2
= x
should find the same result as above (except using x instead of xA ):


f (x) =

f (x + ϵ)

− f (x)
lim
ϵ→0 ϵ

= since f (x)
2
(x + ϵ) 2
= x
2
− x
lim
ϵ→0 ϵ

= since (x
2
x + 2xϵ 2
+ ϵ)
2
+ ϵ 2
= x
2
− x
lim + 2xϵ
ϵ→0 ϵ 2
+ ϵ
2
= since the two x cancel out
2
2xϵ + ϵ
lim
ϵ→0 ϵ
2
= lim since 2xϵ and ϵ can both be divided by ϵ
ϵ→0

(2x + ϵ)

= 2x

As we see, this result matches the result we obtained earlier.

Notations

A word about notations: there are several other notations for the derivative that you will find in the
litterature:


f (x)

df (x)
=
dx

d
= f
dx

(x)

d
This notation is also handy when a function is not named. For example refers to the derivative of the
dx
2
[x ]
function x .
2
↦ x
Moreover, when people talk about the function f (x) , they sometimes leave out “(x) ”, and they just talk
about the function f . When this is the case, the notation of the derivative is also simpler:

df

f =
dx

d
= f
dx

df
The f ′ notation is Lagrange’s notation, while is Leibniz’s notation.
dx
There are also other less common notations, such as Newton’s notation ẏ (assuming y ) or Euler’s
= f

(x)

notation Df .

Differentiation Rules
One very important rule is that the derivative of a sum is the sum of the derivatives. More precisely, if
we define f (x), then f ′ (x). This is quite easy to prove:
= g ′
= g

(x)
(x)

+ h ′
+ h
(x)
(x)


f (x = by definition

f (x
)

+ ϵ)

− f

(x)
lim
ϵ→0 ϵ

= using f (x) = g(x) + h(x)

g(x

+ ϵ)

+ h

(x

+ ϵ)

− g

(x)

− h

(x)
lim
ϵ→0 ϵ

= just moving terms around

g(x

+ ϵ)

− g

(x)

+ h

(x

+ ϵ)

− h

(x)
lim
ϵ→0 ϵ

= since the limit of a sum is the sum of the limits

g(x h(x
+ ϵ) + ϵ)

− g − h

(x) (x)
lim + lim
ϵ→0 ϵ ϵ→0 ϵ
′ ′ ′
= g using the definitions of g (x) and h (x)

(x)


+ h

(x)

Similarly, it is possible to show the following important rules (I’ve included the proofs at the end of this
notebook, in case you’re curious):

Function f Derivative f


f (x) f (x)
Constant
= c = 0

f (x) f (x)

Sum = g(x) = g (x)



+ h(x) + h (x)


f (x)

f (x) = g(x

Product = g(x )h (x)



)h(x) + g (x

)h(x)


f (x)


f (x) g (x)h(x)

Quotient g(x) − g(x)h


=
h(x) (x)
=
2
h (x)

f (x) with r ≠ 0

f (x)
Power
r r−1
= x = rx

f (x) f (x)

Exponential = exp = exp

(x) (x)


f (x)
f (x)
Logarithm 1
= ln(x) =
x


f (x) f (x)

Sin = sin = cos

(x) (x)

f (x) ′
f (x) =
Cos = cos
− sin(x)
(x)


f (x) f (x)

Tan = tan 1
=
(x) cos (x)
2


f (x)
f (x)

= g
Chain Rule = g
(h(x))
(h(x))

h (x)

Let’s try differentiating a simple function using the above rules: we will find the derivative of f (x) . Using
3
= x

+ cos

(x)

the rule for the derivative of sums, we find that ′


f (x) . Using the rule for the derivative of powers and for
d
=
dx
3
[x ]

d
+
dx

[cos

(x)]

the cos function, we find that f ′ (x) .


2
= 3x

− sin

(x)

Let’s try a harder example: let’s find the derivative of f (x) . First, let’s define u(x) and v(x) . Using the
= sin = sin 2
= 2x
2
(2x ) (x)

+ 1 + 1

rule for sums, we find that ′


u (x) . Since the derivative of the sin function is cos , and the derivative of
d
=
dx

[sin

(x)]

d
+
dx

[1]

constants is 0, we find that ′


u (x) . Next, using the product rule, we find that v (x)

.
= cos d d
2 2
= 2 [x ] + [2] x
(x)
dx dx
Since the derivative of a constant is 0, the second term cancels out. And since the power rule tells us that
the derivative of x2 is 2x , we find that v′ (x) . Lastly, using the chain rule, since f (x), we find that f ′ (x) .
= 4x = u ′
= u
(v(x (v(x))

)) ′
v (x)

= cos
2
(2x )

4x

The Chain Rule

Refer our treatment of the chain rule here.

Derivatives and Optimization


When trying to optimize a function f (x) , we look for the values of x that minimize (or maximize) the
function.
It is important to note that when a function reaches a minimum or maximum, assuming it is differentiable at
that point, the derivative will necessarily be equal to 0. For example, you can check the above animation,
and notice that whenever the function f (in the upper graph) reaches a maximum or minimum, then the
derivative f ′ (in the lower graph) is equal to 0.
So one way to optimize a function is to differentiate it and analytically find all the values for which the
derivative is 0, then determine which of these values optimize the function (if any). For example, consider
the function f (x) . Using the derivative rules (specifically, the sum rule, the product rule, the
1 1
4 2
= x − x +
4 2
power rule and the constant rule), we find that f ′ (x). We look for the values of x for which f ′ (x), so x3
power rule and the constant rule), we find that . We look for the values of for which , so
3 = 0 − 2x
= x
= 0
− 2x
, and therefore x(x . So x
2
, or x or x = . As you can see on the following graph of f (x) , these 3
– –
= 0 = √2 −√2
− 2)

= 0
values correspond to local extrema. Two global minima f and one local maximum f (0) .

(√2) 1
=
2
= f


−√2)

=
1

2

If a function has a local extremum at a point xA and is differentiable at that point, then f ′ . However, the
(xA )

= 0
reverse is not always true. For example, consider f (x) . Its derivative is f ′ (x), which is equal to 0 at xA .
3 2 = 0
= x = x
Yet, this point is not an extremum, as you can see on the following diagram. It’s just a single point where
the slope is 0.
So in short, you can optimize a function by analytically working out the points at which the derivative is 0,
and then investigating only these points. It’s a beautifully elegant solution, but it requires a lot of work, and
it’s not always easy, or even possible. For neural networks, it’s practically impossible.
Another option to optimize a function is to perform Gradient Descent (we will consider minimizing the
function, but the process would be almost identical if we tried to maximize a function instead): start at a
random point x0 , then use the function’s derivative to determine the slope at that point, and move a little bit
in the downwards direction, then repeat the process until you reach a local minimum, and cross your
fingers in the hope that this happens to be the global minimum.
At each iteration, the step size is proportional to the slope, so the process naturally slows down as it
approaches a local minimum. Each step is also proportional to the learning rate: a parameter of the
Gradient Descent algorithm itself (since it is not a parameter of the function we are optimizing, it is called a
hyperparameter).

Higher Order Derivatives


What happens if we try to differentiate the function f ′(x) ? Well, we get the so­called second order
2
d f
derivative, noted f ′′(x , or . If we repeat the process by differentiating f ′′(x , we get the third­order
2
dx
) )
3
d f
derivative f ′′′, or . And we could go on to get higher order derivatives.
3
dx
(x)

What’s the intuition behind second order derivatives? Well, since the (first order) derivative represents the
instantaneous rate of change of f at each point, the second order derivative represents the instantaneous
rate of change of the rate of change itself, in other words, you can think of it as the acceleration of the
curve: if f ′′(x , then the curve is accelerating “downwards”, if f ′′(x then the curve is accelerating
) )

< 0 > 0
“upwards”, and if f ′′(x , then the curve is locally a straight line. Note that a curve could be going upwards
)

= 0
(i.e., f ′(x) ) but also be accelerating downwards (i.e., f ′′(x ): for example, imagine the path of a stone
> 0 )

< 0
thrown upwards, as it is being slowed down by gravity (which constantly accelerates the stone
downwards).
Deep Learning generally only uses first order derivatives, but you will sometimes run into some
optimization algorithms or cost functions based on second order derivatives.

Partial Derivatives
Up to now, we have only considered functions with a single variable x . What happens when there are
multiple variables? For example, let’s start with a simple function with 2 variables: f (x, . If we plot this
y)

= sin

(xy)

function, using z , we get the following 3D graph. I also plotted some point A on the surface, along with
= f

(x,

y)

two lines I will describe shortly.


If you were to stand on this surface at point A and walk along the x axis towards the right (increasing x ),
your path would go down quite steeply (along the dashed blue line). The slope along this axis would be
negative. However, if you were to walk along the y axis, towards the back (increasing y ), then your path
would almost be flat (along the solid red line), at least locally: the slope along that axis, at point A , would
be very slightly positive.
As you can see, a single number is no longer sufficient to describe the slope of the function at a given
point. We need one slope for the x axis, and one slope for the y axis. One slope for each variable. To find
∂f
the slope along the x axis, called the partial derivative of f with regards to x , and noted (with curly
∂x
∂ ), we can differentiate f (x, with regards to x while treating all other variables (in this case just y ) as
y)

constants:

∂f
=
∂x

f (x + ϵ,

y)

− f (x,

y)
lim
ϵ→0 ϵ

If you use the derivative rules listed earlier (in this example you would just need the product rule and the
chain rule), making sure to treat y as a constant, then you will find:

∂f

∂x
= y cos

(xy)

Similarly, the partial derivative of f with regards to y is defined as:

∂f
=
∂y

f (x, y

+ ϵ)

− f (x,

y)
lim
ϵ→0 ϵ

All variables except for y are treated like constants (just x in this example). Using the derivative rules, we
get:

∂f

∂y

= x cos

(xy)
We now have equations to compute the slope along the x axis and along the y axis. But what about the
other directions? If you were standing on the surface at point A , you could decide to walk in any direction
you choose, not just along the x or y axes. What would the slope be then? Shouldn’t we compute the
slope along every possible direction?
Well, it can be shown that if all the partial derivatives are defined and continuous in a neighborhood around
point A , then the function f is totally differentiable at that point, meaning that it can be locally
approximated by a plane PA (the tangent plane to the surface at point A ). In this case, having just the
partial derivatives along each axis ($x\) and y in our case) is sufficient to perfectly characterize that plane.
Its equation is:

= f (xA ,

yA ) + (x

− xA

∂f
) (xA ,
∂x

yA ) + (y

− yA

∂f
) (xA ,
∂y

yA )

In Deep Learning, we will generally be dealing with well­behaved functions that are totally differentiable at
any point where all the partial derivatives are defined, but you should know that some functions are not that
nice. For example, consider the function:

h(x, y)

⎧ 0 if x


= 0 or y

⎪ = 0


1 otherwise

At the origin (i.e., at (x,), the partial derivatives of the function h with respect to x and y are both perfectly
y)

(0,

0)

defined: they are equal to 0. Yet the function can clearly not be approximated by a plane at that point. It is
not totally differentiable at that point (but it is totally differentiable at any point off the axes).

Gradients
So far we have considered only functions with a single variable x , or with 2 variables, x and y , but the
previous paragraph also applies to functions with more variables. So let’s consider a function f with n

variables: f (x1 , . For convenience, we will define a vector X whose components are these variables:
x2 ,

…,

xn )

X
X

x1

⎜ x2
= ⎜

⎜ ⋮


xn




Now f (X) is easier to write than f (x1 , .


x2 ,

…,

xn )

The gradient of the function f (X) at some point XA is the vector whose components are all the partial
derivatives of the function at that point. It is noted ∇f , or sometimes ∇X A
f :
(XA

∇f (XA )

=
∂f

(XA
∂x1


⎜ )

⎜ ∂f

(XA

⎜ ∂x2

⎜ )






⎜ ∂f

⎜ ∂xn

(XA )


















Assuming the function is totally differentiable at the point XA , then the surface it describes can be
approximated by a plane at that point (as discussed in the previous section), and the gradient vector is the
one that points towards the steepest slope on that plane.

Gradient Descent, Revisited

In Deep Learning, the Gradient Descent algorithm we discussed earlier is based on gradients instead of
derivatives (hence its name). It works in much the same way, but using vectors instead of scalars: simply
start with a random vector X0 , then compute the gradient of f at that point, and perform a small step in
the opposite direction, then repeat until convergence. More precisely, at each step t , compute
Xt . The constant η is the learning rate, typically a small value such as 10−3 . In
= Xt−1 − η∇f (Xt−1 )

practice, we generally use more efficient variants of this algorithm, but the general idea remains the same.
In Deep Learning, the letter X is generally used to represent the input data. When you use a neural
network to make predictions, you feed the neural network the inputs X , and you get back a prediction
^
y . The function f treats the model parameters as constants. We can use more explicit notation by
= f

(X)

writing y^ , where w represents the model parameters and indicates that the function relies on them, but
= fw

(X)

treats them as constants.


However, when training a neural network, we do quite the opposite: all the training examples are grouped
in a matrix X , all the labels are grouped in a vector y , and both X and y are treated as constants, while w
is treated as variable: specifically, we try to minimize the cost function LX,y , where g is a function that
(w)

= g

(fX

(w),

y)

measures the “discrepancy” between the predictions fX and the labels y , where fX represents the
(w) (w)

vector containing the predictions for each training example. Minimizing the loss function is usually
performed using Gradient Descent (or a variant of GD): we start with random model parameters w0 , then
we compute ∇L and we use this gradient vector to perform a Gradient Descent step, then we repeat the
(w0 )

process until convergence. It is crucial to understand that the gradient of the loss function is with regards to
the model parameters w (not the inputs X ).

Jacobians
Until now we have only considered functions that output a scalar, but it is possible to output vectors
instead. For example, a classification neural network typically outputs one probability for each class, so if
there are m classes, the neural network will output an d ­dimensional vector for each input.
In Deep Learning we generally only need to differentiate the loss function, which almost always outputs a
single scalar number. But suppose for a second that you want to differentiate a function f (X) which
outputs d ­dimensional vectors. The good news is that you can treat each output dimension independently
of the others. This will give you a partial derivative for each input dimension and each output dimension. If
you put them all in a single matrix, with one column per input dimension and one row per output dimension,
you get the so­called Jacobian matrix.

Jf (XA )

=
∂f1 ∂f1 ∂f1
⎛ ⎞
(XA (XA … (XA
∂x1 ∂x2 ∂xn
⎜ ⎟
⎜ ⎟
⎜ ) ) ) ⎟
⎜ ⎟
⎜ ⎟
∂f2 ∂f2 ∂f2
⎜ ⎟
⎜ (XA (XA … (XA ⎟
⎜ ∂x1 ∂x2 ∂xn ⎟
⎜ ⎟
⎜ ) ) ) ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⋮ ⋮ ⋱ ⋮ ⎟
⎜ ⎟
⎜ ⎟
⎜ ∂fm ∂fm ∂fm ⎟
⎜ … ⎟
⎜ ∂x1 ∂x2 ∂xn ⎟

⎝ ⎠
(XA ) (XA ) (XA )

The partial derivatives themselves are often called the Jacobians. It’s just the first order partial derivatives
of the function f .
Hessians
Let’s come back to a function f (X) which takes an n ­dimensional vector as input and outputs a scalar. If
you determine the equation of the partial derivative of f with regards to xi (the ith component of X ), you
∂f
will get a new function of X : . You can then compute the partial derivative of this function with regards
∂xi

to xj (the jth component of X ). The result is a partial derivative of a partial derivative: in other words, it is
2
∂ f
a second order partial derivatives, also called a Hessian. It is noted X : . If i ≠ j then it is called
∂xj xi
2
∂ f
a mixed second order partial derivative. Or else, if j = i , it is noted .
2
∂xi

Let’s look at an example: f (x, . As we showed earlier, the first order partial derivatives of f are:
y)

= sin

(xy)

∂f ∂f
and . So we can now compute all the Hessians (using the derivative rules we
∂x ∂y

= y cos(xy) = x cos(xy)

discussed earlier):
2
∂ f ∂f
2
= [y cos(xy)] = −y sin(xy)
2
∂x ∂x
2
∂ f ∂f
= [y cos(xy)] = cos(xy) − xy sin(xy)
∂y ∂x ∂y
2
∂ f ∂f
= [x cos(xy)] = cos(xy) − xy sin(xy)
∂x ∂y ∂x
2
∂ f ∂f
2
= [x cos(xy)] = −x sin(xy)
2
∂y ∂y
2
∂ f
Note that . This is the case whenever all the partial derivatives are defined and continuous in a
∂x

∂y

2
∂ f
=
∂y

∂x
neighborhood around the point at which we differentiate.
The matrix containing all the Hessians is called the Hessian matrix:

Hf (XA )

=
2 2 2
∂ f ∂ f ∂ f
⎛ ⎞

2
⎜ ∂x1 ∂ x1 ∂ x2 ∂ x1 ∂ xn ⎟
⎜ ⎟
⎜ (XA ) (XA ) (XA ) ⎟
⎜ ⎟
⎜ 2 2 2 ⎟
⎜ ∂ f ∂ f ∂ f ⎟
⎜ … ⎟
⎜ ∂ x2 ∂ x1 2
∂x2 ∂ x2 ∂ xn ⎟
⎜ ⎟
⎜ ⎟
(XA ) (XA ) (XA )
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⋮ ⋮ ⋱ ⋮ ⎟
⎜ ⎟
⎜ 2 2 2 ⎟
⎜ ∂ f ∂ f ∂ f ⎟
⎜ … ⎟
⎜ ∂x ∂x ∂ xn ∂ x2 ∂xn
2 ⎟
n 1

⎝ ⎠
(XA ) (XA ) (XA )

There are great optimization algorithms which take advantage of the Hessians, but in practice Deep
Learning almost never uses them. Indeed, if a function has n variables, there are n
2
Hessians: since
neural networks typically have several millions of parameters, the number of Hessians would exceed
thousands of billions. Even if we had the necessary amount of RAM, the computations would be
prohibitively slow.
Examples

Derivative 1: Composed Exponential Function


f (x)
2
x
= e ,x

∈ R

The exponential function is a very foundational, common, and useful example. It is a strictly positive
function, i.e. x
e in R , and an important property to remember is that e
0
In addition, you should
> 0 = 1.
remember that the exponential is the inverse of the logarithmic function. It is also one of the easiest

functions to derivate because its derivative is simply the exponential itself, i.e. x
(e ) . The derivative
x
= e
becomes tricker when the exponential is combined with another function. In such cases, we use the chain
rule formula, which states that the derivative of f is equal to f ′ , i.e.,
(g(x (g(x))


)) ⋅ g

(x)

∂f (g(x))

∂x

∂g

= f
∂x

(g(x))

Applying chain rule, we can compute the derivative of f (x) . We first Multiplying these two intermediate
2
x
= e
results, we obtain,

2
x
∂e

∂x
2
∂x
=
∂x
2
x
× e

= 2x
2
x
× e

Derivative 2. Function with Variable Base and Variable Exponent


f (x)
x
= x ,x

∈ R
+

This function is a classic in interviews, especially in the financial/quant industry, where math skills are
tested in even greater depth than in tech companies for machine learning positions. It sometimes brings
the interviewees out of their comfort zone, but really, the hardest part of this question is to be able to start
correctly.
The most important thing to realize when approaching a function in such exponential form is, first, the
inverse relationship between exponential and logarithm, and, second, the fact, that every exponential
function can be rewritten as a natural exponential function in the form of,

∀a
+
∈ R ,

∀b ∈ R,

b
b
a
b ln(a)
= e

Before we get to our f (x) example, let us demonstrate this property with a simpler function f (x) We first
∗ x
= x = 2

.
use the above equation to rewrite 2 as exp and subsequently apply chain rule.

(x ln

(2))

∂x ln(2)
x
2 =
∂x
x ln(2)
= e
x ln(2)
xln
× e
∂e (2)

∂x

= ln

(2

x ln(2)
)e

= ln
x
(2)2

Going back to the original function f (x) , once you rewrite the function as f (x) , the derivative becomes
x
= x = exp

(x ln

x)

relatively straightforward to compute, with the only potentially difficult part being the chain rule step.

xln(x)
∂e
x
x =
∂x
xln(x)
= e
∂x ln(x)
x x ln(x)
∂x = × e
⟹ ∂x
∂x

= (ln(x)

x
x
+ )x
x
x
= x (1

+ ln(x))

Note that here we used the product rule (uv)′ for the exponent sin .
′ (x)
= u v

+ uv

This function is generally asked without any information on the function’s domain. If your interviewer
doesn’t specify the domain by default, he might be testing your mathematical acuity. Here is where the
question gets deceiving. Without being specific about the domain, it seems that xx is defined for both
positive and negative values. However, for negative x , e.g. ( , the result is a complex number,
−0.9

) (

−0.9

concretely −1.05 potential way out would be to define the domain of the function as Z

(see
+
− 0.34i. A ∪ R ∖0

here for further discussion), but this would still not be differentiable for negative values. Therefore, in order
to properly define the derivative of xx , we need to restrict the domain to only strictly positive values. We
exclude 0 because for a derivative to be defined in 0, we need the limit derivative from the left (limit in 0 for
negative values) to be equal to the limit derivative from the right (limit in 0 for positive values) − a condition
f(0

+Δx)

−f(0)
−f(0)
that is broken in this case. since the left limit limΔx is undefined, the function is not differentiable in
Δx

→0

0, and thus the function’s domain is restricted to only positive values.


Before we move on to the next section, I leave you with a slightly more advanced version of this function to
test your understanding: f (x) . If you understood the logic and steps behind the first example, adding the
2
x
= x
extra exponent shouldn’t cause any difficulties and you should conclude the following result:

2
∂x ln

2 (x)
x
x =
2 ∂x
x ln(x)
= e 2
x ln(x)
2 × e
x ln(x)
∂e

∂x

1
2
= ( x
x

+ 2x ln

(x)

2
x ln(x)
)e

2
x +1
= x

(1 + 2 ln

(x))

Derivative 3: Gradient of a Multi­Dimensional Input Function


f (x, y, z)
xy
= 2

+ z cos

(x),

(x, y, z)

3
∈ R

So far, the functions discussed in the first and second derivative sections are functions mapping from R to
R , L. the domain as well as the range of the function are real numbers. But machine learning is essentially
vectorial and the functions are multi­dimensional. A good example of such multidimensionality is a neural
network layer of input size m and output size k, i.e., f (x) , which is an element­wise composition of a2

= g


(W x

+ b)

linear mapping W
T
x (with weight matrix W and input vector x ) and a non linear mapping g (activation
function). In the general case, this can also be viewed as a mapping from R to R .
m k

In the specific case of k , the derivative is called gradient. Let us now compute the derivative of the
= 1
following three­dimensional function mapping R to R:
3

f (x, y, z)
xy
= 2

+ z cos

(x)

You can think off as a function mapping a vector of size 3 to a vector of size 1
The derivative of a multi­dimensional input function is called a gradient function that maps to is a
n
g R R

set of n partial derivatives of g where each h partial derivative is a function of n variables. Thus, if 8 is a
mapping from R to R, its gradient ∇g is a mapping from R to R .
D n n

To find the gradient of our function f (x, , we construct vector of partial derivatives ∂f , f// ∂y and
y, z) /∂x

= 2w

+ z cos(x)

∂f , and obtain the following result:


/∂z

∇f (x, y,

z)

∂f
⎡ ⎤
∂x

⎢ ∂f ⎥
= ⎢ ⎥
⎢ ∂y ⎥
⎢ ⎥
∂f
⎣ ⎦
∂z

ln

xy
(2)y2


⎢ − z sin

⎢ (x)


⎢ ln
⎢ xy
⎢ (2)x2


cos(x)










Note that this is an example similar to the previous section and we use the equivalence 2xy .
= exp

(xy ln

(2))

In conclusion, for a multi­dimensional function that maps to , the derivative is a gradient , which
3
R R ∇f

maps R to R .
3 3

In a general form of mappings “to where , the derivative of a multi­dimensional function that
k
R R k

> 1

maps to is a Jacobian matrix (instead of a gradient vector). Let us investigate this in the next
m k
R R

section.

Derivative 4. Jacobian of a Multi­Dimensional Input and Output Function


f (x, y)

2
2x
= [
x√y

],x

∈ R, y

+
∈ R

We know from the previous section that the derivative of a function mapping to is a gradient
m
R R

mapping R “to R “. But what about the case where also the output domain is multi­dimensional, i.e. a
mapping from R to R for k ?
m k

> 1
In such case, the derivative is called Jacobian matrix. We can view the gradient simply as a special case of
Jacobian with dimension m with m equal to the number of variables. The Jacobian J (g) of a function g
Jacobian with dimension with equal to the number of variables. The Jacobian of a function
× 1
mapping R R a dimension of , l.e. is a matrix of shape . In other words, each row of
m to
k k i J (g)

× m × m
represents the gradient ∇gi of each sub­function gi of g .
Let us derive the above defined function f (x, mapping R to R , thus both input and output domains are
2 2

y)

=
2
[2x ,

x ∣ y

multidimensional. In this particular case, since the square root function is not defined for negative values,
we need to restrict the domain of y to R . The first row of our output Jacobian will be the derivative of
+

function 1, i.e. ∇2x2 , and the second row the derivative of function 2, i.e. ∇x√y.

∇f (x, y)

= Jf (x,

y)

=
∂f1 ∂f1

∂x ∂y

∂f2 ∂f2

∂x ∂y



4x 0
[ x
√y 2√y

In deep learning, an example where the Jacobian is of special interest is in the explainability field (see, for
example, Sensitivity based Neural Networks Explanations) that aims to understand the behavior of neural
networks and analyses sensitivity of the output layer of neural networks with regard to the inputs. The
Jacobian helps to investigate the impact of variation in the input space on the output. This can analogously
be applied to understand the concepts of intermediate layers in neural networks.
In summary, remember that while gradient is a derivative of a scalar with regard to a vector, Jacobian is a
derivative of a vector with regard to another vector.

Derivative 5. Hessian of a Multi­Dimensional Input Function


f (x, y)

2 3
= x y ,

(x, y)

2
∈ R

So far, our discussion has only been focused on first­order derivatives, but in neural networks we often talk
about higher­order derivatives of multidimensional functions. A specific case is the second derivative, also
called the Hessian matrix, and denoted H (f or ∇2 (nabla squared). The Hessian of a function g mapping
)

to R is a mapping H (g) from R to R .


n n n∗n
R

Let us analyze how we went from R to R on the output domain. The first derivative, i.e. gradient , is
n∗n
∇g

a mapping from R to R and its derivative is a Jacobian. Thus, the derivation of each sub­function ∇gi
n n

results in a mapping of R to R , with n such functions. You can think of this as if deriving each element
n n

of the gradient vector expanded into a vector, becoming thus a vector of vectors, i.e. a matrix.
To compute the Hessian, we need to calculate so­called cross­derivatives, that is, derivate first with respect
to and then with respect to , or vice­versa. One might ask if the order in which we take the cross
to x and then with respect to y , or vice­versa. One might ask if the order in which we take the cross
derivatives matters; in other words, if the Hessian matrix is symmetric or not. In cases where the function f
2
is C , i.e. twice continuously differentiable, Schwarz theorem states that the cross derivatives are equal
and thus the Hessian matrix is symmetric. Some discontinuous, yet differentiable functions, do not satisfy
the equality of cross­derivatives.
Constructing the Hessian of a function is equal to finding second­order partial derivatives of a scalar­valued
function. For the specific example f (x, , the computation yields the following result:
y)

2 3
= x y

2
∇ f (x,

y)

= Hf (x,

y)

=
2 2
∂ f ∂ f
⎡ ⎤ 3 2
∂x
2
∂x∂y 2y 6xy
⎢ 2 2
⎥ = [ ]
∂ f ∂ f 2 2
6xy 6yx
⎣ ⎦
2
∂y∂x ∂y

You can see that the cross­derivatives 6xy


2
are in fact equal. We first derived with regard to x and
obtained 2xy
3
, then again with regard to y , obtaining 6xy
2
The diagonal elements are simply fi “ for
.
each mono­dimensional subfunction of either x or y .
An extension would be to discuss the case of a second order derivatives for multi­dimensional functions
mapping m to , which can intuitively be seen as a second­order Jacobian. This is a mapping from
k
R R

to , i.e. a 3D tensor. Similarly to the Hessian, in order to find the gradient of the Jacobian
m k∗m
R R

∗ m
(differentiate a second time), we differentiate each element of the k matrix and obtain a matrix of
× m
vectors, i.e. a tensor. While it is rather unlikely that you would be asked to do such computation manually, it
is important to be aware of higher­order derivatives for multidimensional functions.

Probability Theory

Concepts

Chance Events

Randomness is all around us. Probability theory is the mathematical framework that allows us to analyze
chance events in a logically sound manner. The probability of an event is a number indicating how likely
that event will occur. This number is always between 0 and 1, where 0 indicates impossibility and 1
indicates certainty.
A classic example of a probabilistic experiment is a fair coin toss, in which the two possible outcomes are
heads or tails. In this case, the probability of flipping a head or a tail is . In an actual series of coin tosses,
1

we may get more or less than exactly 50% heads. But as the number of flips increases, the long­run
frequency of heads is bound to get closer and closer to 50%.
For an unfair or weighted coin, the two outcomes are not equally likely, in which case, you’ll need to assign
appropriate weights to each of the outcomes. If we assign numbers to the outcomes — say, 1 for heads, 0
for tails — then we have created the mathematical object known as a random variable.

Expectation

The expectation of a random variable is a number that attempts to capture the center of that random
variable’s distribution. It can be interpreted as the long­run average of many independent samples from the
given distribution. More precisely, it is defined as the probability­weighted sum of all possible values in the
random variable’s support,
random variable’s support,

E[X] =

∑ xP

x∈X

(x)

Consider the probabilistic experiment of rolling a fair die. After a sufficiently large number of iterations, the
running sample mean converges to the expectation of 3.5. Changing the distribution of the different faces
of the die (thus making the die biased or “unfair”) would affect the expected value.

Variance

Whereas expectation provides a measure of centrality, the variance of a random variable quantifies the
spread of that random variable’s distribution. The variance is the average value of the squared difference
between the random variable and its expectation,

Var(X)

= E [(X

− E[X]

2
) ]

When you draw cards randomly from a deck of ten cards, you’ll observe that the running average of
squared differences begins to resemble the true variance.

Set Theory

A set, broadly defined, is a collection of objects. In the context of probability theory, we use set notation to
specify compound events. For example, we can represent the event “roll an even number” by the set
{2, . For this reason it is important to be familiar with the algebra of sets.
4, 6}

Sets are usually visualized as Venn diagrams.

Counting

It can be surprisingly difficult to count the number of sequences or sets satisfying certain conditions.
For example, consider a bag of marbles in which each marble is a different color. If we draw marbles one at
a time from the bag without replacement, and there exists four unique marbles in the bag, how many
different ordered sequences (permutations) of the marbles are possible? How many different unordered
sets (combinations)?
Permutations with n and n Px :
= 4 = 1
n
Combinations with n and (x) or n Cx :
= 4 = 1

Conditional Probability

Conditional probabilities allow us to account for information we have about our system of interest. For
example, we might expect the probability that it will rain tomorrow (in general) to be smaller than the
probability it will rain tomorrow given that it is cloudy today. This latter probability is a conditional probability,
since it accounts for relevant information that we possess.
Mathematically, computing a conditional probability amounts to shrinking our sample space to a particular
event. So in our rain example, instead of looking at how often it rains on any day in general, we “pretend”
that our sample space consists of only those days for which the previous day was cloudy. We then
determine how many of those days were rainy.

Probability Distributions
A probability distribution specifies the relative likelihoods of all possible outcomes. Before we dive into
some common probability distributions, let’s go over the associated terminologies.

Random Variables

Formally, a random variable is a function that assigns a real number to each outcome in the probability
space. By sampling from the probability space associated with your distribution, you can generate the
empirical distribution of your random variable.

Central Limit Theorem

The Central Limit Theorem (CLT) states that the sample mean of a sufficiently large number of
independent and identically distributed (i.i.d.) random variables is approximately normally distributed. The
larger the sample space, the better the approximation.

Types of Probability Distributions

There are two major classes of probability distributions:


Discrete
Continuous
Note that the discrete distributions are defined by the probability mass function (PMF) while the continuous
ones are defined by the probability density function (PDF), as we’ll see in the below section.

Discrete

A discrete random variable has a finite or countable number of possible values.


If X is a discrete random variable, then there exists unique non­negative functions, f (x) and F (x) , such
that the following are true:
P (X

= x)

= f

(x)P

(X

< x)

= F

(x)

where, f denotes the probability mass function and F (x) denotes the cumulative distribution
(x)

function.

Continuous

A continuous random variable takes on an uncountably infinite number of possible values (e.g. all real
numbers).
If X is a continuous random variable, then there exists unique non­negative functions, f (x) and F (x) ,
such that the following are true:

P (a =
b
≤ X
∫ f
≤ b) a

(x

)dx

P (X = F

< x (x)

where, f denotes the probability density function and F (x) denotes the cumulative distribution
(x)

function.

Bernoulli Distribution

The Bernoulli distribution arises as the result of a binary outcome, which is why it is used to model binary
data.
For e.g., building a spam vs. ham binary classifier, or modeling a coin toss.
A Bernoulli random variable thus models a discrete distribution.
Bernoulli random variables take the values 0 and 1 with probabilities of p and 1 , respectively.
− p

The mean of a Bernoulli random variable is p and the variance is p(1 .


− p)

If we let X be a Bernoulli random variable, it is typical to call X as a “success” and X as a “failure”.


= 1 = 0

PMF

The probability mass function f (⋅) of a Bernoulli distribution, over possible outcomes k ∈ , is given by,
{0,

1}

f (k; p)
=

⎧p if k = 1

⎨q = 1 if k = 0

− p

This can also be expressed as,

f (k; p)

k
= p (1

1−k
− p)

for k

∈ {0, 1}

or as,

f (k; p)

= pk

+ (1

− p)(1

− k)

for k

∈ {0, 1}

Note that the Bernoulli distribution is a special case of the binomial distribution with n .
= 1

CDF

The cumulative density function (CDF) of the Bernoulli distribution is given by,

F (k; p)

⎧0 if k < 0



1 − p if 0 ≤ k

⎪ < 1



1 if k ≥ 1
Iid Bernoulli Trials

If several iid Bernoulli observations x1 , , are observed the likelihood is


…,

xn
n−
n

i=1
xi
p (1 − p)
1−xi
= p
∑ xi
(1 − p)
∑ xi
.
Notice that the likelihood depends only on the sum of the xi .
xi
Because n is fixed and assumed known, this implies that the sample proportion ∑
i n
contains all of the
relevant information about p.
We can maximize the Bernoulli likelihood over p to obtain that ^ =
p is the maximum likelihood estimator
xi

i n

for p.

Binomial Trials

Binomial random variables are obtained as the sum of iid Bernoulli trials.
Specifically, let X1 , be iid Bernoulli(p) ; then X = is a binomial random variable.
n
…, ∑ Xi
i=1

Xn

The binomial mass function is:

P (X

= x)

n
= (
x

x
) p (1

− p

n−x
) for x

= 0,

…,n

The Selection Problem

The notation n Cx or (x) (read “n choose x ”) counts the number of ways of selecting items out of
n
x n

n!
=
x!(n

−x)!

without replacement disregarding the order of the items.


Also,

n
( )
0

n
= ( )
n

= 1

Example Justification of the Binomial Likelihood

Consider the probability of getting 6 heads out of 10 coin flips from a coin with success probability p.
The probability of getting 6 heads and 4 tails in any specific order is p6 (1.
− p
4
)

There are ( 6 ) possible orders of 6 heads and 4 tails.


10

Example

Suppose a friend has 8 children, 7 of which are girls and none are twins.
If each gender has an independent 50% probability for each birth, what’s the probability of getting 7 or
more girls out of 8 births?

8 7
( ) 0.5
7

1
(1 − 0.5)

8
+ (
8

8
) 0.5 (1

0
− 0.5)

≈ 0.04

choose(8, 7) * 0.5 ^ 8 + choose(8, 8) * 0.5 ^ 8 # Returns 0.03516

Gaussian/normal Distribution

The normal (or Gaussian) distribution has a bell­shaped density function and is used to model real­valued
random variables that are assumed to be additively produced by many small effects.
For example, the normal distribution is used to model people’s height, since height can be assumed to be
the result of many small genetic and evironmental factors.
Another example would be modeling the price of a house, since the price of a house can be
assumed to be a function of the area, school district, distance to landmarks etc.
If X a random variable the follows a Gaussian distribution then, E[X] and V ar .
= μ (X)

2
= σ
The notation used to indicate that a random variable was sampled from a normal distribution is: X .
∼ N

(μ,

2
σ )

PDF

A random variable is said to follow a normal or Gaussian distribution with mean μ and variance σ
2
if the
associated PDF is,

f (x)

2
x−μ
1 −
1
( )
2 σ
= e
−−
σ√2π
CDF

The CDF of the Gaussian distribution is given by,

F (x)

= Φ
x

− μ

σ

1
=
2


⎢1

+ erf
x
⎛ ⎞⎤
− μ
⎜ ⎥
– ⎟⎥
σ√2
⎝ ⎠⎦

where,
Φ(⋅) represents the CDF of the standard normal distribution.
erf(x) represents the error function.

Standard Normal Distribution

The simplest case of a the normal distribution is called the standard normal distribution. This is a special
case when μ and σ , described by the PDF:
= 0 = 1
φ(x)

1 1 2
− x
= e 2
−−
√2π

Note that the standard normal density function is denoted by ϕ .


Standard normal random variables are often labeled Z .

Example 1

What is the 95 percentile of a N (μ, distribution?


th

2
σ )

We want the point x0 so that P (X .


≤ x0 )

= 0.95

P (X = P

≤ x0 ) X − μ
(
σ

x0 − μ

σ

= P (Z

x0 − μ

σ

= 0.95

x0 −μ

σ
or x0 .
= μ
= 1.645

+ σ1.645

In general, x0 where z0 is the appropriate standard normal quantile.


= μ

+ σz0

Example 2

What is the probability that a N (μ, random variable is 2σ (i.e., 2 standard deviations) above the mean?
2
σ )

Formally, we can write the question as:

P (X

> μ

+ 2σ)

= P


X − μ


σ

μ + 2σ
− μ
>
σ


This simplifies to,

P (X

> μ

+ 2σ)

= P (Z

≥ 2)

≈ 2.5

Facts about the Normal Density

If X , then Z is the standard normal distribution.


X−μ
∼ N
=
σ
(μ,

2
σ )

If Z , i.e., Z is a random variable that follows the standard normal distribution, then X .
∼ ϕ = μ

+ σZ

∼ N

(μ,

2
σ )

The PDF of a general normal distribution in terms of the PDF of a standard normal ϕ(⋅) is,

1
ϕ
σ

x − μ
(
σ

Approximately 68% , 95% and 99.7% of the normal density lies within 1, 2 and 3 standard deviations from
the mean, respectively. , , and are the , , and percentiles
th th th st
−1.28 −1.645 −1.96 −2.33 10 5 2.5 1

of the standard normal distribution respectively.


By symmetry, , , and are the , , and percentiles of the
th th th th
1.28 1.645 1.96 2.33 90 95 97.5 99

standard normal distribution respectively.

Other Properties

The normal distribution is symmetric and peaked around its mean (therefore the mean, median and mode
are all equal).
A constant times a normally distributed random variable is also normally distributed (what is the mean and
variance?).
Sums of normally distributed random variables are again normally distributed even if the variables are
dependent (what is the mean and variance?).
Sample means of normally distributed random variables are again normally distributed (with what mean
and variance?).
The square of a standard normal random variable follows what is called chi­squared distribution.
The exponent of a normally distributed random variables follows what is called the log­normal distribution.
As we will see later, many random variables, properly normalized, limit to a normal distribution.
Poisson Distribution

A Poisson random variable counts the number of events occurring in a fixed interval of time or space, given
that these events occur with an average rate λ .
This distribution can be used to model events such as:
The number of meteor showers in a year.
The number of goals in a soccer match.
The number of patients arriving in an emergency room between 10 and 11 PM.
The number of laser photons hitting a detector in a particular time interval.
The number of customers arriving in a store (or say, the number of page­views on a website).
A Poisson random variable thus models a discrete distribution.
Both the mean and variance of this distribution is λ .
Note that λ ranges from 0 to ∞ .

PMF

The PMF of the the Poisson distribution is given by,

P (X

= x; λ)

x −λ
λ e
= for x = 0, 1, …
x!

CDF

The CDF of the Poisson distribution is given by,


Γ( k + 1
⌊k⌋
⌋, λ) λ
i

, or e
−λ
∑ ⌊ ⌋
or Q( k + 1 , λ)
⌊k⌋! i=0
i!

(for k

≥ 0,

where Γ

(x,

) is the upper incomplete gamma function, ⌊k⌋ is the floor function, and Q is the regularized qamma function)
Use­cases for the Poisson Distribution

Modeling count data, i.e., data of the form . Examples include radioactive decay, survival
number of events

time

data, contingency tables etc.


Approximating binomials when n is large and p is small.

Poisson Derivation

Let h be very small.


Now, if we assume that…
Prob. of an event in an interval of length h is λh while the prob. of more than one event is negligible.
Whether or not an event occurs in one small interval does not impact whether or not an event occurs
in another small interval
… then, the number of events per unit time is Poisson with mean λ .

Rates and Poisson Random Variables

Poisson random variables are used to model rates.


X where,
∼ Poisson(λt)

λ is the expected count per unit of time.


X
= E[ ]
t

t is the total monitoring time.

Poisson Approximation to the Binomial

A binomial random variable is the sum of n independent Bernoulli random variables with parameter p. It is
frequently used to model the number of successes in a specified number of identical binary experiments,
such as the number of heads in five coin tosses.
When n is large and p is small (with np ), the Poisson distribution is an accurate approximation to the
< 10
binomial distribution.
Formally, X ,λ .
∼ Binomial(n, p) = np

Example
The number of people that show up at a bus stop is Poisson with a mean of 2.5 per hour.
If watching the bus stop for 4 hours, what is the probability that 3 or fewer people show up for the whole
time?

ppois(3, lambda = 2.5 * 4) # Returns 0.01034

Example: Poisson Approximation to the Binomial

If we flip a coin with success probablity 0.01 five hundred times, what’s the probability of 2 or fewer
successes?

pbinom(2, size=500, prob=0.01) # Returns 0.1234


ppois(2, lambda=500 * 0.01) # Returns 0.1247

Uniform Distribution

The uniform distribution (or rectangular distribution) is a continuous distribution such that all intervals of
equal length on the distribution’s support have equal probability. For example, this distribution might be
used to model people’s full birth dates, where it is assumed that all times in the calendar year are equally
likely.
The distribution describes an experiment where there is an arbitrary outcome that lies between certain
bounds.
The bounds are defined by the parameters, a and b , which are the minimum and maximum values. The
interval can be either be closed (e.g., [a, b] ) or open (e.g., (a, b) ).
Therefore, the distribution is often abbreviated U (a, , where U stands for uniform distribution.
b)

PDF

The PDF of the continuous uniform distribution is given by,

f (x)

=
1
⎧ for a

⎪ b−a


⎪ ≤ x ≤ b


0 for x



⎪ < a or x


> b
CDF

The CDF of the continuous uniform distribution is given by,

F (x)

⎧0 for x



⎪ < a


⎪ x−a
for x
⎨ b−a

∈ [a, b]




⎪1 for x



> b

Geometric Distribution
A geometric random variable counts the number of trials that are required to observe a single success,
where each trial is independent and has success probability p. A geometric random variable thus models a
discrete distribution.
For example, this distribution can be used to model the number of times a die must be rolled in order for a
six to be observed.
Student’s T­distribution

A Student’s t­distribution (or simply the t­distribution), is a continuous probability distribution that arises
when estimating the mean of a normally distributed population in situations where the sample size is small
and population standard deviation is unknown.

Chi­squared Distribution

A chi­squared random variable with k degrees of freedom is the sum of k independent and identically
distributed squared standard normal random variables. A chi­squared random variable thus models a
continuous distribution.
It is often used in hypothesis testing and in the construction of confidence intervals.

Exponential Distribution
The exponential distribution is the continuous analogue of the geometric distribution. It is often used to
model waiting times.
F Distribution

The F­distribution (also known as the Fisher–Snedecor distribution), is a continuous distribution that arises
frequently as the null distribution of a test statistic, most notably in the analysis of variance.

Gamma Distribution

The gamma distribution is a general family of continuous probability distributions. The exponential and chi­
squared distributions are special cases of the gamma distribution.
Beta Distribution

The beta distribution is a general family of continuous probability distributions bound between 0 and 1. The
beta distribution is frequently used as a conjugate prior distribution in Bayesian statistics.

Frequentist Inference
Frequentist inference is the process of determining properties of an underlying distribution via the
observation of data.

Point Estimation

One of the main goals of statistics is to estimate unknown parameters. To approximate these parameters,
we choose an estimator, which is simply any function of randomly sampled observations. To illustrate this
idea, let’s consider the problem of estimating the value of π . To do so, we can uniformly drop samples on a
square containing an inscribed circle. Notice that the value of π can be expressed as a ratio of the areas,

Scircle

2
= πr

Ssquare

2
= 4r
⟹π
Scircle
= 4
Ssquare

We can estimate this ratio with our samples. Let m be the number of samples within our circle and n the
total number of samples dropped. We define our estimator π
^ as:

m
π
^ = 4
n

It can be shown that this estimator has the desirable properties of being unbiased and consistent.
Confidence Intervals

In contrast to point estimators, confidence intervals estimate a parameter by specifying a range of possible
values. Such an interval is associated with a confidence level, which is the probability that the procedure
used to generate the interval will produce an interval containing the true parameter.

The Bootstrap

Much of frequentist inference centers on the use of “good” estimators. The precise distributions of these
estimators, however, can often be difficult to derive analytically. The computational technique known as the
Bootstrap provides a convenient way to estimate properties of an estimator via resampling.

Bayesian Inference
Bayesian inference techniques specify how one should update one’s beliefs upon observing data.

Bayes’ Theorem
Suppose that on your most recent visit to the doctor’s office, you decide to get tested for a rare disease. If
you are unlucky enough to receive a positive result, the logical next question is, “Given the test result, what
is the probability that I actually have this disease?” (Medical tests are, after all, not perfectly accurate.)
Bayes’ Theorem tells us exactly how to compute this probability:

(Disease

∣ +)

P (+

∣ Disease

)P

( Disease

)
=
P (+)

As the equation indicates, the posterior probability of having the disease given that the test was positive
depends on the prior probability of the disease P . Think of this as the incidence of the disease in
(Disease)

the general population. Set this probability by dragging the bars below.
The posterior probability also depends on the test accuracy: How often does the test correctly report a
negative result for a healthy patient, and how often does it report a positive result for someone with the
disease?

Likelihood Function

In statistics, the likelihood function has a very precise definition:

L(θ ∣ x)

= P (x ∣

θ)

The concept of likelihood plays a fundamental role in both Bayesian and frequentist statistics. To read
more, refer the section on likelihood vs. probability in our CS229 notes.

Prior to Posterior

At the core of Bayesian statistics is the idea that prior beliefs should be updated as new data is acquired.
Consider a possibly biased coin that comes up heads with probability p. This purple slider determines the
value of p (which would be unknown in practice).
As we acquire data in the form of coin tosses, we update the posterior distribution on p, which represents
our best guess about the likely values for the bias of the coin. This updated distribution then serves as the
prior for future coin tosses.

Regression Analysis
Linear regression is an approach for modeling the linear relationship between two variables.

Ordinary Least Squares

The ordinary least squares (OLS) approach to regression allows us to estimate the parameters of a linear
model.
The goal of this method is to determine the linear model that minimizes the sum of the squared errors
between the observations in a dataset and those predicted by the model.

Correlation

Correlation is a measure of the linear relationship between two variables. It is defined for a sample as the
following and takes value between +1 and ­1 inclusive:

r
sxy
=
−−
− −
−−
√sxx √syy

sxy , are defined as:


sxx ,

syy

sxy =
n

∑ (xi

i=1

− x̄) (yi

− ȳ )

sxx =
n

∑ (xi

i=1

2
− x̄)

syy =
n

∑ (yi

i=1

2
− ȳ )

It can also be understood as the cosine of the angle formed by the ordinary least square line determined in
both variable dimensions.

Analysis of Variance

Analysis of Variance (ANOVA) is a statistical method for testing whether groups of data have the same
mean. ANOVA generalizes the t­test to two or more groups by comparing the sum of square error within
and between groups.
Trigonometry

Ratios
∘ ∘ ∘ ∘ ∘ ∘
Angle 0 30 45 60 90

c c c c c c
Angle 0 π/6 π/4 π/3 π/2

1 1 √3
sin θ 0 2 √2
1
2

√3 1 1
cos θ 1 √2 2
0
2

1 –
tan θ 0 √3
1 √3 N/A

– 2
cosec θ N/A 2 √2
√3
1
2 –
sec θ 1 √3
√2 2 N/A

– 1
cot θ N/A √3 1 √3
0

N/A = not defined.

Graphical View of Sin and Cos


Shown below is a graphical view of how cos and sin vary as the angle goes from ∘
0 to ∘
360 (or
(θ) (θ)

equivalently, 0c to 2π c ).
Note that the below diagram shows a unit circle (with radius = 1).

References and Credits


Aurélien Geron’s Hands­on Machine Learning with Scikit­Learn, Keras and TensorFlow served as a major
inspiration for this tutorial.
Seeing Theory helped with sections that offer an overview to probability and statistics.
5 Derivatives to Excel in Your Machine Learning Interview helped with examples on derivatives for uni­ and
multi­dimensional functions, including the gradient, Jacobian and Hessian.
Wikipedia articles on Bernoulli distribution, Normal distribution, Poisson distribution and Uniform
Wikipedia articles on Bernoulli distribution, Normal distribution, Poisson distribution and Uniform
distribution.
Wolfram MathWorld article on Bernoulli distribution.
Trigonometric functions for a graphical view into how cos and sin vary with θ.
(θ) (θ)

Citation
If you found our work useful, please cite it as:

@article{Chadha2020DistilledMathTutorial,
title = {Math Tutorial},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}

| | | |

www.amanchadha.com

You might also like