Professional Documents
Culture Documents
Aman Ai Primers Math
Aman Ai Primers Math
Back to aman.ai
search...
Primers • Math
Linear Algebra
Vectors
Definition
Purpose
Vectors in Python
Plotting Vectors
2D Vectors
3D Vectors
Norm
Addition
Differential Calculus
Slope of a Straight Line
Defining the Slope of a Curve
Differentiability
Differentiating a Function
Example: Finding the Derivative of x2
Notations
Differentiation Rules
The Chain Rule
Derivatives and Optimization
Higher Order Derivatives
Partial Derivatives
Gradients
Gradient Descent, Revisited
Jacobians
Hessians
Examples
Derivative 1: Composed Exponential Function
Derivative 2. Function with Variable Base and Variable Exponent
Derivative 3: Gradient of a MultiDimensional Input Function
Derivative 4. Jacobian of a MultiDimensional Input and Output Function
Derivative 5. Hessian of a MultiDimensional Input Function
Probability Theory
Concepts
Chance Events
Expectation
Variance
Set Theory
Counting
Conditional Probability
Probability Distributions
Random Variables
Central Limit Theorem
Types of Probability Distributions
Discrete
Continuous
Bernoulli Distribution
PMF
CDF
Iid Bernoulli Trials
Binomial Trials
The Selection Problem
Example Justification of the Binomial Likelihood
Example
Gaussian/normal Distribution
PDF
CDF
Standard Normal Distribution
Example 1
Example 2
Facts about the Normal Density
Other Properties
Poisson Distribution
PMF
CDF
Usecases for the Poisson Distribution
Poisson Derivation
Rates and Poisson Random Variables
Poisson Approximation to the Binomial
Example
Example: Poisson Approximation to the Binomial
Uniform Distribution
PDF
CDF
Geometric Distribution
Student’s Tdistribution
Chisquared Distribution
Exponential Distribution
F Distribution
Gamma Distribution
Beta Distribution
Frequentist Inference
Point Estimation
Confidence Intervals
The Bootstrap
Bayesian Inference
Bayes’ Theorem
Likelihood Function
Prior to Posterior
Regression Analysis
Ordinary Least Squares
Correlation
Analysis of Variance
Trigonometry
Ratios
Graphical View of Sin and Cos
References and Credits
Citation
Open in Colab
Linear Algebra
Linear Algebra is the branch of mathematics that studies vector spaces and linear transformations between
vector spaces, such as rotating a shape, scaling it up or down, translating it (i.e., moving it), etc.
Machine Learning relies heavily on Linear Algebra, so it is essential to understand what vectors and
matrices are, what operations you can perform with them, and how they can be useful.
Vectors
Definition
A vector is a quantity defined by a magnitude and a direction. For example, a rocket’s velocity is a 3
dimensional vector: its magnitude is the speed of the rocket, and its direction is (hopefully) up. A vector can
be represented by an array of numbers called scalars. Each scalar corresponds to the magnitude of the
vector with regards to each dimension.
For example, say the rocket is going up at a slight angle: it has a vertical speed of 5,000 m/s, and also a
slight speed towards the East at 10 m/s, and a slight speed towards the North at 50 m/s. The rocket’s
velocity may be represented by the following vector:
velocity
10
⎡ ⎤
⎢ 50 ⎥
⎣ ⎦
5000
Note: by convention vectors are generally presented in the form of columns. Also, vector names are
generally lowercase to distinguish them from matrices (which we will discuss below) and in bold (when
possible) to distinguish them from simple scalar values such as meters_per_second = 5026 .
A list of N numbers may also represent the coordinates of a point in an N dimensional space, so it is
quite frequent to represent vectors as simple points instead of arrows. A vector with 1 element may be
represented as an arrow or a point on an axis, a vector with 2 elements is an arrow or a point on a plane, a
vector with 3 elements is an arrow or point in space, and a vector with N elements is an arrow or a point in
an N dimensional space… which most people find hard to imagine.
Purpose
Vectors have many purposes in Machine Learning, most notably to represent observations and predictions.
For example, say we built a Machine Learning system to classify videos into 3 categories (good, spam,
clickbait) based on what we know about them. For each video, we would have a vector representing what
we know about it, such as:
video
10.5
⎡
⎢ 5.2
= ⎢
⎢ 3.25
⎣
7.0
⎥
⎥
⎥
This vector could represent a video that lasts 10.5 minutes, but only 5.2% viewers watch for more than a
minute, it gets 3.25 views per day on average, and it was flagged 7 times as spam. As you can see, each
axis may have a different meaning.
Based on this vector our Machine Learning system may predict that there is an 80% probability that it is a
spam video, 18% that it is clickbait, and 2% that it is a good video. This could be represented as the
following vector:
0.80
⎡ ⎤
class_probabilities = ⎢ 0.18 ⎥
⎣ ⎦
0.02
Vectors in Python
In python, a vector can be represented in many ways, the simplest being a regular python list of numbers:
Since we plan to do quite a lot of scientific calculations, it is much better to use NumPy’s ndarray , which
provides a lot of convenient and optimized implementations of essential mathematical operations on
vectors (for more details about NumPy, check out the NumPy tutorial). For example:
import numpy as np
video = np.array([10.5, 5.2, 3.25, 7.0])
video
video.size
Plotting Vectors
To plot vectors, we will use matplotlib, so let’s start by importing it (for details about Matplotlib, check out
our Matplotlib tutorial):
In a Jupyter/Colab notebook, we can simply output the graphs within the notebook itself by running the
%matplotlib inline magic command. Run this cell if you’re viewing this in Colab:
%matplotlib inline
2D Vectors
Now, import NumPy for array containers to handle the input data that we’re looking to plot:
import numpy as np
Let’s create a couple very simple 2D vectors to plot:
u = np.array([2, 5])
v = np.array([3, 1])
These vectors each have 2 elements, so they can easily be represented graphically on a 2D graph, for
example as points:
3D Vectors
Plotting 3D vectors is also relatively straightforward. First let’s create two 3D vectors:
import numpy as np
a = np.array([1, 2, 8])
b = np.array([5, 6, 3])
Now let’s plot them using matplotlib’s Axes3D . Note that we’ll be using mpl_toolkits to carry out this
step (and matplotlib does not load mpl_toolkits as a dependency during installation) so let’s load it
up first using pip install ‐‐upgrade matplotlib (if you’re running this in a Jupyter notebook, use
!pip install ‐‐upgrade matplotlib ).
It is a bit hard to visualize exactly where in space these two points are, so let’s add vertical lines. We’ll
create a small convenience function to plot a list of 3D vectors with vertical lines attached:
multiple possible norms, but the most common one (and the only one we will discuss here) is the Euclidian
norm, which is defined as:
∥u∥
−−−−−−
2
= ∑ ui
√
i
−
−
We could implement this easily in pure python, recalling that √x .
1
= x 2
def vector_norm(vector):
squares = [element**2 for element in vector]
return sum(squares)**0.5
u = np.array([2, 5])
print("||", u, "|| =")
vector_norm(u) # Prints 5.385164807134504
However, it is much more efficient to use NumPy’s norm function, available in the linalg (Linear
Algebra) module:
import numpy.linalg as LA
LA.norm(u) # Prints 5.385164807134504
Let’s plot a little diagram to confirm that the length of vector v is indeed ≈ 5.4 :
radius = LA.norm(u)
plt.gca().add_artist(plt.Circle((0,0), radius, color="#DDDDDD"))
plot_vector2d(u, color="red")
plt.axis([0, 8.7, 0, 6])
plt.grid()
plt.show()
Looks about right!
Addition
Vectors of same size can be added together. Addition is performed elementwise:
import numpy as np
u = np.array([2, 5])
v = np.array([3, 1])
print(" ", u)
print("+", v)
print("‐"*10)
u + v
which outputs:
[2 5]
+ [3 1]
‐‐‐‐‐‐‐‐‐‐
array([5, 6])
plot_vector2d(u, color="r")
plot_vector2d(v, color="b")
plot_vector2d(v, origin=u, color="b", linestyle="dotted")
plot_vector2d(u, origin=v, color="r", linestyle="dotted")
plot_vector2d(u+v, color="g")
plt.axis([0, 9, 0, 7])
plt.text(0.7, 3, "u", color="r", fontsize=18)
plt.text(4, 3, "u", color="r", fontsize=18)
plt.text(1.8, 0.2, "v", color="b", fontsize=18)
plt.text(3.1, 5.6, "v", color="b", fontsize=18)
plt.text(2.4, 2.5, "u+v", color="g", fontsize=18)
plt.grid()
plt.show()
Vector addition is commutative, meaning that u . You can see it on the previous image: following u
+ v
= v
+ u
+ w
(u
+ v)
+ w
If you have a shape defined by a number of points (vectors), and you add a vector v to all of these points,
then the whole shape gets shifted by v . This is called a geometric translation:
t1 = np.array([2, 0.25])
t2 = np.array([2.5, 3.5])
t3 = np.array([1, 2])
t1b = t1 + v
t2b = t2 + v
t3b = t3 + v
plt.axis([0, 6, 0, 5])
plt.grid()
plt.show()
Finally, subtracting a vector is like adding the opposite vector.
Differential Calculus
Calculus is the study of continuous change. It has two major subfields: differential calculus, which studies
the rate of change of functions, and integral calculus, which studies the area under the curve. In this
notebook, we will discuss the former.
Differential calculus is at the core of deep learning, so it is important to understand what derivatives and
gradients are, how they are used in deep learning, and understand what their limitations are.
slope
Δy
=
Δx
height
=
width
rise
=
run
yB − yA
=
xB − xA
In this example, the height (rise) is 3, and the width (run) is 6, so the slope is .
3
= 0.5
Defining the Slope of a Curve
Now, let’s try to figure out how we can compute the slope of something else than a straight line. For
example, let’s consider the curve defined by y :
= f
(x)
2
= x
Obviously, the slope varies: on the left (i.e., when x ), the slope is negative (i.e., when we move from left
< 0
to right, the curve goes down), while on the right (i.e., when x ) the slope is positive (i.e., when we move
> 0
from left to right, the curve goes up). At the point x , the slope is equal to 0 (i.e., the curve is locally flat).
= 0
The fact that the slope is 0 when we reach a minimum (or indeed a maximum) is crucially important, and
we will come back to it later.
How can we put numbers on these intuitions? Well, say we want to estimate the slope of the curve at a
point A , we can do this by taking another point B on the curve, not too far away, and then computing the
slope between these two points.
As you can see, when point B is very close to point A , the (AB line becomes almost indistinguishable
)
from the curve itself (at least locally around point A ). The (AB line gets closer and closer to the tangent
)
line to the curve at point A : this is the best linear approximation of the curve at point A .
So it makes sense to define the slope of the curve at point A as the slope that the (AB line approaches
)
when B gets infinitely close to A . This slope is called the derivative of the function f at x . For
= xA
example, the derivative of the function f (x) at x is equal to 2xA (we will see how to get this result
2 = xA
= x
shortly), so on the graph above, since the point A is located at xA , the tangent line to the curve at that
=
−1
absolute value of x :
No matter how much you zoom in on the origin (the point at x ), the curve will always look like a V. The
= 0,
= 0
slope is 1 for any x , and it is +1 for any x , but at x , the slope is undefined, since it is not
< 0 > 0 = 0
possible to approximate the curve y locally around the origin using a straight line, no matter how much
= |x
that the curve y has an undefined slope at that point. However, the function f (x) is differentiable at
= |x = |x
| |
(xB )
As a counterexample,
{
−1 if x < 0
f (x) = {
+1 if x ≥ 0
is not continuous at xA = 0 , even though it is defined at that point: indeed, when you
approach it from the negative side, it does not approach infinitely close to f (0) = +1 .
Therefore, it is not continuous at that point, and thus not differentiable either.
The function must not have a breaking point at xA , meaning that the slope that the (AB) line
approaches as B approaches A must be the same whether B approaches from the left side or from
the right side. We already saw a counterexample with f , which is both defined and
(x) = |x|
continuous at xA = 0 , but which has a breaking point at xA = 0 : the slope of the curve y is 1
= |x|
root of x : the curve is vertical at the origin, so the function is not differentiable at xA = 0 .
Differentiating a Function
Now let’s see how to actually differentiate a function (i.e., find its derivative).
The derivative of a function f (x) at x is noted f ′ , and it is defined as:
= xA (xA )
′
f (xA )
f (xB )
− f (xA )
lim
xB →xA xB − xA
yB
− yA
Don’t be scared, this is simpler than it looks! You may recognize the rise over run equation that we
xB
− xA
discussed earlier. That’s just the slope of the (AB line. And the notation lim means that we are making
xB
) →xA
approaches when B gets infinitely close to A . This is just a formal way of saying exactly the same thing as
earlier.
Let’s look at a concrete example. Let’s see if we can determine what the slope of the y curve is, at any
2
= x
point A :
′
f (xA ) =
f (xB )
− f (xA )
lim
xB →xA xB − xA
= since f (x)
2 2
x − x 2
B A = x
lim
xB →xA xB − xA
2
= since xA
(xB 2
− xB
− xA ) = (xA
(xB − xB )
+ xA ) (xA
lim
xB →xA xB − xA + xB )
lim
xB →xA
(xB + xA )
lim xB
xB →xA
lim xA
xB →xA
= xA + since xB approaches xA
lim xA
xB →xA
+ xA
= 2xA
= 2xA
→k
→k
lim [f (x) + g(x)] = lim f (x) + lim g(x) the limit of a sum is the sum of the limits
x x x
→k →k →k
lim [f (x) × g(x)] = lim f (x) × lim g(x) the limit of a product is the product of the limits
x x x
→k →k →k
Important note: in Deep Learning, differentiation is almost always performed automatically by the
framework you are using (such as TensorFlow or PyTorch). This is called autodifferentiation. However, you
should still make sure you have a good understanding of derivatives, or else they will come and bite you
one day, for example when you use a square root in your cost function without realizing that its derivative
−−
−
approaches infinity when x approaches 0 (tip: you should use x instead, where ϵ is some small
√
+ ϵ
− xA
note that xB . With that, we can reformulate the definition above like so:
= xA
+ ϵ
′
f (xA )
f (xA
+ ϵ)
− f (xA )
lim
ϵ→0 ϵ
While we’re at it, let’s just rename xA to x , to get rid of the annoying subscript A and make the equation
simpler to read:
′
f (x) =
f (x + ϵ)
− f (x)
lim
ϵ→0 ϵ
Okay! Now let’s use this new definition to find the derivative of f (x) at any point x , and (hopefully) we
2
= x
should find the same result as above (except using x instead of xA ):
′
f (x) =
f (x + ϵ)
− f (x)
lim
ϵ→0 ϵ
= since f (x)
2
(x + ϵ) 2
= x
2
− x
lim
ϵ→0 ϵ
= since (x
2
x + 2xϵ 2
+ ϵ)
2
+ ϵ 2
= x
2
− x
lim + 2xϵ
ϵ→0 ϵ 2
+ ϵ
2
= since the two x cancel out
2
2xϵ + ϵ
lim
ϵ→0 ϵ
2
= lim since 2xϵ and ϵ can both be divided by ϵ
ϵ→0
(2x + ϵ)
= 2x
Notations
A word about notations: there are several other notations for the derivative that you will find in the
litterature:
′
f (x)
df (x)
=
dx
d
= f
dx
(x)
d
This notation is also handy when a function is not named. For example refers to the derivative of the
dx
2
[x ]
function x .
2
↦ x
Moreover, when people talk about the function f (x) , they sometimes leave out “(x) ”, and they just talk
about the function f . When this is the case, the notation of the derivative is also simpler:
df
′
f =
dx
d
= f
dx
df
The f ′ notation is Lagrange’s notation, while is Leibniz’s notation.
dx
There are also other less common notations, such as Newton’s notation ẏ (assuming y ) or Euler’s
= f
(x)
notation Df .
Differentiation Rules
One very important rule is that the derivative of a sum is the sum of the derivatives. More precisely, if
we define f (x), then f ′ (x). This is quite easy to prove:
= g ′
= g
(x)
(x)
+ h ′
+ h
(x)
(x)
′
f (x = by definition
f (x
)
+ ϵ)
− f
(x)
lim
ϵ→0 ϵ
g(x
+ ϵ)
+ h
(x
+ ϵ)
− g
(x)
− h
(x)
lim
ϵ→0 ϵ
g(x
+ ϵ)
− g
(x)
+ h
(x
+ ϵ)
− h
(x)
lim
ϵ→0 ϵ
g(x h(x
+ ϵ) + ϵ)
− g − h
(x) (x)
lim + lim
ϵ→0 ϵ ϵ→0 ϵ
′ ′ ′
= g using the definitions of g (x) and h (x)
(x)
′
+ h
(x)
Similarly, it is possible to show the following important rules (I’ve included the proofs at the end of this
notebook, in case you’re curious):
Function f Derivative f
′
′
f (x) f (x)
Constant
= c = 0
′
f (x) f (x)
′
+ h(x) + h (x)
′
f (x)
f (x) = g(x
′
)h(x) + g (x
)h(x)
′
f (x)
′
f (x) g (x)h(x)
=
h(x) (x)
=
2
h (x)
f (x) with r ≠ 0
′
f (x)
Power
r r−1
= x = rx
′
f (x) f (x)
(x) (x)
′
f (x)
f (x)
Logarithm 1
= ln(x) =
x
′
f (x) f (x)
(x) (x)
f (x) ′
f (x) =
Cos = cos
− sin(x)
(x)
′
f (x) f (x)
Tan = tan 1
=
(x) cos (x)
2
′
f (x)
f (x)
′
= g
Chain Rule = g
(h(x))
(h(x))
′
h (x)
Let’s try differentiating a simple function using the above rules: we will find the derivative of f (x) . Using
3
= x
+ cos
(x)
d
+
dx
[cos
(x)]
− sin
(x)
Let’s try a harder example: let’s find the derivative of f (x) . First, let’s define u(x) and v(x) . Using the
= sin = sin 2
= 2x
2
(2x ) (x)
+ 1 + 1
[sin
(x)]
d
+
dx
[1]
)) ′
v (x)
= cos
2
(2x )
4x
= 0
values correspond to local extrema. Two global minima f and one local maximum f (0) .
–
(√2) 1
=
2
= f
–
−√2)
=
1
−
2
If a function has a local extremum at a point xA and is differentiable at that point, then f ′ . However, the
(xA )
= 0
reverse is not always true. For example, consider f (x) . Its derivative is f ′ (x), which is equal to 0 at xA .
3 2 = 0
= x = x
Yet, this point is not an extremum, as you can see on the following diagram. It’s just a single point where
the slope is 0.
So in short, you can optimize a function by analytically working out the points at which the derivative is 0,
and then investigating only these points. It’s a beautifully elegant solution, but it requires a lot of work, and
it’s not always easy, or even possible. For neural networks, it’s practically impossible.
Another option to optimize a function is to perform Gradient Descent (we will consider minimizing the
function, but the process would be almost identical if we tried to maximize a function instead): start at a
random point x0 , then use the function’s derivative to determine the slope at that point, and move a little bit
in the downwards direction, then repeat the process until you reach a local minimum, and cross your
fingers in the hope that this happens to be the global minimum.
At each iteration, the step size is proportional to the slope, so the process naturally slows down as it
approaches a local minimum. Each step is also proportional to the learning rate: a parameter of the
Gradient Descent algorithm itself (since it is not a parameter of the function we are optimizing, it is called a
hyperparameter).
What’s the intuition behind second order derivatives? Well, since the (first order) derivative represents the
instantaneous rate of change of f at each point, the second order derivative represents the instantaneous
rate of change of the rate of change itself, in other words, you can think of it as the acceleration of the
curve: if f ′′(x , then the curve is accelerating “downwards”, if f ′′(x then the curve is accelerating
) )
< 0 > 0
“upwards”, and if f ′′(x , then the curve is locally a straight line. Note that a curve could be going upwards
)
= 0
(i.e., f ′(x) ) but also be accelerating downwards (i.e., f ′′(x ): for example, imagine the path of a stone
> 0 )
< 0
thrown upwards, as it is being slowed down by gravity (which constantly accelerates the stone
downwards).
Deep Learning generally only uses first order derivatives, but you will sometimes run into some
optimization algorithms or cost functions based on second order derivatives.
Partial Derivatives
Up to now, we have only considered functions with a single variable x . What happens when there are
multiple variables? For example, let’s start with a simple function with 2 variables: f (x, . If we plot this
y)
= sin
(xy)
function, using z , we get the following 3D graph. I also plotted some point A on the surface, along with
= f
(x,
y)
constants:
∂f
=
∂x
f (x + ϵ,
y)
− f (x,
y)
lim
ϵ→0 ϵ
If you use the derivative rules listed earlier (in this example you would just need the product rule and the
chain rule), making sure to treat y as a constant, then you will find:
∂f
∂x
= y cos
(xy)
∂f
=
∂y
f (x, y
+ ϵ)
− f (x,
y)
lim
ϵ→0 ϵ
All variables except for y are treated like constants (just x in this example). Using the derivative rules, we
get:
∂f
∂y
= x cos
(xy)
We now have equations to compute the slope along the x axis and along the y axis. But what about the
other directions? If you were standing on the surface at point A , you could decide to walk in any direction
you choose, not just along the x or y axes. What would the slope be then? Shouldn’t we compute the
slope along every possible direction?
Well, it can be shown that if all the partial derivatives are defined and continuous in a neighborhood around
point A , then the function f is totally differentiable at that point, meaning that it can be locally
approximated by a plane PA (the tangent plane to the surface at point A ). In this case, having just the
partial derivatives along each axis ($x\) and y in our case) is sufficient to perfectly characterize that plane.
Its equation is:
= f (xA ,
yA ) + (x
− xA
∂f
) (xA ,
∂x
yA ) + (y
− yA
∂f
) (xA ,
∂y
yA )
In Deep Learning, we will generally be dealing with wellbehaved functions that are totally differentiable at
any point where all the partial derivatives are defined, but you should know that some functions are not that
nice. For example, consider the function:
h(x, y)
⎧ 0 if x
⎪
⎪
= 0 or y
⎨
⎪ = 0
⎩
⎪
1 otherwise
At the origin (i.e., at (x,), the partial derivatives of the function h with respect to x and y are both perfectly
y)
(0,
0)
defined: they are equal to 0. Yet the function can clearly not be approximated by a plane at that point. It is
not totally differentiable at that point (but it is totally differentiable at any point off the axes).
Gradients
So far we have considered only functions with a single variable x , or with 2 variables, x and y , but the
previous paragraph also applies to functions with more variables. So let’s consider a function f with n
variables: f (x1 , . For convenience, we will define a vector X whose components are these variables:
x2 ,
…,
xn )
X
X
x1
⎛
⎜ x2
= ⎜
⎜
⎜ ⋮
⎝
xn
⎟
⎟
⎟
⎟
…,
xn )
The gradient of the function f (X) at some point XA is the vector whose components are all the partial
derivatives of the function at that point. It is noted ∇f , or sometimes ∇X A
f :
(XA
∇f (XA )
=
∂f
⎛
(XA
∂x1
⎜
⎜
⎜ )
⎜
⎜ ∂f
⎜
(XA
⎜
⎜ ∂x2
⎜
⎜ )
⎜
⎜
⎜
⋮
⎜
⎜
⎜ ∂f
⎜
⎜ ∂xn
⎝
(XA )
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
Assuming the function is totally differentiable at the point XA , then the surface it describes can be
approximated by a plane at that point (as discussed in the previous section), and the gradient vector is the
one that points towards the steepest slope on that plane.
In Deep Learning, the Gradient Descent algorithm we discussed earlier is based on gradients instead of
derivatives (hence its name). It works in much the same way, but using vectors instead of scalars: simply
start with a random vector X0 , then compute the gradient of f at that point, and perform a small step in
the opposite direction, then repeat until convergence. More precisely, at each step t , compute
Xt . The constant η is the learning rate, typically a small value such as 10−3 . In
= Xt−1 − η∇f (Xt−1 )
practice, we generally use more efficient variants of this algorithm, but the general idea remains the same.
In Deep Learning, the letter X is generally used to represent the input data. When you use a neural
network to make predictions, you feed the neural network the inputs X , and you get back a prediction
^
y . The function f treats the model parameters as constants. We can use more explicit notation by
= f
(X)
writing y^ , where w represents the model parameters and indicates that the function relies on them, but
= fw
(X)
= g
(fX
(w),
y)
measures the “discrepancy” between the predictions fX and the labels y , where fX represents the
(w) (w)
vector containing the predictions for each training example. Minimizing the loss function is usually
performed using Gradient Descent (or a variant of GD): we start with random model parameters w0 , then
we compute ∇L and we use this gradient vector to perform a Gradient Descent step, then we repeat the
(w0 )
process until convergence. It is crucial to understand that the gradient of the loss function is with regards to
the model parameters w (not the inputs X ).
Jacobians
Until now we have only considered functions that output a scalar, but it is possible to output vectors
instead. For example, a classification neural network typically outputs one probability for each class, so if
there are m classes, the neural network will output an d dimensional vector for each input.
In Deep Learning we generally only need to differentiate the loss function, which almost always outputs a
single scalar number. But suppose for a second that you want to differentiate a function f (X) which
outputs d dimensional vectors. The good news is that you can treat each output dimension independently
of the others. This will give you a partial derivative for each input dimension and each output dimension. If
you put them all in a single matrix, with one column per input dimension and one row per output dimension,
you get the socalled Jacobian matrix.
Jf (XA )
=
∂f1 ∂f1 ∂f1
⎛ ⎞
(XA (XA … (XA
∂x1 ∂x2 ∂xn
⎜ ⎟
⎜ ⎟
⎜ ) ) ) ⎟
⎜ ⎟
⎜ ⎟
∂f2 ∂f2 ∂f2
⎜ ⎟
⎜ (XA (XA … (XA ⎟
⎜ ∂x1 ∂x2 ∂xn ⎟
⎜ ⎟
⎜ ) ) ) ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⋮ ⋮ ⋱ ⋮ ⎟
⎜ ⎟
⎜ ⎟
⎜ ∂fm ∂fm ∂fm ⎟
⎜ … ⎟
⎜ ∂x1 ∂x2 ∂xn ⎟
⎝ ⎠
(XA ) (XA ) (XA )
The partial derivatives themselves are often called the Jacobians. It’s just the first order partial derivatives
of the function f .
Hessians
Let’s come back to a function f (X) which takes an n dimensional vector as input and outputs a scalar. If
you determine the equation of the partial derivative of f with regards to xi (the ith component of X ), you
∂f
will get a new function of X : . You can then compute the partial derivative of this function with regards
∂xi
to xj (the jth component of X ). The result is a partial derivative of a partial derivative: in other words, it is
2
∂ f
a second order partial derivatives, also called a Hessian. It is noted X : . If i ≠ j then it is called
∂xj xi
2
∂ f
a mixed second order partial derivative. Or else, if j = i , it is noted .
2
∂xi
Let’s look at an example: f (x, . As we showed earlier, the first order partial derivatives of f are:
y)
= sin
(xy)
∂f ∂f
and . So we can now compute all the Hessians (using the derivative rules we
∂x ∂y
= y cos(xy) = x cos(xy)
discussed earlier):
2
∂ f ∂f
2
= [y cos(xy)] = −y sin(xy)
2
∂x ∂x
2
∂ f ∂f
= [y cos(xy)] = cos(xy) − xy sin(xy)
∂y ∂x ∂y
2
∂ f ∂f
= [x cos(xy)] = cos(xy) − xy sin(xy)
∂x ∂y ∂x
2
∂ f ∂f
2
= [x cos(xy)] = −x sin(xy)
2
∂y ∂y
2
∂ f
Note that . This is the case whenever all the partial derivatives are defined and continuous in a
∂x
∂y
2
∂ f
=
∂y
∂x
neighborhood around the point at which we differentiate.
The matrix containing all the Hessians is called the Hessian matrix:
Hf (XA )
=
2 2 2
∂ f ∂ f ∂ f
⎛ ⎞
…
2
⎜ ∂x1 ∂ x1 ∂ x2 ∂ x1 ∂ xn ⎟
⎜ ⎟
⎜ (XA ) (XA ) (XA ) ⎟
⎜ ⎟
⎜ 2 2 2 ⎟
⎜ ∂ f ∂ f ∂ f ⎟
⎜ … ⎟
⎜ ∂ x2 ∂ x1 2
∂x2 ∂ x2 ∂ xn ⎟
⎜ ⎟
⎜ ⎟
(XA ) (XA ) (XA )
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⋮ ⋮ ⋱ ⋮ ⎟
⎜ ⎟
⎜ 2 2 2 ⎟
⎜ ∂ f ∂ f ∂ f ⎟
⎜ … ⎟
⎜ ∂x ∂x ∂ xn ∂ x2 ∂xn
2 ⎟
n 1
⎝ ⎠
(XA ) (XA ) (XA )
There are great optimization algorithms which take advantage of the Hessians, but in practice Deep
Learning almost never uses them. Indeed, if a function has n variables, there are n
2
Hessians: since
neural networks typically have several millions of parameters, the number of Hessians would exceed
thousands of billions. Even if we had the necessary amount of RAM, the computations would be
prohibitively slow.
Examples
∈ R
The exponential function is a very foundational, common, and useful example. It is a strictly positive
function, i.e. x
e in R , and an important property to remember is that e
0
In addition, you should
> 0 = 1.
remember that the exponential is the inverse of the logarithmic function. It is also one of the easiest
′
functions to derivate because its derivative is simply the exponential itself, i.e. x
(e ) . The derivative
x
= e
becomes tricker when the exponential is combined with another function. In such cases, we use the chain
rule formula, which states that the derivative of f is equal to f ′ , i.e.,
(g(x (g(x))
′
)) ⋅ g
(x)
∂f (g(x))
∂x
∂g
′
= f
∂x
(g(x))
Applying chain rule, we can compute the derivative of f (x) . We first Multiplying these two intermediate
2
x
= e
results, we obtain,
2
x
∂e
∂x
2
∂x
=
∂x
2
x
× e
= 2x
2
x
× e
This function is a classic in interviews, especially in the financial/quant industry, where math skills are
tested in even greater depth than in tech companies for machine learning positions. It sometimes brings
the interviewees out of their comfort zone, but really, the hardest part of this question is to be able to start
correctly.
The most important thing to realize when approaching a function in such exponential form is, first, the
inverse relationship between exponential and logarithm, and, second, the fact, that every exponential
function can be rewritten as a natural exponential function in the form of,
∀a
+
∈ R ,
∗
∀b ∈ R,
b
b
a
b ln(a)
= e
Before we get to our f (x) example, let us demonstrate this property with a simpler function f (x) We first
∗ x
= x = 2
.
use the above equation to rewrite 2 as exp and subsequently apply chain rule.
∗
(x ln
(2))
∂x ln(2)
x
2 =
∂x
x ln(2)
= e
x ln(2)
xln
× e
∂e (2)
⟹
∂x
= ln
(2
x ln(2)
)e
= ln
x
(2)2
Going back to the original function f (x) , once you rewrite the function as f (x) , the derivative becomes
x
= x = exp
(x ln
x)
relatively straightforward to compute, with the only potentially difficult part being the chain rule step.
xln(x)
∂e
x
x =
∂x
xln(x)
= e
∂x ln(x)
x x ln(x)
∂x = × e
⟹ ∂x
∂x
= (ln(x)
x
x
+ )x
x
x
= x (1
+ ln(x))
Note that here we used the product rule (uv)′ for the exponent sin .
′ (x)
= u v
′
+ uv
This function is generally asked without any information on the function’s domain. If your interviewer
doesn’t specify the domain by default, he might be testing your mathematical acuity. Here is where the
question gets deceiving. Without being specific about the domain, it seems that xx is defined for both
positive and negative values. However, for negative x , e.g. ( , the result is a complex number,
−0.9
∧
) (
−0.9
concretely −1.05 potential way out would be to define the domain of the function as Z
−
(see
+
− 0.34i. A ∪ R ∖0
here for further discussion), but this would still not be differentiable for negative values. Therefore, in order
to properly define the derivative of xx , we need to restrict the domain to only strictly positive values. We
exclude 0 because for a derivative to be defined in 0, we need the limit derivative from the left (limit in 0 for
negative values) to be equal to the limit derivative from the right (limit in 0 for positive values) − a condition
f(0
+Δx)
−f(0)
−f(0)
that is broken in this case. since the left limit limΔx is undefined, the function is not differentiable in
Δx
−
→0
2
∂x ln
2 (x)
x
x =
2 ∂x
x ln(x)
= e 2
x ln(x)
2 × e
x ln(x)
∂e
⟹
∂x
1
2
= ( x
x
+ 2x ln
(x)
2
x ln(x)
)e
2
x +1
= x
(1 + 2 ln
(x))
+ z cos
(x),
(x, y, z)
3
∈ R
So far, the functions discussed in the first and second derivative sections are functions mapping from R to
R , L. the domain as well as the range of the function are real numbers. But machine learning is essentially
vectorial and the functions are multidimensional. A good example of such multidimensionality is a neural
network layer of input size m and output size k, i.e., f (x) , which is an elementwise composition of a2
= g
⊤
(W x
+ b)
linear mapping W
T
x (with weight matrix W and input vector x ) and a non linear mapping g (activation
function). In the general case, this can also be viewed as a mapping from R to R .
m k
In the specific case of k , the derivative is called gradient. Let us now compute the derivative of the
= 1
following threedimensional function mapping R to R:
3
f (x, y, z)
xy
= 2
+ z cos
(x)
You can think off as a function mapping a vector of size 3 to a vector of size 1
The derivative of a multidimensional input function is called a gradient function that maps to is a
n
g R R
set of n partial derivatives of g where each h partial derivative is a function of n variables. Thus, if 8 is a
mapping from R to R, its gradient ∇g is a mapping from R to R .
D n n
To find the gradient of our function f (x, , we construct vector of partial derivatives ∂f , f// ∂y and
y, z) /∂x
= 2w
+ z cos(x)
∇f (x, y,
z)
∂f
⎡ ⎤
∂x
⎢ ∂f ⎥
= ⎢ ⎥
⎢ ∂y ⎥
⎢ ⎥
∂f
⎣ ⎦
∂z
ln
⎡
xy
(2)y2
⎢
⎢
⎢ − z sin
⎢
⎢ (x)
⎢
⎢
⎢ ln
⎢ xy
⎢ (2)x2
⎣
cos(x)
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
Note that this is an example similar to the previous section and we use the equivalence 2xy .
= exp
(xy ln
(2))
In conclusion, for a multidimensional function that maps to , the derivative is a gradient , which
3
R R ∇f
maps R to R .
3 3
In a general form of mappings “to where , the derivative of a multidimensional function that
k
R R k
> 1
maps to is a Jacobian matrix (instead of a gradient vector). Let us investigate this in the next
m k
R R
section.
2
2x
= [
x√y
],x
∈ R, y
+
∈ R
We know from the previous section that the derivative of a function mapping to is a gradient
m
R R
mapping R “to R “. But what about the case where also the output domain is multidimensional, i.e. a
mapping from R to R for k ?
m k
> 1
In such case, the derivative is called Jacobian matrix. We can view the gradient simply as a special case of
Jacobian with dimension m with m equal to the number of variables. The Jacobian J (g) of a function g
Jacobian with dimension with equal to the number of variables. The Jacobian of a function
× 1
mapping R R a dimension of , l.e. is a matrix of shape . In other words, each row of
m to
k k i J (g)
× m × m
represents the gradient ∇gi of each subfunction gi of g .
Let us derive the above defined function f (x, mapping R to R , thus both input and output domains are
2 2
y)
=
2
[2x ,
x ∣ y
multidimensional. In this particular case, since the square root function is not defined for negative values,
we need to restrict the domain of y to R . The first row of our output Jacobian will be the derivative of
+
function 1, i.e. ∇2x2 , and the second row the derivative of function 2, i.e. ∇x√y.
∇f (x, y)
= Jf (x,
y)
=
∂f1 ∂f1
⎡
∂x ∂y
⎢
∂f2 ∂f2
⎣
∂x ∂y
⎤
⎥
⎦
4x 0
[ x
√y 2√y
In deep learning, an example where the Jacobian is of special interest is in the explainability field (see, for
example, Sensitivity based Neural Networks Explanations) that aims to understand the behavior of neural
networks and analyses sensitivity of the output layer of neural networks with regard to the inputs. The
Jacobian helps to investigate the impact of variation in the input space on the output. This can analogously
be applied to understand the concepts of intermediate layers in neural networks.
In summary, remember that while gradient is a derivative of a scalar with regard to a vector, Jacobian is a
derivative of a vector with regard to another vector.
2 3
= x y ,
(x, y)
2
∈ R
So far, our discussion has only been focused on firstorder derivatives, but in neural networks we often talk
about higherorder derivatives of multidimensional functions. A specific case is the second derivative, also
called the Hessian matrix, and denoted H (f or ∇2 (nabla squared). The Hessian of a function g mapping
)
Let us analyze how we went from R to R on the output domain. The first derivative, i.e. gradient , is
n∗n
∇g
a mapping from R to R and its derivative is a Jacobian. Thus, the derivation of each subfunction ∇gi
n n
results in a mapping of R to R , with n such functions. You can think of this as if deriving each element
n n
of the gradient vector expanded into a vector, becoming thus a vector of vectors, i.e. a matrix.
To compute the Hessian, we need to calculate socalled crossderivatives, that is, derivate first with respect
to and then with respect to , or viceversa. One might ask if the order in which we take the cross
to x and then with respect to y , or viceversa. One might ask if the order in which we take the cross
derivatives matters; in other words, if the Hessian matrix is symmetric or not. In cases where the function f
2
is C , i.e. twice continuously differentiable, Schwarz theorem states that the cross derivatives are equal
and thus the Hessian matrix is symmetric. Some discontinuous, yet differentiable functions, do not satisfy
the equality of crossderivatives.
Constructing the Hessian of a function is equal to finding secondorder partial derivatives of a scalarvalued
function. For the specific example f (x, , the computation yields the following result:
y)
2 3
= x y
2
∇ f (x,
y)
= Hf (x,
y)
=
2 2
∂ f ∂ f
⎡ ⎤ 3 2
∂x
2
∂x∂y 2y 6xy
⎢ 2 2
⎥ = [ ]
∂ f ∂ f 2 2
6xy 6yx
⎣ ⎦
2
∂y∂x ∂y
to , i.e. a 3D tensor. Similarly to the Hessian, in order to find the gradient of the Jacobian
m k∗m
R R
∗ m
(differentiate a second time), we differentiate each element of the k matrix and obtain a matrix of
× m
vectors, i.e. a tensor. While it is rather unlikely that you would be asked to do such computation manually, it
is important to be aware of higherorder derivatives for multidimensional functions.
Probability Theory
Concepts
Chance Events
Randomness is all around us. Probability theory is the mathematical framework that allows us to analyze
chance events in a logically sound manner. The probability of an event is a number indicating how likely
that event will occur. This number is always between 0 and 1, where 0 indicates impossibility and 1
indicates certainty.
A classic example of a probabilistic experiment is a fair coin toss, in which the two possible outcomes are
heads or tails. In this case, the probability of flipping a head or a tail is . In an actual series of coin tosses,
1
we may get more or less than exactly 50% heads. But as the number of flips increases, the longrun
frequency of heads is bound to get closer and closer to 50%.
For an unfair or weighted coin, the two outcomes are not equally likely, in which case, you’ll need to assign
appropriate weights to each of the outcomes. If we assign numbers to the outcomes — say, 1 for heads, 0
for tails — then we have created the mathematical object known as a random variable.
Expectation
The expectation of a random variable is a number that attempts to capture the center of that random
variable’s distribution. It can be interpreted as the longrun average of many independent samples from the
given distribution. More precisely, it is defined as the probabilityweighted sum of all possible values in the
random variable’s support,
random variable’s support,
E[X] =
∑ xP
x∈X
(x)
Consider the probabilistic experiment of rolling a fair die. After a sufficiently large number of iterations, the
running sample mean converges to the expectation of 3.5. Changing the distribution of the different faces
of the die (thus making the die biased or “unfair”) would affect the expected value.
Variance
Whereas expectation provides a measure of centrality, the variance of a random variable quantifies the
spread of that random variable’s distribution. The variance is the average value of the squared difference
between the random variable and its expectation,
Var(X)
= E [(X
− E[X]
2
) ]
When you draw cards randomly from a deck of ten cards, you’ll observe that the running average of
squared differences begins to resemble the true variance.
Set Theory
A set, broadly defined, is a collection of objects. In the context of probability theory, we use set notation to
specify compound events. For example, we can represent the event “roll an even number” by the set
{2, . For this reason it is important to be familiar with the algebra of sets.
4, 6}
Counting
It can be surprisingly difficult to count the number of sequences or sets satisfying certain conditions.
For example, consider a bag of marbles in which each marble is a different color. If we draw marbles one at
a time from the bag without replacement, and there exists four unique marbles in the bag, how many
different ordered sequences (permutations) of the marbles are possible? How many different unordered
sets (combinations)?
Permutations with n and n Px :
= 4 = 1
n
Combinations with n and (x) or n Cx :
= 4 = 1
Conditional Probability
Conditional probabilities allow us to account for information we have about our system of interest. For
example, we might expect the probability that it will rain tomorrow (in general) to be smaller than the
probability it will rain tomorrow given that it is cloudy today. This latter probability is a conditional probability,
since it accounts for relevant information that we possess.
Mathematically, computing a conditional probability amounts to shrinking our sample space to a particular
event. So in our rain example, instead of looking at how often it rains on any day in general, we “pretend”
that our sample space consists of only those days for which the previous day was cloudy. We then
determine how many of those days were rainy.
Probability Distributions
A probability distribution specifies the relative likelihoods of all possible outcomes. Before we dive into
some common probability distributions, let’s go over the associated terminologies.
Random Variables
Formally, a random variable is a function that assigns a real number to each outcome in the probability
space. By sampling from the probability space associated with your distribution, you can generate the
empirical distribution of your random variable.
The Central Limit Theorem (CLT) states that the sample mean of a sufficiently large number of
independent and identically distributed (i.i.d.) random variables is approximately normally distributed. The
larger the sample space, the better the approximation.
Discrete
= x)
= f
(x)P
(X
< x)
= F
(x)
where, f denotes the probability mass function and F (x) denotes the cumulative distribution
(x)
function.
Continuous
A continuous random variable takes on an uncountably infinite number of possible values (e.g. all real
numbers).
If X is a continuous random variable, then there exists unique nonnegative functions, f (x) and F (x) ,
such that the following are true:
P (a =
b
≤ X
∫ f
≤ b) a
(x
)dx
P (X = F
< x (x)
where, f denotes the probability density function and F (x) denotes the cumulative distribution
(x)
function.
Bernoulli Distribution
The Bernoulli distribution arises as the result of a binary outcome, which is why it is used to model binary
data.
For e.g., building a spam vs. ham binary classifier, or modeling a coin toss.
A Bernoulli random variable thus models a discrete distribution.
Bernoulli random variables take the values 0 and 1 with probabilities of p and 1 , respectively.
− p
PMF
The probability mass function f (⋅) of a Bernoulli distribution, over possible outcomes k ∈ , is given by,
{0,
1}
f (k; p)
=
⎧p if k = 1
⎨q = 1 if k = 0
⎩
− p
f (k; p)
k
= p (1
1−k
− p)
for k
∈ {0, 1}
or as,
f (k; p)
= pk
+ (1
− p)(1
− k)
for k
∈ {0, 1}
Note that the Bernoulli distribution is a special case of the binomial distribution with n .
= 1
CDF
The cumulative density function (CDF) of the Bernoulli distribution is given by,
F (k; p)
⎧0 if k < 0
⎪
⎪
⎪
1 − p if 0 ≤ k
⎨
⎪ < 1
⎪
⎩
⎪
1 if k ≥ 1
Iid Bernoulli Trials
xn
n−
n
∏
i=1
xi
p (1 − p)
1−xi
= p
∑ xi
(1 − p)
∑ xi
.
Notice that the likelihood depends only on the sum of the xi .
xi
Because n is fixed and assumed known, this implies that the sample proportion ∑
i n
contains all of the
relevant information about p.
We can maximize the Bernoulli likelihood over p to obtain that ^ =
p is the maximum likelihood estimator
xi
∑
i n
for p.
Binomial Trials
Binomial random variables are obtained as the sum of iid Bernoulli trials.
Specifically, let X1 , be iid Bernoulli(p) ; then X = is a binomial random variable.
n
…, ∑ Xi
i=1
Xn
P (X
= x)
n
= (
x
x
) p (1
− p
n−x
) for x
= 0,
…,n
The notation n Cx or (x) (read “n choose x ”) counts the number of ways of selecting items out of
n
x n
n!
=
x!(n
−x)!
n
( )
0
n
= ( )
n
= 1
Consider the probability of getting 6 heads out of 10 coin flips from a coin with success probability p.
The probability of getting 6 heads and 4 tails in any specific order is p6 (1.
− p
4
)
Example
Suppose a friend has 8 children, 7 of which are girls and none are twins.
If each gender has an independent 50% probability for each birth, what’s the probability of getting 7 or
more girls out of 8 births?
8 7
( ) 0.5
7
1
(1 − 0.5)
8
+ (
8
8
) 0.5 (1
0
− 0.5)
≈ 0.04
Gaussian/normal Distribution
The normal (or Gaussian) distribution has a bellshaped density function and is used to model realvalued
random variables that are assumed to be additively produced by many small effects.
For example, the normal distribution is used to model people’s height, since height can be assumed to be
the result of many small genetic and evironmental factors.
Another example would be modeling the price of a house, since the price of a house can be
assumed to be a function of the area, school district, distance to landmarks etc.
If X a random variable the follows a Gaussian distribution then, E[X] and V ar .
= μ (X)
2
= σ
The notation used to indicate that a random variable was sampled from a normal distribution is: X .
∼ N
(μ,
2
σ )
A random variable is said to follow a normal or Gaussian distribution with mean μ and variance σ
2
if the
associated PDF is,
f (x)
2
x−μ
1 −
1
( )
2 σ
= e
−−
σ√2π
CDF
F (x)
= Φ
x
⎛
− μ
⎜
σ
⎝
1
=
2
⎡
⎢1
⎢
⎣
+ erf
x
⎛ ⎞⎤
− μ
⎜ ⎥
– ⎟⎥
σ√2
⎝ ⎠⎦
where,
Φ(⋅) represents the CDF of the standard normal distribution.
erf(x) represents the error function.
The simplest case of a the normal distribution is called the standard normal distribution. This is a special
case when μ and σ , described by the PDF:
= 0 = 1
φ(x)
1 1 2
− x
= e 2
−−
√2π
Example 1
2
σ )
= 0.95
P (X = P
≤ x0 ) X − μ
(
σ
x0 − μ
≤
σ
= P (Z
x0 − μ
≤
σ
= 0.95
x0 −μ
∴
σ
or x0 .
= μ
= 1.645
+ σ1.645
+ σz0
Example 2
What is the probability that a N (μ, random variable is 2σ (i.e., 2 standard deviations) above the mean?
2
σ )
P (X
> μ
+ 2σ)
= P
⎛
X − μ
⎜
⎜
σ
⎝
μ + 2σ
− μ
>
σ
⎟
⎟
P (X
> μ
+ 2σ)
= P (Z
≥ 2)
≈ 2.5
2
σ )
If Z , i.e., Z is a random variable that follows the standard normal distribution, then X .
∼ ϕ = μ
+ σZ
∼ N
(μ,
2
σ )
The PDF of a general normal distribution in terms of the PDF of a standard normal ϕ(⋅) is,
1
ϕ
σ
x − μ
(
σ
Approximately 68% , 95% and 99.7% of the normal density lies within 1, 2 and 3 standard deviations from
the mean, respectively. , , and are the , , and percentiles
th th th st
−1.28 −1.645 −1.96 −2.33 10 5 2.5 1
Other Properties
The normal distribution is symmetric and peaked around its mean (therefore the mean, median and mode
are all equal).
A constant times a normally distributed random variable is also normally distributed (what is the mean and
variance?).
Sums of normally distributed random variables are again normally distributed even if the variables are
dependent (what is the mean and variance?).
Sample means of normally distributed random variables are again normally distributed (with what mean
and variance?).
The square of a standard normal random variable follows what is called chisquared distribution.
The exponent of a normally distributed random variables follows what is called the lognormal distribution.
As we will see later, many random variables, properly normalized, limit to a normal distribution.
Poisson Distribution
A Poisson random variable counts the number of events occurring in a fixed interval of time or space, given
that these events occur with an average rate λ .
This distribution can be used to model events such as:
The number of meteor showers in a year.
The number of goals in a soccer match.
The number of patients arriving in an emergency room between 10 and 11 PM.
The number of laser photons hitting a detector in a particular time interval.
The number of customers arriving in a store (or say, the number of pageviews on a website).
A Poisson random variable thus models a discrete distribution.
Both the mean and variance of this distribution is λ .
Note that λ ranges from 0 to ∞ .
PMF
P (X
= x; λ)
x −λ
λ e
= for x = 0, 1, …
x!
CDF
⌊
Γ( k + 1
⌊k⌋
⌋, λ) λ
i
, or e
−λ
∑ ⌊ ⌋
or Q( k + 1 , λ)
⌊k⌋! i=0
i!
(for k
≥ 0,
where Γ
(x,
) is the upper incomplete gamma function, ⌊k⌋ is the floor function, and Q is the regularized qamma function)
Usecases for the Poisson Distribution
Modeling count data, i.e., data of the form . Examples include radioactive decay, survival
number of events
time
Poisson Derivation
A binomial random variable is the sum of n independent Bernoulli random variables with parameter p. It is
frequently used to model the number of successes in a specified number of identical binary experiments,
such as the number of heads in five coin tosses.
When n is large and p is small (with np ), the Poisson distribution is an accurate approximation to the
< 10
binomial distribution.
Formally, X ,λ .
∼ Binomial(n, p) = np
Example
The number of people that show up at a bus stop is Poisson with a mean of 2.5 per hour.
If watching the bus stop for 4 hours, what is the probability that 3 or fewer people show up for the whole
time?
If we flip a coin with success probablity 0.01 five hundred times, what’s the probability of 2 or fewer
successes?
Uniform Distribution
The uniform distribution (or rectangular distribution) is a continuous distribution such that all intervals of
equal length on the distribution’s support have equal probability. For example, this distribution might be
used to model people’s full birth dates, where it is assumed that all times in the calendar year are equally
likely.
The distribution describes an experiment where there is an arbitrary outcome that lies between certain
bounds.
The bounds are defined by the parameters, a and b , which are the minimum and maximum values. The
interval can be either be closed (e.g., [a, b] ) or open (e.g., (a, b) ).
Therefore, the distribution is often abbreviated U (a, , where U stands for uniform distribution.
b)
f (x)
=
1
⎧ for a
⎪
⎪ b−a
⎪
⎪
⎪ ≤ x ≤ b
⎨
0 for x
⎪
⎪
⎪
⎪ < a or x
⎩
⎪
> b
CDF
F (x)
⎧0 for x
⎪
⎪
⎪
⎪ < a
⎪
⎪
⎪ x−a
for x
⎨ b−a
∈ [a, b]
⎪
⎪
⎪
⎪
⎪1 for x
⎪
⎩
⎪
> b
Geometric Distribution
A geometric random variable counts the number of trials that are required to observe a single success,
where each trial is independent and has success probability p. A geometric random variable thus models a
discrete distribution.
For example, this distribution can be used to model the number of times a die must be rolled in order for a
six to be observed.
Student’s Tdistribution
A Student’s tdistribution (or simply the tdistribution), is a continuous probability distribution that arises
when estimating the mean of a normally distributed population in situations where the sample size is small
and population standard deviation is unknown.
Chisquared Distribution
A chisquared random variable with k degrees of freedom is the sum of k independent and identically
distributed squared standard normal random variables. A chisquared random variable thus models a
continuous distribution.
It is often used in hypothesis testing and in the construction of confidence intervals.
Exponential Distribution
The exponential distribution is the continuous analogue of the geometric distribution. It is often used to
model waiting times.
F Distribution
The Fdistribution (also known as the Fisher–Snedecor distribution), is a continuous distribution that arises
frequently as the null distribution of a test statistic, most notably in the analysis of variance.
Gamma Distribution
The gamma distribution is a general family of continuous probability distributions. The exponential and chi
squared distributions are special cases of the gamma distribution.
Beta Distribution
The beta distribution is a general family of continuous probability distributions bound between 0 and 1. The
beta distribution is frequently used as a conjugate prior distribution in Bayesian statistics.
Frequentist Inference
Frequentist inference is the process of determining properties of an underlying distribution via the
observation of data.
Point Estimation
One of the main goals of statistics is to estimate unknown parameters. To approximate these parameters,
we choose an estimator, which is simply any function of randomly sampled observations. To illustrate this
idea, let’s consider the problem of estimating the value of π . To do so, we can uniformly drop samples on a
square containing an inscribed circle. Notice that the value of π can be expressed as a ratio of the areas,
Scircle
2
= πr
Ssquare
2
= 4r
⟹π
Scircle
= 4
Ssquare
We can estimate this ratio with our samples. Let m be the number of samples within our circle and n the
total number of samples dropped. We define our estimator π
^ as:
m
π
^ = 4
n
It can be shown that this estimator has the desirable properties of being unbiased and consistent.
Confidence Intervals
In contrast to point estimators, confidence intervals estimate a parameter by specifying a range of possible
values. Such an interval is associated with a confidence level, which is the probability that the procedure
used to generate the interval will produce an interval containing the true parameter.
The Bootstrap
Much of frequentist inference centers on the use of “good” estimators. The precise distributions of these
estimators, however, can often be difficult to derive analytically. The computational technique known as the
Bootstrap provides a convenient way to estimate properties of an estimator via resampling.
Bayesian Inference
Bayesian inference techniques specify how one should update one’s beliefs upon observing data.
Bayes’ Theorem
Suppose that on your most recent visit to the doctor’s office, you decide to get tested for a rare disease. If
you are unlucky enough to receive a positive result, the logical next question is, “Given the test result, what
is the probability that I actually have this disease?” (Medical tests are, after all, not perfectly accurate.)
Bayes’ Theorem tells us exactly how to compute this probability:
(Disease
∣ +)
P (+
∣ Disease
)P
( Disease
)
=
P (+)
As the equation indicates, the posterior probability of having the disease given that the test was positive
depends on the prior probability of the disease P . Think of this as the incidence of the disease in
(Disease)
the general population. Set this probability by dragging the bars below.
The posterior probability also depends on the test accuracy: How often does the test correctly report a
negative result for a healthy patient, and how often does it report a positive result for someone with the
disease?
Likelihood Function
L(θ ∣ x)
= P (x ∣
θ)
The concept of likelihood plays a fundamental role in both Bayesian and frequentist statistics. To read
more, refer the section on likelihood vs. probability in our CS229 notes.
Prior to Posterior
At the core of Bayesian statistics is the idea that prior beliefs should be updated as new data is acquired.
Consider a possibly biased coin that comes up heads with probability p. This purple slider determines the
value of p (which would be unknown in practice).
As we acquire data in the form of coin tosses, we update the posterior distribution on p, which represents
our best guess about the likely values for the bias of the coin. This updated distribution then serves as the
prior for future coin tosses.
Regression Analysis
Linear regression is an approach for modeling the linear relationship between two variables.
The ordinary least squares (OLS) approach to regression allows us to estimate the parameters of a linear
model.
The goal of this method is to determine the linear model that minimizes the sum of the squared errors
between the observations in a dataset and those predicted by the model.
Correlation
Correlation is a measure of the linear relationship between two variables. It is defined for a sample as the
following and takes value between +1 and 1 inclusive:
r
sxy
=
−−
− −
−−
√sxx √syy
syy
sxy =
n
∑ (xi
i=1
− x̄) (yi
− ȳ )
sxx =
n
∑ (xi
i=1
2
− x̄)
syy =
n
∑ (yi
i=1
2
− ȳ )
It can also be understood as the cosine of the angle formed by the ordinary least square line determined in
both variable dimensions.
Analysis of Variance
Analysis of Variance (ANOVA) is a statistical method for testing whether groups of data have the same
mean. ANOVA generalizes the ttest to two or more groups by comparing the sum of square error within
and between groups.
Trigonometry
Ratios
∘ ∘ ∘ ∘ ∘ ∘
Angle 0 30 45 60 90
c c c c c c
Angle 0 π/6 π/4 π/3 π/2
1 1 √3
sin θ 0 2 √2
1
2
√3 1 1
cos θ 1 √2 2
0
2
1 –
tan θ 0 √3
1 √3 N/A
– 2
cosec θ N/A 2 √2
√3
1
2 –
sec θ 1 √3
√2 2 N/A
– 1
cot θ N/A √3 1 √3
0
equivalently, 0c to 2π c ).
Note that the below diagram shows a unit circle (with radius = 1).
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledMathTutorial,
title = {Math Tutorial},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}
| | | |
www.amanchadha.com