You are on page 1of 1

det(Hf) < 0

saddle point

tr(Hf) > 0
det(Hf) > 0 local minimum
further
Critical points investigation tr(Hf) < 0
=0 local maximum
Vectors as Directions Matrix derivative
tr(Hf) = 0
does not happen
det(Hf) = 0
unclear Vector derivative
Scalar Multiplication Hf= need more info

Definitions
2D intuition Gradient Descent
Geometry of Column Vectors. Derivative Condition
A Warning
Addition as Displacement
Operations When going to more Hf is called positive semi-definite if Hf= The Gradient
collection of partial derivatives
complex models i.e.
Neural Networks, there
are many local minima & And this implies f is convex
many saddle points.
Subtraction as Mapping So they are not convex Hf=

Opposite is Concave Level set


Lp-Norm
Visualize Now look for an algorithm to find
the zero of some function g(x)

Minimizing f <---> f '(x)=0


Geometry of Norms Idea Apply this algorithm to f '(x)

L1-Norm If the matrix is diagonal, a


the function is convex if the line positive entry is a
between two points stays above Hessian direction where it curves Newthon's method line: on (x 0, g(x 0))
slope g '(x 0)
Types of Norms up, and a negative entry
Benefits Convexity is a direction where it
points in the direction of y=g '(x 0) (x-x 0)+g(x 0)
Trace maximum increase solve the equation y=0
Euclidean Norm (L2-Norm) When a function is convex then there is curves down The Gradient
sum of diagonal terms tr(Hf)
a single unique local minimum
- points in the direction of Computing the Line
no maxima Key Properties
For , there is maximum decrease Relationship to
no saddle points
L∞-Norm Gradient Descent
Gradient descent is guarantied to find the many derivatives Second Derivative A learning rate is
global minimum with small enough at local max & min
adaptive to f(x) we want to find where g(x)=0 and we start with
Newton's Method always works If you have a function some initial guess x0 and then iterate
L0-Norm Update Step for Zero Finding
Despite the name, it's not a norm where n - n-dimensional vector,
Partial Derivatives
It's a number of non-zero elements of a vector Then it's a function of many variables.
is a measure of the rate of change of
the function... when one of the Pictorially
You need to know how the function responds
variables is subjected to a small g(x) x such that g(x)=0
All distances are non-negative Norm Properties Norms - are measures change but the others are kept
to changes in all of them.
of distance constant. To minimize f, we want to find where f '(x)=0 and

Vectors The majority of this will be just bookkeeping, thus we may start with some initial guess x0 and
Example: then iterate Newton's Method on f ' to get
Distances multiply with scalar multiplication but will be terribly messy bookkeeping. Update Step for Minimization

If I travel from A to B then B to C, that is at least as


far as going from A to C (Triangular Inequality)

Measures of Magnitude
Orthogonality v*w=0 how to pick eta
Matrix Calculus As simplistic as this is, almost all machine
If A is a matrix where the rows are features wi learning you have heard of use some recall that an improperly chosen
and B is a matrix where the columns are data learning rate will cause the entire
The computational complexity of inverting an Newton Method version of this in the learning process optimization procedure to either fail or
vectors vj then the i,j-th entry of the product is
n x n matrix is not actually known, but the operate too slowly to be of practical
wi*vj, which is to say the i-th feature if the j-th use.
best-known algorithm is O(n^2.373)
vector. Issues:
For high dimensional data sets, anything past
Gradient Goal: Minimize f(x)
vw<0 vw>0
In formulae: if C=AB, where A is an impractical, so Newton's Method is reserved
for a few hundred dimentions at most.
Multivariate
linear time in the dimensions is often
Descent
n x m matrix and B is a m x k
1D
matrix, then C is a n x k matrix
where
Matrix
multiplication
and examples
Derivatives Maximum
Sometimes we can circumvent this issue.

Key Consequences Likelihood


Estimation 1. Start with a guess of X0
Well defined: 2. Iterate through
2D find p such that Pp(D) is maximized
- is learning rate

Univariate f''(x)
ALGORITHM
3. Stop after some condition is met
Decision plane Second shows how the
max -> f '' < 0 if the value if x doesn't change more than 0.001

Math for
a fixed number of steps

Derivatives
min -> f '' > 0
Derivative slope is changing Can't tell -> f '' = 0 fancier things TBD
proceed with higher derivative

3D Angles
Machine Can be presented as:
Hyperplane - is the thing orthogonal to
a given vector
perpendicular line in 2D
Learning let's approximate
Interpretation

perpendicular surface in 3D
Dot product and how better approximation
Derivative
to extract angles
Chain Rule
f(x +є) = f(x) + f'(x)є Alternative

Dot Product - the sum of the products of the


corresponding entries of the two sequences

Matrices
Distributativity of numbers. http://hyperphysics.phy- Product Rule
A(B+C) = AB +AC Matrix Products astr.gsu.edu/hbase/Math/
Associativity derfunc.html Rules Sum Rule
A(BC)=(AB)C
Not commutativity
Most usable Quotient Rule
Intersection of two sets
The Identity Matrix
AB!=BA IA=A Union of two sets

Outcome
Matrix product A single possibility
from the experiment
properties Sample Space 2. Something always happens.
Hadamard product The set of all possible 1. The fraction of the times an event
outcomes occurs is between 0 and 1. Symmetric difference of two sets
All ones on the diagonal An (often less useful) method of Linear dependence Inclusion/Exclusion
Distributativity
A(B+C) = AB +AC multiplying matrices is element-wise Capital Omega
3. If two events can't happen at the same time (disjoint events),
AoB then the fraction of the time that at least one of them occurs is
det(A)=0 only if columns of A are the sum of the fraction of the time either one occurs
Associativity Event Relative complement
linearly dependent
A(BC)=(AB)C
Example Something you Axioms of probability separately.
of A (left) in B (right)
Properties of can observe with
the Hadamard a yes/no answer Absolute complement
Commutativity Product of A in U
AoB=BoC Terminology Capital E
Definition
lies in lower dimentional space Visualizing Probability General Picture
Sample Space <-> Region
if there are some a-s, that using Venn diagram Outcomes <-> Points
a1*v1+a2*v2+...+ak*vk=0 Probability Events <-> Subregion
Suppose A is 2x2 matrix (mapping R^2 to itself). Any Fraction of an experiment Disjoint events <-> Disjoint subregions
such matrix can be expressed uniquely as a where an event occurs
stretching, followed by a skewing, followed by a Geometry of matrix Probability <-> Area of subregion
Intuition from Two
rotation
Dimentions
operations a1=1, a2=-2, a3=-1 P{E} є [0,1]

Any vector can be written as a sum scalar multiple of two If I know B occurred, the probability that A
specific vectors The Determinant occurred is the fraction of the area of B which
det(A) is the factor the area is multiplied by Intuition: is occupied by A
A applied to any vector The probability of an event Conditional probability
is the expected fraction of can be leveraged to understand
det(A) is negative if it flips the plane over competing hypotheses
time that the outcome would
occur with repeated
experiments.
Given a probability model with some vector of Building machine learning models odds is fraction of two probabilities
The Two-by-two
parameters (Theta) and observed data D, the best i.e. 2/1
det(A)=ad-bc Determinant fitting model is the one that maximizes the
probability Bayes’ rule
computation Maximum Likelihood Estimation Posterior odds = ratio of probability of
Larger Matrices generating data * prior odds
m determinants of (m-1)x(m-1) matrices
computer does it simplier Q(m^3) times
called matrix factorizations Two events are independent if one event

Matrix invertibility
is a statistical theory states that given a sufficiently large
sample size from a population with a finite level of variance,
the mean of all samples from the same population will be
Central limit theorem Probability doesn't influence the other

approximately equal to the mean of the population.


When can you invert? Key Properties Independence
it can be done only when det != 0 A and B are independent if P{AnB}=P{A}*P{B}
Amongst all continuous RV with E[X]=0, Var[X]=1. H(X) Chebyshev’s inequality
Entropy is maximized uniquely for X~N(0,1)
Maximum entropy distribution
How to Compute the Inverse Gaussian is the most Randon RV with fixed mean and is a function X that takes in an outcome and
A^(-1)*A=I variance
For any random gives a number back
The Gaussian curve variable X (no
Discrete X takes only at most countable many
assumptions) at least
Random variables values, usually only a finite set of values
99 of the time
General Gaussian Density Expected Value
mean
Standard Gaussian
(Normal Distribution) Density
E[X]=0 Entropy (H) Variance
Var[X]=1
Standard Deviation how close to the mean are samples

Entropy

The Only Choice was the Units One coin H (1/2)


Entropy = one bit of randomness
For many applications ML works with firstly you need to choose the base for the
continuous random variables logarithm. T (1/2)
(measurement with real numbers). If the base is not 2, then Entropy should be
divided by log2 of the base
Two coins HH (1/4)
Entropy = 2 bits of randomness H
Probability density function HT (1/4)
Continuous random variables
Examples TH (1/4)
T
Examine the Trees TT (1/4)
if we flip n coins, then P=1/2^n H (1/2)
# coin flips = -log2(P) A mixed case
Entropy = 1.5 bits of randomnes = TH (1/4)
=1/2(1 bit) + 1/2(2 bits) T
TT (1/4)

You might also like