ML Cheat Sheet

Combined
Cheat Sheets of
Machine Learning,
Deep Learning,
Probability and
LLMs
Syed Afroz Ali

Data Scientist (Kaggle Grandmaster)
https://www.kaggle.com/pythonafroz
https://www.linkedin.com/in/syed-afroz-70939914/
Probability–the Science of Uncertainty Theorem (Bayes’ rule) Given a partition {A1 , A2 , . . .} of the Discrete random variables
sample space, meaning that ⋃ Ai = Ω and the events are disjoint,
and Data i Probability mass function and expectation
by Fabián Kozynski
and if P(Ai ) > 0 for all i, then for every event B, the conditional
Definition (Random variable) A random variable X is a function
probabilities P(Ai ∣B) can be obtained from the conditional
of the sample space Ω into the real numbers (or Rn ). Its range can
probabilities P(B∣Ai ) and the initial probabilities P(Ai ) as follows:
be discrete or continuous.
Probability P(Ai )P(B∣Ai ) Definition (Probability mass funtion (PMF)) The probability law
P(Ai ∣B) = .
Probability models and axioms ∑j P(Aj )P(B∣Aj ) of a discrete random variable X is called its PMF. It is defined as
Definition (Sample space) A sample space Ω is the set of all pX (x) = P(X = x) = P ({ω ∈ Ω ∶ X(ω) = x}) .
Independence
possible outcomes. The set’s elements must be mutually exclusive,
collectively exhaustive and at the right granularity. Definition (Independence of events) Two events are independent Properties
Definition (Event) An event is a subset of the sample space. if occurrence of one provides no information about the other. We
say that A and B are independent if pX (x) ≥ 0, ∀ x.
Probability is assigned to events.
Definition (Probability axioms) A probability law P assigns P(A ∩ B) = P(A)P(B). ∑x pX (x) = 1.
probabilities to events and satisfies the following axioms: Example (Bernoulli random variable) A Bernoulli random
Equivalently, as long as P(A) > 0 and P(B) > 0,
Nonnegativity P(A) ≥ 0 for all events A. variable X with parameter 0 ≤ p ≤ 1 (X ∼ Ber(p)) takes the
P(B∣A) = P(B) P(A∣B) = P(A). following values:
Normalization P(Ω) = 1. ⎧
⎪
⎪1 w.p. p,
(Countable) additivity For every sequence of events A1 , A2 , . . . Remarks X=⎨
⎪
⎪0 w.p. 1 − p.
• The definition of independence is symmetric with respect to ⎩
such that Ai ∩ Aj = ∅: P (⋃ Ai ) = ∑ P(Ai ).
i i A and B. An indicator random variable of an event (IA = 1 if A occurs) is an
Corollaries (Consequences of the axioms) • The product definition applies even if P(A) = 0 or P(B) = 0. example of a Bernoulli random variable.
Corollary If A and B are independent, then A and B c are Example (Discrete uniform random variable) A Discrete uniform
• P(∅) = 0.
independent. Similarly for Ac and B, or for Ac and B c . random variable X between a and b with a ≤ b (X ∼ Uni[a, b])
• For any finite collection of disjoint events A1 , . . . , An , takes any of the values in {a, a + 1, . . . , b} with probability b−a+1
1
.
n n Definition (Conditional independence) We say that A and B are
P ( ⋃ Ai ) = ∑ P(Ai ). independent conditioned on C, where P(C) > 0, if Example (Binomial random variable) A Binomial random
i=1 i=1 variable X with parameters n (natural number) and 0 ≤ p ≤ 1
• P(A) + P(Ac ) = 1. P(A ∩ B∣C) = P(A∣C)P(B∣C). (X ∼ Bin(n, p)) takes values in the set {0, 1, . . . , n} with
• P(A) ≤ 1. Definition (Independence of a collection of events) We say that probabilities pX (i) = (ni)pi (1 − p)n−i .
events A1 , A2 , . . . , An are independent if for every collection of It represents the number of successes in n independent trials where
• If A ⊂ B, then P(A) ≤ P(B).
distinct indices i1 , i2 , . . . , ik , we have each trial has a probability of success p. Therefore, it can also be
• P(A ∪ B) = P(A) + P(B) − P(A ∩ B). seen as the sum of n independent Bernoulli random variables, each
P(Ai1 ∩ . . . ∩ Aik ) = P(Ai1 ) ⋅ P(Ai2 )⋯P(Aik ).
• P(A ∪ B) ≤ P(A) + P(B). with parameter p.
Example (Discrete uniform law) Assume Ω is finite and consists Counting Example (Geometric random variable) A Geometric random
of n equally likely elements. Also, assume that A ⊂ Ω with k variable X with parameter 0 ≤ p ≤ 1 (X ∼ Geo(p)) takes values in
This section deals with finite sets with uniform probability law. In
elements. Then P(A) = n k
. the set {1, 2, . . .} with probabilities pX (i) = (1 − p)i−1 p.
this case, to calculate P(A), we need to count the number of
elements in A and in Ω. It represents the number of independent trials until (and including)
Conditioning and Bayes’ rule the first success, when the probability of success in each trial is p.
Remark (Basic counting principle) For a selection that can be
Definition (Conditional probability) Given that event B has Definition (Expectation/mean of a random variable) The
done in r stages, with ni choices at each stage i, the number of
occurred and that P (B) > 0, the probability that A occurs is
possible selections is n1 ⋅ n2 ⋯nr . expectation of a discrete random variable is defined as
P(A ∩ B) Definition (Permutations) The number of permutations
P(A∣B) =
△
. E[X] = ∑ xpX (x).
△
P(B) (orderings) of n different elements is
x
Remark (Conditional probabilities properties) They are the same n! = 1 ⋅ 2 ⋅ 3⋯n. assuming ∑x ∣x∣pX (x) < ∞.
as ordinary probabilities. Assuming P(B) > 0:
Definition (Combinations) Given a set of n elements, the number Properties (Properties of expectation)
• P(A∣B) ≥ 0. of subsets with exactly k elements is
• P(Ω∣B) = 1 • If X ≥ 0 then E[X] ≥ 0.
n n!
• P(B∣B) = 1. ( )= . • If a ≤ X ≤ b then a ≤ E[X] ≤ b.
k k!(n − k)!
• If A ∩ C = ∅, P(A ∪ C∣B) = P(A∣B) + P(C∣B). • If X = c then E[X] = c.
Definition (Partitions) We are given an n−element set and
Proposition (Multiplication rule) nonnegative integers n1 , n2 , . . . , nr , whose sum is equal to n. The Example Expected value of know r.v.
P(A1 ∩A2 ∩⋯∩An ) = P(A1 )⋅P(A2 ∣A1 )⋯P(An ∣A1 ∩A2 ∩⋯∩An−1 ). number of partitions of the set into r disjoint subsets, with the ith • If X ∼ Ber(p) then E[X] = p.
subset containing exactly ni elements, is equal to
Theorem (Total probability theorem) Given a partition • If X = IA then E[X] = P(A).
{A1 , A2 , . . .} of the sample space, meaning that ⋃ Ai = Ω and the n n!
i
( )= . • If X ∼ Uni[a, b] then E[X] = a+b
.
n1 , . . . , nr n1 !n2 !⋯nr ! 2
events are disjoint, and for every event B, we have • If X ∼ Bin(n, p) then E[X] = np.
Remark This is the same as counting how to assign n distinct
P(B) = ∑ P(Ai )P(B∣Ai ).
i
elements to r people, giving each person i exactly ni elements. • If X ∼ Geo(p) then E[X] = 1
p
.
Theorem (Expected value rule) Given a random variable X and a Properties (Properties of joint PMF) Proposition (Expectation of product of independent r.v.) If X
function g ∶ R → R, we construct the random variable Y = g(X). and Y are discrete independent random variables,
• ∑ ⋯ ∑ pX1 ,...,Xn (x1 , . . . , xn ) = 1.
Then x1 xn
∑ ypY (y) = E[Y ] = E [g(X)] = ∑ g(x)pX (x). E[XY ] = E[X]E[Y ].
y x
• pX1 (x1 ) = ∑ ⋯ ∑ pX1 ,...,Xn (x1 , x2 , . . . , xn ).
x2 xn Remark If X and Y are independent,
Remark (PMF of Y = g(X)) The PMF of Y = g(X) is E [g(X)h(Y )] = E [g(X)] E [h(Y )].
• pX2 ,...,Xn (x2 , . . . , xn ) = ∑ pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ).
pY (y) = ∑ pX (x). x1
x∶g(x)=y Proposition (Variance of sum of independent random variables)
Definition (Functions of multiple r.v.) If Z = g(X1 , . . . , Xn ), IF X and Y are discrete independent random variables,
Remark In general g (E[X]) ≠ E [g(X)]. They are equal if
where g ∶ Rn → R, then pZ (z) = P (g(X1 , . . . , Xn ) = z).
g(x) = ax + b. Var(X + Y ) = Var(X) + Var(Y ).
Proposition (Expected value rule for multiple r.v.) Given
Variance, conditioning on an event, multiple r.v. g ∶ Rn → R,
Continuous random variables
Definition (Variance of a random variable) Given a random E [g(X1 , . . . , Xn )] = ∑ g(x1 , . . . , xn )pX1 ,...,Xn (x1 , . . . , xn ). PDF, Expectation, Variance, CDF
variable X with µ = E[X], its variance is a measure of the spread x1 ,...,xn
of the random variable and is defined as Definition (Probability density function (PDF)) A probability
Properties (Linearity of expectations)
density function of a r.v. X is a non-negative real valued function
Var(X) = E [(X − µ) ] = ∑(x − µ) pX (x).
2 2
△
• E[aX + b] = aE[X] + b. fX that satisfies the following
x
• E[X1 + ⋯ + Xn ] = E[X1 ] + ⋯ + E[Xn ]. ∞
Definition (Standard deviation) • ∫ fX (x)dx = 1.
√ Conditioning on a random variable, independence −∞
σX = Var(X). b
Definition (Conditional PMF given another random variable) • P(a ≤ X ≤ b) = ∫ fX (x)dx for some random variable X.
Properties (Properties of the variance) Given discrete random variables X, Y and y such that pY (y) > 0 a
• Var(aX) = a2 Var(X), for all a ∈ R. we define Definition (Continuous random variable) A random variable X is
△ pX,Y (x, y) continuous if its probability law can be described by a PDF fX .
• Var(X + b) = Var(X), for all b ∈ R. pX∣Y (x∣y) = .
pY (y) Remark Continuous random variables satisfy:
• Var(aX + b) = a2 Var(X).
Proposition (Multiplication rule) Given jointly discrete random
• For small δ > 0, P(a ≤ X ≤ a + δ) ≈ fX (a)δ.
• Var(X) = E[X 2 ] − (E[X])2 . variables X, Y , and whenever the conditional probabilities are
Example (Variance of known r.v.) defined, • P(X = a) = 0, ∀a ∈ R.
• If X ∼ Ber(p), then Var(X) = p(1 − p). pX,Y (x, y) = pX (x)pY ∣X (y∣x) = pY (y)pX∣Y (x∣y). Definition (Expectation of a continuous random variable) The
expectation of a continuous random variable is
• If X ∼ Uni[a, b], then Var(X) =
(b−a)(b−a+2)
. Definition (Conditional expectation) Given discrete random
12 ∞
variables X, Y and y such that pY (y) > 0 we define E[X] = ∫
△
xfX (x)dx.
• If X ∼ Bin(n, p), then Var(X) = np(1 − p). −∞
• If X ∼ Geo(p), then Var(X) = 1−p E[X∣Y = y] = ∑ xpX∣Y (x∣y). ∞
p2 x assuming ∫ ∣x∣fX (x)dx < ∞.
Proposition (Conditional PMF and expectation, given an event) −∞
Additionally we have
Given the event A, with P(A) > 0, we have the following Properties (Properties of expectation)
• pX∣A (x) = P(X = x∣A). E [g(X)∣Y = y] = ∑ g(x)pX∣Y (x∣y).
x
• If X ≥ 0 then E[X] ≥ 0.
• If A is a subset of the range of X, then: • If a ≤ X ≤ b then a ≤ E[X] ≤ b.
⎧
⎪
Theorem (Total probability and expectation theorems)
⎪
1
pX (x), if x ∈ A,
pX∣A (x) = pX∣{X∈A} (x) = ⎨ P(A)
△ If pY (y) > 0, then ∞
⎪
⎪0, otherwise. • E [g(X)] = ∫ g(x)fX (x)dx.
⎩ pX (x) = ∑ pY (y)pX∣Y (x∣y), −∞
• ∑x pX∣A (x) = 1. y • E[aX + b] = aE[X] + b.
• E[X∣A] = ∑x xpX∣A (x). E[X] = ∑ pY (y)E[X∣Y = y]. Definition (Variance of a continuous random variable) Given a
y
• E [g(X)∣A] = ∑x g(x)pX∣A (x). continuous random variable X with µ = E[X], its variance is
Proposition (Total expectation rule) Given a partition of disjoint Definition (Independence of a random variable and an event) A ∞
discrete random variable X and an event A are independent if Var(X) = E [(X − µ)2 ] = ∫ (x − µ)2 fX (x)dx.
events A1 , . . . , An such that ∑i P(Ai ) = 1, and P(Ai ) > 0, −∞
P(X = x and A) = pX (x)P(A), for all x.
E[X] = P(A1 )E[X∣A1 ] + ⋯ + P(An )E[X∣An ]. Definition (Independence of two random variables) Two discrete It has the same properties as the variance of a discrete random
random variables X and Y are independent if variable.
Definition (Memorylessness of the geometric random variable)
When we condition a geometric random variable X on the event pX,Y (x, y) = pX (x)pY (y) for all x, y. Example (Uniform continuous random variable) A Uniform
X > n we have memorylessness, meaning that the “remaining time” Remark (Independence of a collection of random variables) A continuous random variable X between a and b, with a < b,
X − n, given that X > n, is also geometric with the same parameter. collection X1 , X2 , . . . , Xn of random variables are independent if (X ∼ Uni(a, b)) has PDF
Formally, ⎧
⎪
pX−n∣X>n (i) = pX (i). pX1 ,...,Xn (x1 , . . . , xn ) = pX1 (x1 )⋯pXn (xn ), ∀ x1 , . . . , xn . ⎪ 1 , if a < x < b,
fX (x) = ⎨ b−a
⎪
⎪0, otherwise.
Definition (Joint PMF) The joint PMF of random variables Remark (Independence and expectation) In general, ⎩
X1 , X2 , . . . , Xn is E [g(X, Y )] ≠ g (E[X], E[Y ]). An exception is for linear functions:
(b−a)2
pX1 ,X2 ,...,Xn (x1 , . . . , xn ) = P(X1 = x1 , . . . , Xn = xn ). E[aX + bY ] = aE[X] + bE[Y ]. We have E[X] = a+b
2
and Var(X) = 12
.
Example (Exponential random variable) An Exponential random Theorem (Total probability and expectation theorems) Given a Proposition (Expectation of product of independent r.v.) If X
variable X with parameter λ > 0 (X ∼ Exp(λ)) has PDF partition of the space into disjoint events A1 , A2 , . . . , An such that and Y are independent continuous random variables,
⎧ ∑i P(Ai ) = 1 we have the following:
⎪
⎪λe−λx , if x ≥ 0, E[XY ] = E[X]E[Y ].
fX (x) = ⎨ FX (x) = P(A1 )FX∣A1 (x) + ⋯ + P(An )FX∣An (x),
⎪
⎪ 0, otherwise.
⎩
fX (x) = P(A1 )fX∣A1 (x) + ⋯ + P(An )fX∣An (x), Remark If X and Y are independent,
We have E[X] = 1
and Var(X) = 1 E [g(X)h(Y )] = E [g(X)] E [h(Y )].
λ λ2
. E[X] = P(A1 )E[X∣A1 ] + ⋯ + P(An )E[X∣An ].
Definition (Cumulative Distribution Function (CDF)) The CDF Proposition (Variance of sum of independent random variables)
of a random variable X is FX (x) = P(X ≤ x). If X and Y are independent continuous random variables,
Definition (Jointly continuous random variables) A pair
In particular, for a continuous random variable, we have
(collection) of random variables is jointly continuous if there exists
Var(X + Y ) = Var(X) + Var(Y ).
x a joint PDF fX,Y that describes them, that is, for every set B ⊂ Rn
FX (x) = ∫ fX (x)dx,
P ((X, Y ) ∈ B) = ∬ fX,Y (x, y)dxdy. Proposition (Bayes’ rule summary)
−∞
B
dFX (x) • For X, Y discrete: pX∣Y (x∣y) =
pX (x)pY ∣X (y∣x)
.
fX (x) = . Properties (Properties of joint PDFs) pY (y)
dx
∞
• fX (x) = ∫ fX,Y (x, y)dy. fX (x)fY ∣X (y∣x)
Properties (Properties of CDF) • For X, Y continuous: fX∣Y (x∣y) = fY (y)
.
−∞
• If y ≥ x, then FX (y) ≥ FX (x). x y
pX (x)fY ∣X (y∣x)
• FX,Y (x, y) = P(X ≤ x, Y ≤ y) = ∫ [ ∫ fX,Y (u, v)dv] du. • For X discrete, Y continuous: pX∣Y (x∣y) = .
• lim FX (x) = 0. −∞ −∞ fY (y)
x→−∞
∂ 2 FX,Y (x,y) fX (x)pY ∣X (y∣x)
• lim FX (x) = 1. • fX,Y (x) = . • For X continuous, Y discrete: fX∣Y (x∣y) = .
x→∞ ∂x ∂y pY (y)
Definition (Normal/Gaussian random variable) A Normal random Example (Uniform joint PDF on a set S) Let S ⊂ R2 with area
Derived distributions
variable X with mean µ and variance σ 2 > 0 (X ∼ N (µ, σ 2 )) has s > 0, then the random variable (X, Y ) is uniform over S if it has
PDF PDF Proposition (Discrete case) Given a discrete random variable X
⎧
⎪
fX (x) = √
1 − 1 (x−µ)2
e 2σ2 . ⎪ 1 , (x, y) ∈ S, and a function g, the r.v. Y = g(X) has PMF
fX,Y (x, y) = ⎨ s
2πσ 2 ⎪0, (x, y) ∈/ S.
⎪
⎩ pY (y) = ∑ pX (x).
We have E[X] = µ and Var(X) = σ 2 .
Conditioning on a random variable, independence, Bayes’ rule x∶g(x)=y
Remark (Standard Normal) The standard Normal is N (0, 1).
Definition (Conditional PDF given another random variable)
Proposition (Linearity of Gaussians) Given X ∼ N (µ, σ 2 ), and if Remark (Linear function of discrete random variable) If
Given jointly continuous random variables X, Y and a value y such
a ≠ 0, then aX + b ∼ N (aµ + b, a2 σ 2 ). that fY (y) > 0, we define the conditional PDF as g(x) = ax + b, then pY (y) = pX ( y−b
a
).
Using this Y = X−µ is a standard gaussian.
σ fX,Y (x, y) Proposition (Linear function of continuous r.v.) Given a
fX∣Y (x∣y) =
△
.
Conditioning on an event, and multiple continuous r.v. fY (y) continuous random variable X and Y = aX + b, with a ≠ 0, we have
Definition (Conditional PDF given an event) Given a continuous Additionally we define P(X ∈ A∣Y = y) ∫A fX∣Y (x∣y)dx. 1 y−b
random variable X and event A with P (A) > 0, we define the Proposition (Multiplication rule) Given jointly continuous fY (y) = fX ( ).
conditional PDF as the function that satisfies ∣a∣ a
random variables X, Y , whenever possible we have
P(X ∈ B∣A) = ∫ fX∣A (x)dx. fX,Y (x, y) = fX (x)fY ∣X (y∣x) = fY (y)fX∣Y (x∣y). Corollary (Linear function of normal r.v.) If X ∼ N (µ, σ 2 ) and
B
Y = aX + b, with a ≠ 0, then Y ∼ N (aµ + b, a2 σ 2 ).
Definition (Conditional PDF given X ∈ A) Given a continuous Definition (Conditional expectation) Given jointly continuous
random variable X and an A ⊂ R, with P (A) > 0: random variables X, Y , and y such that fY (y) > 0, we define the Example (General function of a continuous r.v.) If X is a
conditional expected value as continuous random variable and g is any function, to obtain the
⎧
⎪ pdf of Y = g(X) we follow the two-step procedure:
⎪
1
fX (x), x ∈ A, ∞
fX∣X∈A (x) = ⎨ P(A) E[X∣Y = y] = ∫ xfX∣Y (x∣y)dx.
⎪
⎪0, x ∈/ A. 1. Find the CDF of Y : FY (y) = P(Y ≤ y) = P (g(X) ≤ y).
⎩
−∞
Additionally we have
Definition (Conditional expectation) Given a continuous random 2. Differentiate the CDF of Y to obtain the PDF:
variable X and an event A, with P (A) > 0: E [g(X)∣Y = y] = ∫
∞
g(x)fX∣Y (x∣y)dx.
dFY (y)
fY (y) = dy .
−∞
∞
E[X∣A] = ∫ fX∣A (x)dx. Theorem (Total probability and total expectation theorems)
Proposition (General formula for monotonic g) Let X be a
−∞ continuous random variable and g a function that is monotonic
Definition (Memorylessness of the exponential random variable) fX (x) = ∫
∞
fY (y)fX∣Y (x∣y)dy, wherever fX (x) > 0. The PDF of Y = g(X) is given by
−∞
When we condition an exponential random variable X on the event ∞
X > t we have memorylessness, meaning that the “remaining time” E[X] = ∫ fY (y)E[X∣Y = y]dy. dh
fY (y) = fX (h(y)) ∣ (y)∣ .
X − t given that X > t is also geometric with the same parameter −∞ dy
i.e., Definition (Independence) Jointly continuous random variables
P(X − t > x∣X > t) = P(X > x). X, Y are independent if fX,Y (x, y) = fX (x)fY (y) for all x, y. where h = g −1 in the interval where g is monotonic.
Sums of independent r.v., covariance and correlation Definition (Conditional variance as a random variable) Given The Central Limit Theorem
Proposition (Discrete case) Let X, Y be discrete independent random variables X, Y the conditional variance Var(X∣Y ) is the Theorem (Central Limit Theorem (CLT)) Given a sequence of
random variables and Z = X + Y , then the PMF of Z is random variable that takes the value Var(X∣Y = y) whenever independent random variables {X1 , X2 , . . .} with E[Xi ] = µ and
Y = y. Var(Xi ) = σ 2 , we define
pZ (z) = ∑ pX (x)pY (z − x). Theorem (Law of total variance)
x 1 n
Var(X) = E [Var(X∣Y )] + Var (E[X∣Y ]) . Zn = √ ∑ (Xi − µ).
Proposition (Continuous case) Let X, Y be continuous σ n i=1
independent random variables and Z = X + Y , then the PDF of Z is Proposition (Sum of a random number of independent r.v.)
Then, for every z, we have
∞ Let N be a nonnegative integer random variable.
fZ (z) = ∫ fX (x)fY (z − x)dx. Let X, X1 , X2 , . . . , XN be i.i.d. random variables. lim P(Zn ≤ z) = P(Z ≤ z),
−∞ n→∞
Let Y = ∑i Xi . Then
Proposition (Sum of independent normal r.v.) Let X ∼ N (µx , σx2 ) where Z ∼ N (0, 1).
and Y ∼ N (µy , σy2 ) independent. Then E[Y ] = E[N ]E[X],
Corollary (Normal approximation of a binomial) Let
Z = X + Y ∼ N (µx + µy , σx2 + σy2 ). Var(Y ) = E[N ] Var(X) + (E[X])2 Var(N ). X ∼ Bin(n, p) with n large. Then Sn can be approximated by
Definition (Covariance) We define the covariance of random Z ∼ N (np, np(1 − p)).
variables X, Y as Remark (De Moivre-Laplace 1/2 approximation) Let X ∼ Bin,
Convergence of random variables then P(X = i) = P (i − 12 ≤ X ≤ i + 12 ) and we can use the CLT to
Cov(X, Y ) = E [(X − E[X]) (Y − E[Y ])] .
△
approximate the PMF of X.
Inequalities, convergence, and the Weak Law of
Properties (Properties of covariance) Large Numbers
• If X, Y are independent, then Cov(X, Y ) = 0. Theorem (Markov inequality) Given a random variable X ≥ 0 and,
• Cov(X, X) = Var(X). for every a > 0 we have
• Cov(aX + b, Y ) = a Cov(X, Y ). E[X]
P(X ≥ a) ≤ .
• Cov(X, Y + Z) = Cov(X, Y ) + Cov(X, Z). a
Theorem (Chebyshev inequality) Given a random variable X with
• Cov(X, Y ) = E[XY ] − E[X]E[Y ].
E[X] = µ and Var(X) = σ 2 , for every > 0 we have
Proposition (Variance of a sum of r.v.)
σ2
Var(X1 + ⋯ + Xn ) = ∑ Var(Xi ) + ∑ Cov(Xi , Xj ). P (∣X − µ∣ ≥ ) ≤ .
i i≠j
2
Theorem (Weak Law of Large Number (WLLN)) Given a
Definition (Correlation coefficient) We define the correlation sequence of i.i.d. random variables {X1 , X2 , . . .} with E[Xi ] = µ
coefficient of random variables X, Y , with σX , σY > 0, as and Var(Xi ) = σ 2 , we define
Cov(X, Y ) 1 n
ρ(X, Y ) =
△
. Mn = ∑ Xi ,
σX σY n i=1
Properties (Properties of the correlation coefficient) for every > 0 we have
• −1 ≤ ρ ≤ 1.
lim P (∣Mn − µ∣ ≥ ) = 0.
n→∞
• If X, Y are independent, then ρ = 0.
Definition (Convergence in probability) A sequence of random
• ∣ρ∣ = 1 if and only if X − E[X] = c (Y − E[Y ]).
variables {Yi } converges in probability to the random variable Y if
• ρ(aX + b, Y ) = sign(a)ρ(X, Y ).
lim P (∣Yi − Y ∣ ≥ ) = 0,
n→∞
Conditional expectation and variance, sum of
random number of r.v. for every > 0.
Definition (Conditional expectation as a random variable) Given Properties (Properties of convergence in probability) If Xn → a
random variables X, Y the conditional expectation E[X∣Y ] is the and Yn → b in probability, then
random variable that takes the value E[X∣Y = y] whenever Y = y. • Xn + Yn → a + b.
Theorem (Law of iterated expectations) • If g is a continuous function, then g(Xn ) → g(a).
E [E[X∣Y ]] = E[X]. • E[Xn ] does not always converge to a.
CS 229 – Machine Learning https://stanford.edu/~shervine
Super VIP Cheatsheet: Machine Learning 4 Machine Learning Tips and Tricks 10
4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Afshine Amidi and Shervine Amidi 4.1.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
October 6, 2018 4.3 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Refreshers 12
5.1 Probabilities and Statistics . . . . . . . . . . . . . . . . . . . . . . . . 12
Contents 5.1.1 Introduction to Probability and Combinatorics . . . . . . . . . 12
5.1.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . 12
1 Supervised Learning 2 5.1.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1 Introduction to Supervised Learning . . . . . . . . . . . . . . . . . . . 2 5.1.4 Jointly Distributed Random Variables . . . . . . . . . . . . . . 13
1.2 Notations and general concepts . . . . . . . . . . . . . . . . . . . . . 2 5.1.5 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 5.2 Linear Algebra and Calculus . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 2 5.2.1 General notations . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.2 Matrix operations . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.2 Classification and logistic regression . . . . . . . . . . . . . . . 3
5.2.3 Matrix properties . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.3 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . 3
5.2.4 Matrix calculus . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Generative Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5.1 Gaussian Discriminant Analysis . . . . . . . . . . . . . . . . . 4
1.5.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Tree-based and ensemble methods . . . . . . . . . . . . . . . . . . . . 4
1.7 Other non-parametric approaches . . . . . . . . . . . . . . . . . . . . 4
1.8 Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Unsupervised Learning 6
2.1 Introduction to Unsupervised Learning . . . . . . . . . . . . . . . . . 6
2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Expectation-Maximization . . . . . . . . . . . . . . . . . . . . 6
2.2.2 k-means clustering . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Clustering assessment metrics . . . . . . . . . . . . . . . . . . 6
2.3 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Principal component analysis . . . . . . . . . . . . . . . . . . 7
2.3.2 Independent component analysis . . . . . . . . . . . . . . . . . 7
3 Deep Learning 8
3.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 8
3.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Reinforcement Learning and Control . . . . . . . . . . . . . . . . . . . 9
Stanford University 1 Fall 2018

CS 229 – Machine Learning Shervine Amidi & Afshine Amidi
1 Supervised Learning r Cost function – The cost function J is commonly used to assess the performance of a model,
and is defined with the loss function L as follows:
1.1 Introduction to Supervised Learning m
X
J(θ) = L(hθ (x(i) ), y (i) )
Given a set of data points {x(1) , ..., x(m) } associated to a set of outcomes {y (1) , ..., y (m) }, we
want to build a classifier that learns how to predict y from x. i=1
r Type of prediction – The different types of predictive models are summed up in the table
below: r Gradient descent – By noting α ∈ R the learning rate, the update rule for gradient descent
is expressed with the learning rate and the cost function J as follows:
Regression Classifier
θ ←− θ − α∇J(θ)
Outcome Continuous Class
Examples Linear regression Logistic regression, SVM, Naive Bayes
r Type of model – The different models are summed up in the table below:
Discriminative model Generative model

Goal Directly estimate P (y|x) Estimate P (x|y) to deduce P (y|x)
What’s learned Decision boundary Probability distributions of the data
Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training
example, and batch gradient descent is on a batch of training examples.
Illustration r Likelihood – The likelihood of a model L(θ) given parameters θ is used to find the optimal
parameters θ through maximizing the likelihood. In practice, we use the log-likelihood `(θ) =
log(L(θ)) which is easier to optimize. We have:
θopt = arg max L(θ)

Examples Regressions, SVMs GDA, Naive Bayes θ
1.2 Notations and general concepts r Newton’s algorithm – The Newton’s algorithm is a numerical method that finds θ such
that `0 (θ) = 0. Its update rule is as follows:
r Hypothesis – The hypothesis is noted hθ and is the model that we choose. For a given input
`0 (θ)
data x(i) , the model prediction output is hθ (x(i) ). θ←θ−
`00 (θ)
r Loss function – A loss function is a function L : (z,y) ∈ R × Y 7−→ L(z,y) ∈ R that takes as
inputs the predicted value z corresponding to the real data value y and outputs how different Remark: the multidimensional generalization, also known as the Newton-Raphson method, has
they are. The common loss functions are summed up in the table below: the following update rule:
−1
Least squared Logistic Hinge Cross-entropy θ ← θ − ∇2θ `(θ) ∇θ `(θ)
1
(y − z)2 log(1 + exp(−yz)) max(0,1 − yz) − y log(z) + (1 − y) log(1 − z)
2 1.3 Linear models
1.3.1 Linear regression
We assume here that y|x; θ ∼ N (µ,σ 2 )

r Normal equations – By noting X the matrix design, the value of θ that minimizes the cost
function is a closed-form solution such that:
Linear regression Logistic regression SVM Neural Network θ = (X T X)−1 X T y

r LMS algorithm – By noting α the learning rate, the update rule of the Least Mean Squares Distribution η T (y) a(η) b(y)
(LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff
learning rule, is as follows: φ

Bernoulli log 1−φ
y log(1 + exp(η)) 1
m
η2 2
X (i)
exp − y2

∀j, θ j ← θj + α y (i) − hθ (x(i) ) xj Gaussian µ y 2
√1
2π
i=1
1
Poisson log(λ) y eη
Remark: the update rule is a particular case of the gradient ascent. y!
r LWR – Locally Weighted Regression, also known as LWR, is a variant of linear regression that Geometric log(1 − φ) y log eη

1
1−eη
weights each training example in its cost function by w(i) (x), which is defined with parameter
τ ∈ R as:

(x(i) − x)2 r Assumptions of GLMs – Generalized Linear Models (GLM) aim at predicting a random
w(i) (x) = exp −
2τ 2 variable y as a function fo x ∈ Rn+1 and rely on the following 3 assumptions:
(1) y|x; θ ∼ ExpFamily(η) (2) hθ (x) = E[y|x; θ] (3) η = θT x

1.3.2 Classification and logistic regression Remark: ordinary least squares and logistic regression are special cases of generalized linear
models.
r Sigmoid function – The sigmoid function g, also known as the logistic function, is defined
as follows:
1 1.4 Support Vector Machines
∀z ∈ R, g(z) = ∈]0,1[
1 + e−z The goal of support vector machines is to find the line that maximizes the minimum distance to
the line.
r Logistic regression – We assume here that y|x; θ ∼ Bernoulli(φ). We have the following r Optimal margin classifier – The optimal margin classifier h is such that:
form:
h(x) = sign(wT x − b)
1
φ = p(y = 1|x; θ) = = g(θT x)
1 + exp(−θT x) where (w, b) ∈ Rn × R is the solution of the following optimization problem:
Remark: there is no closed form solution for the case of logistic regressions. 1
min ||w||2 such that y (i) (wT x(i) − b) > 1
2
r Softmax regression – A softmax regression, also called a multiclass logistic regression, is
used to generalize logistic regression when there are more than 2 outcome classes. By convention,
we set θK = 0, which makes the Bernoulli parameter φi of each class i equal to:
exp(θiT x)
φi =
K
X
exp(θjT x)
j=1
1.3.3 Generalized Linear Models

r Exponential family – A class of distributions is said to be in the exponential family if it can
be written in terms of a natural parameter, also called the canonical parameter or link function,
η, a sufficient statistic T (y) and a log-partition function a(η) as follows:
p(y; η) = b(y) exp(ηT (y) − a(η))
Remark: the line is defined as wT x − b = 0 .
Remark: we will often have T (y) = y. Also, exp(−a(η)) can be seen as a normalization param-
eter that will make sure that the probabilities sum to one. r Hinge loss – The hinge loss is used in the setting of SVMs and is defined as follows:
Here are the most common exponential distributions summed up in the following table:
L(z,y) = [1 − yz]+ = max(0,1 − yz)

r Kernel – Given a feature mapping φ, we define the kernel K to be defined as: 1.5.2 Naive Bayes
K(x,z) = φ(x)T φ(z) r Assumption – The Naive Bayes model supposes that the features of each data point are all
independent:
2
||x−z||
In practice, the kernel K defined by K(x,z) = exp − 2σ 2
is called the Gaussian kernel
n
and is commonly used. Y
P (x|y) = P (x1 ,x2 ,...|y) = P (x1 |y)P (x2 |y)... = P (xi |y)
i=1
r Solutions – Maximizing the log-likelihood gives the following solutions, with k ∈ {0,1},
l ∈ [[1,L]]
(j)
1 #{j|y (j) = k and xi = l}
P (y = k) = × #{j|y (j) = k} and P (xi = l|y = k) =
m #{j|y (j) = k}
Remark: we say that we use the "kernel trick" to compute the cost function using the kernel
because we actually don’t need to know the explicit mapping φ, which is often very complicated.
Instead, only the values K(x,z) are needed. Remark: Naive Bayes is widely used for text classification and spam detection.
r Lagrangian – We define the Lagrangian L(w,b) as follows:

l
X 1.6 Tree-based and ensemble methods
L(w,b) = f (w) + βi hi (w)
i=1 These methods can be used for both regression and classification problems.
r CART – Classification and Regression Trees (CART), commonly known as decision trees,
Remark: the coefficients βi are called the Lagrange multipliers. can be represented as binary trees. They have the advantage to be very interpretable.
r Random forest – It is a tree-based technique that uses a high number of decision trees
1.5 Generative Learning built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly
uninterpretable but its generally good performance makes it a popular algorithm.
A generative model first tries to learn how the data is generated by estimating P (x|y), which
we can then use to estimate P (y|x) by using Bayes’ rule. Remark: random forests are a type of ensemble methods.
r Boosting – The idea of boosting methods is to combine several weak learners to form a
stronger one. The main ones are summed up in the table below:
1.5.1 Gaussian Discriminant Analysis
r Setting – The Gaussian Discriminant Analysis assumes that y and x|y = 0 and x|y = 1 are
such that: Adaptive boosting Gradient boosting
y ∼ Bernoulli(φ) - High weights are put on errors to - Weak learners trained
improve at the next boosting step on remaining errors
x|y = 0 ∼ N (µ0 ,Σ) and x|y = 1 ∼ N (µ1 ,Σ) - Known as Adaboost
r Estimation – The following table sums up the estimates that we find when maximizing the
likelihood:
1.7 Other non-parametric approaches
φ
b µbj (j = 0,1) Σ
b r k-nearest neighbors – The k-nearest neighbors algorithm, commonly known as k-NN, is a
non-parametric approach where the response of a data point is determined by the nature of its
k neighbors from the training set. It can be used in both classification and regression settings.
m Pm m
1 X 1
i=1 {y (i) =j}
x(i) 1 X
1{y(i) =1} m (x(i) − µy(i) )(x(i) − µy(i) )T
1 Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the
P
m m
i=1 i=1 {y (i) =j} i=1 higher the variance.

∃h ∈ H, ∀i ∈ [[1,d]], h(x(i) ) = y (i)
r Upper bound theorem – Let H be a finite hypothesis class such that |H| = k and let δ and
the sample size m be fixed. Then, with probability of at least 1 − δ, we have:
r
1 2k

(h) 6 min (h) + 2
b log
h∈H 2m δ
r VC dimension – The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis

class H, noted VC(H) is the size of the largest set that is shattered by H.
Remark: the VC dimension of H = {set of linear classifiers in 2 dimensions} is 3.
1.8 Learning Theory

r Union bound – Let A1 , ..., Ak be k events. We have:
P (A1 ∪ ... ∪ Ak ) 6 P (A1 ) + ... + P (Ak )

r Theorem (Vapnik) – Let H be given, with VC(H) = d and m the number of training
examples. With probability at least 1 − δ, we have:
r
d m 1 1

h) 6
(b min (h) + O log + log
h∈H m d m δ
r Hoeffding inequality – Let Z1 , .., Zm be m iid variables drawn from a Bernoulli distribution
of parameter φ. Let φ
b be their sample mean and γ > 0 fixed. We have:
P (|φ − φ
b| > γ) 6 2 exp(−2γ 2 m)
Remark: this inequality is also known as the Chernoff bound.
r Training error – For a given classifier h, we define the training error b

(h), also known as the
empirical risk or empirical error, to be as follows:
m
1 X
b(h) = 1{h(x(i) )6=y(i) }
m
i=1
r Probably Approximately Correct (PAC) – PAC is a framework under which numerous

results on learning theory were proved, and has the following set of assumptions:
• the training and testing sets follow the same distribution
• the training examples are drawn independently
r Shattering – Given a set S = {x(1) ,...,x(d) }, and a set of classifiers H, we say that H shatters
S if for any set of labels {y (1) , ..., y (d) }, we have:

2 Unsupervised Learning 2.2.2 k-means clustering
2.1 Introduction to Unsupervised Learning We note c(i) the cluster of data point i and µj the center of cluster j.
r Algorithm – After randomly initializing the cluster centroids µ1 ,µ2 ,...,µk ∈ Rn , the k-means
r Motivation – The goal of unsupervised learning is to find hidden patterns in unlabeled data algorithm repeats the following step until convergence:
{x(1) ,...,x(m) }. m
X
r Jensen’s inequality – Let f be a convex function and X a random variable. We have the 1{c(i) =j} x(i)
following inequality: i=1
c(i) = arg min||x(i) − µj ||2 and µj = m
E[f (X)] > f (E[X]) j X
1{c(i) =j}
i=1
2.2 Clustering
2.2.1 Expectation-Maximization
r Latent variables – Latent variables are hidden/unobserved variables that make estimation
problems difficult, and are often denoted z. Here are the most common settings where there are
latent variables:
Setting Latent variable z x|z Comments
Mixture of k Gaussians Multinomial(φ) N (µj ,Σj ) µj ∈ Rn , φ ∈ Rk
Factor analysis N (0,I) N (µ + Λz,ψ) µj ∈ Rn

r Distortion function – In order to see if the algorithm converges, we look at the distortion
function defined as follows:
r Algorithm – The Expectation-Maximization (EM) algorithm gives an efficient method at m
estimating the parameter θ through maximum likelihood estimation by repeatedly constructing X
a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows: J(c,µ) = ||x(i) − µc(i) ||2
i=1
• E-step: Evaluate the posterior probability Qi (z (i) ) that each data point x(i) came from
a particular cluster z (i) as follows:
2.2.3 Hierarchical clustering
Qi (z (i) ) = P (z (i) |x(i) ; θ)
r Algorithm – It is a clustering algorithm with an agglomerative hierarchical approach that
build nested clusters in a successive manner.
• M-step: Use the posterior probabilities Qi (z (i) ) as cluster specific weights on data points
r Types – There are different sorts of hierarchical clustering algorithms that aims at optimizing
x(i) to separately re-estimate each cluster model as follows: different objective functions, which is summed up in the table below:
Xˆ
P (x(i) ,z (i) ; θ)

Ward linkage Average linkage Complete linkage
θi = argmax Qi (z (i) ) log dz (i)
θ z (i) Qi (z (i) ) Minimize within cluster Minimize average distance Minimize maximum distance
i
distance between cluster pairs of between cluster pairs
2.2.4 Clustering assessment metrics

In an unsupervised learning setting, it is often hard to assess the performance of a model since
we don’t have the ground truth labels as was the case in the supervised learning setting.
r Silhouette coefficient – By noting a and b the mean distance between a sample and all
other points in the same class, and between a sample and all other points in the next nearest
cluster, the silhouette coefficient s for a single sample is defined as follows:
b−a
s=
max(a,b)

r Calinski-Harabaz index – By noting k the number of clusters, Bk and Wk the between

and within-clustering dispersion matrices respectively defined as
k m
X X
Bk = nc(i) (µc(i) − µ)(µc(i) − µ)T , Wk = (x(i) − µc(i) )(x(i) − µc(i) )T
j=1 i=1
the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such
that the higher the score, the more dense and well separated the clusters are. It is defined as
follows:
Tr(Bk ) N −k
s(k) = ×
Tr(Wk ) k−1 2.3.2 Independent component analysis
It is a technique meant to find the underlying generating sources.
2.3 Dimension reduction r Assumptions – We assume that our data x has been generated by the n-dimensional source
vector s = (s1 ,...,sn ), where si are independent random variables, via a mixing and non-singular
2.3.1 Principal component analysis matrix A as follows:
x = As
It is a dimension reduction technique that finds the variance maximizing directions onto which
to project the data. The goal is to find the unmixing matrix W = A−1 by an update rule.
r Eigenvalue, eigenvector – Given a matrix A ∈ Rn×n , λ is said to be an eigenvalue of A if
there exists a vector z ∈ Rn \{0}, called eigenvector, such that we have: r Bell and Sejnowski ICA algorithm – This algorithm finds the unmixing matrix W by
following the steps below:
Az = λz
• Write the probability of x = As = W −1 s as:
r Spectral theorem – Let A ∈ Rn×n . If A is symmetric, then A is diagonalizable by a real n
orthogonal matrix U ∈ Rn×n . By noting Λ = diag(λ1 ,...,λn ), we have:
Y
p(x) = ps (wiT x) · |W |
∃Λ diagonal, A = U ΛU T i=1
Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of
matrix A. • Write the log likelihood given our training data {x(i) , i ∈ [[1,m]]} and by noting g the
sigmoid function as:
r Algorithm – The Principal Component Analysis (PCA) procedure is a dimension reduction
technique that projects the data on k dimensions by maximizing the variance of the data as m n
!
follows:
X X
0
l(W ) = log g (wjT x(i) ) + log |W |
• Step 1: Normalize the data to have a mean of 0 and standard deviation of 1. i=1 j=1
Therefore, the stochastic gradient ascent learning rule is such that for each training example
(i) m m
(i)
xj − µj 1 X (i) 1 X (i) x(i) , we update W as follows:
xj ← where µj = xj and σj2 = (xj − µj ) 2
σj m m
i=1 i=1
1 − 2g(w1T x(i) )
  
1 − 2g(w2 x ) (i) T
T (i)
W ←− W + α  .. x + (W T )−1 

m
1 X T .
• Step 2: Compute Σ = x(i) x(i) ∈ Rn×n , which is symmetric with real eigenvalues.
m 1 − 2g(wn T x(i) )
i=1
• Step 3: Compute u1 , ..., uk ∈ Rn the k orthogonal principal eigenvectors of Σ, i.e. the

orthogonal eigenvectors of the k largest eigenvalues.
• Step 4: Project the data on spanR (u1 ,...,uk ). This procedure maximizes the variance
among all k-dimensional spaces.

3 Deep Learning
∂L(z,y) ∂L(z,y) ∂a ∂z
= × ×
3.1 Neural Networks ∂w ∂a ∂z ∂w
Neural networks are a class of models that are built with layers. Commonly used types of neural As a result, the weight is updated as follows:
networks include convolutional and recurrent neural networks. ∂L(z,y)
w ←− w − η
r Architecture – The vocabulary around neural networks architectures is described in the ∂w
figure below:
r Updating weights – In a neural network, weights are updated as follows:
• Step 1: Take a batch of training data.
• Step 2: Perform forward propagation to obtain the corresponding loss.
• Step 3: Backpropagate the loss to get the gradients.
• Step 4: Use the gradients to update the weights of the network.

By noting i the ith layer of the network and j the j th hidden unit of the layer, we have:
[i] [i] T [i]
zj = wj x + bj r Dropout – Dropout is a technique meant at preventing overfitting the training data by
dropping out units in a neural network. In practice, neurons are either dropped with probability
p or kept with probability 1 − p.
where we note w, b, z the weight, bias and output respectively.
r Activation function – Activation functions are used at the end of a hidden unit to introduce
non-linear complexities to the model. Here are the most common ones: 3.2 Convolutional Neural Networks
r Convolutional layer requirement – By noting W the input volume size, F the size of the
Sigmoid Tanh ReLU Leaky ReLU convolutional layer neurons, P the amount of zero padding, then the number of neurons N that
fit in a given volume is such that:
1 ez − e−z
g(z) = g(z) = g(z) = max(0,z) g(z) = max(z,z) W − F + 2P
1 + e−z ez + e−z N = +1
with 1 S
r Batch normalization – It is a step of hyperparameter γ, β that normalizes the batch {xi }.

By noting µB , σB
2 the mean and variance of that we want to correct to the batch, it is done as
follows:
xi − µB
xi ←− γ p +β
2 +
σB
It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.
r Cross-entropy loss – In the context of neural networks, the cross-entropy loss L(z,y) is
commonly used and is defined as follows: 3.3 Recurrent Neural Networks
h i
L(z,y) = − y log(z) + (1 − y) log(1 − z) r Types of gates – Here are the different types of gates that we encounter in a typical recurrent
neural network:
Input gate Forget gate Output gate Gate

r Learning rate – The learning rate, often noted η, indicates at which pace the weights get
updated. This can be fixed or adaptively changed. The current most popular method is called Write to cell or not? Erase a cell or not? Reveal a cell or not? How much writing?
Adam, which is a method that adapts the learning rate.
r Backpropagation – Backpropagation is a method to update the weights in the neural network
by taking into account the actual output and the desired output. The derivative with respect r LSTM – A long short-term memory (LSTM) network is a type of RNN model that avoids
to weight w is computed using chain rule and is of the following form: the vanishing gradient problem by adding ’forget’ gates.

3.4 Reinforcement Learning and Control r Maximum likelihood estimate – The maximum likelihood estimates for the state transition
probabilities are as follows:
The goal of reinforcement learning is for an agent to learn how to evolve in an environment.
#times took action a in state s and got to s0
Psa (s0 ) =
r Markov decision processes – A Markov decision process (MDP) is a 5-tuple (S,A,{Psa },γ,R) #times took action a in state s
where:
• S is the set of states r Q-learning – Q-learning is a model-free estimation of Q, which is done as follows:
• A is the set of actions

h i
Q(s,a) ← Q(s,a) + α R(s,a,s0 ) + γ max Q(s0 ,a0 ) − Q(s,a)
a0
• {Psa } are the state transition probabilities for s ∈ S and a ∈ A
• γ ∈ [0,1[ is the discount factor
• R : S × A −→ R or R : S −→ R is the reward function that the algorithm wants to

maximize
r Policy – A policy π is a function π : S −→ A that maps states to actions.

Remark: we say that we execute a given policy π if given a state s we take the action a = π(s).
r Value function – For a given policy π and a given state s, we define the value function V π
as follows:
h i
V π (s) = E R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + ...|s0 = s,π
∗
r Bellman equation – The optimal Bellman equations characterizes the value function V π
of the optimal policy π ∗ :
∗ ∗
X
V π (s) = R(s) + max γ Psa (s0 )V π (s0 )
a∈A
s0 ∈S
Remark: we note that the optimal policy π ∗ for a given state s is such that:
X
π ∗ (s) = argmax Psa (s0 )V ∗ (s0 )
a∈A
s0 ∈S
r Value iteration algorithm – The value iteration algorithm is in two steps:
• We initialize the value:
V0 (s) = 0
• We iterate the value based on the values before:

" #
X
0 0
Vi+1 (s) = R(s) + max γPsa (s )Vi (s )
a∈A
s0 ∈S

4 Machine Learning Tips and Tricks Metric Formula Equivalent

TP
True Positive Rate Recall, sensitivity
4.1 Metrics TP + FN
TPR
Given a set of data points {x(1) , ..., x(m) }, where each x(i) has n features, associated to a set of FP
outcomes {y (1) , ..., y (m) }, we want to assess a given classifier that learns how to predict y from False Positive Rate 1-specificity
x. TN + FP
FPR
4.1.1 Classification r AUC – The area under the receiving operating curve, also noted AUC or AUROC, is the
area below the ROC as shown in the following figure:
In a context of a binary classification, here are the main metrics that are important to track to
assess the performance of the model.
r Confusion matrix – The confusion matrix is used to have a more complete picture when
assessing the performance of a model. It is defined as follows:
Predicted class
+ –
TP FN
+ False Negatives
True Positives
Type II error
Actual class 4.1.2 Regression
FP TN
– False Positives r Basic metrics – Given a regression model f , the following metrics are commonly used to
True Negatives assess the performance of the model:
Type I error
Total sum of squares Explained sum of squares Residual sum of squares
r Main metrics – The following metrics are commonly used to assess the performance of
classification models: m
X m
X m
X
SStot = (yi − y)2 SSreg = (f (xi ) − y)2 SSres = (yi − f (xi ))2
i=1 i=1 i=1
Metric Formula Interpretation
TP + TN
Accuracy Overall performance of model r Coefficient of determination – The coefficient of determination, often noted R2 or r2 ,
TP + TN + FP + FN
provides a measure of how well the observed outcomes are replicated by the model and is defined
TP as follows:
Precision How accurate the positive predictions are
TP + FP SSres
R2 = 1 −
TP SStot
Recall Coverage of actual positive sample
TP + FN
Sensitivity r Main metrics – The following metrics are commonly used to assess the performance of
TN regression models, by taking into account the number of variables n that they take into consid-
Specificity Coverage of actual negative sample eration:
TN + FP
2TP Adjusted R2
F1 score Hybrid metric useful for unbalanced classes Mallow’s Cp AIC BIC
2TP + FP + FN
SSres + 2(n + 1)b
σ2 (1 − R2 )(m − 1)
2 (n + 2) − log(L) log(m)(n + 2) − 2 log(L) 1−
m m−n−1
r ROC – The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by
varying the threshold. These metrics are are summed up in the table below: where L is the likelihood and b
σ 2 is an estimate of the variance associated with each response.

4.2 Model selection LASSO Ridge Elastic Net

- Shrinks coefficients to 0 Makes coefficients smaller Tradeoff between variable
r Vocabulary – When selecting a model, we distinguish 3 different parts of the data that we
have as follows: - Good for variable selection selection and small coefficients
Training set Validation set Testing set

- Model is trained - Model is assessed - Model gives predictions
- Usually 80% of the dataset - Usually 20% of the dataset - Unseen data
- Also called hold-out
or development set
Once the model has been chosen, it is trained on the entire dataset and tested on the unseen
test set. These are represented in the figure below: h i
... + λ||θ||1 ... + λ||θ||22 ... + λ (1 − α)||θ||1 + α||θ||22
λ∈R λ∈R λ ∈ R, α ∈ [0,1]
r Model selection – Train model on training set, then evaluate on the development set, then
r Cross-validation – Cross-validation, also noted CV, is a method that is used to select a pick best performance model on the development set, and retrain all of that model on the whole
model that does not rely too much on the initial training set. The different types are summed training set.
up in the table below:
4.3 Diagnostics
k-fold Leave-p-out
- Training on k − 1 folds and - Training on n − p observations and r Bias – The bias of a model is the difference between the expected prediction and the correct
model that we try to predict for given data points.
assessment on the remaining one assessment on the p remaining ones
- Generally k = 5 or 10 - Case p = 1 is called leave-one-out r Variance – The variance of a model is the variability of the model prediction for given data
points.
The most commonly used method is called k-fold cross-validation and splits the training data r Bias/variance tradeoff – The simpler the model, the higher the bias, and the more complex
into k folds to validate the model on one fold while training the model on the k − 1 other folds, the model, the higher the variance.
all of this k times. The error is then averaged over the k folds and is named cross-validation
error.
Underfitting Just right Overfitting
- High training error - Training error - Low training error
Symptoms - Training error close slightly lower than - Training error much
to test error test error lower than test error
- High bias - High variance
Regression
r Regularization – The regularization procedure aims at avoiding the model to overfit the
data and thus deals with high variance issues. The following table sums up the different types
of commonly used regularization techniques:

5 Refreshers
5.1 Probabilities and Statistics

Classification
5.1.1 Introduction to Probability and Combinatorics
r Sample space – The set of all possible outcomes of an experiment is known as the sample
space of the experiment and is denoted by S.
r Event – Any subset E of the sample space is known as an event. That is, an event is a set
consisting of possible outcomes of the experiment. If the outcome of the experiment is contained
in E, then we say that E has occurred.
r Axioms of probability – For each event E, we denote P (E) as the probability of event E
Deep learning occuring. By noting E1 ,...,En mutually exclusive events, we have the 3 following axioms:
n
! n
[ X
(1) 0 6 P (E) 6 1 (2) P (S) = 1 (3) P Ei = P (Ei )
i=1 i=1
- Complexify model - Regularize

Remedies - Add more features - Get more data r Permutation – A permutation is an arrangement of r objects from a pool of n objects, in a
- Train longer given order. The number of such arrangements is given by P (n, r), defined as:
n!
P (n, r) =
r Error analysis – Error analysis is analyzing the root cause of the difference in performance (n − r)!
between the current and the perfect models.
r Ablative analysis – Ablative analysis is analyzing the root cause of the difference in perfor- r Combination – A combination is an arrangement of r objects from a pool of n objects, where
mance between the current and the baseline models. the order does not matter. The number of such arrangements is given by C(n, r), defined as:
P (n, r) n!
C(n, r) = =
r! r!(n − r)!
Remark: we note that for 0 6 r 6 n, we have P (n,r) > C(n,r).
5.1.2 Conditional Probability

r Bayes’ rule – For events A and B such that P (B) > 0, we have:
P (B|A)P (A)
P (A|B) =
P (B)
Remark: we have P (A ∩ B) = P (A)P (B|A) = P (A|B)P (B).
r Partition – Let {Ai , i ∈ [[1,n]]} be such that for all i, Ai 6= ∅. We say that {Ai } is a partition
if we have:
n
[
∀i 6= j, Ai ∩ Aj = ∅ and Ai = S
i=1
n
X
Remark: for any event B in the sample space, we have P (B) = P (B|Ai )P (Ai ).
i=1

r Extended form of Bayes’ rule – Let {Ai , i ∈ [[1,n]]} be a partition of the sample space. r Expectation and Moments of the Distribution – Here are the expressions of the expected
We have: value E[X], generalized expected value E[g(X)], kth moment E[X k ] and characteristic function
ψ(ω) for the discrete and continuous cases:
P (B|Ak )P (Ak )
P (Ak |B) = n
X
P (B|Ai )P (Ai ) Case E[X] E[g(X)] E[X k ] ψ(ω)
i=1 n n n n
X X X X
(D) xi f (xi ) g(xi )f (xi ) xki f (xi ) f (xi )eiωxi
r Independence – Two events A and B are independent if and only if we have: i=1 i=1 i=1 i=1
ˆ +∞ ˆ +∞ ˆ +∞ ˆ +∞
P (A ∩ B) = P (A)P (B) (C) xf (x)dx g(x)f (x)dx xk f (x)dx f (x)eiωx dx
−∞ −∞ −∞ −∞
5.1.3 Random Variables

Remark: we have eiωx = cos(ωx) + i sin(ωx).
r Random variable – A random variable, often noted X, is a function that maps every element
in a sample space to a real line. r Revisiting the kth moment – The kth moment can also be computed with the characteristic
function as follows:
r Cumulative distribution function (CDF) – The cumulative distribution function F ,
1 ∂k ψ
which is monotonically non-decreasing and is such that lim F (x) = 0 and lim F (x) = 1, is E[X k ] =
x→−∞ x→+∞ ik ∂ω k
defined as: ω=0
F (x) = P (X 6 x)
r Transformation of random variables – Let the variables X and Y be linked by some
function. By noting fX and fY the distribution function of X and Y respectively, we have:
Remark: we have P (a < X 6 B) = F (b) − F (a).
dx
fY (y) = fX (x)
r Probability density function (PDF) – The probability density function f is the probability dy
that X takes on values between two adjacent realizations of the random variable.
r Relationships involving the PDF and CDF – Here are the important properties to know r Leibniz integral rule – Let g be a function of x and potentially c, and a, b boundaries that
in the discrete (D) and the continuous (C) cases. may depend on c. We have:
ˆ ˆ b
∂ b ∂b ∂a ∂g
g(x)dx = · g(b) − · g(a) + (x)dx
Case CDF F PDF f Properties of PDF ∂c a ∂c ∂c a ∂c
X X
(D) F (x) = P (X = xi ) f (xj ) = P (X = xj ) 0 6 f (xj ) 6 1 and f (xj ) = 1
r Chebyshev’s inequality – Let X be a random variable with expected value µ and standard
xi 6x j deviation σ. For k, σ > 0, we have the following inequality:
ˆ x ˆ +∞ 1
dF
(C) F (x) = f (y)dy f (x) = f (x) > 0 and f (x)dx = 1 P (|X − µ| > kσ) 6
−∞ dx −∞ k2
5.1.4 Jointly Distributed Random Variables

r Variance – The variance of a random variable, often noted Var(X) or σ 2 , is a measure of the
spread of its distribution function. It is determined as follows:
r Conditional density – The conditional density of X with respect to Y , often noted fX|Y ,
Var(X) = E[(X − E[X])2 ] = E[X 2 ] − E[X]2 is defined as follows:
fXY (x,y)
fX|Y (x) =
r Standard deviation – The standard deviation of a random variable, often noted σ, is a fY (y)
measure of the spread of its distribution function which is compatible with the units of the
actual random variable. It is determined as follows:
r Independence – Two random variables X and Y are said to be independent if we have:
p
σ= Var(X) fXY (x,y) = fX (x)fY (y)

r Marginal density and cumulative distribution – From the joint density probability 5.1.5 Parameter estimation
function fXY , we have:
r Random sample – A random sample is a collection of n random variables X1 , ..., Xn that
Case Marginal density Cumulative function are independent and identically distributed with X.
r Estimator – An estimator θ̂ is a function of the data that is used to infer the value of an
unknown parameter θ in a statistical model.
X XX
(D) fX (xi ) = fXY (xi ,yj ) FXY (x,y) = fXY (xi ,yj )
j xi 6x yj 6y r Bias – The bias of an estimator θ̂ is defined as being the difference between the expected
ˆ ˆ ˆ value of the distribution of θ̂ and the true value, i.e.:
+∞ x y
(C) fX (x) = fXY (x,y)dy FXY (x,y) = fXY (x0 ,y 0 )dx0 dy 0 Bias(θ̂) = E[θ̂] − θ
−∞ −∞ −∞
Remark: an estimator is said to be unbiased when we have E[θ̂] = θ.

r Distribution of a sum of independent random variables – Let Y = X1 + ... + Xn with
X1 , ..., Xn independent. We have: r Sample mean and variance – The sample mean and the sample variance of a random
sample are used to estimate the true mean µ and the true variance σ 2 of a distribution, are
n
Y noted X and s2 respectively, and are such that:
ψY (ω) = ψXk (ω)
n n
k=1 1 X 1 X
X= Xi and s2 = σ̂ 2 = (Xi − X)2
n n−1
i=1 i=1
r Covariance – We define the covariance of two random variables X and Y , that we note σXY
2
or more commonly Cov(X,Y ), as follows:

r Central Limit Theorem – Let us have a random sample X1 , ..., Xn following a given
Cov(X,Y ) , σXY
2
= E[(X − µX )(Y − µY )] = E[XY ] − µX µY distribution with mean µ and variance σ 2 , then we have:
σ

r Correlation – By noting σX , σY the standard deviations of X and Y , we define the correlation X ∼ N µ, √
n→+∞ n
between the random variables X and Y , noted ρXY , as follows:
2
σXY
ρXY =
σX σY 5.2 Linear Algebra and Calculus
Remarks: For any X, Y , we have ρXY ∈ [−1,1]. If X and Y are independent, then ρXY = 0. 5.2.1 General notations
r Main distributions – Here are the main distributions to have in mind:
r Vector – We note x ∈ Rn a vector with n entries, where xi ∈ R is the ith entry:
x1 !
x2
Type Distribution PDF ψ(ω) E[X] Var(X) ..
x= ∈ Rn
n .
xn
X ∼ B(n, p) P (X = x) = px q n−x (peiω + q)n np npq
x
Binomial x ∈ [[0,n]]
(D) r Matrix – We note A ∈ Rm×n a matrix with m rows and n columns, where Ai,j ∈ R is the
µx iω
entry located in the ith row and j th column:
X ∼ Po(µ) P (X = x) = e−µ eµ(e −1) µ µ A1,1 · · · A1,n
!
x!
Poisson x∈N A= . . ∈ Rm×n
.. ..
1 eiωb − eiωa a+b (b − a)2 Am,1 · · · Am,n
X ∼ U (a, b) f (x) =
b−a (b − a)iω 2 12 Remark: the vector x defined above can be viewed as a n × 1 matrix and is more particularly
Uniform x ∈ [a,b] called a column-vector.
2
1 −1
x−µ
1 2
σ2 r Identity matrix – The identity matrix I ∈ Rn×n is a square matrix with ones in its diagonal
(C) X ∼ N (µ, σ) f (x) = √ e2 σ
eiωµ− 2 ω µ σ2 and zero everywhere else:
2πσ
Gaussian x∈R  1 0 ··· 0 
.. .. .
1 1 1 . . .. 
X ∼ Exp(λ) f (x) = λe−λx I= 0

1− iω λ λ2 .. . . ..

Exponential x ∈ R+
λ
. . . 0
0 ··· 0 1

Remark: for all matrices A ∈ Rn×n , we have A × I = I × A = A.
r Diagonal matrix – A diagonal matrix D ∈ Rn×n is a square matrix with nonzero values in ∀i,j, i,j = Aj,i
AT
its diagonal and zero everywhere else:
 d1 0 · · · 0 Remark: for matrices A,B, we have (AB)T = B T AT .

.. .. ..
. .
D= 0 .
r Inverse – The inverse of an invertible square matrix A is noted A−1 and is the only matrix
 
.. .. ..

. . . 0 such that:
0 ··· 0 dn
AA−1 = A−1 A = I
Remark: we also note D as diag(d1 ,...,dn ).
Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1 =
5.2.2 Matrix operations B −1 A−1
r Trace – The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:
r Vector-vector multiplication – There are two types of vector-vector products:
n
• inner product: for x,y ∈ Rn , we have:
X
tr(A) = Ai,i
i=1
n
X
xT y = xi yi ∈ R
Remark: for matrices A,B, we have tr(AT ) = tr(A) and tr(AB) = tr(BA)
i=1
r Determinant – The determinant of a square matrix A ∈ Rn×n , noted |A| or det(A) is
expressed recursively in terms of A\i,\j , which is the matrix A without its ith row and j th
• outer product: for x ∈ Rm , y ∈ Rn , we have:
column, as follows:
x1 y1 ··· x1 yn n
.. ..
X
xy T = ∈ Rm×n det(A) = |A| = (−1)i+j Ai,j |A\i,\j |
. .
xm y1 ··· xm yn j=1
Remark: A is invertible if and only if |A| 6= 0. Also, |AB| = |A||B| and |AT | = |A|.
r Matrix-vector multiplication – The product of matrix A ∈ Rm×n and vector x ∈ Rn is a
vector of size Rm , such that:

aT
 5.2.3 Matrix properties
r,1 x n
..
X
Ax =  = ac,i xi ∈ R m
r Symmetric decomposition – A given matrix A can be expressed in terms of its symmetric
.
T
ar,m x i=1 and antisymmetric parts as follows:
A + AT A − AT
where aT
r,i are the vector rows and ac,j are the vector columns of A, and xi are the entries
A= +
2 2
of x. | {z } | {z }
Symmetric Antisymmetric
r Matrix-matrix multiplication – The product of matrices A ∈ Rm×n and B ∈ Rn×p is a
matrix of size Rn×p , such that:

aT ··· aT
 r Norm – A norm is a function N : V −→ [0, + ∞[ where V is a vector space, and such that
r,1 bc,1 r,1 bc,p n
for all x,y ∈ V , we have:
.. ..
X
AB =  = ac,i bT n×p
∈R
. . r,i
T
ar,m bc,1 ··· T
ar,m bc,p i=1 • N (x + y) 6 N (x) + N (y)
where aT • N (ax) = |a|N (x) for a scalar

r,i , br,i are the vector rows and ac,j , bc,j are the vector columns of A and B respec-
T
tively.
• if N (x) = 0, then x = 0
r Transpose – The transpose of a matrix A ∈ Rm×n , noted AT , is such that its entries are
flipped: For x ∈ V , the most commonly used norms are summed up in the table below:

Norm Notation Definition Use case

∂f (A)

n
X ∇A f (A) =
Manhattan, L1 ||x||1 |xi | LASSO regularization i,j ∂Ai,j
i=1
Remark: the gradient of f is only defined when f is a function that returns a scalar.
v
r Hessian – Let f : Rn → R be a function and x ∈ Rn be a vector. The hessian of f with
u n
uX
Euclidean, L2 ||x||2 t x2i Ridge regularization respect to x is a n × n symmetric matrix, noted ∇2x f (x), such that:
i=1 ∂ 2 f (x)

∇2x f (x) =
! 1 i,j ∂xi ∂xj
n p
X
p-norm, Lp ||x||p xpi Hölder inequality Remark: the hessian of f is only defined when f is a function that returns a scalar.
i=1
r Gradient operations – For matrices A,B,C, the following gradient properties are worth
Infinity, L∞ ||x||∞ max |xi | Uniform convergence having in mind:
i
∇A tr(AB) = B T ∇AT f (A) = (∇A f (A))T
r Linearly dependence – A set of vectors is said to be linearly dependent if one of the vectors ∇A tr(ABAT C) = CAB + C T AB T ∇A |A| = |A|(A−1 )T
in the set can be defined as a linear combination of the others.
Remark: if no vector can be written this way, then the vectors are said to be linearly independent.
r Matrix rank – The rank of a given matrix A is noted rank(A) and is the dimension of the
vector space generated by its columns. This is equivalent to the maximum number of linearly
independent columns of A.
r Positive semi-definite matrix – A matrix A ∈ Rn×n is positive semi-definite (PSD) and

is noted A 0 if we have:
A = AT and ∀x ∈ Rn , xT Ax > 0
Remark: similarly, a matrix A is said to be positive definite, and is noted A 0, if it is a PSD

matrix which satisfies for all non-zero vector x, xT Ax > 0.

there exists a vector z ∈ Rn \{0}, called eigenvector, such that we have:
Az = λz
r Spectral theorem – Let A ∈ Rn×n . If A is symmetric, then A is diagonalizable by a real

∃Λ diagonal, A = U ΛU T
r Singular-value decomposition – For a given matrix A of dimensions m × n, the singular-

value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m
unitary, Σ m × n diagonal and V n × n unitary matrices, such that:
A = U ΣV T
5.2.4 Matrix calculus
r Gradient – Let f : Rm×n → R be a function and A ∈ Rm×n be a matrix. The gradient of f

with respect to A is a m × n matrix, noted ∇A f (A), such that:

CS 230 – Deep Learning Shervine Amidi & Afshine Amidi
Super VIP Cheatsheet: Deep Learning 1 Convolutional Neural Networks
Afshine Amidi and Shervine Amidi 1.1 Overview
November 25, 2018

r Architecture of a traditional CNN – Convolutional neural networks, also known as CNNs,
are a specific type of neural networks that are generally composed of the following layers:
Contents
1 Convolutional Neural Networks 2

1.1 Overview . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 2
1.2 Types of layer . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 2
1.3 Filter hyperparameters . . . . . . . . . . . . .. . . . . . . . . . . . . 2
1.4 Tuning hyperparameters . . . . . . . . . . . .. . . . . . . . . . . . . 3
1.5 Commonly used activation functions . . . . . .. . . . . . . . . . . . . 3
1.6 Object detection . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 4 The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters
1.6.1 Face verification and recognition . . . .. . . . . . . . . . . . . 5 that are described in the next sections.
1.6.2 Neural style transfer . . . . . . . . . .. . . . . . . . . . . . . 5
1.6.3 Architectures using computational tricks . . . . . . . . . . . . 6
2 Recurrent Neural Networks 7

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Handling long term dependencies . . . . . . . . . . . . . . . . . . . . 8
1.2 Types of layer
2.3 Learning word representation . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Motivation and notations . . . . . . . . . . . . . . . . . . . 9
2.3.2 Word embeddings . . . . . . . . . . . . . . . . . . . . . . . 9
r Convolutional layer (CONV) – The convolution layer (CONV) uses filters that perform
2.4 Comparing words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 convolution operations as it is scanning the input I with respect to its dimensions. Its hyperpa-
2.5 Language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 rameters include the filter size F and stride S. The resulting output O is called feature map or
activation map.
2.6 Machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Deep Learning Tips and Tricks 11

3.1 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Training a neural network . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Finding optimal weights . . . . . . . . . . . . . . . . . . . . . 12
3.3 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Remark: the convolution step can be generalized to the 1D and 3D cases as well.
3.3.1 Weights initialization . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 Optimizing convergence . . . . . . . . . . . . . . . . . . . . . 12
r Pooling (POOL) – The pooling layer (POOL) is a downsampling operation, typically applied
3.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 after a convolution layer, which does some spatial invariance. In particular, max and average
3.5 Good practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 pooling are special kinds of pooling where the maximum and average value is taken, respectively.
Stanford University 1 Winter 2019

Max pooling Average pooling

Each pooling operation selects the Each pooling operation averages
Purpose
maximum value of the current view the values of the current view
r Zero-padding – Zero-padding denotes the process of adding P zeroes to each side of the
boundaries of the input. This value can either be manually specified or automatically set through
one of the three modes detailed below:
Valid Same Full

Illustration j I e−I+F −S
k
Sd S
Pstart = 2 Pstart ∈ [[0,F − 1]]
Value P =0 l I e−I+F −S
m
Sd S
Pend = Pend = F − 1
2
- Preserves detected features - Downsamples feature map
Comments
- Most commonly used - Used in LeNet
Illustration
r Fully Connected (FC) – The fully connected layer (FC) operates on a flattened input where
each input is connected to all neurons. If present, FC layers are usually found towards the end
of CNN architectures and can be used to optimize objectives such as class scores.
- Maximum padding
- No padding - Padding such that feature
l m such that end
map size has size I
convolutions are
- Drops last S
Purpose applied on the limits
convolution if - Output size is
of the input
dimensions do not mathematically convenient
match - Filter ’sees’ the input
- Also called ’half’ padding
end-to-end
1.4 Tuning hyperparameters

r Parameter compatibility in convolution layer – By noting I the length of the input
1.3 Filter hyperparameters volume size, F the length of the filter, P the amount of zero padding, S the stride, then the
output size O of the feature map along that dimension is given by:
The convolution layer contains filters for which it is important to know the meaning behind its
hyperparameters. I − F + Pstart + Pend
O= +1
r Dimensions of a filter – A filter of size F × F applied to an input containing C channels is S
a F × F × C volume that performs convolutions on an input of size I × I × C and produces an
output feature map (also called activation map) of size O × O × 1.
Remark: the application of K filters of size F × F results in an output feature map of size
O × O × K.
r Stride – For a convolutional or a pooling operation, the stride S denotes the number of pixels Remark: often times, Pstart = Pend , P , in which case we can replace Pstart + Pend by 2P in
by which the window moves after each operation. the formula above.

r Understanding the complexity of the model – In order to assess the complexity of a ReLU Leaky ReLU ELU
model, it is often useful to determine the number of parameters that its architecture will have.
In a given layer of a convolutional neural network, it is done as follows: g(z) = max(z,z) g(z) = max(α(ez − 1),z)
g(z) = max(0,z)
with 1 with α 1
CONV POOL FC
Illustration
Input size I ×I ×C I ×I ×C Nin

Non-linearity complexities Addresses dying ReLU
Differentiable everywhere
Output size O×O×K O×O×C Nout biologically interpretable issue for negative values
Number of
(F × F × C + 1) · K 0 (Nin + 1) × Nout
parameters r Softmax – The softmax step can be seen as a generalized logistic function that takes as input
a vector of scores x ∈ Rn and outputs a vector of output probability p ∈ Rn through a softmax
- Input is flattened function at the end of the architecture. It is defined as follows:
- One bias parameter
- Pooling operation - One bias parameter
per filter
done channel-wise per neuron p1
Remarks - In most cases, S < F .. e xi
- The number of FC p= where pi =
- A common choice - In most cases, S = F . n
neurons is free of pn X
for K is 2C e xj
structural constraints
j=1
r Receptive field – The receptive field at layer k is the area denoted Rk × Rk of the input
that each pixel of the k-th activation map can ’see’. By calling Fj the filter size of layer j and 1.6 Object detection
Si the stride value of layer i and with the convention S0 = 1, the receptive field at layer k can
be computed with the formula: r Types of models – There are 3 main types of object recognition algorithms, for which the
k j−1 nature of what is predicted is different. They are described in the table below:
X Y
Rk = 1 + (Fj − 1) Si
Classification
j=1 i=0 Image classification Detection
w. localization
In the example below, we have F1 = F2 = 3 and S1 = S2 = 1, which gives R2 = 1+2 · 1+2 · 1 =
5.
- Classifies a picture - Detects object in a picture - Detects up to several objects

- Predicts probability of in a picture
- Predicts probability object and where it is - Predicts probabilities of objects
of object located and where they are located
Traditional CNN Simplified YOLO, R-CNN YOLO, R-CNN
1.5 Commonly used activation functions
r Rectified Linear Unit – The rectified linear unit layer (ReLU) is an activation function g r Detection – In the context of object detection, different methods are used depending on
that is used on all elements of the volume. It aims at introducing non-linearities to the network. whether we just want to locate the object or detect a more complex shape in the image. The
Its variants are summarized in the table below: two main ones are summed up in the table below:

Bounding box detection Landmark detection

- Detects a shape or characteristics of
Detects the part of the image where
an object (e.g. eyes)
the object is located
- More granular
r YOLO – You Only Look Once (YOLO) is an object detection algorithm that performs the
following steps:
• Step 1: Divide the input image into a G × G grid.
• Step 2: For each grid cell, run a CNN that predicts y of the following form:
Box of center (bx ,by ), height bh
Reference points (l1x ,l1y ), ...,(lnx ,lny ) T
and width bw y = pc ,bx ,by ,bh ,bw ,c1 ,c2 ,...,cp ,... ∈ RG×G×k×(5+p)
| {z }
repeated k times
r Intersection over Union – Intersection over Union, also known as IoU, is a function that
quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding where pc is the probability of detecting an object, bx ,by ,bh ,bw are the properties of the
box Ba . It is defined as: detected bouding box, c1 ,...,cp is a one-hot representation of which of the p classes were
detected, and k is the number of anchor boxes.
Bp ∩ Ba
IoU(Bp ,Ba ) = • Step 3: Run the non-max suppression algorithm to remove any potential duplicate over-
Bp ∪ Ba
lapping bounding boxes.
Remark: when pc = 0, then the network does not detect any object. In that case, the corre-
sponding predictions bx , ..., cp have to be ignored.
Remark: we always have IoU ∈ [0,1]. By convention, a predicted bounding box Bp is considered
as being reasonably good if IoU(Bp ,Ba ) > 0.5. r R-CNN – Region with Convolutional Neural Networks (R-CNN) is an object detection algo-
rithm that first segments the image to find potential relevant bounding boxes and then run the
r Anchor boxes – Anchor boxing is a technique used to predict overlapping bounding boxes. detection algorithm to find most probable objects in those bounding boxes.
In practice, the network is allowed to predict more than one box simultaneously, where each box
prediction is constrained to have a given set of geometrical properties. For instance, the first
prediction can potentially be a rectangular box of a given form, while the second will be another
rectangular box of a different geometrical form.
r Non-max suppression – The non-max suppression technique aims at removing duplicate
overlapping bounding boxes of a same object by selecting the most representative ones. After
having removed all boxes having a probability prediction lower than 0.6, the following steps are
repeated while there are boxes remaining:
• Step 1: Pick the box with the largest prediction probability.

Remark: although the original algorithm is computationally expensive and slow, newer archi-
• Step 2: Discard any box having an IoU > 0.5 with the previous box. tectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.

1.6.1 Face verification and recognition
r Types of models – Two main types of model are summed up in table below:
Face verification Face recognition

- Is this the correct person? - Is this one of the K persons in the database?
- One-to-one lookup - One-to-many lookup
r Activation – In a given layer l, the activation is noted a[l] and is of dimensions nH × nw × nc
r Content cost function – The content cost function Jcontent (C,G) is used to determine how
the generated image G differs from the original content image C. It is defined as follows:
1 [l](C)
Jcontent (C,G) = ||a − a[l](G) ||2
2
r Style matrix – The style matrix G[l] of a given layer l is a Gram matrix where each of its
[l]
elements Gkk0 quantifies how correlated the channels k and k0 are. It is defined with respect to
r One Shot Learning – One Shot Learning is a face verification algorithm that uses a limited activations a[l] as follows:
training set to learn a similarity function that quantifies how different two given images are. The [l] [l]
similarity function applied to two images is often noted d(image 1, image 2). n nw
H X
X
[l] [l] [l]
Gkk0 = aijk aijk0
r Siamese Network – Siamese Networks aim at learning how to encode images to then quantify
i=1 j=1
how different two images are. For a given input image x(i) , the encoded output is often noted
as f (x(i) ).
Remark: the style matrix for the style image and the generated image are noted G[l](S) and
r Triplet loss – The triplet loss ` is a loss function computed on the embedding representation G[l](G) respectively.
of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive
example belong to a same class, while the negative example to another one. By calling α ∈ R+ r Style cost function – The style cost function Jstyle (S,G) is used to determine how the
the margin parameter, this loss is defined as follows: generated image G differs from the style S. It is defined as follows:
`(A,P,N ) = max (d(A,P ) − d(A,N ) + α,0) nc

1 1 X 2
[l] [l](S) [l](G)
Jstyle (S,G) = ||G[l](S) − G[l](G) ||2F = Gkk0 − Gkk0
(2nH nw nc )2 (2nH nw nc )2
k,k0 =1
r Overall cost function – The overall cost function is defined as being a combination of the
content and style cost functions, weighted by parameters α,β, as follows:
J(G) = αJcontent (C,G) + βJstyle (S,G)
Remark: a higher value of α will make the model care more about the content while a higher
value of β will make it care more about the style.
1.6.3 Architectures using computational tricks

1.6.2 Neural style transfer r Generative Adversarial Network – Generative adversarial networks, also known as GANs,
are composed of a generative and a discriminative model, where the generative model aims at
r Motivation – The goal of neural style transfer is to generate an image G based on a given generating the most truthful output that will be fed into the discriminative which aims at
content C and a given style S. differentiating the generated and true image.

2 Recurrent Neural Networks
2.1 Overview
r Architecture of a traditional RNN – Recurrent neural networks, also known as RNNs,
are a class of neural networks that allow previous outputs to be used as inputs while having
hidden states. They are typically as follows:
Remark: use cases using variants of GANs include text to image, music generation and syn-
thesis.
r ResNet – The Residual Network architecture (also called ResNet) uses residual blocks with a
high number of layers meant to decrease the training error. The residual block has the following
characterizing equation:
a[l+2] = g(a[l] + z [l+2] ) For each timestep t, the activation a<t> and the output y <t> are expressed as follows:
r Inception Network – This architecture uses inception modules and aims at giving a try
at different convolutions in order to increase its performance. In particular, it uses the 1 × 1 a<t> = g1 (Waa a<t−1> + Wax x<t> + ba ) and y <t> = g2 (Wya a<t> + by )
convolution trick to lower the burden of computation.
where Wax , Waa , Wya , ba , by are coefficients that are shared temporally and g1 , g2 activation
functions
? ? ?
The pros and cons of a typical RNN architecture are summed up in the table below:
Advantages Drawbacks
- Possibility of processing input of any length - Computation being slow
- Model size not increasing with size of input - Difficulty of accessing information
- Computation takes into account from a long time ago
historical information - Cannot consider any future input
- Weights are shared across time for the current state
r Applications of RNNs – RNN models are mostly used in the fields of natural language
processing and speech recognition. The different applications are summed up in the table below:

Type of RNN Illustration Example

T
∂L(T ) X ∂L(T )
=
One-to-one ∂W ∂W
(t)
t=1
Traditional neural network
Tx = Ty = 1
2.2 Handling long term dependencies

r Commonly used activation functions – The most common activation functions used in
One-to-many RNN modules are described below:
Music generation
Tx = 1, Ty > 1
Sigmoid Tanh RELU
1 ez − e−z
g(z) = g(z) = g(z) = max(0,z)
1 + e−z ez + e−z
Many-to-one
Sentiment classification
Tx > 1, Ty = 1
Many-to-many
Name entity recognition
Tx = Ty r Vanishing/exploding gradient – The vanishing and exploding gradient phenomena are
often encountered in the context of RNNs. The reason why they happen is that it is difficult
to capture long term dependencies because of multiplicative gradient that can be exponentially
decreasing/increasing with respect to the number of layers.
Many-to-many r Gradient clipping – It is a technique used to cope with the exploding gradient problem
sometimes encountered when performing backpropagation. By capping the maximum value for
Machine translation the gradient, this phenomenon is controlled in practice.
Tx 6= Ty
r Loss function – In the case of a recurrent neural network, the loss function L of all time
steps is defined based on the loss at every time step as follows:
Ty
X r Types of gates – In order to remedy the vanishing gradient problem, specific gates are used
y ,y) =
L(b y <t> ,y <t> )
L(b in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and
t=1 are equal to:
Γ = σ(W x<t> + U a<t−1> + b)
r Backpropagation through time – Backpropagation is done at each point in time. At where W, U, b are coefficients specific to the gate and σ is the sigmoid function. The main ones
timestep T , the derivative of the loss L with respect to weight matrix W is expressed as follows: are summed up in the table below:

Type of gate Role Used in 2.3 Learning word representation

Update gate Γu How much past should matter now? GRU, LSTM In this section, we note V the vocabulary and |V | its size.
Relevance gate Γr Drop previous information? GRU, LSTM
Forget gate Γf Erase a cell or not? LSTM 2.3.1 Motivation and notations
Output gate Γo How much to reveal of a cell? LSTM r Representation techniques – The two main ways of representing words are summed up in
the table below:
r GRU/LSTM – Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM)
deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being 1-hot representation Word embedding
a generalization of GRU. Below is a table summing up the characterizing equations of each
architecture:
Gated Recurrent Unit Long Short-Term Memory
(GRU) (LSTM)
c̃<t> tanh(Wc [Γr ? a<t−1> ,x<t> ] + bc ) tanh(Wc [Γr ? a<t−1> ,x<t> ] + bc )
c<t> Γu ? c̃<t> + (1 − Γu ) ? c<t−1> Γu ? c̃<t> + Γf ? c<t−1>
a<t> c<t> Γo ? c<t>
- Noted ow - Noted ew
- Naive approach, no similarity information - Takes into account words similarity
Dependencies r Embedding matrix – For a given word w, the embedding matrix E is a matrix that maps
its 1-hot representation ow to its embedding ew as follows:
ew = Eow
Remark: learning the embedding matrix can be done using target/context likelihood models.
Remark: the sign ? denotes the element-wise multiplication between two vectors.
r Variants of RNNs – The table below sums up the other commonly used RNN architectures: 2.3.2 Word embeddings
Bidirectional Deep r Word2vec – Word2vec is a framework aimed at learning word embeddings by estimating the
(BRNN) likelihood that a given word is surrounded by other words. Popular models include skip-gram,
(DRNN) negative sampling and CBOW.
r Skip-gram – The skip-gram word2vec model is a supervised learning task that learns word
embeddings by assessing the likelihood of any given target word t happening with a context
word c. By noting θt a parameter associated with t, the probability P (t|c) is given by:

exp(θtT ec )
P (t|c) =
|V |
X
exp(θjT ec )
j=1
Remark: summing over the whole vocabulary in the denominator of the softmax part makes
this model computationally expensive. CBOW is another word2vec model using the surrounding
words to predict a given word.
r Negative sampling – It is a set of binary classifiers using logistic regressions that aim at 2.5 Language model
assessing how a given context and a given target words are likely to appear simultaneously, with
the models being trained on sets of k negative examples and 1 positive example. Given a context r Overview – A language model aims at estimating the probability of a sentence P (y).
word c and a target word t, the prediction is expressed by:
r n-gram model – This model is a naive approach aiming at quantifying the probability that
P (y = 1|c,t) = σ(θtT ec ) an expression appears in a corpus by counting its number of appearance in the training data.
Remark: this method is less computationally expensive than the skip-gram model. r Perplexity – Language models are commonly assessed using the perplexity metric, also
known as PP, which can be interpreted as the inverse probability of the dataset normalized by
r GloVe – The GloVe model, short for global vectors for word representation, is a word em- the number of words T . The perplexity is such that the lower, the better and is defined as
bedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of follows:
times that a target i occurred with a context j. Its cost function J is as follows: ! T1
T
|V |
Y 1
1 X PP =
J(θ) = f (Xij )(θiT ej + bi + b0j − log(Xij ))2
P|V | (t) (t)
yj · b
yj
2 t=1 j=1
i,j=1
Remark: PP is commonly used in t-SNE.
here f is a weighting function such that Xi,j = 0 =⇒ f (Xi,j ) = 0.
(final)
Given the symmetry that e and θ play in this model, the final word embedding ew is given
by: 2.6 Machine translation
(final) e w + θw r Overview – A machine translation model is similar to a language model except it has an
ew =
2 encoder network placed before. For this reason, it is sometimes referred as a conditional language
model. The goal is to find a sentence y such that:
Remark: the individual components of the learned word embeddings are not necessarily inter-
pretable. y= arg max P (y <1> ,...,y <Ty > |x)
y <1> ,...,y <Ty >
2.4 Comparing words r Beam search – It is a heuristic search algorithm used in machine translation and speech
recognition to find the likeliest sentence y given an input x.
r Cosine similarity – The cosine similarity between words w1 and w2 is expressed as follows:
• Step 1: Find top B likely words y <1>
w1 · w2
similarity = = cos(θ)
||w1 || ||w2 || • Step 2: Compute conditional probabilities y <k> |x,y <1> ,...,y <k−1>
• Step 3: Keep top B combinations x,y <1> ,...,y <k>

Remark: θ is the angle between words w1 and w2 .
r t-SNE – t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at re-

ducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly
used to visualize word vectors in the 2D space.

Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search. Remark: the attention scores are commonly used in image captioning and machine translation.
r Beam width – The beam width B is a parameter for beam search. Large values of B yield
to better result but with slower performance and increased memory. Small values of B lead to
worse results but is less computationally intensive. A standard value for B is around 10.
r Length normalization – In order to improve numerical stability, beam search is usually ap-
plied on the following normalized objective, often called the normalized log-likelihood objective,
defined as:
Ty
1 X h i
Objective = log p(y <t> |x,y <1> , ..., y <t−1> )
Tyα
t=1
Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.
r Error analysis – When obtaining a predicted translation b y that is bad, one can wonder why r Attention weight – The amount of attention that the output y <t> should pay to the
0 0
we did not get a good translation y ∗ by performing the following error analysis: activation a<t > is given by α<t,t > computed as follows:
0
0 exp(e<t,t >)
Case P (y ∗ |x) > P (b
y |x) P (y ∗ |x) 6 P (b
y |x) α<t,t >
=
Tx
00
X
Root cause Beam search faulty RNN faulty exp(e<t,t >
)
- Try different architecture t00 =1
Remedies Increase beam width - Regularize Remark: computation complexity is quadratic with respect to Tx .
- Get more data
? ? ?
r Bleu score – The bilingual evaluation understudy (bleu) score quantifies how good a machine
translation is by computing a similarity score based on n-gram precision. It is defined as follows:
n
!
1 X
bleu score = exp pk
n
k=1
where pn is the bleu score on n-gram only defined as follows:

X
countclip (n-gram)
n-gram∈y
pn =
b
X
count(n-gram)
n-gram∈y b
Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially
inflated bleu score.
2.7 Attention
r Attention model – This model allows an RNN to pay attention to specific parts of the input
that is considered as being important, which improves the performance of the resulting model
0
in practice. By noting α<t,t > the amount of attention that the output y <t> should pay to the
0
activation a <t > and c <t> the context at time t, we have:
0
> <t0 > 0
X X
c<t> = α<t,t a with α<t,t >
=1
t0 t0

3 Deep Learning Tips and Tricks 3.2 Training a neural network
3.1 Data processing 3.2.1 Definitions
r Data augmentation – Deep learning models usually need a lot of data to be properly trained. r Epoch – In the context of training a model, epoch is a term used to refer to one iteration
It is often useful to get more data from the existing ones using data augmentation techniques. where the model sees the whole training set to update its weights.
The main ones are summed up in the table below. More precisely, given the following input
r Mini-batch gradient descent – During the training phase, updating weights is usually not
image, here are the techniques that we can apply:
based on the whole training set at once due to computation complexities or one data point due
to noise issues. Instead, the update step is done on mini-batches, where the number of data
Original Flip Rotation Random crop points in a batch is a hyperparameter that we can tune.
r Loss function – In order to quantify how a given model performs, the loss function L is
usually used to evaluate to what extent the actual outputs y are correctly predicted by the
model outputs z.
r Cross-entropy loss – In the context of binary classification in neural networks, the cross-
entropy loss L(z,y) is commonly used and is defined as follows:
h i
L(z,y) = − y log(z) + (1 − y) log(1 − z)
- Random focus
- Flipped with respect - Rotation with on one part of
- Image without to an axis for which a slight angle the image
3.2.2 Finding optimal weights
any modification the meaning of the - Simulates incorrect - Several random
image is preserved horizon calibration crops can be r Backpropagation – Backpropagation is a method to update the weights in the neural network
done in a row by taking into account the actual output and the desired output. The derivative with respect
to each weight w is computed using the chain rule.
Color shift Noise addition Information loss Contrast change
Using this method, each weight is updated with the rule:

∂L(z,y)
w ←− w − α
- Nuances of RGB ∂w
- Addition of noise - Parts of image - Luminosity changes
is slightly changed
- More tolerance to ignored - Controls difference
- Captures noise r Updating weights – In a neural network, weights are updated as follows:
quality variation of - Mimics potential in exposition due
that can occur
inputs loss of parts of image to time of day • Step 1: Take a batch of training data and perform forward propagation to compute the
with light exposure
loss.
• Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight.
r Batch normalization – It is a step of hyperparameter γ, β that normalizes the batch {xi }.
• Step 3: Use the gradients to update the weights of the network.
By noting µB , σB
2 the mean and variance of that we want to correct to the batch, it is done as
follows:
xi − µB
xi ←− γ p +β
2 +
σB
It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.

3.3 Parameter tuning Method Explanation Update of w Update of b

- Dampens oscillations
3.3.1 Weights initialization Momentum - Improvement to SGD w − αvdw b − αvdb
- 2 parameters to tune
r Xavier initialization – Instead of initializing the weights in a purely random manner, Xavier - Root Mean Square propagation
dw db
initialization enables to have initial weights that take into account characteristics that are unique RMSprop - Speeds up learning algorithm w − α√ b ←− b − α √
to the architecture. by controlling oscillations
sdw sdb
r Transfer learning – Training a deep learning model requires a lot of data and more impor- - Adaptive Moment estimation
tantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets vdw vdb
Adam - Most popular method w − α√ b ←− b − α √
that took days/weeks to train, and leverage it towards our use case. Depending on how much sdw + sdb +
data we have at hand, here are the different ways to leverage this: - 4 parameters to tune
Remark: other methods include Adadelta, Adagrad and SGD.
Training size Illustration Explanation

3.4 Regularization
r Dropout – Dropout is a technique used in neural networks to prevent overfitting the training
Freezes all layers, data by dropping out neurons with probability p > 0. It forces the model to avoid relying too
Small
trains weights on softmax much on particular sets of features.
Freezes most layers,

Medium trains weights on last
layers and softmax
Remark: most deep learning frameworks parametrize dropout through the ’keep’ parameter 1−p.
Trains weights on layers r Weight regularization – In order to make sure that the weights are not too large and that
Large and softmax by initializing the model is not overfitting the training set, regularization techniques are usually performed on
weights on pre-trained ones the model weights. The main ones are summed up in the table below:
LASSO Ridge Elastic Net

- Shrinks coefficients to 0 Tradeoff between variable
Makes coefficients smaller
- Good for variable selection selection and small coefficients
3.3.2 Optimizing convergence
r Learning rate – The learning rate, often noted α or sometimes η, indicates at which pace the
weights get updated. It can be fixed or adaptively changed. The current most popular method
is called Adam, which is a method that adapts the learning rate.
r Adaptive learning rates – Letting the learning rate vary when training a model can reduce
h i
the training time and improve the numerical optimal solution. While Adam optimizer is the ... + λ||θ||1 ... + λ||θ||22 ... + λ (1 − α)||θ||1 + α||θ||22
most commonly used technique, others can also be useful. They are summed up in the table λ∈R λ∈R
below: λ ∈ R,α ∈ [0,1]

r Early stopping – This regularization technique stops the training process as soon as the
validation loss reaches a plateau or starts to increase.
3.5 Good practices

r Overfitting small batch – When debugging a model, it is often useful to make quick tests
to see if there is any major issue with the architecture of the model itself. In particular, in order
to make sure that the model can be properly trained, a mini-batch is passed inside the network
to see if it can overfit on it. If it cannot, it means that the model is either too complex or not
complex enough to even overfit on a small batch, let alone a normal-sized training set.
r Gradient checking – Gradient checking is a method used during the implementation of
the backward pass of a neural network. It compares the value of the analytical gradient to the
numerical gradient at given points and plays the role of a sanity-check for correctness.
Numerical gradient Analytical gradient

df f (x + h) − f (x − h) df
Formula (x) ≈ (x) = f 0 (x)
dx 2h dx
- Expensive; loss has to be
computed two times per dimension
- ’Exact’ result
- Used to verify correctness
Comments of analytical implementation - Direct computation
-Trade-off in choosing h
not too small (numerical instability) - Used in the final implementation
nor too large (poor gradient approx.)
? ? ?

CS 221 – Artificial Intelligence Afshine Amidi & Shervine Amidi
3 Variables-based models 12
Super VIP Cheatsheet: Artificial Intelligence 3.1 Constraint satisfaction problems . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Factor graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Dynamic ordering . . . . . . . . . . . . . . . . . . . . . . . . 12
Afshine Amidi and Shervine Amidi 3.1.3 Approximate methods . . . . . . . . . . . . . . . . . . . . . . 13
3.1.4 Factor graph transformations . . . . . . . . . . . . . . . . . . 13
September 8, 2019 3.2 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Probabilistic programs . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Contents
4 Logic-based models 16
4.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1 Reflex-based models 2 4.2 Knowledge base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1 Linear predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4.3 Propositional logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4.4 First-order logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.1.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Loss minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Non-linear predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Fine-tuning models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6.1 k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6.2 Principal Component Analysis . . . . . . . . . . . . . . . . 4
2 States-based models 5
2.1 Search optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Tree search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Graph search . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Learning costs . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 A? search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.5 Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 When unknown transitions and rewards . . . . . . . . . . . . . 9
2.3 Game playing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Speeding up minimax . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Simultaneous games . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Non-zero-sum games . . . . . . . . . . . . . . . . . . . . . . . 12
Stanford University 1 Spring 2019

1 Reflex-based models 1.1.2 Regression
r Linear regression – Given a weight vector w ∈ Rd and a feature vector φ(x) ∈ Rd , the
1.1 Linear predictors output of a linear regression of weights w denoted as fw is given by:
In this section, we will go through reflex-based models that can improve with experience, by fw (x) = s(x,w)
going through samples that have input-output pairs.
r Feature vector – The feature vector of an input x is noted φ(x) and is such that: r Residual – The residual res(x,y,w) ∈ R is defined as being the amount by which the prediction
fw (x) overshoots the target y:
" φ (x) #
1 res(x,y,w) = fw (x) − y
φ(x) = .. ∈ Rd
.
φd (x)
1.2 Loss minimization

r Score – The score s(x,w) of an example (φ(x),y) ∈ Rd × R associated to a linear model of
r Loss function – A loss function Loss(x,y,w) quantifies how unhappy we are with the weights
weights w ∈ Rd is given by the inner product:
w of the model in the prediction task of output y from input x. It is a quantity we want to
minimize during the training process.
s(x,w) = w · φ(x)
r Classification case – The classification of a sample x of true label y ∈ {−1,+1} with a linear
model of weights w can be done with the predictor fw (x) , sign(s(x,w)). In this situation, a
metric of interest quantifying the quality of the classification is given by the margin m(x,y,w),
and can be used with the following loss functions:
1.1.1 Classification
Name Zero-one loss Hinge loss Logistic loss
r Linear classifier – Given a weight vector w ∈ Rd and a feature vector φ(x) ∈ Rd , the binary
linear classifier fw is given by: Loss(x,y,w) 1{m(x,y,w)60} max(1 − m(x,y,w), 0) log(1 + e−m(x,y,w) )
+1 if w · φ(x) > 0

fw (x) = sign(s(x,w)) = −1 if w · φ(x) < 0
? if w · φ(x) = 0
Illustration
r Regression case – The prediction of a sample x of true label y ∈ R with a linear model of
weights w can be done with the predictor fw (x) , s(x,w). In this situation, a metric of interest
quantifying the quality of the regression is given by the margin res(x,y,w) and can be used with
the following loss functions:
Name Squared loss Absolute deviation loss
Loss(x,y,w) (res(x,y,w))2 |res(x,y,w)|
r Margin – The margin m(x,y,w) ∈ R of an example (φ(x),y) ∈ Rd × {−1, + 1} associated to

a linear model of weights w ∈ Rd quantifies the confidence of the prediction: larger values are
better. It is given by:
Illustration
m(x,y,w) = s(x,w) × y

r Loss minimization framework – In order to train a model, we want to minimize the

training loss is defined as follows:
w ←− w − η∇w Loss(x,y,w)
1 X
TrainLoss(w) = Loss(x,y,w)
|Dtrain |
(x,y)∈Dtrain
1.3 Non-linear predictors

r k-nearest neighbors – The k-nearest neighbors algorithm, commonly known as k-NN, is a
non-parametric approach where the response of a data point is determined by the nature of its
k neighbors from the training set. It can be used in both classification and regression settings.
r Stochastic updates – Stochastic gradient descent (SGD) updates the parameters of the
model one training example (φ(x),y) ∈ Dtrain at a time. This method leads to sometimes noisy,
but fast updates.
r Batch updates – Batch gradient descent (BGD) updates the parameters of the model one
batch of examples (e.g. the entire training set) at a time. This method computes stable update
directions, at a greater computational cost.
1.5 Fine-tuning models

Remark: the higher the parameter k, the higher the bias, and the lower the parameter k, the
higher the variance.
r Hypothesis class – A hypothesis class F is the set of possible predictors with a fixed φ(x)
r Neural networks – Neural networks are a class of models that are built with layers. Com- and varying w:
monly used types of neural networks include convolutional and recurrent neural networks. The
vocabulary around neural networks architectures is described in the figure below:

F= fw : w ∈ Rd
r Logistic function – The logistic function σ, also called the sigmoid function, is defined as:
1
∀z ∈] − ∞, + ∞[, σ(z) =
1 + e−z
Remark: we have σ 0 (z) = σ(z)(1 − σ(z)).

By noting i the ith layer of the network and j the j th hidden unit of the layer, we have:
r Backpropagation – The forward pass is done through fi , which is the value for the subex-
[i] [i] T [i] pression rooted at i, while the backward pass is done through gi = ∂out and represents how fi
zj = wj x + bj ∂fi
influences the output.
where we note w, b, x, z the weight, bias, input and non-activated output of the neuron respec-
tively.
1.4 Stochastic gradient descent

r Approximation and estimation error – The approximation error approx represents how
r Gradient descent – By noting η ∈ R the learning rate (also called step size), the update far the entire hypothesis class F is from the target predictor g ∗ , while the estimation error est
rule for gradient descent is expressed with the learning rate and the loss function Loss(x,y,w) as quantifies how good the predictor fˆ is with respect to the best predictor f ∗ of the hypothesis
follows: class F .

1.6 Unsupervised Learning

The class of unsupervised learning methods aims at discovering the structure of the data, which
may have of rich latent structures.
r Regularization – The regularization procedure aims at avoiding the model to overfit the
data and thus deals with high variance issues. The following table sums up the different types
of commonly used regularization techniques: 1.6.1 k-means
r Clustering – Given a training set of input points Dtrain , the goal of a clustering algorithm
LASSO Ridge Elastic Net is to assign each point φ(xi ) to a cluster zi ∈ {1,...,k}.
- Shrinks coefficients to 0 Tradeoff between variable r Objective function – The loss function for one of the main clustering algorithms, k-means,
Makes coefficients smaller is given by:
- Good for variable selection selection and small coefficients
n
X
Lossk-means (x,µ) = ||φ(xi ) − µzi ||2
i=1
r Algorithm – After randomly initializing the cluster centroids µ1 ,µ2 ,...,µk ∈ Rn , the k-means
algorithm repeats the following step until convergence:
m
X
1{zi =j} φ(xi )
i=1
zi = arg min||φ(xi ) − µj || 2
and µj = m
j X
h i 1{zi =j}
... + λ||θ||1 ... + λ||θ||22 ... + λ (1 − α)||θ||1 + α||θ||22 i=1
λ∈R λ∈R λ ∈ R, α ∈ [0,1]
r Hyperparameters – Hyperparameters are the properties of the learning algorithm, and

include features, regularization parameter λ, number of iterations T , step size η, etc.
r Sets vocabulary – When selecting a model, we distinguish 3 different parts of the data that
we have as follows:
Training set Validation set Testing set

- Model is assessed - Model gives predictions
- Model is trained
- Usually 20 of the dataset - Unseen data
- Usually 80 of the dataset
- Also called hold-out or development set 1.6.2 Principal Component Analysis
there exists a vector z ∈ Rn \{0}, called eigenvector, such that we have:
Once the model has been chosen, it is trained on the entire dataset and tested on the unseen
test set. These are represented in the figure below: Az = λz

r Spectral theorem – Let A ∈ Rn×n . If A is symmetric, then A is diagonalizable by a real 2 States-based models
∃Λ diagonal, A = U ΛU T 2.1 Search optimization
In this section, we assume that by accomplishing action a from state s, we deterministically
Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of arrive in state Succ(s,a). The goal here is to determine a sequence of actions (a1 ,a2 ,a3 ,a4 ,...)
matrix A. that starts from an initial state and leads to an end state. In order to solve this kind of problem,
our objective will be to find the minimum cost path by using states-based models.
r Algorithm – The Principal Component Analysis (PCA) procedure is a dimension reduction
technique that projects the data on k dimensions by maximizing the variance of the data as
follows:
2.1.1 Tree search
• Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.
This category of states-based algorithms explores all possible states and actions. It is quite
memory efficient, and is suitable for huge state spaces but the runtime can become exponential
in the worst cases.
(i) m m
(i)
xj − µj 1 X (i) 1 X (i)
xj ← where µj = xj and σj2 = (xj − µj )2
σj m m
i=1 i=1
m
1 X T
• Step 2: Compute Σ = x(i) x(i) ∈ Rn×n , which is symmetric with real eigenvalues.
m
i=1
• Step 3: Compute u1 , ..., uk ∈ Rn the k orthogonal principal eigenvectors of Σ, i.e. the

orthogonal eigenvectors of the k largest eigenvalues.
• Step 4: Project the data on spanR (u1 ,...,uk ). This procedure maximizes the variance
r Search problem – A search problem is defined with:
among all k-dimensional spaces.
• a starting state sstart
• possible actions Actions(s) from state s
• action cost Cost(s,a) from state s with action a
• successor Succ(s,a) of state s after action a
• whether an end state was reached IsEnd(s)
The objective is to find a path that minimizes the cost.

r Backtracking search – Backtracking search is a naive recursive algorithm that tries all
possibilities to find the minimum cost path. Here, action costs can be either positive or negative.
r Breadth-first search (BFS) – Breadth-first search is a graph search algorithm that does a
level-by-level traversal. We can implement it iteratively with the help of a queue that stores at

each step future nodes to be visited. For this algorithm, we can assume action costs to be equal r Graph – A graph is comprised of a set of vertices V (also called nodes) as well as a set of
to a constant c > 0. edges E (also called links).
r Depth-first search (DFS) – Depth-first search is a search algorithm that traverses a graph
by following each path as deep as it can. We can implement it recursively, or iteratively with
the help of a stack that stores at each step future nodes to be visited. For this algorithm, action Remark: a graph is said to be acylic when there is no cycle.
costs are assumed to be equal to 0.
r State – A state is a summary of all past actions sufficient to choose future actions optimally.
r Dynamic programming – Dynamic programming (DP) is a backtracking search algorithm

with memoization (i.e. partial results are saved) whose goal is to find a minimum cost path from
state s to an end state send . It can potentially have exponential savings compared to traditional
graph search algorithms, and has the property to only work for acyclic graphs. For any given
state s, the future cost is computed as follows:
0 if IsEnd(s)

FutureCost(s) =

min Cost(s,a) + FutureCost(Succ(s,a)) otherwise
a∈Actions(s)
r Iterative deepening – The iterative deepening trick is a modification of the depth-first

search algorithm so that it stops after reaching a certain depth, which guarantees optimality
when all action costs are equal. Here, we assume that action costs are equal to a constant c > 0.
r Tree search algorithms summary – By noting b the number of actions per state, d the
solution depth, and D the maximum depth, we have:
Algorithm Action costs Space Time
Backtracking search any O(D) O(bD )
Breadth-first search c>0 O(bd ) O(bd )
Depth-first search 0 O(D) O(bD )
DFS-Iterative deepening c>0 O(d) O(bd )
Remark: the figure above illustrates a bottom-to-top approach whereas the formula provides the
2.1.2 Graph search intuition of a top-to-bottom problem resolution.
This category of states-based algorithms aims at constructing optimal paths, enabling exponen- r Types of states – The table below presents the terminology when it comes to states in the
tial savings. In this section, we will focus on dynamic programming and uniform cost search. context of uniform cost search:

State Explanation • decreases the estimated cost of each state-action of the true minimizing path y given by
the training data,
States for which the optimal path has
Explored E • increases the estimated cost of each state-action of the current predicted path y 0 inferred
already been found
from the learned weights.
States seen for which we are still figuring out
Frontier F
how to get there with the cheapest cost Remark: there are several versions of the algorithm, one of which simplifies the problem to only
learning the cost of each action a, and the other parametrizes Cost(s,a) to a feature vector of
Unexplored U States not seen yet learnable weights.
r Uniform cost search – Uniform cost search (UCS) is a search algorithm that aims at finding 2.1.4 A? search
the shortest path from a state sstart to an end state send . It explores states s in increasing order
of PastCost(s) and relies on the fact that all action costs are non-negative. r Heuristic function – A heuristic is a function h over states s, where each h(s) aims at
estimating FutureCost(s), the cost of the path from s to send .
r Algorithm – A∗ is a search algorithm that aims at finding the shortest path from a state s to
an end state send . It explores states s in increasing order of PastCost(s) + h(s). It is equivalent
to a uniform cost search with edge costs Cost0 (s,a) given by:
Cost0 (s,a) = Cost(s,a) + h(Succ(s,a)) − h(s)
Remark: this algorithm can be seen as a biased version of UCS exploring states estimated to be
closer to the end state.
Remark 1: the UCS algorithm is logically equivalent to Djikstra’s algorithm.
Remark 2: the algorithm would not work for a problem with negative action costs, and adding a r Consistency – A heuristic h is said to be consistent if it satisfies the two following properties:
positive constant to make them non-negative would not solve the problem since this would end
up being a different problem. • For all states s and actions a,
r Correctness theorem – When a state s is popped from the frontier F and moved to explored h(s) 6 Cost(s,a) + h(Succ(s,a))
set E, its priority is equal to PastCost(s) which is the minimum cost path from sstart to s.
r Graph search algorithms summary – By noting N the number of total states, n of which
are explored before the end state send , we have:
Algorithm Acyclicity Costs Time/space

Dynamic programming yes any O(N )
Uniform cost search no c>0 O(n log(n))
Remark: the complexity countdown supposes the number of possible actions per state to be
constant. • The end state verifies the following:
h(send ) = 0
2.1.3 Learning costs
Suppose we are not given the values of Cost(s,a), we want to estimate these quantities from a
training set of minimizing-cost-path sequence of actions (a1 , a2 , ..., ak ).
r Structured perceptron – The structured perceptron is an algorithm aiming at iteratively

learning the cost of each state-action pair. At each step, it:

r Correctness – If h is consistent, then A∗ returns the minimum cost path. r Max heuristic – Let h1 (s), h2 (s) be two heuristics. We have the following property:
r Admissibility – A heuristic h is said to be admissible if we have: h1 (s), h2 (s) consistent =⇒ h(s) = max{h1 (s), h2 (s)} consistent
h(s) 6 FutureCost(s)
2.2 Markov decision processes
r Theorem – Let h(s) be a given heuristic. We have:
In this section, we assume that performing action a from state s can lead to several states s01 ,s02 ,...
h(s) consistent =⇒ h(s) admissible in a probabilistic manner. In order to find our way between an initial state and an end state,
our objective will be to find the maximum value policy by using Markov decision processes that
help us cope with randomness and uncertainty.
r Efficiency – A∗ explores all states s satisfying the following equation:
PastCost(s) 6 PastCost(send ) − h(s) 2.2.1 Notations
r Definition – The objective of a Markov decision process is to maximize rewards. It is defined
with:
• a starting state sstart
• transition probabilities T (s,a,s0 ) from s to s0 with action a
• rewards Reward(s,a,s0 ) from s to s0 with action a
Remark: larger values of h(s) is better as this equation shows it will restrict the set of states s • whether an end state was reached IsEnd(s)
going to be explored.
• a discount factor 0 6 γ 6 1
2.1.5 Relaxation
It is a framework for producing consistent heuristics. The idea is to find closed-form reduced
costs by removing constraints and use them as heuristics.
r Relaxed search problem – The relaxation of search problem P with costs Cost is noted
Prel with costs Costrel , and satisfies the identity:
Costrel (s,a) 6 Cost(s,a)
r Relaxed heuristic – Given a relaxed search problem Prel , we define the relaxed heuristic
h(s) = FutureCostrel (s) as the minimum cost path from s to an end state in the graph of costs
Costrel (s,a). r Transition probabilities – The transition probability T (s,a,s0 ) specifies the probability
of going to state s0 after action a is taken in state s. Each s0 7→ T (s,a,s0 ) is a probability
r Consistency of relaxed heuristics – Let Prel be a given relaxed problem. By theorem, we distribution, which means that:
have: X
h(s) = FutureCostrel (s) =⇒ h(s) consistent ∀s,a, T (s,a,s0 ) = 1
s0 ∈ States
r Tradeoff when choosing heuristic – We have to balance two aspects in choosing a heuristic:
r Policy – A policy π is a function that maps each state s to an action a, i.e.
• Computational efficiency: h(s) = FutureCostrel (s) must be easy to compute. It has to π : s 7→ a
produce a closed form, easier search and independent subproblems.
• Good enough approximation: the heuristic h(s) should be close to FutureCost(s) and we r Utility – The utility of a path (s0 , ..., sk ) is the discounted sum of the rewards on that path.
have thus to not remove too many constraints. In other words,

k
X
X Qopt (s,a) = T (s,a,s0 ) Reward(s,a,s0 ) + γVopt (s0 )
u(s0 ,...,sk ) = ri γ i−1
s0 ∈ States
i=1
r Optimal value – The optimal value Vopt (s) of state s is defined as being the maximum value
attained by any policy. It is computed as follows:
Vopt (s) = max Qopt (s,a)
a∈ Actions(s)
Remark: the figure above is an illustration of the case k = 4. r Optimal policy – The optimal policy πopt is defined as being the policy that leads to the
optimal values. It is defined by:
r Q-value – The Q-value of a policy π by taking action a from state s, also noted Qπ (s,a), is
the expected utility of taking action a from state s and then following policy π. It is defined as ∀s, πopt (s) = argmax Qopt (s,a)
follows: a∈ Actions(s)
X
Qπ (s,a) = T (s,a,s0 ) Reward(s,a,s0 ) + γVπ (s0 )
r Value iteration – Value iteration is an algorithm that finds the optimal value Vopt as well
s0 ∈ States
as the optimal policy πopt . It is done as follows:
r Value of a policy – The value of a policy π from state s, also noted Vπ (s), is the expected • Initialization: for all states s, we have
utility by following policy π from state s over random paths. It is defined as follows:
(0)
Vopt (s) ←− 0
Vπ (s) = Qπ (s,π(s))
• Iteration: for t from 1 to TVI , we have

Remark: Vπ (s) is equal to 0 if s is an end state.
(t) (t−1)
∀s, Vopt (s) ←− max Qopt (s,a)
a∈ Actions(s)
2.2.2 Applications
with
r Policy evaluation – Given a policy π, policy evaluation is an iterative algorithm that com-
putes Vπ . It is done as follows: X h i
(t−1) (t−1)
Qopt (s,a) = T (s,a,s0 ) Reward(s,a,s0 ) + γVopt (s0 )
• Initialization: for all states s, we have s0 ∈ States
(0)
Vπ (s) ←− 0
Remark: if we have either γ < 1 or the MDP graph being acyclic, then the value iteration
algorithm is guaranteed to converge to the correct answer.
• Iteration: for t from 1 to TPE , we have
∀s,
(t)
Vπ (s) ←− Qπ
(t−1)
(s,π(s)) 2.2.3 When unknown transitions and rewards
Now, let’s assume that the transition probabilities and the rewards are unknown.
with
r Model-based Monte Carlo – The model-based Monte Carlo method aims at estimating
T (s,a,s0 ) and Reward(s,a,s0 ) using Monte Carlo simulation with:
X h i
(t−1) 0 0 (t−1) 0
Qπ (s,π(s)) = T (s,π(s),s ) Reward(s,π(s),s ) + γVπ (s )
# times (s,a,s0 ) occurs
s0 ∈ States Tb(s,a,s0 ) =
# times (s,a) occurs
Remark: by noting S the number of states, A the number of actions per state, S 0 the number and
of successors and T the number of iterations, then the time complexity is of O(TPE SS 0 ).
0
Reward(s,a,s
\ ) = r in (s,a,r,s0 )
r Optimal Q-value – The optimal Q-value Qopt (s,a) of state s with action a is defined to be
the maximum Q-value attained by any policy starting. It is computed as follows: These estimations will be then used to deduce Q-values, including Qπ and Qopt .

Remark: model-based Monte Carlo is said to be off-policy, because the estimation does not • a starting state sstart
depend on the exact policy.
r Model-free Monte Carlo – The model-free Monte Carlo method aims at directly estimating
Qπ , as follows: • successors Succ(s,a) from states s with actions a
bπ (s,a) = average of ut where st−1 = s, at = a
Q • whether an end state was reached IsEnd(s)
where ut denotes the utility starting at step t of a given episode.
• the agent’s utility Utility(s) at end state s
Remark: model-free Monte Carlo is said to be on-policy, because the estimated value is dependent
on the policy π used to generate the data. • the player Player(s) who controls state s
r Equivalent formulation – By introducing the constant η = 1
and for
1+(#updates to (s,a))
Remark: we will assume that the utility of the agent has the opposite sign of the one of the
each (s,a,u) of the training set, the update rule of model-free Monte Carlo has a convex combi- opponent.
nation formulation:
r Types of policies – There are two types of policies:
bπ (s,a) ← (1 − η)Q
Q bπ (s,a) + ηu
as well as a stochastic gradient formulation: • Deterministic policies, noted πp (s), which are actions that player p takes in state s.
bπ (s,a) ← Q
Q bπ (s,a) − η(Q
bπ (s,a) − u) • Stochastic policies, noted πp (s,a) ∈ [0,1], which are probabilities that player p takes action
a in state s.
r SARSA – State-action-reward-state-action (SARSA) is a boostrapping method estimating

Qπ by using both raw data and estimates as part of the update rule. For each (s,a,r,s0 ,a0 ), we r Expectimax – For a given state s, the expectimax value Vexptmax (s) is the maximum expected
have: utility of any agent policy when playing with respect to a fixed and known opponent policy πopp .
h i It is computed as follows:
bπ (s,a) ←− (1 − η)Q
Q bπ (s0 ,a0 )
bπ (s,a) + η r + γ Q
Utility(s) IsEnd(s)

max Vexptmax (Succ(s,a)) Player(s) = agent



Remark: the SARSA estimate is updated on the fly as opposed to the model-free Monte Carlo a∈Actions(s)
one where the estimate can only be updated at the end of the episode. Vexptmax (s) = X

 πopp (s,a)Vexptmax (Succ(s,a)) Player(s) = opp
r Q-learning – Q-learning is an off-policy algorithm that produces an estimate for Qopt . On

a∈Actions(s)
each (s,a,r,s0 ,a0 ), we have:
h i
bopt (s,a) ← (1 − η)Q
Q bopt (s,a) + η r + γ max bopt (s0 ,a0 )
Q Remark: expectimax is the analog of value iteration for MDPs.
a0 ∈ Actions(s0 )
r Epsilon-greedy – The epsilon-greedy policy is an algorithm that balances exploration with

probability and exploitation with probability 1 − . For a given state s, the policy πact is
computed as follows:

argmax Q
bopt (s,a) with proba 1 −
πact (s) = a∈ Actions
random from Actions(s) with proba
2.3 Game playing

In games (e.g. chess, backgammon, Go), other agents are present and need to be taken into r Minimax – The goal of minimax policies is to find an optimal policy against an adversary
account when constructing our policy. by assuming the worst case, i.e. that the opponent is doing everything to minimize the agent’s
utility. It is done as follows:
r Game tree – A game tree is a tree that describes the possibilities of a game. In particular,
each node is a decision point for a player and each root-to-leaf path is a possible outcome of the  Utility(s) IsEnd(s)

game. max Vminimax (Succ(s,a)) Player(s) = agent
Vminimax (s) = a∈Actions(s)
r Two-player zero-sum game – It is a game where each state is fully observed and such that  min Vminimax (Succ(s,a)) Player(s) = opp
players take turns. It is defined with: a∈Actions(s)

Remark: we can extract πmax and πmin from the minimax value Vminimax .
r TD learning – Temporal difference (TD) learning is used when we don’t know the transi-
tions/rewards. The value is based on exploration policy. To be able to use it, we need to know
r Minimax properties – By noting V the value function, there are 3 properties around rules of the game Succ(s,a). For each (s,a,r,s0 ), the update is done as follows:
minimax to have in mind:

w ←− w − η V (s,w) − (r + γV (s0 ,w)) ∇w V (s,w)
• Property 1 : if the agent were to change its policy to any πagent , then the agent would be
no better off.
∀πagent , V (πmax ,πmin ) > V (πagent ,πmin ) 2.3.2 Simultaneous games
This is the contrary of turn-based games, where there is no ordering on the player’s moves.
• Property 2 : if the opponent changes its policy from πmin to πopp , then he will be no
better off. r Single-move simultaneous game – Let there be two players A and B, with given possible
actions. We note V (a,b) to be A’s utility if A chooses action a, B chooses action b. V is called
∀πopp , V (πmax ,πmin ) 6 V (πmax ,πopp ) the payoff matrix.
r Strategies – There are two main types of strategies:

• Property 3 : if the opponent is known to be not playing the adversarial policy, then the
minimax policy might not be optimal for the agent. • A pure strategy is a single action:
∀π, V (πmax ,π) 6 V (πexptmax ,π) a ∈ Actions
In the end, we have the following relationship: • A mixed strategy is a probability distribution over actions:
V (πexptmax ,πmin ) 6 V (πmax ,πmin ) 6 V (πmax ,π) 6 V (πexptmax ,π) ∀a ∈ Actions, 0 6 π(a) 6 1
2.3.1 Speeding up minimax r Game evaluation – The value of the game V (πA ,πB ) when player A follows πA and player
B follows πB is such that:
r Evaluation function – An evaluation function is a domain-specific and approximate estimate X
of the value Vminimax (s). It is noted Eval(s). V (πA ,πB ) = πA (a)πB (b)V (a,b)
Remark: FutureCost(s) is an analogy for search problems. a,b
r Alpha-beta pruning – Alpha-beta pruning is a domain-general exact method optimizing

the minimax algorithm by avoiding the unnecessary exploration of parts of the game tree. To do r Minimax theorem – By noting πA ,πB ranging over mixed strategies, for every simultaneous
so, each player keeps track of the best value they can hope for (stored in α for the maximizing two-player zero-sum game with a finite number of actions, we have:
player and in β for the minimizing player). At a given step, the condition β < α means that the
optimal path is not going to be in the current branch as the earlier player had a better option max min V (πA ,πB ) = min max V (πA ,πB )
at their disposal. πA πB πB πA

2.3.3 Non-zero-sum games 3 Variables-based models

r Payoff matrix – We define Vp (πA ,πB ) to be the utility for player p.
3.1 Constraint satisfaction problems
r Nash equilibrium – A Nash equilibrium is (πA
∗ ,π ∗ ) such that no player has an incentive to
change its strategy. We have:

B In this section, our objective is to find maximum weight assignments of variable-based models.
One advantage compared to states-based models is that these algorithms are more convenient
∗
∀πA , VA (πA ∗
,πB ∗
) > VA (πA ,πB ) and ∗
∀πB , VB (πA ∗
,πB ∗
) > VB (πA ,πB ) to encode problem-specific constraints.
Remark: in any finite-player game with finite number of actions, there exists at least one Nash 3.1.1 Factor graphs
equilibrium.
r Definition – A factor graph, also referred to as a Markov random field, is a set of variables
X = (X1 ,...,Xn ) where Xi ∈ Domaini and m factors f1 ,...,fm with each fj (X) > 0.
r Scope and arity – The scope of a factor fj is the set of variables it depends on. The size of
this set is called the arity.
Remark: factors of arity 1 and 2 are called unary and binary respectively.
r Assignment weight – Each assignment x = (x1 ,...,xn ) yields a weight Weight(x) defined as
being the product of all factors fj applied to that assignment. Its expression is given by:
m
Y
Weight(x) = fj (x)
j=1
r Constraint satisfaction problem – A constraint satisfaction problem (CSP) is a factor

graph where all factors are binary; we call them to be constraints:
∀j ∈ [[1,m]], fj (x) ∈ {0,1}
Here, the constraint j with assignment x is said to be satisfied if and only if fj (x) = 1.
r Consistent assignment – An assignment x of a CSP is said to be consistent if and only if

Weight(x) = 1, i.e. all constraints are satisfied.
3.1.2 Dynamic ordering

r Dependent factors – The set of dependent factors of variable Xi with partial assignment x
is called D(x,Xi ), and denotes the set of factors that link Xi to already assigned variables.

r Backtracking search – Backtracking search is an algorithm used to find maximum weight 3.1.3 Approximate methods
assignments of a factor graph. At each step, it chooses an unassigned variable and explores
its values by recursion. Dynamic ordering (i.e. choice of variables and values) and lookahead r Beam search – Beam search is an approximate algorithm that extends partial assignments
(i.e. early elimination of inconsistent options) can be used to explore the graph more efficiently, of n variables of branching factor b = |Domain| by exploring the K top paths at each step. The
although the worst-case runtime stays exponential: O(|Domain|n ). beam size K ∈ {1,...,bn } controls the tradeoff between efficiency and accuracy. This algorithm
has a time complexity of O(n · Kb log(Kb)).
r Forward checking – It is a one-step lookahead heuristic that preemptively removes incon- The example below illustrates a possible beam search of parameters K = 2, b = 3 and n = 5.
sistent values from the domains of neighboring variables. It has the following characteristics:
• After assigning a variable Xi , it eliminates inconsistent values from the domains of all its
neighbors.
• If any of these domains becomes empty, we stop the local backtracking search.
• If we un-assign a variable Xi , we have to restore the domain of its neighbors.
r Most constrained variable – It is a variable-level ordering heuristic that selects the next
unassigned variable that has the fewest consistent values. This has the effect of making incon-
sistent assignments to fail earlier in the search, which enables more efficient pruning.
r Least constrained value – It is a value-level ordering heuristic that assigns the next value
that yields the highest number of consistent values of neighboring variables. Intuitively, this
procedure chooses first the values that are most likely to work.
Remark: in practice, this heuristic is useful when all factors are constraints.
Remark: K = 1 corresponds to greedy search whereas K → +∞ is equivalent to BFS tree search.
r Iterated conditional modes – Iterated conditional modes (ICM) is an iterative approximate

algorithm that modifies the assignment of a factor graph one variable at a time until convergence.
At step i, we assign to Xi the value v that maximizes the product of all factors connected to
that variable.
Remark: ICM may get stuck in local minima.
r Gibbs sampling – Gibbs sampling is an iterative approximate method that modifies the
assignment of a factor graph one variable at a time until convergence. At step i:
• we assign to each element u ∈ Domaini a weight w(u) that is the product of all factors
connected to that variable,
• we sample v from the probability distribution induced by w and assign it to Xi .
The example above is an illustration of the 3-color problem with backtracking search coupled Remark: Gibbs sampling can be seen as the probabilistic counterpart of ICM. It has the advan-
with most constrained variable exploration and least constrained value heuristic, as well as tage to be able to escape local minima in most cases.
forward checking at each step.
r Arc consistency – We say that arc consistency of variable Xl with respect to Xk is enforced 3.1.4 Factor graph transformations
when for each xl ∈ Domainl :
• unary factors of Xl are non-zero, r Independence – Let A,B be a partitioning of the variables X. We say that A and B are
independent if there are no edges between A and B and we write:
• there exists at least one xk ∈ Domaink such that any factor between Xl and Xk is
non-zero. A,B independent ⇐⇒ A ⊥
⊥B
r AC-3 – The AC-3 algorithm is a multi-step lookahead heuristic that applies forward checking
to all relevant variables. After a given assignment, it performs forward checking and then Remark: independence is the key property that allows us to solve subproblems in parallel.
successively enforces arc consistency with respect to the neighbors of variables for which the
r Conditional independence – We say that A and B are conditionally independent given C
domain change during the process.
if conditioning on C produces a graph in which A and B are independent. In this case, it is
Remark: AC-3 can be implemented both iteratively and recursively. written:

A and B cond. indep. given C ⇐⇒ A ⊥

⊥ B|C
r Conditioning – Conditioning is a transformation aiming at making variables independent

that breaks up a factor graph into smaller pieces that can be solved in parallel and can use
backtracking. In order to condition on a variable Xi = v, we do as follows:
• Consider all factors f1 ,...,fk that depend on Xi
• Remove Xi and f1 ,...,fk
• Add gj (x) for j ∈ {1,...,k} defined as:
gj (x) = fj (x ∪ {Xi : v}) Remark: finding the best variable ordering is a NP-hard problem.
r Markov blanket – Let A ⊆ X be a subset of variables. We define MarkovBlanket(A) to be 3.2 Bayesian networks
the neighbors of A that are not in A.
In this section, our goal will be to compute conditional probabilities. What is the probability of
r Proposition – Let C = MarkovBlanket(A) and B = X\(A ∪ C). Then we have: a query given evidence?
A⊥
⊥ B|C
3.2.1 Introduction
r Explaining away – Suppose causes C1 and C2 influence an effect E. Conditioning on the
effect E and on one of the causes (say C1 ) changes the probability of the other cause (say C2 ).
In this case, we say that C1 has explained away C2 .
r Directed acyclic graph – A directed acyclic graph (DAG) is a finite directed graph with
no directed cycles.
r Bayesian network – A Bayesian network is a directed acyclic graph (DAG) that specifies
a joint distribution over random variables X = (X1 ,...,Xn ) as a product of local conditional
distributions, one for each node:
r Elimination – Elimination is a factor graph transformation that removes Xi from the graph
and solves a small subproblem conditioned on its Markov blanket as follows: n
Y
P (X1 = x1 ,...,Xn = xn ) , p(xi |xParents(i) )
• Consider all factors fi,1 ,...,fi,k that depend on Xi
i=1
• Remove Xi and fi,1 ,...,fi,k
• Add fnew,i (x) defined as: Remark: Bayesian networks are factor graphs imbued with the language of probability.
k
Y
fnew,i (x) = max fi,l (x)
xi
l=1
r Treewidth – The treewidth of a factor graph is the maximum arity of any factor created by
variable elimination with the best variable ordering. In other words,
Treewidth = min max arity(fnew,i )
orderings i∈{1,...,n}
r Locally normalized – For each xParents(i) , all factors are local conditional distributions.
The example below illustrates the case of a factor graph of treewidth 3. Hence they have to satisfy:

Hto ∼ o )
p(Hto |Ht−1
o∈{a,b} Multiple object
Factorial HMM
tracking
X
p(xi |xParents(i) ) = 1 Et ∼ p(Et |Hta ,Htb )
xi
Y ∼ p(Y ) Document
As a result, sub-Bayesian networks and conditional distributions are consistent. Naive Bayes
Wi ∼ p(Wi |Y ) classification
Remark: local conditional distributions are the true conditional distributions.
r Marginalization – The marginalization of a leaf node yields a Bayesian network without

that node.
α ∈ RK distribution
Latent Dirichlet
Zi ∼ p(Zi |α) Topic modeling
Allocation (LDA)
Wi ∼ p(Wi |Zi )
3.2.2 Probabilistic programs
r Concept – A probabilistic program randomizes variables assignment. That way, we can write 3.2.3 Inference
down complex Bayesian networks that generate assignments without us having to explicitly
specify associated probabilities. r General probabilistic inference strategy – The strategy to compute the probability
P (Q|E = e) of query Q given evidence E = e is as follows:
Remark: examples of probabilistic programs include Hidden Markov model (HMM), factorial
HMM, naive Bayes, latent Dirichlet allocation, diseases and symptoms and stochastic block • Step 1: Remove variables that are not ancestors of the query Q or the evidence E by
models.
marginalization
r Summary – The table below summarizes the common probabilistic programs as well as their • Step 2: Convert Bayesian network to factor graph
applications:
• Step 3: Condition on the evidence E = e
• Step 4: Remove nodes disconnected from the query Q by marginalization
Program Algorithm Illustration Example • Step 5: Run probabilistic inference algorithm (manual, variable elimination, Gibbs sam-
pling, particle filtering)
Language r Forward-backward algorithm – This algorithm computes the exact value of P (H = hk |E =

Markov Model Xi ∼ p(Xi |Xi−1 )
modeling e) (smoothing query) for any k ∈ {1, ..., L} in the case of an HMM of size L. To do so, we proceed
in 3 steps:
• Step 1: for i ∈ {1,..., L}, compute Fi (hi ) = Fi−1 (hi−1 )p(hi |hi−1 )p(ei |hi )
P
hi−1
Hidden Markov Ht ∼ p(Ht |Ht−1 ) • Step 2: for i ∈ {L,..., 1}, compute Bi (hi ) = Bi+1 (hi+1 )p(hi+1 |hi )p(ei+1 |hi+1 )
P
Object tracking hi+1
Model (HMM) Et ∼ p(Et |Ht )
• Step 3: for i ∈ {1,...,L}, compute Si (hi ) = PFi (hi )Bi (hi )
Fi (hi )Bi (hi )
hi

with the convention F0 = BL+1 = 1. From this procedure and these notations, we get that 4 Logic-based models
P (H = hk |E = e) = Sk (hk )
4.1 Basics
Remark: this algorithm interprets each assignment to be a path where each edge hi−1 → hi is r Syntax of propositional logic – By noting f,g formulas, and ¬, ∧, ∨, →, ↔ connectives, we
of weight p(hi |hi−1 )p(ei |hi ). can write the following logical expressions:
r Gibbs sampling – This algorithm is an iterative approximate method that uses a small set of
assignments (particles) to represent a large probability distribution. From a random assignment Name Symbol Meaning Illustration
x, Gibbs sampling performs the following steps for i ∈ {1,...,n} until convergence:
• For all u ∈ Domaini , compute the weight w(u) of assignment x where Xi = u Affirmation f f
• Sample v from the probability distribution induced by w: v ∼ P (Xi = v|X−i = x−i )
• Set Xi = v
Remark: X−i denotes X\{Xi } and x−i represents the corresponding assignment. Negation ¬f not f
r Particle filtering – This algorithm approximates the posterior density of state variables
given the evidence of observation variables by keeping track of K particles at a time. Starting
from a set of particles C of size K, we run the following 3 steps iteratively:
• Step 1: proposal - For each old particle xt−1 ∈ C, sample x from the transition probability Conjunction f ∧g f and g
distribution p(x|xt−1 ) and add x to a set C 0 .
• Step 2: weighting - Weigh each x of the set C 0 by w(x) = p(et |x), where et is the evidence
observed at time t.
• Step 3: resampling - Sample K elements from the set C 0 using the probability distribution Disjunction f ∨g f or g
induced by w and store them in C: these are the current particles xt .
Remark: a more expensive version of this algorithm also keeps track of past particles in the
proposal step.
r Maximum likelihood – If we don’t know the local conditional distributions, we can learn Implication f →g if f then g
them using maximum likelihood.
Y
max p(X = x; θ)
θ
x∈Dtrain
Biconditional f ↔g f , that is to say g
r Laplace smoothing – For each distribution d and partial assignment (xParents(i) ,xi ), add λ
to countd (xParents(i) ,xi ), then normalize to get probability estimates.
Remark: formulas can be built up recursively out of these connectives.
r Algorithm – The Expectation-Maximization (EM) algorithm gives an efficient method at
estimating the parameter θ through maximum likelihood estimation by repeatedly constructing r Model – A model w denotes an assignment of binary weights to propositional symbols.
a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:
Example: the set of truth values w = {A : 0,B : 1,C : 0} is one possible model to the propositional
• E-step: Evaluate the posterior probability q(h) that each data point e came from a symbols A, B and C.
particular cluster h as follows: r Interpretation function – The interpretation function I(f,w) outputs whether model w
satisfies formula f :
q(h) = P (H = h|E = e; θ)
I(f,w) ∈ {0,1}
• M-step: Use the posterior probabilities q(h) as cluster specific weights on data points e
to determine θ through maximum likelihood. r Set of models – M(f ) denotes the set of models w that satisfy formula f . Mathematically
speaking, we define it as follows:

Name Mathematical formulation Illustration Notes
∀w ∈ M(f ), I(f,w) = 1 - f does not bring any

KB
M(KB) ∩ M(f ) = M(KB) new information
entails f
- Also written KB |= f
- No model satisfies
KB the constraints after
M(KB) ∩ M(f ) = ∅
4.2 Knowledge base contradicts f adding f
Equivalent to KB |= ¬f
r Definition – The knowledge base KB is the conjunction of all formulas that have been
considered so far. The set of models of the knowledge base is the intersection of the set of - f does not contradict
models that satisfy each formula. In other words: M(KB) ∩ M(f ) 6= ∅ KB
f contingent
and - f adds a non-trivial
to KB
\ M(KB) ∩ M(f ) 6= M(KB) amount of information
M(KB) = M(f )
to KB
f ∈KB
r Model checking – A model checking algorithm takes as input a knowledge base KB and
outputs whether it is satisfiable or not.
Remark: popular model checking algorithms include DPLL and WalkSat.
r Inference rule – An inference rule of premises f1 ,...,fk and conclusion g is written:
f1 ,...,fk
g
r Probabilistic interpretation – The probability that query f is evaluated to 1 can be seen
as the proportion of models w of the knowledge base KB that satisfy f , i.e.:
r Forward inference algorithm – From a set of inference rules Rules, this algorithm goes
X through all possible f1 ,...,fk and adds g to the knowledge base KB if a matching rule exists.
P (W = w) This process is repeated until no more additions can be made to KB.
P (f |KB) =
w∈M(KB)∩M(f )
r Derivation – We say that KB derives f (written KB ` f ) with rules Rules if f already is in
KB or gets added during the forward inference algorithm using the set of rules Rules.
X
P (W = w)
w∈M(KB) r Properties of inference rules – A set of inference rules Rules can have the following
properties:
r Satisfiability – The knowledge base KB is said to be satisfiable if at least one model w Name Mathematical formulation Notes
satisfies all its constraints. In other words:
- Inferred formulas are entailed by
KB satisfiable ⇐⇒ M(KB) 6= ∅ KB
Soundness {f : KB ` f } ⊆ {f : KB |= f }
- Can be checked one rule at a time
- "Nothing but the truth"
Remark: M(KB) denotes the set of models compatible with all the constraints of the knowledge - Formulas entailing KB are either
base. already in the knowledge base or
Completeness {f : KB ` f } ⊇ {f : KB |= f }
r Relation between formulas and knowledge base – We define the following properties inferred from it
between the knowledge base KB and a new formula f : - "The whole truth"

4.3 Propositional logic • Step 2: Repeatedly apply resolution rule
In this section, we will go through logic-based models that use logical formulas and inference • Step 3: Return unsatisfiable if and only if False is derived
rules. The idea here is to balance expressivity and computational efficiency.
r Horn clause – By noting p1 ,...,pk and q propositional symbols, a Horn clause has the form: 4.4 First-order logic
(p1 ∧ ... ∧ pk ) −→ q
The idea here is that variables yield compact knowledge representations.
Remark: when q = false, it is called a "goal clause", otherwise we denote it as a "definite r Model – A model w in first-order logic maps:
clause".
r Modus ponens inference rule – For propositional symbols f1 ,...,fk and p, the modus • constant symbols to objects
ponens rule is written:
• predicate symbols to tuple of objects
f1 ,...,fk , (f1 ∧ ... ∧ fk ) −→ p
p
r Horn clause – By noting x1 ,...,xn variables and a1 ,...,ak ,b atomic formulas, the first-order
logic version of a horn clause has the form:
Remark: it takes linear time to apply this rule, as each application generate a clause that
contains a single propositional symbol. ∀x1 ,...,∀xn , (a1 ∧ ... ∧ ak ) → b
r Completeness – Modus ponens is complete with respect to Horn clauses if we suppose that
KB contains only Horn clauses and p is an entailed propositional symbol. Applying modus r Substitution – A substitution θ maps variables to terms and Subst(θ,f ) denotes the result
ponens will then derive p. of substitution θ on f .
r Conjunctive normal form – A conjunctive normal form (CNF) formula is a conjunction of r Unification – Unification takes two formulas f and g and returns the most general substitu-
clauses, where each clause is a disjunction of atomic formulas. tion θ that makes them equal:
Remark: in other words, CNFs are ∧ of ∨.
Unify[f,g] = θ s.t. Subst[θ,f ] = Subst[θ,g]
r Equivalent representation – Every formula in propositional logic can be written into an
equivalent CNF formula. The table below presents general conversion properties: Note: Unify[f,g] returns Fail if no such θ exists.
Rule name Initial Converted r Modus ponens – By noting x1 ,...,xn variables, a1 ,...,ak and a01 ,...,a0k atomic formulas and
by calling θ = Unify(a01 ∧ ... ∧ a0k , a1 ∧ ... ∧ ak ) the first-order logic version of modus ponens can
↔ f ↔g (f → g) ∧ (g → f ) be written:
Eliminate → f →g ¬f ∨ g a01 ,...,a0k ∀x1 ,...,∀xn (a1 ∧ ... ∧ ak ) → b
¬¬ ¬¬f f Subst[θ, b]
¬ over ∧ ¬(f ∧ g) ¬f ∨ ¬g
r Completeness – Modus ponens is complete for first-order logic with only Horn clauses.
Distribute ¬ over ∨ ¬(f ∨ g) ¬f ∧ ¬g
r Resolution rule – By noting f1 , ..., fn , g1 , ..., gm , p, q formulas and by calling θ = Unify(p,q),
∨ over ∧ f ∨ (g ∧ h) (f ∨ g) ∧ (f ∨ h) the first-order logic version of the resolution rule can be written:
f1 ∨ ... ∨ fn ∨ p, ¬q ∨ g1 ∨ ... ∨ gm
r Resolution inference rule – For propositional symbols f1 ,...,fn , and g1 ,...,gm as well as p, Subst[θ,f1 ∨ ... ∨ fn ∨ g1 ∨ ... ∨ gm ]
the resolution rule is written:
f1 ∨ ... ∨ fn ∨ p, ¬p ∨ g1 ∨ ... ∨ gm r Semi-decidability – First-order logic, even restricted to only Horn clauses, is semi-decidable.
f1 ∨ ... ∨ fn ∨ g1 ∨ ... ∨ gm
• if KB |= f , forward inference on complete inference rules will prove f in finite time
Remark: it can take exponential time to apply this rule, as each application generates a clause
that has a subset of the propositional symbols. • if KB 6|= f , no algorithm can show this in finite time
r Resolution-based inference – The resolution-based inference algorithm follows the follow-

ing steps:
• Step 1: Convert all formulas into CNF

Machine Learning Interview Cheat sheets
Aqeel Anwar
Last Updated: March 2021
This document contains cheat sheets on various topics asked during a Machine Learn-
ing/Data science interview. This document is constantly updated to include more topics.
Click here to get the updated version
Table of Contents
Basics of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1. Bias-Variance Trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Imbalanced Data in Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4. Bayes’ Theorem and Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5. Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6. Regularization in ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7. Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
8. Famous CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
9. Ensemble Methods in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Behavioral Interview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1. How to prepare for behavioral interview? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
2. How to answer a behavioral question? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Page 1 of 14
Cheat Sheet – Bias-Variance Tradeoff
What is Bias?
• Error between average model prediction and ground truth
• The bias of the estimated function tells us the capacity of the underlying model to
predict the values
What is Variance?
• Average variability in the model prediction for the given dataset
• The variance of the estimated function tells you how much the function can adjust
to the change in the dataset
High Bias Overly-simplified Model
Under-fitting
High error on both test and train data
High Variance Overly-complex Model

Over-fitting
Low error on train data and high on test
Starts modelling the noise in the input
Minimum Error
Bias variance Trade-off

• Increasing bias reduces variance and vice-versa
• Error = bias2 + variance +irreducible error
• The best model is where the error is reduced.
• Compromise between bias and variance
Source: https://www.cheatsheets.aqeel-anwar.com
Page 2 of 14
Cheat Sheet – Imbalanced Data in Classification
Blue: Label 1
Green: Label 0 Correct Predictions

Accuracy =
Total Predictions
Classifier that always predicts label blue yields prediction accuracy of 90%
Accuracy doesn’t always give the correct insight about your trained model
Accuracy: %age correct prediction Correct prediction over total predictions One value for entire network
Precision: Exactness of model From the detected cats, how many were Each class/label has a value
actually cats
Recall: Completeness of model Correctly detected cats over total cats Each class/label has a value
F1 Score: Combines Precision/Recall Harmonic mean of Precision and Recall Each class/label has a value
Performance metrics associated with Class 1

(Is your prediction correct?) (What did you predict)
Actual Labels True Negative
1 0
(Your prediction is correct) (You predicted 0)
TP FP
True False
Predicted Labels
Precision = False +ve rate =

1
Positive Positive TP + FP TN + FP
(Prec x Rec) TP + TN
F1 score = 2x Accuracy =
(Prec + Rec) TP + FN + FP + TN
False True
0
Negative Negative TN TP
Specificity = Recall, Sensitivity =
TN +FP True +ve rate TP + FN
Possible solutions
1. Data Replication: Replicate the available data until the Blue: Label 1
number of samples are comparable Green: Label 0
2. Synthetic Data: Images: Rotate, dilate, crop, add noise to Blue: Label 1
existing input images and create new data Green: Label 0
3. Modified Loss: Modify the loss to reflect greater error when 𝑙𝑜𝑠𝑠 = 𝑎 ∗ 𝒍𝒐𝒔𝒔𝒈𝒓𝒆𝒆𝒏 + 𝑏 ∗ 𝒍𝒐𝒔𝒔𝒃𝒍𝒖𝒆 𝑎>𝑏
misclassifying smaller sample set
4. Change the algorithm: Increase the model/algorithm complexity so that the two classes are perfectly
separable (Con: Overfitting)
Increase model
complexity
No straight line (y=ax) passing through origin can perfectly Straight line (y=ax+b) can perfectly separate data.
separate data. Best solution: line y=0, predict all labels blue Green class will no longer be predicted as blue
Page 3 of 14
Cheat Sheet – PCA Dimensionality Reduction
What is PCA?
• Based on the dataset find a new set of orthogonal feature vectors in such a way that the
data spread is maximum in the direction of the feature vector (or dimension)
• Rates the feature vector in the decreasing order of data spread (or variance)
• The datapoints have maximum variance in the first feature vector, and minimum variance
in the last feature vector
• The variance of the datapoints in the direction of feature vector can be termed as a
measure of information in that direction.
Steps
1. Standardize the datapoints
2. Find the covariance matrix from the given datapoints
3. Carry out eigen-value decomposition of the covariance matrix
4. Sort the eigenvalues and eigenvectors
Dimensionality Reduction with PCA

• Keep the first m out of n feature vectors rated by PCA. These m vectors will be the best m
vectors preserving the maximum information that could have been preserved with m
vectors on the given dataset
Steps:
1. Carry out steps 1-4 from above
2. Keep first m feature vectors from the sorted eigenvector matrix
3. Transform the data for the new basis (feature vectors)
4. The importance of the feature vector is proportional to the magnitude of the eigen value
Figure 1 Figure 2
Feature # 1 (F1)
FeFeature # 1
Variance
Variance
1
e#
2
ur
e#
at
ur
at
Fe
ew
w
Ne
N
F1 F2 Feature # 2 (F2) Feature # 2 F1 F2
Figure 3 Figure 1: Datapoints with feature vectors as

x and y-axis
Figure 2: The cartesian coordinate system is
rotated to maximize the standard deviation
Variance
ew Feature # 1
along any one axis (new feature # 2)

1
#
2 Figure 3: Remove the feature vector with

re
e#
u
ur minimum standard deviation of datapoints

at
at
Fe
Fe F1 F2 (new feature # 1) and project the data on

w
Ne
N
Feature # 2 new feature # 2
Page 4 of 14
Cheat Sheet – Bayes Theorem and Classifier
What is Bayes’ Theorem?
• Describes the probability of an event, based on prior knowledge of conditions that might be
related to the event.
P(A B)
• How the probability of an event changes when
we have knowledge of another event Posterior
Probability
P(A) P(A B)
Usually a better
estimate than P(A)
Bayes’ Theorem
Example
• Probability of fire P(F) = 1%
• Probability of smoke P(S) = 10%
Likelihood P(A) Evidence
• Prob of smoke given there is a fire P(S F) = 90%
• What is the probability that there is a fire given P(B A) Prior P(B)
we see a smoke P(F S)? Probability
Maximum Aposteriori Probability (MAP) Estimation

The MAP estimate of the random variable y, given that we have observed iid (x1, x2, x3, … ), is
given by. We try to accommodate our prior knowledge when estimating.
ˆMAP y that maximizes the product of
prior and likelihood
Maximum Likelihood Estimation (MLE)

The MAP estimate of the random variable y, given that we have observed iid (x1, x2, x3, … ), is
given by. We assume we don’t have any prior knowledge of the quantity being estimated.
ˆ y that maximizes only the
MLE
likelihood
MLE is a special case of MAP where our prior is uniform (all values are equally likely)
Naïve Bayes’ Classifier (Instantiation of MAP as classifier)

Suppose we have two classes, y=y1 and y=y2. Say we have more than one evidence/features (x1,
x2, x3, … ), using Bayes’ theorem
Bayes’ theorem assumes the features (x1, x2, x3, … ) are i.i.d. i.e
Page 5 of 14
Cheat Sheet – Regression Analysis
What is Regression Analysis?
Fitting a function f(.) to datapoints yi=f(xi) under some error function. Based on the estimated
function and error, we have the following types of regression
1. Linear Regression:
Fits a line minimizing the sum of mean-squared error
for each datapoint.
2. Polynomial Regression:
Fits a polynomial of order k (k+1 unknowns) minimizing
the sum of mean-squared error for each datapoint.
3. Bayesian Regression:
For each datapoint, fits a gaussian distribution by
minimizing the mean-squared error. As the number of
data points xi increases, it converges to point
estimates i.e.
4. Ridge Regression:
Can fit either a line, or polynomial minimizing the sum
of mean-squared error for each datapoint and the
weighted L2 norm of the function parameters beta.
5. LASSO Regression:
Can fit either a line, or polynomial minimizing the the
sum of mean-squared error for each datapoint and the
weighted L1 norm of the function parameters beta.
6. Logistic Regression (NOT regression, but classification):
Can fit either a line, or polynomial with sigmoid
activation minimizing the sum of mean-squared error for
each datapoint. The labels y are binary class labels.
Visual Representation:
Linear Regression Polynomial Regression Bayesian Linear Regression Logistic Regression
Label 1
y
y
Label 0
x x x x
Summary:
What does it fit? Estimated function Error Function
Linear A line in n dimensions
Polynomial A polynomial of order k
Bayesian Linear Gaussian distribution for each point
Ridge Linear/polynomial
LASSO Linear/polynomial
Logistic Linear/polynomial with sigmoid
Page 6 of 14
Cheat Sheet – Regularization in ML
What is Regularization in ML?
• Regularization is an approach to address over-fitting in ML.
• Overfitted model fails to generalize estimations on test data
• When the underlying model to be learned is low bias/high
variance, or when we have small amount of data, the
estimated model is prone to over-fitting.
• Regularization reduces the variance of the model
Types of Regularization: Figure 1. Overfitting
1. Modify the loss function:
• L2 Regularization: Prevents the weights from getting too large (defined by L2 norm). Larger
the weights, more complex the model is, more chances of overfitting.
• L1 Regularization: Prevents the weights from getting too large (defined by L1 norm). Larger
the weights, more complex the model is, more chances of overfitting. L1 regularization
introduces sparsity in the weights. It forces more weights to be zero, than reducing the the
average magnitude of all weights
• Entropy: Used for the models that output probability. Forces the probability distribution
towards uniform distribution.
2. Modify data sampling:

• Data augmentation: Create more data from available data by randomly cropping, dilating,
rotating, adding small amount of noise etc.
• K-fold Cross-validation: Divide the data into k groups. Train on (k-1) groups and test on 1
group. Try all k possible combinations.
3. Change training approach:

• Injecting noise: Add random noise to the weights when they are being learned. It pushes the
model to be relatively insensitive to small variations in the weights, hence regularization
• Dropout: Generally used for neural networks. Connections between consecutive layers are
randomly dropped based on a dropout-ratio and the remaining network is trained in the
current iteration. In the next iteration, another set of random connections are dropped.
5-fold cross-validation Original Network Dropout-ratio = 30%
Test Train
Train Test Train
Train Test Train

Train Test Train
Train Test Connections = 16 Active = 11 (70%) Active = 11 (70%)
Figure 2. K-fold CV Figure 3. Drop-out

Page 7 of 14
Cheat Sheet – Famous CNNs
AlexNet – 2012
Why: AlexNet was born out of the need to improve the results of
the ImageNet challenge.
What: The network consists of 5 Convolutional (CONV) layers and 3
Fully Connected (FC) layers. The activation used is the Rectified
Linear Unit (ReLU).
How: Data augmentation is carried out to reduce over-fitting, Uses
Local response localization.
VGGNet – 2014
Why: VGGNet was born out of the need to reduce the # of
parameters in the CONV layers and improve on training time
What: There are multiple variants of VGGNet (VGG16, VGG19, etc.)
How: The important point to note here is that all the conv kernels are
of size 3x3 and maxpool kernels are of size 2x2 with a stride of two.
ResNet – 2015
Why: Neural Networks are notorious for not being able to find a
simpler mapping when it exists. ResNet solves that.
What: There are multiple versions of ResNetXX architectures where
‘XX’ denotes the number of layers. The most used ones are ResNet50
and ResNet101. Since the vanishing gradient problem was taken care of
(more about it in the How part), CNN started to get deeper and deeper
How: ResNet architecture makes use of shortcut connections do solve
the vanishing gradient problem. The basic building block of ResNet is
a Residual block that is repeated throughout the network.
Filter
Concatenation
Weight layer
3x3 5x5
f(x) x 1x1
Conv Conv
1x1 Conv
Weight layer Conv 1x1 1x1 3x3

Conv Conv Maxpool
+ Previous
f(x)+x Layer
Figure 1 ResNet Block Figure 2 Inception Block

Inception – 2014
Why: Lager kernels are preferred for more global features, on the other
hand, smaller kernels provide good results in detecting area-specific
features. For effective recognition of such a variable-sized feature, we
need kernels of different sizes. That is what Inception does.
What: The Inception network architecture consists of several inception
modules of the following structure. Each inception module consists of
four operations in parallel, 1x1 conv layer, 3x3 conv layer, 5x5 conv
layer, max pooling
How: Inception increases the network space from which the best
network is to be chosen via training. Each inception module can
capture salient features at different levels.
Page 8 of 14
Cheat Sheet – Convolutional Neural Network
Convolutional Neural Network:
The data gets into the CNN through the input layer and passes
through various hidden layers before getting to the output layer.
The output of the network is compared to the actual labels in
terms of loss or error. The partial derivatives of this loss w.r.t the
trainable weights are calculated, and the weights are updated
through one of the various methods using backpropagation.
CNN Template:
Most of the commonly used hidden layers (not all) follow a
pattern
1. Layer function: Basic transforming function such as
convolutional or fully connected layer.
a. Fully Connected: Linear functions between the input and the
output.
a. Convolutional Layers: These layers are applied to 2D (3D) input feature maps. The trainable weights are a 2D (3D)
kernel/filter that moves across the input feature map, generating dot products with the overlapping region of the input
feature map.
b.Transposed Convolutional (DeConvolutional) Layer: Usually used to increase the size of the output feature map
(Upsampling) The idea behind the transposed convolutional layer is to undo (not exactly) the convolutional layer
Fully Connected Layer Convolutional Layer
w11*x
x1 1+ b1
+ b1 y1
w21*x2
x2
1
3 +b
1*x
x3 w3
Input Node Output Node Input Map Kernel Output Map
2. Pooling: Non-trainable layer to change the size of the feature map

a. Max/Average Pooling: Decrease the spatial size of the input layer based on
selecting the maximum/average value in receptive field defined by the kernel
b. UnPooling: A non-trainable layer used to increase the spatial size of the input
layer based on placing the input pixel at a certain index in the receptive field
of the output defined by the kernel.
3. Normalization: Usually used just before the activation functions to limit the
unbounded activation from increasing the output layer values too high
a. Local Response Normalization LRN: A non-trainable layer that square-normalizes the pixel values in a feature map
within a local neighborhood.
b. Batch Normalization: A trainable approach to normalizing the data by learning scale and shift variable during training.
3. Activation: Introduce non-linearity so CNN can 5. Loss function: Quantifies how far off the CNN prediction
efficiently map non-linear complex mapping. is from the actual labels.
a. Non-parametric/Static functions: Linear, ReLU a. Regression Loss Functions: MAE, MSE, Huber loss
b. Parametric functions: ELU, tanh, sigmoid, Leaky ReLU b. Classification Loss Functions: Cross entropy, Hinge loss
c. Bounded functions: tanh, sigmoid 4.0
MSE Loss
2.0
MAE Loss
2.0 Ω
Huber Loss
æ
mse = (x ° x̂)2 mae = |x ° x̂| 1
2 (x ° x̂)
2
: |x ° x̂| < ∞
3.5 1.75 1.75 ∞|x ° x̂| ° 12 ∞ 2 : else
∞ =1.9
3.0 1.5 1.5
2.5 1.25 1.25
2.0 1.0 1.0
1.5 0.75 0.75
1.0 0.5 0.5
0.5 0.25 0.25
0.0 0.0 0.0
-2.0 -1.0 0.0 1.0 2.0 -2.0 -1.0 0.0 1.0 2.0 -2.0 -1.0 0.0 1.0 2.0
Hinge Loss Cross Entropy Loss

1.0
3.0 Ω æ
max(0, 1 ° x̂) : x = 1 °ylog(p) ° (1 ° y)log(1 ° p)
2.5
max(0, 1 + x̂) : x = °1 8.0 0.8
2.0 6.0 0.6
1.5
4.0 0.4
1.0
2.0
0.5 0.2
0.0 0.0 0.0

-2.0 -1.0 0.0 1.0 2.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Page 9 of 14
Cheat Sheet – Ensemble Learning in ML
What is Ensemble Learning? Wisdom of the crowd
Combine multiple weak models/learners into one predictive model to reduce bias, variance and/or improve accuracy.
Types of Ensemble Learning: N number of weak learners

1.Bagging: Trains N different weak models (usually of same types – homogenous) with N non-overlapping subset of the
input dataset in parallel. In the test phase, each model is evaluated. The label with the greatest number of predictions is
selected as the prediction. Bagging methods reduces variance of the prediction
2.Boosting: Trains N different weak models (usually of same types – homogenous) with the complete dataset in a
sequential order. The datapoints wrongly classified with previous weak model is provided more weights to that they can
be classified by the next weak leaner properly. In the test phase, each model is evaluated and based on the test error of
each weak model, the prediction is weighted for voting. Boosting methods decreases the bias of the prediction.
3.Stacking: Trains N different weak models (usually of different types – heterogenous) with one of the two subsets of the
dataset in parallel. Once the weak learners are trained, they are used to trained a meta learner to combine their
predictions and carry out final prediction using the other subset. In test phase, each model predicts its label, these set of
labels are fed to the meta learner which generates the final prediction.
The block diagrams, and comparison table for each of these three methods can be seen below.
Ensemble Method – Boosting Ensemble Method – Bagging
Input Dataset Step #1 Input Dataset
Step #1 Create N subsets
Assign equal weights Complete dataset from original Subset #1 Subset #2 Subset #3 Subset #4
to all the datapoints dataset, one for each
in the dataset weak model
Uniform weights
Step #2
Train each weak
Weak Model Weak Model Weak Model Weak Model
Step #2a Step #2b model with an
Train a weak model Train Weak • Based on the final error on the independent #1 #2 #3 #4
with equal weights to trained weak model, calculate a subset, in
Model #1 parallel
all the datapoints scalar alpha.
• Use alpha to increase the weights of
wrongly classified points, and
decrease the weights of correctly
alpha1 Adjusted weights classified points
Step #3
In the test phase, predict from
each weak model and vote their Voting
Step #3b predictions to get final prediction
Step #3a Train Weak • Based on the final error on the
Train a weak model Model #2 trained weak model, calculate a
with adjusted weights scalar alpha.
on all the datapoints • Use alpha to increase the weights of
in the dataset wrongly classified points, and Final Prediction
decrease the weights of correctly
alpha2 Adjusted weights classified points
Train Weak Ensemble Method – Stacking

Model #3
Step #1
Create 2 subsets from Input Dataset
original dataset, one
for training weak Subset #1 – Weak Learners Subset #3#2 – Meta Learner
Subset
alpha3 Adjusted weights models and one for
meta-model
Train Weak
Step #(n+1)a Model #4 Step #2
Train a weak model Train each weak
with adjusted weights model with the
Train Weak Train Weak Train Weak Train Weak
on all the datapoints weak learner Model #1 Model #2 Model #3 Model #4
in the dataset dataset
alpha3
x x x x Input Dataset
Subset #1 – Weak Learners Subset #2 – Meta Learner
Step #n+2
In the test phase, predict from each
weak model and vote their predictions
weighted by the corresponding alpha to
get final prediction Step #3
Voting Train a meta-
learner for which Trained Weak Trained Weak Trained Weak Trained Weak
the input is the
outputs of the Model Model Model Model
weak models for #1 #2 #3 #4
the Meta Learner
dataset
Final Prediction
Parameter Bagging Boosting Stacking

Meta Model
Focuses on Reducing variance Reducing bias Improving accuracy
Nature of weak
Homogenous Homogenous Heterogenous Step #4
learners is In the test phase, feed the input to the
weak models, collect the output and feed
Weak learners are Learned voting it to the meta model. The output of the
Final Prediction
Simple voting Weighted voting meta model is the final prediction
aggregated by (meta-learner)
Page 10 of 14
How to prepare for
1/4 behavioral interview?
Collect stories, assign keywords, practice
the STAR format
Keywords List important keywords that will be populated with your personal
stories. Most common keywords are given in the table below
Conflict Compromise to
Negotiation Creativity Flexibility Convincing
Resolution achieve goal
Another team Adjust to a
Handling Challenging Working with
priorities not colleague Take Stand
Crisis Situation difficult people
aligned style
Handling –ve Coworker Working with a Your Influence
Your strength
feedback view of you deadline weakness Others
Handling Converting Decision
Handling Conflict Mentorship/
unexpected challenge to without enough
failure Resolution Leadership
situation opportunity data
Stories
1. List all the organizations you have been a part of. For example
1. Academia: BSc, MSc, PhD
2. Industry: Jobs, Internship
3. Societies: Cultural, Technical, Sports
2. Think of stories from step 1 that can fall into one of the keywords categories. The
more stories the better. You should have at least 10-15 stories.
3. Create a summary table by assigning multiple keywords to each stories. This will help
you filter out the stories when the question asked in the interview. An example can be
seen below
Story 1: [Convincing] [Take Stand] [influence other]
Story 2: [Mentorship] [Leadership]
Story 3: [Conflict resolution] [Negotiation]
Story 4: [decision-without-enough-data]
STAR Format
Write down the stories in the STAR format as explained in the 2/4 part of this cheat
sheet. This will help you practice the organization of story in a meaningful way.
Icon Source: www.flaticon.com
Page 11 of 14
How to prepare for
2/4 behavioral interview?
Direct*, meaningful*, personalized*, logical*
*(Respective colors are used to identify these characteristics in the example)
Example: “Tell us about a time when you had to convince senior executives”
S
“I worked as an intern in XYZ company in
Situation the summer of 2019. The project details
provided to me was elaborative. After
Explain the situation and some initial brainstorming, and research I
realized that the project approach can be
provide necessary context for modified to make it more efficient in
terms of the underlying KPIs. I decided to
your story. talk to my manager about it.”
“I had an hour-long call with my manager

and explained him in detail the proposed
T
Task approach and how it could improve the
KPIs. I was able to convince him. He
Explain the task and your asked me if I will be able to present my
proposed approach for approval in front of
responsibility in the the higher executives. I agreed to it. I was
working out of the ABC(city) office and
situation the executives need to fly in from
XYZ(city) office.”
“I did a quick background check on the

Action
A
executives to know better about their area
of expertise so that I can convince them
Walk through the steps and accordingly. I prepared an elaborative 15
slide presentation starting with explaining
actions you took to address their approach, moving onto my proposed
the issue approach and finally comparing them on
preliminary results.
“After some active discussion we were able

to establish that the proposed approach
Result
R
was better than the initial one. The
executives proposed a few small changes
State the outcome of the to my approach and really appreciated my
result of your actions stand. At the end of my internship, I was
selected among the 3 out of 68 interns
who got to meet the senior vice president
of the company over lunch.”

Page 12 of 14
How to answer a
3/4 behavioral question?
Understand, Extract, Map, Select and Apply
Example: “Tell us about a time when you had to convince senior executives”
Understand the question

Example: A story where I was able to convince
Understand my seniors. Maybe they had something in mind,
and I had a better approach and tried to
convince them
Extract keywords and tags

Extract useful keywords that encapsulates the
Extract Example:
gist of the question
[Convincing], [Creative], [Leadership]
Map the keyword to your stories

Shortlist all the stories that fall under the
Map keywords extracted from previous step
Example:
Story1, Story2, Story3, Story4, … , Story N
Select the best story

From the shortlisted stories, pick the one that
Select best describes the question and has not been used
so far in the interview
Example: Story3
Apply the STAR method

Apply the STAR method on the selected story to
Apply answer the question
Example: See Cheat Sheet 2/3 for details

Page 13 of 14
Behavioral Interview
4/4 Cheat Sheet
Summarizing the behavioral interview
Gather important topics as keywords

1 Understand and collect all the important topics
commonly asked in the interview
Collect your stories
How to
2 Based on all the organizations you have been a part of,
think of all the stories that fall under the keywords above
prepare Practice stories in STAR format
for the 3 Practice each story using the STAR format. You will have
to answer the question following this format.
interview Assign keywords to stories

4 Assign each of your story one or more keywords. This will
help you recall them quickly
Create a summary table

5 Create a summary table mapping stories to their associated
keywords. This will be used during the behavioral question
Understand the question

U Understand the question and clarify any confusions that
you have
Extract the keywords
How to E Try to extract one or more of the keywords from the

question
answer a
Map the keywords to stories
question
during
M Based on the keywords extracted, find the stories using the
summary table created during preparation (Step 4)
interview Select a story

S Since each keyword maybe assigned to multiple stories,
select the one that is most relevant and has not been used.
Apply the START format

A Once the story has been shortlisted, apply STAR format on
the story to answer the question.
Page 14 of 14
Python For Data Science Cheat Sheet Model Architecture Inspect Model
>>> model.output_shape
Sequential Model Model output shape
Keras >>> from keras.models import Sequential
>>>
>>>
model.summary()
model.get_config()
Model summary representation
Model configuration
Learn Python for data science Interactively at www.DataCamp.com >>> model = Sequential() >>> model.get_weights() List all weight tensors in the model
>>> model2 = Sequential()
>>> model3 = Sequential() Compile Model
Multilayer Perceptron (MLP) MLP: Binary Classification
Keras Binary Classification >>> model.compile(optimizer='adam',
loss='binary_crossentropy',
Keras is a powerful and easy-to-use deep learning library for >>> from keras.layers import Dense metrics=['accuracy'])
Theano and TensorFlow that provides a high-level neural >>> model.add(Dense(12, MLP: Multi-Class Classification
input_dim=8, >>> model.compile(optimizer='rmsprop',
networks API to develop and evaluate deep learning models. kernel_initializer='uniform', loss='categorical_crossentropy',
activation='relu')) metrics=['accuracy'])
A Basic Example >>> model.add(Dense(8,kernel_initializer='uniform',activation='relu'))
MLP: Regression
>>> model.add(Dense(1,kernel_initializer='uniform',activation='sigmoid')) >>> model.compile(optimizer='rmsprop',
>>> import numpy as np loss='mse',
>>> from keras.models import Sequential Multi-Class Classification metrics=['mae'])
>>> from keras.layers import Dense >>> from keras.layers import Dropout
>>> data = np.random.random((1000,100)) >>> model.add(Dense(512,activation='relu',input_shape=(784,))) Recurrent Neural Network
>>> labels = np.random.randint(2,size=(1000,1)) >>> model.add(Dropout(0.2)) >>> model3.compile(loss='binary_crossentropy',
>>> model = Sequential() optimizer='adam',
>>> model.add(Dense(512,activation='relu')) metrics=['accuracy'])
>>> model.add(Dense(32, >>> model.add(Dropout(0.2))
activation='relu', >>> model.add(Dense(10,activation='softmax'))
>>>
input_dim=100))
model.add(Dense(1, activation='sigmoid'))
Regression Model Training
>>> model.compile(optimizer='rmsprop', >>> model.add(Dense(64,activation='relu',input_dim=train_data.shape[1])) >>> model3.fit(x_train4,
loss='binary_crossentropy', >>> model.add(Dense(1)) y_train4,
metrics=['accuracy']) batch_size=32,
>>> model.fit(data,labels,epochs=10,batch_size=32) Convolutional Neural Network (CNN) epochs=15,
verbose=1,
>>> predictions = model.predict(data) >>> from keras.layers import Activation,Conv2D,MaxPooling2D,Flatten validation_data=(x_test4,y_test4))
>>> model2.add(Conv2D(32,(3,3),padding='same',input_shape=x_train.shape[1:]))
Data Also see NumPy, Pandas & Scikit-Learn >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(32,(3,3))) Evaluate Your Model's Performance
Your data needs to be stored as NumPy arrays or as a list of NumPy arrays. Ide- >>> model2.add(Activation('relu')) >>> score = model3.evaluate(x_test,
>>> model2.add(MaxPooling2D(pool_size=(2,2))) y_test,
ally, you split the data in training and test sets, for which you can also resort batch_size=32)
>>> model2.add(Dropout(0.25))
to the train_test_split module of sklearn.cross_validation.
>>> model2.add(Conv2D(64,(3,3), padding='same'))
Keras Data Sets >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(64,(3, 3)))
Prediction
>>> from keras.datasets import boston_housing, >>> model2.add(Activation('relu')) >>> model3.predict(x_test4, batch_size=32)
mnist, >>> model2.add(MaxPooling2D(pool_size=(2,2))) >>> model3.predict_classes(x_test4,batch_size=32)
cifar10, >>> model2.add(Dropout(0.25))
imdb
>>> (x_train,y_train),(x_test,y_test) = mnist.load_data()
>>> (x_train2,y_train2),(x_test2,y_test2) = boston_housing.load_data()
>>>
>>>
model2.add(Flatten())
model2.add(Dense(512))
Save/ Reload Models
>>> (x_train3,y_train3),(x_test3,y_test3) = cifar10.load_data() >>> model2.add(Activation('relu')) >>> from keras.models import load_model
>>> (x_train4,y_train4),(x_test4,y_test4) = imdb.load_data(num_words=20000) >>> model2.add(Dropout(0.5)) >>> model3.save('model_file.h5')
>>> num_classes = 10 >>> my_model = load_model('my_model.h5')
>>> model2.add(Dense(num_classes))
>>> model2.add(Activation('softmax'))
Other
Recurrent Neural Network (RNN) Model Fine-tuning
>>> from urllib.request import urlopen
>>> data = np.loadtxt(urlopen("http://archive.ics.uci.edu/
ml/machine-learning-databases/pima-indians-diabetes/
>>> from keras.klayers import Embedding,LSTM Optimization Parameters
pima-indians-diabetes.data"),delimiter=",") >>> model3.add(Embedding(20000,128)) >>> from keras.optimizers import RMSprop
>>> X = data[:,0:8] >>> model3.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2)) >>> opt = RMSprop(lr=0.0001, decay=1e-6)
>>> y = data [:,8] >>> model3.add(Dense(1,activation='sigmoid')) >>> model2.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
Preprocessing Also see NumPy & Scikit-Learn
Early Stopping
Sequence Padding Train and Test Sets >>> from keras.callbacks import EarlyStopping
>>> from keras.preprocessing import sequence >>> from sklearn.model_selection import train_test_split >>> early_stopping_monitor = EarlyStopping(patience=2)
>>> x_train4 = sequence.pad_sequences(x_train4,maxlen=80) >>> X_train5,X_test5,y_train5,y_test5 = train_test_split(X, >>> model3.fit(x_train4,
>>> x_test4 = sequence.pad_sequences(x_test4,maxlen=80) y,
test_size=0.33, y_train4,
random_state=42) batch_size=32,
One-Hot Encoding epochs=15,
>>> from keras.utils import to_categorical Standardization/Normalization validation_data=(x_test4,y_test4),
>>> Y_train = to_categorical(y_train, num_classes) >>> from sklearn.preprocessing import StandardScaler callbacks=[early_stopping_monitor])
>>> Y_test = to_categorical(y_test, num_classes) >>> scaler = StandardScaler().fit(x_train2)
>>> Y_train3 = to_categorical(y_train3, num_classes) >>> standardized_X = scaler.transform(x_train2) DataCamp
>>> Y_test3 = to_categorical(y_test3, num_classes) >>> standardized_X_test = scaler.transform(x_test2) Learn Python for Data Science Interactively
Basic plots Scales API Tick locators API Animation API
756 plot([X],Y,[fmt],…) API ax.set_[xy]scale(scale,…) from matplotlib import ticker import matplotlib.animation as mpla
Cheat sheet 432 Version 3.5.0 X, Y, fmt, color, marker, linestyle linear
0.0 log
0.0 ax.[xy]axis.set_[minor|major]_locator(locator)
1 - + any values
2.5 0 + values > 0
2.510102101 0logit ticker.NullLocator() T = np.linspace(0, 2*np.pi, 100)
Quick start 765 1234567 X,scatter(X,Y,…) API API

0.0 2 0 2 symlog 0.0 ticker.MultipleLocator(0.5)
S = np.sin(T)
line, = plt.plot(T, S)
423 Y, [s]izes, [c]olors, marker, cmap
-
2.5 1000100any values2.5 1 0 < values < 1
+ 0 1
0.0 0.5 1.0
ticker.FixedLocator([0, 1, 5])
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
def animate(i):
import numpy as np
1 2
0 1 5 line.set_ydata(np.sin(T+i/50))
756 1234567 x,bar[h](x,height,…)
import matplotlib as mpl ticker.LinearLocator(numticks=3) anim = mpla.FuncAnimation(
import matplotlib.pyplot as plt API Projections API 0.0 2.5 5.0
432
plt.gcf(), animate, interval=5)
height, width, bottom, align, color ticker.IndexLocator(base=0.5, offset=0.25)
plt.show()
1 subplot(…,projection=p) 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75
ticker.AutoLocator()
X = np.linspace(0, 2*np.pi, 100)
765 1234567 imshow(Z,…) API
p=’polar’ p=’3d’ 0 1 2 3 4 5
Styles API
3421
Y = np.cos(X) ticker.MaxNLocator(n=4)
Z, cmap, interpolation, extent, origin 0.0 1.5 3.0 4.5
fig, ax = plt.subplots() ticker.LogLocator(base=10, numticks=15) plt.style.use(style)
6754 1234567 contour[f]([X],[Y],Z,…)

103 104 105 106 107 108 109 1010
ax.plot(X, Y, color=’green’) p=Orthographic() API
default grayscale
API classic
from cartopy.crs import Cartographic 1.0 1.0 1.0
X, Y, Z, levels, colors, extent, origin

3
0.5
21
0.5 0.5
fig.savefig(“figure.pdf”)
Tick formatters
0.0 0.0 0.0
API 0.5 0.5 0.5
fig.show()
32 1234567 pcolormesh([X],[Y],Z,…)
1.0 1.0 1.0
0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
10 API ggplot seaborn fast

from matplotlib import ticker
Anatomy of a figure X, Y, Z, vmin, vmax, cmap

1.0 1.0 1.0
ax.[xy]axis.set_[minor|major]_formatter(formatter)
12 Lines 0.5 0.5

API
0.5
0.0 0.0 0.0
7653 3210123 quiver([X],[Y],U,V,…)

ticker.NullFormatter() 0.5 0.5 0.5
Anatomy of a figure
4
linestyle or ls
1.0
0 1 2 3 4 5 6
1.0
0 1 2 3 4 5 6
1.0
0 1 2 3 4 5 6
API ticker.FixedFormatter(['zero', 'one', 'two', ]) bmh Solarize_Light2 seaborn-notebook

432
Title Blue signal
Major tick X, Y, U, V, C, units, angles
1.0 1.0 1.0
Red signal "-" ":" "--" "-." (0,(0.01,2)) zero one two three four five 0.5 0.5 0.5
1 capstyle or dash_capstyle
Legend ticker.FuncFormatter(lambda x, pos: "[%.2f]" % x) 0.0 0.0 0.0
0.5 0.5 0.5
765 1234567 Z, explode, labels, colors, radius

Minor tick [0.00] [1.00] [2.00] [3.00] [4.00] [5.00] 1.0 1.0 1.0
"butt" "round" "projecting"

0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
pie(X,…) API ticker.FormatStrFormatter('>%d<')
432
3
>0< >1< >2< >3< >4< >5<
Major tick label Grid
Quick reminder
Line
(line plot) 1 Markers API 0
ticker.ScalarFormatter()
1 2 3 4 5
765 1234567T x,text(x,y,text,…) API ticker.StrMethodFormatter('{x}') ax.grid()

Y axis label
2
432 TEX y, text, va, ha, size, weight, transform 0.0 1.0 2.0 3.0 4.0 5.0 ax.set_[xy]lim(vmin, vmax)
1 ax.set_[xy]label(label)
'.' 'o' 's' 'P' 'X' '*' 'p' 'D' '<' '>' '^' 'v' ticker.PercentFormatter(xmax=5)
Y axis label 0% 20% 40% 60% 80% 100% ax.set_[xy]ticks(ticks, [labels])
756 1234567 X,fill[_between][x](…)
Markers
(scatter plot)
API '1' '2' '3' '4' '+' 'x' '|' '_' 4 5 6 7 ax.set_[xy]ticklabels(labels)
1 432 Y1, Y2, color, where
Ornaments
ax.set_title(title)
1 '$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $'
markevery
ax.tick_params(width=10, …)
1234567
Spines ax.legend(…) API ax.set_axis_[on|off]()
Figure Line 10 [0, -1] (25, 5) [0, 25, -1] handles, labels, loc, title, frameon
Axes (line plot)
Advanced plots fig.suptitle(title)
0
765
title
0 0.25 0.50 0.75 1 1.25 1.50 1.75 2 2.25 2.50 2.75 3 3.25 3.50 3.75 4 fig.tight_layout()
X axis label step(X,Y,[fmt],…) API
Colors Legend
432
Minor tick label API plt.gcf(), plt.gca()
X axis label X, Y, fmt, color, marker, where handletextpad
label
handle
1 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 markerfacecolor (mfc) mpl.rc(’axes’, linewidth=1, …)
1
handlelength
’Cn’
Subplots layout
0 Label 1 Label 3 [fig|ax].patch.set_alpha(0)
765 1234567 X,boxplot(X,…)
API 0 b 2 g 4 r 6 c 8 m 10 y 12 k 14 w 16 ’x’
API 1 DarkRed Firebrick Crimson IndianRed Salmon labelspacing markeredgecolor (mec) text=r’$\frac{-e^{i\pi}}{2^n}$’
’name’
fig, axs = plt.subplots(3, 3) 3

subplot[s](rows,cols,…) API
42 10 (1,0,0) (1,0,0,0.75) (1,0,0,0.5) (1,0,0,0.25)
notch, sym, bootstrap, widths
0 2
10 #FF0000
4 6 8 10 12 14 16 (R,G,B[,A]) Label 2 Label 4
1
borderpad
0 2 4 #FF0000BB
6 8 #FF000088
10 12 #FF000044
14 16 ’#RRGGBB[AA]’ Keyboard shortcuts
columnspacing numpoints or scatterpoints
10 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 API
0 2 4 6 8 10 12 14 16 ’x.y’
756 246 X,errorbar(X,Y,xerr,yerr,…)
borderaxespad
0
API 0 2 4 6 8 10 12 14 16
ctrl + s Save ctrl + w Close plot
G = gridspec(rows,cols,…) API
432 Y, xerr, yerr, fmt ax.colorbar(…) API r Reset view f Fullscreen 0/1
ax = G[0,:]
1 Colormaps API
mappable, ax, cax, orientation
f View forward b View back
71
61
51 1234567 X,hist(X, bins, …)
API plt.get_cmap(name) p Pan view o Zoom to rect
ax.inset_axes(extent) 41
31
21
API bins, range, density, weights
Uniform
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x X pan/zoom y Y pan/zoom
111 viridis g Minor grid 0/1 G Major grid 0/1
756 1234567 violinplot(D,…) API magma ax.annotate(…) API
l X axis log/linear L Y axis log/linear
ax = d.new_horizontal(’10%’) 3
d=make_axes_locatable(ax) API
42 D, positions, widths, vert plasma
text, xy, xytext, xycoords, textcoords, arrowprops
1 Sequential
text Ten simple rules
765 1234567 barbs([X],[Y],
Greys READ
API U, V, …) YlOrBr xytext xy
4
textcoords xycoords
X, Y, U, V, C, length, pivot, sizes
231
Wistia 1. Know your audience
Getting help Diverging 2. Identify your message
6754 1234567 positions, orientation, lineoffsets

Spectral 3. Adapt the figure
Å matplotlib.org eventplot(positions,…)
API
4. Captions are not optional
coolwarm
Event handling API
H github.com/matplotlib/matplotlib/issues
ď discourse.matplotlib.org
321 Qualitative
RdGy
fig, ax = plt.subplots()
5. Do not trust the defaults
6. Use color effectively
W stackoverflow.com/questions/tagged/matplotlib
Ż gitter.im/matplotlib
7
654 1234567 hexbin(X,Y,C,…) API
X, Y, C, gridsize, bins
tab10
tab20
def on_click(event):
print(event)
7. Do not mislead the reader
8. Avoid “chartjunk”
F twitter.com/matplotlib
a Matplotlib users mailing list
321 Cyclic
twilight
fig.canvas.mpl_connect(
’button_press_event’, on_click)
9. Message trumps beauty
10. Get the right tool
1234567
Axes adjustments API Uniform colormaps Color names API Legend placement How do I …
plt.subplots_adjust( … ) viridis
plasma
black
k
dimgray
dimgrey
gray
floralwhite
darkgoldenrod
goldenrod
cornsilk
gold
darkturquoise
cadetblue
powderblue
lightblue
deepskyblue
L K J … resize a figure?
→ fig.set_size_inches(w, h)
… save a figure?
A 2 9 1 I
grey lemonchiffon skyblue
inferno
darkgray khaki lightskyblue → fig.savefig(”figure.pdf”)
darkgrey palegoldenrod steelblue
magma silver darkkhaki aliceblue … save a transparent figure?
top lightgray ivory dodgerblue
axes width cividis lightgrey beige lightslategray → fig.savefig(”figure.pdf”, transparent=True)
gainsboro lightyellow lightslategrey
whitesmoke lightgoldenrodyellow slategray … clear a figure/an axes?
w olive slategrey
→ fig.clear() → ax.clear()
Sequential colormaps white
snow
y
yellow
lightsteelblue
cornflowerblue
rosybrown olivedrab royalblue … close all figures?
B 6 10 7 H
figure height
lightcoral yellowgreen ghostwhite
axes height indianred darkolivegreen lavender → plt.close(”all”)
brown greenyellow midnightblue
Greys
firebrick chartreuse navy … remove ticks?
hspace Purples maroon lawngreen darkblue → ax.set_[xy]ticks([])
darkred honeydew mediumblue
Blues r darkseagreen b … remove tick labels ?
red palegreen blue
Greens mistyrose lightgreen slateblue → ax.set_[xy]ticklabels([])
salmon forestgreen darkslateblue
tomato limegreen mediumslateblue … rotate tick labels ?
C 3 8 4 G
left bottom wspace right Oranges
darksalmon darkgreen mediumpurple
Reds coral g rebeccapurple → ax.tick_params(axis=”x”, rotation=90)
orangered green blueviolet
YlOrBr lightsalmon lime indigo … hide top spine?
sienna seagreen darkorchid
figure width
D E F
YlOrRd seashell mediumseagreen darkviolet → ax.spines[’top’].set_visible(False)
chocolate springgreen mediumorchid
OrRd saddlebrown mintcream thistle … hide legend border?
sandybrown mediumspringgreen plum
peachpuff mediumaquamarine violet → ax.legend(frameon=False)
PuRd peru aquamarine purple
Extent & origin API linen turquoise darkmagenta ax.legend(loc=”string”, bbox_to_anchor=(x,y)) … show error as shaded region?
RdPu bisque lightseagreen m
darkorange mediumturquoise fuchsia
2: upper left 9: upper center 1: upper right → ax.fill_between(X, Y+error, Y‐error)
ax.imshow( extent=…, origin=… ) BuPu burlywood azure magenta
GnBu
antiquewhite
tan
lightcyan
paleturquoise
orchid
mediumvioletred 6: center left 10: center 7: center right … draw a rectangle?
origin="upper" origin="upper" PuBu
navajowhite
blanchedalmond
darkslategray
darkslategrey
deeppink
hotpink 3: lower left 8: lower center 4: lower right → ax.add_patch(plt.Rectangle((0, 0), 1, 1)
5
(0,0) (0,0)
YlGnBu
papayawhip
moccasin
teal
darkcyan
lavenderblush
palevioletred … draw a vertical line?
orange c crimson A: upper right / (-0.1,0.9) B: center right / (-0.1,0.5) → ax.axvline(x=0.5)
PuBuGn wheat aqua pink
oldlace cyan lightpink C: lower right / (-0.1,0.1) D: upper left / (0.1,-0.1) … draw outside frame?
BuGn E: upper center / (0.5,-0.1) F: upper right / (0.9,-0.1)
0
(4,4) (4,4) → ax.plot(…, clip_on=False)
extent=[0,10,0,5] extent=[10,0,0,5] YlGn G: lower left / (1.1,0.1) H: center left / (1.1,0.5)
Image interpolation API … use transparency?
origin="lower" origin="lower" I: upper left / (1.1,0.9) J: lower right / (0.9,1.1) → ax.plot(…, alpha=0.25)
5
(4,4) (4,4)
Diverging colormaps K: lower center / (0.5,1.1) L: lower left / (0.1,1.1) … convert an RGB image into a gray image?
→ gray = 0.2989*R + 0.5870*G + 0.1140*B
PiYG … set figure background color?
0
(0,0)
extent=[0,10,0,5] extent=[10,0,0,5]
(0,0)
PRGn Annotation connection styles API → fig.patch.set_facecolor(“grey”)
0 10 0 10 BrBG
… get a reversed colormap?
PuOr None none nearest arc3,
rad=0
arc3,
rad=0.3
angle3,
angleA=0,
→ plt.get_cmap(“viridis_r”)
angleB=90
… get a discrete colormap?
Text alignments
RdGy
API → plt.get_cmap(“viridis”, 10)
RdBu
… show a figure for one second?
ax.text( …, ha=… , va=…, …) RdYlBu
→ fig.show(block=False), time.sleep(1)
Matplotlib
RdYlGn
(1,1)
top
Performance tips
Spectral
bilinear bicubic spline16

angle, angle, arc,
coolwarm angleA=-90, angleA=-90, angleA=-90,
angleB=180, angleB=180, angleB=0,
center bwr rad=0 rad=25 armA=0,
armB=40,
baseline rad=0 scatter(X, Y) slow
bottom seismic plot(X, Y, marker=”o”, ls=””) fast
(0,0)
left center right
for i in range(n): plot(X[i]) slow
Qualitative colormaps plot(sum([x+[None] for x in X],[])) fast
Text parameters API bar, bar, bar, cla(), imshow(…), canvas.draw() slow
Pastel1 spline36 hanning hamming fraction=0.3 fraction=-0.3 angle=180,
fraction=-0.2 im.set_data(…), canvas.draw() fast
ax.text(…, family=…, size=…, weight=…) Pastel2
ax.text(…, fontproperties=…) Paired Beyond Matplotlib
The quick brown fox
Accent
xx-large (1.73)
Dark2 Seaborn: Statistical data visualization
The quick brown fox x-large (1.44)
Set1 Cartopy: Geospatial data processing
The quick brown fox large (1.20) yt: Volumetric data visualization
The quick brown fox
The quick brown fox
medium (1.00)
Set2
hermite kaiser quadric mpld3: Bringing Matplotlib to the browser
The quick brown fox
small
x-small
(0.83)
(0.69)
Set3
Annotation arrow styles API
Datashader: Large data processing pipeline
The quick brown fox xx-small (0.58) tab10
tab20 plotnine: A grammar of graphics for Python
The quick brown fox jumps over the lazy dog black (900)
- <- -> <->
The quick brown fox jumps over the lazy dog bold (700) tab20b
The quick brown fox jumps over the lazy dog semibold (600) tab20c
The quick brown fox jumps over the lazy dog normal (400) Matplotlib Cheatsheets
The quick brown fox jumps over the lazy dog Copyright (c) 2021 Matplotlib Development Team
ultralight (100)
Miscellaneous colormaps
catrom gaussian bessel <|- -|> <|-|> ]-
Released under a CC‐BY 4.0 International License
The quick brown fox jumps over the lazy dog monospace
The quick brown fox jumps over the lazy dog serif
The quick brown fox jumps over the lazy dog sans
The quick brown fox jumps over the lazy dog cursive
terrain -[ ]-[ |-| ]->
ocean
The quick brown fox jumps over the lazy dog italic
The quick brown fox jumps over the lazy dog normal cubehelix
rainbow <-[ simple fancy wedge
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
small-caps
normal twilight mitchell sinc lanczos
randomForestSRC CHEAT SHEET
Basics Tune mtry and nodesize rfsrc.fast Fast approximate random forests using subsampling
with forest options set to encourage computational speed
randomForestSRC is a fast OpenMP and memory efficient
package for fitting random forests (RF) for univariate, tune Find the optimal mtry and nodesize tuning parameter for a
multivariate, unsupervised, survival, competing risks, class random forest using out-of-bag (OOB) error rfsrc.anonymous Random forests carefully modified so as not to
imbalanced classification and quantile regression. save the original training data when sharing
A basic grow call is of the form: o <- tune(quality ~ ., wine)
> o$optimal
synthetic Synthetic random forest using synthetic features
rfsrc(formula, data, ntree, mtry, nodesize) nodesize mtry
1 5
Grow your RF through rfsrc, imbalanced Solutions to the two-class imbalanced problem
specify your model in formula,
provide your data frame in data
and tune your model via ntree, mtry, nodesize. tune.nodesize Find the optimal nodesize quantreg Univariate or multivariate quantile regression forest and
returns its conditional quantile and density values
Specify a formula Grow

sidClustering Clustering of unsupervised data
Survival rfsrc(Surv(time, status) ~ ., data = veteran) Convenient interface for growing a CART tree
Competing Risk rfsrc(Surv(time, status) ~ ., data = wihs)
rfsrc.cart(formula, data, ntree = 1, mtry = ncol(data),
Inference from the Forest
bootstrap = "none")
Regression rfsrc(Ozone ~., data = airquality)
Quantile Regression quantreg(mpg ~ ., data = mtcars) Ensemble Predicted Value for Training Data
Classification rfsrc(Surv(time, status) ~ ., data=veteran) o <- rfsrc(Ozone ~ ., data = airquality)
Imbalanced Two-Class imbalanced(status ~ ., data = breast) Fast OpenMP parallel computing of random forests Inbag and out-of-bag (OOB) predicted values for the training dataset
Multivariate Regression rfsrc(Multivar(mpg, cyl) ~., data = mtcars)
are in o$predicted and o$predicted.oob
rfsrc(formula, data, ntree = 500,
Mixed Regression rfsrc(cbind(Species,Sepal.Length)~.,data=iris)
Quantile Regression quantreg(cbind(mpg, cyl) ~ ., data = mtcars)
mtry = NULL, ytry = NULL, Other Ensemble Values for Training Data
nodesize = NULL, nodedepth = NULL,
MV Mixed Quantile quantreg(cbind(Species,Sepal.Length)~.,data=iris) splitrule = NULL, nsplit = 10, • For classification problem, we also have $class and
importance = c(FALSE, TRUE, "none", "permute", $class.oob for class labels
Unsupervised rfsrc(data = mtcars)
"random", "anti"),
sidClustering sidClustering(data = mtcars)
ensemble = c("all", "oob", "inbag"), • For survival problem, we have
Breiman (Shi-Horvath) sidClustering(data = mtcars, method = "sh") $survival and $survival.oob for survival function
bootstrap = c("by.root", "none", "by.user"),
samptype = c("swor", "swr"), $chf and $chf.oob for cumulative hazard function
samp = NULL, membership = FALSE, $cif and $cif.oob for cumulative incidence function
Clean up and impute data

na.action = c("na.omit", "na.impute"),
nimpute = 1, Prediction Error for Assessing Model Performance
ntime = 250, cause,
y a b z a z choose your variables in formula and grow a tree. proximity = FALSE, distance = FALSE, o <- rfsrc(Species ~ ., data=iris, block.size=1)
forest.wt = FALSE, xvar.wt = NULL,
o <- rfsrc(y ~ a + z, data= dta, ntree = 1) o$err.rate returns tree cumulative OOB
yvar.wt = NULL, split.wt = NULL,
NA your outcome(s) will be saved in o$y and your case.wt = NULL, error rate; print(o) lists OOB error rate in
y
predictors are in o$x from dta without missing values. forest = TRUE, the bottom; plot(o) plots OOB error rate
var.used = c(FALSE, "all.trees", "by.tree"), along with number of trees; get.auc(y,
To impute your data, use split.depth = c(FALSE, "all.trees", "by.tree"), prob)obtains the value of AUC (area under
o <- impute(y ~ a + z, data = dta) seed = NULL, do.trace = FALSE, the ROC curve)
o <- rfsrc(y ~ a + z, data = dta, na.action = "na.impute") statistics = FALSE, ...)
get.mv.error obtains error rate from a multivariate random forest
CC by Min Lu• Learn more with the randomForestSRC homepage • randomForestSRC version 2.12.0 • Updated: 2021-08
Visualization Variable Selection and Hunting
Restore var.select(formula, data, method) Variable selection or
plot.survival plots various survival estimates
hunting by setting method
Restoration using the predict function makes it possible for users
to acquire information from the grow forest without the md Minimal depth (default)
computational expense of having to regrow a new forest vh Variable hunting
vh.vimp Variable hunting with VIMP
Examples of restore are as follows (extract: proximity, variable
splitting behavior, performance over specific trees)
o <- rfsrc(Ozone ~ ., data = airquality) Partial Plot
predict(o, proximity = TRUE)$proximity Marginal Effect Plot
predict(o.obj, var.used = "by.tree")$var.used
predict(o, get.tree=10:15)$err.rate plot.variable(o, xvar.names)
plot.competing.risk plots summary

curves from a competing risk analysis Variable Selection
Partial Dependence Plot
plot.quantreg plots quantiles
Variable Importance (VIMP)
plot.variable(o, xvar.names, partial = TRUE) and partial
obtained from a quantile regression
forest
o <- rfsrc(Species ~ ., iris, importance = TRUE) Categorical predictor: Continuous predictor:
Or
obj <- rfsrc(Species ~ ., data = iris)
o <- vimp(obj)
o$importance returns permutation VIMP

and plot(o) plots VIMP when setting
Tree Visualization importance to "permute" or "TRUE” in
get.tree extract a single tree from a forest rfsrc or using vimp
and plot it on your browser
mtcars.unspv <- rfsrc(data = mtcars) Set surv.type for survival analysis:
plot(get.tree(mtcars.unspv, 5))
subsample subsample forests for VIMP mort Mortality
confidence intervals rel.freq Relative frequency of mortality
surv Predicted survival, where the predicted survival is
Split Statistics plot.sample plots Subsampled VIMP
for the time point specified using time
confidence intervals years.lost The expected number of life years lost
stat.split acquires split statistic
information. The end-cut o <- rfsrc(mpg ~ ., mtcars) cif The cumulative incidence function
preference (ECP) splitting property smp.o <- subsample(reg.o, B=25, chf The cumulative hazard function
subratio=.5)
can be plotted
plot.subsample(smp.o) get.partial.plot.data is a handy function that parses the
output from "partial.rfsrc" in format suitable for plots
holdout.vimp calculates hold out VIMP from the error rate of
blocks of trees grown with and without a variable
Predict on New Data get.mv.vimp returns VIMP from a multivariate random forest
o.pred <- predict(object = o, newdata)

Predicted values for the new dataset are in o.pred$predicted Minimal Depth
max.subtree extracts minimal depth and maximal subtree
get.mv.predicted returns predicted value for multivariate information used for variable selection and identifying interactions
regression analysis between variables
CC by Min Lu• Learn more with the randomForestSRC homepage • randomForestSRC version 2.12.0 • Updated: 2021-08
Large Language Models : : CHEAT SHEET
Large Language Models (LLMs) How are LLMs trained? Fine-Tuning
LLMs are artificial intelligence models that can generate LLMs are trained using a process called Fine-tuning is the process of training a pre-trained
human-like text, based on patterns found in massive unsupervised learning. This involves large language model on a specific task using a
amounts of training data. They are used in applications
such as language translation, chatbots, and content feeding the model massive amounts of smaller dataset. This allows the model to learn
creation. text data, such as books, articles, and task-specific features and improve its performance.
websites, and having the model learn the The fine-tuning process typically involves freezing
patterns and relationships between words
Some popular LLMs and phrases in the text. The model is then
the weights of the pre-trained model and only
training the task-specific layers.
fine-tuned on a specific task, such as
Some popular LLMs include GPT-3 (Generative When fine-tuning a model, it's important to
language translation or text
Pretrained Transformer by OpenAI, BERT consider factors such as the size of the fine-tuning
summarization.
(Bidirectional Encoder Representations from dataset, the choice of optimizer and learning rate,
Transformers) by Google, and XLNet (eXtreme and the choice of evaluation metrics
MultiLingual Language Model) by Carnegie Mellon
University and Google.
Preprocessing
Text normalization is the process of converting text to a
standard format, such as lowercasing all text, removing
special characters, and converting numbers to their
Example of fine-tuning LLMs
written form.
•Model Cost: $500 - $5000 per month, depending on the
Tokenization is the process of breaking down text into size and complexity of the language model
individual units, such as words or phrases. This is an Input Representations:
important step in preparing text data for NLP tasks. •GPU size: NVIDIA GeForce RTX 3080 or higher
•Word embeddings: Each token is replaced by a vector
Stop Words are common words that are usually removed •Number of GPUs: 1-4, depending on the size of the that represents its meaning in a continuous vector
Choose between LLMs during text processing, as they do not carry much meaning
and can introduce noise or affect the results of NLP tasks.
language model and the desired speed of fine-tuning. For space. Common methods for word embeddings include
Word2Vec, GloVe, and fastText.
example, fine-tuning the GPT-3 model, which is one of the
Examples of stop words include "the," "a," "an," "in," and largest language models available, would require a
When comparing different models, it's
"is.” minimum of 4 GPUs. •Subword embeddings: Each token is broken down into
important to consider their
architecture, the size of the model, the smaller subword units (e.g., characters or character n-
amount of training data used, and their Lemmatization is the process of reducing words to their grams), and each subword is replaced by a vector that
•The size of the data that GPT-3 is fine-tuned on can vary
performance on specific NLP tasks. base or dictionary form, by taking into account their part represents its meaning. This approach can handle out-
greatly depending on the specific use case and the size of
of speech and context. It is a more sophisticated of-vocabulary (OOV) words and can improve the model's
the model itself. GPT-3 is one of the largest language
technique than stemming and produces more accurate ability to capture morphological and semantic
models available, with over 175 billion parameters, so it
Components of LLMs results, but it is computationally more expensive. typically requires a large amount of data for fine-tuning to
similarities. Common methods for subword embeddings
include Byte Pair Encoding (BPE), Unigram Language
see a noticeable improvement in performance.
LLMs typically consist of an encoder, a Model (ULM), and SentencePiece.
Stemming and lemmatization are techniques used to
decoder, and attention mechanisms. The
reduce words to their base form. This helps to reduce the Note: fine-tuning GPT-3 on a small dataset of only a few
encoder takes in input text and converts it •Positional encodings: Since LLMs operate on
dimensionality of the data and improve the performance gigabytes may not result in a significant improvement in
into a set of hidden representations, while sequences of tokens, they need a way to encode the
of models. performance, while fine-tuning on a much larger dataset of
the decoder generates the output text. The position of each token in the sequence. Positional
attention mechanisms help the model focus several terabytes could result in a substantial improvement.
The size of the fine-tuning data will also depend on the encodings are vectors that are added to the word or
on the most relevant parts of the input text. subword embeddings to provide information about the
specific NLP task the model is being fine-tuned for and the
desired level of accuracy. position of each token.
Applications of LLMs
•Segment embeddings: In some LLMs, such as the
• LLMs are used in a wide range of Transformer, the input sequence can be divided into
applications, including language This is just one example, and actual costs and GPU
multiple segments (e.g., sentences or paragraphs).
translation, chatbots, content specifications may vary depending on the language model,
Segment embeddings are added to the word or subword
creation, and text the data it is being fine-tuned on, and other factors. It's
embeddings to indicate which segment each token
summarization. always best to check with the language model provider for
belongs to.
the latest information and specific recommendations for
fine-tuning.
• They can also be used to improve
search engines, voice assistants,
and virtual assistants.
Ashish Patel • Principal Research Scientist • ashishpatel.ce.2011@gmail.com Abonia Sojasingarayar • Machine Learning Scientist • aboniaa@gmail.com Updated: 2023-02
Challenges and limitations
Text Generation: ChatGPT can generate text based on a
Attention Mechanisms Evaluating LLMs prompt or seed text. Example:
Self-Attention:
Human : Write a short story about a magical world.
with LLMs
•Accuracy measures the proportion of correctly classified
•A mechanism that allows a sequence to weigh the • One of the main challenges with
instances out of all instances. This metric is commonly ChatGPT: Once upon a time, in a land far away, there
importance of all other elements in the sequence when LLMs is the potential for biased or
used for text classification tasks such as sentiment was a magical world filled with wonder and
computing a representation for each element. offensive language, as the models
analysis, where the goal is to correctly classify a text as enchantment. The sky was always a brilliant shade of
•Can capture relationships between different elements in the learn from the patterns found in the
positive, negative, or neutral. purple, and the stars twinkled like diamonds in the sky.
sequence, making it well-suited for tasks that require training data.
The people of this world possessed magical powers that
modeling long-range dependencies. they used to create incredible wonders and to help • Unethical considerations, such as
•F1-score is a metric that takes into account both
•Popularized by the Transformer model. those in need. One day, a young woman discovered that gender and racial biases.
precision and recall. Precision is the proportion of true
she too had magical powers, and set out on an • Amount of computational resources
positive results out of all predicted positive results, while
Multi-Head Attention: incredible journey to explore the wonders of this needed to train and run LLMs, which
recall is the proportion of true positive results out of all
magical world and to hone her skills. can be expensive and energy-
•A variation of self-attention where the attention mechanism actual positive results. The F1-score is the harmonic mean
intensive.
is applied multiple times in parallel with different sets of of precision and recall, and it provides a balanced
weights. measure of model performance on text classification, Summarization: ChatGPT can summarize a long text or • Handling out-of-vocabulary words
•Allows the model to attend to different aspects of the input question answering, and other tasks. article. Example: • Improving interpretability. While
sequence, improving its ability to capture complex patterns large language models have shown
and dependencies. impressive performance on a variety
•Perplexity : It measures how well a language model Human : Please summarize this article about the
of NLP tasks, they may not perform
•Each "head" produces a separate output that is predicts the next word in a sequence. The lower the benefits of exercise.
as well on specific tasks, such as
concatenated and linearly transformed to produce the final perplexity, the better the model is at predicting the next ChatGPT: The article discusses the various physical and those that require a deeper
output. word. Perplexity is calculated as 2 to the power of the mental health benefits of exercise, including improved understanding of the underlying
•Also commonly used in the Transformer model. cross-entropy, which is a measure of how well the cardiovascular health, stronger muscles and bones, context.
model's predicted probabilities match the true better weight management, and reduced stress and
probabilities of the next word in the sequence.Also text anxiety. It also highlights the importance of finding an
can be evaluated using metrics such as BLEU or ROUGE
scores.
exercise routine that is enjoyable and sustainable.
Applications of LLMs
Translation: ChatGPT can translate text from one • LLMs are used in a wide range of
language to another. Example: applications, including language
translation, chatbots, content
Example - ChatGPT Translate this sentence from English to Spanish. "The creation, and text
sky is blue." summarization.
ChatGPT is a large language model developed by ChatGPT: El cielo es azul.
OpenAI that has been trained on a diverse range of • They can also be used to improve
internet text, including articles, forums, and social search engines, voice assistants,
media. This training allows ChatGPT to generate and virtual assistants.
human-like text and answer a wide variety of Tools&Libraries support LLMs
questions on many topics.
a. Popular NLP libraries, such as TensorFlow, PyTorch,
Use Cases ChatGPT spaCy, Hugging Face Transformers, AllenNLP,OpenAI
GPT-3 API, AllenAI's ELMO, spaCy Transformers etc Future of LLMs
Question Answering: ChatGPT can answer factual provide tools for working with large language models.
questions based on the information it has been These libraries allow for easy fine-tuning and The future of LLMs is promising, with ongoing research
trained on. Example: deployment of models. focused on improving their accuracy, reducing bias, and
making them more accessible and energy-efficient.
Human : What is the capital of France? b. Some large language models, such as GPT-3, provide As the demand for AI-driven applications continues to
ChatGPT: The capital of France is Paris. APIs for access to their models. This can simplify the grow, LLMs will play an increasingly important role in
process of integrating a large language model into a shaping the future of human-machine interaction.
Conversational: ChatGPT can engage in a real-world application.
conversation with a user. Example:
Human : Hi, how are you today?

ChatGPT: Hello! I'm just an AI, so I don't have
emotions, but I'm functioning well today. How can I
assist you?
Transformer Architecture
GPT Architecture
BERT Architecture

ML Cheat Sheet

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Cheat Sheet

Uploaded by

Copyright:

Available Formats

Combined

Syed Afroz Ali

Stanford University 1 Fall 2018

Discriminative model Generative model

θopt = arg max L(θ)

1.3.1 Linear regression

We assume here that y|x; θ ∼ N (µ,σ 2 )

Linear regression Logistic regression SVM Neural Network θ = (X T X)−1 X T y

Stanford University 2 Fall 2018

(1) y|x; θ ∼ ExpFamily(η) (2) hθ (x) = E[y|x; θ] (3) η = θT x

1.3.3 Generalized Linear Models

Stanford University 3 Fall 2018

r Lagrangian – We define the Lagrangian L(w,b) as follows:

Stanford University 4 Fall 2018

∃h ∈ H, ∀i ∈ [[1,d]], h(x(i) ) = y (i)

r VC dimension – The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis

1.8 Learning Theory

P (A1 ∪ ... ∪ Ak ) 6 P (A1 ) + ... + P (Ak )

Remark: this inequality is also known as the Chernoff bound.

r Training error – For a given classifier h, we define the training error b

r Probably Approximately Correct (PAC) – PAC is a framework under which numerous

• the training and testing sets follow the same distribution

• the training examples are drawn independently

Stanford University 5 Fall 2018

2 Unsupervised Learning 2.2.2 k-means clustering

Setting Latent variable z x|z Comments

Mixture of k Gaussians Multinomial(φ) N (µj ,Σj ) µj ∈ Rn , φ ∈ Rk

Factor analysis N (0,I) N (µ + Λz,ψ) µj ∈ Rn

2.2.4 Clustering assessment metrics

Stanford University 6 Fall 2018

r Calinski-Harabaz index – By noting k the number of clusters, Bk and Wk the between

• Step 3: Compute u1 , ..., uk ∈ Rn the k orthogonal principal eigenvectors of Σ, i.e. the

Stanford University 7 Fall 2018

• Step 2: Perform forward propagation to obtain the corresponding loss.

• Step 3: Backpropagate the loss to get the gradients.

• Step 4: Use the gradients to update the weights of the network.

r Batch normalization – It is a step of hyperparameter γ, β that normalizes the batch {xi }.

Input gate Forget gate Output gate Gate

Stanford University 8 Fall 2018

• A is the set of actions

• γ ∈ [0,1[ is the discount factor

• R : S × A −→ R or R : S −→ R is the reward function that the algorithm wants to

r Policy – A policy π is a function π : S −→ A that maps states to actions.

r Value iteration algorithm – The value iteration algorithm is in two steps:

• We initialize the value:

• We iterate the value based on the values before:

Stanford University 9 Fall 2018

4 Machine Learning Tips and Tricks Metric Formula Equivalent

Stanford University 10 Fall 2018

4.2 Model selection LASSO Ridge Elastic Net

Training set Validation set Testing set

Stanford University 11 Fall 2018

5.1 Probabilities and Statistics

- Complexify model - Regularize

Remark: we note that for 0 6 r 6 n, we have P (n,r) > C(n,r).

5.1.2 Conditional Probability

Remark: we have P (A ∩ B) = P (A)P (B|A) = P (A|B)P (B).

Stanford University 12 Fall 2018

5.1.3 Random Variables

5.1.4 Jointly Distributed Random Variables

Stanford University 13 Fall 2018

Remark: an estimator is said to be unbiased when we have E[θ̂] = θ.

or more commonly Cov(X,Y ), as follows:

Stanford University 14 Fall 2018

Remark: similarly, a matrix A is said to be positive definite, and is noted A 0, if it is a PSD

`(A,P,N ) = max (d(A,P ) − d(A,N ) + α,0) nc