MLF Combined

MACHINE LEARNING - FOUNDATIONS
REVISION (WEEK 2)
IIT Madras Online Degree

Topics
• Continuity
• Differentiability
• Linear Approximation
• Higher order approximations
• Multivariate Linear Approximation
• Directional Derivative
2
Continuity
A function 𝑓(𝑥) is continuous at 𝑥 = 𝑎 if:
𝑓(𝑎) = lim− 𝑓(𝑥) = lim+ 𝑓(𝑥)

𝑥→𝑎 𝑥→𝑎
3
Differentiability
A function 𝑓(𝑥) is differentiable at 𝑥 = 𝑎 if:
𝑓(𝑥) − 𝑓(𝑎) 𝑓(𝑥) − 𝑓(𝑎)

𝑓 ′ (𝑎) = lim− = lim+
𝑥→𝑎 𝑥−𝑎 𝑥→𝑎 𝑥−𝑎
4
Linear Approximation
The linear approximation 𝐿(𝑥) of a function 𝑓(𝑥) at point 𝑎 is given by:

′
𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎)
5

′
𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎)
This is indeed the equation of a tangent line:
𝑦 − 𝑦1 = 𝑚(𝑥 − 𝑥1 )
𝑦 = 𝑦1 + 𝑚(𝑥 − 𝑥1 )
5

′
𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎)
This is indeed the equation of a tangent line:
𝑦 − 𝑦1 = 𝑚(𝑥 − 𝑥1 )
𝑦 = 𝑦1 + 𝑚(𝑥 − 𝑥1 )
If 𝑥1 = 𝑎, 𝑦1 = 𝑓(𝑎) and 𝑚 = 𝑓 ′ (𝑎), we get,
𝑦 = 𝑓(𝑎) + 𝑓 ′ (𝑎)(𝑥 − 𝑎)
5
Higher order approximations
′
𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎)
6
′
𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎)
Quadratic Approximation
″
′ 𝑓 (𝑎)
𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎) + (𝑥 − 𝑎)2
2
6
′
𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎)
Quadratic Approximation
″
′ 𝑓 (𝑎)
𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎) + (𝑥 − 𝑎)2
2
Higher-order Approximations
𝑓 (2) (𝑎)
𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (1) (𝑎)(𝑥 − 𝑎) + (𝑥 − 𝑎)2 +
2
𝑓 (3) (𝑎) 𝑓 (4) (𝑎)
+ (𝑥 − 𝑎)3 + (𝑥 − 𝑎)4 ...
3⋅2 4⋅3⋅2
6
Multivariate linear approximation: Linear approximation of functions involving
multiple variables
The linear approximation of a function 𝑓 of two variables 𝑥 and 𝑦 in the neighborhood of

(𝑎, 𝑏) is:
𝜕𝑓 𝜕𝑓
𝐿(𝑥, 𝑦) = 𝑓(𝑎, 𝑏) + (𝑎, 𝑏)(𝑥 − 𝑎) + (𝑎, 𝑏)(𝑦 − 𝑏)
𝜕𝑥 𝜕𝑦
7
DIRECTIONAL DERIVATIVES
Directional Derivative
𝜕𝑓
• 𝑓𝑥 (𝑥, 𝑦) = 𝜕𝑥 (𝑥, 𝑦)
9
𝜕𝑓
• 𝑓𝑥 (𝑥, 𝑦) = 𝜕𝑥 (𝑥, 𝑦) = Rate of change of 𝑓 as we vary 𝑥 (keeping 𝑦 fixed).
9
𝜕𝑓
𝜕𝑓
• 𝑓𝑦 (𝑥, 𝑦) = 𝜕𝑦 (𝑥, 𝑦)
9
𝜕𝑓
𝜕𝑓
• 𝑓𝑦 (𝑥, 𝑦) = 𝜕𝑦 (𝑥, 𝑦) = Rate of change of 𝑓 as we vary 𝑦 (keeping 𝑥 fixed).
9
𝜕𝑓
𝜕𝑓
• Directional derivative of 𝑓(𝑥, 𝑦)
9
𝜕𝑓
𝜕𝑓
• Directional derivative of 𝑓(𝑥, 𝑦) = Rate of change of 𝑓 if we allow both 𝑥 and 𝑦 to
change simultaneously (in some direction (𝑢)).
9
𝜕𝑓
𝜕𝑓
𝐷𝑢⃗⃗⃗ ⃗ 𝑓(𝑥, 𝑦) = ∇𝑓 ⋅ 𝑢
𝜕𝑓 𝜕𝑓
=[ , ] ⋅ [𝑢1 , 𝑢2 ]
𝜕𝑥 𝜕𝑦
𝜕𝑓 𝜕𝑓
= 𝑢1 + 𝑢2
𝜕𝑥 𝜕𝑦
9
𝜕𝑓
𝜕𝑓
𝐷𝑢⃗⃗⃗ ⃗ 𝑓(𝑥, 𝑦) = ∇𝑓 ⋅ 𝑢
𝜕𝑓 𝜕𝑓
=[ , ] ⋅ [𝑢1 , 𝑢2 ]
𝜕𝑥 𝜕𝑦
𝜕𝑓 𝜕𝑓
= 𝑢1 + 𝑢2
𝜕𝑥 𝜕𝑦
Directional derivative can be considered to be a weighted sum of partial derivatives.
9
WEEK 3: REVISION
1. Four Fundamental Subspaces
2. Orthogonal Vectors and Subspaces
3. Projections
4. Least Squares and Projections onto a Subspace
5. Example of Least Squares
2
Suppose A is a m × n matrix.
1. The column space is C(A), a subspace of ℝ𝑚 .

2. The row space is C(𝐀𝑇 ), a subspace of ℝ𝑛 .
3. The nullspace is N(A), a subspace of ℝ𝑛 .
4. The left nullspace is N(𝐀𝑇 ), a subspace of ℝ𝑚 .
3
A=
Column space is C(A)
Row space is C(𝐀𝑇 )
Nullspace is N(A)
Left nullspace is N(𝐀𝑇 )
4
▪ Find the condition on (𝑏1 , 𝑏2 , 𝑏3 ) for 𝐴𝑥 = 𝑏 to be solvable, if
5
rank(A) + nullity(A) = n dim(C(AT )) + dim(N(AT)) = m
6
▪ The projection p of b onto a line:
Projection matrix of vector 𝑎,
7
−1
3
▪ Projection matrix of 𝑎 =
−2
1
8
1
0
Projection of 𝑏 = onto a:
1
1
9
▪ It often happens that Ax = b has no solution.
▪ The usual reason is: too many equations.
▪ The matrix A has more rows than columns.
▪ There are more equations than unknowns (m is greater than n).
▪ Then columns span a small part of m-dimensional space.
▪ We cannot always get the error e= b - Ax down to zero. When e is zero, x is an exact
solution to Ax=b.
▪ When the length of e is as small as possible, 𝑥ො is a least squares solution.
▪ Least Squares method: Solving 𝑨𝑻 𝑨ෝ

𝒙 = 𝑨𝑇 b we get 𝑥ො
10
11
𝒙 = 𝑨𝑇 b
𝑨𝑻 𝑨ෝ
Solving this we get,
Best fit line: y = 0.7x + 0.7
12
Best fit line: y = 0.7x + 0.7
13
Machine Learning
Foundations
Week 4
Machine Learning Foundations
Week-5 Revision
Arun Prakash A
2
Complex vectors
x∈C n y ∈ Cn
⎡ 3 − 2i ⎤ ⎡−2 + 4i⎤
x = ⎢ −2 + i ⎥ ∈ C3 y = ⎢ 5 − i ⎥ ∈ C3
⎣−4 − 3i ⎦ ⎣ −2i ⎦
Operations:
⎡ 1 + 2i ⎤
z = x + y ∈ Cn z = ⎢ 3 ⎥ ∈ C3
Addition
⎣−4 − 5i⎦
⎡ 1 − 2i ⎤
z=⎢ 3 ⎥
⎣−4 + 5i⎦
Conjugate
3
Inner Product x ⋅ y = xT y ∈ C
⎡ 3 − 2i ⎤ ⎡−2 + 4i⎤ ⎡−2 + 4i⎤
x = ⎢ −2 + i ⎥ y = ⎢ 5 − i ⎥ xT = [3 + 2i, −2 − i, −4 + 3i] ⎢ 5 − i ⎥
⎣−4 − 3i ⎦ ⎣ −2i ⎦ ⎣ −2i ⎦
= −19 + 13i
Properties
1.x ⋅ y = y ⋅ x 3. x ⋅ cy = c(x ⋅ y) 5. x ⋅ x = ∣∣x∣∣2
2. (x + y) ⋅ z = x ⋅ z + y ⋅ z 4. cx ⋅ y = c(x ⋅ y)
cx ⋅ cy = ∣c∣(x ⋅ y), true or false ?
4
Complex Matrices
2 3 − 3i 2i 3 + 3i
A=[ ] B=[ ]
3 + 3i 5 3 + 3i 5i
T
Hermitian if: ∗
A = A = AT x ⋅ y = x∗ y ∈ C
2 3 − 3i
A=[ ] is Hermitian
3 + 3i 5
2i 3 + 3i
B=[ ] is not Hermitian
3 + 3i 5i
2i 3 + 3i
C=[ ] is not Hermitian
3 − 3i 5i 5
Properties of Hermitian Matrices
1. All Eigenvalues λi are real.
2. Eigenvectors are orthogonal if λi 
= λj for i =
j
Finding complex eigenvectors:

2 3 − 3i
Consider the matrix A = [ ]. Find the complex eigenvector for the eigenvalue
3 + 3i 5
λ=8
−6x1 + (3 − 3i)x2 = 0
−6 3 − 3i
N [A − λI] = [ ] −2x1 + (1 − 1i)x2 = 0
3 + 3i −3
R2 = R2 + 12 (1 + i)R1 2x1 = (1 − 1i)x2
1 1
∴x=[ ] = c[ ]
−6 3 − 3i 1 + 1i 1 + 1i
=[ ]
0 0
1 1−i
x = 1 − i[ ]=[ ]
−6 3 − 3i x1 0
[ ][ ] = [ ]
0 0 x2 0
1 + 1i 2 6
Unitary Matrices
Real Case: Complex Case: Properties:

QT Q = I U ∗U = I
It preserves the length and angle of
vectors!
−sin(t)
U =[ ]
cos(t)
sin(t) cos(t) Therefore, eigenvalues are ∣λi ∣ = 1
Eigenvectors are orthogonal if λi 

=
UT = [ ]
cos(t) sin(t)
−sin(t) cos(t) λj for i =
j
cos2 (t) + sin2 (t) cos(t)sin(t) − sin(t)cos(t)
U ∗ UT = [ ] Unitary matrices are need not be
sin(t)cos(t) − cos(t)sin(t) sin2 (t) + cos2 (t)
necessarily Hermitian.
1 0
U ∗ UT = [ ]=I
0 1 There exists unitary matrix that
diagonalizes a Hermitian matrix.
7
Diagonalization of Hermitian Matrices
Schur's Theorem
Any n × n matrix is similar to upper
triangular matrix T , that is A = U T U ∗
Example:
1 7
A= [
5 7
] λ1 = −2, λ2 = 3 x1 = [ ] x2 = [ ]
−2 −4 −1 −2
1 7 If we do U ∗ AU , will it be triangular or diagonal matrix?

U= [ ]
−1 −2 No, why?
It is not an orthonormal matrix!
So how to ﬁnd orthogonal matrix?
8
Gram-Schmidt process:
1 7
A= [
5 7
] λ1 = −2, λ2 = 3 x1 = [ ] x2 = [ ]
−2 −4 −1 −2
1
Find a vector orthogonal to x1 = [ ] ( you could have picked x2 as well )
−1
1 1
1
q1 = [ ] U= 1
[ ]
−1 2 −1 1
If we do U ∗ AU , will it be triangular or diagonal matrix?

q2 = x2 − (x2 ⋅ q1 ) q1q⋅q1 1
−2 9
∗
U AU =[ ]
2.5 1 0 3
q2 = [ ] = [ ]
2.5 −1 What if we have considered different vector
instead of x2 during orthogonalization?
9
1 7
A= [
5 7
] λ1 = −2, λ2 = 3 x1 = [ ] x2 = [ ]
−2 −4 −1 −2
For a 2 x 2 case, the

orthogonal vector to a
given vector x1 is
unique!
https://www.geogebra.org/material/iframe/id/vsb7b4pw/width/7
00/height/500/border/888888/sfsb/true/smb/false/stb/false/stbh
/false/ai/false/asb/false/sri/false/rc/false/ld/false/sdz/false/ctl/fal So, it doesn't matter
se which vector you use
as x2
This is my claim!
10
For a matrix of size 3 × 3, if we have only one eigenvector x1 , then there

are many possible orthogonal vectors based on the direction of x2
Could you reason, why? (I hope, no need for geogebra :-))
Therefore, schur decomposition is not unique!
Spectral Theorem
Any Hermitian matrix is similar to diagonal
matrix D, that is A = U DU ∗
11
Singular Value Decomposition (SVD)
Any matrix A can be diagonalized as A =
Q1 ΣQT2 , where Q1 = eig(AAT ) and Q2 =
eig(AT A)
No problem in computation steps as long as none of the singular values
are zero.
If any of the singular value is zero, we need to bring GS process to create unitary
matrices.
Add-on
If SVD is used for PCA, then Singular values represent the variance of the data.
Higher the singular value, higher the variance!. (Watch once again the Image
compression tutorial in week-5, keeping this in mind)
12
WEEK 9: REVISION FINAL EXAM
CONTENTS
1. Properties of Convex Functions
2. Applications of Optimization in Machine Learning
3. Revisiting Constrained Optimization
4. Relation between Primal and Dual Problem, KKT Conditions
5. KKT conditions continued
1. PROPERTIES OF CONVEX FUNCTIONS
Necessary and sufficient conditions for optimality of convex functions
2. APPLICATIONS OF OPTIMIZATION IN ML
2. APPLICATIONS OF OPTIMIZATION IN ML
3. CONSTRAINED OPTIMIZATION
Consider the constrained optimization problem as follows:
4. RELATION BETWEEN PRIMAL AND DUAL PROBLEM
5. KARUSH-KUHN-TUCKER CONDITIONS
Consider the optimization problem with multiple equality and inequality constraints as
follows:
SOME SOLVED
PROBLEMS
Properties of convex functions
https://www.geogebra.org/m/esqcd4he
Given below is a set of data points and their labels.
How to find the optimal w* using the analytical method?
Let us use Gradient descent optimization.

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 3𝑥1 + 𝑥2
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜
𝑥1 − 𝑥2 + 4 ≤ 0
−3𝑥1 + 2𝑥2 + 10 ≤ 0
Stationarity conditions 1
Complementary slackness conditions

3
Primal feasibility conditions
Dual feasibility conditions

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 3𝑥1 + 𝑥2
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜
𝑥1 − 𝑥2 + 4 ≤ 0
−3𝑥1 + 2𝑥2 + 10 ≤ 0
Stationarity conditions 1
Complementary slackness conditions

3
Primal feasibility conditions
Dual feasibility conditions

MLF Combined

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MLF Combined

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING - FOUNDATIONS

IIT Madras Online Degree

A function 𝑓(𝑥) is continuous at 𝑥 = 𝑎 if:

𝑓(𝑎) = lim− 𝑓(𝑥) = lim+ 𝑓(𝑥)

A function 𝑓(𝑥) is differentiable at 𝑥 = 𝑎 if:

𝑓(𝑥) − 𝑓(𝑎) 𝑓(𝑥) − 𝑓(𝑎)

The linear approximation 𝐿(𝑥) of a function 𝑓(𝑥) at point 𝑎 is given by:

The linear approximation 𝐿(𝑥) of a function 𝑓(𝑥) at point 𝑎 is given by:

This is indeed the equation of a tangent line:

The linear approximation 𝐿(𝑥) of a function 𝑓(𝑥) at point 𝑎 is given by:

This is indeed the equation of a tangent line:

If 𝑥1 = 𝑎, 𝑦1 = 𝑓(𝑎) and 𝑚 = 𝑓 ′ (𝑎), we get,

The linear approximation of a function 𝑓 of two variables 𝑥 and 𝑦 in the neighborhood of

Directional derivative can be considered to be a weighted sum of partial derivatives.

1. The column space is C(A), a subspace of ℝ𝑚 .

Column space is C(A)

Row space is C(𝐀𝑇 )

Left nullspace is N(𝐀𝑇 )

Projection matrix of vector 𝑎,

▪ Least Squares method: Solving 𝑨𝑻 𝑨ෝ

Solving this we get,

Best fit line: y = 0.7x + 0.7

cx ⋅ cy = ∣c∣(x ⋅ y), true or false ?

Finding complex eigenvectors:

Real Case: Complex Case: Properties:

Eigenvectors are orthogonal if λi 

1 7 If we do U ∗ AU , will it be triangular or diagonal matrix?

If we do U ∗ AU , will it be triangular or diagonal matrix?

For a 2 x 2 case, the

For a matrix of size 3 × 3, if we have only one eigenvector x1 , then there

Could you reason, why? (I hope, no need for geogebra :-))

Therefore, schur decomposition is not unique!

How to find the optimal w* using the analytical method?

Let us use Gradient descent optimization.

Complementary slackness conditions

Primal feasibility conditions

Dual feasibility conditions

Complementary slackness conditions

Primal feasibility conditions

Dual feasibility conditions

You might also like