You are on page 1of 84

MACHINE LEARNING - FOUNDATIONS

REVISION (WEEK 2)

IIT Madras Online Degree


Topics

• Continuity
• Differentiability
• Linear Approximation
• Higher order approximations
• Multivariate Linear Approximation
• Directional Derivative

2
Continuity

A function 𝑓(𝑥) is continuous at 𝑥 = 𝑎 if:

𝑓(𝑎) = lim− 𝑓(𝑥) = lim+ 𝑓(𝑥)


𝑥→𝑎 𝑥→𝑎

3
Differentiability

A function 𝑓(𝑥) is differentiable at 𝑥 = 𝑎 if:

𝑓(𝑥) − 𝑓(𝑎) 𝑓(𝑥) − 𝑓(𝑎)


𝑓 ′ (𝑎) = lim− = lim+
𝑥→𝑎 𝑥−𝑎 𝑥→𝑎 𝑥−𝑎

4
Linear Approximation

The linear approximation 𝐿(𝑥) of a function 𝑓(𝑥) at point 𝑎 is given by:



𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎)

5
Linear Approximation

The linear approximation 𝐿(𝑥) of a function 𝑓(𝑥) at point 𝑎 is given by:



𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎)

This is indeed the equation of a tangent line:

𝑦 − 𝑦1 = 𝑚(𝑥 − 𝑥1 )
𝑦 = 𝑦1 + 𝑚(𝑥 − 𝑥1 )

5
Linear Approximation

The linear approximation 𝐿(𝑥) of a function 𝑓(𝑥) at point 𝑎 is given by:



𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎)

This is indeed the equation of a tangent line:

𝑦 − 𝑦1 = 𝑚(𝑥 − 𝑥1 )
𝑦 = 𝑦1 + 𝑚(𝑥 − 𝑥1 )

If 𝑥1 = 𝑎, 𝑦1 = 𝑓(𝑎) and 𝑚 = 𝑓 ′ (𝑎), we get,

𝑦 = 𝑓(𝑎) + 𝑓 ′ (𝑎)(𝑥 − 𝑎)

5
Higher order approximations

Linear Approximation

𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎)

6
Higher order approximations

Linear Approximation

𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎)

Quadratic Approximation

′ 𝑓 (𝑎)
𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎) + (𝑥 − 𝑎)2
2

6
Higher order approximations

Linear Approximation

𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎)

Quadratic Approximation

′ 𝑓 (𝑎)
𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (𝑎)(𝑥 − 𝑎) + (𝑥 − 𝑎)2
2

Higher-order Approximations
𝑓 (2) (𝑎)
𝐿(𝑥) = 𝑓(𝑎) + 𝑓 (1) (𝑎)(𝑥 − 𝑎) + (𝑥 − 𝑎)2 +
2
𝑓 (3) (𝑎) 𝑓 (4) (𝑎)
+ (𝑥 − 𝑎)3 + (𝑥 − 𝑎)4 ...
3⋅2 4⋅3⋅2

6
Multivariate linear approximation: Linear approximation of functions involving
multiple variables

The linear approximation of a function 𝑓 of two variables 𝑥 and 𝑦 in the neighborhood of


(𝑎, 𝑏) is:
𝜕𝑓 𝜕𝑓
𝐿(𝑥, 𝑦) = 𝑓(𝑎, 𝑏) + (𝑎, 𝑏)(𝑥 − 𝑎) + (𝑎, 𝑏)(𝑦 − 𝑏)
𝜕𝑥 𝜕𝑦

7
DIRECTIONAL DERIVATIVES
Directional Derivative

𝜕𝑓
• 𝑓𝑥 (𝑥, 𝑦) = 𝜕𝑥 (𝑥, 𝑦)

9
Directional Derivative

𝜕𝑓
• 𝑓𝑥 (𝑥, 𝑦) = 𝜕𝑥 (𝑥, 𝑦) = Rate of change of 𝑓 as we vary 𝑥 (keeping 𝑦 fixed).

9
Directional Derivative

𝜕𝑓
• 𝑓𝑥 (𝑥, 𝑦) = 𝜕𝑥 (𝑥, 𝑦) = Rate of change of 𝑓 as we vary 𝑥 (keeping 𝑦 fixed).
𝜕𝑓
• 𝑓𝑦 (𝑥, 𝑦) = 𝜕𝑦 (𝑥, 𝑦)

9
Directional Derivative

𝜕𝑓
• 𝑓𝑥 (𝑥, 𝑦) = 𝜕𝑥 (𝑥, 𝑦) = Rate of change of 𝑓 as we vary 𝑥 (keeping 𝑦 fixed).
𝜕𝑓
• 𝑓𝑦 (𝑥, 𝑦) = 𝜕𝑦 (𝑥, 𝑦) = Rate of change of 𝑓 as we vary 𝑦 (keeping 𝑥 fixed).

9
Directional Derivative

𝜕𝑓
• 𝑓𝑥 (𝑥, 𝑦) = 𝜕𝑥 (𝑥, 𝑦) = Rate of change of 𝑓 as we vary 𝑥 (keeping 𝑦 fixed).
𝜕𝑓
• 𝑓𝑦 (𝑥, 𝑦) = 𝜕𝑦 (𝑥, 𝑦) = Rate of change of 𝑓 as we vary 𝑦 (keeping 𝑥 fixed).
• Directional derivative of 𝑓(𝑥, 𝑦)

9
Directional Derivative

𝜕𝑓
• 𝑓𝑥 (𝑥, 𝑦) = 𝜕𝑥 (𝑥, 𝑦) = Rate of change of 𝑓 as we vary 𝑥 (keeping 𝑦 fixed).
𝜕𝑓
• 𝑓𝑦 (𝑥, 𝑦) = 𝜕𝑦 (𝑥, 𝑦) = Rate of change of 𝑓 as we vary 𝑦 (keeping 𝑥 fixed).
• Directional derivative of 𝑓(𝑥, 𝑦) = Rate of change of 𝑓 if we allow both 𝑥 and 𝑦 to
change simultaneously (in some direction (𝑢)).

9
Directional Derivative

𝜕𝑓
• 𝑓𝑥 (𝑥, 𝑦) = 𝜕𝑥 (𝑥, 𝑦) = Rate of change of 𝑓 as we vary 𝑥 (keeping 𝑦 fixed).
𝜕𝑓
• 𝑓𝑦 (𝑥, 𝑦) = 𝜕𝑦 (𝑥, 𝑦) = Rate of change of 𝑓 as we vary 𝑦 (keeping 𝑥 fixed).
• Directional derivative of 𝑓(𝑥, 𝑦) = Rate of change of 𝑓 if we allow both 𝑥 and 𝑦 to
change simultaneously (in some direction (𝑢)).

𝐷𝑢⃗⃗⃗ ⃗ 𝑓(𝑥, 𝑦) = ∇𝑓 ⋅ 𝑢
𝜕𝑓 𝜕𝑓
=[ , ] ⋅ [𝑢1 , 𝑢2 ]
𝜕𝑥 𝜕𝑦
𝜕𝑓 𝜕𝑓
= 𝑢1 + 𝑢2
𝜕𝑥 𝜕𝑦

9
Directional Derivative

𝜕𝑓
• 𝑓𝑥 (𝑥, 𝑦) = 𝜕𝑥 (𝑥, 𝑦) = Rate of change of 𝑓 as we vary 𝑥 (keeping 𝑦 fixed).
𝜕𝑓
• 𝑓𝑦 (𝑥, 𝑦) = 𝜕𝑦 (𝑥, 𝑦) = Rate of change of 𝑓 as we vary 𝑦 (keeping 𝑥 fixed).
• Directional derivative of 𝑓(𝑥, 𝑦) = Rate of change of 𝑓 if we allow both 𝑥 and 𝑦 to
change simultaneously (in some direction (𝑢)).

𝐷𝑢⃗⃗⃗ ⃗ 𝑓(𝑥, 𝑦) = ∇𝑓 ⋅ 𝑢
𝜕𝑓 𝜕𝑓
=[ , ] ⋅ [𝑢1 , 𝑢2 ]
𝜕𝑥 𝜕𝑦
𝜕𝑓 𝜕𝑓
= 𝑢1 + 𝑢2
𝜕𝑥 𝜕𝑦

Directional derivative can be considered to be a weighted sum of partial derivatives.

9
WEEK 3: REVISION
1. Four Fundamental Subspaces
2. Orthogonal Vectors and Subspaces
3. Projections
4. Least Squares and Projections onto a Subspace
5. Example of Least Squares

2
Suppose A is a m × n matrix.

1. The column space is C(A), a subspace of ℝ𝑚 .


2. The row space is C(𝐀𝑇 ), a subspace of ℝ𝑛 .
3. The nullspace is N(A), a subspace of ℝ𝑛 .
4. The left nullspace is N(𝐀𝑇 ), a subspace of ℝ𝑚 .

3
A=

Column space is C(A)

Row space is C(𝐀𝑇 )

Nullspace is N(A)

Left nullspace is N(𝐀𝑇 )

4
▪ Find the condition on (𝑏1 , 𝑏2 , 𝑏3 ) for 𝐴𝑥 = 𝑏 to be solvable, if

5
rank(A) + nullity(A) = n dim(C(AT )) + dim(N(AT)) = m
6
▪ The projection p of b onto a line:

Projection matrix of vector 𝑎,

7
−1
3
▪ Projection matrix of 𝑎 =
−2
1

8
1
0
Projection of 𝑏 = onto a:
1
1

9
▪ It often happens that Ax = b has no solution.
▪ The usual reason is: too many equations.
▪ The matrix A has more rows than columns.
▪ There are more equations than unknowns (m is greater than n).
▪ Then columns span a small part of m-dimensional space.
▪ We cannot always get the error e= b - Ax down to zero. When e is zero, x is an exact
solution to Ax=b.
▪ When the length of e is as small as possible, 𝑥ො is a least squares solution.

▪ Least Squares method: Solving 𝑨𝑻 𝑨ෝ


𝒙 = 𝑨𝑇 b we get 𝑥ො

10
11
𝒙 = 𝑨𝑇 b
𝑨𝑻 𝑨ෝ

Solving this we get,

Best fit line: y = 0.7x + 0.7

12
Best fit line: y = 0.7x + 0.7

13
Machine Learning
Foundations
Week 4
Machine Learning Foundations
Week-5 Revision

Arun Prakash A

2
Complex vectors
x∈C n y ∈ Cn
⎡ 3 − 2i ⎤ ⎡−2 + 4i⎤
x = ⎢ −2 + i ⎥ ∈ C3 y = ⎢ 5 − i ⎥ ∈ C3
⎣−4 − 3i ⎦ ⎣ −2i ⎦

Operations:
⎡ 1 + 2i ⎤
z = x + y ∈ Cn z = ⎢ 3 ⎥ ∈ C3
Addition
⎣−4 − 5i⎦

⎡ 1 − 2i ⎤
z=⎢ 3 ⎥
⎣−4 + 5i⎦
Conjugate

3
Inner Product x ⋅ y = xT y ∈ C
⎡ 3 − 2i ⎤ ⎡−2 + 4i⎤ ⎡−2 + 4i⎤
x = ⎢ −2 + i ⎥ y = ⎢ 5 − i ⎥ xT = [3 + 2i, −2 − i, −4 + 3i] ⎢ 5 − i ⎥
⎣−4 − 3i ⎦ ⎣ −2i ⎦ ⎣ −2i ⎦
= −19 + 13i

Properties
1.x ⋅ y = y ⋅ x 3. x ⋅ cy = c(x ⋅ y) 5. x ⋅ x = ∣∣x∣∣2
2. (x + y) ⋅ z = x ⋅ z + y ⋅ z 4. cx ⋅ y = c(x ⋅ y)

cx ⋅ cy = ∣c∣(x ⋅ y), true or false ?

4
Complex Matrices

2 3 − 3i 2i 3 + 3i
A=[ ] B=[ ]
3 + 3i 5 3 + 3i 5i

T
Hermitian if: ∗
A = A = AT x ⋅ y = x∗ y ∈ C

2 3 − 3i
A=[ ] is Hermitian
3 + 3i 5

2i 3 + 3i
B=[ ] is not Hermitian
3 + 3i 5i
2i 3 + 3i
C=[ ] is not Hermitian
3 − 3i 5i 5
Properties of Hermitian Matrices
1. All Eigenvalues λi are real.
2. Eigenvectors are orthogonal if λi 
= λj for i =
j

Finding complex eigenvectors:


2 3 − 3i
Consider the matrix A = [ ]. Find the complex eigenvector for the eigenvalue
3 + 3i 5
λ=8
−6x1 + (3 − 3i)x2 = 0
−6 3 − 3i
N [A − λI] = [ ] −2x1 + (1 − 1i)x2 = 0
3 + 3i −3
R2 = R2 + 12 (1 + i)R1 2x1 = (1 − 1i)x2
1 1
∴x=[ ] = c[ ]
−6 3 − 3i 1 + 1i 1 + 1i
=[ ]
0 0
1 1−i
x = 1 − i[ ]=[ ]
−6 3 − 3i x1 0
[ ][ ] = [ ]
0 0 x2 0
1 + 1i 2 6
Unitary Matrices

Real Case: Complex Case: Properties:


QT Q = I U ∗U = I
It preserves the length and angle of
vectors!
−sin(t)
U =[ ]
cos(t)
sin(t) cos(t) Therefore, eigenvalues are ∣λi ∣ = 1

Eigenvectors are orthogonal if λi 


=
UT = [ ]
cos(t) sin(t)
−sin(t) cos(t) λj for i =
j
cos2 (t) + sin2 (t) cos(t)sin(t) − sin(t)cos(t)
U ∗ UT = [ ] Unitary matrices are need not be
sin(t)cos(t) − cos(t)sin(t) sin2 (t) + cos2 (t)
necessarily Hermitian.
1 0
U ∗ UT = [ ]=I
0 1 There exists unitary matrix that
diagonalizes a Hermitian matrix.

7
Diagonalization of Hermitian Matrices
Schur's Theorem
Any n × n matrix is similar to upper
triangular matrix T , that is A = U T U ∗

Example:

1 7
A= [
5 7
] λ1 = −2, λ2 = 3 x1 = [ ] x2 = [ ]
−2 −4 −1 −2

1 7 If we do U ∗ AU , will it be triangular or diagonal matrix?


U= [ ]
−1 −2 No, why?
It is not an orthonormal matrix!
So how to find orthogonal matrix?
8
Gram-Schmidt process:
1 7
A= [
5 7
] λ1 = −2, λ2 = 3 x1 = [ ] x2 = [ ]
−2 −4 −1 −2

1
Find a vector orthogonal to x1 = [ ] ( you could have picked x2 as well )
−1

1 1
1
q1 = [ ] U= 1
[ ]
−1 2 −1 1

If we do U ∗ AU , will it be triangular or diagonal matrix?


q2 = x2 − (x2 ⋅ q1 ) q1q⋅q1 1
−2 9

U AU =[ ]
2.5 1 0 3
q2 = [ ] = [ ]
2.5 −1 What if we have considered different vector
instead of x2 during orthogonalization?
9
Gram-Schmidt process:
1 7
A= [
5 7
] λ1 = −2, λ2 = 3 x1 = [ ] x2 = [ ]
−2 −4 −1 −2

For a 2 x 2 case, the


orthogonal vector to a
given vector x1 is
unique!
https://www.geogebra.org/material/iframe/id/vsb7b4pw/width/7
00/height/500/border/888888/sfsb/true/smb/false/stb/false/stbh
/false/ai/false/asb/false/sri/false/rc/false/ld/false/sdz/false/ctl/fal So, it doesn't matter
se which vector you use
as x2

This is my claim!

10
Gram-Schmidt process:

For a matrix of size 3 × 3, if we have only one eigenvector x1 , then there


are many possible orthogonal vectors based on the direction of x2

Could you reason, why? (I hope, no need for geogebra :-))

Therefore, schur decomposition is not unique!

Spectral Theorem
Any Hermitian matrix is similar to diagonal
matrix D, that is A = U DU ∗

11
Singular Value Decomposition (SVD)
Any matrix A can be diagonalized as A =
Q1 ΣQT2 , where Q1 = eig(AAT ) and Q2 =
eig(AT A)
No problem in computation steps as long as none of the singular values
are zero.

If any of the singular value is zero, we need to bring GS process to create unitary
matrices.

Add-on
If SVD is used for PCA, then Singular values represent the variance of the data.
Higher the singular value, higher the variance!. (Watch once again the Image
compression tutorial in week-5, keeping this in mind)
12
WEEK 9: REVISION FINAL EXAM
CONTENTS
1. Properties of Convex Functions
2. Applications of Optimization in Machine Learning
3. Revisiting Constrained Optimization
4. Relation between Primal and Dual Problem, KKT Conditions
5. KKT conditions continued
1. PROPERTIES OF CONVEX FUNCTIONS
Necessary and sufficient conditions for optimality of convex functions
1. PROPERTIES OF CONVEX FUNCTIONS
1. PROPERTIES OF CONVEX FUNCTIONS
1. PROPERTIES OF CONVEX FUNCTIONS
1. PROPERTIES OF CONVEX FUNCTIONS
2. APPLICATIONS OF OPTIMIZATION IN ML
2. APPLICATIONS OF OPTIMIZATION IN ML
3. CONSTRAINED OPTIMIZATION
Consider the constrained optimization problem as follows:
4. RELATION BETWEEN PRIMAL AND DUAL PROBLEM
5. KARUSH-KUHN-TUCKER CONDITIONS
Consider the optimization problem with multiple equality and inequality constraints as
follows:
SOME SOLVED
PROBLEMS
Properties of convex functions

https://www.geogebra.org/m/esqcd4he
Given below is a set of data points and their labels.

How to find the optimal w* using the analytical method?

Let us use Gradient descent optimization.


𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 3𝑥1 + 𝑥2
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜
𝑥1 − 𝑥2 + 4 ≤ 0
−3𝑥1 + 2𝑥2 + 10 ≤ 0

Stationarity conditions 1

Complementary slackness conditions


3

Primal feasibility conditions

Dual feasibility conditions


𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 3𝑥1 + 𝑥2
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜
𝑥1 − 𝑥2 + 4 ≤ 0
−3𝑥1 + 2𝑥2 + 10 ≤ 0

Stationarity conditions 1

Complementary slackness conditions


3

Primal feasibility conditions

Dual feasibility conditions

You might also like