You are on page 1of 315

# CS7015 (Deep Learning) : Lecture 6

## Eigen Values, Eigen Vectors, Eigen Value Decomposition, Principal Component

Analysis, Singular Value Decomposition

## Department of Computer Science and Engineering

1/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Module 6.1 : Eigenvalues and Eigenvectors

2/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
  What happens when a matrix hits a
1 2
A= vector?
2 1

 
1
x=
3

3/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
  What happens when a matrix hits a
1 2
A= vector?
2 1
The vector gets transformed into a
new vector (it strays from its path)
 
1
x=
3

3/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
  What happens when a matrix hits a
1 2
A= vector?
2 1  
7 The vector gets transformed into a
Ax =
5 new vector (it strays from its path)
 
1
x=
3

3/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
  What happens when a matrix hits a
1 2
A= vector?
2 1  
7 The vector gets transformed into a
Ax =
5 new vector (it strays from its path)
  The vector may also get scaled
1 (elongated or shortened) in the
x=
3 process.

3/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
  For a given square matrix A, there
1 2
A= exist special vectors which refuse to
2 1
stray from their path.

4/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
  For a given square matrix A, there
1 2
A= exist special vectors which refuse to
2 1
stray from their path.

 
1
x=
1
x

4/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
  For a given square matrix A, there
1 2
A= exist special vectors which refuse to
2 1
stray from their path.

   
3 1
Ax = =3
3 1
 
1
x=
1
x

4/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
  For a given square matrix A, there
1 2
A= exist special vectors which refuse to
2 1
stray from their path.
These vectors are called eigenvectors.
   
3 1
Ax = =3
3 1
 
1
x=
1
x

4/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
  For a given square matrix A, there
1 2
A= exist special vectors which refuse to
2 1
stray from their path.
These vectors are called eigenvectors.
    More formally,
3 1
Ax = =3
3 1 Ax = λx [direction remains the same]
 
1
x=
1
x

4/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
  For a given square matrix A, there
1 2
A= exist special vectors which refuse to
2 1
stray from their path.
These vectors are called eigenvectors.
    More formally,
3 1
Ax = =3
3 1 Ax = λx [direction remains the same]

1
 The vector will only get scaled but
x= will not change its direction.
1
x

4/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
 
1 2
A=
2 1

   
3 1
Ax = =3
3 1
 
1
x=
1
x

5/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
  So what is so special about
1 2
A= eigenvectors?
2 1

   
3 1
Ax = =3
3 1
 
1
x=
1
x

5/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
  So what is so special about
1 2
A= eigenvectors?
2 1
Why are they always in the limelight?

   
3 1
Ax = =3
3 1
 
1
x=
1
x

5/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
  So what is so special about
1 2
A= eigenvectors?
2 1
Why are they always in the limelight?
It turns out that several properties
    of matrices can be analyzed based
3 1 on their eigenvalues (for example, see
Ax = =3
3 1 spectral graph theory)
 
1
x=
1
x

5/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
  So what is so special about
1 2
A= eigenvectors?
2 1
Why are they always in the limelight?
It turns out that several properties
    of matrices can be analyzed based
3 1 on their eigenvalues (for example, see
Ax = =3
3 1 spectral graph theory)
  We will now see two cases where
1 eigenvalues/vectors will help us in
x=
1 this course
x

5/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let us assume that on day 0, k1 students eat
Chinese food, and k2 students eat Mexican food.
(Of course, no one eats in the mess!)

6/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let us assume that on day 0, k1 students eat
Chinese Mexican Chinese food, and k2 students eat Mexican food.
k1 k2 (Of course, no one eats in the mess!)
 
k1
v(0) =
k2

6/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let us assume that on day 0, k1 students eat
Chinese Mexican Chinese food, and k2 students eat Mexican food.
k1 k2 (Of course, no one eats in the mess!)
  On each subsequent day i, a fraction p of the
k1 students who ate Chinese food on day (i − 1),
v(0) =
k2 continue to eat Chinese food on day i, and (1 − p)
shift to Mexican food.

6/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let us assume that on day 0, k1 students eat
Chinese Mexican Chinese food, and k2 students eat Mexican food.
k1 k2 (Of course, no one eats in the mess!)
  On each subsequent day i, a fraction p of the
k1 students who ate Chinese food on day (i − 1),
v(0) =
k2 continue to eat Chinese food on day i, and (1 − p)
shift to Mexican food.
Similarly a fraction q of students who ate Mexican
food on day (i − 1) continue to eat Mexican food
on day i, and (1 − q) shift to Chinese food.

6/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let us assume that on day 0, k1 students eat
Chinese Mexican Chinese food, and k2 students eat Mexican food.
k1 k2 (Of course, no one eats in the mess!)
  On each subsequent day i, a fraction p of the
k1 students who ate Chinese food on day (i − 1),
v(0) =
k2 continue to eat Chinese food on day i, and (1 − p)
  shift to Mexican food.
pk1 + (1 − q)k2
v(1) = Similarly a fraction q of students who ate Mexican
(1 − p)k1 + qk2
food on day (i − 1) continue to eat Mexican food
on day i, and (1 − q) shift to Chinese food.

6/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let us assume that on day 0, k1 students eat
Chinese Mexican Chinese food, and k2 students eat Mexican food.
k1 k2 (Of course, no one eats in the mess!)
  On each subsequent day i, a fraction p of the
k1 students who ate Chinese food on day (i − 1),
v(0) =
k2 continue to eat Chinese food on day i, and (1 − p)
  shift to Mexican food.
pk1 + (1 − q)k2
v(1) = Similarly a fraction q of students who ate Mexican
(1 − p)k1 + qk2
   food on day (i − 1) continue to eat Mexican food
p 1−q k1
= on day i, and (1 − q) shift to Chinese food.
1−p q k2

6/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let us assume that on day 0, k1 students eat
Chinese Mexican Chinese food, and k2 students eat Mexican food.
k1 k2 (Of course, no one eats in the mess!)
  On each subsequent day i, a fraction p of the
k1 students who ate Chinese food on day (i − 1),
v(0) =
k2 continue to eat Chinese food on day i, and (1 − p)
  shift to Mexican food.
pk1 + (1 − q)k2
v(1) = Similarly a fraction q of students who ate Mexican
(1 − p)k1 + qk2
   food on day (i − 1) continue to eat Mexican food
p 1−q k1
= on day i, and (1 − q) shift to Chinese food.
1−p q k2

v(1) = M v(0)
v(2) = M v(1)
= M 2 v(0)

6/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let us assume that on day 0, k1 students eat
Chinese Mexican Chinese food, and k2 students eat Mexican food.
k1 k2 (Of course, no one eats in the mess!)
  On each subsequent day i, a fraction p of the
k1 students who ate Chinese food on day (i − 1),
v(0) =
k2 continue to eat Chinese food on day i, and (1 − p)
  shift to Mexican food.
pk1 + (1 − q)k2
v(1) = Similarly a fraction q of students who ate Mexican
(1 − p)k1 + qk2
   food on day (i − 1) continue to eat Mexican food
p 1−q k1
= on day i, and (1 − q) shift to Chinese food.
1−p q k2
The number of customers in the two restaurants
v(1) = M v(0) is thus given by the following series:
v(2) = M v(1)
v(0) , M v(0) , M 2 v(0) , M 3 v(0) , . . .
2
= M v(0)
In general, v(n) = M n v(0)
6/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
1−p

p k1 k2 q

1−q

7/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
This is a problem for the two restaurant
owners.

1−p

p k1 k2 q

1−q

7/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
This is a problem for the two restaurant
owners.
The number of patrons is changing constantly.

1−p

p k1 k2 q

1−q

7/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
This is a problem for the two restaurant
owners.
The number of patrons is changing constantly.
Or is it? Will the system eventually reach
a steady state? (i.e. will the number
1−p of customers in the two restaurants become
constant over time?)
p k1 k2 q

1−q

7/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
This is a problem for the two restaurant
owners.
The number of patrons is changing constantly.
Or is it? Will the system eventually reach
a steady state? (i.e. will the number
1−p of customers in the two restaurants become
constant over time?)
p k1 k2 q Turns out they will!

1−q

7/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
This is a problem for the two restaurant
owners.
The number of patrons is changing constantly.
Or is it? Will the system eventually reach
a steady state? (i.e. will the number
1−p of customers in the two restaurants become
constant over time?)
p k1 k2 q Turns out they will!
Let’s see how?
1−q

7/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Definition
Let λ1 , λ2 , . . . , λn be the
eigenvectors of an n × n matrix
A. λ1 is called the dominant
eigen value of A if

|λ1 | ≥ |λi | i = 2, . . . , n

8/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Definition Definition
Let λ1 , λ2 , . . . , λn be the A matrix M is called a stochastic matrix if all the
eigenvectors of an n × n matrix entries are positive and the sum of the elements in
A. λ1 is called the dominant each column is equal to 1.
eigen value of A if (Note that the matrix in our example is a
stochastic matrix)
|λ1 | ≥ |λi | i = 2, . . . , n

8/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Definition Definition
Let λ1 , λ2 , . . . , λn be the A matrix M is called a stochastic matrix if all the
eigenvectors of an n × n matrix entries are positive and the sum of the elements in
A. λ1 is called the dominant each column is equal to 1.
eigen value of A if (Note that the matrix in our example is a
stochastic matrix)
|λ1 | ≥ |λi | i = 2, . . . , n

Theorem
The largest (dominant)
eigenvalue of a stochastic matrix
is 1.
See proof here

8/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Definition Definition
Let λ1 , λ2 , . . . , λn be the A matrix M is called a stochastic matrix if all the
eigenvectors of an n × n matrix entries are positive and the sum of the elements in
A. λ1 is called the dominant each column is equal to 1.
eigen value of A if (Note that the matrix in our example is a
stochastic matrix)
|λ1 | ≥ |λi | i = 2, . . . , n
Theorem
Theorem If A is a n × n square matrix with a dominant
The largest (dominant) eigenvalue, then the sequence of vectors given by
eigenvalue of a stochastic matrix Av0 , A2 v0 , . . . , An v0 , . . . approaches a multiple of
is 1. the dominant eigenvector of A.
See proof here (the theorem is slightly misstated here for ease of
explanation)

8/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
1−p
Let ed be the dominant eigenvector of M and
λd = 1 the corresponding dominant eigenvalue
p k1 k2 q

1−q

9/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
1−p
Let ed be the dominant eigenvector of M and
λd = 1 the corresponding dominant eigenvalue
p k1 k2 q
Given the previous definitions and theorems,
what can you say about the sequence
M v(0) , M 2 v(0) , M 3 v(0) , . . . ? 1−q

9/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
1−p
Let ed be the dominant eigenvector of M and
λd = 1 the corresponding dominant eigenvalue
p k1 k2 q
Given the previous definitions and theorems,
what can you say about the sequence
M v(0) , M 2 v(0) , M 3 v(0) , . . . ? 1−q
There exists an n such that

## v(n) = M n v(0) = ked (some multiple of ed )

9/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
1−p
Let ed be the dominant eigenvector of M and
λd = 1 the corresponding dominant eigenvalue
p k1 k2 q
Given the previous definitions and theorems,
what can you say about the sequence
M v(0) , M 2 v(0) , M 3 v(0) , . . . ? 1−q
There exists an n such that

## v(n+1) = M v(n) = M (ked ) = k(M ed ) = k(λd ed ) = ked

9/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
1−p
Let ed be the dominant eigenvector of M and
λd = 1 the corresponding dominant eigenvalue
p k1 k2 q
Given the previous definitions and theorems,
what can you say about the sequence
M v(0) , M 2 v(0) , M 3 v(0) , . . . ? 1−q
There exists an n such that

## The population in the two restaurants

becomes constant after time step n.
See Proof Here

9/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A

10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)

10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)

Ap x0 = ked

10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)

Ap x0 = ked
Ap+1 x0 = A(Ap x0 ) = kAed = kλd ed

10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)

Ap x0 = ked
Ap+1 x0 = A(Ap x0 ) = kAed = kλd ed
Ap+2 x0 = A(Ap+1 x0 ) = kλd Aed = kλ2d ed

10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)

Ap x0 = ked
Ap+1 x0 = A(Ap x0 ) = kAed = kλd ed
Ap+2 x0 = A(Ap+1 x0 ) = kλd Aed = kλ2d ed
Ap+n x0 =

10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)

Ap x0 = ked
Ap+1 x0 = A(Ap x0 ) = kAed = kλd ed
Ap+2 x0 = A(Ap+1 x0 ) = kλd Aed = kλ2d ed
Ap+n x0 = k(λd )n ed

10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)

Ap x0 = ked
Ap+1 x0 = A(Ap x0 ) = kAed = kλd ed
Ap+2 x0 = A(Ap+1 x0 ) = kλd Aed = kλ2d ed
Ap+n x0 = k(λd )n ed

## In general, if λd is the dominant eigenvalue of a matrix A, what would happen

to the sequence x0 , Ax0 , A2 x0 , . . . if

10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)

Ap x0 = ked
Ap+1 x0 = A(Ap x0 ) = kAed = kλd ed
Ap+2 x0 = A(Ap+1 x0 ) = kλd Aed = kλ2d ed
Ap+n x0 = k(λd )n ed

## In general, if λd is the dominant eigenvalue of a matrix A, what would happen

to the sequence x0 , Ax0 , A2 x0 , . . . if
|λd | > 1

10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)

Ap x0 = ked
Ap+1 x0 = A(Ap x0 ) = kAed = kλd ed
Ap+2 x0 = A(Ap+1 x0 ) = kλd Aed = kλ2d ed
Ap+n x0 = k(λd )n ed

## In general, if λd is the dominant eigenvalue of a matrix A, what would happen

to the sequence x0 , Ax0 , A2 x0 , . . . if
|λd | > 1 (will explode)

10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)

Ap x0 = ked
Ap+1 x0 = A(Ap x0 ) = kAed = kλd ed
Ap+2 x0 = A(Ap+1 x0 ) = kλd Aed = kλ2d ed
Ap+n x0 = k(λd )n ed

## In general, if λd is the dominant eigenvalue of a matrix A, what would happen

to the sequence x0 , Ax0 , A2 x0 , . . . if
|λd | > 1 (will explode)
|λd | < 1

10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)

Ap x0 = ked
Ap+1 x0 = A(Ap x0 ) = kAed = kλd ed
Ap+2 x0 = A(Ap+1 x0 ) = kλd Aed = kλ2d ed
Ap+n x0 = k(λd )n ed

## In general, if λd is the dominant eigenvalue of a matrix A, what would happen

to the sequence x0 , Ax0 , A2 x0 , . . . if
|λd | > 1 (will explode)
|λd | < 1 (will vanish)

10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)

Ap x0 = ked
Ap+1 x0 = A(Ap x0 ) = kAed = kλd ed
Ap+2 x0 = A(Ap+1 x0 ) = kλd Aed = kλ2d ed
Ap+n x0 = k(λd )n ed

## In general, if λd is the dominant eigenvalue of a matrix A, what would happen

to the sequence x0 , Ax0 , A2 x0 , . . . if
|λd | > 1 (will explode)
|λd | < 1 (will vanish)
|λd | = 1

10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)

Ap x0 = ked
Ap+1 x0 = A(Ap x0 ) = kAed = kλd ed
Ap+2 x0 = A(Ap+1 x0 ) = kλd Aed = kλ2d ed
Ap+n x0 = k(λd )n ed

## In general, if λd is the dominant eigenvalue of a matrix A, what would happen

to the sequence x0 , Ax0 , A2 x0 , . . . if
|λd | > 1 (will explode)
|λd | < 1 (will vanish)
|λd | = 1 (will reach a steady state)

10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Now instead of a stochastic matrix let us consider any square matrix A
Let p be the time step at which the sequence x0 , Ax0 , A2 x0 , . . . approaches a
multiple of ed (the dominant eigenvector of A)

Ap x0 = ked
Ap+1 x0 = A(Ap x0 ) = kAed = kλd ed
Ap+2 x0 = A(Ap+1 x0 ) = kλd Aed = kλ2d ed
Ap+n x0 = k(λd )n ed

## In general, if λd is the dominant eigenvalue of a matrix A, what would happen

to the sequence x0 , Ax0 , A2 x0 , . . . if
|λd | > 1 (will explode)
|λd | < 1 (will vanish)
|λd | = 1 (will reach a steady state)
(We will use this in the course at some point)
10/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Module 6.2 : Linear Algebra - Basic Definitions

11/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We will see some more examples where eigenvectors are important, but before
that let’s revisit some basic definitions from linear algebra.

12/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Basis
A set of vectors ∈ Rn is called a basis, if they are linearly independent and every
vector ∈ Rn can be expressed as a linear combination of these vectors.

13/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Basis
A set of vectors ∈ Rn is called a basis, if they are linearly independent and every
vector ∈ Rn can be expressed as a linear combination of these vectors.

## Linearly independent vectors

A set of n vectors v1 , v2 , . . . , vn is linearly independent if no vector in the set can
be expressed as a linear combination of the remaining n − 1 vectors.
In other words, the only solution to

## c1 v1 + c2 v2 + . . . cn vn = 0 is c1 = c2 = · · · = cn = 0(ci ’s are scalars)

13/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
For example consider the space R2

y = (0, 1)

x = (1, 0)

14/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
For example consider the space R2
Now consider the vectors
   
1 0
x= and y =
0 1

y = (0, 1)

x = (1, 0)

14/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
For example consider the space R2
Now consider the vectors
   
1 0
x= and y =
0 1
 
y = (0, 1) a
Any vector ∈ R2 , can be expressed as a
b
linear combination of these two vectors i.e
x = (1, 0)      
a 1 0
=a +b
b 0 1

14/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
For example consider the space R2
Now consider the vectors
   
1 0
x= and y =
0 1
 
y = (0, 1) a
Any vector ∈ R2 , can be expressed as a
b
linear combination of these two vectors i.e
x = (1, 0)      
a 1 0
=a +b
b 0 1

## Further, x and y are linearly independent.

(the only solution to c1 x + c2 y = 0 is c1 =
c2 = 0)

14/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In fact, turns out that x and y are unit vectors
in the direction of the co-ordinate axes.

y = (0, 1)

x = (1, 0)

15/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In fact, turns out that x and y are unit vectors
in the direction of the co-ordinate axes.
And indeed we are used to representing all
vectors in R2 as a linear combination of these
two vectors.

y = (0, 1)

x = (1, 0)

15/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In fact, turns out that x and y are unit vectors
in the direction of the co-ordinate axes.
And indeed we are used to representing all
vectors in R2 as a linear combination of these
two vectors.
But there is nothing sacrosanct about the
y = (0, 1)
particular choice of x and y.

x = (1, 0)

15/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In fact, turns out that x and y are unit vectors
in the direction of the co-ordinate axes.
And indeed we are used to representing all
vectors in R2 as a linear combination of these
two vectors.
But there is nothing sacrosanct about the
y = (0, 1)
particular choice of x and y.
We could have chosen any 2 linearly
x = (1, 0) independent vectors in R2 as the basis vectors.

15/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In fact, turns out that x and y are unit vectors
in the direction of the co-ordinate axes.
And indeed we are used to representing all
vectors in R2 as a linear combination of these
two vectors.
But there is nothing sacrosanct about the
y = (0, 1)
particular choice of x and y.
We could have chosen any 2 linearly
x = (1, 0) independent vectors in R2 as the basis vectors.
For example, consider the linearly
independent vectors, [2, 3]T and [5, 7]T .
See how any vector [a, b]T ∈ R2 can be
expressed as a linear combination of these
two vectors.

15/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In fact, turns out that x and y are unit vectors
in the direction of the co-ordinate axes.
And indeed we are used to representing all
vectors in R2 as a linear combination of these
two vectors.
But there is nothing sacrosanct about the
y = (0, 1)
particular choice of x and y.
We could have chosen any 2 linearly
x = (1, 0) independent vectors in R2 as the basis vectors.
For example, consider the linearly
independent vectors, [2, 3]T and [5, 7]T .
See how any vector [a, b]T ∈ R2 can be
     
a 2 5
= x1 + x2
b 3 7 expressed as a linear combination of these
two vectors.

15/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In fact, turns out that x and y are unit vectors
in the direction of the co-ordinate axes.
And indeed we are used to representing all
vectors in R2 as a linear combination of these
two vectors.
But there is nothing sacrosanct about the
y = (0, 1)
particular choice of x and y.
We could have chosen any 2 linearly
x = (1, 0) independent vectors in R2 as the basis vectors.
For example, consider the linearly
independent vectors, [2, 3]T and [5, 7]T .
See how any vector [a, b]T ∈ R2 can be
     
a 2 5
= x1 + x2
b 3 7 expressed as a linear combination of these
two vectors.
We can find x1 and x2 by solving a system of
linear equations. 15/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In fact, turns out that x and y are unit vectors
in the direction of the co-ordinate axes.
And indeed we are used to representing all
vectors in R2 as a linear combination of these
two vectors.
But there is nothing sacrosanct about the
y = (0, 1)
particular choice of x and y.
We could have chosen any 2 linearly
x = (1, 0) independent vectors in R2 as the basis vectors.
For example, consider the linearly
independent vectors, [2, 3]T and [5, 7]T .
a = 2x1 + 5x2 See how any vector [a, b]T ∈ R2 can be
expressed as a linear combination of these
b = 3x1 + 7x2
two vectors.
We can find x1 and x2 by solving a system of
linear equations. 15/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In general, given a set of linearly independent

z1
 vectors u1 , u2 , . . . un ∈ Rn , we can express any
z=
z2 vector z ∈ Rn as a linear combination of these
vectors.

u2
u1

16/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In general, given a set of linearly independent

z1
 vectors u1 , u2 , . . . un ∈ Rn , we can express any
z=
z2 vector z ∈ Rn as a linear combination of these
vectors.
z = α1 u1 + α2 u2 + · · · + αn un
u2
u1

16/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In general, given a set of linearly independent

z1
 vectors u1 , u2 , . . . un ∈ Rn , we can express any
z=
z2 vector z ∈ Rn as a linear combination of these
vectors.
z = α1 u1 + α2 u2 + · · · + αn un
       
u2 z1 u11 u21 un1
u1  z2   u12   u22   un2 
 ..  = α1  ..  + α2  ..  + . . . + αn  .. 
       
.  .   .   . 
zn u1n u2n unn

16/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In general, given a set of linearly independent

z1
 vectors u1 , u2 , . . . un ∈ Rn , we can express any
z=
z2 vector z ∈ Rn as a linear combination of these
vectors.
z = α1 u1 + α2 u2 + · · · + αn un
       
u2 z1 u11 u21 un1
u1  z2   u12   u22   un2 
 ..  = α1  ..  + α2  ..  + . . . + αn  .. 
       
.  .   .   . 
zn u1n u2n unn
    
z1 u11 u21 . . . un1 α1
 z2   u12 u22 . . . un2   α2 
 ..  =  ..
    
.. .. ..   .. 
.  . . . .  . 
zn u1n u2n . . . unn αn

## (Basically rewriting in matrix form)

16/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In general, given a set of linearly independent

z1
 vectors u1 , u2 , . . . un ∈ Rn , we can express any
z=
z2 vector z ∈ Rn as a linear combination of these
vectors.
z = α1 u1 + α2 u2 + · · · + αn un
       
u2 z1 u11 u21 un1
u1  z2   u12   u22   un2 
 ..  = α1  ..  + α2  ..  + . . . + αn  .. 
       
.  .   .   . 
zn u1n u2n unn
    
z1 u11 u21 . . . un1 α1
 z2   u12 u22 . . . un2   α2 
 ..  =  ..
    
.. .. ..   .. 
.  . . . .  . 
zn u1n u2n . . . unn αn
We can now find the αi s using Gaussian
Elimination (Time Complexity: O(n3 ))
16/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
  Now let us see if we have orthonormal basis.
a
z=
b

| z→
|
u2
u1
θ
α2

α1

17/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
  Now let us see if we have orthonormal basis.
a uTi uj = 0 ∀i 6= j and uTi ui = kui k2 = 1
z=
b

| z→
|
u2
u1
θ
α2

α1

17/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
  Now let us see if we have orthonormal basis.
a uTi uj = 0 ∀i 6= j and uTi ui = kui k2 = 1
z=
b
Again we have:

| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1
θ
α2

α1

17/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
  Now let us see if we have orthonormal basis.
a uTi uj = 0 ∀i 6= j and uTi ui = kui k2 = 1
z=
b
Again we have:

| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1 uT1 z = α1 uT1 u1 + . . . + αn uT1 un
θ
α2

α1

17/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
  Now let us see if we have orthonormal basis.
a uTi uj = 0 ∀i 6= j and uTi ui = kui k2 = 1
z=
b
Again we have:

| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1 uT1 z = α1 uT1 u1 + . . . + αn uT1 un
θ
α2

α1 = α1

17/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
  Now let us see if we have orthonormal basis.
a uTi uj = 0 ∀i 6= j and uTi ui = kui k2 = 1
z=
b
Again we have:

| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1 uT1 z = α1 uT1 u1 + . . . + αn uT1 un
θ
α2

α1 = α1

## We can directly find each αi using a dot

product between z and ui (time complexity
O(N ))

17/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
  Now let us see if we have orthonormal basis.
a uTi uj = 0 ∀i 6= j and uTi ui = kui k2 = 1
z=
b
Again we have:

| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1 uT1 z = α1 uT1 u1 + . . . + αn uT1 un
θ
α2

α1 = α1

## We can directly find each αi using a dot

product between z and ui (time complexity
O(N ))
The total complexity will be O(N 2 )

17/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
  Now let us see if we have orthonormal basis.
a uTi uj = 0 ∀i 6= j and uTi ui = kui k2 = 1
z=
b
Again we have:

| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1 uT1 z = α1 uT1 u1 + . . . + αn uT1 un
θ
α2

α1 = α1

## → → z T u1 We can directly find each αi using a dot

α1 = | z |cosθ = | z | → = z T u1 product between z and ui (time complexity
| z ||u1 |
O(N ))
The total complexity will be O(N 2 )

17/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
  Now let us see if we have orthonormal basis.
a uTi uj = 0 ∀i 6= j and uTi ui = kui k2 = 1
z=
b
Again we have:

| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1 uT1 z = α1 uT1 u1 + . . . + αn uT1 un
θ
α2

α1 = α1

## → → z T u1 We can directly find each αi using a dot

α1 = | z |cosθ = | z | → = z T u1 product between z and ui (time complexity
| z ||u1 |
O(N ))
Similarly, α2 = z T u2 . The total complexity will be O(N 2 )

17/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
  Now let us see if we have orthonormal basis.
a uTi uj = 0 ∀i 6= j and uTi ui = kui k2 = 1
z=
b
Again we have:

| z→
z = α1 u1 + α2 u2 + . . . + αn un
|
u2
u1 uT1 z = α1 uT1 u1 + . . . + αn uT1 un
θ
α2

α1 = α1

## → → z T u1 We can directly find each αi using a dot

α1 = | z |cosθ = | z | → = z T u1 product between z and ui (time complexity
| z ||u1 |
O(N ))
Similarly, α2 = z T u2 . The total complexity will be O(N 2 )
When u1 and u2 are unit vectors
along the co-ordinate axes
     
a 1 0
z= =a +b
b 0 1
17/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Remember
An orthogonal basis is the most convenient basis that one can hope for.

18/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
But what does any of this have to do with
eigenvectors?

19/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
But what does any of this have to do with
eigenvectors?
Turns out that the eigenvectors can form a
basis.

19/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
But what does any of this have to do with
Theorem 1 eigenvectors?
The eigenvectors of a matrix Turns out that the eigenvectors can form a
A ∈ Rn×n having distinct basis.
eigenvalues are linearly
independent.
Proof: See here

19/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
But what does any of this have to do with
Theorem 1 eigenvectors?
The eigenvectors of a matrix Turns out that the eigenvectors can form a
A ∈ Rn×n having distinct basis.
eigenvalues are linearly
In fact, the eigenvectors of a square symmetric
independent.
matrix are even more special.
Proof: See here

19/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
But what does any of this have to do with
Theorem 1 eigenvectors?
The eigenvectors of a matrix Turns out that the eigenvectors can form a
A ∈ Rn×n having distinct basis.
eigenvalues are linearly
In fact, the eigenvectors of a square symmetric
independent.
matrix are even more special.
Proof: See here

Theorem 2
The eigenvectors of a square
symmetric matrix are
orthogonal.
Proof: See here

19/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
But what does any of this have to do with
Theorem 1 eigenvectors?
The eigenvectors of a matrix Turns out that the eigenvectors can form a
A ∈ Rn×n having distinct basis.
eigenvalues are linearly
In fact, the eigenvectors of a square symmetric
independent.
matrix are even more special.
Proof: See here
Thus they form a very convenient basis.
Theorem 2
The eigenvectors of a square
symmetric matrix are
orthogonal.
Proof: See here

19/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
But what does any of this have to do with
Theorem 1 eigenvectors?
The eigenvectors of a matrix Turns out that the eigenvectors can form a
A ∈ Rn×n having distinct basis.
eigenvalues are linearly
In fact, the eigenvectors of a square symmetric
independent.
matrix are even more special.
Proof: See here
Thus they form a very convenient basis.
Theorem 2 Why would we want to use the eigenvectors as
The eigenvectors of a square a basis instead of the more natural co-ordinate
symmetric matrix are axes?
orthogonal.
Proof: See here

19/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
But what does any of this have to do with
Theorem 1 eigenvectors?
The eigenvectors of a matrix Turns out that the eigenvectors can form a
A ∈ Rn×n having distinct basis.
eigenvalues are linearly
In fact, the eigenvectors of a square symmetric
independent.
matrix are even more special.
Proof: See here
Thus they form a very convenient basis.
Theorem 2 Why would we want to use the eigenvectors as
The eigenvectors of a square a basis instead of the more natural co-ordinate
symmetric matrix are axes?
orthogonal. We will answer this question soon.
Proof: See here

19/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Module 6.3 : Eigenvalue Decomposition

20/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Before proceeding let’s do a quick recap of eigenvalue decomposition.

21/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let u1 , u2 , . . . , un be the eigenvectors of a matrix A and let λ1 , λ2 , . . . , λn be
the corresponding eigenvalues.

22/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let u1 , u2 , . . . , un be the eigenvectors of a matrix A and let λ1 , λ2 , . . . , λn be
the corresponding eigenvalues.
Consider a matrix U whose columns are u1 , u2 , . . . , un .

22/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let u1 , u2 , . . . , un be the eigenvectors of a matrix A and let λ1 , λ2 , . . . , λn be
the corresponding eigenvalues.
Consider a matrix U whose columns are u1 , u2 , . . . , un .
Now

AU =

22/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let u1 , u2 , . . . , un be the eigenvectors of a matrix A and let λ1 , λ2 , . . . , λn be
the corresponding eigenvalues.
Consider a matrix U whose columns are u1 , u2 , . . . , un .
Now
x x x
  
AU = A u1 u2 . . . un 
y y y

22/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let u1 , u2 , . . . , un be the eigenvectors of a matrix A and let λ1 , λ2 , . . . , λn be
the corresponding eigenvalues.
Consider a matrix U whose columns are u1 , u2 , . . . , un .
Now
x x x  x x x 
     
AU = A u1 u2 . . . un  = Au 1 Au
2 . . . Au
n

y y y y y y

22/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let u1 , u2 , . . . , un be the eigenvectors of a matrix A and let λ1 , λ2 , . . . , λn be
the corresponding eigenvalues.
Consider a matrix U whose columns are u1 , u2 , . . . , un .
Now
x x x  x x x 
     
AU = A u1 u2 . . . un  = Au 1 Au
2 . . . Au
n

y y y y y y
 x x x 
  
= λ1u1 λ2u2 . . . λnun 
y y y

22/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let u1 , u2 , . . . , un be the eigenvectors of a matrix A and let λ1 , λ2 , . . . , λn be
the corresponding eigenvalues.
Consider a matrix U whose columns are u1 , u2 , . . . , un .
Now
x x x  x x x 
     
AU = A u1 u2 . . . un  = Au 1 Au
2 . . . Au
n

y y y y y y
 x x x 
  
= λ1u1 λ2u2 . . . λnun 
y y y

x x x
  
u u
= 1 2
 . . . u
n 
y y y

22/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let u1 , u2 , . . . , un be the eigenvectors of a matrix A and let λ1 , λ2 , . . . , λn be
the corresponding eigenvalues.
Consider a matrix U whose columns are u1 , u2 , . . . , un .
Now
x x x  x x x 
     
AU = A u1 u2 . . . un  = Au 1 Au2 . . . Au
n

y y y y y y
 x x x 
  
= λ1u1 λ2u2 . . . λnun 
y y y
 
x x  λ1 0 . . . 0
.. 
x 
   
u u . . . u 0 λ2 . 
= 1 2
 n  .
 
..

y y y  .. . 0 

0 ... 0 λn

22/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let u1 , u2 , . . . , un be the eigenvectors of a matrix A and let λ1 , λ2 , . . . , λn be
the corresponding eigenvalues.
Consider a matrix U whose columns are u1 , u2 , . . . , un .
Now
x x x  x x x 
     
AU = A u1 u2 . . . un  = Au 1 Au2 . . . Au
n

y y y y y y
 x x x 
  
= λ1u1 λ2u2 . . . λnun 
y y y
 
x x  λ1 0 . . . 0
.. 
x 
   
u u . . . u 0 λ2 . 
= 1 2
 n  .
 
..
 = UΛ
y y y  .. . 0 

0 ... 0 λn

22/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let u1 , u2 , . . . , un be the eigenvectors of a matrix A and let λ1 , λ2 , . . . , λn be
the corresponding eigenvalues.
Consider a matrix U whose columns are u1 , u2 , . . . , un .
Now
x x x  x x x 
     
AU = A u1 u2 . . . un  = Au 1 Au2 . . . Au
n

y y y y y y
 x x x 
  
= λ1u1 λ2u2 . . . λnun 
y y y
 
x x  λ1 0 . . . 0
.. 
x 
   
u u . . . u 0 λ2 . 
= 1 2
 n  .
 
..
 = UΛ
y y y  .. . 0 

0 ... 0 λn

where Λ is a diagonal matrix whose diagonal elements are the eigenvalues of A. 22/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
AU = U Λ

23/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
AU = U Λ

## If U −1 exists, then we can write,

A = U ΛU −1 [eigenvalue decomposition]
−1
U AU = Λ [diagonalization of A]

23/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
AU = U Λ

## If U −1 exists, then we can write,

A = U ΛU −1 [eigenvalue decomposition]
−1
U AU = Λ [diagonalization of A]

## Under what conditions would U −1 exist?

23/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
AU = U Λ

## If U −1 exists, then we can write,

A = U ΛU −1 [eigenvalue decomposition]
−1
U AU = Λ [diagonalization of A]

## Under what conditions would U −1 exist?

If the columns of U are linearly independent [See proof here]

23/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
AU = U Λ

## If U −1 exists, then we can write,

A = U ΛU −1 [eigenvalue decomposition]
−1
U AU = Λ [diagonalization of A]

## Under what conditions would U −1 exist?

If the columns of U are linearly independent [See proof here]
i.e. if A has n linearly independent eigenvectors.

23/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
AU = U Λ

## If U −1 exists, then we can write,

A = U ΛU −1 [eigenvalue decomposition]
−1
U AU = Λ [diagonalization of A]

## Under what conditions would U −1 exist?

If the columns of U are linearly independent [See proof here]
i.e. if A has n linearly independent eigenvectors.
i.e. if A has n distinct eigenvalues [sufficient condition, proof : Slide 19
Theorem 1]

23/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
If A is symmetric then the situation is even more convenient.

Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
If A is symmetric then the situation is even more convenient.
The eigenvectors are orthogonal [proof : Slide 19 Theorem 2]

Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
If A is symmetric then the situation is even more convenient.
The eigenvectors are orthogonal [proof : Slide 19 Theorem 2]
Further let’s assume, that the eigenvectors have been normalized [ uTi ui = 1]
 
← u1 →  x x x
 ← u2 →    
Q = UT U =   u1 u2 . . . un 
 ...  y y y
← un →

Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
If A is symmetric then the situation is even more convenient.
The eigenvectors are orthogonal [proof : Slide 19 Theorem 2]
Further let’s assume, that the eigenvectors have been normalized [ uTi ui = 1]
 
← u1 →  x x x
 ← u2 →    
Q = UT U =   u1 u2 . . . un 
 ...  y y y
← un →

## Each cell of the matrix, Qij is given by uTi uj

Qij = uTi uj = 0 if i 6= j
= 1 if i = j

## ∴ U T U = I (the identity matrix)

Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
If A is symmetric then the situation is even more convenient.
The eigenvectors are orthogonal [proof : Slide 19 Theorem 2]
Further let’s assume, that the eigenvectors have been normalized [ uTi ui = 1]
 
← u1 →  x x x
 ← u2 →    
Q = UT U =   u1 u2 . . . un 
 ...  y y y
← un →

## Each cell of the matrix, Qij is given by uTi uj

Qij = uTi uj = 0 if i 6= j
= 1 if i = j

## ∴ U T U = I (the identity matrix)

U T is the inverse of U (very convenient to calculate)
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Given the EVD, A = U ΣU T ,
what can you say about the sequence x0 , Ax0 , A2 x0 , . . . in terms of the eigen
values of A.
(Hint: You should arrive at the same conclusion we saw earlier)

25/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Theorem (one more important property of eigenvectors)
If A is a square symmetric N × N matrix, then the solution to the following
optimization problem is given by the eigenvector corresponding to the largest
eigenvalue of A.
max xT Ax
x
s.t kxk = 1
and the solution to
min xT Ax
x
s.t kxk = 1
is given by the eigenvector corresponding to the smallest eigenvalue of A.
Proof: Next slide.

26/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
This is a constrained optimization problem that can be solved using Lagrange
Multipliers:

L = xT Ax − λ(xT x − 1)
∂L
= 2Ax − λ(2x) = 0 => Ax = λx
∂x

27/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
This is a constrained optimization problem that can be solved using Lagrange
Multipliers:

L = xT Ax − λ(xT x − 1)
∂L
= 2Ax − λ(2x) = 0 => Ax = λx
∂x
Hence x must be an eigenvector of A with eigenvalue λ.

27/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
This is a constrained optimization problem that can be solved using Lagrange
Multipliers:

L = xT Ax − λ(xT x − 1)
∂L
= 2Ax − λ(2x) = 0 => Ax = λx
∂x
Hence x must be an eigenvector of A with eigenvalue λ.
Multiplying by xT :

xT Ax = λxT x = λ(since xT x = 1)

27/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
This is a constrained optimization problem that can be solved using Lagrange
Multipliers:

L = xT Ax − λ(xT x − 1)
∂L
= 2Ax − λ(2x) = 0 => Ax = λx
∂x
Hence x must be an eigenvector of A with eigenvalue λ.
Multiplying by xT :

xT Ax = λxT x = λ(since xT x = 1)

Therefore, the critical points of this constrained problem are the eigenvalues of
A.

27/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
This is a constrained optimization problem that can be solved using Lagrange
Multipliers:

L = xT Ax − λ(xT x − 1)
∂L
= 2Ax − λ(2x) = 0 => Ax = λx
∂x
Hence x must be an eigenvector of A with eigenvalue λ.
Multiplying by xT :

xT Ax = λxT x = λ(since xT x = 1)

Therefore, the critical points of this constrained problem are the eigenvalues of
A.
The maximum value is the largest eigenvalue, while the minimum value is the
smallest eigenvalue.
27/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
The story so far...

28/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
The story so far...
The eigenvectors corresponding to different eigenvalues are linearly
independent.

28/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
The story so far...
The eigenvectors corresponding to different eigenvalues are linearly
independent.
The eigenvectors of a square symmetric matrix are orthogonal.

28/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
The story so far...
The eigenvectors corresponding to different eigenvalues are linearly
independent.
The eigenvectors of a square symmetric matrix are orthogonal.
The eigenvectors of a square symmetric matrix can thus form a convenient basis.

28/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
The story so far...
The eigenvectors corresponding to different eigenvalues are linearly
independent.
The eigenvectors of a square symmetric matrix are orthogonal.
The eigenvectors of a square symmetric matrix can thus form a convenient basis.
We will put all of this to use.

28/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Module 6.4 : Principal Component Analysis and its
Interpretations

29/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6

30/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Over the next few slides we will introduce Principal Component Analysis and
see three different interpretations of it

30/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
Consider the following data

31/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
Consider the following data
Each point (vector) here is
represented using a linear
combination of the x and y axes
(i.e. using the point’s x and y
co-ordinates)

31/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
Consider the following data
Each point (vector) here is
represented using a linear
combination of the x and y axes
(i.e. using the point’s x and y
co-ordinates)
In other words we are using x and y
as the basis
x

31/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
Consider the following data
Each point (vector) here is
represented using a linear
combination of the x and y axes
(i.e. using the point’s x and y
co-ordinates)
In other words we are using x and y
as the basis
x What if we choose a different basis?

31/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
For example, what if we use u1 and
u2 as a basis instead of x and y.
u1
u2

32/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
For example, what if we use u1 and
u2 as a basis instead of x and y.
u1 We observe that all the points have a
very small component in the direction
u2 of u2 (almost noise)

32/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
For example, what if we use u1 and
u2 as a basis instead of x and y.
u1 We observe that all the points have a
very small component in the direction
u2 of u2 (almost noise)
It seems that the same data which
was originally in R2 (x, y) can now be
represented in R1 (u1 ) by making a
x smarter choice for the basis

32/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
Let’s try stating this more formally

u1
u2

33/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
Let’s try stating this more formally
Why do we not care about u2 ?
u1
u2

33/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
Let’s try stating this more formally
Why do we not care about u2 ?
u1 Because the variance in the data in
this direction is very small (all data
u2 points have almost the same value in
the u2 direction)

33/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
Let’s try stating this more formally
Why do we not care about u2 ?
u1 Because the variance in the data in
this direction is very small (all data
u2 points have almost the same value in
the u2 direction)
If we were to build a classifier on
top of this data then u2 would not
x contribute to the classifier as the
points are not distinguishable along
this direction

33/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
In general, we are interested in
representing the data using fewer
dimensions such that
u1
u2

34/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
In general, we are interested in
representing the data using fewer
dimensions such that the data has
u1
high variance along these dimensions
u2

34/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
In general, we are interested in
representing the data using fewer
dimensions such that the data has
u1
high variance along these dimensions
u2 Is that all?

34/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
In general, we are interested in
representing the data using fewer
dimensions such that the data has
u1
high variance along these dimensions
u2 Is that all?
No, there is something else that we
desire. Let’s see what.

34/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
x y z Consider the following data
1 1 1
0.5 0 0
0.25 1 1
0.35 1.5 1.5
0.45 1 1
0.57 2 2.1
0.62 1.1 1
0.73 0.75 0.76
0.72 0.86 0.87

35/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
x y z Consider the following data
1 1 1 Is z adding any new information
0.5 0 0 beyond what is already contained in
0.25 1 1 y?
0.35 1.5 1.5
0.45 1 1
0.57 2 2.1
0.62 1.1 1
0.73 0.75 0.76
0.72 0.86 0.87

35/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
x y z Consider the following data
1 1 1 Is z adding any new information
0.5 0 0 beyond what is already contained in
0.25 1 1 y?
0.35 1.5 1.5 The two columns are highly
0.45 1 1 correlated (or they have a high
0.57 2 2.1 covariance)
0.62 1.1 1
0.73 0.75 0.76
0.72 0.86 0.87
Pn
− y)(zi − z)
i=1 (yi
ρyz = pPn pPn
2 2
i=1 (yi − y) i=1 (zi − z)

35/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
x y z Consider the following data
1 1 1 Is z adding any new information
0.5 0 0 beyond what is already contained in
0.25 1 1 y?
0.35 1.5 1.5 The two columns are highly
0.45 1 1 correlated (or they have a high
0.57 2 2.1 covariance)
0.62 1.1 1 In other words the column z
0.73 0.75 0.76 is redundant since it is linearly
0.72 0.86 0.87 dependent on y.
Pn
− y)(zi − z)
i=1 (yi
ρyz = pPn pPn
2 2
i=1 (yi − y) i=1 (zi − z)

35/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
In general, we are interested in
representing the data using fewer
dimensions such that
u1
u2

36/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
In general, we are interested in
representing the data using fewer
dimensions such that
u1
the data has high variance along these
u2 dimensions

36/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
In general, we are interested in
representing the data using fewer
dimensions such that
u1
the data has high variance along these
u2 dimensions
the dimensions are linearly
independent (uncorrelated)

36/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
In general, we are interested in
representing the data using fewer
dimensions such that
u1
the data has high variance along these
u2 dimensions
the dimensions are linearly
independent (uncorrelated)
(even better if they are orthogonal
x because that is a very convenient
basis)

36/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let p1 , p2 , · · · , pn be a set of such n linearly independent orthonormal vectors. Let
P be a n × n matrix such that p1 , p2 , · · · , pn are the columns of P .

37/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let p1 , p2 , · · · , pn be a set of such n linearly independent orthonormal vectors. Let
P be a n × n matrix such that p1 , p2 , · · · , pn are the columns of P .

## Let x1 , x2 , · · · , xm ∈ Rn be m data points and let X be a matrix such that

x1 , x2 , · · · , xm are the rows of this matrix. Further let us assume that the data is
0-mean and unit variance.

37/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let p1 , p2 , · · · , pn be a set of such n linearly independent orthonormal vectors. Let
P be a n × n matrix such that p1 , p2 , · · · , pn are the columns of P .

## Let x1 , x2 , · · · , xm ∈ Rn be m data points and let X be a matrix such that

x1 , x2 , · · · , xm are the rows of this matrix. Further let us assume that the data is
0-mean and unit variance.
We want to represent each xi using this new basis P .

## xi = αi1 p1 + αi2 p2 + αi3 p3 + · · · + αin pn

37/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let p1 , p2 , · · · , pn be a set of such n linearly independent orthonormal vectors. Let
P be a n × n matrix such that p1 , p2 , · · · , pn are the columns of P .

## Let x1 , x2 , · · · , xm ∈ Rn be m data points and let X be a matrix such that

x1 , x2 , · · · , xm are the rows of this matrix. Further let us assume that the data is
0-mean and unit variance.
We want to represent each xi using this new basis P .

## xi = αi1 p1 + αi2 p2 + αi3 p3 + · · · + αin pn

For an orthonormal basis we know that we can find these αi0 s using
 

T
  T
αij = xi pj = ← xi → pj 

37/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In general, the transformed data x̂i is given by
 
↑ ↑
xTi → p1 · · · pn  = xTi P
 
x̂i = ←
↓ ↓

38/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
In general, the transformed data x̂i is given by
 
↑ ↑
xTi → p1 · · · pn  = xTi P
 
x̂i = ←
↓ ↓

and

## X̂ = XP (X̂ is the matrix of transformed points)

38/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Theorem:
If X is a matrix such that its columns have zero mean and if X̂ = XP then the
columns of X̂ will also have zero mean.

39/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Theorem:
If X is a matrix such that its columns have zero mean and if X̂ = XP then the
columns of X̂ will also have zero mean.
Proof: For any matrix A, 1T A gives us a row vector with the ith element
containing the sum of the ith column of A. (this is easy to see using the
row-column picture of matrix multiplication).

39/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Theorem:
If X is a matrix such that its columns have zero mean and if X̂ = XP then the
columns of X̂ will also have zero mean.
Proof: For any matrix A, 1T A gives us a row vector with the ith element
containing the sum of the ith column of A. (this is easy to see using the
row-column picture of matrix multiplication).
Consider
1T X̂ = 1T XP = (1T X)P
But 1T X is the row vector containing the sums of the columns of X. Thus
1T X = 0. Therefore, 1T X̂ = 0.
Hence the transformed matrix also has columns with sum = 0.

39/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Theorem:
If X is a matrix such that its columns have zero mean and if X̂ = XP then the
columns of X̂ will also have zero mean.
Proof: For any matrix A, 1T A gives us a row vector with the ith element
containing the sum of the ith column of A. (this is easy to see using the
row-column picture of matrix multiplication).
Consider
1T X̂ = 1T XP = (1T X)P
But 1T X is the row vector containing the sums of the columns of X. Thus
1T X = 0. Therefore, 1T X̂ = 0.
Hence the transformed matrix also has columns with sum = 0.

Theorem:
X T X is a symmetric matrix.

39/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Theorem:
If X is a matrix such that its columns have zero mean and if X̂ = XP then the
columns of X̂ will also have zero mean.
Proof: For any matrix A, 1T A gives us a row vector with the ith element
containing the sum of the ith column of A. (this is easy to see using the
row-column picture of matrix multiplication).
Consider
1T X̂ = 1T XP = (1T X)P
But 1T X is the row vector containing the sums of the columns of X. Thus
1T X = 0. Therefore, 1T X̂ = 0.
Hence the transformed matrix also has columns with sum = 0.

Theorem:
X T X is a symmetric matrix.
Proof: We can write (X T X)T = X T (X T )T = X T X

39/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Definition:
1
If X is a matrix whose columns are zero mean then Σ = m X T X is the covariance
matrix. In other words each entry Σij stores the covariance between columns i and
j of X.

40/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Definition:
1
If X is a matrix whose columns are zero mean then Σ = m X T X is the covariance
matrix. In other words each entry Σij stores the covariance between columns i and
j of X.
Explanation: Let C be the covariance matrix of X. Let µi , µj denote the means
of the ith and j th column of X respectively. Then by definition of covariance, we
can write :
m
1 X
Cij = (Xki − µi )(Xkj − µj )
m
k=1
m
1 X
= Xki Xkj (∵ µi = µj = 0)
m
k=1
1 1
= XiT Xj = (X T X)ij
m m

40/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
X̂ = XP

41/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
X̂ = XP
1 T
Using the previous theorem & definition, we get m X̂ X̂ is the covariance matrix of
the transformed data. We can write :

41/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
X̂ = XP
1 T
Using the previous theorem & definition, we get m X̂ X̂ is the covariance matrix of
the transformed data. We can write :
1 T 1 T
X̂ X̂ = (XP ) XP
m m

41/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
X̂ = XP
1 T
Using the previous theorem & definition, we get m X̂ X̂ is the covariance matrix of
the transformed data. We can write :
 
1 T 1 T 1 1 T
X̂ X̂ = (XP ) XP = P T X T XP = P T X X P
m m m m

41/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
X̂ = XP
1 T
Using the previous theorem & definition, we get m X̂ X̂ is the covariance matrix of
the transformed data. We can write :
 
1 T 1 T 1 1 T
X̂ X̂ = (XP ) XP = P T X T XP = P T X X P = P T ΣP
m m m m

41/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
X̂ = XP
1 T
Using the previous theorem & definition, we get m X̂ X̂ is the covariance matrix of
the transformed data. We can write :
 
1 T 1 T 1 1 T
X̂ X̂ = (XP ) XP = P T X T XP = P T X X P = P T ΣP
m m m m
1 T
Each cell i, j of the covariance matrix m X̂ X̂ stores the covariance between columns
i and j of X̂.

41/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
X̂ = XP
1 T
Using the previous theorem & definition, we get m X̂ X̂ is the covariance matrix of
the transformed data. We can write :
 
1 T 1 T 1 1 T
X̂ X̂ = (XP ) XP = P T X T XP = P T X X P = P T ΣP
m m m m
1 T
Each cell i, j of the covariance matrix m X̂ X̂ stores the covariance between columns
i and j of X̂.
Ideally we want,
 
1 T
X̂ X̂ =0 i 6= j ( covariance = 0)
m ij
 
1 T
X̂ X̂ 6= 0 i = j ( variance 6= 0)
m ij

41/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
X̂ = XP
1 T
Using the previous theorem & definition, we get m X̂ X̂ is the covariance matrix of
the transformed data. We can write :
 
1 T 1 T 1 1 T
X̂ X̂ = (XP ) XP = P T X T XP = P T X X P = P T ΣP
m m m m
1 T
Each cell i, j of the covariance matrix m X̂ X̂ stores the covariance between columns
i and j of X̂.
Ideally we want,
 
1 T
X̂ X̂ =0 i 6= j ( covariance = 0)
m ij
 
1 T
X̂ X̂ 6= 0 i = j ( variance 6= 0)
m ij

## In other words, we want

1 T
X̂ X̂ = P T ΣP = D [ where D is a diagonal matrix ]
m 41/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We want,
P T ΣP = D

42/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We want,
P T ΣP = D
But Σ is a square matrix and P is an orthogonal matrix

42/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We want,
P T ΣP = D
But Σ is a square matrix and P is an orthogonal matrix
Which orthogonal matrix satisfies the following condition?

42/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We want,
P T ΣP = D
But Σ is a square matrix and P is an orthogonal matrix
Which orthogonal matrix satisfies the following condition?

P T ΣP = D

42/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We want,
P T ΣP = D
But Σ is a square matrix and P is an orthogonal matrix
Which orthogonal matrix satisfies the following condition?

P T ΣP = D

## In other words, which orthogonal matrix P diagonalizes Σ?

42/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We want,
P T ΣP = D
But Σ is a square matrix and P is an orthogonal matrix
Which orthogonal matrix satisfies the following condition?

P T ΣP = D

## In other words, which orthogonal matrix P diagonalizes Σ?

Answer: A matrix P whose columns are the eigen vectors of Σ = X T X [By
Eigen Value Decomposition]

42/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We want,
P T ΣP = D
But Σ is a square matrix and P is an orthogonal matrix
Which orthogonal matrix satisfies the following condition?

P T ΣP = D

## In other words, which orthogonal matrix P diagonalizes Σ?

Answer: A matrix P whose columns are the eigen vectors of Σ = X T X [By
Eigen Value Decomposition]
Thus, the new basis P used to transform X is the basis consisting of the eigen
vectors of X T X

42/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Why is this a good basis?

43/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Why is this a good basis?
Because the eigen vectors of X T X are linearly independent (proof : Slide 19
Theorem 1)

43/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Why is this a good basis?
Because the eigen vectors of X T X are linearly independent (proof : Slide 19
Theorem 1)
And because the eigen vectors of X T X are orthogonal (∵ X T X is symmetric -
saw proof earlier)

43/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Why is this a good basis?
Because the eigen vectors of X T X are linearly independent (proof : Slide 19
Theorem 1)
And because the eigen vectors of X T X are orthogonal (∵ X T X is symmetric -
saw proof earlier)
This method is called Principal Component Analysis for transforming the data
to a new basis where the dimensions are non-redundant (low covariance) & not
noisy (high variance)

43/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Why is this a good basis?
Because the eigen vectors of X T X are linearly independent (proof : Slide 19
Theorem 1)
And because the eigen vectors of X T X are orthogonal (∵ X T X is symmetric -
saw proof earlier)
This method is called Principal Component Analysis for transforming the data
to a new basis where the dimensions are non-redundant (low covariance) & not
noisy (high variance)
In practice, we select only the top-k dimensions along which the variance is
high (this will become more clear when we look at an alternalte interpretation
of PCA)

43/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Module 6.5 : PCA : Interpretation 2

44/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Given n orthogonal linearly independent vectors P = p1 , p2 , · · · , pn we can
represent xi exactly as a linear combination of these vectors.

45/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Given n orthogonal linearly independent vectors P = p1 , p2 , · · · , pn we can
represent xi exactly as a linear combination of these vectors.
n
X
0
xi = αij pj [we know how to estimate αij s but we will come back to that later]
j=1

45/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Given n orthogonal linearly independent vectors P = p1 , p2 , · · · , pn we can
represent xi exactly as a linear combination of these vectors.
n
X
0
xi = αij pj [we know how to estimate αij s but we will come back to that later]
j=1

But we are interested only in the top-k dimensions (we want to get rid of noisy &
redundant dimensions)
k
X
x̂i = αik pk
j=1

45/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Given n orthogonal linearly independent vectors P = p1 , p2 , · · · , pn we can
represent xi exactly as a linear combination of these vectors.
n
X
0
xi = αij pj [we know how to estimate αij s but we will come back to that later]
j=1

But we are interested only in the top-k dimensions (we want to get rid of noisy &
redundant dimensions)
k
X
x̂i = αik pk
j=1

## We want to select p0i s such that we minimise the reconstructed error

m
X
e= (xi − x̂i )T (xi − x̂i )
i=1

45/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
m
X
e= (xi − x̂i )T (xi − x̂i )
i=1

46/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
m
X
e= (xi − x̂i )T (xi − x̂i )
i=1
 2
m
X n
X k
X
=  αij pj − αij pj 
i=1 j=1 j=1

46/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
m
X
e= (xi − x̂i )T (xi − x̂i )
i=1
 2
m
X n
X k
X
=  αij pj − αij pj 
i=1 j=1 j=1
 2  T  
m
X n
X m
X n
X n
X
=  αij pj  =  αij pj   αij pj 
i=1 j=k+1 i=1 j=k+1 j=k+1

46/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
m
X
e= (xi − x̂i )T (xi − x̂i )
i=1
 2
m
X n
X k
X
=  αij pj − αij pj 
i=1 j=1 j=1
 2  T  
m
X n
X m
X n
X n
X
=  αij pj  =  αij pj   αij pj 
i=1 j=k+1 i=1 j=k+1 j=k+1
m
X
= (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )T (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )
i=1

46/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
m
X
e= (xi − x̂i )T (xi − x̂i )
i=1
 2
m
X n
X k
X
=  αij pj − αij pj 
i=1 j=1 j=1
 2  T  
m
X n
X m
X n
X n
X
=  αij pj  =  αij pj   αij pj 
i=1 j=k+1 i=1 j=k+1 j=k+1
m
X
= (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )T (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )
i=1
m X
X n m X
X n n
X
= αij pTj pj αij + αij pTj pL αiL
i=1 j=k+1 i=1 j=k+1 L=k+1,L6=k

46/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
m
X
e= (xi − x̂i )T (xi − x̂i )
i=1
 2
m
X n
X k
X
=  αij pj − αij pj 
i=1 j=1 j=1
 2  T  
m
X n
X m
X n
X n
X
=  αij pj  =  αij pj   αij pj 
i=1 j=k+1 i=1 j=k+1 j=k+1
m
X
= (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )T (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )
i=1
m X
X n m X
X n n
X
= αij pTj pj αij + αij pTj pL αiL
i=1 j=k+1 i=1 j=k+1 L=k+1,L6=k
Xm X n
2
= αij (∵ pTj pj = 1, pTi pj = 0 ∀i 6= j)
i=1 j=k+1

46/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
m
X
e= (xi − x̂i )T (xi − x̂i )
i=1
 2
m
X n
X k
X
=  αij pj − αij pj 
i=1 j=1 j=1
 2  T  
m
X n
X m
X n
X n
X
=  αij pj  =  αij pj   αij pj 
i=1 j=k+1 i=1 j=k+1 j=k+1
m
X
= (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )T (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )
i=1
m X
X n m X
X n n
X
= αij pTj pj αij + αij pTj pL αiL
i=1 j=k+1 i=1 j=k+1 L=k+1,L6=k
Xm X n
2
= αij (∵ pTj pj = 1, pTi pj = 0 ∀i 6= j)
i=1 j=k+1
m X
n
X 2
= xTi pj
i=1 j=k+1
46/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
m X
X n
pTj xi xTi pj
 
=
m
X i=1 j=k+1
e= (xi − x̂i )T (xi − x̂i )
i=1
 2
m
X n
X k
X
=  αij pj − αij pj 
i=1 j=1 j=1
 2  T  
m
X n
X m
X n
X n
X
=  αij pj  =  αij pj   αij pj 
i=1 j=k+1 i=1 j=k+1 j=k+1
m
X
= (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )T (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )
i=1
m X
X n m X
X n n
X
= αij pTj pj αij + αij pTj pL αiL
i=1 j=k+1 i=1 j=k+1 L=k+1,L6=k
Xm X n
2
= αij (∵ pTj pj = 1, pTi pj = 0 ∀i 6= j)
i=1 j=k+1
m X
n
X 2
= xTi pj
i=1 j=k+1
46/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
m X
X n
pTj xi xTi pj
 
=
m
X i=1 j=k+1
e= (xi − x̂i )T (xi − x̂i ) n m
!
X X
i=1
 2 = pTj xi xTi pj
m
X n
X k
X j=k+1 i=1
=  αij pj − αij pj 
i=1 j=1 j=1
 2  T  
m
X n
X m
X n
X n
X
=  αij pj  =  αij pj   αij pj 
i=1 j=k+1 i=1 j=k+1 j=k+1
m
X
= (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )T (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )
i=1
m X
X n m X
X n n
X
= αij pTj pj αij + αij pTj pL αiL
i=1 j=k+1 i=1 j=k+1 L=k+1,L6=k
Xm X n
2
= αij (∵ pTj pj = 1, pTi pj = 0 ∀i 6= j)
i=1 j=k+1
m X
n
X 2
= xTi pj
i=1 j=k+1
46/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
m X
X n
pTj xi xTi pj
 
=
m
X i=1 j=k+1
e= (xi − x̂i )T (xi − x̂i ) n m
!
X X
i=1
 2 = pTj xi xTi pj
m
X n
X k
X j=k+1 i=1
= αij pj − αij pj  n m
" #
1 X XT X
 X
T
i=1 j=1 j=1 = pj mCpj ∵ xi xTi = =C
m m
 2  T  j=k+1 i=1
m
X n
X m
X n
X n
X
=  αij pj  =  αij pj   αij pj 
i=1 j=k+1 i=1 j=k+1 j=k+1
m
X
= (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )T (αi,k+1 pk+1 + αi,k+2 pk+2 + . . . + αi,n pn )
i=1
m X
X n m X
X n n
X
= αij pTj pj αij + αij pTj pL αiL
i=1 j=k+1 i=1 j=k+1 L=k+1,L6=k
Xm X n
2
= αij (∵ pTj pj = 1, pTi pj = 0 ∀i 6= j)
i=1 j=k+1
m X
n
X 2
= xTi pj
i=1 j=k+1
46/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We want to minimize e
n
X
min pTj mCpj s.t. pTj pj = 1 ∀j = k + 1, k + 2, · · · , n
pk+1 ,pk+2 ,··· ,pn
j=k+1

47/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We want to minimize e
n
X
min pTj mCpj s.t. pTj pj = 1 ∀j = k + 1, k + 2, · · · , n
pk+1 ,pk+2 ,··· ,pn
j=k+1

The solution to the above problem is given by the eigen vectors corresponding to
the smallest eigen values of C (Proof : refer Slide 26).

47/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We want to minimize e
n
X
min pTj mCpj s.t. pTj pj = 1 ∀j = k + 1, k + 2, · · · , n
pk+1 ,pk+2 ,··· ,pn
j=k+1

The solution to the above problem is given by the eigen vectors corresponding to
the smallest eigen values of C (Proof : refer Slide 26).

## Thus we select P = p1 , p2 , · · · , pn as eigen vectors of C and retain only top-k eigen

vectors to express the data [or discard the eigen vectors k + 1, · · · , n]

47/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Key Idea
Minimize the error in reconstructing xi after projecting the data on to a new basis.

48/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let’s look at the ‘Reconstruction Error’ in the context of our toy example

49/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y

u2 u1
x

## u1 = [1, 1] and u2 = [−1, 1] are the

new basis vectors

50/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y

u2 u1
x

## u1 = [1, 1] and u2 = [−1, 1] are the

new basis vectors
Let us
h converti them tohunit vectors
i
−1
u1 = √12 √12 & u2 = √ 2
√1
2

50/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
Consider the point x = [3.3, 3] in the
original data

u2 u1
x

## u1 = [1, 1] and u2 = [−1, 1] are the

new basis vectors
Let us
h converti them tohunit vectors
i
−1
u1 = √12 √12 & u2 = √ 2
√1
2

50/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
Consider the point x = [3.3, 3] in the
original data

α1 = xT u1 = 6.3/ √2
T
α2 = x u2 = −0.3/ 2

u2 u1
x

## u1 = [1, 1] and u2 = [−1, 1] are the

new basis vectors
Let us
h converti them tohunit vectors
i
−1
u1 = √12 √12 & u2 = √ 2
√1
2

50/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
Consider the point x = [3.3, 3] in the
original data

α1 = xT u1 = 6.3/ √2
T
α2 = x u2 = −0.3/ 2
the perfect reconstruction of x is
given by (using n = 2 dimensions)
u2 u1  
x = α1 u1 + α2 u2 = 3.3 3
x

## u1 = [1, 1] and u2 = [−1, 1] are the

new basis vectors
Let us
h converti them tohunit vectors
i
−1
u1 = √12 √12 & u2 = √ 2
√1
2

50/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
y
Consider the point x = [3.3, 3] in the
original data

α1 = xT u1 = 6.3/ √2
T
α2 = x u2 = −0.3/ 2
the perfect reconstruction of x is
given by (using n = 2 dimensions)
u2 u1  
x = α1 u1 + α2 u2 = 3.3 3
x
But we are going to reconstruct it
using fewer (only k = 1 < n
u1 = [1, 1] and u2 = [−1, 1] are the dimensions, ignoring the low variance
new basis vectors u2 dimension)
Let us
h converti them tohunit vectors
i
−1
u1 = √12 √12 & u2 = √ √1
 
2 2
x̂ = α1 u1 = 3.15 3.15

## (reconstruction with minimum error)

50/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Recap
The eigen vectors of a matrix with distinct eigenvalues are linearly independent

51/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Recap
The eigen vectors of a matrix with distinct eigenvalues are linearly independent
The eigen vectors of a square symmetric matrix are orthogonal

51/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Recap
The eigen vectors of a matrix with distinct eigenvalues are linearly independent
The eigen vectors of a square symmetric matrix are orthogonal
PCA exploits this fact by representing the data using a new basis comprising
only the top-k eigen vectors

51/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Recap
The eigen vectors of a matrix with distinct eigenvalues are linearly independent
The eigen vectors of a square symmetric matrix are orthogonal
PCA exploits this fact by representing the data using a new basis comprising
only the top-k eigen vectors
The n − k dimensions which contribute very little to the reconstruction error

51/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Recap
The eigen vectors of a matrix with distinct eigenvalues are linearly independent
The eigen vectors of a square symmetric matrix are orthogonal
PCA exploits this fact by representing the data using a new basis comprising
only the top-k eigen vectors
The n − k dimensions which contribute very little to the reconstruction error
These are also the directions along which the variance is minimum

51/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Module 6.6 : PCA : Interpretation 3

52/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We started off with the following wishlist

53/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We started off with the following wishlist
We are interested in representing the data using fewer dimensions such that

53/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We started off with the following wishlist
We are interested in representing the data using fewer dimensions such that
the dimensions have low covariance

53/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We started off with the following wishlist
We are interested in representing the data using fewer dimensions such that
the dimensions have low covariance
the dimensions have high variance

53/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We started off with the following wishlist
We are interested in representing the data using fewer dimensions such that
the dimensions have low covariance
the dimensions have high variance
So far we have paid a lot of attention to the covariance

53/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We started off with the following wishlist
We are interested in representing the data using fewer dimensions such that
the dimensions have low covariance
the dimensions have high variance
So far we have paid a lot of attention to the covariance
It has indeed played a central role in all our analysis

53/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We started off with the following wishlist
We are interested in representing the data using fewer dimensions such that
the dimensions have low covariance
the dimensions have high variance
So far we have paid a lot of attention to the covariance
It has indeed played a central role in all our analysis
But what about variance? Have we achieved our stated goal of high variance
along dimensions?

53/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
We started off with the following wishlist
We are interested in representing the data using fewer dimensions such that
the dimensions have low covariance
the dimensions have high variance
So far we have paid a lot of attention to the covariance
It has indeed played a central role in all our analysis
But what about variance? Have we achieved our stated goal of high variance
along dimensions?
To answer this question we will see yet another interpretation of PCA

53/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
The ith dimension of the transformed data X̂ is given by

X̂i = Xpi

54/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
The ith dimension of the transformed data X̂ is given by

X̂i = Xpi
The variance along this dimension is given by

54/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
The ith dimension of the transformed data X̂ is given by

X̂i = Xpi
The variance along this dimension is given by
X̂iT X̂i 1
= pTi X T Xpi
m m | {z }

54/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
The ith dimension of the transformed data X̂ is given by

X̂i = Xpi
The variance along this dimension is given by
X̂iT X̂i 1
= pTi X T Xpi
m m | {z }
1
= pTi λi pi [∵ pi is the eigen vector of X T X]
m

54/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
The ith dimension of the transformed data X̂ is given by

X̂i = Xpi
The variance along this dimension is given by
X̂iT X̂i 1
= pTi X T Xpi
m m | {z }
1
= pTi λi pi [∵ pi is the eigen vector of X T X]
m
1
= λi pTi pi
m |{z}
=1

54/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
The ith dimension of the transformed data X̂ is given by

X̂i = Xpi
The variance along this dimension is given by
X̂iT X̂i 1
= pTi X T Xpi
m m | {z }
1
= pTi λi pi [∵ pi is the eigen vector of X T X]
m
1
= λi pTi pi
m |{z}
=1
λi
=
m

54/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
The ith dimension of the transformed data X̂ is given by

X̂i = Xpi
The variance along this dimension is given by
X̂iT X̂i 1
= pTi X T Xpi
m m | {z }
1
= pTi λi pi [∵ pi is the eigen vector of X T X]
m
1
= λi pTi pi
m |{z}
=1
λi
=
m
Thus the variance along the ith dimension (ith eigen vector of X T X) is given
by the corresponding (scaled) eigen value.

54/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
The ith dimension of the transformed data X̂ is given by

X̂i = Xpi
The variance along this dimension is given by
X̂iT X̂i 1
= pTi X T Xpi
m m | {z }
1
= pTi λi pi [∵ pi is the eigen vector of X T X]
m
1
= λi pTi pi
m |{z}
=1
λi
=
m
Thus the variance along the ith dimension (ith eigen vector of X T X) is given
by the corresponding (scaled) eigen value.
Hence, we did the right thing by discarding the dimensions (eigenvectors)
corresponding to lower eigen values! 54/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
A Quick Summary
We have seen 3 different interpretations of PCA

55/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
A Quick Summary
We have seen 3 different interpretations of PCA
It ensures that the covariance between the new dimensions is minimized

55/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
A Quick Summary
We have seen 3 different interpretations of PCA
It ensures that the covariance between the new dimensions is minimized
It picks up dimensions such that the data exhibits a high variance across these
dimensions

55/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
A Quick Summary
We have seen 3 different interpretations of PCA
It ensures that the covariance between the new dimensions is minimized
It picks up dimensions such that the data exhibits a high variance across these
dimensions
It ensures that the data can be represented using less number of dimensions

55/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Module 6.7 : PCA : Practical Example

56/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Consider we are given a large number of
images of human faces (say, m images)

57/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Consider we are given a large number of
images of human faces (say, m images)
Each image is 100 × 100 [10K dimensions]

57/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Consider we are given a large number of
images of human faces (say, m images)
Each image is 100 × 100 [10K dimensions]
We would like to represent and store the
images using much fewer dimensions (around
50-200)

57/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Consider we are given a large number of
images of human faces (say, m images)
Each image is 100 × 100 [10K dimensions]
We would like to represent and store the
images using much fewer dimensions (around
50-200)
We construct a matrix X ∈ Rm×10K

57/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Consider we are given a large number of
images of human faces (say, m images)
Each image is 100 × 100 [10K dimensions]
We would like to represent and store the
images using much fewer dimensions (around
50-200)
We construct a matrix X ∈ Rm×10K
Each row of the matrix corresponds to 1 image

57/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Consider we are given a large number of
images of human faces (say, m images)
Each image is 100 × 100 [10K dimensions]
We would like to represent and store the
images using much fewer dimensions (around
50-200)
We construct a matrix X ∈ Rm×10K
Each row of the matrix corresponds to 1 image
Each image is represented using 10K
dimensions

57/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
X ∈ Rm×10K (as explained on the previous
slide)

58/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
X ∈ Rm×10K (as explained on the previous
slide)
We retain the top 100 dimensions
corresponding to the top 100 eigen vectors of
XT X

58/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
X ∈ Rm×10K (as explained on the previous
slide)
We retain the top 100 dimensions
corresponding to the top 100 eigen vectors of
XT X
Note that X T X is a n × n matrix so its eigen
vectors will be n dimensional (n = 10K in this
case)

58/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
X ∈ Rm×10K (as explained on the previous
slide)
We retain the top 100 dimensions
corresponding to the top 100 eigen vectors of
XT X
Note that X T X is a n × n matrix so its eigen
vectors will be n dimensional (n = 10K in this
case)
We can convert each eigen vector into a 100 ×
100 matrix and treat it as an image

58/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
X ∈ Rm×10K (as explained on the previous
slide)
We retain the top 100 dimensions
corresponding to the top 100 eigen vectors of
XT X
Note that X T X is a n × n matrix so its eigen
vectors will be n dimensional (n = 10K in this
case)
We can convert each eigen vector into a 100 ×
100 matrix and treat it as an image
Let’s see what we get

58/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
X ∈ Rm×10K (as explained on the previous
slide)
We retain the top 100 dimensions
corresponding to the top 100 eigen vectors of
XT X
Note that X T X is a n × n matrix so its eigen
vectors will be n dimensional (n = 10K in this
case)
We can convert each eigen vector into a 100 ×
100 matrix and treat it as an image
Let’s see what we get
What we have plotted here are the first 16
eigen vectors of X T X (basically, treating each
10K dimensional eigen vector as a 100 × 100
dimensional image)
58/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
These images are called eigenfaces
and form a basis for representing any
face in our database

59/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
These images are called eigenfaces
and form a basis for representing any
face in our database
In other words, we can now represent
a given image (face) as a linear
combination of these eigen faces

59/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
These images are called eigenfaces
and form a basis for representing any
face in our database
In other words, we can now represent
a given image (face) as a linear
combination of these eigen faces

1
X
α1i pi
59/71
i=1
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
These images are called eigenfaces
and form a basis for representing any
face in our database
In other words, we can now represent
a given image (face) as a linear
combination of these eigen faces

2
X
α1i pi
59/71
i=1
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
These images are called eigenfaces
and form a basis for representing any
face in our database
In other words, we can now represent
a given image (face) as a linear
combination of these eigen faces

4
X
α1i pi
59/71
i=1
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
These images are called eigenfaces
and form a basis for representing any
face in our database
In other words, we can now represent
a given image (face) as a linear
combination of these eigen faces

8
X
α1i pi
59/71
i=1
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
These images are called eigenfaces
and form a basis for representing any
face in our database
In other words, we can now represent
a given image (face) as a linear
combination of these eigen faces

12
X
α1i pi
59/71
i=1
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
These images are called eigenfaces
and form a basis for representing any
face in our database
In other words, we can now represent
a given image (face) as a linear
combination of these eigen faces

16
X
α1i pi
59/71
i=1
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
These images are called eigenfaces
and form a basis for representing any
face in our database
In other words, we can now represent
a given image (face) as a linear
combination of these eigen faces
In practice, we just need to store
p1 , p2 , · · · , pk (one time storage)

16
X
α1i pi
59/71
i=1
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
These images are called eigenfaces
and form a basis for representing any
face in our database
In other words, we can now represent
a given image (face) as a linear
combination of these eigen faces
In practice, we just need to store
p1 , p2 , · · · , pk (one time storage)
Then for each image i we just
need to store the scalar values
αi1 , αi2 , · · · , αik

16
X
α1i pi
59/71
i=1
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
These images are called eigenfaces
and form a basis for representing any
face in our database
In other words, we can now represent
a given image (face) as a linear
combination of these eigen faces
In practice, we just need to store
p1 , p2 , · · · , pk (one time storage)
Then for each image i we just
need to store the scalar values
αi1 , αi2 , · · · , αik
This significantly reduces the storage
cost without much loss in image
quality
16
X
α1i pi
59/71
i=1
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Module 6.8 : Singular Value Decomposition

60/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let us get some more perspective on eigen vectors before moving ahead

61/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let v1 , v2 , · · · , vn be the eigen vectors of A and let λ1 , λ2 , · · · , λn be
corresponding eigen values

## Av1 = λ1 v1 , Av2 = λ2 v2 , · · · , Avn = λn vn

62/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let v1 , v2 , · · · , vn be the eigen vectors of A and let λ1 , λ2 , · · · , λn be
corresponding eigen values

## If a vector x in Rn is represented using v1 , v2 , · · · , vn as basis then

n
X
x= αi vi
i=1

62/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let v1 , v2 , · · · , vn be the eigen vectors of A and let λ1 , λ2 , · · · , λn be
corresponding eigen values

## If a vector x in Rn is represented using v1 , v2 , · · · , vn as basis then

n
X
x= αi vi
i=1
Xn n
X
Now, Ax = αi Avi = αi λi vi
i=1 i=1

62/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let v1 , v2 , · · · , vn be the eigen vectors of A and let λ1 , λ2 , · · · , λn be
corresponding eigen values

## If a vector x in Rn is represented using v1 , v2 , · · · , vn as basis then

n
X
x= αi vi
i=1
Xn n
X
Now, Ax = αi Avi = αi λi vi
i=1 i=1

## The matrix multiplication reduces to a scalar multiplication if the eigen vectors

of A are used as a basis.

62/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
So far all the discussion was centered around square matrices (A ∈ Rn×n )

63/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
So far all the discussion was centered around square matrices (A ∈ Rn×n )
What about rectangular matrices A ∈ Rm×n ? Can they have eigen vectors?

63/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
So far all the discussion was centered around square matrices (A ∈ Rn×n )
What about rectangular matrices A ∈ Rm×n ? Can they have eigen vectors?
Is it possible to have Am×n xn×1 = xn×1 ?

63/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
So far all the discussion was centered around square matrices (A ∈ Rn×n )
What about rectangular matrices A ∈ Rm×n ? Can they have eigen vectors?
Is it possible to have Am×n xn×1 = xn×1 ? Not possible !

63/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
So far all the discussion was centered around square matrices (A ∈ Rn×n )
What about rectangular matrices A ∈ Rm×n ? Can they have eigen vectors?
Is it possible to have Am×n xn×1 = xn×1 ? Not possible !
The result of Am×n xn×1 is a vector belonging to Rm (whereas x ∈ Rn )

63/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
So far all the discussion was centered around square matrices (A ∈ Rn×n )
What about rectangular matrices A ∈ Rm×n ? Can they have eigen vectors?
Is it possible to have Am×n xn×1 = xn×1 ? Not possible !
The result of Am×n xn×1 is a vector belonging to Rm (whereas x ∈ Rn )
So do we miss out on the advantage that a basis of eigen vectors provides
for square matrices (i.e. converting matrix multiplications into scalar
multiplications)?

63/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
So far all the discussion was centered around square matrices (A ∈ Rn×n )
What about rectangular matrices A ∈ Rm×n ? Can they have eigen vectors?
Is it possible to have Am×n xn×1 = xn×1 ? Not possible !
The result of Am×n xn×1 is a vector belonging to Rm (whereas x ∈ Rn )
So do we miss out on the advantage that a basis of eigen vectors provides
for square matrices (i.e. converting matrix multiplications into scalar
multiplications)?
We will see the answer to this question over the next few slides

63/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Note that matrix Am×n provides a transformation Rn → Rm

64/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Note that matrix Am×n provides a transformation Rn → Rm
What if we could have pairs of vectors (v1 , u1 ), (v2 , u2 ), · · · , (vk , uk ) such that vi ∈ Rn ,
ui ∈ Rm and Avi = σi ui

64/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Note that matrix Am×n provides a transformation Rn → Rm
What if we could have pairs of vectors (v1 , u1 ), (v2 , u2 ), · · · , (vk , uk ) such that vi ∈ Rn ,
ui ∈ Rm and Avi = σi ui
Further let’s assume that v1 , · · · , vk , · · · , vn are orthogonal & thus form a basis V in Rn

64/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Note that matrix Am×n provides a transformation Rn → Rm
What if we could have pairs of vectors (v1 , u1 ), (v2 , u2 ), · · · , (vk , uk ) such that vi ∈ Rn ,
ui ∈ Rm and Avi = σi ui
Further let’s assume that v1 , · · · , vk , · · · , vn are orthogonal & thus form a basis V in Rn
Similarly let’s assume that u1 , · · · , uk , · · · , um are orthogonal & thus form a basis U in Rm

64/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Note that matrix Am×n provides a transformation Rn → Rm
What if we could have pairs of vectors (v1 , u1 ), (v2 , u2 ), · · · , (vk , uk ) such that vi ∈ Rn ,
ui ∈ Rm and Avi = σi ui
Further let’s assume that v1 , · · · , vk , · · · , vn are orthogonal & thus form a basis V in Rn
Similarly let’s assume that u1 , · · · , uk , · · · , um are orthogonal & thus form a basis U in Rm
Now what if every vector x ∈ Rn is represented using the basis V

64/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Note that matrix Am×n provides a transformation Rn → Rm
What if we could have pairs of vectors (v1 , u1 ), (v2 , u2 ), · · · , (vk , uk ) such that vi ∈ Rn ,
ui ∈ Rm and Avi = σi ui
Further let’s assume that v1 , · · · , vk , · · · , vn are orthogonal & thus form a basis V in Rn
Similarly let’s assume that u1 , · · · , uk , · · · , um are orthogonal & thus form a basis U in Rm
Now what if every vector x ∈ Rn is represented using the basis V
k
X
x= αi vi [note we are using k instead of n ; will clarify this in a minute]
i=1

64/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Note that matrix Am×n provides a transformation Rn → Rm
What if we could have pairs of vectors (v1 , u1 ), (v2 , u2 ), · · · , (vk , uk ) such that vi ∈ Rn ,
ui ∈ Rm and Avi = σi ui
Further let’s assume that v1 , · · · , vk , · · · , vn are orthogonal & thus form a basis V in Rn
Similarly let’s assume that u1 , · · · , uk , · · · , um are orthogonal & thus form a basis U in Rm
Now what if every vector x ∈ Rn is represented using the basis V
k
X
x= αi vi [note we are using k instead of n ; will clarify this in a minute]
i=1
k
X
Ax = αi Avi
i=1

64/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Note that matrix Am×n provides a transformation Rn → Rm
What if we could have pairs of vectors (v1 , u1 ), (v2 , u2 ), · · · , (vk , uk ) such that vi ∈ Rn ,
ui ∈ Rm and Avi = σi ui
Further let’s assume that v1 , · · · , vk , · · · , vn are orthogonal & thus form a basis V in Rn
Similarly let’s assume that u1 , · · · , uk , · · · , um are orthogonal & thus form a basis U in Rm
Now what if every vector x ∈ Rn is represented using the basis V
k
X
x= αi vi [note we are using k instead of n ; will clarify this in a minute]
i=1
k
X
Ax = αi Avi
i=1
k
X
= αi σi ui
i=1

64/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Note that matrix Am×n provides a transformation Rn → Rm
What if we could have pairs of vectors (v1 , u1 ), (v2 , u2 ), · · · , (vk , uk ) such that vi ∈ Rn ,
ui ∈ Rm and Avi = σi ui
Further let’s assume that v1 , · · · , vk , · · · , vn are orthogonal & thus form a basis V in Rn
Similarly let’s assume that u1 , · · · , uk , · · · , um are orthogonal & thus form a basis U in Rm
Now what if every vector x ∈ Rn is represented using the basis V
k
X
x= αi vi [note we are using k instead of n ; will clarify this in a minute]
i=1
k
X
Ax = αi Avi
i=1
k
X
= αi σi ui
i=1

## Once again the matrix multiplication reduces to a scalar multiplication

64/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let’s look at a geometric interpretation of this

65/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
A
n Co R m
R ace lum
p n
ows of spac
R of A A e

dim=k=rank(A)
dim=k=rank(A)

66/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
A
n Co R m
R ace lum
p n
ows of spac
R of A A e

dim=k=rank(A)
dim=k=rank(A)

Rn - Space of all vectors which can multiply with A to give Ax [ this is the space of
inputs of the function]

66/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
A
n Co R m
R ace lum
p n
ows of spac
R of A A e

dim=k=rank(A)
dim=k=rank(A)

Rn - Space of all vectors which can multiply with A to give Ax [ this is the space of
inputs of the function]
Rm - Space of all vectors which are outputs of the function Ax

66/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
A
n Co R m
R ace lum
p n
ows of spac
R of A A e

dim=k=rank(A)
dim=k=rank(A)

Rn - Space of all vectors which can multiply with A to give Ax [ this is the space of
inputs of the function]
Rm - Space of all vectors which are outputs of the function Ax
We are interested in finding a basis U , V such that

66/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
A
n Co R m
R ace lum
p n
ows of spac
R of A A e

dim=k=rank(A)
dim=k=rank(A)

Rn - Space of all vectors which can multiply with A to give Ax [ this is the space of
inputs of the function]
Rm - Space of all vectors which are outputs of the function Ax
We are interested in finding a basis U , V such that
V - basis for inputs

66/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
A
n Co R m
R ace lum
p n
ows of spac
R of A A e

dim=k=rank(A)
dim=k=rank(A)

Rn - Space of all vectors which can multiply with A to give Ax [ this is the space of
inputs of the function]
Rm - Space of all vectors which are outputs of the function Ax
We are interested in finding a basis U , V such that
V - basis for inputs
U - basis for outputs

66/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
A
n Co R m
R ace lum
p n
ows of spac
R of A A e

dim=k=rank(A)
dim=k=rank(A)

Rn - Space of all vectors which can multiply with A to give Ax [ this is the space of
inputs of the function]
Rm - Space of all vectors which are outputs of the function Ax
We are interested in finding a basis U , V such that
V - basis for inputs
U - basis for outputs
such that if the inputs and outputs are represented using this basis then the operation
Ax reduces to a scalar operation 66/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
What do we mean by saying that dimension of rowspace is k? If x ∈ Rn then
why is the dimension not n.

67/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
What do we mean by saying that dimension of rowspace is k? If x ∈ Rn then
why is the dimension not n.
It means that of all the possible vectors in Rn only a subspace of vectors lying
in Rk can act as inputs to Ax and produce a non-zero output. The remaining
vectors in Rn−k will produce a zero output

67/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
What do we mean by saying that dimension of rowspace is k? If x ∈ Rn then
why is the dimension not n.
It means that of all the possible vectors in Rn only a subspace of vectors lying
in Rk can act as inputs to Ax and produce a non-zero output. The remaining
vectors in Rn−k will produce a zero output
Hence we need only k dimensions to represent x
k
X
x= αi vi
i=1

67/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let’s look at a way of writing this as a matrix operation

## Am×n Vn×k = Um×k Σk×k

| {z }
diagonal matrix

68/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let’s look at a way of writing this as a matrix operation

| {z }
diagonal matrix

## If we have k orthogonal vectors (Vn×k ) then using Gram Schmidt

orthogonalization, we can find n − k more orthogonal vectors to complete the
basis for Rn [We can do the same for U]

## Am×n Vn×n = Um×m Σm×n

U T AV = Σ [U −1 = U T ] A = U ΣV T [V −1 = V T ]

68/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let’s look at a way of writing this as a matrix operation

| {z }
diagonal matrix

## If we have k orthogonal vectors (Vn×k ) then using Gram Schmidt

orthogonalization, we can find n − k more orthogonal vectors to complete the
basis for Rn [We can do the same for U]

## Am×n Vn×n = Um×m Σm×n

U T AV = Σ [U −1 = U T ] A = U ΣV T [V −1 = V T ]
Σ is a diagonal matrix with only the first k diagonal elements as non-zero

68/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Let’s look at a way of writing this as a matrix operation

| {z }
diagonal matrix

## If we have k orthogonal vectors (Vn×k ) then using Gram Schmidt

orthogonalization, we can find n − k more orthogonal vectors to complete the
basis for Rn [We can do the same for U]

## Am×n Vn×n = Um×m Σm×n

U T AV = Σ [U −1 = U T ] A = U ΣV T [V −1 = V T ]
Σ is a diagonal matrix with only the first k diagonal elements as non-zero
Now the question is how do we find V , U and Σ

68/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Suppose V , U and Σ exist, then

69/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Suppose V , U and Σ exist, then

AT A = (U ΣV T )T (U ΣV T )

69/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Suppose V , U and Σ exist, then

AT A = (U ΣV T )T (U ΣV T )
= V ΣT U T U ΣV T

69/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Suppose V , U and Σ exist, then

AT A = (U ΣV T )T (U ΣV T )
= V ΣT U T U ΣV T
AT A = V Σ 2 V T

69/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Suppose V , U and Σ exist, then

AT A = (U ΣV T )T (U ΣV T )
= V ΣT U T U ΣV T
AT A = V Σ 2 V T

## What does this look like?

69/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Suppose V , U and Σ exist, then

AT A = (U ΣV T )T (U ΣV T )
= V ΣT U T U ΣV T
AT A = V Σ 2 V T

## What does this look like? Eigen Value decomposition of AT A

69/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Suppose V , U and Σ exist, then

AT A = (U ΣV T )T (U ΣV T )
= V ΣT U T U ΣV T
AT A = V Σ 2 V T

## What does this look like? Eigen Value decomposition of AT A

Similarly we can show that

AAT = U Σ2 U T

69/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
Suppose V , U and Σ exist, then

AT A = (U ΣV T )T (U ΣV T )
= V ΣT U T U ΣV T
AT A = V Σ 2 V T

## What does this look like? Eigen Value decomposition of AT A

Similarly we can show that

AAT = U Σ2 U T

Thus U and V are the eigen vectors of AAT and AT A respectively and Σ2 = Λ
where Λ is the diagonal matrix containing eigen values of AT A

69/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
   
↑ ··· ↑ 
σ1
 
← v1 →

  
=
  ..   .. 

 A

u1 · · · uk   .   . 
↓ · · · ↓ m×k σk k×k ← vk → k×n
m×n
k
X
= σi ui viT
i=1

70/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
   
↑ ··· ↑ 
σ1
 
← v1 →

  
=
  ..   .. 

 A

u1 · · · uk   .   . 
↓ · · · ↓ m×k σk k×k ← vk → k×n
m×n
k
X
= σi ui viT
i=1

Theorem:
σ1 u1 v1T is the best rank-1 approximation of the matrix A. 2i=1 σi ui viT is the best
P
Pk
rank-2 approximation of matrix A. In general, i=1 σi ui viT is the best rank-k
approximation of matrix A. In other words, the solution to

## min kA − Bk2F is given by :

T
B =U.,k Σk,k Vk,. (minimizes reconstruction error of A)
70/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
p
σi = λi = singular value of A

71/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
p
σi = λi = singular value of A
U = left singular matrix of A

71/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6
p
σi = λi = singular value of A
U = left singular matrix of A
V = right singular matrix of A

71/71
Prof. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 6