Unit4 PCA-SVD

MACHINE
INTELLIGENCE
Dimensionality Reduction
Aronya Baksy
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Introduction to Dimensionality Reduction
Aronya Baksy
Department of Computer Science and Engineering
Outline
• What is dimensionality reduction?
• Motivation behind dimensionality reduction
• Algorithm for Dimensionality Reduction: PCA
• Mathematical toolbox: the Singular Value Decomposition

Discussion: Features for Learning Task
Comment on the usefulness of the following feature vectors for classification:

Feature Set 1:
Feature 1: 5 17 13 29 72 7
Feature 2: 0.9 1.1 1.0 0.95 1.05 0.91
Feature Set 2:
Feature 1: 5 17 13 29 72 7
Feature 2: 1.9 35 65 3.8 11.8 45.8
Feature Set 3:
Feature 1: 5 20 30 10 45 71
Feature 2: 5.5 19 28 9.5 44 70.5
Discussion: Features for Learning Task (2)
Feature Set 1:
The second feature does not seem to be of much value in distinguishing the instances. Why?
Feature Set 2:
Both the features seem useful in distinguishing the instances. Why?
Feature Set 3:
Both the features seem useful in distinguishing the instances at first glance. But let us transform the
coordinate axes (why) using the Transformation Matrix: ( (0.707, 0.707)T ; (-0.707, 0.707)T )
Example: The point (30, 28) gets transformed in to
0.707 0.707 30 41
-0.707 0.707 28 -1.4
The features now (linear combinations of the original features) are:
Feature 1’: 7.4 27.5 41 13.6 63 100
Feature 2’: 0.35 0.7 1.4 0.35 0.70 0.35
Discussion: Features for Learning Task (3)
Feature Set 3 (continued):

The features now (linear combinations of the original features) are:
Feature 1’: 7.4 27.5 41 13.6 63 100
Feature 2’: 0.35 0.7 1.4 0.35 0.70 0.35
(The original coordinates can be recovered by a rotation of coordinate axes in the opposite direction!)
How good are these transformed features for discriminating the instances?
(Informed) Assumptions:
• Features with high variance are generally more useful than the features with low variance.
• Only one feature from a set of highly correlated features is useful, the rest of the features do not add
much value while discriminating the data instances!
How do we capture these ideas in a quantitative manner?

COVARIANCE MATRIX!
Discussion: Covariance Matrix
Variance:
Covariance:
Covariance Matrix:
Unbiased Estimate of Covariance Matrix with

Mean-Centered Data:
Discussion: Covariance Matrix (2)
Covariance between 2 features:

• Positives: Both increase / decrease together
• Negative: While one increases (decreases), the other decreases (increases)
• Zero: The features are independent!
Correlation between 2 features:
Covariance (Feature1, Feature2) / [ Variance (Feature 1) * Variance (Feature 2) ]
Covariance = 0 → Uncorrelated, Independent
Examples of Covariance Matrices with Mean-Centered Data:

• Feature Set 3 considered earlier:
Feature 1: 5 20 30 10 45 71 Mean = 30.2
Feature 2: 5.5 19 28 9.5 44 70.5 Mean = 29.4
Mean Centered Data: Covariance Matrix:
-25.2 -10.2 -0.2 -20.2 15.2 41.2 615 1671
-23.9 -10.4 -1.4 -19.9 14.6 41.1 1671 596
What is the inference?
• Feature Set 3 considered earlier:

Covariance Matrix
615 1671
1671 596
• Same feature set after change of coordinate axes:
Feature 1’: 7.4 27.5 41 13.6 63 100 Mean = 42
Feature 2’: 0.35 0.7 1.4 0.35 0.70 0.35 Mean = 0.64
Mean-Centered Data:
Feature 1’: -34.6 -14.5 -1.0 -28.4 21 58
Feature 2’: -0.29 0.06 0.76 -0.29 0.06 -0.29
Covariance Matrix
1204 0.216
0.216 0.167
Inference: Only Feature 1 can be used without much loss of useful information! Drop the second
feature completely! Dimensionality is reduced from2 to 1.The flip side is that we will not be able to
Recover the original data in error-free way! But that is ok for our purpose of learning to classify!
Desirable Covariance Matrix:

• Off-Diagonal Entries must be 0
• Diagonal entries are sorted by magnitude.
→Desirable features, sorted by the order of their importance!
• We could drop features of “very low” value → Dimensionality reduction!
Can we change the coordinate axes to achieve this?

Discussion: PCA
•Can we change the coordinate axes to achieve this (In the Covariance Matrix, the Off-Diagonal
entries are 0 and diagonal entries are sorted by magnitude)? (Have a re-look at the example!)
Find some orthonormal matrix P such that it transforms the original feature space x in to Y –
Y = PX that leads to the Covariance Matrix of Y to be of the desired nature as indicated
above! In other words:
Is diagonalized!
The rows of P are called the Principal Components!

• These are the directions along which features have maximum variance, in order, and
Covariance between any two distinct features is 0!
• If we wish, we can drop the “less important” principal axes and this will lead to a reduction
in the dimensionality!
Discussion: PCA (2)
• A is symmetric and thus can be diagonalized!

• A = E D ET
where D is diagonal and
the Columns of E are the Eigen Vectors of A.
• Now, select P as ET !!!
• We can complete the diagonalization process.
Notice that A (with normalizing

value of n-1) is the Covariance
Matrix of our original X.
Discussion: PCA (3)
Thus:
• Eigen vectors of XXT are the principal components of X and also these are the rows of P
• The jth diagonal value of CY is the variance of X along Pi
• By re-arranging the order of Eigen Values of XXT , if necessary, so that they are in descending order, we
can get the principal components in the order of their importance in explaining the total variance of the
data set.
• Total Variance of the data = Sum of Eigen Values of XXT = λ1 + λ2 + λ3 + … + λm
• The Eigen Values are sorted; thus λ1 ≥ λ2 ≥ λ3 ≥ … ≥ λm
• By dropping the principal components of “low importance”, we can achieve dimensionality reduction.
• The transformed data is Y = P X
• In practice, we always work with mean-centered data to simplify the calculations.
PCA: Example
• Consider the following collection of 6 points, each being 3-dimensional
• Assuming each component takes 4 bytes to store, these 6 points need

(4*3)*6 = 72 bytes to store
• Do you observe any relationships between the 6 points? (hint: think of a

number that when multiplied with the 1st point gives the 2nd one)
PCA: Example
• All the 6 points can be expressed as multiples of the first one:
• If we only store the point (1,2,3) and the 6 multiplying factors, then we only
need (4*9) = 36 bytes (this is a 50% saving over storing all points as they are!)
PCA: Example
• All the 6 points lie on the same line (along the vector (1,2,3) in the 3D
Euclidean plane)
෠ that are familiar to us

• Here p1, p2, p3 are the standard Euclidean axes (𝑖,Ƹ 𝑗,Ƹ 𝑘)
Discussion:
It may be more helpful to
think that the features are:
α, 2 α, 3 α – perfectly
correlated features!
(Instead of thinking in terms
of different data points!)
Principal Component Analysis: Stepwise
1. Center the data by the mean of feature (i.e., replace 𝑋 with 𝑋 − 𝜇𝑋 )
2. Find the covariance matrix of the centered data
3. Find the Eigenvalues and Eigenvectors of C
4. Select the k largest Eigenvalues, and combine those corresponding

Eigenvectors into another matrix
5. Multiply this matrix with the data X to get the lower-dimensional X’

Principal Component Analysis: Stepwise Example
The data to the left is in 2 Dimensions
We will find the principal components and plot them

PCA: Step 1
Center the data so that all features have mean 0

PCA: Step 2
Find the covariance matrix of the centered X
Notice how the matrix is symmetric

PCA: Step 2
Find the Eigenvalues and Eigenvectors of covariance matrix
Typo in the matrix of Eigen Vectors!

Notice that 1.284 is the largest Eigenvalue
And its corresponding Eigenvector is the second column
PCA: Step 2
The Eigenvalues of the covariance matrix are:
𝑒𝑖𝑔𝑣𝑎𝑙𝑖
How much info. is captured along Principal Component i? This is given by
𝑠𝑢𝑚𝑘 𝑒𝑖𝑔𝑣𝑎𝑙𝑘
PCA: Step 2
Data plotted along with the principal component

vectors overlaid (the diagonal dotted lines)
PCA: Step 2
Plotting the data along with the Principal

Components (the two dotted lines)
This is simply a rotation of the previous figure

PCA: Step 3
Since we choose to retain only 1 PC, we choose only that Eigenvector and
call it P
Typo in the matrix of Eigen Vectors!
Find the product XP to get the new reduced 1D data (note that X is of
shape (10,2) and P is of shape (2, 1))
On the next slide, we see the reduced 1D approximation

PCA: Step 3
Transformed Data:
Example Calculations:
• Original Mean-Centered Point: 0.69, 0.49
• Transformed Point:
(-0.67787 * 0.69) + ( -0.73517 * 0.49) = -0.822787
(-0.73517 * 0.69) + (0.67787 * 0.49) = -0.175115
PCA: Step 3
The variation
along the
principal
component 1
is preserved.
The variation
along the
other
principal
component is
lost!
Principal Component Analysis
• Popular
• Easy to use.
• But, non-parametric!
Singular Value Decomposition (SVD)
• Diagonalization is at the heart of many applications.

• Eigen Value / Eigen Vector based diagonalization works only with Square Matrices!
• SVD works with arbitrary n x m matrices!
The pinnacle of Linear Algebra –
The most general diagonalization theorem!!
SVD (2)
• Let A be m x n matrix.
• When A is square, if λi is an eigen value and if Vi is the corresponding eigen vector, we have A Vi = λi Vi
• When we try to form a similar equation with A (m x n) , we have the problem that Vi has to be n x 1 (for
compatible matrix multiplication) but then A Vi will be m x 1 vector!
• We are mapping from Rn space to Rm space, unlike the case of square matrices!
• Thus , in general, A Vi must get equated to some constant times an m x1 vector!
• We need two sets of “eigen vectors” !!!!!!!!
***
• Let us defer this problem for the present and instead consider the matrix ATA (why?)
• This is a symmetric, square matrix and hence can be diagonalized using eigen value decomposition!
• We will try to decompose A via decomposition of ATA!!!
SVD (3)
• Let λ1 , λ2, …, λr be the eigen values of ( ATA) with V1,V2, …,Vr being the corresponding
orthonormal eigen vectors. Thus ( ATA) Vi = λi Vi for i = 1, 2, …, r
• Pre-multiply by ViT getting ViT ( ATA) Vi = λi ViT Vi → (ViT AT) (A Vi ) = λi (ViT Vi ) →
(A Vi )T (A Vi ) = λi (ViT Vi ) → || A Vi ||2 = λi ||Vi ||2 = λi as Vi is an orthonormal vector.
Thus || A Vi || = √ λi = say σi
• By rearranging if necessary, we can assume that λ1 ≥ λ2 ≥ …, ≥ λr
• Let Ui = (1 / σi ) (A Vi ) . Thus Ui is an unit vector.
A Vi = σi Ui for i = 1, 2, 3, …, r
• Extend {U1, U2,…, Ur} by adding (m-r) Zero Vectors of size m x 1 to get an m x m matrix U
U = [ U1, U2,…, Ur, 0, 0, …,0 ] = [ U1, U2,……… Um]
• Extend Vs similarly by adding (n-r) Zero Vectors of size n x 1 to get an n x n matrix V
V = [ V1,V2,…,Vr, 0, 0, …,0 ] = [ V1,V2,……… Vn]
• Now, we can extend the above A Vi = σi Ui for i = 1, 2, 3, …, r as follows:
A V = [AV1, AV2, …, AVr, 0, 0, …, 0] = [σ1 U1 , σ2 Ui , …, σr Ur , 0, 0, …, 0]
SVD (4)
We got:
A V = [AV1, AV2, …, AVr, 0, 0, …, 0] = [σ1 U1 , σ2 Ui , …, σr Ur , 0, 0, …, 0]
RHS, [σ1 U1 , σ2 Ui , …, σr Ur , 0, 0, …, 0] , can be written as:
The second matrix has the
following “diagonal” shape:
Let this “diagonal” matrix be ∑
We can now write: A V = U ∑ ; Post multiplying by V-1 , we get A = U ∑ V-1

As V is orthonormal, its inverse V-1 is VT. Hence, we get: A = U ∑ VT - the complete SVD!!!!!
SVD (5) – Finally – The Full SVD
SVD (6) – The Full SVD
A = U ∑ VT
mxn mxm mxn nxn
U1 U2 …Ur 0 V1
σ1 V2 rxn
σ2 …
mxr m x (m –r) 0
… Vr
σr
0 (n – r) x n
0 0
SVD (7) - Discussion
• In the development of SVD, we saw that ( ATA) Vi = λi Vi
• Thus,Vi is an eigen vector of ATA with the corresponding eigen value of λi = σi2
where σi is a singular value of A.
• (ATA) Vi = λi Vi ; Pre-multiply both sides by A to get
A (ATA) Vi = λi ( A Vi )
(AAT) ( A Vi ) = λi ( A Vi ) = λi Ui
λi is an eigen value of ( AAT) also with Ui as the corresponding eigen vector!
THUS
• Both AAT and ATA have the same set of eigen values which are squares of the singular
values of A.
• Vi s are the eigen vectors of ATA (n x n)
• Ui s are the eigen vectors of AAT (m x m)
SVD and PCA
Our original X was m (number of features) x n (number of instances) matrix. Let:

SVD and PCA (2)
• If we calculate SVD of Y, the columns of matrix V contain the eigen vectors of YT Y = CX
• Thus we calculate the SVD of XT (not X) as U ∑ VT and then the columns of V are the principal
components of X !!!! We do not need to explicitly form the Covariance Matrix.
(We defined X as features (m) by instances (n) matrix and we are interested in the eigen
vectors of the covariance matrix of features! So, we do SVD of XT and use the column vectors
of V. This was done to be compatible with the earlier derivation of PCA. However, we can get
the SVD of X itself but then we need to use the vectors of U instead of V! Both approaches
work as AAT and ATA have same eigen values as already discussed! In fact, often it may be better
to work with X, as often n >> m. Example: If m=10^3 and n= 10^6, XXT is 10^3 by 10^3. On
the other hand XTX is 10^6 by 10^6 – a huge difference in the cost of computation of eigen values!)
• Further, in practice we need to use iterative numerical algorithms for PCA or SVD.
• Iterative numerical algorithms for SVD have been shown to converge much faster than those for the
direct PCA! → Thus
PCA via SVD is the most commonly used approach for PCA,
a very popular technique for dimensionality reduction!
(SVD Again) Reduced SVD
• Full SVD: A = U ∑ VT
A :m x n U :m x m ∑: mxn VT : n x n
• We padded U, V, and ∑ with zeros as required to get the Full SVD Form (because the full form
presents a complete picture from the perspective of Linear Algebra and also because the Numerical
Algorithms work with such general full form).
• If we discard those padded zeros, we get the Reduced SVD which is also correct as we have
already seen that A Vi = σi Ui for i = 1, 2, 3, …, r.
• The Reduced SVD can be represented as:
Or, equivalently as:

A = Ur ∑r VrT
where Ur is m x r; ∑r is r x r and VrT is r x n.
Reduced SVD (2)
Applications of SVD
• SVD is quite a general and powerful technique for decomposing matrices.
• It has applications in several diverse domains, PCA being just one of them!
• We already discussed PCA. We will discuss only two more applications here.
Application in Data Mining:
Reduced SVD: A = Ur ∑r VrT where A is m x n; Ur is m x r; ∑r is r x r and VrT is r x n.
• Both AAT and ATA have the same set of eigen values which are squares of the singular values of A.
• Rows of VrT (i.e., columns of Vr are the eigen vectors of ATA (n x n)
• Columns of Ur are the eigen vectors of AAT (m x m)
Thus:
• V is useful in explaining the covariance among the columns of A
• U is useful in explaining the covariance among the rows of A
• Depending on how we interpret the rows and columns of A, both of the above explanations can be
useful in revealing patterns in the data.
• Any examples?
• Can be used in applications like Recommender Systems!
One interpretation of SVD
• An example where we are interested in the covariance among rows as well as columns.
• The following matrix A lists out movie ratings of 5 movies by 7 users, followed by it’s SVD
on the right.
• Number of Singular Values of A = 2 → Rank of A = 2. Inference? Rows as well as columns
have essentially 2 “dimensions”!
One interpretation of SVD (2)
• There are two concepts in this data: science fiction and romance
• The columns of the U matrix connects people to concepts (each row

represents how much one person likes the two concepts)
• The V matrix connects movies to concepts (each row of 𝑉 𝑇 represents how

much the 5 movies are connected to the two concepts)
The Σ matrix outlines the

relative strength of the two
concepts
One interpretation of SVD (3)
• Now we have two additional ratings for the movies Matrix and Star Wars
from Jill and Jane in the matrix called M’
• Note that the rank of M’ is 3 (not 2 anymore like M)

One Interpretation of SVD (4)
• What does the third concept mean? Don’t know!
• What is the relative strength of the third concept? It’s much

less than the other two
• Can we ignore it?

One Interpretation of SVD (5)
• Reconstructing the matrix M’ after removing the last ‘concept’
• Compare with the actual M’ on the right.

One Interpretation of SVD: Summary
• SVD gives us a set of hidden concepts that the data represents
• By eliminating the concepts that are less important, we can achieve a

representation of the data, that contains most of the info in the original data, but
in a lower dimensional space
SVD and Image Compression
• Reduced SVD again: A = Ur ∑r VrT where A is m x n; Ur is m x r; ∑r is r x r and VrT is r x n.
• Expanding the above, we get:
A = σ1 U1 V1T + σ2 U2 V2T + σ3 U3 V3T + … … + σr Ur VrT
• A is the sum of r Rank 1 matrices! One Rank 1 matrix requires a storage of only n + m values, instead
of n x m values!
• These r matrices are in decreasing order of ”importance”.
• If we use only the first Rank 1 matrix - σ1 U1 V1T , we do achieve great compression but the
approximation may not be that good. (Low picture quality perhaps).
• As we keep adding more of the above matrices, approximation improves very fast and we may be able
to get extremely good quality approximation to the original picture with a relatively small number of
matrices leading to great reduction in the space requirements!
SVD and Image Compression (2) - Example
Original Image (2000 x 1500) r = 5; 0.57% Storage r = 100; 11.67% Storage
Source: Data-Driven Science and Engineering by Steven L. Brunton and J. Nathan Kutz
SVD - Conclusions
• SVD is an extremely powerful result from Linear Algebra

• Can be used in any context with Large Matrices
(Examples: Latent Semantic Indexing, Eigenfaces, Recommender Systems etc)
• Thus, has applications in several diverse domains apart from Machine Learning
(Example: Pseudoinverses (Linear Algebra), Signal Processing, Weather Modeling etc)
THANK YOU
Aronya Baksy
abaksy@gmail.com
+91 80 2672 1983 Extn 701

Unit4 PCA-SVD

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit4 PCA-SVD

Uploaded by

Copyright:

Available Formats

MACHINE

Introduction to Dimensionality Reduction

• What is dimensionality reduction?

• Motivation behind dimensionality reduction

• Algorithm for Dimensionality Reduction: PCA

• Mathematical toolbox: the Singular Value Decomposition

Comment on the usefulness of the following feature vectors for classification:

Feature Set 3 (continued):

How do we capture these ideas in a quantitative manner?

Unbiased Estimate of Covariance Matrix with

Covariance between 2 features:

Examples of Covariance Matrices with Mean-Centered Data:

• Feature Set 3 considered earlier:

Desirable Covariance Matrix:

Can we change the coordinate axes to achieve this?

The rows of P are called the Principal Components!

• A is symmetric and thus can be diagonalized!

Notice that A (with normalizing

• Consider the following collection of 6 points, each being 3-dimensional

• Assuming each component takes 4 bytes to store, these 6 points need

• Do you observe any relationships between the 6 points? (hint: think of a

• All the 6 points can be expressed as multiples of the first one:

෠ that are familiar to us

1. Center the data by the mean of feature (i.e., replace 𝑋 with 𝑋 − 𝜇𝑋 )

2. Find the covariance matrix of the centered data

3. Find the Eigenvalues and Eigenvectors of C

4. Select the k largest Eigenvalues, and combine those corresponding

5. Multiply this matrix with the data X to get the lower-dimensional X’

The data to the left is in 2 Dimensions

We will find the principal components and plot them

Center the data so that all features have mean 0

Find the covariance matrix of the centered X

Notice how the matrix is symmetric

Find the Eigenvalues and Eigenvectors of covariance matrix

Typo in the matrix of Eigen Vectors!

The Eigenvalues of the covariance matrix are:

Data plotted along with the principal component

Plotting the data along with the Principal

This is simply a rotation of the previous figure

Typo in the matrix of Eigen Vectors!

On the next slide, we see the reduced 1D approximation

• Diagonalization is at the heart of many applications.

Let this “diagonal” matrix be ∑

We can now write: A V = U ∑ ; Post multiplying by V-1 , we get A = U ∑ V-1

Our original X was m (number of features) x n (number of instances) matrix. Let:

Or, equivalently as:

• The columns of the U matrix connects people to concepts (each row

• The V matrix connects movies to concepts (each row of 𝑉 𝑇 represents how

The Σ matrix outlines the

• Note that the rank of M’ is 3 (not 2 anymore like M)

• What does the third concept mean? Don’t know!

• What is the relative strength of the third concept? It’s much

• Can we ignore it?

• Reconstructing the matrix M’ after removing the last ‘concept’

• Compare with the actual M’ on the right.

• SVD gives us a set of hidden concepts that the data represents

• By eliminating the concepts that are less important, we can achieve a

Original Image (2000 x 1500) r = 5; 0.57% Storage r = 100; 11.67% Storage

• SVD is an extremely powerful result from Linear Algebra

You might also like