You are on page 1of 50

MACHINE

INTELLIGENCE
Dimensionality Reduction

Aronya Baksy
Department of Computer Science and Engineering
MACHINE INTELLIGENCE

Introduction to Dimensionality Reduction

Aronya Baksy
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Outline

• What is dimensionality reduction?

• Motivation behind dimensionality reduction

• Algorithm for Dimensionality Reduction: PCA

• Mathematical toolbox: the Singular Value Decomposition


MACHINE INTELLIGENCE
Discussion: Features for Learning Task

Comment on the usefulness of the following feature vectors for classification:


Feature Set 1:
Feature 1: 5 17 13 29 72 7
Feature 2: 0.9 1.1 1.0 0.95 1.05 0.91

Feature Set 2:
Feature 1: 5 17 13 29 72 7
Feature 2: 1.9 35 65 3.8 11.8 45.8

Feature Set 3:
Feature 1: 5 20 30 10 45 71
Feature 2: 5.5 19 28 9.5 44 70.5
MACHINE INTELLIGENCE
Discussion: Features for Learning Task (2)

Feature Set 1:
The second feature does not seem to be of much value in distinguishing the instances. Why?

Feature Set 2:
Both the features seem useful in distinguishing the instances. Why?

Feature Set 3:
Both the features seem useful in distinguishing the instances at first glance. But let us transform the
coordinate axes (why) using the Transformation Matrix: ( (0.707, 0.707)T ; (-0.707, 0.707)T )
Example: The point (30, 28) gets transformed in to
0.707 0.707 30 41
-0.707 0.707 28 -1.4
The features now (linear combinations of the original features) are:
Feature 1’: 7.4 27.5 41 13.6 63 100
Feature 2’: 0.35 0.7 1.4 0.35 0.70 0.35
MACHINE INTELLIGENCE
Discussion: Features for Learning Task (3)

Feature Set 3 (continued):


The features now (linear combinations of the original features) are:
Feature 1’: 7.4 27.5 41 13.6 63 100
Feature 2’: 0.35 0.7 1.4 0.35 0.70 0.35
(The original coordinates can be recovered by a rotation of coordinate axes in the opposite direction!)
How good are these transformed features for discriminating the instances?

(Informed) Assumptions:
• Features with high variance are generally more useful than the features with low variance.
• Only one feature from a set of highly correlated features is useful, the rest of the features do not add
much value while discriminating the data instances!

How do we capture these ideas in a quantitative manner?


COVARIANCE MATRIX!
MACHINE INTELLIGENCE
Discussion: Covariance Matrix

Variance:

Covariance:

Covariance Matrix:

Unbiased Estimate of Covariance Matrix with


Mean-Centered Data:
MACHINE INTELLIGENCE
Discussion: Covariance Matrix (2)

Covariance between 2 features:


• Positives: Both increase / decrease together
• Negative: While one increases (decreases), the other decreases (increases)
• Zero: The features are independent!
Correlation between 2 features:
Covariance (Feature1, Feature2) / [ Variance (Feature 1) * Variance (Feature 2) ]
Covariance = 0 → Uncorrelated, Independent

Examples of Covariance Matrices with Mean-Centered Data:


• Feature Set 3 considered earlier:
Feature 1: 5 20 30 10 45 71 Mean = 30.2
Feature 2: 5.5 19 28 9.5 44 70.5 Mean = 29.4
Mean Centered Data: Covariance Matrix:
-25.2 -10.2 -0.2 -20.2 15.2 41.2 615 1671
-23.9 -10.4 -1.4 -19.9 14.6 41.1 1671 596
What is the inference?
MACHINE INTELLIGENCE
Discussion: Covariance Matrix (3)

• Feature Set 3 considered earlier:


Covariance Matrix
615 1671
1671 596
• Same feature set after change of coordinate axes:
Feature 1’: 7.4 27.5 41 13.6 63 100 Mean = 42
Feature 2’: 0.35 0.7 1.4 0.35 0.70 0.35 Mean = 0.64
Mean-Centered Data:
Feature 1’: -34.6 -14.5 -1.0 -28.4 21 58
Feature 2’: -0.29 0.06 0.76 -0.29 0.06 -0.29
Covariance Matrix
1204 0.216
0.216 0.167
Inference: Only Feature 1 can be used without much loss of useful information! Drop the second
feature completely! Dimensionality is reduced from2 to 1.The flip side is that we will not be able to
Recover the original data in error-free way! But that is ok for our purpose of learning to classify!
MACHINE INTELLIGENCE
Discussion: Covariance Matrix (4)

Desirable Covariance Matrix:


• Off-Diagonal Entries must be 0
• Diagonal entries are sorted by magnitude.
→Desirable features, sorted by the order of their importance!
• We could drop features of “very low” value → Dimensionality reduction!

Can we change the coordinate axes to achieve this?


MACHINE INTELLIGENCE
Discussion: PCA

•Can we change the coordinate axes to achieve this (In the Covariance Matrix, the Off-Diagonal
entries are 0 and diagonal entries are sorted by magnitude)? (Have a re-look at the example!)

Find some orthonormal matrix P such that it transforms the original feature space x in to Y –
Y = PX that leads to the Covariance Matrix of Y to be of the desired nature as indicated
above! In other words:

Is diagonalized!

The rows of P are called the Principal Components!


• These are the directions along which features have maximum variance, in order, and
Covariance between any two distinct features is 0!
• If we wish, we can drop the “less important” principal axes and this will lead to a reduction
in the dimensionality!
MACHINE INTELLIGENCE
Discussion: PCA (2)

• A is symmetric and thus can be diagonalized!


• A = E D ET
where D is diagonal and
the Columns of E are the Eigen Vectors of A.
• Now, select P as ET !!!
• We can complete the diagonalization process.

Notice that A (with normalizing


value of n-1) is the Covariance
Matrix of our original X.
MACHINE INTELLIGENCE
Discussion: PCA (3)
Thus:
• Eigen vectors of XXT are the principal components of X and also these are the rows of P
• The jth diagonal value of CY is the variance of X along Pi
• By re-arranging the order of Eigen Values of XXT , if necessary, so that they are in descending order, we
can get the principal components in the order of their importance in explaining the total variance of the
data set.
• Total Variance of the data = Sum of Eigen Values of XXT = λ1 + λ2 + λ3 + … + λm
• The Eigen Values are sorted; thus λ1 ≥ λ2 ≥ λ3 ≥ … ≥ λm

• By dropping the principal components of “low importance”, we can achieve dimensionality reduction.
• The transformed data is Y = P X
• In practice, we always work with mean-centered data to simplify the calculations.
MACHINE INTELLIGENCE
PCA: Example

• Consider the following collection of 6 points, each being 3-dimensional

• Assuming each component takes 4 bytes to store, these 6 points need


(4*3)*6 = 72 bytes to store

• Do you observe any relationships between the 6 points? (hint: think of a


number that when multiplied with the 1st point gives the 2nd one)
MACHINE INTELLIGENCE
PCA: Example

• All the 6 points can be expressed as multiples of the first one:

• If we only store the point (1,2,3) and the 6 multiplying factors, then we only
need (4*9) = 36 bytes (this is a 50% saving over storing all points as they are!)
MACHINE INTELLIGENCE
PCA: Example

• All the 6 points lie on the same line (along the vector (1,2,3) in the 3D
Euclidean plane)

෠ that are familiar to us


• Here p1, p2, p3 are the standard Euclidean axes (𝑖,Ƹ 𝑗,Ƹ 𝑘)

Discussion:
It may be more helpful to
think that the features are:
α, 2 α, 3 α – perfectly
correlated features!
(Instead of thinking in terms
of different data points!)
MACHINE INTELLIGENCE
Principal Component Analysis: Stepwise

1. Center the data by the mean of feature (i.e., replace 𝑋 with 𝑋 − 𝜇𝑋 )

2. Find the covariance matrix of the centered data

3. Find the Eigenvalues and Eigenvectors of C

4. Select the k largest Eigenvalues, and combine those corresponding


Eigenvectors into another matrix

5. Multiply this matrix with the data X to get the lower-dimensional X’


MACHINE INTELLIGENCE
Principal Component Analysis: Stepwise Example

The data to the left is in 2 Dimensions

We will find the principal components and plot them


MACHINE INTELLIGENCE
PCA: Step 1

Center the data so that all features have mean 0


MACHINE INTELLIGENCE
PCA: Step 2

Find the covariance matrix of the centered X

Notice how the matrix is symmetric


MACHINE INTELLIGENCE
PCA: Step 2

Find the Eigenvalues and Eigenvectors of covariance matrix

Typo in the matrix of Eigen Vectors!


Notice that 1.284 is the largest Eigenvalue
And its corresponding Eigenvector is the second column
MACHINE INTELLIGENCE
PCA: Step 2

The Eigenvalues of the covariance matrix are:

𝑒𝑖𝑔𝑣𝑎𝑙𝑖
How much info. is captured along Principal Component i? This is given by
𝑠𝑢𝑚𝑘 𝑒𝑖𝑔𝑣𝑎𝑙𝑘
MACHINE INTELLIGENCE
PCA: Step 2

Data plotted along with the principal component


vectors overlaid (the diagonal dotted lines)
MACHINE INTELLIGENCE
PCA: Step 2

Plotting the data along with the Principal


Components (the two dotted lines)

This is simply a rotation of the previous figure


MACHINE INTELLIGENCE
PCA: Step 3

Since we choose to retain only 1 PC, we choose only that Eigenvector and
call it P

Typo in the matrix of Eigen Vectors!

Find the product XP to get the new reduced 1D data (note that X is of
shape (10,2) and P is of shape (2, 1))

On the next slide, we see the reduced 1D approximation


MACHINE INTELLIGENCE
PCA: Step 3

Transformed Data:

Example Calculations:
• Original Mean-Centered Point: 0.69, 0.49
• Transformed Point:
(-0.67787 * 0.69) + ( -0.73517 * 0.49) = -0.822787
(-0.73517 * 0.69) + (0.67787 * 0.49) = -0.175115
MACHINE INTELLIGENCE
PCA: Step 3

The variation
along the
principal
component 1
is preserved.
The variation
along the
other
principal
component is
lost!
MACHINE INTELLIGENCE
Principal Component Analysis

• Popular
• Easy to use.
• But, non-parametric!
MACHINE INTELLIGENCE
Singular Value Decomposition (SVD)

• Diagonalization is at the heart of many applications.


• Eigen Value / Eigen Vector based diagonalization works only with Square Matrices!
• SVD works with arbitrary n x m matrices!
The pinnacle of Linear Algebra –
The most general diagonalization theorem!!
MACHINE INTELLIGENCE
SVD (2)

• Let A be m x n matrix.
• When A is square, if λi is an eigen value and if Vi is the corresponding eigen vector, we have A Vi = λi Vi
• When we try to form a similar equation with A (m x n) , we have the problem that Vi has to be n x 1 (for
compatible matrix multiplication) but then A Vi will be m x 1 vector!
• We are mapping from Rn space to Rm space, unlike the case of square matrices!
• Thus , in general, A Vi must get equated to some constant times an m x1 vector!
• We need two sets of “eigen vectors” !!!!!!!!

***

• Let us defer this problem for the present and instead consider the matrix ATA (why?)
• This is a symmetric, square matrix and hence can be diagonalized using eigen value decomposition!
• We will try to decompose A via decomposition of ATA!!!
MACHINE INTELLIGENCE
SVD (3)

• Let λ1 , λ2, …, λr be the eigen values of ( ATA) with V1,V2, …,Vr being the corresponding
orthonormal eigen vectors. Thus ( ATA) Vi = λi Vi for i = 1, 2, …, r
• Pre-multiply by ViT getting ViT ( ATA) Vi = λi ViT Vi → (ViT AT) (A Vi ) = λi (ViT Vi ) →
(A Vi )T (A Vi ) = λi (ViT Vi ) → || A Vi ||2 = λi ||Vi ||2 = λi as Vi is an orthonormal vector.
Thus || A Vi || = √ λi = say σi
• By rearranging if necessary, we can assume that λ1 ≥ λ2 ≥ …, ≥ λr
• Let Ui = (1 / σi ) (A Vi ) . Thus Ui is an unit vector.
A Vi = σi Ui for i = 1, 2, 3, …, r
• Extend {U1, U2,…, Ur} by adding (m-r) Zero Vectors of size m x 1 to get an m x m matrix U
U = [ U1, U2,…, Ur, 0, 0, …,0 ] = [ U1, U2,……… Um]
• Extend Vs similarly by adding (n-r) Zero Vectors of size n x 1 to get an n x n matrix V
V = [ V1,V2,…,Vr, 0, 0, …,0 ] = [ V1,V2,……… Vn]
• Now, we can extend the above A Vi = σi Ui for i = 1, 2, 3, …, r as follows:
A V = [AV1, AV2, …, AVr, 0, 0, …, 0] = [σ1 U1 , σ2 Ui , …, σr Ur , 0, 0, …, 0]
MACHINE INTELLIGENCE
SVD (4)

We got:
A V = [AV1, AV2, …, AVr, 0, 0, …, 0] = [σ1 U1 , σ2 Ui , …, σr Ur , 0, 0, …, 0]
RHS, [σ1 U1 , σ2 Ui , …, σr Ur , 0, 0, …, 0] , can be written as:
The second matrix has the
following “diagonal” shape:

Let this “diagonal” matrix be ∑

We can now write: A V = U ∑ ; Post multiplying by V-1 , we get A = U ∑ V-1


As V is orthonormal, its inverse V-1 is VT. Hence, we get: A = U ∑ VT - the complete SVD!!!!!
MACHINE INTELLIGENCE
SVD (5) – Finally – The Full SVD
MACHINE INTELLIGENCE
SVD (6) – The Full SVD

A = U ∑ VT
mxn mxm mxn nxn
U1 U2 …Ur 0 V1
σ1 V2 rxn
σ2 …
mxr m x (m –r) 0
… Vr
σr
0 (n – r) x n
0 0
MACHINE INTELLIGENCE
SVD (7) - Discussion
• In the development of SVD, we saw that ( ATA) Vi = λi Vi
• Thus,Vi is an eigen vector of ATA with the corresponding eigen value of λi = σi2
where σi is a singular value of A.
• (ATA) Vi = λi Vi ; Pre-multiply both sides by A to get
A (ATA) Vi = λi ( A Vi )
(AAT) ( A Vi ) = λi ( A Vi ) = λi Ui
λi is an eigen value of ( AAT) also with Ui as the corresponding eigen vector!
THUS
• Both AAT and ATA have the same set of eigen values which are squares of the singular
values of A.
• Vi s are the eigen vectors of ATA (n x n)
• Ui s are the eigen vectors of AAT (m x m)
MACHINE INTELLIGENCE
SVD and PCA

Our original X was m (number of features) x n (number of instances) matrix. Let:


MACHINE INTELLIGENCE
SVD and PCA (2)
• If we calculate SVD of Y, the columns of matrix V contain the eigen vectors of YT Y = CX
• Thus we calculate the SVD of XT (not X) as U ∑ VT and then the columns of V are the principal
components of X !!!! We do not need to explicitly form the Covariance Matrix.
(We defined X as features (m) by instances (n) matrix and we are interested in the eigen
vectors of the covariance matrix of features! So, we do SVD of XT and use the column vectors
of V. This was done to be compatible with the earlier derivation of PCA. However, we can get
the SVD of X itself but then we need to use the vectors of U instead of V! Both approaches
work as AAT and ATA have same eigen values as already discussed! In fact, often it may be better
to work with X, as often n >> m. Example: If m=10^3 and n= 10^6, XXT is 10^3 by 10^3. On
the other hand XTX is 10^6 by 10^6 – a huge difference in the cost of computation of eigen values!)
• Further, in practice we need to use iterative numerical algorithms for PCA or SVD.
• Iterative numerical algorithms for SVD have been shown to converge much faster than those for the
direct PCA! → Thus
PCA via SVD is the most commonly used approach for PCA,
a very popular technique for dimensionality reduction!
MACHINE INTELLIGENCE
(SVD Again) Reduced SVD

• Full SVD: A = U ∑ VT
A :m x n U :m x m ∑: mxn VT : n x n
• We padded U, V, and ∑ with zeros as required to get the Full SVD Form (because the full form
presents a complete picture from the perspective of Linear Algebra and also because the Numerical
Algorithms work with such general full form).
• If we discard those padded zeros, we get the Reduced SVD which is also correct as we have
already seen that A Vi = σi Ui for i = 1, 2, 3, …, r.
• The Reduced SVD can be represented as:

Or, equivalently as:


A = Ur ∑r VrT
where Ur is m x r; ∑r is r x r and VrT is r x n.
MACHINE INTELLIGENCE
Reduced SVD (2)
MACHINE INTELLIGENCE
Applications of SVD
• SVD is quite a general and powerful technique for decomposing matrices.
• It has applications in several diverse domains, PCA being just one of them!
• We already discussed PCA. We will discuss only two more applications here.
Application in Data Mining:
Reduced SVD: A = Ur ∑r VrT where A is m x n; Ur is m x r; ∑r is r x r and VrT is r x n.
• Both AAT and ATA have the same set of eigen values which are squares of the singular values of A.
• Rows of VrT (i.e., columns of Vr are the eigen vectors of ATA (n x n)
• Columns of Ur are the eigen vectors of AAT (m x m)
Thus:
• V is useful in explaining the covariance among the columns of A
• U is useful in explaining the covariance among the rows of A
• Depending on how we interpret the rows and columns of A, both of the above explanations can be
useful in revealing patterns in the data.
• Any examples?
• Can be used in applications like Recommender Systems!
MACHINE INTELLIGENCE
One interpretation of SVD

• An example where we are interested in the covariance among rows as well as columns.
• The following matrix A lists out movie ratings of 5 movies by 7 users, followed by it’s SVD
on the right.
• Number of Singular Values of A = 2 → Rank of A = 2. Inference? Rows as well as columns
have essentially 2 “dimensions”!
MACHINE INTELLIGENCE
One interpretation of SVD (2)

• There are two concepts in this data: science fiction and romance

• The columns of the U matrix connects people to concepts (each row


represents how much one person likes the two concepts)

• The V matrix connects movies to concepts (each row of 𝑉 𝑇 represents how


much the 5 movies are connected to the two concepts)

The Σ matrix outlines the


relative strength of the two
concepts
MACHINE INTELLIGENCE
One interpretation of SVD (3)

• Now we have two additional ratings for the movies Matrix and Star Wars
from Jill and Jane in the matrix called M’

• Note that the rank of M’ is 3 (not 2 anymore like M)


MACHINE INTELLIGENCE
One Interpretation of SVD (4)

• What does the third concept mean? Don’t know!

• What is the relative strength of the third concept? It’s much


less than the other two

• Can we ignore it?


MACHINE INTELLIGENCE
One Interpretation of SVD (5)

• Reconstructing the matrix M’ after removing the last ‘concept’

• Compare with the actual M’ on the right.


MACHINE INTELLIGENCE
One Interpretation of SVD: Summary

• SVD gives us a set of hidden concepts that the data represents

• By eliminating the concepts that are less important, we can achieve a


representation of the data, that contains most of the info in the original data, but
in a lower dimensional space
MACHINE INTELLIGENCE
SVD and Image Compression
• Reduced SVD again: A = Ur ∑r VrT where A is m x n; Ur is m x r; ∑r is r x r and VrT is r x n.
• Expanding the above, we get:
A = σ1 U1 V1T + σ2 U2 V2T + σ3 U3 V3T + … … + σr Ur VrT
• A is the sum of r Rank 1 matrices! One Rank 1 matrix requires a storage of only n + m values, instead
of n x m values!
• These r matrices are in decreasing order of ”importance”.
• If we use only the first Rank 1 matrix - σ1 U1 V1T , we do achieve great compression but the
approximation may not be that good. (Low picture quality perhaps).
• As we keep adding more of the above matrices, approximation improves very fast and we may be able
to get extremely good quality approximation to the original picture with a relatively small number of
matrices leading to great reduction in the space requirements!
MACHINE INTELLIGENCE
SVD and Image Compression (2) - Example

Original Image (2000 x 1500) r = 5; 0.57% Storage r = 100; 11.67% Storage

Source: Data-Driven Science and Engineering by Steven L. Brunton and J. Nathan Kutz
MACHINE INTELLIGENCE
SVD - Conclusions

• SVD is an extremely powerful result from Linear Algebra


• Can be used in any context with Large Matrices
(Examples: Latent Semantic Indexing, Eigenfaces, Recommender Systems etc)
• Thus, has applications in several diverse domains apart from Machine Learning
(Example: Pseudoinverses (Linear Algebra), Signal Processing, Weather Modeling etc)
THANK YOU

Aronya Baksy
abaksy@gmail.com
+91 80 2672 1983 Extn 701

You might also like