Professional Documents
Culture Documents
Unit4 PCA-SVD
Unit4 PCA-SVD
INTELLIGENCE
Dimensionality Reduction
Aronya Baksy
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Aronya Baksy
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Outline
Feature Set 2:
Feature 1: 5 17 13 29 72 7
Feature 2: 1.9 35 65 3.8 11.8 45.8
Feature Set 3:
Feature 1: 5 20 30 10 45 71
Feature 2: 5.5 19 28 9.5 44 70.5
MACHINE INTELLIGENCE
Discussion: Features for Learning Task (2)
Feature Set 1:
The second feature does not seem to be of much value in distinguishing the instances. Why?
Feature Set 2:
Both the features seem useful in distinguishing the instances. Why?
Feature Set 3:
Both the features seem useful in distinguishing the instances at first glance. But let us transform the
coordinate axes (why) using the Transformation Matrix: ( (0.707, 0.707)T ; (-0.707, 0.707)T )
Example: The point (30, 28) gets transformed in to
0.707 0.707 30 41
-0.707 0.707 28 -1.4
The features now (linear combinations of the original features) are:
Feature 1’: 7.4 27.5 41 13.6 63 100
Feature 2’: 0.35 0.7 1.4 0.35 0.70 0.35
MACHINE INTELLIGENCE
Discussion: Features for Learning Task (3)
(Informed) Assumptions:
• Features with high variance are generally more useful than the features with low variance.
• Only one feature from a set of highly correlated features is useful, the rest of the features do not add
much value while discriminating the data instances!
Variance:
Covariance:
Covariance Matrix:
•Can we change the coordinate axes to achieve this (In the Covariance Matrix, the Off-Diagonal
entries are 0 and diagonal entries are sorted by magnitude)? (Have a re-look at the example!)
Find some orthonormal matrix P such that it transforms the original feature space x in to Y –
Y = PX that leads to the Covariance Matrix of Y to be of the desired nature as indicated
above! In other words:
Is diagonalized!
• By dropping the principal components of “low importance”, we can achieve dimensionality reduction.
• The transformed data is Y = P X
• In practice, we always work with mean-centered data to simplify the calculations.
MACHINE INTELLIGENCE
PCA: Example
• If we only store the point (1,2,3) and the 6 multiplying factors, then we only
need (4*9) = 36 bytes (this is a 50% saving over storing all points as they are!)
MACHINE INTELLIGENCE
PCA: Example
• All the 6 points lie on the same line (along the vector (1,2,3) in the 3D
Euclidean plane)
Discussion:
It may be more helpful to
think that the features are:
α, 2 α, 3 α – perfectly
correlated features!
(Instead of thinking in terms
of different data points!)
MACHINE INTELLIGENCE
Principal Component Analysis: Stepwise
𝑒𝑖𝑔𝑣𝑎𝑙𝑖
How much info. is captured along Principal Component i? This is given by
𝑠𝑢𝑚𝑘 𝑒𝑖𝑔𝑣𝑎𝑙𝑘
MACHINE INTELLIGENCE
PCA: Step 2
Since we choose to retain only 1 PC, we choose only that Eigenvector and
call it P
Find the product XP to get the new reduced 1D data (note that X is of
shape (10,2) and P is of shape (2, 1))
Transformed Data:
Example Calculations:
• Original Mean-Centered Point: 0.69, 0.49
• Transformed Point:
(-0.67787 * 0.69) + ( -0.73517 * 0.49) = -0.822787
(-0.73517 * 0.69) + (0.67787 * 0.49) = -0.175115
MACHINE INTELLIGENCE
PCA: Step 3
The variation
along the
principal
component 1
is preserved.
The variation
along the
other
principal
component is
lost!
MACHINE INTELLIGENCE
Principal Component Analysis
• Popular
• Easy to use.
• But, non-parametric!
MACHINE INTELLIGENCE
Singular Value Decomposition (SVD)
• Let A be m x n matrix.
• When A is square, if λi is an eigen value and if Vi is the corresponding eigen vector, we have A Vi = λi Vi
• When we try to form a similar equation with A (m x n) , we have the problem that Vi has to be n x 1 (for
compatible matrix multiplication) but then A Vi will be m x 1 vector!
• We are mapping from Rn space to Rm space, unlike the case of square matrices!
• Thus , in general, A Vi must get equated to some constant times an m x1 vector!
• We need two sets of “eigen vectors” !!!!!!!!
***
• Let us defer this problem for the present and instead consider the matrix ATA (why?)
• This is a symmetric, square matrix and hence can be diagonalized using eigen value decomposition!
• We will try to decompose A via decomposition of ATA!!!
MACHINE INTELLIGENCE
SVD (3)
• Let λ1 , λ2, …, λr be the eigen values of ( ATA) with V1,V2, …,Vr being the corresponding
orthonormal eigen vectors. Thus ( ATA) Vi = λi Vi for i = 1, 2, …, r
• Pre-multiply by ViT getting ViT ( ATA) Vi = λi ViT Vi → (ViT AT) (A Vi ) = λi (ViT Vi ) →
(A Vi )T (A Vi ) = λi (ViT Vi ) → || A Vi ||2 = λi ||Vi ||2 = λi as Vi is an orthonormal vector.
Thus || A Vi || = √ λi = say σi
• By rearranging if necessary, we can assume that λ1 ≥ λ2 ≥ …, ≥ λr
• Let Ui = (1 / σi ) (A Vi ) . Thus Ui is an unit vector.
A Vi = σi Ui for i = 1, 2, 3, …, r
• Extend {U1, U2,…, Ur} by adding (m-r) Zero Vectors of size m x 1 to get an m x m matrix U
U = [ U1, U2,…, Ur, 0, 0, …,0 ] = [ U1, U2,……… Um]
• Extend Vs similarly by adding (n-r) Zero Vectors of size n x 1 to get an n x n matrix V
V = [ V1,V2,…,Vr, 0, 0, …,0 ] = [ V1,V2,……… Vn]
• Now, we can extend the above A Vi = σi Ui for i = 1, 2, 3, …, r as follows:
A V = [AV1, AV2, …, AVr, 0, 0, …, 0] = [σ1 U1 , σ2 Ui , …, σr Ur , 0, 0, …, 0]
MACHINE INTELLIGENCE
SVD (4)
We got:
A V = [AV1, AV2, …, AVr, 0, 0, …, 0] = [σ1 U1 , σ2 Ui , …, σr Ur , 0, 0, …, 0]
RHS, [σ1 U1 , σ2 Ui , …, σr Ur , 0, 0, …, 0] , can be written as:
The second matrix has the
following “diagonal” shape:
A = U ∑ VT
mxn mxm mxn nxn
U1 U2 …Ur 0 V1
σ1 V2 rxn
σ2 …
mxr m x (m –r) 0
… Vr
σr
0 (n – r) x n
0 0
MACHINE INTELLIGENCE
SVD (7) - Discussion
• In the development of SVD, we saw that ( ATA) Vi = λi Vi
• Thus,Vi is an eigen vector of ATA with the corresponding eigen value of λi = σi2
where σi is a singular value of A.
• (ATA) Vi = λi Vi ; Pre-multiply both sides by A to get
A (ATA) Vi = λi ( A Vi )
(AAT) ( A Vi ) = λi ( A Vi ) = λi Ui
λi is an eigen value of ( AAT) also with Ui as the corresponding eigen vector!
THUS
• Both AAT and ATA have the same set of eigen values which are squares of the singular
values of A.
• Vi s are the eigen vectors of ATA (n x n)
• Ui s are the eigen vectors of AAT (m x m)
MACHINE INTELLIGENCE
SVD and PCA
• Full SVD: A = U ∑ VT
A :m x n U :m x m ∑: mxn VT : n x n
• We padded U, V, and ∑ with zeros as required to get the Full SVD Form (because the full form
presents a complete picture from the perspective of Linear Algebra and also because the Numerical
Algorithms work with such general full form).
• If we discard those padded zeros, we get the Reduced SVD which is also correct as we have
already seen that A Vi = σi Ui for i = 1, 2, 3, …, r.
• The Reduced SVD can be represented as:
• An example where we are interested in the covariance among rows as well as columns.
• The following matrix A lists out movie ratings of 5 movies by 7 users, followed by it’s SVD
on the right.
• Number of Singular Values of A = 2 → Rank of A = 2. Inference? Rows as well as columns
have essentially 2 “dimensions”!
MACHINE INTELLIGENCE
One interpretation of SVD (2)
• There are two concepts in this data: science fiction and romance
• Now we have two additional ratings for the movies Matrix and Star Wars
from Jill and Jane in the matrix called M’
Source: Data-Driven Science and Engineering by Steven L. Brunton and J. Nathan Kutz
MACHINE INTELLIGENCE
SVD - Conclusions
Aronya Baksy
abaksy@gmail.com
+91 80 2672 1983 Extn 701