Dimensionality Reduction Using PCA (Principal Component Analysis)

Data Mining course by K K Singh is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Dimensionality Reduction using

PCA ( Principal component analysis)
K K SINGH,
DEPT OF CSE, RGUKT NUZVID
Outline 2
 Introduction
 Mathematical background
 PCA
 Applications
Let there are 1000 students in 20 class rooms of the department .
We want to call a meeting to discuss about the performance of student and build
some inference/model.
Whether to call all the 1000 students, OR only a set of students representative
(CRs)??
Same way if there are 1000 variables/dimensions in a data set, only a set of ‘P’
representative variables (called principle components in PCA) may be enough to
build the model.
Introduction
 3
Principal component analysis (PCA) is a statistical procedure that uses
an orthogonal transformation to convert a set of observations (of possibly
correlated variables) into a set of values of linearly uncorrelated variables
called principal components
 PCA can supply the user with a lower-dimensional picture, a projection of
the object when viewed from its most informative viewpoint
 Look at the different 2-D view of a 3-D cuboids.
Check
 Which one is the most informative??
 (Forth one?)
 The transformation is defined in such a way that the first principal

component has the largest possible variance
 Source: https://en.wikipedia.org/wiki/Principal_component_analysis
Mathematical Background 4
 If we have one dimensional data (x)
 Variance : The average square of the distance from the mean of the data
set to its points
 Var(X)=∑(xi - x̅ )2/(n-1)
 Covariance Always measured between two dimensions.

 Cov(X,Y)=∑(xi - x̅ ) (yi - ȳ ) /(n-1)
 Cov(X, X)=Var(X)
 If X and Y are independent (uncorrelated) cov(X,Y)=0
 Covariance Matrix
Cov(x,x) cov(x,y) var(x) cov(x,y)

 C= =
Cov(x, y) cov(y,y) Cov(x, y) var(y)
C is square and symmetric matrix.

• The diagonal values are the variance for each dimension and the off-diagonal are the
covariance between measurement types.
• Large term in the diagonal correspond to interesting dimensions
, whereas large values in the off-diagonal correspond to high correlations (redundancy).
0 1

A V = λ V => (A – λI ) V =0
A=
-2 -3
0 1 λ 0
 |A-λI | = -
-2 -3 0 λ +1
+1
V1 = k1 , V2 = k2
 -λ 1 = λ + 3λ +2 = 0
2
-2
-2 -3 -λ -1
 => λ1= -1, λ2 = -2

Eigenvectors Properties
 eigenvectors can only be found for square matrices.
 Not every square matrix has eigenvectors.
 Symmetric matrices S(n x n) satisfies two properties:
 I. has exactly n eigenvectors.
 II. All the eigenvectors are orthogonal (perpendicular).
 Any vector is an eigenvector to the identity matrix.
Example 8
Sl
No
size,
(KLOC)
Cyclomatic
Complexity
No of
Defects(
Step 1: get some data
(CC) D) from scipy import linalg as LA
1 5 55 4 import numpy as np
x1=[5,8,6,9,1,2,4,6,8]
2 8 75 8
x2=[55,75,50,85,12,24,30,70,85]
3 6 50 5 X=np.stack((x1, x2), axis=-1)
4 9 85 10 Step 2: Calculate the covariance matrix C.
5 1 12 3 n,m=X.shape
6 2 24 4
M=np.mean(X, axis = 0)
7 4 30 3
X =X- M
8 6 70 7
cov = np.dot(X.T,X)/(n-1)=
9 8 85 6
7.25 71.75
71.75 734.5
plt.scatter(x1,x2, color="r") Step 5:
Project on the p eigenvectors that corresponds to the
highest p eigen values 9
PC = np.dot(X, evecs)
plt.scatter(PC[:,0],PC[:,1], color="b")
plt.ylim(-2, 10)
Step 4: Calculate the eigenvectors and

eigenvalues of the covariance matrixe Back
vals , evecs = LA.eigh(cov)
idx = np.argsort(evals)[::-1] Step 5:
Get the data back
evecs = evecs[:,idx]
evals = evals[idx] XX= np.dot(PC, evecs.T)
 Why we use PCA or SVD ?
 A powerful tool for analyzing data and finding patterns. 10
 Used for data compression and feature selection .
 So one can reduce the number of dimensions without much loss of information.
 PCA can be done by either eigenvalue decomposition of a data covariance matrix
or singular value decomposition of a data matrix, usually after a normalization step of the
initial data (mean centering)
 Some other terminologies about PCA?
 component scores(or factor scores): New variables (PCs) are constructed as weighted
averages of the original variables(X). These new variables are called the principal
components, latent variables, or factors. Their specific values on a specific row are
referred to as the factor scores, the component scores, or simply the scores
 Loadings: the weight by which each normalized original variable should be multiplied to
get the transformed variable .
 Loadings=Eigenvectors⋅ √ (Eigenvalues)
PCA applications 11
 Dimensionality Reduction
 Image compression
 Feature selection.
 Page Rank and Hits algorithm
 LSI:Latent Semantic Indexing (Document term-to-
concept similarity matrix )
Assignments: 12
 Can we compute eigenvectors of as square matrix A where

Det(A)=0?
 Explain XTX/ (n-1) = cov(X), where X is mean centered data
matrix
 Let The principle components P n*m and Eigen Vectors Vm*m are
given and the first k eigen vectors capture 99% of variance in the
data set, then how to reconstruct the original data set using the
first k , V & PC?
13
Thank you for listening
Questions or Thoughts ??

Dimensionality Reduction Using PCA (Principal Component Analysis)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dimensionality Reduction Using PCA (Principal Component Analysis)

Uploaded by

Copyright:

Available Formats

Data Mining course by K K Singh is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Dimensionality Reduction using

 The transformation is defined in such a way that the first principal

 Covariance Always measured between two dimensions.

Cov(x,x) cov(x,y) var(x) cov(x,y)

C is square and symmetric matrix.

 => λ1= -1, λ2 = -2

Step 4: Calculate the eigenvectors and

 Can we compute eigenvectors of as square matrix A where

Thank you for listening

You might also like