You are on page 1of 13

Data Mining course by K K Singh is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Dimensionality Reduction using


PCA ( Principal component analysis)

K K SINGH,
DEPT OF CSE, RGUKT NUZVID
Outline 2
 Introduction
 Mathematical background
 PCA
 Applications
Let there are 1000 students in 20 class rooms of the department .

We want to call a meeting to discuss about the performance of student and build
some inference/model.
Whether to call all the 1000 students, OR only a set of students representative
(CRs)??
Same way if there are 1000 variables/dimensions in a data set, only a set of ‘P’
representative variables (called principle components in PCA) may be enough to
build the model.
Introduction
 3
Principal component analysis (PCA) is a statistical procedure that uses
an orthogonal transformation to convert a set of observations (of possibly
correlated variables) into a set of values of linearly uncorrelated variables
called principal components
 PCA can supply the user with a lower-dimensional picture, a projection of
the object when viewed from its most informative viewpoint
 Look at the different 2-D view of a 3-D cuboids.
Check
 Which one is the most informative??
 (Forth one?)

 The transformation is defined in such a way that the first principal


component has the largest possible variance
 Source: https://en.wikipedia.org/wiki/Principal_component_analysis
Mathematical Background 4
 If we have one dimensional data (x)
 Variance : The average square of the distance from the mean of the data
set to its points
 Var(X)=∑(xi - x̅ )2/(n-1)

 Covariance Always measured between two dimensions.


 Cov(X,Y)=∑(xi - x̅ ) (yi - ȳ ) /(n-1)
 Cov(X, X)=Var(X)
 If X and Y are independent (uncorrelated) cov(X,Y)=0
Mathematical Background 5
 Covariance Matrix

Cov(x,x) cov(x,y) var(x) cov(x,y)


 C= =
Cov(x, y) cov(y,y) Cov(x, y) var(y)

C is square and symmetric matrix.


• The diagonal values are the variance for each dimension and the off-diagonal are the
covariance between measurement types.
• Large term in the diagonal correspond to interesting dimensions
, whereas large values in the off-diagonal correspond to high correlations (redundancy).
Mathematical Background 6
0 1

A V = λ V => (A – λI ) V =0
A=
-2 -3
0 1 λ 0
 |A-λI | = -
-2 -3 0 λ +1
+1
V1 = k1 , V2 = k2
 -λ 1 = λ + 3λ +2 = 0
2
-2
-2 -3 -λ -1

 => λ1= -1, λ2 = -2


Mathematical Background 7

Eigenvectors Properties
 eigenvectors can only be found for square matrices.
 Not every square matrix has eigenvectors.
 Symmetric matrices S(n x n) satisfies two properties:
 I. has exactly n eigenvectors.
 II. All the eigenvectors are orthogonal (perpendicular).
 Any vector is an eigenvector to the identity matrix.
Example 8
Sl
No
size,
(KLOC)
Cyclomatic
Complexity
No of
Defects(
Step 1: get some data
(CC) D) from scipy import linalg as LA
1 5 55 4 import numpy as np
x1=[5,8,6,9,1,2,4,6,8]
2 8 75 8
x2=[55,75,50,85,12,24,30,70,85]
3 6 50 5 X=np.stack((x1, x2), axis=-1)
4 9 85 10 Step 2: Calculate the covariance matrix C.
5 1 12 3 n,m=X.shape
6 2 24 4
M=np.mean(X, axis = 0)
7 4 30 3
X =X- M
8 6 70 7
cov = np.dot(X.T,X)/(n-1)=
9 8 85 6

7.25 71.75

71.75 734.5
plt.scatter(x1,x2, color="r") Step 5:
Project on the p eigenvectors that corresponds to the
highest p eigen values 9
PC = np.dot(X, evecs)

plt.scatter(PC[:,0],PC[:,1], color="b")
plt.ylim(-2, 10)

Step 4: Calculate the eigenvectors and


eigenvalues of the covariance matrixe Back
vals , evecs = LA.eigh(cov)
idx = np.argsort(evals)[::-1] Step 5:
Get the data back
evecs = evecs[:,idx]
evals = evals[idx] XX= np.dot(PC, evecs.T)
 Why we use PCA or SVD ?
 A powerful tool for analyzing data and finding patterns. 10
 Used for data compression and feature selection .
 So one can reduce the number of dimensions without much loss of information.
 PCA can be done by either eigenvalue decomposition of a data covariance matrix
or singular value decomposition of a data matrix, usually after a normalization step of the
initial data (mean centering)
 Some other terminologies about PCA?
 component scores(or factor scores): New variables (PCs) are constructed as weighted
averages of the original variables(X). These new variables are called the principal
components, latent variables, or factors. Their specific values on a specific row are
referred to as the factor scores, the component scores, or simply the scores
 Loadings: the weight by which each normalized original variable should be multiplied to
get the transformed variable .
 Loadings=Eigenvectors⋅ √ (Eigenvalues)
PCA applications 11

 Dimensionality Reduction
 Image compression
 Feature selection.
 Page Rank and Hits algorithm
 LSI:Latent Semantic Indexing (Document term-to-
concept similarity matrix )
Assignments: 12

 Can we compute eigenvectors of as square matrix A where


Det(A)=0?
 Explain XTX/ (n-1) = cov(X), where X is mean centered data
matrix
 Let The principle components P n*m and Eigen Vectors Vm*m are
given and the first k eigen vectors capture 99% of variance in the
data set, then how to reconstruct the original data set using the
first k , V & PC?
13

Thank you for listening

Questions or Thoughts ??

You might also like