Professional Documents
Culture Documents
K K SINGH,
DEPT OF CSE, RGUKT NUZVID
Outline 2
Introduction
Mathematical background
PCA
Applications
Let there are 1000 students in 20 class rooms of the department .
We want to call a meeting to discuss about the performance of student and build
some inference/model.
Whether to call all the 1000 students, OR only a set of students representative
(CRs)??
Same way if there are 1000 variables/dimensions in a data set, only a set of ‘P’
representative variables (called principle components in PCA) may be enough to
build the model.
Introduction
3
Principal component analysis (PCA) is a statistical procedure that uses
an orthogonal transformation to convert a set of observations (of possibly
correlated variables) into a set of values of linearly uncorrelated variables
called principal components
PCA can supply the user with a lower-dimensional picture, a projection of
the object when viewed from its most informative viewpoint
Look at the different 2-D view of a 3-D cuboids.
Check
Which one is the most informative??
(Forth one?)
Eigenvectors Properties
eigenvectors can only be found for square matrices.
Not every square matrix has eigenvectors.
Symmetric matrices S(n x n) satisfies two properties:
I. has exactly n eigenvectors.
II. All the eigenvectors are orthogonal (perpendicular).
Any vector is an eigenvector to the identity matrix.
Example 8
Sl
No
size,
(KLOC)
Cyclomatic
Complexity
No of
Defects(
Step 1: get some data
(CC) D) from scipy import linalg as LA
1 5 55 4 import numpy as np
x1=[5,8,6,9,1,2,4,6,8]
2 8 75 8
x2=[55,75,50,85,12,24,30,70,85]
3 6 50 5 X=np.stack((x1, x2), axis=-1)
4 9 85 10 Step 2: Calculate the covariance matrix C.
5 1 12 3 n,m=X.shape
6 2 24 4
M=np.mean(X, axis = 0)
7 4 30 3
X =X- M
8 6 70 7
cov = np.dot(X.T,X)/(n-1)=
9 8 85 6
7.25 71.75
71.75 734.5
plt.scatter(x1,x2, color="r") Step 5:
Project on the p eigenvectors that corresponds to the
highest p eigen values 9
PC = np.dot(X, evecs)
plt.scatter(PC[:,0],PC[:,1], color="b")
plt.ylim(-2, 10)
Dimensionality Reduction
Image compression
Feature selection.
Page Rank and Hits algorithm
LSI:Latent Semantic Indexing (Document term-to-
concept similarity matrix )
Assignments: 12
Questions or Thoughts ??