You are on page 1of 19

Pattern Recognition

Module 3
PCA
Subrata Datta
Dept. of AIML
NSEC
PCA: Introduction
• Principal component analysis, or PCA, is a dimensionality
reduction method that is often used to reduce the
dimensionality of large data sets, by transforming a large set of
variables into a smaller one that still contains most of the
information in the large set.
• PCA is defined as an orthogonal linear transformation.
• PCA is non-parametric method of extracting relevant
information from datasets.
• Principal Component Analysis (PCA) is an unsupervised
learning method.
PCA: Uses
• PCA is used to visualize multidimensional data.
• It is used to reduce the number of dimensions in healthcare data.
• PCA can help resize an image.
• It can be used in finance to analyze stock data and forecast returns.
Common terms used in PCA
• Dimensionality: It is the number of features or variables present in
the given dataset. More easily, it is the number of columns present in
the dataset.
• Correlation: It signifies that how strongly two variables are related
to each other. Such as if one changes, the other variable also gets
changed. The correlation value ranges from -1 to +1. Here, -1 occurs
if variables are inversely proportional to each other, and +1 indicates
that variables are directly proportional to each other.
• Orthogonal: It defines that variables are not correlated to each
other, and hence the correlation between the pair of variables is zero.
• Eigenvectors: If there is a square matrix M, and a non-zero vector v
is given. Then v will be eigenvector if Av is the scalar multiple of v.
• Covariance Matrix: A matrix containing the covariance between
the pair of variables is called the Covariance Matrix.
Steps in PCA
• Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X
and Y, where X is the training set, and Y is the validation set.
• Representing data into a structure
Now we will represent our dataset into a structure. Such as we will
represent the two-dimensional matrix of independent variable X. Here each
row corresponds to the data items, and the column corresponds to the
Features. The number of columns is the dimensions of the dataset.
• Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column,
the features with high variance are more important compared to the features
with lower variance.
If the importance of features is independent of the variance of the feature,
then we will divide each data item in a column with the standard deviation
of the column. Here we will name the matrix as Z.
Steps in PCA contd..
• Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of
Z.
• Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.
• Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order,
which means from largest to smallest. And simultaneously sort the eigenvectors
accordingly in matrix P of eigenvalues. The resultant matrix will be named as P*.
• Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to the
Z. In the resultant matrix Z*, each observation is the linear combination of original
features. Each column of the Z* matrix is independent of each other.
• Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to
remove. It means, we will only keep the relevant or important features in the new
dataset, and unimportant features will be removed out.
Types of PCA
• Sparse PCA
• Nonlinear PCA
• Robust PCA
PCA: Mathematical Example

Systolic BP Diastolic BP
126 78
128 80
128 82
130 82
130 84
132 86
Steps
• Center the data
• Calculate the covariance matrix
• Calculate eigenvalues of covariance matrix
• Calculate eigenvectors of covariance matrix
• Order the eigenvectors
• Calculate the principal components
Center the data

Systolic BP (SBP) Diastolic BP (DBP)


Mean/ Avg 129 82

Systolic BP (SBP) Diff (SBP) Diastolic BP (DBP) Diff (DBP)

126-129 -3 78-82 -4
128-129 -1 80-82 -2
128-129 -1 82-82 0
130-129 1 82-82 0
130-129 1 84-82 2
132-129 3 86-82 4
Center the data

Centered SBP (cSBP) Centered DBP (cDBP)


-3 -4
-1 -2
-1 0
1 0
1 2
3 4
Calculate the covariance matrix
1 n

1 n
var( cSBP )  cSBPj cov( cSBP , cDBP )   cSBP * cDBP
n  1 j 1 n  1 j 1

1 22
var( cSBP )  {( 3) 2  (1) 2  (1) 2  (1) 2  (1) 2  (3) 2 }   4.4
6 1 5
1 40
var( cDBP )  {( 4) 2  ( 2) 2  (0) 2  (0) 2  ( 2) 2  ( 4) 2 }  8
6 1 5

1 28
cov( cSBP , cDBP )  {( 3) * ( 4)  (1) * (2)  (1) * (0)  (1) * (0)  (1) * (2)  (3) * (4)}   5.6
6 1 5

SBP DBP
SBP 4.4 5.6
DBP 5.6 8.0
Calculate the eigen values of
covariance matrix
det 𝐴 − 𝜆𝐼 = 0 4.4 5.6 1 0
A [ ] I [ ]
5. 6 8.0 0 1

(4.4   ) 5 .6
det[ ]0
5.6 (8.0   )

(4.4   ) * (8.0   )  5.6 * 5.6  0  3.84  12.4  2  0

1  0.32; 2  12.08
Calculate the eigen vectors of
covariance matrix
A.v  .v v  [
x
y
] 𝜆2 = 12.08

4.4 5.6 x x
[ ] * [ ]  12.08 * [ ]
5.6 8.0 y y

y  1.37 x
1
𝑣2 = [ ]
1.37
Calculation of previous slide
[4.4 5.6 * [x = 12.08 * [x
5.6 8.0] y] y]

[4.4x+ 5.6y = [ 12.08x


5.6x + 8.0y] 12.08y]
Therefore, 4.4x+5.6y=12.08x
or, 5.6y=7.68x
or, y=(7.68/5.6)x
or, y= 1.37 x
Normalize the vector v
1
𝑣2 = [ ]
1.37
𝑛𝑜𝑟𝑚𝑎𝑙 𝑓𝑎𝑐𝑡𝑜𝑟 = 12 + 1.372 = 1 + 1.8769 = 2.8769

1
2.8769 0.59
𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝑣2 = [ 1.37 ]=[0.81]
2.8769

−0.81
𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑙𝑦 𝑤𝑒 𝑔𝑒𝑡, 𝑣1 = [ ] for 𝜆1 = 0.32
0.59

0.59 −0.81
𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒, 𝑣 = [ ]
0.81 0.59
Calculate the principal component
−3 −4
0.59 −0.81 −1 −2
𝑣=[ ] −1 0
0.81 0.59 𝐷=[
1 0
]
1 2
3 4

𝑃𝐶1 𝑃𝐶2
−3 −4
−5.0 0.1
−1 −2 −2.2 −0.4
0.59 −0.81 −1 0 −0.6
0.8 ]
𝐷. 𝑉 = .[ ]=[
0.81 0.59 1 0
0.6 −0.8
1 2 2.2 0.4
3 4 5.0 −0.1
𝑃𝐶1 = 0.59 ∗ 𝑐𝑆𝐵𝑃 + 0.81 ∗ 𝑐𝐷𝐵𝑃
𝑃𝐶2 = −0.81 ∗ 𝑐𝑆𝐵𝑃 + 0.59 ∗ 𝑐𝐷𝐵𝑃

cSBP cDBP 𝑷𝑪𝟏𝒊 𝑷𝑪𝟐𝒊


-3 -4 0.59*(-3)+0.81*(-4)=-5.01 -0.81(-3)+0.59(-4)=0.07
-1 -2 0.59(-1)+0.81(-2)=-2.21 -0.81(-1)+0.59(-2)=-0.37
-1 0 0.59(-1)+0.81(0)=-0.59 -0.81(-1)+0.59(0)=0.81
1 0 0.59(1)+0.81(0)=0.59 -0.81(1)+0.59(0)=-0.81
1 2 0.59(1)+0.81(2)=2.21 -0.81(1)+0.59(2)=0.37
3 4 0.59(3)+0.81(4)=5.01 -0.81(3)+0.59(4)=-0.07
Variance 12.08 0.32
Result
Total variance of the dataset is
Var(cSBP)+ var(cDBP)=4.4+8.0=12.4
Since the first principal component (PC1)
captures almost all variance hence we neglect
second principal component (PC2).

Answer is PC1

You might also like