CS-13410 Introduction To Machine Learning: Lecture # 18

CS-13410
Introduction to Machine Learning

Lecture # 18
Dimensionality Reduction – PCA

Data Preprocessing
 Dimensionality reduction is a part of Data
Preprocessing
 Data Preprocessing has following four major
steps
1. Data cleaning
2. Data integration
3. Data reduction
4. Data Transformation and Discretization
2
Data Reduction
 Reduce representation of the data set that is
much smaller in volume, yet closely
maintains the integrity of the original data.
Different Strategies are
• Dimensionality Reduction
• Numerosity reduction
• Data Compression
3
Dimensionality Reduction (DR)
 Process of reducing the number of random
attributes under consideration.
 Two very common methods are Wavelet
Transforms and Principal Components
Analysis
4
Numerosity Reduction
 Replace the original data volume by
alternative smaller forms of data
representation.
 Regression and log-linear models
(parametric methods)
 Histograms, clustering, sampling and data
cube aggregation (nonparametric methods)
5
Data Compression
 Transformations are applied to obtain a reduced or
compressed representation
 Lossless: The original data can be reconstructed
from the compressed data without any information
loss.
 Lossy: Only an approximation of the original data
can be reconstructed.
6
DR: Principal Components Analysis (PCA)
 Why PCA?
 PCA is a useful statistical technique, has
found applications in:
 Face recognition
 Image Compression
 Reducing dimension of data
7
PCA Goal:
Removing Dimensional Redundancy
 The major goal of PCA in Data Science and Machine
Learning is to remove the “dimensional redundancy”
from data.
 What does that mean?
 A typical dataset contains several dimensions (variables) that
may or may not correlate.
 Dimensions that correlate vary together.
 The information represented by a set of dimensions with high
correlation can be extracted by studying just one dimension
that represents the whole set.
 Hence the goal is to reduce the dimensions of a dataset to a
smaller set of representative dimensions that do not correlate.
8
PCA Goal:
Dim 1
Dim 2
Dim 3
Analyzing 12
Dim 4
Dimensional data is
Dim 5
Dim 6
challenging !!!
Dim 7
Dim 8
Dim 9
Dim 10
Dim 11
Dim 12
PCA Goal:
Dim 1
Dim 2 But some dimensions
Dim 3
represent redundant
Dim 4
Dim 5
information. Can we
Dim 6 “reduce” these.
Dim 7
Dim 8
Dim 9
Dim 10
Dim 11
Dim 12
PCA Goal:
Dim 1
Dim 2
Dim 3
Lets assume we have a
Dim 4
“PCA black box” that can
Dim 5 reduce the correlating
Dim 6 dimensions.
Dim 7
Dim 8
Pass the 12d data set
Dim 9
through the black box to
Dim 10
Dim 11
get a three dimensional
Dim 12 data set.
PCA Goal:
Given appropriate reduction,
Dim 1
analyzing the reduced dataset is
Dim 2
Dim 3
much more efficient than the
Dim 4 original “redundant” data.
Dim 5 Dim A
Dim 6 PCA
Dim B
Black box
Dim 7 Dim C
Dim 8
Dim 9
Dim 10 Pass the 12 d data set through the
Dim 11
black box to get a three dimensional
data set.
Dim 12
Mathematics inside PCA Black box: Bases
 Lets now give the “black box” a mathematical form.
 In linear algebra dimensions of a space are a linearly
independent set called “bases” that spans the space created by
dimensions.
i.e. each point in that space is a linear combination of the bases
set.
e.g. consider the simplest example of standard basis in Rn
consisting of the coordinate axes.
Every point in R3 is a linear combination

of the standard basis of R3
(2,3,3) = 2 (1,0,0) + 3(0,1,0) + 3 (0,0,1)

PCA Goal: Change of Basis
 Assume X is the 6-dimensional data set given as
input
Data Points Dimensions
 A naïve basis for X is standard basis for R6 and hence

BX = X
 Here, we want to find a new (reduced) basis P such as
PX = Y
 Y will be the resultant reduced data set.
PCA Goal
 Change of Basis
QUESTION: What is a good choice for P ?

 Lets park this question right now and revisit after studying
some related concepts
Background Stats/Maths
 Mean and Standard Deviation
 Variance and Covariance
 Covariance Matrix
 Eigenvectors and Eigenvalues
Mean and Standard Deviation
 Mean:
 it doesn’t tell us a lot about data

set.
 Different data sets can have same
mean.
 Standard Deviation (SD) of a data
set is a measure of how spread
out the data is.
 Variance is another measure of
the spread of data in data set. It
is almost identical to SD.
Covariance
 SD and Variance are 1-dimensional
 1-D data sets could be
 Heights of all the people in the room
 Salary of employee in a company
 Marks in the quiz
 However many datasets have more than 1-dimension
 Our aim is to find any relationship between different
dimensions.
 E.g. Finding relationship with students result and their hour of
study.
 It is used to measure relationship between 2-Dimensions.
Covariance Interpretation
 We have data set for students study hour (H) and marks
achieved (M)
 We find cov(H,M)
 Exact value of covariance is not as important as the sign
(i.e. positive or negative)
 +ve , both dimensions increase together
 -ve , as one dimension increases other decreases
 Zero, their exist no relationship
Covariance Matrix
 Covariance is always measured between 2 –
dim.
 What if we have a data set with more than
2-dim?
 We have to calculate more than one
covariance measurement.
 E.g. from a 3-dim data set (dimensions
x,y,z) we could cacluate cov(x,y) ,
cov(x,z) , cov(y,z)
Covariance Matrix
 Can use covariance matrix to find
covariance of all the possible pairs
 Since cov(a,b)=cov(b,a)
The matrix is symmetrical about the main
diagonal
Eigenvectors
 Consider the two
multiplications between a
matrix and a vector
 In first example the
resulting vector is not an
integer multiple of the
original vector.
 Whereas in second example,
the resulting vector is 4
times the original vector
Eigenvectors and Eigenvalues
 More formally defined
 Let A be a nn matrix. The vector v that satisfies
Av = v
for some scalar  is called the eigenvalue of
matrix A corresponding to eigenvector v.
Principal Components Analysis (PCA)
PCA is a method to identify a new
set of predictors, as linear
combinations of the original ones,
that captures the ‘maximum
amount’ of variance in the
observed data.
A technique for identifying
patterns in data.
Also used to express data in such a way as to highlight
similarities and differences.
PCA are used to reduce the dimension in data without
losing the integrity of information.
24
Principal Components Analysis (PCA)
Definition
Principal Components Analysis (PCA) produces a list of p
principle components (Y1, . . . , Yp) such that
Each Yi is a linear combination of the original predictors,
and it’s vector norm is 1
The Yi’s are pairwise orthogonal
The Yi’s are ordered in decreasing order in the amount of
captured observed variance.
That is, the observed data shows more variance in the
direction of Y1 than in the direction of Y2.
To perform dimensionality reduction we select the top m

principle components of PCA as our new predictors and
express our observed data in terms of these predictors.
24
The Intuition Behind PCA
Top PCA components capture the
most of amount of variation
(interesting features) of the data.
Each component is a linear
combination of the original
predictors - we visualize them as
vectors in the feature space.
25
The Intuition Behind PCA
Transforming our observed
data means projecting our
dataset onto the space
defined by the top m PCA
components, these
components are our new
predictors.
25
Using PCA for Regression
PCA is easy to use in Python, so how do we then use it for
regression modeling in a real-life problem?
If we use all p of the new Yj , then we have not improved
the dimensionality. Instead, we select the first M PCA
variables, Y1, ..., YM , to use as predictors in a regression
model.
The choice of M is important and can vary from
application to application. It depends on various things, like
how collinear the predictors are, how truly related they are
to the response, etc...
What would be the best way to check for a specified
problem?
Train and Test!!!
Step by Step
 Step 1:
 We need to have some data for PCA
 Step 2:
 Subtract the mean from each of the data
point
Step1 & Step2
X1 X2
2.5 2.4 0.69 0.49 0.4761 0.2401
0.5 0.7 -1.31 -1.21 1.7161 1.4641
2.2 2.9 0.39 0.99 0.1521 0.9801
1.9 2.2 0.09 0.29 0.0081 0.0841
3.1 3 1.29 1.09 1.6641 1.1881
2.3 2.7 0.49 0.79 0.2401 0.6241
2 1.6 0.19 -0.31 0.0361 0.0961
1 1.1 -0.81 -0.81 0.6561 0.6561
1.5 1.6 -0.31 -0.31 0.0961 0.0961
1.1 0.9 -0.71 -1.01 0.5041 1.0201
18.1 19.1 0 0 5.549 6.449
Step3: Calculate the Covariance matrix
 Calculate the covariance matrix
Var(x1) = 5.549/9 = 0.6166
Var(x2) = 6.449/9 = 0.71656
Cov(x1,x2) = 5.539/9 = 0.6154
 Non-diagonal elements in the covariance matrix are positive

 So x1 , x2 variable increase together
 .6 166 .6154
cov= 𝐴 ¿
.6154 ( .7166 )
Step
 4: Calculate the eigenvalues and
eigenvectors of the covariance matrix using
the following equation
where A is the cov. Matrix and I is the

identity matrix.
By solving the above equation we get a
second degree equation. After solving the
second degree equation we tow values of
. These are eigenvalues of A.
32
Now
 we will find eigenvectors by solving the following
equation
A
Since covariance matrix is square, we can calculate the
eigenvector and eigenvalues of the matrix using the
constraint
We solve the above two equation one by one for both the
eigenvalues and find two eigenvectors
The final answer is
Where
33
What does this all mean?
Data Points
Eigenvectors
Conclusion
 Eigenvector give us information about the
pattern.
 By looking at graph in previous slide. See
how one of the eigenvectors go through the
middle of the points.
 Second eigenvector tells about another
weak pattern in data.
 So by finding eigenvectors of covariance
matrix we are able to extract lines that
characterize the data.
Step 5:Chosing components and forming a feature
vector.
 Highest eigenvalue is the principal
component of the data set.
 In our example, the eigenvector with the
largest eigenvalue was the one that pointed
down the middle of the data.
 So, once the eigenvectors are found, the
next step is to order them by eigenvalues,
highest to lowest.
 This gives the components in order of
significance.
Cont’d
 Now, here comes the idea of dimensionality
reduction and data compression
 You can decide to ignore the components of
least significance.
 You do lose some information, but if
eigenvalues are small you don’t lose much.
 More formal stated (see next slide)
Cont’d
 We have n – dimension
 So we will find n eigenvectors
 But if we chose only p first eigenvectors.
 Then the final dataset has only p dimension
Step 6: Deriving the new dataset
 Now, we have chosen
the components
(eigenvectors) that we
want to keep.
Choice-1: with two eigenvectors
 We can write them in
form of a matrix of
vectors
 In our example we
have two eigenvectors,
So we have two choices Choice-2: with one eigenvector
i.e. first eigenvector
Cont’d
 To
obtain the final dataset we will multiply
the above vector transposed with the
transpose of original data matrix i.e.
for first eigenvector we get first principle

component as
 Final dataset will have data items in columns

and dimensions along rows.
 So we have original data set represented in
transformed form
40
Original data set represented using two
eigenvectors.
Original data set restored using only one
eigenvectors.
PCA – Mathematical Working
 Naïve Basis (I) of input matrix (X) spans a large
dimensional space.
 Change of Basis (P) is required so that X can be
projected along a lower dimension space having
significant dimensions only.
 A properly selected P will generate a projection Y.
 Use this P to project the correlation matrix. Lessen
the number of Eigenvectors in P for a reduced
dimension projection.
PCA Procedure
 Step 1
 Get data
 Step 2
 Subtract the mean
 Step 3
 Calculate the covariance matrix
 Step 4
 Calculate the eigenvectors and eigenvalues of the
covariance matrix
 Step 5
 Choose components and form a feature vector

CS-13410 Introduction To Machine Learning: Lecture # 18

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS-13410 Introduction To Machine Learning: Lecture # 18

Uploaded by

Copyright:

Available Formats

CS-13410

Introduction to Machine Learning

Dimensionality Reduction – PCA

Every point in R3 is a linear combination

(2,3,3) = 2 (1,0,0) + 3(0,1,0) + 3 (0,0,1)

 A naïve basis for X is standard basis for R6 and hence

QUESTION: What is a good choice for P ?

 it doesn’t tell us a lot about data

To perform dimensionality reduction we select the top m

 Non-diagonal elements in the covariance matrix are positive

where A is the cov. Matrix and I is the

for first eigenvector we get first principle

 Final dataset will have data items in columns

You might also like