Professional Documents
Culture Documents
3
Dimensionality Reduction (DR)
Process of reducing the number of random
attributes under consideration.
Two very common methods are Wavelet
Transforms and Principal Components
Analysis
4
Numerosity Reduction
Replace the original data volume by
alternative smaller forms of data
representation.
Regression and log-linear models
(parametric methods)
Histograms, clustering, sampling and data
cube aggregation (nonparametric methods)
5
Data Compression
Transformations are applied to obtain a reduced or
compressed representation
Lossless: The original data can be reconstructed
from the compressed data without any information
loss.
Lossy: Only an approximation of the original data
can be reconstructed.
6
DR: Principal Components Analysis (PCA)
Why PCA?
PCA is a useful statistical technique, has
found applications in:
Face recognition
Image Compression
Reducing dimension of data
7
PCA Goal:
Removing Dimensional Redundancy
The major goal of PCA in Data Science and Machine
Learning is to remove the “dimensional redundancy”
from data.
What does that mean?
A typical dataset contains several dimensions (variables) that
may or may not correlate.
Dimensions that correlate vary together.
The information represented by a set of dimensions with high
correlation can be extracted by studying just one dimension
that represents the whole set.
Hence the goal is to reduce the dimensions of a dataset to a
smaller set of representative dimensions that do not correlate.
8
PCA Goal:
Removing Dimensional Redundancy
Dim 1
Dim 2
Dim 3
Analyzing 12
Dim 4
Dimensional data is
Dim 5
Dim 6
challenging !!!
Dim 7
Dim 8
Dim 9
Dim 10
Dim 11
Dim 12
PCA Goal:
Removing Dimensional Redundancy
Dim 1
Dim 2 But some dimensions
Dim 3
represent redundant
Dim 4
Dim 5
information. Can we
Dim 6 “reduce” these.
Dim 7
Dim 8
Dim 9
Dim 10
Dim 11
Dim 12
PCA Goal:
Removing Dimensional Redundancy
Dim 1
Dim 2
Dim 3
Lets assume we have a
Dim 4
“PCA black box” that can
Dim 5 reduce the correlating
Dim 6 dimensions.
Dim 7
Dim 8
Pass the 12d data set
Dim 9
through the black box to
Dim 10
Dim 11
get a three dimensional
Dim 12 data set.
PCA Goal:
Removing Dimensional Redundancy
Given appropriate reduction,
Dim 1
analyzing the reduced dataset is
Dim 2
Dim 3
much more efficient than the
Dim 4 original “redundant” data.
Dim 5 Dim A
Dim 6 PCA
Dim B
Black box
Dim 7 Dim C
Dim 8
Dim 9
Dim 10 Pass the 12 d data set through the
Dim 11
black box to get a three dimensional
data set.
Dim 12
Mathematics inside PCA Black box: Bases
Lets now give the “black box” a mathematical form.
In linear algebra dimensions of a space are a linearly
independent set called “bases” that spans the space created by
dimensions.
i.e. each point in that space is a linear combination of the bases
set.
e.g. consider the simplest example of standard basis in Rn
consisting of the coordinate axes.
24
Principal Components Analysis (PCA)
Definition
Principal Components Analysis (PCA) produces a list of p
principle components (Y1, . . . , Yp) such that
Each Yi is a linear combination of the original predictors,
and it’s vector norm is 1
The Yi’s are pairwise orthogonal
The Yi’s are ordered in decreasing order in the amount of
captured observed variance.
That is, the observed data shows more variance in the
direction of Y1 than in the direction of Y2.
25
The Intuition Behind PCA
Transforming our observed
data means projecting our
dataset onto the space
defined by the top m PCA
components, these
components are our new
predictors.
25
Using PCA for Regression
PCA is easy to use in Python, so how do we then use it for
regression modeling in a real-life problem?
If we use all p of the new Yj , then we have not improved
the dimensionality. Instead, we select the first M PCA
variables, Y1, ..., YM , to use as predictors in a regression
model.
The choice of M is important and can vary from
application to application. It depends on various things, like
how collinear the predictors are, how truly related they are
to the response, etc...
What would be the best way to check for a specified
problem?
Train and Test!!!
Step by Step
Step 1:
We need to have some data for PCA
Step 2:
Subtract the mean from each of the data
point
Step1 & Step2
X1 X2
2.5 2.4 0.69 0.49 0.4761 0.2401
0.5 0.7 -1.31 -1.21 1.7161 1.4641
2.2 2.9 0.39 0.99 0.1521 0.9801
1.9 2.2 0.09 0.29 0.0081 0.0841
3.1 3 1.29 1.09 1.6641 1.1881
2.3 2.7 0.49 0.79 0.2401 0.6241
2 1.6 0.19 -0.31 0.0361 0.0961
1 1.1 -0.81 -0.81 0.6561 0.6561
1.5 1.6 -0.31 -0.31 0.0961 0.0961
1.1 0.9 -0.71 -1.01 0.5041 1.0201
18.1 19.1 0 0 5.549 6.449
Step3: Calculate the Covariance matrix
Calculate the covariance matrix
Var(x1) = 5.549/9 = 0.6166
Var(x2) = 6.449/9 = 0.71656
Cov(x1,x2) = 5.539/9 = 0.6154
.6 166 .6154
cov= 𝐴 ¿
.6154 ( .7166 )
Step
4: Calculate the eigenvalues and
eigenvectors of the covariance matrix using
the following equation
32
Now
we will find eigenvectors by solving the following
equation
A
Since covariance matrix is square, we can calculate the
eigenvector and eigenvalues of the matrix using the
constraint
We solve the above two equation one by one for both the
eigenvalues and find two eigenvectors
The final answer is
Where
33
What does this all mean?
Data Points
Eigenvectors
Conclusion
Eigenvector give us information about the
pattern.
By looking at graph in previous slide. See
how one of the eigenvectors go through the
middle of the points.
Second eigenvector tells about another
weak pattern in data.
So by finding eigenvectors of covariance
matrix we are able to extract lines that
characterize the data.
Step 5:Chosing components and forming a feature
vector.
Highest eigenvalue is the principal
component of the data set.
In our example, the eigenvector with the
largest eigenvalue was the one that pointed
down the middle of the data.
So, once the eigenvectors are found, the
next step is to order them by eigenvalues,
highest to lowest.
This gives the components in order of
significance.
Cont’d
Now, here comes the idea of dimensionality
reduction and data compression
You can decide to ignore the components of
least significance.
You do lose some information, but if
eigenvalues are small you don’t lose much.
More formal stated (see next slide)
Cont’d
We have n – dimension
So we will find n eigenvectors
But if we chose only p first eigenvectors.
Then the final dataset has only p dimension
Step 6: Deriving the new dataset
Now, we have chosen
the components
(eigenvectors) that we
want to keep.
Choice-1: with two eigenvectors
We can write them in
form of a matrix of
vectors
In our example we
have two eigenvectors,
So we have two choices Choice-2: with one eigenvector
i.e. first eigenvector
Cont’d
To
obtain the final dataset we will multiply
the above vector transposed with the
transpose of original data matrix i.e.
Step 2
Subtract the mean
Step 3
Calculate the covariance matrix
Step 4
Calculate the eigenvectors and eigenvalues of the
covariance matrix
Step 5
Choose components and form a feature vector