You are on page 1of 32

Dimensionality Reduction

(PCA,LDA,Kernel PCA)
Background

● Let’s discuss the basic idea behind principal component


analysis.
● It is an unsupervised statistical technique used to examine
the interrelations among a set of variables in order to
identify the underlying structure of those variables.
● It is also known sometimes as a general factor analysis.
Background
● Where regression determines a line of best fit to a data set,
factor analysis determines several orthogonal lines of best fit
to the data set.
● Orthogonal means “at right angles”.
○ Actually the lines are perpendicular to each other in n-
dimensional space.
● n-Dimensional Space is the variable sample space.
○ There are as many dimensions as there are variables, so in a
data set with 4 variables the sample space is
4-dimensional.
Background

● Here we have some data


plotted along two
features, x and y.
Background

● We can add an
orthogonal
line.
● Now we can
begin to
understand
the
components!
Background

● Components are a linear


transformation that
chooses a variable
system for the data set
such that the greatest
variance of the data set
comes to lie on the first
axis
Background

● The second greatest


variance on the second
axis, and so on …
● This process allows us
to
reduce the number of
variables used in an
analysis.
Background

● Note that components


are uncorrelated, since in
the sample space they
are orthogonal to each
other.
Background

● We can continue
this analysis into
higher dimensions
Background

● If we use this technique on a data set with a large number


of variables, we can compress the amount of explained
variation to just a few components.
● The most challenging part of PCA is interpreting the
components.
Background

● For our work with Python, we’ll walk through an example


of how to perform PCA with scikit learn.
● We usually want to standardize our data by some scale
for PCA, so we’ll cover how to do this as well.
● Since this algorithm is used usually for analysis of data
and
not a fully deployable model, there won’t be a
portfolio project for this topic.
PCA In Details…
What is Principal Component Analysis?

First of all Principal Component Analysis is a good


name. It does what it says on the tin. PCA finds the
principal components of data.

It is often useful to measure data in terms of its


principal components rather than on a normal x-y
axis. So what are principal components then?
They’re the underlying structure in the data. They
are the directions where there is the most
variance, the directions where the data is most
spread out. This is easiest to explain by way of
example. Here’s some triangles in the shape of an
oval:
Imagine that the triangles are points of data. To find the direction
where there is most variance, find the straight line where the data is
most spread out when projected onto it. A vertical straight line with
the points projected on to it will look like this:

The data isn’t very spread out here, therefore it doesn’t have a large
variance. It is probably not the principal component.
A horizontal line are with lines projected on will look like this:

On this line the data is way more spread out, it has a large variance. In fact
there isn’t a straight line you can draw that has a larger variance than a
horizontal one. A horizontal line is therefore the principal component in this
example.

Luckily we can use math's to find the principal component rather than
drawing lines and unevenly shaped triangles. This is where eigenvectors
and eigenvalues come in.
Eigenvectors and Eigenvalues

When we get a set of data points, like the triangles above, we can
deconstruct the set into eigenvectors and eigenvalues. Eigenvectors and
values exist in pairs: every eigenvector has a corresponding eigenvalue. An
eigenvector is a direction, in the example above the eigenvector was the
direction of the line (vertical, horizontal, 45 degrees etc.) . An eigenvalue
is a number, telling you how much variance there is in the data in that
direction, in the example above the eigenvalue is a number telling us how
spread out the data is on the line.

The eigenvector with the highest eigenvalue is therefore the principal


component.
Okay, so even though in the last example I could point my line in any
direction, it turns out there are not many eigenvectors/values in a data set. In
fact the amount of eigenvectors/values that exist equals the number of
dimensions the data set has. Say I'm measuring age and hours on the
internet. there are 2 variables, it’s a 2 dimensional data set, therefore there
are 2 eigenvectors/values. If I'm measuring age, hours on internet and hours
on mobile phone there’s 3 variables, 3-D data set, so 3 eigenvectors/values.
The reason for this is that eigenvectors put the data into a new set of
dimensions, and these new dimensions have to be equal to the original amount
of dimensions. This sounds complicated, but again an example should make it
clear.
Here’s a graph with the oval:

At the moment the oval is on an x-y axis. x could be age and y


hours on the internet. These are the two dimensions that my
data set is currently being measured in. Now remember that
the principal component of the oval was a line splitting it
long ways:
It turns out the other eigenvector (remember there are only two of them as it’s a 2-D
problem) is perpendicular to the principal component. As we said, the eigenvectors
have to be able to span the whole x-y area, in order to do this (most effectively), the
two directions need to be orthogonal (i.e. 90 degrees) to one another. This why the x
and y axis are orthogonal to each other in the first place. It would be really awkward
if the y axis was at 45 degrees to the x axis. So the second eigenvector would look like
this:
The eigenvectors have given us a much more useful axis to frame the
data in. We can now re-frame the data in these new dimensions. It
would look like this::
Note that nothing has been done to the data itself. We’re just looking at it from a different
angle. So getting the eigenvectors gets you from one set of axes to another. These axes are
much more intuitive to the shape of the data now. These directions are where there is most
variation, and that is where there is more information (think about this the reverse way
round. If there was no variation in the data [e.g. everything was equal to 1] there would be
no information, it’s a very boring statistic – in this scenario the eigenvalue for that
dimension would equal zero, because there is no variation).

But what do these eigenvectors represent in real life? The old axes were well defined (age
and hours on internet, or any 2 things that you’ve explicitly measured), whereas the new
ones are not. This is where you need to think. There is often a good reason why these axes
represent the data better, but math's won’t tell you why, that’s for you to work out.

How does PCA and eigenvectors help in the actual analysis of data? Well there’s quite a
few uses, but a main one is dimension reduction.
Dimension Reduction

PCA can be used to reduce the dimensions of a data set.


Dimension reduction is analogous to being philosophically
reductionist: It reduces the data down into it’s basic
components, stripping away any unnecessary parts.

Let’s say you are measuring three things: age, hours on


internet and hours on mobile. There are 3 variables so it is a
3D data set. 3 dimensions is an x,y and z graph, It measure
width, depth and height (like the dimensions in the real
world). Now imagine that the data forms into an oval like the
ones above, but that this oval is on a plane. i.e. all the data
points lie on a piece of paper within this 3D graph (having
width and depth, but no height). Like this:
When we find the 3 eigenvectors/values of the data set (remember 3D problem = 3
eigenvectors), 2 of the eigenvectors will have large eigenvalues, and one of the
eigenvectors will have an eigenvalue of zero. The first two eigenvectors will show
the width and depth of the data, but because there is no height on the data (it is
on a piece of paper) the third eigenvalue will be zero. On the picture below ev1 is
the first eigenvector (the one with the biggest eigenvalue, the principal component),
ev2 is the second eigenvector (which has a non-zero eigenvalue) and ev3 is the third
eigenvector, which has an eigenvalue of zero.
We can now rearrange our axes to be along the eigenvectors, rather
than age, hours on internet and hours on mobile. However we know
that the ev3, the third eigenvector, is pretty useless. Therefore instead
of representing the data in 3 dimensions, we can get rid of the useless
direction and only represent it in 2 dimensions, like before:
This is dimension reduction. We have reduced the problem from a 3D to a 2D
problem, getting rid of a dimension. Reducing dimensions helps to simplify the
data and makes it easier to visualize.

Note that we can reduce dimensions even if there isn’t a zero eigenvalue.
Imagine we did the example again, except instead of the oval being on a 2D
plane, it had a tiny amount of height to it. There would still be 3 eigenvectors,
however this time all the eigenvalues would not be zero. The values would be
something like 10, 8 and 0.1. The eigenvectors corresponding to 10 and 8 are
the dimensions where there is a lot of information, the eigenvector
corresponding to 0.1 will not have much information at all, so we can
therefore discard the third eigenvector again in order to make the data set
more simple.
What are the common methods to perform Dimension Reduction?

There are many methods to perform Dimension reduction. I have listed the most


common methods below:

1. Missing Values: While exploring data, if we encounter missing values, what we


do? Our first step should be to identify the reason then impute missing values/ drop
variables using appropriate methods. But, what if we have too many missing values?
Should we impute missing values or drop the variables?

I would prefer the latter, because it would not have lot more details about data set.
Also, it would not help in improving the power of model. Next question, is there any
threshold of missing values for dropping a variable? It varies from case to case. If
the information contained in the variable is not that much, you can drop the
variable if it has more than ~40-50% missing values.
2. Low Variance: Let’s think of a scenario where we have a constant variable (all
observations have same value, 5) in our data set. Do you think, it can improve the
power of model? Of course NOT, because it has zero variance. In case of high
number of dimensions, we should drop variables having low variance compared to
others because these variables will not explain the variation in target variables.
 
3. Decision Trees: It is one of my favorite techniques. It can be used as a ultimate
solution to tackle multiple challenges like missing values, outliers and identifying
significant variables. It worked well in our Data Hackathon also. Several data
scientists used decision tree and it worked well for them.

4. Random Forest: Similar to decision tree is Random Forest. I would also


recommend using the in-built feature importance provided by random forests to
select a smaller subset of input features. Just be careful that random forests have a
tendency to bias towards variables that have more no. of distinct values i.e. favor
numeric variables over binary/categorical values.
5. High Correlation: Dimensions exhibiting higher correlation can lower down the performance of
model. Moreover, it is not good to have multiple variables of similar information or variation also
known as “Multicollinearity”. You can use Pearson (continuous variables) or Polychoric (discrete
variables) correlation matrix to identify the variables with high correlation and select one of them
using VIF (Variance Inflation Factor). Variables having higher value ( VIF > 5 ) can be dropped.
 
6. Backward Feature Elimination: In this method, we start with all n dimensions. Compute the sum
of square of error (SSR) after eliminating each variable (n times). Then, identifying variables whose
removal has produced the smallest increase in the SSR and removing it finally, leaving us with n-
1 input features.

Repeat this process until no other variables can be dropped.

Reverse to this, we can use “Forward Feature Selection” method. In this method, we select
one variable and analyse the performance of model by adding another variable. Here,
selection of variable is based on higher improvement in model performance.
7. Factor Analysis: Let’s say some variables are highly correlated. These variables can be
grouped by their correlations i.e. all variables in a particular group can be highly
correlated among themselves but have low correlation with variables of other group(s).
Here each group represents a single underlying construct or factor. These factors
are small in number as compared to large number of dimensions. However, these factors
are difficult to observe.

There are basically two methods of performing factor analysis:

EFA (Exploratory Factor Analysis)


CFA (Confirmatory Factor Analysis)
8. Principal Component Analysis (PCA): In this technique, variables are transformed into a new
set of variables, which are linear combination of original variables. These new set of variables are
known as principle components. They are obtained in such a way that first principle component
accounts for most of the possible variation of original data after which each succeeding component
has the highest possible variance.
The second principal component must be orthogonal to the first principal component. In other
words, it does its best to capture the variance in the data that is not captured by the first principal
component. For two-dimensional dataset, there can be only two principal components. Below is a
snapshot of the data and its first and second principal components. You can notice that second
principle component is orthogonal to first principle component.

The principal components are sensitive to the scale of


measurement, now to fix this issue we should always
standardize variables before applying PCA. Applying PCA
to your data set loses its meaning. If interpretability of
the results is important for your analysis, PCA is not the
right technique for your project.
Dimensionality Reduction is good or bad

There is no global right or wrong answer. It depends on situation to situation. Let me cover a few
examples to bring out the spectrum of possibilities. These may not be comprehensive, but would give
you a good flavor of what should be done.

Example 1: You are building a regression with 10 variables, some of them are mildly correlated.
In this case, there is no need to perform any dimensionality reduction. Since there is little correlation
among the variables, all of them are bringing in new information. You should keep all the variables in the
mix and build a model.

Example 2: Again, you are building a regression model. This time with 20 variables and some of
these variables would be highly correlated.
For example, in case of telecom - number of calls made by a customer in a month and the monthly bill
he / she receives. Or in case of insurance, it could be number of policies and total premium sold by an
agent / branch. In these cases, because you only have limited number of variables you should only add
one of these variables to your model. There is limited / no value you might get by bringing in all the
variables. And a lot of this additional value can be noise. You can still do away with applying any formal
dimensional reduction technique in this case (even though you are doing it for all practical reasons).
Example 3: Now assume that you have 500 such variables with some of them being
correlated with each other.

For example, data output of sensors from a smartphone. Or a retail chain looking at the
performance of a store manager with tons of similar variables - like total number of SKUs
sold, number of bills created, number of customers sold to, number of days present, time
spent on the aisle etc. etc. All of these would be correlated and are way too many to
individually figure out which are correlated to which ones. In this case, you should definitely
apply dimension reduction techniques. You may or may not make actual sense of these
vectors, but you can still understand them by looking at the result.

In a lot of scenarios like this, you will also see that you will retain more than 90% of
information with less that 15 - 20% of variables. Hence, these can be good applications of
dimensionality reduction.

You might also like