Professional Documents
Culture Documents
(PCA,LDA,Kernel PCA)
Background
● We can add an
orthogonal
line.
● Now we can
begin to
understand
the
components!
Background
● We can continue
this analysis into
higher dimensions
Background
The data isn’t very spread out here, therefore it doesn’t have a large
variance. It is probably not the principal component.
A horizontal line are with lines projected on will look like this:
On this line the data is way more spread out, it has a large variance. In fact
there isn’t a straight line you can draw that has a larger variance than a
horizontal one. A horizontal line is therefore the principal component in this
example.
Luckily we can use math's to find the principal component rather than
drawing lines and unevenly shaped triangles. This is where eigenvectors
and eigenvalues come in.
Eigenvectors and Eigenvalues
When we get a set of data points, like the triangles above, we can
deconstruct the set into eigenvectors and eigenvalues. Eigenvectors and
values exist in pairs: every eigenvector has a corresponding eigenvalue. An
eigenvector is a direction, in the example above the eigenvector was the
direction of the line (vertical, horizontal, 45 degrees etc.) . An eigenvalue
is a number, telling you how much variance there is in the data in that
direction, in the example above the eigenvalue is a number telling us how
spread out the data is on the line.
But what do these eigenvectors represent in real life? The old axes were well defined (age
and hours on internet, or any 2 things that you’ve explicitly measured), whereas the new
ones are not. This is where you need to think. There is often a good reason why these axes
represent the data better, but math's won’t tell you why, that’s for you to work out.
How does PCA and eigenvectors help in the actual analysis of data? Well there’s quite a
few uses, but a main one is dimension reduction.
Dimension Reduction
Note that we can reduce dimensions even if there isn’t a zero eigenvalue.
Imagine we did the example again, except instead of the oval being on a 2D
plane, it had a tiny amount of height to it. There would still be 3 eigenvectors,
however this time all the eigenvalues would not be zero. The values would be
something like 10, 8 and 0.1. The eigenvectors corresponding to 10 and 8 are
the dimensions where there is a lot of information, the eigenvector
corresponding to 0.1 will not have much information at all, so we can
therefore discard the third eigenvector again in order to make the data set
more simple.
What are the common methods to perform Dimension Reduction?
I would prefer the latter, because it would not have lot more details about data set.
Also, it would not help in improving the power of model. Next question, is there any
threshold of missing values for dropping a variable? It varies from case to case. If
the information contained in the variable is not that much, you can drop the
variable if it has more than ~40-50% missing values.
2. Low Variance: Let’s think of a scenario where we have a constant variable (all
observations have same value, 5) in our data set. Do you think, it can improve the
power of model? Of course NOT, because it has zero variance. In case of high
number of dimensions, we should drop variables having low variance compared to
others because these variables will not explain the variation in target variables.
3. Decision Trees: It is one of my favorite techniques. It can be used as a ultimate
solution to tackle multiple challenges like missing values, outliers and identifying
significant variables. It worked well in our Data Hackathon also. Several data
scientists used decision tree and it worked well for them.
Reverse to this, we can use “Forward Feature Selection” method. In this method, we select
one variable and analyse the performance of model by adding another variable. Here,
selection of variable is based on higher improvement in model performance.
7. Factor Analysis: Let’s say some variables are highly correlated. These variables can be
grouped by their correlations i.e. all variables in a particular group can be highly
correlated among themselves but have low correlation with variables of other group(s).
Here each group represents a single underlying construct or factor. These factors
are small in number as compared to large number of dimensions. However, these factors
are difficult to observe.
There is no global right or wrong answer. It depends on situation to situation. Let me cover a few
examples to bring out the spectrum of possibilities. These may not be comprehensive, but would give
you a good flavor of what should be done.
Example 1: You are building a regression with 10 variables, some of them are mildly correlated.
In this case, there is no need to perform any dimensionality reduction. Since there is little correlation
among the variables, all of them are bringing in new information. You should keep all the variables in the
mix and build a model.
Example 2: Again, you are building a regression model. This time with 20 variables and some of
these variables would be highly correlated.
For example, in case of telecom - number of calls made by a customer in a month and the monthly bill
he / she receives. Or in case of insurance, it could be number of policies and total premium sold by an
agent / branch. In these cases, because you only have limited number of variables you should only add
one of these variables to your model. There is limited / no value you might get by bringing in all the
variables. And a lot of this additional value can be noise. You can still do away with applying any formal
dimensional reduction technique in this case (even though you are doing it for all practical reasons).
Example 3: Now assume that you have 500 such variables with some of them being
correlated with each other.
For example, data output of sensors from a smartphone. Or a retail chain looking at the
performance of a store manager with tons of similar variables - like total number of SKUs
sold, number of bills created, number of customers sold to, number of days present, time
spent on the aisle etc. etc. All of these would be correlated and are way too many to
individually figure out which are correlated to which ones. In this case, you should definitely
apply dimension reduction techniques. You may or may not make actual sense of these
vectors, but you can still understand them by looking at the result.
In a lot of scenarios like this, you will also see that you will retain more than 90% of
information with less that 15 - 20% of variables. Hence, these can be good applications of
dimensionality reduction.