You are on page 1of 33

# 23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

(http://www.analyticsvidhya.com)

(http://datahack.analyticsvidhya.com/contest/practice-problem-time-series)

## Practical Guide to Principal Component Analysis

(PCA) in R & Python
PYTHON (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/CATEGORY/PYTHON-2/)

(HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/CATEGORY/R/)

harer.php?u=http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-

20Principal%20Component%20Analysis%20(PCA)%20in%20R%20&%20Python)

rincipal%20Component%20Analysis%20(PCA)%20in%20R%20&%20Python+http://www.analyticsvidhya.com/blog/2016/03/practical-

hon/)

com/pin/create/button/?url=http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-

dhya.com/wp-

iption=Practical%20Guide%20to%20Principal%20Component%20Analysis%20(PCA)%20in%20R%20&%20Python)

utm_source=AV&utm_medium=Banner&utm_campaign=AVBanner)

Introduction
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

1/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

Introduction
Too much of anything is good for nothing!

What happens when a data set has too many variables ? Here are few possible situations
which you might come across:
1. You find thatmost of thevariables are correlated.
2. You lose patience and decide to run a model on whole data. This returnspoor accuracy
andyou feel terrible.
3. You become indecisive about what to do
4. You start thinking of some strategic method to find few important variables

Trust me, dealing with such situations isnt as difficult as it sounds. Statistical techniques
such asfactor analysis, principal component analysis help to overcome such difficulties.
In this post, Ive explained the concept of principal component analysis in detail. Ive kept
the explanation to be simple and informative. For practical understanding, Ive also
demonstrated using this technique in R with interpretations.

## What is Principal Component Analysis ?

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

2/33

23/03/2016

## What is Principal Component Analysis ?

In simple words, principal component analysis is a method ofextracting important variables
from a large set of variables available in a data set. It extracts low dimensional set of
features from a high dimensional data set with a motive tocapture as much information as
possible. With fewer variables, visualization also becomes much more meaningful. PCA is
more useful when dealing with 3 or higher dimensional data.
It is always performed on a symmetric correlation or covariance matrix. This means the
matrix should be numeric and have standardized data.
Lets understand it using an example:
Lets say we have a data set of dimension 300 (n) 50 (p). n represents the number of
observations and p represents number of predictors. Since we have a large p = 50, therecan
be p(p1)/2 scatter plots i.e more than 1000 plots possible to analyze the variable
relationship.Wouldnt is be a tedious job to perform exploratory analysis on this data ?
In this case, it would be a lucid approach to select a subset of p (p << 50) predictor which
captures as much information. Followed by plotting the observation in the resultant low
dimensional space.
The image below shows the transformation of a high dimensional data (3 dimension) to low
dimensional data (2 dimension) using PCA. Not to forget, each resultant dimension is a linear
combination of p features

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

3/33

23/03/2016

## What are principal components ?

A principal component is a normalized linear combination of theoriginal predictors in a data
set. In image above, PC1 and PC2 are the principal components. Lets say we have a set of
predictors as X,X...,Xp
The principal component can be writtenas:
Z=X+X+X+....+p Xp

where,
Z isfirst principal component

## ,..) of first principal component.

The loadings are constrained to a sum of square equals to 1. This is because large
magnitude of loadings may lead to large variance. It also defines the direction of the
principal component (Z) along which data varies the most. It results in a line in p
dimensional space which is closest to the n observations. Closeness is measured using
average squared euclidean distance.
X..Xpare normalized predictors. Normalized predictors havemean equals to zero and

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

4/33

23/03/2016

## standard deviation equals to one.

Therefore,
First principal component is a linear combination of original predictor variables which
captures the maximum variance in the data set. It determines the direction of highest
variability in the data. Larger the variability captured in first component, larger the
information captured by component.No other component can have variability higher than
first principal component.
The first principal component results in a line which is closest to the data i.e. it minimizes the
sum of squared distance between a data point and the line.
Similarly, we can compute the second principal component also.

## Second principal component ( Z) is also a linear combination of original predictors which

captures the remaining variance in the data set and is uncorrelated with Z.In other words,
the correlation between first and second component should iszero. It can be represented
as:
Z=X+X+X+....+p2Xp

If the two components are uncorrelated, their directions should be orthogonal (image
below). This image is based on a simulated data with 2 predictors. Notice the direction of the
components, as expected they are orthogonal. This suggests the correlation b/w these
components in zero.

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

5/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

All succeeding principal component follows a similar concept i.e. they capture the
remaining variation without being correlated with the previous component. In general, for

## n pdimensional data, min(n-1, p) principal component can be constructed.

The directions of these components are identified in an unsupervised way i.e. the response
variable(Y) is not used to determine the component direction. Therefore, it is an
unsupervised approach.

Note: Partial least square (PLS) is a supervised alternative to PCA. PLS assigns higher weight
to variables which are strongly related to response variable to determine principal
components.

## Why is normalization of variables necessary ?

The principal components are supplied with normalized version of original predictors. This is
because, the original predictors may have different scales. For example: Imagine a data set
with variables measuring units as gallons, kilometers, light years etc. It is definite that the
scale of variances in these variables will be large.

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

6/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

Performing PCA on un-normalized variables will lead to insanely large loadings for variables
with high variance. In turn, this will lead to dependence of a principal component on the
variable with high variance. This is undesirable.
As shown in image below, PCA was run on a data set twice (with unscaled and scaled
predictors). This data set has ~40 variables. You can see, first principal component is
dominated by a variable Item_MRP. And, second principal component is dominated by a
variable Item_Weight. This domination prevails due to high value of variance associated
with a variable. When the variables are scaled, we get a much better representation of
variables in 2D space.

## Implement PCA in R & Python (with interpretation)

How many principal components to choose ? I could dive deep in theory, but it would be
better to answer these question practically.
For this demonstration, Ill be using the data set from Big Mart Prediction Challenge
(http://datahack.analyticsvidhya.com/contest/practice-problem-bigmart-sales-prediction).

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

7/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

Remember, PCA can be applied only on numerical data. Therefore, if the data has
categorical variables they must be converted to numerical. Also, make sure you have done
the basic data cleaning prior to implementing this technique. Lets quickly finish with initial
#directorypath
>path<".../Data/Big_Mart_Sales"
#setworkingdirectory
>setwd(path)
>test\$Item_Outlet_Sales<1
#combinethedataset
>combi<rbind(train,test)
#imputemissingvalueswithmedian

>combi\$Item_Weight[is.na(combi\$Item_Weight)]<median(combi\$Item_Weight,na.rm=TR
#impute0withmedian

>combi\$Item_Visibility<ifelse(combi\$Item_Visibility==0,median(combi\$Item_Visib
#findmodeandimpute
>table(combi\$Outlet_Size,combi\$Outlet_Type)
>levels(combi\$Outlet_Size)[1]<"Other"

Till here, weve imputed missing values. Now we are left with removing the dependent
(response) variable and other identifier variables( if any). As we said above, we are practicing
an unsupervised learning technique, hence response variable must be removed.

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

8/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

#removethedependentandidentifiervariables

>my_data<subset(combi,select=c(Item_Outlet_Sales,Item_Identifier,

Lets check the available variables ( a.k.a predictors) in the data set.
#checkavailablevariables
>colnames(my_data)

Since PCA works on numeric variables, lets see if we have any variable other than numeric.
#checkvariableclass
>str(my_data)
'data.frame':14204obs.of9variables:
\$Item_Weight:num9.35.9217.519.28.93...
\$Item_Fat_Content:Factorw/5levels"LF","lowfat",..:3535355355...
\$Item_Visibility:num0.0160.01930.01680.0540.054...
\$Item_Type:Factorw/16levels"BakingGoods",..:515117101141466...
\$Item_MRP:num249.848.3141.6182.153.9...

\$Outlet_Establishment_Year:int1999200919991998198720091987198520022007..
\$Outlet_Size:Factorw/4levels"Other","High",..:3331232311...

\$Outlet_Location_Type:Factorw/3levels"Tier1","Tier2",..:1313333322
\$Outlet_Type:Factorw/4levels"GroceryStore",..:2321232422...

Sadly, 6 out of 9 variables are categorical in nature. We have some additional work to do
now. Well convert these categorical variables into numeric using one hot encoding.
>library(dummies)

>new_my_data<dummy.data.frame(my_data,names=c("Item_Fat_Content","Item_Type",
"Outlet_Establishment_Year","Outlet_Size",
"Outlet_Location_Type","Outlet_Type"))

## To check, if we now have a data set of integer values, simple write:

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

9/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

#checkthedataset
>str(new_my_data)

And, we now have all the numerical values. We can now go ahead with PCA.
The base R function prcomp() is used to performPCA. By default, it centers the variable to
have mean equals to zero. With parameter scale.=T, we normalize the variables to have
standard deviation equals to 1.
#principalcomponentanalysis
>prin_comp<prcomp(new_my_data,scale.=T)
>names(prin_comp)
[1]"sdev""rotation""center""scale""x"

## The prcomp() function results in 5 useful measures:

1. center and scale refers to respective mean and standard deviation of the variables that
are used for normalization prior to implementing PCA
#outputsthemeanofvariables
prin_comp\$center
#outputsthestandarddeviationofvariables
prin_comp\$scale

2. The rotation measure provides the principal component loading. Each column of rotation
matrix contains the principal component loading vector. This is the most important measure
we should be interested in.
>prin_comp\$rotation

This returns 44 principal components loadings. Is that correct ? Absolutely. In a data set, the
maximum number of principal component loadings is a minimum of (n-1, p). Lets look at
first 4 principal components and first 5 rows.

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

10/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

>prin_comp\$rotation[1:5,1:4]
PC1PC2PC3PC4
Item_Weight0.00544292250.0012856660.0112461940.011887106
Item_Fat_ContentLF0.00219833140.0037685570.0097900940.016789483
Item_Fat_Contentlowfat0.00190427100.0018669050.0030664150.018396143
Item_Fat_ContentLowFat0.00279364670.0022343280.0283098110.056822747
Item_Fat_Contentreg0.00029363190.0011209310.0090332540.001026615

3. In order to compute the principal component score vector, we dont need to multiply the
loading with data. Rather, the matrix x has the principal component score vectors in a
14204 44 dimension.
(http://discuss.analyticsvidhya.com)
> dim(prin_comp\$x)
[1] 14204 44
Lets plot the resultant principal components.
>biplot(prin_comp,scale=0)

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

11/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

The parameter scale=0 ensures that arrows are scaled to represent the loadings. To
make inference from image above, focus on the extreme ends (top, bottom, left, right) of
this graph.
We

infer

than

first

principal

component

corresponds

to

measure

of

## Outlet_TypeSupermarket, Outlet_Establishment_Year 2007. Similarly, it can be said that the

second

component

corresponds

to

measure

of

Outlet_Location_TypeTier1,

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

12/33

23/03/2016

## Outlet_Sizeother. For exact measure of a variable in a component, you should look at

rotation matrix(above) again.
4. The prcomp() function also provides the facility to compute standard deviation of each
principal component. sdev refers to the standard deviation of principal components.
#computestandarddeviationofeachprincipalcomponent
>std_dev<prin_comp\$sdev
#computevariance
>pr_var<std_dev^2
#checkvarianceoffirst10components
>pr_var[1:10]
[1]4.5636153.2177022.7447262.5410912.1981522.0153201.9320761.256831
[9]1.2037911.168101

We aim to find the components which explain the maximum variance. This is because, we
want to retain as much information as possible using these components. So, higher is the
explained variance, higher will be the information contained in those components.
To compute the proportion of variance explained by each component, we simply divide the
variance by sum of total variance. This results in:
#proportionofvarianceexplained
>prop_varex<pr_var/sum(pr_var)
>prop_varex[1:20]
[1]0.103718530.073129580.062380140.057752070.049958000.04580274
[7]0.043910810.028564330.027358880.026547740.025598760.02556797
[13]0.025495160.025088310.024939320.024909380.024683130.02446016
[19]0.023903670.02371118

This shows that first principal component explains 10.3% variance. Second component
explains 7.3% variance. Third component explains 6.2% variance and so on. So, how do we
decide how many components should we select for modeling stage ?

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

13/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

The answer to this question is provided by a scree plot. A scree plot is used to access
components or factors which explains the most of variability in the data. It represents values
in descending order.
#screeplot
>plot(prop_varex,xlab="PrincipalComponent",
ylab="ProportionofVarianceExplained",
type="b")

The plot above shows that ~ 30 components explains around 98.4% variance in the data set.
In order words, using PCA we have reduced 44 predictors to 30 without compromising on
explained variance. This is the power of PCA> Lets do a confirmation check, by plotting a
cumulative variance plot. This will give us a clear picture of number of components.
#cumulativescreeplot
>plot(cumsum(prop_varex),xlab="PrincipalComponent",
ylab="CumulativeProportionofVarianceExplained",
type="b")

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

14/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

This plot shows that 30 components results in variance close to ~ 98%. Therefore, in this
case, well select number of components as 30 [PC1 to PC30] and proceed to the modeling
stage. This completes the steps to implement PCA in R. For modeling, well use these 30
components as predictor variables and follow the normal procedures.

For Python Users: To implement PCA in python, simply import PCA from sklearn library. The
interpretation remains same as explained for R users above. Ofcourse, the result is some as
derived after using R. The data set used for Python is a cleaned version where missing
values have been imputed, and categorical variables are converted into numeric.
importnumpyasnp
fromsklearn.decompositionimportPCA
importpandasaspd
importmatplotlib.pyplotasplt
fromsklearn.preprocessingimportscale
%matplotlibinline

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

15/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

#convertittonumpyarrays
X=data.values
#Scalingthevalues
X=scale(X)
pca=PCA(n_components=44)
pca.fit(X)
#TheamountofvariancethateachPCexplains
var=pca.explained_variance_ratio_
#CumulativeVarianceexplains
var1=np.cumsum(np.round(pca.explained_variance_ratio_,decimals=4)*100)
printvar1
[10.3717.6823.9229.734.739.2843.6746.5349.27
51.9254.4857.0459.5962.164.5967.0869.5572.
74.3976.7679.181.4483.7786.0688.3390.5992.7
94.7696.7898.44100.01100.01100.01100.01100.01100.01
100.01100.01100.01100.01100.01100.01100.01100.01]
plt.plot(var1)

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

16/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

#LookingataboveplotI'mtaking30variables
pca=PCA(n_components=30)
pca.fit(X)
X1=pca.fit_transform(X)
printX1

For more information on PCA in python, visit scikit learn documentation (http://scikitlearn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

Points to Remember
1. PCA is used to extract important features from a data set.
2. These features are low dimensional in nature.
3. These features a.k.a components are a resultant of normalized linear combination of original
predictor variables.
4. These components aim to capture as much information as possible with high explained
variance.
5. The first component has the highest variance followed by second, third and so on.
6. The components must be uncorrelated (remember orthogonal direction ? ). See above.
7. Normalizing data becomesextremely important when the predictors are measured in
different units.
8. PCA works best on data set having 3 or higher dimensions. Because, with higher dimensions,
it becomes increasingly difficult to make interpretations from the resultant cloud of data.
9. PCA is applied on a data set with numeric variables.

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

17/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

10. PCA is a tool which helps to produce better visualizations of high dimensional data.

End Notes
This brings me to the end of this tutorial. Without delving deep into mathematics, Ive tried
to make you familiar with most important concepts required to use this technique. Its
simple but needs special attention while deciding the number of components. Practically,
we should strive to retain only first few k components
The idea behind pca is to construct some principal components( Z << Xp ) which
satisfactorily explains most of the variability in the data, as well as relationship with the
response variable.
Did you like reading this article ? Did you understand this technique ? Do share your
suggestions / opinions in the comments section below.

You can test your skills and knowledge. Check out Live Competitions
(http://datahack.analyticsvidhya.com/contest/all) and compete with
bestData Scientists from all over the world.

(http://www.analyticsvidhya.com/blog/2016/03/practicalguideprincipalcomponentanalysispython/?

146

(http://www.analyticsvidhya.com/blog/2016/03/practicalguideprincipalcomponentanalysispython/?

165

1&nb=1)

(http://www.analyticsvidhya.com/blog/2016/03/practicalguideprincipalcomponentanalysispython/?

(http://www.analyticsvidhya.com/blog/2016/03/practicalguideprincipalcomponentanalysispython/?
share=pocket&nb=1)

(http://www.analyticsvidhya.com/blog/2016/03/practicalguideprincipalcomponentanalysispython/?share=reddit&nb=1)

RELATED

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

18/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

(http://www.analyticsvidhya.co

(http://www.analyticsvidhya.co

(http://www.analyticsvidhya.co

m/blog/2016/01/guide-dataexploration/)

m/blog/2016/01/completetutorial-learn-data-science-

data-science/)

## A Comprehensive guide to Data

Exploration
(http://www.analyticsvidhya.co
m/blog/2016/01/guide-dataexploration/)

python-scratch-2/)

## Free Must Read Books on

Statistics & Mathematics for
Data Science
(http://www.analyticsvidhya.co

## A Complete Tutorial to Learn

Data Science with Python from
Scratch
(http://www.analyticsvidhya.co
m/blog/2016/01/completetutorial-learn-data-sciencepython-scratch-2/)

In "Machine Learning"

In "Machine Learning"

## TAGS: EXPLAINED VARIANCE (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/EXPLAINED-VARIANCE/), FACTOR ANALYSIS

(HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/FACTOR-ANALYSIS/), FIRST COMPONENTS
(HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/FIRST-COMPONENTS/), NORMALIZATION
(HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/NORMALIZATION/), PCA IN PYTHON (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/PCA-INPYTHON/), PCA IN R (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/PCA-IN-R/), PRINCIPAL COMPONENT ANALYSIS
(HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/PRINCIPAL-COMPONENT-ANALYSIS/), SCREE PLOT
(HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/SCREE-PLOT/), STATISTICS (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/STATISTICS/)

Previous Article

Next Article

## Winning Solutions of DYD

Competition R and XGBoost Ruled

## Course Review Big data and

(http://www.analyticsvidhya.com/blog/2016/03/winning- Course by Simplilearn
solutions-dyd-competition-xgboostruled/)

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

19/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

(http://www.analyticsvidhya.com/blog/author/manish-saraswat/)
Author

Manish Saraswat
(http://www.analyticsvidhya.com/blog/author/manishsaraswat/)
I believe education can change this world. Knowledge is the most powerful asset one can
build. It builds up like compound interest. I am passionate about helping people. I care
about animals, unprivileged people, sharing knowledge, health and books. R, Data Science
and Machine Learning keep me busy. Try. Bleed. Succeed.

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

20/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

MARCH 21, 2016 AT 6:19 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALCOMPONENT-ANALYSIS-PYTHON/#COMMENT-107888)

Excellent Manish

## Manish Saraswat says:

MARCH 21, 2016 AT 6:37 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107890#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107890)

## Your appreciation means a lot.

Thank you Tuhin Sir

Surobhi says:
MARCH 21, 2016 AT 7:24 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107894#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107894)

Hi Manish,
Information given about PCA in your article was very comprehensive as you have
covered both the theoretical and the implementation part very well. It was fun and
simple to understand too. Can you please write a similar one for Factor Analysis? How is
it different from PCA and how to decide on the method of dimensional reduction case
to case. Thanks

## Manish Saraswat says:

MARCH 21, 2016 AT 10:39 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107909#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107909)

Thanks Surobhi !
I already have it in my plan to write soon one detailed post on Factor Analysis. Wish me
luck!

## Prasoon Saxena says:

MARCH 21, 2016 AT 7:32 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107897#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107897)

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

21/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

This is good explanation Manish and thank you for sharing it.
Quick question, model created using these 30pca will have all 50 independent variable
but if I want to figure out what among those 50 independent variables which are most
critical one then how we figure that so that we can build model using those specific
variables.
Will appreciate your help.
Thanks

## Manish Saraswat says:

MARCH 21, 2016 AT 10:47 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107912#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107912)

Hello
For model building, well use the resultant 30 components as independent variables.
Remember, each component is a vector comprising of principal component score
derived from each predictor variable (in this case we have 50). Check
prin_comp\$rotation for principal component scores in each vector. This technique is
used to shrink the dimension of a data set such that it becomes easier to analyze,
visualize and interpret.
By critical, I assume you are talking about measuring variable importance. If thats the
case, you can look for p values, t statistics in regression. For variable selection,
regression is equipped with various approaches such as forward selection, backward
selection, step wise selection etc.

Muthupandi says:
MARCH 21, 2016 AT 12:45 PM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107922#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107922)

Hi Manish
Thanks for the article, had good explanation and walk through. In PCA on R u have
shown hot to get the name of the variable and its missing in python, can u explain how i
can get the names of the variables after reduction so that we can use it for model
building? or am i understanding it wrongly

## Manish Saraswat says:

MARCH 21, 2016 AT 4:31 PM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107951#RESPOND)

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

22/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

COMPONENT-ANALYSIS-PYTHON/#COMMENT-107951)

Hi
I didnt understand when you say name of the variable. Did you mean rotation matrix ?
Please elaborate on that.
Secondly, once you have done pca, you dont need to use variables for modeling,
instead use the resultant combination of components as independent variables which
leads to highest explained variance.

Muthupandi says:
MARCH 22, 2016 AT 5:24 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALCOMPONENT-ANALYSIS-PYTHON/#COMMENT-107981)

## Variable name am referring to is Item_Weight,Item_Fat_ContentLF

Item_Fat_Contentlow fat Item_Fat_ContentLow Fat,Item_Fat_Contentreg etc.,
so we have to use X1 to train the model?

## Hunaidkhan (http://none) says:

MARCH 21, 2016 AT 8:40 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107901#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107901)

Really informative Manish, Also variables derived from PCA can be used for Regression
analysis. Regression analysis with PCA gives a better prediction and less error.

## Manish Saraswat says:

MARCH 21, 2016 AT 11:00 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107914#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107914)

Rightly said. PCA when used for regression takes a form of supervised approach known
as PLS (partial least squares). In PLS, the response variable is used for identification of
principal components.

Debarshi says:
MARCH 21, 2016 AT 8:46 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107902#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107902)

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

23/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

I have used PCA recently in one projects, and would like to add few points:
-PCA reduce the dimension but the the result is not very intuitive, as each PCs are
combination of all the original variables. So use Factor Analysis (Factor Rotation) on top
of PCA to get a better relationship between PCs (rather Factors) and original Variable,
this result was brilliant in an Insurance data.
-If you have perfectly correlated variables (A & B) then also PCA will not suggest you to
drop one, rather it will suggest to use a combination of these two (A+B), but off course it
will reduce the dimension
-This is different from feature selection, dont mix these two concept
-There is a concept of Nonlinear PCA which helps to include non Numeric values as
well.
-If you want to reduce the dimension (or numbers) of predictors (X) remember PCA does
not consider response (Y) while reducing the dimension, your original variables may be
(??) a better predictors.

## Manish Saraswat says:

MARCH 21, 2016 AT 11:02 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107915#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107915)

## Thanks a lot Debarshi. These are so much helpful.

Venu says:
MARCH 21, 2016 AT 8:49 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107903#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107903)

Good One

Pallavi says:
MARCH 21, 2016 AT 10:42 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107910#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107910)

Hi Manish,
Another good article! I have always found it difficult to explain the principle components
to business users. Would really appreciate, if you also write how do you explain the PCA
to business users What general questions you get from business users and how to
handle those.

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

24/33

23/03/2016

Thanks

## Dox vK (http://none) says:

MARCH 21, 2016 AT 12:17 PM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107920#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107920)

I understand there is a PCA for qualitative data could some one provide me with a
good intutive resource for suvh?

Krishna says:
MARCH 21, 2016 AT 1:17 PM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107925#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107925)

Hi Manish,
The article is very helpful. While we normalize the data for numeric variables, do we
need to remove outliers if any exists in the data before performing PCA?
Also looks like , implementation of final model in production is quite tedious, as we
always have to compute components prior scoring.
Thanks,
Krishna

## Manish Saraswat says:

MARCH 21, 2016 AT 4:36 PM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107952#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107952)

Hello
Also mentioned in the article, data cleaning (removal of outliers, imputing missing
values) are important prior to implementing principal component analysis. Such things
only adds noise and inconsistency in the data. Hence, it is a good practice to sort them
out first.
I beg to differ on this procedure being tedious. For the sake of understanding, Ive
explained each and every step used in this technique, may be this makes it tedious.
However, if you think you have understood it correctly, just pick the important
commands (codes) and get to the scree plot stage in no time.

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

25/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

sandy says:
MARCH 21, 2016 AT 3:48 PM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107945#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107945)

## nice one explanation

Ankur says:
MARCH 22, 2016 AT 8:58 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107989#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107989)

Hi Manish,
while running the command >prin_comp <- prcomp(new_my_data, scale. = T)
it giving error "Error in svd(x, nu = 0) : infinite or missing values in 'x'"
how to rectify it.
BTW a GREAT article..

## Manish Saraswat says:

MARCH 22, 2016 AT 12:52 PM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107996#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107996)

Hello
This error says Your data set has missing values. Please Impute.
To rectify, run the code once again where I dealt with missing values. Good Luck!

Connect with:

guide-principal-component-analysis-python%2F)

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

26/33

23/03/2016

Comment

Name (required)

Email (required)

Website

SUBMIT COMMENT

## Notify me of new posts by email.

TOP USERS
Rank
1

Name

Points
Nalin Pasricha
(http://datahack.analyticsvidhya.com/user/profile/Nalin)

3789

SRK (http://datahack.analyticsvidhya.com/user/profile/SRK)

3696

Aayushmnit
(http://datahack.analyticsvidhya.com/user/profile/aayushmnit)

3304

binga (http://datahack.analyticsvidhya.com/user/profile/binga)

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

2954

27/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

vikash (http://datahack.analyticsvidhya.com/user/profile/vikash)

2312

## More Rankings (http://datahack.analyticsvidhya.com/users)

(http://pgpba.greatlakes.edu.in/?

utm_source=AVM&utm_medium=Banner&utm_campaign=Pgpba_decjan)

POPULAR POSTS
A Complete Tutorial to learn Data Science in R from Scratch
(http://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-sciencescratch/)
Free Must Read Books on Statistics & Mathematics for Data Science
A Complete Tutorial to Learn Data Science with Python from Scratch
(http://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-sciencepython-scratch-2/)
Essentials of Machine Learning Algorithms (with Python and R Codes)
(http://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/)
A Complete Tutorial on Time Series Modeling in R
(http://www.analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling/)
7 Types of Regression Techniques you should know!
(http://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/)

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

28/33

23/03/2016

## SAS vs. R (vs. Python) which tool should I learn?

(http://www.analyticsvidhya.com/blog/2014/03/sas-vs-vs-python-tool-learn/)
Quick Insights: India Analytics and Big Data Salary Report 2016
(http://www.analyticsvidhya.com/blog/2016/02/quick-insights-analytics-big-data-salaryreport-2016/)

(http://praxis.ac.in/fulltime-bigdata-analytics?

utm_source=avbanner)

professional/)

RECENT POSTS

(http://www.analyticsvidhya.com/blog/2016/03/select-important-variables-

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

29/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

boruta-package/)

How to perform feature selection (i.e. pick important variables) using Boruta Package in R ?
(http://www.analyticsvidhya.com/blog/2016/03/select-important-variables-borutapackage/)
GUEST BLOG , MARCH 22, 2016

Course Review Big data and Hadoop Developer Certification Course by Simplilearn
KUNAL JAIN , MARCH 22, 2016

(http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principalcomponent-analysis-python/)

## Practical Guide to Principal Component Analysis (PCA) in R & Python

(http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-componentanalysis-python/)
MANISH SARASWAT , MARCH 21, 2016

(http://www.analyticsvidhya.com/blog/2016/03/winning-solutions-dydcompetition-xgboost-ruled/)

## Winning Solutions of DYD Competition R and XGBoost Ruled

(http://www.analyticsvidhya.com/blog/2016/03/winning-solutions-dyd-competitionxgboost-ruled/)
MANISH SARASWAT , MARCH 18, 2016

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

30/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

GET CONNECTED

4,372

FOLLOWERS

959

FOLLOWERS

12,675
FOLLOWERS

Email

SUBSCRIBE

uri=analyticsvidhya)

For those of you, who are wondering what is Analytics Vidhya, Analytics can be defined as the
science of extracting insights from raw data. The spectrum of analytics starts from capturing data
and evolves into using insights / trends from this data to make informed decisions.

STAY CONNECTED

4,372

12,675

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

31/33

23/03/2016

## Practical Guide to Principal Component Analysis (PCA) in R & Python

4,372

FOLLOWERS

959

FOLLOWERS

LATEST POSTS

12,675
FOLLOWERS

Email

SUBSCRIBE

uri=analyticsvidhya)

(http://www.analyticsvidhya.com/blog/2016/03/select-important-variablesboruta-package/)

How to perform feature selection (i.e. pick important variables) using Boruta Package in R ?
(http://www.analyticsvidhya.com/blog/2016/03/select-important-variables-borutapackage/)
GUEST BLOG , MARCH 22, 2016

Course Review Big data and Hadoop Developer Certification Course by Simplilearn
KUNAL JAIN , MARCH 22, 2016

(http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principalcomponent-analysis-python/)

## Practical Guide to Principal Component Analysis (PCA) in R & Python

(http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-componentanalysis-python/)
MANISH SARASWAT , MARCH 21, 2016

(http://www.analyticsvidhya.com/blog/2016/03/winning-solutions-dydcompetition-xgboost-ruled/)

## Winning Solutions of DYD Competition R and XGBoost Ruled

http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

32/33

23/03/2016

## Winning Solutions of DYD Competition R and XGBoost Ruled

(http://www.analyticsvidhya.com/blog/2016/03/winning-solutions-dyd-competitionxgboost-ruled/)
MANISH SARASWAT , MARCH 18, 2016

Home (http://www.analyticsvidhya.com/)