You are on page 1of 42

2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

STHDA
S t at i st i cal too l s f or h ig h‐ th ro ug hp ut data analysis

 HOME  BOOKS  R/STATISTICS  STAT SOFTWARES  CONTACT

Search

 Connect

Home / Easy Guides / R software / Factor analysis / ade4 and factoextra : Actions menu for module Wiki
Principal Component Analysis ‐ R software and data mining

广告

Fastest VPN Most Reliable VPN in China.


Fast Servers in 87 Countries.
for China 24/7 Live Chat Support

 ade4 and factoextra : Principal Component Analysis ‐ R software and data mining
广告 Google Data Mining SPSS Software PCA

Tools

Required packages
Prepare the data
Principal component analysis
Variances of the principal components
Extract the eigenvalues
Make a scree plot using ade4 base graphics
Make the scree plot using the package factoextra
Graph of variables : the circle of correlations
Coordinates of variables on the principal components
Graph of variables using ade4 base graph
Graph of variables using factoextra
Cos2 : quality of the representation for variables on the factor map
Contributions of the variables to the principal components
Graph of individuals
Coordinates of individuals on the principal components
Cos2 : quality of the representation for individuals on the principal components
Contribution of the individuals to the princial components
Graph of individuals using ade4 base graph
Biplot of individuals and variables using ade4
Graph of individuals using factoextra
Change the color of individuals by groups
Principal component analysis using supplementary individuals and variables
Supplementary individuals
Supplementary quantitative variables
Infos
http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 1/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

This R tutorial describes how to perform a Principal Component Analysis (PCA) using R software and ade4
package.

Required packages
The package ade4 can be installed and loaded as follow :

install.packages("ade4")
library("ade4")

The package factoextra is used for the visualization of the principal component analysis results
factoextra can be installed as follow :

# install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Load it :

library("factoextra")

Prepare the data

We’ll used the data sets decathlon2 from the package factoextra :

library("factoextra")
data(decathlon2)
head(decathlon2[, 1:6])

X100m Long.jump Shot.put High.jump X400m X110m.hurdle


SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69
CLAY 10.76 7.40 14.26 1.86 49.37 14.05
BERNARD 11.02 7.23 14.25 1.92 48.93 14.99
YURKOV 11.34 7.09 15.19 2.10 50.42 15.31
ZSIVOCZKY 11.13 7.30 13.48 2.01 48.62 14.17
McMULLEN 10.83 7.31 13.76 2.13 49.91 14.38

 This data is a subset of decathlon data in FactoMineR package


As illustrated below, the data used here describes athletes’ performance during two sporting events (Desctar and
OlympicG). It contains 27 individuals (athletes) described by 13 variables :

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 2/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

 Only some of these individuals and variables will be used to perform the principal component analysis
(PCA).

The coordinates of the remaining individuals and variables on the factor map will be predicted after the
PCA.

In PCA terminology, our data contains :

Active individuals (in blue, rows 1:23) : Individuals that are used during the principal component
analysis.
Supplementary individuals (in green, rows 24:27) : The coordinates of these individuals will be predicted
using the PCA informations and parameters obtained with active individuals/variables
Active variables (in pink, columns 1:10) : Variables that are used for the principal component analysis.
Supplementary variables : As supplementary individuals, the coordinates of these variables will be
predicted also.
Supplementary continuous variables : Columns 11 and 12 corresponding respectively to the rank and the
points of athletes.
Supplementary qualitative variables : Column 13 corresponding to the two athletic meetings (2004
Olympic Game or 2004 Decastar). This factor variables will be used to color individuals by groups.

Extract only active individuals and variables for principal component analysis:

decathlon2.active <- decathlon2[1:23, 1:10]


head(decathlon2.active[, 1:6])

X100m Long.jump Shot.put High.jump X400m X110m.hurdle


SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69
http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 3/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

CLAY 10.76 7.40 14.26 1.86 49.37 14.05


BERNARD 11.02 7.23 14.25 1.92 48.93 14.99
YURKOV 11.34 7.09 15.19 2.10 50.42 15.31
ZSIVOCZKY 11.13 7.30 13.48 2.01 48.62 14.17
McMULLEN 10.83 7.31 13.76 2.13 49.91 14.38

Principal component analysis


The function dudi.pca() [in ade4 package] can be used. A simplified format is :

dudi.pca(df, center = TRUE, scale = TRUE,


scannf = TRUE, nf = 2)

df : a data frame. Rows are individuals and columns are numeric variables
center : a logical value specifying whether the variables should be shifted to be zero centered.
scale : a logical value. If TRUE, the data are scaled to unit variance before the analysis. This
standardization to the same scale avoids some variables to become dominant just because of their large
measurement units.
scannf : a logical value specifying whether the scree plot should be displayed
nf : number of dimensions kept in the final results.

In the R code below, the PCA is performed only on the active individuals/variables :

library("ade4")
res.pca <- dudi.pca(decathlon2.active, scannf = FALSE, nf = 5)

Variances of the principal components

Extract the eigenvalues

Eigenvalues measure the amount of variation retained by a principal component :

summary(res.pca)

Class: pca dudi


Call: dudi.pca(df = decathlon2.active, scannf = FALSE, nf = 5)
Total inertia: 10
Eigenvalues:
Ax1 Ax2 Ax3 Ax4 Ax5
4.1242 1.8385 1.2391 0.8194 0.7016
Projected inertia (%):
Ax1 Ax2 Ax3 Ax4 Ax5
41.242 18.385 12.391 8.194 7.016
Cumulative projected inertia (%):
http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 4/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Ax1 Ax1:2 Ax1:3 Ax1:4 Ax1:5


41.24 59.63 72.02 80.21 87.23
(Only 5 dimensions (out of 10) are shown)

You can also use the package factoextra to extract the eigenvalues :

library("factoextra")
eig.val <- get_eigenvalue(res.pca)
head(eig.val)

eigenvalue variance.percent cumulative.variance.percent


Dim 1 4.1242133 41.242133 41.24213
Dim 2 1.8385309 18.385309 59.62744
Dim 3 1.2391403 12.391403 72.01885
Dim 4 0.8194402 8.194402 80.21325
Dim 5 0.7015528 7.015528 87.22878
Dim 6 0.4228828 4.228828 91.45760

Make a scree plot using ade4 base graphics

The function scree plot() can be used to represent the amount of inertia (variance) associated with each principal
component (PC).
A simplified format is :

screeplot(x, ncps = length(x$eig), type = c("barplot", "lines"))

x : an object of class dudi


ncps : the number of components to be plotted
type : the type of plot

Example of usage :

screeplot(res.pca, main ="Screeplot - Eigenvalues")

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 5/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

You can also customize the plot using the standard barplot() function. In the R code below, we’ll draw the
percentage of variances retained by each component :

barplot(eig.val[, 2], names.arg=1:nrow(eig.val),


main = "Variances",
xlab = "Principal Components",
ylab = "Percentage of variances",
col ="steelblue")
# Add connected line segments to the plot
lines(x = 1:nrow(eig.val), eig.val[, 2],
type="b", pch=19, col = "red")

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 6/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

 ~60% of the information (variances) contained in the data are retained by the first two principal
components.

Make the scree plot using the package factoextra

fviz_screeplot(res.pca, ncp=10)

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 7/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Graph of variables : the circle of correlations

Coordinates of variables on the principal components

The coordinates of the variables on the factor map are :

# Column coordinates
head(res.pca$co)

Comp1 Comp2 Comp3 Comp4 Comp5


X100m 0.8506257 -0.17939806 -0.3015564 0.03357320 0.1944440
Long.jump -0.7941806 0.28085695 0.1905465 -0.11538956 -0.2331567
Shot.put -0.7339127 0.08540412 -0.5175978 0.12846837 0.2488129
High.jump -0.6100840 -0.46521415 -0.3300852 0.14455012 -0.4027002
X400m 0.7016034 0.29017826 -0.2835329 0.43082552 -0.1039085
X110m.hurdle 0.7641252 -0.02474081 -0.4488873 -0.01689589 -0.2242200

Graph of variables using ade4 base graph

The function s.corcircle() can be used to plot the correlation circle. A simplified format is :

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 8/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

s.corcircle(dfxy, label = row.names(dfxy), grid = TRUE,


box = FALSE)

dfxy : a data frame specifying the coordinates of variables


label : a vector of strings specifying point labels
grid : a logical value specifying whether a grid in the background of the plot should be drawn
box : a logical value indicating whether a box should be drawn

# Graph of variables
s.corcircle(res.pca$co)

Graph of variables using factoextra

The function fviz_pca_var() is used to visualize variables :

# Default plot
fviz_pca_var(res.pca)

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 9/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

# Change color and theme


fviz_pca_var(res.pca, col.var="steelblue")+
theme_minimal()

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 10/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Read more about the function fviz_pca_var() : Graph of variables ‐ Principal Component Analysis

 How to calculate the cos2 and the contribution of variables?

The cos2 and the contributions of variables (columns) / individuals (rows) are calculated using the function
inertia.dudi() as follow :

inertia <- inertia.dudi(res.pca, row.inertia = TRUE,


col.inertia = TRUE)

 Note that, the contributions and the cos2 are printed in 1/10 000. The sign is the sign of the coordinates.

Cos2 : quality of the representation for variables on the factor map

The squared coordinates of variables are called cos2.


A high cos2 indicates a good representation of the variable on the principal component. In this case the
variable is positioned close to the circumference of the correlation circle.
A low cos2 indicates that the variable is not perfectly represented by the PCs. In this case the variable is
close to the center of the circle.
The cos2 of the variables are :

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 11/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

# relative contributions of columns


var.cos2 <- abs(inertia$col.rel/10000)
head(var.cos2)

Comp1 Comp2 Comp3 Comp4 Comp5 con.tra


X100m 0.7236 0.0322 0.0909 0.0011 0.0378 0.1
Long.jump 0.6307 0.0789 0.0363 0.0133 0.0544 0.1
Shot.put 0.5386 0.0073 0.2679 0.0165 0.0619 0.1
High.jump 0.3722 0.2164 0.1090 0.0209 0.1622 0.1
X400m 0.4922 0.0842 0.0804 0.1856 0.0108 0.1
X110m.hurdle 0.5839 0.0006 0.2015 0.0003 0.0503 0.1

It can also be calculated as follow :

# squared coordinates
head(res.pca$co^2)

Comp1 Comp2 Comp3 Comp4 Comp5


X100m 0.7235641 0.0321836641 0.09093628 0.0011271597 0.03780845
Long.jump 0.6307229 0.0788806285 0.03630798 0.0133147506 0.05436203
Shot.put 0.5386279 0.0072938636 0.26790749 0.0165041211 0.06190783
High.jump 0.3722025 0.2164242070 0.10895622 0.0208947375 0.16216747
X400m 0.4922473 0.0842034209 0.08039091 0.1856106269 0.01079698
X110m.hurdle 0.5838873 0.0006121077 0.20149984 0.0002854712 0.05027463

 Using factoextra package, the color of variables can be automatically controlled by the value of their
cos2.

fviz_pca_var(res.pca, col.var="contrib")+
scale_color_gradient2(low="white", mid="blue",
high="red", midpoint=55) + theme_minimal()

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 12/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Contributions of the variables to the principal components

The contributions can be printed in % as follow :

# absolute contribution of columns


var.contrib <- inertia$col.abs/100
head(var.contrib)

Comp1 Comp2 Comp3 Comp4 Comp5


X100m 17.54 1.75 7.34 0.14 5.39
Long.jump 15.29 4.29 2.93 1.62 7.75
Shot.put 13.06 0.40 21.62 2.01 8.82
High.jump 9.02 11.77 8.79 2.55 23.12
X400m 11.94 4.58 6.49 22.65 1.54
X110m.hurdle 14.16 0.03 16.26 0.03 7.17

 Note that, You can also use the function get_pca_var() [from factoextra package]. It provides a list of
matrices containing all the results for the active variables (coordinates, correlation between variables and
axes, squared cosine and contributions).

var <- get_pca_var(res.pca)


names(var)

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 13/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

[1] "coord" "cor" "cos2" "contrib"

# Contributions of variables
head(var$contrib)

Dim.1 Dim.2 Dim.3 Dim.4 Dim.5


X100m 17.544293 1.7505098 7.338659 0.13755240 5.389252
Long.jump 15.293168 4.2904162 2.930094 1.62485936 7.748815
Shot.put 13.060137 0.3967224 21.620432 2.01407269 8.824401
High.jump 9.024811 11.7715838 8.792888 2.54987951 23.115504
X400m 11.935544 4.5799296 6.487636 22.65090599 1.539012
X110m.hurdle 14.157544 0.0332933 16.261261 0.03483735 7.166193

 Using factoextra package, the color of variables can be automatically controlled by the value of their
contributions

fviz_pca_var(res.pca, col.var="contrib") +
scale_color_gradient2(low="white", mid="blue",
high="red", midpoint=50) + theme_minimal()

 This is helpful to highlight the most important variables for the principal components.
The most important variables for a given PC can be visualized using the function fviz_pca_contrib()[factoextra
http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 14/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

package] :
(factoextra >= 1.0.1 is required)

# Contributions of variables on PC1


fviz_pca_contrib(res.pca, choice = "var", axes = 1)

# Contributions of variables on PC2


fviz_pca_contrib(res.pca, choice = "var", axes = 2)

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 15/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Read more about fviz_pca_contrib() : Principal Component Analysis: How to reveal the most important variables
in your data?

Graph of individuals

Coordinates of individuals on the principal components

The coordinates of the individuals on the factor maps can be extracted as follow :

# The row coordinates


head(res.pca$li)

Axis1 Axis2 Axis3 Axis4 Axis5


SEBRLE -0.1955047 1.5890567 -0.6424912 0.08389652 -1.16829387
CLAY -0.8078795 2.4748137 1.3873827 1.29838232 0.82498206
BERNARD 1.3591340 1.6480950 -0.2005584 -1.96409420 -0.08419345
YURKOV 0.8889532 -0.4426067 -2.5295843 0.71290837 -0.40782264
ZSIVOCZKY 0.1081216 -2.0688377 1.3342591 -0.10152796 0.20145217
McMULLEN -0.1212195 -1.0139102 0.8625170 1.34164291 -1.62151286

Cos2 : quality of the representation for individuals on the principal components

# relative contributions of rows


ind.cos2 <- abs(inertia$row.rel)/10000
head(ind.cos2)

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 16/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Axis1 Axis2 Axis3 Axis4 Axis5 con.tra


SEBRLE 0.0075 0.4975 0.0813 0.0014 0.2689 0.0221
CLAY 0.0487 0.4570 0.1436 0.1258 0.0508 0.0583
BERNARD 0.1972 0.2900 0.0043 0.4118 0.0008 0.0407
YURKOV 0.0961 0.0238 0.7782 0.0618 0.0202 0.0357
ZSIVOCZKY 0.0016 0.5764 0.2398 0.0014 0.0055 0.0323
McMULLEN 0.0022 0.1522 0.1101 0.2665 0.3893 0.0294

Contribution of the individuals to the princial components

The contributions can be printed in % as follow :

# absolute contributions of rows


ind.contrib <- inertia$row.abs/100
head(ind.contrib)

Axis1 Axis2 Axis3 Axis4 Axis5


SEBRLE 0.04 5.97 1.45 0.04 8.46
CLAY 0.69 14.48 6.75 8.94 4.22
BERNARD 1.95 6.42 0.14 20.47 0.04
YURKOV 0.83 0.46 22.45 2.70 1.03
ZSIVOCZKY 0.01 10.12 6.25 0.05 0.25
McMULLEN 0.02 2.43 2.61 9.55 16.29

 It’s also possible to use the function get_pca_ind() [from factoextra package]. factoextra provides, a list
of matrices containing all the results for the active individuals (coordinates, squared cosine and
contributions)./span>

ind <- get_pca_ind(res.pca)


names(ind)

[1] "coord" "cos2" "contrib"

# Contributions of individuals
head(ind$contrib)

Dim.1 Dim.2 Dim.3 Dim.4 Dim.5


SEBRLE 0.04029447 5.9714533 1.4483919 0.03734589 8.45894063
CLAY 0.68805664 14.4839248 6.7537381 8.94458283 4.21794385
BERNARD 1.94740183 6.4234107 0.1411345 20.46819433 0.04393073
YURKOV 0.83308415 0.4632733 22.4517396 2.69663605 1.03075263
ZSIVOCZKY 0.01232413 10.1217143 6.2464325 0.05469230 0.25151025
McMULLEN 0.01549089 2.4310854 2.6102794 9.55055888 16.29493304

Use the function fviz_pca_contrib()[factoextra package] to visualize the most contributing individuals :

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 17/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

(factoextra >= 1.0.1 is required)

# Contributions of variables on PC1


fviz_pca_contrib(res.pca, choice = "ind", axes = 1)

# Contributions of variables on PC2


fviz_pca_contrib(res.pca, choice = "ind", axes = 2)

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 18/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Read more about fviz_pca_contrib() : Principal Component Analysis: How to reveal the most important variables
in your data?

Graph of individuals using ade4 base graph

The function s.label() can be used. A simplified format is :

s.label(dfxy, xax = 1, yax = 2)

dfxy : a data frame with at least two coordinates


xax : a numeric value specifying the column number containing x values
yax : a numeric value specifying the column number containing y values

Factor map of individuals :

s.label(res.pca$li, xax = 1, yax = 2)

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 19/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Biplot of individuals and variables using ade4

Biplot can be drawn using the combination of the two functions below :
s.label() to plot individuals
s.arrow() to add variables

# Plot of individuals
s.label(res.pca$li, xax = 1, yax = 2)
# Add variables
s.arrow(7*res.pca$c1, add.plot = TRUE)

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 20/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

It’s also possible to use the function scatter() or biplot() :

scatter(res.pca)

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 21/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

# Remove the scree plot (posieig ="none")


# Remove row labels (clab.row = 0)
scatter(res.pca, posieig = "none", clab.row = 0)

NULL

 Note that, to remove variable labels the argument clab.col = 0 can be used.

Graph of individuals using factoextra

The function fviz_pca_ind() is used to visualize individuals :

fviz_pca_ind(res.pca)

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 22/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Control automatically the color of individuals using the cos2 values (the quality of the individuals on the factor
map) :

fviz_pca_ind(res.pca, col.ind="cos2") +
scale_color_gradient2(low="white", mid="blue",
high="red", midpoint=0.50)

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 23/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Change the theme :

fviz_pca_ind(res.pca, col.ind="cos2") +
scale_color_gradient2(low="white", mid="blue",
high="red", midpoint=0.50) + theme_minimal()

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 24/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Read more about fviz_pca_ind() : Graph of individuals ‐ principal component analysis


Make a biplot of individuals and variables :

fviz_pca_biplot(res.pca, geom = "text") +


theme_minimal()

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 25/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Read more about fviz_pca_biplot() : Biplot of individuals and variables ‐ principal component analysis

Change the color of individuals by groups

The data sets decathlon2 contain a supplementary qualitative variable at columns 13 corresponding to the type
of competitions.
Qualitative variable can be helpful for interpreting the data and for coloring individuals by groups :

# Data for the supplementary qualitative variables


quali.sup <- as.factor(decathlon2[1:23, 13])
head(quali.sup)

[1] Decastar Decastar Decastar Decastar Decastar Decastar


Levels: Decastar OlympicG

The function s.class() can be used to visualize the classes (groups) of points :

s.class(dfxy, fac, xax = 1, yax = 2, col)

dfxy : a data frame containing the two columns for x and y axes
fac : a factor variable partitioning the individuals in classes
xax, yax : a numeric value specifying the column number containing x and y values
col : a vector of colors used to draw each class in a different color
http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 26/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Color individuals by groups :

s.class(res.pca$li, fac = quali.sup, xax = 1, yax = 2)

# Change the colors


s.class(res.pca$li, fac = quali.sup, col = c("blue", "red"))

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 27/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

# Make a biplot
# clab.row : hide the label for rows (individuals)
res <- scatter(res.pca, clab.row = 0, posieig = "none")
s.class(res.pca$li, fac = quali.sup, col = c("blue", "red"),
add.plot = TRUE)

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 28/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

# Customize the biplot


# - remove row labels (clab.row = 0)
# - hide the scree plot (posieig = 0)
# - remove stars (cstar = 0)
# - remove ellipse (cellipse = 0)
res <- scatter(res.pca, clab.row = 0, posieig = "none")
s.class(res.pca$li, fac = quali.sup, col = c("blue", "red"),
add.plot = TRUE, cstar = 0, cellipse = 0)

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 29/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

# remove labels for classes (clabel = 0)


res <- scatter(res.pca, clab.row = 0, posieig = "none")
s.class(res.pca$li, fac = quali.sup, col = c("blue", "red"),
add.plot = TRUE, cstar = 0, cellipse = 0, clabel = 0)

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 30/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

It’s also possible to use factoextra :

fviz_pca_ind(res.pca, habillage = quali.sup,


addEllipses =TRUE, ellipse.level = 0.68) +
theme_minimal()

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 31/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Elegant biplot using factoextra and iris data :

data(iris)
head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

# The variable Species (index = 5) is removed


# before PCA analysis
iris.pca <- dudi.pca(iris[,-5], scannf = FALSE, nf = 2)

Now, let’s :
make a biplot of individuals and variables
change the color of individuals by groups
change the transparency of variable colors by their contribution values
show only the labels for variables

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 32/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

fviz_pca_biplot(iris.pca,
habillage = iris$Species, addEllipses = TRUE,
col.var = "red", alpha.var ="cos2",
label = "var") +
scale_color_brewer(palette="Dark2")+
theme_minimal()

Principal component analysis using supplementary individuals and variables

 Ascolumns
described above, the data sets decathlon2 contain supplementary continuous variables (quanti.sup,
11:12), supplementary qualitative variables (quali.sup, column 13) and supplementary
individuals (ind.sup, rows 24:27)

Supplementary variables / individuals are not used to compute the principal component. Their coordinates are
predicted using only the information provided by the performed principal component analysis on active variables /
individuals.
The functions suprow() and supcol() [in ade4 package] are used to calculate the coordinates of supplementary
rows (individuals) and columns (variables), respectively.
The simplified formats are :

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 33/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

# For supplementary individuals (rows)


suprow(x, Xsup)
# For supplementary variables (columns)
supcol(x, Xsup)

Supplementary individuals

# Data for the supplementary individuals


ind.sup <- decathlon2[24:27, 1:10, drop = FALSE]
ind.sup[, 1:6]

X100m Long.jump Shot.put High.jump X400m X110m.hurdle


KARPOV 11.02 7.30 14.77 2.04 48.37 14.09
WARNERS 11.11 7.60 14.31 1.98 48.68 14.23
Nool 10.80 7.53 14.26 1.88 48.81 14.80
Drews 10.87 7.38 13.07 1.88 48.51 14.01

Predict the coordinates of the supplementary individuals :

ind.sup.pca <- suprow(res.pca, ind.sup)


names(ind.sup.pca)

[1] "tabsup" "lisup"

# coordinates
ind.sup.coord <- ind.sup.pca$lisup
head(ind.sup.coord)

Axis1 Axis2 Axis3 Axis4 Axis5


KARPOV -0.7947206 0.77951227 1.6330203 1.7242283 0.75070396
WARNERS 0.3864645 -0.12159237 1.7387332 -0.7063341 0.03230011
Nool 0.5591306 1.97748871 0.4830358 -2.2784526 0.25461493
Drews 1.1092038 0.01741477 3.0488182 -1.5343468 0.32642192

 How to visualize supplementary individuals on the factor map?

The function fviz_add() is used :

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 34/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

# Plot of active individuals


p <- fviz_pca_ind(res.pca)
# Add supplementary individuals
fviz_add(p, ind.sup.coord, color ="blue")

 How to calculate the cos2 (quality of the representation) for supplementary individuals?

cos2.func <-function(x){x^2/sum(x^2)}
ind.sup.cos2 <- t(apply(ind.sup.coord, 1, cos2.func))
head(ind.sup.cos2)

Axis1 Axis2 Axis3 Axis4 Axis5


KARPOV 0.08486144 8.164458e-02 0.35831467 0.3994579 0.0757214366
WARNERS 0.04050537 4.009646e-03 0.81989704 0.1353050 0.0002829447
Nool 0.03218782 4.026179e-01 0.02402281 0.5344967 0.0066747159
Drews 0.09473792 2.335268e-05 0.71575477 0.1812793 0.0082046453

Supplementary quantitative variables

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 35/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

# Data for the supplementary quantitative variables


quanti.sup <- decathlon2[1:23, 11:12, drop = FALSE]
head(quanti.sup)

Rank Points
SEBRLE 1 8217
CLAY 2 8122
BERNARD 4 8067
YURKOV 5 8036
ZSIVOCZKY 7 8004
McMULLEN 8 7995

 Remember that, rows 24:27 are supplementary individuals. We don’t want them in this current analysis.
This is why, I extracted only rows 1:23.

Predict the coordinates of the supplementary variables :


(You have to scale the supplementary variables before the analysis as the PCA has been performed on scaled
data.)

quanti.pca <- supcol(res.pca, scale(quanti.sup))


names(quanti.pca)

[1] "tabsup" "cosup"

# coordinates
quanti.coord <- quanti.pca$cosup
head(quanti.coord)

Comp1 Comp2 Comp3 Comp4 Comp5


Rank 0.6860587 -0.2398049 0.1793975 0.0545264 0.07220371
Points -0.9425246 0.0759751 -0.1545490 -0.1625770 0.03046248

Visualize supplementary variables on the factor map using factoextra :

# Plot of active variables


p <- fviz_pca_var(res.pca)
# Add supplementary active variables
fviz_add(p, quanti.coord, geom="arrow", color ="blue")

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 36/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

# Get the cos2 of the supplementary quantitative variables


(quanti.coord^2)[, 1:4]

Comp1 Comp2 Comp3 Comp4


Rank 0.4706766 0.057506383 0.03218347 0.002973128
Points 0.8883526 0.005772216 0.02388540 0.026431296

Infos

 This analysis has been performed using R software (ver. 3.1.2) and ggplot2 (ver. 1.0.0)
广告 Google Graph Software Eclipse Software Package Software

Share 0 Share 12

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 37/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA
广告

Want to Learn More on R Programming and Data Science? 
Follow us by Email

Subscribe
by FeedBurner

On Social Networks:
on Social Networks

 Get involved :
  Click to follow us on Facebook and Google+ :    
  Comment this article by clicking on "Discussion" button (top‐right position of this page)
  Sign up as a member and post news and articles on STHDA web site.

Suggestions

 Principal component analysis in R : prcomp() vs. princomp() ‐ R software and data mining
 Correspondence Analysis in R: The Ultimate Guide for the Analysis, the Visualization and the Interpretation ‐ R
software and data mining
 Principal Component Analysis: How to reveal the most important variables in your data? ‐ R software and data
mining
 FactoMineR and factoextra : Principal Component Analysis Visualization ‐ R software and data mining
 Multiple Correspondence Analysis Essentials: Interpretation and application to investigate the associations
between categories of multiple qualitative variables ‐ R software and data mining
http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 38/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

 Principal component analysis : the basics you should read ‐ R software and data mining
 Correspondence analysis basics ‐ R software and data mining
 Factor analysis
 ca package and factoextra : Correspondence Analysis ‐ R software and data mining
 ade4 and factoextra : Correspondence Analysis ‐ R software and data mining
 MASS package and factoextra : Correspondence Analysis ‐ R software and data mining

This page has been seen 21466 times

License
(Click on the image below)

Welcome!
Want to Learn More on R Programming and Data Science?
Follow us by Email

Subscribe
by FeedBurner

on Social Networks

R Basics

Importing Data

Exporting Data

Reshaping Data

Data Manipulation

Data Visualization

Basic Statistics

Cluster Analysis

Surv ival Analysis

广告 Google

R Software

Data Mining Software
http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 39/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

factoextra

survminer

ggpubr

ggcorrplot

Forum

Contact

广告

Fastest VPN
for China

Most Reliable VPN


in China. Fast
Servers in 87
Countries. 24/7
Live Chat Support
http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 40/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

R Books
Cluster Analysis Book

ggplot2 Book

3D Plots in R

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 41/42
2017/4/9 ade4 and factoextra : Principal Component Analysis ­ R software and data mining ­ Easy Guides ­ Wiki ­ STHDA

Guest Book
If you like this web site or if you have a suggestion, let us know. This encourages us to continue....
By kassambara
Guest Book

R‐Bloggers

Newsletter alboukadel.kassambara@gmail.com

Sitemap | Boosted by PHPBoost

http://www.sthda.com/english/wiki/ade4­and­factoextra­principal­component­analysis­r­software­and­data­mining 42/42

You might also like