You are on page 1of 24

Dimension Reduction Techniques

Introduction
• The number of input variables or features for a dataset
is referred to as its dimensionality.
• Dimensionality reduction refers to techniques that
reduce the number of input variables in a dataset.
• More input features often make a predictive modeling
task more challenging to model, more generally
referred to as the curse of dimensionality.
• High-dimensionality statistics and dimensionality
reduction techniques are often used for data
visualization. Nevertheless these techniques can be
used in applied machine learning to simplify a
classification or regression dataset in order to better fit
a predictive model.
What is Dimensionality Reduction?
In machine learning we are having too many
factors on which the final classification is done.
These factors are basically, known as variables.
The higher the number of features, the harder it
gets to visualize the training set and then work
on it. Sometimes, most of these features are
correlated, and hence redundant. This is where
dimensionality reduction algorithms come into
play.
Motivation
• When we deal with real problems and real
data we often deal with high dimensional
data that can go up to millions.
• In original high dimensional structure,
data represents itself. Although,
sometimes we need to reduce its
dimensionality.
• We need to reduce the dimensionality that
needs to associate with visualizations.
Although, that is not always the case.
Dimensionality Reduction Methods
The various methods used for dimensionality
reduction include:
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)
• Factor Analysis(FA)
Principal Component Analysis
What Is Principal Component Analysis (PCA)? 

Principal components analysis (PCA) is a


dimensionality reduction technique that enables
you to identify correlations and patterns in a data
set so that it can be transformed into a data set of
significantly lower dimension without loss of
any important information.
Concept
•The main idea behind PCA is to figure out patterns
and correlations among various features in the data
set. On finding a strong correlation between
different variables, a final decision is made about
reducing the dimensions of the data in such a way
that the significant data is still retained.
•Such a process is very essential in solving complex
data-driven problems that involve the use of high-
dimensional data sets. PCA can be achieved via a
series of steps. Let’s discuss the whole end-to-end
process.
Why use Principal Components Analysis?

The main aim of principal components analysis in R is to


report hidden structure in a data set. In doing so, we may
be able to do the following things:
• Basically, it is prior to identifying how different variables
work together to create the dynamics of the system.
• Reduce the dimensionality of the data.
• Decreases redundancy in the data.
• Filter some of the noise in the data.
• Compress the data.
• Prepare the data for further analysis using other
techniques.
Functions to Perform Principal Analysis in R
• prcomp() (stats)
• princomp() (stats)
• PCA() (FactoMineR)
• dudi.pca() (ade4)
• acp() (amap)
Methods for Principal Component Analysis in R
There are two methods for R principal component analysis:
1. Spectral Decomposition
It examines the covariances/correlations between variables.
2. Singular Value Decomposition
It examines the covariances/correlations between individuals.
Function princomp() is used here for carrying out a spectral
approach.  And, we can also use the functions prcomp() and PCA() in the
singular value decomposition.
prcomp(x, scale = FALSE)
Arguments
x: A numeric matrix or data frame.

scale: It is a logical value. It indicates whether the variables


should be scaled to have unit variance and will take place before the
analysis takes place.
princomp(x, cor = FALSE, scores = TRUE)
Arguments
x: A numeric matrix or data frame.
cor: A logical value. If TRUE, then data will be
centred and also scaled before the analysis.
scores: A logical value. If TRUE, then
coordinates on each principal component are
calculated
acp(x,center=TRUE,reduce=TRUE,wI=rep(1,nrow(x)),wV=r
ep(1,ncol(x)))

Arguments
x: Matrix / data frame

center : a logical value indicating whether we center data

reduce: a logical value indicating whether we "reduce"


data i.e. divide each column by standard deviation

wI,wV: weigth vector for individuals / variables


PCA(X, scale.unit = TRUE, ncp = 5, ind.sup = NULL, quanti.sup = NULL, quali.sup =
NULL, row.w = NULL, col.w = NULL, graph = TRUE, axes = c(1,2))
Arguments
X:a data frame with n rows (individuals) and p columns (numeric variables)
ncp :number of dimensions kept in the results (by default 5)
scale.unit:a boolean, if TRUE (value set by default) then data are scaled to unit
variance
ind.sup :a vector indicating the indexes of the supplementary individuals
quanti.sup:a vector indicating the indexes of the quantitative supplementary
variables
quali.sup:a vector indicating the indexes of the categorical supplementary variables
row.w:an optional row weights (by default, a vector of 1 for uniform row weights);
the weights are given only for the active individuals
col.w:an optional column weights (by default, uniform column weights); the
weights are given only for the active variables
graph:boolean, if TRUE a graph is displayed
axes:a length 2 vector specifying the components to plot
dudi.pca(df, row.w = rep(1, nrow(df))/nrow(df), col.w = rep(1, ncol(df)), center = TRUE,
scale = TRUE, scannf = TRUE, nf = 2)
Arguments:
df: a data frame with n rows (individuals) and p columns (numeric variables)

row.w:an optional row weights (by default, uniform row weights)

col.w:an optional column weights (by default, unit column weights)

center:a logical or numeric value, centring option if TRUE, centring by the mean if FALSE
no centring if a numeric vector, its length must be equal to the number of columns of
the data frame df and gives the decentring

scale:a logical value indicating whether the column vectors should be normed for the
row.w weighting

scannf:a logical value indicating whether the screeplot should be displayed

nf:if scannf FALSE, an integer indicating the number of kept axes


Example
Data=mtcars
dim(Data)
head(Data)
str(Data)
New.Data=Data[,c(-8,-9)]
head(New.Data)

#Using function PCA from package FactoMineR


install.packages("FactoMineR")
library(FactoMineR)
pca <- PCA(New.Data, scale. = TRUE)
names(pca)
summary(pca)
library(ggbiplot)
ggbiplot(pca)
#Using function prcomp from package STAT
pca <- prcomp(New.Data, scale. = TRUE)
names(pca1)
summary(pca1)
ggbiplot(pca1)

#Using function princomp from package STAT


pca2=princomp(New.Data,cor = TRUE)
summary(pca2)
ggbiplot(pca2)
#using function dudi.pca from package "ade4"
library(ade4)
pca3=dudi.pca(New.Data)
names(pca3)
summary(pca3)

#using function acp from package "amap"


library(amap)
pca4=acp(New.Data,center = T,reduce = T)
names(pca4)
screeplot(pca4)
PCA in R

Step 1: Importing libraries and reading the dataset “Pizza”(from


open platform data.world)
library(dplyr)
library(data.table)
library(datasets)
library(ggplot2)
#Import dataset “Pizza”
Data=Pizza
Step 2: Making sense of the data
dim(Data)
head(Data)
str(Data)
Step 3: Getting the principal components
#removing column of brand
New.Data <- Data[, -1]
#PCA
pca <- prcomp(New.Data, scale. = TRUE)
summary(pca)
names(pca)

The important thing to know about prcomp() is that it returns 3 things:


• x : stores all the principal components of data that we can use to plot
graphs and understand the correlations between the PCs.
• sdev: calculates standard deviation to know how much variation each
PC has in the data
• rotation: determines which loading scores have the largest effect on
the PCA plot i.e. the largest coordinates (in absolute terms)
Step 4: Using x
Even though currently our data has more than
two dimensions, we can plot our graph using x.
Usually, the first few PCs capture maximum
variance, hence PC1 and PC2 are plotted below
to understand the data.
plot(pca$x[,1], pca$x[,2])
NOTE:This plot clearly shows us how the first
two PCSs divide the data into FOUR clusters (or
A, B , C and D pizza brands) depending on the
characteristics that define them.
Step 5: Using sdev
Here we use the square of sdev and calculate the percentage of
variation each PC has.
pca_var <- pca$sdev^2
pca_var_perc <- round(pca_var/sum(pca_var)*100, 1)
barplot(pca_var_perc, main = "Variation Plot", xlab = "PCs", ylab =
"Percentage Variance", ylim = c(0, 100))
#OR
screeplot(pca,type = "l")
abline(1,0,col="red")
NOTE:This barplot tells us that almost 60% of the
variation in the data is shown by PC1, 30% by PC2, 15% by
PC3 and then very little is captured by the rest of the PCs.
Step 6: Using rotation
This part explains which of the features matter the most in separating
the pizza brands from each other; rotation assigns weights to the
features (technically called loadings) and an array of ‘loadings’ for a PC
is called an eigenvector.
PC1 <- pca$rotation[,1]
PC1_scores <- abs(PC1)
PC1_scores_ordered <- sort(PC1_scores, decreasing = TRUE)
names(PC1_scores_ordered)
ggplot(data, aes(x=data$ash, y=data$fat, color = data$brand)) +
geom_point() + labs(title = "Pizza brands by two variables")
Out Put:
[1] "ash" "fat" "sodium" "carb" "prot" "cal" "mois" "id“
We see how the variable ‘ash’ is the most important feature in
differentiating between the brands while ‘fat’  is next and so on.
Step 7: Differentiating between brands using
the two most important features
ggplot(data, aes(x=data$ash, y=data$fat, color =
data$brand)) + geom_point() + labs(title = "Pizza
brands by two variables")
This plot clearly shows how instead of the 8
columns given to us in the dataset, only two
were enough to understand we had different
types of pizzas

You might also like