Dimension Reduction Techniques in Machine Learning

Dimension Reduction Techniques
Introduction
• The number of input variables or features for a dataset
is referred to as its dimensionality.
• Dimensionality reduction refers to techniques that
reduce the number of input variables in a dataset.
• More input features often make a predictive modeling
task more challenging to model, more generally
referred to as the curse of dimensionality.
• High-dimensionality statistics and dimensionality
reduction techniques are often used for data
visualization. Nevertheless these techniques can be
used in applied machine learning to simplify a
classification or regression dataset in order to better fit
a predictive model.
What is Dimensionality Reduction?
In machine learning we are having too many
factors on which the final classification is done.
These factors are basically, known as variables.
The higher the number of features, the harder it
gets to visualize the training set and then work
on it. Sometimes, most of these features are
correlated, and hence redundant. This is where
dimensionality reduction algorithms come into
play.
Motivation
• When we deal with real problems and real
data we often deal with high dimensional
data that can go up to millions.
• In original high dimensional structure,
data represents itself. Although,
sometimes we need to reduce its
dimensionality.
• We need to reduce the dimensionality that
needs to associate with visualizations.
Although, that is not always the case.
Dimensionality Reduction Methods
The various methods used for dimensionality
reduction include:
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)
• Factor Analysis(FA)
Principal Component Analysis
What Is Principal Component Analysis (PCA)?
Principal components analysis (PCA) is a

dimensionality reduction technique that enables
you to identify correlations and patterns in a data
set so that it can be transformed into a data set of
significantly lower dimension without loss of
any important information.
Concept
•The main idea behind PCA is to figure out patterns
and correlations among various features in the data
set. On finding a strong correlation between
different variables, a final decision is made about
reducing the dimensions of the data in such a way
that the significant data is still retained.
•Such a process is very essential in solving complex
data-driven problems that involve the use of high-
dimensional data sets. PCA can be achieved via a
series of steps. Let’s discuss the whole end-to-end
process.
Why use Principal Components Analysis?
The main aim of principal components analysis in R is to

report hidden structure in a data set. In doing so, we may
be able to do the following things:
• Basically, it is prior to identifying how different variables
work together to create the dynamics of the system.
• Reduce the dimensionality of the data.
• Decreases redundancy in the data.
• Filter some of the noise in the data.
• Compress the data.
• Prepare the data for further analysis using other
techniques.
Functions to Perform Principal Analysis in R
• prcomp() (stats)
• princomp() (stats)
• PCA() (FactoMineR)
• dudi.pca() (ade4)
• acp() (amap)
Methods for Principal Component Analysis in R
There are two methods for R principal component analysis:
1. Spectral Decomposition
It examines the covariances/correlations between variables.
2. Singular Value Decomposition
It examines the covariances/correlations between individuals.
Function princomp() is used here for carrying out a spectral
approach. And, we can also use the functions prcomp() and PCA() in the
singular value decomposition.
prcomp(x, scale = FALSE)
Arguments
x: A numeric matrix or data frame.
scale: It is a logical value. It indicates whether the variables

should be scaled to have unit variance and will take place before the
analysis takes place.
princomp(x, cor = FALSE, scores = TRUE)
Arguments
x: A numeric matrix or data frame.
cor: A logical value. If TRUE, then data will be
centred and also scaled before the analysis.
scores: A logical value. If TRUE, then
coordinates on each principal component are
calculated
acp(x,center=TRUE,reduce=TRUE,wI=rep(1,nrow(x)),wV=r
ep(1,ncol(x)))
Arguments
x: Matrix / data frame
center : a logical value indicating whether we center data
reduce: a logical value indicating whether we "reduce"

data i.e. divide each column by standard deviation
wI,wV: weigth vector for individuals / variables

PCA(X, scale.unit = TRUE, ncp = 5, ind.sup = NULL, quanti.sup = NULL, quali.sup =
NULL, row.w = NULL, col.w = NULL, graph = TRUE, axes = c(1,2))
Arguments
X:a data frame with n rows (individuals) and p columns (numeric variables)
ncp :number of dimensions kept in the results (by default 5)
scale.unit:a boolean, if TRUE (value set by default) then data are scaled to unit
variance
ind.sup :a vector indicating the indexes of the supplementary individuals
quanti.sup:a vector indicating the indexes of the quantitative supplementary
variables
quali.sup:a vector indicating the indexes of the categorical supplementary variables
row.w:an optional row weights (by default, a vector of 1 for uniform row weights);
the weights are given only for the active individuals
col.w:an optional column weights (by default, uniform column weights); the
weights are given only for the active variables
graph:boolean, if TRUE a graph is displayed
axes:a length 2 vector specifying the components to plot
dudi.pca(df, row.w = rep(1, nrow(df))/nrow(df), col.w = rep(1, ncol(df)), center = TRUE,
scale = TRUE, scannf = TRUE, nf = 2)
Arguments:
df: a data frame with n rows (individuals) and p columns (numeric variables)
row.w:an optional row weights (by default, uniform row weights)
col.w:an optional column weights (by default, unit column weights)
center:a logical or numeric value, centring option if TRUE, centring by the mean if FALSE
no centring if a numeric vector, its length must be equal to the number of columns of
the data frame df and gives the decentring
scale:a logical value indicating whether the column vectors should be normed for the
row.w weighting
scannf:a logical value indicating whether the screeplot should be displayed
nf:if scannf FALSE, an integer indicating the number of kept axes

Example
Data=mtcars
dim(Data)
head(Data)
str(Data)
New.Data=Data[,c(-8,-9)]
head(New.Data)
#Using function PCA from package FactoMineR

install.packages("FactoMineR")
library(FactoMineR)
pca <- PCA(New.Data, scale. = TRUE)
names(pca)
summary(pca)
library(ggbiplot)
ggbiplot(pca)
#Using function prcomp from package STAT
pca <- prcomp(New.Data, scale. = TRUE)
names(pca1)
summary(pca1)
ggbiplot(pca1)
#Using function princomp from package STAT

pca2=princomp(New.Data,cor = TRUE)
summary(pca2)
ggbiplot(pca2)
#using function dudi.pca from package "ade4"
library(ade4)
pca3=dudi.pca(New.Data)
names(pca3)
summary(pca3)
#using function acp from package "amap"

library(amap)
pca4=acp(New.Data,center = T,reduce = T)
names(pca4)
screeplot(pca4)
PCA in R
Step 1: Importing libraries and reading the dataset “Pizza”(from

open platform data.world)
library(dplyr)
library(data.table)
library(datasets)
library(ggplot2)
#Import dataset “Pizza”
Data=Pizza
Step 2: Making sense of the data
dim(Data)
head(Data)
str(Data)
Step 3: Getting the principal components
#removing column of brand
New.Data <- Data[, -1]
#PCA
pca <- prcomp(New.Data, scale. = TRUE)
summary(pca)
names(pca)
The important thing to know about prcomp() is that it returns 3 things:

• x : stores all the principal components of data that we can use to plot
graphs and understand the correlations between the PCs.
• sdev: calculates standard deviation to know how much variation each
PC has in the data
• rotation: determines which loading scores have the largest effect on
the PCA plot i.e. the largest coordinates (in absolute terms)
Step 4: Using x
Even though currently our data has more than
two dimensions, we can plot our graph using x.
Usually, the first few PCs capture maximum
variance, hence PC1 and PC2 are plotted below
to understand the data.
plot(pca$x[,1], pca$x[,2])
NOTE:This plot clearly shows us how the first
two PCSs divide the data into FOUR clusters (or
A, B , C and D pizza brands) depending on the
characteristics that define them.
Step 5: Using sdev
Here we use the square of sdev and calculate the percentage of
variation each PC has.
pca_var <- pca$sdev^2
pca_var_perc <- round(pca_var/sum(pca_var)*100, 1)
barplot(pca_var_perc, main = "Variation Plot", xlab = "PCs", ylab =
"Percentage Variance", ylim = c(0, 100))
#OR
screeplot(pca,type = "l")
abline(1,0,col="red")
NOTE:This barplot tells us that almost 60% of the
variation in the data is shown by PC1, 30% by PC2, 15% by
PC3 and then very little is captured by the rest of the PCs.
Step 6: Using rotation
This part explains which of the features matter the most in separating
the pizza brands from each other; rotation assigns weights to the
features (technically called loadings) and an array of ‘loadings’ for a PC
is called an eigenvector.
PC1 <- pca$rotation[,1]
PC1_scores <- abs(PC1)
PC1_scores_ordered <- sort(PC1_scores, decreasing = TRUE)
names(PC1_scores_ordered)
ggplot(data, aes(x=data$ash, y=data$fat, color = data$brand)) +
geom_point() + labs(title = "Pizza brands by two variables")
Out Put:
[1] "ash" "fat" "sodium" "carb" "prot" "cal" "mois" "id“
We see how the variable ‘ash’ is the most important feature in
differentiating between the brands while ‘fat’ is next and so on.
Step 7: Differentiating between brands using
the two most important features
ggplot(data, aes(x=data$ash, y=data$fat, color =
data$brand)) + geom_point() + labs(title = "Pizza
brands by two variables")
This plot clearly shows how instead of the 8
columns given to us in the dataset, only two
were enough to understand we had different
types of pizzas

Dimension Reduction Techniques in Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dimension Reduction Techniques in Machine Learning

Uploaded by

Copyright:

Available Formats

Dimension Reduction Techniques

Principal components analysis (PCA) is a

The main aim of principal components analysis in R is to

scale: It is a logical value. It indicates whether the variables

center : a logical value indicating whether we center data

reduce: a logical value indicating whether we "reduce"

wI,wV: weigth vector for individuals / variables

row.w:an optional row weights (by default, uniform row weights)

col.w:an optional column weights (by default, unit column weights)

scannf:a logical value indicating whether the screeplot should be displayed

nf:if scannf FALSE, an integer indicating the number of kept axes

#Using function PCA from package FactoMineR

#Using function princomp from package STAT

#using function acp from package "amap"

Step 1: Importing libraries and reading the dataset “Pizza”(from

The important thing to know about prcomp() is that it returns 3 things:

You might also like