Boruta Feature Selection in R - DataCamp

11/26/22, 2:48 PM Boruta Feature Selection in R | DataCamp
Home About R Learn R
Feature Selection in R with the Boruta R

Package
Tackle feature selection in R: explore the Boruta algorithm, a wrapper built
around the Random Forest classification algorithm, and its implementation!
Mar 2018 · 17 min read
DataCamp Team
Making data science accessible to everyone
TO P I C S
R Programming
Data Science
Machine Learning
R E L AT E D
An Introduction to Papers With Code
https://www.datacamp.com/tutorial/feature-selection-R-boruta 1/18
Regular Expressions Cheat Sheet
ggplot2 Cheat Sheet
A Guide to R Regular Expressions
High-dimensional data, in terms of number of features, is increasingly common these days in

machine learning problems. To extract useful information from these high volumes of data,
you have to use statistical techniques to reduce the noise or redundant data. This is because
you often need not use every feature at your disposal to train a model. You can improve
your model by feeding in only those features that are uncorrelated and non-redundant. This
is where feature selection plays an important role. Not only it helps in training your model
faster but also reduces the complexity of the model, makes it easier to interpret and
improves the accuracy, precision or recall, whatever may the performance metric be.
In this tutorial, you'll tackle the following concepts:
First, you'll learn more about feature selection: you'll see when you can use it, and what
types of methods you have available to select the most important features for your
model!
Then, you'll get introduced to the Boruta algorithm.You'll see how you can use it to
perform a top-down search for relevant features by comparing original attributes'
importance with importance achievable at random, estimated using their permuted
copies, and progressively elliminating irrelevant features.
You'll also take a brief look at the dataset on which you'll be performing the feature
selection. You'll see how you can easily impute missing values with the help of the
Amelia package.
Lastly, you'll learn more about the Boruta package, which you can use to run the
algorithm.
Feature Selection
Generally, whenever you want to reduce the dimensionality of the data you come across
methods like Principal Component Analysis, Singular Value decomposition etc. So it's natural
to ask why you need other feature selection methods at all. The thing with these techniques
is that they are unsupervised ways of feature selection: take, for example, PCA, which uses
variance in data to find the components. These techniques don't take into account the
information between feature values and the target class or values. Also, there are certain
assumptions, such as normality, associated with such methods which require some kind of
transformations before starting to apply them. These constraints doesn't apply to all kinds of
data.
There are three types of feature selection methods in general:
Filter Methods : filter methods are generally used as a preprocessing step. The selection
of features is independent of any machine learning algorithm. Instead the features are
selected on the basis of their scores in various statistical tests for their correlation with
the outcome variable. Some common filter methods are Correlation metrics (Pearson,
Spearman, Distance), Chi-Squared test, Anova, Fisher's Score etc.
Wrapper Methods : in wrapper methods, you try to use a subset of features and train a
model using them. Based on the inferences that you draw from the previous model, you
decide to add or remove features from the subset. Forward Selection, Backward
elimination are some of the examples for wrapper methods.
Embedded Methods : these are the algorithms that have their own built-in feature
selection methods. LASSO regression is one such example.
In this tutorial you will use one of the wrapper methods which is readily available in R
through a package called Boruta .
The Boruta Algorithm

The Boruta algorithm is a wrapper built around the random forest classification algorithm. It
tries to capture all the important, interesting features you might have in your dataset with
respect to an outcome variable.
First, it duplicates the dataset, and shuffle the values in each column. These values are
called shadow features. * Then, it trains a classifier, such as a Random Forest Classifier,
on the dataset. By doing this, you ensure that you can an idea of the importance -via
the Mean Decrease Accuracy or Mean Decrease Impurity- for each of the features of
your data set. The higher the score, the better or more important.
Then, the algorithm checks for each of your real features if they have higher
importance. That is, whether the feature has a higher Z-score than the maximum Z-
score of its shadow features than the best of the shadow features. If they do, it records
this in a vector. These are called a hits. Next,it will continue with another iteration. After
a predefined set of iterations, you will end up with a table of these hits. Remember: a Z-
score is the number of standard deviations from the mean a data point is, for more info
click here.
At every iteration, the algorithm compares the Z-scores of the shuffled copies of the
features and the original features to see if the latter performed better than the former. If
it does, the algorithm will mark the feature as important. In essence, the algorithm is
trying to validate the importance of the feature by comparing with random shuffled
copies, which increases the robustness. This is done by simply comparing the number of
times a feature did better with the shadow features using a binomial distribution.
If a feature hasn't been recorded as a hit in say 15 iterations, you reject it and also
remove it from the original matrix. After a set number of iterations -or if all the features
have been either confirmed or rejected- you stop.
Boruta Agorithm in R
Let's use the Boruta algorithm in one of the most commonly available datasets: the Bank
Marketing data. This data represensts a direct marketing campaigns (phone calls) of a
Portuguese banking institution The classification goal is to predict if the client will subscribe
Portuguese banking institution. The classification goal is to predict if the client will subscribe
a term deposit or not.
Tip: don't forget to check out detailed description of different features here.
read_file <- read.csv('./bank_bank.csv',header=TRUE,sep=';',stringsAsFactors =
str(read_file)
## 'data.frame': 4521 obs. of 17 variables:

## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : chr "unemployed" "services" "management" "management" ...
## $ marital : chr "married" "married" "single" "married" ...
## $ education: chr "primary" "secondary" "tertiary" "tertiary" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : chr "no" "yes" "yes" "yes" ...
## $ loan : chr "no" "yes" "no" "yes" ...
## $ contact : chr "cellular" "cellular" "cellular" "unknown" ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : chr "oct" "may" "apr" "jun" ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : chr "unknown" "failure" "failure" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Let's use the summary() function to summarize the common descriptive statistics of
different features in your dataset.
summary(read_file)
## age job marital education

## Min. :19.00 Length:4521 Length:4521 Length:4521
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :41.17
## 3rd Qu.:49.00
## Max. :87.00
## default balance housing loan
## Length:4521 Min. :-3313 Length:4521 Length:4521
# Class :character 1st Qu : 69 Class :character Class :character
## Class :character 1st Qu.: 69 Class :character Class :character
## Mode :character Median : 444 Mode :character Mode :character
## Mean : 1423
## 3rd Qu.: 1480
## Max. :71188
## contact day month duration
## Length:4521 Min. : 1.00 Length:4521 Min. : 4
## Class :character 1st Qu.: 9.00 Class :character 1st Qu.: 104
## Mode :character Median :16.00 Mode :character Median : 185
## Mean :15.92 Mean : 264
## 3rd Qu.:21.00 3rd Qu.: 329
## Max. :31.00 Max. :3025
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.00 Min. : 0.0000 Length:4521
## 1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.00 Median : 0.0000 Mode :character
## Mean : 2.794 Mean : 39.77 Mean : 0.5426
## 3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000
## Max. :50.000 Max. :871.00 Max. :25.0000
## y
## Length:4521
## Class :character
## Mode :character
##
##
##
The summary() function gives the measure of central tendency for continuous features such
as mean, median, quantiles, etc. If you have any categorical features in your data set, you'll
also get to see the Class and Mode of those features.
Now let's convert the categorical features into factor data type:
convert <- c(2:5, 7:9,11,16:17)

read_file[,convert] <- data.frame(apply(read_file[convert], 2, as.factor))
str(read_file)
## 'data.frame': 4521 obs. of 17 variables:

## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2
## $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# $ balance : int 1787 4789 1350 1476 0 747 307 147 221 88
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...

## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 . .
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Since the data points are shuffled in order to create shadow features and the Z-score is
calculated for each of them, it's important to treat missing or blank values prior to using
boruta package, otherwise it throws an error.
(Un)fortunately, this dataset doesn't have neither. However, for educational purposes, you'll
introduce some NAs in the data.
Let's seed missing values in your data set using prodNA() function. You can access this
function by installing missForest package.
Remember that you can use install.packages() to install missing packages if needed!
library(missForest)
# Generate 5% missing values at random

bank.mis <- prodNA(read_file, noNA = 0.05)
You can again call summary() function on your new data frame to get the number of
imputed NAs , but let's get a little bit more creative instead!
Let's visualize the missingness in the data using the following ggplot2 code:
library(reshape2)
library(ggplot2)
library(dplyr)
ggplot_missing <- function(x){
x %>%
is.na %>%
melt %>%
ggplot(data = .,
aes(x = Var2,
y = Var1)) +
geom_raster(aes(fill = value)) +
scale_fill_grey(name = "",
labels = c("Present","Missing")) +
theme_minimal() +
theme(axis.text.x = element_text(angle=45, vjust=0.5)) +
labs(x = "Variables in Dataset",
y = "Rows / observations")
}
ggplot_missing(bank.mis)
The white lines in the plot show you visually that you have seeded missing values in every
feature but you see how much pain went into writing that ggplot_missing function? Well,
the Amelia package in R , which you will be using in the next section, provides a one line
alternative to plot similar figure:
library(Amelia)
missmap(bank.mis)
Try it out on your own!
Imputing
BLOG Missing Values with Amelia Sign In Category
Get Started
You can now impute the missing values in several ways like imputing with mean, media or
mode (for categorical features) but let's use another powerful package for imputing missing
values Amelia.
Amelia takes m bootstrap samples and applies the EMB (or expectation-maximization
with bootstrapping) algorithm to each sample. The m estimates of mean and variances will
be different. Finally, the first set of estimates are used to impute first set of missing values
using regression, then second set of estimates are used for second set and so on. Multiple
imputation helps to reduce bias and increase efficiency. Also, it is enabled with parallel
imputation feature using multicore CPUs.
It has 3 important parameters:

m : the number of imputed datasets to create.
idvars : keep all ID variables and other variables that you don't want to impute.
noms : keep nominal variables here.
library(Amelia)
amelia_bank <- amelia(bank.mis, m=3, parallel = "multicore",noms=c('job','marit
## -- Imputation 1 --
##
## 1 2 3 4 5 6
##
##
## 1 2 3 4 5 6
##
##
## 1 2 3 4 5
To access the imputed data frames, you can use the following subsetting:
amelia_bank$imputations[[1]]
To export the imputed dataset to csv files use:
write.amelia(amelia_bank, file.stem = "imputed_bank_data_set")
The Boruta R Package

Now let's use Boruta algorithm on one of the imputed datasets. You can make use of the
Boruta package to do this:
library(Boruta)
set seed(111)
set.seed(111)
boruta.bank_train <- Boruta(y~., data = amelia_bank$imputations[[1]], doTrace =
print(boruta.bank_train)
## Boruta performed 99 iterations in 18.97234 mins.

## 10 attributes confirmed important: age, contact, day, duration,
## housing and 5 more;
## 3 attributes confirmed unimportant: education, job, marital;

## 3 tentative attributes left: balance, campaign, default;
Boruta gives a call on the significance of features in a data set. Many of them are already
classified as important and unimportant but you see that there are some who have been
assigned in tentative category.
But what does this mean?
Tentative features have an importance that is so close to their best shadow features that
Boruta is not able to make a decision with the desired confidence in the default number of
Random Forest runs.
What do you then do about this?
You could consider increasing the maxRuns parameter if tentative features are left.
However, note that you can also provide values of mtry and ntree parameters, which will
be passed to randomForest() function. The first allows you to specify the number of
variables that are randomly sampled as candidates at each split, while the latter is used to
specify the number of trees you want to grow. With these arguments specified, the Random
Forest classifier will achieve a convergence at minimal value of the out of bag error.
Remember that an out of bag error is an estimate to measure prediction error of the
classifiers which uses bootstrap aggregation to sub sample data samples used for training. It
is the mean prediction error on each training set sample X using only trees that did not have
X in their bootstrap sample.
Alternatively, you can also set the doTrace argument to 1 or 2 , which allows you to get a
report of the progress of the process.
The boruta package also contains a TentativeRoughFix() function, which can be used to fill
missing decisions by simple comparison of the median feature Z-score with the median Z-
score of the most important shadow feature:
#take a call on tentative features

boruta.bank <- TentativeRoughFix(boruta.bank_train)

print(boruta.bank)
## Boruta performed 99 iterations in 18.97234 mins.

## Tentatives roughfixed over the last 99 iterations.
## 12 attributes confirmed important: age, campaign, contact, day,
## default and 7 more;

## 4 attributes confirmed unimportant: balance, education, job,
## marital;
Boruta has now done its job: it has successfully classified each feature as important or
unimportant.
You can now plot the boruta variable importance chart by calling plot(boruta.bank) .
However, the x axis labels will be horizontal. This won't be really neat.
That's why you will add the feature labels to the x axis vertically, just like in the following
code chunk:
plot(boruta.bank, xlab = "", xaxt = "n")

lz<-lapply(1:ncol(boruta.bank$ImpHistory),function(i)
boruta.bank$ImpHistory[is.finite(boruta.bank$ImpHistory[,i]),i])
names(lz) <- colnames(boruta.bank$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta.bank$ImpHistory), cex.axis = 0.7)
The y axis label Importance represents the Z score of every feature in the shuffled
dataset.
The blue boxplots correspond to minimal, average and maximum Z score of a shadow
feature, while the red and green boxplots represent Z scores of rejected and confirmed
features, respectively. As you can see the red boxplots have lower Z score than that of
maximum Z score of shadow feature which is precisely the reason they were put in
unimportant category.
You can confirm the importance of the features by typing:
getSelectedAttributes(boruta.bank, withTentative = F)
# [1] " " "d f lt"

https://www.datacamp.com/tutorial/feature-selection-R-boruta
"h i " "l " " t t" "d " 11/18
## [1] "age" "default" "housing" "loan" "contact" "day"
## [7] "month" "duration" "campaign" "pdays" "previous" "poutcome"
bank_df <- attStats(boruta.bank)

print(bank_df)
## meanImp medianImp minImp maxImp normHits decision
## age 11.4236197 11.3760979 8.4250222 15.518420 1.00000000 Confirmed

## job 0.0741753 0.3002281 -1.7651336 1.566687 0.01010101 Rejected
## marital 1.8891283 2.0043568 -1.0276720 4.804499 0.22222222 Rejected
## education 1.5969540 1.6188117 -1.6836346 4.629572 0.28282828 Rejected
## default 2.3721979 2.3472820 -0.1434933 5.044653 0.50505051 Confirmed
## balance 2.3349682 2.3214378 -0.8098151 5.567993 0.51515152 Rejected
## housing 8.4147808 8.4384240 4.7392059 10.404609 1.00000000 Confirmed
## loan 4.1872186 4.2797591 2.0325838 6.263155 0.87878788 Confirmed
## contact 18.9482180 18.9757719 16.0937657 22.121461 1.00000000 Confirmed
## day 9.5645192 9.5828766 6.1842233 13.495442 1.00000000 Confirmed
## month 24.1475736 24.2067940 20.0621966 27.200679 1.00000000 Confirmed
## duration 71.5232213 71.1785055 64.3941499 78.249830 1.00000000 Confirmed
## campaign 2.6221456 2.6188180 -0.4144493 4.941482 0.65656566 Confirmed
## pdays 26.5650528 26.7123730 23.7902945 29.067476 1.00000000 Confirmed
## previous 20.9569022 20.9703991 18.7273357 23.117672 1.00000000 Confirmed
## poutcome 28.5166889 28.4885934 25.9855974 31.527154 1.00000000 Confirmed
You can easily validate the result as the feature duration has been given the highest
importance which is already mentioned in the data description (click here) if you read
carefully!
Conclusion
Voila! You have successfully filtered out the most important features from your dataset just
by typing a few lines of code. With this you have reduced the noise from your data which
will be really beneficial for any classifier to assign a label to an observation. Training a
model on these important features will definitely improve your model's performance which
was the point of doing feature selection in the first place!
If you want to check out the resources that have been used to make this tutorial, check out
the following:
PACKAGE AMELIA : A Program for Missing Data; James Honaker, Gary King, and
Matthew Blackwell
Feature Selection with the Boruta Package; Miron B. Kursa, Witold R. Rudnicki
RDocumentation
UCI Machine Learning Repository
TO P I C S
R Programming Data Science Machine Learning
R Courses
Introduction to R In
Beginner 4 hours 2,390,832
Master the basics of data analysis in R, including vectors, lists, and data frames, C
and practice R with real data sets. s
See Details Start Course S
See all courses
Related
An Introduction to Papers With
Code
Abid Ali Awan
Regular Expressions Cheat

Sheet
DataCamp Team •
ggplot2 Cheat Sheet
DataCamp Team •
A Guide to R Regular
Expressions
Elena Kosourova
Streamline Your Machine

Learning Workflow with MLFlow
Moez Ali
A I t d ti t Q L
https://www.datacamp.com/tutorial/feature-selection-R-boruta
i 14/18
An Introduction to Q-Learning:
A Tutorial For Beginners
Abid Ali Awan See More
Grow your data skills with DataCamp for Mobile

Make progress on the go with our mobile courses and daily 5-minute coding challenges.
LEARN
Learn Python
Learn R
Learn SQL
Learn Power BI
Learn Tableau
Assessments
Career Tracks
Skill Tracks
Courses
Data Science Roadmap
DATA C O U R S E S
Python Courses
R Courses
SQL Courses
Power BI Courses
Tableau Courses
Spreadsheet Courses
Data Analysis Courses
Data Visualization Courses
Machine Learning Courses
Data Engineering Courses
WO R KS PA C E
Get Started
Templates
Integrations
Documentation
C E R T I F I C AT I O N
Certifications
Data Scientist
Data Analyst
Hire Data Professionals
RESOURCES
Resource Center
Upcoming Events
Blog
Tutorials
Open Source
RDocumentation
Course Editor
Book a Demo with DataCamp for Business
PLANS
Pricing
For Business
For Classrooms
Discounts, Promos & Sales
DataCamp Donates
SUPPORT
Help Center
Become an Instructor
Become an Affiliate
ABOUT
About Us
Learner Stories
Careers
Press
Leadership
Contact Us
Privacy Policy Cookie Notice Do Not Sell My Personal Information Accessibility Security
Terms of Use
© 2022 DataCamp, Inc. All Rights Reserved.

Boruta Feature Selection in R - DataCamp

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Boruta Feature Selection in R - DataCamp

Uploaded by

Copyright:

Available Formats

11/26/22, 2:48 PM Boruta Feature Selection in R | DataCamp

Home About R Learn R

Feature Selection in R with the Boruta R

An Introduction to Papers With Code

Regular Expressions Cheat Sheet

ggplot2 Cheat Sheet

A Guide to R Regular Expressions

High-dimensional data, in terms of number of features, is increasingly common these days in

In this tutorial, you'll tackle the following concepts:

There are three types of feature selection methods in general:

The Boruta Algorithm

read_file <- read.csv('./bank_bank.csv',header=TRUE,sep=';',stringsAsFactors =

## 'data.frame': 4521 obs. of 17 variables:

## age job marital education

convert <- c(2:5, 7:9,11,16:17)

## 'data.frame': 4521 obs. of 17 variables:

## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...

# Generate 5% missing values at random

Try it out on your own!

It has 3 important parameters:

m : the number of imputed datasets to create.

noms : keep nominal variables here.

To export the imputed dataset to csv files use:

write.amelia(amelia_bank, file.stem = "imputed_bank_data_set")

The Boruta R Package

## Boruta performed 99 iterations in 18.97234 mins.

## 3 attributes confirmed unimportant: education, job, marital;

But what does this mean?

What do you then do about this?

#take a call on tentative features

boruta.bank <- TentativeRoughFix(boruta.bank_train)

## Boruta performed 99 iterations in 18.97234 mins.

## default and 7 more;

plot(boruta.bank, xlab = "", xaxt = "n")

You can confirm the importance of the features by typing:

# [1] " " "d f lt"

bank_df <- attStats(boruta.bank)

## meanImp medianImp minImp maxImp normHits decision

## age 11.4236197 11.3760979 8.4250222 15.518420 1.00000000 Confirmed

UCI Machine Learning Repository

R Programming Data Science Machine Learning

See Details Start Course S

See all courses

Abid Ali Awan

Regular Expressions Cheat

ggplot2 Cheat Sheet

Streamline Your Machine

Abid Ali Awan See More

Grow your data skills with DataCamp for Mobile

Data Science Roadmap

Data Analysis Courses

Data Visualization Courses

Machine Learning Courses

Data Engineering Courses

Hire Data Professionals

Book a Demo with DataCamp for Business

Discounts, Promos & Sales

© 2022 DataCamp, Inc. All Rights Reserved.

You might also like