Professional Documents
Culture Documents
DataCamp Team
Making data science accessible to everyone
TO P I C S
R Programming
Data Science
Machine Learning
R E L AT E D
https://www.datacamp.com/tutorial/feature-selection-R-boruta 1/18
11/26/22, 2:48 PM Boruta Feature Selection in R | DataCamp
First, you'll learn more about feature selection: you'll see when you can use it, and what
types of methods you have available to select the most important features for your
model!
Then, you'll get introduced to the Boruta algorithm.You'll see how you can use it to
perform a top-down search for relevant features by comparing original attributes'
importance with importance achievable at random, estimated using their permuted
copies, and progressively elliminating irrelevant features.
You'll also take a brief look at the dataset on which you'll be performing the feature
selection. You'll see how you can easily impute missing values with the help of the
Amelia package.
Lastly, you'll learn more about the Boruta package, which you can use to run the
algorithm.
Feature Selection
https://www.datacamp.com/tutorial/feature-selection-R-boruta 2/18
11/26/22, 2:48 PM Boruta Feature Selection in R | DataCamp
Generally, whenever you want to reduce the dimensionality of the data you come across
methods like Principal Component Analysis, Singular Value decomposition etc. So it's natural
to ask why you need other feature selection methods at all. The thing with these techniques
is that they are unsupervised ways of feature selection: take, for example, PCA, which uses
variance in data to find the components. These techniques don't take into account the
information between feature values and the target class or values. Also, there are certain
assumptions, such as normality, associated with such methods which require some kind of
transformations before starting to apply them. These constraints doesn't apply to all kinds of
data.
Filter Methods : filter methods are generally used as a preprocessing step. The selection
of features is independent of any machine learning algorithm. Instead the features are
selected on the basis of their scores in various statistical tests for their correlation with
the outcome variable. Some common filter methods are Correlation metrics (Pearson,
Spearman, Distance), Chi-Squared test, Anova, Fisher's Score etc.
Wrapper Methods : in wrapper methods, you try to use a subset of features and train a
model using them. Based on the inferences that you draw from the previous model, you
decide to add or remove features from the subset. Forward Selection, Backward
elimination are some of the examples for wrapper methods.
Embedded Methods : these are the algorithms that have their own built-in feature
selection methods. LASSO regression is one such example.
In this tutorial you will use one of the wrapper methods which is readily available in R
through a package called Boruta .
First, it duplicates the dataset, and shuffle the values in each column. These values are
called shadow features. * Then, it trains a classifier, such as a Random Forest Classifier,
on the dataset. By doing this, you ensure that you can an idea of the importance -via
the Mean Decrease Accuracy or Mean Decrease Impurity- for each of the features of
your data set. The higher the score, the better or more important.
Then, the algorithm checks for each of your real features if they have higher
importance. That is, whether the feature has a higher Z-score than the maximum Z-
https://www.datacamp.com/tutorial/feature-selection-R-boruta 3/18
11/26/22, 2:48 PM Boruta Feature Selection in R | DataCamp
score of its shadow features than the best of the shadow features. If they do, it records
this in a vector. These are called a hits. Next,it will continue with another iteration. After
a predefined set of iterations, you will end up with a table of these hits. Remember: a Z-
score is the number of standard deviations from the mean a data point is, for more info
click here.
At every iteration, the algorithm compares the Z-scores of the shuffled copies of the
features and the original features to see if the latter performed better than the former. If
it does, the algorithm will mark the feature as important. In essence, the algorithm is
trying to validate the importance of the feature by comparing with random shuffled
copies, which increases the robustness. This is done by simply comparing the number of
times a feature did better with the shadow features using a binomial distribution.
If a feature hasn't been recorded as a hit in say 15 iterations, you reject it and also
remove it from the original matrix. After a set number of iterations -or if all the features
have been either confirmed or rejected- you stop.
Boruta Agorithm in R
Let's use the Boruta algorithm in one of the most commonly available datasets: the Bank
Marketing data. This data represensts a direct marketing campaigns (phone calls) of a
Portuguese banking institution The classification goal is to predict if the client will subscribe
https://www.datacamp.com/tutorial/feature-selection-R-boruta 4/18
11/26/22, 2:48 PM Boruta Feature Selection in R | DataCamp
Portuguese banking institution. The classification goal is to predict if the client will subscribe
a term deposit or not.
Tip: don't forget to check out detailed description of different features here.
str(read_file)
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Let's use the summary() function to summarize the common descriptive statistics of
different features in your dataset.
summary(read_file)
## Class :character 1st Qu.: 9.00 Class :character 1st Qu.: 104
## Mode :character Median :16.00 Mode :character Median : 185
## Mean :15.92 Mean : 264
## 3rd Qu.:21.00 3rd Qu.: 329
## Max. :31.00 Max. :3025
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.00 Min. : 0.0000 Length:4521
## 1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.00 Median : 0.0000 Mode :character
## Mean : 2.794 Mean : 39.77 Mean : 0.5426
## 3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000
## Max. :50.000 Max. :871.00 Max. :25.0000
## y
## Length:4521
## Class :character
## Mode :character
##
##
##
P O W E R E D B Y D ATA C A M P W O R K S PA C E
The summary() function gives the measure of central tendency for continuous features such
as mean, median, quantiles, etc. If you have any categorical features in your data set, you'll
also get to see the Class and Mode of those features.
Now let's convert the categorical features into factor data type:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Since the data points are shuffled in order to create shadow features and the Z-score is
calculated for each of them, it's important to treat missing or blank values prior to using
boruta package, otherwise it throws an error.
(Un)fortunately, this dataset doesn't have neither. However, for educational purposes, you'll
introduce some NAs in the data.
Let's seed missing values in your data set using prodNA() function. You can access this
function by installing missForest package.
Remember that you can use install.packages() to install missing packages if needed!
library(missForest)
P O W E R E D B Y D ATA C A M P W O R K S PA C E
You can again call summary() function on your new data frame to get the number of
imputed NAs , but let's get a little bit more creative instead!
Let's visualize the missingness in the data using the following ggplot2 code:
library(reshape2)
library(ggplot2)
library(dplyr)
ggplot_missing <- function(x){
https://www.datacamp.com/tutorial/feature-selection-R-boruta 7/18
11/26/22, 2:48 PM Boruta Feature Selection in R | DataCamp
x %>%
is.na %>%
melt %>%
ggplot(data = .,
aes(x = Var2,
y = Var1)) +
geom_raster(aes(fill = value)) +
scale_fill_grey(name = "",
labels = c("Present","Missing")) +
theme_minimal() +
theme(axis.text.x = element_text(angle=45, vjust=0.5)) +
labs(x = "Variables in Dataset",
y = "Rows / observations")
}
ggplot_missing(bank.mis)
P O W E R E D B Y D ATA C A M P W O R K S PA C E
The white lines in the plot show you visually that you have seeded missing values in every
feature but you see how much pain went into writing that ggplot_missing function? Well,
the Amelia package in R , which you will be using in the next section, provides a one line
alternative to plot similar figure:
library(Amelia)
missmap(bank.mis)
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Imputing
BLOG Missing Values with Amelia Sign In Category
Get Started
You can now impute the missing values in several ways like imputing with mean, media or
mode (for categorical features) but let's use another powerful package for imputing missing
values Amelia.
Amelia takes m bootstrap samples and applies the EMB (or expectation-maximization
with bootstrapping) algorithm to each sample. The m estimates of mean and variances will
be different. Finally, the first set of estimates are used to impute first set of missing values
using regression, then second set of estimates are used for second set and so on. Multiple
imputation helps to reduce bias and increase efficiency. Also, it is enabled with parallel
imputation feature using multicore CPUs.
idvars : keep all ID variables and other variables that you don't want to impute.
library(Amelia)
amelia_bank <- amelia(bank.mis, m=3, parallel = "multicore",noms=c('job','marit
## -- Imputation 1 --
##
## 1 2 3 4 5 6
##
## -- Imputation 2 --
##
## 1 2 3 4 5 6
##
## -- Imputation 3 --
##
## 1 2 3 4 5
P O W E R E D B Y D ATA C A M P W O R K S PA C E
To access the imputed data frames, you can use the following subsetting:
amelia_bank$imputations[[1]]
P O W E R E D B Y D ATA C A M P W O R K S PA C E
P O W E R E D B Y D ATA C A M P W O R K S PA C E
library(Boruta)
set seed(111)
https://www.datacamp.com/tutorial/feature-selection-R-boruta 9/18
11/26/22, 2:48 PM Boruta Feature Selection in R | DataCamp
set.seed(111)
boruta.bank_train <- Boruta(y~., data = amelia_bank$imputations[[1]], doTrace =
print(boruta.bank_train)
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Boruta gives a call on the significance of features in a data set. Many of them are already
classified as important and unimportant but you see that there are some who have been
assigned in tentative category.
Tentative features have an importance that is so close to their best shadow features that
Boruta is not able to make a decision with the desired confidence in the default number of
Random Forest runs.
You could consider increasing the maxRuns parameter if tentative features are left.
However, note that you can also provide values of mtry and ntree parameters, which will
be passed to randomForest() function. The first allows you to specify the number of
variables that are randomly sampled as candidates at each split, while the latter is used to
specify the number of trees you want to grow. With these arguments specified, the Random
Forest classifier will achieve a convergence at minimal value of the out of bag error.
Remember that an out of bag error is an estimate to measure prediction error of the
classifiers which uses bootstrap aggregation to sub sample data samples used for training. It
is the mean prediction error on each training set sample X using only trees that did not have
X in their bootstrap sample.
Alternatively, you can also set the doTrace argument to 1 or 2 , which allows you to get a
report of the progress of the process.
The boruta package also contains a TentativeRoughFix() function, which can be used to fill
missing decisions by simple comparison of the median feature Z-score with the median Z-
score of the most important shadow feature:
Boruta has now done its job: it has successfully classified each feature as important or
unimportant.
You can now plot the boruta variable importance chart by calling plot(boruta.bank) .
However, the x axis labels will be horizontal. This won't be really neat.
That's why you will add the feature labels to the x axis vertically, just like in the following
code chunk:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
The y axis label Importance represents the Z score of every feature in the shuffled
dataset.
The blue boxplots correspond to minimal, average and maximum Z score of a shadow
feature, while the red and green boxplots represent Z scores of rejected and confirmed
features, respectively. As you can see the red boxplots have lower Z score than that of
maximum Z score of shadow feature which is precisely the reason they were put in
unimportant category.
getSelectedAttributes(boruta.bank, withTentative = F)
P O W E R E D B Y D ATA C A M P W O R K S PA C E
You can easily validate the result as the feature duration has been given the highest
importance which is already mentioned in the data description (click here) if you read
carefully!
Conclusion
Voila! You have successfully filtered out the most important features from your dataset just
by typing a few lines of code. With this you have reduced the noise from your data which
will be really beneficial for any classifier to assign a label to an observation. Training a
model on these important features will definitely improve your model's performance which
was the point of doing feature selection in the first place!
If you want to check out the resources that have been used to make this tutorial, check out
the following:
PACKAGE AMELIA : A Program for Missing Data; James Honaker, Gary King, and
Matthew Blackwell
Feature Selection with the Boruta Package; Miron B. Kursa, Witold R. Rudnicki
https://www.datacamp.com/tutorial/feature-selection-R-boruta 12/18
11/26/22, 2:48 PM Boruta Feature Selection in R | DataCamp
RDocumentation
TO P I C S
R Courses
Introduction to R In
Beginner 4 hours 2,390,832
Master the basics of data analysis in R, including vectors, lists, and data frames, C
and practice R with real data sets. s
Related
An Introduction to Papers With
Code
https://www.datacamp.com/tutorial/feature-selection-R-boruta 13/18
11/26/22, 2:48 PM Boruta Feature Selection in R | DataCamp
DataCamp Team •
DataCamp Team •
A Guide to R Regular
Expressions
Elena Kosourova
Moez Ali
A I t d ti t Q L
https://www.datacamp.com/tutorial/feature-selection-R-boruta
i 14/18
11/26/22, 2:48 PM Boruta Feature Selection in R | DataCamp
An Introduction to Q-Learning:
A Tutorial For Beginners
LEARN
Learn Python
Learn R
Learn SQL
Learn Power BI
Learn Tableau
Assessments
Career Tracks
Skill Tracks
Courses
https://www.datacamp.com/tutorial/feature-selection-R-boruta 15/18
11/26/22, 2:48 PM Boruta Feature Selection in R | DataCamp
DATA C O U R S E S
Python Courses
R Courses
SQL Courses
Power BI Courses
Tableau Courses
Spreadsheet Courses
WO R KS PA C E
Get Started
Templates
Integrations
Documentation
C E R T I F I C AT I O N
Certifications
Data Scientist
Data Analyst
RESOURCES
Resource Center
https://www.datacamp.com/tutorial/feature-selection-R-boruta 16/18
11/26/22, 2:48 PM Boruta Feature Selection in R | DataCamp
Upcoming Events
Blog
Tutorials
Open Source
RDocumentation
Course Editor
PLANS
Pricing
For Business
For Classrooms
DataCamp Donates
SUPPORT
Help Center
Become an Instructor
Become an Affiliate
ABOUT
About Us
Learner Stories
Careers
Press
Leadership
https://www.datacamp.com/tutorial/feature-selection-R-boruta 17/18
11/26/22, 2:48 PM Boruta Feature Selection in R | DataCamp
Contact Us
Privacy Policy Cookie Notice Do Not Sell My Personal Information Accessibility Security
Terms of Use
https://www.datacamp.com/tutorial/feature-selection-R-boruta 18/18