Bagging and Boosting Regression Algorithms
Introduction to Ensemble Learning:
When you want to buy a new Bike or Cycle , will you go to the shop and purchase right
away?
The straight answer is no . You will browse some 100 reviews before buying a new
product.
Ensemble models in Machine learning operate on a similar idea.
They combine decisions from various models to improve the performance.
There are several algorithms used in Ensemble learning.
This is true for a diverse set of models in comparison to a single model .This
diversification in machine learning is achieved by a technique called Ensemble learning.
Ensemble Techniques:
Max Voting.
Averaging.
Weighted Averaging.
Max Voting:
This voting method is generally used for classification problems.
In this technique , multiple models are used to make predictions for each data point.
The predictions by each model are taken as a vote.
The predictions we get from the majority of the models are used for the final
predictions.
For eg , we asked 5 of them to rate a movie , we will assume 3 of them rated it as 4 and
two of them rated it as 5.
Since the majority rating is 4 , this will be taken as the final rating.
We can consider this as taking the mode of all the predictions.
Max Voting - Result
Colleague 1 Colleague 2 Colleague 3 Colleague 4 Colleague 5
5 4 5 4 4
Final Rating
5
EXAMPLE CODE:
X_train consists of independent variables in training .
Y_train is the target variable for training data.
The validation set is x_test (independent variables) and y_test(target
variable).
EXAMPLE CODE:
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()
model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)
pred1=model1.predict(x_test)
pred2=model2.predict(x_test)
pred3=model3.predict(x_test)
final_pred = np.array([])
for i in range(0,len(x_test)):
final_pred = np.append(final_pred, mode([pred1[i], pred2[i], pred3[i]]))
EXAMPLE CODE:
The voting classifier module in sklearn can be used.
from sklearn.ensemble import VotingClassifier
model1 = LogisticRegression(random_state=1)
model2 = tree.DecisionTreeClassifier(random_state=1)
model = VotingClassifier(estimators=[('lr', model1), ('dt', model2)],
voting='hard')
model.fit(x_train,y_train)
model.score(x_test,y_test)
AVERAGING:
Multiple predictions are made for each point in averaging.
We take an average of predictions from all models and use it to make the final
prediction.
It can be used for making predictions in regression problems or while making
predictions in classification problems.
In the below case , the averaging method would take the average of all values.
i.e. (5+4+5+4+4)/5 = 4.4.
Output:
Colleague 1 Colleague 2 Colleague 3 Colleague 4
Colleague5
5 4 5 4 4
Final Rating
4.4
Sample Code:
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()
model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)
pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)
finalpred=(pred1+pred2+pred3)/3
Weighted Average:
This is an extension of averaging method.
All the models are assigned different weights defining the importance of each model for
prediction.
If two of the colleagues are critics , while others have no prior experience in this field , then
the answers by these two friends are given more importance as compared to other people.
The result is calculated as [(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) + (4*0.18)] = 4.41.
Output:
Colleague 1 Colleague 2 Colleague 3 Colleague 4
weight 0.23 0.23 0.18 0.18
Colleague 5 Final Rating
0.18
Rating 5 4 5 4
4 4.41
Sample Code:
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()
model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)
pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)
finalpred=(pred1*0.3+pred2*0.3+pred3*0.4)
Advanced Ensemble Techniques
Stacking:
It is an ensemble learning technique that uses prediction from multiple
models(Decision tree , knn and svm) to build a new model.
This model is used for predictions on the test set.
Following steps are used to create a stacked ensemble:
Steps:
1. The train set is split into 10 parts.
Advanced Ensemble Techniques
2. The base model is built on 9 parts and predictions are made on the
10th part.
This is done for each part of the train set.
Advanced Ensemble Techniques
3.The decision tree is then built on the whole train dataset.
4.Using this model , predictions are made on the test set.
Advanced Ensemble Techniques
5. Steps 2 to 4 are repeated for another base model ( say knn)
resulting in another set of predictions for train set and test set.
Advanced Ensemble Techniques
6.The predictions from the train set are used as features to build a
new model.
Advanced Ensemble Techniques
7. This model is used to make predictions on the test prediction set.
Sample Code:
First , define a function to make predictions on n-folds of train and
test dataset.
This function returns the prediction for train and test for each model.
Sample Code:
def Stacking(model,train,y,test,n_fold): folds=StratifiedKFold(n_splits=n_fold,random_state=1)
test_pred=np.empty((test.shape[0],1),float)
train_pred=np.empty((0,1),float)
for train_indices,val_indices in folds.split(train,y.values):
x_train,x_val=train.iloc[train_indices],train.iloc[val_indices]
y_train,y_val=y.iloc[train_indices],y.iloc[val_indices]
model.fit(X=x_train,y=y_train) train_pred=np.append(train_pred,model.predict(x_val))
test_pred=np.append(test_pred,model.predict(test))
return test_pred.reshape(-1,1),train_pred
Sample Code:
Create two Base Models – Decision tree and knn:
model1 = tree.DecisionTreeClassifier(random_state=1)
test_pred1 ,train_pred1=Stacking(model=model1,n_fold=10,
train=x_train,test=x_test,y=y_train)
train_pred1=pd.DataFrame(train_pred1)
test_pred1=pd.DataFrame(test_pred1)
Sample Code:
model2 = KNeighborsClassifier()
test_pred2 ,train_pred2=Stacking(model=model2,n_fold=10,train=x_train
,test=x_test,y=y_train)
train_pred2=pd.DataFrame(train_pred2)
test_pred2=pd.DataFrame(test_pred2)
Sample Code:
Create a third model , Logistic Regression on the predictions of
Decision tree and Knn models.
df = pd.concat([train_pred1, train_pred2], axis=1)
df_test = pd.concat([test_pred1, test_pred2], axis=1)
model = LogisticRegression(random_state=1)
model.fit(df,y_train)
model.score(df_test, y_test)
Sample Code:
The Stacking model we have created has only two levels.
The decision tree and knn models are built at level zero and Logistic
Regression model is built at level one.
Multiple levels can be created in a Stacking model.
Blending:
It follows the same approach as stacking but uses only a validation set
from the train set to make predictions.
In case of blending , the predictions are made on the validation set
only.
The holdout set and predictions are used to build a model which is run
on the test set.
Following is the process involved in Blending.
Blending:
The train set is split into training and validation sets .
Blending:
Models are fitted on the training sets.
The predictions are made on the validation set and the test set.
Blending:
The validation set and its predictions are used as features to build a
new model.
This model is used to make predictions on the test and meta
features.
Sample Code:
Build two models , decision tree and knn on the train set in order to make
predictions on the validation set.
model1 = tree.DecisionTreeClassifier()
model1.fit(x_train, y_train)
val_pred1=model1.predict(x_val)
test_pred1=model1.predict(x_test)
val_pred1=pd.DataFrame(val_pred1)
test_pred1=pd.DataFrame(test_pred1)
Sample Code:
model2 = KNeighborsClassifier()
model2.fit(x_train,y_train)
val_pred2=model2.predict(x_val)
test_pred2=model2.predict(x_test)
val_pred2=pd.DataFrame(val_pred2)
test_pred2=pd.DataFrame(test_pred2)
Blending:
Combining the meta features and the validation set , a logistic regression
model is built to make predictions on the test set.
df_val=pd.concat([x_val, val_pred1,val_pred2],axis=1)
df_test=pd.concat([x_test, test_pred1,test_pred2],axis=1)
model = LogisticRegression()
model.fit(df_val,y_val)
model.score(df_test,y_test)
Bagging:
The main idea behind bagging is to combine the results of multiple models(for eg , decision trees) to
get a generalized result.
There is a high chance that these models will give the same result since they are getting the same
input.
One of the techniques used is Bootstrapping.
It is a sampling technique in which we create subsets of observations from the original dataset , with
replacement.
The size of the subsets is of the same size as the original set.
Bagging techniques uses these subsets(bags) to get a fair idea of the distribution(complete set).
The size of the subsets created for bagging may be less than the original set.
Bagging:
Bagging:
1. Multiple subsets are created from original dataset selecting
observations with replacement.
2. A base model is created on each of these subsets.
3. The models run in parallel and are independent of each other.
4. The final predictions are determined by combining the predictions
from all the models.
Bagging:
Boosting:
Various Boosting techniques are used which are used in building an ensemble
model.
Boosting methods are used to build an ensemble model in an increment way.
The main principle is build the model incrementally by training each base
model estimator sequentially.
In order to build powerful ensemble , these methods combine several week
learners which are sequentially trained over multiple iterations of training data.
The sklearn ensemble module is having two boosting methods.
Boosting:
AdaBoost:
It is one of the powerful boosting ensemble method.
The main key is in the way they give weights to the instances in the
dataset.
The algorithm needs to pay less attention to the instances while
constructing subsequent models.
Classification With AdaBoost:
The scikit – learn module provides sklearn . ensemble. AdaBoostClassifier.
While building this classifier , the main parameter this module use is the
base_estimator.
Base_estimator is the value of the base estimator from which the boosted
ensemble is built.
If we choose this parameter’s value to none then , the base estimator would
be DecisionTreeClassifer(max_depth=1).
Implementation Example:
In the following example , we are building a AdaBoost Classifier by using
sklearn.ensemble.AdaBoostClassifier and also predicting and checking its score.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples = 1000, n_features = 10,n_informative =
2, n_redundant = 0,random_state = 0, shuffle = False)
ADBclf = AdaBoostClassifier(n_estimators = 100, random_state = 0)
ADBclf.fit(X, y)
Output:
AdaBoostClassifier(algorithm = 'SAMME.R', base_estimator = None, learning_rate = 1.0, n_estimators =
100, random_state = 0)
Example:
Once fitted , we can predict the new values as follows:
print(ADBclf.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))
Output:
[1]
Example:
Now , we can check the score as follows:
ADBclf.score(X, y)
Output:
0.995
Example:
We can also use sklearn dataset to build classifier using Extra – tree
method.
In the example below, we are using pima-Indians diabtes dataset.
Example:
from pandas import read_csv
from sklearn.model_selection import Kfold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier
path = r"C:\pima-indians-diabetes.csv“
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names = headernames)
Example:
X = array[:,0:8]
Y = array[:,8]
seed = 5
kfold = KFold(n_splits = 10, random_state = seed)
num_trees = 100
max_features = 5
ADBclf = AdaBoostClassifier(n_estimators = num_trees, max_features = max_features)
results = cross_val_score(ADBclf, X, Y, cv = kfold)
print(results.mean())
Output:
0.7851435406698566
Regression With AdaBoost:
For creating a Regressor with Ada Boost method , the Sci-kit Learn
library provides sklearn.ensemble.AdaBoostRegressor.
While building regressor , it will use the same parameters as used by
sklearn.ensemble.AdaBoostClassifier.
Implementation Example:
In the following example , we are building a AdaBoostregressor and also predicting for
new values by using the predict () method.
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle =
False)
ADBregr = RandomForestRegressor(random_state = 0,n_estimators = 100)
ADBregr.fit(X, y)
Output:
AdaBoostRegressor(base_estimator = None, learning_rate = 1.0, loss = 'linear', n_estimators = 100,
random_state = 0)
Example:
Once fitted , we can predict from Regression model as follows:
print(ADBregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))
Output:
[85.50955817]
Gradient Tree Boosting:
It is also called Gradient Boosted Regression Trees(GBRT).
It is a generalization of Boostint to arbitrarily differentiable loss functions.
It produces a prediction model in the form of ensemble of week prediction
models.
It can be used for regression as well as classification problems.
The main advantage lies in the fact that they naturally handle the mixed
type data.
Boosting:
Classification With Gradient Tree Boost:
For creating a Gradient Tree Boost Classifier , the Scikit – Learn module provides
sklearn.ensemble.GradientBoostingClassifier.
While building this classifier , the main parameter that the module use is ‘loss’.
‘loss’ is the value of loss function to be optimized.
If we choose loss = deviance , it refers to deviance for classification with probabilistic outputs.
If we choose the parameter’s value to exponential , then it recovers the AdaBoost Algorithm.
The parameters n_estimator will control the number of week learners.
A hyper – parameter named learning rate (in the range of (0.0, 1.0]) will control overfitting via
shrinkage.
Implementation Example:
In the following example , we are building a Gradient Boosting Classifier by using
sklearn.ensemble.GradientBoostingClassifier.
This classifier is fitted with 50 week learners.
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
X, y = make_hastie_10_2(random_state = 0)
X_train, X_test = X[:5000], X[5000:]
y_train, y_test = y[:5000], y[5000:]
GDBclf = GradientBoostingClassifier(n_estimators = 50, learning_rate = 1.0,max_depth = 1,
random_state = 0).fit(X_train, y_train) GDBclf.score(X_test, y_test)
Output:
0.8724285714285714
Example:
Sklearn datasets can also be used to build the classifier using sklearn:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names = headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
))
Example:
seed = 5
kfold = KFold(n_splits = 10, random_state = seed)
num_trees = 100
max_features = 5
ADBclf = GradientBoostingClassifier(n_estimators = num_trees, max_features =
max_features)
results = cross_val_score(ADBclf, X, Y, cv = kfold)
print(results.mean())
Output:
0.7946582356674234
Regression With Gradient Tree Boost:
For creating a regressor with Gradient tree Boost method , the Scikit-
learn library provides sklearn.ensemble.GradientBoostingRegressor.
It can specify the loss function for regressor via the parameter name
‘loss’.
The default value for loss is ‘ls’.
Implementation Example:
A gradient Boosting Regressor is built by using the Gradient Boosting Regressor by using
sklearn.ensemble.GradientBoostingRegressor.
The mean_squared_error is found by using the mean_squared_error() method.
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingRegressor
X, y = make_friedman1(n_samples = 2000, random_state = 0, noise = 1.0) X_train, X_test = X[:1000], X[1000:]
y_train, y_test = y[:1000], y[1000:]
GDBreg = GradientBoostingRegressor(n_estimators = 80, learning_rate=0.1, max_depth = 1, random_state = 0, loss
= 'ls').fit(X_train, y_train)
Output:
Once fitted , we can find the mean_squared_error as follows:
mean_squared_error(y_test, GDBreg.predict(X_test)).
Output:
5.391246106657164
Gradient Boosting:
It is an ensemble machine learning algorithm that works well for regression as
well as classification problems.
It uses the boosting technique.
It combines a number of weak learners to form a strong learner.
Regression trees are used as a base learner .
Each subsequent tree in a series is built on the errors calculated by the previous
tree.
We need to predict the age group of people using the following data.
Gradient Boosting:
Gradient Boosting:
The mean age is assumed to be the predicted value for all the
observations in the dataset.
The errors are calculated using the mean prediction and actual values
of age.
Gradient Boosting:
A tree model is created using the errors created above as the target
variable.
The main objective is to find the best split in order to minimize the
error.
Gradient Boosting:
The predcitions by this model are combined with the predictions 1.
The value calculated above is the new prediction.
New errors are calculated using this predicted value and actual value.
Gradient Boosting:
Steps 2 to 6 is repeated till the max. number of iterations is reached.
Or the error function does not change.
Code:
from sklearn.ensemble import GradientBoostingClassifier
model= GradientBoostingClassifier(learning_rate=0.01,random_state=1)
model.fit(x_train, y_train)
model.score(x_test,y_test)
0.81621621621621621
Sample Code – Regression Problem
from sklearn.ensemble import GradientBoostingRegressor
model= GradientBoostingRegressor()
model.fit(x_train, y_train)
model.score(x_test,y_test)
Parameters
Min_samples_split:
Defines the min. number of samples which are required in a node to be
considered for splitting.
It is used to control over – splitting.
Min_samples_leaf:
Defines the min. number of samples required in a leaf node.
Lower values should be chosen for imbalanced class problems.
Parameters
Min_weight_fraction_leaf:
It is defined as a fraction of total number of observations instead of an integer.
Max_depth:
It denotes the maximum depth of a tree.
It is used to control over – fitting .
Higher depth will allow the model to learn relations specific to a particular
sample.
Parameters
Max_leaf_nodes:
The max. number of terminal nodes in a tree.
It can be defined in place of max. depth.
Max_features:
The max . Number of features to be considered while searching for the best split.
The square root of the total number of festures works well.
Higher values can lead to overfitting problem.
XGBoost:(extreme Gradient Boosting)
It is an advanced implementation of the gradient boosting algorithm.
It is one of the most powerful algorithms that is used for doing any type of
predictions.
XGBoost algorithm has predictive power and is 10 times faster than other
gradient boosting techniques.
It also includes a variety of regularization.
It reduces overfitting and improves overall performance.
It is also known as regularized boosting technique.
XGBoost:(extreme Gradient Boosting)
It is comparatively better when compared to other types of techniques:
Regularization:
XGBoost has better regularization when compared to standard GBM
implementation.
XGBoost helps to reduce overfitting.
Parallel Processing:
It implements parallel processing .
It is faster than GBM.
XGBoost:(extreme Gradient Boosting)
Parallel Processing:
It also supports implementation on Hadoop.
High Flexibility:
It allows users to define custom optimization objectives.
Handling Missing values:
XGBoost has inbuilt routines to handle missing values.
XGBoost:(extreme Gradient Boosting)
Tree Pruning:
XGBoost makes splits upto the max_depth specified and then starts pruning the tree
backwards.
It removes splits beyond which there is no positive sign.
Built – In Cross Validation:
XGBoost allows a user to run a cross – validation at each iteration of the boosting process .
It is easy to get the exact optimum number of boosting iterations in a single run.
XGBoost:Sample Code
import xgboost as xgb
model=xgb.XGBClassifier(random_state=1,learning_rate=0.01)
model.fit(x_train, y_train)
model.score(x_test,y_test)
0.82702702702702702
XGBoost:Sample Code
import xgboost as xgb
model=xgb.XGBRegressor()
model.fit(x_train, y_train)
model.score(x_test,y_test)
XGBoost: Parameters
Nthread:
It is used for parallel processing .
The number of cores in the system should be entered.
If you wish to run all the cores automatically , the algorithm will detect automatically.
Eta:
Analogous to learning rate in GBM.
Makes the model more robust by shrinking the weights on each step.
XGBoost: Parameters
Min_child_weight:
Defines the min. sum of weights of all observations required in a child.
It is used to avoid over – fitting.
Higher values prevent a model from learning relations.
Max_depth:
It is used to define the maximum depth.
Higher depth will allow the model to learn relations very specific to a
particular sample.
XGBoost: Parameters
max_leaf_nodes:
The maximum number of terminal nodes or leaves in a tree.
Can be defined in place of max_depth. Since binary trees are created, a
depth of ‘n’ would produce a maximum of 2^n leaves.
XGBoost: Parameters
gamma
A node is split only when the resulting split gives a positive
reduction in the loss function. Gamma specifies the minimum loss
reduction required to make a split.
Makes the algorithm conservative. The values can vary depending
on the loss function and should be tuned.
XGBoost: Parameters
subsample
Same as the subsample of GBM. Denotes the fraction of
observations to be randomly sampled for each tree.
Lower values make the algorithm more conservative and prevent
overfitting but values that are too small might lead to under-fitting.
XGBoost: Parameters
colsample_bytree
It is similar to max_features in GBM.
Denotes the fraction of columns to be randomly sampled for each
tree.