You are on page 1of 12

HW2

January 28, 2023

1 Homework 2
1.1 Large Scale Data Analysis / Aalto University, Spring 2023
This homework set consists of 3 questions. You will implement the Bagging algorithm, Random
Forest and AdaBoost.M1 by yourself.

1.2 Import packages


Note: you do not need any other packages, so if you import something else, please specify why you
need those packages
[1]: import numpy as np
import time
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
from sklearn.metrics import accuracy_score
mpl.style.use('default')

1.2.1 Create the data


[2]: from scipy.stats import chi2
def generate_data(n, p):
x = np.random.normal(size=(n, p))
sdX = np.sum(x ** 2, axis=1)
c = chi2.ppf(q=0.5, df=p)
y = np.ones(n)
y[sdX <= c] = -1
return x, y

N = 2000
Nt = 10000
p = 10
np.random.seed(0)
X, y = generate_data(N, p)
Xt, yt = generate_data(Nt, p)

1
1.3 Question 1
In this problem you will implement the bagging algorithm in the binary classification problem and
use it to redo the Figure 3.3.
You can compare your results with scikit-learn:
[3]: B=200 # number of bootstrapped trees

[4]: from sklearn.ensemble import BaggingClassifier


%time sk_Bag = BaggingClassifier(n_estimators=B,random_state=123).fit(X,y)
errBag = 1-accuracy_score(yt,sk_Bag.predict(Xt))
print("Sklearn Bagging error rate : {:5.2f}%".format(100*errBag))

CPU times: user 4.94 s, sys: 16.8 ms, total: 4.96 s


Wall time: 4.96 s
Sklearn Bagging error rate : 14.12%

1.4 1 (a)
The bagging algorithm for classification is described in Algorithm 3.1 of lecture notes. Write this
function yourself by writing a function named MyBagging having outputs a list Trees that contains
the B bagged decision trees and an array or list err that contains the B out-of-bag (OOB) training
errors when a new tree is added. The majorityvoting is option flag for fitting either using majority
voting or largest mean probability prediction.
Note: You shoud use DecisionTreeClassifier to compute the classification decision tree (with default
options). You are not allowed to use BaggingClassifier in the implemention of MyBagging

[5]: def coinflip():


return np.sign(np.random.rand(1) - 0.5)

def MyBagging(X, y, B,majorityvoting=False):


"""
MyBagging

params:
X,y (input,output) data. Outputs y need to have elements␣
↪between -1 and 1.

B positive integer, stating the number of boosted trees


majorityvoting flag (False/True) to handle predictions (majority vote␣
↪or mean probability)

"""
#helper func to translate higher probability_mean to correct class label
proba_to_sign = lambda z: np.argmax(z,axis=1)*2 - 1
majority = np.zeros(len(X))
mean_prob = np.zeros((len(X),2))

2
Trees = []
Err = []
for i in range(B):
#bootstrap
clf = DecisionTreeClassifier()
indices_for_boot = np.random.randint(len(X),size=len(X))
#indices for OOB calculation
OOB_indices = np.setxor1d(range(len(X)),indices_for_boot)
boot_X, boot_y = X[indices_for_boot], y[indices_for_boot]
#train tree on bootstrap data
clf.fit(boot_X, boot_y)
Trees.append(clf)
#classify OOB samples for majority vote
majority[OOB_indices] += clf.predict(X[OOB_indices])
#compute probabilities for mean probability
mean_prob[OOB_indices] += clf.predict_proba(X[OOB_indices])
if majorityvoting:
#do majority vote, classify all undecided or unseen samples randomly
OOB_predictions = np.where(majority, np.sign(majority),coinflip())
else:
# for each sample return class with higher accumulated probability,␣
↪classify undecided or unseen samples randomly

OOB_predictions = np.where((mean_prob[:,0] != mean_prob[:


↪,1]),proba_to_sign(mean_prob), coinflip())

#compare OOB predictions to y for OOB error rate


Err.append(np.count_nonzero(OOB_predictions != y) / len(y))
return Trees, Err

1.5 1 (b)
Write a function PredictBagging that computes the predicted class labels yhat for input data X
for each bagged tree as well as the error rate Err at each iteration, i.e., when a new tree is added,
given the true labels y. The input Trees is the output from MyBagging. For prediction, the
flag majorityvoting is used to return prediction either by using majority voting or largest mean
probability prediction of trees. In the case of majority voting, handle the case of ties by random
guessing.
[6]: def PredictBagging(Trees, X, y, majorityvoting=False):
"""
PredictBagging

params:
Trees A list of Bagged trees
X,y (input,output) data. Outputs y need to have elements␣
↪between -1 and 1.

majorityvoting flag (False/True) to handle predictions (majority vote␣


↪or mean probability)

3
"""
#helper func to translate higher probability_mean to correct class label
proba_to_sign = lambda z: np.argmax(z,axis=1)*2 - 1
Err = []
# trees x samples
yhat = np.array([t.predict(X) for t in Trees])
majority = np.zeros(len(X))
mean_prob = np.zeros((len(X),2))
for i in range(len(Trees)):
majority += yhat[i];
mean_prob += Trees[i].predict_proba(X)
if majorityvoting:
#do majority vote, classify all undecided or unseen samples randomly
predictions = np.where(majority, np.sign(majority),coinflip())
else:
# for each sample return sign of class with higher accumulated␣
↪probability, classify undecided or unseen samples randomly

predictions = np.where((mean_prob[:,0] != mean_prob[:


↪,1]),proba_to_sign(mean_prob), coinflip())

Err.append(np.count_nonzero(predictions != y) / len(y))
return yhat, Err

1.6 1 (c)
Use the functions you made in part 1(a) and 1(b) to redo the Figure 3.3a in the lecture notes.
Use np.random.seed(123) before running the MyBagging for reproducible results. This will set the
random seed before running the method. This is because of randomness in bootstrap samples.
[7]: np.random.seed(123)

[8]: Trees, OOB_Err = MyBagging(X,y,B)


preds, test_Err = PredictBagging(Trees,Xt,yt)
plt.plot(range(B),OOB_Err,label="OOB training error")
plt.plot(range(B),test_Err,label="test error")
stump = DecisionTreeClassifier(max_depth=1)
stump.fit(X,y)
stump_error = np.count_nonzero(stump.predict(Xt) != yt) / len(yt)
nodes_245 = DecisionTreeClassifier(max_leaf_nodes=245)
nodes_245.fit(X,y)
nodes_245_error = np.count_nonzero(nodes_245.predict(Xt) != yt) / len(yt)
plt.ylabel('Error rates')
plt.xlabel('Number of trees')
plt.axhline(y=nodes_245_error, color='r', linestyle='--', label='245-node␣
↪error')

plt.axhline(y=stump_error, color='c', linestyle='--', label='stump error')


ax = plt.gca()
plt.legend()

4
plt.show

[8]: <function matplotlib.pyplot.show(close=None, block=None)>

[9]: print("MyBagging error rate : {:.2f}%".format(test_Err[-1] * 100))

MyBagging error rate : 14.26%

1.7 Question 2
In this problem you will implement the random forest algorithm in the binary classification problem
and use it to redo the Figure 3.3b in the lecture notes.
Hint: Again you shoud use DecisionTreeClassifier to compute the classification decision tree. You
can use the PredictBagging function you implemented in problem 1b to compute the predicted
class labels for an input data X.
[10]: B = 200
d = 2
nmin = 3
RANDOM_STATE=123

5
You can compare with the scikit-learn:
[11]: from sklearn.ensemble import RandomForestClassifier
%time sk_RF =␣
↪RandomForestClassifier(n_estimators=B,max_features=d,min_samples_leaf=nmin,random_state=RAND

↪fit(X,y)

errRF_skl = 1-accuracy_score(yt,sk_RF.predict(Xt))
print("Sklearn Random Forest error rate : {:5.2f}%".format(100*errRF_skl))

CPU times: user 1.39 s, sys: 4.2 ms, total: 1.39 s


Wall time: 1.39 s
Sklearn Random Forest error rate : 12.77%
Note: scikit-learn implementation of RandomForest combines classifiers by averaging their proba-
bilistic prediction (not majority voting!)

1.7.1 2 (a)
Implement the random forest algorithm for classification described in algorithm 3.2 by yourself
by writing a function named MyRandomForest. The outputs of this function are an object Trees
which contains the B decision trees classifiers and a vector or list err that contains the B out-of-bag
(OOB) training error when a new tree is added.

[12]: def MyRandomForest(X, y, B, d, nmin,majorityvoting=False):


"""
MyRandomForest function

params:
X,y training data
B the number of learners
d number of randmized features in each split
nmin minimum node size
majorityvoting flag (False/True) to handle predictions (majority vote␣
↪or mean probability)

"""
#helper func to translate higher probability_mean to correct class label
#helper func to translate higher probability_mean to correct class label
proba_to_sign = lambda z: np.argmax(z,axis=1)*2 - 1
majority = np.zeros(len(X))
mean_prob = np.zeros((len(X),2))
Trees = []
Err = []
for i in range(B):
#bootstrap, difference- use randomized features for splitting
clf = DecisionTreeClassifier(min_samples_leaf=nmin,max_features=d)
indices_for_boot = np.random.randint(len(X),size=len(X))
#indices for OOB calculation

6
OOB_indices = np.setxor1d(range(len(X)),indices_for_boot)
boot_X, boot_y = X[indices_for_boot], y[indices_for_boot]
#train tree on bootstrap data
clf.fit(boot_X, boot_y)
Trees.append(clf)
#classify OOB samples for majority vote
majority[OOB_indices] += clf.predict(X[OOB_indices])
#compute probabilities for mean probability
mean_prob[OOB_indices] += clf.predict_proba(X[OOB_indices])
if majorityvoting:
#do majority vote, classify all undecided or unseen samples randomly
OOB_predictions = np.where(majority, np.sign(majority),coinflip())
else:
# for each sample return class with higher accumulated probability,␣
↪classify undecided or unseen samples randomly

OOB_predictions = np.where((mean_prob[:,0] != mean_prob[:


↪,1]),proba_to_sign(mean_prob), coinflip())

#compare OOB predictions to y for OOB error rate


Err.append(np.count_nonzero(OOB_predictions != y) / len(y))
return Trees, Err

1.7.2 2 (b)
Use the functions you have made to redo the Figure 3.3b. Use the following parameter values: d=2
and nmin=3, B=200.
[13]: d=2
nmin = 3
Trees, OOB_Err = MyRandomForest(X,y,B,d,nmin)
preds, test_Err = PredictBagging(Trees,Xt,yt)
plt.plot(range(B),OOB_Err,label="OOB training error")
plt.plot(range(B),test_Err,label="test error")
stump = DecisionTreeClassifier(max_depth=1)
stump.fit(X,y)
stump_error = np.count_nonzero(stump.predict(Xt) != yt) / len(yt)
nodes_245 = DecisionTreeClassifier(max_leaf_nodes=245)
nodes_245.fit(X,y)
nodes_245_error = np.count_nonzero(nodes_245.predict(Xt) != yt) / len(yt)
plt.ylabel('Error rates')
plt.xlabel('Number of trees')
plt.axhline(y=nodes_245_error, color='r', linestyle='--', label='245-node␣
↪error')

plt.axhline(y=stump_error, color='c', linestyle='--', label='stump error')


plt.legend()
plt.show
# your code

7
[13]: <function matplotlib.pyplot.show(close=None, block=None)>

[14]: print("MyRandomForest error rate : {:.2f}%".format(test_Err[-1] * 100))

MyRandomForest error rate : 12.74%

2 Question 3
In this problem you will implement the Adaboost.M1 algorithm and use it to redo the Figure 4.1.
Note: Again you should use DecisionTreeClassifier to compute the stumps.
You can compare your algorithm’s perforformance to Sklearn. Note that your function should
produce the exact same result.
[15]: from sklearn.ensemble import AdaBoostClassifier
%time sk_AdaM1 = AdaBoostClassifier(n_estimators=600,algorithm='SAMME').fit(X,y)
%time sk_AdaR = AdaBoostClassifier(n_estimators=600,algorithm='SAMME.R').
↪fit(X,y)

errAdaM1 = 1-accuracy_score(yt,sk_AdaM1.predict(Xt))
errAdaR = 1-accuracy_score(yt,sk_AdaR.predict(Xt))

8
print("Sklearn AdaBoost.M1 error rate : {:5.2f}".format(100*errAdaM1))
print("Sklearn AdaBoost.R error rate : {:5.2f}".format(100*errAdaR))

CPU times: user 2.43 s, sys: 112 µs, total: 2.43 s


Wall time: 2.43 s
CPU times: user 2.61 s, sys: 3.97 ms, total: 2.61 s
Wall time: 2.61 s
Sklearn AdaBoost.M1 error rate : 10.25
Sklearn AdaBoost.R error rate : 5.63

2.1 (a)
Implement the Adaboost.M1 algorithm described in Algorithm 4.1 using stumps (classification
decision trees with two terminal nodes) as the base learner. Write a function MyAdaBoostM1,
whose outputs are a list G that contains the M trees and a list or array alpha that contains weights
of each boosting iteration. The input node is the number of leaves you wish to use in your base
learner. Default is node=2, so using stumps.
[16]: def MyAdaBoostM1(X, y, M, node=2):

G = [DecisionTreeClassifier(max_depth = np.log2(node)) for i in range(M)]


#hack- initialize all columns in weights matrix to w0 for weights[:,i-1] to␣
↪return correct values for i = 0

weights = np.ones((len(X),M)) / len(X)


alphas = []
for i in range(M):
G[i].fit(X,y,sample_weight=weights[:,i-1])
err = np.dot(weights[:,i-1], G[i].predict(X) != y) / np.sum(weights[:
↪,i-1])

alpha_i = np.log((1-err)/err)
weights[:,i] = np.multiply(weights[:,i-1],np.exp(alpha_i * (G[i].
↪predict(X) != y)))

alphas.append(alpha_i)
return G, alphas

2.2 (b)
Write a function PredictAdaBoostM1 that computes the predicted data labels and error rate at each
boosting iteration for a given input test data. The inputs of the function are G and alpha which
are the outputs from MyAdaBoostM1 as well as X and y which are the input data of features and
class labels. The outputs are the predicted labels (yhat) and the error rates (Err) at each boosting
iteration.
[17]: def PredictAdaBoostM1(G, alpha, X, y):
Err = []
yhat = np.zeros((len(X),len(G)))
f_x = np.zeros(len(X))
for i in range(len(G)):

9
f_x += alpha[i] * G[i].predict(X)
yhat[:,i] = np.where(f_x, np.sign(f_x),coinflip())
Err.append(np.count_nonzero(yhat[:,i] != y) / len(y))
return yhat, Err

2.3 (c)
Write a function PredictProbaAdaBoostM1 that computes the class prediction probability 𝑝(x) ̂ =
Pr(𝑌 = 1|𝑋 = x) for cases in a given feature matrix X. The inputs of the function are G and alpha
which are the outputs from MyAdaBoostM1 as well as the feature matrix X. The function gives
as its output the predicted class probabilities for all cases in the feature matrix. How to compute
𝑝(x)
̂ is explained in Remark 4.1 of lecture notes.

[18]: def PredictProbaAdaBoostM1(G, alpha, X):


sigmoid = lambda x: 1/(1 + np.exp(-x))
yhat = np.array([t.predict(X) for t in G])
fhat = alpha @ yhat / np.sum(alpha)
return sigmoid(fhat)

[ ]:

2.4 (d)
Use the functions you made in a)-c) parts to redo Figure 4.1a and Figure 4.3a.
When making histogram of probability predictions, you should use plt.hist( , **kwargs) with fol-
lowing key word arguments:
kwargs = dict(alpha=0.7, bins=50,density=True,stacked=True)

[19]: M = 600 # use 600 boosting iterations


node = 2 # for stumps

Plot for test/training errors


[20]: G, alphas = MyAdaBoostM1(X, y, M, node)
yhat_train, train_Err = PredictAdaBoostM1(G, alphas, X, y)
yhat_test, test_Err = PredictAdaBoostM1(G, alphas, Xt, yt)
plt.plot(range(M),train_Err,label="training error")
plt.plot(range(M),test_Err,label="test error")
stump = DecisionTreeClassifier(max_depth=1)
stump.fit(X,y)
stump_error = np.count_nonzero(stump.predict(Xt) != yt) / len(yt)
nodes_245 = DecisionTreeClassifier(max_leaf_nodes=245)
nodes_245.fit(X,y)
nodes_245_error = np.count_nonzero(nodes_245.predict(Xt) != yt) / len(yt)
plt.ylabel('Error rates')
plt.xlabel('Boosting iterations')

10
plt.axhline(y=nodes_245_error, color='r', linestyle='--', label='245-node␣
↪error')

plt.axhline(y=stump_error, color='c', linestyle='--', label='stump error')


plt.legend()
ticks = np.arange(0, 0.5, 10)
plt.grid()

plt.show

[20]: <function matplotlib.pyplot.show(close=None, block=None)>

[21]: print("MyAdaBoostM1 error rate : {:.2f}%".format(test_Err[-1] * 100))

MyAdaBoostM1 error rate : 10.25%

plot of histograms
[22]: # Normalize
kwargs = dict(alpha=0.7, bins=50,density=True,stacked=True)
preds = PredictProbaAdaBoostM1(G, alphas, Xt)
class_1 = yt == 1
plt.hist(preds[class_1],**kwargs, label = "y=1")

11
other_class = np.invert(class_1)
plt.hist(preds[other_class],**kwargs, label = "y=-1")
plt.legend(loc='upper right')

plt.show()

12

You might also like