Professional Documents
Culture Documents
1 Homework 2
1.1 Large Scale Data Analysis / Aalto University, Spring 2023
This homework set consists of 3 questions. You will implement the Bagging algorithm, Random
Forest and AdaBoost.M1 by yourself.
N = 2000
Nt = 10000
p = 10
np.random.seed(0)
X, y = generate_data(N, p)
Xt, yt = generate_data(Nt, p)
1
1.3 Question 1
In this problem you will implement the bagging algorithm in the binary classification problem and
use it to redo the Figure 3.3.
You can compare your results with scikit-learn:
[3]: B=200 # number of bootstrapped trees
1.4 1 (a)
The bagging algorithm for classification is described in Algorithm 3.1 of lecture notes. Write this
function yourself by writing a function named MyBagging having outputs a list Trees that contains
the B bagged decision trees and an array or list err that contains the B out-of-bag (OOB) training
errors when a new tree is added. The majorityvoting is option flag for fitting either using majority
voting or largest mean probability prediction.
Note: You shoud use DecisionTreeClassifier to compute the classification decision tree (with default
options). You are not allowed to use BaggingClassifier in the implemention of MyBagging
params:
X,y (input,output) data. Outputs y need to have elements␣
↪between -1 and 1.
"""
#helper func to translate higher probability_mean to correct class label
proba_to_sign = lambda z: np.argmax(z,axis=1)*2 - 1
majority = np.zeros(len(X))
mean_prob = np.zeros((len(X),2))
2
Trees = []
Err = []
for i in range(B):
#bootstrap
clf = DecisionTreeClassifier()
indices_for_boot = np.random.randint(len(X),size=len(X))
#indices for OOB calculation
OOB_indices = np.setxor1d(range(len(X)),indices_for_boot)
boot_X, boot_y = X[indices_for_boot], y[indices_for_boot]
#train tree on bootstrap data
clf.fit(boot_X, boot_y)
Trees.append(clf)
#classify OOB samples for majority vote
majority[OOB_indices] += clf.predict(X[OOB_indices])
#compute probabilities for mean probability
mean_prob[OOB_indices] += clf.predict_proba(X[OOB_indices])
if majorityvoting:
#do majority vote, classify all undecided or unseen samples randomly
OOB_predictions = np.where(majority, np.sign(majority),coinflip())
else:
# for each sample return class with higher accumulated probability,␣
↪classify undecided or unseen samples randomly
1.5 1 (b)
Write a function PredictBagging that computes the predicted class labels yhat for input data X
for each bagged tree as well as the error rate Err at each iteration, i.e., when a new tree is added,
given the true labels y. The input Trees is the output from MyBagging. For prediction, the
flag majorityvoting is used to return prediction either by using majority voting or largest mean
probability prediction of trees. In the case of majority voting, handle the case of ties by random
guessing.
[6]: def PredictBagging(Trees, X, y, majorityvoting=False):
"""
PredictBagging
params:
Trees A list of Bagged trees
X,y (input,output) data. Outputs y need to have elements␣
↪between -1 and 1.
3
"""
#helper func to translate higher probability_mean to correct class label
proba_to_sign = lambda z: np.argmax(z,axis=1)*2 - 1
Err = []
# trees x samples
yhat = np.array([t.predict(X) for t in Trees])
majority = np.zeros(len(X))
mean_prob = np.zeros((len(X),2))
for i in range(len(Trees)):
majority += yhat[i];
mean_prob += Trees[i].predict_proba(X)
if majorityvoting:
#do majority vote, classify all undecided or unseen samples randomly
predictions = np.where(majority, np.sign(majority),coinflip())
else:
# for each sample return sign of class with higher accumulated␣
↪probability, classify undecided or unseen samples randomly
Err.append(np.count_nonzero(predictions != y) / len(y))
return yhat, Err
1.6 1 (c)
Use the functions you made in part 1(a) and 1(b) to redo the Figure 3.3a in the lecture notes.
Use np.random.seed(123) before running the MyBagging for reproducible results. This will set the
random seed before running the method. This is because of randomness in bootstrap samples.
[7]: np.random.seed(123)
4
plt.show
1.7 Question 2
In this problem you will implement the random forest algorithm in the binary classification problem
and use it to redo the Figure 3.3b in the lecture notes.
Hint: Again you shoud use DecisionTreeClassifier to compute the classification decision tree. You
can use the PredictBagging function you implemented in problem 1b to compute the predicted
class labels for an input data X.
[10]: B = 200
d = 2
nmin = 3
RANDOM_STATE=123
5
You can compare with the scikit-learn:
[11]: from sklearn.ensemble import RandomForestClassifier
%time sk_RF =␣
↪RandomForestClassifier(n_estimators=B,max_features=d,min_samples_leaf=nmin,random_state=RAND
↪fit(X,y)
errRF_skl = 1-accuracy_score(yt,sk_RF.predict(Xt))
print("Sklearn Random Forest error rate : {:5.2f}%".format(100*errRF_skl))
1.7.1 2 (a)
Implement the random forest algorithm for classification described in algorithm 3.2 by yourself
by writing a function named MyRandomForest. The outputs of this function are an object Trees
which contains the B decision trees classifiers and a vector or list err that contains the B out-of-bag
(OOB) training error when a new tree is added.
params:
X,y training data
B the number of learners
d number of randmized features in each split
nmin minimum node size
majorityvoting flag (False/True) to handle predictions (majority vote␣
↪or mean probability)
"""
#helper func to translate higher probability_mean to correct class label
#helper func to translate higher probability_mean to correct class label
proba_to_sign = lambda z: np.argmax(z,axis=1)*2 - 1
majority = np.zeros(len(X))
mean_prob = np.zeros((len(X),2))
Trees = []
Err = []
for i in range(B):
#bootstrap, difference- use randomized features for splitting
clf = DecisionTreeClassifier(min_samples_leaf=nmin,max_features=d)
indices_for_boot = np.random.randint(len(X),size=len(X))
#indices for OOB calculation
6
OOB_indices = np.setxor1d(range(len(X)),indices_for_boot)
boot_X, boot_y = X[indices_for_boot], y[indices_for_boot]
#train tree on bootstrap data
clf.fit(boot_X, boot_y)
Trees.append(clf)
#classify OOB samples for majority vote
majority[OOB_indices] += clf.predict(X[OOB_indices])
#compute probabilities for mean probability
mean_prob[OOB_indices] += clf.predict_proba(X[OOB_indices])
if majorityvoting:
#do majority vote, classify all undecided or unseen samples randomly
OOB_predictions = np.where(majority, np.sign(majority),coinflip())
else:
# for each sample return class with higher accumulated probability,␣
↪classify undecided or unseen samples randomly
1.7.2 2 (b)
Use the functions you have made to redo the Figure 3.3b. Use the following parameter values: d=2
and nmin=3, B=200.
[13]: d=2
nmin = 3
Trees, OOB_Err = MyRandomForest(X,y,B,d,nmin)
preds, test_Err = PredictBagging(Trees,Xt,yt)
plt.plot(range(B),OOB_Err,label="OOB training error")
plt.plot(range(B),test_Err,label="test error")
stump = DecisionTreeClassifier(max_depth=1)
stump.fit(X,y)
stump_error = np.count_nonzero(stump.predict(Xt) != yt) / len(yt)
nodes_245 = DecisionTreeClassifier(max_leaf_nodes=245)
nodes_245.fit(X,y)
nodes_245_error = np.count_nonzero(nodes_245.predict(Xt) != yt) / len(yt)
plt.ylabel('Error rates')
plt.xlabel('Number of trees')
plt.axhline(y=nodes_245_error, color='r', linestyle='--', label='245-node␣
↪error')
7
[13]: <function matplotlib.pyplot.show(close=None, block=None)>
2 Question 3
In this problem you will implement the Adaboost.M1 algorithm and use it to redo the Figure 4.1.
Note: Again you should use DecisionTreeClassifier to compute the stumps.
You can compare your algorithm’s perforformance to Sklearn. Note that your function should
produce the exact same result.
[15]: from sklearn.ensemble import AdaBoostClassifier
%time sk_AdaM1 = AdaBoostClassifier(n_estimators=600,algorithm='SAMME').fit(X,y)
%time sk_AdaR = AdaBoostClassifier(n_estimators=600,algorithm='SAMME.R').
↪fit(X,y)
errAdaM1 = 1-accuracy_score(yt,sk_AdaM1.predict(Xt))
errAdaR = 1-accuracy_score(yt,sk_AdaR.predict(Xt))
8
print("Sklearn AdaBoost.M1 error rate : {:5.2f}".format(100*errAdaM1))
print("Sklearn AdaBoost.R error rate : {:5.2f}".format(100*errAdaR))
2.1 (a)
Implement the Adaboost.M1 algorithm described in Algorithm 4.1 using stumps (classification
decision trees with two terminal nodes) as the base learner. Write a function MyAdaBoostM1,
whose outputs are a list G that contains the M trees and a list or array alpha that contains weights
of each boosting iteration. The input node is the number of leaves you wish to use in your base
learner. Default is node=2, so using stumps.
[16]: def MyAdaBoostM1(X, y, M, node=2):
alpha_i = np.log((1-err)/err)
weights[:,i] = np.multiply(weights[:,i-1],np.exp(alpha_i * (G[i].
↪predict(X) != y)))
alphas.append(alpha_i)
return G, alphas
2.2 (b)
Write a function PredictAdaBoostM1 that computes the predicted data labels and error rate at each
boosting iteration for a given input test data. The inputs of the function are G and alpha which
are the outputs from MyAdaBoostM1 as well as X and y which are the input data of features and
class labels. The outputs are the predicted labels (yhat) and the error rates (Err) at each boosting
iteration.
[17]: def PredictAdaBoostM1(G, alpha, X, y):
Err = []
yhat = np.zeros((len(X),len(G)))
f_x = np.zeros(len(X))
for i in range(len(G)):
9
f_x += alpha[i] * G[i].predict(X)
yhat[:,i] = np.where(f_x, np.sign(f_x),coinflip())
Err.append(np.count_nonzero(yhat[:,i] != y) / len(y))
return yhat, Err
2.3 (c)
Write a function PredictProbaAdaBoostM1 that computes the class prediction probability 𝑝(x) ̂ =
Pr(𝑌 = 1|𝑋 = x) for cases in a given feature matrix X. The inputs of the function are G and alpha
which are the outputs from MyAdaBoostM1 as well as the feature matrix X. The function gives
as its output the predicted class probabilities for all cases in the feature matrix. How to compute
𝑝(x)
̂ is explained in Remark 4.1 of lecture notes.
[ ]:
2.4 (d)
Use the functions you made in a)-c) parts to redo Figure 4.1a and Figure 4.3a.
When making histogram of probability predictions, you should use plt.hist( , **kwargs) with fol-
lowing key word arguments:
kwargs = dict(alpha=0.7, bins=50,density=True,stacked=True)
10
plt.axhline(y=nodes_245_error, color='r', linestyle='--', label='245-node␣
↪error')
plt.show
plot of histograms
[22]: # Normalize
kwargs = dict(alpha=0.7, bins=50,density=True,stacked=True)
preds = PredictProbaAdaBoostM1(G, alphas, Xt)
class_1 = yt == 1
plt.hist(preds[class_1],**kwargs, label = "y=1")
11
other_class = np.invert(class_1)
plt.hist(preds[other_class],**kwargs, label = "y=-1")
plt.legend(loc='upper right')
plt.show()
12