You are on page 1of 12

boosting-mllab

April 13, 2024

1 21BCE5695

2 M. Ashwin

3 AdaBoost and XGBoost


[1]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from sklearn.model_selection import train_test_split
import sklearn
from sklearn.ensemble import AdaBoostClassifier

4 Importing and visualizing data


[2]: dataf = pd.read_csv('loan_approval_dataset.csv')

[3]: print(dataf.shape)

(4269, 8)

[4]: dataf.head()

[4]: no_of_dependents education self_employed income_annum loan_amount \


0 2 Graduate No 9600000 29900000
1 0 Not Graduate Yes 4100000 12200000
2 3 Graduate No 9100000 29700000
3 3 Graduate No 8200000 30700000
4 5 Not Graduate Yes 9800000 24200000

loan_term cibil_score loan_status


0 12 778 Approved
1 8 417 Rejected
2 20 506 Rejected

1
3 8 467 Rejected
4 20 382 Rejected

[5]: print(dataf['loan_status'].value_counts())

loan_status
Approved 2656
Rejected 1613
Name: count, dtype: int64

[6]: dataf['loan_status'].value_counts().plot.bar()

[6]: <Axes: xlabel='loan_status'>

5 Independent variables (Categorical)


A categorical variable (also called qualitative variable) is a variable that can take on one of a
limited, and usually fixed, number of possible values.

2
[7]: dataf['education'].value_counts(normalize=True).plot.bar(title='Gender')
plt.show()
dataf['no_of_dependents'].value_counts(normalize=True).plot.bar(title='Married')
plt.show()
dataf['self_employed'].value_counts(normalize=True).plot.
↪bar(title='Self_Employed')

plt.show()

3
4
6 Idependent variables (Numerical)
Visualizing the distribution of annual income.
[8]: sns.displot(dataf['income_annum'])

[8]: <seaborn.axisgrid.FacetGrid at 0x7956dcbcaa70>

5
We can see that the annual income is evenly distributed.
Cibil Score distribution bar graph.
[9]: sns.displot(dataf['cibil_score'])

[9]: <seaborn.axisgrid.FacetGrid at 0x7956defd2680>

6
We can see that the cibil score is also evenly distributed hence, no normalization will be required
Loan Amount distribution plot.
[10]: sns.displot(dataf['loan_amount'])

[10]: <seaborn.axisgrid.FacetGrid at 0x7956dca363e0>

7
Encoding data
[11]: from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()
obj = (dataf.dtypes == 'object')
print(type(obj))
for col in list(obj[obj].index):
dataf[col] = label_encoder.fit_transform(dataf[col])

<class 'pandas.core.series.Series'>

[12]: edu = []
for i in dataf['education']:
if i==0:
edu.append(1)
else:
edu.append(0)

8
dataf['education'] = edu

[13]: l = []
for i in dataf['loan_status']:
if i==0:
l.append(1)
else:
l.append(0)
dataf['loan_status'] = l

Correlation matrix
[14]: matrix = dataf.corr()
f, ax = plt.subplots(figsize=(10,10))
sns.heatmap(matrix,vmax=.8,square=True,cmap="BuPu", annot = True)

[14]: <Axes: >

9
7 Model Building
Splitting data into training and testing
[15]: Xval = dataf.drop(['loan_status'], axis=1)
Yval = dataf['loan_status']

X_train, X_test, Y_train, Y_test = train_test_split(Xval, Yval, train_size=0.8,␣


↪random_state=5)

[16]: print(np.array(X_train).shape)
print(np.array(Y_train).shape)
print(np.array(X_test).shape)
print(np.array(Y_test).shape)

(3415, 7)
(3415,)
(854, 7)
(854,)
Ada Boost Model
[17]: model = AdaBoostClassifier(n_estimators=50,learning_rate=1)

model.fit(X_train, Y_train)

y_pred = model.predict(X_test)

[18]: from sklearn import metrics

val = metrics.accuracy_score(Y_test, y_pred)

[19]: print("Accuracy:", val)

Accuracy: 0.9765807962529274
XGBoost Model
Creating the training data for the XGBoost model
[20]: import xgboost as xgb
dtrain_reg = xgb.DMatrix(X_train, Y_train, enable_categorical=False)
dtest_reg = xgb.DMatrix(X_test, Y_test, enable_categorical=False)

10
[21]: params = {"objective": "reg:squarederror"}
n = 100
model = xgb.train(
params=params,
dtrain=dtrain_reg,
num_boost_round=n,
)

Testing the mean_squared_error of the model


[22]: from sklearn.metrics import mean_squared_error

preds = model.predict(dtest_reg)
rmse = mean_squared_error(Y_test, preds, squared=False)
print(f"RMSE of the base model: {rmse:.3f}")

RMSE of the base model: 0.101


Training the XGBoost model using validation data
[23]: evals = [(dtrain_reg, "train"), (dtest_reg, "validation")]

[25]: #Increasing the n value


n = 5000
model = xgb.train(
params=params,
dtrain=dtrain_reg,
num_boost_round=n,
evals=evals,
verbose_eval=1000
)

[0] train-rmse:0.35046 validation-rmse:0.35153


[1000] train-rmse:0.00294 validation-rmse:0.10157
[2000] train-rmse:0.00294 validation-rmse:0.10157
[3000] train-rmse:0.00294 validation-rmse:0.10157
[4000] train-rmse:0.00294 validation-rmse:0.10157
[4999] train-rmse:0.00294 validation-rmse:0.10157
We can notice that the rmse does not change much even after using validation data
Training the model using early stopping: helps to stop training the model when the rmse value has
reached its lowest
[26]: n = 5000
model = xgb.train(
params=params,
dtrain=dtrain_reg,
num_boost_round=n,
evals=evals,

11
verbose_eval=1000,
early_stopping_rounds=50
)

[0] train-rmse:0.35046 validation-rmse:0.35153


[77] train-rmse:0.01354 validation-rmse:0.10060
XGBoost Classifier model
[27]: dtrain_clf = xgb.DMatrix(X_train, Y_train, enable_categorical=True)
dtest_clf = xgb.DMatrix(X_test, Y_test, enable_categorical=True)

[28]: params = {"objective": "multi:softprob", "num_class": 2}


n = 1000

results = xgb.cv(
params, dtrain_clf,
num_boost_round=n,
nfold=5,
metrics=["mlogloss", "auc", "merror"],
)

[29]: results.keys()

[29]: Index(['train-mlogloss-mean', 'train-mlogloss-std', 'train-auc-mean',


'train-auc-std', 'train-merror-mean', 'train-merror-std',
'test-mlogloss-mean', 'test-mlogloss-std', 'test-auc-mean',
'test-auc-std', 'test-merror-mean', 'test-merror-std'],
dtype='object')

AUC score of the model


[30]: results['test-auc-mean'].max()

[30]: 0.9963477687005426

12

You might also like