Boosting Mllab

boosting-mllab
April 13, 2024
1 21BCE5695
2 M. Ashwin
3 AdaBoost and XGBoost

[1]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from sklearn.model_selection import train_test_split
import sklearn
from sklearn.ensemble import AdaBoostClassifier
4 Importing and visualizing data

[2]: dataf = pd.read_csv('loan_approval_dataset.csv')
[3]: print(dataf.shape)
(4269, 8)
[4]: dataf.head()
[4]: no_of_dependents education self_employed income_annum loan_amount \

0 2 Graduate No 9600000 29900000
1 0 Not Graduate Yes 4100000 12200000
2 3 Graduate No 9100000 29700000
3 3 Graduate No 8200000 30700000
4 5 Not Graduate Yes 9800000 24200000
loan_term cibil_score loan_status

0 12 778 Approved
1 8 417 Rejected
2 20 506 Rejected
1
3 8 467 Rejected
4 20 382 Rejected
[5]: print(dataf['loan_status'].value_counts())
loan_status
Approved 2656
Rejected 1613
Name: count, dtype: int64
[6]: dataf['loan_status'].value_counts().plot.bar()
[6]: <Axes: xlabel='loan_status'>
5 Independent variables (Categorical)

A categorical variable (also called qualitative variable) is a variable that can take on one of a
limited, and usually fixed, number of possible values.
2
[7]: dataf['education'].value_counts(normalize=True).plot.bar(title='Gender')
plt.show()
dataf['no_of_dependents'].value_counts(normalize=True).plot.bar(title='Married')
plt.show()
dataf['self_employed'].value_counts(normalize=True).plot.
↪bar(title='Self_Employed')
plt.show()
3
4
6 Idependent variables (Numerical)
Visualizing the distribution of annual income.
[8]: sns.displot(dataf['income_annum'])
[8]: <seaborn.axisgrid.FacetGrid at 0x7956dcbcaa70>
5
We can see that the annual income is evenly distributed.
Cibil Score distribution bar graph.
[9]: sns.displot(dataf['cibil_score'])
[9]: <seaborn.axisgrid.FacetGrid at 0x7956defd2680>
6
We can see that the cibil score is also evenly distributed hence, no normalization will be required
Loan Amount distribution plot.
[10]: sns.displot(dataf['loan_amount'])
[10]: <seaborn.axisgrid.FacetGrid at 0x7956dca363e0>
7
Encoding data
[11]: from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
obj = (dataf.dtypes == 'object')
print(type(obj))
for col in list(obj[obj].index):
dataf[col] = label_encoder.fit_transform(dataf[col])
<class 'pandas.core.series.Series'>
[12]: edu = []
for i in dataf['education']:
if i==0:
edu.append(1)
else:
edu.append(0)
8
dataf['education'] = edu
[13]: l = []
for i in dataf['loan_status']:
if i==0:
l.append(1)
else:
l.append(0)
dataf['loan_status'] = l
Correlation matrix
[14]: matrix = dataf.corr()
f, ax = plt.subplots(figsize=(10,10))
sns.heatmap(matrix,vmax=.8,square=True,cmap="BuPu", annot = True)
[14]: <Axes: >
9
7 Model Building
Splitting data into training and testing
[15]: Xval = dataf.drop(['loan_status'], axis=1)
Yval = dataf['loan_status']
X_train, X_test, Y_train, Y_test = train_test_split(Xval, Yval, train_size=0.8,␣

↪random_state=5)
[16]: print(np.array(X_train).shape)
print(np.array(Y_train).shape)
print(np.array(X_test).shape)
print(np.array(Y_test).shape)
(3415, 7)
(3415,)
(854, 7)
(854,)
Ada Boost Model
[17]: model = AdaBoostClassifier(n_estimators=50,learning_rate=1)
model.fit(X_train, Y_train)
y_pred = model.predict(X_test)
[18]: from sklearn import metrics
val = metrics.accuracy_score(Y_test, y_pred)
[19]: print("Accuracy:", val)
Accuracy: 0.9765807962529274
XGBoost Model
Creating the training data for the XGBoost model
[20]: import xgboost as xgb
dtrain_reg = xgb.DMatrix(X_train, Y_train, enable_categorical=False)
dtest_reg = xgb.DMatrix(X_test, Y_test, enable_categorical=False)
10
[21]: params = {"objective": "reg:squarederror"}
n = 100
model = xgb.train(
params=params,
dtrain=dtrain_reg,
num_boost_round=n,
)
Testing the mean_squared_error of the model

[22]: from sklearn.metrics import mean_squared_error
preds = model.predict(dtest_reg)
rmse = mean_squared_error(Y_test, preds, squared=False)
print(f"RMSE of the base model: {rmse:.3f}")
RMSE of the base model: 0.101

Training the XGBoost model using validation data
[23]: evals = [(dtrain_reg, "train"), (dtest_reg, "validation")]
[25]: #Increasing the n value

n = 5000
model = xgb.train(
params=params,
dtrain=dtrain_reg,
num_boost_round=n,
evals=evals,
verbose_eval=1000
)
[0] train-rmse:0.35046 validation-rmse:0.35153

We can notice that the rmse does not change much even after using validation data
Training the model using early stopping: helps to stop training the model when the rmse value has
reached its lowest
[26]: n = 5000
model = xgb.train(
params=params,
dtrain=dtrain_reg,
num_boost_round=n,
evals=evals,
11
verbose_eval=1000,
early_stopping_rounds=50
)

XGBoost Classifier model
[27]: dtrain_clf = xgb.DMatrix(X_train, Y_train, enable_categorical=True)
dtest_clf = xgb.DMatrix(X_test, Y_test, enable_categorical=True)
[28]: params = {"objective": "multi:softprob", "num_class": 2}

n = 1000
results = xgb.cv(
params, dtrain_clf,
num_boost_round=n,
nfold=5,
metrics=["mlogloss", "auc", "merror"],
)
[29]: results.keys()
[29]: Index(['train-mlogloss-mean', 'train-mlogloss-std', 'train-auc-mean',

'train-auc-std', 'train-merror-mean', 'train-merror-std',
'test-mlogloss-mean', 'test-mlogloss-std', 'test-auc-mean',
'test-auc-std', 'test-merror-mean', 'test-merror-std'],
dtype='object')
AUC score of the model

[30]: results['test-auc-mean'].max()
[30]: 0.9963477687005426
12

Boosting Mllab

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Boosting Mllab

Uploaded by

Copyright:

Available Formats

boosting-mllab

April 13, 2024

3 AdaBoost and XGBoost

4 Importing and visualizing data

[4]: no_of_dependents education self_employed income_annum loan_amount \

loan_term cibil_score loan_status

[6]: <Axes: xlabel='loan_status'>

5 Independent variables (Categorical)

[8]: <seaborn.axisgrid.FacetGrid at 0x7956dcbcaa70>

[9]: <seaborn.axisgrid.FacetGrid at 0x7956defd2680>

[10]: <seaborn.axisgrid.FacetGrid at 0x7956dca363e0>

[14]: <Axes: >

X_train, X_test, Y_train, Y_test = train_test_split(Xval, Yval, train_size=0.8,␣

[18]: from sklearn import metrics

val = metrics.accuracy_score(Y_test, y_pred)

[19]: print("Accuracy:", val)

Testing the mean_squared_error of the model

RMSE of the base model: 0.101

[25]: #Increasing the n value

[0] train-rmse:0.35046 validation-rmse:0.35153

[0] train-rmse:0.35046 validation-rmse:0.35153

[28]: params = {"objective": "multi:softprob", "num_class": 2}

[29]: Index(['train-mlogloss-mean', 'train-mlogloss-std', 'train-auc-mean',

AUC score of the model

You might also like