Professional Documents
Culture Documents
1 21BCE5695
2 M. Ashwin
[3]: print(dataf.shape)
(4269, 8)
[4]: dataf.head()
1
3 8 467 Rejected
4 20 382 Rejected
[5]: print(dataf['loan_status'].value_counts())
loan_status
Approved 2656
Rejected 1613
Name: count, dtype: int64
[6]: dataf['loan_status'].value_counts().plot.bar()
2
[7]: dataf['education'].value_counts(normalize=True).plot.bar(title='Gender')
plt.show()
dataf['no_of_dependents'].value_counts(normalize=True).plot.bar(title='Married')
plt.show()
dataf['self_employed'].value_counts(normalize=True).plot.
↪bar(title='Self_Employed')
plt.show()
3
4
6 Idependent variables (Numerical)
Visualizing the distribution of annual income.
[8]: sns.displot(dataf['income_annum'])
5
We can see that the annual income is evenly distributed.
Cibil Score distribution bar graph.
[9]: sns.displot(dataf['cibil_score'])
6
We can see that the cibil score is also evenly distributed hence, no normalization will be required
Loan Amount distribution plot.
[10]: sns.displot(dataf['loan_amount'])
7
Encoding data
[11]: from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
obj = (dataf.dtypes == 'object')
print(type(obj))
for col in list(obj[obj].index):
dataf[col] = label_encoder.fit_transform(dataf[col])
<class 'pandas.core.series.Series'>
[12]: edu = []
for i in dataf['education']:
if i==0:
edu.append(1)
else:
edu.append(0)
8
dataf['education'] = edu
[13]: l = []
for i in dataf['loan_status']:
if i==0:
l.append(1)
else:
l.append(0)
dataf['loan_status'] = l
Correlation matrix
[14]: matrix = dataf.corr()
f, ax = plt.subplots(figsize=(10,10))
sns.heatmap(matrix,vmax=.8,square=True,cmap="BuPu", annot = True)
9
7 Model Building
Splitting data into training and testing
[15]: Xval = dataf.drop(['loan_status'], axis=1)
Yval = dataf['loan_status']
[16]: print(np.array(X_train).shape)
print(np.array(Y_train).shape)
print(np.array(X_test).shape)
print(np.array(Y_test).shape)
(3415, 7)
(3415,)
(854, 7)
(854,)
Ada Boost Model
[17]: model = AdaBoostClassifier(n_estimators=50,learning_rate=1)
model.fit(X_train, Y_train)
y_pred = model.predict(X_test)
Accuracy: 0.9765807962529274
XGBoost Model
Creating the training data for the XGBoost model
[20]: import xgboost as xgb
dtrain_reg = xgb.DMatrix(X_train, Y_train, enable_categorical=False)
dtest_reg = xgb.DMatrix(X_test, Y_test, enable_categorical=False)
10
[21]: params = {"objective": "reg:squarederror"}
n = 100
model = xgb.train(
params=params,
dtrain=dtrain_reg,
num_boost_round=n,
)
preds = model.predict(dtest_reg)
rmse = mean_squared_error(Y_test, preds, squared=False)
print(f"RMSE of the base model: {rmse:.3f}")
11
verbose_eval=1000,
early_stopping_rounds=50
)
results = xgb.cv(
params, dtrain_clf,
num_boost_round=n,
nfold=5,
metrics=["mlogloss", "auc", "merror"],
)
[29]: results.keys()
[30]: 0.9963477687005426
12