You are on page 1of 19

[Akash kumar] [Dev Bhoomi Uattarakhand University] [Date- 15/06/23]

[Predicting Whether The Customer Will Subscribe To Term Deposit.]

Abstract: This project report aims to analyze the factors influencing a customer's decision to
subscribe to a term deposit and develop a predictive model to forecast the likelihood of
subscription. The report outlines the data collection process, data preprocessing techniques, feature
engineering approaches, model development, evaluation metrics, and interpretation of results. The
project findings provide insights for financial institutions to optimize their marketing strategies
and improve the subscription rate.

Problem Statement

Business Use Case


There has been a revenue decline for a Portuguese bank and they would like to know what
actions to take. After investigation, they found out that the root cause is that their clients are not
depositing as frequently as before. Knowing that term deposits allow banks to hold onto a
deposit for a specific amount of time, so banks can invest in higher gain financial products to
make a profit. In addition, banks also hold better chance to persuade term deposit clients into
buying other products such as funds or insurance to further increase their revenues. As a result,
the Portuguese bank would like to identify existing clients that have higher chance to subscribe
for a term deposit and focus marketing efforts on such clients.

Data Science Problem Statement


Predict if the client will subscribe to a term deposit based on the analysis of the marketing
campaigns the bank performed.

Evaluation Metric
We will be using ROC-AUC for evaluation.

Objective of this template notebook


The main objective of this template is to take you through the entire working pipeline that you
may follow while appraoching a Machine Learning problem.

We will be defining a task to be performed and write the code to solve the task.

The tasks performed below should serve as a good guide regarding the steps that you
should go about a Machine Learning Problem. But kindly do not restrict yourself to only
the tasks that have been performed in this notebook and feel free to bring your ideas,skills
and strategies and implement them as well.
Word of caution
This template is just an example of a data-science pipeline, every data science problem is unique
and there are multiple ways to tackle them. Go through this template and try to leverage the
information in this while solving your hackathon problems but you may not be able to use all the
functions created here.

Understanding the dataset


Data Set Information

The data is related to direct marketing campaigns of a Portuguese banking institution. The
marketing campaigns were based on phone calls. Often, more than one contact to the same client
was required, in order to access if the product (bank term deposit) would be subscribed ('yes') or
not ('no') subscribed.

There are two datasets: train.csv with all examples (32950) and 21 inputs including the target
feature, ordered by date (from May 2008 to November 2010), very close to the data analyzed in
[Moro et al., 2014]

test.csv which is the test data that consists of 8238 observations and 20 features without the
target feature

Goal:- The classification goal is to predict if the client will subscribe (yes/no) a term deposit
(variable y).

Features

Feature Feature Type Description

age numeric age of a person

type of job ('admin.','blue-


Categorical,
job collar','entrepreneur','housemaid','management','retired','self-
nominal
employed','services','student','technician','unemployed','unknown')

categorical, marital status ('divorced','married','single','unknown'; note:


marital
nominal ‘divorced' means divorced or widowed)
Feature Feature Type Description

educatio categorical, ('basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.c


n nominal ourse','university.degree','unknown')

categorical,
default has credit in default? ('no','yes','unknown')
nominal

categorical,
housing has housing loan? ('no','yes','unknown')
nominal

categorical,
loan has personal loan? ('no','yes','unknown')
nominal

categorical,
contact contact communication type ('cellular','telephone')
nominal

categorical,
month last contact month of year ('jan', 'feb', 'mar', ..., 'nov', 'dec')
ordinal

day_of_ categorical,
last contact day of the week ('mon','tue','wed','thu','fri')
week ordinal

last contact duration, in seconds . Important note: this attribute


duration numeric
highly affects the output target (e.g., if duration=0 then y='no')

campaig number of contacts performed during this campaign and for this
numeric
n client (includes last contact)
Feature Feature Type Description

number of days that passed by after the client was last contacted
pdays numeric from a previous campaign (999 means client was not previously
contacted)

number of contacts performed before this campaign and for this


previous numeric
client

P categorical, outcome of the previous marketing campaign


outcome nominal ('failure','nonexistent','success')

Target variable (desired output):

Feature Feature_Type Description

y binary has the client subscribed a term deposit? ('yes','no')

Importing necessary libraries


The following code is written in Python 3.x. Libraries provide pre-written functionality to
perform necessary tasks.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Loading Data Modelling Libraries


We will use the popular scikit-learn library to develop our machine learning algorithms. In
sklearn, algorithms are called Estimators and implemented in their own classes. For data
visualization, we will use the matplotlib and seaborn library. Below are common classes to load.
In [2]:
from sklearn.preprocessing import LabelEncoder,MinMaxScaler,StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier ,RandomForestClassifier ,GradientBo
ostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import Ridge,Lasso
from sklearn.metrics import roc_auc_score ,mean_squared_error,accuracy_score,classification_r
eport,roc_curve,confusion_matrix
import warnings
warnings.filterwarnings('ignore')
from scipy.stats.mstats import winsorize
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
pd.set_option('display.max_columns',None)
import six
import sys
sys.modules['sklearn.externals.six'] = six

Data Loading and Cleaning


Applying vanilla models on the data
Since we have performed preprocessing on our data and also done with the EDA part, it is now
time to apply vanilla machine learning models on the data and check their performance.

Fit vanilla classification models


Since we have label encoded our categorical variables, our data is now ready for applying
machine learning algorithms.

There are many Classification algorithms are present in machine learning, which are used for
different classification applications. Some of the main classification algorithms are as follows-

• Logistic Regression
• DecisionTree Classifier
• RandomForest Classfier
The code we have written below internally splits the data into training data and validation data. It
then fits the classification model on the train data and then makes a prediction on the validation
data and outputs the scores for this prediction.

PREPARING THE TRAIN AND TEST DATA


In [4]:
# Predictors
X = dataframe.iloc[:,:-1]

# Target
y = dataframe.iloc[:,-1]

# Dividing the data into train and test subsets


x_train,x_val,y_train,y_val = train_test_split(X,y,test_size=0.2,random_state=42)

FITTING THE MODEL AND PREDICTING THE VALUES


In [5]:
# run Logistic Regression model
model = LogisticRegression()
# fitting the model
model.fit(x_train, y_train)
# predicting the values
y_scores = model.predict(x_val)

GETTING THE METRICS TO CHECK OUR MODEL PERFORMANCE


In [6]:
# getting the auc roc curve
auc = roc_auc_score(y_val, y_scores)
print('Classification Report:')
print(classification_report(y_val,y_scores))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_scores)
print('ROC_AUC_SCORE is',roc_auc_score(y_val, y_scores))

#fpr, tpr, _ = roc_curve(y_test, predictions[:,1])

plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()
Classification Report:
precision recall f1-score support

0 0.90 0.98 0.93 5798


1 0.50 0.17 0.26 792

accuracy 0.88 6590


macro avg 0.70 0.57 0.60 6590
weighted avg 0.85 0.88 0.85 6590

ROC_AUC_SCORE is 0.5742166403601381
The above two steps are combined and run in a single cell for all the remaining models
respectively
In [7]:
# Run Decision Tree Classifier
model = DecisionTreeClassifier()

model.fit(x_train, y_train)
y_scores = model.predict(x_val)
auc = roc_auc_score(y_val, y_scores)
print('Classification Report:')
print(classification_report(y_val,y_scores))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_scores)
print('ROC_AUC_SCORE is',roc_auc_score(y_val, y_scores))

#fpr, tpr, _ = roc_curve(y_test, predictions[:,1])

plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()
Classification Report:
precision recall f1-score support

0 0.93 0.93 0.93 5798


1 0.46 0.46 0.46 792

accuracy 0.87 6590


macro avg 0.69 0.69 0.69 6590
weighted avg 0.87 0.87 0.87 6590

ROC_AUC_SCORE is 0.6924298608715649

In [8]:
from sklearn import tree
from sklearn.tree import export_graphviz # display the tree within a Jupyter notebook
from IPython.display import SVG
from graphviz import Source
from IPython.display import display
from ipywidgets import interactive, IntSlider, FloatSlider, interact
import ipywidgets
from IPython.display import Image
from subprocess import call
import matplotlib.image as mpimg
In [9]:
@interact
def plot_tree(crit=["gini", "entropy"],
split=["best", "random"],
depth=IntSlider(min=1,max=30,value=2, continuous_update=False),
min_split=IntSlider(min=2,max=5,value=2, continuous_update=False),
min_leaf=IntSlider(min=1,max=5,value=1, continuous_update=False)):

estimator = DecisionTreeClassifier(random_state=0,
criterion=crit,
splitter = split,
max_depth = depth,
min_samples_split=min_split,
min_samples_leaf=min_leaf)
estimator.fit(x_train, y_train)
print('Decision Tree Training Accuracy: {:.3f}'.format(accuracy_score(y_train, estimator.predi
ct(x_train))))
print('Decision Tree Test Accuracy: {:.3f}'.format(accuracy_score(y_val, estimator.predict(x_
val))))

graph = Source(tree.export_graphviz(estimator,
out_file=None,
feature_names=x_train.columns,
class_names=['0', '1'],
filled = True))

display(Image(data=graph.pipe(format='png')))

return estimator
Decision Tree Training Accuracy: 0.896
Decision Tree Test Accuracy: 0.889

DecisionTreeClassifier(max_depth=2, random_state=0)
In [10]:
# run Random Forrest Classifier
model = RandomForestClassifier()

model.fit(x_train, y_train)
y_scores = model.predict(x_val)
auc = roc_auc_score(y_val, y_scores)
print('Classification Report:')
print(classification_report(y_val,y_scores))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_scores)
print('ROC_AUC_SCORE is',roc_auc_score(y_val, y_scores))

#fpr, tpr, _ = roc_curve(y_test, predictions[:,1])

plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()
Classification Report:
precision recall f1-score support

0 0.92 0.97 0.94 5798


1 0.64 0.35 0.45 792

accuracy 0.90 6590


macro avg 0.78 0.66 0.70 6590
weighted avg 0.88 0.90 0.89 6590

ROC_AUC_SCORE is 0.662138372340166

Feature Selection
Now that we have applied vanilla models on our data, we now have a basic understanding of
what our predictions look like. Let's now use feature selection methods for identifying the best
set of features for each model.

Using RFE for feature selection


In this task let's use Recursive Feature Elimination for selecting the best features. RFE is a
wrapper method that uses the model to identify the best features.
• For the below task, we have inputted 8 feature. You can change this value and input the
number of features you want to retain for your model
In [11]:
# Selecting 8 number of features
# selecting models
models = LogisticRegression()
# using rfe and selecting 8 features
rfe = RFE(models,8)
# fitting the model
rfe = rfe.fit(X,y)
# ranking features
feature_ranking = pd.Series(rfe.ranking_, index=X.columns)
plt.show()
print('Features to be selected for Logistic Regression model are:')
print(feature_ranking[feature_ranking.values==1].index.tolist())
print('===='*30)
Features to be selected for Logistic Regression model are:
['job', 'marital', 'education', 'housing', 'contact', 'day_of_week', 'campaign', 'poutcome']
=====================================================================
===================================================
In [12]:
# Selecting 8 number of features
# Random Forrest classifier model
models = RandomForestClassifier()
# using rfe and selecting 8 features
rfe = RFE(models,8)
# fitting the model
rfe = rfe.fit(X,y)
# ranking features
feature_ranking = pd.Series(rfe.ranking_, index=X.columns)
plt.show()
print('Features to be selected for Random Forrest Classifier are:')
print(feature_ranking[feature_ranking.values==1].index.tolist())
print('===='*30)
Features to be selected for Random Forrest Classifier are:
['age', 'job', 'education', 'month', 'day_of_week', 'duration', 'campaign', 'poutcome']
=====================================================================
===================================================
Feature Selection using Random Forest
Random Forests are often used for feature selection in a data science workflow. This is because
the tree based strategies that random forests use, rank the features based on how well they
improve the purity of the node. The nodes having a very low impurity get split at the start of the
tree while the nodes having a very high impurity get split towards the end of the tree. Hence by
pruning the tree after desired amount of splits, we can create a subset of the most important
features.
In [13]:
# splitting the data into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=
y)
# selecting the data
rfc = RandomForestClassifier(random_state=42)
# fitting the data
rfc.fit(X_train, y_train)
# predicting the data
y_pred = rfc.predict(X_test)
# feature importances
rfc_importances = pd.Series(rfc.feature_importances_, index=X.columns).sort_values().tail(10)
# plotting bar chart according to feature importance
rfc_importances.plot(kind='bar')
plt.show()

Observations :
We can test the features obtained from both the feature selection techniques by inserting these
features to the model and depending on which set of features perform better, we can retain them
for the model.

The Feature Selection techniques can differ from problem to problem and the techniques
applied for this problem may or may not work for the other problems. In those cases, feel
free to try out other methods like PCA, SelectKBest(), SelectPercentile(), tSNE etc.

Grid-Search & Hyperparameter Tuning


Hyperparameters are function attributes that we have to specify for an algorithm. By now, you
should be knowing that grid search is done to find out the best set of hyperparameters for your
model.
Grid Search for Random Forest
In the below task, we write a code that performs hyperparameter tuning for a random forest
classifier. We have used the hyperparameters max_features, max_depth and criterion for this
task. Feel free to play around with this function by introducing a few more hyperparameters and
chaniging their values

In [14]:
# splitting the data
x_train,x_val,y_train,y_val = train_test_split(X,y, test_size=0.3, random_state=42, stratify=y)
# selecting the classifier
rfc = RandomForestClassifier()
# selecting the parameter
param_grid = {
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
# using grid search with respective parameters
grid_search_model = GridSearchCV(rfc, param_grid=param_grid)
# fitting the model
grid_search_model.fit(x_train, y_train)
# printing the best parameters
print('Best Parameters are:',grid_search_model.best_params_)
Best Parameters are: {'criterion': 'gini', 'max_depth': 8, 'max_features': 'log2'}

Applying the best parameters obtained using Grid Search on Random Forest model
In the task below, we fit a random forest model using the best parameters obtained using Grid
Search. Since the target is imbalanced, we apply Synthetic Minority Oversampling (SMOTE) for
undersampling and oversampling the majority and minority classes in the target respectively.

Kindly note that SMOTE should always be applied only on the training data and not on the
validation and test data.

You can try experimenting with and without SMOTE and check for the difference in recall.

In [15]:
from sklearn.metrics import roc_auc_score,roc_curve,classification_report
from sklearn.model_selection import cross_val_score
from imblearn.over_sampling import SMOTE
from yellowbrick.classifier import roc_auc

# A function to use smote


def grid_search_random_forrest_best(dataframe,target):
# splitting the data
x_train,x_val,y_train,y_val = train_test_split(dataframe,target, test_size=0.3, random_state=42
)

# Applying Smote on train data for dealing with class imbalance


smote = SMOTE()

X_sm, y_sm = smote.fit_sample(x_train, y_train)

rfc = RandomForestClassifier(n_estimators=11, max_features='auto', max_depth=8, criterion=


'entropy',random_state=42)

rfc.fit(X_sm, y_sm)
y_pred = rfc.predict(x_val)
print(classification_report(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
visualizer = roc_auc(rfc,X_sm,y_sm,x_val,y_val)

grid_search_random_forrest_best(X,y)
precision recall f1-score support

0 0.96 0.78 0.86 8723


1 0.31 0.73 0.43 1162

accuracy 0.77 9885


macro avg 0.63 0.76 0.65 9885
weighted avg 0.88 0.77 0.81 9885

[[6801 1922]
[ 309 853]]
Applying the grid search function for random forest only on the best features obtained
using Random Forest
In [16]:
grid_search_random_forrest_best(X[['age', 'job', 'education', 'month', 'day_of_week', 'duration', 'c
ampaign', 'poutcome']],y)
precision recall f1-score support

0 0.96 0.81 0.88 8723


1 0.36 0.78 0.49 1162

accuracy 0.81 9885


macro avg 0.66 0.80 0.69 9885
weighted avg 0.89 0.81 0.84 9885

[[7099 1624]
[ 258 904]]
Ensembling
Ensemble learning uses multiple machine learning models to obtain better predictive
performance than could be obtained from any of the constituent learning algorithms alone. In the
below task, we have used an ensemble of three models
- RandomForestClassifier(), GradientBoostingClassifier(), LogisticRegression(). Feel free to
modify this function as per your requirements and fit more models or change the parameters for
every model.

In [17]:
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import VotingClassifier

# splitting the data


x_train,x_val,y_train,y_val = train_test_split(X, y, test_size=0.3, random_state=42)
# using smote
smote = SMOTE()
X_sm, y_sm = smote.fit_sample(x_train, y_train)
# models to use for ensembling
model1 = RandomForestClassifier()
model3 = GradientBoostingClassifier()
model2 = LogisticRegression()
# fitting the model
model = VotingClassifier(estimators=[('rf', model1), ('lr', model2), ('xgb',model3)], voting='soft')
model.fit(X_sm,y_sm)
# predicting balues and getting the metrics
y_pred = model.predict(x_val)
In [18]:
print(classification_report(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
visualizer = roc_auc(model,X_sm,y_sm,x_val,y_val)
precision recall f1-score support

0 0.95 0.85 0.90 8723


1 0.38 0.69 0.49 1162

accuracy 0.83 9885


macro avg 0.67 0.77 0.70 9885
weighted avg 0.89 0.83 0.85 9885

[[7420 1303]
[ 358 804]]

Prediction on the test data


In the below task, we have performed a prediction on the test data. We have used Logistic
Regression for this prediction. You can use the model of your choice that will give you the best
metric score on the validation data.

In this task below, we will read the test file and store the Id column from the test file in a
variable Id. This column would be of use to us while submission since we need to have an Id
column in the submission file which is the same Id of the observations in the test data.
We have to perform the same preprocessing operations on the test data that we have performed
on the train data. For demonstration purposes, we have preprocessed the test data and this
preprocessed data is present in the csv file test_preprocessed.csv

We then make a prediction on the preprocessed test data using the Grid Search Logisitic
regression model. And as the final step, we concatenate this prediction with the Id column and
then convert this into a csv file which becomes the submission.csv

In [19]:
# Preprocessed Test File
test = pd.read_csv('../input/banking-project-term-deposit/new_test.csv')
test.head()
Out[19]:

a
jo mar educa defa hous lo cont mo day_of_ durat camp poutc
g
b ital tion ult ing an act nth week ion aign ome
e

3
0 4 0 6 0 0 0 0 3 3 131 5 1
2

3 1
1 3 6 0 0 0 0 4 3 100 1 1
7 0

5
2 5 0 5 1 2 0 0 3 2 131 2 1
5

4
3 2 1 0 1 0 0 1 4 3 48 2 1
4

2
4 0 2 3 0 0 0 0 5 0 144 2 1
8

#Creating Pre-processed Test


[21]: In
smote = SMOTE()

X_sm, y_sm = smote.fit_sample(x_train, y_train)

rfc = RandomForestClassifier()
# selecting the parameter
param_grid = {
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
# using grid search with respective parameters
grid_search_model = GridSearchCV(rfc, param_grid=param_grid)

# fitting the model


grid_search_model.fit(X_sm, y_sm)

# Predict on the preprocessed test file


y_pred = grid_search_model.predict(test)
In [22]:
prediction = pd.DataFrame(y_pred,columns=['y'])
submission = pd.concat([prediction['y']],1)

submission.to_csv('submission.csv',index=False)

Conclusion-this project report outlines the steps taken to develop a predictive model for
forecasting whether a customer will subscribe to a term deposit. By leveraging machine learning
algorithms and analyzing customer attributes, the report provides valuable insights to financial
institutions for optimizing their marketing efforts and improving the subscription rate.

You might also like