Professional Documents
Culture Documents
Abstract: This project report aims to analyze the factors influencing a customer's decision to
subscribe to a term deposit and develop a predictive model to forecast the likelihood of
subscription. The report outlines the data collection process, data preprocessing techniques, feature
engineering approaches, model development, evaluation metrics, and interpretation of results. The
project findings provide insights for financial institutions to optimize their marketing strategies
and improve the subscription rate.
Problem Statement
Evaluation Metric
We will be using ROC-AUC for evaluation.
We will be defining a task to be performed and write the code to solve the task.
The tasks performed below should serve as a good guide regarding the steps that you
should go about a Machine Learning Problem. But kindly do not restrict yourself to only
the tasks that have been performed in this notebook and feel free to bring your ideas,skills
and strategies and implement them as well.
Word of caution
This template is just an example of a data-science pipeline, every data science problem is unique
and there are multiple ways to tackle them. Go through this template and try to leverage the
information in this while solving your hackathon problems but you may not be able to use all the
functions created here.
The data is related to direct marketing campaigns of a Portuguese banking institution. The
marketing campaigns were based on phone calls. Often, more than one contact to the same client
was required, in order to access if the product (bank term deposit) would be subscribed ('yes') or
not ('no') subscribed.
There are two datasets: train.csv with all examples (32950) and 21 inputs including the target
feature, ordered by date (from May 2008 to November 2010), very close to the data analyzed in
[Moro et al., 2014]
test.csv which is the test data that consists of 8238 observations and 20 features without the
target feature
Goal:- The classification goal is to predict if the client will subscribe (yes/no) a term deposit
(variable y).
Features
categorical,
default has credit in default? ('no','yes','unknown')
nominal
categorical,
housing has housing loan? ('no','yes','unknown')
nominal
categorical,
loan has personal loan? ('no','yes','unknown')
nominal
categorical,
contact contact communication type ('cellular','telephone')
nominal
categorical,
month last contact month of year ('jan', 'feb', 'mar', ..., 'nov', 'dec')
ordinal
day_of_ categorical,
last contact day of the week ('mon','tue','wed','thu','fri')
week ordinal
campaig number of contacts performed during this campaign and for this
numeric
n client (includes last contact)
Feature Feature Type Description
number of days that passed by after the client was last contacted
pdays numeric from a previous campaign (999 means client was not previously
contacted)
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
There are many Classification algorithms are present in machine learning, which are used for
different classification applications. Some of the main classification algorithms are as follows-
• Logistic Regression
• DecisionTree Classifier
• RandomForest Classfier
The code we have written below internally splits the data into training data and validation data. It
then fits the classification model on the train data and then makes a prediction on the validation
data and outputs the scores for this prediction.
# Target
y = dataframe.iloc[:,-1]
plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()
Classification Report:
precision recall f1-score support
ROC_AUC_SCORE is 0.5742166403601381
The above two steps are combined and run in a single cell for all the remaining models
respectively
In [7]:
# Run Decision Tree Classifier
model = DecisionTreeClassifier()
model.fit(x_train, y_train)
y_scores = model.predict(x_val)
auc = roc_auc_score(y_val, y_scores)
print('Classification Report:')
print(classification_report(y_val,y_scores))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_scores)
print('ROC_AUC_SCORE is',roc_auc_score(y_val, y_scores))
plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()
Classification Report:
precision recall f1-score support
ROC_AUC_SCORE is 0.6924298608715649
In [8]:
from sklearn import tree
from sklearn.tree import export_graphviz # display the tree within a Jupyter notebook
from IPython.display import SVG
from graphviz import Source
from IPython.display import display
from ipywidgets import interactive, IntSlider, FloatSlider, interact
import ipywidgets
from IPython.display import Image
from subprocess import call
import matplotlib.image as mpimg
In [9]:
@interact
def plot_tree(crit=["gini", "entropy"],
split=["best", "random"],
depth=IntSlider(min=1,max=30,value=2, continuous_update=False),
min_split=IntSlider(min=2,max=5,value=2, continuous_update=False),
min_leaf=IntSlider(min=1,max=5,value=1, continuous_update=False)):
estimator = DecisionTreeClassifier(random_state=0,
criterion=crit,
splitter = split,
max_depth = depth,
min_samples_split=min_split,
min_samples_leaf=min_leaf)
estimator.fit(x_train, y_train)
print('Decision Tree Training Accuracy: {:.3f}'.format(accuracy_score(y_train, estimator.predi
ct(x_train))))
print('Decision Tree Test Accuracy: {:.3f}'.format(accuracy_score(y_val, estimator.predict(x_
val))))
graph = Source(tree.export_graphviz(estimator,
out_file=None,
feature_names=x_train.columns,
class_names=['0', '1'],
filled = True))
display(Image(data=graph.pipe(format='png')))
return estimator
Decision Tree Training Accuracy: 0.896
Decision Tree Test Accuracy: 0.889
DecisionTreeClassifier(max_depth=2, random_state=0)
In [10]:
# run Random Forrest Classifier
model = RandomForestClassifier()
model.fit(x_train, y_train)
y_scores = model.predict(x_val)
auc = roc_auc_score(y_val, y_scores)
print('Classification Report:')
print(classification_report(y_val,y_scores))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_scores)
print('ROC_AUC_SCORE is',roc_auc_score(y_val, y_scores))
plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()
Classification Report:
precision recall f1-score support
ROC_AUC_SCORE is 0.662138372340166
Feature Selection
Now that we have applied vanilla models on our data, we now have a basic understanding of
what our predictions look like. Let's now use feature selection methods for identifying the best
set of features for each model.
Observations :
We can test the features obtained from both the feature selection techniques by inserting these
features to the model and depending on which set of features perform better, we can retain them
for the model.
The Feature Selection techniques can differ from problem to problem and the techniques
applied for this problem may or may not work for the other problems. In those cases, feel
free to try out other methods like PCA, SelectKBest(), SelectPercentile(), tSNE etc.
In [14]:
# splitting the data
x_train,x_val,y_train,y_val = train_test_split(X,y, test_size=0.3, random_state=42, stratify=y)
# selecting the classifier
rfc = RandomForestClassifier()
# selecting the parameter
param_grid = {
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
# using grid search with respective parameters
grid_search_model = GridSearchCV(rfc, param_grid=param_grid)
# fitting the model
grid_search_model.fit(x_train, y_train)
# printing the best parameters
print('Best Parameters are:',grid_search_model.best_params_)
Best Parameters are: {'criterion': 'gini', 'max_depth': 8, 'max_features': 'log2'}
Applying the best parameters obtained using Grid Search on Random Forest model
In the task below, we fit a random forest model using the best parameters obtained using Grid
Search. Since the target is imbalanced, we apply Synthetic Minority Oversampling (SMOTE) for
undersampling and oversampling the majority and minority classes in the target respectively.
Kindly note that SMOTE should always be applied only on the training data and not on the
validation and test data.
You can try experimenting with and without SMOTE and check for the difference in recall.
In [15]:
from sklearn.metrics import roc_auc_score,roc_curve,classification_report
from sklearn.model_selection import cross_val_score
from imblearn.over_sampling import SMOTE
from yellowbrick.classifier import roc_auc
rfc.fit(X_sm, y_sm)
y_pred = rfc.predict(x_val)
print(classification_report(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
visualizer = roc_auc(rfc,X_sm,y_sm,x_val,y_val)
grid_search_random_forrest_best(X,y)
precision recall f1-score support
[[6801 1922]
[ 309 853]]
Applying the grid search function for random forest only on the best features obtained
using Random Forest
In [16]:
grid_search_random_forrest_best(X[['age', 'job', 'education', 'month', 'day_of_week', 'duration', 'c
ampaign', 'poutcome']],y)
precision recall f1-score support
[[7099 1624]
[ 258 904]]
Ensembling
Ensemble learning uses multiple machine learning models to obtain better predictive
performance than could be obtained from any of the constituent learning algorithms alone. In the
below task, we have used an ensemble of three models
- RandomForestClassifier(), GradientBoostingClassifier(), LogisticRegression(). Feel free to
modify this function as per your requirements and fit more models or change the parameters for
every model.
In [17]:
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import VotingClassifier
[[7420 1303]
[ 358 804]]
In this task below, we will read the test file and store the Id column from the test file in a
variable Id. This column would be of use to us while submission since we need to have an Id
column in the submission file which is the same Id of the observations in the test data.
We have to perform the same preprocessing operations on the test data that we have performed
on the train data. For demonstration purposes, we have preprocessed the test data and this
preprocessed data is present in the csv file test_preprocessed.csv
We then make a prediction on the preprocessed test data using the Grid Search Logisitic
regression model. And as the final step, we concatenate this prediction with the Id column and
then convert this into a csv file which becomes the submission.csv
In [19]:
# Preprocessed Test File
test = pd.read_csv('../input/banking-project-term-deposit/new_test.csv')
test.head()
Out[19]:
a
jo mar educa defa hous lo cont mo day_of_ durat camp poutc
g
b ital tion ult ing an act nth week ion aign ome
e
3
0 4 0 6 0 0 0 0 3 3 131 5 1
2
3 1
1 3 6 0 0 0 0 4 3 100 1 1
7 0
5
2 5 0 5 1 2 0 0 3 2 131 2 1
5
4
3 2 1 0 1 0 0 1 4 3 48 2 1
4
2
4 0 2 3 0 0 0 0 5 0 144 2 1
8
rfc = RandomForestClassifier()
# selecting the parameter
param_grid = {
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
# using grid search with respective parameters
grid_search_model = GridSearchCV(rfc, param_grid=param_grid)
submission.to_csv('submission.csv',index=False)
Conclusion-this project report outlines the steps taken to develop a predictive model for
forecasting whether a customer will subscribe to a term deposit. By leveraging machine learning
algorithms and analyzing customer attributes, the report provides valuable insights to financial
institutions for optimizing their marketing efforts and improving the subscription rate.