BDMDM Telemarketing

BUSINESS DATA MINING AND
DECISION MAKING
Telemarketing
Submitted by:
Soumya Ojha (P18084)
Krati Sharma (P18041)
Upasana Ghosh(P18066)
Swapnil Jain(P18070)
Ajitesh Sahoo(P18092)
Problem Statement:
• A Portuguese banking institution wants to increase the number of subscribers of its

product (bank term deposits)
• Contact the clients through phone calls
• Wants a model which tells them whether a client (based on certain details), will
subscribe to the product or not.
This banking institution is seeking following benefits by finding solution for above problem:
• Reduced cost: They want to reduce cost on efforts put by employees on calling
customers which will reduce cost of operation.
• Focussed attention on right clients: Employees would be able to convert more on

clients which has higher probably of getting converted.
• Increased profits: With higher conversion and reducing cost of operation, company
would be able to earn more profits.
Data description:
Input Variable
1 - age (numeric)
2 – job: type of job (categorical:

"admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")
3 - marital: marital status (categorical: "married","divorced","single"; note: "divorced" means

divorced or widowed)
4 - education (categorical: "unknown","secondary","primary","tertiary")
5 - default: has credit in default? (binary: "yes","no")
6 - balance: average yearly balance, in euros (numeric)
7 - housing: has housing loan? (binary: "yes","no")
8 - loan: has personal loan? (binary: "yes","no")
# related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
10 - day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
12 - duration: last contact duration, in seconds (numeric)
# other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric,
includes last contact)
14 - pdays: number of days that passed by after the client was last contacted from a previous
campaign (numeric, -1 means client was not previously contacted)
15 - previous: number of contacts performed before this campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing campaign (categorical:

"unknown","other","failure","success")
Output Variable:
Y - has the client subscribed a term deposit? (binary: "yes","no")
 There are no missing values in the dataset

 4521 rows with 521 rows with a yes for product subscribed -> High class imbalance
Insights from the data:

• Majority of customers that have been contacted lie in the range of 30-40.
• Maximum People with housing loan does not subscribe the product.
• Highest rate of success has been with students and retired.
• Greater the duration of the call, greater is the chance of success.
• Also, if the customer has been previously contacted, the chance of outcome being
positive is high.
• There is also some correlation between success and number of contacts performed
during the campaign.
• Old people with age>60 is highest subscribers. Next highest is customers with
age<25.
• Higher subscription rate when number of calls is less.
• Maximum people contacted during May; Maximum subscription taken in March.

• Higher subscription rate when number of calls is less.
• Maximum people contacted during May; Maximum subscription taken in March.
Approach taken for data cleaning:
1. Converted all non-numeric variables to numeric variables using label encoding.
2. Then, we created X (independent variables set) and y (yes/no) for the entire dataset.
3. Then, divided the entire dataset into test and train using 80:20 rule.
4. Ran Random forest on the test & train dataset and obtained feature importance.
5. Assumption made: Features with feature importance score less than 0.005 not much
significant, can be dropped.
6. Dropped Columns Default and loan.
Logistic Regression:
1. After loading the data, we identified all the categorical variables.
2. Then we created a new dataset with one-hot coding of all categorical variables into
dummy variables.
3. Then we divided data into input and output.

4. Then we split this new dataset into test and train using 80/20 rules. 80% of data as
Train data and 20% of data as Test data.
5. Then we trained the model and tested it on test data
Key Findings
• Precision: 0.60 - High
• Recall: 0.39 - Low
• F1-score: 0.48 - Low
• Accuracy: 90.49 - High
• AUC: 68% - Medium
Though, accuracy is more than 90% but recall value is low. So this means that this model is
not good.
KNN Classifier:
• We created the model for label encoded cleaned data

Key findings
• Precision: 0.48
• Recall: 0.15
• Accuracy: 88.95%
• F1-score:0.23
• AUC: 57%
Naïve Bayes:
Key Findings
Precision: 0.33
Recall: 0.46
F1-score: 0.39
Accuracy: 83.97
AUC: 68%
Support Vector Machine:

Linear kernel
Created a new dataset with coding for all categorical variables
Split the dataset into train and test
Trained the model
Predicted output for unseen data
Data Insights
 Precision 0.43
 Recall 0.29
 F1 Score 0.35
 Accuracy 87.9%
Support Vector Machine:
Radial Basis Function
Created a new dataset with coding for all categorical variables
Split the dataset into train and test
Trained the model
Predicted output for unseen data

Data Insights
Precision 0
Recall 0
F1 Score 0
Accuracy 89.0%
Decision Tree:
We ran a decision tree model with max tree depth of 3.

Key findings
• Precision: 0.51
• Recall: 0.24
• F1-score:0.33
• AUC: 61%
Random Forest:
Key Findings:
• Precision: 0.64
• Recall: 0.39
• F1-score:0.49
• AUC: 68%
Synthetic minority oversampling (SMOTE):
• Used for handling class imbalance issue
• Imblearn.combine used which provides a combination of oversampling and

undersampling methods
• Used oversampling method to make the number of entries with y=yes same as that
with y=No
• Trained all the models using oversampled test and train data
Comparison of models:
Based on above data, model chosen Random Forest Classifier.
Prediction for new customers:
• Model trained with random forest used for predicting new customer behaviour
• Overall accuracy of prediction -> 97.32%
• Recall of 0.92 and F1-score of 0.89
Issues Faced:
• Deciding which variables to drop given insignificant change in F1 score

• Not able to plot Decision Tree. Tried even graphviz, did not work
• Not able to plot svm ROC curve
• Faced issues while trying to implement cross validation
• Logistic regression needs same number of attributes on test data and unseen data.
Here, contact column did not have values like cellular and telephone for unseen
data. So, model could not be used for predicting
• 100% of the New customer data has poutcome as unknown whereas the test data
had less than 12% of such data to train the model, leading to poor classification

BDMDM Telemarketing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDMDM Telemarketing

Uploaded by

Copyright:

Available Formats

BUSINESS DATA MINING AND

Soumya Ojha (P18084)

Krati Sharma (P18041)

• A Portuguese banking institution wants to increase the number of subscribers of its

• Contact the clients through phone calls

• Focussed attention on right clients: Employees would be able to convert more on

2 – job: type of job (categorical:

3 - marital: marital status (categorical: "married","divorced","single"; note: "divorced" means

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")

# related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: "unknown","telephone","cellular")

10 - day: last contact day of the month (numeric)

12 - duration: last contact duration, in seconds (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical:

Y - has the client subscribed a term deposit? (binary: "yes","no")

 There are no missing values in the dataset

Insights from the data:

• Higher subscription rate when number of calls is less.

• Maximum people contacted during May; Maximum subscription taken in March.

• Maximum people contacted during May; Maximum subscription taken in March.

Approach taken for data cleaning:

1. Converted all non-numeric variables to numeric variables using label encoding.

6. Dropped Columns Default and loan.

1. After loading the data, we identified all the categorical variables.

3. Then we divided data into input and output.

5. Then we trained the model and tested it on test data

• Precision: 0.60 - High

• Recall: 0.39 - Low

• F1-score: 0.48 - Low

• Accuracy: 90.49 - High

• AUC: 68% - Medium

• We created the model for label encoded cleaned data

Support Vector Machine:

Created a new dataset with coding for all categorical variables

Split the dataset into train and test

Trained the model

Predicted output for unseen data

Support Vector Machine:

Radial Basis Function

Created a new dataset with coding for all categorical variables

Split the dataset into train and test

Trained the model

Predicted output for unseen data

We ran a decision tree model with max tree depth of 3.

Synthetic minority oversampling (SMOTE):

• Used for handling class imbalance issue

• Imblearn.combine used which provides a combination of oversampling and

Prediction for new customers:

• Overall accuracy of prediction -> 97.32%

• Recall of 0.92 and F1-score of 0.89

• Deciding which variables to drop given insignificant change in F1 score

• Not able to plot svm ROC curve

• Faced issues while trying to implement cross validation

You might also like