You are on page 1of 16

BUSINESS DATA MINING AND

DECISION MAKING
Telemarketing

Submitted by:

Soumya Ojha (P18084)

Krati Sharma (P18041)

Upasana Ghosh(P18066)

Swapnil Jain(P18070)

Ajitesh Sahoo(P18092)
Problem Statement:

• A Portuguese banking institution wants to increase the number of subscribers of its


product (bank term deposits)

• Contact the clients through phone calls

• Wants a model which tells them whether a client (based on certain details), will
subscribe to the product or not.

This banking institution is seeking following benefits by finding solution for above problem:

• Reduced cost: They want to reduce cost on efforts put by employees on calling
customers which will reduce cost of operation.

• Focussed attention on right clients: Employees would be able to convert more on


clients which has higher probably of getting converted.

• Increased profits: With higher conversion and reducing cost of operation, company
would be able to earn more profits.

Data description:

Input Variable
1 - age (numeric)

2 – job: type of job (categorical:


"admin.","unknown","unemployed","management","housemaid","entrepreneur","student",

"blue-collar","self-employed","retired","technician","services")

3 - marital: marital status (categorical: "married","divorced","single"; note: "divorced" means


divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")

# related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: "unknown","telephone","cellular")

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)

# other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric,
includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous
campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical:


"unknown","other","failure","success")

Output Variable:

Y - has the client subscribed a term deposit? (binary: "yes","no")

 There are no missing values in the dataset


 4521 rows with 521 rows with a yes for product subscribed -> High class imbalance

Insights from the data:


• Majority of customers that have been contacted lie in the range of 30-40.

• Maximum People with housing loan does not subscribe the product.
• Highest rate of success has been with students and retired.
• Greater the duration of the call, greater is the chance of success.

• Also, if the customer has been previously contacted, the chance of outcome being
positive is high.

• There is also some correlation between success and number of contacts performed
during the campaign.

• Old people with age>60 is highest subscribers. Next highest is customers with
age<25.

• Higher subscription rate when number of calls is less.

• Maximum people contacted during May; Maximum subscription taken in March.


• Higher subscription rate when number of calls is less.

• Maximum people contacted during May; Maximum subscription taken in March.

Approach taken for data cleaning:

1. Converted all non-numeric variables to numeric variables using label encoding.

2. Then, we created X (independent variables set) and y (yes/no) for the entire dataset.

3. Then, divided the entire dataset into test and train using 80:20 rule.

4. Ran Random forest on the test & train dataset and obtained feature importance.

5. Assumption made: Features with feature importance score less than 0.005 not much
significant, can be dropped.

6. Dropped Columns Default and loan.

Logistic Regression:

1. After loading the data, we identified all the categorical variables.

2. Then we created a new dataset with one-hot coding of all categorical variables into
dummy variables.

3. Then we divided data into input and output.


4. Then we split this new dataset into test and train using 80/20 rules. 80% of data as
Train data and 20% of data as Test data.

5. Then we trained the model and tested it on test data

Key Findings

• Precision: 0.60 - High

• Recall: 0.39 - Low

• F1-score: 0.48 - Low

• Accuracy: 90.49 - High

• AUC: 68% - Medium

Though, accuracy is more than 90% but recall value is low. So this means that this model is
not good.

KNN Classifier:

• We created the model for label encoded cleaned data


Key findings

• Precision: 0.48

• Recall: 0.15

• Accuracy: 88.95%
• F1-score:0.23

• AUC: 57%

Naïve Bayes:

Key Findings

Precision: 0.33

Recall: 0.46

F1-score: 0.39

Accuracy: 83.97

AUC: 68%

Support Vector Machine:


Linear kernel

Created a new dataset with coding for all categorical variables

Split the dataset into train and test

Trained the model

Predicted output for unseen data

Data Insights

 Precision 0.43
 Recall 0.29
 F1 Score 0.35
 Accuracy 87.9%

Support Vector Machine:

Radial Basis Function

Created a new dataset with coding for all categorical variables

Split the dataset into train and test

Trained the model

Predicted output for unseen data


Data Insights

Precision 0

Recall 0

F1 Score 0

Accuracy 89.0%

Decision Tree:

We ran a decision tree model with max tree depth of 3.


Key findings

• Precision: 0.51

• Recall: 0.24

• Accuracy: 89.17%

• F1-score:0.33

• AUC: 61%

Random Forest:
Key Findings:

• Precision: 0.64

• Recall: 0.39

• Accuracy: 90.93%

• F1-score:0.49

• AUC: 68%

Synthetic minority oversampling (SMOTE):

• Used for handling class imbalance issue

• Imblearn.combine used which provides a combination of oversampling and


undersampling methods

• Used oversampling method to make the number of entries with y=yes same as that
with y=No

• Trained all the models using oversampled test and train data

Comparison of models:
Based on above data, model chosen Random Forest Classifier.

Prediction for new customers:

• Model trained with random forest used for predicting new customer behaviour

• Overall accuracy of prediction -> 97.32%

• Recall of 0.92 and F1-score of 0.89

Issues Faced:

• Deciding which variables to drop given insignificant change in F1 score


• Not able to plot Decision Tree. Tried even graphviz, did not work

• Not able to plot svm ROC curve

• Faced issues while trying to implement cross validation

• Logistic regression needs same number of attributes on test data and unseen data.
Here, contact column did not have values like cellular and telephone for unseen
data. So, model could not be used for predicting

• 100% of the New customer data has poutcome as unknown whereas the test data
had less than 12% of such data to train the model, leading to poor classification

You might also like