Professional Documents
Culture Documents
DECISION MAKING
Telemarketing
Submitted by:
Upasana Ghosh(P18066)
Swapnil Jain(P18070)
Ajitesh Sahoo(P18092)
Problem Statement:
• Wants a model which tells them whether a client (based on certain details), will
subscribe to the product or not.
This banking institution is seeking following benefits by finding solution for above problem:
• Reduced cost: They want to reduce cost on efforts put by employees on calling
customers which will reduce cost of operation.
• Increased profits: With higher conversion and reducing cost of operation, company
would be able to earn more profits.
Data description:
Input Variable
1 - age (numeric)
"blue-collar","self-employed","retired","technician","services")
11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
# other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric,
includes last contact)
14 - pdays: number of days that passed by after the client was last contacted from a previous
campaign (numeric, -1 means client was not previously contacted)
15 - previous: number of contacts performed before this campaign and for this client (numeric)
Output Variable:
• Maximum People with housing loan does not subscribe the product.
• Highest rate of success has been with students and retired.
• Greater the duration of the call, greater is the chance of success.
• Also, if the customer has been previously contacted, the chance of outcome being
positive is high.
• There is also some correlation between success and number of contacts performed
during the campaign.
• Old people with age>60 is highest subscribers. Next highest is customers with
age<25.
2. Then, we created X (independent variables set) and y (yes/no) for the entire dataset.
3. Then, divided the entire dataset into test and train using 80:20 rule.
4. Ran Random forest on the test & train dataset and obtained feature importance.
5. Assumption made: Features with feature importance score less than 0.005 not much
significant, can be dropped.
Logistic Regression:
2. Then we created a new dataset with one-hot coding of all categorical variables into
dummy variables.
Key Findings
Though, accuracy is more than 90% but recall value is low. So this means that this model is
not good.
KNN Classifier:
• Precision: 0.48
• Recall: 0.15
• Accuracy: 88.95%
• F1-score:0.23
• AUC: 57%
Naïve Bayes:
Key Findings
Precision: 0.33
Recall: 0.46
F1-score: 0.39
Accuracy: 83.97
AUC: 68%
Data Insights
Precision 0.43
Recall 0.29
F1 Score 0.35
Accuracy 87.9%
Precision 0
Recall 0
F1 Score 0
Accuracy 89.0%
Decision Tree:
• Precision: 0.51
• Recall: 0.24
• Accuracy: 89.17%
• F1-score:0.33
• AUC: 61%
Random Forest:
Key Findings:
• Precision: 0.64
• Recall: 0.39
• Accuracy: 90.93%
• F1-score:0.49
• AUC: 68%
• Used oversampling method to make the number of entries with y=yes same as that
with y=No
• Trained all the models using oversampled test and train data
Comparison of models:
Based on above data, model chosen Random Forest Classifier.
• Model trained with random forest used for predicting new customer behaviour
Issues Faced:
• Logistic regression needs same number of attributes on test data and unseen data.
Here, contact column did not have values like cellular and telephone for unseen
data. So, model could not be used for predicting
• 100% of the New customer data has poutcome as unknown whereas the test data
had less than 12% of such data to train the model, leading to poor classification