Machine Learning in Business
MIS710 – A2
Part A. Case Study Report
0
Table of Content
1. Introduction......................................................................................................................................................................
1.1 Objective.....................................................................................................................................................................
2. Methodology.....................................................................................................................................................................
2.1 Overview of the machine learning approach...............................................................................................................
3. Data preparation and Exploratory Data Analysis (EDA):....................................................................................................
3.1 Data sources................................................................................................................................................................
4. Model development and evaluation...............................................................................................................................
4.1 Supervised Machine Learning....................................................................................................................................
4.2Unsupervised Machine Learnining…………………………………………………………………………………………………………..14
5. Solution recommendation...............................................................................................................................................
5.1Supervised Machine learning......................................................................................................................................
5.2Unsupervised Machine Learnining…………………………………………………………………………………………………………..14
5. Solution recommendation...............................................................................................................................................
6. Technical recommendations...........................................................................................................................................
References..........................................................................................................................................................................
1
Machine Learning in Business
Executive Summary
This report provides a strategic analysis and machine learning solution for Great Ocean Bank aimed at
enhancing marketing campaign effectiveness and customer understanding. Using the GoBank dataset,
which includes customer demographics, banking relationships, and economic indicators, the study
predicts 'Sale' or 'No Sale' outcomes and identifies key influencing factors.
Objectives and Approach: The primary objective was to develop predictive and clustering models to
optimize marketing strategies. The approach included:
● Data preparation and exploratory data analysis (EDA) to uncover key patterns.
● Development and evaluation of two predictive machine learning models.
● Implementation of DBSCAN clustering analytics to segment customers based on similarities.
Key Findings:
Demographic factors such as age and qualification have less influence on sales outcomes ,compared
to other variables they have less predictive power.
● Existing account types affect customers' likelihood of engaging with new banking products.
● The method of last contact and past campaign results are crucial predictors of future sales.
● Economic indicators significantly impact sales outcomes.
Recommendations:
● Model Deployment: The Logistic Regression model was chosen for its simplicity, explainable
and reasonable accuracy.
2
● Customer Segmentation: DBSCAN clustering showed most promise, allowing for more
targeted marketing strategies.
The proposed solutions are expected to significantly improve marketing efficiency and customer
satisfaction for Great Ocean Bank.
1. Introduction
1.1 Objective
Great Ocean Banking Group, serving over 1 million customers in Victoria, Australia, seeks to enhance
the effectiveness of its marketing campaigns by understanding the factors influencing campaign
outcomes. The business problem is to predict potential 'Sale' or 'No Sale' outcomes and segment
customers for targeted marketing efforts. This project leverages data analytics and machine learning
to provide insights aiming to optimize marketing strategies, improve customer engagement, and drive
higher returns on investment. The value proposition lies in more efficient resource allocation,
personalized customer interactions, and data-driven decision-making, ultimately fostering stronger
customer relationships and increasing overall satisfaction.
2. Methodology
2.1 Overview of the machine learning approach
Data Preprocessing: Clean and preprocess the dataset (GoBank.csv) by handling missing values,
encoding categorical variables, and scaling numerical features.
Feature Engineering: Select and engineer relevant features that could influence the prediction, such
as customer demographics, previous interactions, and economic indicators.
Model Selection: Experiment with various classification algorithms, including Logistic Regression and
Random Forest, to identify the best-performing model.
Model Evaluation: Evaluate the models using cross-validation techniques and metrics such as
accuracy, precision, recall, and F1 score to ensure robustness and reliability.
3
Hyper-parameter Tuning: Optimize the selected model's hyper-parameters using Grid Search with
Cross-Validation to achieve the best possible performance.
Image D
3. Data preparation and Exploratory Data Analysis
(EDA):
3.1 Data sources
4
Image 1A (Source: Self-Created)
The dataset from Great Ocean Bank contains 22,940 entries and 19 columns, capturing customer
demographics, banking relationships, last contact details from marketing campaigns, and economic
indicators. Key columns, such as 'Qualification' and 'Previous Campaign Outcome', contain some null
values, which need to be addressed. Overall, the dataset is relatively clean but requires preprocessing.
Several preprocessing steps were undertaken to enhance the dataset's quality and suitability.
5
3.2. Handling Missing Values:
Given that the proportion of missing values was less than 1%, the decision was made to drop these
rows. This approach was chosen to avoid tampering with the data or introducing any bias that
imputation methods might cause. By removing these few instances, the dataset's integrity and
reliability were maintained.
Image 1B
3.3. Encoding Categorical Variables:
Machine learning models require numerical input, necessitating the conversion of categorical
variables into numerical format. Two primary methods were used:
● Label Encoding: For ordinal categorical variables, label encoding was used to maintain the
6
inherent order.
● One-Hot Encoding: For nominal categorical variables, one-hot encoding was applied to create
binary columns for each category, preventing any ordinal relationships from being inferred
where none exist.
3.4.Scaling Numerical Features:
Feature scaling is essential to ensure that numerical features contribute equally to the model,
especially when using algorithms sensitive to feature magnitudes.
Before encoding the categorical variables, a univariate analysis of the 18 variables is done to
understand the distribution and characteristics of each feature. This analysis helped in identifying the
nature of the variables, such as their central tendency, variability, and the presence of any outliers.
7
Image 1C
8
Image 1D
The initial analysis revealed a class imbalance in the target variable, which is critical to address for
accurate model performance.
3.5.Feature selection:
The feature selection utilizes the chi-square (χ²) statistical test to identify the top k features most
relevant to the target variable. This is achieved using the SelectKBest class from
sklearn.feature_selection, which ranks features based on their chi-square scores and selects the top k
highest-scoring ones. After fitting this selector to the scaled training data (X_train_scaled) and
applying it to both training and test datasets, the method retrieves the indices of the selected
features, which are then used to print their names. This process aids in dimensionality reduction by
9
retaining only the most statistically significant features, potentially enhancing model performance and
interpretability.
Image 2A
10
Image 2B
4. Model development and evaluation
4.1 Supervised Machine Learning
LOGISTIC REGRESSION
Logistic regression is a statistical method used for binary classification tasks, where the goal is to
predict the probability of an instance belonging to one of two classes. It's called "logistic" because it's
based on the logistic function, also known as the sigmoid function, which maps any real-valued
number to a value between 0 and 1.
11
Important parameters:
· penalty: Type of regularization.
· C: Inverse of regularization strength.
· solver: Optimization algorithm.
· max_iter: Maximum number of iterations.
· multi_class: Handling of multiple classes.
Image 4A
Image 4B
The accuracy of the Logistic regression model is 86.89 %. In addition, from Table 3A provided by the
12
software, it can be observed that its precision was 86%, meaning that most of the positive predictions
are correct. Similarly, the recall score is 87%, meaning that some positive prediction is not identified.
The F1 score is 87%. In the confusion matrix, it can be observed that 3627 0 s were correctly classified
while 846 1 s were correctly classified.
DECISION TREE CLASSIFIER
A decision tree classifier is like a flowchart that makes decisions based on the features of data. It
starts at the root and asks questions about the features, splitting the data into smaller groups at each
node. These questions are based on the most informative features for predicting the target variable.
Eventually, it reaches leaf nodes where no more questions are needed, and a prediction is made. To
classify new data, you follow the path in the tree based on its features until you reach a leaf node,
which gives the predicted class. Decision trees are easy to understand and interpret, making them
useful for various classification tasks.
Important parameters:
· criterion: Impurity measure for split quality.
· splitter: Strategy for choosing splits.
· max_depth: Maximum depth of the tree.
· min_samples_split: Minimum samples required to split a node.
· min_samples_leaf: Minimum samples required at a leaf node.
· max_features: Number of features considered for the best split.
· random_state: Seed for random number generation.
13
Image 4 C
Image 4D
The accuracy of the decision tree classifier model is 87.96 %. In addition, from Table 4A provided by
the software, it can be observed that its precision was 87%, meaning that most of the positive
predictions are correct. Similarly, the recall score is 88%, meaning that some positive prediction is not
identified. The F1 score is 88%. In the confusion matrix, it can be observed that 5437 0 s were
correctly classified while 1272 1 s were correctly classified.
4.2 Unsupervised Machine learning:
Clustering models
K-MEANS:
14
KMeans is a widely-used centroid-based clustering algorithm that iteratively partitions data into K
clusters. It achieves this by assigning each data point to the nearest cluster centroid and updating
centroids based on the mean of the data points in each cluster.
HIERARCHICAL CLUSTERING
Agglomerative Clustering, in contrast, is a hierarchical clustering algorithm that begins with each data
point as a separate cluster. It then iteratively merges the closest pairs of clusters until only one cluster
remains.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is a density-based clustering algorithm that groups together closely packed points based on
their density. DBSCAN does not require the user to specify the number of clusters beforehand. It
relies on two parameters: epsilon (eps), which defines the radius of the neighborhood around a point,
and minPts, the minimum number of points within that radius to define a cluster.
5. Solution recommendation
Supervised Machine learning
MODEL CLASS F1-SORE PRECISION RECALL ACCURACY
LOGISTIC 0 0.92 0.89 0.96 86.8
REGRESSION
1 0.60 0.73 0.50
DECISION TREE 0 0.93 0.91 0.94 87.9
1 0.66 0.72 0.62
15
Logistic Regression emerges as the preferred choice over Decision Trees for reasons:
Precision for the 'Sale' class: Logistic Regression exhibits slightly better precision (0.73) for the 'Sale'
class compared to Decision Trees (0.72). This indicates that Logistic Regression is better at correctly
identifying positive instances of 'Sale'.
Model Robustness and Interpretability: Logistic Regression models are known for their simplicity and
interpretability. They offer clear insights into the impact of each feature through coefficients. This
transparency aids in understanding the driving factors behind predictions.
Lower risk of overfitting: Logistic Regression tends to be less prone to overfitting compared to
Decision Trees. With fewer hyperparameters to tune and a simpler model structure, Logistic
Regression offers more robust generalization to unseen data.
Interpretability and Ease of Deployment: Logistic Regression's simplicity and interpretability make it
an attractive choice for deployment in real-world scenarios. Clients can easily grasp and trust the
model's predictions, facilitating smoother integration into decision-making processes.
Unsupervised Machine learning:
MODEL SILHOUETTE SCORE
K-MEANS 0.2959911036256011
HIERARCHICAL CLUSTERING 0.1787855930595059
DB SCAN 0.9894963986146712
Table 1B
4.5 Clustering analytics results and justification of the number of clusters:
Clustering analysis results provided insights into customer segmentation employing K-means and
hierarchical clustering. The K-means model defined four clusters regarding scaled numerical features
16
related to the customers, such as age and consumer confidence index. The number of four clusters
was justified after evaluating the elbow plot where a significant decrease in the sum of squared gen
distances of the clusters that comprised more than four was indicated. The outcomes were further
supported by the hierarchical clustering method where the number of clusters was specified in the
dendrogram.
5. Solution recommendation
The analysis conducted shows that the Great Ocean Banking Group can benefit from utilizing both
supervised and unsupervised machine learning models to improve marketing strategies. Both the
logistic regression and random forest models demonstrate strong predictive powers, allowing the
identification of prospective customers for various marketing campaigns. Furthermore, the clustering
analysis shows that several customer segments demonstrate distinct behaviors, allowing their
targeting. The bank can improve its customer engagement by focusing on these demographics and
customer behaviors.
6. Technical recommendations
Summary of Development and Testing Environment:
● Programming Language: Python
● Computing Environment: Jupyter Notebook running on Kaggle platform,VS Code
● Software Libraries:
o Pandas: Data manipulation and analysis (pd)
o NumPy: Numerical computations (np)
o Scikit-learn: Machine learning library (StandardScaler, LabelEncoder, KMeans,
AgglomerativeClustering, DBSCAN, silhouette_score)
o Matplotlib: Plotting and visualization (plt)
o Seaborn: Statistical data visualization (sns)
17
Suggestions for Maintenance of Accuracy and Relevance Over Time:
Regular Data Updates:
1. Periodically update the dataset to capture new trends and customer behaviors.
2. Automate the data ingestion process to ensure fresh data is always available.
Model Retraining:
1. Regularly retrain clustering models to adjust to new data patterns.
2. Implement a retraining schedule (e.g., quarterly) based on data volume and business
needs
Parameter Tuning:
1. Continuously monitor clustering performance metrics like silhouette score.
2. Perform periodic hyperparameter tuning for algorithms like DBSCAN to adapt to data
changes.
Monitoring and Evaluation:
1. Set up a monitoring system to track clustering performance over time.
2. Evaluate clusters against business metrics to ensure they remain meaningful and
actionable.
Data Preprocessing Enhancements:
1. Refine preprocessing steps to handle new data anomalies or emerging patterns.
2. Update encoding schemes and standardization techniques as necessary.
Documentation and Knowledge Sharing:
1. Maintain comprehensive documentation of preprocessing steps, model parameters,
and evaluation metrics.
2. Foster a collaborative environment where insights and improvements are shared
18
among team members.
Scalability and Performance:
1. Ensure the computational environment can scale with data growth.
2. Optimize code for performance, particularly for large datasets and complex algorithms.
By implementing these recommendations, the clustering models will maintain accuracy and
relevance, adapting to the evolving nature of the data and providing valuable insights for decision-
making.
References
-International Institute of Business Analysis. (2022). Business Analysis Core Concept Model (BACCM).
IIBA.
https://www.iiba.org/business-analysis-blogs/6-steps-to-applying-the-baccm/
-Zakrzewska, D., & Murlewski, J. (2005). Clustering algorithms for bank customer segmentation. In
Intelligent Systems Design and Applications, 2005. ISDA '05. Proceedings. 5th International
Conference on (pp. 197-202). IEEE Xplore. DOI:10.1109/ISDA.2005.33
19
20