You are on page 1of 27

MARCH 24, 2024

Contents
Problem 1:
Sl. No Problem Page No
Model building and interpretation
a) Build various models (You can choose to build 2
models for either or all of descriptive, predictive or
prescriptive purposes)
b) Test your predictive model against the test set 24
using various appropriate performance metrics
c) Interpretation of the model(s) 24

Model Tuning and business implication


a) Ensemble modelling, wherever applicable 25
b) Any other model tuning measures (if applicable) 25
c) Interpretation of the most optimum model and its 26
implication on the business

1
Problem - 1
1. Model building and interpretation
Data modeling facilitates the integration of business processes with data rules
and structures, enabling the technical implementation of data. This involves training
machine learning algorithms to predict labels from features, fine-tuning them to meet
business requirements, and validating their performance on holdout data. The
outcome of modeling is a trained model capable of making predictions on new data
points.
Following the exploratory data analysis (EDA) in phase 1, our dataset is primed
for model building. We will now leverage various machine learning algorithms to train
models, refining them to align with our business objectives. This iterative process
ensures that our models are robust and reliable, empowering us to derive actionable
insights and drive informed decision-making.

a) Build various models (You can choose to build models for either or all
of descriptive, predictive or prescriptive purposes)
To achieve our objective of predicting whether a product will be selected or
not, we employed several classification algorithms:

o Logistic Regression,
o Random Forest
o K Nearest Neighbors &
o XGB Classifier

These models are designed to predict categorical binary outcomes based


on input data. With a dataset divided into a 70:30 ratio for training and testing,
respectively, we embarked on model building. The training data, comprising 70% of
the dataset, was utilized to train the models, while the remaining 30% was reserved
for testing their performance.

Post-training, we evaluated each model using metrics such as classification


reports, accuracy scores, and ROC curves. These assessments provided insights into
the models' predictive capabilities and allowed us to gauge their effectiveness in
fulfilling our prediction task. Furthermore, we explored the potential benefits of
employing Bagging and Boosting techniques, along with tuning methodologies, to
enhance model performance.

2
Train and Test Split:
The dataset was partitioned into two segments: a training set and a testing set. The
training set comprised 70% of the observations, while the testing set held the
remaining 30%. The training set served as the basis for fitting our model, which would
subsequently be evaluated using the testing set.

Logistic Regression Model:

1. Logistic Regression, a supervised Machine Learning method, categorizes


elements into two groups based on calculated probabilities.
2. It's specifically designed for classification tasks, offering discrete outputs.
3. Logistic Regression aims to fit values onto a sigmoid curve, advancing one step
beyond linear regression.
4. The loss function in logistic regression is computed through maximum
likelihood estimation.
5. This Logistic Regression model is evaluated based on Accuracy, Confusion
Matrix and Classification Report.

3
➢ Classification report for Trained Data and Test Data

Accuracy score for Train data is 0.84 and Test data is 0.86

4
➢ Confusion Matrix for Train and Test Data

➢ ROC Curve for Train(Green) and Test Data(Orange):

5
Tuned Logistic Regression Model:
Model tuning involves optimizing the model's performance by adjusting
hyperparameters. This process entails incorporating new features into the model and
conducting computations across all features. The aim is to identify the optimal
parameters or values that yield the best model performance. Techniques such as
GridSearch and best_estimators are commonly employed for this purpose, facilitating
the search for the most effective model configuration through systematic
experimentation and evaluation.

Best Parameters for Tuned Logistic Regression:

6
➢ Classification report for Tuned Trained Data and Test Data

Accuracy score for Train data is 0.86 and Test data is 0.87

7
➢ Confusion Matrix for Tuned Train and Test Data

➢ ROC Curve for Tuned Train(Green) and Test Data(Orange):

Note: There is no major difference from comparing the tuned Logistic Regression
model with the LR model.

8
Random Forest Model:
1. A Random Forest (RF) is an ensemble classifier comprised of multiple Decision
Trees, analogous to a forest containing many trees.
2. Deeply grown Decision Trees often lead to overfitting on training data, resulting
in high variability in classification outcomes for slight input changes.
3. Due to their sensitivity to training data, Decision Trees can be error-prone when
applied to test datasets.
4. In an RF, individual Decision Trees are trained on different subsets of the
training dataset.
5. Classification of a new sample involves passing its input vector through each
tree, with each tree providing a classification outcome.
6. The RF aggregates these outcomes, either by majority vote for discrete
classification or averaging for numeric classification, effectively reducing
variance compared to a single Decision Tree. Additionally, the randomForest
package in R facilitates RF model building and tuning by adjusting parameters
such as the number of trees (ntree) and the number of randomly sampled
variables.

➢ Classification report for Trained Data and Test Data

9
Accuracy score for Train data is 1.0 and Test data is 0.98

➢ Confusion Matrix for Train and Test Data

10
➢ ROC Curve for Train and Test Data:

Bagging for Random Forest Model (Tuned RF Model):


Bagging, or Bootstrap Aggregating, is a method utilized in ensemble learning to bolster
the stability and accuracy of machine learning models. Specifically applied to Random
Forests, it involves training multiple instances of Decision Trees on different subsets
of the training data. Each tree in the Random Forest is trained on a random sample of
the dataset, and during prediction, the outputs of these trees are combined to yield
the final prediction. By leveraging the diversity among the trees, bagging mitigates
overfitting and reduces variance in the model. This is achieved by allowing each tree
to capture unique aspects of the data's underlying patterns. Consequently, Random
Forests exhibit strong generalization capabilities and are less susceptible to overfitting
compared to individual Decision Trees. Bagging for Random Forests thus serves to
enhance the model's predictive performance and robustness across various
classification and regression tasks in machine learning.

11
➢ Classification report for Tuned Trained Data and Test Data

Accuracy score for Train data is 0.98 and Test data is 0.93

12
➢ Confusion Matrix for Tuned Train and Test Data

➢ ROC Curve for Tuned Train and Test Data:

Note: There is no major difference from comparing the tuned Logistic Regression
model with the LR model.

13
KNN Model:
1. K-Nearest Neighbors (KNN) is a popular supervised machine learning algorithm used
for classification and regression tasks.
2. It operates on the principle of similarity, where the class or value of a new data
point is determined by the majority class or average value of its k nearest neighbors
in the feature space.
3. KNN is a non-parametric algorithm, meaning it does not make any assumptions
about the underlying data distribution, making it particularly useful for nonlinear
and complex datasets.

➢ Classification report for Trained Data and Test Data

14
Accuracy score for Train data is 0.92 and Test data is 0.86

➢ Confusion Matrix for Train and Test Data

15
➢ ROC Curve for Train and Test Data:

16
Tuned KNN Model:
A tuned K-Nearest Neighbors (KNN) model involves optimizing its hyperparameters,
such as the number of neighbors (k), to enhance its performance. This process aims to
improve the model's accuracy and generalization ability, ensuring it can effectively
classify or regress on new data instances with greater precision.

➢ Classification report for Tuned Trained Data and Test Data

Accuracy score for Train data is 0.92 and Test data is 0.90

17
➢ Confusion Matrix for Tuned Train and Test Data

18
➢ ROC Curve for Tuned Train and Test Data:

19
Difference between Bagging and Boosting Techniques:
Bagging and boosting are both ensemble learning techniques used to improve the
performance of machine learning models, but they differ in their approach:

1. Bagging (Bootstrap Aggregating)

• In bagging, multiple instances of the same base learning algorithm are trained
on different subsets of the training data, typically using random sampling
with replacement.
• Each model is trained independently, and during prediction, the outputs of
these models are aggregated (e.g., through majority voting for classification or
averaging for regression) to make the final prediction.
• Bagging helps reduce overfitting and variance in the model by introducing
diversity among the models.

2. Boosting

• In boosting, base learners are trained sequentially, with each new model
focusing on the mistakes made by the previous ones.
• The models are built in a sequential manner, with each subsequent model
attempting to correct the errors made by the ensemble up to that point.
• Boosting optimizes a loss function by iteratively minimizing the errors made
by the ensemble, typically using gradient descent or other optimization
techniques.
• Boosting tends to produce highly accurate predictive models and often
outperforms bagging in terms of predictive performance, but it can be more
sensitive to noise and outliers in the data.

In summary, while both bagging and boosting are ensemble learning techniques used
to improve model performance, they differ in their approach to combining multiple
models and addressing the weaknesses of individual models. Bagging focuses on
reducing variance by introducing diversity among models, while boosting aims to
iteratively improve the ensemble by focusing on reducing bias and improving
accuracy.

20
XGB Classifier

➢ Classification report for Trained Data and Test Data

21
Accuracy score for Train data is 1 and Test data is 0.93

➢ Confusion Matrix for Train and Test Data

22
➢ ROC Curve for Train and Test Data:

23
b) Test your predictive model against the test set using various
appropriate performance metrics
c) Interpretation of the model(s)

Checking the performance based on the Train-Test datasets predictions using


ROC_AUC Score, Confusion matrix and Accuracy score for each of the LR model,
Random Forest Model, KNN Classifier and XGB Classifier models.
Metric Accuracy Recall Precision3 F1 Score4 AUC Score5
Score1 value2
Models used Train Test Train Test Train Test Train Test Train Test
Logistic Regression 0.84 0.86 1 1 0.84 0.86 0.92 0.92 0.65 0.651
Tuned LR 0.86 0.87 0.17 0.17 0.73 0.77 0.27 0.28 0.776 0.776
Random Forest 1 0.98 1 0.88 1 1 1 0.93 1 0.998
Bagged RF 0.98 0.93 0.98 0.96 0.98 0.91 0.98 0.93 0.998 0.971
KNN Classifier 0.92 0.86 0.93 0.89 0.91 0.83 0.92 0.86 0.976 0.906
Tuned KNN 0.92 0.90 0.94 0.94 0.91 0.88 0.92 0.91 0.975 0.943
Gradient Booster 1 0.93 1 0.96 1 0.91 1 0.93 1 0.970

1. Accuracy: The proportion of correctly classified instances among the total number of
instances in a dataset.
2. Precision: The proportion of true positive predictions among all positive predictions made
by the model.
3. Recall: The proportion of true positive predictions among all actual positive instances in
the dataset.
4. F1 Score: The harmonic mean of precision and recall, providing a balance between the two
metrics.
5. AUC (Area Under the ROC Curve): The area under the receiver operating characteristic
curve, which quantifies the model's ability to discriminate between positive and negative
instances across different threshold values.

Based on the above findings, in terms of accuracy – Random Forest is the best performed
model compared to other models.

24
Problem - 2
2. Model Tuning and business interpretation

a) Ensemble modelling, wherever applicable


b) Any other model tuning measures (if applicable)

Model Ensemble means:


1. Ensemble methods integrate multiple base models to create an optimal predictive
model, often resulting in superior accuracy compared to individual models.
2. Simple averaging or weighted methods combine predictions from different models
by averaging their outputs.
3. Bagging methods, similar to averaging, use multiple versions of the same model to
enhance performance.
4. Boosting techniques incrementally construct an ensemble by emphasizing training
instances that previous models misclassified, with Adaboost and XGBoost being
common implementations.
5. Despite boosting methods yielding accuracy scores surpassing those of logistic
regression and KNN, they fall short of the accuracy achieved by Random Forest
models.

Model Tuning means:

1. Model tuning optimizes a model's performance by adjusting hyperparameters to


balance accuracy and prevent overfitting.
2. Hyperparameters, distinct from model parameters, dictate the behavior of the
algorithm during training and cannot be directly learned from the data.

In our scenario, we achieved satisfactory accuracy in all of our models, however we


sought to validate our findings empirically.
Upon tuning, the accuracy obtained was lower than the original model's
performance.
Therefore, it was concluded that tuning is unnecessary for this dataset, as the
initial model already attained satisfactory accuracy without the need for further
adjustments.

25
c) Interpretation of the most optimum model and its implication on the
business

1. The highest correlation with daily average time spent on the page is found in total
likes on outstation check-ins and yearly average views on the travel page.
2. Over 8000 customers do not follow the company page. Around 26% of users checked
in outstation a week ago, indicating a lower inclination towards travel.
3. Beach and financial locations are preferred by many customers, followed by
historical sites and others. Approximately 90% of customers use mobile devices, with
tablets being the preferred option.
4. The majority (84%) are not employed, and 57% are adults. About 85% of customers
have not previously purchased the product.
5. Random Forest proves to be the best model, achieving 96% accuracy for both mobile
and laptop devices. Implementing the Random Forest approach can lead to better
predictions, aiding the aviation company in cost savings.

*****

26

You might also like