Professional Documents
Culture Documents
Contents
Problem 1:
Sl. No Problem Page No
Model building and interpretation
a) Build various models (You can choose to build 2
models for either or all of descriptive, predictive or
prescriptive purposes)
b) Test your predictive model against the test set 24
using various appropriate performance metrics
c) Interpretation of the model(s) 24
1
Problem - 1
1. Model building and interpretation
Data modeling facilitates the integration of business processes with data rules
and structures, enabling the technical implementation of data. This involves training
machine learning algorithms to predict labels from features, fine-tuning them to meet
business requirements, and validating their performance on holdout data. The
outcome of modeling is a trained model capable of making predictions on new data
points.
Following the exploratory data analysis (EDA) in phase 1, our dataset is primed
for model building. We will now leverage various machine learning algorithms to train
models, refining them to align with our business objectives. This iterative process
ensures that our models are robust and reliable, empowering us to derive actionable
insights and drive informed decision-making.
a) Build various models (You can choose to build models for either or all
of descriptive, predictive or prescriptive purposes)
To achieve our objective of predicting whether a product will be selected or
not, we employed several classification algorithms:
o Logistic Regression,
o Random Forest
o K Nearest Neighbors &
o XGB Classifier
2
Train and Test Split:
The dataset was partitioned into two segments: a training set and a testing set. The
training set comprised 70% of the observations, while the testing set held the
remaining 30%. The training set served as the basis for fitting our model, which would
subsequently be evaluated using the testing set.
3
➢ Classification report for Trained Data and Test Data
Accuracy score for Train data is 0.84 and Test data is 0.86
4
➢ Confusion Matrix for Train and Test Data
5
Tuned Logistic Regression Model:
Model tuning involves optimizing the model's performance by adjusting
hyperparameters. This process entails incorporating new features into the model and
conducting computations across all features. The aim is to identify the optimal
parameters or values that yield the best model performance. Techniques such as
GridSearch and best_estimators are commonly employed for this purpose, facilitating
the search for the most effective model configuration through systematic
experimentation and evaluation.
6
➢ Classification report for Tuned Trained Data and Test Data
Accuracy score for Train data is 0.86 and Test data is 0.87
7
➢ Confusion Matrix for Tuned Train and Test Data
Note: There is no major difference from comparing the tuned Logistic Regression
model with the LR model.
8
Random Forest Model:
1. A Random Forest (RF) is an ensemble classifier comprised of multiple Decision
Trees, analogous to a forest containing many trees.
2. Deeply grown Decision Trees often lead to overfitting on training data, resulting
in high variability in classification outcomes for slight input changes.
3. Due to their sensitivity to training data, Decision Trees can be error-prone when
applied to test datasets.
4. In an RF, individual Decision Trees are trained on different subsets of the
training dataset.
5. Classification of a new sample involves passing its input vector through each
tree, with each tree providing a classification outcome.
6. The RF aggregates these outcomes, either by majority vote for discrete
classification or averaging for numeric classification, effectively reducing
variance compared to a single Decision Tree. Additionally, the randomForest
package in R facilitates RF model building and tuning by adjusting parameters
such as the number of trees (ntree) and the number of randomly sampled
variables.
9
Accuracy score for Train data is 1.0 and Test data is 0.98
10
➢ ROC Curve for Train and Test Data:
11
➢ Classification report for Tuned Trained Data and Test Data
Accuracy score for Train data is 0.98 and Test data is 0.93
12
➢ Confusion Matrix for Tuned Train and Test Data
Note: There is no major difference from comparing the tuned Logistic Regression
model with the LR model.
13
KNN Model:
1. K-Nearest Neighbors (KNN) is a popular supervised machine learning algorithm used
for classification and regression tasks.
2. It operates on the principle of similarity, where the class or value of a new data
point is determined by the majority class or average value of its k nearest neighbors
in the feature space.
3. KNN is a non-parametric algorithm, meaning it does not make any assumptions
about the underlying data distribution, making it particularly useful for nonlinear
and complex datasets.
14
Accuracy score for Train data is 0.92 and Test data is 0.86
15
➢ ROC Curve for Train and Test Data:
16
Tuned KNN Model:
A tuned K-Nearest Neighbors (KNN) model involves optimizing its hyperparameters,
such as the number of neighbors (k), to enhance its performance. This process aims to
improve the model's accuracy and generalization ability, ensuring it can effectively
classify or regress on new data instances with greater precision.
Accuracy score for Train data is 0.92 and Test data is 0.90
17
➢ Confusion Matrix for Tuned Train and Test Data
18
➢ ROC Curve for Tuned Train and Test Data:
19
Difference between Bagging and Boosting Techniques:
Bagging and boosting are both ensemble learning techniques used to improve the
performance of machine learning models, but they differ in their approach:
• In bagging, multiple instances of the same base learning algorithm are trained
on different subsets of the training data, typically using random sampling
with replacement.
• Each model is trained independently, and during prediction, the outputs of
these models are aggregated (e.g., through majority voting for classification or
averaging for regression) to make the final prediction.
• Bagging helps reduce overfitting and variance in the model by introducing
diversity among the models.
2. Boosting
• In boosting, base learners are trained sequentially, with each new model
focusing on the mistakes made by the previous ones.
• The models are built in a sequential manner, with each subsequent model
attempting to correct the errors made by the ensemble up to that point.
• Boosting optimizes a loss function by iteratively minimizing the errors made
by the ensemble, typically using gradient descent or other optimization
techniques.
• Boosting tends to produce highly accurate predictive models and often
outperforms bagging in terms of predictive performance, but it can be more
sensitive to noise and outliers in the data.
In summary, while both bagging and boosting are ensemble learning techniques used
to improve model performance, they differ in their approach to combining multiple
models and addressing the weaknesses of individual models. Bagging focuses on
reducing variance by introducing diversity among models, while boosting aims to
iteratively improve the ensemble by focusing on reducing bias and improving
accuracy.
20
XGB Classifier
21
Accuracy score for Train data is 1 and Test data is 0.93
22
➢ ROC Curve for Train and Test Data:
23
b) Test your predictive model against the test set using various
appropriate performance metrics
c) Interpretation of the model(s)
1. Accuracy: The proportion of correctly classified instances among the total number of
instances in a dataset.
2. Precision: The proportion of true positive predictions among all positive predictions made
by the model.
3. Recall: The proportion of true positive predictions among all actual positive instances in
the dataset.
4. F1 Score: The harmonic mean of precision and recall, providing a balance between the two
metrics.
5. AUC (Area Under the ROC Curve): The area under the receiver operating characteristic
curve, which quantifies the model's ability to discriminate between positive and negative
instances across different threshold values.
Based on the above findings, in terms of accuracy – Random Forest is the best performed
model compared to other models.
24
Problem - 2
2. Model Tuning and business interpretation
25
c) Interpretation of the most optimum model and its implication on the
business
1. The highest correlation with daily average time spent on the page is found in total
likes on outstation check-ins and yearly average views on the travel page.
2. Over 8000 customers do not follow the company page. Around 26% of users checked
in outstation a week ago, indicating a lower inclination towards travel.
3. Beach and financial locations are preferred by many customers, followed by
historical sites and others. Approximately 90% of customers use mobile devices, with
tablets being the preferred option.
4. The majority (84%) are not employed, and 57% are adults. About 85% of customers
have not previously purchased the product.
5. Random Forest proves to be the best model, achieving 96% accuracy for both mobile
and laptop devices. Implementing the Random Forest approach can lead to better
predictions, aiding the aviation company in cost savings.
*****
26