Professional Documents
Culture Documents
MBAS901-FinalAssessment-Sabina K
MBAS901-FinalAssessment-Sabina K
In [15]:
We must carry out some pre-processing procedures in order to get the data ready for
machine learning models. Imputing missing values is what we do first. We use the feature's
"mode" to replace missing values. Then, we feature scale all numerical variables while
simultaneously encoding all category variables. We develop two distinct pipelines because
categorical and numerical variables require various pre-processing steps. After that, a
column transformer object is created so that we may apply the various pipelines to various
columns.
Even if we are aware that the categorical data needs to be encoded, we still need to choose
the kind of categorical encoding that will be used. It is clear from the result above that the
majority of the categorical features have a large number of unique values. Therefore, one
hot encoding would result in the creation of a huge number of additional features, which is
undesirable. As a result, we use ordinal encoding over one hot encoding in order to escape
the curse of dimensionality.
The 'is_cancelled' column is the dependent variable within the dataset. The number 1 represents a
booking that has been cancelled while the number 0 represents a booking that has not been
cancelled.
In [4]:
To determine if a hotel reservation will be canceled, we use a number of machine learning
classifiers. For this investigation, a decision tree, a random forest, and a logistic regression
are used. Training data and test data are separated from the data. The model is trained
using 70% of the data, then tested using the remaining 30%. We can assess how well each
model works with fresh data thanks to the test data. A complete set of pipelines is
established before any models are deployed. This consists of both the individual machine
learning models and the column transformations that contain the pre-processing processes.
To effectively show and contrast the assessment metrics for each model, an evaluation
function is developed and established. The function accepts as inputs an object
representing a machine learning model as well as the training and test values for x and y.
For the model that was supplied, the function then outputs an accuracy score, roc score,
confusion matrix, classification report, and ROC graph. We also examine how important
each feature is in the top-performing tree-based model.
The Random Forest is the top-performing classification model, as shown by the evaluation
metrics and ROC curve. The model performs significantly better than Logistic Regression
and just slightly better than Decision Tree. The algorithm will typically be able to forecast
whether a booking will be canceled or not based on its high accuracy score of 90.5%.
Because it can be used to generate more precise revenue estimates, which in turn enables
the effective implementation of revenue management strategies, this model is advantageous
to hotel owners. This could result in better pricing, which generates greater revenues.
Part B [20 marks]
Using the same dataset, develop another different kind/type of a predictive model.
As a MINIMUM, you are required to do the following:
Identify and name the technique used and its usefulness
Explain how this model can be evaluated and used for practical purposes
Your answers MUST include answers to ALL the above parts including texts and figures/charts
from SAS Viya/ any other software approved by the lecturer.
Grading Rubric