You are on page 1of 9

MBAS901

Techniques and Tools for Business Analytics:


Final Assignment
Total marks 100
Weight 50%
Submission Mode One PDF File (with ALL content) on the subject site on Moodle
Due date Displayed on the subject site on Moodle
Late Penalty 1 mark per day late

Part A [80 marks]


You are given a dataset of hotel bookings with many of their features. You may use the dataset
in SAS Viya called HOTEL_BOOKING_DEMAND_DATASET.

As a MINIMUM, you are required to do the following:


 Build 2 suitable models to predict whether a booking will be cancelled or not (or
otherwise),
 NOTE: You MUST NOT include the column ‘ReservationStatus’ as your
predictors/features for obvious reasons; this data will not be available
beforehand, and so cannot be used to predict whether a booking will be
cancelled or not.
 NOTE: adr= Average Daily Rate for a hotel room
 For each of the 2 models: Identify and list the important predictors
 For each of the 2 models: List and explain at least 3 model evaluation techniques
required for your model
 For each of the 2 models: Report the evaluation results of your model
 For each of the 2 models: Suggest how the model can be improved further.
Demonstrate it.
 For each of the 2 models: List some limitations of the modeling technique you used
 Compare and contrast the two models. Recommend which model is better for use in
this case and why.
We remove some functionalities for a variety of reasons. Due to having too many null
values, the "company," "agent," and "country" features are all eliminated. Due to the fact
that only a small number of values are null, the "children" functionality is retained. However,
observations that have null values for "children" are dropped.
The "reservation status" has been removed since it could have caused data leaks. The
feature is divided into the No-Show, Check-Out, and Canceled categories. Inferring the
target class label from the value of "reservation status" would be possible.
“Arrival date” was created to combine “arrival date month, year and day”. Thereafter,
“arrival date month, year and day” and “arrival week number" are removed. As a result,
multicollinearity would result from the presence of this trait.
Finally, "booking changes" is removed because the values could vary over time and result
in data leakage.
The results show that the class quantities are out of balance. Despite the fact that the
imbalance is not significant, we will attempt to assess the machine learning classifiers using
complex metrics, such as precision and recall scores, since these metrics take any class
imbalances into consideration.

In order to quickly identify any instances of high multicollinearity, we generate a correlation


heatmap.
The correlation heatmap reveals that there is no severe multicollinearity.

In [15]:
We must carry out some pre-processing procedures in order to get the data ready for
machine learning models. Imputing missing values is what we do first. We use the feature's
"mode" to replace missing values. Then, we feature scale all numerical variables while
simultaneously encoding all category variables. We develop two distinct pipelines because
categorical and numerical variables require various pre-processing steps. After that, a
column transformer object is created so that we may apply the various pipelines to various
columns.
Even if we are aware that the categorical data needs to be encoded, we still need to choose
the kind of categorical encoding that will be used. It is clear from the result above that the
majority of the categorical features have a large number of unique values. Therefore, one
hot encoding would result in the creation of a huge number of additional features, which is
undesirable. As a result, we use ordinal encoding over one hot encoding in order to escape
the curse of dimensionality.

The 'is_cancelled' column is the dependent variable within the dataset. The number 1 represents a
booking that has been cancelled while the number 0 represents a booking that has not been
cancelled.

In [4]:
To determine if a hotel reservation will be canceled, we use a number of machine learning
classifiers. For this investigation, a decision tree, a random forest, and a logistic regression
are used. Training data and test data are separated from the data. The model is trained
using 70% of the data, then tested using the remaining 30%. We can assess how well each
model works with fresh data thanks to the test data. A complete set of pipelines is
established before any models are deployed. This consists of both the individual machine
learning models and the column transformations that contain the pre-processing processes.

To effectively show and contrast the assessment metrics for each model, an evaluation
function is developed and established. The function accepts as inputs an object
representing a machine learning model as well as the training and test values for x and y.
For the model that was supplied, the function then outputs an accuracy score, roc score,
confusion matrix, classification report, and ROC graph. We also examine how important
each feature is in the top-performing tree-based model.
The Random Forest is the top-performing classification model, as shown by the evaluation
metrics and ROC curve. The model performs significantly better than Logistic Regression
and just slightly better than Decision Tree. The algorithm will typically be able to forecast
whether a booking will be canceled or not based on its high accuracy score of 90.5%.
Because it can be used to generate more precise revenue estimates, which in turn enables
the effective implementation of revenue management strategies, this model is advantageous
to hotel owners. This could result in better pricing, which generates greater revenues.
Part B [20 marks]
Using the same dataset, develop another different kind/type of a predictive model.
As a MINIMUM, you are required to do the following:
 Identify and name the technique used and its usefulness
 Explain how this model can be evaluated and used for practical purposes
Your answers MUST include answers to ALL the above parts including texts and figures/charts
from SAS Viya/ any other software approved by the lecturer.
Grading Rubric

You might also like