Professional Documents
Culture Documents
Final Assessment MBAS901-Sabina.K
Final Assessment MBAS901-Sabina.K
Predicting whether a booking cancellation will occur is the main objective of this paper. Machine
learning techniques such as Logistic Regression, Decision Tree and Random Forest are used to solve
this binary classification problem.
The models will not only be able to forecast whether a reservation will be canceled or not, but they
will also provide more insight on the independent variables that are crucial in determining whether a
reservation is canceled. Hotel owners and managers will find this material useful. It will enable more
precise income projections and the effective use of revenue management strategies.
Importing and reading the data frame is the first step. After that, we examine the data frame’s
attributes.
Due to having too many null values, the "company," "agent," and "country" features are all
eliminated. Since only a small number of values are null, the "children" functionality is retained.
Observations with null values for "children" are, however, not included. The "reservation status" has
been removed since it could have caused data leaks. The feature is divided into the No-Show, Check-
Out, and Canceled categories. Inferring the target class label from the value of "reservation status"
would be possible. As it is connected to "arrival date month," the feature "arrival date week number"
is likewise removed. As a result, multicollinearity would result from the presence of this trait. Finally,
"booking changes" is removed because the values might alter over time and expose data.
In addition, all duplicates were removed in the dataset.
The results show that the class quantities are out of balance.
Even though the imbalance is not significant, the machine
learning classifiers are evaluated using complex metrics, such
as precision and recall scores, since these metrics take any
class imbalances into consideration.
Some pre-processing procedures were carried-out to get the data ready for machine learning models.
Imputing missing values is what that is done first. The feature's "mode" is used to replace missing
values. The feature scaling is performed to all numerical variables while simultaneously encoding all
category variables. Two distinct pipelines are developed because categorical and numerical variables
require various pre-processing steps. After that, a column transformer object is created so that we
may apply the various pipelines to various columns.
It is clear from the result that the
majority of the categorical features
have a large number of unique values.
Therefore, one hot encoding would
result in the creation of a huge
number of additional features, which
is undesirable. As a result, we use
ordinal encoding over one hot encoding in order to escape
the curse of dimensionality.
It is necessary to
choose the value of
k after which the distortion starts to decrease linearly in
order to establish the ideal number of clusters. Thus, the
four clusters are the ideal number for the given data. In order to display the cluster centers, the K-
means algorithm is run using lead time and ADR with a number of clusters equal to 4.
Then the number of observations is displayed belonging to each cluster above.
The clients that appear in the red cluster, or those with
the shortest lead times and highest ADR, are thought to
be the most profitable. While the orange category
displays the highest (least lucrative) delivery time and
the lowest ADR.
Using SAS
The cleansed dataset which was performed in Python is used in SAS to predict whether a booking
will be cancelled or not.
In Python there are three models used using testing and training methods, whereas in SAS Logistic
Regression and Decision Tree are utilized for prediction.
The response variable is is_canceled for all models analyzed, whereas the rest are the predictors.
The value of ROC chart which is KS (Youden) for the first model is considered good being .9949 and
will be compared later with another model’s values.
From the Fit Summary of the Logistic Regression, it is illustrated that assigned_room_type and
arrival date_month are the most critical variables since the p-value is less than .05.
From the residual plot there are some negative outliers.
According to the confusion matrix the True Negative and True Positive are quite good, with a
frequency of 60,883 and 23,106 respectively. From a business perspective, the model anticipates
that 99.98% of customers won't cancel the reservation, and this is also what is shown in practice.
The False Positive can be problematic because it incorrectly forecasts that 12 customers will cancel
the booking. Yet, the overall figure is not significant, and the risk can be eliminated. In addition,
False Negative predicts that 799 people will not cancel their booking, yet they cancel and there is a
risk of lost sales opportunities.
The
accuracy of the
prediction at the chosen cutoff value is depicted visually in
the misclassification plot. Whether the expected probability
of the level 1 for the response is canceled is greater than or
equal to the cut-off value determines the predicted response
classification. The anticipated classification is an event if it
is more than or equal to the cut-off value; else, it is a NOT
event. For this data, the segment of the bar colored
"Correct" corresponds to true positives for the event level of
is canceled, 1.
Cumulative lift measures model effectiveness, which is considered effective enough.
Decision trees has .38 KS (Youden) which is not good compared to Logistic Regression.
As for the variable importance analysis lead_time is considered the most crucial predictor.
According to the
Confusion Matrix, True Negatives and True Positives represent decent results, having 56,066 and
10,067 respectively. Yet as for False Positives the model incorrectly classifies that 4,889 people will
cancel their bookings which is many times more than in Logistic Regression. On the contrary,
13,838 customers are considered as the hotel guests, whereas it is wrong.
The segment of the bar colored "Correct" on this data's bar corresponding to the event level of is
canceled, 1, relates to true positives.
There are approximately 2.39 times more occurrences in the first two quantiles than would be
predicted by chance (10% of all events) according to this model's cumulative lift in the 10% quantile
in the Decision Trees model.
The Logistic Regression should be improved by changing variable selection to the Backward method
as well as Decision Trees can be improved by Random Forest yet based on the results the changes
were not led to the optimistic direction. Therefore, we will leave the Logistic Regression without
changing any variable selection methods.
In contrast to the Python method, The Logistic Regression is the
top-performing classification model, as shown by the KS
(Youden), ROC curve and Confusion Matrix. The model performs
considerably better than the Decision Tree and the Random
Forest.
Based on the two variables lead time (Number of days between the date of entering the reservation
and the client arrival date) and adr (Average Daily Rate as defined by dividing the sum of all
accommodation transactions by the total number of nights), Clustering allowed us to visualize the
most profitable customers which are considered as Cluster 2 and the least profitable customers that
refer to Cluster 3.
The following are some of the Logistic Regression modeling technique's limitations:
The foundation of Logistic Regression is the linear relationship between predictor (independent)
and predicted (dependent) variables. The findings are unlikely to be linearly separable in actual life.
Therefore, although the assumption of linearly separable data is made for logistic regression, it is
not always accurate. Overfitting may occur if there are fewer observations than there are features.
Only discrete functions can be predicted with it. The dependent variable for Logistic Regressions is
therefore limited to the set of discrete numbers. In most cases, minimal or no multicollinearity
exists among the independent variables when using logistic regression.
The following are some limitations of the Random Forest modeling technique:
It can be challenging to strike a balance between increased tree count and training time (and space).
To increase forecast accuracy, more trees can be planted. However, training the model in a random
forest requires more time and space because there are more trees involved. For data with limited
features, random forest may not yield adequate results. because there is a significant reduction in
randomness. For data with significant noise, random forest may overfit. Random forest lessens the
degree of overfitting through rough voting, but its prediction is still overfitted when compared to
linear model, which is distinguished by good matching of existing data. Decision trees tend to be
overfitted in prediction.