You are on page 1of 9

Using Python

Predicting whether a booking cancellation will occur is the main objective of this paper. Machine
learning techniques such as Logistic Regression, Decision Tree and Random Forest are used to solve
this binary classification problem.
The models will not only be able to forecast whether a reservation will be canceled or not, but they
will also provide more insight on the independent variables that are crucial in determining whether a
reservation is canceled. Hotel owners and managers will find this material useful. It will enable more
precise income projections and the effective use of revenue management strategies.
Importing and reading the data frame is the first step. After that, we examine the data frame’s
attributes.
Due to having too many null values, the "company," "agent," and "country" features are all
eliminated. Since only a small number of values are null, the "children" functionality is retained.
Observations with null values for "children" are, however, not included. The "reservation status" has
been removed since it could have caused data leaks. The feature is divided into the No-Show, Check-
Out, and Canceled categories. Inferring the target class label from the value of "reservation status"
would be possible. As it is connected to "arrival date month," the feature "arrival date week number"
is likewise removed. As a result, multicollinearity would result from the presence of this trait. Finally,
"booking changes" is removed because the values might alter over time and expose data.
In addition, all duplicates were removed in the dataset.

The results show that the class quantities are out of balance.
Even though the imbalance is not significant, the machine
learning classifiers are evaluated using complex metrics, such
as precision and recall scores, since these metrics take any
class imbalances into consideration.

Some pre-processing procedures were carried-out to get the data ready for machine learning models.
Imputing missing values is what that is done first. The feature's "mode" is used to replace missing
values. The feature scaling is performed to all numerical variables while simultaneously encoding all
category variables. Two distinct pipelines are developed because categorical and numerical variables
require various pre-processing steps. After that, a column transformer object is created so that we
may apply the various pipelines to various columns.
It is clear from the result that the
majority of the categorical features
have a large number of unique values.
Therefore, one hot encoding would
result in the creation of a huge
number of additional features, which
is undesirable. As a result, we use
ordinal encoding over one hot encoding in order to escape
the curse of dimensionality.

The dependent variable in the dataset is the 'is cancelled'


column. A cancelled reservation is represented by the
number 1, while an uncancelled reservation is represented
by the number 0.

To determine if a hotel reservation will be


canceled Logistic Regression, Decision Tree, and
Random Forest are utilized. Training data and
test data are separated from the data. The model is
trained using 70% of the data, then tested using
the remaining 30%. A complete set of pipelines is
established before any models are deployed. This
includes of both the individual machine learning
models and the column transformations that
house the pre-processing procedures.

To effectively show and contrast the assessment metrics


for each model, an evaluation function is developed and
established. The function accepts as inputs an object
representing a machine learning model as well as the
training and test values for x and y. For the model that
was supplied, the function then outputs an accuracy
score, roc score, confusion matrix, classification report,
and ROC graph. The relevance of the features is also
evaluated.
The Random Forest is the top-performing classification
model, as shown by the evaluation metrics and ROC
curve. The model performs significantly better than
Logistic Regression and just slightly better than
Decision Tree. The algorithm will typically be able to
forecast whether a booking will be canceled or not based
on its high accuracy score of 91%. Because it can be used
to generate more precise revenue estimates, which in
turn enables the effective implementation of revenue
management strategies, this model is advantageous to
hotel owners. This could result in better pricing, which
generates greater revenues.

Given that Random Forest is the top-performing tree-


based model, the concentration will be on its findings.
The feature importance show that the most important
factors used to determine whether a reservation will be
canceled or not are "reservation status date," "hotel,"
and "is repeated guest." This information is useful to
hotel owners since it enables them to determine the
underlying causes of cancelled reservations. Hotel
owners should develop a plan that focuses on the
variables they can manage in order to reduce booking
cancellations, even while some of the factors that lead to
cancellations are outside their control or influence.
After using supervised algorithms in the first section, an unsupervised problem now has been
addressed, a clustering problem based on K-Means, and the outcomes will be examined of each
cluster to determine the most profitable customers in our data set based on lead time and ADR.
Initially, the number of clusters is performed:

It is necessary to
choose the value of
k after which the distortion starts to decrease linearly in
order to establish the ideal number of clusters. Thus, the
four clusters are the ideal number for the given data. In order to display the cluster centers, the K-
means algorithm is run using lead time and ADR with a number of clusters equal to 4.
Then the number of observations is displayed belonging to each cluster above.
The clients that appear in the red cluster, or those with
the shortest lead times and highest ADR, are thought to
be the most profitable. While the orange category
displays the highest (least lucrative) delivery time and
the lowest ADR.

Using SAS
The cleansed dataset which was performed in Python is used in SAS to predict whether a booking
will be cancelled or not.
In Python there are three models used using testing and training methods, whereas in SAS Logistic
Regression and Decision Tree are utilized for prediction.
The response variable is is_canceled for all models analyzed, whereas the rest are the predictors.
The value of ROC chart which is KS (Youden) for the first model is considered good being .9949 and
will be compared later with another model’s values.
From the Fit Summary of the Logistic Regression, it is illustrated that assigned_room_type and
arrival date_month are the most critical variables since the p-value is less than .05.
From the residual plot there are some negative outliers.
According to the confusion matrix the True Negative and True Positive are quite good, with a
frequency of 60,883 and 23,106 respectively. From a business perspective, the model anticipates
that 99.98% of customers won't cancel the reservation, and this is also what is shown in practice.
The False Positive can be problematic because it incorrectly forecasts that 12 customers will cancel
the booking. Yet, the overall figure is not significant, and the risk can be eliminated. In addition,
False Negative predicts that 799 people will not cancel their booking, yet they cancel and there is a
risk of lost sales opportunities.

The

accuracy of the
prediction at the chosen cutoff value is depicted visually in
the misclassification plot. Whether the expected probability
of the level 1 for the response is canceled is greater than or
equal to the cut-off value determines the predicted response
classification. The anticipated classification is an event if it
is more than or equal to the cut-off value; else, it is a NOT
event. For this data, the segment of the bar colored
"Correct" corresponds to true positives for the event level of
is canceled, 1.
Cumulative lift measures model effectiveness, which is considered effective enough.

Decision trees has .38 KS (Youden) which is not good compared to Logistic Regression.
As for the variable importance analysis lead_time is considered the most crucial predictor.

According to the
Confusion Matrix, True Negatives and True Positives represent decent results, having 56,066 and
10,067 respectively. Yet as for False Positives the model incorrectly classifies that 4,889 people will
cancel their bookings which is many times more than in Logistic Regression. On the contrary,
13,838 customers are considered as the hotel guests, whereas it is wrong.

The segment of the bar colored "Correct" on this data's bar corresponding to the event level of is
canceled, 1, relates to true positives.
There are approximately 2.39 times more occurrences in the first two quantiles than would be
predicted by chance (10% of all events) according to this model's cumulative lift in the 10% quantile
in the Decision Trees model.
The Logistic Regression should be improved by changing variable selection to the Backward method
as well as Decision Trees can be improved by Random Forest yet based on the results the changes
were not led to the optimistic direction. Therefore, we will leave the Logistic Regression without
changing any variable selection methods.
In contrast to the Python method, The Logistic Regression is the
top-performing classification model, as shown by the KS
(Youden), ROC curve and Confusion Matrix. The model performs
considerably better than the Decision Tree and the Random
Forest.

In addition, the Clustering is performed in SAS as well as in Python below:

Based on the two variables lead time (Number of days between the date of entering the reservation
and the client arrival date) and adr (Average Daily Rate as defined by dividing the sum of all
accommodation transactions by the total number of nights), Clustering allowed us to visualize the
most profitable customers which are considered as Cluster 2 and the least profitable customers that
refer to Cluster 3.
The following are some of the Logistic Regression modeling technique's limitations:
The foundation of Logistic Regression is the linear relationship between predictor (independent)
and predicted (dependent) variables. The findings are unlikely to be linearly separable in actual life.
Therefore, although the assumption of linearly separable data is made for logistic regression, it is
not always accurate. Overfitting may occur if there are fewer observations than there are features.
Only discrete functions can be predicted with it. The dependent variable for Logistic Regressions is
therefore limited to the set of discrete numbers. In most cases, minimal or no multicollinearity
exists among the independent variables when using logistic regression.
The following are some limitations of the Random Forest modeling technique:
It can be challenging to strike a balance between increased tree count and training time (and space).
To increase forecast accuracy, more trees can be planted. However, training the model in a random
forest requires more time and space because there are more trees involved. For data with limited
features, random forest may not yield adequate results. because there is a significant reduction in
randomness. For data with significant noise, random forest may overfit. Random forest lessens the
degree of overfitting through rough voting, but its prediction is still overfitted when compared to
linear model, which is distinguished by good matching of existing data. Decision trees tend to be
overfitted in prediction.

You might also like