You are on page 1of 6

Machine Learning Techniques Final Report

(Fall, 2020)
Team:
ML_Explorer

Member:
Min-Hung Lo (B05502062)
Pai-Hung Cheng (R09522717)
Ting-Yi Chen (R09543028)

Instructor:
Hsuan-Tien Lin

by the beautiful mathematics of the traditional machine


Abstract—This report provides three different machine learning models. Hence, we come up with an idea to
learning models for predicting the daily revenue of make an inner competition between DL and other 2
specific hotel booking platform. The selected models are traditional models. Here, SVM and GBM are chosen
Deep Learning (DL), Gradient Boosted Machine (GBM) because the former has no aggregation tricks, while the
and Support Vector Machine (SVM). Apart from the latter has.
methods above, linear regression (LR) had been tried
out at first to make sure whether the task could easily be
solved linearly. A brief theoretical review to these II. MODEL SELECTION
models and comparison from different perspectives are A. Deep Learning
presented. We also introduce how to pre-process the
dataset, split the original data into train and validation DL is a class of machine learning algorithms that uses
subset and choose the best tuning parameters with multiple layers to progressively extract higher-level
minima validation error calculated by mean absolute features from raw data, which is proved to be able to
error (MAE). The best result is achieved by GBM with model complex non-linear relationship.
public score showing 0.407895 and private score showing There are diverse types of neural networks, but they
0.402597. always consist of the same components: neurons,
weights, biases, activation functions, learning rate and
I. INTRODUCTION optimizer. A variety of outcomes are generated,
achine learning is now a widely used technique resulting from different combination of parameters.
M in several fields. In this competition, we are
asked to help the hotel booking platform to predict the
Since it takes times to train a neural network, the vital
task for us is to select the best combination of hyper-
future daily revenue, which is the average daily rate parameter.
(ADR) multiplied by the number of days the customer B. Gradient Boosted Machine
is going to stay in the room. The goal of the prediction
is to accurately infer the future daily revenue of the GBM is a supervised learning model which
platform, where the daily revenue is quantized to 10 combines gradient descent with boosting and can deal
scales. For example, revenue lies between 0 to 10000 with both regression and classification problems [1].
will be labeled 1, and 10001 to 20000 will be 2, and so Take classification problem as an example, GBM starts
on. To solve this problem, DL is gaining popularity with a weak learner such as decision stump, and then
due to its supremacy in terms of accuracy when dealing increases the weights of misclassifications while
with huge amount of data, so we considered DL to be lowering the weights for those correctly classified [2].
one of the three candidates to conquer these 90000- We repeat this process for a specified number of
training data. On the other hand, we are also impressed iterations. Those misclassifications arise from the
TABLE I TABLE II
COMPARISON BETWEEN DL, GBM AND SVM ADVANTAGES AND DISADVANTAGES OF GBM AND SVM

Performance DL GBM SVM Model Advantages Disadvantages


aLow bHigh cMedium
Efficiency Makes weak learner Improves to minimize
dMedium dLow powerful and provides all errors which causes
Scalability High
Interpretability Low dMedium dHigh predictive accuracy overfitting
Popularity High Medium Low Can optimize on Trains sequentially
a
The time complexity of DL is rather complicated and is affected by several different error with respect to errors
factors [3]. GBM functions which consumes time
b
The time complexity is 𝑂(𝑁𝑙𝑜𝑔𝑁 ∗ 𝑑 ∗ 𝑚) for GBM [4]. Has high flexibility
c
The training time complexity is 𝑂(𝑁 2 ∗ 𝑑) and the testing time Works great with which results in
complexity is 𝑂(𝑁𝑆𝑉 ∗ 𝑑) for SVM [5]. categorical and complicated
dThe structure for GBM is tree-based and for SVM is generalized linear
numerical values interactions between
model. parameters
Suitable for medium size
Weak resistance to noise
previous iteration is identified as gradients which tell dataset
us how to improve the model in the next iteration. The SVM The classifying
More effective in high
error function is given as hyperplane gives no
dimensional spaces
specific explanation [9]
𝑚𝑖𝑛 𝑚𝑖𝑛 1 𝑁
𝜂

ℎ 𝑁 𝑛=1
𝑒𝑟𝑟(∑𝑡−1
𝜏=1 𝛼𝜏 𝒈𝜏 (𝒙𝑛 ) + 𝜂ℎ(𝒙𝑛 ), 𝑦𝑛 ), (1)
E. Discussion
where 𝜂 is the step of gradient descent, ℎ is called Due to the variety of machine learning models, three
functional gradient, 𝑁 is the number of rows in the specific perspectives should be pointed out to select
dataset, 𝑡 is the number of optimizing counts, 𝛼𝜏 the most suited model for this problem.
represents the weight of misclassification mentioned 1) Efficiency
1−𝜖𝜏 Computational time consumption of a model on an
above, and is defined as 𝑙𝑛 (√ ) , 𝜖𝜏 is the input of size N to arrive at the output.
𝜖𝜏
prediction error rate, 𝒈𝜏 is hypothesis, 𝒙𝑛 is the training 2) Scalability
data and 𝑦𝑛 is the training label. The error here is The ability to deal with any amount of data.
calculated with 0/1 error [6]. 3) Interpretability
The ability of a model for human to go through each
C. Support Vector Machine step of the algorithm and check if each step is
SVM has been used to define a space in which reasonable to them [10].
different classes are maximally separable, i.e., SVM In addition, the popularity is considered as a
classifier (SVC) learns from the training dataset and reference. Table I compares DL, GBM and SVM with
project into a higher dimensional space where two these four aspects, where GBM reveals the highest
classes can be separated by a hyperplane which efficiency with proper scalability and interpretability.
maximizes the margin separating the classes. Besides from those superiorities, the tree-based
Combined with the kernel trick, SVC can solve non- structure also makes GBM a well-known model to
linear problem as well. handle categorical data. Thus, we recommend GBM
As a branch of SVM, SVM regression (SVR) the best model. The cons and pros of GBM and SVM
supports the regression problem. Instead of finding a are summarized in Table II.
hyperplane with largest margin, SVR aim to decide a
decision boundary at a distance from the original III. DATA PREPROCESSING
hyperplane such that data points closest to the
hyperplane or the support vectors are within that A. Preliminary work
boundary line [7]. 1) Extract stays_in_week_nights and
stays_in_weekend_nights from the
D. Linear Regression original train.csv before feature transformation.
LR is a linear approach to model the relationship 2) Create necessary .csv files by popping out
between a scalar response and one or more explanatory ADR and cancellation labels.
variables [8]. This method is mostly used for
forecasting and finding out cause and effect
relationship between variables.
TABLE III
PARAMETERS AND VALIDATION ERRORS OF DL
Parameter I Parameter II Parameter III
Validation Error
batch_size epochs init_mode
10 100 uniform 0.390
10 200 uniform 0.403
10 100 lecun_uniform 0.426

B. Feature Transformation x) company


1) Drop invalid rows xi) arrival_date_year
a) No revenue contribution b) Method
b) adr < 0 For each feature, combining all the
c) stays_in_week_nights + stays_in_weekend_ categories in the train and test data as the
nights = 0 scaler.
d) adults + children + babies = 0 i) fillna('N/A')
2) Drop useless fields ii) LabelEncoder+MinMaxScaler(-0.5,
a) Data without meaning 0.5)
i) ID iii) OneHotEncoder
b) Data not given in test data iv) TargetEncoder
i) reservation_status 6) Given range categorical features
ii) reservation_status_date a) Repeated time
3) Pop important fields i) month
a) Revenue components b) Method
i) is_canceled i) Map with dictionary {'Jan': 1, ..., 'Dec':
ii) adr 12} + MinMaxScaler(0, 1)
4) Numerical features
a) Time period, people or status change C. Encoding Effectiveness and Discussion
i) lead_time 1) OneHotEncoding
ii) arrival_date_week_number Several features include plenty of categories, such as
iii) stays_in_weekend_nights/stays_in_week Country and Agent. Compare to the 28 features
_nights originally, OneHotEncoding brings up the number of
iv) days_in_waiting_list features up to 900 which creates redundancy and
v) adults, children, babies causes memory issues.
vi) previous_cancellations 2) TargetEncoding
vii) prev_booking_not_canceled In target recording, features are replaced with a
viii) booking_changes blend of posterior probability of the target given
ix) total_of_special_requests categorical value and the prior probability of the target
x) required_car_parking_spaces over all the training data [10]. This method takes in
b) Method target-explainable value while encoding but requires
For each feature, combining all the different careful validation in case of overfitting.
numerical value in the train and test data as Apparently, TargetEncoding is preferred over
the scaler. OneHotEncoding to reduce training time and increase
i) fillna(0) physical meaning. Numerical features are processed
ii) MinMaxScaler(0, 1) with MinMaxScaler which eliminates the effect of
5) Unknown range categorical features different numerical intervals.
a) Categories, Time
i) hotel IV. IMPLEMENTATION
ii) country
iii) market_segment, A. Validation
iv) distribution channel Validation is a touchstone of model performance.
v) reserved_room_type We find the best model and corresponding parameters
vi) assigned_room_type by calculating the validation error before uploading
vii) deposit_type
viii) customer_type
ix) agent
TABLE IV
PARAMETERS AND VALIDATION ERRORS OF GBM
Parameter I Parameter II
max_depth Validation Error Public Score Private Score
n_estimators
3 0.317 - -
10 0.232 0.329 0.442
50 50 - -
0.298
100 0.304 - -
3 0.291 0.408 0.403
10 0.245 0.355 0.455
100
50 0.291 - -
100 0.291 - -
3 0.298 - -
10 0.218 0.355 0.455
200
50 0.298 - -
100 0.298 - -

our result, so we can improve the correctness without


entry limitations.
D. ADR and Label Prediction
sklearn.model_selection.train_test_split provides an Next, with cancellation features added, we can
efficient way to help us calculate the validation error. predict ADR for those valid orders and extract output
We choose to split the data with respect to the date into labels. We aim to find the smallest validation error
two folds, where the first fold contains 80% of the which is calculated using MAE.
original one and the second fold contains the
remaining 20%. E. Parameters Tuning
Setting train_test_split(shuffle=False) brings us a 1) GridSearchCV for DL
benefit since older data is used for training while DL have hyper-parameters that we can adjust to
newer data is used for validation, which makes sense customize our model to fit the data. Hyper-parameter
to this time-dependent problem. optimization is an effective approach to objectively
In this problem, we take advantage of the validation search different combination of hyper-parameter and
tricks twice. One for predicting cancellation and the choose a subset that results in a model that achieves
other for predicting ADR. the best performance, which is available in the scikit-
learn Python library.
B. Data Grouping First, we set up our model with KerasClassifier,
Initially, each row is grouped date by date, so all the where the neural network inside is built with 5 layers,
features are summed up for one single day. For each with Dropout layer with dropout rate of 0.05. By
example, the total number of adults in one day is searching different batch size, epoch, and initial mode
condensed into one cell. The categorical features are of weights, GridSearchCV will select the best
also added up after label encoding. And the prediction combination of them with minimum loss.
target would be the daily revenue label. With this
manner, it is very efficient for training. Since the 2) For loops for GBM and SVM
classification label directed corresponds to the train GBM and SVM are rather simple models, so we
label. execute for loops to find parameter combinations
which lead to the best score. For example,
However, it is found that this method somehow the n_estimators of GBM and the kernel function of
misses or mixes the features. Thus, we turn to let each SVM.
row data remained as one unit, and the predicted term
become cancellation and ADR rather than integers. V. EXPERIMENTAL RESULTS
C. Cancellation Prediction A. Linear Regression
Since test data has no cancellation records, the prior
Starting with the easiest approach, some of the data
work is to predict whether the order is canceled. We
information are revealed and the baseline is set-up.
aim to find the smallest validation error which is
One advantage of LR is that the weight could be
calculated using MAE.
obtained, which tells the importance of each feature.
And it is also with the weight observation, we find that
TABLE V
PARAMETERS AND VALIDATION ERRORS OF SVM
Parameter I
Validation Error Public Score Private Score
C
1 0.464 0.687 0.753
10 0.464 0.592 0.688
100 0.497 - -
500 0.630 - -

TABLE VI
PARAMETERS AND VALIDATION ERRORS OF SVM ( FOR IS_CANCELED LABEL)
Parameter I Parameter 2
Validation Error
C Kernel
linear 0.181
1 poly 0.139
rbf 0.144
linear 0.174
10 poly 0.132
rbf 0.133
linear 0.174
100 poly 0.126
rbf 0.124
linear 0.174
200 poly 0.126
rbf 0.122

OneHotEncoder is unfavorable. And settle us on C. Gradient Boosted Machine


TargetEncoder. The reason is given below.
We adjust two parameters in this experiment —
Consider using the MinMaxScaler and n_estimators and max_depth — while holding others.
OneHotEncoder for numerical and categorical features, Table IV shows the tuned parameters and the
respectively. Due to the sparsity of data (some data are validation error of each model, where public and
left blank), the linear model tends to compensate the private scores are shown with the smallest validation
blank by increasing the weight of other columns. In error and the default. We can observe that the
one single row, the transformed one-hot-encoded value validation error becomes smaller when two terms
is normally way higher than that of min-max-scaled increase until reaching some specific values.
between [0,1]. Hence, from LR’s perspective, Therefore, we can draw a short conclusion that the
modifying the weight of one-hot-encoded feature is appropriate model complexity helps to strengthen the
more efficient, which causes some parts of the one-hot- model, but care must be taken to avoid overfitting.
encoded weights went to extremely high or negative. D. Support Vector Machine
B. Deep Learning SVM has several types of kernel function such as
linear, sigmoid, polynomial, and radial basis function.
With the help of GridSearchCV, we adjust various Particularly, radial basis function is the most popular
parameters — batch_size, epoch and init_mode, trying one, hence, we only choose the regularizer C as the
every combination to find the minimum validation variable while other minor parameters remaining
error. Table III shows the best combination of unchanged. Table V gives a summary of the result,
parameters. After checking validation error of each where public and private scores are shown under the
combination of parameters, we find that there is a same conditions as Table IV.
trade-off between epoch and init_mode for our pre- Besides from the result, the cancellation prediction
processed dataset. By adjusting different pre- is also shown in Table VI with different combinations
processing method, we can find the best combination of C and kernel.
to better-fit our data [12].
E. Discussion
From the discussion above, we can find that GBM is
superior to other methods. From the whole practical
experience, we know that data pre-processing is a key Available from:
point to the model performance. If data pre-processed https://www.sciencedirect.com/topics/engineering/sup
is set up properly, SVM or DL may also reach the same port-vector-machine
level to GBM. [8] Linear Regression Introduction.
Available from:
VI. CONCLUSION https://towardsdatascience.com/introduction-to-
machine-learning-algorithms-linear-regression-
This work demonstrates three different methods to
14c4e325882a
solve the predicting problem. The process of feature
[9] Advantage and Disadvantage of SVM
transformation and implementation are also described
Available from:
in detail. Although DL and SVM are admittedly
https://dhirajkumarblog.medium.com/top-4-
powerful tools, GBM in this problem achieves the
advantages-and-disadvantages-of-support-vector-
lowest error as we expect because of the
machine-or-svm-a3c06a2b107
straightforward data preprocessing, which is
[10] Zachary C. Lipton, The Mythos of Model
unfavorable to other two models. In addition, the
Interpretability. 2016, Carnegie Mellon University
efficiency of GBM gives us more chances to trial and
[11] Target Encoding
error with respect to the validation error.
Available from:
https://medium.com/analytics-vidhya/target-encoding-
VII. WORK LOADS vs-one-hot-encoding-with-simple-examples-
A. Min-Hung Lo 276a7e7b3e64)
Data preprocessing, LR, some parts of SVM and [12] Grid Search.
report writing. Available from:
B. Pai-Hung Cheng https://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-
Data preprocessing, DL and report writing. search.html
C. Ting-Yi Chen
Data preprocessing, GBM, some parts of SVM and
report writing.

REFERENCES
[1] A Gentle Introduction to Gradient Boosting.
Available from:
http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_
boosting/slides/gradient_boosting.pdf
[2] Understanding Gradient Boosted Machines.
Available from:
https://towardsdatascience.com/understanding-
gradient-boosting-machines-9be756fe76ab
[3] Rich, Lee and Ing-yi, Chen, Time Complexity of DL.
2020 International Conference on Mathematics and
Computers in Science and Engineering (MACISE)
[4] Time Complexity of GBM.
Available from:
https://medium.com/towards-artificial-
intelligence/time-and-space-complexity-of-machine-
learning-models-df9b704e3e9c
[5] Marc Claesen, Frank De Smet, Time Complexity of
SVM with RBF Kernel. 2014, Katholieke Universiteit
Leuven
[6] Machine Learning Techniques Lecture 11: Gradient
Boosted Decision Tree.
Available from:
https://www.csie.ntu.edu.tw/~htlin/mooc/doc/211_han
dout.pdf
[7] SVM Introduction.

You might also like