You are on page 1of 44

Data Analytics MSc Dissertation MTH775P, 2019/20

Disquisitiones Arithmeticæ
Predicting the prices for breakfasts and beds

Hai Nam Nguyen, ID 161136118


Supervisor: Dr. Martin Benning

A thesis presented for the degree of


Master of Science in Data Analytics

School of Mathematical Sciences


Queen Mary University of London
Declaration of original work

This declaration is made on August 17, 2020.

Student’s Declaration: I Student Name hereby declare that the work


in this thesis is my original work. I have not copied from any other students’
work, work of mine submitted elsewhere, or from any other sources except
where due reference or acknowledgement is made explicitly in the text, nor
has any part been written for me by another person.
Referenced text has been flagged by:

1. Using italic fonts, and

2. using quotation marks “. . . ”, and

3. explicitly mentioning the source in the text.

i
This work is dedicated to my niece Nguyen Le Tue An(Mochi), who has
brought a great source of joy to me and my family recently.
Abstract

Pricing and guessing the right prices are vital for both hosts and renters on home-
sharing plat-form from internet based companies. To contribute the growing inter-
est and immense literatureon applying Artificial Intelligence on predicting rental
prices, this paper attempts to build ma-chine learning models for that purpose
using the Luxstay listings in Hanoi. R2 score is used as the main criterion for the
model performance and the results show that Extreme GradientBoostings (XGB)
is the model with the best performance with R2 = 0.62, beating the most so-
phisticated machine learning model: Neural Networks.

iii
Contents

Declaration of original work i

Abstract iii

1 Introduction 1

2 Literature Review 2

3 Experimental Design 5
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 K-Fold Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Measuring Model Accuracy . . . . . . . . . . . . . . . . . . . . . . 7

4 Methods 9
4.1 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1.1 FISTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 Extreme Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . 16
4.5 LightGBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5.1 Gradient-based One-sided Sampling . . . . . . . . . . . . . 20
4.5.2 Exclusive Feature Bundling . . . . . . . . . . . . . . . . . . 20
4.6 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6.1 Adam Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 26

iv
CONTENTS v

5 Experiments and Results 28

6 Conclusion and Outlook 30

A Some special mathematical notations 32


A.1 Vector Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
A.2 The Hadamard product . . . . . . . . . . . . . . . . . . . . . . . . 33

B The Chain Rule 34

References 34
Chapter 1

Introduction

Since its establishment in 2016, Luxstay has become one of the most popular plat-
forms on home-sharing along with Airbnb in Vietnam with a network of more than
15,000 listings. The platform connects the guests’ demand to rent villas, houses,
apartment,... to hosts and vice versa. Hence, providing a reasonable price will
help hosts to gain a high and stable income and guests will get great experiences
in new places. Therefore, working on the sensible predictor and suggestion of
Luxstay prices can generate a real-life value and practical application.
Hanoi is the capital of Vietnam and has the second most listings on Luxstay.
The city has been also ranked in top 10 destinations to visit by TripAdvisor. As a
dynamic city with active bookings and listings, Hanoi can be a great example for
the study of Luxstay Pricing.
In this paper, we build a price prediction model and compare the performance
of different methods using R2 as the main measure. The input of our models is
the data scraped on the Hanoi page of the website which includes continuous and
categorical records about listings. Then a number of methods including traditional
Machine Learning models (LASSO, random forest, gradient boosting), Extreme
Gradient Boosting, LightGBM and neural network to predict prices of listings.

1
Chapter 2

Literature Review

The sharing economy is a socio-economic system that arranges ”the peer-to-peer-


based activity of obtaining, giving, or sharing the access to goods and services”
through ”community-based online services” (J. Hamari 2015). Home-sharing is one
of the sharing activities and it has been experienced a significant growth due to
a high demand from tourism (Guttentag 2015). Given that Luxstay is a startup
from an emerging economy, the platform has not received much attention from
the academic community as well as Airbnb, the leading company for this service
(Wang & Nicolau 2017). Nevertheless, the Vietnamese home-sharing platforms
has some similar characteristics to Airbnb as it is also an internet-based company
that coordinates the demand of short-term renters and hosts. . Therefore, it is
worth to conduct a review some findings on Airbnb from recent papers.
Gibbs et al. (2017) stated that one of the biggest challenges of Airbnb was
pricing the right prices by identifying the two key reasons for this issue. Firstly,
unlike the hotel business, where the prices are set by trained experts and industry
benchmarks, rental prices on Airbnb are normally determined by regular hots with
limited supports. Secondly, instead of letting algorithm to control prices like Uber
and Lyft, Airbnb leaves the prices to hosts to decide given that they might not
be well-informed. Consequently, these two factors may lead to cause a ptentially
financial loss and empirical evidence shows that incompetent pricing causes a loss
of 46% of additional revenue on Airbnb. Hence, there have been an interest in the

2
CHAPTER 2. LITERATURE REVIEW 3

study of rental price prediction on the leading platform. The two trends for this
topic are hedonic-based regression and artificial intelligence techniques.
The term Hedonic is defined to describe ”the weighting of the relative impor-
tance of various components among others in constructing an index of usefulness
and desirability” (Goodman 1998). In other words, Hedonic pricing is to identifies
factors and characteristics affecting an item price (Investopedia.com). Wang &
Nicolau (2017) aimed to design a system to understand which features are im-
portant input for an automated price suggestion on Airbnb using hedonic-base
regression approach. The functional form used were Ordinary Least Squares and
Quantile Regression to analyse 25 variables of 180,533 listings in 33 cities. The
result shows that features related to host attributes such as the number of their
listings and the profile pictures are the most important features. Among those,
super host status, which reveals experienced hosts on the platform, is the best one.
However, the authors also discussed the limitation of this analysis. The approach
is under some economic assumptions needed to be examined. The assumption of
hosts’ rationality requires a qualitative check which is skipped in the study. Gener-
ally, the effectiveness of hedonic-based regression for price prediction is restricted
by the model assumptions and esimation (Selim 2009).
Another approach for price prediction is to apply artificial intelligence tech-
niques which mainly includes machine learning an neural network models. Tang
& Sangani (2015) produced a model fore price prediction for San Francisco list-
ings. To reduce the complexity of the task, they turned the regression problem
into a classification task that predict both the neighbour hood and price range of
a listing and Support Vector Machine was the main model to be tuned. Uniquely,
they included images as inputs for the model by creating a visual dictionary to
categorise the image of a listing. The result shows that while the price prediction
achieves a high accuracy in the test set at 81.2%, the neighbourhood prediction
suffers from overfitting with a big gap between the train and test sets. Alterna-
tively, Cai & Han (2019) attempted to work on the regression problem using the
listings in Melbourne. The study implemented l1 regularisation as feature selec-
tion for all traditional machine learning methods and then compared to models
without it. The result shows that the latter perform better overall and gradient
CHAPTER 2. LITERATURE REVIEW 4

boosting algorithm produces the best precision with R2 = 0.6914 in the test set.
Recently, another study of the listings in New York holds an interesting result
with an highest R2 of 0.7768 (Kalehbasti et al. 2019). To gain that score, they
performed a logarithmic transformation to the prices and then train their mod-
els. Additionally, they also attempted to compare three feature selection methods,
which are manual selection, p-value and LASSO. The analysis shows that p-value
and LASSO outperformed manual selection and the best method to be applied in
the paper is LASSO.
In this paper, we applied the knowledge of the last three studies to build our
price predictor for the listings on Luxstay. Apart from widely used traditional
machine learning methods and neural networks, we also attempted to code an
algorithm to compute LASSO regression ourselves and used the two recent gradi-
ent boosting technique, Extreme Gradient Boosting and LightGBM. The project
worked on the original rental prices to produce a price prediction without any
logarithmic transformation.
Chapter 3

Experimental Design

3.1 Dataset

Figure 3.1: Example of Luxstay Listings

Our dataset of Luxstay listings was scraped using BeautifulSoup package on


Python (Richardson 2007). It includes 2675 listings posted in Hanoi on 27 De-
cember, 2019. Each listing contains fields describing the offered price (in dollar),

5
CHAPTER 3. EXPERIMENTAL DESIGN 6

district, type of home, name of its building, numbers of guests allowed, bedrooms
and bathrooms.
In order to make the dataset become available inputs for machine learning
models, we went through few pre-processing steps. Firstly, we droped features that
are not related to the prices such as listing id, listing name and listing link.
Secondly, we used dummy variable encoding to solve the issue with categorical
features which some machine learning algorithms can not work with directly. A
categorical variable is a variable that assign an observation to a specific group or
nominal category on the basis of some qualitative property (Yates et al. 2003). A
dummy variable is a binary variable that stores values of 0 and 1 where the former
represents the absence of a category and the latter shows the presence (James
H. Stock 2020, p. 186). The number of dummy variables depends on the number
of different categories such that it requires K-1 dummy variables if there are K
categories in a feature to make the data matrix to be invertible matrix, avoiding
the dummy variable trap (James H. Stock 2020, p. 230).
We ended up with 78 explanatory features. As we have a limited number of
listings in our dataset, we attempted to solve this problem by using K-Fold Cross
Validation for model selection since this method is considered to be useful when
the number of records is low (Bishop 2006, p. 32).

3.2 K-Fold Cross Validation

Figure 3.2: The technique of K-Fold Cross Validation with K=4 (Bishop 2006,
p. 33)
CHAPTER 3. EXPERIMENTAL DESIGN 7

The method involves splitting the dataset into K different groups. Then K-1
groups are used to train a specific model which are evaluated by the remaining
group. The last step is then repeated for K times until K specific groups are tested
individually. Finally, the performance score of a model, which is discussed in the
section below, is the average of the scores from K runs.
A major drawback of this technique is that it is computationally expensive as
a model is required to train and test K times. This issue is critical in our case
as there are machine learning algorithms that have a plenty number of hyper-
parameters with different combinations required to be tested. For instance, there
are more than 10 hyper-parameters to be tuned in Extreme Gradient Boosting,
which is infeasible to use K-Fold Cross Validation for all of the compositions.
Therefore, we only tuned some parameters supposed to have vital impacts on the
model performance while leaving the others in default values set by its package.

3.3 Measuring Model Accuracy


In this project, we used several machine learning algorithms with different sets of
parameters to build a price predictor. In order to choose the best candidate for
this task, there needs to be some metrics that assess how those models perform.
The performance is quantified by showing how close the predicted value of a given
observation is to the true value of that observation. For a regression problem, the
most commonly-used metric is mean squared error (MSE) (Hastie 2017, p. 29),
which is given by,

n
1X
M SE = (yi − fˆ(xi ))2 (3.1)
n
i=1

where fˆ(xi ) is the prediction produced by fˆ for the i th observation. The


MSE will be small if a model generates precise values and vice versa. In general,
the MSE is computed to optimise a model using the training dataset and then
evaluated that model performance using the testing dataset. The MSE according
to the above formula is not bounded to any range. The smallest MSE is 0, the
result of a model with perfect predictions and we know that it is nearly impossible
CHAPTER 3. EXPERIMENTAL DESIGN 8

to have that in reality. Therefore, by choosing the smallest MSE among our models
we do not if that model can become a real tool for a price suggestion practically.
Thus, it is where the R2 statistic comes in as an alternative measure.
The R2 statistic shows the fraction of variance in the target that can be pre-
dicted using the features (James H. Stock 2020, p. 153). Then metrics always takes
on a values between 0 and 1, where an R2 near 0 provides a model with a bad
accuracy while an R2 close to 1 provides a model good at predicting the target.
The formula of this metrics is given by,
Pn
2 i=1 (yi − ȳ)2
R = 1 − Pn (3.2)
i=1 (yi − fˆ(xi ))2

where ȳ is the mean of the target that we try to predict. Additionally, the
formula can also be derived into this
Pn 2
i=1 (yi −ȳ)
2 n
R =1− Pn ˆ 2
i=1 (yi −f (xi ))
n

in which we can write like this regard to (3.1)

MSE of a model
R2 = 1 − (3.3)
MSE of the mean of data
As the MSE gets smaller toward 0, the R2 gets bigger toward 1. Therefore, we
can interpret that the R2 is a rescaling of the MSE. This is the reason for us to
choose the R2 as the main metrics for model selection as its intuitive scale appears
to be better descriptively.
Chapter 4

Methods

4.1 LASSO
Least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996) is a
regression analysis technique that was introduced to improve the prediction accu-
racy and perform as feature selection method for regression models. LASSO seeks
to find the solution of this following problem,

arg min kY − Xwk22 + αkwk1



(4.1)
w∈Rd

where X ∈ RN ×d the data matrix with N records and d features, Y ∈ RN the


target vector, w ∈ Rd the weight parameters and the subscripts 1 and 2 indicate the
l1 and l2 norms respectively (Appendix A). The problem above is indifferentible so
we cannot apply the common algorithm for regression models Gradient Descent in
order to compute LASSO. However, there have been various mathematical theories
to compute the solution of the LASSO, including Coordinate Descent (Tibshirani
et al. 2010) and Least Angle Regression (Efron et al. 2004), which are installed in
popular machine learning package such as Scikit Learn 1 . Instead of using those
two in a pre-written package, we attempted to write an alternative algorithm,
FISTA algorithm (Beck & Teboulle 2009), to solve the LASSO problem ourselves
1
LASSO User Guide on Scikit-learn Document

9
CHAPTER 4. METHODS 10

using Numpy package (Oliphant 2006).

4.1.1 FISTA
Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) is an iterative algorithm
based on the application of proximal operators to solve non-differentible convex
optimisation problems . In particular, the general optimisation problem is

X = arg min {f (X) + g(X)} (4.2)


X∈Rd

, where:

• g : Rn 7→ R is a continuous convex function, which is possibly non-smooth ,


i.e, indifferentible

• g : Rn 7→ R is a smooth convex function with Lipschitz continuous graident


L(f )

– Lipschitz constant: a function (f ) such that


k∇f (x) − ∇f (y)k ≤ L(f )kx − yk, for all x, y ∈ Rn , then L(f ) is a
Lipschitz constant of ∇f

FISTA can be used in many problems related to 4.2. LASSO is among the best
known. Hence, we can apply this algorithm to solve the following Loss function of
LASSO
 
1
Loss = min kXw − Y k + αkwk1
W 2
This is a slightly modified version of 4.1 as we add a fraction of 2 in the first
term for mathematical convenience. Then we set the loss function into this form:

Loss = min{f (w) + g(w)}


W

For this problem, our job is to find two functions: ∇f (w) and L(f ) to proceed
the algorithm. Firstly, we compute the former function:
CHAPTER 4. METHODS 11

We have: f (w) = 12 kXw − Y k

Then applying the chain rule (B), we get this partial derivative respected to
the weight:
∇f (w) = X T (Xw − Y )

Now we find the Lipschitz constant through this k∇f (a) − ∇f (b)k. By expanding
the argument we have:

kX T (Xa − Y ) − X T (Xb − Y )k

Thus, we factorise the common term X T X: kX T X(a − b)k .Applying this norm
inequality kA(a − b)k ≤ kAkka − bk(Benning 2019), we have the Lipschitz constant
L(f ) found that is L = kX T Xk
FISTA is a refined version of Iterative Shrinkage-Thresholding Algorithm (ISTA)
in which both methods seek to find the solution of this proximal function (Beck &
Teboulle 2009):
 
L 1
PL (v) = arg min g(w) + kw − (v − ∇f (v)k2
w 2 2
For this problem, we substitute g(w) = λkwk1 and set z = v − 12 ∇f (v). Then
it becomes
 
L 2
PL (z) = arg min λkwk1 + kw − zk
w 2
With some steps using calculus, we get this result

λ λ
z − L , z>


 L
S λ (z) = 0, |z| ≤ Lλ (4.3)
L 

z + λ ,

z < − Lλ
L

The formula (4.3) can be written as

Stau (z) = sign(z)(|z| − τ )+


CHAPTER 4. METHODS 12

where the ”sign” function is defined as





+1, x>0

sign(x) = 0, x=0



−1, x<0

and (a)+ is defined as



0, a<0
(a)+ =
a, a≥0

As a consequence, we have enough recipes to run the FISTA. Since the Lipschitz
constant in this case is easy to compute, we follow the algorithm with constant step
size according to the paper (Beck & Teboulle 2009). The algorithm is proceeded
as below

Algorithm 1 FISTA with constant step size


Input L = L(f ) − A Lipschitz constant of ∇f , λ: regularized parameter
Initialise: w0 ∈ Rn , v1 = w0 , t1 = 1
Iterate:
for k = 1, ...,K − 1 do
compute zk = vk − L1 ∇f (v)
compute wk = S λ (zk )
L √
1+ 1+4t2k
compute tk+1 = 2
−1
compute vt+1 = wk + ttkk+1 (wk − wk−1 )
end for
return wK

4.2 Random Forest


Random Forest (Breiman 2001) is an ensemble method that uses bagging tech-
nique. The design of ensemble learning is to construct a prediction model by
CHAPTER 4. METHODS 13

applying a multiple machine learning algorithms in order to achieve a better pre-


dictive power rather than using those algorithm alone (Trevor Hastie 2009, p. 605).
Bagging (or Bootstrap Aggregating) is is to average the results of models in the en-
semble equally. This technique trains each model in the ensemble using a bootstrap
sample, a subset that is randomly drawn from the training dataset (Trevor Hastie
2009, p. 282), thereby reducing the variance. The Random Forest algorithm con-
tains a collection of decision trees. Its output is the class that has the most votes for
classification problem or the mean prediction of the individual trees for regression.

Figure 4.1: Random Forest structure (Source)

The Random Forest Regression in our study operates through this following
algorithm (Trevor Hastie 2009, p. 588):
CHAPTER 4. METHODS 14

Algorithm 2 Random Forest for Regression


Iterate:
for b = 1, ...,B do
1. Draw of boostrap sample Z of size N from the training data.
2. Grow a decision tree Tb to the boostrapped data, by recursively
repeating the following steps for each terminal node of the tree, until
the minimum node size nm in is reached,
i. Select m variables at random from the p variables.
ii. Pick the best variable/split-point among the m using Mean
Squared Errors
iii. Split the node into two daughter nodes
end for
Output the ensemble of trees Tb B1
return The prediction of a new point x: B1 B
P
1 Tb (x)

The construct of Decision Tree is referred to the original paper for more details
(Breiman et al. 1984). Nonetheless, the Random Forest algorithm only uses a
limited number, which is less than the total amount, of features selected randomly
to decide the candidate to split a node. This helps remove the problem that the
ensemble over-relies on an individual feature and have a fair use of all features,
making the model more robust.

4.3 Gradient Boosting


Gradient Boosting is another form of ensemble method that applies boosting tech-
nique. The boosting method is to combine the predictions of many weak learner
to generate a powerful ”committee”. Different from Random Forest, which builds
a forest of decision trees simultaneously, Gradient Boosting generate trees sequen-
tially, each of which is to improve the errors made by the previous trees in the
series.
The Gradient Boosting Regression algorithm is as follow (Friedman 2002):
CHAPTER 4. METHODS 15

Algorithm 3 Gradient Boosting Regression


Initialise: f0 (x) = arg minγ N
P
i=1 L(yi , γ), learning rate ν
for m = 1, ...,M do
1. For i = 1, ...,N compute
 
∂L(yi , f (xi ))
rim =−
∂f (xi ) f =fm−1

2. Fit a regression tree to the targets rim giving terminal regions Rjm ,
where j = 1, ...,Jm
3. For j = 1, ...,Jm compute
X
γjm = arg min L(yi , fm−1 (xi ) + γ)
γ
xi ∈Rjm

PJm
4. Update fm (x) = fm−1 (x) + ν j=1 γI(x ∈ Rjm )
end for
return fˆ(x) = fM (x)

The Algorithm 3 is indicated by the choice of of the loss function L(y, f (x)).
In this study, our choice of the loss criteria is the ’least-squares’ L(y, f (x)) =
1
2 (yi − f (xi ))2 . By optimising this function, we can find the first model, which is
a single terminal node tree showing the mean of the target y in the training set.
Moreover, the negative gradient of the loss function computed in each iteration
is called pseudo residual r. The succeeding trees are built based on this paper
(Breiman et al. 1984). Thus, each tree corrects the mistakes of the previous trees.
The corrections is scaled by the learning rate ν to avoid the problem of high
variance, increasing the robustness.
CHAPTER 4. METHODS 16

4.4 Extreme Gradient Boosting


Extreme Gradient Boosting (XGBoost) (Chen & Guestrin 2016) is another boost-
ing method that is applied in this study. XGBoost is a variant of Gradient Boosting
and this technique is architected with a different algorithm on how the embedeed
trees are structured such as changing the splitting method.
For a given dataset with n examples and m features D = {(xi , yi }(|D| = n, xi ∈
Rm , yi ∈ R), the method objective is to solve the given function:

n
l(yi , ˆ(y)t−1
X
L(t) = i + wj ) + Ω(wj )
i=1

Where Ij = {i|q(xi ) = j} is a set of data points assigned to the j-th leaf, q is


the structure of each tree that maps an example to corresponding leaf index and
(t)
w is a vector of scores on a leaf. Here ŷi is the prediction of the i-th instance at
the t iteration and l represents the loss function to be determined. We chose the
common Mean Squared Errors as the loss function in this case.
In XGBoost, the second-order approximation of Taylor series is applied for the
loss function for computational efficiency:

1
l(yi , ŷit−1 + wj ) ≈ l(yi , ŷit−1 ) + gi wj + hi wj2
2
where gi = ∂yt−1 l(yi , ŷit−1 ) and hi = ∂y2t−1 l(yi , ŷit−1 are the first and second order
gradients of the loss function. The term g and h are inspired to the facts that
the first-order of a function is often called Gradient and the second-order is often
called Hessian respectively.
Remove the constant l(yi , ŷit−1 ) of the approximation, the objective function
becomes:

n
X 1
L(t) = (gi wj + hi wj2 ) + Ω(wj )
2
i=1
CHAPTER 4. METHODS 17

We then set the regularized term as follow:

T
1 X 2
Ω(wj ) = γT + λ wj
2
j=1

Here γ is the pruning para meter and λ is the l2 regularized parameter. Hence,
our objective function is given as:

n T
X 1 1 X 2
L(t) = (gi wj + hi wj2 ) + γT + λ wj
2 2
i=1 j=1
 
T
X X 1 X
= ( gi )wj + ( hi + λ)wj2  + γT
2
j=1 i∈Ij i∈Ij

We could compress the whole equation into a simpler form by letting Gj =


P P
i∈Ij gi and Hj = i∈Ij hi :

T  
(t)
X 1 2
L = Gj wj + (Hj + λ)wj + γT (4.4)
2
j=1

We then optimise the equation 4.4 by using the first-order condition with re-
spect to wj :

Gj
wj = −
Hj + λ
Hence, the corresponding objective function is:

T
1 X G2j
L(t) (q) = − + γT (4.5)
2 Hj + λ
j=1

The equation 4.5 can be used as a metric to measure the quality of a structure
q. It could be impossible to process all the tree structure q. Thus, a greedy
algorithm that starts from a single leaf and then iteratively adds branches to the
tree is used instead. Starting from a single leaf node, we split into two nodes
such that IL and IR are the instance sets of left and right nodes after splitting.
CHAPTER 4. METHODS 18

Therefore, we have I = IL ∪ IR , then the loss after the split is as follow:


" #
1 G2L G2R G2j
Lsplit = + − −γ (4.6)
2 HL + λ HR + λ Hj + λ

The equation 4.6 can be decomposed into four parts: the score on the new
left leaf, the score on the new right leaf, the score on the original leaf and the
regularization put on the additional leaf. We can observe that if the bracket in
the equation is smaller than γ, we would not get a better result from adding that
branch. Thus, the branch is removed and this is how the pruning techniques in
tree based work.
In order to find the best candidate to split, we need an algorithm to perform
the work. In this work, we applied the so-called exact greedy algorithm, which is
defined as ”a split finding algorithm enumerates overall the possible splits on all
the features” (Chen & Guestrin 2016):

Algorithm 4 Exact Greedy Algorithm


Input: I, instance set of current node
Input: d, feature dimension
Initialise: Gain= 0
Iterate:
for k = 1, ..,m do
GL = 0, HL = 0
for j in sorted(I, by xjk ) do
GL = GL + gj , HL = HL + hj
GR = G − GL, HR = H − HL 
G2L G2 G2
score = max score, HL +λ + HRR+λ − H+λ
end for
end for
return Split with max score

The algorithm is computational demanding so it is required to sort the data


regards to feature values and then scan to compute the structure score of all
possible split solutions to find an optimal split. The original paper (Chen &
Guestrin 2016) is to refer to additional information on other splitting algorithms
CHAPTER 4. METHODS 19

and computations.

4.5 LightGBM

Figure 4.2: How LightGBM and other boosting algorithms work (Source)

LightGBM (Ke et al. 2017) is another vairant of Gradient Boosting method that
mainly focuses on speeding up the training efficiency. Instead of building trees
horizontally at level-wise, LightGBM grows trees vertically by adding a new tree
leaf at each iteration. The figure 4.2 displays the implementations of LightGBM
and other Gradient Boosting techniques.
The computational efficiency of LightGBM comes from the fact that this
method combines the two techiniques, Gradient-based One-sided Sampling (GOSS)
and Excludsive Feature Bundling (EFB). The descriptions of those two are shown
in the two sections below
CHAPTER 4. METHODS 20

4.5.1 Gradient-based One-sided Sampling


This technique is to improve the performance of the instances that have large gra-
dients, which generate large errors. Therefore, after each iteration, the instances
with large gradients are kept while a proportion of instances with small gradients
are dropped randomly. As a result, the authors claims that this treatment pro-
vides a more accurate estimation than the traditional method where instances are
uniformly and randomly sampled. The algorithm for this technique are given as:

Algorithm 5 Gradient-based One-sided Sampling


Input: I: training data, d: iteration
Input: a: sampling ratio of large gradient data
Input: b: sampling of small gradient data
Input: loss: loss function , L: weak learner, models= {}, fact= 1−a
b
topN=a × len(I), randN= b × len(I)
for i = 1tod do
pred = models.predict(I)
g = loss(I, preds), w = {1, 1, . . . }
sorted = GetSortedIndices(abs(g))
topSet = sorted[1:topN]
randSet = RandomPick(sorted[topN:len(I)], randN)
usedSet = Topset + randSet
w[randSet] × = fact . Assign weight f act to the small gradient data
newModel = L(I[usedSet], -g[usedSet], w[usedSet])
model.append(newModel)
end for

4.5.2 Exclusive Feature Bundling


In practical applications, high-dimensional data normally have problem with spar-
sity. Such problems that many features are (almost) exclusive in sparse feature
space. In other words, they hardly get nonzero values at the same time. Thus,
EFB is designed to correct this problem by bundling those exclusive features into
a single feature. However, there are two issues when implementing this technique.
The first issue is to choose which features should be bundled and the second one
CHAPTER 4. METHODS 21

is how they are bundled.

Algorithm 6 Greedy Bundling


Input: F: features, K: max conflict count
Construct graph G
searchOrder = G.sortByDegree()
bundles = {}, bundlesConflict = {}
for i in searchOrder do
needNew = True
for j = 1tolen(bundles) do
cnt = ConflictCnt(bundles[j], F[i])
if cnt + bundlesConf lict [i] ≤ K then
bundles[j].add(F[i]), needNew = False
break
end if
end for
if needNew then
Add F[i] as a new bundle to bundles
end if
end for
return bundles

The Algorithm 6 is architected to determine features to be bundled. The graph


G in this algorithm is constructed as the following: we take features as vertices
and then add edges for every two features of they are not mutually exclusive.
The term ”conflict” here means that there are some features that are not 100%
mutually exclusive. We can tolerate a small number of this features while bundling
them in order to enhance computational efficiency.
For the second issue, the Algorithm 7 shows how a number of features should
be merged.
The original paper is referred for more details on the theoretical analysis of
the algorithm shown in this chapter and some experimental performances of Light-
GBM.
In this chapter, we only show the key points which LightGBM runs differently
from the other Gradient Boosting techniques. The technique of the construction
of a decision tree in each iteration is similar to the algorithm shown in Section 4.3.
CHAPTER 4. METHODS 22

Algorithm 7 Merge Exclusive Feature


Input: numData: number of data
Input: F: One bundle of exclusive features
binRanges = {0}, totalBin = 0
for f in F do
totalBin += f.numBin
binRanges.append(totalBin)
end for
newBin = newBin(numData)
for i = 1tonumData do
newBin[i] = 0
for j = 1tolen(F ) do
if F [j].bin[i] 6= 0 then
newBin[i] = F[j].bin[i] + binRanges[j]
end if
end for
end for
return newBin, binRanges
CHAPTER 4. METHODS 23

4.6 Neural Networks

Figure 4.3: An example of Neural Networks architecture (Source: VIASAT)

The terms of Neural Networks are biologically inspired by how human process in-
formation in their brains. Figure 4.3 displays the architecture of Neural Networks.
Artificial Neural Network consists of nodes, which can be called as neurons, and
activation functions. Each node have a collection of weights and a bias, which can
be computed through the learning process. An activation functions is to produce
an output for a node depending on its input values. Every Neural Network in-
cludes three kinds of layers. The input layer receives information from the records
of a dataset, the output layer results in the network prediction and hidden layers
connect the input and the output with one another. Mathematically, a model of
Neural Network is defined as follow (Benning 2020, p. 28):
CHAPTER 4. METHODS 24

fw (x) = ϕL (ϕL−1 (...ϕ1 (ϕ0 (x, w1 , b1 ), w2 , b2 )..., wL−1 , bL−1 )wL , bL ) (4.7)

The equation 4.7 how a Neural Network of L layers is presented mathematically.


We have {ϕl }L L
l=1 is a set of L activation functions that contain weights w = {wL }l=1
and biases b = {bL }L
l=1 .
Typically, an activation function is in form of affine-linear transformation,
which is defined as

ϕ(x, W, b) = W T x + b (4.8)

where x ∈ Rn represents number of inputs, W ∈ Rn×m is a weighted matrix


and b ∈ Rm is a bias vector. By this way, the activation function maps n inputs
onto m outputs. In this study, we only used this type of activation functions in
the output layer. For the input and hidden layers, Rectified Linear Unit (ReLU)
is chosen. The function is as the following:

ϕ(x, W, b) = max(0, W T x + b) (4.9)

The function is a combination of the affine-linear transformations and the


rectifier, ϕ(x) = max(0, x). An advantage of using ReLU is that it is computational
efficient due to its simplicity. This contributes to make ReLU become one of
the most popular activation functions used for Neural Networks. However, this
function is not flawless as there appears a problem called ’Dying ReLU’.
When the function produces an output of zero or a negative value, the gradient
of the function becomes zero. Thus, the process of backpropagation, which is
mention in the subsection below, can not perform in that neuron, making it turn
off. As we train a Network that appears to have those neurons, we may end up
having a large part of the network doing nothing. A solution for the Dying ReLU is
that we can choose to use the so-called Leaky ReLU, where we adjust the function
so its gradient has a small slope for negative values. Nevertheless, we ignored this
issue and only used ReLU for the hidden layers in this study. We would like to try
the Leaky ReLU in our future work for further experiments with Neural Networks.
CHAPTER 4. METHODS 25

In order to produce an accurate prediction, the output layer needs to go


through a loss function to match with the actual values closely. Combined with
the choice of loss function, a model of Neural Networks can be generalised into
this problem:

s
( )
1X
arg min li (fw (xi ), yi ) (4.10)
w,b s
i=1

where i ∈ {1, ...,s} and {li }si=1 is a family of loss functions. For the regression
problem of this study, we choose the least-squares, li (x) = 12 (fw (xi ) − y)2 in order
to optimise the problem 4.10 as follow:

s
( )
1 X
arg min (fw (xi ) − y)2
w,b 2s
i=1

Since we have a differentible neural network with differentible activation func-


tions. We can determine an algorithm to train the model. Our choice in this task
is Adam.

4.6.1 Adam Algorithm


Adaptive Moment Estimation (Adam) is an extension of Stochastic Gradient De-
scent (SGD) (Kingma & Ba 2014). While SGD keeps only one learning for all
weights and does not adjust the rate during training, Adam provides a learning
rate for each weight and separtely adjust the rate through the training process.
The authors also claims that Adam inherits the advantages of Adaptive Gradient
Algorithm (Adagrad) and Root Mean Square Propagation (RMSProp). The ben-
efit of the former is to improve performance with sparse gradients and the benefit
of the latter is to improve the performance on online and non-stationary problems.
As a result, Adam is insisted to be effective to solve practical problems related to
the use of Neural Network (Kingma & Ba 2014).

The algorithm of Adam is as follow regard to the original paper (Kingma &
Ba 2014):
CHAPTER 4. METHODS 26

Algorithm 8 Adam
Specify: f (θ): Stochastic objective function with parameters θ
Specify: α, β1 , β2 ∈ [0, 1), 
Initialise: θ0
m0 = 0
v0 = 0
for t = 1, ...,T do
compute gt = ∇θ ft (θt−1 )
compute gt2 = gt gt ( : Hadamard product (A))
compute mt = β1 mt−1 + (1 − β1 )gt
compute vt = β2 vt−1 + (1 − β2 )gt2
mt
compute m̂t = 1−β t
1
vt
compute v̂t = 1−β t
2
compute θt = θt−1 − √αv̂m̂t +
t

end for
return θT

In order to implement Algorithm 8, we need to compute the gradients with


respect to the weights in different layers. The process of this computation is called
backpropagation.

4.6.2 Backpropagation
Backward propagation of errors (Backpropagation) is the practice of readjusting
the weights and biases of a Neural Network based on the error rate obtained in the
previous epoch of the training process. The term ”backwards” here means that the
algorithm goes through the network backwards. It computes the gradient of the
final layer first then does it to the first layer last. The algorithm of backpropagation
is given as (Benning 2020, p. 32):
CHAPTER 4. METHODS 27

Algorithm 9 Backpropagation
Specify: activation function ϕ, sample {(xi , yi )}si=1 , weight and bias di-
mensions and no. of layer L
Iterate:
for i = 1, ...,s do
for l = 1, ...,s do
Forward pass: compute zil = WlT xl−1
i + bl
Forward pass: compute xli = ϕ(zil )
end for
end for
for i = 1, ...,s do
for l = L, ...,1 do
Backward pass: compute

ϕ0 (z l ) 1 ∇ l(xL , y ), l<L
l i s 1 i i
δi =
ϕ0 (z l ) W δ l+1 , l ∈ {1, ...L − 1} ≥ 0
i l+1

end for
end for
Partial derivatives: compute

∂L
= δjl , j ∈ {1, ...,nl }
∂blj

Partial derivatives: compute

∂L
l
= δjl xl−1
k , j ∈ {1, ...,nl } and k ∈ {1, ...,nl−1 }
∂wjk

return {Wl }Ll=1 , {bl }Ll=1


Chapter 5

Experiments and Results

Model Name Test MSE Test R2


LASSO (baseline) 1519.35 0.5580
Random Forest 1430.54 0.5891
Gradient Boosting 1351.38 0.6168
Extreme Gradient Boosting 1328.01 0.6252
LightGBM 1488.84 0.5652
Neural Networks 1442.60 0.5877

Table 5.1: Results of the trained models

Figure 4.1 shows the results of our models. For Random Forest and Gradient
Boosting algorithms, we used Scikit-learn library (Pedregosa et al. 2011). In this
package, we used grid search, which is a method to try every combination of a
selected list of parameters, for hyper-parameter tuning with 5-fold cross validation
on the whole dataset. For LASSO, we chose the regularisation parameter λ = 8.
For Random Forest, we chose max feature = 10, max depth = 50, 900 estimators
and others in default values 1 . For Gradient Boosting, we chose learning rate =
0.2, max depth = 7, max feature = 8, 50 estimators and others in default values
2. For Extreme Gradient Boosting, we used XGBoost library (Chen & Guestrin
1
Random Forest Regressor
2
Gradient Boosting Regressor

28
CHAPTER 5. EXPERIMENTS AND RESULTS 29

2016) and chose learning rate = 0.1, max depth = 4, ridge parameter = 0.01, 100
estimators and others in default values 3 . For LightGBM, we used lightgbm library
Ke et al. (2017) and chose learning rate = 0.081, max depth = 8, 100 estimators,
10 leaves and others in default values 4 . For Neural Network, we used Pytorch
library (Paszke et al. 2019) to build the network architect. As we had a small
amount of record in the dataset, we set the batch size to be the number of training
data and trained through 500 iterations. Adam Optimiser (Kingma & Ba 2014) is
chosen to optimise the model and the parameters of this algorithm are the default
values set in the package 5 .
In this study, Extreme Gradient Boosting achieves the best performance, scor-
ing R2 = 0.6252, among all models. As for LightGBM, this model is only better
than the baseline model LASSO. This can be explained that we have a limited
number of records. The insufficient number of training example may cause the
model to overfit since the leaf-wise tree growth can be sensitive to overfitting with
small data.

3
XGBRegressor
4
LGBMRegressor
5
Adam Optimiser
Chapter 6

Conclusion and Outlook

This paper attempted to build an Artificial Intelligence tool for predicting the
prices of Luxstay listings based on the scraped data with a limited set of features,
including the numbers of available guests, bathrooms, bedrooms, type of home,
name of building and district of a listing. Machine Learning models used are
LASSO regression, Random Forest, Gradient Boosting, Extreme Gradient Boost-
ing and Neural Networks. The best model is chosen using K-Fold cross validation
with K = 5 and the best results are assessed in terms of Mean Squared Error and
R2 statistic. Among the models trained and tested, Extreme Gradient Boosting
achieved the best performance with an R2 of 0.6252 and a MSE of 1328.01 through
the cross validation.
Nevertheless, we are aware of that there are limitations that affect our study
negatively and we believed these three are the major ones. Firstly, the price
in the dataset is referred as the sticker price (or the advertised price). This is
the price advertised for potential guests instead of the actual price paid by the
previous ones. Hence, some of them may not reflect the real situation and then
this might cause some noises in our model performance (Lewis 2019). Secondly, we
skipped outlier analysis, which consists of techniques treating data points that are
significantly different from other observations, in this study. Figure 5.1 shows the
actual prices against the predictions by Extreme Gradient Boosting and the red
line displays the flawless prediction. From this figure we see that as the actual price

30
CHAPTER 6. CONCLUSION AND OUTLOOK 31

Figure 6.1: Predicted Values vs Actual Prices for Extreme Gradient Boosting

goes up, the precision of our best model decrease, especially for the records above
$500. Thus, we believe that if we apply some special treatment to those outliers
our model performances can be improved remarkably. Thirdly, as we mentioned
several times above, our study is restricted by data limitation. We acknowledged
that we need to include more listings as well as explanatory features such as review
scores and services included in an accommodation.
As for future work, we will attempt to correct the limitations above. We will
also try to run some further experiments with neural net architect. We will tune
with different numbers of output layers and add other techniques such as early
stopping and batch normalisation. We would like to try to integrate text and
image data as inputs in our models. Lastly, we would like to add listings in other
cities and attempt to produce a price suggestion tool for all Vietnamese cities on
Luxstay.
Appendix A

Some special mathematical


notations

A.1 Vector Norm


A function k · k : Rn 7→ R is called a vector norm if it has the following properties

1. kxk ≥ 0 for any vector x ∈ Rn , and kxk = 0 if and only if x = 0

2. kaxk = |a|kxk for any scalar a ∈ R

3. kx + yk ≤ kxk + kyk for any vectors x, y ∈ Rn

in general, for any p > 0 we have

n
!1
X p

kxkp = |xi |p
i=1

Example:

• p = 1: The l1-norm

kxk1 = |x1 | + |x2 | + ... + |xn |

32
APPENDIX A. SOME SPECIAL MATHEMATICAL NOTATIONS 33

• p = 2: The l2-norm
q
kxk2 = x21 + x22 + ... + x2n

A.2 The Hadamard product


The Hadamard Product or elementwise product is an operator that perform mul-
tiplication of elements in the same index between matrices or vectors. Suppose x
and y are two matrices of the same dimension, we have:

(x y)ij = xij yij


Appendix B

The Chain Rule

Suppose we have two differentiable functions f (x) and g(x). Then to differentiate
y = f (g(x)), let u = g(x) and then y = f (u). We have

dy dy du
= ×
dx du dx

34
Bibliography

Beck, A. & Teboulle, M. (2009), ‘A fast iterative shrinkage-thresholding algorithm


for linear inverse problems’, SIAM J. Imaging Sciences 2, 183–202.

Benning, M. (2019), MTH786P Machine Learning with Python Coursework 3,


Queen Mary, University of London.

Benning, M. (2020), MTH793P - Advanced on Machine Learning, Leacture Notes


(Last updated on: May 6, 2020), Queen Mary, University of London.

Bishop, C. M. (2006), Pattern Recognition and Machine Learning, Information


science and statistics, 1st ed. 2006. corr. 2nd printing edn, Springer.

Breiman, L. (2001), ‘Random forests’, Machine Learning 45, 5–32.

Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1984), Classification and
regression trees.

Cai, T. & Han, K. (2019), Melbourne airbnb price prediction.

Chen, T. & Guestrin, C. (2016), Xgboost: A scalable tree boosting system,


pp. 785–794.

Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. (2004), ‘Least angle regres-
sion’, The Annals of Statistics 32(2), 407–451.
URL: http://www.jstor.org/stable/3448465

Friedman, J. (2002), ‘Stochastic gradient boosting’, Computational Statistics Data


Analysis 38, 367–378.

35
BIBLIOGRAPHY 36

Gibbs, C., Guttentag, D., Gretzel, U., Yao, L. & Morton, J. (2017), ‘Use of dy-
namic pricing strategies by airbnb hosts’, International Journal of Contemporary
Hospitality Management 30, 00–00.

Goodman, A. C. (1998), ‘Andrew court and the invention of hedonic price analysis’,
Journal of Urban Economics 44(2), 291 – 298.
URL: http://www.sciencedirect.com/science/article/pii/S0094119097920714

Guttentag, D. (2015), ‘Airbnb: disruptive innovation and the rise of an informal


tourism accommodation sector’, Current Issues in Tourism 18, 1192–1217.

Hastie, Trevor;James, G. M. R. D. (2017), An introduction to statistical learning:


with applications in R, Springer texts in statistics, corrected at 8th printing edn,
Springer : Springer Science+Business Media.

Investopedia.com (2020), ‘Using hedonic pricing to determine the factors impact-


ing home prices’.
URL: https://www.investopedia.com/terms/h/hedonicpricing.asp (accessed:
17.07.2020)

J. Hamari, M. Sjoklint, A. U. (2015), ‘The sharing economy: why people partic-


ipate in collaborative consumption’, Journal of the association for information
science and technology .

James H. Stock, M. W. W. (2020), Introduction to Econometrics, Global Edition,


4 edn, Pearson Education Limited.

Kalehbasti, P. R., Nikolenko, L. & Rezaei, H. (2019), ‘Airbnb price prediction


using machine learning and sentiment analysis’.

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. & Liu,
T.-Y. (2017), Lightgbm: A highly efficient gradient boosting decision tree, in
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan
& R. Garnett, eds, ‘Advances in Neural Information Processing Systems 30’,
Curran Associates, Inc., pp. 3146–3154.
BIBLIOGRAPHY 37

URL: http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-
boosting-decision-tree.pdf

Kingma, D. & Ba, J. (2014), ‘Adam: A method for stochastic optimization’, In-
ternational Conference on Learning Representations .

Lewis, L. (2019), ‘Predicting airbnb prices with machine learning and deep learn-
ing’.
URL: https://towardsdatascience.com/predicting-airbnb-prices-with-machine-
learning-and-deep-learning-f46d44afb8a6 (accessed: 30.05.2020)

Oliphant, T. E. (2006), A guide to NumPy, Vol. 1, Trelgol Publishing USA.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,
T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E.,
DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai,
J. & Chintala, S. (2019), Pytorch: An imperative style, high-performance deep
learning library, in H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc,
E. Fox & R. Garnett, eds, ‘Advances in Neural Information Processing Systems
32’, Curran Associates, Inc., pp. 8024–8035.
URL: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-
performance-deep-learning-library.pdf

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Pas-
sos, A., Cournapeau, D., Brucher, M., Perrot, M. & Duchesnay, E. (2011),
‘Scikit-learn: Machine learning in python’, Journal of Machine Learning Re-
search 12, 2825–2830.

Richardson, L. (2007), ‘Beautiful soup documentation’, April .

Selim, H. (2009), ‘Determinants of house prices in turkey: Hedonic regression


versus artificial neural network’, Expert Syst. Appl. 36, 2843–2852.

Tang, E. & Sangani, K. (2015), ‘Neighborhood and price prediction for san fran-
cisco airbnb listings’, CSS 229 Final Project Report .
BIBLIOGRAPHY 38

Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal
of the Royal Statistical Society. Series B (Methodological) 58(1), 267–288.
URL: http://www.jstor.org/stable/2346178

Tibshirani, R., Hastie, T. & Friedman, J. (2010), ‘Regularized paths for generalized
linear models via coordinate descent’, Journal of Statistical Software 33.

Trevor Hastie, Robert Tibshirani, J. F. (2009), The elements of statistical


learning: Data mining, inference, and prediction, Springer Series in Statistics,
2nd ed. 2009. corr. 3rd printing 5th printing. edn, Springer.
URL: http://gen.lib.rus.ec/book/index.php?md5=0161e6689920acb72e562a5b8d726f4d

Wang, D. & Nicolau, J. (2017), ‘Price determinants of sharing economy based


accommodation rental: A study of listings from 33 cities on airbnb.com’, Inter-
national Journal of Hospitality Management 62, 120–131.

Yates, D. S., Moore, D. S. & Starnes, D. S. (2003), The practice of statistics:


TI-83/89 graphing calculator enhanced, W.H. Freeman.

You might also like