You are on page 1of 45

Predicting NBA Player Bets using

Machine Learning

Edward Thomas Sciberras

Supervisor: Dr. Kristian Guillaumier

June 2023

Submitted in partial fulfilment of the requirements


for the degree of B.Sc. IT Artificial Intelligence.
Abstract
Sports betting is known as the activity of attempting to predict the outcome of a sports
event and placing a wager on said outcome. With the rise of the Internet and
e-commerce, a new type of betting has been popularised, online sports betting.
Bookmakers create their odds using machine learning and mathematical models that
use different data to use the odds that will maximise their profits. Although
bookmakers likely have extremely accurate models for predicting sports event
outcomes [1], creating a model that would outperform theirs is theoretically possible.
Thus, if a model can outperform a bookmakers model, then it can predict sports events
outcomes at a higher accuracy, meaning that if a bettor had to consistently place bets
from this model over a lengthened period of time they will make a profit.
In this FYP, we propose to create a model that can rival that of the bookmakers.
We chose to focus on a specific type of bet in the National Basketball Association
(NBA) known as player props. The nature of these bets is to bet on a specific player’s
performance as opposed to a team winning or losing a game.
To accomplish this task we needed a dataset large enough to train a machine
learning model. Unfortunately, there are no publicly available datasets, prompting us to
create our own using web scraping. We also needed a dataset which included player
performances. Luckily an NBA stats Application Programming Interface (API) is publicly
available which aided us to gather the data we needed for training. We then decided to
use three models which can use sequential data since they will be trained on the
previous performances of a player, these being: recurrent neural network (RNN), long
short-term memory network (LSTM), and a transformer.
Once each model was trained and hyperparameter tuned, we used each of them
with different betting strategies to see what could give a bettor an advantage when
betting. The final results showed that, overall, the transformer was the most accurate,
followed by the LSTM, and lastly the RNN. Although the transformer was the
strongest, it was still not capable of toppling the accuracies of the bookmakers’ models
and was not able to be profitable when betting. However, when using a metric called
peak, which measured the highest balance amount during a betting simulation, the
transformer reached heights of seven times the initial balance. Indicating that it is
possible for it to go on a streak of hits which is profitable.

i
Acknowledgements
I would like to express my deepest gratitude to my supervisor, Dr Kristian Guillaumier,
for his invaluable guidance, support, and encouragement throughout my research. His
expertise, dedication, and patience have been instrumental in shaping this FYP. I am
grateful for the time and effort that he put into reviewing my work, providing
constructive feedback, and most of all, for pushing me to reach my full potential.
I would also like to thank my parents for their unyielding love, unwavering faith,
and constant motivation throughout my academic pursuit. Their dedication, guidance
and sacrifices have been a pillar of strength, giving me the bravery to face challenges
and overcome the obstacles that I encountered not only while working on this paper,
but also in life. Their belief in my abilities and their constant presence have been the
driving factor behind my academic growth. I am, and always will be, thankful for their
commitment towards my well-being.
Finally, I want to express my sincere appreciation to my friends and colleagues for
their invaluable contributions to my academic pursuits. Their generosity, insights, and
willingness to share their expertise have been critical in helping me to develop my ideas
and refine my research. I am grateful for their camaraderie, kindness, and humour,
which have helped me maintain my balanced life and keep things in perspective. I am
extremely lucky to have such a supportive and dynamic network of friends and
colleagues, and I will always cherish the memories that we have created together.

ii
Contents

Abstract i

Acknowledgements ii

Contents iv

List of Figures v

List of Tables vi

List of Abbreviations vii

Glossary of Symbols 1

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Proposed Solution and Results . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Literature Review 4


2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 A Simple Bet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Player Betting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 The Mathematics behind Sports Betting . . . . . . . . . . . . . . . . 5
2.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 NBA Moneyline Betting . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 NBA Fantasy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 NFL Spread Betting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Sports Betting with Sequential Models . . . . . . . . . . . . . . . . 13
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Methodology 15
3.1 The Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

iii
3.1.1 Player Bets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Player Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.3 Preparing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 The Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Training the Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 Long Short-Term Memory Network . . . . . . . . . . . . . . . . . . 20
3.2.4 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Evaluation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.1 Simulating Betting & Strategies . . . . . . . . . . . . . . . . . . . . . 23
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Evaluation 26
4.1 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Long-Short Term Memory Network . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1 Betting Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Conclusion 34
5.1 Revisiting Our Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

iv
List of Figures

Figure 4.1 RNN using different betting strategies. . . . . . . . . . . . . . . . . . . 27


Figure 4.2 LSTM using different betting strategies. . . . . . . . . . . . . . . . . . . 28
Figure 4.3 Transformer using different betting strategies. . . . . . . . . . . . . . . 30

v
List of Tables

Table 3.1 Description of Basketball Statistics Used for Model Training. . . . . . . 17


Table 3.2 Two examples of bets and a model’s predictions. . . . . . . . . . . . . . 24
Table 3.3 Different configurations for testing betting strategies. . . . . . . . . . . 25

Table 4.1 Confusion matrix for RNN . . . . . . . . . . . . . . . . . . . . . . . . . . 26


Table 4.2 Confusion matrix for LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Table 4.3 Confusion matrix for transformer . . . . . . . . . . . . . . . . . . . . . . 29
Table 4.4 Table of betting strategy results with different models (in Euros). . . . . 31
Table 4.5 Table of peaks of betting strategy results with different models (in Euros). 32

vi
List of Abbreviations
ANN Artificial Neural Network.

API Application Programming Interface.

ATP Association of Tennis Professionals.

EMH Efficient Market Hypothesis.

KCV K-Fold Cross Validation.

LMT Logistical Model Tree.

LSTM Long Short Term Memory Network.

MAPE Mean Absolute Percentage Error.

MSE Mean Squared Error.

NBA National Basketball Association.

NFL National Football League.

NLP Natural Language Processing.

PIE Player Impact Estimate.

RNN Recurrent Neural Network.

SVM Support Vector Machine.

vii
1 Introduction
The NBA is the largest basketball league and one of the largest sports associations in
the world [2]. The entire league consists of 30 teams and around 450 players. Due to
its popularity, analytics and data collection has increased in relation to teams and
players. Data and analytics have always been driving factors in the world of sports
betting, and with the recent rise of artificial intelligence, they are more advanced than
ever.
A newer type of bet in the NBA is player betting bets, where the bettor places a
wager on the performance of an individual player rather than whether or not the team
will win the game. This FYP aims to analyse and use vast amounts of historical data to
predict the number of points a player will score in an upcoming game. The models will
then compare the prediction to the bets that the bookmakers have posted and bet
accordingly.

1.1 Motivation
In recent years, sports betting has exploded in popularity with the rise of online sports
betting. Betting companies and bookmakers have all the power when it comes to
setting the odds of an event and they also take a cut from every bet that is placed,
even if the bet is won by the bettor. Creating a model that can rival the models used by
bookmakers is a complex task due to the high-stakes nature of the betting business.
We expect that the models used by bookmakers are close to the state-of-the-art [1].
However, we hypothesise that the betting market has a vulnerability in the player
betting market that can be exploited using machine learning. If we are successful, it will
imply that the bookmakers’ models for creating odds on this specific bet are not the
most profitable and could be improved.
Additionally, we have not encountered any literature that indicates success over a
lengthened period of time when it comes to this specific type of betting. Another
reason why this problem is an interesting one is that there is no publicly available
dataset for historical player bets, which is needed to train models to predict these bets.
Therefore, we need to create a dataset with the properties needed to train a model
with this data. Along with this dataset, another dataset involving the historical
performances of players is needed. This data can be gathered from an API, which we
would need to combine multiple endpoints to get all the data needed.
The model selection portion of this FYP involves selecting various neural networks
that fit the specifications of this problem and are to be able to take sequences of data.
Once the models are selected, they will all be hyperparameter tuned and compared to

1
1 Introduction

one another.
Finally, we aim to explore distinct betting strategies as there are many different
ways that a bettor can choose to bet. Some betters prefer to play it safe and bet big
money on bets that are likely to hit. Others prefer to bet on multiple risky, unlikely bets,
in the hope that one of them hits and makes up for the other bets as well. Each strategy
will be compared to one another and we will see which technique performs the best.

1.2 Aims and Objectives


The goal of this FYP is to identify and implement several machine learning algorithms
that can effectively predict the outcome of bets over a long period of time. In order to
achieve this goal, the following objectives have been identified:

1. Collecting, transforming, and cleaning a dataset of previous player points bets


along with their outcome. No such dataset exists publicly, so we will employ
techniques such as web scraping to gather this dataset.

2. Gathering player data on a game-by-game basis. The only games needed are the
ones that have their respective bets.

3. Investigating a number of models and hyperparameter tuning them using the


dataset. Success will be measured on the accuracy that the model reaches. Each
model will be compared with one another.

4. Using each model with different betting techniques and comparing profits. We
aim to find the most efficient way to bet.

1.3 Proposed Solution and Results


The proposed solution starts with web scraping to gather historical bets on which to
train the models. Once all the data is scraped, another dataset of player performances
is required. This data needs to be collected on a game-by-game basis so that the model
can easily retrieve the previous matches of a particular player when training. An NBA
stats API exists that allows users to collect information about already played games.
We will combine different endpoints in this API to get as many different stats to
experiment with what works best with the models.
Once the necessary data is collected, the development of the models will start. We
plan on using three models that all support sequential data so that we can input the
data game by game into them. The models are an RNN, a LSTM, and a transformer.
Once each model is made and trained on the data, we will tune the hyperparameters
and the layers within each model.

2
1 Introduction

After each model has been hyperparameter tuned and tested, we plan to insert the
models into a simulated betting scenario where it would start with a certain balance
and bet depending on the output of the model and either gain additional balance from
winning the bet or lose the stake. Our goal is also to try different betting strategies
with each model to see which betting strategy is the most efficient.
The results suggest that the transformer performs the best, followed by the LSTM
and lastly the RNN. The metric used for performance is the accuracy of correctly
predicting bets, where the transformer reached 52.8% accuracy over a series of player
bets. Even when the betting strategies were used, the transformer had the highest
average final balance with the other two models in the same order. When using a
different metric called peak, the results were more promising, with the transformer
managing to reach seven times its initial balance in some cases. This indicates that the
model is capable of correctly predicting multiple bets in a row and possibly being
profitable.

1.4 Document Structure


The rest of this document is organised into the following chapters:

Background: The foundations and critical mathematical concepts behind betting are
discussed. This chapter will contain equations and examples regarding betting
and NBA betting to get an understanding of this project.

Literature Review: In the literature review, we will explore similar approaches to


machine learning in betting. This will include papers that try to predict different
types of NBA bets as well as other sports and the use of sequential models in
sports betting.

Implementation: In this chapter, we will go over the specifics and details of how this
FYP was developed. We will clearly describe the implementation decisions that
were taken for each model and the manner in which we test the results of each
of the models.

Evaluation: The exact evaluation procedures are outlined and the results are
presented. This will be followed by a discussion of the results, which will display
the strengths and weaknesses of each model.

Conclusion: This FYP is concluded by summarising the work that has been done, the
results that were obtained, and proposals for improvements in the future.

3
2 Background and Literature Review
In this chapter, we discuss the mathematics and nature of player betting more in detail
as well as give an overview of academic papers and literature that have similar
objectives.

2.1 Background
Before explaining the models and the data more in detail, we need an understanding of
betting. In this section, we discuss the most important mathematical concepts of
betting that are used in this FYP. For a more thorough exposition of this topic, we
suggest reading through [3].

2.1.1 A Simple Bet


Bookmakers will create odds for every outcome of an event depending on the likelihood
of the outcome happening. A bet works by a bettor placing a bet on an outcome of a
sports event that they think will happen. The stake is the amount of money that the
bettor is comfortable putting on their bet. There are only two outcomes after the bet is
placed the first being that the bettor hit (predicted correctly) their bet and they will get
their stake back together with the profit (which is determined by the odds) and the
second is when the bettor misses (predicts incorrectly) the bet and loses their money.
Suppose that the LA Lakers and the Chicago Bulls will be playing against each other.
The odds for the LA Lakers have 2.15 odds to win the game whilst the Chicago Bulls
have 1.74 odds. The odds are usually denoted as follows:

LA Lakers 2.15 — 1.74 Chicago Bulls

The odds show the bettor that for every e1 bet, they will receive 1 × odds if the bet
hit (meaning they predicted the outcome correctly). In our example, if a bettor placed
e10 on the LA Lakers winning and they do win, then the better will receive
10 × 2.15 = 21.5. This means that the better will get their stake back in addition to
another e11.5 as pure profit.
The higher the odds of an outcome of a bet, the less likely to happen. In the
previous example, the bookmakers thought that it was more likely for the Chicago Bulls
to win (favourites) and that there was a higher chance of the LA Lakers losing the game
(underdogs).

4
2 Background and Literature Review

2.1.2 Player Betting


Bookmakers offer many different types of bets. In this FYP we will be focusing on
player points betting for NBA players. In every NBA game, there is a team known as
the scorers table dedicated to keeping track of what happens in the game. For instance,
if a player scores the ball then they will record it in the stat sheet. At the end of the
game, every player will have totals of their stats throughout the game.
Bookmakers will release odds on how many stat totals each player will get every
game. The format the bookmakers use to show the bet is using a number that we will
call the mid-point. Player betting falls under a type of betting known as Over/Under
betting, in which the bettor is provided with the mid-point and the only two options for
betting are if the player will achieve over or under the mid-point.
Suppose that Lebron James (an NBA player) will be playing a game. The bookmakers
release the bet for his points that game. The mid-point is 26.5 points (the mid-point is
always a decimal that ends in 5 so that the player can never finish with the same value
as the mid-point) and the odds for him scoring under 26.5 points is 1.83 whilst scoring
over is 1.9. This means that the bookmakers think that it is slightly more likely that
Lebron James will score under the mid-point in this particular game. The bets are
formatted in the following way:

Lebron James (Points) — Over 26.5 (1.9) / Under 26.5 (1.83)

Bookmakers will release these player bets for most of the players and most of the
different types of stats in the league, but in this project, we will be focusing on
specifically player points bets.

2.1.3 The Mathematics behind Sports Betting


Implied Win Probability

From the odds of a bet, one can calculate the win probability which is usually referred
to as the implied probability [4]. To calculate the implied probability for a bet, we simply
take the inverse of the decimal value for the odds [5]. The following formula
demonstrates this:

1
ImpliedP robability = (2.1)
Odds

The Vigorish

Now that we can calculate the implied probability of any bet given its odds, we can
calculate the implied probability of both sides of a bet. Since player betting only has

5
2 Background and Literature Review

two outcomes, i.e. either the player goes over or under the midpoint, we can safely
assume that totalling the implied probability of both sides would add to 100%. In this
example, we will use the aforementioned bet and equation.

Calculating Implied Probability for Lebron James (Points) — Over 26.5 (1.9)

1
ImpliedP robability =
1.9
ImpliedP robability ≈ 0.5263%

Calculating Implied Probability for Lebron James (Points) — Under 26.5 (1.83)

1
ImpliedP robability =
1.83
ImpliedP robability ≈ 0.5464%

If we total these two implied probabilities, we get 1.0728. This value is not
concurrent with the laws of probability, as P (A) + P (Ac ) = 1, where A is the probability
of an event happening and Ac is the complement of said event.
The reason why the implied probability is not accurate using the bookmakers’ odds
is due to the Vigorish. The vigorish is the percentage removed from a bettor’s winnings
by the bookmakers [6]. In this case, the vigorish of the bet calculated above is
approximately 7.28%, as this is the extra percentage over 100% when tallying the
implied probabilities. The overround is a term coined to reference the summed implied
probabilities greater than 1, which in this example would be 107.28%.

Removing the Vigorish

In order to get the actual win probabilities based on the models of the bookmakers, we
would have to remove the vigorish that is baked into their odds when they release
them. Since probability theory claims that all possible events should equate to 1, we
just need to scale the implied probabilities so that they sum up to 1. Dividing the
implied probability by the overround will be equal to the actual probability.

ImpliedP robability
ActualP robability = (2.2)
Overround

The following are the calculations to gather the actual probabilities for the previous
bet.

6
2 Background and Literature Review

Actual Probability for Over ≈ 0.4906


Actual Probability for Under ≈ 0.5093
0.4906 + 0.5093 ≈ 1

It is important to note that these are not the true probabilities of the events
happening, as it is impossible to calculate the true probability. This is simply an
educated guess based on the odds given to us by the bookmakers.

The Significance of 53.8%

Suppose that we place a e10 wager on a 50/50 bet and win, we would expect to
double your earnings. However, as previously mentioned, bookmakers will always take
their cut. Normally, for an even-odds (50/50) game, the bookmaker will take 10% of
the winnings [4]. Therefore, in order to make e10 on a coin flip, one would need to bet
e11 (essentially a bet with 1.91 odds).
To find out what percentage of even-odd bets we would have to win on a
consistent basis in order to be profitable, we straightforwardly plug the 1.91 odds into
the formula for implied probability [4].

Implied Probability for a bet with 1.91 Odds

1
ImpliedP robability =
1.91
ImpliedP robability ≈ 0.5235%

This indicates that for a bettor to break even on their even-odd bets, they would
have to win 52.4% of the time to overcome the vigorish of the bookmakers. However,
with player betting the bookmakers tend to take slightly more vigorish from the bets,
meaning that an even-odds bet for a player bet would not be 1.91, but 1.86. With this in
mind, we can calculate the new percentage to overcome the vigorish using the same
calculation.

Implied Probability for a bet with 1.86 Odds

1
ImpliedP robability =
1.86
ImpliedP robability ≈ 0.5376%

Thus, to beat the bookmakers and make a profit over a lengthened period of time
for player betting, we need to reach an accuracy of 53.8%.

7
2 Background and Literature Review

The Balanced Book Hypothesis

The Balanced Book Hypothesis indicates that bookmakers aim to equally split the
amount of money that is wagered on both sides of any even-odds bet [7]. It would be
impossible for the bookmakers to lose money in this scenario as no matter which side
wins, they will gather the profits from the bettors that lost and they take their vigorish
from the winners as a cut.
In the case of player betting, bookmakers aim to achieve the balanced book
hypothesis by adjusting the mid-point in their player bets. For instance, in the
previously mentioned bet of Lebron James, if the mid-point was lower than 26.5 then
chances are that more people would have bet on the over bet. In a situation where
there is more money on one side, the bookmaker is vulnerable to losing money [7].

2.2 Literature Review


In the following section, our organisation of papers will follow the ensuing themes:
NBA Moneyline betting, NBA Fantasy, NFL Spread betting, and sports betting using
sequential models.

2.2.1 NBA Moneyline Betting


Moneyline betting is a type of wager that the bettor simply bets on the team that they
think will win the game outright. An example of this is the previously described bet
including the Chicago Bulls and the LA Lakers.

In Hubˊaček et al. [8], the goal was to predict moneyline NBA bets. If a team is
much clearly better than their opponent, the odds will be more lopsided. For instance,
the Chicago Bulls are playing the LA Lakers. From the current season, the Chicago Bulls
have been playing better and have a better record for the season. Thus, their odds will
be placed at 1.38 since they are the favourites and the LA Lakers’ odds would be 3.15,
implying that they are the underdogs.
The data they gathered consisted of various quantitative measures from both
teams in their matches for that season. When they are predicting the outcome of a
game, only the games that happened before the date of said game will apply. They
used a Logistic Regression model as a baseline prediction model. They then developed
two variants of a neural network. Firstly, a standard feed-forward network, while the
second used a convolutional layer in tandem with three dense layers. The
convolutional layer is specifically designed to deal with player-related data due to the
high number of features.

8
2 Background and Literature Review

The authors also proposed whether it is beneficial to not include the bookmakers’
odds in the models themselves. If the odds are included as a feature then it is more
likely that the predictions will be similar to the bookmakers’. Therefore, they did not
include the odds in their models in order to decorrelate the models from the models of
the bookmakers.
Testing was done over 9,093 games that were played from 2006 to 2014. From
their results, is it clear to see that it is to the detriment of the better to include the
bookmakers’ odds in the models. The convolutional neural network achieved higher
accuracies and profits than the rest of the models. Their results ended with the neural
models being slightly less accurate than those of the bookmakers, whilst the logistic
regression baseline was the least accurate one.

In Dotan [4], they also aim to beat the moneyline market for NBA bets. The data
they collected involved the game-by-game statistics for each team in the league. This
data was available from an R package called nbastatR. The historical odds for
moneyline bets on each game in the season were found at the publicly available online
resource [9]. The statistics and odds were from the 2007 to 2020 season of NBA.
An interesting part of this paper is that they attempted to scale each statistic to
their respective season. Each season in the NBA is different, basketball is not always
played the same and the sport evolves over the years. In some seasons the pace of a
game is much faster, meaning that both teams will have more possessions since the
average duration of a possession goes down. To account for such intricacies, Dotan
adjusted each team stat to a standardised per-100 possessions to avoid this
”phenomenon of statistical inflation” [4].
The testing phase of this project was on different games from the 2007 to 2020
season to keep the environment as similar and fair as possible. The authors used
logistic regression, random forest, XGBoost, and an ANN model. All models performed
similarly within the bracket of 61.6% — 65.9%, with the XGBoost and the logistic
regression models both at the top. It is important to note that moneyline bets are not
even-odds bets so there is usually a favourite in a game. This means that the accuracy
will always be higher than that of a model attempting to predict an even-odds bet since
it is always safer to bet on the favourite.

Thabtah et al. [10] also set out to predict the outcome of NBA matches. They
attempted to anticipate the result of the NBA finals games from the year 1980 to
2017. The data they used for their project was gathered from [11] and included the
game-by-game team totals from the same time span. Their dataset incorporated 22
features which each represented a different aspect of each team.
Apart from testing their hypothesis with different models, the authors also opted

9
2 Background and Literature Review

to test using different datasets. They used two different filter methods and one rule
induction algorithm to select different combinations of attributes, these being: multiple
regression, correlation feature set, and ripper algorithm. These datasets were being used
with the following models: ANN, naive bayes, and an logistical model tree (LMT).
Among all the datasets considered, the one that was chosen using the Ripper
Algorithm performed the best since all three of the models had their best accuracy
score using that combination of features. When using the full dataset with all the
features, the models had a tendency to drop in accuracy, they attributed this drop to
having irrelevant features that would only make the models more complex without
adding any important information to the prediction of the game. On average, the best
performing model was the LMT followed by the Naive Bayes and lastly the ANN.
Although the ANN achieved the lowest accuracy, it had the highest accuracy when the
dataset had all its features, including the irrelevant ones. This could hint towards this
type of machine learning model being more immune to features which do not give as
much information as others.

2.2.2 NBA Fantasy


NBA fantasy is a type of sports game in that each participant will pick their own virtual
basketball team made up of real NBA players. In this game, each participant will
compete with one another based on the statistical performance of their chosen players
in real-life NBA games. Fantasy points are single numbers that will describe a player’s
performance in a specific game and are calculated using an equation which has some
stats that are more heavily weighted than others. Different companies that offer
fantasy basketball have slightly different equations and weightings of stats.

Papageorgiou [12] set out to solve the problem of predicting fantasy points for each
player on a daily basis. To complete the task they proposed having two datasets, one
for individual player statistics and one for team statistics. Both datasets were gathered
by scraping the nba.com website. The data was organised in a game-by-game manner
for the players and the teams alike. The data spans from the 2011 season to the 2021
season. The player dataset had 106 columns each representing a different stat, whilst
the team dataset had 100 columns. They also decided to have two combinations of
features for training their models to test which dataset would perform better. One
dataset held only basic features whilst the other one also included advanced statistics.
This would be testing whether having more features would result in a more accurate
result or that having an increased number of features would be to the detriment of the
model due to the higher complexity.
For the model implementation, the author used Pycaret, which is an open-sourced

10
2 Background and Literature Review

machine learning library in Python. Using this library, the author can use many different
models easily since the workflows are automated and not much code needs to be
written. Papageorgiou then proposed to use each model that the library offers and test
which works best in this situation. An interesting design choice that was made in this
project is that they made and trained a model for each individual player as opposed to
having one model for every player.
When using the advanced features dataset, the random forest regressor model had
the highest accuracy for the largest number of players, whilst when using the basic
features dataset it was the voting regressor that outshone its competitors. When
moving forward with these models, the author tested them by using mean absolute
percentage error (MAPE) and resulted in a 30.7% MAPE for the advanced dataset and
31.1% MAPE for the basic dataset. Everything was tested on an unseen dataset to
keep fairness standards high. After some optimisations using a singular dataset, the
author managed to get the MAPE to just 25.9% on the test set.

In [13], the main objective of this study was similar to the previous paper and also
attempt to forecast the number of fantasy points that a player will achieve in an
upcoming game. However, they investigated a slightly different way to predict the
number of fantasy points. Instead of the models outputting a single number for the
fantasy points, they want to build models for each base statistic that is used for the
equation to calculate the fantasy points and use the outputs in the equation to calculate
the final score.
The data that they used to train their models was gathered by web scraping player
game stat lines from [14]. An interesting design choice that the authors went for is that
they used rolling averages of each individual player stat over a 3, 5, and 10 game
window. This means that they had a feature for a player’s stat over the last 3, 5, and 10
games. This tripled the features for the ones that are continuous but could provide
more information to the models. The data collected starts from 2014 and ends in 2020.
The criteria for their model selection was that they prioritised models that perform
better on higher-level structured data and that are capable of interpreting feature
importance. With this criterion in mind, the authors used random forest and XGBoost. A
random forest model that simply predicts the fantasy score directly is being used as a
benchmark. The several random forest models to each predict an individual stat
outperformed the single model to predict the fantasy score with a mean squared error
(MSE) of 93.18 on the test set as opposed to 94.61. However, after hyperparameter
tuning all the models, the XGBoost model won the battle and achieved a final MSE of
92.27 on the test set.

11
2 Background and Literature Review

2.2.3 NFL Spread Betting


Donnelly [7] asked an important question: are the betting markets efficient? They are
asking if the odds given to us by the bookmakers are optimal or not. The efficient
market hypothesis (EMH) was originally made for the financial markets and assumes
that all publicly available information is reflected in the price of an investment. This
was hard to prove in the financial world, but researchers have found that betting
markets have adequate conditions to test the hypothesis since the problems are easier
to determine [15].
The investigated type of betting is a spread bet in the National Football League
(NFL). A spread bet involves a game between two teams that are not equally matched
in skill. The underdog will get a handicap to even out the playing field. As an example,
the Philadelphia Eagles and the New York Giants (two NFL teams) will be playing against
each other. Since the Eagles are a better team, the Giants will get a +7 handicap. For
betting purposes, this adds 7 points to the Giant’s final score. When a bettor bets on
the Eagles, they think that the Eagles will win by more than 7 points. If a bettor places
money on the Giants, they think that the Giants will either win, draw or lose by less
than 7 points. In the case that the Giants lose by exactly 7 points, this results in a push
and the stake that the bettor placed is returned.
The two main categories of information that were gathered for the paper were (1)
point spreads and the actual outcome of each game and (2) the adjusted statistics for
each game. A type of team evaluation known as adjusted statistics compares each play
to a baseline that is representative of the league as a whole while accounting for the
opponent’s difficulty. The seasons that were included in this paper were from the
2003-2004 season to the 2011-2012 season.
Instead of using machine learning to predict the outcome of a spread, they decided
to use a weighted equation that takes both teams’ previous games played and how
many points they score and conceded into account. Using the equation to determine
the result of a spread, they managed to reach a 52.22% accuracy over three seasons.

Gimpel [16] presents a novel approach to beating the bookmakers in NFL spread
bets. The writer managed to extract 230 distinct features from the 1992 season up to
the 2001 season. The novelty of their approach lies within the feature selection
portion of the project as they opted to use a randomised feature search. The way this
works is that they select random features and train a logistic regression model on that
temporary dataset, if the accuracy is higher than the previous temporary dataset then
keep it, if not then try a new random dataset. The logistic regression model was selected
due to its speed.
The focus of this project was more so the dataset rather than the models, however,

12
2 Background and Literature Review

two models were tested using the final dataset, these being a logistic regression model
and a support vector machine (SVM). The final results included the former achieving a
54.07% accuracy and the latter resulting in 50.76% over two seasons of NFL games.

2.2.4 Sports Betting with Sequential Models


In Bucquet and Sarukkai’s work [17], they set out to beat the bookmaker’s in a different
type of NBA betting known as over/under. This type of betting is more similar to player
betting as the odds are always even and made to be as even as possible (such as in the
balanced book hypothesis). This type of wager involves predicting the total amount of
points scored in a game. The bookmakers will provide the mid-point and bettors will
bet on if they think both teams combined will score more or less than the mid-point.
For their features, they decided to look at each team’s past three games and gather
simple features such as points scored and points conceded, as well as other, more
complicated features. To account for the opponent’s strength, they added each
opponent’s season averages in certain selected stats. Finally, they included the number
of days since the last game and the distance travelled to account for player fatigue. The
data collected ranged from six seasons (2012 to 2018). Using the first four of these
seasons as a testing set, 2016 to 2017 as a validation set and tested on the 2017 and
2018 season. All this data was picked from [18]. They also needed a dataset for the
odds and the mid-point of every game, which was gathered from [9].
They investigated the following models: random forest, collaborative filtering, ANN,
and an LSTM. The model evaluation was done using MSE between the over/under value
predicted by each model and the actual point outcome of the game. In the end, the
best-performing model was the neural network with an accuracy of 51.5% on average.

The authors in [19] aim to provide insight into predicting the outcome of tennis
matches using neural networks and sequential models. The data they needed for this
task was available to them from the Association of Tennis Professionals (ATP), which
included all the raw statistics from most tennis games that were played in the past
decades. However, since some of the earlier matches had a tendency to not have
complete data, the writers opted to only use matches played after the year 2000. To
improve the quality of the dataset, only matches where both players had at least 25
matches played were considered to have enough data to train the models on. When
possible, the models would use the previous 50 matches of a player to more accurately
predict the outcome, however, 25 was the minimum amount of games and if a player
did not have 50 played games, then the algorithm would pad the rest of the feature
vector with zeros.
When training the models, the authors chose to try to train a feed-forward

13
2 Background and Literature Review

network with a dataset consisting of the players’ averages over their past 50 games
and another dataset which concatenated the players’ statistics of their last 50 games.
As a baseline metric for all their tests, they simply chose the winner depending on
which player had more ATP rank points (i.e. which player was ranked higher). This
naive approach resulted in a 65.6% accuracy. The feed-forward network on the
averaged match statistics dataset achieved a 64.4% whilst when trained on the
concatenated dataset achieved a mere 53.2%. Lastly, the authors used an LSTM. This
model impressively beat the naive approach with an accuracy of 69.6%.

2.3 Summary
In this chapter, we discussed the important mathematics and principles behind betting
and more specifically player betting. The key role of the bookmakers wanting both
sides of a bet to be as equal as possible is one of the main principles that player betting
is based on. Moreover, we covered literature that gave us ideas and inspiration for this
FYP that helped throughout the development process.
When reading all this literature, we have some important takeaways that we
gathered from the papers. Firstly, from [8] we realised the importance of not using the
bookmakers’ odds to decorrelate as much as possible from their models. Using their
odds does help to get closer to their results but will impede our models from achieving
any accuracy over their models. The NBA fantasy papers showed the most ideal way to
predict the stats of a player using their previous performances. In Bucquet and
Sarukkai’s work [17], they showed how sequential models work best with this type of
task.

14
3 Methodology
This chapter is divided into three main sections. In the first section, we explain how the
two datasets were created. The second section explains the designs of the three
models used, and the last section explains the evaluation design that we proposed for
these models.

3.1 The Datasets


The specifications of this project called for two datasets that were not publicly
available, consequently we were tasked with tailor-making the datasets for the models
to be trained on. Both these datasets were stored in an SQLite database so that the
data can be queried with ease.

3.1.1 Player Bets


A model needs to learn patterns in data by seeing many different training examples and
recognising patterns over time. To do this, there must be as much data as possible
since the more data there is, the better trained the model will be.
Unfortunately, after thoroughly searching online, there were no available datasets
that included the characteristics that we needed to train our models for this specific
type of bet. To bypass this issue it was decided that we had to make our own dataset.
We then found the website [20] that started collecting player bets in March 2022 and
still publishing new bets every day to keep up with the evolving betting market. Web
scraping was then employed to gather the relevant information that we needed from
this website. This involved getting the player name that the bet was about, the
midpoint of the points, the date that the game was played, and the number of points
that the player scored in the actual game. This data was saved and organised into a
JSON file format which would be understood by an artificial intelligence model.
For this FYP, we gathered data starting from March 2022 and ending in March
2023, which will be the span of games that we will be predicting for the models. This
data also spans over two seasons since the NBA season usually finishes in June and
starts a new season in October. In the end, we managed to scrape 14,124 usable
player bets for the models to be trained on. This number of instances is larger than
some of the datasets in the literature review since they only have one bet per game,
such as moneyline. An advantage of player betting is that there are several in every
game. This reasoning allows us to have plenty of bets even though we only have bets
from two seasons.

15
3 Methodology

3.1.2 Player Performances


The second type of data that is needed for this task is that of the individual players’
performance in each game. The performance of a player can be quantified using a
number of different statistics that are taken down during an NBA game. There are
many distinct statistics that a player gets after a game is finished. The most basic stats
of a player’s performance include the countable actions that a player takes in a game,
these are stats such as points (number of baskets scored in a game depending on
location), rebounds (number of times the player gained possession after a missed shot),
and assists (number of times the player passed the ball to a player that then scores).
Then there are also more advanced statistics that are derived from the basic stats, such
as Player Impact Estimate (PIE), which is a metric that measures a player’s all-around
contribution to the match [21], or True Shooting%, which gauges a player’s efficiency to
score the ball after adjusting for each shot’s worth [22]. What this means is that if a
player is more inclined to shoot 3-pointers (which are further away from the basket and
harder to make) then their True Shooting% would not decline as much as a player that
only takes 2-pointers if they attempted and missed the same amount of shots.
The models will predict how many points a player will score in their upcoming game
by being trained on these basic and advanced statistics of this type of nature with
every player having this data for each game that they played.
This data was gathered from an open-sourced NBA API [23]. The API has a many
end-points that can be queried for data about the current and historic NBA, however,
the ones that are of use to us are the ones that relate to the individual stats per player
in each game that we have a corresponding bet for. More specifically the data which
includes stats relating to players scoring the ball.
After determining which end-points were potentially useful for training models in
this task, we queried and combined the stats of each to create a dataset for every
player that played in a game during the timespan of bets that we have available.
Out of all the potential features that were gathered, we narrowed down the ones in
use to nine distinct features. We only selected features that we felt have an impact on
the number of points a player can score in a game, thus keeping the number of features
to a minimum to avoid the Curse of Dimensionality [24] having an effect on the
outcome. This phenomenon states that the higher dimensional the data for training
machine learning models are, the more prone the models are to not perform as well.
When there is a large number of dimensions, the models would also need more data to
find patterns since there is an extra order of magnitude more of data configurations. To
combat this, we made it a point to keep the number of features to a minimum to avoid
reductions in accuracy from the models, especially since we do not have an enormous
amount of data to use. The following table displays and elucidates the different

16
3 Methodology

basketball statistics that were used to train the models:

Statistic Abbreviation Statistic Name Statistic Description

The number of times a player touches and possesses


TCHS Touches
the ball during a game.

The number of shots made by a player without receiving an


UFGM Unassisted Field Goals Made
assist from a teammate.

The percentage of shots made by a player without receiving


UFG PCT Unassisted Field Goals Percentage
an assist from a teammate.

The team points scored per 100 possessions while the play
OFF RATING Offensive Rating
is on the court.

A shooting percentage that factors in the value of three-point field goals and
TS PCT True Shooting Percentage
free throws in addition to conventional two-point field goals.

The percentage of team plays used by a player when they


USG PCT Usage Percentage
are on the floor.

Measures a player’s overall statistical contribution against the total statistics


PIE Player Impact Estimate
in games that they play in.

FGM Field Goals Made The number of shots made by a player in a game.

PTS Points The number of points scored by a player in a game.

Table 3.1 Description of Basketball Statistics Used for Model Training.

3.1.3 Preparing the Data


Even though both datasets were created and ready, the data itself is raw and not ready
to be inputted into the models for training. To prepare the data we needed to alter its
properties slightly.
Regarding the bets dataset, there was nothing of note that we needed to change as
the data was already suitable for model training.
However, the player performances dataset needed some little tweaks. The first of
which was that most of the stats were in their absolute forms with large numbers and
not normalised. Thus, for each feature that was not normalised, we created a
normalised column counterpart that would be used and queried later on. Secondly,
there were games that a player did not play in due to injury or their coach decided not
to play them. In this case, their stats were all null which could not be used by a model,
so all the null values were replaced by a zero.
We were then tasked with creating a representation of a player’s previous game
performances in a way that can be understood by machine learning models. From here
on out, this representation of a player’s historical performances paired with the points
that the player scored in his upcoming game will be called an instance. To create all the
instances needed to train the models, we first queried the player bets database and

17
3 Methodology

stored all the bets into a dataframe. We then looped through all the bets one at a time
getting the player name and the date that the bet was on. Once we had the name and
the date we queried our player performance database to return the chosen stats for
the previous seven games that the player played in prior to said date. Each game
performance is represented as a vector of nine normalised numerical values, each
representing a statistic. Therefore, the culmination of the historical performances for
one instance would be a 2D array in the shape of (7, 9). If a player does not have seven
games played prior to the bet, then we pad the 2D array with the needed vectors with
zeroes.
Seven previous performances were chosen as we felt that it was a good
representation of the current form of a player. If we decreased the number of games,
to say three games, the model might be sensitive to changes in the player’s
performance and the models could capture more volatility. On the other hand, if we
increased this number the models might not adjust well to the player’s recent changes
in performance since it will be considering a longer period of time.
There were some instances when gathering the performances that the player name
in the bet database did not match any player name in the player performance database
since they were from two different sources. When this happened we simply ignored
that bet and removed it from the final datasets. The aforementioned 14,124 instances
were the final dataset and already does not include the unsearchable instances.
The second part of the instance was acquiring the number of points that the player
scored on the day of the bet. This part was simpler since we simply queried the
performances database with the player name and date of the game and returned the
points. Once both parts of the instance were set, we stored them in a tuple.

3.2 The Models


In this FYP, since we are using sequential data in the form of previous games that
happen after each other, it was decided to use models that can take advantage of this
data type. Consequently, the models that were trained and tested are an recurrent
neural network (RNN), a long short-term memory network (LSTM), and a transformer.

3.2.1 Training the Models


Before describing the models’ architectures, we believe it is important to briefly explain
how they were trained and which hyperparameters were used for each of them.
Starting off with the data split, we decided to use 85% of the data for training and
15% for the test set. We think that slightly favouring the training set with more data is
important since we already do not have heaps of data to train the models on and

18
3 Methodology

having just over 2,000 bets to test on will still yield an accurate result since in reality, a
normal bettor will take years to bet over this amount. Regarding the validation set, we
used K-Fold Cross Validation (KCV) so there is no fixed validation set and it changes
every loop. This path was chosen as during development we encountered issues of the
accuracy on the validation set being high and after the model is finished training and
tested on the test set, the accuracy would be much lower. This technique also reduces
the risk of over-fitting since the training set is not constant [25].
Regarding the loss function and optimiser, it was decided to use MSE and the Adam
optimiser respectively. MSE was chosen since it is a differentiable [26] function that is
commonly used for regression tasks [27] and it is easy to compute, which means it will
speed up runtime and allow us to test different hyperparameter configurations. The
Adam optimiser was selected mainly due to its ability to converge with speed [28]
when compared to other optimisation algorithms since it has its adaptive learning rate
[29]. Moreover, it also computes learning rates for each individual parameter which
allows this optimiser to adapt to different scales for the gradients.
After the KCV, the data is then split into batches before the training starts. This
implementation choice was mainly due to its robustness as it can adapt better to the
changes in the distribution of data and consequently makes the model more adaptable
and less prone to overfitting [27]. Furthermore, this technique also could improve
training speed due to the gradients being computed more efficiently which leads to the
model updating more frequently and converging faster [26]. In order to cater for each
model individually, we developed the training loop with a modular approach that
allowed us to use different values for all the techniques and hyperparameters above to
try to squeeze out every last bit of performance that we could.
During every batch, the model is fed a batch, the accuracy is calculated, and then
optimised with backpropagation in that order. The accuracy is a representation of how
well the model is performing by testing how many bets it gets right as a percentage of
the whole number of bets. This can be represented with the following formula:

CorrectAnswers
Accuracy = × 100 (3.1)
T otalN umberOf Bets

This equation is the main metric that was used to measure the performance of the
models since it represents the reality of the task at hand as best it can. At the end of
each epoch (cycles of training data), we then calculate the accuracy of the model on the
current validation set (changes with every fold), and if the accuracy of the current
model in the epoch is the largest it has ever been, then we save it. With this technique,
when the model is done training, we would have saved the best performing model in
the whole training cycle since sometimes the models tend to overfit towards the end of

19
3 Methodology

training and it is extremely difficult to predict when it will start to overfit. The models
are then tested with this best-performing model on the test set, which is completely
unseen for the models.

3.2.2 Recurrent Neural Network


A RNN is one of the most fundamental types of neural networks that allow for data to
be processed sequentially. Its design is made for a sequence of inputs by maintaining a
”memory” of sorts of the previous inputs that it has seen [30]. This makes the
predictions of this model based on the context of the entire sequence. Unfortunately,
this model is also prone to the vanishing gradient problem which occurs when the
gradients of the loss function with respect to the network weights become very small
during backpropagation and results in a loss of information for the model [31].
When designing this model, the vanishing gradient problem was an issue that we
kept in mind. The number of layers for the RNN layer was kept to one as even when
trying two layers the model was decreasing in performance, very possibly due to the
vanishing gradient problem. To make up for this, we have four hidden dense layers that
all lead to the singular output of the points prediction.
Since this model is less computationally expensive, we could afford to train it for
more epochs. This also allowed us to lower the learning rate on the optimiser to reduce
the chance of the model overfitting. However, the learning rate when using the Adam
optimiser is dynamic and changes as the epochs go by. After testing out different
hidden layer sizes, the sweet spot for this model was to start with 1,024 nodes before
the overfitting started to affect the performance greatly. Going lower hurt the
performance, possibly due to the model not being complex enough to learn the
intricate patterns in the data. The four dense layers get smaller in size as they get
closer to the output.
For this model, we decided to go with a relatively smaller batch size when
compared to the other two models since the model may generalise better due to the
smaller batches, which further reduces the chances of overfitting. Even though smaller
batches cause slower training time, the model’s low level of computation expense more
than makes up for this time lost. Additionally, we decided to use a slightly higher k
value in KCV to attempt to make the model really notice the subtle but meaningful
differences in the data and also to reduce bias in the data to be more representative of
unseen data.

3.2.3 Long Short-Term Memory Network


LSTMs are often considered to be a step up power-wise from the traditional RNN. This
is true since RNNs process the sequential data by passing hidden states from one time

20
3 Methodology

step to the next, which can make the RNN struggle to maintain a long-term
dependency due to the vanishing gradient problem. LSTMs address this problem by
introducing a more complex architecture that has supplementary components in the
form of memory cells and gating mechanisms [32]. This allows the network to selectively
store or forget specific pieces of information at each time step, which makes it more
powerful to handle long-term dependencies. However, this increased complexity in the
model architecture comes at the cost of an increased computational expense as well as
potentially greater difficulty in understanding the model’s behaviour.
When designing this model we opted to keep the same hidden layer size since
going above or below 1,024 yielded the same consequences as the RNN. This model
differs from the previous one as it has one less hidden layer for the sake of reducing
the chance of overfitting, but to make up for this loss we added an extra layer to the
LSTM layer in the hopes of the model learning more complex patterns in the data.
Seeing as how this model will take more time to train due to the computational
complexity, the batch size was increased at the expense of a higher probability of
overfitting. In spite of that, we have taken many measures to ensure that the model
does not overfit, and we believe that this trade-off of training time is worth it since we
explored more hyperparameter configurations with the saved time.
This model was trained on half the epochs as its predecessor, which was done for
two reasons. Firstly, we were considering the increase in training time seeing that it has
a more complex architecture and secondly, it does not need as many epochs on
account that the model will learn the patterns and converge quicker with its more
powerful complexity. Essentially, when the model was being trained for more epochs it
was stabilising and not learning anything new. With the decrease in epochs, the
learning rate was also higher to assist the model in converging faster whilst reducing
the risk of getting stuck in a local minima. The k value in KCV was kept the same for the
same reasons as the previous model.

3.2.4 Transformer
A transformer is a type of neural network architecture that has gained a significant
level of popularity in recent years, particularly in the natural language processing (NLP)
domain. This can be attributed to the fact that transformers have a unique ability to
process the input data as a whole and use other self-attention mechanisms to express
relationships in different parts of the input [33], which increases the power of a model
of the sorts. The model is able to use these self-attention mechanisms by assigning
weights to all parts of the input, giving higher weights to the parts that are more
relevant to the problem at hand. This is as opposed to the previous two models which
rely on sequential processing of data. The increase in power also comes with the

21
3 Methodology

detriment of an even more complex architecture along with longer training times. Even
though these types of models are primarily used for NLP tasks, they can be applied to
sequential data problems such as this one and still have significant advantages over
other traditional model architectures.
The layers of this model are slightly different to its counterparts since it starts off
with an embedding layer that maps the input data into a higher-dimensional space that
can be more easily processed by the model [33]. Subsequently, the encoder layer is
responsible for processing the input data using the aforementioned self-attention
mechanisms. Specifically, this layer accommodates for the multi-head self-attention
mechanism that allows the model to attend to specific parts of the input concurrently,
which will capture the global context of the whole input [33]. This encoder layer also
includes a standard feed forward neural network which processes the output of the
self-attention layer to produce a new representation of the input. Finally, the encoder
layer incorporates a normalisation of layers technique that is done to stabilise the
training by normalising the output. The final unique layer of this transformer is the
encoder itself, which stacks a number of encoder layer with a predetermined
hyperparameter for the number of layers. Stacking multiple of these layers will allow
the model to truly capture more complex relationships between the input data and the
output. After all the distinct layers, there are three fully-connected linear layers, similar
to the LSTM previously.
Due to the complexity of this model’s architecture, we were able to afford to
increase the hidden layer size without the model itself overfitting on the training data.
When creating the encoder layer we went with eight parallel self-attention heads. This
number was chosen after testing and finding a balance between the model capacity
and increased training time along with the risk of overfitting. After the layer itself was
created, we then created the encoder itself using this layer. We utilised four of these
layers in the encoder as we felt that it improved performance greatly without the
model overfitting. However, this did increase training time by a substantial amount.
In view of the fact that this model takes much longer to train than the other two
models, we had to make sacrifices in other hyperparameters to cut down on this time
as much as we could without sacrificing performance too much. For instance, the batch
size was double that of the LSTM and the k in KCV was reduced slightly which might
have affected the accuracy of the model, but overall we believe that these forfeitures
in other parts of the model are worth it for the great increase in performance from the
encoder layer and the encoder.
All these models were developed using Python and using the Pytorch library. The
hardware used to train these models are an AMD Ryzen 5 3600 for the CPU, a Radeon
RX580 for the GPU, and 16GB of RAM running at 1600MHz. On average, the RNN
took around two hours to train, whilst the LSTM took 12 hours, and the transformer

22
3 Methodology

took two days per training loop.

3.3 Evaluation Strategy


Once each model was trained, we required a way to test the performance of the
models in a fair way that would also represent the reality of betting. To achieve this we
started by simply using the models’ outputs (which represent the number of points
they think a player will score) and compare them to the mid-point of the bet. Since
player bets only have two possibilities to bet on (i.e., over or under the mid-point), the
models’ predictions will be either over or under. This answer from the models will then
be compared to the outcome that happened in real life, meaning if the player had
scored more or fewer points when compared to the mid-point in the actual NBA game.
We will only consider the accuracy of the bets in the test set as these are bets that the
models have not seen yet and are the most representative of bets that a real bettor will
see on a day-to-day basis. This evaluation strategy was inspired by the literature that
was read since most of the papers also tested the number of bets that the models
predict correctly.

3.3.1 Simulating Betting & Strategies


Whilst accuracy is a good baseline for the performance of a model it does not perfectly
mimic the actions that a bettor would take whenever they want to bet. The reason is
that the models are betting on every single bet that there is to get a final percentage of
accuracy, however, when a bettor wants to place a bet they do not bet on everything
since they would browse which bets are available and select how many they want to
bet on. Usually, a bettor would select the bets that they feel are confident in going
through based on their knowledge of the sport.

Simulating Betting

To further create an environment that represents the reality of a bettor, we created a


simulation of a betting scenario in which the models would ”choose” the bets that they
”think” will go through. This is done by first separating the bets in the test set into
chunks, each chunk will represent the several bets that a bettor will choose from. In
this case, we decided to have each chunk includes ten bets since this will simulate over
200 days of betting since we have just over 2,000 unseen bets.
The models will then be given a starting balance (e10,000) to simulate the amount
a bettor will deposit when starting their betting journey. Following this, the models
start to loop through each chunk and the bets within. In order for the models to select
which bets they are willing to place their simulated money on, we developed a system

23
3 Methodology

that will look at the relative difference between the prediction of a model on a bet and
the mid-point given out by the bookmakers, since theoretically, if the prediction is much
higher or lower than the mid-point the model is essentially more ”confident” of that bet
hitting as opposed to a bet in which the predicted points is only slightly higher or lower
than the mid-point. It is important that we use the relative difference when determining
which bets to select. If we look closely at the following scenario of two bets we can
realise the importance of using relative difference as opposed to just outright difference.

Bet Number Bet Mid-point Prediction

1 10.5 19

2 25.5 35

Table 3.2 Two examples of bets and a model’s predictions.

From the above bets, we can see that the difference in the first bet is 8.5 and the
difference in the second bet is 9.5. Even though the difference for the second bet is
greater than that of the first bet, the difference when relative to the mid-point is much
less since the mid-point of the first bet is much lower than that of the second one.
Thus, in this case, if the model had to choose which of these two bets to bet on, the
behaviour of this betting algorithm would make it choose the first one.
This system uses the following equation to determine the level of confidence
(relative difference) of a bet going through:

P rediction − M idpoint
RelativeDif f erence = (3.2)
M idpoint

The use of absolute in this equation is to cater for the models’ predictions which
are sometimes under the mid-point instead of over it.
Subsequently, the relative difference of each bet in the chunk will be calculated and
the model will select the bets with the highest relative difference. The number of bets
chosen depends on a parameter that will be altered multiple times for testing purposes.
Once the selected bets are chosen and bet on (the model chose over or under), the
actual amount of points that the player scored is compared. The amount of virtual
money that the model places as the stake for each bet is known as a unit and is also
determined by a parameter that will be changed for testing. The unit size is dynamic
and always changes since it is a percentage of the balance. If the model’s prediction
was correct then it will receive its stake back as well as additional profit, if the
prediction was wrong it will lose the stake it placed. The odds for each bet were set to

24
3 Methodology

1.86 since this is the average even-odds bet for player betting. For example, if the
model was correct for a bet it would receive unit × 1.86 back.

Betting Strategies

Every bettor is unique and uses a different method for betting in the hopes of resulting
in a profit. Some bettors prefer to play it safe on low-odds bets and bet higher
amounts, at the risk of this safe event not happening and losing all their stake. Other
bettors play smaller amounts of money on higher-odds bets thinking that one of them
will hit and result in a bigger payout. This form of evaluation was inspired by [8], as in
their paper they also used different forms of betting strategies to display which
strategy was the most optimal.
Thus, we set out to test which type of betting is the most profitable, and to do this
we used two parameters which we changed and checked the results. The two
parameters are the number of bets to bet on and the unit size for each bet. Using
different configurations of these two parameters would allow us to check which type
of betting is most profitable for a bettor.
The configurations that we used for these tests are the following:

Config No. Number of Bets Unit Size

1 1 20%

2 3 10%

3 5 5%

4 10 2%

Table 3.3 Different configurations for testing betting strategies.

3.4 Summary
In this chapter, we went over how both datasets were created that were the
foundations for the models to be trained on. In addition, we described the model
architecture of all three models along with the hyperparameter choices we took when
training them. We also went over the design decisions in the training loop to reduce
certain factors such as overfitting. Finally, the procedure of testing the models along
with the metrics used were explained. Moreover, we discussed how we intend to
simulate a real betting scenario to try and yield the best results possible from the
models by using different betting strategies.

25
4 Evaluation
One of the main objectives of this FYP was to compare different models and test their
performance when put into a real-life scenario for betting on NBA player bets. In this
chapter, we will decipher which models achieved the best scores and discuss why they
performed the way they did.

An important note regarding the baseline comparisons in this section is that since we
encountered no literature that has tried our approach of predicting these specific types
of bets, the only sure comparison that we can make is to the 53.8% accuracy that the
models need to achieve to be profitable over the long term against the bookmakers.

4.1 Recurrent Neural Network


In this section, we will present the results of our RNN model for predicting NBA player
points bets. This model was mostly used as a baseline model for our other two models
since it is one of the most basic sequential models available. The model reached a
50.7% accuracy on the test set. This is compared to the 53.8% that we have to reach
to consistently win bets over a long period of time to be profitable.
While using this model would still be better than simply guessing a side to bet on, it
is not enough to overcome the vigorish set by the bookmakers. Below is a confusion
matrix of the results:

Actual
Over Under
Over 559 562
Prediction
Under 484 514

Table 4.1 Confusion matrix for RNN

We speculate that a potential reason as to why this model did not perform very
well is due to the vanishing gradient problem. Even though we implemented measures
when designing the model, such as keeping the number of layers to a minimum, it is
still possible that this architecture still suffered from this issue that it is so vulnerable
to. Another likely cause for the lack of performance from this model is its difficulty to
capture long-term dependencies. In spite of the fact that RNNs are able to process
sequential data, they can struggle to capture these long-term dependencies. This
happens due to the network naturally losing the impact of the earlier inputs, which
leads to more errors in the prediction. The dataset characteristics might have also been

26
4 Evaluation

not ideal for this type of model to train on. For instance, it is possible that seven
historical games of a player were not enough for the model to properly predict the
player’s points for an upcoming game.
The following are the balance over time when this model was being simulated for
betting with different strategies:

RNN balance using betting strategy 1 RNN balance using betting strategy 2

RNN balance using betting strategy 3 RNN balance using betting strategy 4

Figure 4.1 RNN using different betting strategies.

From the graphs, we can see how the RNN model has a run of hits at around day
10 to day 20. Ultimately, due to the accuracy of the bets being low, the model will
never make a profit long-term even if it guesses more than half the bets correctly.

4.2 Long-Short Term Memory Network


The LSTM is looked at as an improvement in power when it comes to sequential
models. The results also show this as this model reached an accuracy of 52.3% which is
1.6% more than the RNN. The confusion matrix below represents the predictions of
this model as opposed to the actual results that happened:

27
4 Evaluation

Actual

Over Under

Over 706 415


Prediction
Under 595 403

Table 4.2 Confusion matrix for LSTM

This jump in accuracy can be attributed to this model architecture handling the
vanishing gradient problem which is particularly important to remove for time series
prediction problems such as this one. The architecture of this model not only solves
the prior issue but also allows for long-term dependencies, which could also play a part
in this model’s increased performance. However, even this increase in performance
could not overcome the vigorish which is taken by the bookmakers. As shown in the
following four graphs which show how the model performs when faced with a
simulation of real-life betting:

LSTM balance using betting strategy 1 LSTM balance using betting strategy 2

LSTM balance using betting strategy 3 LSTM balance using betting strategy 4

Figure 4.2 LSTM using different betting strategies.

Regardless of betting strategy, a common theme with the balance over time for the
LSTM is that when compared to the RNN, the LSTM takes longer to reach the plateau
at the bottom of the graph (which represents losing almost all money). This can be
attributed to the increase in accuracy of this model since the higher the accuracy the

28
4 Evaluation

more bets it is able to predict correctly. From the graphs, we can also notice that this
model has a good run of bets towards the latter part of the days since there is a spike in
balance for all the strategies.

4.3 Transformer
Even though the transformer was introduced specifically for NLP tasks, it has shown to
be effective for other sequential and regression tasks, such as this one. An interesting
point for this model is that the accuracy was 47.2%, which is far behind the other two
models. In spite of that, we decided to take the negation prediction of this model and
do the opposite of what it predicts. In this way, the accuracy is 52.8%, which is the
best-performing model. A possible reason why this model learnt the negation function
is that it managed to find a pattern that the other two models were not able to see. We
also speculate that this model learnt the negation due to the inductive bias of this
model architecture and tends to favour the negated predictions of this task. The
confusion matrix below represents the negation predictions that were made by this
model architecture:

Actual
Over Under
Over 697 424
Prediction
Under 577 421

Table 4.3 Confusion matrix for transformer

This model could have performed the way it did since unlike RNNs and LSTMs, this
type of model architecture does not employ recurrence or hidden states. Instead, it
makes use of a self-attention process to identify connections among the sequence’s
various pieces. As a result, the model may learn which sequence pieces are most
important for making predictions and more easily capture long-term dependencies.
The following are how this model performed under a real-life betting scenario:

29
4 Evaluation

Transformer balance using betting Transformer balance using betting


strategy 1 strategy 2

Transformer balance using betting Transformer balance using betting


strategy 3 strategy 4

Figure 4.3 Transformer using different betting strategies.

This model seemed to perform better with the latter two strategies since the first
two strategies ended with almost losing everything, whilst with strategies three and
four it managed to keep the balance to over e3,000. This is still a major loss compared
to the e10,000 starting point, but still better than the other two models. The
transformer seemed to have a good run of bets around the 100-day mark but went
swiftly downhill after that.

4.4 Discussion of Results


To begin this section and discussion, we will show a table below that will represent the
final balance of each model at the end of every betting strategy simulation.

30
4 Evaluation

RNN LSTM Transformer Averages

Strategy 1 33 1,056 7 366

Strategy 2 45 763 921 576

Strategy 3 656 2,426 3,525 2,202

Strategy 4 633 2,248 3,144 2,009

Averages 342 1,623 1,899

Table 4.4 Table of betting strategy results with different models (in Euros).

From the table, it is clear that the transformer had the highest average final balance
of all the models. Not only does this concur with the accuracy, since it also had the
highest accuracy, but it also shows that the model is the most efficient for a real
betting scenario. It is also of note that the only model to learn the not function still
performed the best.
One might look at the highest accuracy (52.8%) achieved in this project and think
that it is not even close to beating the bookmakers. However, in even-odds bets, the
margins are always extremely tight due to the vigorish that is taken by the bookmakers.

4.4.1 Betting Strategies


The first strategy was the worst of the tested strategies since it only achieved a mere
366 average final balance over the three models. This number would be even lower if
the LSTM did not manage to finish with over 1,000 balance. Since the LSTM worked
well, compared to the other models, it is a possibility that the bet with the highest
relative difference for this model was representative of a good bet. This type of betting
strategy is very risky since the bettor is only placing one bet and if that bet misses, a
big chunk of the balance is gone, but if the bet hits, a large sum will be returned. If a
bettor had to lose a few bets consecutively with this strategy they would immediately
lose their money, but if a string of hits comes along then the bettor could potentially
gain a lot of balance in a short period of time.
The second strategy performed better than the previous one but the results are
still not as high as the final two strategies. Overall, this strategy was usually the first to
flatline (i.e. the balance stays more or less the same for a long period of time) even
though the previous strategy is more risky. A potential reason for this is that this is the
strategy with the highest percentage of the balance bet every day. The last strategy
had one bet a day with a unit size of 20% whilst this one has three daily bets with 10%
each bet. Thus, since none of the models are profitable over time, the more balance

31
4 Evaluation

bet, the faster the money is lost.


The third strategy seems to be the sweet spot for all the models due to all their
best results coming when using this strategy. Even though the third strategy has more
balance bet daily than the first strategy, it is less risky. The models are picking five
different bets as opposed to one, which is far less risky since anything can go wrong in
these types of bets, such as a player getting injured and having to leave the game or
the player simply not having a good game. Even if the confidence levels are not always
as high as the first strategy, spreading the bets seems to be the least likely way to lose
balance.
Lastly, the final betting strategy only yielded slightly less than the previous
strategy. It achieved a close result probably due to similar reasons, being spreading
bets out as opposed to betting one large sum on a singular bet. However, this strategy
did not reach the heights of the previous one due to the issue of the more bets to
choose from the less confident the models will be on the last couple of bets. This
further displays that the confidence level of the models make a difference.
This scenario is not entirely realistic for a real bettor, since in reality, the bettor can
choose to cash out either all their balance or a portion of it. To account for this we
introduced a new metric called the peak. This metric is the highest balance a model
reached in a simulation. The table below represents the peak for each model when
using each strategy.

RNN LSTM Transformer Averages

Strategy 1 16,586 13,735 73,107 34,476

Strategy 2 11,266 13,595 37,708 20,856

Strategy 3 11,242 11,512 25,139 15,964

Strategy 4 10,217 10,217 9900 10,111

Averages 12,328 12,265 36,464

Table 4.5 Table of peaks of betting strategy results with different models (in Euros).

From this table, it is clear to see that the transformer had the best peaks due to its
performance over the other two models. Since the models also tend to decrease in
riskiness from strategy one to four, it makes sense that although the first strategy was
the worst performing for the final balance, it reached the highest level of peaks on
average with the transformer going over seven times the starting balance at a point,
before losing most of it. Thus, if a bettor was smart and reached those heights of
balance, they would have withdrawn most or some of it from their account and could

32
4 Evaluation

have made a large profit if they decided to use this model. However, it could have been
a spell of good luck as it could have easily gone the other way and lost almost all their
money immediately.
Impressively, the transformer reached a peak of over seven times the starting
balance when using the initial betting strategy. This shows that although the models
were not profitable over a long period of time, there was still potential for them to go
on a long string of hits and still obtain a big profit. This is especially true for the
transformer. It also makes sense that the first betting strategy had the highest peaks
since the unit size is the largest of all the strategies. Thus, if the models hit a lot of bets
consecutively the balance would increase significantly.

33
5 Conclusion
In this FYP, we developed several models that can predict NBA player points bets. Even
though we did not manage to create a model that is profitable over a lengthened
period of time, the transformer model managed to reach seven times the initial starting
balance with a sequence of hits on its predictions. After creating the datasets needed
for this task, we developed the models and tuned them according to the data and the
performance. Once we had all the models trained we compared their results and
discussed the differences between them. We also used the models in a more real
betting scenario, which gave the models a certain level of balance and let them bet on
bets that they have not seen before. This effort was made to check if the models would
be capable of reaching a profit.

5.1 Revisiting Our Aims and Objectives


The main aim of this FYP was to create a model that could potentially rival that of the
bookmakers. The following are the objectives that were laid out at the beginning of
this project aimed towards accomplishing this task.

1. A database for historic player points bets was successfully web-scraped and
stored.

2. The game-by-game performances by players were gathered and organised.

3. We successfully implemented multiple sequential models which were compared


to one another with the use of the accuracy metric.

4. We developed a betting simulation that put the models to the test in a realistic
betting scenario and used different betting strategies to see which is optimal.

5.2 Future Work


Regarding future work, we propose a few ways that without the constraints of time in
this project, could have improved the performance of the models or yielded better
results.

• Implementation of an algorithm that would allow the models to take into


consideration the player in question’s previous games against the specific team
that he will play against that day. This could be fruitful since a player’s play style
could be effective against a certain way that a team plays, or could be the
opposite and that the player usually does not play well against them.

34
5 Conclusion

• In many cases, the more data the better trained the models will be, thus we can
continually web-scrape more bets as they come out day by day throughout the
season.

• Experimenting with different features in the dataset to see what works best since
sometimes certain features will have a correlation to the final result. Moving on
with the data, we only had a sparse set of data since the website we found to
web scrape the data only had bets since February 2022. Thus, since more bets
are coming out each and every day, we could use them to further train our
models on larger datasets.

• Testing different models and more hyperparameter tuning for the models that
were used.

5.3 Final Remarks


Many people have attempted to beat the bookmakers in many different types of bets,
using different models and approaches. We have not encountered any literature that
has attempted to predict NBA player points bets using machine learning algorithms.
However, this area of online sports betting continues to grow day by day, meaning the
bookmakers will have to stay sharp as someone is bound to one day make a model that
can beat theirs on a consistent basis.
In this FYP, we attempted to beat the bookmakers by predicting over 53.8% of
player points bets. Our best model was the transformer, which reached 52.8%. This
proves that A. Vaswani et al. in [33] were right in saying ”Attention is all you need”. This
distinction of 1% is the difference between having a model that will consistently lose
money for you over a long period of time or a model that can be very profitable over
time if the bookmakers do nothing about it.

35
References
[1] L. Egidi, F. Pauli, and N. Torelli, “Combining historical data and bookmakers’ odds
in modelling football scores,” Statistical Modelling, vol. 18, no. 5-6, pp. 436–459,
2018.
[2] Statista, National basketball association total league revenue from 2001/02 to
2021/22, https://www.statista.com/statistics/193467/total-league-
revenue-of-the-nba-since-2005/.
[3] E. W. Packel, Mathematics of Games and Gambling. MAA, 2006, vol. 28.
[4] G. Dotan, Beating the Book: A Machine Learning Approach to Identifying an Edge in
NBA Betting Markets. University of California, Los Angeles, 2020.
[5] D. Cortis, “Expected values and variances in bookmaker payouts: A theoretical
approach towards setting limits on odds,” The Journal of Prediction Markets, vol. 9,
no. 1, pp. 1–14, 2015.
[6] S. R. Clarke, “Adjusting true odds to allow for vigorish,” in Proceedings of the 13th
Australasian Conference on Mathematics and Computers in Sport. R. Stefani and A.
Shembri, Eds, 2016, pp. 111–115.
[7] J. P. Donnelly, “Nfl betting market: Using adjusted statistics to test market
efficiency and build a betting model,” 2013.
[8] O. Hubˊaček, G. Šourek, and F. Železnˋy, “Exploiting sports-betting market using
machine learning,” International Journal of Forecasting, vol. 35, no. 2,
pp. 783–796, 2019.
[9] Sportsbook reviews online, https://sportsbookreviewsonline.com/
scoresoddsarchives/scoresoddsarchives.htm.
[10] F. Thabtah, L. Zhang, and N. Abdelhamid, “Nba game result prediction using
feature analysis and machine learning,” Annals of Data Science, vol. 6, no. 1,
pp. 103–116, 2019.
[11] Kaggle, https://www.kaggle.com/.
[12] G. Papageorgiou, “Data mining in sports: Daily nba player performance
prediction,” 2022.
[13] C. Young, A. Koo, S. Gandhi, and C. Tech, “Final project: Nba fantasy score
prediction,” 2020.
[14] Stathead, https://stathead.com/.
[15] P. K. Gray and S. F. Gray, “Testing market efficiency: Evidence from the nfl sports
betting market,” The Journal of Finance, vol. 52, no. 4, pp. 1725–1737, 1997.

36
REFERENCES

[16] K. Gimpel, Beating the nfl football point spread, 2006.


[17] A. Bucquet and V. Sarukkai, The bank is open: Ai in sports gambling.
[18] Basketball reference, https://www.basketball-reference.com/.
[19] M. Dumovic and T. Howarth, “Tennis match predictions using neural neworks,”
[20] Bettingpros, https://www.bettingpros.com/nba/picks/prop-bets/.
[21] N. Stuffer, Player impact estimate (pie),
https://www.nbastuffer.com/analytics101/player-impact-estimate-pie/.
[22] Statistical analysis primer, https://www.nba.com/thunder/news/stats101.html.
[23] S. Patel, Nba api, https://github.com/swar/nba_api.
[24] M. Köppen, “The curse of dimensionality,” in 5th online world conference on soft
computing in industrial applications (WSC5), vol. 1, 2000, pp. 4–8.
[25] D. Berrar, Cross-validation. 2019.
[26] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
[27] T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman, The elements of
statistical learning: data mining, inference, and prediction. Springer, 2009, vol. 2.
[28] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv
preprint arXiv:1609.04747, 2016.
[29] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv
preprint arXiv:1412.6980, 2014.
[30] L. R. Medsker and L. Jain, “Recurrent neural networks,” Design and Applications,
vol. 5, pp. 64–67, 2001.
[31] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent
neural networks,” in International conference on machine learning, Pmlr, 2013,
pp. 1310–1318.
[32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[33] A. Vaswani et al., “Attention is all you need,” Advances in neural information
processing systems, vol. 30, 2017.

37

You might also like