Professional Documents
Culture Documents
Machine Learning
June 2023
i
Acknowledgements
I would like to express my deepest gratitude to my supervisor, Dr Kristian Guillaumier,
for his invaluable guidance, support, and encouragement throughout my research. His
expertise, dedication, and patience have been instrumental in shaping this FYP. I am
grateful for the time and effort that he put into reviewing my work, providing
constructive feedback, and most of all, for pushing me to reach my full potential.
I would also like to thank my parents for their unyielding love, unwavering faith,
and constant motivation throughout my academic pursuit. Their dedication, guidance
and sacrifices have been a pillar of strength, giving me the bravery to face challenges
and overcome the obstacles that I encountered not only while working on this paper,
but also in life. Their belief in my abilities and their constant presence have been the
driving factor behind my academic growth. I am, and always will be, thankful for their
commitment towards my well-being.
Finally, I want to express my sincere appreciation to my friends and colleagues for
their invaluable contributions to my academic pursuits. Their generosity, insights, and
willingness to share their expertise have been critical in helping me to develop my ideas
and refine my research. I am grateful for their camaraderie, kindness, and humour,
which have helped me maintain my balanced life and keep things in perspective. I am
extremely lucky to have such a supportive and dynamic network of friends and
colleagues, and I will always cherish the memories that we have created together.
ii
Contents
Abstract i
Acknowledgements ii
Contents iv
List of Figures v
List of Tables vi
Glossary of Symbols 1
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Proposed Solution and Results . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Methodology 15
3.1 The Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
iii
3.1.1 Player Bets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Player Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.3 Preparing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 The Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Training the Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 Long Short-Term Memory Network . . . . . . . . . . . . . . . . . . 20
3.2.4 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Evaluation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.1 Simulating Betting & Strategies . . . . . . . . . . . . . . . . . . . . . 23
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Evaluation 26
4.1 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Long-Short Term Memory Network . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1 Betting Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Conclusion 34
5.1 Revisiting Our Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
iv
List of Figures
v
List of Tables
vi
List of Abbreviations
ANN Artificial Neural Network.
vii
1 Introduction
The NBA is the largest basketball league and one of the largest sports associations in
the world [2]. The entire league consists of 30 teams and around 450 players. Due to
its popularity, analytics and data collection has increased in relation to teams and
players. Data and analytics have always been driving factors in the world of sports
betting, and with the recent rise of artificial intelligence, they are more advanced than
ever.
A newer type of bet in the NBA is player betting bets, where the bettor places a
wager on the performance of an individual player rather than whether or not the team
will win the game. This FYP aims to analyse and use vast amounts of historical data to
predict the number of points a player will score in an upcoming game. The models will
then compare the prediction to the bets that the bookmakers have posted and bet
accordingly.
1.1 Motivation
In recent years, sports betting has exploded in popularity with the rise of online sports
betting. Betting companies and bookmakers have all the power when it comes to
setting the odds of an event and they also take a cut from every bet that is placed,
even if the bet is won by the bettor. Creating a model that can rival the models used by
bookmakers is a complex task due to the high-stakes nature of the betting business.
We expect that the models used by bookmakers are close to the state-of-the-art [1].
However, we hypothesise that the betting market has a vulnerability in the player
betting market that can be exploited using machine learning. If we are successful, it will
imply that the bookmakers’ models for creating odds on this specific bet are not the
most profitable and could be improved.
Additionally, we have not encountered any literature that indicates success over a
lengthened period of time when it comes to this specific type of betting. Another
reason why this problem is an interesting one is that there is no publicly available
dataset for historical player bets, which is needed to train models to predict these bets.
Therefore, we need to create a dataset with the properties needed to train a model
with this data. Along with this dataset, another dataset involving the historical
performances of players is needed. This data can be gathered from an API, which we
would need to combine multiple endpoints to get all the data needed.
The model selection portion of this FYP involves selecting various neural networks
that fit the specifications of this problem and are to be able to take sequences of data.
Once the models are selected, they will all be hyperparameter tuned and compared to
1
1 Introduction
one another.
Finally, we aim to explore distinct betting strategies as there are many different
ways that a bettor can choose to bet. Some betters prefer to play it safe and bet big
money on bets that are likely to hit. Others prefer to bet on multiple risky, unlikely bets,
in the hope that one of them hits and makes up for the other bets as well. Each strategy
will be compared to one another and we will see which technique performs the best.
2. Gathering player data on a game-by-game basis. The only games needed are the
ones that have their respective bets.
4. Using each model with different betting techniques and comparing profits. We
aim to find the most efficient way to bet.
2
1 Introduction
After each model has been hyperparameter tuned and tested, we plan to insert the
models into a simulated betting scenario where it would start with a certain balance
and bet depending on the output of the model and either gain additional balance from
winning the bet or lose the stake. Our goal is also to try different betting strategies
with each model to see which betting strategy is the most efficient.
The results suggest that the transformer performs the best, followed by the LSTM
and lastly the RNN. The metric used for performance is the accuracy of correctly
predicting bets, where the transformer reached 52.8% accuracy over a series of player
bets. Even when the betting strategies were used, the transformer had the highest
average final balance with the other two models in the same order. When using a
different metric called peak, the results were more promising, with the transformer
managing to reach seven times its initial balance in some cases. This indicates that the
model is capable of correctly predicting multiple bets in a row and possibly being
profitable.
Background: The foundations and critical mathematical concepts behind betting are
discussed. This chapter will contain equations and examples regarding betting
and NBA betting to get an understanding of this project.
Implementation: In this chapter, we will go over the specifics and details of how this
FYP was developed. We will clearly describe the implementation decisions that
were taken for each model and the manner in which we test the results of each
of the models.
Evaluation: The exact evaluation procedures are outlined and the results are
presented. This will be followed by a discussion of the results, which will display
the strengths and weaknesses of each model.
Conclusion: This FYP is concluded by summarising the work that has been done, the
results that were obtained, and proposals for improvements in the future.
3
2 Background and Literature Review
In this chapter, we discuss the mathematics and nature of player betting more in detail
as well as give an overview of academic papers and literature that have similar
objectives.
2.1 Background
Before explaining the models and the data more in detail, we need an understanding of
betting. In this section, we discuss the most important mathematical concepts of
betting that are used in this FYP. For a more thorough exposition of this topic, we
suggest reading through [3].
The odds show the bettor that for every e1 bet, they will receive 1 × odds if the bet
hit (meaning they predicted the outcome correctly). In our example, if a bettor placed
e10 on the LA Lakers winning and they do win, then the better will receive
10 × 2.15 = 21.5. This means that the better will get their stake back in addition to
another e11.5 as pure profit.
The higher the odds of an outcome of a bet, the less likely to happen. In the
previous example, the bookmakers thought that it was more likely for the Chicago Bulls
to win (favourites) and that there was a higher chance of the LA Lakers losing the game
(underdogs).
4
2 Background and Literature Review
Bookmakers will release these player bets for most of the players and most of the
different types of stats in the league, but in this project, we will be focusing on
specifically player points bets.
From the odds of a bet, one can calculate the win probability which is usually referred
to as the implied probability [4]. To calculate the implied probability for a bet, we simply
take the inverse of the decimal value for the odds [5]. The following formula
demonstrates this:
1
ImpliedP robability = (2.1)
Odds
The Vigorish
Now that we can calculate the implied probability of any bet given its odds, we can
calculate the implied probability of both sides of a bet. Since player betting only has
5
2 Background and Literature Review
two outcomes, i.e. either the player goes over or under the midpoint, we can safely
assume that totalling the implied probability of both sides would add to 100%. In this
example, we will use the aforementioned bet and equation.
Calculating Implied Probability for Lebron James (Points) — Over 26.5 (1.9)
1
ImpliedP robability =
1.9
ImpliedP robability ≈ 0.5263%
Calculating Implied Probability for Lebron James (Points) — Under 26.5 (1.83)
1
ImpliedP robability =
1.83
ImpliedP robability ≈ 0.5464%
If we total these two implied probabilities, we get 1.0728. This value is not
concurrent with the laws of probability, as P (A) + P (Ac ) = 1, where A is the probability
of an event happening and Ac is the complement of said event.
The reason why the implied probability is not accurate using the bookmakers’ odds
is due to the Vigorish. The vigorish is the percentage removed from a bettor’s winnings
by the bookmakers [6]. In this case, the vigorish of the bet calculated above is
approximately 7.28%, as this is the extra percentage over 100% when tallying the
implied probabilities. The overround is a term coined to reference the summed implied
probabilities greater than 1, which in this example would be 107.28%.
In order to get the actual win probabilities based on the models of the bookmakers, we
would have to remove the vigorish that is baked into their odds when they release
them. Since probability theory claims that all possible events should equate to 1, we
just need to scale the implied probabilities so that they sum up to 1. Dividing the
implied probability by the overround will be equal to the actual probability.
ImpliedP robability
ActualP robability = (2.2)
Overround
The following are the calculations to gather the actual probabilities for the previous
bet.
6
2 Background and Literature Review
It is important to note that these are not the true probabilities of the events
happening, as it is impossible to calculate the true probability. This is simply an
educated guess based on the odds given to us by the bookmakers.
Suppose that we place a e10 wager on a 50/50 bet and win, we would expect to
double your earnings. However, as previously mentioned, bookmakers will always take
their cut. Normally, for an even-odds (50/50) game, the bookmaker will take 10% of
the winnings [4]. Therefore, in order to make e10 on a coin flip, one would need to bet
e11 (essentially a bet with 1.91 odds).
To find out what percentage of even-odd bets we would have to win on a
consistent basis in order to be profitable, we straightforwardly plug the 1.91 odds into
the formula for implied probability [4].
1
ImpliedP robability =
1.91
ImpliedP robability ≈ 0.5235%
This indicates that for a bettor to break even on their even-odd bets, they would
have to win 52.4% of the time to overcome the vigorish of the bookmakers. However,
with player betting the bookmakers tend to take slightly more vigorish from the bets,
meaning that an even-odds bet for a player bet would not be 1.91, but 1.86. With this in
mind, we can calculate the new percentage to overcome the vigorish using the same
calculation.
1
ImpliedP robability =
1.86
ImpliedP robability ≈ 0.5376%
Thus, to beat the bookmakers and make a profit over a lengthened period of time
for player betting, we need to reach an accuracy of 53.8%.
7
2 Background and Literature Review
The Balanced Book Hypothesis indicates that bookmakers aim to equally split the
amount of money that is wagered on both sides of any even-odds bet [7]. It would be
impossible for the bookmakers to lose money in this scenario as no matter which side
wins, they will gather the profits from the bettors that lost and they take their vigorish
from the winners as a cut.
In the case of player betting, bookmakers aim to achieve the balanced book
hypothesis by adjusting the mid-point in their player bets. For instance, in the
previously mentioned bet of Lebron James, if the mid-point was lower than 26.5 then
chances are that more people would have bet on the over bet. In a situation where
there is more money on one side, the bookmaker is vulnerable to losing money [7].
In Hubˊaček et al. [8], the goal was to predict moneyline NBA bets. If a team is
much clearly better than their opponent, the odds will be more lopsided. For instance,
the Chicago Bulls are playing the LA Lakers. From the current season, the Chicago Bulls
have been playing better and have a better record for the season. Thus, their odds will
be placed at 1.38 since they are the favourites and the LA Lakers’ odds would be 3.15,
implying that they are the underdogs.
The data they gathered consisted of various quantitative measures from both
teams in their matches for that season. When they are predicting the outcome of a
game, only the games that happened before the date of said game will apply. They
used a Logistic Regression model as a baseline prediction model. They then developed
two variants of a neural network. Firstly, a standard feed-forward network, while the
second used a convolutional layer in tandem with three dense layers. The
convolutional layer is specifically designed to deal with player-related data due to the
high number of features.
8
2 Background and Literature Review
The authors also proposed whether it is beneficial to not include the bookmakers’
odds in the models themselves. If the odds are included as a feature then it is more
likely that the predictions will be similar to the bookmakers’. Therefore, they did not
include the odds in their models in order to decorrelate the models from the models of
the bookmakers.
Testing was done over 9,093 games that were played from 2006 to 2014. From
their results, is it clear to see that it is to the detriment of the better to include the
bookmakers’ odds in the models. The convolutional neural network achieved higher
accuracies and profits than the rest of the models. Their results ended with the neural
models being slightly less accurate than those of the bookmakers, whilst the logistic
regression baseline was the least accurate one.
In Dotan [4], they also aim to beat the moneyline market for NBA bets. The data
they collected involved the game-by-game statistics for each team in the league. This
data was available from an R package called nbastatR. The historical odds for
moneyline bets on each game in the season were found at the publicly available online
resource [9]. The statistics and odds were from the 2007 to 2020 season of NBA.
An interesting part of this paper is that they attempted to scale each statistic to
their respective season. Each season in the NBA is different, basketball is not always
played the same and the sport evolves over the years. In some seasons the pace of a
game is much faster, meaning that both teams will have more possessions since the
average duration of a possession goes down. To account for such intricacies, Dotan
adjusted each team stat to a standardised per-100 possessions to avoid this
”phenomenon of statistical inflation” [4].
The testing phase of this project was on different games from the 2007 to 2020
season to keep the environment as similar and fair as possible. The authors used
logistic regression, random forest, XGBoost, and an ANN model. All models performed
similarly within the bracket of 61.6% — 65.9%, with the XGBoost and the logistic
regression models both at the top. It is important to note that moneyline bets are not
even-odds bets so there is usually a favourite in a game. This means that the accuracy
will always be higher than that of a model attempting to predict an even-odds bet since
it is always safer to bet on the favourite.
Thabtah et al. [10] also set out to predict the outcome of NBA matches. They
attempted to anticipate the result of the NBA finals games from the year 1980 to
2017. The data they used for their project was gathered from [11] and included the
game-by-game team totals from the same time span. Their dataset incorporated 22
features which each represented a different aspect of each team.
Apart from testing their hypothesis with different models, the authors also opted
9
2 Background and Literature Review
to test using different datasets. They used two different filter methods and one rule
induction algorithm to select different combinations of attributes, these being: multiple
regression, correlation feature set, and ripper algorithm. These datasets were being used
with the following models: ANN, naive bayes, and an logistical model tree (LMT).
Among all the datasets considered, the one that was chosen using the Ripper
Algorithm performed the best since all three of the models had their best accuracy
score using that combination of features. When using the full dataset with all the
features, the models had a tendency to drop in accuracy, they attributed this drop to
having irrelevant features that would only make the models more complex without
adding any important information to the prediction of the game. On average, the best
performing model was the LMT followed by the Naive Bayes and lastly the ANN.
Although the ANN achieved the lowest accuracy, it had the highest accuracy when the
dataset had all its features, including the irrelevant ones. This could hint towards this
type of machine learning model being more immune to features which do not give as
much information as others.
Papageorgiou [12] set out to solve the problem of predicting fantasy points for each
player on a daily basis. To complete the task they proposed having two datasets, one
for individual player statistics and one for team statistics. Both datasets were gathered
by scraping the nba.com website. The data was organised in a game-by-game manner
for the players and the teams alike. The data spans from the 2011 season to the 2021
season. The player dataset had 106 columns each representing a different stat, whilst
the team dataset had 100 columns. They also decided to have two combinations of
features for training their models to test which dataset would perform better. One
dataset held only basic features whilst the other one also included advanced statistics.
This would be testing whether having more features would result in a more accurate
result or that having an increased number of features would be to the detriment of the
model due to the higher complexity.
For the model implementation, the author used Pycaret, which is an open-sourced
10
2 Background and Literature Review
machine learning library in Python. Using this library, the author can use many different
models easily since the workflows are automated and not much code needs to be
written. Papageorgiou then proposed to use each model that the library offers and test
which works best in this situation. An interesting design choice that was made in this
project is that they made and trained a model for each individual player as opposed to
having one model for every player.
When using the advanced features dataset, the random forest regressor model had
the highest accuracy for the largest number of players, whilst when using the basic
features dataset it was the voting regressor that outshone its competitors. When
moving forward with these models, the author tested them by using mean absolute
percentage error (MAPE) and resulted in a 30.7% MAPE for the advanced dataset and
31.1% MAPE for the basic dataset. Everything was tested on an unseen dataset to
keep fairness standards high. After some optimisations using a singular dataset, the
author managed to get the MAPE to just 25.9% on the test set.
In [13], the main objective of this study was similar to the previous paper and also
attempt to forecast the number of fantasy points that a player will achieve in an
upcoming game. However, they investigated a slightly different way to predict the
number of fantasy points. Instead of the models outputting a single number for the
fantasy points, they want to build models for each base statistic that is used for the
equation to calculate the fantasy points and use the outputs in the equation to calculate
the final score.
The data that they used to train their models was gathered by web scraping player
game stat lines from [14]. An interesting design choice that the authors went for is that
they used rolling averages of each individual player stat over a 3, 5, and 10 game
window. This means that they had a feature for a player’s stat over the last 3, 5, and 10
games. This tripled the features for the ones that are continuous but could provide
more information to the models. The data collected starts from 2014 and ends in 2020.
The criteria for their model selection was that they prioritised models that perform
better on higher-level structured data and that are capable of interpreting feature
importance. With this criterion in mind, the authors used random forest and XGBoost. A
random forest model that simply predicts the fantasy score directly is being used as a
benchmark. The several random forest models to each predict an individual stat
outperformed the single model to predict the fantasy score with a mean squared error
(MSE) of 93.18 on the test set as opposed to 94.61. However, after hyperparameter
tuning all the models, the XGBoost model won the battle and achieved a final MSE of
92.27 on the test set.
11
2 Background and Literature Review
Gimpel [16] presents a novel approach to beating the bookmakers in NFL spread
bets. The writer managed to extract 230 distinct features from the 1992 season up to
the 2001 season. The novelty of their approach lies within the feature selection
portion of the project as they opted to use a randomised feature search. The way this
works is that they select random features and train a logistic regression model on that
temporary dataset, if the accuracy is higher than the previous temporary dataset then
keep it, if not then try a new random dataset. The logistic regression model was selected
due to its speed.
The focus of this project was more so the dataset rather than the models, however,
12
2 Background and Literature Review
two models were tested using the final dataset, these being a logistic regression model
and a support vector machine (SVM). The final results included the former achieving a
54.07% accuracy and the latter resulting in 50.76% over two seasons of NFL games.
The authors in [19] aim to provide insight into predicting the outcome of tennis
matches using neural networks and sequential models. The data they needed for this
task was available to them from the Association of Tennis Professionals (ATP), which
included all the raw statistics from most tennis games that were played in the past
decades. However, since some of the earlier matches had a tendency to not have
complete data, the writers opted to only use matches played after the year 2000. To
improve the quality of the dataset, only matches where both players had at least 25
matches played were considered to have enough data to train the models on. When
possible, the models would use the previous 50 matches of a player to more accurately
predict the outcome, however, 25 was the minimum amount of games and if a player
did not have 50 played games, then the algorithm would pad the rest of the feature
vector with zeros.
When training the models, the authors chose to try to train a feed-forward
13
2 Background and Literature Review
network with a dataset consisting of the players’ averages over their past 50 games
and another dataset which concatenated the players’ statistics of their last 50 games.
As a baseline metric for all their tests, they simply chose the winner depending on
which player had more ATP rank points (i.e. which player was ranked higher). This
naive approach resulted in a 65.6% accuracy. The feed-forward network on the
averaged match statistics dataset achieved a 64.4% whilst when trained on the
concatenated dataset achieved a mere 53.2%. Lastly, the authors used an LSTM. This
model impressively beat the naive approach with an accuracy of 69.6%.
2.3 Summary
In this chapter, we discussed the important mathematics and principles behind betting
and more specifically player betting. The key role of the bookmakers wanting both
sides of a bet to be as equal as possible is one of the main principles that player betting
is based on. Moreover, we covered literature that gave us ideas and inspiration for this
FYP that helped throughout the development process.
When reading all this literature, we have some important takeaways that we
gathered from the papers. Firstly, from [8] we realised the importance of not using the
bookmakers’ odds to decorrelate as much as possible from their models. Using their
odds does help to get closer to their results but will impede our models from achieving
any accuracy over their models. The NBA fantasy papers showed the most ideal way to
predict the stats of a player using their previous performances. In Bucquet and
Sarukkai’s work [17], they showed how sequential models work best with this type of
task.
14
3 Methodology
This chapter is divided into three main sections. In the first section, we explain how the
two datasets were created. The second section explains the designs of the three
models used, and the last section explains the evaluation design that we proposed for
these models.
15
3 Methodology
16
3 Methodology
The team points scored per 100 possessions while the play
OFF RATING Offensive Rating
is on the court.
A shooting percentage that factors in the value of three-point field goals and
TS PCT True Shooting Percentage
free throws in addition to conventional two-point field goals.
FGM Field Goals Made The number of shots made by a player in a game.
17
3 Methodology
stored all the bets into a dataframe. We then looped through all the bets one at a time
getting the player name and the date that the bet was on. Once we had the name and
the date we queried our player performance database to return the chosen stats for
the previous seven games that the player played in prior to said date. Each game
performance is represented as a vector of nine normalised numerical values, each
representing a statistic. Therefore, the culmination of the historical performances for
one instance would be a 2D array in the shape of (7, 9). If a player does not have seven
games played prior to the bet, then we pad the 2D array with the needed vectors with
zeroes.
Seven previous performances were chosen as we felt that it was a good
representation of the current form of a player. If we decreased the number of games,
to say three games, the model might be sensitive to changes in the player’s
performance and the models could capture more volatility. On the other hand, if we
increased this number the models might not adjust well to the player’s recent changes
in performance since it will be considering a longer period of time.
There were some instances when gathering the performances that the player name
in the bet database did not match any player name in the player performance database
since they were from two different sources. When this happened we simply ignored
that bet and removed it from the final datasets. The aforementioned 14,124 instances
were the final dataset and already does not include the unsearchable instances.
The second part of the instance was acquiring the number of points that the player
scored on the day of the bet. This part was simpler since we simply queried the
performances database with the player name and date of the game and returned the
points. Once both parts of the instance were set, we stored them in a tuple.
18
3 Methodology
having just over 2,000 bets to test on will still yield an accurate result since in reality, a
normal bettor will take years to bet over this amount. Regarding the validation set, we
used K-Fold Cross Validation (KCV) so there is no fixed validation set and it changes
every loop. This path was chosen as during development we encountered issues of the
accuracy on the validation set being high and after the model is finished training and
tested on the test set, the accuracy would be much lower. This technique also reduces
the risk of over-fitting since the training set is not constant [25].
Regarding the loss function and optimiser, it was decided to use MSE and the Adam
optimiser respectively. MSE was chosen since it is a differentiable [26] function that is
commonly used for regression tasks [27] and it is easy to compute, which means it will
speed up runtime and allow us to test different hyperparameter configurations. The
Adam optimiser was selected mainly due to its ability to converge with speed [28]
when compared to other optimisation algorithms since it has its adaptive learning rate
[29]. Moreover, it also computes learning rates for each individual parameter which
allows this optimiser to adapt to different scales for the gradients.
After the KCV, the data is then split into batches before the training starts. This
implementation choice was mainly due to its robustness as it can adapt better to the
changes in the distribution of data and consequently makes the model more adaptable
and less prone to overfitting [27]. Furthermore, this technique also could improve
training speed due to the gradients being computed more efficiently which leads to the
model updating more frequently and converging faster [26]. In order to cater for each
model individually, we developed the training loop with a modular approach that
allowed us to use different values for all the techniques and hyperparameters above to
try to squeeze out every last bit of performance that we could.
During every batch, the model is fed a batch, the accuracy is calculated, and then
optimised with backpropagation in that order. The accuracy is a representation of how
well the model is performing by testing how many bets it gets right as a percentage of
the whole number of bets. This can be represented with the following formula:
CorrectAnswers
Accuracy = × 100 (3.1)
T otalN umberOf Bets
This equation is the main metric that was used to measure the performance of the
models since it represents the reality of the task at hand as best it can. At the end of
each epoch (cycles of training data), we then calculate the accuracy of the model on the
current validation set (changes with every fold), and if the accuracy of the current
model in the epoch is the largest it has ever been, then we save it. With this technique,
when the model is done training, we would have saved the best performing model in
the whole training cycle since sometimes the models tend to overfit towards the end of
19
3 Methodology
training and it is extremely difficult to predict when it will start to overfit. The models
are then tested with this best-performing model on the test set, which is completely
unseen for the models.
20
3 Methodology
step to the next, which can make the RNN struggle to maintain a long-term
dependency due to the vanishing gradient problem. LSTMs address this problem by
introducing a more complex architecture that has supplementary components in the
form of memory cells and gating mechanisms [32]. This allows the network to selectively
store or forget specific pieces of information at each time step, which makes it more
powerful to handle long-term dependencies. However, this increased complexity in the
model architecture comes at the cost of an increased computational expense as well as
potentially greater difficulty in understanding the model’s behaviour.
When designing this model we opted to keep the same hidden layer size since
going above or below 1,024 yielded the same consequences as the RNN. This model
differs from the previous one as it has one less hidden layer for the sake of reducing
the chance of overfitting, but to make up for this loss we added an extra layer to the
LSTM layer in the hopes of the model learning more complex patterns in the data.
Seeing as how this model will take more time to train due to the computational
complexity, the batch size was increased at the expense of a higher probability of
overfitting. In spite of that, we have taken many measures to ensure that the model
does not overfit, and we believe that this trade-off of training time is worth it since we
explored more hyperparameter configurations with the saved time.
This model was trained on half the epochs as its predecessor, which was done for
two reasons. Firstly, we were considering the increase in training time seeing that it has
a more complex architecture and secondly, it does not need as many epochs on
account that the model will learn the patterns and converge quicker with its more
powerful complexity. Essentially, when the model was being trained for more epochs it
was stabilising and not learning anything new. With the decrease in epochs, the
learning rate was also higher to assist the model in converging faster whilst reducing
the risk of getting stuck in a local minima. The k value in KCV was kept the same for the
same reasons as the previous model.
3.2.4 Transformer
A transformer is a type of neural network architecture that has gained a significant
level of popularity in recent years, particularly in the natural language processing (NLP)
domain. This can be attributed to the fact that transformers have a unique ability to
process the input data as a whole and use other self-attention mechanisms to express
relationships in different parts of the input [33], which increases the power of a model
of the sorts. The model is able to use these self-attention mechanisms by assigning
weights to all parts of the input, giving higher weights to the parts that are more
relevant to the problem at hand. This is as opposed to the previous two models which
rely on sequential processing of data. The increase in power also comes with the
21
3 Methodology
detriment of an even more complex architecture along with longer training times. Even
though these types of models are primarily used for NLP tasks, they can be applied to
sequential data problems such as this one and still have significant advantages over
other traditional model architectures.
The layers of this model are slightly different to its counterparts since it starts off
with an embedding layer that maps the input data into a higher-dimensional space that
can be more easily processed by the model [33]. Subsequently, the encoder layer is
responsible for processing the input data using the aforementioned self-attention
mechanisms. Specifically, this layer accommodates for the multi-head self-attention
mechanism that allows the model to attend to specific parts of the input concurrently,
which will capture the global context of the whole input [33]. This encoder layer also
includes a standard feed forward neural network which processes the output of the
self-attention layer to produce a new representation of the input. Finally, the encoder
layer incorporates a normalisation of layers technique that is done to stabilise the
training by normalising the output. The final unique layer of this transformer is the
encoder itself, which stacks a number of encoder layer with a predetermined
hyperparameter for the number of layers. Stacking multiple of these layers will allow
the model to truly capture more complex relationships between the input data and the
output. After all the distinct layers, there are three fully-connected linear layers, similar
to the LSTM previously.
Due to the complexity of this model’s architecture, we were able to afford to
increase the hidden layer size without the model itself overfitting on the training data.
When creating the encoder layer we went with eight parallel self-attention heads. This
number was chosen after testing and finding a balance between the model capacity
and increased training time along with the risk of overfitting. After the layer itself was
created, we then created the encoder itself using this layer. We utilised four of these
layers in the encoder as we felt that it improved performance greatly without the
model overfitting. However, this did increase training time by a substantial amount.
In view of the fact that this model takes much longer to train than the other two
models, we had to make sacrifices in other hyperparameters to cut down on this time
as much as we could without sacrificing performance too much. For instance, the batch
size was double that of the LSTM and the k in KCV was reduced slightly which might
have affected the accuracy of the model, but overall we believe that these forfeitures
in other parts of the model are worth it for the great increase in performance from the
encoder layer and the encoder.
All these models were developed using Python and using the Pytorch library. The
hardware used to train these models are an AMD Ryzen 5 3600 for the CPU, a Radeon
RX580 for the GPU, and 16GB of RAM running at 1600MHz. On average, the RNN
took around two hours to train, whilst the LSTM took 12 hours, and the transformer
22
3 Methodology
Simulating Betting
23
3 Methodology
that will look at the relative difference between the prediction of a model on a bet and
the mid-point given out by the bookmakers, since theoretically, if the prediction is much
higher or lower than the mid-point the model is essentially more ”confident” of that bet
hitting as opposed to a bet in which the predicted points is only slightly higher or lower
than the mid-point. It is important that we use the relative difference when determining
which bets to select. If we look closely at the following scenario of two bets we can
realise the importance of using relative difference as opposed to just outright difference.
1 10.5 19
2 25.5 35
From the above bets, we can see that the difference in the first bet is 8.5 and the
difference in the second bet is 9.5. Even though the difference for the second bet is
greater than that of the first bet, the difference when relative to the mid-point is much
less since the mid-point of the first bet is much lower than that of the second one.
Thus, in this case, if the model had to choose which of these two bets to bet on, the
behaviour of this betting algorithm would make it choose the first one.
This system uses the following equation to determine the level of confidence
(relative difference) of a bet going through:
P rediction − M idpoint
RelativeDif f erence = (3.2)
M idpoint
The use of absolute in this equation is to cater for the models’ predictions which
are sometimes under the mid-point instead of over it.
Subsequently, the relative difference of each bet in the chunk will be calculated and
the model will select the bets with the highest relative difference. The number of bets
chosen depends on a parameter that will be altered multiple times for testing purposes.
Once the selected bets are chosen and bet on (the model chose over or under), the
actual amount of points that the player scored is compared. The amount of virtual
money that the model places as the stake for each bet is known as a unit and is also
determined by a parameter that will be changed for testing. The unit size is dynamic
and always changes since it is a percentage of the balance. If the model’s prediction
was correct then it will receive its stake back as well as additional profit, if the
prediction was wrong it will lose the stake it placed. The odds for each bet were set to
24
3 Methodology
1.86 since this is the average even-odds bet for player betting. For example, if the
model was correct for a bet it would receive unit × 1.86 back.
Betting Strategies
Every bettor is unique and uses a different method for betting in the hopes of resulting
in a profit. Some bettors prefer to play it safe on low-odds bets and bet higher
amounts, at the risk of this safe event not happening and losing all their stake. Other
bettors play smaller amounts of money on higher-odds bets thinking that one of them
will hit and result in a bigger payout. This form of evaluation was inspired by [8], as in
their paper they also used different forms of betting strategies to display which
strategy was the most optimal.
Thus, we set out to test which type of betting is the most profitable, and to do this
we used two parameters which we changed and checked the results. The two
parameters are the number of bets to bet on and the unit size for each bet. Using
different configurations of these two parameters would allow us to check which type
of betting is most profitable for a bettor.
The configurations that we used for these tests are the following:
1 1 20%
2 3 10%
3 5 5%
4 10 2%
3.4 Summary
In this chapter, we went over how both datasets were created that were the
foundations for the models to be trained on. In addition, we described the model
architecture of all three models along with the hyperparameter choices we took when
training them. We also went over the design decisions in the training loop to reduce
certain factors such as overfitting. Finally, the procedure of testing the models along
with the metrics used were explained. Moreover, we discussed how we intend to
simulate a real betting scenario to try and yield the best results possible from the
models by using different betting strategies.
25
4 Evaluation
One of the main objectives of this FYP was to compare different models and test their
performance when put into a real-life scenario for betting on NBA player bets. In this
chapter, we will decipher which models achieved the best scores and discuss why they
performed the way they did.
An important note regarding the baseline comparisons in this section is that since we
encountered no literature that has tried our approach of predicting these specific types
of bets, the only sure comparison that we can make is to the 53.8% accuracy that the
models need to achieve to be profitable over the long term against the bookmakers.
Actual
Over Under
Over 559 562
Prediction
Under 484 514
We speculate that a potential reason as to why this model did not perform very
well is due to the vanishing gradient problem. Even though we implemented measures
when designing the model, such as keeping the number of layers to a minimum, it is
still possible that this architecture still suffered from this issue that it is so vulnerable
to. Another likely cause for the lack of performance from this model is its difficulty to
capture long-term dependencies. In spite of the fact that RNNs are able to process
sequential data, they can struggle to capture these long-term dependencies. This
happens due to the network naturally losing the impact of the earlier inputs, which
leads to more errors in the prediction. The dataset characteristics might have also been
26
4 Evaluation
not ideal for this type of model to train on. For instance, it is possible that seven
historical games of a player were not enough for the model to properly predict the
player’s points for an upcoming game.
The following are the balance over time when this model was being simulated for
betting with different strategies:
RNN balance using betting strategy 1 RNN balance using betting strategy 2
RNN balance using betting strategy 3 RNN balance using betting strategy 4
From the graphs, we can see how the RNN model has a run of hits at around day
10 to day 20. Ultimately, due to the accuracy of the bets being low, the model will
never make a profit long-term even if it guesses more than half the bets correctly.
27
4 Evaluation
Actual
Over Under
This jump in accuracy can be attributed to this model architecture handling the
vanishing gradient problem which is particularly important to remove for time series
prediction problems such as this one. The architecture of this model not only solves
the prior issue but also allows for long-term dependencies, which could also play a part
in this model’s increased performance. However, even this increase in performance
could not overcome the vigorish which is taken by the bookmakers. As shown in the
following four graphs which show how the model performs when faced with a
simulation of real-life betting:
LSTM balance using betting strategy 1 LSTM balance using betting strategy 2
LSTM balance using betting strategy 3 LSTM balance using betting strategy 4
Regardless of betting strategy, a common theme with the balance over time for the
LSTM is that when compared to the RNN, the LSTM takes longer to reach the plateau
at the bottom of the graph (which represents losing almost all money). This can be
attributed to the increase in accuracy of this model since the higher the accuracy the
28
4 Evaluation
more bets it is able to predict correctly. From the graphs, we can also notice that this
model has a good run of bets towards the latter part of the days since there is a spike in
balance for all the strategies.
4.3 Transformer
Even though the transformer was introduced specifically for NLP tasks, it has shown to
be effective for other sequential and regression tasks, such as this one. An interesting
point for this model is that the accuracy was 47.2%, which is far behind the other two
models. In spite of that, we decided to take the negation prediction of this model and
do the opposite of what it predicts. In this way, the accuracy is 52.8%, which is the
best-performing model. A possible reason why this model learnt the negation function
is that it managed to find a pattern that the other two models were not able to see. We
also speculate that this model learnt the negation due to the inductive bias of this
model architecture and tends to favour the negated predictions of this task. The
confusion matrix below represents the negation predictions that were made by this
model architecture:
Actual
Over Under
Over 697 424
Prediction
Under 577 421
This model could have performed the way it did since unlike RNNs and LSTMs, this
type of model architecture does not employ recurrence or hidden states. Instead, it
makes use of a self-attention process to identify connections among the sequence’s
various pieces. As a result, the model may learn which sequence pieces are most
important for making predictions and more easily capture long-term dependencies.
The following are how this model performed under a real-life betting scenario:
29
4 Evaluation
This model seemed to perform better with the latter two strategies since the first
two strategies ended with almost losing everything, whilst with strategies three and
four it managed to keep the balance to over e3,000. This is still a major loss compared
to the e10,000 starting point, but still better than the other two models. The
transformer seemed to have a good run of bets around the 100-day mark but went
swiftly downhill after that.
30
4 Evaluation
Table 4.4 Table of betting strategy results with different models (in Euros).
From the table, it is clear that the transformer had the highest average final balance
of all the models. Not only does this concur with the accuracy, since it also had the
highest accuracy, but it also shows that the model is the most efficient for a real
betting scenario. It is also of note that the only model to learn the not function still
performed the best.
One might look at the highest accuracy (52.8%) achieved in this project and think
that it is not even close to beating the bookmakers. However, in even-odds bets, the
margins are always extremely tight due to the vigorish that is taken by the bookmakers.
31
4 Evaluation
Table 4.5 Table of peaks of betting strategy results with different models (in Euros).
From this table, it is clear to see that the transformer had the best peaks due to its
performance over the other two models. Since the models also tend to decrease in
riskiness from strategy one to four, it makes sense that although the first strategy was
the worst performing for the final balance, it reached the highest level of peaks on
average with the transformer going over seven times the starting balance at a point,
before losing most of it. Thus, if a bettor was smart and reached those heights of
balance, they would have withdrawn most or some of it from their account and could
32
4 Evaluation
have made a large profit if they decided to use this model. However, it could have been
a spell of good luck as it could have easily gone the other way and lost almost all their
money immediately.
Impressively, the transformer reached a peak of over seven times the starting
balance when using the initial betting strategy. This shows that although the models
were not profitable over a long period of time, there was still potential for them to go
on a long string of hits and still obtain a big profit. This is especially true for the
transformer. It also makes sense that the first betting strategy had the highest peaks
since the unit size is the largest of all the strategies. Thus, if the models hit a lot of bets
consecutively the balance would increase significantly.
33
5 Conclusion
In this FYP, we developed several models that can predict NBA player points bets. Even
though we did not manage to create a model that is profitable over a lengthened
period of time, the transformer model managed to reach seven times the initial starting
balance with a sequence of hits on its predictions. After creating the datasets needed
for this task, we developed the models and tuned them according to the data and the
performance. Once we had all the models trained we compared their results and
discussed the differences between them. We also used the models in a more real
betting scenario, which gave the models a certain level of balance and let them bet on
bets that they have not seen before. This effort was made to check if the models would
be capable of reaching a profit.
1. A database for historic player points bets was successfully web-scraped and
stored.
4. We developed a betting simulation that put the models to the test in a realistic
betting scenario and used different betting strategies to see which is optimal.
34
5 Conclusion
• In many cases, the more data the better trained the models will be, thus we can
continually web-scrape more bets as they come out day by day throughout the
season.
• Experimenting with different features in the dataset to see what works best since
sometimes certain features will have a correlation to the final result. Moving on
with the data, we only had a sparse set of data since the website we found to
web scrape the data only had bets since February 2022. Thus, since more bets
are coming out each and every day, we could use them to further train our
models on larger datasets.
• Testing different models and more hyperparameter tuning for the models that
were used.
35
References
[1] L. Egidi, F. Pauli, and N. Torelli, “Combining historical data and bookmakers’ odds
in modelling football scores,” Statistical Modelling, vol. 18, no. 5-6, pp. 436–459,
2018.
[2] Statista, National basketball association total league revenue from 2001/02 to
2021/22, https://www.statista.com/statistics/193467/total-league-
revenue-of-the-nba-since-2005/.
[3] E. W. Packel, Mathematics of Games and Gambling. MAA, 2006, vol. 28.
[4] G. Dotan, Beating the Book: A Machine Learning Approach to Identifying an Edge in
NBA Betting Markets. University of California, Los Angeles, 2020.
[5] D. Cortis, “Expected values and variances in bookmaker payouts: A theoretical
approach towards setting limits on odds,” The Journal of Prediction Markets, vol. 9,
no. 1, pp. 1–14, 2015.
[6] S. R. Clarke, “Adjusting true odds to allow for vigorish,” in Proceedings of the 13th
Australasian Conference on Mathematics and Computers in Sport. R. Stefani and A.
Shembri, Eds, 2016, pp. 111–115.
[7] J. P. Donnelly, “Nfl betting market: Using adjusted statistics to test market
efficiency and build a betting model,” 2013.
[8] O. Hubˊaček, G. Šourek, and F. Železnˋy, “Exploiting sports-betting market using
machine learning,” International Journal of Forecasting, vol. 35, no. 2,
pp. 783–796, 2019.
[9] Sportsbook reviews online, https://sportsbookreviewsonline.com/
scoresoddsarchives/scoresoddsarchives.htm.
[10] F. Thabtah, L. Zhang, and N. Abdelhamid, “Nba game result prediction using
feature analysis and machine learning,” Annals of Data Science, vol. 6, no. 1,
pp. 103–116, 2019.
[11] Kaggle, https://www.kaggle.com/.
[12] G. Papageorgiou, “Data mining in sports: Daily nba player performance
prediction,” 2022.
[13] C. Young, A. Koo, S. Gandhi, and C. Tech, “Final project: Nba fantasy score
prediction,” 2020.
[14] Stathead, https://stathead.com/.
[15] P. K. Gray and S. F. Gray, “Testing market efficiency: Evidence from the nfl sports
betting market,” The Journal of Finance, vol. 52, no. 4, pp. 1725–1737, 1997.
36
REFERENCES
37