Professional Documents
Culture Documents
Master Thesis Jason Raats PDF
Master Thesis Jason Raats PDF
learning to forecast
football matches
J. M. Raats
Technische Universiteit Delft
Using cost-sensitive learning to
forecast football matches
by
J. M. Raats
Master of Science
in Computer Science
J. M. Raats
Delft, July 2018
iii
Contents
List of Figures vii
List of Tables ix
List of Algorithms xi
Acronyms xiii
1 Introduction 1
2 Background: Sports betting 3
2.1 Football . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Bookmakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Betting on football matches: the use of odds . . . . . . . . . . . . . . . . . . 4
2.2.2 Unfair odds: overround and margin. . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Efficient Market Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Related work 9
3.1 Football market efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Statistical models applied on football matches . . . . . . . . . . . . . . . . . . . . . 10
3.3 Cost-sensitive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Methods: Decision trees 13
4.1 A decision tree that minimizes error rate . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.1 Splitting criterion: gini impurity . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.2 Pruning the tree to prevent overfitting . . . . . . . . . . . . . . . . . . . . . . 15
4.1.3 Predicting the outcome of an upcoming match . . . . . . . . . . . . . . . . . 18
4.2 Decision tree optimized for costs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Creating a cost-sensitive decision tree . . . . . . . . . . . . . . . . . . . . . . 19
4.2.2 Pruning a cost-sensitive decision tree. . . . . . . . . . . . . . . . . . . . . . . 20
4.2.3 Prediction process for cost-sensitive decision trees . . . . . . . . . . . . . . . 20
4.3 Using incremental decision trees to accommodate changes . . . . . . . . . . . . . . 21
4.3.1 The process of incrementally updating the decision tree. . . . . . . . . . . . 21
4.3.2 Storing requisite data elements in the decision tree . . . . . . . . . . . . . . 22
4.4 Training and testing through cross-validation. . . . . . . . . . . . . . . . . . . . . . 23
4.4.1 Standard cross-validation process. . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.2 Incorporating time through time series cross-validation . . . . . . . . . . . . 24
5 Experimental setup 25
5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.1 Sources used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.2 Quick analysis of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Cost-combining methods that are tested in this project . . . . . . . . . . . . . . . . 28
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3.1 Preparing the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3.2 Building the cost-sensitive models . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.3 Betting on matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.4 Selecting which matches to bet on . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Results 35
6.1 Betting on all matches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.1 Profit and accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.2 Cost-confusion matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.3 Statistical significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
v
vi Contents
6.1 Profit over time for all methods when betting on all matches. . . . . . . . . . . . . . . 36
6.2 Cost-confusion matrices for all methods when betting on all matches. . . . . . . . . . . 37
6.3 Profit over time for all methods when value betting. . . . . . . . . . . . . . . . . . . . 39
6.4 Cost-confusion matrices for all methods when value betting. . . . . . . . . . . . . . . . 40
G.1 Profit over time when betting on all matches for the b365 method. . . . . . . . . . . . 67
G.2 Profit over time when betting on all matches for the bw method. . . . . . . . . . . . . 68
G.3 Profit over time when betting on all matches for the avg method. . . . . . . . . . . . . 68
G.4 Profit over time when betting on all matches for the max method. . . . . . . . . . . . . 68
G.5 Profit over time when betting on all matches for the min method. . . . . . . . . . . . . 69
G.6 Profit over time when betting on all matches for the rnd method. . . . . . . . . . . . . 69
I.1 Normality check statistics for all methods when betting on all matches. . . . . . . . . . 73
J.1 Profit over time when value betting for the b365 method. . . . . . . . . . . . . . . . . 75
J.2 Profit over time when value betting for the bw method. . . . . . . . . . . . . . . . . . 76
J.3 Profit over time when value betting for the avg method. . . . . . . . . . . . . . . . . . 76
J.4 Profit over time when value betting for the max method. . . . . . . . . . . . . . . . . . 76
J.5 Profit over time when value betting for the min method. . . . . . . . . . . . . . . . . . 77
J.6 Profit over time when value betting for the rnd method. . . . . . . . . . . . . . . . . . 77
L.1 Normality check statistics for all methods when value betting. . . . . . . . . . . . . . . 81
vii
List of Tables
6.1 Profit and accuracy for all methods when betting on all matches. . . . . . . . . . . . . 35
6.2 𝑝-values of the One-Way ANOVA and Kruskal-Wallis tests when betting on all matches. 36
6.3 Profit and accuracy for all methods when value betting. . . . . . . . . . . . . . . . . . 38
6.4 Results of the One-Way ANOVA and Kruskal-Wallis tests when value betting. . . . . . . 39
6.5 Results of the Tukey HSD test between cost-combining methods when value betting. . 41
J.1 Number of matches selected to bet on for each method and for each run. . . . . . . . 75
ix
List of Algorithms
1 Pseudocode for creating a decision tree. . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Getting cost-complexity alpha values of a decision tree. . . . . . . . . . . . . . . . . . 17
3 Cost-complexity pruning process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Pseudocode to create models for each week. . . . . . . . . . . . . . . . . . . . . . . . 30
5 Pseudocode to prune models for each week. . . . . . . . . . . . . . . . . . . . . . . . 31
6 Match selection pseudocode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
xi
Acronyms
ANOVA Analysis of variance. 36, 38
MDL Minimum Description Length. 15, 16, 18, 19, 23, 31, 45
SD Standard Deviation. 35
xiii
1
Introduction
Sports betting is becoming increasingly popular due to the possibility to bet online. This makes it very
easy to bet on sport matches even though you are not present at the match yourself. This encourages
more people to bet on sport matches. In the Netherlands, an increase in sport betting turnover of
34% was visible between 2015 and 2016, while the Dutch gambling market as a whole increased only
3.8% [1]. This growth in sports betting is also occurring globally, albeit at a slower rate [2, 3].
At the same time, machine learning is also getting more popular. It is used in a wide variety of fields
to improve many applications, like for example predicting Oscar winners [4], stock prizes [5] or football
world cup matches [6–9]. The strength of these models is that they can find patterns in huge amounts
of data that no human is capable of. Once a pattern is learned, models are then able to automatically
apply this knowledge to new, unseen data. The hardest part is to have enough quality data available
so that the model can find these patterns in the first place.
Since both fields are becoming increasingly popular, researchers have investigated if the two can
be combined. For example, some of the biggest companies, like Microsoft [6], Bloomberg [7] and
Goldman Sachs [8], applied machine learning to predict the outcome of all 2014 football world cup
matches. We also use football data in our project, but instead of predicting the outcome of world
cup matches we use Dutch league matches for our experiments. For football (also known as soccer)
leagues there is more data available, since more teams compete against each other and for a longer
period of time. The world cup is organized only once per four years and lasts for a month whereas the
Eredivisie is active for almost a full year every year. Since it would be too ambitious to analyze football
leagues from various countries, we limited this study to one case: that of the Dutch Eredivisie, the
national football competition. Although most previous research into football markets focuses on the
English Premier League, it is also worthwhile to study other markets. This research is a first attempt
at broadening the field of study into these other international leagues.
Some research has already been done on predicting the highest amount of league matches right [10–
12], but we will investigate a different approach. While testing for accuracy has been previously ex-
plored, how to acquire the highest profit has not been studied. Thus, for this project we not only want
to predict the right outcome of football matches, but we also want to place bets on these matches to
attempt to make the highest amount of profit possible. Since the return for each match can be quite
different, as we will show in future chapters, a higher prediction accuracy will not automatically lead
to a higher profit. The machine learning models used in the studies all try to optimize for accuracy,
so we have to use a different kind of machine learning technique to optimize for profit. The technique
that we use in this project is called cost-sensitive learning. With cost-sensitive learning it is possible
to incorporate the different costs for each match and use these costs to optimize machine learning
models for profit instead of accuracy.
The costs for each match depend on the bookmaker’s odds and the stake placed by the bettor.
The odds of a bookmaker depict how probable each of the outcome possibilities is according to the
bookmaker. Since there are three different outcomes possible for football matches (home team wins,
a draw or away team wins), odds consist of three values: [𝑜 ,𝑜 ,𝑜 ]. For example, if the
odds for a match are [1.3, 4.33, 7.50], then this means that if you bet on the home team and they
win, you would get 1.3 times your stake back from the bookmaker. This same logic applies to draws
1
2 1. Introduction
and away wins as well. However, if you bet on an outcome and this does not happen, then you lose
your stake. In this example, the outcome that is most likely to occur according to the bookmaker is
the home team winning, since this event has the so-called short odds or the lowest value attached to
it. On the other hand, the probability that the away team wins is very small, since a bettor would get
7.5 times his stake back from the bookmaker.
There are usually multiple bookmakers who publish their odds for the same upcoming matches.
These odds differ slightly per bookmaker, since each bookmaker uses their own method or algorithm to
calculate the probabilities for each outcome to occur. This means that there are different costs involved
for the same football match. They depend on which bookmaker is used to place the bets. In standard
cost-sensitive learning this introduces a problem, because it is assumed that for each instance there
is only one ground truth cost possible per outcome. Using the multiple odds of all bookmakers could
add valuable information to our models. What would be the most optimal way to apply cost-sensitive
learning, specifically cost-combining methods, to this case? To this end, we will explore various cost
combining methods. Instead of using the odds of only one bookmaker in our cost-sensitive model, we
will combine the odds of all bookmakers and use the combined odds in our model. Our hypothesis is
that when the odds from various bookmakers can be combined in some mathematical way, this will
result in a higher profit than using only the information of one single bookmaker.
But before we dive into the cost-combining methods, we will first give some background information
of sports betting in chapter 2. There the basic principles of sports betting will be explained and this
information will be useful to understand the related work in chapter 3. After the related work we will
show the cost-sensitive model that we used to do our experiments in chapter 4. We discuss which
cost-combining methods we applied in this research, and how the experiments were conducted, in
chapter 5 while the results of these experiments can be found in chapter 6. These results will be
discussed and a conclusion will be given in chapter 7.
2
Background: Sports betting
Before we dive into the technical aspects of predicting football match outcomes, we will first present
some background information about football and sports betting. This information is necessary to un-
derstand the rest of the report more easily. Since we are trying to predict football match outcomes,
it is useful to understand how a football match looks like and how a football season is structured. We
will discuss this in section 2.1. To be able to make a profit from these predictions, we need someone
or some organization to place our bets. This can be done by a so-called bookmaker. We will show
how a bookmaker operates in section 2.2. And finally, there is a theory about how efficiently a mar-
ket behaves called the Efficient Market Hypothesis (EMH), which suggests how accurately the market
presents the value of the goods being sold. This theory will be explained in section 2.3.
2.1. Football
Football, as a game, has been played for many years. According to the Fédération Internationale de
Football Association (FIFA), some variants of the game were already practised in the second and third
centuries BC in China [13]. Today, it is the most popular game in the world: it is estimated that over
four billion people around the world consider themselves a football fan [14].
Since explaining the rules of a football match is not the point of this research, we suggest those
interested consult other sources if they wish to learn more about the game [15]. In essence, a team,
consisting of eleven players, attempts to score goals on the side of the field of the opposing team. To
be sure that international readers are aware we are discussing European football, or soccer, a map of
the field is displayed in figure 2.1. Relevant to this research is that there are three possible outcomes
of a match: the home team wins, there is a draw, or the away team wins. If the home team won, then
that means that the team that played in their home stadium scored more goals than the away team
and vice versa. In case of a draw, both teams ended up with the same number of goals. The winning
team is awarded three points for the competition, the losing team gets nothing and when tied both
teams get one point.
In a football season multiple teams play against each other throughout the year. Every country
has its own league. In the Netherlands, the focus of our research, it is called the Eredivisie, while in
e.g. England it is called the Premier League and in Spain they have the Primera División. The teams
compete for one year (season) and after all matches are played, the team with the most points is
declared the champion. In each league the teams compete twice against each other: one time in their
home stadium and once at the opponent’s. In football terms this is called home and away respectively.
There are 18 teams in the Eredivisie, which means that each team plays 34 matches in one season
and thus there are 306 matches in total. In other leagues different numbers of teams, and therefore
matches might be possible, but for this project only the Eredivisie is investigated.
There is something special about playing at the home ground. Historical data shows that teams
are more likely to win at home than at the opponent’s stadium [16]. Although there are three possible
outcomes of a match (home team wins, there is a draw, or the away team wins), our data shows
that in general, home teams have a 47.6% chance to win, opposed to 23.6% chance of a draw and a
28.8% chance to win away (for more information about our dataset, please consult chapter 5). These
3
4 2. Background: Sports betting
percentages may vary slightly per season and per league, but this puts the home team in a major
advantage, which is important to keep in mind for the rest of this project.
2.2. Bookmakers
To bet on football matches, sports matches or other events, you need to find a company or person that
accepts bets against specified odds (we will further explain the term ’odds’ below in paragraph 2.2.1).
This company or person is called a bookmaker. The first bookmaker ever recorded stood at Newmarket
horse racecourse in 1795 [17] and since then the market has kept growing. It is estimated that by
2020 the betting market will reach 66.59 billion USD [2]. One of the reasons for this growth is because
of the introduction of online betting [1]. Though it is currently illegal for Dutch citizens to participate in
online betting except for one bookmaker (Toto), there are other online betting sites that publish odds
for the Dutch football market, and their odds are used in this research [18, 19]. It is expected that
this growth will increase even further when more countries legalize online betting in the near future,
including the Netherlands [20, 21].
where 𝑃(𝑋) is the probability of event 𝑋, 𝑜 is the odd of event 𝑋 and 𝑋 can be home, draw or
away. So in the example, the probabilities according to the bookmaker are:
2.2. Bookmakers 5
Figure 2.2: Margin per bookmaker per season for the Dutch Eredivisie.
9
10 3. Related work
played after the technical staff of one team is replaced. In different leagues, they looked at the first
few matches after the staff had changed, and then they would bet for all those matches that that
team would win. They found that the odds underestimate the true probabilities, indicating the market
is semi-strong inefficient, because the profits were higher than expected. However, since those few
matches might not be representative of the rest of the season, this study does not show whether the
whole market would be semi-strong inefficient. Since we are interested in making a profit throughout
a season, we need to know about the state of the entire market.
One element of the inefficient market of football bookmakers is the favorite-longshot bias. There are
a few different explanations for this phenomenon. Either the bookmaker is unable to calculate the true
probabilities for each match outcome; the bookmakers apply unbalanced margins to each outcome;
or a combination of both. In [28] they assumed that the bookmakers could accurately forecast the
outcomes, but altered the odds to be in their benefit. They assert that the bookmaker influences these
odds in such a way that the odds for favorite teams are more valuable than longshot odds. They analyze
the odds of the bookmaker with the lowest margins, seen in figure 2.2, as these should be closer to
the bookmaker’s true odds (if no margin had been applied), compared to those of other bookmakers
with higher margins. They find it to be unlikely that the investigated bookmaker creates their odds
by applying the margin equally over each outcome. First, they distilled the bookmaker’s ’actual’ odds
by removing the margin equally from each odds. Then, they converted these odds to probabilities
(see section 2.2). They found that these probabilities were very different from the real outcomes of
the football matches, which violated their assumption of bookmakers’ predictive accuracy. Their other
tests, where bookmakers were assumed to temper with the odds by applying the margins proportionally
to the size of the true odds, resulted in a smaller difference between probabilities. Apparently, the
bookmakers employ these more biased methods to apply the margins to their true odds. Since the
shortest odds are those of the favorite, they receive the smallest margin in proportion to the other two
outcomes, while the longshot receives the biggest margin using that same logic. This confirms the
abovementioned difference in profit for betting on favorites versus betting on longshots.
The previous article assumed that bookmakers are able to correctly predict the probability of the
different outcomes. The authors of [29] employed a large-scale research set-up to check whether this
assumption is correct, and bookmakers’ odds could be used as forecasts. They found that bookmak-
ers were indeed getting closer to the real probabilities over time. However, they converted odds to
probabilities under the assumption that the bookmakers applied margins to their true odds equally.
The work of [28] showed that this is unlikely to be a fair representation of how bookmakers create
their odds. It would thus be interesting to conduct the research of [29] again with the methods shown
in [28] and compare the results.
models for two seasons and only for the matches in which Tottenham Hotspur (of the English Premier
League) played. The model that had the highest prediction rates predicted almost 60% of all Tottenham
matches correctly. There is however one major drawback to this model: it uses information about the
key players in the Tottenham selection, so after two years the model could not be used anymore
because the players left or retired. The method in this paper is thus highly sensitive to changes in
teams and only relevant for a specific period of time.
In another paper multiple models were also used and compared [11]. Training data contained ten
seasons of the English Premier League and the models were tested on the two following seasons. Their
three best-performing models had prediction rates around 50%. One of the reasons that the prediction
rates were not very high, was because almost all models had trouble predicting draws correctly. One
model never predicted a draw for a single match, even though 29% of the investigated matches in the
English Premier League ended in a draw. This shows how difficult it may be to correctly predict a draw
using machine learning. Though another blog post achieved similar accuracy rates, it performed better
on draws, suggesting that there may be ways to improve draw predictions, but that this may require
decreasing the accuracy of home and away predictions [12].
The most recent paper that we could find about machine learning applied to football matches is [34].
In this paper they, again, compared multiple machine learning models to each other. Here the training
data consists of nine seasons, with the model being tested on weeks 6 to 38 of seasons 2014/2015 and
2015/2016. This resulted in accuracy values between 52.7% and 56.7%, which is quit high compared
to previous papers. Unfortunately, their paper was published closely to our own publication date,
prohibiting us from applying some of their findings (specifically on feature engineering) to this project.
This will be reflected on in the discussion in chapter 7.
13
14 4. Methods: Decision trees
𝑋 < 0.36
True False
𝑏𝑙𝑢𝑒 𝑟𝑒𝑑
(a) Training data and decision tree boundaries. (b) Corresponding decision tree structure.
Figure 4.1: Decision tree trained on data from left figure. The structure of the decision tree is shown on the right.
in the dataset, which is the case in our example, it will make twenty splits before all possible splits
are considered. Then, the best combination of variable and split value is chosen to become the first
decision. To know which split is the best, it will calculate the gini impurity score for each split. This is
done through the following formula:
𝐼 (𝑝) = 1 − ∑ 𝑝
where 𝐼 (𝑝) is the gini impurity, 𝐽 is the amount of classes and 𝑝 is the fraction of items labeled
with class 𝑖. The worst possible gini for this dataset is 0.48 and occurs when the dataset is split in
two groups where one group consists of all instances and the other group has no instances at all.
If we fill in the formula, we get the following calculation: 1 − ( + ) = 0.48. First we get the
proportion of the red class, which has 6 instances of the total ten instances and then the proportion
of the blue class is added. The best split has a gini of 0.444, which is not much better than the worst
split. We can see from the decision tree model in figure 4.1b that this split is at 𝑋 < 0.35, so that the
bottom group of instances all belong to the red class (gini impurity of 0) and the top group has a gini
of 1 − ( + ) = 0.444. Adding the gini scores of both groups gets the total gini for this split, which
is 0.444. There is no other split where the gini is lower than 0.444, so the model chooses this split to
be the first decision.
Since the group of red objects has a perfect gini score of 0, there is nothing to improve. The model
is done for this group and thus creates a leaf node where the red class will be predicted for future
objects. The other group still has mixed objects, so the process repeats until the groups are pure, just
like in the first group.
The pseudocode of this process can be seen in algorithm 1. The function split_group takes data
as a parameter, which contains all the information about the data points. Then it checks if the data is
already pure, which is the case when all data points belong to the same class. If the data is pure, then
no decision needs to be made, so the code returns. If there are mixed classes in the data, then for each
variable and for each unique value of the data points for that variable, the impurity is calculated. This
is where the gini impurity could be used as a metric. The best cutpoint, where the impurity is smallest,
is saved. After all cutpoints have been processed, the data is split into two groups: one group where
the data is below the best cutpoint value and the other group contains the rest of the data. Then the
function split_group is called recursively for both groups until all groups are pure.
4.1. A decision tree that minimizes error rate 15
Cost-complexity pruning
Cost-Complexity Pruning (CCP) looks at multiple subtrees of the original tree and calculates the test
error for each one of these subtrees to see which one has the best test error [42]. However, if all
subtrees would be checked this would take a very long time. Let |𝑇 ̃| be the number of leaves in the
original tree 𝑇, then there are ⌊1.5028369| ̃| ⌋ possible subtrees to check [44]. For example, this means
that a tree with 25 leaves already has 26,472 possible pruned subtrees. Complex and large datasets
(e.g. containing many football matches, the outcomes of which are difficult to predict, see section 3),
can easily result in a tree with well over 25 leaves.
This means that checking each subtree is not feasible within a reasonable amount of time. This is
why CCP only deals with a fraction of the possible subtrees. The idea of CCP is that the deeper a tree
grows, the more complex the tree gets and this should be punished because the more complex the
tree, the more likely it is to overfit. You could say each subtree of 𝑇 has an amount of complexity cost
attached to it. If the subtree is deep, then the complexity cost is high and if it is shallow, then there is
almost no cost.
Let us define the complexity cost of a tree 𝑇 as follows:
̃|
𝐶 (𝑇) = 𝑅(𝑇) + 𝛼|𝑇
where 𝐶 (𝑇) is the complexity cost of tree 𝑇 given an 𝛼 value, 𝑅(𝑇) is the resubstitution error of 𝑇
and |𝑇̃| is the amount of leaves. So the cost of the tree is the amount of incorrectly classified instances
plus an additional complexity error based on an 𝛼 value. But there is also a cost for individual nodes
in the tree. Let us define this as:
𝐶 ({𝑡}) = 𝑅(𝑡) + 𝛼
4.1. A decision tree that minimizes error rate 17
where 𝐶 ({𝑡}) is the complexity cost of node 𝑡 and 𝑅(𝑡) is the resubstitution error of node 𝑡. This
resubstitution error 𝑅(𝑡) is a combination of the incorrectly classified instances and the proportion of
data that falls into node 𝑡, so 𝑅(𝑡) = 𝑟(𝑡)𝑝(𝑡). This means that for the whole tree 𝑅(𝑇) = ∑ ∈ ̃ 𝑅(𝑡 ) =
∑ ∈ ̃ 𝑟(𝑡 )𝑝(𝑡 ).
To get a feel for what 𝛼 does to the tree, let 𝛼 be 0 for a moment. This means that there is no
penalty for the complexity of the tree, thus the tree will not be pruned. The whole tree would correctly
classify all instances (because that is given from the building procedure of the tree), so the complexity
cost of the whole tree will be 0.
However, when 𝛼 increases, then there will be a point where 𝐶 (𝑇 ) = 𝐶 ({𝑡}) for some decision
node 𝑡 in the tree. When this happens, it means that it is better to prune the tree at node 𝑡, because the
tree has the same complexity cost but the tree is smaller which is better due to the risk of overfitting.
To find this 𝛼 value for when this happens, we need to find when
̃| = 𝑅(𝑡) + 𝛼
𝑅(𝑇 ) + 𝛼|𝑇
𝑅(𝑡) − 𝑅(𝑇 )
𝛼=
̃| − 1
|𝑇
Now we have everything to start our pruning process. The pseudocode for getting all alpha values
is shown in algorithm 2. It begins by considering the whole tree 𝑇 and for each decision node in 𝑇 we
calculate for which 𝛼 value it would be pruned. The lowest 𝛼 value is chosen and the corresponding
decision node is pruned. We now know that 𝑇 > 𝑇 where 𝑇 is the original tree and 𝑇 is the
pruned tree after the first step. This process is repeated for 𝑇 and future trees until 𝑇 = {𝑡 }, so it
only contains the root node of tree 𝑇. This process will result in a sequence of subtrees of 𝑇 where
𝑇 > 𝑇 > ⋯ > 𝑇 with corresponding 𝛼 values 𝛼 < 𝛼 < ⋯ < 𝛼 where 𝛼 = 0. Now we know that
𝑇 is the best smallest subtree when 𝛼 ∈ [𝛼 , 𝛼 ), so this is very helpful when looking for the best 𝛼
value.
After all 𝛼 values are calculated for the whole tree, the next step is to find the best 𝛼 value. This
can be found in two ways: using a validation dataset or using cross-validation. In this project we used
cross-validation, because if we used a validation dataset we would have to separate some matches
from our training dataset and that means that we would not be able to use valuable information. This
is still possible when using cross-validation, which we will explain in further detail in section 4.4.
For each fold in the cross-validation method a new tree is built based on the training data for that
fold. This tree is then pruned according to the 𝛼 values obtained from the original tree. We know
that all values of 𝛼 ∈ [𝛼 , 𝛼 ) result in subtree 𝑇 , so in this case we will choose a 𝛽 value that lies
between 𝛼 and 𝛼 with which to prune the tree. This results in a sequence of 𝛽 values that all lie
18 4. Methods: Decision trees
between the corresponding 𝛼 values so that 𝛽 < 𝛽 < ⋯ < 𝛽 where 𝛽 = 0 to ensure no pruning
takes place. The other 𝛽 values will be set as 𝛽 = √𝛼 𝛼 and 𝛽 = ∞.
After the test error is calculated for each fold and for each 𝛽 value, the errors are added up and
the 𝛽 value with the lowest test error is selected. Let us say that 𝛽 is the one with the lowest test
error, then we know that this corresponds to 𝛼 , so the original tree will be pruned with 𝛼 . The
whole cost-complexity pruning process is also depicted as pseudocode in algorithm 3. Note that the
split_group function refers to the code in algorithm 1 and the get_alpha_values function refers
to algorithm 2.
To conclude this section, let us summarize how a decision tree is created. First a splitting criterion
must be chosen to define what a good split is. We used the gini impurity as an example to show how
a splitting criterion based on misclassification works. Since we are building the whole tree so that all
leaves are pure, two post-pruning methods were discussed: MDL and CCP. These pruning techniques
are applied to the tree after the whole tree is built. This usually means that the leaves of the decision
tree are not pure anymore and thus the training error is also not optimal. However, it is more important
that the test error is most likely to benefit from this step, and thus the decision tree will perform better
on unseen data than before pruning. Since we are more interested in getting the most profit instead of
getting the most amount of predictions right, we will now discuss how to alter the decision tree process
to incorporate the costs for each match.
4.2. Decision tree optimized for costs 19
Actual Actual
H D A H D A
H −𝑝 1 1 H −0.53 1 1
Predicted D 1 −𝑝 1 Predicted D 1 −3.00 1
A 1 1 −𝑝 A 1 1 −5.00
(a) Generic cost-matrix for football betting. (b) Cost-matrix for a match with odds [1.53, 4, 6].
Table 4.1: Cost-matrices used for football matches. Please note that negative values signify profit.
20 4. Methods: Decision trees
𝐶 ({𝑡}) = 𝑐(𝑡) + 𝛼
and the new cost for the whole tree:
̃|
𝐶 (𝑇 ) = 𝐶(𝑇 ) + 𝛼|𝑇
where 𝑐(𝑡) is the minimal cost (CSL) in node 𝑡 and 𝐶(𝑇 ) is the total amount of costs for tree 𝑇, so
𝐶(𝑇 ) = ∑ ∈ ̃ 𝑐(𝑡 ). To get the best 𝛼 values, we need to use:
𝑐(𝑡) − 𝐶(𝑇 )
𝛼=
̃| − 1
|𝑇
This formula is a variant from the cost-complexity formula used in [41], but it leads to the same
pruned trees. However, the 𝛼 values with our formula will be positive instead of negative, which is
more in line with the original CCP method explained in section 4.1.2. So if our formula is used, it is
only necessary to change this formula in the original process, which is at line 7 of algorithm 2. The
rest of the steps of the algorithm will work exactly the same.
Changing the splitting criterion and cost-complexity pruning method have altered the standard decision
tree to become a cost-sensitive learning model. The costs of each instance now depict the importance
of predicting that instance correctly, based on the odds. In this cost-sensitive model, it may take e.g.
five matches where the favorite team won, to balance out a match where the longshot won. This
better reflects the real life situation of varying profits per prediction, instead of assuming similar costs
throughout the model. However, since we use a lot of data and need to retrain the model multiple
times, the training process takes a very long time. To speed up this process, we use an incremental
version of the decision tree model which will be explained in the next section. This is however not
4.3. Using incremental decision trees to accommodate changes 21
a vital part for our investigation, so if you are not interested in this technique, this section can be
skipped in order to continue reading section 4.4 if you are more interested in the methods that actually
influence the results of our experiments.
𝑋 𝑋 class
1 0.20 0.99 𝑏𝑙𝑢𝑒
2 0.14 0.09 𝑟𝑒𝑑
3 0.15 0.34 𝑟𝑒𝑑
4 0.41 0.37 𝑏𝑙𝑢𝑒
5 0.07 0.71 𝑟𝑒𝑑
6 0.57 0.63 𝑏𝑙𝑢𝑒
7 0.61 0.48 𝑟𝑒𝑑
8 0.74 0.14 𝑟𝑒𝑑
9 0.82 0.02 𝑟𝑒𝑑
10 0.52 0.63 𝑏𝑙𝑢𝑒
Table 4.2: Data used for training the incremental decision tree.
22 4. Methods: Decision trees
tree restructuring and the same tree as figure 4.1b is created. Even though, for ease of reading, we
used an example where no costs are mentioned, it is also possible to use this tree building process for
cost-sensitive trees.
𝑋 < 0.175
True False
True False
(a) Data used at step 1. (b) Data used at step 2. (c) Data used at step 7.
Figure 4.3: Data used during the training of the incremental decision tree.
Key Description
best_variable Empty if leaf, else stores which variable is best
class_counts Saves how many instances of each class are inside this node
flags Empty if leaf, else tracks whether a node is stale or pruned
instance_costs Empty if decision node, else stores the cost-matrix of each instance
instances Empty if decision node, else stores information of each instance
left Empty if leaf, else refers to the left node
mdl Stores the MDL value of this node
n_instances Stores the number of instances
n_variables Stores the number of variables of each instance
right Empty if leaf, else refers to the right node
variables Empty if leaf, else saves the unique values for each variable
Table 4.3: Information stored in an incremental decision tree with a short description. Keys used in both leaves and decision
nodes cover both cells.
is a decision node, then this means that this node has to have a left node and a right node, since a
decision is made in this node. To be able to store the information of these two ’children’ of this node,
the left and right keys are used to store the information for the nodes below this node. These nodes
can also be decision nodes, leaves, or a combination of both. The mdl key stores the MDL value of this
node, but only if the MDL pruning method is used. The n_instances key stores the number of instances
that fall into this node and the n_variables stores the number of variables or features used by these
instances. The last key is the variables key, which saves the unique values of each variable used in the
data. This is very important data to store, because without this data is it not possible to determine the
best split when new data is added to the model. This is also the place where the impurity values for
each possible split and best cutpoint for each variable are stored. In appendix A it is depicted that a lot
of information is stored in this part of the tree. It almost takes up half of the total information stored
in the tree, and this will increase very quickly when new data is added to the tree if this data contains
new variable values.
For a leaf node less data is stored. There is no information stored in best_variable, flags, left, right
or variables. The primary goal is to save the instances and the costs of these instances in the instances
and instance_costs keys of the tree so that we are able to use this information later.
Figure 4.4: Difference between traditional evaluation, standard cross-validation and time series cross-validation. The blue dots
are used for training a model, the red dots for testing that model. In this case the order of the dots matter: the dots on the left
occurred earlier in time than the right dots.
5.1. Data
There is a lot of data available regarding football match outcomes. For this project we need historical
data for football match outcomes but also the odds of multiple bookmakers. First, we will look at the
data sources for this information and after that a quick analysis of the acquired data is performed.
25
26 5. Experimental setup
Figure 5.1: Number of odds published per match for seasons 2006/2007-2016/2017.
Figure 5.2: Number of odds published per bookmaker for seasons 2006/2007-2016/2017.
5.1. Data 27
(a) Bookmakers’ favorite accuracy. (b) Profit when betting on bookmakers’ favorites.
Figure 5.3: Favorite accuracy and profit per bookmaker for seasons 2006/2007-2016/2017.
With the help of this dataset it is also possible to calculate the prediction accuracy of the bookmakers
for seasons 2006/2007 until 2016/2017. As mentioned in the chapter 3, accuracy rates around 50%
seem to be quite common when researchers attempt to predict future matches. The bookmakers
indicate their prediction of the outcomes by giving that outcome the lowest relative odds, since the
bookmaker assumes it has the highest possibility to occur as shown in section 2.2.1. Using this metric,
we can observe that the bookmaker with the highest accuracy (57.1%) is Gamebookers. In figure 5.3a
the results for all bookmakers are shown. The bookmaker with the lowest accuracy is Pinnacle with
53.1% of matches. While Pinnacle has the lowest prediction accuracy of the ten bookmakers, they still
perform better than most of the accuracy values reported in chapter 3, where the highest accuracy
over multiple seasons is 56.7% [34]. This shows that while it can be hard to beat the bookmakers,
more recent machine learning research is becoming increasingly accurate.
In figure 5.3b it is shown what the relative profit would be if one were to bet on these bookmaker’s
favorites (determining the profit of the bookmakers themselves is outside the scope of this project).
We see that using Gamebookers’ favorites would result in the best profit with -3.6% profit. Although
Pinnacle had the lowest relative accuracy in its predictions, it is not the bookmaker with the lowest
profit. This is William Hill with a profit of -6.9%, whereas Pinnacle resulted in -5.6%. This shows that
a higher accuracy does not automatically lead to a higher relative profit. Since the margins of Pinnacle
are much lower than the other bookmakers, the profit of correctly predicting the outcome of a match
results in a higher profit, whereas the loss for misclassifying a match’s outcome is the same. In real life
betting, it is thus important to check where you want to place your bets if you wish to have a higher
chance at a profit.
If we look at the amount of times each possible outcome occurs, then we see roughly the same
pattern for each season. This pattern is shown in figure 5.4. We observe that almost half of the matches
are won by the home team and the away team wins just over 25% of the time every season. The two
grey dotted lines are placed at 50% and 75% of the total amount of matches in a season. The only
season where predicting only home wins would result in an accuracy that comes close to bookmakers’
standards is season 2010/2011. If only home wins were predicted, then the accuracy would be 52.3%.
This shows that bookmakers, and in particular Gamebookers, do a good job in forecasting the match
outcomes.
28 5. Experimental setup
Figure 5.4: Outcome distribution for seasons 2006/2007-2016/2017. The two grey dotted lines are placed at 50% and 75% of
the total amount of matches in a season.
An example for each method can be found in figures 5.5 and 5.6. In the first figure there are
cost-matrices constructed from four different bookmakers and the result for each combining method is
shown in the second figure. In this example the rnd method uses the odds from bookmaker B.
H D A H D A H D A H D A
H -1.8 1 1 H -2 1 1 H -1.9 1 1 H -1.6 1 1
D 1 -2.5 1 D 1 -2.1 1 D 1 -2.7 1 D 1 -2.5 1
A 1 1 -1.4 A 1 1 -1.3 A 1 1 -1.3 A 1 1 -1.4
(a) Bookmaker A. (b) Bookmaker B. (c) Bookmaker C. (d) Bookmaker D.
H D A H D A H D A H D A
H -1.83 1 1 H -1.6 1 1 H -2 1 1 H -2 1 1
D 1 -2.45 1 D 1 -2.1 1 D 1 -2.7 1 D 1 -2.1 1
A 1 1 -1.35 A 1 1 -1.3 A 1 1 -1.4 A 1 1 -1.3
(a) avg method. (b) max method. (c) min method. (d) rnd method.
Figure 5.6: The four different cost-combining methods applied to four example odds. The costs in the matrices are rounded to
two decimals.
In total we will create six models: two models that only use a single bookmaker’s odds (Bet365
and Bet&Win) and four models that use a combination of odds. First we will investigate if the single
bookmaker models perform better than the combined odds models. We think that the avg and min
models will perform better than the single bookmaker models, but max and rnd will perform worse. We
will also investigate which of the four combining methods performs best. We expect the avg method
to be the best method of all methods and the max to perform worst.
5.3. Experiments
With the goal and general structure of this project in mind, it is time to decide how to conduct the
experiments. The process of acquiring the data to storing the results to disk is shown in figure 5.7.
First we have to combine the different data sources into one container and extract features from this
dataset. After this is done, we need to build and prune the incremental cost-sensitive decision trees
for each of the six methods for every week in the dataset. These trees are then used to predict the
outcomes for upcoming matches. Since the models are trained on season 2006/2007, this decreases
the number of matches we will predict to 3,057 (compared to 3,363 in the whole dataset). The results
of these predictions are stored on disk and the last step is to select which matches to bet on, instead
of betting on all matches. We will now walk through the whole process in more detail.
After this intermediate dataset is made, it is easy to extract the features for each match. For
each match, we will search for the necessary information of the features in the team information
dataset. It is important to look for information originating from before the match is played, because
we have to simulate that we do not know anything about the match’s results: we only know what
happened in previously played matches. For more information about the features used in our models,
see appendix E. In short: for each match we want to know the difference in points, goals and results
between the two playing teams at the moment in time before the match is played.
Creating models
In algorithm 4 it is shown which pseudocode is used to create the models for each week. The name
parameter is used to save the model to disk (’avg’, ’max’, ’min’ and ’rnd’), data refers to the dataset
where all features are stored, seasons specifies for which seasons the models should be created and
finally, cost_matrix_function is the cost-matrix combination method to be used. This is where
the average, maximum, minimum and random cost-matrices are created. We use the odds as they
are: we do not try to acquire the true odds of the bookmakers such as in [29].
The function incremental_update(model, data, costs) refers to an incremental decision
tree, which we built in Python. There was already code available for this incremental decision tree
at [49], but this code was written in C and it is not possible to use this code in a cost-sensitive manner,
so it was decided to convert the code to Python so that the required alterations could be made. The
function save_model(name, model) writes the decision tree, which is a Python dictionary (see
appendix A), to disk using the Pickle library to save memory.
Prune models
By loading the unpruned models using the Pickle library, we can run the pruning steps asynchronously
in two different Jupyter notebooks. This speeds up the process of creating the unpruned models, then
pruning the models and finally getting the results for the pruned models. The pseudocode for pruning
the models can be found in algorithm 5.
5.3. Experiments 31
The structure of the code is quite similar to the code for creating the models, as shown in algorithm 4,
because normally we would have pruned the trees immediately after creating them. However, during
the process of pruning, more data is needed to be able to use Time Series Cross-Validation (TSCV),
which was discussed in section 4.4.2.
First, those weeks intended for validation are sampled from all weeks until this_week. A maximum
of 60 weeks must be used to keep the cross-validation under a respectable time period, but if fewer
weeks are available, then we use them all. The validation weeks will be sorted based on time, so that
the TSCV will be correctly applied. The validation data and costs are then split into six equally sized
parts. This will lead to 5-fold TSCV, because the last part of data is not used for training. The model
is loaded from disk and then the function ccp will apply Cost-Complexity Pruning (CCP) on the model
given the validation data and cost datasets. (For more about CCP, see section 4.2.2.) After the pruning
is completed, the model is then saved to disk again. In these experiments the MDL method is not used.
We will discuss why this was decided in chapter 7.
𝑣𝑎𝑙_𝑑𝑎𝑡𝑎_𝑐ℎ𝑢𝑛𝑘𝑠 ← split(𝑣𝑎𝑙_𝑑𝑎𝑡𝑎, 6)
𝑣𝑎𝑙_𝑐𝑜𝑠𝑡_𝑐ℎ𝑢𝑛𝑘𝑠 ← split(𝑣𝑎𝑙_𝑐𝑜𝑠𝑡, 6)
𝑚𝑜𝑑𝑒𝑙 ← load_model(𝑛𝑎𝑚𝑒)
𝑚𝑜𝑑𝑒𝑙 ← ccp(𝑚𝑜𝑑𝑒𝑙, 𝑣𝑎𝑙_𝑑𝑎𝑡𝑎_𝑐ℎ𝑢𝑛𝑘𝑠, 𝑣𝑎𝑙_𝑐𝑜𝑠𝑡_𝑐ℎ𝑢𝑛𝑘𝑠)
save_model(𝑛𝑎𝑚𝑒, 𝑚𝑜𝑑𝑒𝑙)
end for
end for
end function
a bookmaker does. This is called ’value betting’ [50]. If this is the case, the bookmaker’s odds of
that event will return a higher profit than the punter thinks that event is worth. To determine whether
there is indeed a value in betting on an event, the punter may use the following formula: 𝑉(𝑋) =
𝐸 (𝑋) − 𝐸 (𝑋) = 𝐸 (𝑋) − 1/𝑜 , where 𝑉(𝑋) is the value of event 𝑋, 𝐸 (𝑋) is the expected probability
of event 𝑋 by the punter, 𝐸 (𝑋) is the expected probability of event 𝑋 by the bookmaker and 𝑜
are the odds of event 𝑋. If we fill in the formula using the values of an example, then we get:
𝑉(𝑋) = 0.19 − 1/6.07 ≈ 0.19 − 0.165 = 0.025 = 2.5%. Thus, for example, if a punter were to think
that the chance of ADO Den Haag beating FC Utrecht is 19% and the bookmaker’s odds for that event
are 6.07 (resulting in a probability of 16.5%), then this outcome has a value of 2.5%. (Converting
the odds to the probability of that event happening was explained in section 2.2.1.) Every punter has
his/her own threshold value or preference for when to bet on a match or not, and can use this formula
to determine whether this threshold is met.
We can also apply this value betting strategy in this project. When a model predicts the out-
come of an upcoming match it can also show what the expected costs are for this match, just like a
cost-insensitive decision tree can show the probability of the predicted outcome (see sections 4.1.3
and 4.2.3). Recall that the prediction takes place at the leaves of a decision tree and the information
of the training instances that are in these leaves are responsible for the prediction.
Since we use cost-sensitive models, we will use the expected cost of the predicted instances to
find matches that have ’value’. To see how to get the expected costs of an upcoming match, take
a look at section 4.2.3. In our case, to find matches with value, we will compare the odds of the
predicted outcome with the expected costs. This is calculated through the following formula: 𝑉(𝑋) =
𝑜 −(−1∗𝐸 (𝑋)), where 𝐸 (𝑋) is the expected cost of event 𝑋. In this case, we want to bet on matches
where the odds of event 𝑋 would result in a higher profit if predicted correctly than the expected profit
of that event 𝑋. Recall that profit is a negative cost in cost-sensitive models, so we have to multiply
the expected cost by -1 to get the expected profit.
The last step involves setting a threshold for which matches we would want to bet on. Setting this
threshold too high could mean betting on almost zero matches, but setting this threshold too low could
end up in misclassifying too many matches which lowers our relative profit. The algorithm which we
used to get our optimal threshold values is shown in algorithm 6.
𝑏𝑒𝑠𝑡_𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ← 𝑜𝑒𝑑[max_index(𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑_𝑝𝑟𝑜𝑓𝑖𝑡)]
𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑠.append(𝑏𝑒𝑠𝑡_𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑)
end for
end function
To obtain the optimal value betting strategy for each week, we first load the data of previous weeks,
since we cannot use the data of upcoming or future matches. The idea is that previous matches will
allow us to calculate the best threshold value, which results in the highest relative profit. Although this
threshold value may not be the optimal one, it was used as an approximation during this project due to
time constraints. After the data of previous weeks is obtained, the difference between the odds of the
predicted outcomes and the expected costs of these predictions is calculated. Then, for each of these
𝑜𝑒𝑑 (odds-expectation difference) values, the profit of using this as a threshold is calculated. If this
threshold results in enough matches to bet on, then we save the relative profit, otherwise we ignore
5.3. Experiments 33
this threshold. (’Enough’ is determined by dividing the number of previous matches by 30, to ensure a
proportional representation of previous data. By the last week, this means a minimum of 100 matches
is considered.) After all profits for each 𝑜𝑒𝑑 value is calculated, we select the best threshold value for
that week and this threshold is then applied to the upcoming matches.
To see how this works in practice, we display all 𝑜𝑒𝑑 values and their corresponding relative profit
in figure 5.8 for one of the runs of the avg method. The blue dots represent the relative profit and the
red line shows on how many matches would be bet. We want to bet on a minimum of 100 matches, so
when a threshold results in less than 100 matches we ignore it. We see that a profit of almost 12.5%
can be achieved when an 𝑜𝑒𝑑 threshold of 5.5 is used. However, this result is actually very hard to
achieve, since this is only the result after incorporating the data of all ten seasons. In reality, we could
only base our threshold on previous events, so less data than those full ten seasons. When a selection
is to be made for upcoming matches, we can only look back at matches until then, so we have to use
the best threshold up to that point in time.
Figure 5.8: Relative profit per threshold value. Blue dots depict the profit and the red line shows how many matches are
bet on.
6
Results
This chapter contains the results of the experiments as described in the previous chapter. Recall that for
each cost-combining method five runs were conducted. The results in this chapter show the average
profit values (including standard deviations), cost-confusion matrices, and statistical significance checks
over these five runs per method. The results will be split over two sections. In section 6.1 the results
for betting on all matches is shown and in section 6.2 the results for ’value betting’, betting only on
selected matches, are presented.
Table 6.1: Profit and accuracy for all methods when betting on all matches.
35
36 6. Results
Figure 6.1: Profit over time for all methods when betting on all matches.
but in these figures we display profit as positive and loss as negative to avoid any misunderstanding.
On the left side the averaged cost-confusion matrices are shown and on the right the standard
deviation is displayed. Some noteworthy results: when betting only on those matches where the home
team is predicted to win using the avg method, we would have ended up with a mean profit of 28.08
units. The avg method has the biggest standard deviation for matches predicted as a draw (40.73),
but has the lowest standard deviation for matches predicted as away winning (11.47).
In most cases, the standard deviation in the diagonal of the matrices is the highest value. This is
because the profit for predicting a correct outcome varies a lot, whereas misclassifying a match has
the same costs.
Table 6.2: -values of the One-Way ANOVA and Kruskal-Wallis tests when betting on all matches. The tests are performed
twice: once for all six methods and once for only the four cost-combining methods. Values in bold depict significance.
6.1. Betting on all matches 37
Predicted Predicted
H D A tot H D A
H 873.53 -353.20 -214.00 306.33 H 55.18 17.15 17.12
Actual D -429.80 541.32 -119.00 -7.48 Actual D 16.00 27.17 10.43
A -485.20 -224.00 303.52 -405.68 A 11.27 12.02 16.96
tot -41.47 -35.88 -29.48 -106.83 tot 37.83 24.02 24.83
(a) Cost-confusion matrix for the b365 method. (b) Standard deviation for the b365 method.
Predicted Predicted
H D A tot H D A
H 892.63 -331.20 -218.20 343.23 H 29.47 31.15 19.83
Actual D -437.40 516.19 -119.00 -40.21 Actual D 5.71 41.96 10.83
A -485.20 -215.60 336.89 -363.91 A 6.97 7.89 15.63
tot -29.97 -30.61 -0.31 -60.89 tot 19.59 15.38 30.39
(c) Cost-confusion matrix for the bw method. (d) Standard deviation for the bw method.
Predicted Predicted
H D A tot H D A
H 920.88 -324.00 -226.00 370.88 H 42.24 16.38 13.10
Actual D -419.20 509.56 -140.20 -49.84 Actual D 8.82 26.54 4.96
A -473.60 -221.80 339.68 -355.72 A 15.84 14.82 25.54
tot 28.08 -36.24 -26.52 -34.67 tot 29.33 40.73 11.47
(e) Cost-confusion matrix for the avg method. (f) Standard deviation for the avg method.
Predicted Predicted
H D A tot H D A
H 883.06 -283.20 -307.80 292.06 H 23.43 12.86 17.55
Actual D -424.00 454.65 -153.00 -122.35 Actual D 16.04 40.86 5.02
A -466.40 -205.20 375.09 -296.51 A 15.84 11.12 19.05
tot -7.34 -33.75 -85.71 -126.80 tot 21.79 22.79 22.03
(g) Cost-confusion matrix for the max method. (h) Standard deviation for the max method.
Predicted Predicted
H D A tot H D A
H 875.47 -318.00 -241.40 316.07 H 66.97 24.52 22.53
Actual D -440.60 469.52 -134.20 -105.28 Actual D 22.20 36.49 14.46
A -497.80 -182.80 337.40 -343.20 A 28.10 20.06 33.81
tot -62.93 -31.28 -38.2 -132.41 tot 21.58 38.72 15.07
(i) Cost-confusion matrix for the min method. (j) Standard deviation for the min method.
Predicted Predicted
H D A tot H D A
H 899.88 -321.60 -257.60 320.68 H 31.89 12.19 14.35
Actual D -420.40 480.89 -147.80 -87.31 Actual D 22.51 45.09 13.67
A -462.20 -223.40 364.74 -320.86 A 16.70 19.71 20.15
tot 17.28 -64.11 -40.66 -87.48 tot 22.97 32.22 36.12
(k) Cost-confusion matrix for the rnd method. (l) Standard deviation for the rnd method.
Figure 6.2: Cost-confusion matrices for all methods when betting on all matches.
38 6. Results
Table 6.3: Profit and accuracy for all methods when value betting.
6.2. Value betting 39
Figure 6.3: Profit over time for all methods when value betting.
assumption when all methods are compared. If only the four cost-combining methods are compared to
each other, then we see a fairly equal standard deviation and 𝑝-values below 0.05, so this means that
at least one of these four methods performs better or worse than the others. Some post-hoc testing
is necessary to discover which method this is.
One of the post-hoc tests that can be used to see which method behaves differently is the Tukey Hon-
est Significant Difference (HSD) test [51]. This test checks per pair of methods whether these two
methods have an equal mean. The results of the four cost-combining methods, when paired in all
possible ways, can be found in table 6.5. This table shows for each method pair what the lower and
upper confidence intervals look like and if the null hypothesis should be rejected or not. In this case,
we only observe a significant difference between the avg and max method. This means that it is very
likely that the avg method performs better than max, if the results of this test are combined with the
results in table 6.3. For the other method pairs it is not possible to draw a similar conclusion.
Table 6.4: Results of the One-Way ANOVA and Kruskal-Wallis tests when value betting. The tests are performed twice: once for
all six methods and once for only the four cost-combining methods. Values in bold depict significance.
40 6. Results
Predicted Predicted
H D A tot H D A
H 106.77 -50.80 -60.20 -4.23 H 20.86 30.92 4.66
Actual D -37.20 78.97 -14.80 26.97 Actual D 11.05 57.12 3.37
A -81.40 -23.60 58.73 -46.27 A 15.05 26.58 11.76
tot -11.83 4.57 -16.27 -23.53 tot 21.61 7.42 13.57
(a) Cost-confusion matrix for the b365 method. (b) Standard deviation for the b365 method.
Predicted Predicted
H D A tot H D A
H 154.73 -56.00 -69.40 29.33 H 113.89 62.41 35.88
Actual D -59.80 93.36 -22.20 11.36 Actual D 65.39 112.20 16.73
A -97.60 -26.40 88.51 -35.49 A 70.85 43.47 41.10
tot -2.67 10.96 -3.09 5.2 tot 23.66 14.40 22.34
(c) Cost-confusion matrix for the bw method. (d) Standard deviation for the bw method.
Predicted Predicted
H D A tot H D A
H 161.02 -49.00 -69.20 42.82 H 34.91 10.39 6.43
Actual D -44.80 60.30 -24.00 -8.50 Actual D 11.11 28.14 4.56
A -79.80 -12.40 83.02 -9.18 A 16.94 5.68 29.25
tot 36.42 -1.10 -10.18 25.14 tot 19.70 18.59 21.77
(e) Cost-confusion matrix for the avg method. (f) Standard deviation for the avg method.
Predicted Predicted
H D A tot H D A
H 156.50 -69.80 -100.80 -14.10 H 57.26 31.78 20.90
Actual D -54.40 90.49 -24.80 11.29 Actual D 38.88 44.92 9.68
A -89.80 -28.20 84.15 -33.85 A 35.16 28.81 21.82
tot 12.30 -7.51 -41.45 -36.66 tot 20.10 21.46 11.13
(g) Cost-confusion matrix for the max method. (h) Standard deviation for the max method.
Predicted Predicted
H D A tot H D A
H 267.33 -130.80 -109.60 26.93 H 63.98 46.67 32.67
Actual D -102.40 202.42 -40.40 59.62 Actual D 44.78 81.31 19.37
A -161.40 -61.80 119.86 -103.34 A 42.87 36.66 54.14
tot 3.53 9.82 -30.14 -16.79 tot 35.70 21.71 19.56
(i) Cost-confusion matrix for the min method. (j) Standard deviation for the min method.
Predicted Predicted
H D A tot H D A
H 232.77 -86.00 -94.60 52.17 H 171.63 77.91 40.53
Actual D -95.00 115.99 -32.20 -11.21 Actual D 88.07 109.66 23.76
A -125.00 -43.40 111.92 -56.48 A 87.51 54.07 64.24
tot 12.77 -13.41 -14.88 -15.52 tot 20.04 27.11 19.85
(k) Cost-confusion matrix for the rnd method. (l) Standard deviation for the rnd method.
Figure 6.4: Cost-confusion matrices for all methods when value betting.
6.2. Value betting 41
Table 6.5: Results of the Tukey HSD test between cost-combining methods when value betting. The mean difference; lower and
upper limits of the confidence interval; and if the null hypothesis can be rejected or not (in bold) are shown.
7
Discussion and conclusion
This research tested different cost-sensitive methods on football data to see if there is a method that
would perform better than others, and lead to more profit as a result. Two methods use the costs of
two individual bookmakers while the other four use different ways to combine the costs of multiple
bookmakers. The results of these methods were gathered both for when the model would be tasked
to bet on all matches using these costs, and for when the model would employ ’value betting’: betting
only on specific matches that were expected to pass a threshold and yield a profit. The first research
objective was to investigate if cost-combining methods would work better than single-cost methods
and the second objective is to check which of those cost-combining methods would work best.
We will dive into the first research question in section 7.1. Then in section 7.2 we will discuss
the results of all cost-combining methods. In section 7.3 more general findings from this project are
described. A conclusion will be given in section 7.4.
43
44 7. Discussion and conclusion
performs significantly differently, so this means that there is also no significant difference between a
single-cost and cost-combining method. This counters our hypothesis that a cost-combining method
would outperform the predictions of a single bookmaker. However, we should keep in mind that these
tests were run on only five runs per method in total, which is hardly enough to state conclusively that
there can be no (significant) difference between such methods. Furthermore, this project was restricted
to analyzing the odds of ten specific bookmakers for ten seasons of the Dutch national football league.
Other data sources may prove to result in different outcomes. For the conducted experiments in this
project, we cannot conclude that combining the costs of multiple bookmakers is better than using the
costs of a single bookmaker.
method due to time constraints. Since we use the data of ten seasons to test the cost-combining
methods and it takes almost two days to create, prune and obtain the results of one method, it could
have been better to run more experiments for fewer seasons. This trade-off is unfortunately inevitable
when considering deadlines and relatively limited hardware. Although focusing on fewer seasons would
have resulted in short-term-based models, it would have been possible to run many more experiments
and perhaps more robust conclusions could have been made. For example, if data of only five seasons
would have been tested, then it would have been possible to run around fifteen runs for each method
instead of five. This is because it takes much more time to update the decision trees for later weeks
than for weeks in the first season since less data is used in the tree. If we could start this project
from scratch, we might have used the data of only five seasons. However, there is also a value in
building models over longer periods of time. After all, the bookmakers themselves probably base their
predictions on long-term information as well, to reach the level of accuracy we have seen. As the
outcomes of football matches are highly variable, it may be the case that using as high a number of
seasons as possible leads to the most predictive models (though it may also be the case that predictions
become stable after a set number of seasons). Since we saw the best results when value betting, there
is a reason to suspect that using more seasons would provide more robust results, as more data can
lead to more stable threshold values. In conclusion, we do not see an optimal way to decide on
these types of experiment limitations considering time constraints; we only hope that explaining our
perspective can inform the choices of future work.
When looking at the accuracy of the methods in tables 6.1 and 6.3, we see that these are (much)
lower than the accuracy values reported in chapter 3. When betting on all matches the accuracy of
all our methods lies around 40% and this drops to around 20-30% when value betting. However, the
profit of most methods are higher when the accuracy is lower: the avg method managed to have a
profit of 6.96% with an overall accuracy of only 23.76%. This is more than half of the accuracy values
reported in [11, 12, 33]. While these papers had a different primary goal than we had, it is interesting
to see that predicting the outcome of more matches correctly should not be a preferred goal when
the objective of a project is to reach the highest amount of profit. In future research, this contrast
should be taken into account when deciding on the objective: increasing accuracy and profit may not
be related.
Additionally, when betting on all matches, it was interesting that no method was able to achieve
a net profit. Betting in this project only turned profitable when selecting specific matches to bet on,
i.e. when value betting. In fact, four out of the six investigated methods improved their relative profit
when value betting, even though their accuracy dropped significantly. Once again, we see that accuracy
rates are not necessarily linked to higher profits. Furthermore, this shows that punters should decide
carefully on which matches they should bet, as this can determine whether they make a profit or not,
potentially more so than e.g. betting on more likely outcomes (such as home wins). Matches that
are bet on must have enough ’value’ to expect a profit in the long run. Similarly, this may prove an
interesting field for further machine learning research. Future work can attempt to (further) exploit
this phenomenon.
In chapter 4.1.2 the MDL method was described, but this pruning method was not used for the
results shown in chapter 6. After the first cost-sensitive decision trees were pruned with this method,
we noticed that the trees were not optimally pruned. The MDL technique might work decently for
standard cost-insensitive decision trees, but when implementing cost-sensitive learning we noticed
that the trees were pruned proportionally to the depth of the tree. This is to be expected considering
how the technique works, but when CCP pruning was used the cost-sensitive decision trees were often
pruned at the root node of the tree for some weeks. This can also be seen in appendix H: all methods
predicted many more home wins than the other methods because when the trees were pruned at the
root node, this means that all upcoming matches were predicted as a home win. This never occurred
when using MDL as the pruning method (since proportionate pruning leaves a larger part of the tree
intact and never cuts down to the root), and the results using that MDL method suffered because of
this. However, MDL was not tested in combination with the value betting technique, so this might be
interesting to investigate in future work.
In chapter 3 we mentioned that paper [34] was released too close to our own publication date to
take advantage of their findings. The best model presented in this paper managed to have an accuracy
value of 56.7%, which is the highest value we could find in all research for experiments done over
multiple seasons, betting on all matches. These results were obtained after carefully constructing a
46 7. Discussion and conclusion
feature dataset that contained information about the streak and the form of the teams in addition to
features that are comparable to our features. These extra features could also have been very useful
in our research, but unfortunately it was not possible to incorporate these into our experiments. For
future research we would thus also recommend spending more time on feature engineering as this
may lead to better results. In the case of [34], feature engineering resulted in a higher accuracy score,
but we do not know if this also would have resulted in a higher profit.
Our results show a relatively high variance and this is mainly due to using a single decision tree
for predicting the outcomes [52]. One way to lower this variance would be to use an ensemble of
decision tree models and combine their predictions in some way to get a single prediction. If the test
is run again and one decision tree would give a different outcome, then this would not influence the
final (integrated) outcome much, which results in a lower variance. The same authors that wrote [41],
which was used to create the cost-sensitive decision tree, also showed how to use an ensemble of cost-
sensitive decision trees [53]. In this paper they tested multiple cost-sensitive ensemble techniques and
showed that using these ensemble models indeed improved the results.
One small note about our dataset: although we simply disregarded the missing odds of bookmak-
ers, there are other ways to deal with such missing data. In future research, these missing odds may
be imputed through various methods, e.g. regression, so that we have more full dataset of the book-
makers’ odds. We were only able to compare two single bookmakers’ odds, since only two bookmakers
published odds for all matches in our dataset. If the missing odds in the other bookmakers’ data are
imputed, the results of more individual bookmakers can be compared both with each other and with
our cost-combining methods. Additionally, there are of course many other cost-combining methods
one can conceive to combine these bookmakers’ odds. One example may be attempting to distill the
so-called ’true’ odds mentioned in chapter 3, and averaging those to remove the bookmakers’ margins
from the equation.
Finally, there are some other ways to optimize for profit to mention. It may be possible to better
predict future matches by relying more heavily on recent information in the dataset, prioritizing this
information over older observations. After all, a match played last year may tell us more about the
football team then one played nine years ago. In our model, all previous matches were weighed equally,
which may prove to increasingly influence the results as our dataset spans a longer time frame. Also,
there are other ways to optimize for profit than directly incorporating the costs in the model. It is for
example also possible to predict how high a stake must be used for each match, or to create a model
that predicts for which matches a bet must be placed and for which matches it is better to not place a
bet, rather than doing this post-hoc using value betting. To our knowledge, these strategies are not yet
explored and published, so investigating multiple techniques to optimize for profit might be interesting
to do for future work.
7.4. Conclusion
This project compared two different cost-sensitive methods. The first method used the odds of a single
bookmaker to achieve a profit by predicting the outcome of future matches correctly and the second
method used a combination of the odds of multiple bookmakers to achieve this goal. The first objective
was to investigate if the cost-combining methods performed better than the single-cost methods. After
reviewing the results, we could not find a significant difference between these two.
The second objective was to find the best cost-combining method of the four that were tested.
From our experiments we can conclude that using the average of all costs works better than using the
maximal costs when value betting, i.e. betting on only a selected number of matches, but no other
conclusions could be drawn for the other methods due to a lack of significance in the results.
We have also shown that a higher accuracy rate does not always have to result in a higher profit.
While other papers had the primary goal to predict the most number of matches right, we wanted to
optimize our model for getting the highest profit. This resulted in accuracy levels between 20-30%,
which is much lower than the accuracy values of other papers (around 50%), but in the end we have
managed to get a profit of 6.96%, which to our knowledge has not been achieved before.
Since these results were obtained with only five runs per method, our conclusions are unfortunately
not very robust. More runs are needed to achieve a better understanding of these various methods,
although we hope to have provided a starting point for further research. Due to time constraints we
were unable to conduct more of these experiments, since obtaining the results for one run took us
7.4. Conclusion 47
almost two days. With the help of faster hardware, or limiting the size of the data used - though there
are some trade-offs to consider - this could be resolved.
And finally, we have created an incremental, cost-sensitive decision tree in Python with which to
conduct our experiments. Without this model, it would have taken even longer to run the experiments.
To our knowledge, no incremental decision tree model that can incorporate costs exists, so we had
to create one ourselves. It took us quite some time to get this working, but in the end it saved us
more time by using this version than to use a standard cost-sensitive decision tree that would have had
to have been made from scratch for every week. Future research may benefit from this incremental
cost-sensitive decision tree model as it will greatly increase the speed with which that research can be
conducted.
Bibliography
[1] Marktscan landgebonden kansspelen 2016, https://www.kansspelautoriteit.nl/
publish/pages/4264/marktscan_landbased_2016.pdf (2017), accessed: 7 June 2018.
[2] Global 35.97 billion online gambling market growth at cagr of 10.81%, 2014-2020 - mar-
ket to reach 66.59 billion with growth very geography specific - research and markets,
https://www.prnewswire.com/news-releases/global-3597-billion-online-
gambling-market-growth-at-cagr-of-1081-2014-2020---market-to-reach-
6659-billion-with-growth-very-geography-specific---research-and-
markets-300348358.html (2016), accessed: 8 April 2018.
[3] Sports betting market to rise at nearly 8.62% cagr to 2022, https://www.prnewswire.com/
news-releases/sports-betting-market-to-rise-at-nearly-862-cagr-to-
2022-669711413.html (2018), accessed: 7 June 2018.
[4] A comparison of forecasting methods: fundamentals, polling, prediction markets, and experts,
http://www.researchdmr.com/ForecastingOscar.pdf (2014), accessed: 9 May 2018.
[5] A simple deep learning model for stock price prediction using tensorflow, https:
//medium.com/mlreview/a-simple-deep-learning-model-for-stock-price-
prediction-using-tensorflow-30505541d877 (2017), accessed: 23 May 2018.
[6] With germany’s win microsoft perfectly predicted the world cup’s knockout round, https://
qz.com/233830/world-cup-germany-argentina-predictions-microsoft/ (2014),
accessed: 1 May 2018.
[9] A. Groll, C. Ley, G. Schauberger, and H. Van Eetvelde, Prediction of the FIFA World Cup 2018 - A
random forest approach with an emphasis on estimated team ability parameters, ArXiv e-prints
(2018), arXiv:1806.03208 [stat.AP] .
[10] J. S. Xu, Online sports gambling: A look into the efficiency of bookmakers’ odds as forecasts in the
case of english premier league, Unpublished undergraduate dissertation). University of California,
Berkeley (2011).
[11] B. Ulmer, M. Fernandez, and M. Peterson, Predicting Soccer Match Results in the English Premier
League, Ph.D. thesis, Doctoral dissertation, Ph. D. dissertation, Stanford (2013).
[12] J. (@opisthokonta), Predicting football results with adaptive boosting, https://
opisthokonta.net/?p=809 (2014), accessed: 18 May 2018.
49
50 Bibliography
[16] N. Vlastakis, G. Dotsis, and R. N. Markellos, How efficient is the european football betting market?
evidence from arbitrage and trading strategies, Journal of Forecasting 28, 426 (2009).
[17] N. S. Barrett, The Daily Telegraph chronicle of horse racing (Guinness Pub., 1995).
[20] A. Sacks and A. Ryan, Economic impact of legalized sports betting, https://
www.americangaming.org/sites/default/files/AGA-Oxford%20-%20Sports%
20Betting%20Economic%20Impact%20Report1.pdf (2007), accessed: 2 July 2018.
[22] E. F. Fama, Efficient capital markets: A review of theory and empirical work, The journal of Finance
25, 383 (1970).
[23] M. Cain, D. Law, and D. Peel, The favourite-longshot bias and market efficiency in uk football
betting, Scottish Journal of Political Economy 47, 25 (2000).
[24] J. Goddard and I. Asimakopoulos, Forecasting football results and the efficiency of fixed-odds
betting, Journal of Forecasting 23, 51 (2004).
[25] H. O. Stekler, D. Sendor, and R. Verlander, Issues in sports forecasting, International Journal of
Forecasting 26, 606 (2010).
[26] B. Deschamps and O. Gergaud, Efficiency in betting markets: evidence from english football, The
Journal of Prediction Markets 1, 61 (2012).
[27] G. Bernardo, M. Ruberti, and R. Verona, Testing semi-strong efficiency in a fixed odds betting
market: Evidence from principal European football leagues, Tech. Rep. (University Library of
Munich, Germany, 2015).
[28] J. Buchdahl, Squares & Sharps, Suckers & Sharks: The Science, Psychology & Philosophy of
Gambling, Vol. 16 (Oldcastle Books, 2016).
[29] E. Štrumbelj and M. R. Šikonja, Online bookmakers’ odds as forecasts: The case of european
soccer leagues, International Journal of Forecasting 26, 482 (2010).
[31] M. J. Dixon and S. G. Coles, Modelling association football scores and inefficiencies in the football
betting market, Journal of the Royal Statistical Society: Series C (Applied Statistics) 46, 265
(1997).
[32] R. H. Koning, Balance in competition in dutch soccer, Journal of the Royal Statistical Society:
Series D (The Statistician) 49, 419 (2000).
[33] A. Joseph, N. E. Fenton, and M. Neil, Predicting football results using bayesian nets and other
machine learning techniques, Knowledge-Based Systems 19, 544 (2006).
[34] R. Baboota and H. Kaur, Predictive analysis and modelling football results using machine learning
approach for english premier league, International Journal of Forecasting (2018).
[35] P. Domingos, Metacost: A general method for making classifiers cost-sensitive, in Proceedings of
the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (ACM,
1999) pp. 155–164.
Bibliography 51
[36] C. Elkan, The foundations of cost-sensitive learning, in International joint conference on artificial
intelligence, Vol. 17 (Lawrence Erlbaum Associates Ltd, 2001) pp. 973–978.
[37] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan, Adacost: misclassification cost-sensitive boosting,
in Icml, Vol. 99 (1999) pp. 97–105.
[38] J. Du, Z. Cai, and C. X. Ling, Cost-sensitive decision trees with pre-pruning, in Advances in
Artificial Intelligence (Springer, 2007) pp. 171–179.
[39] B. Zhou and Q. Liu, A comparison study of cost-sensitive classifier evaluations, in International
Conference on Brain Informatics (Springer, 2012) pp. 360–371.
[40] P. D. Turney, Types of cost in inductive concept learning, arXiv preprint cs/0212034 (2002).
[41] A. C. Bahnsen, D. Aouada, and B. Ottersten, Example-dependent cost-sensitive decision trees,
Expert Systems with Applications 42, 6609 (2015).
[42] L. Breiman, J. Friedman, R. Olshen, and C. J. Stone, Classification and regression trees (Chapman
and Hall/CRC, 1984).
[43] P. E. Utgoff, N. C. Berkman, and J. A. Clouse, Decision tree induction based on efficient tree
restructuring, Machine Learning 29, 5 (1997).
[44] Classification trees, www.cs.uu.nl/docs/vakken/mdm/trees.pdf (2017), accessed: 14
February 2018.
[45] P. E. Utgoff, Incremental induction of decision trees, Machine learning 4, 161 (1989).
[46] P. E. Utgoff, An improved algorithm for incremental induction of decision trees, in Machine Learning
Proceedings 1994 (Elsevier, 1994) pp. 318–325.
[47] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning, Vol. 1 (Springer
series in statistics New York, 2001).
[48] R. J. Hyndman and G. Athanasopoulos, Forecasting: principles and practice (OTexts, 2014).
[49] Machine Learning Laboratory incremental decision tree induction, https://www-
ml.cs.umass.edu/iti/index.html (2001), accessed: 26 February 2018.
[50] V. Y. Kotlyar and O. Smyrnova, Betting market: analysis of arbitrage situations, Cybernetics and
Systems Analysis 48, 912 (2012).
[51] Comparing more than two means: One-way anova, https://brownmath.com/stat/
anova1.htm (2016), accessed: 3 July 2018.
[52] T. G. Dietterich and E. B. Kong, Machine learning bias, statistical bias, and statistical variance of
decision tree algorithms, Tech. Rep. (Technical report, Department of Computer Science, Oregon
State University, 1995).
[53] A. C. Bahnsen, D. Aouada, and B. Ottersten, Ensemble of example-dependent cost-sensitive
decision trees, arXiv preprint arXiv:1505.04637 (2015).
A
Incremental decision tree example
Below it is depicted what the data structure of an incremental decision tree looks like. This example is
based on the decision tree shown in figure 4.2b, since showing a deeper decision tree would take up
too many pages. The red class is represented as class 0 and the blue class is class 1.
{
’ best_variable ’ : 0,
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 1 ,
’ key ’ : 1 . 0 ,
’ left ’: {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 0 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ r i g h t ’ : None } ,
’ f l a g s ’ : [ False , False ] ,
’ instance_costs ’ : [] ,
’ instances ’ : [] ,
’ left ’: {
’ b e s t _ v a r i a b l e ’ : None ,
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 0 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ f l a g s ’ : [ False , False ] ,
’ instance_costs ’ : [0] ,
’ instances ’ : [ array ([0.14 , 0.09 , 0. ])] ,
’ l e f t ’ : None ,
’ mdl ’ : 0 ,
’ n_instances ’ : 1 ,
’ n_variables ’ : 2 ,
’ r i g h t ’ : None ,
’ variables ’ : []} ,
53
54 A. Incremental decision tree example
’ mdl ’ : 0 ,
’ n_instances ’ : 2 ,
’ n_variables ’ : 2 ,
’ right ’ : {
’ b e s t _ v a r i a b l e ’ : None ,
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 1 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ f l a g s ’ : [ False , False ] ,
’ instance_costs ’ : [0] ,
’ instances ’ : [ array ([0.2 , 0.99 , 1. ])] ,
’ l e f t ’ : None ,
’ mdl ’ : 0 ,
’ n_instances ’ : 1 ,
’ n_variables ’ : 2 ,
’ r i g h t ’ : None ,
’ variables ’ : []} ,
’ variables ’ : [
{
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 1 ,
’ key ’ : 1 . 0 ,
’ left ’: {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 0 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ r i g h t ’ : None } ,
’ count ’ : 2 ,
’ cutpoint ’ : 0.17 ,
’ metric_value ’ : 0.0 ,
’ new_cutpoint ’ : 0 . 1 7 ,
’ numeric_value_counts ’ : {
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 1 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ count ’ : 1 ,
’ height ’ : 1 ,
’ key ’ : 0 . 2 ,
’ left ’: {
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
55
’ key ’ : 0 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 0 . 1 4 ,
’ l e f t ’ : None ,
’ metric_value ’ : 0.5 ,
’ r i g h t ’ : None } ,
’ metric_value ’ : 0.0 ,
’ r i g h t ’ : None}
},
{
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 1 ,
’ key ’ : 1 . 0 ,
’ left ’: {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 0 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ r i g h t ’ : None } ,
’ count ’ : 2 ,
’ cutpoint ’ : 0.0 ,
’ metric_value ’ : 0.0 ,
’ new_cutpoint ’ : 0 . 5 4 ,
’ numeric_value_counts ’ : {
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 1 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ count ’ : 1 ,
’ height ’ : 1 ,
’ key ’ : 0 . 9 9 ,
’ left ’: {
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 0 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 0 . 0 9 ,
’ l e f t ’ : None ,
’ metric_value ’ : 0.5 ,
’ r i g h t ’ : None } ,
’ metric_value ’ : 0.0 ,
56 A. Incremental decision tree example
’ r i g h t ’ : None}
}
]}
B
Football team names
In table B.1 it is shown what the assigned team abbreviations are next to the full team names of the
teams that competed in the Dutch Eredivisie league between season 2006/2007 and 2016/2017.
57
Information about
C
the intermediate dataset
Table C.1 depicts which information is collected about each football team for the intermediate dataset.
The first column shows the attribute name, the second columns contains a short description of these
attributes and the last column shows which (Python) data type the attribute is. This intermediate
dataset was created to make feature extraction more easy.
59
Combining data
D
from different sources
In this project, we only had to deal with two different data sources, but even within one data source
some preprocessing is required. For example, data downloaded from [19] is split into different seasons.
This means that the data for seasons 2006/2007, 2007/2008 and 2016/2017 are stored in separated
CSV files. Between these files, there are various naming conventions for the different teams. For
example, some CSV files use the name ’ADO Den Haag’, others only use ’ADO’. These refer to the same
team, so when combining the data we must ensure that the information about this team is correctly
combined. This is also an issue when we want to combine these files with the data from the other
source [18]. The way we solved this was to create a mapping for all possible ways teams are noted
in the data to a chosen acronym. Each team was designated its own three letter acronym for usability
purposes. These naming conventions can be found in appendix B.
After this mapping is completed, the different CSV files and the SQL database from Kaggle are
stored in separate pandas dataframes. A pandas dataframe is part of the popular Python library called
Pandas and it used often in data science projects. The created dataframes have the same column
information so that it is easy to concatenate the dataframes into one big dataframe. This project has
only created a dataframe for the matches of the Dutch Eredivisie football league, but this would also be
possible for more countries. Our one dataframe now contains all the matches from season 2006/2007
until 2016/2017, but this raw data is not very useful for the decision tree model to use as features.
To be able to extract the features from this dataframe, we first created a helper dataset which con-
tains all required information per team per date. This makes it easier to acquire all necessary features.
For each team and match date we store this date, opponent and other match-specific information like
goals and points scored. For more information about the dataset, see appendix C. After this process
was finished, our dataset contained 10,404 rows by 12 columns including the information of 3,363
football matches.
61
E
Features
Table E.1 on page 64 shows which features are used when training the models. These features describe
the difference between the home team and the away team by subtracting the statistic of the away team
from the home team. For example, for feature 𝑟_𝑤_𝑡 the total number of wins in the current season of
the home team is subtracted by the total number of wins in the current season of the away team. This
means that if this feature is positive, then the home team won more matches than the away team; if
this feature is zero then both teams won the same number of matches; if this feature is negative then
the away team won more matches than the home team.
When a feature has a 𝑥 in its name, then this feature is used three times where 𝑥 ∈ {1, 2, 3}. When
a feature has a 𝑟 in its name, then this means that only those matches are used where the team played
on the same side as in this match. So for the home team the statistic is calculated when the home
team played home and for the away team the statistic is calculated when the away team played away.
When a feature has a 𝑚 in its name, then only the matches are considered where these two teams
competed against each other.
63
64 E. Features
Feature Description
year Calendar year when match is played
month Calendar month when match is played
outcome ’H’ when home team won, ’D’ for a draw or ’A’ when away team won
season B365 BS BW GB IW LB PS SJ VC WH
2000/2001 306 306 306 1 4 306 306 306 306 3
2001/2002 306 306 306 79 13 306 306 306 306 13
2002/2003 6 306 306 0 12 306 306 306 306 9
2003/2004 1 306 306 0 7 306 306 306 306 4
2004/2005 0 306 1 0 4 51 306 306 306 5
2005/2006 0 306 0 0 4 6 306 0 6 4
2006/2007 0 306 0 0 2 4 306 0 6 0
2007/2008 0 1 0 0 1 1 306 0 2 81
2008/2009 1 1 1 1 4 1 306 1 3 4
2009/2010 1 2 1 1 3 1 306 1 1 4
2010/2011 0 0 0 0 0 0 306 0 1 0
2011/2012 0 0 0 0 3 0 306 0 0 0
2012/2013 1 1 1 1 1 2 23 2 1 2
2013/2014 0 306 0 306 0 1 4 0 0 1
2014/2015 0 306 0 306 0 4 2 261 0 0
2015/2016 0 306 0 306 2 0 1 306 1 0
2016/2017 0 306 0 306 0 0 3 306 1 0
65
Profit over time
G
for betting on all matches
Figures G.1 to G.6 show the profit from the start until season 2016/2017 when betting on all matches
for all five runs of each method. The dotted black line depicts the average of all five runs.
Figure G.1: Profit over time when betting on all matches for the b365 method.
67
68 G. Profit over time for betting on all matches
Figure G.2: Profit over time when betting on all matches for the bw method.
Figure G.3: Profit over time when betting on all matches for the avg method.
Figure G.4: Profit over time when betting on all matches for the max method.
69
Figure G.5: Profit over time when betting on all matches for the min method.
Figure G.6: Profit over time when betting on all matches for the rnd method.
Confusion matrices
H
for betting on all matches
The confusion matrices for each method when betting on all matches are shown in figure H.1 on
page 72. The left side shows the confusion matrices while the right side shows the standard deviation
values. The values inside the confusion matrices are the averaged values of the five runs of each
method.
71
72 H. Confusion matrices for betting on all matches
Predicted
Predicted
H D A
H D A
H 886.8 353.2 214.0
H 32.8 17.2 17.1
Actual D 429.8 174.2 119.0
Actual D 16.0 8.6 10.4
A 485.2 224.0 170.8
A 11.3 12.0 10.5
tot 1801.8 751.4 503.8
(b) Standard deviation for the b365 method.
(a) Confusion matrix for the b365 method.
Predicted
Predicted
H D A
H D A
H 904.6 331.2 218.1
H 24.7 31.2 19.8
Actual D 437.4 166.6 119.0
Actual D 5.7 14.8 10.8
A 485.2 215.6 179.2
A 7.0 7.8 5.5
tot 1827.2 713.4 516.3
(d) Standard deviation for the bw method.
(c) Confusion matrix for the bw method.
Predicted
Predicted
H D A
H D A
H 904.0 324.0 226.0
H 26.3 16.4 13.1
Actual D 419.2 163.6 140.2
Actual D 8.8 7.1 5.0
A 473.6 221.8 184.6
A 15.8 14.8 6.1
tot 1796.8 709.4 550.8
(f) Standard deviation for the avg method.
(e) Confusion matrix for the avg method.
Predicted
Predicted
H D A
H D A
H 863.0 283.2 307.8
H 25.7 12.9 17.6
Actual D 424.0 146.0 153.0
Actual D 16.0 12.1 5.0
A 466.4 205.2 208.4
A 15.8 11.1 10.2
tot 1753.4 634.4 669.2
(h) Standard deviation for the max method.
(g) Confusion matrix for the max method.
Predicted
Predicted
H D A
H D A
H 894.6 318.0 241.4
H 42.2 24.5 22.5
Actual D 440.6 148.2 134.2
Actual D 22.2 12.8 14.5
A 497.8 182.8 199.4
A 28.1 20.1 15.3
tot 1833.0 649.0 575.0
(j) Standard deviation for the min method.
(i) Confusion matrix for the min method.
Predicted
Predicted
H D A
H D A
H 874.8 321.6 257.6
H 19.7 12.2 14.3
Actual D 420.4 154.8 147.8
Actual D 22.5 11.8 13.7
A 462.2 223.4 194.4
A 16.7 19.7 6.8
tot 1757.4 699.8 599.8
(l) Standard deviation for the rnd method.
(k) Confusion matrix for the rnd method.
Figure I.1: Normality check statistics for all methods when betting on all matches. is the statistic value, is -value and is
the size of the data.
73
J
Profit over time for value betting
Table J.1 depicts how many matches were selected to bet on for each method and each run. In
figures J.1 to J.6 show the profit from the start until season 2016/2017 when value betting for all five
runs of each method. The dotted black line depicts the average of all five runs.
Table J.1: Number of matches selected to bet on for each method and for each run.
Figure J.1: Profit over time when value betting for the b365 method.
75
76 J. Profit over time for value betting
Figure J.2: Profit over time when value betting for the bw method.
Figure J.3: Profit over time when value betting for the avg method.
Figure J.4: Profit over time when value betting for the max method.
77
Figure J.5: Profit over time when value betting for the min method.
Figure J.6: Profit over time when value betting for the rnd method.
Confusion matrices
K
for value betting
The confusion matrices for each method when value betting are shown in figure K.1 on page 80. The
left side shows the confusion matrices while the right side shows the standard deviation values. The
values inside the confusion matrices are the averaged values of the five runs of each method.
79
80 K. Confusion matrices for value betting
Predicted
Predicted
H D A
H D A
H 46.4 50.8 60.2
H 20.4 30.9 4.7
Actual D 37.2 19.6 14.8
Actual D 11.1 18.2 3.4
A 81.4 23.6 11.2
A 15.1 26.6 6.3
tot 165.0 94.0 86.2
(b) Standard deviation for the b365 method.
(a) Confusion matrix for the b365 method.
Predicted
Predicted
H D A
H D A
H 90.8 56.0 69.4
H 135.1 62.4 35.9
Actual D 59.8 23.4 22.2
Actual D 65.4 34.0 16.7
A 97.6 26.4 24.4
A 70.9 43.5 30.3
tot 248.2 105.8 116.0
(d) Standard deviation for the bw method.
(c) Confusion matrix for the bw method.
Predicted
Predicted
H D A
H D A
H 58.0 49.0 69.2
H 24.7 10.4 6.4
Actual D 44.8 13.2 24.0
Actual D 11.1 6.8 4.6
A 79.8 12.4 15.8
A 16.9 5.7 7.7
tot 182.6 74.6 109.0
(f) Standard deviation for the avg method.
(e) Confusion matrix for the avg method.
Predicted
Predicted
H D A
H D A
H 59.4 69.8 100.8
H 58.5 31.8 20.9
Actual D 54.4 21.4 24.8
Actual D 38.9 15.2 9.7
A 89.8 28.2 20.4
A 35.2 28.8 17.0
tot 203.6 119.4 146.0
(h) Standard deviation for the max method.
(g) Confusion matrix for the max method.
Predicted
Predicted
H D A
H D A
H 163.6 130.8 109.6
H 84.9 46.7 32.7
Actual D 102.4 56.4 40.4
Actual D 44.8 25.8 19.4
A 161.4 61.8 44.0
A 42.9 36.7 27.2
tot 427.4 249.0 194.0
(j) Standard deviation for the min method.
(i) Confusion matrix for the min method.
Predicted Predicted
H D A H D A
H 162.6 86.0 94.6 H 162.3 77.9 40.5
Actual D 95.0 33.6 32.2 Actual D 88.1 34.4 23.8
A 125.0 43.4 36.6 A 87.5 54.1 39.6
tot 382.6 163.0 163.4
(k) Confusion matrix for the rnd method. (l) Standard deviation for the rnd method.
Figure L.1: Normality check statistics for all methods when value betting. is the statistic value, is -value and is the size
of the data.
81