You are on page 1of 97

Using cost-sensitive

learning to forecast
football matches

J. M. Raats
Technische Universiteit Delft
Using cost-sensitive learning to
forecast football matches
by

J. M. Raats

in partial fulfillment of the requirements for the degree of

Master of Science
in Computer Science

at the Delft University of Technology,


to be defended publicly on Thursday July 19, 2018 at 13:00.

Student number: 4083563


Supervisor: Prof. dr. M. Loog
Thesis committee: Prof. dr. ir. M.J.T. Reinders, TU Delft
Prof. dr. M. Loog, TU Delft
Dr. J.A. Pouwelse, TU Delft

An electronic version of this thesis is available at http://repository.tudelft.nl/.


Preface
This thesis is my final piece of work as a student at TU Delft. This project started with the primary
goal of using machine learning models to forecast football matches as accurately as possible, but
transformed into comparing different methods that incorporate the odds of multiple bookmakers to
see which method performs best (with profit in mind instead of accuracy). This made it much more
interesting for me personally, and because of this, I often had to suppress the urge to investigate
unexplored areas to maximize profit instead of actually investigating my research questions. This
turned out to be more like a luxury problem than an obstacle.
I would like to thank my supervisor Marco Loog for guiding me through this long journey, your
patience and critical thinking. Also, I want to thank my family and friends for supporting me and
bringing me the (sometimes much needed) joy in the weekends so that I could continue focusing on
my project during the week. Finally I want to especially thank my girlfriend, Georgina, for offering me
new ideas, perspectives and proofreading this thesis. Without you, this project would never have had
the quality that it has right now.

J. M. Raats
Delft, July 2018

iii
Contents
List of Figures vii
List of Tables ix
List of Algorithms xi
Acronyms xiii
1 Introduction 1
2 Background: Sports betting 3
2.1 Football . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Bookmakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Betting on football matches: the use of odds . . . . . . . . . . . . . . . . . . 4
2.2.2 Unfair odds: overround and margin. . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Efficient Market Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Related work 9
3.1 Football market efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Statistical models applied on football matches . . . . . . . . . . . . . . . . . . . . . 10
3.3 Cost-sensitive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Methods: Decision trees 13
4.1 A decision tree that minimizes error rate . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.1 Splitting criterion: gini impurity . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.2 Pruning the tree to prevent overfitting . . . . . . . . . . . . . . . . . . . . . . 15
4.1.3 Predicting the outcome of an upcoming match . . . . . . . . . . . . . . . . . 18
4.2 Decision tree optimized for costs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Creating a cost-sensitive decision tree . . . . . . . . . . . . . . . . . . . . . . 19
4.2.2 Pruning a cost-sensitive decision tree. . . . . . . . . . . . . . . . . . . . . . . 20
4.2.3 Prediction process for cost-sensitive decision trees . . . . . . . . . . . . . . . 20
4.3 Using incremental decision trees to accommodate changes . . . . . . . . . . . . . . 21
4.3.1 The process of incrementally updating the decision tree. . . . . . . . . . . . 21
4.3.2 Storing requisite data elements in the decision tree . . . . . . . . . . . . . . 22
4.4 Training and testing through cross-validation. . . . . . . . . . . . . . . . . . . . . . 23
4.4.1 Standard cross-validation process. . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.2 Incorporating time through time series cross-validation . . . . . . . . . . . . 24
5 Experimental setup 25
5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.1 Sources used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.2 Quick analysis of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Cost-combining methods that are tested in this project . . . . . . . . . . . . . . . . 28
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3.1 Preparing the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3.2 Building the cost-sensitive models . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.3 Betting on matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.4 Selecting which matches to bet on . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Results 35
6.1 Betting on all matches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.1 Profit and accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.2 Cost-confusion matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.3 Statistical significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

v
vi Contents

6.2 Value betting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38


6.2.1 Profit and accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2.2 Cost-confusion matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2.3 Statistical significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7 Discussion and conclusion 43
7.1 Comparing single-cost and cost-combining methods. . . . . . . . . . . . . . . . . . 43
7.2 Best cost-combining method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.3 General discussion and recommendations . . . . . . . . . . . . . . . . . . . . . . . . 44
7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Bibliography 49
A Incremental decision tree example 53
B Football team names 57
C Information about the intermediate dataset 59
D Combining data from different sources 61
E Features 63
F Missing odds for each bookmaker per season 65
G Profit over time for betting on all matches 67
H Confusion matrices for betting on all matches 71
I Normality check of results for betting on all matches 73
J Profit over time for value betting 75
K Confusion matrices for value betting 79
L Normality check of results for value betting 81
List of Figures

2.1 Map of a football field. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


2.2 Margin per bookmaker per season for the Dutch Eredivisie. . . . . . . . . . . . . . . . 5
2.3 The three different efficient market forms. . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.1 Decision tree trained on example data. . . . . . . . . . . . . . . . . . . . . . . . . . . 14


4.2 Incremental decision tree structures during training. . . . . . . . . . . . . . . . . . . . 22
4.3 Data used during the training of the incremental decision tree. . . . . . . . . . . . . . 22
4.4 Difference between traditional evaluation, standard cross-validation and time series cross-
validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1 Number of odds published per match for seasons 2006/2007-2016/2017. . . . . . . . . 26


5.2 Number of odds published per bookmaker for seasons 2006/2007-2016/2017. . . . . . 26
5.3 Favorite accuracy and profit per bookmaker for seasons 2006/2007-2016/2017. . . . . 27
5.4 Outcome distribution for seasons 2006/2007-2016/2017. . . . . . . . . . . . . . . . . . 28
5.5 Cost-matrices of four bookmakers for the same match. . . . . . . . . . . . . . . . . . . 29
5.6 The four different cost-combining methods applied to four example odds. . . . . . . . . 29
5.7 Experiment process of this project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.8 Relative profit per 𝑜𝑒𝑑 threshold value. . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.1 Profit over time for all methods when betting on all matches. . . . . . . . . . . . . . . 36
6.2 Cost-confusion matrices for all methods when betting on all matches. . . . . . . . . . . 37
6.3 Profit over time for all methods when value betting. . . . . . . . . . . . . . . . . . . . 39
6.4 Cost-confusion matrices for all methods when value betting. . . . . . . . . . . . . . . . 40

G.1 Profit over time when betting on all matches for the b365 method. . . . . . . . . . . . 67
G.2 Profit over time when betting on all matches for the bw method. . . . . . . . . . . . . 68
G.3 Profit over time when betting on all matches for the avg method. . . . . . . . . . . . . 68
G.4 Profit over time when betting on all matches for the max method. . . . . . . . . . . . . 68
G.5 Profit over time when betting on all matches for the min method. . . . . . . . . . . . . 69
G.6 Profit over time when betting on all matches for the rnd method. . . . . . . . . . . . . 69

H.1 Confusion matrices when betting on all matches. . . . . . . . . . . . . . . . . . . . . . 72

I.1 Normality check statistics for all methods when betting on all matches. . . . . . . . . . 73

J.1 Profit over time when value betting for the b365 method. . . . . . . . . . . . . . . . . 75
J.2 Profit over time when value betting for the bw method. . . . . . . . . . . . . . . . . . 76
J.3 Profit over time when value betting for the avg method. . . . . . . . . . . . . . . . . . 76
J.4 Profit over time when value betting for the max method. . . . . . . . . . . . . . . . . . 76
J.5 Profit over time when value betting for the min method. . . . . . . . . . . . . . . . . . 77
J.6 Profit over time when value betting for the rnd method. . . . . . . . . . . . . . . . . . 77

K.1 Confusion matrices when betting on selected matches. . . . . . . . . . . . . . . . . . . 80

L.1 Normality check statistics for all methods when value betting. . . . . . . . . . . . . . . 81

vii
List of Tables

2.1 Bookmakers used in this research. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.1 Cost-matrices used for football matches. . . . . . . . . . . . . . . . . . . . . . . . . . . 19


4.2 Data used for training the incremental decision tree. . . . . . . . . . . . . . . . . . . . 21
4.3 Information stored in an incremental decision tree with a short description. . . . . . . . 23

6.1 Profit and accuracy for all methods when betting on all matches. . . . . . . . . . . . . 35
6.2 𝑝-values of the One-Way ANOVA and Kruskal-Wallis tests when betting on all matches. 36
6.3 Profit and accuracy for all methods when value betting. . . . . . . . . . . . . . . . . . 38
6.4 Results of the One-Way ANOVA and Kruskal-Wallis tests when value betting. . . . . . . 39
6.5 Results of the Tukey HSD test between cost-combining methods when value betting. . 41

B.1 Football team abbreviations and their full names. . . . . . . . . . . . . . . . . . . . . . 57

C.1 Intermediate dataset attributes, description and data types. . . . . . . . . . . . . . . . 59

E.1 Features used when training the models. . . . . . . . . . . . . . . . . . . . . . . . . . 64

F.1 Missing odds for each bookmaker per season. . . . . . . . . . . . . . . . . . . . . . . . 65

J.1 Number of matches selected to bet on for each method and for each run. . . . . . . . 75

ix
List of Algorithms
1 Pseudocode for creating a decision tree. . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Getting cost-complexity alpha values of a decision tree. . . . . . . . . . . . . . . . . . 17
3 Cost-complexity pruning process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Pseudocode to create models for each week. . . . . . . . . . . . . . . . . . . . . . . . 30
5 Pseudocode to prune models for each week. . . . . . . . . . . . . . . . . . . . . . . . 31
6 Match selection pseudocode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

xi
Acronyms
ANOVA Analysis of variance. 36, 38

CCP Cost-Complexity Pruning. 15, 16, 18–20, 31, 45


CSL Cost-Sensitive Learning. 19, 20

EMH Efficient Market Hypothesis. 3, 6, 9

FIFA Fédération Internationale de Football Association. 3

HSD Honest Significant Difference. 39, 44

MDL Minimum Description Length. 15, 16, 18, 19, 23, 31, 45

SD Standard Deviation. 35

TSCV Time Series Cross-Validation. 24, 31

xiii
1
Introduction
Sports betting is becoming increasingly popular due to the possibility to bet online. This makes it very
easy to bet on sport matches even though you are not present at the match yourself. This encourages
more people to bet on sport matches. In the Netherlands, an increase in sport betting turnover of
34% was visible between 2015 and 2016, while the Dutch gambling market as a whole increased only
3.8% [1]. This growth in sports betting is also occurring globally, albeit at a slower rate [2, 3].
At the same time, machine learning is also getting more popular. It is used in a wide variety of fields
to improve many applications, like for example predicting Oscar winners [4], stock prizes [5] or football
world cup matches [6–9]. The strength of these models is that they can find patterns in huge amounts
of data that no human is capable of. Once a pattern is learned, models are then able to automatically
apply this knowledge to new, unseen data. The hardest part is to have enough quality data available
so that the model can find these patterns in the first place.
Since both fields are becoming increasingly popular, researchers have investigated if the two can
be combined. For example, some of the biggest companies, like Microsoft [6], Bloomberg [7] and
Goldman Sachs [8], applied machine learning to predict the outcome of all 2014 football world cup
matches. We also use football data in our project, but instead of predicting the outcome of world
cup matches we use Dutch league matches for our experiments. For football (also known as soccer)
leagues there is more data available, since more teams compete against each other and for a longer
period of time. The world cup is organized only once per four years and lasts for a month whereas the
Eredivisie is active for almost a full year every year. Since it would be too ambitious to analyze football
leagues from various countries, we limited this study to one case: that of the Dutch Eredivisie, the
national football competition. Although most previous research into football markets focuses on the
English Premier League, it is also worthwhile to study other markets. This research is a first attempt
at broadening the field of study into these other international leagues.
Some research has already been done on predicting the highest amount of league matches right [10–
12], but we will investigate a different approach. While testing for accuracy has been previously ex-
plored, how to acquire the highest profit has not been studied. Thus, for this project we not only want
to predict the right outcome of football matches, but we also want to place bets on these matches to
attempt to make the highest amount of profit possible. Since the return for each match can be quite
different, as we will show in future chapters, a higher prediction accuracy will not automatically lead
to a higher profit. The machine learning models used in the studies all try to optimize for accuracy,
so we have to use a different kind of machine learning technique to optimize for profit. The technique
that we use in this project is called cost-sensitive learning. With cost-sensitive learning it is possible
to incorporate the different costs for each match and use these costs to optimize machine learning
models for profit instead of accuracy.
The costs for each match depend on the bookmaker’s odds and the stake placed by the bettor.
The odds of a bookmaker depict how probable each of the outcome possibilities is according to the
bookmaker. Since there are three different outcomes possible for football matches (home team wins,
a draw or away team wins), odds consist of three values: [𝑜 ,𝑜 ,𝑜 ]. For example, if the
odds for a match are [1.3, 4.33, 7.50], then this means that if you bet on the home team and they
win, you would get 1.3 times your stake back from the bookmaker. This same logic applies to draws

1
2 1. Introduction

and away wins as well. However, if you bet on an outcome and this does not happen, then you lose
your stake. In this example, the outcome that is most likely to occur according to the bookmaker is
the home team winning, since this event has the so-called short odds or the lowest value attached to
it. On the other hand, the probability that the away team wins is very small, since a bettor would get
7.5 times his stake back from the bookmaker.
There are usually multiple bookmakers who publish their odds for the same upcoming matches.
These odds differ slightly per bookmaker, since each bookmaker uses their own method or algorithm to
calculate the probabilities for each outcome to occur. This means that there are different costs involved
for the same football match. They depend on which bookmaker is used to place the bets. In standard
cost-sensitive learning this introduces a problem, because it is assumed that for each instance there
is only one ground truth cost possible per outcome. Using the multiple odds of all bookmakers could
add valuable information to our models. What would be the most optimal way to apply cost-sensitive
learning, specifically cost-combining methods, to this case? To this end, we will explore various cost
combining methods. Instead of using the odds of only one bookmaker in our cost-sensitive model, we
will combine the odds of all bookmakers and use the combined odds in our model. Our hypothesis is
that when the odds from various bookmakers can be combined in some mathematical way, this will
result in a higher profit than using only the information of one single bookmaker.
But before we dive into the cost-combining methods, we will first give some background information
of sports betting in chapter 2. There the basic principles of sports betting will be explained and this
information will be useful to understand the related work in chapter 3. After the related work we will
show the cost-sensitive model that we used to do our experiments in chapter 4. We discuss which
cost-combining methods we applied in this research, and how the experiments were conducted, in
chapter 5 while the results of these experiments can be found in chapter 6. These results will be
discussed and a conclusion will be given in chapter 7.
2
Background: Sports betting
Before we dive into the technical aspects of predicting football match outcomes, we will first present
some background information about football and sports betting. This information is necessary to un-
derstand the rest of the report more easily. Since we are trying to predict football match outcomes,
it is useful to understand how a football match looks like and how a football season is structured. We
will discuss this in section 2.1. To be able to make a profit from these predictions, we need someone
or some organization to place our bets. This can be done by a so-called bookmaker. We will show
how a bookmaker operates in section 2.2. And finally, there is a theory about how efficiently a mar-
ket behaves called the Efficient Market Hypothesis (EMH), which suggests how accurately the market
presents the value of the goods being sold. This theory will be explained in section 2.3.

2.1. Football
Football, as a game, has been played for many years. According to the Fédération Internationale de
Football Association (FIFA), some variants of the game were already practised in the second and third
centuries BC in China [13]. Today, it is the most popular game in the world: it is estimated that over
four billion people around the world consider themselves a football fan [14].
Since explaining the rules of a football match is not the point of this research, we suggest those
interested consult other sources if they wish to learn more about the game [15]. In essence, a team,
consisting of eleven players, attempts to score goals on the side of the field of the opposing team. To
be sure that international readers are aware we are discussing European football, or soccer, a map of
the field is displayed in figure 2.1. Relevant to this research is that there are three possible outcomes
of a match: the home team wins, there is a draw, or the away team wins. If the home team won, then
that means that the team that played in their home stadium scored more goals than the away team
and vice versa. In case of a draw, both teams ended up with the same number of goals. The winning
team is awarded three points for the competition, the losing team gets nothing and when tied both
teams get one point.
In a football season multiple teams play against each other throughout the year. Every country
has its own league. In the Netherlands, the focus of our research, it is called the Eredivisie, while in
e.g. England it is called the Premier League and in Spain they have the Primera División. The teams
compete for one year (season) and after all matches are played, the team with the most points is
declared the champion. In each league the teams compete twice against each other: one time in their
home stadium and once at the opponent’s. In football terms this is called home and away respectively.
There are 18 teams in the Eredivisie, which means that each team plays 34 matches in one season
and thus there are 306 matches in total. In other leagues different numbers of teams, and therefore
matches might be possible, but for this project only the Eredivisie is investigated.
There is something special about playing at the home ground. Historical data shows that teams
are more likely to win at home than at the opponent’s stadium [16]. Although there are three possible
outcomes of a match (home team wins, there is a draw, or the away team wins), our data shows
that in general, home teams have a 47.6% chance to win, opposed to 23.6% chance of a draw and a
28.8% chance to win away (for more information about our dataset, please consult chapter 5). These

3
4 2. Background: Sports betting

Figure 2.1: Map of a football field.

percentages may vary slightly per season and per league, but this puts the home team in a major
advantage, which is important to keep in mind for the rest of this project.

2.2. Bookmakers
To bet on football matches, sports matches or other events, you need to find a company or person that
accepts bets against specified odds (we will further explain the term ’odds’ below in paragraph 2.2.1).
This company or person is called a bookmaker. The first bookmaker ever recorded stood at Newmarket
horse racecourse in 1795 [17] and since then the market has kept growing. It is estimated that by
2020 the betting market will reach 66.59 billion USD [2]. One of the reasons for this growth is because
of the introduction of online betting [1]. Though it is currently illegal for Dutch citizens to participate in
online betting except for one bookmaker (Toto), there are other online betting sites that publish odds
for the Dutch football market, and their odds are used in this research [18, 19]. It is expected that
this growth will increase even further when more countries legalize online betting in the near future,
including the Netherlands [20, 21].

2.2.1. Betting on football matches: the use of odds


There are dozens of events one could bet on in one football match: not only on who wins the match,
but also on what the end score will be, or on which players will score a goal, etc. In this project,
however, we are only interested in predicting the outcome of the match. This is called 1X2 betting:
there are three possible outcomes for a football match, home wins (1), draw (X) or away team wins
(2). The bookmaker will publish its odds for all three events before the match is played and the models
in this research will then choose one of the three events to bet on.
To explain how these odds (always plural, never singular) work, we will show an example. For the
match between the two Dutch football clubs FC Utrecht and ADO Den Haag in the national competition
the Eredivisie, the odds of one bookmaker were [1.53, 4.00, 6.00]. The first odds are for home winning,
the second for draw and the last one for away winning. This means that if you, the punter (bettor),
bet €1 on home winning, and they win then you will receive €1.53 from the bookmaker. Your profit
will then be €0.53. If you are wrong however, then you lose your stake, the €1. This might make it
tempting to bet on ’underdog’ ADO Den Haag, because you would get €6 back if they win, but the
chance that they will win is very small, as reflected by the higher odds. This set of odds thus shows
that the bookmaker estimates FC Utrecht to be the favorite to win this match (since this outcome has
the lowest odds), and ADO Den Haag is the so-called longshot.
In fact it is possible to convert the odds of the bookmaker to the probability of each event occurring
according to the bookmaker. This can easily be done through the following formula:

𝑃(𝑋) = 1/𝑜 𝑋 ∈ {𝐻, 𝐷, 𝐴}

where 𝑃(𝑋) is the probability of event 𝑋, 𝑜 is the odd of event 𝑋 and 𝑋 can be home, draw or
away. So in the example, the probabilities according to the bookmaker are:
2.2. Bookmakers 5

Figure 2.2: Margin per bookmaker per season for the Dutch Eredivisie.

𝑃(𝐻) = 1/1.53 ≈ 0.65 𝑃(𝐷) = 1/4 = 0.25 𝑃(𝐴) = 1/6 ≈ 0.17


However, when adding all probabilities you end up with a probability higher than 1. As we will
explain below, this is done by the bookmaker to make a profit.

2.2.2. Unfair odds: overround and margin


In the example the probabilities add up to 1.07 or 107%. This is called the overround. With the
overround the bookmaker expects to make a profit in the long run. In this case the bookmaker uses a
7% margin, which is the overround minus 100%.
As a punter, you want this margin to be as low as possible, because the higher the margin of the
bookmaker, the more unfair it is for you, the punter, to bet on the match. In betting terminology,
’fair’ odds signify that both parties take on the same risk: they are not in anyone’s favor, meaning
both bookmaker and punter would not be expected to make a profit or a loss in the long run. Since
bookmakers want their business to be profitable, they employ ’unfair’ odds, as this allows them to
make a profit in the long term (e.g. during a season), regardless of the outcome of a single match,
which may yield them a short-term loss.
Say you find another bookmaker that published the following odds for the same match: [1.564,
4.42, 6.07], which results in a margin of only 3%. This would be much more interesting to bet on,
because for each possible event you would get more money back, if you bet on the right result of
course.
In the dataset that we use (see section 5.1 for more information), we can clearly see differences
between different bookmakers and which margins they use. This is shown in figure 2.2. The figure
displays the margins of ten bookmakers compared per season and also the average of the margin of
all bookmakers per season is shown (the dotted black line). To see the full names of the bookmakers,
see table 2.1. There is a visible trend that the margins keep getting smaller: in season 2000/2001 the
average margin is 14.6%, but in season 2016/2017 the average margin is just under 6.6%, which is
more than half of the first season. This is a good sign for punters, because the bets will be more fair
for them.
6 2. Background: Sports betting

Abbreviation Full name


B365 Bet365
BS Blue Square
BW Bet&Win
GB Gamebookers
IW Interwetten
LB Ladbrokes
PS Pinnacle
SJ Stan James
VC VC Bet
WH William Hill

Table 2.1: Bookmakers used in this research.

2.3. Efficient Market Hypothesis


In capital markets there is the assumption that the price of stock or security represents the real value
of that stock or security. Various researchers have tested whether this assumption holds. If the prices
do reflect the value of a stock, then this market is called efficient. This theory can also be applied to
sports betting: the odds set by the bookmaker can reflect the true possibilities of the match outcomes.
If this is the case, then this would be called an efficient market. However, if the bookmaker publishes
odds that do not align with the true probabilities (e.g. because their models are unable to produce such
true probabilities), then the market is deemed inefficient. How to convert the odds of a bookmakers to
the assumed probabilities is explained in section 2.2. In section 2.2.2, we saw that bookmakers employ
a margin to increase the likelihood of their long-term profit. We would thus expect the bookmakers’
odds not to reflect true probabilities, and thus that the market is inefficient.
There are three ways to check if a market is efficient [22] and these are all shown in figure 2.3.
The first domain in the EMH is called weak-form efficiency. This occurs when the price of a stock fully
reflects all information of past prices. This means that it is not possible to predict the price of the stock
in the future based on past prices. To translate this to our project: the odds of past matches are not
useful to predict the outcome of the next match. So if we would look at the odds of matches in the
past, it would not be possible to come up with a winning strategy for upcoming matches.
The second way is called semi-strong efficiency. This occurs when the price of a stock fully reflects
all information of past prices and also all public information. In this case, only people with insider
information could have an advantage in the market. This would for example occur when an important
football player is injured, but only a couple of people (insiders) know this before the match is played.
If this information becomes public, then the odds will likely be altered to reflect new outcomes more
realistically.
The last domain is called strong-form efficiency. This is when the price of a stock fully reflects all
public and private information. In this market, no one would have an advantage, because the price
is perfectly reflective of the probabilities and there is no additional information available for anyone to
spot unforeseen value.
This project uses predictive odds that were published in the past, but supplements this with ad-
ditional information about how the matches ended. This falls under the semi-supervised efficiency
model, so it would be interesting to know if this betting market is semi-strong efficient or not. If it
is, then the long-term expected profit would be zero, but if it is not, then there might be a winning
strategy. As mentioned above, the bookmakers temper with the odds to expect a profit in the long
run. This suggests that the market is inefficient. We will discuss several papers that have investigated
if this is the case or not in the next chapter.
2.3. Efficient Market Hypothesis 7

Figure 2.3: The three different efficient market forms.


3
Related work
This chapter discusses some related work in three different areas within the realm of our research
project. First we look at the football market efficiency research in section 3.1 to establish how difficult
it may be to make a profit. After that, we discuss additional papers that apply statistical or machine
learning models to football matches in section 3.2. Finally, we reflect on cost-sensitive learning papers
relevant to our project in section 3.3.

3.1. Football market efficiency


The Efficient Market Hypothesis (EMH) has already been discussed in section 2.3, but we have not
established whether the football betting market is efficient. Recall that if the market is deemed efficient,
then this means that the prices of the odds fully reflect the real value of the possible events (home,
draw and away). In that case, it is really hard to create a betting strategy that consistently outperforms
the bookmakers.
There has been a lot of research on proving weak-form market efficiency in the football betting
market [16, 23–26]. Their results are rather mixed. While some papers conclude that this market is
weak-inefficient, others report that it is efficient. However, most of the research was conducted for the
English Premier League or Spanish Primera División. To our knowledge, no paper discussed the market
efficiency for the Dutch Eredivisie, but this will not be investigated in depth in this project because it
is not the objective of this research. If we look at our own data, we are tempted to conclude that for
the Eredivisie the market is weak-form inefficient. This is because we see the same favorite-longshot
bias appear as in [23].
The favorite-longshot bias occurs when bookmakers slightly increase the probability of the favorite
winning but more strongly exaggerate the chance of the longshot winning. Whether this bias exists in
our data can be checked by using the odds of a bookmaker to both bet all matches on the favorite team
for each match, and then similarly bet on the longshot. If the market is efficient, then the odds would
reflect true probabilities and the profit of both situations should be equal, but this is not the case. The
profit for betting on the favorite results in a profit of -3.5%, but for the longshot this results in -17.2%.
This shows that it is much better to bet on the favorite rather than the longshot, which would not
happen in an efficient market. Bookmakers artificially increase the probability of the longshot winning
(reflected in the higher negative percentage), encouraging punters to bet on the longshot, even though
this does not reflect the true probability of them winning. The bookmakers thus seem to make the
market inefficient on purpose, so that punters are led astray and the bookmakers can expect more
profit in the end, especially since they have to pay less to punters when the longshot actually wins.
As we stated in section 2.3, our project falls under the semi-strong market efficiency instead of the
weak-form, so we should check if it is semi-strong efficient or not. After all, in an efficient market, the
odds are reflective of the true probabilities, meaning that betting for profit becomes much harder, as
explained above. Whether a market is semi-strong efficient is much harder to check than weak-form,
due to the enormous amount of extra data that can be used to check if the odds show the true value
of the events happening. We could only find one study that investigated semi-strong efficiency in
football betting [27]. This paper claims that the market is semi-strong inefficient by looking at matches

9
10 3. Related work

played after the technical staff of one team is replaced. In different leagues, they looked at the first
few matches after the staff had changed, and then they would bet for all those matches that that
team would win. They found that the odds underestimate the true probabilities, indicating the market
is semi-strong inefficient, because the profits were higher than expected. However, since those few
matches might not be representative of the rest of the season, this study does not show whether the
whole market would be semi-strong inefficient. Since we are interested in making a profit throughout
a season, we need to know about the state of the entire market.
One element of the inefficient market of football bookmakers is the favorite-longshot bias. There are
a few different explanations for this phenomenon. Either the bookmaker is unable to calculate the true
probabilities for each match outcome; the bookmakers apply unbalanced margins to each outcome;
or a combination of both. In [28] they assumed that the bookmakers could accurately forecast the
outcomes, but altered the odds to be in their benefit. They assert that the bookmaker influences these
odds in such a way that the odds for favorite teams are more valuable than longshot odds. They analyze
the odds of the bookmaker with the lowest margins, seen in figure 2.2, as these should be closer to
the bookmaker’s true odds (if no margin had been applied), compared to those of other bookmakers
with higher margins. They find it to be unlikely that the investigated bookmaker creates their odds
by applying the margin equally over each outcome. First, they distilled the bookmaker’s ’actual’ odds
by removing the margin equally from each odds. Then, they converted these odds to probabilities
(see section 2.2). They found that these probabilities were very different from the real outcomes of
the football matches, which violated their assumption of bookmakers’ predictive accuracy. Their other
tests, where bookmakers were assumed to temper with the odds by applying the margins proportionally
to the size of the true odds, resulted in a smaller difference between probabilities. Apparently, the
bookmakers employ these more biased methods to apply the margins to their true odds. Since the
shortest odds are those of the favorite, they receive the smallest margin in proportion to the other two
outcomes, while the longshot receives the biggest margin using that same logic. This confirms the
abovementioned difference in profit for betting on favorites versus betting on longshots.
The previous article assumed that bookmakers are able to correctly predict the probability of the
different outcomes. The authors of [29] employed a large-scale research set-up to check whether this
assumption is correct, and bookmakers’ odds could be used as forecasts. They found that bookmak-
ers were indeed getting closer to the real probabilities over time. However, they converted odds to
probabilities under the assumption that the bookmakers applied margins to their true odds equally.
The work of [28] showed that this is unlikely to be a fair representation of how bookmakers create
their odds. It would thus be interesting to conduct the research of [29] again with the methods shown
in [28] and compare the results.

3.2. Statistical models applied on football matches


Using statistical models, or machine learning models for that matter, to forecast football match results is
not new. One of the oldest and most used methods to predict the outcome of a match is to use a Poisson
distribution to predict how many goals both teams will score [25, 30, 31]. Using a Poisson model allows
the researchers to estimate the attack and defence strength of each team over time and also estimates
either team’s home advantage. To predict the outcome of an upcoming match, the Poisson distribution
is calculated for either team using the attack, defence and home advantage parameters of both teams.
When viewing these two distributions together, one can calculate the maximum likelihood for each
event - home, draw and away - occurring. These maximum likelihood estimations were however only
optimized to achieve a high accuracy, and did not investigate whether this would also result in a profit.
Another method that is often used is an ordered probit model [10, 24, 32]. This is a linear model
that is useful when there are more than two different outcomes that have a natural ordering. It can
be applied to football matches, because every goal in football hinders or creates the possibility of a
specific outcome. For example, if the home team leads during a match, then the away team has to
first score an equalizer goal, resulting in a draw, before they are able to score another goal and win
themselves. Probit models seem to improve their success rate during specific parts of the season, as
shown in [24]. However, the aim of these types of research is primarily to show market (in-)efficiency,
and not to predict the outcomes of football matches or increase betting profits.
It is more difficult to find research where more advanced machine learning models are used. One
of the first papers that uses machine learning models is [33]. In this paper they investigated multiple
3.3. Cost-sensitive learning 11

models for two seasons and only for the matches in which Tottenham Hotspur (of the English Premier
League) played. The model that had the highest prediction rates predicted almost 60% of all Tottenham
matches correctly. There is however one major drawback to this model: it uses information about the
key players in the Tottenham selection, so after two years the model could not be used anymore
because the players left or retired. The method in this paper is thus highly sensitive to changes in
teams and only relevant for a specific period of time.
In another paper multiple models were also used and compared [11]. Training data contained ten
seasons of the English Premier League and the models were tested on the two following seasons. Their
three best-performing models had prediction rates around 50%. One of the reasons that the prediction
rates were not very high, was because almost all models had trouble predicting draws correctly. One
model never predicted a draw for a single match, even though 29% of the investigated matches in the
English Premier League ended in a draw. This shows how difficult it may be to correctly predict a draw
using machine learning. Though another blog post achieved similar accuracy rates, it performed better
on draws, suggesting that there may be ways to improve draw predictions, but that this may require
decreasing the accuracy of home and away predictions [12].
The most recent paper that we could find about machine learning applied to football matches is [34].
In this paper they, again, compared multiple machine learning models to each other. Here the training
data consists of nine seasons, with the model being tested on weeks 6 to 38 of seasons 2014/2015 and
2015/2016. This resulted in accuracy values between 52.7% and 56.7%, which is quit high compared
to previous papers. Unfortunately, their paper was published closely to our own publication date,
prohibiting us from applying some of their findings (specifically on feature engineering) to this project.
This will be reflected on in the discussion in chapter 7.

3.3. Cost-sensitive learning


To our knowledge, no research has been conducted on achieving the most amount of profit from betting
in football matches. As shown above, previous work focuses on accurately predicting the outcomes of
football matches. Although some may additionally look at the obtained profit, this is not their primary
objective. Since our goal is primarily to optimize for profit, accuracy may not be as high (see both the
results in section 6 and the discussion of this topic in section 7). We saw in section 3.2 that football
match outcomes can be difficult to predict. However, predicting the outcomes of more matches correctly
does not automatically lead to a higher profit. After all, predicting the win of one longshot correctly
may compensate for multiple errors in prediction, if the odds for that longshot are sufficiently large.
Thus, a focus on profit over accuracy requires a different approach than has been applied before. In
this project, we apply cost-sensitive learning to allow for this focus on profit. We will explain in more
detail what cost-sensitive learning is and how it works in section 4.2. For now, let us reflect on whether
previous work can be relevant to this project.
Previous work has applied cost-sensitive learning methods in various contexts, e.g. [35–40]. How-
ever, the results in these papers are not (directly) relevant to our project. More recent work has
tested which type of cost-sensitive learning method yields the most profit. This research can inform
the way in which we apply cost-sensitive learning. In fact, [41] went beyond comparing previous
methods, constructing a new method that focuses on minimizing costs in general, effectively maximiz-
ing profit. They built a more elegant method than previously reported, that encompasses the costs
of (mis-)classification within the model from the start, instead of artificially including those costs by
altering the dataset. They show that this new method is financially successful. Furthermore, they
compare their method with previous cost-sensitive learning techniques, showing that theirs is indeed
more effective. This project will thus employ their newly-created cost-sensitive learning method, as
will be explained further in section 4.2.
4
Methods: Decision trees
As previously mentioned, predicting the outcome of some matches right leads to more profit than for
other matches. Since we are interested in optimizing for profit, it would make no sense to use machine
learning models that are optimized for minimizing the error rate. Instead we use a cost-sensitive
model to predict the outcomes. As shown in the previous chapter, there are multiple ways to apply
cost-sensitive learning. For our project we have chosen to use a cost-sensitive decision tree model
to do our experiments. We will explain how this model works and why we have decided to use this
technique in this chapter.
To be able to see the difference between a standard decision tree model and a cost-sensitive
version, we will first show how the standard model works in section 4.1. After this is explained, the
inner workings of the cost-sensitive decision tree will be shown in section 4.2. In section 4.3 we will
discuss an incremental decision tree technique to increase the speed with which we achieve our results,
without changing the actual results of our experiments. And finally, in section 4.4, we show how to
optimize our decision trees (specifically through pruning), through time series cross-validation.

4.1. A decision tree that minimizes error rate


A decision tree is a very flexible model. This means that it is able to fit data very easily into a model
without using much overhead. An example of this is shown in figure 4.1. In figure 4.1b it is depicted
what the decision tree model looks like when trained on the data from figure 4.1a. The data consists
of two classes, there is a red and a blue class. The goal is to find cutpoints or decisions that will help to
classify the classes correctly. In this project we will only discuss the binary decision tree, which means
that every decision leads to two different outcomes.
The decision tree in figure 4.1b has three decision nodes and four leaves. The decision nodes
contain the decisions and the classification takes place within the leaves. The first decision is to split
the first variable 𝑋 into two different regions: one is below 0.35 and the other is above it. If the
decision is true, then the tree is followed to the left and this means that the model will classify the
object as the red class. If it is not true, then the tree is followed to the right and a new decision is
made. This process is followed until a leaf node is reached and classification is complete.
In this example we see that it takes three decisions to correctly classify all objects. The decision
boundaries of the model are shown as the blue lines in figure 4.1a. But this does not explain how the
model got to its decisions. The model uses something called a splitting criterion to get the best splits
possible according to some metric.

4.1.1. Splitting criterion: gini impurity


There are multiple ways to get to the best decisions. It all depends on which metric is used. As an
example, we consider the metric called gini impurity. This splitting criterion has previously been applied
to decision trees in [34, 42].
When the model starts to train on a new dataset, no decisions are known beforehand. It has to
use the gini impurity to get the best split possible for all objects. To do this, it will go through all
variables and all possible splits within the variables to get it. If there are ten objects and two variables

13
14 4. Methods: Decision trees

𝑋 < 0.36

True False

𝑟𝑒𝑑 𝑋 < 0.14

𝑟𝑒𝑑 𝑋 < 0.59

𝑏𝑙𝑢𝑒 𝑟𝑒𝑑

(a) Training data and decision tree boundaries. (b) Corresponding decision tree structure.

Figure 4.1: Decision tree trained on data from left figure. The structure of the decision tree is shown on the right.

in the dataset, which is the case in our example, it will make twenty splits before all possible splits
are considered. Then, the best combination of variable and split value is chosen to become the first
decision. To know which split is the best, it will calculate the gini impurity score for each split. This is
done through the following formula:

𝐼 (𝑝) = 1 − ∑ 𝑝

where 𝐼 (𝑝) is the gini impurity, 𝐽 is the amount of classes and 𝑝 is the fraction of items labeled
with class 𝑖. The worst possible gini for this dataset is 0.48 and occurs when the dataset is split in
two groups where one group consists of all instances and the other group has no instances at all.
If we fill in the formula, we get the following calculation: 1 − ( + ) = 0.48. First we get the
proportion of the red class, which has 6 instances of the total ten instances and then the proportion
of the blue class is added. The best split has a gini of 0.444, which is not much better than the worst
split. We can see from the decision tree model in figure 4.1b that this split is at 𝑋 < 0.35, so that the
bottom group of instances all belong to the red class (gini impurity of 0) and the top group has a gini
of 1 − ( + ) = 0.444. Adding the gini scores of both groups gets the total gini for this split, which
is 0.444. There is no other split where the gini is lower than 0.444, so the model chooses this split to
be the first decision.
Since the group of red objects has a perfect gini score of 0, there is nothing to improve. The model
is done for this group and thus creates a leaf node where the red class will be predicted for future
objects. The other group still has mixed objects, so the process repeats until the groups are pure, just
like in the first group.
The pseudocode of this process can be seen in algorithm 1. The function split_group takes data
as a parameter, which contains all the information about the data points. Then it checks if the data is
already pure, which is the case when all data points belong to the same class. If the data is pure, then
no decision needs to be made, so the code returns. If there are mixed classes in the data, then for each
variable and for each unique value of the data points for that variable, the impurity is calculated. This
is where the gini impurity could be used as a metric. The best cutpoint, where the impurity is smallest,
is saved. After all cutpoints have been processed, the data is split into two groups: one group where
the data is below the best cutpoint value and the other group contains the rest of the data. Then the
function split_group is called recursively for both groups until all groups are pure.
4.1. A decision tree that minimizes error rate 15

Algorithm 1 Pseudocode for creating a decision tree.


1: function split_group(𝑑𝑎𝑡𝑎)
2: if is_pure(𝑑𝑎𝑡𝑎) then
3: return
4: end if
5:
6: for all 𝑣𝑎𝑟 in variables do
7: for all 𝑣𝑎𝑙𝑢𝑒 in unique(𝑑𝑎𝑡𝑎[𝑣𝑎𝑟]) do
8: 𝑙𝑒𝑓𝑡_𝑔𝑟𝑜𝑢𝑝, 𝑟𝑖𝑔ℎ𝑡_𝑔𝑟𝑜𝑢𝑝 ← get_groups(𝑑𝑎𝑡𝑎, 𝑣𝑎𝑟, 𝑣𝑎𝑙𝑢𝑒)
9: 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦_𝑙𝑒𝑓𝑡 ← impurity_function(𝑙𝑒𝑓𝑡_𝑔𝑟𝑜𝑢𝑝)
10: 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦_𝑟𝑖𝑔ℎ𝑡 ← impurity_function(𝑟𝑖𝑔ℎ𝑡_𝑔𝑟𝑜𝑢𝑝)
11: 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 ← 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦_𝑙𝑒𝑓𝑡 + 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦_𝑟𝑖𝑔ℎ𝑡
12:
13: if 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 < 𝑏𝑒𝑠𝑡_𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 then
14: 𝑏𝑒𝑠𝑡_𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 ← 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦
15: 𝑏𝑒𝑠𝑡_𝑣𝑎𝑟 ← 𝑣𝑎𝑟
16: 𝑏𝑒𝑠𝑡_𝑐𝑢𝑡𝑝𝑜𝑖𝑛𝑡 ← 𝑣𝑎𝑙𝑢𝑒
17: end if
18: end for
19: end for
20:
21: 𝑙𝑒𝑓𝑡_𝑔𝑟𝑜𝑢𝑝, 𝑟𝑖𝑔ℎ𝑡_𝑔𝑟𝑜𝑢𝑝 ← get_groups(𝑑𝑎𝑡𝑎, 𝑏𝑒𝑠𝑡_𝑣𝑎𝑟, 𝑏𝑒𝑠𝑡_𝑐𝑢𝑡𝑝𝑜𝑖𝑛𝑡)
22: 𝑡𝑟𝑒𝑒[𝑙𝑒𝑓𝑡] ← split_group(𝑙𝑒𝑓𝑡_𝑔𝑟𝑜𝑢𝑝)
23: 𝑡𝑟𝑒𝑒[𝑟𝑖𝑔ℎ𝑡] ← split_group(𝑟𝑖𝑔ℎ𝑡_𝑔𝑟𝑜𝑢𝑝)
24: return 𝑡𝑟𝑒𝑒
25: end function

4.1.2. Pruning the tree to prevent overfitting


The decision tree created in figure 4.1b has a perfect classification score for the ten objects in our
example dataset. However, this does not mean that the model performs just as well on unseen objects.
The problem with decision trees is that they usually fit really well on the training data, but do not perform
well on test data. This is called overfitting. Often it is better to increase the training error in order to
get a lower test error. For decision trees this can be achieved through a process called pruning.
The goal of pruning is to decrease the depth of the decision tree so that the model does not overfit.
For example, it is possible to remove the last decision node of the decision tree in figure 4.1b and
convert it to a leaf node. This means that one training object is misclassified, the red object, but that
the tree might perform better for future objects.
There are two main ways in which a decision tree can be pruned. First there is pre-pruning. Pre-
pruning is when the decision tree stops growing before it perfectly classifies the training instances.
This can be done by specifying beforehand how deeply the tree may grow or selecting an impurity
threshold for each decision node to stop the tree from growing further. For example, if the gini score
of the best split possible is below 0.1, do not grow further, because then the probability of overfitting
could be significant. This method is very fast, but will not always lead to the best results, because it
might miss good splits that happen at a later stage. The other pruning method is called post-pruning.
Post-pruning is when the whole decision tree is built first and then afterwards is reduced in size. This
usually takes a lot more time than pre-pruning, but it can lead to better results. After all the decision
nodes are constructed, the best splits can be kept in place and the bad ones can be removed from the
tree.
Since we have to build the whole decision tree anyway (more on that in section 4.3), applying a
post-pruning method better fits the structure of our project. We attempted two different post-pruning
methods to prune the decision trees in this project. The first method is called Minimum Description
Length (MDL) pruning and the second method is Cost-Complexity Pruning (CCP). We will discuss them
both now.
16 4. Methods: Decision trees

Minimum Description Length


The MDL principle is an important concept in information theory and computational learning theory.
It is based on Occam’s razor principle: when multiple solutions for a hypothesis are presented, one
should select the solution where the least amount of assumptions are made. But what does this have
to do with pruning of a decision tree?
When a decision tree is completely built (this is a post-pruning method), it takes up disk space. The
amount of space that is needed depends on the structure of the tree and the training instances. Each
leaf of the tree needs to be encoded in the following amount of bits:

𝑒 = 1 + 𝑙𝑜𝑔(𝑐) + 𝑥(𝑙𝑜𝑔(𝑖) + 𝑙𝑜𝑔(𝑐 − 1))


where 𝑐 is the number of classes, 𝑥 is the number of instances that do not belong to the default
class and 𝑖 is the total number of instances at the leaf [43]. The first bit is necessary to encode that
the node is a leaf, then 𝑙𝑜𝑔(𝑐) bits are needed to encode which class is default, and then for each
instance that does not belong to this default class it is necessary to specify the exception (𝑙𝑜𝑔(𝑖)) and
to which class it belongs (𝑙𝑜𝑔(𝑐 − 1)).
To encode decision nodes of the tree, the following amount of bits is needed:

𝑒 = 1 + 𝑙𝑜𝑔(𝑡) + 𝑚𝑑𝑙(𝑙) + 𝑚𝑑𝑙(𝑟)


where 𝑡 is the amount of possible cutpoints at the node and 𝑚𝑑𝑙(𝑙) and 𝑚𝑑𝑙(𝑟) is the MDL of the
left and right subtrees respectively. As seen from the formulas, this method calculates the MDL of the
leaves first and then it will calculate the MDL of the nodes above them until the root of the tree is
processed.
For each node, both the MDL as decision node and the MDL as leaf will be calculated. So for decision
nodes, this means that the MDL will be calculated as if it were a leaf. If the MDL as a leaf is smaller
than the MDL as a decision node, then this node is pruned, because this means that the encoding of
the node in bits is smaller as a leaf. This will result in more compact trees without losing too much
information, because all relevant information is still represented in the encoded bits.
While MDL works reasonably well, it also has some drawbacks. For example, the method does not
look at training or test errors, it just prunes the tree when the encoding of a leaf takes less space then
as a decision node. There might be a better prune possible, one where the pruned tree has a lower
test error. A pruning method that takes the test error into account is Cost-Complexity Pruning.

Cost-complexity pruning
Cost-Complexity Pruning (CCP) looks at multiple subtrees of the original tree and calculates the test
error for each one of these subtrees to see which one has the best test error [42]. However, if all
subtrees would be checked this would take a very long time. Let |𝑇 ̃| be the number of leaves in the
original tree 𝑇, then there are ⌊1.5028369| ̃| ⌋ possible subtrees to check [44]. For example, this means
that a tree with 25 leaves already has 26,472 possible pruned subtrees. Complex and large datasets
(e.g. containing many football matches, the outcomes of which are difficult to predict, see section 3),
can easily result in a tree with well over 25 leaves.
This means that checking each subtree is not feasible within a reasonable amount of time. This is
why CCP only deals with a fraction of the possible subtrees. The idea of CCP is that the deeper a tree
grows, the more complex the tree gets and this should be punished because the more complex the
tree, the more likely it is to overfit. You could say each subtree of 𝑇 has an amount of complexity cost
attached to it. If the subtree is deep, then the complexity cost is high and if it is shallow, then there is
almost no cost.
Let us define the complexity cost of a tree 𝑇 as follows:

̃|
𝐶 (𝑇) = 𝑅(𝑇) + 𝛼|𝑇
where 𝐶 (𝑇) is the complexity cost of tree 𝑇 given an 𝛼 value, 𝑅(𝑇) is the resubstitution error of 𝑇
and |𝑇̃| is the amount of leaves. So the cost of the tree is the amount of incorrectly classified instances
plus an additional complexity error based on an 𝛼 value. But there is also a cost for individual nodes
in the tree. Let us define this as:

𝐶 ({𝑡}) = 𝑅(𝑡) + 𝛼
4.1. A decision tree that minimizes error rate 17

where 𝐶 ({𝑡}) is the complexity cost of node 𝑡 and 𝑅(𝑡) is the resubstitution error of node 𝑡. This
resubstitution error 𝑅(𝑡) is a combination of the incorrectly classified instances and the proportion of
data that falls into node 𝑡, so 𝑅(𝑡) = 𝑟(𝑡)𝑝(𝑡). This means that for the whole tree 𝑅(𝑇) = ∑ ∈ ̃ 𝑅(𝑡 ) =
∑ ∈ ̃ 𝑟(𝑡 )𝑝(𝑡 ).
To get a feel for what 𝛼 does to the tree, let 𝛼 be 0 for a moment. This means that there is no
penalty for the complexity of the tree, thus the tree will not be pruned. The whole tree would correctly
classify all instances (because that is given from the building procedure of the tree), so the complexity
cost of the whole tree will be 0.
However, when 𝛼 increases, then there will be a point where 𝐶 (𝑇 ) = 𝐶 ({𝑡}) for some decision
node 𝑡 in the tree. When this happens, it means that it is better to prune the tree at node 𝑡, because the
tree has the same complexity cost but the tree is smaller which is better due to the risk of overfitting.
To find this 𝛼 value for when this happens, we need to find when

̃| = 𝑅(𝑡) + 𝛼
𝑅(𝑇 ) + 𝛼|𝑇

If we solve for 𝛼, we get

𝑅(𝑡) − 𝑅(𝑇 )
𝛼=
̃| − 1
|𝑇
Now we have everything to start our pruning process. The pseudocode for getting all alpha values
is shown in algorithm 2. It begins by considering the whole tree 𝑇 and for each decision node in 𝑇 we
calculate for which 𝛼 value it would be pruned. The lowest 𝛼 value is chosen and the corresponding
decision node is pruned. We now know that 𝑇 > 𝑇 where 𝑇 is the original tree and 𝑇 is the
pruned tree after the first step. This process is repeated for 𝑇 and future trees until 𝑇 = {𝑡 }, so it
only contains the root node of tree 𝑇. This process will result in a sequence of subtrees of 𝑇 where
𝑇 > 𝑇 > ⋯ > 𝑇 with corresponding 𝛼 values 𝛼 < 𝛼 < ⋯ < 𝛼 where 𝛼 = 0. Now we know that
𝑇 is the best smallest subtree when 𝛼 ∈ [𝛼 , 𝛼 ), so this is very helpful when looking for the best 𝛼
value.

Algorithm 2 Getting cost-complexity alpha values of a decision tree.


1: function get_alpha_values(𝑡𝑟𝑒𝑒)
2: 𝑡𝑟𝑒𝑒 ← unprune_tree(𝑡𝑟𝑒𝑒)
3: 𝑎𝑙𝑝ℎ𝑎_𝑣𝑎𝑙𝑢𝑒𝑠 ← [0]
4:
5: while root of 𝑡𝑟𝑒𝑒 is not pruned do
6: for all decision nodes 𝑛𝑜𝑑𝑒 in 𝑡𝑟𝑒𝑒 that are not pruned do
7: 𝑎𝑙𝑝ℎ𝑎 ← (R(𝑛𝑜𝑑𝑒) - R(𝑡𝑟𝑒𝑒)) / (size(𝑡𝑟𝑒𝑒) - 1)
8: end for
9: 𝑚𝑖𝑛_𝑎𝑙𝑝ℎ𝑎_𝑛𝑜𝑑𝑒 ← index(min(𝑎𝑙𝑝ℎ𝑎))
10: 𝑡𝑟𝑒𝑒 ← prune_node(𝑡𝑟𝑒𝑒, 𝑚𝑖𝑛_𝑎𝑙𝑝ℎ𝑎_𝑛𝑜𝑑𝑒)
11: 𝑎𝑙𝑝ℎ𝑎_𝑣𝑎𝑙𝑢𝑒𝑠.append(min(𝑎𝑙𝑝ℎ𝑎))
12: end while
13:
14: return 𝑎𝑙𝑝ℎ𝑎_𝑣𝑎𝑙𝑢𝑒𝑠
15: end function

After all 𝛼 values are calculated for the whole tree, the next step is to find the best 𝛼 value. This
can be found in two ways: using a validation dataset or using cross-validation. In this project we used
cross-validation, because if we used a validation dataset we would have to separate some matches
from our training dataset and that means that we would not be able to use valuable information. This
is still possible when using cross-validation, which we will explain in further detail in section 4.4.
For each fold in the cross-validation method a new tree is built based on the training data for that
fold. This tree is then pruned according to the 𝛼 values obtained from the original tree. We know
that all values of 𝛼 ∈ [𝛼 , 𝛼 ) result in subtree 𝑇 , so in this case we will choose a 𝛽 value that lies
between 𝛼 and 𝛼 with which to prune the tree. This results in a sequence of 𝛽 values that all lie
18 4. Methods: Decision trees

between the corresponding 𝛼 values so that 𝛽 < 𝛽 < ⋯ < 𝛽 where 𝛽 = 0 to ensure no pruning
takes place. The other 𝛽 values will be set as 𝛽 = √𝛼 𝛼 and 𝛽 = ∞.
After the test error is calculated for each fold and for each 𝛽 value, the errors are added up and
the 𝛽 value with the lowest test error is selected. Let us say that 𝛽 is the one with the lowest test
error, then we know that this corresponds to 𝛼 , so the original tree will be pruned with 𝛼 . The
whole cost-complexity pruning process is also depicted as pseudocode in algorithm 3. Note that the
split_group function refers to the code in algorithm 1 and the get_alpha_values function refers
to algorithm 2.

Algorithm 3 Cost-complexity pruning process.


1: function cost_complexity_pruning(𝑡𝑟𝑒𝑒)
2: 𝑎𝑙𝑝ℎ𝑎_𝑣𝑎𝑙𝑢𝑒𝑠 ← 𝑔𝑒𝑡_𝑎𝑙𝑝ℎ𝑎_𝑣𝑎𝑙𝑢𝑒𝑠(𝑡𝑟𝑒𝑒)
3: 𝑏𝑒𝑡𝑎_𝑣𝑎𝑙𝑢𝑒𝑠 ← 𝑔𝑒𝑡_𝑏𝑒𝑡𝑎_𝑣𝑎𝑙𝑢𝑒𝑠(𝑎𝑙𝑝ℎ𝑎_𝑣𝑎𝑙𝑢𝑒𝑠)
4:
5: for (𝑡𝑟𝑎𝑖𝑛_𝑑𝑎𝑡𝑎, 𝑡𝑒𝑠𝑡_𝑑𝑎𝑡𝑎) in cross_validation(𝑑𝑎𝑡𝑎) do
6: 𝑡𝑒𝑚𝑝_𝑡𝑟𝑒𝑒 ← 𝑠𝑝𝑙𝑖𝑡_𝑔𝑟𝑜𝑢𝑝(𝑡𝑟𝑎𝑖𝑛_𝑑𝑎𝑡𝑎)
7: for all 𝑏𝑒𝑡𝑎_𝑣𝑎𝑙𝑢𝑒 in 𝑏𝑒𝑡𝑎_𝑣𝑎𝑙𝑢𝑒𝑠 do
8: 𝑡𝑒𝑚𝑝_𝑡𝑟𝑒𝑒 ← 𝑝𝑟𝑢𝑛𝑒_𝑡𝑟𝑒𝑒_𝑤𝑖𝑡ℎ_𝑎𝑙𝑝ℎ𝑎(𝑡𝑒𝑚𝑝_𝑡𝑟𝑒𝑒, 𝑏𝑒𝑡𝑎_𝑣𝑎𝑙𝑢𝑒)
9: 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 ← 𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠(𝑡𝑒𝑚𝑝_𝑡𝑟𝑒𝑒, 𝑡𝑒𝑠𝑡_𝑑𝑎𝑡𝑎)
10: 𝑝𝑟𝑜𝑓𝑖𝑡[𝑏𝑒𝑡𝑎_𝑣𝑎𝑙𝑢𝑒]+ = 𝑔𝑒𝑡_𝑝𝑟𝑜𝑓𝑖𝑡(𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠)
11: end for
12: end for
13:
14: 𝑏𝑒𝑠𝑡_𝑎𝑙𝑝ℎ𝑎_𝑣𝑎𝑙𝑢𝑒 ← 𝑎𝑙𝑝ℎ𝑎_𝑣𝑎𝑙𝑢𝑒𝑠[𝑚𝑎𝑥_𝑖𝑛𝑑𝑒𝑥(𝑝𝑟𝑜𝑓𝑖𝑡)]
15: 𝑡𝑟𝑒𝑒 ← 𝑝𝑟𝑢𝑛𝑒_𝑤𝑖𝑡ℎ_𝑎𝑙𝑝ℎ𝑎(𝑡𝑟𝑒𝑒, 𝑏𝑒𝑠𝑡_𝑎𝑙𝑝ℎ𝑎_𝑣𝑎𝑙𝑢𝑒)
16: return 𝑡𝑟𝑒𝑒
17: end function

4.1.3. Predicting the outcome of an upcoming match


Now that the tree is pruned, it is ready to be used to predict the outcome of upcoming matches. When
the outcome of a match needs to be predicted, the same features that were used in the training data
are extracted for this match. The match will then be guided through the decision tree based on the
information of the instance and the decisions in the tree. When all decisions are processed, the instance
’falls’ in a leaf of the tree. Here, the prediction will take place.
There are two situations possible in the leaf: the leaf is pure or it is not. If the leaf is pure, which
means that the training instances in this leaf belong to the same class, then the predicted outcome
of the tested match is the same as these training instances. The tree will have a confidence of 100%
that this is the right prediction, since there are no instances known that belong to another class. If the
tree is not pure, this confidence rate will be lower, as the leaf contains instances belonging to mixed
classes. Whether the leaf is pure or not, the class with the highest proportion is assigned to the tested
match. For example, if the leaf contains five training instances of class 𝐻 and 3 instances of class 𝐴,
then the model will assign class 𝐻 with a probability of 5/(5 + 3) = 0.625 = 62.5% of it being the right
prediction.

To conclude this section, let us summarize how a decision tree is created. First a splitting criterion
must be chosen to define what a good split is. We used the gini impurity as an example to show how
a splitting criterion based on misclassification works. Since we are building the whole tree so that all
leaves are pure, two post-pruning methods were discussed: MDL and CCP. These pruning techniques
are applied to the tree after the whole tree is built. This usually means that the leaves of the decision
tree are not pure anymore and thus the training error is also not optimal. However, it is more important
that the test error is most likely to benefit from this step, and thus the decision tree will perform better
on unseen data than before pruning. Since we are more interested in getting the most profit instead of
getting the most amount of predictions right, we will now discuss how to alter the decision tree process
to incorporate the costs for each match.
4.2. Decision tree optimized for costs 19

4.2. Decision tree optimized for costs


The decision tree model that was discussed in the previous section is a great example of most standard
machine learning models. There is a training dataset, the model is trained and then applied to test data.
However, if we take a closer look at the splitting criterion and cost-complexity pruning, we notice that
they assume that the best situation is where the least amount of instances are being misclassified. This
might not always be the case. In some cases one instance may have a bigger impact on the outcome
than other instances. This also happens in sports betting: if you correctly predict that a longshot is
going to win, you will make more profit than when correctly predicting a win for the favorite. In this
case it would make sense to incorporate these costs or profit for each match in the model and optimize
for these costs instead. This is called Cost-Sensitive Learning (CSL).
Normally you only need training data to train a standard model. The training data consists of the
features and labels for each instance (in case of classification). For our purpose we chose to use a cost-
sensitive technique which uses an additional element for each instance: the cost-matrix. A cost-matrix
is a matrix where the costs for each possible situation is defined. Since we are applying this technique
to football betting and each match has different odds, our cost-matrices look like the cost-matrix in
table 4.1a.
There are nine possible outcomes when predicting football matches. That is why the cost-matrix is
a three-by-three matrix which depicts the costs for each of these situations. Since this is cost-sensitive
learning, we assume that positive values are costs and negative values are profit. That is why on the
diagonal of the matrix the values are negative. If the model predicts a home win and the home team
actually wins, then the profit will be 𝑝 , which is the odds for home 𝑜 minus the stake, which is 1
unit for each match in our experiments. So for each event the formula is 𝑝 = 𝑜 − 1, 𝑋 ∈ {𝐻, 𝐷, 𝐴}.
If the model makes a false prediction, we lose our one unit stake, so the cost is one. The goal in
cost-sensitive learning is to use these matrices to minimize the costs.
If we consider the same example from section 2.2.1, where the match between FC Utrecht and ADO
Den Haag is used with odds [1.53, 4, 6], then we end up with the cost-matrix shown in figure 4.1b.
Here we can see that the importance of the costs for this match depend on the actual outcome of the
match. If the home team wins, then this match will not bring much profit, but if the away manages
to win, then the match will be much more important for the model to classify correctly. In that case,
the profit is almost ten times greater, which is not visible when using cost-insensitive machine learning
models.

4.2.1. Creating a cost-sensitive decision tree


To incorporate costs in the decision tree model, it is necessary to change some of the steps in the
process of building and pruning the tree. A new splitting criterion must be used and the Cost-Complexity
Pruning method needs to be altered. The Minimum Description Length does not need to change,
because it does not look at classification errors.
In section 4.1.1 we explained how the gini impurity can be used as splitting criterion for the decision
tree model. This method was used to find the best split possible based on the misclassification rate. This
works well in the traditional classification situation, but for cost-sensitive learning a different approach
is needed.
In a cost-sensitive approach the same process is used, but with a different metric to identify the
best split [41]. So instead of calculating the gini score for each possible split, we now calculate the
costs and select the split where the costs are the lowest. To calculate the costs for a split, all instances
in the group are predicted as one class. This can be expressed in the following formula:

Actual Actual
H D A H D A
H −𝑝 1 1 H −0.53 1 1
Predicted D 1 −𝑝 1 Predicted D 1 −3.00 1
A 1 1 −𝑝 A 1 1 −5.00
(a) Generic cost-matrix for football betting. (b) Cost-matrix for a match with odds [1.53, 4, 6].

Table 4.1: Cost-matrices used for football matches. Please note that negative values signify profit.
20 4. Methods: Decision trees

𝐼 (𝑆) = 𝑚𝑖𝑛{𝐶𝑜𝑠𝑡(𝑓 (𝑆)), 𝐶𝑜𝑠𝑡(𝑓 (𝑆)), 𝐶𝑜𝑠𝑡(𝑓 (𝑆))}


where 𝐼 (𝑆) is the cost impurity for split 𝑆, 𝑓 (𝑆) is a function which predicts all instances in split 𝑆
as 𝑋 and 𝐶𝑜𝑠𝑡(𝑓 (𝑆)) returns the costs for predicting all instances as class 𝑋. This is equal to summing
all costs for each instance in the split using the cost-matrix for each of these instances. Remember
that a split results in two different groups, so for both groups the impurity 𝐼 (𝑆) is calculated. This
is shown in lines 9, 10 and 11 in algorithm 1. After all splits are considered, the split with the lowest
amount of costs is then chosen.

4.2.2. Pruning a cost-sensitive decision tree


Cost-Complexity Pruning also needs to be altered to work in cost-sensitive learning. In this case there
are two different costs in play: one is the cost of the tree in CCP and the other is the cost for each
instance in Cost-Sensitive Learning (CSL), which can make it quite confusing. We will show which costs
we are referring to using the acronym between brackets.
Recall that the cost of a node in the tree (CCP) was 𝐶 ({𝑡}) = 𝑅(𝑡) + 𝛼 and the cost of the whole
tree was 𝐶 (𝑇 ) = 𝑅(𝑇 ) + 𝛼|𝑇̃ | (section 4.1.2). These formulas will not work anymore due to the 𝑅(𝑥)
term in both of them, because this is where the misclassification is used as a metric.
So now the new cost for a node (CCP) is defined as:

𝐶 ({𝑡}) = 𝑐(𝑡) + 𝛼
and the new cost for the whole tree:

̃|
𝐶 (𝑇 ) = 𝐶(𝑇 ) + 𝛼|𝑇
where 𝑐(𝑡) is the minimal cost (CSL) in node 𝑡 and 𝐶(𝑇 ) is the total amount of costs for tree 𝑇, so
𝐶(𝑇 ) = ∑ ∈ ̃ 𝑐(𝑡 ). To get the best 𝛼 values, we need to use:

𝑐(𝑡) − 𝐶(𝑇 )
𝛼=
̃| − 1
|𝑇
This formula is a variant from the cost-complexity formula used in [41], but it leads to the same
pruned trees. However, the 𝛼 values with our formula will be positive instead of negative, which is
more in line with the original CCP method explained in section 4.1.2. So if our formula is used, it is
only necessary to change this formula in the original process, which is at line 7 of algorithm 2. The
rest of the steps of the algorithm will work exactly the same.

4.2.3. Prediction process for cost-sensitive decision trees


Since the tree uses costs to check which split is best, it also uses a different method when predicting
the outcome of new instances. Let us give an example when a prediction needs to be made for a
match and the leaf is not pure. If the leaf contains five training instances of class 𝐻 with a total cost of
1 and three instances of class 𝐴 with a total cost of -3.2, then the model will predict that the match will
end in an away win, because it will look at the class with minimal costs instead of the class with the
most number of occurrences. Instead of giving a probability of being correct, it will now also output
the expected costs for this upcoming match. This is calculated as follows: −3.2/(5 + 3) = −0.4. It is
important to divide the costs of the minimal class by the total number of instances in the leaf, because
the prediction is for one instance only and the expected cost for this instance should have the same
weight as the instances in the leaf.

Changing the splitting criterion and cost-complexity pruning method have altered the standard decision
tree to become a cost-sensitive learning model. The costs of each instance now depict the importance
of predicting that instance correctly, based on the odds. In this cost-sensitive model, it may take e.g.
five matches where the favorite team won, to balance out a match where the longshot won. This
better reflects the real life situation of varying profits per prediction, instead of assuming similar costs
throughout the model. However, since we use a lot of data and need to retrain the model multiple
times, the training process takes a very long time. To speed up this process, we use an incremental
version of the decision tree model which will be explained in the next section. This is however not
4.3. Using incremental decision trees to accommodate changes 21

a vital part for our investigation, so if you are not interested in this technique, this section can be
skipped in order to continue reading section 4.4 if you are more interested in the methods that actually
influence the results of our experiments.

4.3. Using incremental decision trees to accommodate changes


The process of building a decision tree that we discussed earlier assumes the whole dataset is known
before the splitting criterion assigns which variable and cutpoint combination is best. However, when
new data becomes available this means that the whole tree needs to be built from the ground up, which
can take a long time. Since we want to build a decision tree for each week, because new data becomes
available after each week, this means that we have to build many decision trees from scratch. This is
not ideal, so we will use an incremental version of the decision tree model which is able to incrementally
add instances to an existing model. After the new instances are added, the tree will look exactly the
same as when a decision tree is built from scratch, so this will not affect the predictions and thus the
results [43, 45, 46].

4.3.1. The process of incrementally updating the decision tree


To be able to update a decision tree incrementally, the building process has to change. In the standard
algorithm it takes just two steps: building the tree and then pruning it. Now it is necessary to split the
building step into two different steps: first we add new instances to the original tree as is, and after
these are added we have to ensure whether the best splits are the same as before, or whether the
tree needs to be restructured to represent the best splits possible. We will use an example to explain
how the whole process works.
First we start with an empty tree, no training data has been given to the model. Each training
instance that will be presented to the model will be processed individually. So when the first instance is
added, a leaf node is created as our root node. Let us assume that we want to create a model for the
same dataset as used in figure 4.1. The instance information can be found in table 4.2, where each
instance is added to the model in order from top to bottom.
After the first instance is added, the leaf node looks like figure 4.2a. In this case, every test instance
would be assigned to the blue class. This is not strange, because the tree only saw a blue instance,
it does not know that there is another class in the dataset. This is shown in figure 4.3a. However, if
we add the second instance, the decision tree needs to change because now there are two different
classes. If we were to add the red class to the leaf node, this node would not be pure, so we add a
decision node which will split into two leaf nodes where each one has its own class instance. In this
case, the best split is at variable 𝑋 at 0.17, which is right between the two instances. This is shown
in figures 4.2b and 4.3b .
The next couple of instances can easily be added without restructuring the decision tree, because
there is no better split available until instance 7. After the seventh instance is added, the decision
node does not make a perfect split anymore, so the model needs to add another decision node. This
is shown in figures 4.2c and 4.3c. Now there are two decision nodes necessary to get the optimal
training error.
The last change happens when instance 9 is added to the model. This instance will trigger the last

𝑋 𝑋 class
1 0.20 0.99 𝑏𝑙𝑢𝑒
2 0.14 0.09 𝑟𝑒𝑑
3 0.15 0.34 𝑟𝑒𝑑
4 0.41 0.37 𝑏𝑙𝑢𝑒
5 0.07 0.71 𝑟𝑒𝑑
6 0.57 0.63 𝑏𝑙𝑢𝑒
7 0.61 0.48 𝑟𝑒𝑑
8 0.74 0.14 𝑟𝑒𝑑
9 0.82 0.02 𝑟𝑒𝑑
10 0.52 0.63 𝑏𝑙𝑢𝑒

Table 4.2: Data used for training the incremental decision tree.
22 4. Methods: Decision trees

tree restructuring and the same tree as figure 4.1b is created. Even though, for ease of reading, we
used an example where no costs are mentioned, it is also possible to use this tree building process for
cost-sensitive trees.

𝑋 < 0.175

True False

𝑋 < 0.17 𝑟𝑒𝑑 𝑋 < 0.59

True False

𝑏𝑙𝑢𝑒 𝑟𝑒𝑑 𝑏𝑙𝑢𝑒 𝑏𝑙𝑢𝑒 𝑟𝑒𝑑

(a) Step 1. (b) Step 2. (c) Step 7.

Figure 4.2: Incremental decision tree structures during training.

(a) Data used at step 1. (b) Data used at step 2. (c) Data used at step 7.

Figure 4.3: Data used during the training of the incremental decision tree.

4.3.2. Storing requisite data elements in the decision tree


To be able to update the structure of the model, some information needs to be stored inside the model
besides the best variable and cutpoint combinations. Otherwise valuable information is lost and it would
not be possible to find the best split anymore. So what we win in training speed, we lose in memory
consumption. In appendix A it is shown which data needs to be saved inside the incremental decision
tree, and in table 4.3 it is depicted which information is stored and why it is stored. We will now discuss
which information is needed for the decision nodes in the incremental decision tree, and after that we
will do the same for the leaves. Due to the nature of the incremental design, this paragraph is quite
technical and presumes some knowledge about dictionaries in programming.
For the decision node it is important to store the best_variable. This value will be used when a
prediction needs to be made. With the help of this value, a prediction can be made quickly, because
it is now only necessary to look up the cutpoint of this variable (in the variable key) instead of looking
through the whole tree to determine which variable is best. The class_counts key stores, for each class,
how many instances there are in this decision node and what the costs are for this class. As shown in
appendix A, this information is also stored in a tree-like data type. This is especially useful when many
classes exist in the data, but since we only have three possible outcomes (home, draw and away), this
tree will never be deeper than one level. The flags key stores a two value boolean array where the
first boolean tracks if the node is ’stale’, this means that new information is stored in this node. If a
node is stale, the model needs to ensure the best cutpoint is still being used, whereas if a node is not
stale, the best split is already known. The second boolean shows if this node is pruned. If the node
4.4. Training and testing through cross-validation 23

Key Description
best_variable Empty if leaf, else stores which variable is best
class_counts Saves how many instances of each class are inside this node
flags Empty if leaf, else tracks whether a node is stale or pruned
instance_costs Empty if decision node, else stores the cost-matrix of each instance
instances Empty if decision node, else stores information of each instance
left Empty if leaf, else refers to the left node
mdl Stores the MDL value of this node
n_instances Stores the number of instances
n_variables Stores the number of variables of each instance
right Empty if leaf, else refers to the right node
variables Empty if leaf, else saves the unique values for each variable

Table 4.3: Information stored in an incremental decision tree with a short description. Keys used in both leaves and decision
nodes cover both cells.

is a decision node, then this means that this node has to have a left node and a right node, since a
decision is made in this node. To be able to store the information of these two ’children’ of this node,
the left and right keys are used to store the information for the nodes below this node. These nodes
can also be decision nodes, leaves, or a combination of both. The mdl key stores the MDL value of this
node, but only if the MDL pruning method is used. The n_instances key stores the number of instances
that fall into this node and the n_variables stores the number of variables or features used by these
instances. The last key is the variables key, which saves the unique values of each variable used in the
data. This is very important data to store, because without this data is it not possible to determine the
best split when new data is added to the model. This is also the place where the impurity values for
each possible split and best cutpoint for each variable are stored. In appendix A it is depicted that a lot
of information is stored in this part of the tree. It almost takes up half of the total information stored
in the tree, and this will increase very quickly when new data is added to the tree if this data contains
new variable values.
For a leaf node less data is stored. There is no information stored in best_variable, flags, left, right
or variables. The primary goal is to save the instances and the costs of these instances in the instances
and instance_costs keys of the tree so that we are able to use this information later.

4.4. Training and testing through cross-validation


In machine learning it is important to use train and test data to make sure that the results show how
well the model performs on unseen data. However, doing this means that the test data is only used
for testing and thus the valuable information in this data cannot be used in the model. It would be
preferable if all data could be used while still being able to evaluate the model in a correct way. This
is where cross-validation comes into play [47].

4.4.1. Standard cross-validation process


A straightforward way of cross-validation is the so-called k-fold cross-validation. The dataset is split
into 𝑘 parts (folds). In practice 𝑘 = 5 or 𝑘 = 10 is usually used, but other values also work. The data
will be randomly assigned to one of the 𝑘 folds and the size of all folds is roughly the same. This is
shown in figure 4.4. After each instance is assigned to a fold, one of the folds is used for test data
(the red dots in the figure) and the other 𝑘 − 1 folds are used for training data (the blue dots in the
figure). A model is built on the 𝑘 − 1 folds and then tested on the test data. This process is repeated 𝑘
times so that each fold is used once as a test set. The results for each set are summed up or averaged,
depending on the evaluation metric that is used, and this will give a rough estimate of how well the
model performs on unseen data.
There is however one problem when using traditional cross-validation on temporal data: data is
randomly assigned to a fold. If the data is time-sensitive then the model might be trained with data
instances from the future to predict instances that occurred earlier in time. This is also the case for our
project, as each week new matches are played and there might be a relation to matches played before.
So in our case we will use another version of cross-validation called time series cross-validation.
24 4. Methods: Decision trees

Figure 4.4: Difference between traditional evaluation, standard cross-validation and time series cross-validation. The blue dots
are used for training a model, the red dots for testing that model. In this case the order of the dots matter: the dots on the left
occurred earlier in time than the right dots.

4.4.2. Incorporating time through time series cross-validation


A sequence of data points that is temporal is also referred to as a time series. In this case you need
to be careful when using cross-validation, because in the traditional version the data is scrambled
into 𝑘 different folds, regardless of when the instance occurs. This can be fixed by using time series
cross-validation [48].
Time Series Cross-Validation (TSCV) works slightly different from traditional cross-validation as it
ensures that only instances that occurred prior to the validation set are used for training. So, if we
want to do 5-fold TSCV, then we actually need to divide the data into 6 equal parts. A model is trained
on the first fold, tested on the second fold and the results are saved. After that a model is trained on
the first two folds and tested on the third fold. The results are saved again and the process continues
until the last fold is used as a test set. After that, it makes no sense to continue, because there is
no data available to validate. Five models were created and there are five results to aggregate in the
proper way depending on the evaluation method used. This makes it possible to use all the temporal
data for training and still get an estimate of how well the model performs. This process is visualized
and compared with traditional evaluation and standard 5-fold cross-validation in figure 4.4.
5
Experimental setup
As mentioned in the introduction of this paper, the aim of this research project is to compare multiple
cost-sensitive methods. These different methods will be introduced in this chapter. But before doing
that we will first look at the data that is used in our experiments in section 5.1. After that, in section 5.2,
we will present the different cost-sensitive methods that are compared in this paper. These will be
investigated to determine which one is most optimal, i.e. leads to the most amount of profit. Finally,
we will show which experiments were conducted to be able to draw a conclusion in section 5.3.

5.1. Data
There is a lot of data available regarding football match outcomes. For this project we need historical
data for football match outcomes but also the odds of multiple bookmakers. First, we will look at the
data sources for this information and after that a quick analysis of the acquired data is performed.

5.1.1. Sources used


The first source that is used is the website football-data.co.uk [19]. This site hosts a lot of
football-related data for multiple countries and leagues. For the Dutch Eredivisie there is data available
for seasons 2000/2001 until 2016/2017 that meets our requirements. These seasons are downloaded
as CSV format for each individual season.
When we started this project we used the Kaggle website [18] to get the required data. Instead
of using separate CSV files, a SQL database is available where all data is already linked and cleaned.
However, only seasons 2008/2009 until 2015/2016 are present in this dataset. Since we wanted to test
our hypotheses as many seasons as possible, we decided to use this dataset and append the necessary
seasons from the first source to this database. This resulted in a dataset containing seasons 2006/2007
until 2016/2017. This way we could train our models on 2006/2007 and test on the other ten seasons.
The process of linking these datasets to each other will be explained in section 5.3.1.
In both datasets there are odds available from ten different bookmakers. The names of all book-
makers can be found in table 2.1. The margins that each bookmaker applies were already shown in
figure 2.2. In season 2006/2007 the average margin of all bookmakers combined is 11.9% and this
decreases each consecutive year until an average margin of 6.6% is reached in 2016/2017.

5.1.2. Quick analysis of the data


The final dataset that we used contains 3,366 matches in total. For only three matches there were no
odds available from any bookmaker, so these were removed from the dataset to end up with a dataset
with a total of 3,363 matches. To see how many bookmakers’ odds are available for the remaining
matches, see figure 5.1 (and appendix F). In this figure it is shown that at least five bookmakers have
published their odds for the 3,363 matches and in season 2012/2013 all ten bookmakers’ odds can
often be used in the experiments.
Figure 5.2 shows for how many matches each bookmaker has published their odds. Only Bet365
and Bet&Win contain all 3,363 matches, for the other bookmakers we do not have all the data. The
bookmaker with the lowest amount of matches available, which is 1,497, is Pinnacle.

25
26 5. Experimental setup

Figure 5.1: Number of odds published per match for seasons 2006/2007-2016/2017.

Figure 5.2: Number of odds published per bookmaker for seasons 2006/2007-2016/2017.
5.1. Data 27

(a) Bookmakers’ favorite accuracy. (b) Profit when betting on bookmakers’ favorites.

Figure 5.3: Favorite accuracy and profit per bookmaker for seasons 2006/2007-2016/2017.

With the help of this dataset it is also possible to calculate the prediction accuracy of the bookmakers
for seasons 2006/2007 until 2016/2017. As mentioned in the chapter 3, accuracy rates around 50%
seem to be quite common when researchers attempt to predict future matches. The bookmakers
indicate their prediction of the outcomes by giving that outcome the lowest relative odds, since the
bookmaker assumes it has the highest possibility to occur as shown in section 2.2.1. Using this metric,
we can observe that the bookmaker with the highest accuracy (57.1%) is Gamebookers. In figure 5.3a
the results for all bookmakers are shown. The bookmaker with the lowest accuracy is Pinnacle with
53.1% of matches. While Pinnacle has the lowest prediction accuracy of the ten bookmakers, they still
perform better than most of the accuracy values reported in chapter 3, where the highest accuracy
over multiple seasons is 56.7% [34]. This shows that while it can be hard to beat the bookmakers,
more recent machine learning research is becoming increasingly accurate.
In figure 5.3b it is shown what the relative profit would be if one were to bet on these bookmaker’s
favorites (determining the profit of the bookmakers themselves is outside the scope of this project).
We see that using Gamebookers’ favorites would result in the best profit with -3.6% profit. Although
Pinnacle had the lowest relative accuracy in its predictions, it is not the bookmaker with the lowest
profit. This is William Hill with a profit of -6.9%, whereas Pinnacle resulted in -5.6%. This shows that
a higher accuracy does not automatically lead to a higher relative profit. Since the margins of Pinnacle
are much lower than the other bookmakers, the profit of correctly predicting the outcome of a match
results in a higher profit, whereas the loss for misclassifying a match’s outcome is the same. In real life
betting, it is thus important to check where you want to place your bets if you wish to have a higher
chance at a profit.
If we look at the amount of times each possible outcome occurs, then we see roughly the same
pattern for each season. This pattern is shown in figure 5.4. We observe that almost half of the matches
are won by the home team and the away team wins just over 25% of the time every season. The two
grey dotted lines are placed at 50% and 75% of the total amount of matches in a season. The only
season where predicting only home wins would result in an accuracy that comes close to bookmakers’
standards is season 2010/2011. If only home wins were predicted, then the accuracy would be 52.3%.
This shows that bookmakers, and in particular Gamebookers, do a good job in forecasting the match
outcomes.
28 5. Experimental setup

Figure 5.4: Outcome distribution for seasons 2006/2007-2016/2017. The two grey dotted lines are placed at 50% and 75% of
the total amount of matches in a season.

5.2. Cost-combining methods that are tested in this project


As previously mentioned, the goal of this project is to investigate if there is a way to combine the odds
of different bookmakers that results in a better performing model: one that would (at least) result
in more profit than if one were to rely on the odds of one bookmaker alone. In figure 5.3 it was
shown that a higher accuracy does not automatically lead to a higher profit. It really depends on which
matches you are predicting right. In chapter 4 we have shown how to convert the standard decision
tree algorithm to incorporate the costs for each match so that the model optimizes for profit (costs)
instead of accuracy.
In our data we have the odds of ten bookmakers available to us, but in the cost-sensitive model
only one cost-matrix can be used. So we have to choose which odds we would use to create the most
optimal profitable model. Since only two bookmakers, Bet365 and Bet&Win, have published their odds
for all matches we will create two models that use only their odds. But as shown in figure 5.1, for
each match at least five bookmakers have released their odds, so we would lose valuable information
if we ignore the other bookmakers. Additionally, there may be a way to combine the odds of multiple
bookmakers to result in a more optimal (profitable) model. That is why we will also create four more
models that use a combination of odds from all bookmakers available for those matches.
In section 4.2 we showed how to convert odds to a cost-matrix. Combining these cost-matrices
might lead to more profit than using only one bookmaker as our cost-matrix source. So for each match
where multiple odds are available, we will create multiple cost-matrices for the given odds and then
combine these matrices so that in the end we will still have only one cost-matrix that can be used in
the cost-sensitive models.
The four different combining methods that will be investigated are: average (avg), maximum (max),
minimum (min) and random (rnd) cost-matrices. Averaging cost-matrices is derived from the wisdom
of the crowd phenomenon, where combined knowledge usually leads to better predictions than an
individual’s prediction [28]. Our hypothesis is that this method will perform best. Using the maximum
costs for each outcome looks at the most pessimistic outcome, so our hypothesis is that this method will
choose the longshot teams more often than the other methods, because the cost difference between
losing and winning is small. Important to note is that using the maximum costs in the cost-matrices
is in fact equal to using the lowest odds and vice versa. So on the other side of the spectrum there is
combining the minimal costs into a cost-matrix. Our hypothesis is that this method will take less risk
than the other methods, so it will bet on favorites more often. The final method just takes a random
cost-matrix as is, so using one of the odds of the bookmakers. Due to this random nature, we cannot
predict how well this method will do. In general, we may hypothesize it to do worse than the average,
but the random model will probably have a high variance in its results.
5.3. Experiments 29

An example for each method can be found in figures 5.5 and 5.6. In the first figure there are
cost-matrices constructed from four different bookmakers and the result for each combining method is
shown in the second figure. In this example the rnd method uses the odds from bookmaker B.

H D A H D A H D A H D A
H -1.8 1 1 H -2 1 1 H -1.9 1 1 H -1.6 1 1
D 1 -2.5 1 D 1 -2.1 1 D 1 -2.7 1 D 1 -2.5 1
A 1 1 -1.4 A 1 1 -1.3 A 1 1 -1.3 A 1 1 -1.4
(a) Bookmaker A. (b) Bookmaker B. (c) Bookmaker C. (d) Bookmaker D.

Figure 5.5: Cost-matrices of four bookmakers for the same match.

H D A H D A H D A H D A
H -1.83 1 1 H -1.6 1 1 H -2 1 1 H -2 1 1
D 1 -2.45 1 D 1 -2.1 1 D 1 -2.7 1 D 1 -2.1 1
A 1 1 -1.35 A 1 1 -1.3 A 1 1 -1.4 A 1 1 -1.3
(a) avg method. (b) max method. (c) min method. (d) rnd method.

Figure 5.6: The four different cost-combining methods applied to four example odds. The costs in the matrices are rounded to
two decimals.

In total we will create six models: two models that only use a single bookmaker’s odds (Bet365
and Bet&Win) and four models that use a combination of odds. First we will investigate if the single
bookmaker models perform better than the combined odds models. We think that the avg and min
models will perform better than the single bookmaker models, but max and rnd will perform worse. We
will also investigate which of the four combining methods performs best. We expect the avg method
to be the best method of all methods and the max to perform worst.

5.3. Experiments
With the goal and general structure of this project in mind, it is time to decide how to conduct the
experiments. The process of acquiring the data to storing the results to disk is shown in figure 5.7.
First we have to combine the different data sources into one container and extract features from this
dataset. After this is done, we need to build and prune the incremental cost-sensitive decision trees
for each of the six methods for every week in the dataset. These trees are then used to predict the
outcomes for upcoming matches. Since the models are trained on season 2006/2007, this decreases
the number of matches we will predict to 3,057 (compared to 3,363 in the whole dataset). The results
of these predictions are stored on disk and the last step is to select which matches to bet on, instead
of betting on all matches. We will now walk through the whole process in more detail.

5.3.1. Preparing the data


The first step is to combine the different data sources into one dataset. This process is shown in
appendix D.

Figure 5.7: Experiment process of this project.


30 5. Experimental setup

After this intermediate dataset is made, it is easy to extract the features for each match. For
each match, we will search for the necessary information of the features in the team information
dataset. It is important to look for information originating from before the match is played, because
we have to simulate that we do not know anything about the match’s results: we only know what
happened in previously played matches. For more information about the features used in our models,
see appendix E. In short: for each match we want to know the difference in points, goals and results
between the two playing teams at the moment in time before the match is played.

5.3.2. Building the cost-sensitive models


The feature dataset is ready to use, but since we are applying cost-sensitive learning, we also need to
create the cost-matrices for each match and available bookmaker. After this is done, it will be possible
to start building the models. To compare the results, we will run the experiments multiple times for
each cost method. While running the first couple of experiments, it was clear that it would take a long
time to complete the models and prune them immediately, so it was decided to split the work into two
steps. Even though the work was split, it still took more than half a day to create each model and one
full day to prune it. Due to lack of times, we were unable to run more than five experiments for each
method. This resulted in a total of 30 models. First we will explain how the models are created without
pruning them, after that we will show how the pruning was done.

Creating models
In algorithm 4 it is shown which pseudocode is used to create the models for each week. The name
parameter is used to save the model to disk (’avg’, ’max’, ’min’ and ’rnd’), data refers to the dataset
where all features are stored, seasons specifies for which seasons the models should be created and
finally, cost_matrix_function is the cost-matrix combination method to be used. This is where
the average, maximum, minimum and random cost-matrices are created. We use the odds as they
are: we do not try to acquire the true odds of the bookmakers such as in [29].
The function incremental_update(model, data, costs) refers to an incremental decision
tree, which we built in Python. There was already code available for this incremental decision tree
at [49], but this code was written in C and it is not possible to use this code in a cost-sensitive manner,
so it was decided to convert the code to Python so that the required alterations could be made. The
function save_model(name, model) writes the decision tree, which is a Python dictionary (see
appendix A), to disk using the Pickle library to save memory.

Algorithm 4 Pseudocode to create models for each week.


function create_models(𝑛𝑎𝑚𝑒, 𝑑𝑎𝑡𝑎, 𝑠𝑒𝑎𝑠𝑜𝑛𝑠, 𝑐𝑜𝑠𝑡_𝑚𝑎𝑡𝑟𝑖𝑥_𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛)
for all 𝑠𝑒𝑎𝑠𝑜𝑛 in 𝑠𝑒𝑎𝑠𝑜𝑛𝑠 do
𝑤𝑒𝑒𝑘𝑠 ← unique(𝑑𝑎𝑡𝑎[𝑠𝑒𝑎𝑠𝑜𝑛])

for all 𝑡ℎ𝑖𝑠_𝑤𝑒𝑒𝑘 in 𝑤𝑒𝑒𝑘𝑠 do


𝑡𝑟𝑛_𝑑𝑎𝑡𝑎 ← 𝑑𝑎𝑡𝑎[𝑦𝑒𝑎𝑟𝑤𝑒𝑒𝑘 == 𝑡ℎ𝑖𝑠_𝑤𝑒𝑒𝑘]
𝑡𝑟𝑛_𝑜𝑑𝑑𝑠 ← get_available_odds(𝑡𝑟𝑛_𝑑𝑎𝑡𝑎)
𝑡𝑟𝑛_𝑐𝑜𝑠𝑡 ← 𝑐𝑜𝑠𝑡_𝑚𝑎𝑡𝑟𝑖𝑥_𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛(𝑡𝑟𝑛_𝑜𝑑𝑑𝑠)

𝑚𝑜𝑑𝑒𝑙 ← incremental_update(𝑚𝑜𝑑𝑒𝑙, 𝑡𝑟𝑛_𝑑𝑎𝑡𝑎, 𝑡𝑟𝑛_𝑐𝑜𝑠𝑡)


save_model(𝑛𝑎𝑚𝑒, 𝑚𝑜𝑑𝑒𝑙)
end for
end for
end function

Prune models
By loading the unpruned models using the Pickle library, we can run the pruning steps asynchronously
in two different Jupyter notebooks. This speeds up the process of creating the unpruned models, then
pruning the models and finally getting the results for the pruned models. The pseudocode for pruning
the models can be found in algorithm 5.
5.3. Experiments 31

The structure of the code is quite similar to the code for creating the models, as shown in algorithm 4,
because normally we would have pruned the trees immediately after creating them. However, during
the process of pruning, more data is needed to be able to use Time Series Cross-Validation (TSCV),
which was discussed in section 4.4.2.
First, those weeks intended for validation are sampled from all weeks until this_week. A maximum
of 60 weeks must be used to keep the cross-validation under a respectable time period, but if fewer
weeks are available, then we use them all. The validation weeks will be sorted based on time, so that
the TSCV will be correctly applied. The validation data and costs are then split into six equally sized
parts. This will lead to 5-fold TSCV, because the last part of data is not used for training. The model
is loaded from disk and then the function ccp will apply Cost-Complexity Pruning (CCP) on the model
given the validation data and cost datasets. (For more about CCP, see section 4.2.2.) After the pruning
is completed, the model is then saved to disk again. In these experiments the MDL method is not used.
We will discuss why this was decided in chapter 7.

Algorithm 5 Pseudocode to prune models for each week.


function prune_models(𝑛𝑎𝑚𝑒, 𝑑𝑎𝑡𝑎, 𝑠𝑒𝑎𝑠𝑜𝑛𝑠, 𝑝𝑟𝑒𝑣_𝑤𝑒𝑒𝑘𝑠, 𝑐𝑜𝑠𝑡_𝑚𝑎𝑡𝑟𝑖𝑥_𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛)
for all 𝑠𝑒𝑎𝑠𝑜𝑛 in 𝑠𝑒𝑎𝑠𝑜𝑛𝑠 do
𝑤𝑒𝑒𝑘𝑠 ← unique(𝑑𝑎𝑡𝑎[𝑠𝑒𝑎𝑠𝑜𝑛])

for all 𝑡ℎ𝑖𝑠_𝑤𝑒𝑒𝑘 in 𝑤𝑒𝑒𝑘𝑠 do


𝑣𝑎𝑙_𝑤𝑒𝑒𝑘𝑠 ← sample(𝑝𝑟𝑒𝑣_𝑤𝑒𝑒𝑘𝑠, min(60, len(𝑝𝑟𝑒𝑣_𝑤𝑒𝑒𝑘𝑠)))
𝑣𝑎𝑙_𝑑𝑎𝑡𝑎 ← 𝑑𝑎𝑡𝑎[𝑦𝑒𝑎𝑟𝑤𝑒𝑒𝑘 ∈ 𝑝𝑟𝑒𝑣_𝑤𝑒𝑒𝑘𝑠]
𝑣𝑎𝑙_𝑜𝑑𝑑𝑠 ← get_available_odds(𝑣𝑎𝑙_𝑑𝑎𝑡𝑎)
𝑣𝑎𝑙_𝑐𝑜𝑠𝑡 ← 𝑐𝑜𝑠𝑡_𝑚𝑎𝑡𝑟𝑖𝑥_𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛(𝑣𝑎𝑙_𝑜𝑑𝑑𝑠)

𝑣𝑎𝑙_𝑑𝑎𝑡𝑎_𝑐ℎ𝑢𝑛𝑘𝑠 ← split(𝑣𝑎𝑙_𝑑𝑎𝑡𝑎, 6)
𝑣𝑎𝑙_𝑐𝑜𝑠𝑡_𝑐ℎ𝑢𝑛𝑘𝑠 ← split(𝑣𝑎𝑙_𝑐𝑜𝑠𝑡, 6)

𝑚𝑜𝑑𝑒𝑙 ← load_model(𝑛𝑎𝑚𝑒)
𝑚𝑜𝑑𝑒𝑙 ← ccp(𝑚𝑜𝑑𝑒𝑙, 𝑣𝑎𝑙_𝑑𝑎𝑡𝑎_𝑐ℎ𝑢𝑛𝑘𝑠, 𝑣𝑎𝑙_𝑐𝑜𝑠𝑡_𝑐ℎ𝑢𝑛𝑘𝑠)
save_model(𝑛𝑎𝑚𝑒, 𝑚𝑜𝑑𝑒𝑙)
end for
end for
end function

5.3.3. Betting on matches


Now that the decision trees are created and pruned for each week, it is time to bet on upcoming
matches. For each week a tree was created, so we will predict the outcome of the next week by
applying the data of next week on the current tree. This will result in either an ’H’ for home winning,
’D’ for a draw or ’A’ for away winning. Recall that the model does not contain any information about
these (’future’) matches before predicting the outcomes to make sure that the real life situation is
simulated.
After the model predicts the outcome of a match, it then has a choice between five to ten book-
makers to place its bet (see figure 5.1 to see how many bookmakers’ odds are available per match).
We will place our bet at the bookmaker that offers the highest odds for the predicted outcome. This
means that multiple bookmakers will be used to place bets, since for each match the odds differ slightly
from bookmaker to bookmaker. We will not take into account additional costs that may result from
this strategy (e.g. taxes, transaction costs). The results reflect the situation where the model bets on
a match with a symbolic stake of 1 unit (which can be thought of as €1), and the profit of this match
depends on the bookmaker with the highest odds for the predicted outcome.

5.3.4. Selecting which matches to bet on


Besides betting on every match, there is another situation that will be investigated in this project. In
real life betting it is not common to bet on all matches: punters (bettors) usually bet on matches
that have ’value’. This means that a punter assumes a higher probability of an event occurring than
32 5. Experimental setup

a bookmaker does. This is called ’value betting’ [50]. If this is the case, the bookmaker’s odds of
that event will return a higher profit than the punter thinks that event is worth. To determine whether
there is indeed a value in betting on an event, the punter may use the following formula: 𝑉(𝑋) =
𝐸 (𝑋) − 𝐸 (𝑋) = 𝐸 (𝑋) − 1/𝑜 , where 𝑉(𝑋) is the value of event 𝑋, 𝐸 (𝑋) is the expected probability
of event 𝑋 by the punter, 𝐸 (𝑋) is the expected probability of event 𝑋 by the bookmaker and 𝑜
are the odds of event 𝑋. If we fill in the formula using the values of an example, then we get:
𝑉(𝑋) = 0.19 − 1/6.07 ≈ 0.19 − 0.165 = 0.025 = 2.5%. Thus, for example, if a punter were to think
that the chance of ADO Den Haag beating FC Utrecht is 19% and the bookmaker’s odds for that event
are 6.07 (resulting in a probability of 16.5%), then this outcome has a value of 2.5%. (Converting
the odds to the probability of that event happening was explained in section 2.2.1.) Every punter has
his/her own threshold value or preference for when to bet on a match or not, and can use this formula
to determine whether this threshold is met.
We can also apply this value betting strategy in this project. When a model predicts the out-
come of an upcoming match it can also show what the expected costs are for this match, just like a
cost-insensitive decision tree can show the probability of the predicted outcome (see sections 4.1.3
and 4.2.3). Recall that the prediction takes place at the leaves of a decision tree and the information
of the training instances that are in these leaves are responsible for the prediction.
Since we use cost-sensitive models, we will use the expected cost of the predicted instances to
find matches that have ’value’. To see how to get the expected costs of an upcoming match, take
a look at section 4.2.3. In our case, to find matches with value, we will compare the odds of the
predicted outcome with the expected costs. This is calculated through the following formula: 𝑉(𝑋) =
𝑜 −(−1∗𝐸 (𝑋)), where 𝐸 (𝑋) is the expected cost of event 𝑋. In this case, we want to bet on matches
where the odds of event 𝑋 would result in a higher profit if predicted correctly than the expected profit
of that event 𝑋. Recall that profit is a negative cost in cost-sensitive models, so we have to multiply
the expected cost by -1 to get the expected profit.
The last step involves setting a threshold for which matches we would want to bet on. Setting this
threshold too high could mean betting on almost zero matches, but setting this threshold too low could
end up in misclassifying too many matches which lowers our relative profit. The algorithm which we
used to get our optimal threshold values is shown in algorithm 6.

Algorithm 6 Match selection pseudocode.


function match_selection(match_data)
for all 𝑤𝑒𝑒𝑘 in 𝑤𝑒𝑒𝑘𝑠 do
𝑝𝑟𝑒𝑣_𝑑𝑎𝑡𝑎 ← 𝑚𝑎𝑡𝑐ℎ_𝑑𝑎𝑡𝑎[𝑝𝑟𝑒𝑣_𝑤𝑒𝑒𝑘𝑠]
𝑜𝑒𝑑 ← sort(𝑝𝑟𝑒𝑣_𝑑𝑎𝑡𝑎[𝑜𝑑𝑑𝑠] − (−1 ∗ 𝑝𝑟𝑒𝑣_𝑑𝑎𝑡𝑎[𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑_𝑐𝑜𝑠𝑡]))

for all 𝑜𝑒𝑑_𝑣𝑎𝑙𝑢𝑒 in 𝑜𝑒𝑑 do


𝑝𝑟𝑜𝑓𝑖𝑡 ← sum(𝑝𝑟𝑒𝑣_𝑑𝑎𝑡𝑎[𝑝𝑟𝑜𝑓𝑖𝑡] if 𝑝𝑟𝑒𝑣_𝑑𝑎𝑡𝑎[𝑜𝑒𝑑] > 𝑜𝑒𝑑_𝑣𝑎𝑙𝑢𝑒)
if 𝑛_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 > ceil(len(𝑝𝑟𝑒𝑣_𝑑𝑎𝑡𝑎 / 30)) then
𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑_𝑟𝑒𝑙_𝑝𝑟𝑜𝑓𝑖𝑡.append(𝑝𝑟𝑜𝑓𝑖𝑡 / 𝑛_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 ∗ 100)
end if
end for

𝑏𝑒𝑠𝑡_𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ← 𝑜𝑒𝑑[max_index(𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑_𝑝𝑟𝑜𝑓𝑖𝑡)]
𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑠.append(𝑏𝑒𝑠𝑡_𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑)
end for
end function

To obtain the optimal value betting strategy for each week, we first load the data of previous weeks,
since we cannot use the data of upcoming or future matches. The idea is that previous matches will
allow us to calculate the best threshold value, which results in the highest relative profit. Although this
threshold value may not be the optimal one, it was used as an approximation during this project due to
time constraints. After the data of previous weeks is obtained, the difference between the odds of the
predicted outcomes and the expected costs of these predictions is calculated. Then, for each of these
𝑜𝑒𝑑 (odds-expectation difference) values, the profit of using this as a threshold is calculated. If this
threshold results in enough matches to bet on, then we save the relative profit, otherwise we ignore
5.3. Experiments 33

this threshold. (’Enough’ is determined by dividing the number of previous matches by 30, to ensure a
proportional representation of previous data. By the last week, this means a minimum of 100 matches
is considered.) After all profits for each 𝑜𝑒𝑑 value is calculated, we select the best threshold value for
that week and this threshold is then applied to the upcoming matches.
To see how this works in practice, we display all 𝑜𝑒𝑑 values and their corresponding relative profit
in figure 5.8 for one of the runs of the avg method. The blue dots represent the relative profit and the
red line shows on how many matches would be bet. We want to bet on a minimum of 100 matches, so
when a threshold results in less than 100 matches we ignore it. We see that a profit of almost 12.5%
can be achieved when an 𝑜𝑒𝑑 threshold of 5.5 is used. However, this result is actually very hard to
achieve, since this is only the result after incorporating the data of all ten seasons. In reality, we could
only base our threshold on previous events, so less data than those full ten seasons. When a selection
is to be made for upcoming matches, we can only look back at matches until then, so we have to use
the best threshold up to that point in time.

Figure 5.8: Relative profit per threshold value. Blue dots depict the profit and the red line shows how many matches are
bet on.
6
Results
This chapter contains the results of the experiments as described in the previous chapter. Recall that for
each cost-combining method five runs were conducted. The results in this chapter show the average
profit values (including standard deviations), cost-confusion matrices, and statistical significance checks
over these five runs per method. The results will be split over two sections. In section 6.1 the results
for betting on all matches is shown and in section 6.2 the results for ’value betting’, betting only on
selected matches, are presented.

6.1. Betting on all matches


This section contains all results for the situation where the models bet on all 3,057 matches. Since the
model is trained on season 2006/2007 and starts to bet on matches beginning in season 2007/2008,
the results will cover seasons 2007/2008 until 2016/2017.

6.1.1. Profit and accuracy


The cumulative profit over time for each method is shown in figure 6.1. In these figures we see that
the avg method has the highest profit after ten seasons, with the min method achieving the lowest
profit. No method has managed to end with a profit, but some did manage to achieve a positive return
for a short period in time, like the rnd method in season 2013/2014 and 2014/2015. Since no method
achieves a profit, this seems to confirm the idea that selective betting may be more profitable (see
section 6.2). To see the results for all runs of each method, see appendix G.
The relative profit after season 2016/2017 is shown in table 6.1. This table also displays the accuracy
for each method and the Standard Deviation (SD). We can see that avg has the best overall accuracy.
The bw method has the best accuracy for home wins, b365 performs best for draws and max is best in
predicting away wins. To get an in-depth look at how many times each method predicts on the different
outcomes, the confusion matrices showing each method’s performance can be found in appendix H.

6.1.2. Cost-confusion matrices


In figure 6.2 the cost-confusion matrices for each method are shown. A cost-confusion matrix depicts
how many costs are incurred for each possible outcome. Normally costs are displayed as positive values,

Method Profit SD Accuracy SD Home Draw Away


b365 -3.49% (2.0) 40.29% (0.8) 61.0% 24.1% 19.4%
bw -1.99% (1.5) 40.90% (0.6) 62.2% 23.0% 20.4%
avg -1.13% (1.7) 40.96% (0.8) 62.2% 22.6% 21.0%
max -4.15% (1.7) 39.82% (0.7) 59.4% 20.2% 23.7%
min -4.33% (2.3) 40.63% (1.0) 61.5% 20.5% 22.7%
rnd -2.86% (1.5) 40.04% (0.5) 60.2% 21.4% 22.1%

Table 6.1: Profit and accuracy for all methods when betting on all matches.

35
36 6. Results

Figure 6.1: Profit over time for all methods when betting on all matches.

but in these figures we display profit as positive and loss as negative to avoid any misunderstanding.
On the left side the averaged cost-confusion matrices are shown and on the right the standard
deviation is displayed. Some noteworthy results: when betting only on those matches where the home
team is predicted to win using the avg method, we would have ended up with a mean profit of 28.08
units. The avg method has the biggest standard deviation for matches predicted as a draw (40.73),
but has the lowest standard deviation for matches predicted as away winning (11.47).
In most cases, the standard deviation in the diagonal of the matrices is the highest value. This is
because the profit for predicting a correct outcome varies a lot, whereas misclassifying a match has
the same costs.

6.1.3. Statistical significance


To check if there is a significant difference between one of the methods, two tests were performed.
But before these tests were conducted, it was checked if the samples were normally distributed. This
was indeed the case, as shown in appendix I. The first test that was performed (to check significant
difference) is the one-way Analysis of variance (ANOVA) test and the second was the Kruskal-Wallis test.
Both tests were conducted twice: once for all six methods and once for only the four cost-combining
methods, fitting with the two research questions. The 𝑝-values of these tests can be found in table 6.2.
If a 𝑝-value lower than 0.05 is achieved at one of the tests, then that means that there is at least
one method that performs significantly differently from the other methods, which would tell us that
this method has better or worse results than the others. Unfortunately, no 𝑝-value less than 0.05 is
achieved, so we cannot reject the null hypotheses that all means of the methods are equal.
The reason why we ran both the one-way ANOVA and the Kruskal-Wallis test is because the data
has to fit different assumptions for these tests. For the one-way ANOVA the data of the different
samples must have the same standard deviation and when this is not the case, the Kruskal-Wallis test
is a better estimator. If we look at table 6.1, then we see that in this case the assumption of equal
standard deviation could be a fair one. This results in 𝑝-values that lie closely to each other in table 6.2.

Test All methods Only cost-combining methods


One-way ANOVA 0.124 0.081
Kruskal–Wallis 0.125 0.069

Table 6.2: -values of the One-Way ANOVA and Kruskal-Wallis tests when betting on all matches. The tests are performed
twice: once for all six methods and once for only the four cost-combining methods. Values in bold depict significance.
6.1. Betting on all matches 37

Predicted Predicted
H D A tot H D A
H 873.53 -353.20 -214.00 306.33 H 55.18 17.15 17.12
Actual D -429.80 541.32 -119.00 -7.48 Actual D 16.00 27.17 10.43
A -485.20 -224.00 303.52 -405.68 A 11.27 12.02 16.96
tot -41.47 -35.88 -29.48 -106.83 tot 37.83 24.02 24.83
(a) Cost-confusion matrix for the b365 method. (b) Standard deviation for the b365 method.

Predicted Predicted
H D A tot H D A
H 892.63 -331.20 -218.20 343.23 H 29.47 31.15 19.83
Actual D -437.40 516.19 -119.00 -40.21 Actual D 5.71 41.96 10.83
A -485.20 -215.60 336.89 -363.91 A 6.97 7.89 15.63
tot -29.97 -30.61 -0.31 -60.89 tot 19.59 15.38 30.39
(c) Cost-confusion matrix for the bw method. (d) Standard deviation for the bw method.

Predicted Predicted
H D A tot H D A
H 920.88 -324.00 -226.00 370.88 H 42.24 16.38 13.10
Actual D -419.20 509.56 -140.20 -49.84 Actual D 8.82 26.54 4.96
A -473.60 -221.80 339.68 -355.72 A 15.84 14.82 25.54
tot 28.08 -36.24 -26.52 -34.67 tot 29.33 40.73 11.47
(e) Cost-confusion matrix for the avg method. (f) Standard deviation for the avg method.

Predicted Predicted
H D A tot H D A
H 883.06 -283.20 -307.80 292.06 H 23.43 12.86 17.55
Actual D -424.00 454.65 -153.00 -122.35 Actual D 16.04 40.86 5.02
A -466.40 -205.20 375.09 -296.51 A 15.84 11.12 19.05
tot -7.34 -33.75 -85.71 -126.80 tot 21.79 22.79 22.03
(g) Cost-confusion matrix for the max method. (h) Standard deviation for the max method.

Predicted Predicted
H D A tot H D A
H 875.47 -318.00 -241.40 316.07 H 66.97 24.52 22.53
Actual D -440.60 469.52 -134.20 -105.28 Actual D 22.20 36.49 14.46
A -497.80 -182.80 337.40 -343.20 A 28.10 20.06 33.81
tot -62.93 -31.28 -38.2 -132.41 tot 21.58 38.72 15.07
(i) Cost-confusion matrix for the min method. (j) Standard deviation for the min method.

Predicted Predicted
H D A tot H D A
H 899.88 -321.60 -257.60 320.68 H 31.89 12.19 14.35
Actual D -420.40 480.89 -147.80 -87.31 Actual D 22.51 45.09 13.67
A -462.20 -223.40 364.74 -320.86 A 16.70 19.71 20.15
tot 17.28 -64.11 -40.66 -87.48 tot 22.97 32.22 36.12
(k) Cost-confusion matrix for the rnd method. (l) Standard deviation for the rnd method.

Figure 6.2: Cost-confusion matrices for all methods when betting on all matches.
38 6. Results

6.2. Value betting


This section discusses the results that were obtained when betting on only selected amount of matches,
i.e. value betting. The process of selecting which match to bet on is explained in section 5.3.4. While
in the previous section each method bet on the same number of matches, this is no longer the case.
Each of the five runs for every method selected their own matches to bet on.

6.2.1. Profit and accuracy


Figure 6.3 shows what the profit over time is for each method when value betting. To see the results
for each run of the methods, see appendix J. Four methods have a higher profit using value betting,
than when betting on all matches, while the other two perform worse. The avg method again has the
highest profit, and the worst-performing method in this case is max.
It is slightly more difficult to compare the different methods in this figure, because each method
uses a different number of matches due to their selection process. To see what the average number
of matches is, see table 6.3. In this table it is shown that the min method uses the highest number of
matches, and b365 uses the lowest number. The table also shows the relative profit, standard deviation
and accuracy. While the bw method achieves a similar profit, its variance (reflected in the standard
deviation) is a lot higher than the avg method.
The accuracy for each method is a lot lower than when betting on all matches (compare to table 6.1).
The method with the highest accuracy is rnd, while max has the lowest. We see the same pattern for
the accuracy’s standard deviation between bw and avg as we saw for the profit, where bw has a much
higher variance than avg. To see the confusion matrices for an in-depth view of the accuracy results,
see appendix K.

6.2.2. Cost-confusion matrices


Figure 6.4 shows the cost-confusion matrices when value betting. Once again, the left side shows the
costs for each possible outcome and on the right side, the standard deviations are shown. Recall that
these values are the absolute costs and it is hard to compare the results between methods since they
were achieved by betting on a different number of matches. We see that avg performs best on home
wins, bw on draws and away wins. When value betting, there is apparently no method to achieve a
profit when only betting on predicted away wins.
We see a high variation among the standard deviations in the cost-confusion matrices when value
betting. Since each run may contain a different number of matches on which were bet, the variance
for individual outcomes (in the cells) is higher for almost all methods, but the variance for betting only
on a specific outcome (in the 𝑡𝑜𝑡 row) is rather steady. For example, the standard deviation for bw for
predicted home wins that are correct is 113.89, but when looking at the costs for all home predicted
matches it is only 23.66.

6.2.3. Statistical significance


To check if there is a significance different between one of the methods, both the one-way ANOVA and
the Kruskal-Wallis tests are conducted once again. (The samples seemed to be normally distributed,
which is shown in appendix L.) The 𝑝-values of the one-way ANOVA and the Kruskal-Wallis tests when
value betting are displayed in table 6.4. In this case there are three 𝑝-values lower than 0.05, which
means that there is a significant difference measured by the tests. However, if the assumption of
the one-way ANOVA test (that the standard deviation of all samples must be equal) and the standard
deviation values in table 6.4 are compared, then we have to conclude that the data does not match this

Method Matches Profit SD Accuracy SD Home Draw Away


b365 345.2 -5.68% (10.6) 22.36% (4.9) 29.4% 23.1% 9.8%
bw 470.0 6.90% (10.0) 29.49% (7.9) 29.5% 15.5% 13.7%
avg 366.2 6.96% (4.7) 23.76% (4.8) 31.3% 16.2% 14.0%
max 469.0 -5.81% (4.5) 21.58% (5.4) 21.6% 20.9% 13.4%
min 870.4 -0.26% (4.6) 30.33% (5.3) 37.6% 28.0% 15.5%
rnd 709.0 -0.59% (4.1) 32.83% (9.2) 38.3% 19.3% 14.6%

Table 6.3: Profit and accuracy for all methods when value betting.
6.2. Value betting 39

Figure 6.3: Profit over time for all methods when value betting.

assumption when all methods are compared. If only the four cost-combining methods are compared to
each other, then we see a fairly equal standard deviation and 𝑝-values below 0.05, so this means that
at least one of these four methods performs better or worse than the others. Some post-hoc testing
is necessary to discover which method this is.
One of the post-hoc tests that can be used to see which method behaves differently is the Tukey Hon-
est Significant Difference (HSD) test [51]. This test checks per pair of methods whether these two
methods have an equal mean. The results of the four cost-combining methods, when paired in all
possible ways, can be found in table 6.5. This table shows for each method pair what the lower and
upper confidence intervals look like and if the null hypothesis should be rejected or not. In this case,
we only observe a significant difference between the avg and max method. This means that it is very
likely that the avg method performs better than max, if the results of this test are combined with the
results in table 6.3. For the other method pairs it is not possible to draw a similar conclusion.

Test All methods Only cost-combining methods


One-way ANOVA 0.047 0.009
Kruskal–Wallis 0.108 0.046

Table 6.4: Results of the One-Way ANOVA and Kruskal-Wallis tests when value betting. The tests are performed twice: once for
all six methods and once for only the four cost-combining methods. Values in bold depict significance.
40 6. Results

Predicted Predicted
H D A tot H D A
H 106.77 -50.80 -60.20 -4.23 H 20.86 30.92 4.66
Actual D -37.20 78.97 -14.80 26.97 Actual D 11.05 57.12 3.37
A -81.40 -23.60 58.73 -46.27 A 15.05 26.58 11.76
tot -11.83 4.57 -16.27 -23.53 tot 21.61 7.42 13.57
(a) Cost-confusion matrix for the b365 method. (b) Standard deviation for the b365 method.

Predicted Predicted
H D A tot H D A
H 154.73 -56.00 -69.40 29.33 H 113.89 62.41 35.88
Actual D -59.80 93.36 -22.20 11.36 Actual D 65.39 112.20 16.73
A -97.60 -26.40 88.51 -35.49 A 70.85 43.47 41.10
tot -2.67 10.96 -3.09 5.2 tot 23.66 14.40 22.34
(c) Cost-confusion matrix for the bw method. (d) Standard deviation for the bw method.

Predicted Predicted
H D A tot H D A
H 161.02 -49.00 -69.20 42.82 H 34.91 10.39 6.43
Actual D -44.80 60.30 -24.00 -8.50 Actual D 11.11 28.14 4.56
A -79.80 -12.40 83.02 -9.18 A 16.94 5.68 29.25
tot 36.42 -1.10 -10.18 25.14 tot 19.70 18.59 21.77
(e) Cost-confusion matrix for the avg method. (f) Standard deviation for the avg method.

Predicted Predicted
H D A tot H D A
H 156.50 -69.80 -100.80 -14.10 H 57.26 31.78 20.90
Actual D -54.40 90.49 -24.80 11.29 Actual D 38.88 44.92 9.68
A -89.80 -28.20 84.15 -33.85 A 35.16 28.81 21.82
tot 12.30 -7.51 -41.45 -36.66 tot 20.10 21.46 11.13
(g) Cost-confusion matrix for the max method. (h) Standard deviation for the max method.

Predicted Predicted
H D A tot H D A
H 267.33 -130.80 -109.60 26.93 H 63.98 46.67 32.67
Actual D -102.40 202.42 -40.40 59.62 Actual D 44.78 81.31 19.37
A -161.40 -61.80 119.86 -103.34 A 42.87 36.66 54.14
tot 3.53 9.82 -30.14 -16.79 tot 35.70 21.71 19.56
(i) Cost-confusion matrix for the min method. (j) Standard deviation for the min method.

Predicted Predicted
H D A tot H D A
H 232.77 -86.00 -94.60 52.17 H 171.63 77.91 40.53
Actual D -95.00 115.99 -32.20 -11.21 Actual D 88.07 109.66 23.76
A -125.00 -43.40 111.92 -56.48 A 87.51 54.07 64.24
tot 12.77 -13.41 -14.88 -15.52 tot 20.04 27.11 19.85
(k) Cost-confusion matrix for the rnd method. (l) Standard deviation for the rnd method.

Figure 6.4: Cost-confusion matrices for all methods when value betting.
6.2. Value betting 41

Method 1 Method 2 MeanDiff Lower Upper Reject


avg max -0.1277 -0.2182 -0.0373 True
avg min -0.0722 -0.1627 0.0182 False
avg rnd -0.0756 -0.1660 0.0149 False
max min 0.0555 -0.0350 0.1460 False
max rnd 0.0522 -0.0383 0.1426 False
min rnd -0.0033 -0.0938 0.0871 False

Table 6.5: Results of the Tukey HSD test between cost-combining methods when value betting. The mean difference; lower and
upper limits of the confidence interval; and if the null hypothesis can be rejected or not (in bold) are shown.
7
Discussion and conclusion
This research tested different cost-sensitive methods on football data to see if there is a method that
would perform better than others, and lead to more profit as a result. Two methods use the costs of
two individual bookmakers while the other four use different ways to combine the costs of multiple
bookmakers. The results of these methods were gathered both for when the model would be tasked
to bet on all matches using these costs, and for when the model would employ ’value betting’: betting
only on specific matches that were expected to pass a threshold and yield a profit. The first research
objective was to investigate if cost-combining methods would work better than single-cost methods
and the second objective is to check which of those cost-combining methods would work best.
We will dive into the first research question in section 7.1. Then in section 7.2 we will discuss
the results of all cost-combining methods. In section 7.3 more general findings from this project are
described. A conclusion will be given in section 7.4.

7.1. Comparing single-cost and cost-combining methods


The performance of both single-cost methods differs greatly. The bw method, based on the odds of
the bookmaker Bet&Win, finished second place in both betting scenarios (betting on all matches and
value betting), whereas the b365 method based on bookmaker Bet365 ended fourth and fifth. In the
cost-confusion matrices we see that the strength of the single-cost methods lies in predicting draws.
Looking at table 6.4a and 6.4c, this is especially the case when value betting: this resulted in a profit of
4.57 units for the b365 method and 10.96 units for the bw method. Home and away predicted matches
resulted in a loss, which lowered the total profits for both methods. However, the bw method had a
much lower loss and ended up with a net profit when value betting. The b365 was not able to achieve
this, as it performed even worse when selecting on which match to bet (-3.49% versus -5.68%).
As for the cost-combining methods, we see a diverse picture. The max method could not achieve
a profit. When betting on all matches it ended up with -4.15% profit, but in the value betting scenario
it resulted in -5.81%. The other three cost-combining methods finished with a higher profit when
value betting. The method which most improved its results by value betting is the avg method. It
increased its profit from -1.13% to +6.96% after ten seasons. Looking at table 6.4e, this was achieved
by acquiring 36.42 units of profit by predicting home wins correctly, which is by far the highest profit
achieved by predicting a single outcome (home wins) correctly. It only lost 1.10 units for predicted
draw matches and it also managed to limit its losses (-10.18) for away winning matches, compensating
these losses with its correctly predicted home wins.
We suggested before that a higher accuracy does not have to result in a higher profit, and we can
see that in action here. For example, the max method displays the highest accuracy for away winning
matches (table 6.1), but lost 85.71 units on those matches (table 6.2g).
So which technique works better: using only a single bookmaker as the cost source in cost-sensitive
learning, or combining the odds of multiple bookmakers? If we look at the results of the best-performing
single-cost and cost-combining method, bw and avg respectively, then it is quite difficult to answer this
question. The avg method performed better when betting on all matches and also when value betting,
but the difference is minimal. When looking at tables 6.2 and 6.4 we see that there is no method that

43
44 7. Discussion and conclusion

performs significantly differently, so this means that there is also no significant difference between a
single-cost and cost-combining method. This counters our hypothesis that a cost-combining method
would outperform the predictions of a single bookmaker. However, we should keep in mind that these
tests were run on only five runs per method in total, which is hardly enough to state conclusively that
there can be no (significant) difference between such methods. Furthermore, this project was restricted
to analyzing the odds of ten specific bookmakers for ten seasons of the Dutch national football league.
Other data sources may prove to result in different outcomes. For the conducted experiments in this
project, we cannot conclude that combining the costs of multiple bookmakers is better than using the
costs of a single bookmaker.

7.2. Best cost-combining method


Now, let us see whether there is a difference in the comparative performance of the cost-combining
methods.
First, the results when betting on all matches are considered. If only the four cost-combining
methods are compared to each other, then we see in table 6.1 that the avg method performs best with
a relative profit of -1.13% and the min method is the worst one with a profit of -4.33% when betting
on all matches. The avg method managed to obtain a net profit of 28.08 units (see figure 6.2) when
betting only on home predicted matches and it lost the least amount of units for away winning matches
(-26.52 units). The only other method that managed to get a net profit is the rnd method when betting
only on home predicted matches.
However, when testing for significance, see table 6.2, there is no significant difference between
these four outcomes. This means that there is no method that performs better or worse than the other
methods when betting on all matches. Ideally, more runs would be conducted to test if this really is
the case and to be able to provide a more robust conclusion.
When value betting, see table 6.3, we see that avg has the highest relative profit (+6.96%) and the
max method has the lowest (-5.81%). The max method is also the only one of the four methods that
perform worse when value betting. The other two methods, min and rnd, are very close to breaking
even or achieving a net profit (-0.26% and -0.59% respectively). These methods bet on many more
matches than the max method does: they bet on 870 and 709 matches respectively compared to the
469 matches that max bets on. Having only lost around 16 units, this means that the relative profit
is much better for these methods. The min and rnd thus seem to vastly outperform the max method
using this metric.
Looking at all the runs of the methods when value betting in appendix J, we see that the avg method
managed to achieve a net profit for four out of five runs, max does not have a single run with a profit,
min has three runs with a profit and the rnd method ended positive after two runs. This shows that
the avg is the most stable method when it comes to getting a profit.
If we look at table 6.4, where the significance of these four methods is tested, it is shown that
there is a noteworthy difference for at least one of the methods (𝑝-values of both tests are below
0.05). A post-hoc test was conducted to see which of these methods was responsible for these results.
In table 6.5, where the results of the Tukey HSD test are shown, it is depicted that there is a significant
difference between the avg method and the max method. This means that it is very likely that the avg
method performs better than the max method, but we cannot conclude this for other method pairs.
To conclusively say that the avg method indeed outperforms all other cost-combining methods, fitting
with our hypothesis, it would also need to show a significant difference when paired with the other two
methods. Our results seem to reflect our hypothesis that the avg method is a more successful way of
combining the odds of multiple bookmakers, and our hypothesis that the max method would seem the
least successful, but unfortunately we cannot conclusively answer our research question. Once again,
our research project only conducted limited runs based on a comparatively specific data source, so it
may yet be possible that our hypotheses do hold if the research were to be repeated and/or extended
in the future.

7.3. General discussion and recommendations


While we discussed the main findings, related to our research questions above, we encountered some
other findings that we would like to discuss.
One of the shortcomings of this project is the limited number of experiments per investigated
7.3. General discussion and recommendations 45

method due to time constraints. Since we use the data of ten seasons to test the cost-combining
methods and it takes almost two days to create, prune and obtain the results of one method, it could
have been better to run more experiments for fewer seasons. This trade-off is unfortunately inevitable
when considering deadlines and relatively limited hardware. Although focusing on fewer seasons would
have resulted in short-term-based models, it would have been possible to run many more experiments
and perhaps more robust conclusions could have been made. For example, if data of only five seasons
would have been tested, then it would have been possible to run around fifteen runs for each method
instead of five. This is because it takes much more time to update the decision trees for later weeks
than for weeks in the first season since less data is used in the tree. If we could start this project
from scratch, we might have used the data of only five seasons. However, there is also a value in
building models over longer periods of time. After all, the bookmakers themselves probably base their
predictions on long-term information as well, to reach the level of accuracy we have seen. As the
outcomes of football matches are highly variable, it may be the case that using as high a number of
seasons as possible leads to the most predictive models (though it may also be the case that predictions
become stable after a set number of seasons). Since we saw the best results when value betting, there
is a reason to suspect that using more seasons would provide more robust results, as more data can
lead to more stable threshold values. In conclusion, we do not see an optimal way to decide on
these types of experiment limitations considering time constraints; we only hope that explaining our
perspective can inform the choices of future work.
When looking at the accuracy of the methods in tables 6.1 and 6.3, we see that these are (much)
lower than the accuracy values reported in chapter 3. When betting on all matches the accuracy of
all our methods lies around 40% and this drops to around 20-30% when value betting. However, the
profit of most methods are higher when the accuracy is lower: the avg method managed to have a
profit of 6.96% with an overall accuracy of only 23.76%. This is more than half of the accuracy values
reported in [11, 12, 33]. While these papers had a different primary goal than we had, it is interesting
to see that predicting the outcome of more matches correctly should not be a preferred goal when
the objective of a project is to reach the highest amount of profit. In future research, this contrast
should be taken into account when deciding on the objective: increasing accuracy and profit may not
be related.
Additionally, when betting on all matches, it was interesting that no method was able to achieve
a net profit. Betting in this project only turned profitable when selecting specific matches to bet on,
i.e. when value betting. In fact, four out of the six investigated methods improved their relative profit
when value betting, even though their accuracy dropped significantly. Once again, we see that accuracy
rates are not necessarily linked to higher profits. Furthermore, this shows that punters should decide
carefully on which matches they should bet, as this can determine whether they make a profit or not,
potentially more so than e.g. betting on more likely outcomes (such as home wins). Matches that
are bet on must have enough ’value’ to expect a profit in the long run. Similarly, this may prove an
interesting field for further machine learning research. Future work can attempt to (further) exploit
this phenomenon.
In chapter 4.1.2 the MDL method was described, but this pruning method was not used for the
results shown in chapter 6. After the first cost-sensitive decision trees were pruned with this method,
we noticed that the trees were not optimally pruned. The MDL technique might work decently for
standard cost-insensitive decision trees, but when implementing cost-sensitive learning we noticed
that the trees were pruned proportionally to the depth of the tree. This is to be expected considering
how the technique works, but when CCP pruning was used the cost-sensitive decision trees were often
pruned at the root node of the tree for some weeks. This can also be seen in appendix H: all methods
predicted many more home wins than the other methods because when the trees were pruned at the
root node, this means that all upcoming matches were predicted as a home win. This never occurred
when using MDL as the pruning method (since proportionate pruning leaves a larger part of the tree
intact and never cuts down to the root), and the results using that MDL method suffered because of
this. However, MDL was not tested in combination with the value betting technique, so this might be
interesting to investigate in future work.
In chapter 3 we mentioned that paper [34] was released too close to our own publication date to
take advantage of their findings. The best model presented in this paper managed to have an accuracy
value of 56.7%, which is the highest value we could find in all research for experiments done over
multiple seasons, betting on all matches. These results were obtained after carefully constructing a
46 7. Discussion and conclusion

feature dataset that contained information about the streak and the form of the teams in addition to
features that are comparable to our features. These extra features could also have been very useful
in our research, but unfortunately it was not possible to incorporate these into our experiments. For
future research we would thus also recommend spending more time on feature engineering as this
may lead to better results. In the case of [34], feature engineering resulted in a higher accuracy score,
but we do not know if this also would have resulted in a higher profit.
Our results show a relatively high variance and this is mainly due to using a single decision tree
for predicting the outcomes [52]. One way to lower this variance would be to use an ensemble of
decision tree models and combine their predictions in some way to get a single prediction. If the test
is run again and one decision tree would give a different outcome, then this would not influence the
final (integrated) outcome much, which results in a lower variance. The same authors that wrote [41],
which was used to create the cost-sensitive decision tree, also showed how to use an ensemble of cost-
sensitive decision trees [53]. In this paper they tested multiple cost-sensitive ensemble techniques and
showed that using these ensemble models indeed improved the results.
One small note about our dataset: although we simply disregarded the missing odds of bookmak-
ers, there are other ways to deal with such missing data. In future research, these missing odds may
be imputed through various methods, e.g. regression, so that we have more full dataset of the book-
makers’ odds. We were only able to compare two single bookmakers’ odds, since only two bookmakers
published odds for all matches in our dataset. If the missing odds in the other bookmakers’ data are
imputed, the results of more individual bookmakers can be compared both with each other and with
our cost-combining methods. Additionally, there are of course many other cost-combining methods
one can conceive to combine these bookmakers’ odds. One example may be attempting to distill the
so-called ’true’ odds mentioned in chapter 3, and averaging those to remove the bookmakers’ margins
from the equation.
Finally, there are some other ways to optimize for profit to mention. It may be possible to better
predict future matches by relying more heavily on recent information in the dataset, prioritizing this
information over older observations. After all, a match played last year may tell us more about the
football team then one played nine years ago. In our model, all previous matches were weighed equally,
which may prove to increasingly influence the results as our dataset spans a longer time frame. Also,
there are other ways to optimize for profit than directly incorporating the costs in the model. It is for
example also possible to predict how high a stake must be used for each match, or to create a model
that predicts for which matches a bet must be placed and for which matches it is better to not place a
bet, rather than doing this post-hoc using value betting. To our knowledge, these strategies are not yet
explored and published, so investigating multiple techniques to optimize for profit might be interesting
to do for future work.

7.4. Conclusion
This project compared two different cost-sensitive methods. The first method used the odds of a single
bookmaker to achieve a profit by predicting the outcome of future matches correctly and the second
method used a combination of the odds of multiple bookmakers to achieve this goal. The first objective
was to investigate if the cost-combining methods performed better than the single-cost methods. After
reviewing the results, we could not find a significant difference between these two.
The second objective was to find the best cost-combining method of the four that were tested.
From our experiments we can conclude that using the average of all costs works better than using the
maximal costs when value betting, i.e. betting on only a selected number of matches, but no other
conclusions could be drawn for the other methods due to a lack of significance in the results.
We have also shown that a higher accuracy rate does not always have to result in a higher profit.
While other papers had the primary goal to predict the most number of matches right, we wanted to
optimize our model for getting the highest profit. This resulted in accuracy levels between 20-30%,
which is much lower than the accuracy values of other papers (around 50%), but in the end we have
managed to get a profit of 6.96%, which to our knowledge has not been achieved before.
Since these results were obtained with only five runs per method, our conclusions are unfortunately
not very robust. More runs are needed to achieve a better understanding of these various methods,
although we hope to have provided a starting point for further research. Due to time constraints we
were unable to conduct more of these experiments, since obtaining the results for one run took us
7.4. Conclusion 47

almost two days. With the help of faster hardware, or limiting the size of the data used - though there
are some trade-offs to consider - this could be resolved.
And finally, we have created an incremental, cost-sensitive decision tree in Python with which to
conduct our experiments. Without this model, it would have taken even longer to run the experiments.
To our knowledge, no incremental decision tree model that can incorporate costs exists, so we had
to create one ourselves. It took us quite some time to get this working, but in the end it saved us
more time by using this version than to use a standard cost-sensitive decision tree that would have had
to have been made from scratch for every week. Future research may benefit from this incremental
cost-sensitive decision tree model as it will greatly increase the speed with which that research can be
conducted.
Bibliography
[1] Marktscan landgebonden kansspelen 2016, https://www.kansspelautoriteit.nl/
publish/pages/4264/marktscan_landbased_2016.pdf (2017), accessed: 7 June 2018.

[2] Global 35.97 billion online gambling market growth at cagr of 10.81%, 2014-2020 - mar-
ket to reach 66.59 billion with growth very geography specific - research and markets,
https://www.prnewswire.com/news-releases/global-3597-billion-online-
gambling-market-growth-at-cagr-of-1081-2014-2020---market-to-reach-
6659-billion-with-growth-very-geography-specific---research-and-
markets-300348358.html (2016), accessed: 8 April 2018.

[3] Sports betting market to rise at nearly 8.62% cagr to 2022, https://www.prnewswire.com/
news-releases/sports-betting-market-to-rise-at-nearly-862-cagr-to-
2022-669711413.html (2018), accessed: 7 June 2018.

[4] A comparison of forecasting methods: fundamentals, polling, prediction markets, and experts,
http://www.researchdmr.com/ForecastingOscar.pdf (2014), accessed: 9 May 2018.

[5] A simple deep learning model for stock price prediction using tensorflow, https:
//medium.com/mlreview/a-simple-deep-learning-model-for-stock-price-
prediction-using-tensorflow-30505541d877 (2017), accessed: 23 May 2018.

[6] With germany’s win microsoft perfectly predicted the world cup’s knockout round, https://
qz.com/233830/world-cup-germany-argentina-predictions-microsoft/ (2014),
accessed: 1 May 2018.

[7] World cup 2014 predictions & results, https://www.bloomberg.com/graphics/2014-


world-cup/#0,0,-1 (2014), accessed: 9 May 2018.

[8] The world cup and economics 2014, http://www.goldmansachs.com/our-thinking/


outlook/world-cup-and-economics-2014-folder/world-cup-economics-
report.pdf (2014), accessed: 9 May 2018.

[9] A. Groll, C. Ley, G. Schauberger, and H. Van Eetvelde, Prediction of the FIFA World Cup 2018 - A
random forest approach with an emphasis on estimated team ability parameters, ArXiv e-prints
(2018), arXiv:1806.03208 [stat.AP] .

[10] J. S. Xu, Online sports gambling: A look into the efficiency of bookmakers’ odds as forecasts in the
case of english premier league, Unpublished undergraduate dissertation). University of California,
Berkeley (2011).

[11] B. Ulmer, M. Fernandez, and M. Peterson, Predicting Soccer Match Results in the English Premier
League, Ph.D. thesis, Doctoral dissertation, Ph. D. dissertation, Stanford (2013).
[12] J. (@opisthokonta), Predicting football results with adaptive boosting, https://
opisthokonta.net/?p=809 (2014), accessed: 18 May 2018.

[13] History of football - the origins, https://www.fifa.com/about-fifa/who-we-are/the-


game/index.html (2018), accessed: 11 May 2018.

[14] The most popular sports in the world, https://www.worldatlas.com/articles/what-


are-the-most-popular-sports-in-the-world.html (2018), accessed: 11 May 2018.

[15] Football, https://www.britannica.com/sports/football-soccer (2017), accessed: 2


July 2018.

49
50 Bibliography

[16] N. Vlastakis, G. Dotsis, and R. N. Markellos, How efficient is the european football betting market?
evidence from arbitrage and trading strategies, Journal of Forecasting 28, 426 (2009).

[17] N. S. Barrett, The Daily Telegraph chronicle of horse racing (Guinness Pub., 1995).

[18] European soccer database, https://www.kaggle.com/hugomathien/soccer (2017), ac-


cessed: 8 November 2017.

[19] Historical football results and betting odds data, http://www.football-data.co.uk/


data.php (2018), accessed: 6 February 2018.

[20] A. Sacks and A. Ryan, Economic impact of legalized sports betting, https://
www.americangaming.org/sites/default/files/AGA-Oxford%20-%20Sports%
20Betting%20Economic%20Impact%20Report1.pdf (2007), accessed: 2 July 2018.

[21] Ontwikkelingen legalisering online gokken, https://www.onlinecasinoground.nl/is-


online-gokken-legaal/#belangrijke-ontwikkelingen-2017 (2018), accessed: 2
July 2018.

[22] E. F. Fama, Efficient capital markets: A review of theory and empirical work, The journal of Finance
25, 383 (1970).

[23] M. Cain, D. Law, and D. Peel, The favourite-longshot bias and market efficiency in uk football
betting, Scottish Journal of Political Economy 47, 25 (2000).

[24] J. Goddard and I. Asimakopoulos, Forecasting football results and the efficiency of fixed-odds
betting, Journal of Forecasting 23, 51 (2004).

[25] H. O. Stekler, D. Sendor, and R. Verlander, Issues in sports forecasting, International Journal of
Forecasting 26, 606 (2010).

[26] B. Deschamps and O. Gergaud, Efficiency in betting markets: evidence from english football, The
Journal of Prediction Markets 1, 61 (2012).

[27] G. Bernardo, M. Ruberti, and R. Verona, Testing semi-strong efficiency in a fixed odds betting
market: Evidence from principal European football leagues, Tech. Rep. (University Library of
Munich, Germany, 2015).

[28] J. Buchdahl, Squares & Sharps, Suckers & Sharks: The Science, Psychology & Philosophy of
Gambling, Vol. 16 (Oldcastle Books, 2016).

[29] E. Štrumbelj and M. R. Šikonja, Online bookmakers’ odds as forecasts: The case of european
soccer leagues, International Journal of Forecasting 26, 482 (2010).

[30] M. Moroney, Facts from figures, 472 pp, (1956).

[31] M. J. Dixon and S. G. Coles, Modelling association football scores and inefficiencies in the football
betting market, Journal of the Royal Statistical Society: Series C (Applied Statistics) 46, 265
(1997).

[32] R. H. Koning, Balance in competition in dutch soccer, Journal of the Royal Statistical Society:
Series D (The Statistician) 49, 419 (2000).

[33] A. Joseph, N. E. Fenton, and M. Neil, Predicting football results using bayesian nets and other
machine learning techniques, Knowledge-Based Systems 19, 544 (2006).

[34] R. Baboota and H. Kaur, Predictive analysis and modelling football results using machine learning
approach for english premier league, International Journal of Forecasting (2018).

[35] P. Domingos, Metacost: A general method for making classifiers cost-sensitive, in Proceedings of
the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (ACM,
1999) pp. 155–164.
Bibliography 51

[36] C. Elkan, The foundations of cost-sensitive learning, in International joint conference on artificial
intelligence, Vol. 17 (Lawrence Erlbaum Associates Ltd, 2001) pp. 973–978.
[37] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan, Adacost: misclassification cost-sensitive boosting,
in Icml, Vol. 99 (1999) pp. 97–105.

[38] J. Du, Z. Cai, and C. X. Ling, Cost-sensitive decision trees with pre-pruning, in Advances in
Artificial Intelligence (Springer, 2007) pp. 171–179.
[39] B. Zhou and Q. Liu, A comparison study of cost-sensitive classifier evaluations, in International
Conference on Brain Informatics (Springer, 2012) pp. 360–371.
[40] P. D. Turney, Types of cost in inductive concept learning, arXiv preprint cs/0212034 (2002).
[41] A. C. Bahnsen, D. Aouada, and B. Ottersten, Example-dependent cost-sensitive decision trees,
Expert Systems with Applications 42, 6609 (2015).
[42] L. Breiman, J. Friedman, R. Olshen, and C. J. Stone, Classification and regression trees (Chapman
and Hall/CRC, 1984).

[43] P. E. Utgoff, N. C. Berkman, and J. A. Clouse, Decision tree induction based on efficient tree
restructuring, Machine Learning 29, 5 (1997).
[44] Classification trees, www.cs.uu.nl/docs/vakken/mdm/trees.pdf (2017), accessed: 14
February 2018.

[45] P. E. Utgoff, Incremental induction of decision trees, Machine learning 4, 161 (1989).
[46] P. E. Utgoff, An improved algorithm for incremental induction of decision trees, in Machine Learning
Proceedings 1994 (Elsevier, 1994) pp. 318–325.
[47] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning, Vol. 1 (Springer
series in statistics New York, 2001).

[48] R. J. Hyndman and G. Athanasopoulos, Forecasting: principles and practice (OTexts, 2014).
[49] Machine Learning Laboratory incremental decision tree induction, https://www-
ml.cs.umass.edu/iti/index.html (2001), accessed: 26 February 2018.

[50] V. Y. Kotlyar and O. Smyrnova, Betting market: analysis of arbitrage situations, Cybernetics and
Systems Analysis 48, 912 (2012).
[51] Comparing more than two means: One-way anova, https://brownmath.com/stat/
anova1.htm (2016), accessed: 3 July 2018.
[52] T. G. Dietterich and E. B. Kong, Machine learning bias, statistical bias, and statistical variance of
decision tree algorithms, Tech. Rep. (Technical report, Department of Computer Science, Oregon
State University, 1995).
[53] A. C. Bahnsen, D. Aouada, and B. Ottersten, Ensemble of example-dependent cost-sensitive
decision trees, arXiv preprint arXiv:1505.04637 (2015).
A
Incremental decision tree example
Below it is depicted what the data structure of an incremental decision tree looks like. This example is
based on the decision tree shown in figure 4.2b, since showing a deeper decision tree would take up
too many pages. The red class is represented as class 0 and the blue class is class 1.
{
’ best_variable ’ : 0,
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 1 ,
’ key ’ : 1 . 0 ,
’ left ’: {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 0 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ r i g h t ’ : None } ,
’ f l a g s ’ : [ False , False ] ,
’ instance_costs ’ : [] ,
’ instances ’ : [] ,
’ left ’: {
’ b e s t _ v a r i a b l e ’ : None ,
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 0 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ f l a g s ’ : [ False , False ] ,
’ instance_costs ’ : [0] ,
’ instances ’ : [ array ([0.14 , 0.09 , 0. ])] ,
’ l e f t ’ : None ,
’ mdl ’ : 0 ,
’ n_instances ’ : 1 ,
’ n_variables ’ : 2 ,
’ r i g h t ’ : None ,
’ variables ’ : []} ,

53
54 A. Incremental decision tree example

’ mdl ’ : 0 ,
’ n_instances ’ : 2 ,
’ n_variables ’ : 2 ,
’ right ’ : {
’ b e s t _ v a r i a b l e ’ : None ,
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 1 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ f l a g s ’ : [ False , False ] ,
’ instance_costs ’ : [0] ,
’ instances ’ : [ array ([0.2 , 0.99 , 1. ])] ,
’ l e f t ’ : None ,
’ mdl ’ : 0 ,
’ n_instances ’ : 1 ,
’ n_variables ’ : 2 ,
’ r i g h t ’ : None ,
’ variables ’ : []} ,
’ variables ’ : [
{
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 1 ,
’ key ’ : 1 . 0 ,
’ left ’: {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 0 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ r i g h t ’ : None } ,
’ count ’ : 2 ,
’ cutpoint ’ : 0.17 ,
’ metric_value ’ : 0.0 ,
’ new_cutpoint ’ : 0 . 1 7 ,
’ numeric_value_counts ’ : {
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 1 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ count ’ : 1 ,
’ height ’ : 1 ,
’ key ’ : 0 . 2 ,
’ left ’: {
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
55

’ key ’ : 0 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 0 . 1 4 ,
’ l e f t ’ : None ,
’ metric_value ’ : 0.5 ,
’ r i g h t ’ : None } ,
’ metric_value ’ : 0.0 ,
’ r i g h t ’ : None}
},
{
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 1 ,
’ key ’ : 1 . 0 ,
’ left ’: {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 0 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ r i g h t ’ : None } ,
’ count ’ : 2 ,
’ cutpoint ’ : 0.0 ,
’ metric_value ’ : 0.0 ,
’ new_cutpoint ’ : 0 . 5 4 ,
’ numeric_value_counts ’ : {
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 1 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ count ’ : 1 ,
’ height ’ : 1 ,
’ key ’ : 0 . 9 9 ,
’ left ’: {
’ class_counts ’ : {
’ costs ’ : 0 ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 0 . 0 ,
’ l e f t ’ : None ,
’ r i g h t ’ : None } ,
’ count ’ : 1 ,
’ height ’ : 0 ,
’ key ’ : 0 . 0 9 ,
’ l e f t ’ : None ,
’ metric_value ’ : 0.5 ,
’ r i g h t ’ : None } ,
’ metric_value ’ : 0.0 ,
56 A. Incremental decision tree example

’ r i g h t ’ : None}
}
]}
B
Football team names
In table B.1 it is shown what the assigned team abbreviations are next to the full team names of the
teams that competed in the Dutch Eredivisie league between season 2006/2007 and 2016/2017.

Abbr. Full name


AJA AFC Ajax
ALK AZ Alkmaar
CAM SC Cambuur
DOR FC Dordrecht
EXC S.B.V. Excelsior
FEY Feyenoord
GAE Go Ahead Eagles
GRA De Graafschap
GRO FC Groningen
HAA ADO Den Haag
HEE SC Heerenveen
HER Heracles Almelo
NAC NAC Breda
NEC N.E.C.
PSV PSV Eindhoven
RKC RKC Waalwijk
ROD Roda JC
SPA Sparta Rotterdam
TWE FC Twente
UTR FC Utrecht
VEN VVV-Venlo
VIT SBV Vitesse
VOL FC Volendam
WII Willem II
ZWO PEC Zwolle

Table B.1: Football team abbreviations and their full names.

57
Information about
C
the intermediate dataset
Table C.1 depicts which information is collected about each football team for the intermediate dataset.
The first column shows the attribute name, the second columns contains a short description of these
attributes and the last column shows which (Python) data type the attribute is. This intermediate
dataset was created to make feature extraction more easy.

Attribute Description Type


date date of match datetime
season season of match object (string)
team team abbreviation object (string)
opponent opponent abbreviation object (string)
side home or away object (string)
won 1 if team won, otherwise 0 int64
draw 1 if match ended in draw, otherwise 0 int64
loss 1 if team lost, otherwise 0 int64
points 3 if team won, 1 if match ended in a draw or 0 when team lost int64
goals_for Number of goals team scored int64
goals_against Number of goals opponent scored int64
goals_diff Difference in goals between team and opponent int64

Table C.1: Intermediate dataset attributes, description and data types.

59
Combining data
D
from different sources
In this project, we only had to deal with two different data sources, but even within one data source
some preprocessing is required. For example, data downloaded from [19] is split into different seasons.
This means that the data for seasons 2006/2007, 2007/2008 and 2016/2017 are stored in separated
CSV files. Between these files, there are various naming conventions for the different teams. For
example, some CSV files use the name ’ADO Den Haag’, others only use ’ADO’. These refer to the same
team, so when combining the data we must ensure that the information about this team is correctly
combined. This is also an issue when we want to combine these files with the data from the other
source [18]. The way we solved this was to create a mapping for all possible ways teams are noted
in the data to a chosen acronym. Each team was designated its own three letter acronym for usability
purposes. These naming conventions can be found in appendix B.
After this mapping is completed, the different CSV files and the SQL database from Kaggle are
stored in separate pandas dataframes. A pandas dataframe is part of the popular Python library called
Pandas and it used often in data science projects. The created dataframes have the same column
information so that it is easy to concatenate the dataframes into one big dataframe. This project has
only created a dataframe for the matches of the Dutch Eredivisie football league, but this would also be
possible for more countries. Our one dataframe now contains all the matches from season 2006/2007
until 2016/2017, but this raw data is not very useful for the decision tree model to use as features.
To be able to extract the features from this dataframe, we first created a helper dataset which con-
tains all required information per team per date. This makes it easier to acquire all necessary features.
For each team and match date we store this date, opponent and other match-specific information like
goals and points scored. For more information about the dataset, see appendix C. After this process
was finished, our dataset contained 10,404 rows by 12 columns including the information of 3,363
football matches.

61
E
Features
Table E.1 on page 64 shows which features are used when training the models. These features describe
the difference between the home team and the away team by subtracting the statistic of the away team
from the home team. For example, for feature 𝑟_𝑤_𝑡 the total number of wins in the current season of
the home team is subtracted by the total number of wins in the current season of the away team. This
means that if this feature is positive, then the home team won more matches than the away team; if
this feature is zero then both teams won the same number of matches; if this feature is negative then
the away team won more matches than the home team.
When a feature has a 𝑥 in its name, then this feature is used three times where 𝑥 ∈ {1, 2, 3}. When
a feature has a 𝑟 in its name, then this means that only those matches are used where the team played
on the same side as in this match. So for the home team the statistic is calculated when the home
team played home and for the away team the statistic is calculated when the away team played away.
When a feature has a 𝑚 in its name, then only the matches are considered where these two teams
competed against each other.

63
64 E. Features

Feature Description
year Calendar year when match is played
month Calendar month when match is played

r_w_t Total number of wins in current season


r_d_t Total number of draws in current season
r_l_t Total number of losses in current season
r_w_t_r Total number of wins for current side in current season
r_d_t_r Total number of draws for current side in current season
r_l_t_r Total number of losses for current side in current season
r_w_x Number of wins in last 𝑥 matches
r_d_x Number of draws in last 𝑥 matches
r_l_x Number of losses in last 𝑥 matches
r_w_x_r Number of wins for current side in last 𝑥 matches
r_d_x_r Number of draws for current side in last 𝑥 matches
r_l_x_r Number of losses for current side in last 𝑥 matches
r_w_x_m Number of wins in last 𝑥 meetings
r_d_x_m Number of draws in last 𝑥 meetings
r_l_x_m Number of losses in last 𝑥 meetings

p_t Total number of points in current season


p_t_r Total number of points for current side in current season
p_t_a Average number of points since season 2006/2007
p_t_a_r Average number of points for current side since season 2006/2007
p_x Number of points in last 𝑥 matches
p_x_r Number of points for current side in last 𝑥 matches

g_f_t Total number of goals scored in current season


g_a_t Total number of goals conceded in current season
g_d_t Total number of goals scored minus goals conceded in current season
g_f_x Number of goals scored in last 𝑥 matches
g_a_x Number of goals conceded in last 𝑥 matches
g_d_x Number of goals scored minus goals conceded in last 𝑥 matches
g_f_x_r Number of goals scored for current side in last 𝑥 matches
g_a_x_r Number of goals conceded for current side in last 𝑥 matches
g_d_x_r Number of goals scored minus goals conceded for current side in last 𝑥 matches
g_f_x_m Number of goals scored in last 𝑥 meetings
g_a_x_m Number of goals conceded in last 𝑥 meetings
g_d_x_m Number of goals scored minus goals conceded in last 𝑥 meetings

o_h Average home odds for this match


o_d Average draw odds for this match
o_a Average away odds for this match

outcome ’H’ when home team won, ’D’ for a draw or ’A’ when away team won

Table E.1: Features used when training the models. ∈{ , , }


Missing odds
F
for each bookmaker per season
Table F.1 shows for each bookmaker how many matches per season are missing odds. Note that a
season has 306 matches. It is not known why the matches are missing from the dataset.

season B365 BS BW GB IW LB PS SJ VC WH
2000/2001 306 306 306 1 4 306 306 306 306 3
2001/2002 306 306 306 79 13 306 306 306 306 13
2002/2003 6 306 306 0 12 306 306 306 306 9
2003/2004 1 306 306 0 7 306 306 306 306 4
2004/2005 0 306 1 0 4 51 306 306 306 5
2005/2006 0 306 0 0 4 6 306 0 6 4
2006/2007 0 306 0 0 2 4 306 0 6 0
2007/2008 0 1 0 0 1 1 306 0 2 81
2008/2009 1 1 1 1 4 1 306 1 3 4
2009/2010 1 2 1 1 3 1 306 1 1 4
2010/2011 0 0 0 0 0 0 306 0 1 0
2011/2012 0 0 0 0 3 0 306 0 0 0
2012/2013 1 1 1 1 1 2 23 2 1 2
2013/2014 0 306 0 306 0 1 4 0 0 1
2014/2015 0 306 0 306 0 4 2 261 0 0
2015/2016 0 306 0 306 2 0 1 306 1 0
2016/2017 0 306 0 306 0 0 3 306 1 0

Table F.1: Missing odds for each bookmaker per season.

65
Profit over time
G
for betting on all matches
Figures G.1 to G.6 show the profit from the start until season 2016/2017 when betting on all matches
for all five runs of each method. The dotted black line depicts the average of all five runs.

Figure G.1: Profit over time when betting on all matches for the b365 method.

67
68 G. Profit over time for betting on all matches

Figure G.2: Profit over time when betting on all matches for the bw method.

Figure G.3: Profit over time when betting on all matches for the avg method.

Figure G.4: Profit over time when betting on all matches for the max method.
69

Figure G.5: Profit over time when betting on all matches for the min method.

Figure G.6: Profit over time when betting on all matches for the rnd method.
Confusion matrices
H
for betting on all matches
The confusion matrices for each method when betting on all matches are shown in figure H.1 on
page 72. The left side shows the confusion matrices while the right side shows the standard deviation
values. The values inside the confusion matrices are the averaged values of the five runs of each
method.

71
72 H. Confusion matrices for betting on all matches

Predicted
Predicted
H D A
H D A
H 886.8 353.2 214.0
H 32.8 17.2 17.1
Actual D 429.8 174.2 119.0
Actual D 16.0 8.6 10.4
A 485.2 224.0 170.8
A 11.3 12.0 10.5
tot 1801.8 751.4 503.8
(b) Standard deviation for the b365 method.
(a) Confusion matrix for the b365 method.

Predicted
Predicted
H D A
H D A
H 904.6 331.2 218.1
H 24.7 31.2 19.8
Actual D 437.4 166.6 119.0
Actual D 5.7 14.8 10.8
A 485.2 215.6 179.2
A 7.0 7.8 5.5
tot 1827.2 713.4 516.3
(d) Standard deviation for the bw method.
(c) Confusion matrix for the bw method.

Predicted
Predicted
H D A
H D A
H 904.0 324.0 226.0
H 26.3 16.4 13.1
Actual D 419.2 163.6 140.2
Actual D 8.8 7.1 5.0
A 473.6 221.8 184.6
A 15.8 14.8 6.1
tot 1796.8 709.4 550.8
(f) Standard deviation for the avg method.
(e) Confusion matrix for the avg method.

Predicted
Predicted
H D A
H D A
H 863.0 283.2 307.8
H 25.7 12.9 17.6
Actual D 424.0 146.0 153.0
Actual D 16.0 12.1 5.0
A 466.4 205.2 208.4
A 15.8 11.1 10.2
tot 1753.4 634.4 669.2
(h) Standard deviation for the max method.
(g) Confusion matrix for the max method.

Predicted
Predicted
H D A
H D A
H 894.6 318.0 241.4
H 42.2 24.5 22.5
Actual D 440.6 148.2 134.2
Actual D 22.2 12.8 14.5
A 497.8 182.8 199.4
A 28.1 20.1 15.3
tot 1833.0 649.0 575.0
(j) Standard deviation for the min method.
(i) Confusion matrix for the min method.

Predicted
Predicted
H D A
H D A
H 874.8 321.6 257.6
H 19.7 12.2 14.3
Actual D 420.4 154.8 147.8
Actual D 22.5 11.8 13.7
A 462.2 223.4 194.4
A 16.7 19.7 6.8
tot 1757.4 699.8 599.8
(l) Standard deviation for the rnd method.
(k) Confusion matrix for the rnd method.

Figure H.1: Confusion matrices when betting on all matches.


Normality check of results
I
for betting on all matches
Figure I.1 shows for all methods what the statistic values are for the Shapiro-Wilk and Anderson-Darling
tests. Both tests check whether the results of the methods are normally distributed. The Shapiro-Wilk
test takes the null hypothesis as that the values are normal distributed, so if the 𝑝-value is below 0.05
then we have to reject this hypothesis. The Anderson-Darling test returns a statistic and if this value
is higher than the critical value of 0.984, which corresponds to the 𝑝-value of 0.05, then we have to
reject the null hypothesis that the results of the method are normally distributed. In these tables it
is shown that the null hypotheses never have to be rejected, so we can conclude our results to be
normally distributed.

Test b365 bw avg


Shapiro-Wilk 𝑠=0.927, 𝑝=0.579 𝑠=0.904, 𝑝=0.433 𝑠=0.821, 𝑝=0.118
Anderson–Darling 𝑠=0.252 𝑠=0.305 𝑠=0.503
(a) Normality check statistics for methods b365, bw and avg ( =5).

Test max min rnd


Shapiro-Wilk 𝑠=0.893, 𝑝=0.373 𝑠=0.901, 𝑝=0.416 𝑠=0.883, 𝑝=0.325
Anderson–Darling 𝑠=0.322 𝑠=0.303 𝑠=0.363
(b) Normality check statistics for methods max, min and rnd ( =5).

Figure I.1: Normality check statistics for all methods when betting on all matches. is the statistic value, is -value and is
the size of the data.

73
J
Profit over time for value betting
Table J.1 depicts how many matches were selected to bet on for each method and each run. In
figures J.1 to J.6 show the profit from the start until season 2016/2017 when value betting for all five
runs of each method. The dotted black line depicts the average of all five runs.

Run b365 bw avg max min rnd


1 239 212 232 293 345 269
2 500 1453 347 434 861 1866
3 374 215 379 315 1400 523
4 237 195 443 341 1004 211
5 376 275 430 962 742 676

Table J.1: Number of matches selected to bet on for each method and for each run.

Figure J.1: Profit over time when value betting for the b365 method.

75
76 J. Profit over time for value betting

Figure J.2: Profit over time when value betting for the bw method.

Figure J.3: Profit over time when value betting for the avg method.

Figure J.4: Profit over time when value betting for the max method.
77

Figure J.5: Profit over time when value betting for the min method.

Figure J.6: Profit over time when value betting for the rnd method.
Confusion matrices
K
for value betting
The confusion matrices for each method when value betting are shown in figure K.1 on page 80. The
left side shows the confusion matrices while the right side shows the standard deviation values. The
values inside the confusion matrices are the averaged values of the five runs of each method.

79
80 K. Confusion matrices for value betting

Predicted
Predicted
H D A
H D A
H 46.4 50.8 60.2
H 20.4 30.9 4.7
Actual D 37.2 19.6 14.8
Actual D 11.1 18.2 3.4
A 81.4 23.6 11.2
A 15.1 26.6 6.3
tot 165.0 94.0 86.2
(b) Standard deviation for the b365 method.
(a) Confusion matrix for the b365 method.

Predicted
Predicted
H D A
H D A
H 90.8 56.0 69.4
H 135.1 62.4 35.9
Actual D 59.8 23.4 22.2
Actual D 65.4 34.0 16.7
A 97.6 26.4 24.4
A 70.9 43.5 30.3
tot 248.2 105.8 116.0
(d) Standard deviation for the bw method.
(c) Confusion matrix for the bw method.

Predicted
Predicted
H D A
H D A
H 58.0 49.0 69.2
H 24.7 10.4 6.4
Actual D 44.8 13.2 24.0
Actual D 11.1 6.8 4.6
A 79.8 12.4 15.8
A 16.9 5.7 7.7
tot 182.6 74.6 109.0
(f) Standard deviation for the avg method.
(e) Confusion matrix for the avg method.

Predicted
Predicted
H D A
H D A
H 59.4 69.8 100.8
H 58.5 31.8 20.9
Actual D 54.4 21.4 24.8
Actual D 38.9 15.2 9.7
A 89.8 28.2 20.4
A 35.2 28.8 17.0
tot 203.6 119.4 146.0
(h) Standard deviation for the max method.
(g) Confusion matrix for the max method.

Predicted
Predicted
H D A
H D A
H 163.6 130.8 109.6
H 84.9 46.7 32.7
Actual D 102.4 56.4 40.4
Actual D 44.8 25.8 19.4
A 161.4 61.8 44.0
A 42.9 36.7 27.2
tot 427.4 249.0 194.0
(j) Standard deviation for the min method.
(i) Confusion matrix for the min method.

Predicted Predicted
H D A H D A
H 162.6 86.0 94.6 H 162.3 77.9 40.5
Actual D 95.0 33.6 32.2 Actual D 88.1 34.4 23.8
A 125.0 43.4 36.6 A 87.5 54.1 39.6
tot 382.6 163.0 163.4
(k) Confusion matrix for the rnd method. (l) Standard deviation for the rnd method.

Figure K.1: Confusion matrices when betting on selected matches.


Normality check of results
L
for value betting
Figure L.1 shows for all methods what the statistic values are for the Shapiro-Wilk and Anderson-Darling
tests. Both tests check whether the results of the methods are normally distributed. The Shapiro-Wilk
test takes the null hypothesis as that the values are normal distributed, so if the 𝑝-value is below 0.05
then we have to reject this hypothesis. The Anderson-Darling test returns a statistic and if this value
is higher than the critical value of 0.984, which corresponds to the 𝑝-value of 0.05, then we have to
reject the null hypothesis that the results of the method are normally distributed. In these tables it
is shown that the null hypotheses never have to be rejected, so we can conclude our results to be
normally distributed.

Test b365 bw avg


Shapiro-Wilk 𝑠=0.848, 𝑝=0.187 𝑠=0.913, 𝑝=0.485 𝑠=0.938, 𝑝=0.649
Anderson–Darling 𝑠=0.433 𝑠=0.293 𝑠=0.242
(a) Normality check statistics for methods b365, bw and avg ( =5).

Test max min rnd


Shapiro-Wilk 𝑠=0.864, 𝑝=0.242 𝑠=0.908, 𝑝=0.454 𝑠=0.886, 𝑝=0.338
Anderson–Darling 𝑠=0.400 𝑠=0.294 𝑠=0.347
(b) Normality check statistics for methods max, min and rnd ( =5).

Figure L.1: Normality check statistics for all methods when value betting. is the statistic value, is -value and is the size
of the data.

81

You might also like