Deriving A Model To Calculate The Probability of Scoring A Goal From Every Shooting Position in The Football Pitch and Applying It To Predict The XG For Different Matches.

Downloaded
from www.clastify.com by Thiago Zambrano Caceres
Deriving a model to calculate the probability of scoring a goal from every Shooting position in the football
pitch and applying it to predict the xG for different Matches.
Page numbers: 20
1
Downloaded from www.clastify.com by Thiago Zambrano Caceres
Introduction
Mathematics is all around us. You don't notice this but even in our free time we encounter mathematics.
Personally, I am a very big fan of soccer and I never miss a single game of my favorite team. After the
game, the presenters always talk about statistics, which I find very interesting and listen to. Because they
explain the performance and the results of a specific match using mathematics which indicates how math
is used in every branch and is indispensable. Over time, I noticed how they talk about the expected goals
Philosophy (xG) which is trying to estimate the number of goals scored by each team by regarding every
shot in the match. I did some research and found the idea behind it very logical because the success of the
shot depends on so many factors like the distance or angle. I always thought why not use mathematics to
build my own model that could estimate the probability of scoring a goal from a specific location in the
football pitch. Through this model I will be able to have a better overview for the performance of the team
and even predict future developments. After all, scoring goals does not mean that you played well.
Therefore, coaches usually need an alternative method by looking at different statistics after the match to
calculate the performance or quality of the team. I would like to offer this using my model and prove how
mathematics can also be very useful in soccer strategies and analysis.
So, the aim of this research is to find an equation where you can insert the location of the shot and thereby
find out the probability of a goal. Only by doing this, I as a fan and even the coaches could determine if the
players were just lucky or if the team has really performed well. The model will evaluate the actual quality
of the team and give further predictions for the performance in the future. That's why I got the data of the
Premier league Season 2018/2019 from Wyscout.1 The data consisted of all the shots that took place
during that season and even the location of all the shots. Next to the location was whether the shot
resulted in a goal or not. A logistic analysis was then performed on the data, from which I was able to get
the coefficients of the equation that relates the position of the shot with its probability of success. The
1
www.wyscout.com
2
equation was implemented on two games and compared with the actual result to test its quality and
effect.
Collected Data
The data from Wyscout described the result and the location of the shot. The position was the
independent variable and consisted of an X coordinate and a Y coordinate. The X coordinate represented
the distance from the right sideline. The Y coordinate represented the vertical nearness to the goal. Both
coordinates were given as a percentage to the total vertical or horizontal distance of the pitch, since
different soccer fields can have different sizes. Since the X coordinate was not suitable to my study in my
opinion, since it makes no difference whether the shot was from the left or right side, but only depends on
the distance from the center of the field, I calculated the absolute value of the difference between the X
coordinate and 50%. Thus, I ignored the side of the shot and set the distance from the center point as the X
coordinate.
The dependent variable of the data in wyscout consists of categories like “accurate, not accurate, block,
opportunity, or goal” that describe the result of the shot.2 If “goal” was written next to the shot, then the
shot is successful if not then the shot is regarded as a failure. I converted these to numbers. Shots that led
to a goal I described with the number “1”. All other shots got the number “0”. In this way I was able to
numerate all variables I have. A small section of the original table in wyscout in inserted below:
2
https://www.nature.com/articles/s41597-019-0247-7
3
But because there were 32698 shots in that season, only the first 5 and last 5 shots are shown in the table
below. The full table of these raw data can be accessed from the link
“https://footballdata.wyscout.com/download-samples/”.
Table 1: the location of each shot in the season and the result of that shot
Shot Result (successful=1, Y coordinate (%) X coordinate (%)
unsuccessful = 0)
1 1 88 9
2 0 88 9
3 0 85 2
4 1 96 2
5 1 98 9
… … … …
32694 0 73 31
32695 0 73 31
32696 0 94 1
32697 0 92 5
32698 0 93 1
4
(0 ≤ 𝑦 ≤ 100): The origin of the Y Axis is at the line of the own goal and continues towards the line of the
opponent’s goal.
𝑥|(0 ≤ 𝑥 ≤ 50): The origin of the X Axis is at the center of the football pitch and continues into two
directions towards the right and the left line of the pitch.
First, a linear regression analysis was performed on all 32698 shots using Excel in order to estimate a
premier relationship between the dependent variable (probability) and both independent variables (X and
Y coordinates). Thereby, the probability of “1” and “0” is calculated by excel and the coefficients of the
regression equation are noted3:
Coefficients
Y Intercept -0.13488
Y Variable 0.002074
X Variable -0.00132
The coefficients of the variables and as well as y intercept belongs to a linear regression model:
𝑦 = 𝑏0 + 𝑏1 𝑦 + 𝑏1 𝑥
𝑏0 : 𝑌 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
𝑋 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒: 𝑡ℎ𝑒 ℎ𝑜𝑟𝑖𝑧𝑜𝑛𝑡𝑎𝑙 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑣𝑒𝑟𝑡𝑖𝑐𝑎𝑙 𝑙𝑖𝑛𝑒 𝑡ℎ𝑟𝑜𝑢𝑔ℎ 𝑡ℎ𝑒 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑖𝑡𝑐ℎ (%)
𝑌 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒: 𝑡ℎ𝑒 𝑣𝑒𝑟𝑡𝑖𝑐𝑎𝑙 𝑛𝑒𝑎𝑟𝑛𝑒𝑠𝑠 𝑡𝑜 𝑡ℎ𝑒 𝑔𝑜𝑎𝑙(%)4
𝑏1 : 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑎𝑛𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑋 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
𝑏2 : 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑎𝑛𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑌 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
3
https://www.southampton.ac.uk/passs/confidence_in_the_police/multivariate_analysis/linear_regression.page
4
https://www.nature.com/articles/s41597-019-0247-7
5
The dependent variable I was looking for is the probability of scoring a goal from a specific location. Since
the probability is between 1 and 0, I cannot use a linear regression because the predicted values will
become greater than 1 or less than zero if I move further on the X axis. So, the dependent variable of the
equation cannot be representing a probability since 𝑏0 + 𝑏1 𝑦 + 𝑏2 𝑥 can result in any number and does
not have a limited range. I did a lot of research to figure out how to solve this problem. I was seeking for a
function that could explain how the probability is affected by one or more factors. I found out that Pierre
François Verhulst discovered a relationship between the probability and one or more independent
variables which he named a logistic equation5:
𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥
= 𝑃(𝑠𝑢𝑐𝑐𝑒𝑠𝑠)
1 + 𝑒 𝑏0+𝑏1𝑦 + 𝑏2𝑥
The coefficients of the linear regression we performed on the data belongs to a liner equation. Therefore,
we need to linearize the logistic equation so that the dependent variable has a liner relation with the
independent variables. In other word, we need to find the dependent variable that is equals to 𝑏0 + 𝑏1 𝑦 +
𝑏2 𝑥.6 Therefore, I rearranged the following logistic equation until I isolate 𝑏0 + 𝑏1 𝑦 + 𝑏2 𝑥 on one side of
the equation:
𝑃 = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑠
𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥
= 𝑃
1 + 𝑒 𝑏0+𝑏1𝑥1+ 𝑏2𝑥2
𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 = 𝑃 (1 + 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 )
𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 = 𝑃 + (𝑃 ∗ 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 )
𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 − (𝑃 ∗ 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 ) = 𝑃
𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 (1 − 𝑃) = 𝑃
5
https://mathworld.wolfram.com/LogisticEquation.html
6
https://xaktly.com/LogisticFunctions.html
6
𝑃
ln ( ) = 𝑏0 + 𝑏1 𝑦 + 𝑏2 𝑥
1−𝑃
𝑃
That means that the expression 𝑏0 + 𝑏1 𝑦 + 𝑏2 𝑥 has a linear relationship with ln ( ). If I insert the
1−𝑃
coefficients in the equation I get:
𝑃
ln ( ) = −0.13488 + 0.002074𝑦 − 0.00132𝑥
1−𝑃
𝑃
Therefore ln ( ) is calculated by inserting the location of each shot for x and y. Since this process must
1−𝑃
be repeated for each shot or 32698 times, the formula is inserted in Excel, which can calculate the
𝑃
ln ( ) for every shot in seconds. Again, only a sample of the same 10 shots of the table is displayed.
1−𝑃
Shot result x y 𝑃
ln ( )
1−𝑃
1 1 88 9 0.035752
2 0 88 9 0.035752
3 0 85 2 0.03877
4 1 96 2 0.061584
5 1 98 9 0.056492
… … … … …
32694 0 2 96 -0.0244
32695 0 2 96 -0.0244
32696 0 2 96 0.058756
32697 0 2 96 0.049328
32698 0 2 96 0.056682
7
My actual aim of this research is to find an equation that related the probability of goal to the location of
𝑃
the shot: Therefore, I had to solve for 𝑃 by rearranging the equation ln ( ) = −0.13488 + 0.00207𝑦 −
1−𝑃
0.00132𝑥 so that 𝑃 is on one side of the equation.
𝑙𝑛(𝑥) = 𝑙𝑜𝑔𝑒 (𝑥)
𝑃
𝑙𝑜𝑔𝑒 ( ) = 𝑏0 + 𝑏1 𝑦 + 𝑏2 𝑥
1−𝑃
𝑃
( ) = 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥
1−𝑃
𝑃
𝑃
I know that 𝑏0 + 𝑏1 𝑦 + 𝑏2 𝑥 is equal to ln ( ). Therefore, if I calculated 𝑒 ln(1−𝑃) with the 𝑥 and 𝑦
1−𝑃
𝑃
coordinates of each shot and got ( )
1−𝑃
𝑃 𝑃
𝑒 ln(1−𝑃 = (
)
)
1−𝑃
Shot result 𝑥 𝑦 𝑃 𝑃
ln ( ) ( )
1−𝑃 1−𝑃
1 1 88 9 0.035752 1.036399
2 0 88 9 0.035752 1.036399
3 0 85 2 0.03877 1.039531
4 1 96 2 0.061584 1.06352
5 1 98 9 0.056492 1.058118
… … … … … …
32694 0 2 96 -0.0244 0.975897
32695 0 2 96 -0.0244 0.975897
32696 0 2 96 0.058756 1.060516
32697 0 2 96 0.049328 1.050565
32698 0 2 96 0.056682 1.058319
8
𝑃
From (1−𝑃) the probability of success of each shot is easily calculated. Let us assume that a variable N is
𝑃
equal ( ). For the shots that resulted in a goal I calculated P of success as following:
1−𝑃
𝑃
𝑁=( )
1−𝑃
𝑁 − 𝑁𝑝 = 𝑃
𝑁 = 𝑃 + 𝑁𝑃
𝑁 = 𝑃(1 + 𝑁)
𝑁
Therefore: =𝑃
1+𝑁
For the shots that did not result in a goal I calculated the probability of failure. Because as I mentioned
before, in this research P is equivalent to P of success. Therefore, the shots that missed the goal I
calculated 1 − 𝑃(𝑠𝑢𝑐𝑐𝑒𝑠𝑠) in order to work out the probability that those shots miss the goal. Thereby,
the probability of each shot corresponds to its actual result.7
𝑁
The formula 𝑃 = ( ) is added to Excel and implemented for all shots that resulted in goal. The formula
1+𝑁
𝑁
1−( )is used for all shots that missed the goal.
1+𝑁
Shot Result 𝑋 𝑌 𝑃 𝑃 𝑃
ln ( ) ( )
1−𝑃 1−𝑃
1 1 88 9 0.035752 1.036399 0.508937
2 0 88 9 0.035752 1.036399 0.491063
3 0 85 2 0.03877 1.039531 0.490309
4 1 96 2 0.061584 1.06352 0.515391
5 1 98 9 0.056492 1.058118 0.514119
7
https://towardsdatascience.com/a-simple-interpretation-of-logistic-regression-coefficients-e3a40a62e8cf
9
… … … … … … …
32694 0 2 96 -0.0244 0.975897 0.506099
32695 0 2 96 -0.0244 0.975897 0.506099
32696 0 2 96 0.058756 1.060516 0.485315
32697 0 2 96 0.049328 1.050565 0.48767
32698 0 2 96 0.056682 1.058319 0.485833
Now we need to find the probability distribution and parameters that best explain the observed data. A
common method to do that is the maximum likelihood estimation. The likelihood function is calculated by
the product of the probability of all shots resulting in their actual and known outcome (1 or 0). But first we
need to write an equation that estimates the probability of the shot based on its actual result.8
𝑓(𝑧) = (𝑃) 𝑧 (1 − 𝑃)1−𝑧
𝑧 = 𝑎 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 (1 𝑜𝑟 0)
This expression is called Bernoulli equation is very convenience for calculating the probability of the shots
based on its real result because if 𝑧 = 0 than (𝑃) 𝑧 is equals to 1 and turns into 1 whereas (1 − 𝑃)1−𝑧
remains
and the probability of failure is calculated9. On the other hand, if 𝑧 = 1 than (1 − 𝑃)1−𝑧 eliminates and the
probability of success is calculated. This equation allows us to estimate the probability of shot given a
certain actual outcome which means it does not calculate the probability of scoring a goal but the
probability of shot resulting in the outcome which in reality already has taken place.
If we take the product of the term (𝑃) 𝑧 (1 − 𝑃)1−𝑧 of all shots, we determine the likelihood function that
indicates the plausibility of the model in predicting the real results of a shots given all my data points. In
8
https://machinelearningmastery.com/logistic-regression-with-maximum-likelihood-estimation/
9
https://www.perfectlyrandom.org/2019/04/27/bernoulli-distribution-as-a-tiny-nn/
10
that way I find out to what extent the model could predict the correct outcomes. Therefore, a bigger
product represents a better estimation of the model as the probability of predicting the correct outcome
increases. The reason why I could do this is that I am assuming that all data point in other words all shots
are independent from each other so that I can use the multiplication rule.
𝐿(𝑙𝑖𝑘𝑙𝑒𝑒ℎ𝑜𝑜𝑑) = ∏(𝑃) 𝑧𝑖 (1 − 𝑃)1−𝑧𝑖

𝑖=1
However, since the probabilities are numbers between 0 and 1 or in other word fractions, the
multiplication of 32698 fractions creates a very small number that Excel itself cannot calculate, so the
natural logs of the probabilities are taken, so that the sum of the natural log of 𝑃 of all shots instead of the
product of 𝑃 is calculated. 10
𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥

𝑙𝑛 (𝑓(𝑧)) = ( ) + (1 − 𝑧𝑖 ) (ln (1 − ))
1 + 𝑒 𝑏0 +𝑏1𝑦+ 𝑏2𝑥 1 + 𝑒 𝑏0 +𝑏1𝑦+ 𝑏2𝑥
𝑙𝑛(𝐿) = ∑ 𝑧𝑖 ln(𝑃) + ⋯ + (1 − 𝑧𝑖 )(ln(1 − 𝑃))

𝑖=1
𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
=𝑃
1+𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
If we substitute P with 1+𝑒 𝑏0+𝑏1𝑦+ 𝑏2 𝑥 we get:
𝑛
𝑒 𝑏0 +𝑏1𝑦+ 𝑏2𝑥 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥
∑ 𝑧𝑖 ln ( ) + ⋯ + (1 − 𝑧𝑖 ) (ln (1 − ))
1 + 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 1 + 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥
𝑖=1
Now the natural log of the Bernoulli equation
𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥 𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
ln (1+𝑒 𝑏0+𝑏1𝑦+ 𝑏2 𝑥 ) + (1 − 𝑧𝑖 ) (ln (1 − ))
1+𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
is implemented for each data point by inserting the location (x and y coordinates) of the shot for 𝑥 and y.
Random values for the coefficients 𝑏0 , 𝑏1 𝑎𝑛𝑑 𝑏2 are inserted in the equation so that the natural log of the
10
https://towardsdatascience.com/calculating-maximum-likelihood-estimation-by-hand-step-by-step-3a740c637c20
11
Bernoulli equation for each shot is calculated. And we keep changing the coefficients 𝑏0 , 𝑏1 𝑎𝑛𝑑 𝑏2 trying
for them random values and inserting them in
𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥 𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
ln (
1+𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
) + (1 − 𝑧𝑖 ) (ln (1 − 1+𝑒 𝑏0+𝑏1 𝑦+ 𝑏2 𝑥 ))
until we found out coefficients that gives the largest value of
𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥

ln ( ) + (1 − 𝑧𝑖 ) (ln (1 − ))
1 + 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 1 + 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥
for that specific shot. As there are a lot of date points (32698 shots), for each data point there is a different
combination of the values of the three unknown coefficients that gives the maximum possible value for the
𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥 𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
ln (1+𝑒 𝑏0+𝑏1 𝑦+ 𝑏2𝑥 ) + (1 − 𝑧𝑖 ) (ln (1 − )).
1+𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
Therefore, we need to determine the coefficients that results in the greatest sum of the log likelihood
function
𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥 𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
∑𝑛𝑖=1 𝑧𝑖 ln ( ) + ⋯ + (1 − 𝑧𝑖 ) (ln (1 − ))
1+𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥 1+𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
including all data points. Because there are a lot of data points and there are so many combinations of the
values of the three different unknown coefficients that determines the maximum possible number of the
sum of the natural log of the maximum likelihood, the probability of catching the correct parameters is
very low and the process would take a long time. Therefore, I used the algorithm of the Solver in excel that
accelerates the process and finds out the most optimal and accurate parameters that best explains the
data and fits the model. The algorithm uses the same equations and tries out random values for the
coefficients until it finds the parameters the give the largest sum possible of the local maxima of the 𝐿𝑛
Likelihood equation for all data points.
Largest sum possible: -1688.45
This largest possible sum came out when the coefficients and intercepts are:
12
Coefficients
Intercept -12.1034
Y Variable 0.103273
X Variable -0.0585
The coefficient of X variable is negative. Because the X variable represents the horizontal distance from the
center of the field. A negative slope of the variable indicates that with increasing distance from the center
of the football field, the probability of a goal decreases. In contrast, the value of the coefficient of the Y
variable is positive. That is, the closer the player to the goal, the higher the probability of scoring a goal.
Furthermore, we realize that the absolute value of the slope of the Y (vertical) variable is higher than that
of the X (horizontal) variable. That means that the Y factor has a larger effect on the probabilities than then
the X factor.
If you use the coefficient and y intercepts, then you only have two unknown variables namely 𝑥 and 𝑦, in
which you can insert the coordinates of the shot and as a result we get the probability of a goal from this
position.
𝑒 −12.1034+ 0.103273𝑦−0.0585𝑥
=𝑃
1 + 𝑒 −12.1034+ 0.103273𝑦−0.0585𝑥
A 3D graph was performed using excel in order to visualize the trend of the equation and determine how
the probability is affected by changing the X or Y variable.
13
Probability of of scoring a goal at each point in the

pitch
0.15
PROBABILITY
0.1
0.05
0
0
X COORDINATE (%)
10
20 40
30
40
50
60 20
70
80
90 0
Y COORDIANTE (%) 100
The horizontal axis represents the vertical nearness to the goal. We can see how the probability is
increasing as the shot becomes closer to the goal. Shots that were far away from the goal rarely result in a
goal. On the other hand, shot that are very close the goal is most likely to go into the goal. The depth axis
represents the horizontal distance from the midpoint of the pitch. As the shot becomes further from the
midpoint the probability of scoring a goal decrease. This is because as we get closer to the center, the
angle between both goalposts increases so that the player has a wider view a better opportunity to score
on target. The graph shows only the left side of the football pitch as I assumed before that it makes no
different if the player shots the ball from the left-hand side or the right-hand side.11 A reflection on the
horizontal axis will show the other side of the field. The Y variable of the graph has a Domain of 𝑦|(0 ≤
𝑦 ≤ 100) because the length of the pitch cannot exceed hundred percent as after that the ball is out. The
Range of the Z variable of the vertical axis has a range of 𝑧|0 < 𝑧 ≤ 0.141. This is because there is no shot
from inside the field that has a probability of zero. Besides, the maximum probability is calculated at (0,
100) and equates to 14.1 percent. The probability is decreasing as we move further from the opponent’s
goal but and results in very small fractions but never reaches zero.
11
https://support.minitab.com/en-us/minitab/19/help-and-how-to/graphs/3d-scatterplot/interpret-the-results/key-results/
14
To test the accuracy of the equation and estimate the performance and result of the match, it is used to
calculate the expected goals of matches. I have selected two matches from the from the Premier League,
because the equation was created from Premier league data. Therefore, there is a risk that the probability
of a goal will be affected differently by the position of the shot in different league, so that the equation
may show inaccurate or illogical results. In addition, all selected games are from 2018, as I only had the
data of this year available.
Application Examples
Chelsea vs. Manchester United 25.02.201812
Table 3: The coordinates of each shot of each team and the probability of that shot resulting in a goal
Manchester United Chelsea
𝑥 Coordinate 𝑦 coordinate 𝑃 𝑥 Coordinate 𝑦 coordinate 𝑃
1 87 0.040035 8 93 0.048938655
2 87 0.037846 6 95 0.066392935
3 89 0.043622 3 90 0.048138833
4 90 0.045528 4 91 0.050232404
8 94 0.053976 6 91 0.044935232
0 90 0.056849 5 88 0.035301982
8 90 0.036375 8 92 0.044349789
7 90 0.038482 6 89 0.036858893
0 95 0.091749 23 84 0.008376046
1 94 0.079129 19 79 0.006328566
0 92 0.068992 30 81 0.00409744
0 90 0.056849 23 81 0.00615824
12
https://www.manutd.com/en/matches/matchcenter/man-utd-vs-chelsea-match-919170
15
2 92 0.061845 14 89 0.023405442
13 73 0.004845 14 89 0.023405442
11 79 0.010067 14 89 0.023405442
13 73 0.004845 14 89 0.023405442
13 75 0.00595 14 89 0.023405442
11 88 0.025114 15 86 0.016311771
16 82 0.010241 15 86 0.016311771
19 90 0.019449 15 86 0.016311771
9 78 0.010205 15 86 0.016311771
1 91 0.059298 17 84 0.011856309
1 91 0.059298 20 81 0.007330988
1 91 0.059298 16 85 0.01390924
1 91 0.059298 23 83 0.007560408
0 93 0.075928 5 86 0.028904655
0 94 0.083498 4 90 0.045528034
0 93 0.075928 11 90 0.03069945
0 93 0.075928 4 90 0.045528034
0 93 0.075928 16 90 0.02309372
7 89 0.034837 16 90 0.02309372
13 88 0.022404 16 90 0.02309372
9 92 0.041935 18 91 0.022786052
6 94 0.060271 27 75 0.002631871
19 70 0.002507921
26 75 0.002789985
16
18 71 0.002946998
22 71 0.002333576
18 71 0.002946998
18 71 0.002946998
11 75 0.006683384
15 75 0.00529636
11 72 0.00491154
16 90 0.02309372
16 90 0.02309372
16 90 0.02309372
16 90 0.02309372
13 81 0.011000131
4 81 0.018482291
10 81 0.01308281
The probability of each shot resulting in goal was calculated for each shot using the equation:
𝑒 −12.1034+( 0.103273𝑦)−(0.0585𝑥)
𝑃=
1 + 𝑒 −12.1034+( 0.103273𝑦)−(0.0585𝑥)
I add up the probabilities of all shots from each team separately13:
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐺𝑜𝑎𝑙𝑠(𝑥𝐺) = 𝑃(𝑠ℎ𝑜𝑡1) + 𝑃(𝑠ℎ𝑜𝑡2) + ⋯ + 𝑃(𝑠ℎ𝑜𝑡𝑥)
1. Chelsea:
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐺𝑜𝑎𝑙(𝑥𝐺) = 𝑃(𝑠ℎ𝑜𝑡1) + 𝑃(𝑠ℎ𝑜𝑡2) + ⋯ + 𝑃(𝑠ℎ𝑜𝑡50)
50
∑(𝑃𝑖) = 1.0 4
𝑛=1
13
https://fbref.com/en/expected-goals-model-explained/
17
2. Manchester United:
Expected Goals (xG) = P (shot 1) + P (shot2) +... +P (shot 34)
50
∑(𝑃𝑖) = 1.59
𝑛=1
Expected Result
Manchester United 1.59 – 1.04 Chelsea
Actual Result
Manchester United 2 – 1 Chelsea
According to my equation, Chelsea should theoretically win and Manchester United lose, which is also
consistent with the actual result. So, my equation shows logical and trustworthy results for this game as it
does not contain an extreme number of goals.
Leicester city vs. Manchester City 10.02.2018
I have used the same approach as the previous example to calculate the expected number of goals for each
team during the match using the equation. The data for this application is given in the appendix.
𝑒 −12.1034+( 0.103273𝑦)−(0.0585𝑥)
𝑃(𝑠𝑢𝑐𝑐𝑒𝑠𝑠) =
1 + 𝑒 −12.1034+( 0.103273𝑦)−(0.0585𝑥)
I add up the probability of all shots from each team separately:
Expected Goal (XG)= P (shot 1) + P (shot2) +... +P (shot x)
1. Leicester City
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐺𝑜𝑎𝑙(𝑋𝐺) = 𝑃(𝑠ℎ𝑜𝑡1) + 𝑃(𝑠ℎ𝑜𝑡2) + ⋯ + 𝑃(𝑠ℎ𝑜𝑡9)
∑(𝑃𝑖) = 0.215
𝑖=1
2. Manchester City:
18
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐺𝑜𝑎𝑙(𝑋𝐺) = 𝑃(𝑠ℎ𝑜𝑡1) + 𝑃(𝑠ℎ𝑜𝑡2) + ⋯ + 𝑃(𝑠ℎ𝑜𝑡75)
75
∑(𝑃𝑖) = 2.55
𝑖=1
Expected Result
Leicester city 0.215 – 2.55 Manchester City
Actual Result
Leicester city 1 – 5 Manchester City
Although Man City’s xG has been estimated far from the actual goals scored as most shots have very small
probabilities, the ratio of the xGs of both teams is very close to the ratio of the actual result. My equation
confirms the actual results, that Manchester City played much better and thus had a higher probability of
goals than Leicester City.
Conclusion
Through this research I was able to find an equation that relates the Probability of the goal to the location
of the shot. The equation is able to estimate the probability of scoring a goal from any location of the
football pitch and is very trustworthy for normal matches that doesn’t contain an extreme number of
goals. In addition to that, the equation is not very accurate for matches with extreme number of goals
because most of the shots have a very small probability so that the resulting expected goals are usually less
than the actual result. However, my expected goal model was able to predict which team has performed
better by observing all its shots during the match. Therefore, the aim of the research was achieved as I was
able to work out an equation that might facilitate the estimation of the performance of the players in a
specific match for me and for the coach. This Investigation also shows the direct impact of using theories
and approaches learned in one field of Mathematics to interpret and make decisions into various fields of
life which indicates that all areas of knowledge are connected to each other. The integration of the
19
different sciences mathematics and sport sciences initiated a relationship between statistics from math
and tactics from sports which enabled a broader overview of the football pitch.
Evaluation
Limitations
The Data is rounded to two significant figures: The 𝑥 and 𝑦 coordinates of the shots are given to two
significant figures and further decimals are neglected. Therefore, the coefficients resulted from the logistic
regression are not absolute accurate. Increasing the sample size will reduce the amount of inaccuracies
Now variety of seasons or leagues are available: Only one season of the Premier league is considered in
that research so that the equation might only fit to a specific season or league. If you look at other seasons
and league and combine their data with the current season, then the equation will be generalized for more
matches worldwide.
Unable to present the full data in the research: The data consists of ore than 30000 shots so that there is
no space to show them in the paper. As a result, some steps might be unclear or unable to be followed.
Demonstrate the data in an appropriate way the enables the reader to understand and follow up the
calculation involving the data
Only old data is available: The data I have belongs to a Premier League Season in 2017/2018. The
relationship between the probability and the position the shot might have been changed or developed by
time. Try to find more updated data that that could fit to current matches
Assumptions
1. It doesn’t make any difference if the player is shooting the ball from the left side or right side of the
pitch
2. The probability of scoring a goal depends only on angle and distance
3. All players have equal probability of scoring a goal
4. All shots are independent from each other

20
Strength
My I considered a huge number of shots that might increase the accuracy of the equation and results.
Besides, I used an algorithm in a lot of steps to prevent any errors that might appear while calculating a
specific step manually
Extensions
• Use additional data from other leagues
• Consider the player that is shooting
• Consider the goalkeeper that is saving the shot (expected savings)
• Consider the number and position of the defenders
• Expected assists
• Consider the number and position of defenders that might inhibit the ball to hit the goal
21
Bibliography
https://mathworld.wolfram.com/LogisticEquation.html
https://xaktly.com/LogisticFunctions.html
https://towardsdatascience.com/a-simple-interpretation-of-logistic-regression-coefficients-e3a40a62e8cf
https://machinelearningmastery.com/logistic-regression-with-maximum-likelihood-estimation/
https://www.perfectlyrandom.org/2019/04/27/bernoulli-distribution-as-a-tiny-nn/
https://towardsdatascience.com/calculating-maximum-likelihood-estimation-by-hand-step-by-step-
3a740c637c20
https://lbelzile.github.io/timeseRies/manual-maximum-likelihood-estimation.htm
https://lbelzile.github.io/timeseRies/manual-maximum-likelihood-estimation.h
https://support.minitab.com/en-us/minitab/19/help-and-how-to/graphs/3d-scatterplot/interpret-the-
results/key-results/
https://www.manutd.com/en/matches/matchcenter/man-utd-vs-chelsea-match-919170
https://fbref.com/en/expected-goals-model-explained/
22
Appendix
Table 1
Shot Result 𝑋 𝑌 𝑃 𝑃 𝑃 ln 𝑃
ln ( ) ( )
1−𝑃 1−𝑃
1 1 88 9 0.035752 1.036399 0.508937 -0.29334
2 0 88 9 0.035752 1.036399 0.491063 -0.30886
3 0 85 2 0.03877 1.039531 0.490309 -0.30953
4 1 96 2 0.061584 1.06352 0.515391 -0.28786
5 1 98 9 0.056492 1.058118 0.514119 -0.28894
… … … … … … … …
3269 0 2 85
0 0.056492 1.058118 0.485881 -0.31347
3269 0 2 85
1 0.056492 1.058118 0.485881 -0.31347
3269 0 2 85
2 0.033114 1.033668 0.491722 -0.30828
3269 0 2 85
3 -0.0244 0.975897 0.506099 -0.29576
3269 0 2 96
4 -0.0244 0.975897 0.506099 -0.29576
3269 0 2 96
5 -0.0244 0.975897 0.506099 -0.29576
3269 0 2 96
6 0.058756 1.060516 0.485315 -0.31398
23
3269 0 2 96
7 0.049328 1.050565 0.48767 -0.31187
3269 0 2 96
8 0.056682 1.058319 0.485833 -0.31351
Table 2
Manchester City Leicester City
x y P x y P
7 96 0.069223 11 88 0.025114467
`17 97 0.07618 9 87 0.025452755
5 96 0.077152 11 87 0.022706207
7 96 0.069223 7 91 0.042490424
6 96 0.073088 12 86 0.019380407
15 80 0.008845 13 90 0.027402662
0 80 0.021009 17 84 0.011856309
14 90 0.025886 15 86 0.016311771
15 80 0.008845 13 89 0.024780561
3 82 0.021657
2 90 0.050891
24
3 83 0.023957
1 80 0.019839
12 93 0.039128
9 91 0.037977
11 91 0.033926
17 91 0.024126
13 96 0.049751
5 96 0.077152
13 96 0.049751
11 96 0.055583
19 88 0.015877
15 94 0.036501
19 88 0.015877
13 88 0.022404
4 96 0.081421
4 96 0.081421
1 96 0.095549
2 96 0.090611
4 96 0.081421
3 83 0.023957
7 83 0.019054
1 83 0.026851
7 83 0.019054
16 90 0.023094
25
1 90 0.053792
6 90 0.040706
15 90 0.024451
16 90 0.023094
5 86 0.028905
5 86 0.028905
5 86 0.028905
11 88 0.025114
11 88 0.025114
8 88 0.029789
11 88 0.025114
22 95 0.027134
22 95 0.027134
19 95 0.032172
22 95 0.027134
23 84 0.008376
14 84 0.014099
23 84 0.008376
23 84 0.008376
15 87 0.018054
11 87 0.022706
3 87 0.035772
22 91 0.018118
9 91 0.037977
26
17 91 0.024126
0 91 0.062646
14 88 0.021158
11 88 0.025114
14 93 0.034958
7 88 0.031527
8 88 0.029789
14 88 0.021158
3 72 0.00782
11 80 0.01115
11 72 0.004912
6 89 0.036859
7 77 0.010345
7 90 0.038482
7 77 0.010345
2 77 0.013811
27
28

Deriving A Model To Calculate The Probability of Scoring A Goal From Every Shooting Position in The Football Pitch and Applying It To Predict The XG For Different Matches.

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deriving A Model To Calculate The Probability of Scoring A Goal From Every Shooting Position in The Football Pitch and Applying It To Predict The XG For Different Matches.

Uploaded by

Copyright:

Available Formats

Downloaded

from www.clastify.com by Thiago Zambrano Caceres

mathematics can also be very useful in soccer strategies and analysis.

Shot Result (successful=1, Y coordinate (%) X coordinate (%)

regression equation are noted3:

𝑌 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒: 𝑡ℎ𝑒 𝑣𝑒𝑟𝑡𝑖𝑐𝑎𝑙 𝑛𝑒𝑎𝑟𝑛𝑒𝑠𝑠 𝑡𝑜 𝑡ℎ𝑒 𝑔𝑜𝑎𝑙(%)4

𝑏1 : 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑎𝑛𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑋 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

𝑏2 : 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑎𝑛𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑌 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

variables which he named a logistic equation5:

𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 = 𝑃 (1 + 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 )

𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 = 𝑃 + (𝑃 ∗ 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 )

𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 − (𝑃 ∗ 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 ) = 𝑃

coefficients in the equation I get:

0.00132𝑥 so that 𝑃 is on one side of the equation.

𝑙𝑛(𝑥) = 𝑙𝑜𝑔𝑒 (𝑥)

32694 0 2 96 -0.0244 0.975897

32695 0 2 96 -0.0244 0.975897

32696 0 2 96 0.058756 1.060516

32697 0 2 96 0.049328 1.050565

32698 0 2 96 0.056682 1.058319

the probability of each shot corresponds to its actual result.7

1 1 88 9 0.035752 1.036399 0.508937

2 0 88 9 0.035752 1.036399 0.491063

3 0 85 2 0.03877 1.039531 0.490309

4 1 96 2 0.061584 1.06352 0.515391

5 1 98 9 0.056492 1.058118 0.514119

32694 0 2 96 -0.0244 0.975897 0.506099

32695 0 2 96 -0.0244 0.975897 0.506099

32696 0 2 96 0.058756 1.060516 0.485315

32697 0 2 96 0.049328 1.050565 0.48767

32698 0 2 96 0.056682 1.058319 0.485833

𝑓(𝑧) = (𝑃) 𝑧 (1 − 𝑃)1−𝑧

𝐿(𝑙𝑖𝑘𝑙𝑒𝑒ℎ𝑜𝑜𝑑) = ∏(𝑃) 𝑧𝑖 (1 − 𝑃)1−𝑧𝑖

𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥

𝑙𝑛(𝐿) = ∑ 𝑧𝑖 ln(𝑃) + ⋯ + (1 − 𝑧𝑖 )(ln(1 − 𝑃))

Now the natural log of the Bernoulli equation

for them random values and inserting them in

until we found out coefficients that gives the largest value of

𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥

Likelihood equation for all data points.

Largest sum possible: -1688.45

the probability is affected by changing the X or Y variable.

Probability of of scoring a goal at each point in the

data of this year available.

Chelsea vs. Manchester United 25.02.201812

Manchester United Chelsea

𝑥 Coordinate 𝑦 coordinate 𝑃 𝑥 Coordinate 𝑦 coordinate 𝑃

I add up the probabilities of all shots from each team separately13:

𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐺𝑜𝑎𝑙𝑠(𝑥𝐺) = 𝑃(𝑠ℎ𝑜𝑡1) + 𝑃(𝑠ℎ𝑜𝑡2) + ⋯ + 𝑃(𝑠ℎ𝑜𝑡𝑥)

𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐺𝑜𝑎𝑙(𝑥𝐺) = 𝑃(𝑠ℎ𝑜𝑡1) + 𝑃(𝑠ℎ𝑜𝑡2) + ⋯ + 𝑃(𝑠ℎ𝑜𝑡50)

Expected Goals (xG) = P (shot 1) + P (shot2) +... +P (shot 34)

Manchester United 1.59 – 1.04 Chelsea

Manchester United 2 – 1 Chelsea

does not contain an extreme number of goals.

Leicester city vs. Manchester City 10.02.2018

I add up the probability of all shots from each team separately:

Expected Goal (XG)= P (shot 1) + P (shot2) +... +P (shot x)

𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐺𝑜𝑎𝑙(𝑋𝐺) = 𝑃(𝑠ℎ𝑜𝑡1) + 𝑃(𝑠ℎ𝑜𝑡2) + ⋯ + 𝑃(𝑠ℎ𝑜𝑡9)

𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐺𝑜𝑎𝑙(𝑋𝐺) = 𝑃(𝑠ℎ𝑜𝑡1) + 𝑃(𝑠ℎ𝑜𝑡2) + ⋯ + 𝑃(𝑠ℎ𝑜𝑡75)

Leicester city 0.215 – 2.55 Manchester City

Leicester city 1 – 5 Manchester City

goals than Leicester City.