You are on page 1of 28

Downloaded

from www.clastify.com by Thiago Zambrano Caceres

Deriving a model to calculate the probability of scoring a goal from every Shooting position in the football
pitch and applying it to predict the xG for different Matches.

Page numbers: 20

1
Downloaded from www.clastify.com by Thiago Zambrano Caceres

Introduction

Mathematics is all around us. You don't notice this but even in our free time we encounter mathematics.

Personally, I am a very big fan of soccer and I never miss a single game of my favorite team. After the

game, the presenters always talk about statistics, which I find very interesting and listen to. Because they

explain the performance and the results of a specific match using mathematics which indicates how math

is used in every branch and is indispensable. Over time, I noticed how they talk about the expected goals

Philosophy (xG) which is trying to estimate the number of goals scored by each team by regarding every

shot in the match. I did some research and found the idea behind it very logical because the success of the

shot depends on so many factors like the distance or angle. I always thought why not use mathematics to

build my own model that could estimate the probability of scoring a goal from a specific location in the

football pitch. Through this model I will be able to have a better overview for the performance of the team

and even predict future developments. After all, scoring goals does not mean that you played well.

Therefore, coaches usually need an alternative method by looking at different statistics after the match to

calculate the performance or quality of the team. I would like to offer this using my model and prove how

mathematics can also be very useful in soccer strategies and analysis.

So, the aim of this research is to find an equation where you can insert the location of the shot and thereby

find out the probability of a goal. Only by doing this, I as a fan and even the coaches could determine if the

players were just lucky or if the team has really performed well. The model will evaluate the actual quality

of the team and give further predictions for the performance in the future. That's why I got the data of the

Premier league Season 2018/2019 from Wyscout.1 The data consisted of all the shots that took place

during that season and even the location of all the shots. Next to the location was whether the shot

resulted in a goal or not. A logistic analysis was then performed on the data, from which I was able to get

the coefficients of the equation that relates the position of the shot with its probability of success. The

1
www.wyscout.com
2
Downloaded from www.clastify.com by Thiago Zambrano Caceres

equation was implemented on two games and compared with the actual result to test its quality and

effect.

Collected Data

The data from Wyscout described the result and the location of the shot. The position was the

independent variable and consisted of an X coordinate and a Y coordinate. The X coordinate represented

the distance from the right sideline. The Y coordinate represented the vertical nearness to the goal. Both

coordinates were given as a percentage to the total vertical or horizontal distance of the pitch, since

different soccer fields can have different sizes. Since the X coordinate was not suitable to my study in my

opinion, since it makes no difference whether the shot was from the left or right side, but only depends on

the distance from the center of the field, I calculated the absolute value of the difference between the X

coordinate and 50%. Thus, I ignored the side of the shot and set the distance from the center point as the X

coordinate.

The dependent variable of the data in wyscout consists of categories like “accurate, not accurate, block,

opportunity, or goal” that describe the result of the shot.2 If “goal” was written next to the shot, then the

shot is successful if not then the shot is regarded as a failure. I converted these to numbers. Shots that led

to a goal I described with the number “1”. All other shots got the number “0”. In this way I was able to

numerate all variables I have. A small section of the original table in wyscout in inserted below:

2
https://www.nature.com/articles/s41597-019-0247-7
3
Downloaded from www.clastify.com by Thiago Zambrano Caceres

But because there were 32698 shots in that season, only the first 5 and last 5 shots are shown in the table

below. The full table of these raw data can be accessed from the link

“https://footballdata.wyscout.com/download-samples/”.

Table 1: the location of each shot in the season and the result of that shot

Shot Result (successful=1, Y coordinate (%) X coordinate (%)

unsuccessful = 0)

1 1 88 9

2 0 88 9

3 0 85 2

4 1 96 2

5 1 98 9

… … … …

32694 0 73 31

32695 0 73 31

32696 0 94 1

32697 0 92 5

32698 0 93 1

4
Downloaded from www.clastify.com by Thiago Zambrano Caceres

(0 ≤ 𝑦 ≤ 100): The origin of the Y Axis is at the line of the own goal and continues towards the line of the

opponent’s goal.

𝑥|(0 ≤ 𝑥 ≤ 50): The origin of the X Axis is at the center of the football pitch and continues into two

directions towards the right and the left line of the pitch.

First, a linear regression analysis was performed on all 32698 shots using Excel in order to estimate a

premier relationship between the dependent variable (probability) and both independent variables (X and

Y coordinates). Thereby, the probability of “1” and “0” is calculated by excel and the coefficients of the

regression equation are noted3:

Coefficients

Y Intercept -0.13488

Y Variable 0.002074

X Variable -0.00132

The coefficients of the variables and as well as y intercept belongs to a linear regression model:

𝑦 = 𝑏0 + 𝑏1 𝑦 + 𝑏1 𝑥

𝑏0 : 𝑌 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

𝑋 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒: 𝑡ℎ𝑒 ℎ𝑜𝑟𝑖𝑧𝑜𝑛𝑡𝑎𝑙 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑣𝑒𝑟𝑡𝑖𝑐𝑎𝑙 𝑙𝑖𝑛𝑒 𝑡ℎ𝑟𝑜𝑢𝑔ℎ 𝑡ℎ𝑒 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑖𝑡𝑐ℎ (%)

𝑌 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒: 𝑡ℎ𝑒 𝑣𝑒𝑟𝑡𝑖𝑐𝑎𝑙 𝑛𝑒𝑎𝑟𝑛𝑒𝑠𝑠 𝑡𝑜 𝑡ℎ𝑒 𝑔𝑜𝑎𝑙(%)4

𝑏1 : 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑎𝑛𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑋 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

𝑏2 : 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑎𝑛𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑌 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

3
https://www.southampton.ac.uk/passs/confidence_in_the_police/multivariate_analysis/linear_regression.page
4
https://www.nature.com/articles/s41597-019-0247-7
5
Downloaded from www.clastify.com by Thiago Zambrano Caceres

The dependent variable I was looking for is the probability of scoring a goal from a specific location. Since

the probability is between 1 and 0, I cannot use a linear regression because the predicted values will

become greater than 1 or less than zero if I move further on the X axis. So, the dependent variable of the

equation cannot be representing a probability since 𝑏0 + 𝑏1 𝑦 + 𝑏2 𝑥 can result in any number and does

not have a limited range. I did a lot of research to figure out how to solve this problem. I was seeking for a

function that could explain how the probability is affected by one or more factors. I found out that Pierre

François Verhulst discovered a relationship between the probability and one or more independent

variables which he named a logistic equation5:

𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥
= 𝑃(𝑠𝑢𝑐𝑐𝑒𝑠𝑠)
1 + 𝑒 𝑏0+𝑏1𝑦 + 𝑏2𝑥

The coefficients of the linear regression we performed on the data belongs to a liner equation. Therefore,

we need to linearize the logistic equation so that the dependent variable has a liner relation with the

independent variables. In other word, we need to find the dependent variable that is equals to 𝑏0 + 𝑏1 𝑦 +

𝑏2 𝑥.6 Therefore, I rearranged the following logistic equation until I isolate 𝑏0 + 𝑏1 𝑦 + 𝑏2 𝑥 on one side of

the equation:

𝑃 = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑠

𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥
= 𝑃
1 + 𝑒 𝑏0+𝑏1𝑥1+ 𝑏2𝑥2

𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 = 𝑃 (1 + 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 )

𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 = 𝑃 + (𝑃 ∗ 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 )

𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 − (𝑃 ∗ 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 ) = 𝑃

𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 (1 − 𝑃) = 𝑃

5
https://mathworld.wolfram.com/LogisticEquation.html
6
https://xaktly.com/LogisticFunctions.html
6
Downloaded from www.clastify.com by Thiago Zambrano Caceres

𝑃
ln ( ) = 𝑏0 + 𝑏1 𝑦 + 𝑏2 𝑥
1−𝑃

𝑃
That means that the expression 𝑏0 + 𝑏1 𝑦 + 𝑏2 𝑥 has a linear relationship with ln ( ). If I insert the
1−𝑃

coefficients in the equation I get:

𝑃
ln ( ) = −0.13488 + 0.002074𝑦 − 0.00132𝑥
1−𝑃

𝑃
Therefore ln ( ) is calculated by inserting the location of each shot for x and y. Since this process must
1−𝑃

be repeated for each shot or 32698 times, the formula is inserted in Excel, which can calculate the

𝑃
ln ( ) for every shot in seconds. Again, only a sample of the same 10 shots of the table is displayed.
1−𝑃

Shot result x y 𝑃
ln ( )
1−𝑃

1 1 88 9 0.035752

2 0 88 9 0.035752

3 0 85 2 0.03877

4 1 96 2 0.061584

5 1 98 9 0.056492

… … … … …

32694 0 2 96 -0.0244

32695 0 2 96 -0.0244

32696 0 2 96 0.058756

32697 0 2 96 0.049328

32698 0 2 96 0.056682

7
Downloaded from www.clastify.com by Thiago Zambrano Caceres

My actual aim of this research is to find an equation that related the probability of goal to the location of

𝑃
the shot: Therefore, I had to solve for 𝑃 by rearranging the equation ln ( ) = −0.13488 + 0.00207𝑦 −
1−𝑃

0.00132𝑥 so that 𝑃 is on one side of the equation.

𝑙𝑛(𝑥) = 𝑙𝑜𝑔𝑒 (𝑥)

𝑃
𝑙𝑜𝑔𝑒 ( ) = 𝑏0 + 𝑏1 𝑦 + 𝑏2 𝑥
1−𝑃

𝑃
( ) = 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥
1−𝑃

𝑃
𝑃
I know that 𝑏0 + 𝑏1 𝑦 + 𝑏2 𝑥 is equal to ln ( ). Therefore, if I calculated 𝑒 ln(1−𝑃) with the 𝑥 and 𝑦
1−𝑃

𝑃
coordinates of each shot and got ( )
1−𝑃

𝑃 𝑃
𝑒 ln(1−𝑃 = (
)
)
1−𝑃

Shot result 𝑥 𝑦 𝑃 𝑃
ln ( ) ( )
1−𝑃 1−𝑃

1 1 88 9 0.035752 1.036399

2 0 88 9 0.035752 1.036399

3 0 85 2 0.03877 1.039531

4 1 96 2 0.061584 1.06352

5 1 98 9 0.056492 1.058118

… … … … … …

32694 0 2 96 -0.0244 0.975897

32695 0 2 96 -0.0244 0.975897

32696 0 2 96 0.058756 1.060516

32697 0 2 96 0.049328 1.050565

32698 0 2 96 0.056682 1.058319

8
Downloaded from www.clastify.com by Thiago Zambrano Caceres

𝑃
From (1−𝑃) the probability of success of each shot is easily calculated. Let us assume that a variable N is

𝑃
equal ( ). For the shots that resulted in a goal I calculated P of success as following:
1−𝑃

𝑃
𝑁=( )
1−𝑃

𝑁 − 𝑁𝑝 = 𝑃

𝑁 = 𝑃 + 𝑁𝑃

𝑁 = 𝑃(1 + 𝑁)

𝑁
Therefore: =𝑃
1+𝑁

For the shots that did not result in a goal I calculated the probability of failure. Because as I mentioned

before, in this research P is equivalent to P of success. Therefore, the shots that missed the goal I

calculated 1 − 𝑃(𝑠𝑢𝑐𝑐𝑒𝑠𝑠) in order to work out the probability that those shots miss the goal. Thereby,

the probability of each shot corresponds to its actual result.7

𝑁
The formula 𝑃 = ( ) is added to Excel and implemented for all shots that resulted in goal. The formula
1+𝑁

𝑁
1−( )is used for all shots that missed the goal.
1+𝑁

Shot Result 𝑋 𝑌 𝑃 𝑃 𝑃
ln ( ) ( )
1−𝑃 1−𝑃

1 1 88 9 0.035752 1.036399 0.508937

2 0 88 9 0.035752 1.036399 0.491063

3 0 85 2 0.03877 1.039531 0.490309

4 1 96 2 0.061584 1.06352 0.515391

5 1 98 9 0.056492 1.058118 0.514119

7
https://towardsdatascience.com/a-simple-interpretation-of-logistic-regression-coefficients-e3a40a62e8cf
9
Downloaded from www.clastify.com by Thiago Zambrano Caceres

… … … … … … …

32694 0 2 96 -0.0244 0.975897 0.506099

32695 0 2 96 -0.0244 0.975897 0.506099

32696 0 2 96 0.058756 1.060516 0.485315

32697 0 2 96 0.049328 1.050565 0.48767

32698 0 2 96 0.056682 1.058319 0.485833

Now we need to find the probability distribution and parameters that best explain the observed data. A

common method to do that is the maximum likelihood estimation. The likelihood function is calculated by

the product of the probability of all shots resulting in their actual and known outcome (1 or 0). But first we

need to write an equation that estimates the probability of the shot based on its actual result.8

𝑓(𝑧) = (𝑃) 𝑧 (1 − 𝑃)1−𝑧

𝑧 = 𝑎 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 (1 𝑜𝑟 0)

This expression is called Bernoulli equation is very convenience for calculating the probability of the shots

based on its real result because if 𝑧 = 0 than (𝑃) 𝑧 is equals to 1 and turns into 1 whereas (1 − 𝑃)1−𝑧

remains

and the probability of failure is calculated9. On the other hand, if 𝑧 = 1 than (1 − 𝑃)1−𝑧 eliminates and the

probability of success is calculated. This equation allows us to estimate the probability of shot given a

certain actual outcome which means it does not calculate the probability of scoring a goal but the

probability of shot resulting in the outcome which in reality already has taken place.

If we take the product of the term (𝑃) 𝑧 (1 − 𝑃)1−𝑧 of all shots, we determine the likelihood function that

indicates the plausibility of the model in predicting the real results of a shots given all my data points. In

8
https://machinelearningmastery.com/logistic-regression-with-maximum-likelihood-estimation/
9
https://www.perfectlyrandom.org/2019/04/27/bernoulli-distribution-as-a-tiny-nn/
10
Downloaded from www.clastify.com by Thiago Zambrano Caceres

that way I find out to what extent the model could predict the correct outcomes. Therefore, a bigger

product represents a better estimation of the model as the probability of predicting the correct outcome

increases. The reason why I could do this is that I am assuming that all data point in other words all shots

are independent from each other so that I can use the multiplication rule.

𝐿(𝑙𝑖𝑘𝑙𝑒𝑒ℎ𝑜𝑜𝑑) = ∏(𝑃) 𝑧𝑖 (1 − 𝑃)1−𝑧𝑖


𝑖=1

However, since the probabilities are numbers between 0 and 1 or in other word fractions, the

multiplication of 32698 fractions creates a very small number that Excel itself cannot calculate, so the

natural logs of the probabilities are taken, so that the sum of the natural log of 𝑃 of all shots instead of the

product of 𝑃 is calculated. 10

𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥


𝑙𝑛 (𝑓(𝑧)) = ( ) + (1 − 𝑧𝑖 ) (ln (1 − ))
1 + 𝑒 𝑏0 +𝑏1𝑦+ 𝑏2𝑥 1 + 𝑒 𝑏0 +𝑏1𝑦+ 𝑏2𝑥

𝑙𝑛(𝐿) = ∑ 𝑧𝑖 ln(𝑃) + ⋯ + (1 − 𝑧𝑖 )(ln(1 − 𝑃))


𝑖=1

𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
=𝑃
1+𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥

𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
If we substitute P with 1+𝑒 𝑏0+𝑏1𝑦+ 𝑏2 𝑥 we get:

𝑛
𝑒 𝑏0 +𝑏1𝑦+ 𝑏2𝑥 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥
∑ 𝑧𝑖 ln ( ) + ⋯ + (1 − 𝑧𝑖 ) (ln (1 − ))
1 + 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 1 + 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥
𝑖=1

Now the natural log of the Bernoulli equation

𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥 𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
ln (1+𝑒 𝑏0+𝑏1𝑦+ 𝑏2 𝑥 ) + (1 − 𝑧𝑖 ) (ln (1 − ))
1+𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥

is implemented for each data point by inserting the location (x and y coordinates) of the shot for 𝑥 and y.

Random values for the coefficients 𝑏0 , 𝑏1 𝑎𝑛𝑑 𝑏2 are inserted in the equation so that the natural log of the

10
https://towardsdatascience.com/calculating-maximum-likelihood-estimation-by-hand-step-by-step-3a740c637c20
11
Downloaded from www.clastify.com by Thiago Zambrano Caceres

Bernoulli equation for each shot is calculated. And we keep changing the coefficients 𝑏0 , 𝑏1 𝑎𝑛𝑑 𝑏2 trying

for them random values and inserting them in

𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥 𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
ln (
1+𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
) + (1 − 𝑧𝑖 ) (ln (1 − 1+𝑒 𝑏0+𝑏1 𝑦+ 𝑏2 𝑥 ))

until we found out coefficients that gives the largest value of

𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥


ln ( ) + (1 − 𝑧𝑖 ) (ln (1 − ))
1 + 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥 1 + 𝑒 𝑏0+𝑏1𝑦+ 𝑏2𝑥

for that specific shot. As there are a lot of date points (32698 shots), for each data point there is a different

combination of the values of the three unknown coefficients that gives the maximum possible value for the

𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥 𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
ln (1+𝑒 𝑏0+𝑏1 𝑦+ 𝑏2𝑥 ) + (1 − 𝑧𝑖 ) (ln (1 − )).
1+𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥

Therefore, we need to determine the coefficients that results in the greatest sum of the log likelihood

function

𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥 𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥
∑𝑛𝑖=1 𝑧𝑖 ln ( ) + ⋯ + (1 − 𝑧𝑖 ) (ln (1 − ))
1+𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥 1+𝑒 𝑏0 +𝑏1 𝑦+ 𝑏2 𝑥

including all data points. Because there are a lot of data points and there are so many combinations of the

values of the three different unknown coefficients that determines the maximum possible number of the

sum of the natural log of the maximum likelihood, the probability of catching the correct parameters is

very low and the process would take a long time. Therefore, I used the algorithm of the Solver in excel that

accelerates the process and finds out the most optimal and accurate parameters that best explains the

data and fits the model. The algorithm uses the same equations and tries out random values for the

coefficients until it finds the parameters the give the largest sum possible of the local maxima of the 𝐿𝑛

Likelihood equation for all data points.

Largest sum possible: -1688.45

This largest possible sum came out when the coefficients and intercepts are:

12
Downloaded from www.clastify.com by Thiago Zambrano Caceres

Coefficients

Intercept -12.1034

Y Variable 0.103273

X Variable -0.0585

The coefficient of X variable is negative. Because the X variable represents the horizontal distance from the

center of the field. A negative slope of the variable indicates that with increasing distance from the center

of the football field, the probability of a goal decreases. In contrast, the value of the coefficient of the Y

variable is positive. That is, the closer the player to the goal, the higher the probability of scoring a goal.

Furthermore, we realize that the absolute value of the slope of the Y (vertical) variable is higher than that

of the X (horizontal) variable. That means that the Y factor has a larger effect on the probabilities than then

the X factor.

If you use the coefficient and y intercepts, then you only have two unknown variables namely 𝑥 and 𝑦, in

which you can insert the coordinates of the shot and as a result we get the probability of a goal from this

position.

𝑒 −12.1034+ 0.103273𝑦−0.0585𝑥
=𝑃
1 + 𝑒 −12.1034+ 0.103273𝑦−0.0585𝑥

A 3D graph was performed using excel in order to visualize the trend of the equation and determine how

the probability is affected by changing the X or Y variable.

13
Downloaded from www.clastify.com by Thiago Zambrano Caceres

Probability of of scoring a goal at each point in the


pitch
0.15
PROBABILITY

0.1

0.05

0
0

X COORDINATE (%)
10
20 40
30
40
50
60 20
70
80
90 0
Y COORDIANTE (%) 100

The horizontal axis represents the vertical nearness to the goal. We can see how the probability is

increasing as the shot becomes closer to the goal. Shots that were far away from the goal rarely result in a

goal. On the other hand, shot that are very close the goal is most likely to go into the goal. The depth axis

represents the horizontal distance from the midpoint of the pitch. As the shot becomes further from the

midpoint the probability of scoring a goal decrease. This is because as we get closer to the center, the

angle between both goalposts increases so that the player has a wider view a better opportunity to score

on target. The graph shows only the left side of the football pitch as I assumed before that it makes no

different if the player shots the ball from the left-hand side or the right-hand side.11 A reflection on the

horizontal axis will show the other side of the field. The Y variable of the graph has a Domain of 𝑦|(0 ≤

𝑦 ≤ 100) because the length of the pitch cannot exceed hundred percent as after that the ball is out. The

Range of the Z variable of the vertical axis has a range of 𝑧|0 < 𝑧 ≤ 0.141. This is because there is no shot

from inside the field that has a probability of zero. Besides, the maximum probability is calculated at (0,

100) and equates to 14.1 percent. The probability is decreasing as we move further from the opponent’s

goal but and results in very small fractions but never reaches zero.

11
https://support.minitab.com/en-us/minitab/19/help-and-how-to/graphs/3d-scatterplot/interpret-the-results/key-results/
14
Downloaded from www.clastify.com by Thiago Zambrano Caceres

To test the accuracy of the equation and estimate the performance and result of the match, it is used to

calculate the expected goals of matches. I have selected two matches from the from the Premier League,

because the equation was created from Premier league data. Therefore, there is a risk that the probability

of a goal will be affected differently by the position of the shot in different league, so that the equation

may show inaccurate or illogical results. In addition, all selected games are from 2018, as I only had the

data of this year available.

Application Examples

Chelsea vs. Manchester United 25.02.201812

Table 3: The coordinates of each shot of each team and the probability of that shot resulting in a goal

Manchester United Chelsea

𝑥 Coordinate 𝑦 coordinate 𝑃 𝑥 Coordinate 𝑦 coordinate 𝑃

1 87 0.040035 8 93 0.048938655

2 87 0.037846 6 95 0.066392935

3 89 0.043622 3 90 0.048138833

4 90 0.045528 4 91 0.050232404

8 94 0.053976 6 91 0.044935232

0 90 0.056849 5 88 0.035301982

8 90 0.036375 8 92 0.044349789

7 90 0.038482 6 89 0.036858893

0 95 0.091749 23 84 0.008376046

1 94 0.079129 19 79 0.006328566

0 92 0.068992 30 81 0.00409744

0 90 0.056849 23 81 0.00615824

12
https://www.manutd.com/en/matches/matchcenter/man-utd-vs-chelsea-match-919170
15
Downloaded from www.clastify.com by Thiago Zambrano Caceres

2 92 0.061845 14 89 0.023405442

13 73 0.004845 14 89 0.023405442

11 79 0.010067 14 89 0.023405442

13 73 0.004845 14 89 0.023405442

13 75 0.00595 14 89 0.023405442

11 88 0.025114 15 86 0.016311771

16 82 0.010241 15 86 0.016311771

19 90 0.019449 15 86 0.016311771

9 78 0.010205 15 86 0.016311771

1 91 0.059298 17 84 0.011856309

1 91 0.059298 20 81 0.007330988

1 91 0.059298 16 85 0.01390924

1 91 0.059298 23 83 0.007560408

0 93 0.075928 5 86 0.028904655

0 94 0.083498 4 90 0.045528034

0 93 0.075928 11 90 0.03069945

0 93 0.075928 4 90 0.045528034

0 93 0.075928 16 90 0.02309372

7 89 0.034837 16 90 0.02309372

13 88 0.022404 16 90 0.02309372

9 92 0.041935 18 91 0.022786052

6 94 0.060271 27 75 0.002631871

19 70 0.002507921

26 75 0.002789985

16
Downloaded from www.clastify.com by Thiago Zambrano Caceres

18 71 0.002946998

22 71 0.002333576

18 71 0.002946998

18 71 0.002946998

11 75 0.006683384

15 75 0.00529636

11 72 0.00491154

16 90 0.02309372

16 90 0.02309372

16 90 0.02309372

16 90 0.02309372

13 81 0.011000131

4 81 0.018482291

10 81 0.01308281

The probability of each shot resulting in goal was calculated for each shot using the equation:

𝑒 −12.1034+( 0.103273𝑦)−(0.0585𝑥)
𝑃=
1 + 𝑒 −12.1034+( 0.103273𝑦)−(0.0585𝑥)

I add up the probabilities of all shots from each team separately13:

𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐺𝑜𝑎𝑙𝑠(𝑥𝐺) = 𝑃(𝑠ℎ𝑜𝑡1) + 𝑃(𝑠ℎ𝑜𝑡2) + ⋯ + 𝑃(𝑠ℎ𝑜𝑡𝑥)

1. Chelsea:

𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐺𝑜𝑎𝑙(𝑥𝐺) = 𝑃(𝑠ℎ𝑜𝑡1) + 𝑃(𝑠ℎ𝑜𝑡2) + ⋯ + 𝑃(𝑠ℎ𝑜𝑡50)

50

∑(𝑃𝑖) = 1.0 4
𝑛=1

13
https://fbref.com/en/expected-goals-model-explained/
17
Downloaded from www.clastify.com by Thiago Zambrano Caceres

2. Manchester United:

Expected Goals (xG) = P (shot 1) + P (shot2) +... +P (shot 34)

50

∑(𝑃𝑖) = 1.59
𝑛=1

Expected Result

Manchester United 1.59 – 1.04 Chelsea

Actual Result

Manchester United 2 – 1 Chelsea

According to my equation, Chelsea should theoretically win and Manchester United lose, which is also

consistent with the actual result. So, my equation shows logical and trustworthy results for this game as it

does not contain an extreme number of goals.

Leicester city vs. Manchester City 10.02.2018

I have used the same approach as the previous example to calculate the expected number of goals for each

team during the match using the equation. The data for this application is given in the appendix.

𝑒 −12.1034+( 0.103273𝑦)−(0.0585𝑥)
𝑃(𝑠𝑢𝑐𝑐𝑒𝑠𝑠) =
1 + 𝑒 −12.1034+( 0.103273𝑦)−(0.0585𝑥)

I add up the probability of all shots from each team separately:

Expected Goal (XG)= P (shot 1) + P (shot2) +... +P (shot x)

1. Leicester City

𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐺𝑜𝑎𝑙(𝑋𝐺) = 𝑃(𝑠ℎ𝑜𝑡1) + 𝑃(𝑠ℎ𝑜𝑡2) + ⋯ + 𝑃(𝑠ℎ𝑜𝑡9)

∑(𝑃𝑖) = 0.215
𝑖=1

2. Manchester City:

18
Downloaded from www.clastify.com by Thiago Zambrano Caceres

𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐺𝑜𝑎𝑙(𝑋𝐺) = 𝑃(𝑠ℎ𝑜𝑡1) + 𝑃(𝑠ℎ𝑜𝑡2) + ⋯ + 𝑃(𝑠ℎ𝑜𝑡75)

75

∑(𝑃𝑖) = 2.55
𝑖=1

Expected Result

Leicester city 0.215 – 2.55 Manchester City

Actual Result

Leicester city 1 – 5 Manchester City

Although Man City’s xG has been estimated far from the actual goals scored as most shots have very small

probabilities, the ratio of the xGs of both teams is very close to the ratio of the actual result. My equation

confirms the actual results, that Manchester City played much better and thus had a higher probability of

goals than Leicester City.

Conclusion

Through this research I was able to find an equation that relates the Probability of the goal to the location

of the shot. The equation is able to estimate the probability of scoring a goal from any location of the

football pitch and is very trustworthy for normal matches that doesn’t contain an extreme number of

goals. In addition to that, the equation is not very accurate for matches with extreme number of goals

because most of the shots have a very small probability so that the resulting expected goals are usually less

than the actual result. However, my expected goal model was able to predict which team has performed

better by observing all its shots during the match. Therefore, the aim of the research was achieved as I was

able to work out an equation that might facilitate the estimation of the performance of the players in a

specific match for me and for the coach. This Investigation also shows the direct impact of using theories

and approaches learned in one field of Mathematics to interpret and make decisions into various fields of

life which indicates that all areas of knowledge are connected to each other. The integration of the

19
Downloaded from www.clastify.com by Thiago Zambrano Caceres

different sciences mathematics and sport sciences initiated a relationship between statistics from math

and tactics from sports which enabled a broader overview of the football pitch.

Evaluation

Limitations

The Data is rounded to two significant figures: The 𝑥 and 𝑦 coordinates of the shots are given to two

significant figures and further decimals are neglected. Therefore, the coefficients resulted from the logistic

regression are not absolute accurate. Increasing the sample size will reduce the amount of inaccuracies

Now variety of seasons or leagues are available: Only one season of the Premier league is considered in

that research so that the equation might only fit to a specific season or league. If you look at other seasons

and league and combine their data with the current season, then the equation will be generalized for more

matches worldwide.

Unable to present the full data in the research: The data consists of ore than 30000 shots so that there is

no space to show them in the paper. As a result, some steps might be unclear or unable to be followed.

Demonstrate the data in an appropriate way the enables the reader to understand and follow up the

calculation involving the data

Only old data is available: The data I have belongs to a Premier League Season in 2017/2018. The

relationship between the probability and the position the shot might have been changed or developed by

time. Try to find more updated data that that could fit to current matches

Assumptions

1. It doesn’t make any difference if the player is shooting the ball from the left side or right side of the

pitch

2. The probability of scoring a goal depends only on angle and distance

3. All players have equal probability of scoring a goal

4. All shots are independent from each other


20
Downloaded from www.clastify.com by Thiago Zambrano Caceres

Strength

My I considered a huge number of shots that might increase the accuracy of the equation and results.

Besides, I used an algorithm in a lot of steps to prevent any errors that might appear while calculating a

specific step manually

Extensions

• Use additional data from other leagues

• Consider the player that is shooting

• Consider the goalkeeper that is saving the shot (expected savings)

• Consider the number and position of the defenders

• Expected assists

• Consider the number and position of defenders that might inhibit the ball to hit the goal

21
Downloaded from www.clastify.com by Thiago Zambrano Caceres

Bibliography

https://mathworld.wolfram.com/LogisticEquation.html

https://xaktly.com/LogisticFunctions.html

https://towardsdatascience.com/a-simple-interpretation-of-logistic-regression-coefficients-e3a40a62e8cf

https://machinelearningmastery.com/logistic-regression-with-maximum-likelihood-estimation/

https://www.perfectlyrandom.org/2019/04/27/bernoulli-distribution-as-a-tiny-nn/

https://towardsdatascience.com/calculating-maximum-likelihood-estimation-by-hand-step-by-step-

3a740c637c20

https://lbelzile.github.io/timeseRies/manual-maximum-likelihood-estimation.htm

https://lbelzile.github.io/timeseRies/manual-maximum-likelihood-estimation.h

https://support.minitab.com/en-us/minitab/19/help-and-how-to/graphs/3d-scatterplot/interpret-the-

results/key-results/

https://www.manutd.com/en/matches/matchcenter/man-utd-vs-chelsea-match-919170

https://fbref.com/en/expected-goals-model-explained/

22
Downloaded from www.clastify.com by Thiago Zambrano Caceres

Appendix

Table 1

Shot Result 𝑋 𝑌 𝑃 𝑃 𝑃 ln 𝑃
ln ( ) ( )
1−𝑃 1−𝑃

1 1 88 9 0.035752 1.036399 0.508937 -0.29334

2 0 88 9 0.035752 1.036399 0.491063 -0.30886

3 0 85 2 0.03877 1.039531 0.490309 -0.30953

4 1 96 2 0.061584 1.06352 0.515391 -0.28786

5 1 98 9 0.056492 1.058118 0.514119 -0.28894

… … … … … … … …

3269 0 2 85

0 0.056492 1.058118 0.485881 -0.31347

3269 0 2 85

1 0.056492 1.058118 0.485881 -0.31347

3269 0 2 85

2 0.033114 1.033668 0.491722 -0.30828

3269 0 2 85

3 -0.0244 0.975897 0.506099 -0.29576

3269 0 2 96

4 -0.0244 0.975897 0.506099 -0.29576

3269 0 2 96

5 -0.0244 0.975897 0.506099 -0.29576

3269 0 2 96

6 0.058756 1.060516 0.485315 -0.31398

23
Downloaded from www.clastify.com by Thiago Zambrano Caceres

3269 0 2 96

7 0.049328 1.050565 0.48767 -0.31187

3269 0 2 96

8 0.056682 1.058319 0.485833 -0.31351

Table 2

Manchester City Leicester City

x y P x y P

7 96 0.069223 11 88 0.025114467

`17 97 0.07618 9 87 0.025452755

5 96 0.077152 11 87 0.022706207

7 96 0.069223 7 91 0.042490424

6 96 0.073088 12 86 0.019380407

15 80 0.008845 13 90 0.027402662

0 80 0.021009 17 84 0.011856309

14 90 0.025886 15 86 0.016311771

15 80 0.008845 13 89 0.024780561

3 82 0.021657

2 90 0.050891

24
Downloaded from www.clastify.com by Thiago Zambrano Caceres

3 83 0.023957

1 80 0.019839

12 93 0.039128

9 91 0.037977

11 91 0.033926

17 91 0.024126

13 96 0.049751

5 96 0.077152

13 96 0.049751

11 96 0.055583

19 88 0.015877

15 94 0.036501

19 88 0.015877

13 88 0.022404

4 96 0.081421

4 96 0.081421

1 96 0.095549

2 96 0.090611

4 96 0.081421

3 83 0.023957

7 83 0.019054

1 83 0.026851

7 83 0.019054

16 90 0.023094

25
Downloaded from www.clastify.com by Thiago Zambrano Caceres

1 90 0.053792

6 90 0.040706

15 90 0.024451

16 90 0.023094

5 86 0.028905

5 86 0.028905

5 86 0.028905

11 88 0.025114

11 88 0.025114

8 88 0.029789

11 88 0.025114

22 95 0.027134

22 95 0.027134

19 95 0.032172

22 95 0.027134

23 84 0.008376

14 84 0.014099

23 84 0.008376

23 84 0.008376

15 87 0.018054

11 87 0.022706

3 87 0.035772

22 91 0.018118

9 91 0.037977

26
Downloaded from www.clastify.com by Thiago Zambrano Caceres

17 91 0.024126

0 91 0.062646

14 88 0.021158

11 88 0.025114

14 93 0.034958

7 88 0.031527

8 88 0.029789

14 88 0.021158

3 72 0.00782

11 80 0.01115

11 72 0.004912

6 89 0.036859

7 77 0.010345

7 90 0.038482

7 77 0.010345

2 77 0.013811

27
Downloaded from www.clastify.com by Thiago Zambrano Caceres

28

You might also like