Soccer Analysis Manuscript

Data Analytics in the Sports World: Assesing
2017/2018 English Premier League Data through

Machine Learning and Passing Networks
Fabrizio Lucero Fernández1,*
1 GRI Sports Analytics Researcher, MUN Engineering, St John’s, A1B 3X6, Canada
* fabriziolufe@unisabana.edu.co
ABSTRACT
These investigation aims to analyze the impact of Passing Network as a clear driver that could influence Machine Learning
models. The study first takes a clear path into wrangling and executing feature engineering with football data from Premier
League matches during 2017/2018 competition. First using the given variables from football event log data, we can portray the
randomness between the great amount of events that occur during a football match. Given this problem the study takes the
lead in creating a python tool for filtering and creating data depending on different match situations and creating graph matrices
that represent the passes between an specific team. It also founds the importance of average node connectivity, which by
being compared with a team;s opposition in every match, presents a correlation of 0.85 with goals scored. This work permits to
set a base for further investigation with the use of graphs in the passing behavior of football teams.
Introduction
The following report has the intention of explaining the outcome of the research process titled: "Data Analytics in the Sports
World: Assesing 2017/2018 English Premier League Data through Machine Learning and Passing Networks"
For this investigation we wanted to see how data analysis have been developed in the football world, and how through
accessible free data we could create through machine learning and even more advanced feature, a rational reasoning behind a
match outcome analysis.
In order to start the assignment the first step involved selecting the preferred database that could contain specifically detailed
events of actions that happened during a match. The most important detail at this part is to have a clear enough sample of
matches, if possible between a same group of teams, such as league games for a country. For this, we included both Statsbomb
and Wyscout commercial data sets as possible options for our project. Even though Statsbomb, football analysis giant, had free
data, options of data sets available such as Lionel Messi full matches for Barcelona, did not provided the similar amount of
games for the rest of the clubs in La Liga. So we had to think otherwise. In the end, Wyscout provided full detailed events of
the 2017/2018 season for Europe Top 5 tier league competitions. Near a million rows of data for every single league. This was
the most suitable data set for analysis that we encounter with a big thank you of Lucca Pappalardo et al from the article: "a
public data set of spatial-temporal match events in soccer competitions"1 .
This project’s natural goal was to provide a set of events that were understandable enough to allow for the first-ever
prediction of the result of a specific binary classification match using machine learning techniques. 1 denotes a victory, whereas
0 indicates a loss. The first piece of data, as is typical in data analysis, wasn’t prepared for a machine learning modeling process,
which is why the pre-processing of the data was considered in the project’s early stage. Following that, we needed to establish
what level of detail we desired for the model department. Football event log analyses can range from being straightforward,
such as grouping incidents into generic categories like passes, shoots, fouls, and free kicks, to being intricate, such as classifying
each incident separately.
Finally, we created a different way to measure pass importance by employing graph networks. In relation to the pass
structure, we examined the impact of a team network and their effectiveness as units. The data was filtered using a Python script
to provide any necessary team or game analyses. The average node connectivity (a weight assigned to the average amount of
passes each player generated within a team’s total passes) and the final position and goals scored over a season were found to
be correlated. Given that correlation does not necessarily imply causality, further investigation into particular matches might be
conducted, and the data could be modeled using a combination of ML methods and graph network analysis.
Methods
Every function used during these stage is explained in detail in the following link: Functions - Fabriziolufe
Pre Processing the data

In this first stage we passed the data from a nested json format, available at this link:
https://figshare.com/collections/Soccer_match_event_dataset/4415000
Data was also merged in order to create a base table suitable for the start of the feature pre-processing. More detail about the
code behind this and specific functions used can be seen in the following .ipynb file: Data Exploration
The feature pre-processing started with the creation of three levels of data depth. In the first one we only included events
detailing them by the event name: Pass-Shot-Foul-Duel-Free Kick. Second level included as well a sub event name, in this case
the event was accompanied by a detail of each, such as Pass- Simple Pass, Shot - Penalty, or Duel - Air Duel1 . In last place, we
added a third level that included the detail of the outcome of every action as accurate or inaccurate. (Here we started to have
problem with the time-window of the data. As the outcome of every action could result in a violation of data that could not be
reproducible or known when analyzing a match that is in progress).
Link to the jupyter notebook for feature pre-processing: Pre-Processing
Model Creation
Starting with the first model for the 3 cases, we created comparative statistics between the team analyzed and the opposition.
Ex: Arsenal F.C. v Leicester City will have a row of data as the team analyzed being Arsenal and the opposition being Leicester,
but the immediate next row will be Leicester City as the main team and Arsenal being the opposition ( change of roles in the
model).
With the help of GridSearchCV function, we tuned the different models that could be useful for the classification task.
By iterating the hyper parameters, we search through different types of ML models ( Logistic Regression, Decision Trees,
Support Vector Machines, Random Forest Classifier, XGBoost, etc...) and showcase the f1-score and accuracy for each. In
order to have a random sample we also variate between the match weeks analyzed by each model at every iteration and try to
determine the results for the next 3 to 5 matches for every team. In this case the evaluated test window is moving, if "Final
Match week" analyzed is f.ex: 30, test set will consist of the 3 following match week matches. Matches from MW: 31, 32, 33.
Inside the functions used, we first pass the data by normalizing their features and creating a pipeline. It also splits the data set
according to the desired margins of "Origin" and "Border" of analysis, which in this case, would be the indexes of the data
frame that the model will capture. We then try the model in a random match week generator and after this the code takes a
selection of classifier models and test them with f-1 score and accuracy measures each of them through iterations. The reason
why our built-in function generates an iteration by adjusting the amount of features and evaluating the model chosen with each
set of features is because even if these functions produce a heuristic answer, we can still tune the model by studying the feature
importance.
The best classifiers after the iteration of 100 samples, were the XGB Classifier, and the Logistic Regression.
Link to Jupyter Notebooks:
Model Selection at Level 1
Results
Machine Learning Modelings
Level and Model Accuracy F1 Score

Level 1-XGB 72.4 72
Level 1-Log-Reg 65.6 58.8
Level 2-XGB 69.4 67.3
Level 2-Log-Reg 72.6 72.3
Level 3-SGD 66.8 66.7
Level 3-Log Reg 69.1 65.3
Table 1. Accuracy and F1 score in models
At the first Level, during the first run with the XGBoost, the model gained more than 70 percent of accuracy, but it gave
great importance the save attempts made by goalkeepers from each side as we can see at Figure 1.
2/7
Figure 1. 1st Level Feature Importance with XGB.
This is not a great feature as the data had bias in the event recording of the save attempts. (They were usually added just
when a goal or a decisive shot happened), what is more, it gave importance to more than 15 specific features, which could result
on the model being useful for other type of data ( will be interesting to see performance on other leagues events). Results from
the second run using Logistic Regression were more encouraging. These demonstrate that the most practical method is often
required rather than the one that is more sophisticated. Although it had an accuracy of over 88 percent in predicting when a
team would lose, it only properly allocated 31.3 percent of the situations when a team won, dropping to 65 percent ( room for
improvement). The confusion matrix in Figure 2 can demonstrate the problem allocating the prediction for a team winning.
The model recognizes that a team’s number of free kicks, whether accurate or imprecise, may have a detrimental impact on
Figure 2. 1st Level Confusion Matrix with Logistic Regression.
their odds of winning. It is impossible to distinguish between the two at this level.
The second level, even though more detailed, showed an accuracy of around 68 percent for the XGBoost model, the amount
of features continued to be vast, and free kick crosses have the most impacting feature coefficient in the model result as shown
in Figure 3. It is inconclusive about the negative or positive importance because of the black box nature of Gradient Boost
classifier. In the other hand, Logistic Regression resulted in an accuracy of 72.5 percent, and in this case is more clearly
assigned the negative impact that the amount of crosses a team may have, can result on a team loosing as shown in Figure 4.
There is actually some research on the subject, indicating how without a clear strategy and team flow, teams tend to go for
an exaggerated amount of crosses resulting on an increment of lost possession.2 In this case by having more depth inside the
features, the classifier managed to raise it’s accuracy of assigning when a team won up to 46.9 percent. The other case ( when a
team lose - (0)) get 79.4 percent accuracy, with the most important feature being the amount of crosses made during the game.
At the third level (Using now SGD Classifier as other alternative), the one with the most detailed feature description, we
get similar results with first two runs. Third Classifier run with the SGD Classifier gives us a different set of results in which
accuracy in the 1-Label ( Predicting a win) classifier, gets 81.3 percent of accuracy, and about 50 percent accuracy with 0-label.
3/7
Figure 3. 2nd Level Feature importance with XGB.
Figure 4. 2nd Level Feature importance with Logistic Regression.
A useful application outside of the data seems to be impossible given the quantity of features but it seems as passes being the
key driver at this level of detail, see Figure 5 for more understanding. Nonetheless, the feature coefficients change significantly
from one iteration to the next. As if the model’s accuracy were limited by the data’s randomness over match weeks... As an
alternative to using graph theory and game connectivity, Pass Graph networks are suggested in this research as a characteristic
that can help us better forecast the match’s outcome.
Discussion regarding Pass Networks

A useful application outside of the data seems to be impossible given the quantity of features, and the feature coefficients
change significantly from one iteration to the next. As if the model’s accuracy were limited by the data’s randomness over
match weeks... As an alternative to using graph theory and game connectivity, Pass Graph networks are suggested in this
research as a characteristic that can help us better forecast the match’s outcome. What is more, when we tried to randomize the
sample of prediction, different results raised from different match weeks analysis. Specific match weeks involved matches
were some results did not followed the model expected outcome. This created difficulty in the last analysis as the machine
learning model was missing a valuable feature that could dictate in some portion the outcome of a match regularly. Being a
sport, we can not leave behind the matter of randomness that is involved in it. What we wanted to do at this point is to inspect a
feature that was stable enough as a clear indicator of a team tendency to win, and set a initial start point for having a great tool
to inspect and analyze match factors with data analytics.
The average connectivity is defined as a function for computing the maximum flow among a pair of nodes. According
to Beineke et al.3 it is related to be the average, over all pairs of vertices, of the maximum number of internally disjoint
paths connecting these vertices. These, can act of an indicator of how every node or in football terms, every player acts as an
important part of the complete graph ( or team connection). This means that if average connectivity is high, the team uses more
every player during a game, if the average connectivity tends to less than 5, only 5 or less players are mathematically much
more involved than the other 6.
The jupyter notebook can be found at this link: Passing Networks
4/7
Figure 5. 3rd Level Feature importance with SGD.
The first step in order to avoid any problems regarding co linearity was to just handle the pass events. ( no direct relation to
shot or save attempts) and segment the data by accurate and inaccurate.
At this stage we wanted to create a tool that could filter the data set by different parameters. We first created the Home
and Away indicator, that could lead us to an analysis of a team playing at his home venue. Then, one immense problem that
we encounter during the ML stage was analyzing a match as a single row, 90 minutes of football ( with all the situations that
happen during this time) as a single entity. We now believe is more valuable to divide the match by different occurrences that
happen during it such as changes in score throughout the match. Also, give the user the possibility to choose match period
and game weeks of analysis. Now, more than only seeing how certain team acted during the whole season. We can segregate
the data to see how did a certain team acted when they were winning or loosing, at 1st or 2nd match period, in a home or
away venue during n number of game weeks. As a bonus, is very difficult to imagine how these networks function without
visualizing, so we created a visualization, shown in Figure 6, as a built-in function to plot in a football field the relationships for
Figure 6. Built-in Tool for visualizing Pass Networks.
both an specific match, or for a team performance individually

For every team and game in the 2017–2018 Premier League, the data for pass receivers and senders has now been stored
in these graphs. We can determine whether there is any correlation between this connectivity and the success of a match by
developing the average node connectivity feature. For the sake of merely introducing the topic, we will develop a feature for
5/7
every team, in every game they played throughout the course of the 38 game weeks. We examine if having a higher average
connectedness than the opposition helps predict whether a team will win or lose a match against its opponent. The average
team connectivity for the season and the total of the average node connectivity difference are summarized in the following
Figure 7. (being positive indicates that the team’s connectedness was higher than his rival).
Figure 7. Head of Statistics including Average Node connectivity difference.
If we now proceed to compare the average node difference with the goals scored ( presuming if there exists a relationship in
any direction between the stability of a team in their passing behavior and the goals created) we can cluster the data by using
the elbow method. As Humaira states4 if k increases, average distortion will decrease, each cluster will have fewer constituent
instances, and the instances will be closer to their respective centroids. However, the improvements in average distortion will
decline as k increases. The value of k at which improvement in distortion declines the most is called the elbow, at which we
should stop dividing the data into further clusters"
The number of clusters for this values should be partitioned by is 3. See Figure 8.
Figure 8. Elbow Method for deciding number of clusters.
Plotting now each team as a part of their own cluster and comparing their goals scored with the connectivity difference, We
can clearly observe 3 groups. Let’s remember that goals are not 100 percent related with the final position in the table, that’s
why Newcastle United (with 44 points in the actual competition - mid table.) seemed to over perform West Ham United ( 42
points - Bottom of the league) in the clustering division. Despite that, is amazing how the 6 teams with overall greater average
node connectivity are the immaculate Big Six of the Premier League, with a clear difference to the record breaker title holder:
Manchester City! (Figure 9)
Conclusion and further analysis

In order to wrap up this study on the examination of English Premier League data using machine learning and passing networks,
we must mention the intriguing finding that was brought to our attention by the use of graph network techniques. Goals and
average node connectivity difference (ANCD) have a R squared of 0.85, indicating that there is a bidirectional correlation
between these two factors that merits further investigation. Information from the other European leagues is also included in the
6/7
Figure 9. Regression Analysis with Premier League Teams.
data set that was presented at the beginning. A fascinating direction would be to determine whether these relationships hold
true for those leagues as well, and if so, to attempt to go deeper into each match condition and its associated connectivity. Who
knows, as a team sport, a team is as strong as it’s weakest link, and in this case it shows how important it might be for every
member of a team structure to participate in the game.
References
1. Pappalardo, L. & Massucco, E. Soccer match event dataset, DOI: 10.6084/m9.figshare.c.4415000.v5 (2019).
2. Vecer, J. Crossing in soccer has a strong negative impact on scoring: Evidence from the english premier league. SSRN
Electron. J. DOI: 10.2139/ssrn.2225728 (2013).
3. Beineke, L. W., Oellermann, O. R. & Pippert, R. E. The average connectivity of a graph. Discret. Math. 252, 31–45, DOI:
https://doi.org/10.1016/S0012-365X(01)00180-7 (2002).
4. Humaira, H. & Rasyidah, R. Determining the appropiate cluster number using elbow method for k-means algorithm. DOI:
10.4108/eai.24-1-2018.2292388 (2020).
Acknowledgements
Even though this is a draft. I have to be thankful with Amílcar Soares for his unwavering help in this process. Thank you for let
me do and learn from what i love the most: football.
Additional information
Git-Hub Source for jupyter notebooks: Fabrizio Lufe
7/7

Soccer Analysis Manuscript

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Soccer Analysis Manuscript

Uploaded by

Copyright:

Available Formats

Data Analytics in the Sports World: Assesing

2017/2018 English Premier League Data through

Pre Processing the data

Level and Model Accuracy F1 Score

Table 1. Accuracy and F1 score in models

Figure 2. 1st Level Confusion Matrix with Logistic Regression.

Figure 4. 2nd Level Feature importance with Logistic Regression.

Discussion regarding Pass Networks

Figure 6. Built-in Tool for visualizing Pass Networks.

both an specific match, or for a team performance individually

Figure 7. Head of Statistics including Average Node connectivity difference.

Figure 8. Elbow Method for deciding number of clusters.

Conclusion and further analysis

You might also like