Professional Documents
Culture Documents
ABSTRACT
These investigation aims to analyze the impact of Passing Network as a clear driver that could influence Machine Learning
models. The study first takes a clear path into wrangling and executing feature engineering with football data from Premier
League matches during 2017/2018 competition. First using the given variables from football event log data, we can portray the
randomness between the great amount of events that occur during a football match. Given this problem the study takes the
lead in creating a python tool for filtering and creating data depending on different match situations and creating graph matrices
that represent the passes between an specific team. It also founds the importance of average node connectivity, which by
being compared with a team;s opposition in every match, presents a correlation of 0.85 with goals scored. This work permits to
set a base for further investigation with the use of graphs in the passing behavior of football teams.
Introduction
The following report has the intention of explaining the outcome of the research process titled: "Data Analytics in the Sports
World: Assesing 2017/2018 English Premier League Data through Machine Learning and Passing Networks"
For this investigation we wanted to see how data analysis have been developed in the football world, and how through
accessible free data we could create through machine learning and even more advanced feature, a rational reasoning behind a
match outcome analysis.
In order to start the assignment the first step involved selecting the preferred database that could contain specifically detailed
events of actions that happened during a match. The most important detail at this part is to have a clear enough sample of
matches, if possible between a same group of teams, such as league games for a country. For this, we included both Statsbomb
and Wyscout commercial data sets as possible options for our project. Even though Statsbomb, football analysis giant, had free
data, options of data sets available such as Lionel Messi full matches for Barcelona, did not provided the similar amount of
games for the rest of the clubs in La Liga. So we had to think otherwise. In the end, Wyscout provided full detailed events of
the 2017/2018 season for Europe Top 5 tier league competitions. Near a million rows of data for every single league. This was
the most suitable data set for analysis that we encounter with a big thank you of Lucca Pappalardo et al from the article: "a
public data set of spatial-temporal match events in soccer competitions"1 .
This project’s natural goal was to provide a set of events that were understandable enough to allow for the first-ever
prediction of the result of a specific binary classification match using machine learning techniques. 1 denotes a victory, whereas
0 indicates a loss. The first piece of data, as is typical in data analysis, wasn’t prepared for a machine learning modeling process,
which is why the pre-processing of the data was considered in the project’s early stage. Following that, we needed to establish
what level of detail we desired for the model department. Football event log analyses can range from being straightforward,
such as grouping incidents into generic categories like passes, shoots, fouls, and free kicks, to being intricate, such as classifying
each incident separately.
Finally, we created a different way to measure pass importance by employing graph networks. In relation to the pass
structure, we examined the impact of a team network and their effectiveness as units. The data was filtered using a Python script
to provide any necessary team or game analyses. The average node connectivity (a weight assigned to the average amount of
passes each player generated within a team’s total passes) and the final position and goals scored over a season were found to
be correlated. Given that correlation does not necessarily imply causality, further investigation into particular matches might be
conducted, and the data could be modeled using a combination of ML methods and graph network analysis.
Methods
Every function used during these stage is explained in detail in the following link: Functions - Fabriziolufe
Model Creation
Starting with the first model for the 3 cases, we created comparative statistics between the team analyzed and the opposition.
Ex: Arsenal F.C. v Leicester City will have a row of data as the team analyzed being Arsenal and the opposition being Leicester,
but the immediate next row will be Leicester City as the main team and Arsenal being the opposition ( change of roles in the
model).
With the help of GridSearchCV function, we tuned the different models that could be useful for the classification task.
By iterating the hyper parameters, we search through different types of ML models ( Logistic Regression, Decision Trees,
Support Vector Machines, Random Forest Classifier, XGBoost, etc...) and showcase the f1-score and accuracy for each. In
order to have a random sample we also variate between the match weeks analyzed by each model at every iteration and try to
determine the results for the next 3 to 5 matches for every team. In this case the evaluated test window is moving, if "Final
Match week" analyzed is f.ex: 30, test set will consist of the 3 following match week matches. Matches from MW: 31, 32, 33.
Inside the functions used, we first pass the data by normalizing their features and creating a pipeline. It also splits the data set
according to the desired margins of "Origin" and "Border" of analysis, which in this case, would be the indexes of the data
frame that the model will capture. We then try the model in a random match week generator and after this the code takes a
selection of classifier models and test them with f-1 score and accuracy measures each of them through iterations. The reason
why our built-in function generates an iteration by adjusting the amount of features and evaluating the model chosen with each
set of features is because even if these functions produce a heuristic answer, we can still tune the model by studying the feature
importance.
The best classifiers after the iteration of 100 samples, were the XGB Classifier, and the Logistic Regression.
Link to Jupyter Notebooks:
Model Selection at Level 1
Model Selection at Level 2
Model Selection at Level 3
Results
Machine Learning Modelings
At the first Level, during the first run with the XGBoost, the model gained more than 70 percent of accuracy, but it gave
great importance the save attempts made by goalkeepers from each side as we can see at Figure 1.
2/7
Figure 1. 1st Level Feature Importance with XGB.
This is not a great feature as the data had bias in the event recording of the save attempts. (They were usually added just
when a goal or a decisive shot happened), what is more, it gave importance to more than 15 specific features, which could result
on the model being useful for other type of data ( will be interesting to see performance on other leagues events). Results from
the second run using Logistic Regression were more encouraging. These demonstrate that the most practical method is often
required rather than the one that is more sophisticated. Although it had an accuracy of over 88 percent in predicting when a
team would lose, it only properly allocated 31.3 percent of the situations when a team won, dropping to 65 percent ( room for
improvement). The confusion matrix in Figure 2 can demonstrate the problem allocating the prediction for a team winning.
The model recognizes that a team’s number of free kicks, whether accurate or imprecise, may have a detrimental impact on
their odds of winning. It is impossible to distinguish between the two at this level.
The second level, even though more detailed, showed an accuracy of around 68 percent for the XGBoost model, the amount
of features continued to be vast, and free kick crosses have the most impacting feature coefficient in the model result as shown
in Figure 3. It is inconclusive about the negative or positive importance because of the black box nature of Gradient Boost
classifier. In the other hand, Logistic Regression resulted in an accuracy of 72.5 percent, and in this case is more clearly
assigned the negative impact that the amount of crosses a team may have, can result on a team loosing as shown in Figure 4.
There is actually some research on the subject, indicating how without a clear strategy and team flow, teams tend to go for
an exaggerated amount of crosses resulting on an increment of lost possession.2 In this case by having more depth inside the
features, the classifier managed to raise it’s accuracy of assigning when a team won up to 46.9 percent. The other case ( when a
team lose - (0)) get 79.4 percent accuracy, with the most important feature being the amount of crosses made during the game.
At the third level (Using now SGD Classifier as other alternative), the one with the most detailed feature description, we
get similar results with first two runs. Third Classifier run with the SGD Classifier gives us a different set of results in which
accuracy in the 1-Label ( Predicting a win) classifier, gets 81.3 percent of accuracy, and about 50 percent accuracy with 0-label.
3/7
Figure 3. 2nd Level Feature importance with XGB.
A useful application outside of the data seems to be impossible given the quantity of features but it seems as passes being the
key driver at this level of detail, see Figure 5 for more understanding. Nonetheless, the feature coefficients change significantly
from one iteration to the next. As if the model’s accuracy were limited by the data’s randomness over match weeks... As an
alternative to using graph theory and game connectivity, Pass Graph networks are suggested in this research as a characteristic
that can help us better forecast the match’s outcome.
4/7
Figure 5. 3rd Level Feature importance with SGD.
The first step in order to avoid any problems regarding co linearity was to just handle the pass events. ( no direct relation to
shot or save attempts) and segment the data by accurate and inaccurate.
At this stage we wanted to create a tool that could filter the data set by different parameters. We first created the Home
and Away indicator, that could lead us to an analysis of a team playing at his home venue. Then, one immense problem that
we encounter during the ML stage was analyzing a match as a single row, 90 minutes of football ( with all the situations that
happen during this time) as a single entity. We now believe is more valuable to divide the match by different occurrences that
happen during it such as changes in score throughout the match. Also, give the user the possibility to choose match period
and game weeks of analysis. Now, more than only seeing how certain team acted during the whole season. We can segregate
the data to see how did a certain team acted when they were winning or loosing, at 1st or 2nd match period, in a home or
away venue during n number of game weeks. As a bonus, is very difficult to imagine how these networks function without
visualizing, so we created a visualization, shown in Figure 6, as a built-in function to plot in a football field the relationships for
5/7
every team, in every game they played throughout the course of the 38 game weeks. We examine if having a higher average
connectedness than the opposition helps predict whether a team will win or lose a match against its opponent. The average
team connectivity for the season and the total of the average node connectivity difference are summarized in the following
Figure 7. (being positive indicates that the team’s connectedness was higher than his rival).
If we now proceed to compare the average node difference with the goals scored ( presuming if there exists a relationship in
any direction between the stability of a team in their passing behavior and the goals created) we can cluster the data by using
the elbow method. As Humaira states4 if k increases, average distortion will decrease, each cluster will have fewer constituent
instances, and the instances will be closer to their respective centroids. However, the improvements in average distortion will
decline as k increases. The value of k at which improvement in distortion declines the most is called the elbow, at which we
should stop dividing the data into further clusters"
The number of clusters for this values should be partitioned by is 3. See Figure 8.
Plotting now each team as a part of their own cluster and comparing their goals scored with the connectivity difference, We
can clearly observe 3 groups. Let’s remember that goals are not 100 percent related with the final position in the table, that’s
why Newcastle United (with 44 points in the actual competition - mid table.) seemed to over perform West Ham United ( 42
points - Bottom of the league) in the clustering division. Despite that, is amazing how the 6 teams with overall greater average
node connectivity are the immaculate Big Six of the Premier League, with a clear difference to the record breaker title holder:
Manchester City! (Figure 9)
6/7
Figure 9. Regression Analysis with Premier League Teams.
data set that was presented at the beginning. A fascinating direction would be to determine whether these relationships hold
true for those leagues as well, and if so, to attempt to go deeper into each match condition and its associated connectivity. Who
knows, as a team sport, a team is as strong as it’s weakest link, and in this case it shows how important it might be for every
member of a team structure to participate in the game.
References
1. Pappalardo, L. & Massucco, E. Soccer match event dataset, DOI: 10.6084/m9.figshare.c.4415000.v5 (2019).
2. Vecer, J. Crossing in soccer has a strong negative impact on scoring: Evidence from the english premier league. SSRN
Electron. J. DOI: 10.2139/ssrn.2225728 (2013).
3. Beineke, L. W., Oellermann, O. R. & Pippert, R. E. The average connectivity of a graph. Discret. Math. 252, 31–45, DOI:
https://doi.org/10.1016/S0012-365X(01)00180-7 (2002).
4. Humaira, H. & Rasyidah, R. Determining the appropiate cluster number using elbow method for k-means algorithm. DOI:
10.4108/eai.24-1-2018.2292388 (2020).
Acknowledgements
Even though this is a draft. I have to be thankful with Amílcar Soares for his unwavering help in this process. Thank you for let
me do and learn from what i love the most: football.
Additional information
Git-Hub Source for jupyter notebooks: Fabrizio Lufe
7/7