You are on page 1of 26

Industrial Big Data Analytics and Machine Learning

Project Report
FIFA Video Game – Players Classification
By:

Imran A. Khan (iak26)


Abstract

During the last decades, video games have become one of the biggest components in the
entertainment industry. One of the bestselling and most representative video games is FIFA by
EA Sports in which players are involved in soccer matches. Soccer is the most representative
sport around the world, and consequently, correctly representing player’s skills is very
important, putting video game companies under a lot of pressure to develop a product which
accurately represents the features of the elements which compose the game. This project studies
the composition of soccer players and their skills in the video game. The set of skills decided
by EA Sports as the most important to translate the real skills into virtual skills are used to
generate a classification of the players and predict matches results based on the composition of
the teams. For these purposes, algorithms such as K-means clusterization, neural networks, and
decision trees classifier are used. Conclusions show how the skills of the individual players can
help to decide the position of a player and their ‘world category’. Players will be presented as
‘top’ on his position or belonging to a second line (or non-top players) on their positions.
Introduction

Based on the objective of the course and keeping on mind methods and techniques studied
during the course, this project is focused on the application of several methods to a problem
which is particularly interesting for video game industry and its fanatics.

Every year, several video games companies invest huge sums of money to innovate and to be
at the forefront of the industry, in [1] there is a full list of video games which in some cases
have needed more than $US 250 million dollars for development and marketing.

One of the most popular video games between people around the world is FIFA by Electronic
Arts, which take advantage of the passion for football, FIFA 19 is one most sold video games
around the world, appearing in position 22th just 3 months after its releasing date, according to
[2]. Also, although there is no an exact amount of money known for development and
commercialization of FIFA 19, there are some references which indicate that the game could
cost as much as a Hollywood production, [3] shows an estimated of $US 350 million for
production and distribution of FIFA 16.

As is natural, the game is not so cheap for clients, since the owners aim to cover the production
costs and make some profit based on their huge economic effort. This context arises on clients
being very demanding, and year by year they are expecting a game more like the real-life game.

One of the hardest tasks involved into the development of the game is about to study the players
and create some virtual skills which may create faithful representations of their appearances in
real life games. This previous context leads us an interest to explore how that classification is
made, during this project multiple machine learning techniques will be applied to classify
football players according to their skills. Also, during a second stage, the project will be focused
on the usage of previous classification to determine its correlation with results of real-life
games.
Football context

This sport is very well known around the world, and it is probably the most famous and
remarkable into the culture of people. Football is a game played by two teams composed by
eleven players for each side. Those players are distributed on a pitch, which in general has
dimensions 110 meters length and 70 meters width. Each one of the players in a team take a
specific role according to some tactical concepts. Hereafter, a list of the position on the pitch
is shown:

• Goalkeeper
• Full-back
• Wing-back
• Center back
• Center defensive midfield
• Center attacking midfield
• Wide midfield
• Wings
• Centre Forward
• Striker

The previous classification could be extended according to the side of the pitch where the
player takes the role, e.g. ‘Left Wing’ or ‘Right Wing’.

Figure 1 and 2 shows an example of line-up in FIFA video game, where positions already
mentioned appear as acronyms which are going to be explained further.
Figure 1. Line-up football example from FIFA 15

Figure 2. Line-up 4-4-2 example


Here, we can see how managers have more than one (and probably infinite) options to
line-up their teams. Nonetheless, there are some classical formations used which can
be summarized in the following list:

• 4-4-2
• 4-3-3
• 4-1-2-1-2
• 4-3-1-2
• 4-2-2-2
• 5-3-2
• 3-4-3

The goalkeeper is not mentioned on line-ups and each formation contains the number
of defenders as the first element and the number of attackers as the last one, number of
midfielders are those intermediate elements and it is quite flexible, since for example
4-1-2-1-2 has 4 midfielders (1 Center Defensive Midfielder, 2 wing Midfielders and 1
Center Attacking Midfielder). In this sense, it is clear how the number of line-ups can
be infinite according to particular positions assigned to the players into the pitch.

Electronic Arts has defined certain criteria to rank players inside the game. The way
selected to do this job is by creating a set of categories, which represent the different
skills that each player could have. Those categories include defensive, attacking, among
others, which they consider very useful in order to differentiate the players, Figure 3
shows a summary of skills categorization for FIFA video game
Figure 3. Player skills for FIFA video game
Background
A short review about proposed methods on this job are summarized by Safavian and Landgrebe
[4], Boser et. al. [5], and LeCun et. al. [6] who describe the baseline of tree decision classifiers,
neural networks and convolutional neural networks (CNN), respectively. These three papers
are very classic having a huge importance among people working on Artificial Intelligence and
Machine Learning. Since the 2000s ML and AI have had an exponential growth and citations
on previous papers can confirm this asseveration, e.g. LeCun et. al. have more than 17.000
citations.

Cotta et. al. [7] present a good review about how to use the information available for FIFA
video game for football community in order to characterize several aspects of the game.
Fortuna et. al. [8] show an unsupervised classifier which aims to categorize top players based
google trends, basically by looking on the most searched words in the internet. Fu et. al. [9]
show a model to predict the performance of youth players based on neural networks and ML
techniques. Much more research on this field deserves to be called, nonetheless, it is out of the
scope the project to be exhaustive mentioning the large number of papers which focus on using
ML techniques to create classifiers or apply classification techniques into the video game
industry. The objective of this section is to highlight this research field as one where multiple
applications may be carried out and presented to the industry in order to give more tools to
improve future versions of its video games.

Along the project, every tool used, to construct the proposed model accurately, will be
referenced in honour of the authors and developers who dedicated their time to make those
tools available for free.
Methods and data

Before to fully described methods, it is important to mention how the data is obtained for the
development of the project. After exploring Kaggle databases, Europen Soocer Database [10]
was found very useful for according to the requirements of the project. This database has more
25.000 matches history, more than 180.000 players’ information and much more other
information which can be found at https://www.kaggle.com/hugomathien/soccer.

It is important to highlight that players’ information contain elements repeated, since one player
could appear more than once due to his presence on multiple versions of the game. To apply
algorithms and techniques proposed on this document, database was filtered, eliminating
repeated data and keeping just the data available for the most recent version of the game. At
the end, more than 10.000 data is available to carry out the project.

Since the data available do not contain any label, the first two steps are related to classification,
in this sense, the proposed algorithms will lead us a players’ classification which will be useful
for third part of the project where the classification is important to define if a player has better
global performance than any other one occupying the same position on the field.

K-mean clusterization

In order to carry out the categorization of the players, a K-means algorithm is implemented,
allowing multiple number of clusters, the correct number of categories is going to be selected
according to technical observations and special features of the problem.

Let us remember the idea behind of K-means algorithm, which consists to randomly assign K
centers on the space (dimension according to the number of attributes) and making a sequential
optimization procedure assigning each data row to a center, iteratively the center are re-
allocated in such a way that the total distance between data points and centers is minimized,
algorithm keeps iterating until the reduction of the total distance between iterations is small
enough to say that the algorithm has converged.

It is important to remember that the algorithm and assignation of data points to centers is very
dependent of the initialization of the centers, and the convergence is obtained based on this
consideration. Thus, it means that running the algorithm multiple times could yield different
categorization, even with the same number of clusters.

Figure 4. K-means example for a data set with 2 attributes and 3 different categories

Once the data is labeled, two different classification algorithms are developed in order to
stablish the most important attributes for every position on the pitch and also to characterize
the players in terms of their skills set
Decision trees

This technique is the most intuitive to identify the skills which clearly define division rules for
specific position into the field. It is important to remember that the algorithm is mainly
designed for numerical variables. Thus, it is very suitable in our case, since the skills are
numerical variables taking values between 0 and 100 for each attribute.

As can be seen, the set of skill is quite large. Therefore, there will be a lot of categories which
can be used to classify the players. Decision tree yields an initial idea of how players can be
classified, in such a way that the next method (neural network) makes much more sense for the
project.

Neural network

Based on the previous analysis, a neural network is going to be trained to classify the players.
The importance of decision tree arises when the number of clusters must be defined. The
branches into the tree will probably lead an intuition about how many classes and how the
classification should be settled to get accurate results.

Another important thing about the neural network is the standardization of variables, before to
train the network and analysis of each category from Figure 3 should be explored to know if
standardization is necessary.
Results and discussion
The section is organized in the same way that the previous one in order to analyze how labelling
and classification works with the proposed algorithms

K-means

The labelling procedure was carried out for several number of clusters and Figure 5 summarizes
results in terms of the minimum distance got for every possible number of cluster from 2 to 20.

Figure 5. Total distance vs. Number of clusters (K-means algorithm)

At the beginning, the number of expected clusters was 9, according to the grouping shown in
Figure 6.
Figure 6. Expected clusterization according to players distribution on the field

However, after analyzing results for 9 clusters and running the algorithm several times, the
behavior was very unstable, and the classification did not show any correspondence between
positions shown in the Figure 6 and results.

Then, for the breaking on 4 and 6 clusters (shown in Figure 5) a manual analysis of the
composition of clusters was carried out and we could observe more stable results for 6 clusters.
Figure 7 shows a representation of results in 2 dimensions for this labelling

Figure 7. Projection in 2D of clusterization made for 6 categories


It should be said that after exploring 7, 8 and 9, results were also inconsistent, the most
consistent results were got for 6 clusters. As Figure 7 shows, there is one category which is
clearly separated to the others (this is very intuitive, since we can notice how goalkeepers have
very different set of skills compared with their teammates). Although the 5 remaining
categories look very overlapped, we need to remember that this is a projection of 33 attributes
just in 2 dimensions. If at some point we could represent a picture of 33 dimensions we could
see how those 5 categories are clearly separated.

After carrying out an analysis over a portion of data with the clusterization carried out, it was
possible to identify the composition of the clusters. Table 1 shows a summary of these results

Table 1. Composition of categories according to k-means algorithm for 6 clusters

Category Players in Observations

Majority Also include

1 ST LW-RW Second line Attackers

2 Midfielders Defenders TOP on these positions

3 TOP ST-AM-LW-RW Top and best attacking players

4 Midfielders Second line midfielders

5 GK

6 Defenders Second line defenders

An additional analysis for particular attributes was also carried out, it helped to identify the
composition of clusters in terms of people who compose those categories
Figure 8. Attacking skills comparison between TOP and non-TOP players categories

Figure 8 shows a deeper comparison between non-top attackers and very top player. Although,
all of them are supposed to be very good on categories corresponding to attack (like dribbling
or finishing) we can see how green top players have in general much better skills than the
others.
Figure 9. Defensive skills comparison between TOP Mid and Def vs. non-TOP defenders

Figure 10. General skills comparison between TOP Mid and Def vs. non-TOP defenders

Now, since top players who are not attackers are in category 2, this category contains both
midfielders and defenders. What we can see is that comparing these top players with non-top
defenders, they indeed have very similar skills to avoid goals on their own goal line. However,
they are very different on abilities proper to those players who play at midfield, like vision,
dribbling or shot power. So, bottom right pic explains why these two categories were created
Figure 11. Defensive skills comparison between TOP Mid and Def vs. non-TOP midfielders

Figure 12. General skills comparison between TOP Mid and Def vs. non-TOP midfielders

On the other hand, we also compare these top players of category 2, against those non-top
players from midfield. Here we have the contrary phenomena regarding our previous analysis
(Figure 9 and 10). On categories like vision, dribbling and shooting, they have similar
performance. However, for defensive skills, the top players are clearly much better than those
non-top midfielders. So, what we can conclude from this and the previous slide is why there is
only one category both for defensive and midfielders top players.

Decision tree classifier

After the analysis of clusters, the data set was divided in training and test subsets, a tree
classifier was trained in order to identify those skills which are more important to identify
position of the players into the field.

Figure xx shows a summary of the what was obtained from the algorithm

Figure 13. Decision tree classifier trained for FIFA video game players

One very important feature where the algorithm does not fail to classify player is when they
have less than 56 on standing attribute and more than on Golkeeper diving. It means basically
that non-goalkeeper players have gk_diving under 39, which makes a lot of sense regarding
what we already discovered during clusterization procedure.
This tree was used to test successfulness on test data set and results are summarized in the
following confusion matrix

Table 2. Confusion matrix for decision tree classifier trained

Real/Algorithm 1 2 3 4 5 6

1 283 1 19 27 0 0

2 9 758 12 18 0 100

3 59 81 301 48 0 0

4 34 37 68 303 0 5

5 0 0 0 0 276 0

6 6 111 0 18 0 601

Applying the previous classification tree on a 30% test database, the success rate of
classification was around 79.5% and we can notice how the hardest part for the algorithm to
predict was for categories 3 and 4. It means that probably non-top midfielders are top attacking
players are not very good represented by the branches established on the previous tree (Figure
13).

Second part consisted to develop another classification algorithm, neural network in this case.

Neural network

Figure 14 shows the structure of the network that was implemented, 33 nodes corresponding
to the attributes of the players, 5 nodes on the intermediate layer and 6 different possible outputs
which correspond to the 6 categories established previously.
Figure 14. Neural network structure trained for players classification

Normalization

One important thing to clarify is that normalization of variables was carried out in order to
improve the convergence of the neural network algorithm. When the algorithm was trained
without any normalization, the convergence was very poor, after running multiple times, the
algorithm did not return any convergence.

The result on the test data set is summarized as follows

Table 3. Confusion matrix for neural network trained

Real/Algorithm 1 2 3 4 5 6

1 310 4 9 7 0 0

2 1 870 3 5 0 18

3 10 10 454 15 0 0

4 5 1 3 436 0 2

5 0 0 0 0 276 0

6 0 1 0 1 0 734
As we already know, neural networks have much better performance (in general) compared
with decision tree classifiers. Thus, testing on the same data set we got a success rate of 97%,
which is very good. However, we still can see how the hardest part to classify is for very TOP
players. This means that probably there are some features in that category which very hard to
handle in order to recognize if the player should or not belong there.
Conclusions and discussion

An algorithm to classify players from FIFA video game has been developed, considering the
set of skills which the owners of the game have identified as the key to represent the behavior
of a player in the field

The clusterization and classification look very consistent in despite of some error for
classifying a small number of players.

Although at the beginning we though that a classification based on most specific position could
be gotten, the algorithm showed that set of skills represent more a general behavior of players
(like if you should belong to a defensive, midfield or attacking position) instead of an exact
classification according to their position. Though this part could look bad for our purposes, in
practice, it sounds very intuitive, since players in real life usually can play good in more than
one position near to their original position, for example, a player like Andres Iniesta from Spain
has been playing in multiple positions as midfielder, not only center middle defender.

Additional analysis

As a final part, an analysis of important matches has been carried out, this will allow to
conclude in terms of classification

World cup final Brazil 2014

Figure 15 shows the line-ups of Germany and Argentina who faced the final match of the world
cup during 2014 year.
Figure 15.World Cup Final Brazil 2014 Line-ups

Efficiency of the algorithm (neural network) here, was 100%.

Composition of the teams was:

Germany

1 Goalkeeper (Cat 5)

2 Top Players (Cat 3)

1 Non-top Attacker (Cat 1)

6 Top def and mid (Cat 2)

1 Non-top mid (Cat 4)

Argentina

1 Goalkeeper (Cat 5)

3 Top Players (Cat 3)

6 Top def and mid (Cat 2)

1 Non-top def (Cat 4)


The result of the match at the end of reglementary 90 minutes was tie, which is very consistent
with composition of the teams, both only have 1 player non-classified as TOP on his position
(Howedes and Demichelis respectively).

Final Champions League 2014-15

Figure 16 shows the lines up corresponding to the final match of UEFA Champions League

Figure 16. Line-ups FIFA Champions League 2014-15

Once again, the classification made by the neural network had 100% of efficiency.

The composition of the teams is as follows:

Juventus

1 Goalkeeper (Cat 5)

2 Top Players (Cat 3)

8 Top def and mid (Cat 2)


Barcelona

1 Goalkeeper (Cat 5)

4 Top Players (Cat 3)

6 Top def and mid (Cat 2)

The result at the end of the match was favorable to Barcelona which once again makes sense
according to the composition of the teams, since Barcelona has 2 more TOP players (4 in total)
compared with Juventus (2 in total).

As final conclusion it should be said that the two matches analyzed yield an idea about
classification being accurate and a good tool to predict result matches according to the teams’
composition. Further research can be made to create another Machine Learning algorithm
which based on categories of players in the line up can predict the result at the end of the match.
Of course, soccer always has an extra stochastic component which come from unexpected facts
like sent off or specific mistake that players may eventually commit.
References

[1] L. o. m. e. v. g. t. develop, "Wikipedia," 25 02 2019. [Online]. Available:


https://en.wikipedia.org/wiki/List_of_most_expensive_video_games_to_develop.
[Accessed 18 03 2019].

[2] M. B. Sauter, "USA TODAY," 13 12 2018. [Online]. Available:


https://www.usatoday.com/story/tech/gaming/2018/12/13/popular-video-games-2018-
25-best-selling-titles-year/38672903/. [Accessed 18 03 2019].

[3] F. 1. C. T. S. A. A. H. Movie, "FIFACOINSHOT," 27 10 2015. [Online]. Available:


http://www.fifacoinshot.com/news/208--fifa-16-cost-the-same-as-a-hollywood-movie.
[Accessed 18 03 2019].

[4] S. R. Safavian and D. Landgrebe, "A survey of decision tree classifier methodology,"
IEEE transactions on systems, man, and cybernetics, vol. 21, no. 3, pp. 660-674, 1991.

[5] B. E. Boser, I. M. Guyon and V. N. Vapnik, "A training algorithm for optimal margin
classifiers," Proceedings of the fifth annual workshop on Computational learning
theory, pp. 144-152, 1992.

[6] Y. LeCun, B. Leon, Y. Bengio and P. Haffner, "Gradient-based learning applied to


document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.

[7] L. Cotta, P. Melo, F. Benevenuto and A. Loureiro, "Using fifa soccer video game data
for soccer analytics," Workshop on Large Scale Sports Analytics, 2016.

[8] F. Fortuna, F. Maturo and T. Di Battista, "Clustering functional data streams:


Unsupervised classification of soccer top players based on Google trends," Quality and
Reliability Engineering International, vol. 34, no. 7, pp. 1448-1460, 2018.

[9] W. Fu, Y. Sun, F. Zhang and B. Guo, "An Intelligent System to Predict Future
Performance of Youth Football Players using Machine Learning," Proceedings on the
International Conference on Artificial Intelligence (ICAI), pp. 63-66, 2018.

[10] H. Mathien, "Kaggle," 16 10 2016. [Online]. Available:


https://www.kaggle.com/hugomathien/soccer. [Accessed 18 03 2019].

You might also like