You are on page 1of 8

ANALYSIS OF THE FACTORS THAT ALTER THE

NUMBER OF MEDALS OBTAINED BY A


COUNTRY ON THE OLYMPIC GAMES
I. INTRODUCTION

The olympic games are the biggest sport event in today’s world. More than 28 sports are
played, and nearly 12000 athletes took part in the last olympic games, celebrated in Tokio. It is
also an evidence, that the interest in sports and its practise has exponentially increased during
the last decades. Therefore, the interest in olympic games has also increased over the last
years.

Governments knew that, and they all have increased their investments in sports. Olympic
games have become a perfect opportunity to show a country’s strength, by winning more
medals. However, is there a magic formula to win more medals in the Olympic Games? Given
today’s technological society we have access to an uncomptable number of data, which
provides us an effective way to approach this study.

II. OBJECTIVES

In this thesis the factors that determine the success of a country in the Olympic Games will be
highlighted and all different trends or correlations between each factor will carefully studied.
In addition, I will try to find whether countries with more participants have an advantage on
those who have few participants. Intuitively, we can imagine that a country with more
participants will have higher odds of having at least one athlete that outstands, and
consequently, win more medals. Therefore, using the hypotesy test, I’ll try to reject the null
hipotesis, which means that population has no effect on the number of medals achieved.

In addition, with all the different data that I collected, I’ll try to find an estimator to predict the
number of medals a country will win on future olympic games, depending on those variables
selected.

III. DATA DESCRIPTION

To make this study truthful, I have decided to focus it on the last Olympic Games, which took
part in Tokyo. This thesis is largely based on the data extracted from the olympic games
website, as well as from the world data website. More than 200 countries will be analysed,
which are all those who participated in the 2020 Olympic Games. However, I had to eliminate
Venezuela, Eritrea and two more countries due to the impossibility to find their data. All this
information, will be manipulated with RStudio. Besides, it is very important to highlight that
the medals achieved in the last Olympic Games, do not necessarily be representative for all the
past and future Olympic Games. Hence, this study will only focus on the Tokyo 2020 JJOO,
which are the most recent Games.

The dataset contains the following 5 variables for each of the 200 countries analysed:

- Number of medals
- Number of participants
- Number of population
- GDP
- GDP per capita
IV. RESULTS

As previously mentioned, we have the dataset of the JJOO Tokyo 2020, which has been
imported to RStudio to start analyzing it.

The first important thing to mention, is how are the different variables going to be treated and
the reason why. Firstly, population, participants and GDP per capita are going to be associated
with medals with a linear regression. As this study is also aiming to predict the medals
obtained by a country in the future, it also has to take into account the global growth of each
variable between each JJOO. As it can
clearly be seen in figure 1, the % change
of GDP (red), is always higher than the
medals one (blue) and population one
(green). Consequently, we will use the
logarithmic function, as the average
change of GDP is clearly higher than the
medals change.

Altough participants change (black) is not


constant, the average change is really
close to the medals one. This is why
Figure 1. Percentage change of each variable between JJOO participants regression is linear as well.

Let’s now focus on the medals values, which are analysed in the following histogram.

This histogram reveals that almost the totality of the


countries achieve between 0-10 medals. This shows us an
interest fact, which is that medals are not well-distributed,
as few countries win the vast majority of the medals.
Moreover, just 93 countries out of the 205 won a medal.
What factors are impulsing this inequality? Is it happening
by random?

Medals are going to be the “Y” variable on this study, the


Figure 2. Medals histogram one that is dependant of the other factors.

While analysing by separate one of the variables that are treated, participants, this study
reveals an important similitude. From the boxplot, we can
appreciate that the data is concentrated between the first
and the second quartile. As well as with the medals, a
much bigger amount of data is concentrated on the low
part of the boxplot. Consequently, there are only outliers
on the upper part, as 0 is included in the boxplot.

This appreciation, also seen in the medals histogram,


indicates that a lot of countries have few participants,
while few countries have a great number of it.
This similitude shown working individually with each data, has to be proven with the individual
linear regression of each of the variables with the medals. Are medals and participants
related? What about the other factors?

Figure 3. Linear regression medals-participants Figure 4. Logarithmic regression medals-GDP

At first glance, it seems that participants may be a decisive factor while determining the
number of medals won by a country. All data are close to the regression line, and there is no
dispersed data. Contrary, in figure 4 can be appreciated that GDP might not be an important
factor for the medals won, or at least not as much as participants. The data is clearly more
dispersed than in figure 3.

In order to confirm that participants have an impact over the medals won, a multiple
regression and hypotesis testing will be done. The default confidence with which it is going to
be affirmed that evidences shown before are not a result of lack, is 95%. Furthermore, with the
multiple linear regression, this study will try to came up with a medals won predictor, using
confidence intervals with the same default confidence used with the hypotesis test (95%).

To find the p-value of each of the factors, a linear model is produced in R-studio.

Thanks to this code, extracted from R-studio, it can


be affirmed that the p-value is clearly lower than
0.05. Hence, the null hypotesis can be rejected at
0.05 level. It has been shown that participants and
medals are strongly and positively correlated.

Furthermore, the Adjusted R-squared is really high.


This means that 79.56% of the medals variance
Figure 5. Obtained results for the linear could be explained only with the participitants. In
regression line between participants and medals
addition, the intercept estimation is lower than 0,
which gives more credibility to this correlation. It is more than obvious that with no
participants, it is impossible to win a medal, and this correlation shows this as well.

On the following table, the results for the other three variables are shown, which have been
calculated with the same manner:

FACTOR P-VALUE ADJUSTED R-SQUARED INTERCEPT


Population 2.2e-16 0.6822 2.802e+00
GDP 1.669e-08 0.1446 -30.704
GDP per capita 1.34e-05 0.08691 1.284e+00

Table 1. Results for all the correlations computed in RStudio


Once it is confirmed that all four factors are individually strongly related with the medals won
by a country, it has to be computed the multiple linear regression.

While computing the scatter plot of 3 factors, it can still


be appreciated the same things that when computed
individually.

The factors used are those whose p-value was lower,


and there’s still a linear and positive regression.

The data is clearly grouped together, and there is no


dispersed data, which may indicate that the correlation
Figure 6. Sclatter plot of 3 variables is even higher than when studied individually.

To prove this theory, the multiple linear regression is computed in RStudio. The following
results are obtained:

Figure 7. Results obtained with linear multiple regression

While doing the multiple linear regression, the only two factors that affect the medals won by
a country are participants and population. The p-value, as well as when analysed individually,
is so small. In addition, the adjusted R-squared is so high, which gives a lot of credibility to this
work. However, the intercept value is not 100% realistic, because of the impossibility of
winning medals without participants or population.

However, if we compute another time the linear multiple regression, but considering only the
factors that shown a p-value lower than 0.05, the following results are obtained:

Figure 8. Best multiple linear regression results


The Adjusted R-squared remains practically equally, as well as the p-value. However, the
intercept estimation has considerably changed. This new multiple regression is below zero,
which, for the reasons explained before, gives a lot of veracity to the results obtained.

There’s only one thing remaining to achieve all the purposes of this study: prediction model.

With the multiple linear regression, we have created a predict function. This would have to be
able to predict the results for each country depending on its population and participants.
However, to know whether it works or not, we will try to predict the medals obtained by some
random countries on the previous Olympic Games.

The Olympic Games before Tokyo where the last ones of one of the greatest athletes of all
time, Usain Bolt. In his honour, the country selected to see whether this prediction model is
working or not, is Jamaica.

Figure 9. Medals obtained by Jamaica in JJOO 2016

Although the estimated value is 13.25436, which is not the real value, 11 is in the confidence
interval, or really close it. Therefore, it seems that the predictor is quite accurately.

V.
Figure 10. Jamaica medals prediction V.
V.
CONCLUSIONS

The main objective of this study was to show whether the participants had an impact on the
medals won by a country. It has been shown that they are highly correlated, both individually
and in the multiple linear regression.

Not only we have find that participants are an important factor, but also population. In both
cases the p-value computed was very close to 0. Therefore, the multiple correlation was very
strong. In addition, the adjusted R-squared was very high, meaning that a high percentage of
the data could be explained with this multiple correlation. Furthermore, the intercept was
realistic, as it was below 0. This indicated that without participants and population, no medals
can be achieved. Taking all this into account, I can affirm that I have found a very strong
correlation.

However, the prediction did not work perfectly. Although the Jamaica prediction was quite
accurate, I found by making more predictions (not computed in the work), that for large
quantities of participants, the prediction was not accurate. This is because, as we saw in the
boxplot, most of the countries had few participants. Therefore, when we try to predict the
medals of a country with lots of participants, the number of medals predicted is significatively
higher than reality. Higher participants are out of range data for this work, as they were the
just few countries with high participants in comparison with those with few participants.

One way to improve this work, would be to analyse more than one Olympic Games. Like this,
the accuracy and veracity of the study will be clearly higher, as more data will be stored.
Taking all the data from just one Olympic Games, has the problem that the variance could be
very high, as results can significatively change from one edition to another. By analysing
different Olympic Games we will reduce this variance.

VI. REFERENCES
International Olympic Comitee. (2022). Olympics. Obtenido de https://olympics.com/es

Worldometers. (2022). worldometers. Obtenido de https://www.worldometers.info/

UPF Probability and Statistics lectures

You might also like