Professional Documents
Culture Documents
Semana1 Tarea
Semana1 Tarea
Among the datasets presented in a previous course we work with Yelp, so I found it interesting
to work with a new one, in this case I chose "SportsStats", according to the information provided
SportsStats is a sports analysis firm partnering with local news and elite personal trainers to
provide “interesting” insights to help their partners. Insights could be patterns / trends
highlighting certain groups / events / countries, etc. for the purpose of developing a news story
or discovering key health insights. And I chose it because I love sports and the characteristics of
its data are very suitable for the work to be done.
Describe the steps you took to import and clean the data.
First I must download the data set, then I must import them into jupyter, to achieve this I will use
pandas to read the csv files and store all this set in a mysql database, also in relation to data
cleaning, I do not know I do since it has surrogate values for empty fields.
You will first need to install pandasql with the following instruction, only if you haven't installed it
before:
In [ ]:
pip install pandasql
In [7]:
import pandas as pd
athlete_events_ds = pd.read_csv("athlete_events.csv")
noc_regions_ds = pd.read_csv("noc_regions.csv")
Perform initial exploration of data and provide some screenshots or display some
stats of the data you are looking at.
In [11]:
import pandas as pd
df = pd.read_csv('athlete_events.csv')
target = "Age"
df[target].plot(kind='hist')
#We can see that the age range of the athletes is between 20 and 30 years old.
Out[11]: <AxesSubplot:ylabel='Frequency'>
In [35]:
import pandas as pd
%matplotlib inline
group = df.groupby("Sex").count().reset_index()
sizes = group["Year"]
labels = group["Sex"]
plt.show()
#We can appreciate the diversity that exists in general lines in terms of participat
Create an ERD or proposed ERD to show the relationships of the data you are
exploring.
It is a referential ERD
Next, the information included in the SportsStats data set will be reviewed and analyzed, as
objectives I consider relevant to identify the relationship between age, game and country of
origin; in the same way, participation by gender diversified by sports. I think fans of the Olympic
games or people interested in analysis for research work related to the games would be
interested.
Questions
How has the participation of men and women changed over time?
Hypothesis
Yes, I think it has been changing as a function of time
It is likely that they do, having participants with an age range between 20 and 30 years
of age.
Probably yes
Approach
Display a histogram plot
Calculate Pearson's correlation coefficient.
Calculate the standard deviation in the progress of the country over time.