You are on page 1of 3

Preparing for Your Proposal

Which client/dataset did you select and why?

Among the datasets presented in a previous course we work with Yelp, so I found it interesting
to work with a new one, in this case I chose "SportsStats", according to the information provided
SportsStats is a sports analysis firm partnering with local news and elite personal trainers to
provide “interesting” insights to help their partners. Insights could be patterns / trends
highlighting certain groups / events / countries, etc. for the purpose of developing a news story
or discovering key health insights. And I chose it because I love sports and the characteristics of
its data are very suitable for the work to be done.

Describe the steps you took to import and clean the data.

First I must download the data set, then I must import them into jupyter, to achieve this I will use
pandas to read the csv files and store all this set in a mysql database, also in relation to data
cleaning, I do not know I do since it has surrogate values for empty fields.

You will first need to install pandasql with the following instruction, only if you haven't installed it
before:

In [ ]:
pip install pandasql

In [7]:
import pandas as pd

from pandasql import sqldf

pysqldf = lambda q: sqldf(q, globals())

athlete_events_ds = pd.read_csv("athlete_events.csv")

noc_regions_ds = pd.read_csv("noc_regions.csv")

Perform initial exploration of data and provide some screenshots or display some
stats of the data you are looking at.

Captures of the dataset in excel

Statistical plots of some data in the dataset

In [11]:
import pandas as pd

df = pd.read_csv('athlete_events.csv')

target = "Age"

df[target].plot(kind='hist')

#We can see that the age range of the athletes is between 20 and 30 years old.

Out[11]: <AxesSubplot:ylabel='Frequency'>

In [35]:
import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

colors = ["#1f77b4", "#ff7f0e"]

group = df.groupby("Sex").count().reset_index()
sizes = group["Year"]

labels = group["Sex"]

plt.pie(sizes, labels = labels, colors = colors)

plt.show()

#We can appreciate the diversity that exists in general lines in terms of participat

Create an ERD or proposed ERD to show the relationships of the data you are
exploring.

It is a referential ERD

Develop Project Proposal


Description

Next, the information included in the SportsStats data set will be reviewed and analyzed, as
objectives I consider relevant to identify the relationship between age, game and country of
origin; in the same way, participation by gender diversified by sports. I think fans of the Olympic
games or people interested in analysis for research work related to the games would be
interested.

Questions

How has the participation of men and women changed over time?

Is there any relationship between age and medal obtained by athletes?

Does the performance of countries vary by season?

Hypothesis
Yes, I think it has been changing as a function of time
It is likely that they do, having participants with an age range between 20 and 30 years
of age.
Probably yes

Approach
Display a histogram plot
Calculate Pearson's correlation coefficient.
Calculate the standard deviation in the progress of the country over time.

You might also like