Professional Documents
Culture Documents
7
DS100-1
STATISTICAL THINKING IN PYTHON
APPLIED DATA SCIENCE
Name:
Write codes in Jupyter notebook as required by the problems. Copy both code and output as screen grab or screen shot and paste
them here.
1 Which of the following conclusions could you draw from the following bee swarm plot of iris petal lengths?
2 Create a function that calculates the empirical cumulative data function of an array. Use the function to calculate the ECDFs
of the three species of Iris (you will need the following datasets: setosa_sepal_length.csv,
versicolor_sepal_length.csv, and virginica_sepal_length.csv). Plot the ECDFs on a single axis.
Code and Output
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
setosa = pd.read_csv("setosa_sepal_length(1).csv")
versi = pd.read_csv("versicolor_sepal_length(1).csv")
virg = pd.read_csv("virginica_sepal_length(1).csv")
def ecdf(data):
n = len(data)
x=np.sort(data)
y=np.arange(1, n+1)/n
return x,y
versicolor_petal_length = versi["7"]
setosa_petal_length = setosa["5.1"]
virginica_petal_length = virg["6.3"]
x_set, y_set = ecdf(setosa_petal_length)
x_vir, y_vir = ecdf(virginica_petal_length)
x_vers, y_vers = ecdf(versicolor_petal_length)
plt.plot(x_set, y_set, marker='.', linestyle='none')
plt.plot(x_vers, y_vers, marker='.', linestyle='none')
Page 1 of 4
plt.plot(x_vir, y_vir, marker='.', linestyle='none')
plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')
plt.xlabel('sepal length (cm)')
plt.ylabel('ECDF')
plt.show()
3 Without plotting the data, determine the 25th, 50th and 75th percentiles of the three iris species.
Code and Output
import pandas as pd
import numpy as np
setosa = pd.read_csv("setosa_sepal_length(1).csv")
versi = pd.read_csv("versicolor_sepal_length(1).csv")
virg = pd.read_csv("virginica_sepal_length(1).csv")
versicolor = np.percentile(versi["7"], [25, 50, 75])
setosa = np.percentile(setosa["5.1"], [25, 50, 75])
virginica = np.percentile(virg["6.3"], [25, 50, 75])
print("Versicolor: ", versicolor)
print("Setosa: ", setosa)
print("Virginica: ", virginica)
Page 2 of 4
4 Let’s say a bank made 100 mortgage loans. It is possible that anywhere between 0 and 100 of the loans will be defaulted
upon. We would like to know the probability of getting a given number of defaults, given that the probability of a default is
0.05. Draw 10,000 samples of this binomial distribution and plot the CDF using our ecdf function. Do not forget to use
np.random.seed(42).
Code and Output
import numpy as np
import matplotlib.pyplot as plt
samples = np.random.binomial(100, 0.05, size=10000)
def ecdf(data):
n = len(data)
x = np.sort(data)
y = np.arange(1, n+1) / n
return x,y
np.random.seed(42)
x,y = ecdf(samples)
plt.plot(x, y, marker='.', linestyle='none')
plt.xlabel('number of defaults out of 100 loans')
plt.ylabel('CDF')
plt.show()
Page 3 of 4
Page 4 of 4