You are on page 1of 4

Worksheet 3.

7
DS100-1
STATISTICAL THINKING IN PYTHON
APPLIED DATA SCIENCE
Name:

Enrico, Dionne Marc L. Page 1 of 4

Write codes in Jupyter notebook as required by the problems. Copy both code and output as screen grab or screen shot and paste
them here.

1 Which of the following conclusions could you draw from the following bee swarm plot of iris petal lengths?

A. All I. versicolor petals are shorter than I. virginica petals.


B. I. setosa petals have a broader range of lengths than the other two species.
C. I. virginica petals tend to be the longest, and I. setosa petals tend to be the shortest of the three species.
C
D. I. versicolor is a hybrid of I. virginica and I. setosa.

2 Create a function that calculates the empirical cumulative data function of an array. Use the function to calculate the ECDFs
of the three species of Iris (you will need the following datasets: setosa_sepal_length.csv,
versicolor_sepal_length.csv, and virginica_sepal_length.csv). Plot the ECDFs on a single axis.
Code and Output

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
setosa = pd.read_csv("setosa_sepal_length(1).csv")
versi = pd.read_csv("versicolor_sepal_length(1).csv")
virg = pd.read_csv("virginica_sepal_length(1).csv")
def ecdf(data):
n = len(data)
x=np.sort(data)
y=np.arange(1, n+1)/n
return x,y
versicolor_petal_length = versi["7"]
setosa_petal_length = setosa["5.1"]
virginica_petal_length = virg["6.3"]
x_set, y_set = ecdf(setosa_petal_length)
x_vir, y_vir = ecdf(virginica_petal_length)
x_vers, y_vers = ecdf(versicolor_petal_length)
plt.plot(x_set, y_set, marker='.', linestyle='none')
plt.plot(x_vers, y_vers, marker='.', linestyle='none')

Page 1 of 4
plt.plot(x_vir, y_vir, marker='.', linestyle='none')
plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')
plt.xlabel('sepal length (cm)')
plt.ylabel('ECDF')
plt.show()

3 Without plotting the data, determine the 25th, 50th and 75th percentiles of the three iris species.
Code and Output

import pandas as pd
import numpy as np
setosa = pd.read_csv("setosa_sepal_length(1).csv")
versi = pd.read_csv("versicolor_sepal_length(1).csv")
virg = pd.read_csv("virginica_sepal_length(1).csv")
versicolor = np.percentile(versi["7"], [25, 50, 75])
setosa = np.percentile(setosa["5.1"], [25, 50, 75])
virginica = np.percentile(virg["6.3"], [25, 50, 75])
print("Versicolor: ", versicolor)
print("Setosa: ", setosa)
print("Virginica: ", virginica)

Page 2 of 4
4 Let’s say a bank made 100 mortgage loans. It is possible that anywhere between 0 and 100 of the loans will be defaulted
upon. We would like to know the probability of getting a given number of defaults, given that the probability of a default is
0.05. Draw 10,000 samples of this binomial distribution and plot the CDF using our ecdf function. Do not forget to use
np.random.seed(42).
Code and Output

import numpy as np
import matplotlib.pyplot as plt
samples = np.random.binomial(100, 0.05, size=10000)
def ecdf(data):
n = len(data)
x = np.sort(data)
y = np.arange(1, n+1) / n
return x,y
np.random.seed(42)
x,y = ecdf(samples)
plt.plot(x, y, marker='.', linestyle='none')
plt.xlabel('number of defaults out of 100 loans')
plt.ylabel('CDF')
plt.show()

Page 3 of 4
Page 4 of 4

You might also like