Professional Documents
Culture Documents
By Prajwal Singh
3. Perform basic EDA which should include the following and print out your insights at every step.
This is the distribution of BMI.We can see that maximum people have BMI of near about 30.
Below is the distribution of ‘Age’ and ‘charges’ in our data.
2
f. Measure of skewness of ‘bmi’, ‘age’ and ‘charges’ columns
print( data['age'].skew(), data['bmi'].skew(), data['charges'].skew())
Age: 0.05567251565299186 BMI: 0.2840471105987448 Charges: 1.5158796580240388
Charges are highly skewed as we can see from the plot also.Age distribution is uniform hence
skewness is very less in this case
3
i. Pair plot that includes all the columns of the data frame
sns.pairplot(data)
But this will not plot the columns which are categorical in nature.So, we first need to do encoding to
plot all columns,After encoding when we will plot,result are as follows:
4
4. Answer the following questions with statistical evidence
a. Do charges of people who smoke differ significantly from the people who don't?
Hence,we can infer from the graph that the charges of smokers are much larger than those
who don't smoke.
ALSO
After performing T Test:
Charges of smoker and non-smoker are not the same since p_value (8.271435842179102e-
283) < 0.05
As from the plot there is no significant difference between BMI of males and females.
After performing T test we get,Gender has no effect on bmi since p value (0.09) > 0.05
5
c. Is the proportion of smokers significantly different in different genders?
From the plot it is clear that more proportion of male are smokers as compared to females.
Also after performing Chi square test we get gender has an effect on smoking habits since
p_value (0.007) < 0.05
d. Is the distribution of bmi across women with no children, one child and two children, the
same ?
Ho = "children has no effect on bmi" # Null Hypothesis
Ha = "children has an effect on bmi" # Alternate Hypothesis
female_data = copy.deepcopy(data[data['sex'] == 'female'])
zero = female_data[female_data.children == 0]['bmi']
one = female_data[female_data.children == 1]['bmi']
two = female_data[female_data.children == 2]['bmi']
f_stat, p_value = stats.f_oneway(zero,one,two)
if p_value < 0.05: # Setting our significance level at 5%
print(f'{Ha} as the p_value ({p_value.round(3)}) < 0.05')
else:
print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.05')
We got p_value > 0.05 hence BMI does not depend on the number of children .