You are on page 1of 6

Statistical Learning Project

By Prajwal Singh

1. Import the necessary libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import copy
import scipy.stats as stats

2. Read the data as a data frame


data=pd.read_csv('G:\Downloads personal\insurance.csv')

3. Perform basic EDA which should include the following and print out your insights at every step.

a. Shape of the data


(1338, 7)
There are 1228 rows and 7 columns in our dataset.

b. Data type of each attribute


data.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)

c. Checking the presence of missing values


data.isnull().sum()
age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0
There are no missing values in any of the columns
1
d. 5 point summary of numerical attributes
data.describe()
Using this we can get a 5 point summary of columns along with some other info.
We can also use boxplot from seaborn to get a graphical view of a 5 point summary.
Below is the boxplot for age.

e. Distribution of ‘bmi’, ‘age’ and ‘charges’ columns.


We can use histogram to get the distribution of the above columns using the displot(or distplot) of
seaborn

This is the distribution of BMI.We can see that maximum people have BMI of near about 30.
Below is the distribution of ‘Age’ and ‘charges’ in our data.

2
f. Measure of skewness of ‘bmi’, ‘age’ and ‘charges’ columns
print( data['age'].skew(), data['bmi'].skew(), data['charges'].skew())
Age: 0.05567251565299186 BMI: 0.2840471105987448 Charges: 1.5158796580240388
Charges are highly skewed as we can see from the plot also.Age distribution is uniform hence
skewness is very less in this case

g. Checking the presence of outliers in ‘bmi’, ‘age’ and ‘charges columns


Number of outliers in bmi:63
Number of outliers in age:0
Number of outliers in charges:973
Hence there are no outliers in the ‘age’ column and charges have high numbers of outliers and are
also highly skewed.

h. Distribution of categorical columns (include children)

3
i. Pair plot that includes all the columns of the data frame
sns.pairplot(data)
But this will not plot the columns which are categorical in nature.So, we first need to do encoding to
plot all columns,After encoding when we will plot,result are as follows:

4
4. Answer the following questions with statistical evidence
a. Do charges of people who smoke differ significantly from the people who don't?

Hence,we can infer from the graph that the charges of smokers are much larger than those
who don't smoke.
ALSO
After performing T Test:
Charges of smoker and non-smoker are not the same since p_value (8.271435842179102e-
283) < 0.05

b. Does the BMI of males differ significantly from that of females?

As from the plot there is no significant difference between BMI of males and females.
After performing T test we get,Gender has no effect on bmi since p value (0.09) > 0.05

5
c. Is the proportion of smokers significantly different in different genders?

From the plot it is clear that more proportion of male are smokers as compared to females.
Also after performing Chi square test we get gender has an effect on smoking habits since
p_value (0.007) < 0.05

d. Is the distribution of bmi across women with no children, one child and two children, the
same ?
Ho = "children has no effect on bmi" # Null Hypothesis
Ha = "children has an effect on bmi" # Alternate Hypothesis
female_data = copy.deepcopy(data[data['sex'] == 'female'])
zero = female_data[female_data.children == 0]['bmi']
one = female_data[female_data.children == 1]['bmi']
two = female_data[female_data.children == 2]['bmi']
f_stat, p_value = stats.f_oneway(zero,one,two)
if p_value < 0.05: # Setting our significance level at 5%
print(f'{Ha} as the p_value ({p_value.round(3)}) < 0.05')
else:
print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.05')
We got p_value > 0.05 hence BMI does not depend on the number of children .

You might also like