Professional Documents
Culture Documents
Univariate Analysis: It is the simplest form of analyzing the data. “Uni” means one i.e. data
consist of only one variable. This analysis failed to explain the cause-and-effect relationship
between two or more variable. The main purpose of this univariate analysis is to describe and
summarize the data and finds the pattern of distribution in it with the help of central tendency
techniques – mean, median, mode, standard deviation, skewness and kurtosis. If we want to
determine the relationship or the impact of one variable on another then bivariate or
multivariate analysis is used. Bivariate or multivariate analysis can be computed using the
following techniques
– t-test, F-test, anova, chi square test, correlation test, regression analysis etc.
Code:
#Step-1: Importing the data from the folded D, the file name is insurance
Insurance_data<-read.csv("D:/Data/insurance.csv")
View(Insurance_data)
Output:
Step-2: Checking the structure of the data
Code:
str(Insurance_data)
summary(Insurance_data)
Output:
> str(Insurance_data)
Step-3: Checking the summary of the chosen variable “Age” for the analysis
Code:
#Step-3: Checking the summary of the chosen variable “Age” for the analysis
summary(Insurance_data$age)
Output:
> #Step-3: Checking the summary of the chosen variable “Age” for the analysis
> summary(Insurance_data$age)
Interpretation:
Minimum age of the respondents is 18. Maximum age of the respondent is 64 years. The mean
age of the respondents is 39.21. Total no of records in the given database is 1338. 50% of the
respondents age group is lesser than 39 years and the remaining 50% respondent’s age group is
above 39 years. (Median-39)
1st quartile from the given data set is 27 years which indicates 25% of the respondents lies
between 18 to 27 years age.
3rd quartile = 51 years, which indicates total 75% of the respondents lies between the 18 to 51
years.
deviation Code:
sd(Insurance_data$age)
Output:
sd(Insurance_data$age)
[1 14.04996
Interpretation:
The deviation of individual observation from its mean age is around 14 years (Lesser deviation is
good). It is advisable to have minimum deviation because the mean is good enough in
representing the whole population.
Bivariate analysis
Correlation analysis: This technique is used to determine the relationship between two or more
variables. The function used in R Studio to compute correlation analysis is “cor” and “cor.test”
The variables we are taking for Bivariate analysis are “Age” & “BMI”
H0: There is no relation between the variables under the study i.e., age and BMI
H1: There is a relation between the variables under the study i.e., age and BMI
Code:
cor(Insurance_data$age,Insurance_data$bmi)
[Note: As the data is not grouped data, we are using ‘comma’. Because it is unstacked data]
Output:
cor(Insurance_data$age,Insurance_data$bmi)
[1] 0.1092719
Interpretation:
There is a positive correlation between the variable “age” and “bmi”. Value of r = 0.1092719
which indicates that the variables are less related/ not highly related.
Chi square test: This technique is used to find out the deviation between observed and expected
frequencies. Chi square is widely used to analyze whether there exist any association between
two or more categorical variables (Like – Male & Female, smoker, region etc).
In R studio Chi square test can be computed using the function “chisq.test()”
The categorical variables under the given database are gender, no of children, smoker and region.
H0: There is no impact of smoking habit on premium charges (The variables are independent)
H1: There is an impact of smoking habit on premium charges (The variables are dependent)
Code:
chisq.test(Insurance_data$smoker,Insurance_data$charges)
Output:
> chisq.test(Insurance_data$smoker,Insurance_data$charges)
Interpretation:
The obtained P value is 0.4794 which is greater than 0.05 So accept H0 i.e., the smoking habit
and premium charges are independent/ there is no impact of smoking habit on premium charges.
Hypothesis Testing: Hypothesis is nothing but the assumption to be tested before testing the
information we have to define two different hypothesis called as Null and Alternative hypothesis.
Based on alternative hypothesis, we can define the type of the test i.e. one tailed or two tailed
test.
For testing the hypothesis, we have various statistical tools. The tools are grouped based on the
type of data distribution.
If the data is distributed normally, consider parametric test for testing the hypothesis – t test, f
test, ANOVA otherwise use non parametric test – Wilcoxon sign test, Mann-Whitney U test, K-
W test, K-S test, Chi square test and Spearman co-relation test.
Parametric test
The total records are 1338. There are 7 variables for evaluation. They are Age, Gender,
BMI, Children, Smoker, Region and Charges
years. Code:
t.test(Insurance_data$age,mu=35)
Output:
> t.test(Insurance_data$age,mu=35)
38.45352 39.96053
sample estimates:
mean of x
39.20703
Interpretation:
The mean age of population lies between 38.45352 to 39.96053 at 95% confidence limit.
The obtained P value is 2.2e-16 ~ 0, which is lesser than 0.05, so accept H1. We can conclude
that the true mean age of the population is not 35 years.
[The mean age of the populstion (35) is not coming under the 38 to 39 limits]
A) Two sample t test (One variable have to be categorical and another one numerical)
The categorical variables in the given data set are gender (Male/Female), no of children (0,1,2,3),
smoking habit (Yes/No) and the region (North east/north west/south east/south west).
We can compute two sample t-test by considering one categorical variable with two
classifications and one continuous variable.
Case 1: Test whether the mean age of female group is equal to the mean age of the male
group or not.
Hypothesis:
H0: Mean age of female group is equal to mean age of male group
H1: Mean age of female group is not equal to mean age of male group
Code:
t.test(Insurance_data$age~Insurance_data$sex)
Output:
> t.test(Insurance_data$age~Insurance_data$sex)
alternative hypothesis: true difference in means between group female and group male is not
equal to 0
95 percent confidence interval:
-0.9214814 2.0932042
sample estimates:
39.50302 38.91716
Interpretation:
The mean age of female group is 39.50302 and mean age of male group is 38.91716 at 95%
confidence limit.
MULTIVARIATE ANALYSIS
ANOVA: One categorical variable with more than two sub classification – either region or
children
H0: There is no significant difference in the mean charges paid based on regional location/ there
is no impact of region on premium charges.
H1: There is a significant difference in the mean charges paid based on regional location/ there is
an impact of region on premium charges.
Code:
##Multivariate
##ANOVA
Anova<-aov(Insurance_data$charges~Insurance_data$region)
summary(Anova)
Output:
> ##Multivariate
> ##ANOVA
> Anova<-aov(Insurance_data$charges~Insurance_data$region)
> summary(Anova)
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpretation:
The obtained P value is 0.0309 which is less than 0.05; accept H1, means there is a difference in
the mean premium charges with respect to the location.
As we reject the Null hypothesis, using the function “aggregate” we will estimate the difference
in premium with respect to region.
variable),FUN=mean) Code:
aggregate(Insurance_data$charges,by=list(Insurance_data$region),FUN=mean)
Output:
aggregate(Insurance_data$charges,by=list(Insurance_data$region),FUN=mean
) Group.1 x
1 northeast 13406.38
2 northwest 12417.58
3 southeast 14735.41
4 southwest 12346.94
Interpretation:
From the above result highest premium charges are paid by southeast region and the
lowest premium charges are paid by southwest region.