You are on page 1of 12

HYPOTHESIS TESTING

Univariate Analysis: It is the simplest form of analyzing the data. “Uni” means one i.e. data
consist of only one variable. This analysis failed to explain the cause-and-effect relationship
between two or more variable. The main purpose of this univariate analysis is to describe and
summarize the data and finds the pattern of distribution in it with the help of central tendency
techniques – mean, median, mode, standard deviation, skewness and kurtosis. If we want to
determine the relationship or the impact of one variable on another then bivariate or
multivariate analysis is used. Bivariate or multivariate analysis can be computed using the
following techniques
– t-test, F-test, anova, chi square test, correlation test, regression analysis etc.

Case 1: Demonstration of Univariate analysis on insurance database

Step-1: Importing the data

Code:

#Step-1: Importing the data from the folded D, the file name is insurance

Insurance_data<-read.csv("D:/Data/insurance.csv")

View(Insurance_data)

Output:
Step-2: Checking the structure of the data

Code:

#Step-2: Checking the structure of the data

str(Insurance_data)

summary(Insurance_data)

Output:

> #Step-2: Checking the structure of the data

> str(Insurance_data)

'data.frame': 1338 obs. of 7 variables:

$ age : int 19 18 28 33 32 31 46 37 37 60 ...

$ sex : chr "female" "male" "male" "male" ...

$ bmi : num 27.9 33.8 33 22.7 28.9 ...

$ children: int 0 1 3 0 0 0 1 3 2 0 ...


$ smoker : chr "yes" "no" "no" "no" ...

$ region : chr "southwest" "southeast" "southeast" "northwest" ...

$ charges : num 16885 1726 4449 21984 3867 ...

Step-3: Checking the summary of the chosen variable “Age” for the analysis

Code:

#Step-3: Checking the summary of the chosen variable “Age” for the analysis

summary(Insurance_data$age)
Output:

> #Step-3: Checking the summary of the chosen variable “Age” for the analysis

> summary(Insurance_data$age)

Min. 1st Qu. Median Mean 3rd Qu. Max.

18.00 27.00 39.00 39.21 51.00 64.00

Interpretation:

Minimum age of the respondents is 18. Maximum age of the respondent is 64 years. The mean
age of the respondents is 39.21. Total no of records in the given database is 1338. 50% of the
respondents age group is lesser than 39 years and the remaining 50% respondent’s age group is
above 39 years. (Median-39)

1st quartile from the given data set is 27 years which indicates 25% of the respondents lies
between 18 to 27 years age.

3rd quartile = 51 years, which indicates total 75% of the respondents lies between the 18 to 51
years.

Step-4: Calculation of standard

deviation Code:

#Calculation of standard deviation

sd(Insurance_data$age)

Output:

> #Calculation of standard deviation

sd(Insurance_data$age)

[1 14.04996
Interpretation:

The deviation of individual observation from its mean age is around 14 years (Lesser deviation is
good). It is advisable to have minimum deviation because the mean is good enough in
representing the whole population.

Case 2: Demonstration of Bivariate analysis on insurance dataset

Bivariate analysis

Correlation analysis: This technique is used to determine the relationship between two or more
variables. The function used in R Studio to compute correlation analysis is “cor” and “cor.test”

The variables we are taking for Bivariate analysis are “Age” & “BMI”

H0: There is no relation between the variables under the study i.e., age and BMI

H1: There is a relation between the variables under the study i.e., age and BMI

Code:

#Correlation analysis between the variables “Age” & “BMI”

cor(Insurance_data$age,Insurance_data$bmi)

[Note: As the data is not grouped data, we are using ‘comma’. Because it is unstacked data]

Output:

> #Correlation analysis between the variables “Age” & “BMI”

cor(Insurance_data$age,Insurance_data$bmi)

[1] 0.1092719

Interpretation:

There is a positive correlation between the variable “age” and “bmi”. Value of r = 0.1092719
which indicates that the variables are less related/ not highly related.
Chi square test: This technique is used to find out the deviation between observed and expected
frequencies. Chi square is widely used to analyze whether there exist any association between
two or more categorical variables (Like – Male & Female, smoker, region etc).

In R studio Chi square test can be computed using the function “chisq.test()”

The categorical variables under the given database are gender, no of children, smoker and region.

Question: Assess the impact of ‘smoker’ on ‘charges’

H0: There is no impact of smoking habit on premium charges (The variables are independent)

H1: There is an impact of smoking habit on premium charges (The variables are dependent)

Code:

#Question: Assess the impact of ‘smoker’ on ‘charges’

chisq.test(Insurance_data$smoker,Insurance_data$charges)

Output:

> #Question: Assess the impact of ‘smoker’ on ‘charges’

> chisq.test(Insurance_data$smoker,Insurance_data$charges)

Pearson's Chi-squared test

data: Insurance_data$smoker and Insurance_data$charges

X-squared = 1338, df = 1336, p-value = 0.4794

Interpretation:

The obtained P value is 0.4794 which is greater than 0.05 So accept H0 i.e., the smoking habit
and premium charges are independent/ there is no impact of smoking habit on premium charges.
Hypothesis Testing: Hypothesis is nothing but the assumption to be tested before testing the
information we have to define two different hypothesis called as Null and Alternative hypothesis.
Based on alternative hypothesis, we can define the type of the test i.e. one tailed or two tailed
test.

For testing the hypothesis, we have various statistical tools. The tools are grouped based on the
type of data distribution.

If the data is distributed normally, consider parametric test for testing the hypothesis – t test, f
test, ANOVA otherwise use non parametric test – Wilcoxon sign test, Mann-Whitney U test, K-
W test, K-S test, Chi square test and Spearman co-relation test.

Parametric test

A) One sample t test:


This test is used to evaluate or to test the difference between population mean and the
sample mean. It is always used to test the hypothesis about the population from which the
samples are drawn.
H0: Population mean is equal to sample mean
H1: Population mean is not equal to sample mean.

The total records are 1338. There are 7 variables for evaluation. They are Age, Gender,
BMI, Children, Smoker, Region and Charges

Case 1: Test whether the mean age of population is 35

years. Code:

#Case 1: Test whether the mean age of population is 35 years.

t.test(Insurance_data$age,mu=35)
Output:

> #Case 1: Test whether the mean age of population is 35 years.

> t.test(Insurance_data$age,mu=35)

One Sample t-test


data: Insurance_data$age

t = 10.953, df = 1337, p-value < 2.2e-16

alternative hypothesis: true mean is not equal to 35

95 percent confidence interval:

38.45352 39.96053

sample estimates:

mean of x

39.20703

Interpretation:

The calculated t value is 10.953

The mean age of population lies between 38.45352 to 39.96053 at 95% confidence limit.

The obtained P value is 2.2e-16 ~ 0, which is lesser than 0.05, so accept H1. We can conclude
that the true mean age of the population is not 35 years.

[The mean age of the populstion (35) is not coming under the 38 to 39 limits]
A) Two sample t test (One variable have to be categorical and another one numerical)

The categorical variables in the given data set are gender (Male/Female), no of children (0,1,2,3),
smoking habit (Yes/No) and the region (North east/north west/south east/south west).

Continuous variables are age, BMI, Charges.

We can compute two sample t-test by considering one categorical variable with two
classifications and one continuous variable.

Case 1: Test whether the mean age of female group is equal to the mean age of the male
group or not.

Hypothesis:

H0: Mean age of female group is equal to mean age of male group

H1: Mean age of female group is not equal to mean age of male group

Code:

t.test(Insurance_data$age~Insurance_data$sex)

Output:

> t.test(Insurance_data$age~Insurance_data$sex)

Welch Two Sample t-test

data: Insurance_data$age by Insurance_data$sex

t = 0.76247, df = 1335.4, p-value = 0.4459

alternative hypothesis: true difference in means between group female and group male is not
equal to 0
95 percent confidence interval:

-0.9214814 2.0932042

sample estimates:

mean in group female mean in group male

39.50302 38.91716

Interpretation:

The calculated t value is 0.76247

The mean age of female group is 39.50302 and mean age of male group is 38.91716 at 95%
confidence limit.

MULTIVARIATE ANALYSIS

ANOVA: One categorical variable with more than two sub classification – either region or
children

One continuous variable: age/BMI/charges

Function: aov(continuous variable~categorical variable)

Case 1: Assess the impact of region on charges

H0: There is no significant difference in the mean charges paid based on regional location/ there
is no impact of region on premium charges.

H1: There is a significant difference in the mean charges paid based on regional location/ there is
an impact of region on premium charges.

Code:

##Multivariate

##ANOVA

Anova<-aov(Insurance_data$charges~Insurance_data$region)
summary(Anova)

Output:

> ##Multivariate

> ##ANOVA

> Anova<-aov(Insurance_data$charges~Insurance_data$region)

> summary(Anova)

Df Sum Sq Mean Sq F value Pr(>F)

Insurance_data$region 3 1.301e+09 433586560 2.97 0.0309 *

Residuals 1334 1.948e+11 146007093

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Interpretation:

The obtained P value is 0.0309 which is less than 0.05; accept H1, means there is a difference in
the mean premium charges with respect to the location.

As we reject the Null hypothesis, using the function “aggregate” we will estimate the difference
in premium with respect to region.

Function: aggregate(continuous variable,by=list(categorical

variable),FUN=mean) Code:

#estimating the difference in premium with respect to region.

aggregate(Insurance_data$charges,by=list(Insurance_data$region),FUN=mean)
Output:

> #estimating the difference in premium with respect to region.

aggregate(Insurance_data$charges,by=list(Insurance_data$region),FUN=mean

) Group.1 x

1 northeast 13406.38

2 northwest 12417.58

3 southeast 14735.41

4 southwest 12346.94

Interpretation:

From the above result highest premium charges are paid by southeast region and the
lowest premium charges are paid by southwest region.

You might also like