You are on page 1of 27

NAME: - RAEESA ALI

COURSE TITTLE: - ADVANCE IN BIOSTATISTICS


ROLL NO: - 22204014-003
Project of biostatistics: Regression, all sample T-Test, Anova
CLASS: - MS. BIOTECHNOLOGY 3rd SEMESTER
DEPARTMENT: - ADVANCE IN BIOTECHNOLOGY
UNIVERSITY NAME: - UNIVERSITY OF SIALKOT
ASSIGNMENT SUBMITTED TO: - SIR MALIK WAQAR
DATE OF ASSIGNMENT SUBMISSION: - 04/06/2023
1.Multiple linear regression:
Q: According to following given data in which Y is dependent variable of life expectancy of Australian
population effects by factor as x1 is alcohol, x2 is hepatitis-b, x3 is measles, x4 is polio and x5 is diphtheria
from year 2015 10 2006. With the help of r-language give interpretation?
Data of Y x1 x2 x3 x4 x5
Australian
Life alcohol hepatitis measles polio diphtheria
pollution
expectanc -b
in years
y

2015 82.8 0 93 74 93 93

2014 82.7 9.71 91 340 92 91

2013 82.3 9.87 91 158 91 92

2012 82 10.03 91 199 92 92

2011 81.9 10.3 92 190 92 92

2010 81.8 10.52 92 70 89 86

2009 81.7 10.62 94 104 86 83

2008 81.3 10.76 94 65 83 83

2007 81.2 10.56 94 11 83 85

2006 81 10.31 95 0 85 83

https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who

Answer:
Command given in R-Studio
1.data=read.csv(choose.files())
Output results:
> data=read.csv(choose.files())
> data
y x1 x2 x3 x4 x5
1 82.8 0.00 93 74 93 93
2 82.7 9.71 91 340 92 91
3 82.3 9.87 91 158 91 92
4 82.0 10.03 91 199 92 92
5 81.9 10.30 92 190 92 92
6 81.8 10.52 92 70 89 86
7 81.7 10.62 94 104 86 83
8 81.3 10.76 94 65 83 83
9 81.2 10.56 94 11 83 85
10 81.0 10.31 95 0 85 83

2.y=data$y
Output results
> y=data$y
> data
y x1 x2 x3 x4 x5
1 82.8 0.00 93 74 93 93
2 82.7 9.71 91 340 92 91
3 82.3 9.87 91 158 91 92
4 82.0 10.03 91 199 92 92
5 81.9 10.30 92 190 92 92
6 81.8 10.52 92 70 89 86
7 81.7 10.62 94 104 86 83
8 81.3 10.76 94 65 83 83
9 81.2 10.56 94 11 83 85
10 81.0 10.31 95 0 85 83
Commands:
3.lifexpectancy=data$y;alcohol=data$x1;hep.b=data$x2;measels=data$x3;polio=data$x4;diphtheria=data$x5
4.reg=lm(lifexpectancy~alcohol+hep.b+measels+polio+diphtheria)
5.summary(reg)
Output results:
> lifexpectancy=data$y;alcohol=data$x1;hep.b=data$x2;measels=data$x3;polio=data$x4;diphtheria=data$x5
> reg=lm(lifexpectancy~alcohol+hep.b+measels+polio+diphtheria)
> summary(reg)
Call:
lm(formula = lifexpectancy ~ alcohol + hep.b + measels + polio +
diphtheria)

Residuals:
1 2 3 4 5 6 7 8 9 10
-0.0026923 0.0005238 0.1958168 -0.2024294 0.0114413 -0.0270730 0.1471499 -0.1007616 0.0302466 -0.0522222

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.101e+02 1.360e+01 8.097 0.00126 **
alcohol -1.490e-01 3.262e-02 -4.568 0.01028 *
hep.b -2.485e-01 1.074e-01 -2.313 0.08178 .
measels 2.798e-03 9.937e-04 2.816 0.04804 *
polio 7.370e-03 4.476e-02 0.165 0.87719
diphtheria -5.486e-02 4.046e-02 -1.356 0.24666
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.17 on 4 degrees of freedom
Multiple R-squared: 0.9652, Adjusted R-squared: 0.9217
F-statistic: 22.17 on 5 and 4 DF, p-value: 0.005121

1.Interpretation:
According to the results Pr(>|t|) values of x1 alcohol is 0.01028 and x3 measles is 0.04804 which is
probability value indicates that these 2 variables are best fitted model as they have directly affected the y= life
expectancy ratio of Australia population from 2015 to 2006
2.Significance of model:
Hypothesis:
Ho:β1=β2=β3=β4=β5
H1: At least one regression coefficient is significant
P-value: 0.005121 which means all the independent regression coefficients are equal to level of
significance so, we accept the Ho and reject the H1.
3.BEST FITTED MODEL VALUES ACCORDING TO Pr(>|t|)
Pr(>|t|) : x1 alcohol =0.01028, x3 measles = 0.04804
Hypothesis:
Ho:β1=β3
H1: at least one regression coefficient is significant
p-value of β1 and β2 is less than 0.05 so we accept the Ho and reject the H1, which means alcohol and
measles affect the life expectancy of Australian population in year between 2015-2006. These are best fitted
model of multiple regression.
These values are best fitted model values as they are less than probability of significance.
4.Goodness/fitted model:
Multiple R-squared: 0.9652, Adjusted R-squared: 0.9217
R-squared value is close to “1” so it interpret that our data is fitted model.

Commands:
6.reg$residuals
7.sum(reg$residuals)
8.reg$fitted.values
9.reg=lm(y-x1+x2+x3+x4)

Output of commands:

> reg$residuals
1 2 3 4 5 6 7 8
-0.002692313 0.000523759 0.195816814 -0.202429400 0.011441349 -0.027073038 0.147149935 -0.100761586
9 10
0.030246648 -0.052222170
> sum(reg$residuals)
[1] -3.361027e-18
> reg$fitted.values
1 2 3 4 5 6 7 8 9 10
82.80269 82.69948 82.10418 82.20243 81.88856 81.82707 81.55285 81.40076 81.16975 81.05222

5.Interpretation:
1. > reg$residuals: it is error values in each regression coefficient of model
2. > sum(reg$residuals): it is total residual error in model.
3. > reg$fitted.values : these are best fitted values all regression coefficient of model.

6.graphically interpretation of data:


r-studio codes of multiple sample quantitative correlation coefficient relation:

commands
10.plot(alcohol,measels)
11.cor(alcohol,measel)

Output of commands:
> plot(alcohol,measels)
> cor(alcohol,measels)
[1] 0.08143148

Interpretation of commands:
Strength of Co=relation: -1≤ r ≤1

1.Graphically we get a straight line which means our data has no


correlation/zero correlation.
2. cor(alcohol,measels)
[1] 0.08143148
Strength of co-relation: -1≤ r ≤1
According to result value, it shows strong No correlation/zero correlation.
R=0
2.One sample T-TEST:
Q: Conclude the following data in one sample t-test of Australian population life expectancy from
year 2015 to 2007 with are r-studio analysis. Give detailed interpretation?
years Australian
population life
expectancy
2015 82.8
2014 82.7
2013 82.5
2012 82.3
2011 82
2010 81.9
2009 81.7
2008 81.3
2007 81.3
https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who

Answer:
Commands:
1.aus_life_expectancy=c(82.8,82.7,82.5,82.3,82,81.9,81.7,81.3,81.3)
2.life_expectancy=c(82.8,82.7,82.5,82.3,82,81.9,81.7,81.3,81.3)
3.hist(life_expectancy)
Results of commands:
> aus_life_expectancy=c(82.8,82.7,82.5,82.3,82,81.9,81.7,81.3,81.3)
> life_expectancy=c(82.8,82.7,82.5,82.3,82,81.9,81.7,81.3,81.3)
> hist(life_expectancy)
> shapiro.test(life_expectancy)
Commands :
4.shapiro.test(life_expectancy)
5.t.test(life_expectancy,mu=0)
Results of commands:
Shapiro-Wilk normality test

data: life_expectancy
W = 0.93244, p-value = 0.505

> t.test(life_expectancy,mu=0)

One Sample t-test

data: life_expectancy
t = 438.41, df = 8, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
81.62395 82.48716
sample estimates:
mean of x
82.05556

Interpretation:

1.Two tailed:life expectancy of Australian population is equal to 82.


Ho: mu=82
H1: mu82

2. Normality of data:
p-value = 0.505 which means our data is normally distributed as level of significance is greater than 0.05 so accept Ho and
reject H1.

Ho: data is normal


H1: data is not normal

3. T-test:Level of significance is p-value < 2.2e-16 which is equal to -10.0197799774 as it is less than 0.05 so, we accept H1
and reject Ho. Resultantly the life expectancy of the Australian population mean from the year between 2015 to 2007 is not
equal to 82 years.
Ho: mu=82
H1: mu82 accept.
Degree of freedom: v= n-1=8 means 1 parameter is independent.
Lower confidence interval=81.62395 upper confidence interval=82.48716
The value of the mean difference is between LCI and UCI.

3. Paired sample T-test:


Q: A training program was conducted to improve “participant” knowledge on ICT. Data were collected from a
sample both pre and post the ICT training program.
1. Test the hypothesis that the training is effective to improve participant knowledge on ICT at α=0.05?

One or Two- tailed data?


PRE POST
12 13
14 15
13 13
11 12
12 13
10 11
15 16
13 13
9 8
14 14
https://statisticstechs.weebly.com/inferential-statistics/paired-sample-t-test-or-repeated-measures

commands:
#training program before ICT knowledge#
1.pre<-c(12,14,13,11,12,10,15,13,9,14)
#training program after ICT knowledge#
2.post<-c(13,15,13,12,13,11,16,13,8,14)
3.d=(pre-post)
4.shapiro.test(d)

# for normality checking OF PAIRED DIFFERENCES#


5.t.test(pre,post,paired=TRUE)
Results of commands:
> pre<-c(12,14,13,11,12,10,15,13,9,14)
> post<-c(13,15,13,12,13,11,16,13,8,14)
> d=(pre-post)
> shapiro.test(d)

Shapiro-Wilk normality test

data: d
W = 0.73087, p-value = 0.002088

data: pre and post


t = -2.2361, df = 9, p-value = 0.05218
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-1.005833719 0.005833719
sample estimates:
mean difference
-0.5

Interpretations:
Assumptions:
1. To check normality and mean difference:
P=0.002 ,its mean difference(d) is not normally distributed between the samples, as significance level
is less than 0.05 so we accept H1 and reject Ho.

Hypothesis :
Ho:d=0
H1:d≠0

2 To check mean difference bwtween the samples:

Hypothesis :
Ho: m=0
H1: m≠0

p-value = 0.05218,the level of significance is equal to 0.05 so, we Ho and reject H1, which means the ICT training
program pre and post session has significant effect on knowledge of population.

4. Independent 2 sample T-Test:


QUESTION: Interpret the data of life expectancy in Australian and Pakistani population from the
year2015-2007 by using independent t-test data analysis by R-studio?

years Australian population life Pakistani population life


expectancy expectancy
2015 82.8 66.4
2014 82.7 66.2
2013 82.5 66
2012 82.3 65.7
2011 82 66.5
2010 81.9 65.1
2009 81.7 64.8
2008 81.3 64.6
2007 81.3 64.4
https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who

ANSWER:
Interpretation of above data by using R-studio commands:
1.aus_life_expectancy<-c(82.8,82.7,82.5,82.3,82,81.9,81.7,81.3,81.3)
2.pak_life_expectancy<-c(66.4,66.2,66,65.7,65.5,65.1,64.8,64.6,64.4)
3.life_expectancy=c(aus_life_expectancy,pak_life_expectancy)
4.group=rep(c("aus_life_expectancy","pak_life_expectancy"),each=9)
5.data.frame(aus_life_expectancy,pak_life_expectancy)
RESULTS:

> aus_life_expectancy<-c(82.8,82.7,82.5,82.3,82,81.9,81.7,81.3,81.3)
> pak_life_expectancy<-c(66.4,66.2,66,65.7,65.5,65.1,64.8,64.6,64.4)
> life_expectancy=c(aus_life_expectancy,pak_life_expectancy)
> group=rep(c("aus_life_expectancy","pak_life_expectancy"),each=9)
> data.frame(aus_life_expectancy,pak_life_expectancy)
aus_life_expectancy pak_life_expectancy
1 82.8 66.4
2 82.7 66.2
3 82.5 66.0
4 82.3 65.7
5 82.0 65.5
6 81.9 65.1
7 81.7 64.8
8 81.3 64.6
9 81.3 64.4
1. 1st Assumption to check Normality of data by Shapiro-test

COMMANDS:-
1.my_data=data.frame(group,life_expectancy)
2.shapiro.test(aus_life_expectancy)
3.shapiro.test(pak_life_expectancy)

RESULTS:

> my_data=data.frame(group,life_expectancy)
> shapiro.test(aus_life_expectancy)

Shapiro-Wilk normality test

data: aus_life_expectancy
W = 0.93244, p-value = 0.505

> shapiro.test(pak_life_expectancy)

Shapiro-Wilk normality test

data: pak_life_expectancy
W = 0.94637, p-value = 0.6503

INTERPETATION OF SHAPIOR-TEST RESULT:


AS, the p-values of Australian population life expectancy=0.505 and Pakistani population life
expectancy=0.6503, which is greater than significance level of both shapiro test p-value=0.05, so we accept
Null hypothesis and reject the Alternative hypothesis.

Null hypothesis: data is normal


Alternative hypothesis: data is not normal

2. Move on 2nd assumption: Variance mean check:

Command:

1.var.test(life_expectancy~group,data=my_data)

Results:

> var.test(life_expectancy~group,data=my_data)

F test to compare two variances

data: weight by group


F = 2.7675, num df = 8, denom df = 8, p-value = 0.1714
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.6242536 12.2689506
sample estimates:
ratio of variances
2.767478

Interpretation of variance mean of results:


The significance level of p-value=0.05 whereas, p-value=0.1714 of both variance, which is greater than0.05,
means we accept the null hypothesis and reject the alternative hypothesis

Null hypothesis: Sigma 1 ^2 = sigma 2 ^2


Alternative hypothesis: sigma1^2≠ sigma2^2

As, is according to the above 2 assumptions the result is shows both a variance means and normality is
accepted by Null hypothesis and our data is normally distributes, so now we go to the final Test which is T-
test for independent sample T testing.
3.T-test of the independent sample:
Command:
t.test(aus_life_expectancy,pak_life_expectancy,var.equal=TRUE)
Results:
> t.test(aus_life_expectancy,pak_life_expectancy,var.equal=TRUE)

Two Sample t-test

data: aus_life_expectancy and pak_life_expectancy


t = 54.518, df = 16, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
15.99723 17.29166
sample estimates:
mean of x mean of y
82.05556 65.41111

Interpretation of T-Test of results:


Significance level of p-value is less than 0.05, which is p-value= < 2.2e-16 of T-test which means
we accept the alternative hypothesis and reject the null hypothesis.

Null hypothesis: µ1=µ2


Alternative hypothesis: µ1≠µ2

5. Anova sapmles T-test:


Q: According to following give data check the population mean and
variance mean between the year 2015 up to 2011 of 3 population life
expectancy by using R-Studio. Give r-studio commands and results
interpretation?

Armania =µ1 Australia=µ Austria=µ3


74.8 82.8 81.5
74.6 82.7 81.4
74.4 82.5 81.1
74.4 82.3 88
73.9 82 88
https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who

Commands:
1.a=c(74.8,74.6,74.4,74.4,73.9)
2.b=c(82.8,82.7,82.5,82.3,82)
3.c=c(81.5,81.4,81.1,88,88)
4.countaries=c(74.8,74.6,74.4,74.4,73.9,82.8,82.7,82.5,82.3,82,81.5,81.4,81.1,88,88)
5.group=rep(c("armania","australia","austria"),each=5)
6.dat=data.frame(countaries,group)
7.anova=aov(countaries~group,data=dat)
8.summary(anova)
9.TukeyHSD(anova)
10.library(car)
11.leveneTest(countaries~group,data=dat)
12. shapiro.test(anova$residuals)
Commands result:

> a=c(74.8,74.6,74.4,74.4,73.9)
> b=c(82.8,82.7,82.5,82.3,82)
> c=c(81.5,81.4,81.1,88,88)
> countaries=c(74.8,74.6,74.4,74.4,73.9,82.8,82.7,82.5,82.3,82,81.5,81.4,81.1,88,88)
> group=rep(c("armania","australia","austria"),each=5)
> dat=data.frame(countaries,group)
> dat
countaries group
1 74.8 armania
2 74.6 armania
3 74.4 armania
4 74.4 armania
5 73.9 armania
6 82.8 australia
7 82.7 australia
8 82.5 australia
9 82.3 australia
10 82.0 australia
11 81.5 austria
12 81.4 austria
13 81.1 austria
14 88.0 austria
15 88.0 austria
> anova=aov(countaries~group,data=dat)
> anova
Call:
aov(formula = countaries ~ group, data = dat)

Terms:
group Residuals
Sum of Squares 264.6493 54.2800
Deg. of Freedom 2 12

Residual standard error: 2.126813


Estimated effects may be unbalanced
> summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)
group 2 264.65 132.32 29.25 2.43e-05 ***
Residuals 12 54.28 4.52
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> TukeyHSD(anova)
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = countaries ~ group, data = dat)

$group
diff lwr upr p adj
australia-armania 8.04 4.451418 11.628582 0.0001761
austria-armania 9.58 5.991418 13.168582 0.0000334
austria-australia 1.54 -2.048582 5.128582 0.5063531
> library(car)
> leveneTest(countaries~group,data=dat)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 2.5129 0.1226
12
> shapiro.test(anova$residuals)

Shapiro-Wilk normality test

data: anova$residuals
W = 0.83775, p-value = 0.0117

Interpretations:
1.HYPOTHESIS:
Ho:µ1=µ2=µ3
H1: atleast one mean is significant.
Significant value is 2.43e-05, which is 0.016, it is greater than 0.05, so we accept Ho and reject H1.
it means that population means are statistically significant.
3. Between the samples:
TukeyHSD(anova)
1.australia-armania : adj-p= 0.0001761; its p value is less than 0.05,so we accept H1 and reject
HO.
µ1≠µ2 we accept H1.
2.austria-armania : adj-p= 0.0000334; its p value is less than 0.05,so we accept H1 and reject
HO.
µ1≠µ3 we accept H1
3.austria-australia: adj-p=0.5063531; Its p-values is greater than 0.05 so, we accept Ho and
reject H1.
µ3=µ2 WE accept Ho
There means are statistically significant

4. For checking variance:


Library(car)
Ho: sigma1^2=sigma2^2=sigma3^2
H1: at least one variance is significant.

p-value is 0.1226, which means it is greater than 0.05 value of significance so we accept Ho and
reject H1.
5. Normality of data:
p-value = 0.0117<0.05 so we accept H1 and reject Ho, data is not normally distributed.

REFERENCES:

https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who
https://statisticstechs.weebly.com/inferential-statistics/paired-sample-t-test-or-repeated-measures

You might also like