You are on page 1of 31

1

Data analysis using R: Lesson1


You need to write commands after the prompt (>).
If the prompt cannot be seen, press Esc key to reset.
If you want to repeat the commands which you used, press PgUp key.
R commands are highlighted in yellow.
Simple calculation
log1010
log10(10)
loge10
log(10)

3*5

4/5

3*5;4/5;3-7
quotient of 11913
119%/%13
odd of 11913
119%%13
take an arithmetic mean of 145287210189204
y<- c(145,287,210,189,204)
mean(y)
arithmetic mean of consecutive integers10111213141516
y2<-10:16
mean(y2)

2
median of consecutive integers10111213141516
median(y2)
standard deviation of consecutive integers10111213141516
sd(y2)
How to read your data set (import)
copy & paste: from excel file
dat<-read.delim("clipboard")
datin the above can be any name.
When you create the data set in excel file, you should keep variable names at the top line.
Variable names should be simple but unique within your data set.

age

male

cr_hd

cr_ft

63

84

70

import the data directlyfrom csv file


dat<-read.csv("c:/Rwork/data/tsunagi.csv")
In the parentheses, you have to write the location where your data exists in your PC.
Notice!
After the data import by either 1) or 2), you have to run the following command to analyze the
data.
attach(dat)

3Check your data set (You need to import tsunagi data set for the following excercise)
1Data Browse
fix(dat)
Notice
You should close the browse window when you analyze the data.

3
2to see variable list
summary(dat)
3to draw a histogram of age
hist(age)
4Box-whisker plot of age
boxplot(age)
5scatter plot of systolic blood pressure and diastolic blood pressure
plot(sbp, dbp)
6contingency table of male and alcohol drinking habit
table(male, alc)

4
The variable list of tsunagi data set
variable name:
age

: age (year)

male

: sex0=female1=male

alc

: alcohol drinking habit0=no1=yes

dur_smk

: duration of smoking (year)

hgt

: hight

wgt

: weight

grip_r

: grip strength for right hand

grip_l

: grip strength for left hand

sbp

: systolic blood pressure

dbp

: diastolic blood pressure

hb

: hemoglobin level

wbc

: white cell count

platelet

: platelet count

GOT

: aspartate aminotransferaseAST

GPT

: alanine aminotransferaseALT

gGTP

: -glutamyl transpeptidase

tp

: total protein level

alb

: albumin

agratio

: ratio of albumin to globulin

chl

: total cholesterol

hdl

: HDL cholesterol

tgl

: triglyceride

HbA1C

: HbA1C

cr_hd

: arm cramps0=no1=yes

cr_ft

: calf cramps0=no1=yes

cancer

: cancer history0=no1=yes

Data analysis using R: Lesson2


Binomial distribution
1Suppose that you toss a coin ten times and you get nine heads and one tail. Is this coincidence?
Null hypothesis
Alternative hypothesis
pbinom(1,10,0.5)
(number of getting tail, number of trials, expected probability)
You will obtain P value using one-sided test.
In this statistical test, you get the probability to obtain one tail or less in 10 trials under
the assumption that hull hypothesis is true.
Statistically significance: 0.05 is conventionally used as significant level.
2. Normal distribution
1) Lets obtain 5000 random numbers with standard normal distribution.
x<-rnorm(5000)
2) Calculate the mean and variance of x.
mean(x)
var(x)
3) Confirm the distribution of x
hist(x)
4) Statistical test for normal distribution (Shapiro-Wilk normality test)
shapiro.test(x)

Kolmogorov-Smirnov test

ks.test(x, "pnorm", mean=mean(x), sd=sqrt(var(x)))


Null hypothesis: the distribution of sample population is equal to that of the source population
which is normally distributed.

6
Case-control study
Import the data set of tsunagi

Suppose that case is subjects with the history of cancer and control is subjects
without cancer history.

We want to examine the case-control difference in the distributions of other factors.

1) Comparison of the distribution of HDL between cases and controls


by(hdl, cancer, mean)
unpaired t test (Student t-test)
t.test(hdl~cancer, var.equal=TRUE)
(unpaired, variance of two sample is equal)
Welch t test
t.test(hdl~cancer)
(unpaired, variance of two sample is NOT equal)
By the way, is HDL normally distributed?
Parametric test: a statistical test that depends on assumption(s) about the distribution of
the data. (distribution of the data is defined by parameter(s))
1. What should we do if your data does not follow the assumption?
i)

Data transformation, e.g., log transformation

ii) Use non-parametric test


lhdl<- log(hdl)
Notice! Tsunagi data set is not used in the following exercise.
2) Non-parametric test for comparison of continuous variable
#

Case: breast cancer cases, N=5

Control: healthy women, N=5

Assume that systolic blood pressure was examined in both groups.

We will create a new variable named cases and controls.


cases<-c(140,150,160,180,190)
controls<-c(110,120,130,140,150)

Wicoxon rank sum test (=Mann-Whitney U test)


wilcox.test(cases,controls)
3) Comparison of the distribution of categorical variables
Use the data of rheumatic patients in the previous page.
(Pearsons) Chi-squared test
chisq.test(matrix(c(73,50,27,50),nr=2))
(nr stands for the number of row.)
chisq.test(matrix(c(73,50,27,50),nc=2))
(nc stands for the number of column.)
If you want to check your contingency table,
matrix(c(73,50,27,50),nr=2)
5) Comparison of the distribution of categorical variables with small sample size.
the history of
rheumatism in parents
Yes

No

-----------------------------Rheumatic patients

Patients siblings

Fishers exact test


fisher.test(matrix(c(5,1,3,3),nr=2))

Data analysis using R: Lesson 3


Data set: tsunagi
1. Check the data distribution
Draw a histogram of systolic blood pressure (sbp
hist(sbp)
qqnorm(sbp)
to check the normality of sbp distribution. If the plots line up on a line, the date is normally distributed.
Check the normality using statistical tests
After transformation, check the histogram of sbp and its normality.
*log-transformation
lsbp<- log10(sbp)
hist(lsbp)
qqnorm(lsbp)
shapiro.test(lsbp)
ks.test(lsbp, "pnorm", mean=mean(lsbp), sd=sqrt(var(lsbp)))
*square-root-transformation
qsbp<- sqrt(sbp)
qqnorm(qsbp)
shapiro.test(qsbp)
ks.test(qsbp, "pnorm", mean=mean(qsbp), sd=sqrt(var(qsbp)))
2paired t test
To compare the grip strength between right and left handsmatched data
cf. To compare the grip strength between males and femalesnon-matched data
boxplot(grip_l,grip_r)
t.test(grip_l,grip_r,paired=T)
cf. unpaired t test
t.test(grip_l,grip_r)

Lets see the difference in the results between paired and unpaired tests.

9
2. Non-parametric test for matched data set
wilcox.test(grip_l,grip_r,paired=T)

Wilcoxon signed rank test

ANOVAanalysis of variance
Suppose that you want to compare the mean among three (or more) groups. How do you test?
1) Lets create a new variable for smoking status, smk_grp, using the variable of dur_smk. The
new variable has three categories: non-smokers, smokers for 1-19 years, smokers for 20years
or more.
smk_grp<-(0 < dur_smk)+(20 <= dur_smk)
factor(smk_grp)

to assign smk_grp as a categorical variable. Now, the new variable, smk_grp, has

three categories, 0(non-smoker), 1(smokers for 1-19 years), and 2(smokers for 20years or more).
2) Estimate the mean and standard deviation for sbp by smk_grp.
by(sbp,smk_grp,mean)
by(sbp,smk_grp,sd)
You can obtain the same results using the following command.
tapply(sbp,smk_grp,mean)
The command of tapply is useful to create a cross table.
tapply(sbp,list(male,smk_grp),mean)
3) Create the box-whisker plot of sbp by smk_grp.
boxplot(sbp~smk_grp)
4) One way ANOVA
oneway.test(sbp~smk_grp, var.equal=TRUE)

what is the null hypothesis for this test?

5) One way ANOVA not assuming equal variances


oneway.test(sbp~smk_grp)

10
6) Non-parametric test for the mean comparison among three or more groups
kruskal.test(sbp~smk_grp)

Kruskal-Wallis test

Post-hoc testing of ANOVAs (Multiple comparison)


If there is a statistical significant difference among the three groups, you may ask Which
one does differ significantly from others?. In this situation, it is NOT appropriate to repeat
t-test between the groups (A vs B, B vs C, A vs C).
There are many tests for multiple comparison.
Tukeys HSD for normally distributed data
Holms method a modified Bonferroni test, applicable for non-parametric test
Bonferroni correction a conservative method with a lower statistical power
Scheffes method lower statistical power
Williams method when you have a control group and there is a tendency among comparison groups,
this method is the best.
Dunnetts method multiple comparison with a control group
NOTICE! A method is NOT recommended.
Duncans method
Bonferroni correction
pairwise.t.test(sbp, smk_grp, p.adjust.method="bonferroni")
You will obtain the following output.
----------------------------------------------------------------Pairwise comparisons using t tests with non-pooled SD
data:

sbp and smk_grp

0 1
1 1 2 1 1
P value adjustment method: bonferroni
----------------------------------------------------------------Values in red color are P values by Bonferroni correction. Since all P values are 1, there was
no statistical significance in all comparisons.

11
Holms method
pairwise.t.test(sbp, smk_grp, p.adjust.method="holm")
or
pairwise.t.test(sbp, smk_grp)
Tukeys HSD
smki<-factor(smk_grp)
TukeyHSD(aov(sbp~smki))
-----------------------------------------------------------Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = sbp ~ smki)
$smki
diff

lwr

upr

p adj

1-0 0.3043848 -5.985119 6.593888 0.9929162


2-0 0.6947262 -3.039939 4.429391 0.9003393
2-1 0.3903413 -6.489470 7.270153 0.9902769
-----------------------------------------------------------

12

Data analysis using R: Lesson


Analysis of Variance (ANOVA)
One-way ANOVA

one variable is used to divide study subjects into groups such as the example

in Lesson3.

Non-matched data

For example, you may want to test the performance of a machine. Under different conditions
(temperature), you repeated performance test ten times using the same machine, and obtained
the following results. (We ignored the effect of machines fatigue.)
Table 1
test

Factor A
A1

A2

A3

A4

-20

20

40

63

64

59

78

58

63

62

82

60

63

60

85

59

61

64

80

61

59

65

83

60

65

71

81

57

61

65

79

62

64

68

80

50

62

74

76

10

61

70

63

83

If you want to input the data manually, you need to run the following commands.
A1<- c(63,58,60,59,61,60,57,62,50,61)
A2<- c(64,63,63,61,59,65,61,64,62,70)
A3<- c(59,62,60,64,65,71,65,68,74,63)
A4<- c(78,82,85,80,83,81,79,80,76,83)
Dat1<-data.frame(A=factor(c(rep("A1", 10), rep("A2", 10), rep("A3", 10), rep("A4", 10))),
y=c(A1,A2,A3,A4))
Dat1
Your output will be as follows,

13
A

A1 63

A1 58

A1 60

A1 59

.
.
37 A4 79
38 A4 80
39 A4 76
40 A4 83
boxplot(y~A, data=Dat1, col="lightblue")
to draw the box-whisker plot
summary(aov(y~A, data=Dat1)) or
oneway.test(y~A, data=Dat1, var.equal=TRUE)

Matched data

Suppose that you may want to test the performance of a machine. Under different conditions
(temperature), you repeated performance test ten times using ten machines (but same model), and
obtained the following results.
Table 2
Factor A
A1

A2

A3

A4

Machines No

-20

20

40

No.1

63

64

59

78

No.2

58

63

62

82

No.3

60

63

60

85

No.4

59

61

64

80

No.5

61

59

65

83

No.6

60

65

71

81

No.7

57

61

65

79

No.8

62

64

68

80

No.9

50

62

74

76

No.10

61

70

63

83

14
Dat2<-data.frame(A=factor(c(rep("A1", 10), rep("A2", 10), rep("A3", 10), rep("A4", 10))),
No=factor(rep(1:10, 4)), y=c(A1,A2,A3,A4))
Dat2
A

No y

A1

63

A1

58

A1

60

A1

59

37 A4

79

38 A4

80

39 A4

76

40 A4 10

83

summary(aov(y~A + No, Dat2))


You will get the following output.
Df

Sum Sq Mean Sq F value

3 2681.47

No

77.72

8.64

27

387.78

14.36

Residuals

Pr(>F)

893.82 62.2353 2.972e-12 ***


0.6013

0.7846

--Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

This result indicates that the measurement is significantly related to factor A (temperature) .On
the other hand, there is no significant difference in the performance of ten machinesP=0.7846.

Two-way ANOVA

Suppose that you may want to test the performance of a machine by different conditions of
temperature and humidity. You repeated the test five times for each combination, and obtained
the following results.

15
Table 3

Results by factor A (temperature) and factor B


Factor A (temp.)
A1

A2

A3

A4

-20

20

40

B1

63

64

63

68

< 50%

58

63

62

72

60

63

67

80

59

61

64

70

Factor B

61

59

65

75

(humidity) B2

60

65

71

72

57

59

62

68

62

64

68

63

64

72

64

61

61

67

73

63

50%<

B1<- c(63,58,60,59,61,64,63,63,61,59,63,62,67,64,65,68,72,80,70,75)
B2<- c(60,57,62,64,61,65,59,64,72,67,71,62,68,64,73,72,68,63,61,63)
Dat3<- data.frame(A=factor(rep(c(rep("A1", 5), rep("A2", 5), rep("A3", 5), rep("A4", 5)), 2)),
B=factor(c(rep("B1", 20), rep("B2",20))), y=c(B1,B2))
Dat3
A

A1 B1 63

A1 B1 58

A1 B1 60

38 A4 B2 63
39 A4 B2 61
40 A4 B2 63
summary(aov(y~A + B, Dat3))
How do you interpret the results?
summary(aov(y~A*B, Dat3))
to see the interaction between A and B

16

----------------------------------------------------------------Df Sum Sq Mean Sq F value


A

3 402.67

0.03

0.03

A:B

3 203.07

67.69

32 416.00

13.00

Residuals

Pr(>F)

134.22 10.3250 6.588e-05 ***


0.0019

0.965294

5.2071 0.004833 **

--Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1

-----------------------------------------------------------------The interaction term is A:B. Since the interaction term is statistically significant, we
can say there is an interaction between A and B for the performance.
To understand this association, lets draw a graph.
attach(Dat3)
interaction.plot(B,A,y)

17
3Cohort study (using Tsunagi data set)
To create a new variable named smk: smokers and never-smokers.
smk<-(0 < dur_smk)
factor(smk)
smokers=TRUE and never-smokers=FAULSE
We followed up all subjects, with information of smoking status, for ten years and obtained the
information of cancer incidence.
This is a cohort study. According to the Dictionary of epidemiology, the definition of cohort
study is The analytic method of epidemiologic study in which subsets of a defined population
can be identified who are , have been, or in the future may be exposed or not exposed, or exposed
in different degrees, to a factor or factors hypothesized to influence the probability of
occurrence of a given disease or other outcome.
To check the number of cancer patients in smokers and never-smokers,
table(cancer,smk)
smk
cancer FALSE TRUE
0

908

325

49

28

Thus, the proportion of cancer patients in smokers (7.9%) is higher than that of never-smokers
(5.1%). Is this difference statistically significant or not?
4Chi-squared test
chisq.test(cancer,smk)
5Alternative method to examine the rate difference (Tsunagi data is not used)
Using the same contingency table of cancer frequency in smokers and never-smokers,
ca<-c(49, 28)
t<-c(957,353)
prop.test(ca,t)
please check this results with that by chi-squared test.

18
6nm Tablechi-squared test
A

Smokers

50

20

30

Non-smokers

50

80

70

all

100

100

100

1) To examine the difference in the proportion of smokers among three groups (A, B, and C),
s<-c(50,20,30)
t<-c(100,100,100)
prop.test(s,t)
--------------------------------------------------------------------3-sample test for equality of proportions without continuity
correction
data:

d out of t

X-squared = 21, df = 2, p-value = 2.754e-05


alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3
0.5

0.2

0.3

---------------------------------------------------------------------2) You can apply the chi-squared test


chisq.test(matrix(c(50,20,30,50,80,70),nc=2))
------------------------------------------------------Pearson's Chi-squared test
data:

matrix(c(50, 20, 30, 50, 80, 70), nc = 2)

X-squared = 21, df = 2, p-value = 2.754e-05


-------------------------------------------------------

19

Data analysis using R: Lesson 5


McNemar test
Lets assume that there are 100 cases with a disease and 100 sex- and age-matched controls.
We are going to compare the proportion of smokers between cases and controls. (In other words,
we want to examine the association between smoking and the risk of disease.)
Since sex- and age-matched controls were selected, you may want to keep the 100 pairs (matched
data set) in the statistical analysis.
Cases
Non-smoker

Smoker

Non-smoker

20 (a)

40 (b)

Smoker

10 (c)

30 (d)

Controls

The 20 pairs in cell (a) and 30 pairs in cell (d) cannot contribute to the analysis since there
is no difference in the smoking status between cases and controls.Thus, only cells (b) and
(c) contribute to the analysis for the association between smoking status and the risk of
disease.
Under the null hypothesis, number of pairs should be equal between cells (b) and (c), and we
apply McNemar test for this analysis.
2

degree of freedom is always 1


2

The command for McNemar test is as follows: prop.test(numerator, denominator)


prop.test(40,50)
What happen if you replace numerator from cell(b) to cell(c)?
prop.test(10,50)

20
3 Correlation
Using Tsunagi data set, lets see the association between systolic and diastolic blood pressures.
scatter plot
plot(sbp,dbp)

Do you see any association between systolic and diastolic blood pressures?

correlation coefficientparametric method


cor.test(sbp,dbp)

This is called (Pearsons) correlation coefficient, which is a measure of association that

indicates the degree to which two variables have a linear relationship.


The coefficient, represented by the letter r, can vary between +1 and -1; when r=+1, there
is a perfect positive linear relationship in which one variable varies directly with the other.
Non-parametric method
Spearmans rank correlation
cor.test(sbp, dbp, method="spearman")
Kendalls Tau
cor.test(sbp, dbp, method="kendall")

4. Regression analysis
Lets see the association between age and systolic blood pressure.
plot(age,sbp)

Systolic blood pressure tends to increase with age. Can we predict systolic blood pressure
by age using a statistical model?

univariate regression model


glm(sbp~age)
This model can be applied under the assumption that the dependent variable normally distributed.

21
Call:

glm(formula = sbp ~ age)

Coefficients:
(Intercept)

age

92.0454

0.5236

Degrees of Freedom: 1308 Total (i.e. Null); 1307 Residual


(1 observation deleted due to missingness)
Null Deviance:

700600

Residual Deviance: 635700

AIC: 11820

AIC: Akaike's Information Criterion

(evaluation of statistical model)

summary(glm(sbp~age))
Call:
glm(formula = sbp ~ age)
Deviance Residuals:
Min

1Q

Median

3Q

Max

-49.837

-15.031

-2.512

12.865

88.257

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 92.04541

2.87076

32.06

<2e-16 ***

age

0.04531

11.55

<2e-16 ***

0.52357

--Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for gaussian family taken to be 486.3577)


Null deviance: 700607 on 1308 degrees of freedom
Residual deviance: 635670 on 1307 degrees of freedom
(1 observation deleted due to missingness)
AIC: 11817
Number of Fisher Scoring iterations: 2

residual plot
Linearity of the association X and Y
Normal distribution of Y variable
Distribution of variable y does not change by Xhomoscedasticity

22
Draw the scatter plot and a line obtained from the regression model
plot(age,sbp)
abline(glm(sbp~age))
plotting the residuals for each subject
rslt<-glm(sbp~age)
plot(resid(rslt))
yp<-predict(rslt)

yp is the predicted value for each age

rsd<-residuals(rslt)

rsd is the residual

comp<- complete.cases(yp, rsd)


yp<- yp[comp]

to exclude subjects with missing data

rsd<- rsd[comp]
plot(yp,rsd)

Residual

fan-shape of residual plot


0.12
0.1
0.08
0.06
0.04
0.02
0
-0.02
-0.04
-0.06
-0.08
-0.1
Fitted value of y

23

Data analysis using R: Lesson 6


Regression analysis using Tsunagi data set
categorical variable as an explanatory variable
In the previous model, we used age as an explanatory variable. How about categorical variables?
Can we use a categorical variable as an explanatory variable? The answer is Yes! Lets
see an example using sex as an explanatory variable.
summary(glm(sbp~male))
Output will be as follows;
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.5988

0.8095

151.45 < 2e-16 ***

male

1.3103

3.68 0.000243 ***

4.8212

Using this regression model, systolic blood pressure can express as:
Systolic blood pressure122.5988 + 4.8212*male
Please notice that male is coded as 1 and female is coded as 0 in Tsunagi data.
Thus, the mean systolic blood pressure in females is 122.5988, and that in males is 127.42
(=122.5988 + 4.8212).
Lets confirm these mean values by other method.
by(sbp,male,mean)
The P value of the regression analysis in the above indicates that the association between
systolic blood pressure and sex is statistically significant. In other words, there is a
significant difference in the mean systolic blood pressure between males and females.
Actually, you would obtain the same P value by t test assuming the same variance between males
and females.
t.test(sbp~male, var.equal=TRUE)
* The above command is different from that mentioned in Lesson 2.
Without a command ofvar.equal=TRUE, you will obtain the result of t-test assuming the
different variance between males and females.
t.test(sbp~male)

24
In the case of sex, male is coded as 1 and female is coded as 0. What happen if we do not use
number(s) for coding categorical variable(s)?
The following command is to create a new variable, smk.
smk<-(0 < dur_smk)
factor(smk)
The code of smk is TRUE or FALSE(TRUE means ever-smoker, FALSE means non-smoker).
Lets conduct a regression analysis using smk.
summary(glm(sbp~smk))
Grouping a continuous variable (convert a continuous variable to categorical variable)
ag<-(50 <= age)+(60 <= age)+(70 <= age)
tapply(age,ag,range)

(ag=0: age<50, ag=1: 50<= age <60, ag=2: 60<= age <70, ag=3: 70<= age

summary(glm(sbp~ag))
R program still recognize the variable of ag as a continuous variable since the data of ag
ranges from 0 to 3.
Thus, we need to specify this variable as a categorical variable. To avoid confusion, I named
agi for the categorical variable (ag is still kept as a continuous variable).
agi<-factor(ag)
summary(glm(sbp~agi))

In the output, you can see new terms, agi1, agi2, and agi3. These terms are

automatically created by the program, and called dummy variable(s).


The obtained regression model is
systolic blood pressure=111.249 + 9.272*agi1 + 16.746*agi2 + 19.650*agi3 .
Using this model, we can obtaine the mean systolic blood pressure for each age group. For example,
to obtain the mean SBP for age group 1 (50-59 years old), you need to substitute 1 into age1
and 0 into agi2 and agi3. Thus, the mean SBP for age group 1 is 111.249 + 9.272. In a similar
way, to obtain the mean SBP for age group 2 (60-69 years old), you need to substitute 1 into
age2 and 0 into agi1 and agi3 (=111.249+16.746). All terms (agi1-3) are 0for age group
0 (less than 50 years old). Thus, the intercept (111.249) is the mean SBP for age group 0.

25

4) Multivariate regression analysis (multiple regression model)


According to the previous analysis, systolic blood pressure is related not only age but also
sex. In this situation, the mean age might be different between males and females. Lets see
the difference in the mean age between males and females.
t.test(age~male, var.equal=T)
Two Sample t-test
data:

age by male

t = -1.6976, df = 1307, p-value = 0.08982


alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.7994433 0.2020984
sample estimates:
mean in group 0 mean in group 1
61.41533

62.71400

There is a marginally significant difference in the mean age between males and females.
Multiple regression analysis provides you a statistical model using more than two covariates.
The following command is to estimate the model using sex and age as covariates.
summary(glm(sbp~male+age))
glm(formula = sbp ~ male + age)
Min

1Q

Median

3Q

Max

-52.284

-14.934

-2.683

12.583

89.900

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 90.90451

2.88096

31.554 < 2e-16 ***

male

4.11746

1.25124

3.291 0.00103 **

age

0.51660

0.04519

11.431 < 2e-16 ***

Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for gaussian family taken to be 482.7276)


Null deviance: 700607 on 1308 degrees of freedom
Residual deviance: 630442 on 1306 degrees of freedom

26
(1 observation deleted due to missingness)
AIC: 11809
Number of Fisher Scoring iterations: 2
The coefficient of sex indicates that the sex difference of systolic blood pressure adjusting
the effect of age. Since the P value of sex is less than 0.05, the sex difference of systolic
blood pressure is statistically significant even after adjusting the effect of age.
5) Multicollinearity
Multicollinearity is a statistical phenomenon in which two or more predictor (explanatory)
variables in a multiple regression model are highly correlated. In this situation, the coefficient
estimates may change erratically in response to small changes in the model or the data.
Multicollinearity does not reduce the predictive power or reliability of the model as a whole;
it only affects calculations regarding individual predictors. That is, a multiple regression
model with correlated predictors can indicate how well the entire bundle of predictors predicts
the outcome variable, but it may not give valid results about any individual predictor, or about
which predictors are redundant with others.
For example, weight, height, and BMI are used in a regression model.
How to avoid multicollinearity?
Examine each association among explanatory variables calculating correlation coefficient.
If you find multicollinearity, pick-up one of them as a covariate.

Analysis of covariance: ANCOVAeffect modification, synergistic effect


In tsunagi data set, the association between age and systolic blood pressure may be different
between males and females. How can we show that?
Lets see associations between age and sbp for males and females, separately.
plot(age, sbp, pch=as.integer(male))
: male, :female
abline(lm1<- glm(sbp[male==0]~age[male==0]))
regression line for female
abline(lm2<- glm(sbp[male==1]~age[male==1]), lty=2)
regression line for male
*Results of regression analyses by sex

27
summary(lm1)
Call:
glm(formula = sbp[male == 0] ~ age[male == 0])
Deviance Residuals:
Min

1Q

Median

3Q

Max

-51.034

-14.539

-2.127

12.897

87.461

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)

76.74885

3.78235

20.29

<2e-16 ***

age[male == 0] 0.74709

0.06026

12.40

<2e-16 ***

--Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for gaussian family taken to be 494.2336)


Null deviance: 474820 on 808 degrees of freedom
Residual deviance: 398847 on 807 degrees of freedom
(1 observation deleted due to missingness)
AIC: 7318.1
Number of Fisher Scoring iterations: 2
summary(lm2)
Call:
glm(formula = sbp[male == 1] ~ age[male == 1])
Deviance Residuals:
Min

1Q

Median

3Q

Max

-48.311

-13.398

-2.858

12.966

61.318

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
age[male == 1]

114.64958
0.20363

4.21475 27.202 < 2e-16 ***


0.06556

3.106 0.00200 **

--Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for gaussian family taken to be 430.8135)


Null deviance: 218702 on 499 degrees of freedom
Residual deviance: 214545 on 498 degrees of freedom
AIC: 4455.8
Number of Fisher Scoring iterations: 2
The coefficients of age for males and females are 0.2 and 0.7, respectively.
Is this difference statistically significant?

28

summary(glm(sbp~male+age+male*age))
or
summary(glm(sbp~male*age))
This regression model can be expressed as follows,
sbpa+b(male) + c(age) + d(male * age)
For females, male=0, then
sbpa+ c(age)
For males, male=1, then
sbpa+b + (c+d) (age)
Thus, the difference in the slope between males and females is d.
H0: d=0, HA: d0
If there is a statistically significant difference between males and females, we should report
the results of regression analysis for males and females, separately.

29

Data analysis using R: Lesson 7


In this session, we use kasari data set.
Kasari data set
MMS

(Mini Mental State)a score of cognitive function

r_gr hand grip of the right


l_gr hand grip of the left
hg

mercury level in the hair

male male=1, female=0


age

age

sbp

systolic blood pressure

dbp

diastolic blood pressure

cr_hd cramp of upper limb (0no1yes


cr_ft calf cramp0no1yes
A case-control study was conducted. Cases had a history of calf cramp, and controls did not.
Please try to answer the following questions by yourself.
How many cases and controls are there in this data set?
table(cr_ft)
Check the distribution of mercury level in the hair with drawing a graph.
hist(hg)
Check the normality of mercury distribution in the hair using an appropriate method. If the
mercury level is not normally distributed, conduct a logarithmic transformation, and then,
check the normality, again.
shapiro.test(hg)
lhg<-log(hg)
shapiro.test(lhg)
Classify the data of mercury level in the hair into four groups (<20, 20-, 30-, and 40-.
hgg=(20 <= hg)+(30 <= hg)+(40<= hg)
factor(hgg)
Create a contingency table using variables of mercury level in the hair (hgg) and the presence
of calf cramp (cr_ft).

30
table(cr_ft,hgg)
Conduct a chi-squared test to check the association between mercury level in the hair and
the presence of calf cramp. Furthermore, conduct an appropriate t-test to compare the mean
of mercury level in the hair between cases and controls after logarithmic transformation.
chisq.test(cr_ft,hgg)
t.test(lhg~cr_ft)
According to these results, there might be an association between mercury level in the hair and
the presence of calf cramp. Lets see this association taking account the effects of other
factors.
logistic regression analysis
In this statistical model, dependent variable is cr_ft (the presence of calf cramp: yes=1,
and no=0).
Univariate logistic regression analysis
rst<-glm(cr_ft~ hgg, family=binomial)
summary(rst)

Note that, the variable hgg has four groups0,1,2,3 but R program does not recognize
it as a categorical variable yet. This model let us know a trend of the calf cramp risk by
the increase of mercury level.

Conduct a logistic regression analysis after specifying hgg as a categorical variable (newly
named as hggi).
hggi<-factor(hgg)
rst2<-glm(cr_ft~ hggi, family=binomial)
summary(rst2)
Conduct a chi-squared test to see the association between sex (male) and the presence of
calf cramp (cr_ft).
table(cr_ft,male)
chisq.test(cr_ft,male)
Conduct an appropriate t-test to compare the mean of mercury level in the hair between males
and females after logarithmic transformation.
t.test(lhg~male)

31
wilcox.test(lhg~male)
There might be a sex difference in the mercury level in the hair. Lets see the association
between the mercury level in the hair and the calf cramp risk after adjusting the effect of
sex.
rst3<-glm(cr_ft~male+ hggi, family=binomial)
summary(rst3)
Estimate the maximum likelihood of odds ratios and corresponding 95% confidence intervals.
print(exp(coef(rst3)))
print(exp(confint(rst3)))

How to interpret the coefficients of logistic regression model.

log(p/(1-p)) = B0+B1X1+B2X2+....+BpXp
where logit(P)=log (P/(1-P))
In a univariate model using male (male1female0) as an explanatory variable,
logit(P)= B0 + a (male)
For males,
logit(P for male)= B0 + a
For females,
logit(P for female)= B0
To know the calf cramp risk in males referring to that of females, odds ratio is calculated.
The definition of odds is P / (1-P) and logit(P)=log (P/(1-P))
Odds ratio for males Pm / (1-Pm)/ (Pf / (1-Pf))

odds for malesPm / (1-Pm)


Odds for females (Pf / (1-Pf))

log(odds ratio) = log {Pm / (1-Pm)/ (Pf / (1-Pf))}


= log (Pm / (1-Pm)- log(Pf / (1-Pf))
= logit (P for male) - logit (P for female)
= (B0 + a ) - B0
= a
Thus, exp(a) is the odds ratio (= the calf cramp risk in males referring to that of females).

You might also like