Professional Documents
Culture Documents
3*5
4/5
3*5;4/5;3-7
quotient of 11913
119%/%13
odd of 11913
119%%13
take an arithmetic mean of 145287210189204
y<- c(145,287,210,189,204)
mean(y)
arithmetic mean of consecutive integers10111213141516
y2<-10:16
mean(y2)
2
median of consecutive integers10111213141516
median(y2)
standard deviation of consecutive integers10111213141516
sd(y2)
How to read your data set (import)
copy & paste: from excel file
dat<-read.delim("clipboard")
datin the above can be any name.
When you create the data set in excel file, you should keep variable names at the top line.
Variable names should be simple but unique within your data set.
age
male
cr_hd
cr_ft
63
84
70
3Check your data set (You need to import tsunagi data set for the following excercise)
1Data Browse
fix(dat)
Notice
You should close the browse window when you analyze the data.
3
2to see variable list
summary(dat)
3to draw a histogram of age
hist(age)
4Box-whisker plot of age
boxplot(age)
5scatter plot of systolic blood pressure and diastolic blood pressure
plot(sbp, dbp)
6contingency table of male and alcohol drinking habit
table(male, alc)
4
The variable list of tsunagi data set
variable name:
age
: age (year)
male
: sex0=female1=male
alc
dur_smk
hgt
: hight
wgt
: weight
grip_r
grip_l
sbp
dbp
hb
: hemoglobin level
wbc
platelet
: platelet count
GOT
: aspartate aminotransferaseAST
GPT
: alanine aminotransferaseALT
gGTP
: -glutamyl transpeptidase
tp
alb
: albumin
agratio
chl
: total cholesterol
hdl
: HDL cholesterol
tgl
: triglyceride
HbA1C
: HbA1C
cr_hd
: arm cramps0=no1=yes
cr_ft
: calf cramps0=no1=yes
cancer
: cancer history0=no1=yes
Kolmogorov-Smirnov test
6
Case-control study
Import the data set of tsunagi
Suppose that case is subjects with the history of cancer and control is subjects
without cancer history.
No
-----------------------------Rheumatic patients
Patients siblings
Lets see the difference in the results between paired and unpaired tests.
9
2. Non-parametric test for matched data set
wilcox.test(grip_l,grip_r,paired=T)
ANOVAanalysis of variance
Suppose that you want to compare the mean among three (or more) groups. How do you test?
1) Lets create a new variable for smoking status, smk_grp, using the variable of dur_smk. The
new variable has three categories: non-smokers, smokers for 1-19 years, smokers for 20years
or more.
smk_grp<-(0 < dur_smk)+(20 <= dur_smk)
factor(smk_grp)
to assign smk_grp as a categorical variable. Now, the new variable, smk_grp, has
three categories, 0(non-smoker), 1(smokers for 1-19 years), and 2(smokers for 20years or more).
2) Estimate the mean and standard deviation for sbp by smk_grp.
by(sbp,smk_grp,mean)
by(sbp,smk_grp,sd)
You can obtain the same results using the following command.
tapply(sbp,smk_grp,mean)
The command of tapply is useful to create a cross table.
tapply(sbp,list(male,smk_grp),mean)
3) Create the box-whisker plot of sbp by smk_grp.
boxplot(sbp~smk_grp)
4) One way ANOVA
oneway.test(sbp~smk_grp, var.equal=TRUE)
10
6) Non-parametric test for the mean comparison among three or more groups
kruskal.test(sbp~smk_grp)
Kruskal-Wallis test
0 1
1 1 2 1 1
P value adjustment method: bonferroni
----------------------------------------------------------------Values in red color are P values by Bonferroni correction. Since all P values are 1, there was
no statistical significance in all comparisons.
11
Holms method
pairwise.t.test(sbp, smk_grp, p.adjust.method="holm")
or
pairwise.t.test(sbp, smk_grp)
Tukeys HSD
smki<-factor(smk_grp)
TukeyHSD(aov(sbp~smki))
-----------------------------------------------------------Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = sbp ~ smki)
$smki
diff
lwr
upr
p adj
12
one variable is used to divide study subjects into groups such as the example
in Lesson3.
Non-matched data
For example, you may want to test the performance of a machine. Under different conditions
(temperature), you repeated performance test ten times using the same machine, and obtained
the following results. (We ignored the effect of machines fatigue.)
Table 1
test
Factor A
A1
A2
A3
A4
-20
20
40
63
64
59
78
58
63
62
82
60
63
60
85
59
61
64
80
61
59
65
83
60
65
71
81
57
61
65
79
62
64
68
80
50
62
74
76
10
61
70
63
83
If you want to input the data manually, you need to run the following commands.
A1<- c(63,58,60,59,61,60,57,62,50,61)
A2<- c(64,63,63,61,59,65,61,64,62,70)
A3<- c(59,62,60,64,65,71,65,68,74,63)
A4<- c(78,82,85,80,83,81,79,80,76,83)
Dat1<-data.frame(A=factor(c(rep("A1", 10), rep("A2", 10), rep("A3", 10), rep("A4", 10))),
y=c(A1,A2,A3,A4))
Dat1
Your output will be as follows,
13
A
A1 63
A1 58
A1 60
A1 59
.
.
37 A4 79
38 A4 80
39 A4 76
40 A4 83
boxplot(y~A, data=Dat1, col="lightblue")
to draw the box-whisker plot
summary(aov(y~A, data=Dat1)) or
oneway.test(y~A, data=Dat1, var.equal=TRUE)
Matched data
Suppose that you may want to test the performance of a machine. Under different conditions
(temperature), you repeated performance test ten times using ten machines (but same model), and
obtained the following results.
Table 2
Factor A
A1
A2
A3
A4
Machines No
-20
20
40
No.1
63
64
59
78
No.2
58
63
62
82
No.3
60
63
60
85
No.4
59
61
64
80
No.5
61
59
65
83
No.6
60
65
71
81
No.7
57
61
65
79
No.8
62
64
68
80
No.9
50
62
74
76
No.10
61
70
63
83
14
Dat2<-data.frame(A=factor(c(rep("A1", 10), rep("A2", 10), rep("A3", 10), rep("A4", 10))),
No=factor(rep(1:10, 4)), y=c(A1,A2,A3,A4))
Dat2
A
No y
A1
63
A1
58
A1
60
A1
59
37 A4
79
38 A4
80
39 A4
76
40 A4 10
83
3 2681.47
No
77.72
8.64
27
387.78
14.36
Residuals
Pr(>F)
0.7846
--Signif. codes:
This result indicates that the measurement is significantly related to factor A (temperature) .On
the other hand, there is no significant difference in the performance of ten machinesP=0.7846.
Two-way ANOVA
Suppose that you may want to test the performance of a machine by different conditions of
temperature and humidity. You repeated the test five times for each combination, and obtained
the following results.
15
Table 3
A2
A3
A4
-20
20
40
B1
63
64
63
68
< 50%
58
63
62
72
60
63
67
80
59
61
64
70
Factor B
61
59
65
75
(humidity) B2
60
65
71
72
57
59
62
68
62
64
68
63
64
72
64
61
61
67
73
63
50%<
B1<- c(63,58,60,59,61,64,63,63,61,59,63,62,67,64,65,68,72,80,70,75)
B2<- c(60,57,62,64,61,65,59,64,72,67,71,62,68,64,73,72,68,63,61,63)
Dat3<- data.frame(A=factor(rep(c(rep("A1", 5), rep("A2", 5), rep("A3", 5), rep("A4", 5)), 2)),
B=factor(c(rep("B1", 20), rep("B2",20))), y=c(B1,B2))
Dat3
A
A1 B1 63
A1 B1 58
A1 B1 60
38 A4 B2 63
39 A4 B2 61
40 A4 B2 63
summary(aov(y~A + B, Dat3))
How do you interpret the results?
summary(aov(y~A*B, Dat3))
to see the interaction between A and B
16
3 402.67
0.03
0.03
A:B
3 203.07
67.69
32 416.00
13.00
Residuals
Pr(>F)
0.965294
5.2071 0.004833 **
--Signif. codes:
-----------------------------------------------------------------The interaction term is A:B. Since the interaction term is statistically significant, we
can say there is an interaction between A and B for the performance.
To understand this association, lets draw a graph.
attach(Dat3)
interaction.plot(B,A,y)
17
3Cohort study (using Tsunagi data set)
To create a new variable named smk: smokers and never-smokers.
smk<-(0 < dur_smk)
factor(smk)
smokers=TRUE and never-smokers=FAULSE
We followed up all subjects, with information of smoking status, for ten years and obtained the
information of cancer incidence.
This is a cohort study. According to the Dictionary of epidemiology, the definition of cohort
study is The analytic method of epidemiologic study in which subsets of a defined population
can be identified who are , have been, or in the future may be exposed or not exposed, or exposed
in different degrees, to a factor or factors hypothesized to influence the probability of
occurrence of a given disease or other outcome.
To check the number of cancer patients in smokers and never-smokers,
table(cancer,smk)
smk
cancer FALSE TRUE
0
908
325
49
28
Thus, the proportion of cancer patients in smokers (7.9%) is higher than that of never-smokers
(5.1%). Is this difference statistically significant or not?
4Chi-squared test
chisq.test(cancer,smk)
5Alternative method to examine the rate difference (Tsunagi data is not used)
Using the same contingency table of cancer frequency in smokers and never-smokers,
ca<-c(49, 28)
t<-c(957,353)
prop.test(ca,t)
please check this results with that by chi-squared test.
18
6nm Tablechi-squared test
A
Smokers
50
20
30
Non-smokers
50
80
70
all
100
100
100
1) To examine the difference in the proportion of smokers among three groups (A, B, and C),
s<-c(50,20,30)
t<-c(100,100,100)
prop.test(s,t)
--------------------------------------------------------------------3-sample test for equality of proportions without continuity
correction
data:
d out of t
0.2
0.3
19
Smoker
Non-smoker
20 (a)
40 (b)
Smoker
10 (c)
30 (d)
Controls
The 20 pairs in cell (a) and 30 pairs in cell (d) cannot contribute to the analysis since there
is no difference in the smoking status between cases and controls.Thus, only cells (b) and
(c) contribute to the analysis for the association between smoking status and the risk of
disease.
Under the null hypothesis, number of pairs should be equal between cells (b) and (c), and we
apply McNemar test for this analysis.
2
20
3 Correlation
Using Tsunagi data set, lets see the association between systolic and diastolic blood pressures.
scatter plot
plot(sbp,dbp)
Do you see any association between systolic and diastolic blood pressures?
4. Regression analysis
Lets see the association between age and systolic blood pressure.
plot(age,sbp)
Systolic blood pressure tends to increase with age. Can we predict systolic blood pressure
by age using a statistical model?
21
Call:
Coefficients:
(Intercept)
age
92.0454
0.5236
700600
AIC: 11820
summary(glm(sbp~age))
Call:
glm(formula = sbp ~ age)
Deviance Residuals:
Min
1Q
Median
3Q
Max
-49.837
-15.031
-2.512
12.865
88.257
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 92.04541
2.87076
32.06
<2e-16 ***
age
0.04531
11.55
<2e-16 ***
0.52357
--Signif. codes:
residual plot
Linearity of the association X and Y
Normal distribution of Y variable
Distribution of variable y does not change by Xhomoscedasticity
22
Draw the scatter plot and a line obtained from the regression model
plot(age,sbp)
abline(glm(sbp~age))
plotting the residuals for each subject
rslt<-glm(sbp~age)
plot(resid(rslt))
yp<-predict(rslt)
rsd<-residuals(rslt)
rsd<- rsd[comp]
plot(yp,rsd)
Residual
23
0.8095
male
1.3103
4.8212
Using this regression model, systolic blood pressure can express as:
Systolic blood pressure122.5988 + 4.8212*male
Please notice that male is coded as 1 and female is coded as 0 in Tsunagi data.
Thus, the mean systolic blood pressure in females is 122.5988, and that in males is 127.42
(=122.5988 + 4.8212).
Lets confirm these mean values by other method.
by(sbp,male,mean)
The P value of the regression analysis in the above indicates that the association between
systolic blood pressure and sex is statistically significant. In other words, there is a
significant difference in the mean systolic blood pressure between males and females.
Actually, you would obtain the same P value by t test assuming the same variance between males
and females.
t.test(sbp~male, var.equal=TRUE)
* The above command is different from that mentioned in Lesson 2.
Without a command ofvar.equal=TRUE, you will obtain the result of t-test assuming the
different variance between males and females.
t.test(sbp~male)
24
In the case of sex, male is coded as 1 and female is coded as 0. What happen if we do not use
number(s) for coding categorical variable(s)?
The following command is to create a new variable, smk.
smk<-(0 < dur_smk)
factor(smk)
The code of smk is TRUE or FALSE(TRUE means ever-smoker, FALSE means non-smoker).
Lets conduct a regression analysis using smk.
summary(glm(sbp~smk))
Grouping a continuous variable (convert a continuous variable to categorical variable)
ag<-(50 <= age)+(60 <= age)+(70 <= age)
tapply(age,ag,range)
(ag=0: age<50, ag=1: 50<= age <60, ag=2: 60<= age <70, ag=3: 70<= age
summary(glm(sbp~ag))
R program still recognize the variable of ag as a continuous variable since the data of ag
ranges from 0 to 3.
Thus, we need to specify this variable as a categorical variable. To avoid confusion, I named
agi for the categorical variable (ag is still kept as a continuous variable).
agi<-factor(ag)
summary(glm(sbp~agi))
In the output, you can see new terms, agi1, agi2, and agi3. These terms are
25
age by male
62.71400
There is a marginally significant difference in the mean age between males and females.
Multiple regression analysis provides you a statistical model using more than two covariates.
The following command is to estimate the model using sex and age as covariates.
summary(glm(sbp~male+age))
glm(formula = sbp ~ male + age)
Min
1Q
Median
3Q
Max
-52.284
-14.934
-2.683
12.583
89.900
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 90.90451
2.88096
male
4.11746
1.25124
3.291 0.00103 **
age
0.51660
0.04519
Signif. codes:
26
(1 observation deleted due to missingness)
AIC: 11809
Number of Fisher Scoring iterations: 2
The coefficient of sex indicates that the sex difference of systolic blood pressure adjusting
the effect of age. Since the P value of sex is less than 0.05, the sex difference of systolic
blood pressure is statistically significant even after adjusting the effect of age.
5) Multicollinearity
Multicollinearity is a statistical phenomenon in which two or more predictor (explanatory)
variables in a multiple regression model are highly correlated. In this situation, the coefficient
estimates may change erratically in response to small changes in the model or the data.
Multicollinearity does not reduce the predictive power or reliability of the model as a whole;
it only affects calculations regarding individual predictors. That is, a multiple regression
model with correlated predictors can indicate how well the entire bundle of predictors predicts
the outcome variable, but it may not give valid results about any individual predictor, or about
which predictors are redundant with others.
For example, weight, height, and BMI are used in a regression model.
How to avoid multicollinearity?
Examine each association among explanatory variables calculating correlation coefficient.
If you find multicollinearity, pick-up one of them as a covariate.
27
summary(lm1)
Call:
glm(formula = sbp[male == 0] ~ age[male == 0])
Deviance Residuals:
Min
1Q
Median
3Q
Max
-51.034
-14.539
-2.127
12.897
87.461
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
76.74885
3.78235
20.29
<2e-16 ***
age[male == 0] 0.74709
0.06026
12.40
<2e-16 ***
--Signif. codes:
1Q
Median
3Q
Max
-48.311
-13.398
-2.858
12.966
61.318
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
age[male == 1]
114.64958
0.20363
3.106 0.00200 **
--Signif. codes:
28
summary(glm(sbp~male+age+male*age))
or
summary(glm(sbp~male*age))
This regression model can be expressed as follows,
sbpa+b(male) + c(age) + d(male * age)
For females, male=0, then
sbpa+ c(age)
For males, male=1, then
sbpa+b + (c+d) (age)
Thus, the difference in the slope between males and females is d.
H0: d=0, HA: d0
If there is a statistically significant difference between males and females, we should report
the results of regression analysis for males and females, separately.
29
age
sbp
dbp
30
table(cr_ft,hgg)
Conduct a chi-squared test to check the association between mercury level in the hair and
the presence of calf cramp. Furthermore, conduct an appropriate t-test to compare the mean
of mercury level in the hair between cases and controls after logarithmic transformation.
chisq.test(cr_ft,hgg)
t.test(lhg~cr_ft)
According to these results, there might be an association between mercury level in the hair and
the presence of calf cramp. Lets see this association taking account the effects of other
factors.
logistic regression analysis
In this statistical model, dependent variable is cr_ft (the presence of calf cramp: yes=1,
and no=0).
Univariate logistic regression analysis
rst<-glm(cr_ft~ hgg, family=binomial)
summary(rst)
Note that, the variable hgg has four groups0,1,2,3 but R program does not recognize
it as a categorical variable yet. This model let us know a trend of the calf cramp risk by
the increase of mercury level.
Conduct a logistic regression analysis after specifying hgg as a categorical variable (newly
named as hggi).
hggi<-factor(hgg)
rst2<-glm(cr_ft~ hggi, family=binomial)
summary(rst2)
Conduct a chi-squared test to see the association between sex (male) and the presence of
calf cramp (cr_ft).
table(cr_ft,male)
chisq.test(cr_ft,male)
Conduct an appropriate t-test to compare the mean of mercury level in the hair between males
and females after logarithmic transformation.
t.test(lhg~male)
31
wilcox.test(lhg~male)
There might be a sex difference in the mercury level in the hair. Lets see the association
between the mercury level in the hair and the calf cramp risk after adjusting the effect of
sex.
rst3<-glm(cr_ft~male+ hggi, family=binomial)
summary(rst3)
Estimate the maximum likelihood of odds ratios and corresponding 95% confidence intervals.
print(exp(coef(rst3)))
print(exp(confint(rst3)))
log(p/(1-p)) = B0+B1X1+B2X2+....+BpXp
where logit(P)=log (P/(1-P))
In a univariate model using male (male1female0) as an explanatory variable,
logit(P)= B0 + a (male)
For males,
logit(P for male)= B0 + a
For females,
logit(P for female)= B0
To know the calf cramp risk in males referring to that of females, odds ratio is calculated.
The definition of odds is P / (1-P) and logit(P)=log (P/(1-P))
Odds ratio for males Pm / (1-Pm)/ (Pf / (1-Pf))