Professional Documents
Culture Documents
Gioi Thieu R
Gioi Thieu R
Nguyn Vn Tun
Nguyn Vn Tun
Garvan Institute of Medical Research
Sydney, Australia
Nguyn Vn Tun
Mc lc
1
Ti R xung v ci t vo my tnh
Ti R package v ci t vo my tnh
3
3.1
3.2
Vn phm R
Cch t tn trong R
H tr trong R
7
9
9
4
4.1
4.2
4.3
4.4
4.5
4.6
4.7
10
10
12
13
14
15
16
17
5
5.1
5.2
5.3
5.4
5.5
5.6
5.7
Bin tp s liu
Tch ri s liu: subset
Chit s liu t mt data .frame
Nhp hai data.frame thnh mt: merge
Bin i s liu (data coding)
Bin i s liu bng cch dng replace
Bin i thnh yu t (factor)
Phn nhm s liu bng cut2 (Hmisc)
19
19
20
21
22
23
23
24
6
6.1
6.2
24
24
26
7
7.1
7.2
7.3
7.3.1
7.3.2
7.3.3
7.3.4
7.4
31
31
32
32
33
35
36
38
41
8
8.1
8.2
8.3
8.4
8.5
8.5.1
8.5.2
8.6
8.7
8.7.1
8.8
Biu
S liu cho phn tch biu
Biu cho mt bin s ri rc (discrete variable): barplot
Biu cho hai bin s ri rc (discrete variable): barplot
Biu hnh trn
Biu cho mt bin s lin tc: stripchart v hist
Stripchart
Histogram
Biu hp (boxplot)
Phn tch biu cho hai bin lin tc
Biu tn x (scatter plot)
Phn tch Biu cho nhiu bin: pairs
42
42
44
45
46
47
47
48
49
50
50
53
Nguyn Vn Tun
8.9
54
9
9.1
9.2
9.3
9.3.1
9.3.2
9.4
9.5
9.6
9.7
9.8
9.9
9.10
9.10.1
9.10.2
55
55
60
61
61
62
63
64
65
66
67
68
69
70
71
10
10.1
10.1.1
10.1.2
10.1.3
10.2
10.3
71
73
73
74
74
75
82
11
11.1
11.2
11.3
11.4
85
85
87
90
91
12
12.1
12.2
12.3
94
95
97
101
13
13.1
13.2
13.4
13.4.1
13.4.2
13.4.3
13.4.4
13.4.5
103
104
106
107
107
108
110
111
112
14
115
15
117
Nguyn Vn Tun
Gii thiu R
Phn tch s liu v biu thng c tin hnh bng cc phn mm thng
dng nh SAS, SPSS, Stata, Statistica, v S-Plus. y l nhng phn mm c cc
cng ti phn mm pht trin v gii thiu trn th trng khong ba thp nin qua, v
c cc trng i hc, cc trung tm nghin cu v cng ti k ngh trn ton th gii
s dng cho ging dy v nghin cu. Nhng v chi ph s dng cc phn mm ny
tung i t tin (c khi ln n hng trm ngn -la mi nm), mt s trng i hc
cc nc ang pht trin (v ngay c mt s nc pht trin) khng c kh nng
ti chnh s dng chng mt cch lu di. Do , cc nh nghin cu thng k trn
th gii hp tc vi nhau pht trin mt phn mm mi, vi ch trng m ngun
m, sao cho tt c cc thnh vin trong ngnh thng k hc v ton hc trn th gii c
th s dng mt cch thng nht v hon ton min ph.
Nm 1996, trong mt bi bo quan trng v tnh ton thng k, hai nh thng k
hc Ross Ihaka v Robert Gentleman [lc ] thuc Trng i hc Auckland, New
Zealand pht ho mt ngn ng mi cho phn tch thng k m h t tn l R [1]. Sng
kin ny c rt nhiu nh thng k hc trn th gii tn thnh v tham gia vo vic
pht trin R.
Cho n nay, qua cha y 10 nm pht trin, cng ngy cng c nhiu nh thng
k hc, ton hc, nghin cu trong mi lnh vc chuyn sang s dng R phn tch
d liu khoa hc. Trn ton cu, c mt mng li hn mt triu ngi s dng R,
v con s ny ang tng rt nhanh. C th ni trong vng 10 nm na, vai tr ca cc
phn mm thng k thng mi s khng cn ln nh trong thi gian qua na.
Vy R l g? Ni mt cch ngn gn, R l mt phn mm s dng cho phn tch
thng k v v biu . Tht ra, v bn cht, R l ngn ng my tnh a nng, c th s
dng cho nhiu mc tiu khc nhau, t tnh ton n gin, ton hc gii tr (recreational
mathematics), tnh ton ma trn (matrix), n cc phn tch thng k phc tp. V l mt
ngn ng, cho nn ngi ta c th s dng R pht trin thnh cc phn mm chuyn
mn cho mt vn tnh ton c bit.
V th, nhng ai lm nghin cu khoa hc, nht l cc nc cn ngho kh nh
nc ta, cn phi hc cch s dng R cho phn tch thng k v th. Bi vit ngn
ny s hng dn bn c cch s dng R. Ti gi nh rng bn c khng bit g v
R, nhng ti k vng bn c bit qua v cch s dng my tnh.
1. Ti R xung v ci t vo my tnh
s dng R, vic u tin l chng ta phi ci t R trong my tnh ca mnh.
lm vic ny, ta phi truy nhp vo mng v vo website c tn l Comprehensive R
Archive Network (CRAN) sau y:
http://cran.R-project.org.
4
Nguyn Vn Tun
R 2.2.1.lnk
Nguyn Vn Tun
2. Ti R package v ci t vo my tnh
R cung cp cho chng ta mt ngn ng my tnh v mt s function lm cc
phn tch cn bn v n gin. Nu mun lm nhng phn tch phc tp hn, chng ta
cn phi ti v my tnh mt s package khc. Package l mt phn mm nh c cc
nh thng k pht trin gii quyt mt vn c th, v c th chy trong h thng R.
Chng hn nh phn tch hi qui tuyn tnh, R c function lm s dng cho mc
ch ny, nhng lm cc phn tch su hn v phc tp hn, chng ta cn n cc
package nh lme4. Cc package ny cn phi c ti v v ci t vo my tnh.
a ch ti cc package vn l: http://cran.r-project.org, ri bm vo phn
Packages xut hin bn tri ca mc lc trang web. Theo ti, mt s package cn ti
v my tnh s dng cho cc phn tch dch t hc l:
Tn package
trellis
lattice
Hmisc
Design
Epi
epitools
Foreign
Rmeta
meta
Chc nng
Dng v th v lm cho th p hn
Dng v th v lm cho th p hn
Mt s phng php m hnh d liu ca F. Harrell
Mt s m hnh thit k nghin cu ca F. Harrell
Dng cho cc phn tch dch t hc
Mt package khc chuyn cho cc phn tch dch t hc
Dng nhp d liu t cc phn mm khc nh
SPSS, Stata, SAS, v.v
Dng cho phn tch tng hp (meta-analysis)
Mt package khc cho phn tch tng hp
6
survival
Nguyn Vn Tun
Zelig
Genetics
BMA
Cc package ny c th ci t trc tuyn bng cch chn Install packages trong phn
packages ca R nh hnh di y. Ngoi ra, nu package c ti xung my tnh
c nhn, vic ci t c th nhanh hn bng cch chn Install package(s) from local zip
file cng trong phn packages (xem hnh di y).
3. Vn phm R
R l mt ngn ng tng tc (interactive language), c ngha l khi chng ta ra
lnh, v nu lnh theo ng vn phm, R s p li bng mt kt qu. V, s tng
tc tip tc cho n khi chng ta t c yu cu. Vn phm chung ca R l mt lnh
(command) hay function (ti s thnh thong cp n l hm). M l hm th
phi c thng s; cho nn theo sau hm l nhng thng s m chng ta phi cung cp.
C php chung ca R l nh sau:
i tng <- hm(thng s 1, thng s 2, , thng s n)
Nguyn Vn Tun
Chng hn nh:
> reg <- lm(y ~ x)
th reg l mt i tng (object), cn lm l mt hm, v y ~ x l thng s ca hm.
Hay:
> setwd(c:/works/stats)
th setwd l mt hm, cn c:/works/stats l thng s ca hm.
bit mt hm cn c nhng thng s no, chng ta dng lnh args(x), (args
vit tt ch arguments) m trong x l mt hm chng ta cn bit:
> args(lm)
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
NULL
x bng 5
x khng bng 5
y nh hn x
x ln hn y
z nh hn hoc bng 7
p ln hn hoc bng 1
C phi x l bin s trng khng (missing value)
A v B (AND)
A hoc B (OR)
Khng l (NOT)
Nguyn Vn Tun
Mt vi iu cn lu khi t tn trong R l:
3.2 H tr trong R
Nguyn Vn Tun
16.5
10.8
32.3
19.3
14.2
11.3
15.5
15.8
16.2
11.2
10
Nguyn Vn Tun
Trong lnh ny, chng ta mun cho R bit rng nhp hai ct (hay hai i tng) age v
insulin vo mt i tng c tn l tuan.
n y th chng ta c mt i tng hon chnh tin hnh phn tch thng k.
kim tra xem trong tuan c g, chng ta ch cn n gin g:
> tuan
V R s bo co:
1
2
3
4
5
6
7
8
9
10
age insulin
50
16.5
62
10.8
60
32.3
40
19.3
48
14.2
47
11.3
57
15.5
70
15.8
48
16.2
67
11.2
11
Nguyn Vn Tun
Chng ta s c mt ca s nh sau:
Nguyn Vn Tun
sex
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
age
57
64
60
65
47
65
76
61
59
57
bmi
17
18
18
18
18
18
19
19
19
19
hdl
5.000
4.380
3.360
5.920
6.250
4.150
0.737
7.170
6.942
5.000
ldl
2.0
3.0
3.0
4.0
2.1
3.0
3.0
3.0
3.0
2.0
tc
4.0
3.5
4.7
7.7
5.0
4.2
5.9
6.1
5.9
4.0
tg
1.1
2.1
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9
Nu
Nam
Nam
Nu
Nu
52
64
45
64
62
24
24
24
25
25
3.360
7.170
7.880
7.360
7.750
2.0
1.0
4.0
4.6
4.0
3.7
6.1
6.7
8.1
6.2
1.2
1.9
3.3
4.0
2.5
Chng ta mun nhp cc d liu ny vo R tin vic phn tch sau ny. Chng
ta s s dng lnh read.table nh sau:
> setwd(c:/works/insulin)
> chol <- read.table("chol.txt", header=TRUE)
Hay
13
Nguyn Vn Tun
> names(chol)
R s cho bit c cc ct nh sau trong d liu (names l lnh hi trong d liu c nhng
ct no v tn g):
[1] "id"
"tg"
Age
18
28
20
21
28
23
20
20
20
20
22
27
26
33
34
32
28
18
26
27
Sex
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
0
0
1
Ethnicity
1
1
1
1
1
4
1
1
1
1
1
2
1
1
3
1
1
2
2
2
IGFI
148.27
114.50
109.82
112.13
102.86
129.59
142.50
118.69
197.69
163.69
144.81
141.60
161.80
89.20
161.80
148.50
157.70
222.90
186.70
167.56
IGFBP3
5.14
5.23
4.33
4.38
4.04
4.16
3.85
3.44
4.12
3.96
3.63
3.48
4.10
2.82
3.80
3.72
3.98
3.98
4.64
3.56
ALS
316.00
296.42
269.82
247.96
240.04
266.95
300.86
277.46
335.23
306.83
295.46
231.20
244.80
177.20
243.60
234.80
224.80
281.40
340.80
321.12
PINP
61.84
98.64
93.26
101.59
58.77
48.93
135.62
79.51
57.25
74.03
68.26
56.78
75.75
48.57
50.68
83.98
60.42
74.17
38.05
30.18
ICTP
5.81
4.96
7.74
6.66
4.62
5.32
8.78
7.19
6.21
4.95
4.54
4.47
6.27
3.58
3.52
4.85
4.89
6.43
5.12
4.78
P3NP
4.21
5.33
4.56
4.61
4.95
3.82
6.75
5.11
4.44
4.84
3.70
4.07
5.26
3.68
3.35
3.80
4.09
5.84
5.77
6.12
14
Nguyn Vn Tun
Nguyn Vn Tun
> library(foreign)
Dn cho R bit chng ta mun x l chol bng cch dng lnh attach(arg) vi
arg l tn ca d liu..
> attach(chol)
Chng ta c th kim tra xem chol c phi l mt data.frame khng bng lnh
is.data.frame(arg) vi arg l tn ca d liu. V d:
> is.data.frame(chol)
[1] TRUE
R cho bit chol qu l mt data.frame.
C bao nhiu ct (hay variable = bin s) v dng s liu (observations) trong d liu
ny? Chng ta dng lnh dim(arg) vi arg l tn ca d liu. (dim vit tt ch
dimension). V d (kt qu ca R trnh by ngay sau khi chng ta g lnh):
> dim(chol)
[1] 50 8
> names(chol)
[1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc"
16
"tg"
Nguyn Vn Tun
> table(sex)
sex
nam Nam
1 21
Nu
28
To ra mt vector s t 1 n 12:
4
4
5
5
6
6
7
7
8
8
9 10 11 12
9 10 11 12
To ra mt vector s t 12 n 5:
> seq(12,7)
[1] 12 11 10
Cng thc chung ca hm seq l seq(from, to, by= ) hay seq(from, to,
length.out= ). Cch s dng s c minh ho bng vi v d sau y:
17
7.777778
9.222222
Nguyn Vn Tun
p dng rep
Cng thc ca hm rep l rep(x, times, ...), trong , x l mt bin s v times
l s ln lp li. V d:
To ra s 10, 3 ln:
> rep(10, 3)
[1] 10 10 10
To ra s 1 n 4, 3 ln:
> rep(c(1:4), 3)
[1] 1 2 3 4 1 2 3 4 1 2 3 4
p dng gl
gl c p dng to ra mt bin th bc (categorical variable), tc bin khng tnh
ton, m l m. Cng thc chung ca hm gl l gl(n, k, length = n*k,
labels = 1:n, ordered = FALSE) v cch s dng s c minh ho bng vi
v d sau y:
To ra bin gm bc 1 v 2; mi bc c lp li 8 ln:
> gl(2, 8)
[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
Levels: 1 2
Hay:
> gl(2, 2, length=20)
[1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
Levels: 1 2
18
Nguyn Vn Tun
To mt bin gm 4 bc 1, 2, 3, 4. Mi bc lp li 2 ln.
Vi ngy gi thng:
5. Bin tp s liu
5.1 Tch ri d liu: subset
Chng ta s quay li vi d liu chol trong v d 1. tin vic theo di v
hiu cu chuyn, ti xin nhc li rng chng ta nhp s liu vo trong mt d liu R
c tn l chol t mt text file c tn l chol.txt:
> setwd(c:/works/insulin)
> chol <- read.table(chol.txt, header=TRUE)
> attach(chol)
Nu chng ta, v mt l do no , ch mun phn tch ring cho nam gii, chng
ta c th tch chol ra thnh hai data.frame, tm gi l nam v nu. lm chuyn ny,
chng ta dng lnh subset(data, cond), trong data l data.frame m chng ta
mun tch ri, v cond l iu kin. V d:
> nam <- subset(chol, sex==Nam)
> nu <- subset(chol, sex==Nu)
19
Nguyn Vn Tun
Sau khi ra hai lnh ny, chng ta c 2 d liu (hai data.frame) mi tn l nam v nu.
Ch iu kin sex == Nam v sex == Nu chng ta dng == thay v = ch
iu kin chnh xc.
Tt nhin, chng ta cng c th tch d liu thnh nhiu data.frame khc nhau vi nhng
iu kin da vo cc bin s khc. Chng hn nh lnh sau y to ra mt data.frame
mi tn l old vi nhng bnh nhn trn 60 tui:
> old <- subset(chol, age>=60)
> dim(old)
[1] 25
[1] 9
1
2
3
4
5
6
7
8
id
1
2
3
4
5
6
7
8
sex
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
tc
4.0
3.5
4.7
7.7
5.0
4.2
5.9
6.1
20
Nguyn Vn Tun
9
9 Nam 5.9
10 10 Nu 4.0
Ch lnh print(arg) n gin lit k tt c s liu trong data.frame arg. Tht ra,
chng ta ch cn n gin g data3, kt qu cng ging y nh print(data3).
5.3 Nhp hai data.frame thnh mt: merge
Gi d nh chng ta c d liu cha trong hai data.frame. D liu th nht tn l d1
gm 3 ct: id, sex, tc nh sau:
id sex tc
1 Nam 4.0
2 Nu 3.5
3 Nu 4.7
4 Nam 7.7
5 Nam 5.0
6 Nu 4.2
7 Nam 5.9
8 Nam 6.1
9 Nam 5.9
10 Nu 4.0
D liu th hai tn l d2 gm 3 ct: id, sex, tg nh sau:
id
1
2
3
4
5
6
7
8
9
10
11
sex
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
Nu
tg
1.1
2.1
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9
1.7
21
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10 10
11 11
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
<NA>
4.0
3.5
4.7
7.7
5.0
4.2
5.9
6.1
5.9
4.0
NA
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
Nu
Nguyn Vn Tun
1.1
2.1
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9
1.7
phn loi 3 nhm long xng, xp xng, v bnh thng, chng ta c th dng m
s 1, 2 v 3. Ni cch khc, chng ta mun to nn mt bin s khc (hy gi l
diagnosis) gm 3 gi tr trn da vo gi tr ca bmd. lm vic ny, chng ta s
dng lnh:
# tm thi cho bin s diagnosis bng bmd
> diagnosis <- bmd
#
>
>
>
22
1
2
3
4
5
6
7
8
9
10
Nguyn Vn Tun
bmd diagnosis
-0.92
3
0.21
3
0.17
3
-3.21
1
-1.80
2
-2.60
1
-2.00
2
1.71
3
2.12
3
-2.11
2
diagnosis
diagnosis
diagnosis
diagnosis
<<<<-
bmd
replace(diagnosis, bmd <= -2.5, 1)
replace(diagnosis, bmd > -2.5 & bmd <= 1.0, 2)
replace(diagnosis, bmd > -1.0, 3)
23
Nguyn Vn Tun
> mean(diagnosis)
[1] 2.3
nhng kt qu 2.3 ny khng c ngha g trong thc t c.
5.7 Phn nhm s liu bng cut2 (Hmisc)
Trong phn tch thng k, c khi chng ta cn phi phn chia mt bin s lin tc thnh
nhiu nhm da vo phn phi ca bin s. Chng hn nh i vi bin s bmd chng ta
c th ct dy s thnh 3 nhm tng ng nhau bng cch dng function cut2
(trong th vin Hmisc) nh sau:
> # nhp th vin Hmisc c th dng function cut2
> library(Hmisc)
> bmd <- c(-0.92,0.21,0.17,-3.21,-1.80,-2.60,-2.00,1.71,2.12,-2.11)
> # chia bin s bmd thnh 2 nhm v trong i tng group
> group <- cut2(bmd, g=2)
> table(group)
group
[-3.21,-0.92) [-0.92, 2.12]
5
5
24
Nguyn Vn Tun
Cng v tr:
> 15+2997
[1] 3012
> 15+2997-9768
[1] -6756
Nhn v chia
> -27*12/21
[1] -15.42857
Cn s bc hai: 10
S pi ()
> sqrt(10)
[1] 3.162278
> pi
[1] 3.141593
> 2+3*pi
[1] 11.42478
Logarit: loge
Logarit: log10
S m: e2.7689
Hm s lng gic
> exp(2.7689)
[1] 15.94109
> cos(pi)
[1] -1
> log(10)
[1] 2.302585
> log10(100)
[1] 2
> log10(2+3*pi)
[1] 1.057848
Vector
> x <- c(2,3,1,5,4,6,7,6,8)
> x
[1] 2 3 1 5 4 6 7 6 8
> exp(cos(x/10))
[1] 2.664634 2.599545 2.704736 2.405
2.511954 2.282647 2.148655 2.282647
[9] 2.007132
> sum(x)
[1] 42
> x*2
[1] 4
> exp(x/10)
[1] 1.221403 1.349859 1.105171 1.648
1.491825 1.822119 2.013753 1.822119
[9] 2.225541
2 10
8 12 14 12 16
Tnh tng bnh phng (sum of squares): 12 Tnh tng bnh phng iu chnh
n
+ 22 + 32 + 42 + 52 = ?
2
(adjusted
sum
of
squares):
( xi x ) = ?
> x <- c(1,2,3,4,5)
> sum(x^2)
[1] 55
i =1
25
( x x )
i =1
Nguyn Vn Tun
Phng sai: s 2 = ( xi x ) / ( n 1) = ?
/n= ?
i =1
lch chun:
s2 :
> sd(x)
[1] 1.581139
A = 2 5 8
3 6 9
V vi R:
> y <- c(1,2,3,4,5,6,7,8,9)
> A <- matrix(y, nrow=3)
> A
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
th kt qu s l:
[,1] [,2] [,3]
[1,]
1
2
3
[2,]
4
5
6
[3,]
7
8
9
Nguyn Vn Tun
Ma trn v hng (scalar matrix) l mt ma trn vung (tc s dng bng s ct), v
tt c cc phn t ngoi ng cho (off-diagonal elements) l 0, v phn t ng cho
l 1. Chng ta c th to mt ma trn nh th bng R nh sau:
> # to ra m ma trn 3 x 3 vi tt c phn t l 0.
> A <- matrix(0, 3, 3)
> # cho cc phn t ng cho bng 1
> diag(A) <- 1
> diag(A)
[1] 1 1 1
> # by gi ma trn A s l:
> A
[,1] [,2] [,3]
[1,]
1
0
0
[2,]
0
1
0
[3,]
0
0
1
27
Nguyn Vn Tun
[1] 1 4 7
> # ct 3 ca ma trn A
> A[3,]
[1] 7 8 9
> # dng 1 ca ma trn A
> A[1,]
[1] 1 2 3
> # dng 2, ct 3 ca ma trn A
> A[2,3]
[1] 6
> # tt c cc dng ca ma trn A, ngoi tr dng 2
> A[-2,]
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
3
6
9
> # tt c cc ct ca ma trn A, ngoi tr ct 1
> A[,-1]
[,1] [,2]
[1,]
4
7
[2,]
5
8
[3,]
6
9
> # xem phn t no cao hn 3.
> A>3
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] FALSE TRUE TRUE
[3,] FALSE TRUE TRUE
28
[2,]
[3,]
-2
-3
-5
-6
-8
-9
Nguyn Vn Tun
-11
-12
Hay A-B:
> D <- A-B
> D
[,1] [,2] [,3] [,4]
[1,]
2
8
14
20
[2,]
4
10
16
22
[3,]
6
12
18
24
A = 2 5 8
3 6 9
1 2 3
B = 4 5 6
7 8 9
Chng ta mun tnh AB, v c th trin khai bng R bng cch s dng %*% nh sau:
>
>
>
>
>
y <- c(1,2,3,4,5,6,7,8,9)
A <- matrix(y, nrow=3)
B <- t(A)
AB <- A%*%B
AB
[,1] [,2] [,3]
[1,]
66
78
90
[2,]
78
93 108
[3,]
90 108 126
Hay tnh BA, v c th trin khai bng R bng cch s dng %*% nh sau:
> BA <- B%*%A
> BA
[,1] [,2] [,3]
[1,]
14
32
50
[2,]
32
77 122
[3,]
50 122 194
29
Nguyn Vn Tun
3x1 + 4 x2 = 4
x1 + 6 x2 = 2
H phng trnh ny c th vit bng k hiu ma trn: AX = Y, trong :
3 4
A=
,
1 6
x
X = 1 ,
x2
4
Y =
2
30
Nguyn Vn Tun
> det(E)
[1] 0
Ngoi nhng php tnh n gin ny, R cn c th s dng cho cc php tnh
phc tp khc. Mt li th ng k ca R l phn mm cung cp cho ngi s dng t
do to ra nhng php tnh ph hp cho tng vn c th. R c mt package Matrix
chuyn thit k cho tnh ton ma trn. Bn c c th ti package xung, ci vo my, v
s dng, nu cn. a ch ti l: http://cran.au.r-project.org/bin/windows/contrib/rrelease/Matrix_0.995-8.zip cng vi ti liu ch dn cch s dng (di khong 80 trang):
http://cran.au.r-project.org/doc/packages/Matrix.pdf.
Chng ta bit rng 3! = 3.2.1 = 6, v 0!=1. Ni chung, cng thc tnh hon v cho
mt s n l: n ! = n ( n 1)( n 2 )( n 3) ... 1 . Trong R cch tnh ny rt n gin vi
lnh prod() nh sau:
Tm 3!
> prod(3:1)
[1] 6
Tm 10!
> prod(10:1)
[1] 3628800
31
Nguyn Vn Tun
Tm 10.9.8.7.6.5.4
> prod(10:4)
[1] 604800
Tm (10.9.8.7.6.5.4) / (40.39.38.37.36)
> prod(10:4) / prod(40:36)
[1] 0.007659481
7.2 T hp (combination)
n
n!
S ln chn k ngi t n phn t l: =
. Cng thc ny cng c khi vit l
k k !( n k ) !
n
Ckn thay v . Vi R, php tnh ny rt n gin bng hm choose(n, k). Sau
k
y l vi v d minh ha:
5
Tm
2
> choose(5, 2)
[1] 10
32
Nguyn Vn Tun
tn vit tt l norm (normal, phn phi chun), binom (binomial , phn phi nh
phn), pois (Poisson, phn phi Poisson), v.v Bng sau y tm tt cc hm v thng
s cho tng hm:
Hm phn
phi
Mt
Tch ly
nh bc
M phng
Chun
dnorm(x, mean,
sd)
dbinom(k, n, p)
pbinom(q, n, p)
qbinom (p, n, p)
rbinom(k, n, prob)
dpois(k, lambda)
ppois(q, lambda)
qpois(p, lambda)
rpois(n, lambda)
dunif(x, min,
max)
dnbinom(x, k, p)
pnbinom(q, k, p)
qnbinom (p,k,prob)
rbinom(n, n, prob)
dbeta(x, shape1,
shape2)
dgamma(x, shape,
rate, scale)
dgeom(x, p)
pbeta(q, shape1,
shape2)
gamma(q, shape,
rate, scale)
pgeom(q, p)
qbeta(p, shape1,
shape2)
qgamma(p, shape,
rate, scale)
qgeom(p, prob)
rbeta(n, shape1,
shape2)
rgamma(n, shape,
rate, scale)
rgeom(n, prob)
dexp(x, rate)
pexp(q, rate)
qexp(p, rate)
rexp(n, rate)
dnorm(x, mean,
sd)
dcauchy(x,
location, scale)
df(x, df1, df2)
pcauchy(q,
location, scale)
pf(q, df1, df2)
qcauchy(p,
location, scale)
qf(p, df1, df2)
rcauchy(n,
location, scale)
rf(n, df1, df2)
Nh phn
Poisson
Uniform
Negative
binomial
Beta
Gamma
Geometric
Exponential
Weibull
Cauchy
F
dt(x, df)
pt(q, df)
qt(p, df)
rt(n, df)
T
dchisq(x,
df)
pchi(q,
df)
qchisq(p,
df)
rchisq(n,
df)
Chi-squared
Ch thch: Trong bng trn, df = degrees of freedome (bc t do); prob = probability (xc sut); n = sample
size (s lng mu). Cc thng s khc c th tham kho thm cho tng lut phn phi. Ring cc lut
phn phi F, t, Chi-squared cn c mt thng s khc na l non-centrality parameter (ncp) c cho s 0.
Tuy nhin ngi s dng c th cho mt thng s khc thch hp, nu cn.
Nh tn gi, hm phn phi nh phn ch c hai gi tr: nam / n, sng / cht, c / khng,
v.v Hm nh phn c pht biu bng nh l nh sau: Nu mt th nghim c tin
hnh n ln, mi ln cho ra kt qu hoc l thnh cng hoc l tht bi, v gm xc sut
thnh cng c bit trc l p, th xc sut c k ln th nghim thnh cng l:
nk
P ( k | n, p ) = Ckn p k (1 p ) , trong k = 0, 1, 2, . . . , n. Trong R, c hm dbinom(k,
n, p) c th gip chng ta tnh cng thc P ( k | n, p ) = Ckn p k (1 p )
nk
mt cch nhanh
33
Nguyn Vn Tun
7
68
8
23
9
13
10
3
Dng s liu th nht (0, 5, 6, , 10) l s bnh nhn mc bnh cao huyt p
trong s 20 ngi m chng ta chn. Dng s liu th hai cho chng ta bit s ln chn
mu trong 1000 ln xy ra. Do , c 6 mu khng c bnh nhn cao huyt p no, 45
mu vi ch 1 bnh nhn cao huyt p, v.v C l cch hiu l v th cc tn s
trn bng lnh hist nh sau:
> hist(b, main="Number of hypertensive patients")
50
Frequency
100
150
200
10
34
Nguyn Vn Tun
Qua biu trn, chng ta thy xc sut c 4 bnh nhn cao huyt p (trong mi ln chn
mu 20 ngi) l cao nht (22.9%). iu ny cng c th hiu c, bi v t l cao
huyt p l 20%, cho nn chng ta k vng rng trung bnh 4 ngi trong s 20 ngi
c chn phi l cao huyt p. Tuy nhin, iu quan trng m biu trn th hin l
c khi chng ta quan st n 10 bnh nhn cao huyt p d xc sut cho mu ny rt thp
(ch 3/1000).
7.3.2 Hm phn phi Poisson (Poisson distribution)
e k
k!
e212
Do , p s cho cu hi trn l: P ( X = 2 | = 1) =
= 0.1839 . p s ny c th
2!
tnh bng R mt cch nhanh chng hn bng hm dpois nh sau:
> dpois(2, 1)
[1] 0.1839397
Chng ta cng c th tnh xc sut sai 1 ch, v xc sut khng sai ch no:
> dpois(1, 1)
[1] 0.3678794
> dpois(0, 1)
35
Nguyn Vn Tun
[1] 0.3678794
= 1 P ( X 2)
= 1 0.3678 0.3678 0.1839
= 0.08
Bng R, chng ta c th tnh nh sau:
# P(X 2)
> ppois(2, 1)
[1] 0.9196986
# 1-P(X 2)
> 1-ppois(2, 1)
[1] 0.0803014
Hai lut phn phi m chng ta va xem xt trn y thuc vo nhm phn phi
p dng cho cc bin s phi lin tc (discrete distributions), m trong bin s c
nhng gi tr theo bc th hay th loi. i vi cc bin s lin tc, c vi lut phn phi
thch hp khc, m quan trng nht l phn phi chun. Phn phi chun l nn tng
quan trng nht ca phn tch thng k. C th ni khng ngoa rng hu ht l thuyt
thng k c xy dng trn nn tng ca phn phi chun. Hm mt phn phi
chun c hai thng s: trung bnh v phng sai 2 (hay lch chun ). Gi X l
mt bin s (nh chiu cao chng hn), hm mt phn phi chun pht biu rng xc
sut m X = x l:
( x )2
1
2
P ( X = x | , ) = f ( x ) =
exp
2 2
2
36
Nguyn Vn Tun
f(height)
0.00
0.02
0.04
0.06
0.08
130
140
150
160
170
180
190
200
Height
Biu trn c v bng hai lnh sau y. Lnh u tin nhm to ra mt bin s
height c gi tr 130, 131, 132, , 200 cm. Lnh th hai l v biu vi iu kin
trung bnh l 156 cm v lch chun l 4.6 cm.
> height <- seq(130, 200, 1)
> plot(height, dnorm(height, 156, 4.6),
type="l",
ylab=f(height),
xlab=Height,
main="Probability distribution of height in Vietnamese women")
2
4.6 2 3.1416
2 ( 4.6 )
= 0.0594
Hm dnorm(x, mean, sd)trong R c th tnh ton xc sut ny cho chng ta mt
cch gn nh:
> dnorm(160, mean=156, sd=4.6)
[1] 0.05942343
37
Nguyn Vn Tun
P(a X b) =
f ( x ) dx
a
Thnh ra, P(160 X 150) chnh l din tch tnh t trc honh = 150 n 160 ca biu
2. Trong R c hm pnorm(x, mean, sd) dng tnh xc sut tch ly cho
mt phn phi chun rt c ch.
pnorm (a, mean, sd) =
Chng hn nh xc sut chiu cao ph n Vit Nam bng hoc thp hn 150 cm l 9.6%:
> pnorm(150, 156, 4.6)
[1] 0.0960575
Hay xc sut chiu cao ph n Vit Nam bng hoc cao hn 165 cm l:
> 1-pnorm(164, 156, 4.6)
[1] 0.04100591
Ni cch khc, ch c khong 4.1% ph n Vit Nam c chiu cao bng hay cao hn 165
cm.
V d 6: ng dng lut phn phi chun: Trong mt qun th, chng ta bit
rng p sut mu trung bnh l 100 mmHg v lch chun l 13 mmHg, hi: c bao
nhiu ngi trong qun th ny c p sut mu bng hoc cao hn 120 mmHg? Cu tr
li bng R l:
> 1-pnorm(120, mean=100, sd=13)
[1] 0.0619679
Tc khong 6.2% ngi trong qun th ny c p sut mu bng hoc cao hn 120
mmHg.
7.3.4 Hm phn phi chun chun ha (Standardized Normal distribution)
38
Nguyn Vn Tun
Mt bin X tun theo lut phn phi chun vi trung bnh bnh v phng sai 2
thng c vit tt l:
X ~ N( , 2)
y v 2 ty thuc vo n v o lng ca bin s. Chng hn nh chiu
cao c tnh bng cm (hay m), huyt p c o bng mmHg, tui c o bng nm,
v.v cho nn i khi m t mt bin s bng n v gc rt kh so snh. Mt cch n
gin hn l chun ha (standardized) X sao cho s trung bnh l 0 v phng sai l 1.
Sau vi thao tc s hc, c th chng minh d dng rng, cch bin i X p ng iu
kin trn l:
X
Z=
0.2
0.0
0.1
f(z)
0.3
0.4
-4
-2
39
Nguyn Vn Tun
Ni cch khc, xc sut 95% l z nm gia -1.96 v 1.96. (Ch trong lnh trn ti
khng cung cp mean=0, sd=1, bi v trong thc t, pnorm gi tr mc nh (default
value) ca thng s mean l 0 v sd l 1).
V d 5 (tip tc). Xin nhc li tin vic theo di, chiu cao trung bnh ph
n Vit Nam l 156 cm v lch chun l 4.6 cm. Do , mt ph n c chiu cao 170
cm cng c ngha l z = (170 156) / 4.6 = 3.04 lch chun, v ti l cc ph n Vit
Nam c chiu cao cao hn 170 cm l rt thp, ch khong 0.1%.
> 1-pnorm(3.04)
[1] 0.001182891
P(Z < z) = p
40
Nguyn Vn Tun
Hay P(Z < z) = 0.975 cho phn phi chun vi trung bnh 0 v lch chun 1:
> qnorm(0.975, mean=0, sd=1)
[1] 1.959964
v.v
Trn y l lnh chng ta chn mu ngu nhin m khng thay th (random sampling
without replacement), tc l mi ln chn mu, chng ta khng b li cc mu chn
vo qun th.
Nhng nu chng ta mun chn mu thay th (tc mi ln chn ra mt s i tng,
chng ta b vo li trong qun th chn tip ln sau). V d, chng ta mun chn 10
ngi t mt qun th 50 ngi, bng cch ly mu vi thay th (random sampling with
replacement), chng ta ch cn thm tham s replace = TRUE:
> sample(1:50, 10, replace=T)
41
[1] 31 44
Nguyn Vn Tun
8 47 50 10 16 29 23
8. Biu
Trong ngn ng R c rt nhiu cch thit k mt biu gn v p. Phn ln
nhng hm thit k biu c sn trong R, nhng mt s loi biu tinh vi v phc
tp khc c th thit k bng cc package chuyn dng nh lattice hay trellis c
th ti t website ca R. Trong chng ny ti s ch cch v cc biu thng dng
bng cch s dng cc hm ph bin trong R.
8.1 S liu cho phn tch biu
Nguyn Vn Tun
bin s): id, sex, age, bmi, hdl, ldl, tc, v tg. (Ch , id l m s
ca 50 i tng nghin cu; sex l gii tnh (nam hay n); age l tui; bmi l t
s trng lng; hdl l high density cholesterol; ldl l low density cholesterol; tc l
tng s - total cholesterol; v tg triglycerides). D liu c cha trong directory
directory c:\works\insulin di tn chol.txt. Trc khi v th, chng ta
bt u bng cch nhp d liu ny vo R.
> setwd(c:/works/stats)
> cong <- read.table(chol.txt, header=TRUE, na.strings=.)
> attach(cong)
64,
51,
45,
50,
58,
60,
60,
70,
60,
60,
18,
21,
22,
24,
65,
42,
51,
55,
45,
18,
21,
22,
24,
47,
64,
63,
74,
63,
18,
21,
22,
25,
65,
49,
54,
48,
52,
76,
44,
57,
46,
64,
61,
45,
70,
49,
45,
59,
80,
47,
69,
64,
57,
48,
60,
72,
62)
18, 18, 19, 19, 19, 19, 20, 20, 20, 20, 20,
21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22,
23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24,
25)
3.0,
1.3,
3.0,
4.3,
4.1,
3.0,
1.2,
1.7,
2.3,
4.4,
4.0,
0.7,
2.0,
6.0,
2.8,
2.1,
4.0,
2.1,
3.0,
3.0,
3.0,
4.1,
4.0,
3.0,
2.0,
3.0,
4.3,
4.1,
2.6,
1.0,
3.0,
4.0,
4.0,
4.4,
4.0,
3.0,
4.3,
4.2,
4.3,
4.6,
tc <-c (4.0,
6.2,
4.3,
5.6,
6.2,
3.5,
4.1,
4.8,
8.3,
6.7,
4.7,
3.0,
4.0,
5.8,
6.3,
7.7,
4.0,
3.0,
7.6,
6.0,
5.0,
6.9,
3.1,
5.8,
4.0,
4.2,
5.7,
5.3,
3.1,
3.7,
5.9,
5.7,
5.3,
5.4,
6.1,
6.1,
5.3,
5.4,
6.3,
6.7,
5.9,
7.1,
4.5,
8.2,
8.1,
4.0,
3.8,
5.9,
6.2,
6.2)
tg <- c(1.1,
1.7,
2.2,
3.3,
2.4,
2.1,
1.0,
2.7,
3.0,
3.3,
0.8,
1.6,
1.1,
1.0,
2.0,
1.1,
1.1,
0.7,
1.4,
2.6,
2.1,
1.5,
1.0,
2.5,
1.8,
1.5,
1.0,
1.7,
0.7,
1.2,
2.6,
2.7,
2.9,
2.4,
1.9,
1.5,
3.9,
2.5,
2.4,
3.3,
5.4,
3.0,
6.2,
1.4,
4.0,
1.9,
3.1,
1.3,
2.7,
2.5)
43
2.0,
4.0,
4.2,
4.0,
4.0)
Nguyn Vn Tun
Bin sex trong d liu trn c hai gi tr (nam v nu), tc l mt bin khng lin
tc. Chng ta mun bit tn s ca gii tnh (bao nhiu nam v bao nhiu n) v v mt
biu n gin. thc hin nh ny, trc ht, chng ta cn dng hm table
bit tn s:
> sex.freq <- table(sex)
> sex.freq
sex
Nam Nu
22 28
Nam
10
15
20
Nu
25
Nam
Nu
10
15
20
25
Thay v th hin tn s nam v n bng 2 ct, chng ta c th th hin bng hai dng
bng thng s horiz = TRUE, nh sau (xem kt qu trong Biu 6b):
> barplot(sex.freq,
horiz = TRUE,
col = rainbow(length(sex.freq)),
main=Frequency of males and females)
44
Nguyn Vn Tun
Age l mt bin s lin tc. Chng ta c th chia bnh nhn thnh nhiu nhm
da vo tui. Hm cut c chc nng ct mt bin lin tc thnh nhiu nhm ri
rc. Chng hn nh:
> ageg <- cut(age, 3)
> table(ageg)
ageg
(42,54.7] (54.7,67.3]
19
24
(67.3,80]
7
Kt qu trn cho thy chng ta c 10 bnh nhn nam v 9 n trong nhm tui th nht,
10 nam v 14 na trong nhm tui th hai, v.v th hin tn s ca hai bin ny,
chng ta vn dng barplot:
> barplot(age.sex, main=Number of males and females in each age
group)
45
Nguyn Vn Tun
10
15
10
20
12
14
(42,54.7]
(54.7,67.3]
(42,54.7]
(67.3,80]
(54.7,67.3]
(67.3,80]
Age group
46
Nguyn Vn Tun
(42,54.7]
(49.6,57.2]
(42,49.6]
(72.4,80]
(67.3,80]
(64.8,72.4]
(54.7,67.3]
(57.2,64.8]
mg/L
47
Nguyn Vn Tun
Chng ta thy bin s tg c s bt lin tc, nht l cc i tng c tg cao. Trong khi
phn ln i tng c tg thp hn 5, th c 2 i tng vi tg rt cao (>5).
8.5.2 Histogram
8
0
No of patients
6
4
Frequency
10
10
12
12
40
50
60
70
80
age
40
50
60
70
80
Age group
Biu 11a. Trc tung l s bnh nhn (i Biu 11b. Thm tn biu v tn ca trc
tng nghin cu) v trc honh l tui. trung v trc honh bng xlab v ylab.
Chng hn nh tui 40 n 45 c 6 bnh nhn,
t 70 n 80 tui c 4 bnh nhn.
48
Nguyn Vn Tun
density.default(x = age)
Density
0.00
0.00
0.01
0.02
0.02
0.01
Density
0.03
0.03
0.04
0.04
Histogram of age
30
40
50
60
70
80
90
40
N = 50 Bandwidth = 3.806
50
60
70
80
age
Biu 12a. Xc sut phn phi mt cho Biu 12b. Xc sut phn phi mt cho
bin age ( tui).
bin age ( tui) vi nhiu interquartile.
mg/L
49
Nguyn Vn Tun
Nam
mg/L
mg/L
Nu
Nam
Nu
Biu 14a. Trong biu ny, chng ta Biu 14b. Total cholesterol cho tng
thy trung v ca total cholesterol n gii gii tnh, vi mu sc v hnh hp nm
thp hn nam gii, nhng dao ng gia ngang.
hai nhm khng khc nhau bao nhiu.
50
Nguyn Vn Tun
hdl
tc
Chng ta mun phn bit gii tnh (nam v n) trong biu trn. v biu ,
chng ta phi dng n hm ifelse. Trong lnh sau y, nu sex==Nam th v k
t s 16 ( trn), nu khng nam th v k t s 22 (tc vung):
> plot(hdl, tc, pch=ifelse(sex=="Nam", 16, 22))
51
Nguyn Vn Tun
M
8
8
M
F
6
tc
M
M
F
hdl
M
F
F
F
M
F
M
M
F
M
F
F F
M
F
F
F
F
M
F
M
M
F
F
F
3
3
tc
hdl
Biu 16a. Mi lin h gia tc v hdl theo Biu 16a. Mi lin h gia tc v hdl theo
tng gii tnh c th hin bng hai k hiu tng gii tnh c th hin bng hai k t.
du.
Chng ta cng c th v mt ng biu din hi qui tuyn tnh (regression line) qua cc
im trn bng cch tip tc ra cc lnh sau y:
> plot(hdl ~ tc, pch=16, main="Total cholesterol and HDL cholesterol",
xlab="Total cholesterol", ylab="HDL cholesterol", bty=l)
> reg <- lm(hdl ~ tc)
> abline(reg)
52
Nguyn Vn Tun
6
2
HDL cholesterol
4
2
HDL cholesterol
Total cholesterol
Total cholesterol
Biu 17a. Trong lnh trn, reg<- Biu 17b. Thay v dng abline, chng ta
lm(hdl~tc) c ngha l tm phng trnh dng hm lowess th hin mi lin h gia
lin h gia hdl v tc bng linear model tc v hdl.
(lm) v 8t kt qu vo i tng reg.
Lnh th hai abline(reg) yu cu R v
ng thng t phng trnh trong reg
Kt qu s l:
53
20
22
24
70
80
18
Nguyn Vn Tun
22
24
50
60
age
18
20
bmi
hdl
ldl
tc
50
60
70
80
54
Nguyn Vn Tun
3
1
mean
group
"weight"
"pinp"
"height"
"ictp"
"ethnicity"
"p3np"
> igfdata
id
55
igfi
igfbp3
als
pinp
ictp
p3np
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
...
...
97
97
98
98
99
99
100 100
Nguyn Vn Tun
Female
Male
Female
Female
Female
Female
Female
Female
Female
Female
15
16
15
15
16
25
19
18
15
24
42
44
43
42
47
45
45
43
41
45
162
Asian 189.000 4.00000 323.667 353.970
160 Caucasian 160.000 3.75000 333.750 375.885
157
Asian 146.833 3.43333 248.333 199.507
155
Asian 185.500 3.40000 251.000 483.607
167
Asian 192.333 4.23333 322.000 105.430
160
Asian 110.000 3.50000 284.667 76.487
161
Asian 157.000 3.20000 274.000 75.880
153
Asian 146.000 3.40000 303.000 86.360
149
Asian 197.667 3.56667 308.500 254.803
157
African 148.000 3.40000 273.000 44.720
11.2867 8.3367
10.4300 6.7450
8.3633 12.5000
13.3300 14.2767
7.9233 4.5033
4.9833 4.9367
6.3500 5.3200
7.3700 4.6700
11.8700 6.8200
3.7400 6.1600
Female
Male
Female
Male
17
18
18
15
54
55
48
54
L thuyt
S trung bnh: x =
Phng sai: s 2 =
1
xi .
n i =1
var(x)
1 n
2
( xi x )
n 1 i =1
sd(x)
lch chun: s = s 2
Sai s chun (standard error): SE =
Khng c
s
n
min(x)
max(x)
range(x)
Tr s thp nht
Tr s cao nht
Ton c (range)
56
4.4367
8.8333
5.6600
6.5933
Nguyn Vn Tun
Tuy nhin, R c lnh summary c th cho chng ta tt c thng tin thng k v mt bin
s:
> summary(age)
Min. 1st Qu.
13.00
16.00
Median
19.00
Max.
34.00
SE
5.898719
igfbp3
age
Min.
:13.00
1st Qu.:16.00
Median :19.00
Mean
:19.17
3rd Qu.:21.25
Max.
:34.00
als
57
weight
Min.
:41.00
1st Qu.:47.00
Median :50.00
Mean
:49.91
3rd Qu.:53.00
Max.
:60.00
pinp
height
Min.
:149.0
1st Qu.:157.0
Median :162.0
Mean
:163.1
3rd Qu.:168.0
Max.
:196.0
ictp
ethnicity
African : 8
Asian
:60
Caucasian:30
Others
: 2
Min.
: 85.71
1st Qu.:137.17
Median :161.50
Mean
:165.59
3rd Qu.:186.46
Max.
:427.00
Min.
:2.000
1st Qu.:3.292
Median :3.550
Mean
:3.617
3rd Qu.:3.875
Max.
:5.233
Nguyn Vn Tun
Min.
:192.7
1st Qu.:256.8
Median :292.5
Mean
:301.8
3rd Qu.:331.2
Max.
:471.7
Min.
: 26.74
1st Qu.: 68.10
Median :103.26
Mean
:167.17
3rd Qu.:196.45
Max.
:742.68
Min.
: 2.697
1st Qu.: 4.878
Median : 6.338
Mean
: 7.420
3rd Qu.: 8.423
Max.
:21.237
p3np
Min.
: 2.343
1st Qu.: 4.433
Median : 5.445
Mean
: 6.341
3rd Qu.: 7.150
Max.
:16.303
sex
Female:69
Male : 0
age
weight
height
Min.
:13.00
Min.
:41.00
Min.
:149.0
1st Qu.:17.00
1st Qu.:47.00
1st Qu.:156.0
Median :19.00
Median :50.00
Median :162.0
Mean
:19.59
Mean
:49.35
Mean
:161.9
3rd Qu.:22.00
3rd Qu.:52.00
3rd Qu.:166.0
Max.
:34.00
Max.
:60.00
Max.
:196.0
igfi
igfbp3
als
Min.
: 85.71
Min.
:2.767
Min.
:204.3
1st Qu.:136.67
1st Qu.:3.333
1st Qu.:263.8
Median :163.33
Median :3.567
Median :302.7
Mean
:167.97
Mean
:3.695
Mean
:311.5
3rd Qu.:186.17
3rd Qu.:3.933
3rd Qu.:361.7
Max.
:427.00
Max.
:5.233
Max.
:471.7
pinp
ictp
p3np
Min.
: 26.74
Min.
: 2.697
Min.
: 2.343
1st Qu.: 62.75
1st Qu.: 4.717
1st Qu.: 4.337
Median : 78.50
Median : 5.537
Median : 5.143
Mean
:108.74
Mean
: 6.183
Mean
: 5.643
3rd Qu.:115.26
3rd Qu.: 7.320
3rd Qu.: 6.143
Max.
:502.05
Max.
:13.633
Max.
:14.420
-----------------------------------------------------------sex: Male
id
sex
age
weight
height
Min.
: 2.00
Female: 0
Min.
:14.00
Min.
:44.00
Min.
:155.0
1st Qu.: 34.50
Male :31
1st Qu.:15.00
1st Qu.:48.50
1st Qu.:161.5
Median : 56.00
Median :17.00
Median :51.00
Median :164.0
Mean
: 55.61
Mean
:18.23
Mean
:51.16
Mean
:165.6
3rd Qu.: 75.00
3rd Qu.:20.00
3rd Qu.:53.50
3rd Qu.:169.0
Max.
:100.00
Max.
:27.00
Max.
:59.00
Max.
:191.0
ethnicity
igfi
igfbp3
als
58
African : 4
Asian
:17
Caucasian: 8
Others
: 2
pinp
Min.
: 56.28
1st Qu.:135.07
Median :245.92
Mean
:297.21
3rd Qu.:450.38
Max.
:742.68
Min.
: 94.67
1st Qu.:138.67
Median :160.00
Mean
:160.29
3rd Qu.:183.00
Max.
:274.00
ictp
Min.
: 3.650
1st Qu.: 6.900
Median : 9.513
Mean
:10.173
3rd Qu.:13.517
Max.
:21.237
Nguyn Vn Tun
Min.
:2.000
Min.
:192.7
1st Qu.:3.183
1st Qu.:249.8
Median :3.500
Median :276.0
Mean
:3.443
Mean
:280.2
3rd Qu.:3.775
3rd Qu.:311.3
Max.
:4.500
Max.
:388.7
p3np
Min.
: 3.390
1st Qu.: 5.375
Median : 7.140
Mean
: 7.895
3rd Qu.:10.010
Max.
:16.303
op <- par(mfrow=c(2,3))
hist(igfi)
hist(igfbp3)
hist(als)
hist(pinp)
hist(ictp)
hist(p3np)
59
Nguyn Vn Tun
Histogram of igfbp3
Histogram of als
200
300
400
0
100
20
Frequency
10
20
Frequency
10
20
0
10
Frequency
30
30
30
40
40
Histogram of igfi
2.0
3.0
4.0
5.0
150
250
350
450
igfbp3
als
Histogram of pinp
Histogram of ictp
Histogram of p3np
40
30
20
Frequency
30
10
10
10
20
Frequency
30
20
Frequency
40
50
igf i
200
400
pinp
600
800
10
15
20
ictp
10
15
p3np
Nu chng ta mun tnh trung bnh ca mt bin s nh igfi cho mi nhm nam
v n gii, hm tapply trong R c th dng cho vic ny:
> tapply(igfi, list(sex), mean)
Female
Male
167.9741 160.2903
Trong lnh trn, igfi l bin s chng ta cn tnh, bin s phn nhm l sex, v ch s
thng k chng ta mun l trung bnh (mean). Qua kt qu trn, chng ta thy s trung
bnh ca igfi cho n gii (167.97) cao hn nam gii (160.29).
Nhng nu chng ta mun tnh cho tng gii tnh v sc tc, chng ta ch cn thm mt
bin s trong hm list:
> tapply(igfi, list(ethnicity, sex), mean)
Female
Male
African
145.1252 120.9168
60
Nguyn Vn Tun
Asian
165.6589 160.4999
Caucasian 176.6536 169.4790
Others
NA 200.5000
x
s/ n
Trong , x l gi tr trung bnh ca mu, l trung bnh theo gi thit (trong trng
hp ny, 30), s l lch chun, v n l s lng mu (100). Nu gi tr t cao hn gi tr
l thuyt theo phn phi t mt tiu chun c ngha nh 5% chng hn th chng ta c
l do pht biu khc bit c ngha thng k. Gi tr ny cho mu 100 c th tnh ton
bng hm qt ca R nh sau:
> qt(0.95, 100)
[1] 1.660234
61
Nguyn Vn Tun
data: age
t = -27.6563, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
18.39300 19.94700
sample estimates:
mean of x
19.17
Trong lnh trn age l bin s chng ta cn kim nh, v mu=30 l gi tr gi thit. R
trnh by tr s t = -27.66, vi 99 bc t do, v tr s p < 2.2e-16 (tc rt thp). R
cng cho bit tin cy 95% ca age l t 18.4 tui n 19.9 tui (30 tui nm qu ngoi
khong tin cy ny). Ni cch khc, chng ta c l do pht biu rng tui trung
bnh trong mu ny tht s thp hn tui trung bnh ca qun th.
9.3.2 Kim nh t hai mu
V d 11. Qua phn tch m t trn (phm summary) chng ta thy ph n c
hormone igfi cao hn nam gii (167.97 v 160.29). Cu hi t ra l c phi tht s
l mt khc bit c h thng hay do cc yu t ngu nhin gy nn. Tr li cu hi ny,
chng ta cn xem xt mc khc bit trung bnh gia hai nhm v lch chun ca
khc bit.
x2 x1
SED
Trong x1 v x2 l s trung bnh ca hai nhm nam v n, v SED l lch chun
ca ( x1 - x2 ) . Thc ra, SED c th c tnh bng cng thc:
t=
Trong SE1 v SE2 l sai s chun (standard error) ca hai nhm nam v n. Theo l
thuyt xc sut, t tun theo lut phn phi t vi bc t do n1 + n2 2 , trong n1 v n2 l
s mu ca hai nhm. Chng ta c th dng R tr li cu hi trn bng hm t.test
nh sau:
> t.test(igfi~ sex)
Welch Two Sample t-test
data: igfi by sex
t = 0.8412, df = 88.329, p-value = 0.4025
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.46855 25.83627
sample estimates:
mean in group Female
mean in group Male
167.9741
160.2903
62
Nguyn Vn Tun
df l bc t do. Tr s p = 0.4025 cho thy mc khc bit gia hai nhm nam v n
khng c ngha thng k (v cao hn 0.05 hay 5%).
95 percent confidence interval:
-10.46855 25.83627
l khong tin cy 95% v khc bit gia hai nhm. Kt qu tnh ton trn cho bit
igf n gii c th thp hn nam gii 10.5 ng/L hoc cao hn nam gii khong 25.8
ng/L. V khc bit qu ln v l thm bng chng cho thy khng c khc bit c
ngha thng k gia hai nhm.
Kim nh trn da vo gi thit hai nhm nam v n c khc phng sai. Nu
chng ta c l do cho rng hai nhm c cng phng sai, chng ta ch thay i mt
thng s trong hm t vi var.equal=TRUE nh sau:
> t.test(igfi~ sex, var.equal=TRUE)
Two Sample t-test
data: igfi by sex
t = 0.7071, df = 98, p-value = 0.4812
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-13.88137 29.24909
sample estimates:
mean in group Female
mean in group Male
167.9741
160.2903
Kim nh t da vo gi thit l phn phi ca mt bin phi tun theo lut phn
phi chun. Nu gi nh ny khng ng, kt qu ca kim nh t c th khng hp l
(valid). kim nh phn phi ca igfi, chng ta c th dng hm shapiro.test
nh sau:
> shapiro.test(igfi)
Shapiro-Wilk normality test
63
Nguyn Vn Tun
data: igfi
W = 0.8528, p-value = 1.504e-08
Tr s p = 0.682 cho thy qu tht khc bit v igfi gia hai nhm nam v n khng
c ngha thng k. Kt lun ny cng khng khc vi kt qu phn tch bng kim nh
t.
9.5 Kim nh t cho cc bin s theo cp (paired t-test, t.test)
180, 140, 160, 160, 220, 185, 145, 160, 160, 170
170, 145, 145, 125, 205, 185, 150, 150, 145, 155
# nhp d kin
before <- c(180, 140, 160, 160, 220, 185, 145, 160, 160, 170)
after <- c(170, 145, 145, 125, 205, 185, 150, 150, 145, 155)
bp <- data.frame(before, after)
> # kim nh t
> t.test(before, after, paired=TRUE)
64
Nguyn Vn Tun
Paired t-test
data: before and after
t = 2.7924, df = 9, p-value = 0.02097
alternative hypothesis: true difference in means is not equal to
0
95 percent confidence interval:
1.993901 19.006099
sample estimates:
mean of the differences
10.5
Kt qu trn cho thy sau khi iu tr p sut mu gim 10.5 mmHg, v khong tin cy
95% l t 2.0 mmHg n 19 mmHg, vi tr s p = 0.0209. Nh vy, chng ta c bng
chng pht biu rng mc gim huyt p c ngha thng k.
Ch nu chng ta phn tch sai bng kim nh thng k cho hai nhm c lp di y
th tr s p = 0.32 cho bit mc gim p sut khng c ngha thng k!
> t.test(before, after)
Welch Two Sample t-test
data: before and after
t = 1.0208, df = 17.998, p-value = 0.3209
alternative hypothesis: true difference in means is not equal to
0
95 percent confidence interval:
-11.11065 32.11065
sample estimates:
mean of x mean of y
168.0
157.5
Nguyn Vn Tun
9.7 Tn s (frequency)
Others
2
Ch trong cc bng thng k trn, hm table khng cung cp cho chng ta s phn
trm. tnh s phn trm, chng ta cn n hm prop.table v cch s dng c th
minh ho nh sau:
# to ra mt object tn l freq cha kt qu tn s
> freq <- table(sex, ethnicity)
# kim tra kt qu
> freq
ethnicity
sex
African Asian Caucasian Others
Female
4
43
22
0
Male
4
17
8
2
# dng hm margin.table xem kt qu
> margin.table(freq, 1)
sex
Female
Male
69
31
> margin.table(freq, 2)
ethnicity
African
Asian Caucasian
Others
66
60
Nguyn Vn Tun
30
Trong bng thng k trn, prop.table tnh t l sc tc cho tng gii tnh. Chng hn
nh n gii (female), 5.8% l ngi Phi chu, 62.3% l ngi chu, 31.8% l ngi
Ty phng da trng . Tng cng l 100%. Tng t, nam gii t l ngi Phi chu l
12.9%, chu l 54.8%, v.v
# tnh phn trm bng hm prop.table
> prop.table(freq, 2)
ethnicity
sex
African
Asian Caucasian
Others
Female 0.5000000 0.7166667 0.7333333 0.0000000
Male
0.5000000 0.2833333 0.2666667 1.0000000
Trong bng thng k trn, prop.table tnh t l gii tnh cho tng sc tc. Chng hn
nh trong nhm ngi chu, 71.7% l n v 28.3% l nam.
# tnh phn trm cho ton b bng
> freq/sum(freq)
ethnicity
sex
African Asian Caucasian Others
Female
0.04 0.43
0.22
0.00
Male
0.04 0.17
0.08
0.02
x n
n (1 )
y, z tun theo lut phn phi chun vi trung bnh 0 v phng sai 1. Cng c th
ni z2 tun theo lut phn phi Chi bnh phng vi bc t do bng 1.
67
Nguyn Vn Tun
Phng php so snh hai t l c th khai trin trc tip t l thuyt kim nh mt t l
va trnh by trn. Cho hai mu vi s i tng n1 v n2, v s bin c l x1 v x2. Do
, chng ta c th c tnh hai t l p1 v p2. L thuyt xc sut cho php chng ta pht
biu rng khc bit gia hai mu d = p1 p2 tun theo lut phn phi chun vi s
trung bnh 0 v phng sai bng:
68
Nguyn Vn Tun
1 1
Vd = + p (1 p )
n1 n2
Trong :
p=
x1 + x2
n1 + n2
Thnh ra, z = d/Vd tun theo lut phn phi chun vi trung bnh 0 v phng sai 1. Ni
cch khc, z2 tun theo lut phn phi Chi bnh phng vi bc t do bng 1. Do ,
chng ta cng c th s dng prop.test kim nh hai t l.
V d 14. Mt nghin cu c tin hnh so snh hiu qu ca thuc chng gy
xng. Bnh nhn c chia thnh hai nhm: nhm A c iu tr gm c 100 bnh
nhn, v nhm B khng c iu tr gm 110 bnh nhn. Sau thi gian 12 thng theo
di, nhm A c 7 ngi b gy xng, v nhm B c 20 ngi gy xng. Vn t ra
l t l gy xng trong hai nhm ny bng nhau (tc thuc khng c hiu qu)?
kim nh xem hai t l ny c tht s khc nhau, chng ta c th s dng hm
prop.test(x, n, ) nh sau:
> fracture <- c(7, 20)
> total <- c(100, 110)
> prop.test(fracture, total)
2-sample test for equality of proportions with continuity
correction
data: fracture out of total
X-squared = 4.8901, df = 1, p-value = 0.02701
alternative hypothesis: two.sided
95 percent confidence interval:
-0.20908963 -0.01454673
sample estimates:
prop 1
prop 2
0.0700000 0.1818182
Kt qu phn tch trn cho thy t l gy xng trong nhm 1 l 0.07 v nhm 2 l 0.18.
Phn tch trn cn cho thy xc sut 95% rng khc bit gia hai nhm c th 0.01
n 0.20 (tc 1 n 20%). Vi tr s p = 0.027, chng ta c th ni rng t l gy xng
trong nhm A qu tht thp hn nhm B.
9.10 So snh nhiu t l (prop.test, chisq.test)
69
Nguyn Vn Tun
ethnicity
sex
African Asian Caucasian Others
Female
4
43
22
0
Male
4
17
8
2
Chng ta mun bit t l n gii gia 4 nhm sc tc c khc nhau hay khng, v tr
li cu hi ny, chng ta li dng prop.test nh sau:
> female <- c( 4, 43, 22, 0)
> total <- c(8, 60, 30, 2)
> prop.test(female, total)
4-sample test for equality of proportions without continuity
correction
data: female out of total
X-squared = 6.2646, df = 3, p-value = 0.09942
alternative hypothesis: two.sided
sample estimates:
prop 1
prop 2
prop 3
prop 4
0.5000000 0.7166667 0.7333333 0.0000000
Warning message:
Chi-squared approximation may be incorrect in: prop.test(female, total)
Tuy t l n gii gia cc nhm c v khc nhau ln (73% trong nhm 3 (ngi da trng)
so vi 50% trong nhm 1 (Phi chu) v 71.7% trong nhm chu, nhng kim nh Chi
bnh phng cho bit trn phng din thng k, cc t l ny khng khc nhau, v tr s
p = 0.099.
9.10.1 Kim nh Chi bnh phng (Chi squared test, chisq.test)
Tht ra, kim nh Chi bnh phng cn c th tnh ton bng hm chisq.test nh
sau:
> chisq.test(sex, ethnicity)
Pearson's Chi-squared test
data: sex and ethnicity
X-squared = 6.2646, df = 3, p-value = 0.09942
Warning message:
Chi-squared
approximation
ethnicity)
may
be
incorrect
70
in:
chisq.test(sex,
Nguyn Vn Tun
M s ID
(id)
1
2
3
4
5
6
7
8
tui
(age)
46
20
52
30
57
25
28
36
BMI
(bmi)
25.4
20.6
26.2
22.6
25.4
23.1
22.7
24.9
Cholesterol
(chol)
3.5
1.9
4.0
2.6
4.5
3.0
2.9
3.8
71
9
10
11
12
13
14
15
16
17
18
Nguyn Vn Tun
22
43
57
33
22
63
40
48
28
49
19.8
25.3
23.2
21.8
20.9
26.7
26.4
21.2
21.2
22.8
2.1
3.8
4.1
3.0
2.5
4.6
3.2
4.2
2.3
4.0
Nhn s qua s liu chng ta thy ngi c tui cng cao cholesterol cng
cng cao. Chng ta th nhp s liu ny vo R v v mt biu tn x nh sau:
> age <- c(46,20,52,30,57,25,28,36,22,43,57,33,22,63,40,48,28,49)
> bmi <-c(25.4,20.6,26.2,22.6,25.4,23.1,22.7,24.9,19.8,25.3,23.2,
21.8,20.9,26.7,26.4,21.2,21.2,22.8)
> chol <- c(3.5,1.9,4.0,2.6,4.5,3.0,2.9,3.8,2.1,3.8,4.1,3.0,
2.5,4.6,3.2, 4.2,2.3,4.0)
2.0
2.5
3.0
chol
3.5
4.0
4.5
20
30
40
50
60
age
72
Nguyn Vn Tun
( xi x )( yi y )
i =1
n
2 n
( xi x ) ( yi y )
i =1
i =1
73
Nguyn Vn Tun
H s tng quan Kendall (cng l mt phng php phn tch phi tham s) c
c tnh bng cch tm cc cp s (x, y) song hnh" vi nhau. Mt cp (x, y) song hnh
y c nh ngha l hiu ( khc bit) trn trc honh c cng du hiu (dng hay
m) vi hiu trn trc tung. Nu hai bin s x v y khng c lin h vi nhau, th s cp
song hnh bng hay tng ng vi s cp khng song hnh.
Bi v c nhiu cp phi kim nh, phng php tnh ton h s tng quan
Kendall i hi thi gian ca my tnh kh cao. Tuy nhin, nu mt d liu di 5000
i tng th mt my vi tnh c th tnh ton kh d dng. R dng hm cor.test vi
thng s method=kendall c tnh h s tng quan Kendall:
> cor.test(age, chol, method="kendall")
Kendall's rank correlation tau
data: age and chol
z = 4.755, p-value = 1.984e-06
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.8333333
74
Nguyn Vn Tun
Warning message:
Cannot compute exact p-value with ties in: cor.test.default(age,
chol, method = "kendall")
y ( + x )
i =1
nh
( x x )( y y )
i =1
(x x )
i =1
)
)
v = y x
)
)
y, x v y l gi tr trung bnh ca bin s x v y. Ch , ti vit v (vi du
m pha trn) l nhc nh rng y l hai c s (estimates) ca v , ch khng
phi v (chng ta khng bit chnh xc v , nhng ch c th c tnh m thi).
)
)
Sau khi c c s v , chng ta c th c tnh cholesterol trung bnh cho tng
tui nh sau:
)
yi = + xi
s2 =
( y y )
i =1
n2
. y, s2 chnh l c s ca 2.
75
Nguyn Vn Tun
)
Hm lm (vit tt t linear model) trong R c th tnh ton cc gi tr ca
v , cng nh s2 mt cch nhanh gn. Chng ta tip tc vi v d bng R nh sau:
Call:
lm(formula = chol ~ age)
Coefficients:
(Intercept)
1.08922
age
0.05779
Phng trnh ny c ngha l khi tui tng 1 nm th cholesterol tng khong 0.058
mmol/L.
Tht ra, hm lm cn cung cp cho chng ta nhiu thng tin khc, nhng chng ta phi
a cc thng tin ny vo mt object. Gi object l reg, th lnh s l:
> reg <- lm(chol ~ age)
> summary(reg)
Call:
lm(formula = chol ~ age)
Residuals:
Min
1Q
Median
-0.40729 -0.24133 -0.04522
3Q
0.17939
Max
0.63040
Coefficients:
Estimate Std. Error t value
(Intercept) 1.089218
0.221466
4.918
age
0.057788
0.005399 10.704
--Signif. codes: 0 '***' 0.001 '**' 0.01
Pr(>|t|)
0.000154 ***
1.06e-08 ***
'*' 0.05 '.' 0.1 ' ' 1
Lnh th hai, summary(reg), yu cu R lit k cc thng tin tnh ton trong reg. Phn
kt qu chia lm 3 phn:
76
Nguyn Vn Tun
3Q
0.17939
Max
0.63040
Chng ta bit rng trung bnh phn d phi l 0, v y, s trung v l -0.04, cng
khng xa 0 bao nhiu. Cc s quantiles 25% (1Q) v 75% (3Q) cng kh cn i chung
quan s trung v, cho thy phn d ca phng trnh ny tng i cn i.
)
)
(b) Phn hai trnh by c s ca v cng vi sai s chun v gi tr ca kim nh t.
)
Gi tr kim nh t cho l 10.74 vi tr s p = 1.06e-08, cho thy khng phi bng 0.
Ni cch khc, chng ta c bng chng cho rng c mt mi lin h gia cholesterol
v tui, v mi lin h ny c ngha thng k.
Coefficients:
Estimate Std. Error t value
(Intercept) 1.089218
0.221466
4.918
age
0.057788
0.005399 10.704
--Signif. codes: 0 '***' 0.001 '**' 0.01
Pr(>|t|)
0.000154 ***
1.06e-08 ***
'*' 0.05 '.' 0.1 ' ' 1
(c) Phn ba ca kt qu cho chng ta thng tin v phng sai ca phn d (residual mean
square). y, s2 = 0.3027. Trong kt qu ny cn c kim nh F, cng ch l mt
kim nh xem c qu tht bng 0, tc c ngha tng t nh kim nh t trong phn
trn. Ni chung, trong trng hp phn tch hi qui tuyn tnh n gin (vi mt yu t)
chng ta khng cn phi quan tm n kim nh F.
Residual standard error: 0.3027 on 16 degrees of freedom
Multiple R-Squared: 0.8775,
Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08
Ngoi ra, phn 3 cn cho chng ta mt thng tin quan trng, l tr s R2 hay h
s xc nh bi (coefficient of determination). Tc l bng tng bnh phng gia s c
tnh v trung bnh chia cho tng bnh phng s quan st v trung bnh. Tr s R2 trong
v d ny l 0.8775, c ngha l phng trnh tuyn tnh (vi tui l mt yu t) gii
thch khong 88% cc khc bit v cholesterol gia cc c nhn. Tt nhin tr s R2
c gi tr t 0 n 100% (hay 1). Gi tr R2 cng cao l mt du hiu cho thy mi lin h
gia hai bin s tui v cholesterol cng cht ch.
Mt h s cng cn cp y l h s iu chnh xc nh bi (m trong kt
qu trn R gi l Adjusted R-squared). y l h s cho chng ta bit mc ci tin
ca phng sai phn d (residual variance) do yu t tui c mt trong m hnh tuyn
tnh. Ni chung, h s ny khng khc my so vi h s xc nh bi, v chng ta cng
khng cn ch tm qu mc.
Gi nh ca phn tch hi qui tuyn tnh
77
Nguyn Vn Tun
78
6
0.466072660
12
0.003765579
18
0.079151419
Nguyn Vn Tun
0.2
-1
0.0
Standardized residuals
17
17
3.0
1.5
2.5
3.5
4.0
4.5
-2
-1
Fitted values
Theoretical Quantiles
Scale-Location
Residuals vs Leverage
1
0.5
0.5
1.0
17
-1
Standardized residuals
Cook's distance
0.0
Standardized residuals
Normal Q-Q
-0.4
Residuals
0.4
0.6
Residuals vs Fitted
2.5
3.0
3.5
4.0
4.5
0.00
0.05
Fitted values
0.10
0.5
0.15
0.20
0.25
Leverage
Biu 19. Phn tch phn d kim tra cc gi nh trong phn tch hi
qui tuyn tnh.
79
Nguyn Vn Tun
Sau khi m hnh tin on cholesterol c kim tra v tnh hp l c thit lp,
chng ta c th v ng biu din ca mi lin h gia tui v cholesterol bng lnh
abline nh sau (xin nhc li object ca phn tch l reg):
2.0
2.5
3.0
chol
3.5
4.0
4.5
20
30
40
50
60
age
)
)
Nhng mi gi tr yi c tnh t c s v , m cc c s ny u c sai
s chun, cho nn gi tr tin on yi cng c sai s. Ni cch khc, yi ch l trung bnh,
80
Nguyn Vn Tun
nhng trong thc t c th cao hn hay thp hn ty theo chn mu. Khong tin cy
95% ny c th c tnh qua R bng cc lnh sau y:
reg <- lm(chol ~ age)
new <- data.frame(age = seq(15, 70, 5))
pred.w.plim <- predict.lm(reg, new, interval="prediction")
pred.w.clim <- predict.lm(reg, new, interval="confidence")
resc <- cbind(pred.w.clim, new)
resp <- cbind(pred.w.plim, new)
plot(chol ~ age, pch=16)
lines(resc$fit ~ resc$age)
lines(resc$lwr ~ resc$age, col=2)
lines(resc$upr ~ resc$age, col=2)
lines(resp$lwr ~ resp$age, col=4)
lines(resp$upr ~ resp$age, col=4)
2.0
2.5
3.0
chol
3.5
4.0
4.5
>
>
>
>
>
>
>
>
>
>
>
>
20
30
40
50
60
age
Biu trn v gi tr tin on trung bnh yi (ng thng mu en), v khong tin cy
95% ca gi tr ny l ng mu . Ngoi ra, ng mu xanh l khong tin cy ca
gi tr tin on cholesterol cho mt tui mi trong qun th.
81
Nguyn Vn Tun
22
24
26
50
60
20
24
26
20
30
40
age
chol
20
30
40
50
60
20
22
bmi
Cng nh gia tui v cholesterol, mi lin h gia bmi v cholesterol cng gn tun
theo mt ng thng. Biu trn cn cho chng ta thy tui v bmi c lin h vi
82
Nguyn Vn Tun
nhau. Tht vy, phn tch hi qui tuyn tnh n gin gia bmi v cholesterol cho thy
nh mi lin h ny c ngha thng k:
> summary(lm(chol ~
bmi))
Call:
lm(formula = chol ~ bmi)
Residuals:
Min
1Q Median
-0.9403 -0.3565 -0.1376
3Q
0.3040
Max
1.4330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.83187
1.60841 -1.761 0.09739 .
bmi
0.26410
0.06861
3.849 0.00142 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.623 on 16 degrees of freedom
Multiple R-Squared: 0.4808,
Adjusted R-squared: 0.4483
F-statistic: 14.82 on 1 and 16 DF, p-value: 0.001418
BMI gii thch khong 48% dao ng v cholesterol gia cc c nhn. Nhng v BMI
cng c lin h vi tui, chng ta mun bit nu hai yu t ny c phn tch cng
mt lc th yu t no quan trng hn. bit nh hng ca c hai yu t age (x1) v
bmi (tm gi l x2) n cholesterol (y) qua mt m hnh hi qui tuyn tnh a bin, v m
hnh l:
yi = + 1 x1i + 2 x2i + i
3Q
0.1698
Max
0.5679
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.455458
0.918230
0.496
0.627
83
Nguyn Vn Tun
age
0.054052
0.007591
7.120 3.50e-06 ***
bmi
0.033364
0.046866
0.712
0.487
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3074 on 15 degrees of freedom
Multiple R-Squared: 0.8815,
Adjusted R-squared: 0.8657
F-statistic: 55.77 on 2 and 15 DF, p-value: 1.132e-07
Phng trnh cho bit khi tui tng 1 nm th cholesterol tng 0.054 mg/L (c s ny
khng khc my so vi 0.0578 trong phng trnh ch c tui), v mi 1 kg/m2 tng
BMI th cholesterol tng 0.0333 mg/L. Hai yu t ny gii thch khong 88.2% (R2 =
0.8815) dao ng ca cholesterol gia cc c nhn.
Chng ta ch phng trnh vi tui (trong phn tch phn trc) gii thch
khong 87.7% dao ng cholesterol gia cc c nhn. Khi chng ta thm yu t BMI,
h s ny tng ln 88.2%, tc ch 0.5%. Cu hi t ra l 0.5% tng trng ny c
ngha thng k hay khng. Cu tr li c th xem qua kt qu kim nh yu t bmi vi
tr s p = 0.487. Nh vy, bmi khng cung cp cho chng thm thng tin hay tin on
cholesterol hn nhng g chng ta c t tui. Ni cch khc, khi tui c
xem xt, th nh hng ca bmi khng cn ngha thng k. iu ny c th hiu c,
bi v qua Biu 10.5 chng ta thy tui v bmi c mt mi lin h kh cao. V hai
bin ny c tng quan vi nhau, chng ta khng cn c hai trong phng trnh. (Tuy
nhin, v d ny ch c tnh cch minh ha cho vic tin hnh phn tch hi qui tuyn tnh
a bin bng R, ch khng c nh m phng d liu theo nh hng sinh hc).
84
Nguyn Vn Tun
3.0
4.0
2.0
0.0
1.0
4.5
-2
-1
Scale-Location
Residuals vs Leverage
0.4
3.0
3.5
4.0
16
0.5
0.8
16
-1
Standardized residuals
Theoretical Quantiles
2.5
16
Fitted values
1.2
3.5
0.0
Standardized residuals
2.5
-1.0
0.0
0.4
16
-0.4
Residuals
8
6
Normal Q-Q
Standardized residuals
Residuals vs Fitted
4.5
Cook's distance15
0.00
0.10
Fitted values
0.20
0.30
Leverage
Tuy BMI khng c ngha thng k trong trng hp ny, Biu 10.6 cho thy
cc gi nh v m hnh hi qui tuyn tnh c th p ng.
Nhm 1: bnh
Crohn
Nhm 3: i
chng (control)
85
1343
1393
1420
1641
1897
2160
2169
2279
2890
Nguyn Vn Tun
1264
1314
1399
1605
2385
2511
2514
2767
2827
2895
3011
1809
1926
2283
2384
2447
2479
2495
2525
2541
2769
2850
2964
2973
3171
3257
3271
3288
3358
3643
3657
n=9
n=11
n=20
Trung bnh: 1910 Trung bnh: 2226
Trung bnh: 2804
SD: 516
SD: 727
SD: 527
Ch thch: SD l lch chun (standard deviation).
Ho: 1 = 2 = 3
HA: c mt khc bit gia 3 j (j = 1,2,3)
Thot u c l bn c, sau khi hc qua phng php so snh hai nhm bng
kim nh t, s ngh rng chng ta cn lm 3 so snh bng kim nh t: gia nhm 1 v 2,
nhm 2 v 3, v nhm 1 v 3. Nhng phng php ny khng hp l, v c ba phng
sai khc nhau. Phng php thch hp cho so snh l phn tch phng sai. Phn tch
phng sai c th ng dng so snh nhiu nhm cng mt lc (simultaneous
comparisons).
minh ha cho phng php phn tch phng sai, chng ta phi dng k hiu.
Gi galactose ca bnh nhn i thuc nhm j (j = 1, 2, 3) l xij. M hnh phn tch
phng sai pht biu rng:
xij = + i + ij
Hay c th hn:
xi1 = + 1 + i1
xi2 = + 2 + i2
xi3 = + 3 + i3
Trc ht, chng ta cn phi nhp d liu vo R. Bc th nht l bo cho R bit rng
chng ta c ba nhm bnh nhn (1, 2 v ), nhm 1 gm 9 ngi, nhm 2 c 11 ngi, v
nhm 3 c 20 ngi:
> group <- c(1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3)
86
Nguyn Vn Tun
phn tch phng sai, chng ta phi nh ngha bin group l mt yu t - factor.
> group <- as.factor(group)
Bc k tip, chng ta np s liu galactose cho tng nhm nh nh ngha trn (gi
object l galactose):
> galactose <- c(1343,1393,1420,1641,1897,2160,2169,2279,2890,
1264,1314,1399,1605,2385,2511,2514,2767,2827,2895,3011,
1809,2850,1926,2964,2283,2973,2384,3171,2447,3257,2479,3271,2495,3288,
2525,3358,2541,3643,2769,3657)
Sau khi c d liu sn sng, chng ta dng hm lm() phn tch phng sai nh
sau:
> analysis <- lm(galactose ~ group)
87
Nguyn Vn Tun
trong thc t khng c khc nhau hay nh hng). Do , trong trng hp c nhiu so
snh, chng ta cn phi iu chnh tr s p sao cho hp l.
C kh nhiu phng php iu chnh tr s p, v 4 phng php thng dng nht
l: Bonferroni, Scheff, Holm v Tukey (tn ca 4 nh thng k hc danh ting).
Phng php no thch hp nht? Khng c cu tr li dt khot cho cu hi ny, nhng
hai im sau y c th gip bn c quyt nh tt hn:
(a)
(b)
1
2
2 0.6805 3 0.0012 0.0321
P value adjustment method: bonferroni
Kt qu trn cho thy tr s p gia nhm 1 (Crohn) v vim rut kt l 0.6805 (tc khng
c ngha thng k); gia nhm Crohn v i chng l 0.0012 (c ngha thng k), v
gia nhm vim rut kt v i chng l 0.0321 (tc cng c ngha thng k).
Mt phng php iu chnh tr s p khc c tn l phng php Holm:
> pairwise.t.test(galactose, group)
Pairwise comparisons using t tests with pooled SD
data:
88
Nguyn Vn Tun
1
2
2 0.2268 3 0.0012 0.0214
P value adjustment method: holm
1
2
2 0.2557 3 0.0017 0.0544
P value adjustment method: holm
diff
lwr
upr
p adj
2-1 316.3232 -312.09857 944.745 0.4439821
3-1 894.2778 333.07916 1455.476 0.0011445
3-2 577.9545
53.11886 1102.790 0.0281768
Kt qu trn cho chng ta thy nhm 3 v 1 khc nhau khong 894 n v, v khong tin
cy 95% t 333 n 1455 n v. Tng t, galactose trong nhm bnh nhn vim rut
kt thp hn nhm i chng (nhm 3) khong 578 n v, v khong tin cy 95% t 53
n 1103.
89
Nguyn Vn Tun
3-2
3-1
2-1
500
1000
1500
Phng php so snh nhiu nhm phi tham s (non-parametric statistics) tng
ng vi phng php phn tch phng sai l Kruskal-Wallis. Cng nh phng php
Wilcoxon so snh hai nhm theo phng php phi tham s, phng php Kruskal-Wallis
cng bin i s liu thnh th bc (ranks) v phn tch khc bit th bc ny gia cc
nhm. Hm kruskal.test trong R c th gip chng ta trong kim nh ny:
> kruskal.test(galactose ~ group)
Kruskal-Wallis rank sum test
data: galactose by group
Kruskal-Wallis chi-squared = 12.1381, df = 2, p-value = 0.002313
90
Nguyn Vn Tun
11.4 Phn tch phng sai hai chiu (two-way analysis of variance ANOVA)
iu kin
(i)
1
2
1
4.1, 3.9, 4.3
2.7, 3.1, 2.6
Vt liu (j)
2
3.1, 2.8, 3.3
1.9, 2.2, 2.3
3
3.5, 3.2, 3.6
2.7, 2.3, 2.5
phn tch bng R, chng ta cn phi t chc d liu sao cho c 4 bin nh sau:
Condition
(iu kin)
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
Material
(vt liu)
1
1
1
2
2
2
3
3
3
1
1
1
2
2
2
i tng
Score
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
4.1
3.9
4.3
3.1
2.8
3.3
3.5
3.2
3.6
2.7
3.1
2.6
1.9
2.2
2.3
91
2
2
2
3
3
3
Nguyn Vn Tun
16
17
18
2.7
2.3
2.5
V to nn 18 m s (t 1 n 18):
> id <- 1:18
By gi s liu sn sng cho phn tch. phn tch phng sai hai chiu, chng ta
vn s dng lnh lm vi cc thng s nh sau:
> twoway <- lm(score ~ condition + material)
> anova(twoway)
Analysis of Variance Table
Response: score
Df Sum Sq Mean Sq F value
Pr(>F)
condition 1 5.0139 5.0139 95.575 1.235e-07 ***
material
2 2.1811 1.0906 20.788 6.437e-05 ***
Residuals 14 0.7344 0.0525
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Ba ngun dao ng (variation) ca score c phn tch trong bng trn. Qua
trung bnh bnh phng (mean square), chng ta thy nh hng ca iu kin c v quan
trng hn l nh hng ca vt liu th nghim. Tuy nhin, c hai nh hng u c
ngha thng k, v tr s p rt thp cho hai yu t. Chng ta yu cu R tm lc cc c
s phn tch bng lnh summary:
> summary(twoway)
Call:
lm(formula = score ~ condition + material)
Residuals:
Min
1Q
-0.32778 -0.16389
Median
0.03333
3Q
0.16111
Max
0.32222
92
Nguyn Vn Tun
Coefficients:
Estimate Std. Error t value
(Intercept)
3.9778
0.1080 36.841
condition2
-1.0556
0.1080 -9.776
material2
-0.8500
0.1322 -6.428
material3
-0.4833
0.1322 -3.655
--Signif. codes: 0 '***' 0.001 '**' 0.01
Pr(>|t|)
2.43e-15
1.24e-07
1.58e-05
0.0026
***
***
***
**
5.0139 + 2.1811
= 0.9074
5.0139 + 2.1811 + 0.7344
93
Nguyn Vn Tun
diff
lwr
upr
p adj
2-1 -0.8500000 -1.19610279 -0.5038972 0.0000442
3-1 -0.4833333 -0.82943612 -0.1372305 0.0068648
3-2 0.3666667 0.02056388 0.7127695 0.0374069
3-2
3-1
2-1
-1.0
-0.5
0.0
0.5
94
Nguyn Vn Tun
fx
1
1
1
1
1
0
0
0
0
0
age
79
89
70
88
85
68
70
69
74
79
bmi
24.7252
25.9909
25.3934
23.2254
24.6097
25.0762
19.8839
25.0593
25.6544
19.9594
bmd
0.818
0.871
1.358
0.714
0.748
0.935
1.040
1.002
0.987
0.863
...
ictp
9.170
7.561
5.347
7.354
6.760
4.939
4.321
4.212
5.605
5.204
pinp
37.383
24.685
40.620
56.782
58.358
67.123
26.399
47.515
26.132
60.267
137
138
139
0
1
0
64
80
67
38.0762
23.3887
25.9455
1.086
0.875
0.983
5.043
4.086
4.328
32.835
23.837
71.334
95
Nguyn Vn Tun
1 p
Cho mt bin c lp x (x c th l lin tc hay khng lin tc), m hnh hi qui logistic
pht biu rng:
logit(p) = + x
Tng t nh m hnh hi qui tuyn tnh, v l hai thng s tuyn tnh cn phi c
tnh t d liu nghin cu. Nhng ngha ca thng s ny, c bit l thng s , rt
khc vi ngha m ta quen vi m hnh hi qui tuyn tnh. hiu ngha ca hai
thng s ny, ti s quay li vi v d 19.
Vn m chng ta mun bit l mi lin h gia mt xng bmd v nguy c
gy xng (fx). tin cho vic minh ha, gi bmd l x, vn m chng ta cn bit
c th vit bng ngn ng m hnh nh sau
p
logit ( p ) = log
+ x
1 p
Ni cch khc:
odds ( p ) =
p
= e + x
1 p
Ni cch khc, m hnh hi qui logistic va trnh by trn pht biu rng mi lin
h gia xc sut gy xng (p) v mt xng bmd l mt mi lin h theo hnh ch S.
M hnh trn cn cho thy xc sut gy xng p ty thuc vo gi tr ca x. Thnh ra,
m hnh trn c th vit mt cch chnh xc hn rng kh nng gy xng vi iu kin x
l:
odds ( p | x ) = e + x
Khi x = x0, kh nng gy xng l: odds ( p | x = x0 ) = e + x0
Khi x = x0 + 1 (tc tng 1 n v t x0), kh nng gy xng l:
odds ( p | x = x0 + 1) = e
+ ( x0 +1)
96
Nguyn Vn Tun
odds ( p | x = x0 + 1)
odds ( p | x = x0 )
+ ( x0 +1)
= e
+ x0
1
n
n
( + xi )
yi = 1 + e
i =1
i =1
n
n
x y = x 1 + e ( + xi )
i i
i
i =1
i =1
p =
e + x
+ x
1+ e
1+ e
1
(
+ x
97
Nguyn Vn Tun
"age"
"bmi"
"bmd"
"ictp" "pinp"
98
Nguyn Vn Tun
1.0
0.6
0.8
BMD
1.2
1
Fracture: 1=yes, 0=no)
Kt qu trn cho thy, bmd trong nhm bnh nhn b gy xng thp hn so vi nhm
khng b gy xng (0.90 v 0.94). V, kim nh t sau y cho thy mc khc bit
ny khng c ngha thng k (p = 0.15).
> t.test(bmd~fx)
Welch Two Sample t-test
data: bmd by fx
t = 1.4572, df = 53.952, p-value = 0.1508
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.01609226 0.10172922
sample estimates:
mean in group 0 mean in group 1
0.9444851
0.9016667
3Q
1.3780
Max
2.0709
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
1.063
1.342
0.792
0.428
bmd
-2.270
1.455 -1.560
0.119
(Dispersion parameter for binomial family taken to be 1)
99
Nguyn Vn Tun
on 136
on 135
degrees of freedom
degrees of freedom
3Q
1.3780
Max
2.0709
Deviance nh gii thch trn phn nh khc bit gia m hnh v d liu (cng tng
t nh mean square residual trong phn tch hi qui tuyn tnh vy). i vi mt m
hnh n l nh v d ny th gi tr ca deviance khng c ngha g nhiu.
(c) Phn k tip cung cp c s ca (m R t tn l intercept) v (bmd) v
sai s chun (standard error).
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
1.063
1.342
0.792
0.428
bmd
-2.270
1.455 -1.560
0.119
100
Nguyn Vn Tun
e-2.27*0.1406 = 0.7267
Tc l, khi bmd tng mt lch chun th t s kh d gy xng gim khong 28%.
Cng c th ni cch khc, l khi bmd gim mt lch chun th t s kh d tng
e2.27*0.1406 = 1.376 hay khong 38%.
Mt cch khc bit nh hng ca bmd l c tnh xc sut gy xng qua phng
trnh:
1.063 2.27 ( bmd )
e
p =
1.063 2.27 ( bmd )
1+ e
Theo , khi bmd = 1.00, p = 0.23. Khi bmd = 0.86 (tc gim 1 lch chun), p =
0.291. Tc l, nu BMD gim 1 lch chun th xc sut gy xng tng 0.291/0.23 =
1.265 hay 26%5.
(d) Phn cui ca kt qu cung cp deviance cho hai m hnh: m hnh khng c bin
c lp (null deviance), v m hnh vi bin c lp, tc l bmd trong v d
(residual deviance).
Null deviance: 157.81
Residual deviance: 155.27
AIC: 159.27
on 136
on 135
degrees of freedom
degrees of freedom
Qua hai s ny, chng ta thy bmd nh hng rt thp n vic tin on gy
xng, ch lm gim deviance t 157.8 xung cn 155.27, v mc gim ny khng c
ngha thng k.
Ngoi ra, R cn cung cp gi tr ca AIC (Akaike Information Criterion) c
tnh t deviance v bc t do. Ti s quay li ngha ca AIC trong phn sp n khi so
snh cc m hnh.
12.3 c tnh xc sut bng R
> predict(logistic)
1
101
Nguyn Vn Tun
Cc s trn l log(p / (1 p)), tc log odds, khng c ngha hc t bao nhiu. Chng ta
1.063 2.27 ( bmd )
e
. c gi tr
mun bit gi tr tin on xc sut p tnh t phng trnh p =
1.063 2.27 ( bmd )
1+ e
ny cho tng bnh nhn, chng ta cho thng s type=response vo hm predict
nh sau:
> predict(logistic, type="response")
1
2
3
4
5
6
7
0.91510135 0.74757001 0.10516416 0.81650178 0.72419767 0.28064726 0.15011664
8
9
10
11
12
13
14
0.15767295 0.33955387 0.37588624 0.28052582 0.34327343 0.44305196 0.23830776
...
Trong kt qu trn (ch in mt phn) c tnh xc sut gy xng cho bnh nhn 1 l
0.915, cho bnh nhn 2 l 0.747, v.v
0.35
0.30
0.25
0.20
0.15
0.40
0.6
0.8
1.0
1.2
bmd
102
Nguyn Vn Tun
Biu trn c th ci tin bng cch cho cc khong cch gi tr bmd gn nhau hn
(nh 0.50, 0.55, 0.60, , 1.20 chng hn), v dng ng biu din thay v dng du
chm. Cc lnh sau y s ci tin biu .
logistic <- glm(fx ~ bmd, family=binomial)
fnbmd <- seq(0.5, 1.2, 0.05) #cho fnbmd t > 0.50,0.55,0.6,...,1.2
new.data <- data.frame(bmd = fnbmd) #cho vo mt dataframe mi
predicted <- predict(logistic, new.data, type=response)
plot(predicted ~ fnbmd, type=l)
0.35
0.30
0.15
0.20
0.25
predicted
0.40
0.45
>
>
>
>
>
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
fnbmd
103
Nguyn Vn Tun
104
Nguyn Vn Tun
xc sut ca tnh hung (c) c gi l sai st loi I (type I error, hay significance
level), v thng k hiu bng . Ni cch khc, chnh l xc sut m kt qu
kim nh thng cho ra kt qu p<0.05 vi iu kin gi thit H sai;
xc sut tnh hng (d) khng phi l vn cn quan tm, nn khng c thut
ng, d c th gi l kt qu m tnh tht (hay true negative).
ng
(thuc c hiu nghim)
Sai
(thuc khng c hiu nghim)
105
Nguyn Vn Tun
V xc sut sai st, thng thng mt nghin cu chp nhn sai st loi I khong
1% hay 5% (tc = 0.01 hay 0.05), v xc sut sai st loi II khong = 0.1 n
= 0.2 (tc power phi t 0.8 n 0.9).
n=
( / )
Trong trng hp c hai nhm i tng, s lng i tng (n) cn thit cho
nghin cu c th tnh ton nh sau:
n = 2
( / )
106
Nguyn Vn Tun
=
0.10
0.05
0.01
= 0.20
(Power = 0.80)
6.15
7.85
13.33
= 0.10
(Power = 0.90)
8.53
10.51
16.74
= 0.05
(Power = 0.95)
10.79
13.00
19.84
13.4 c tnh c mu
13.4.1 c tnh c mu cho mt ch s trung bnh
V d 20: Chng ta mun c tnh chiu cao n ng ngi Vit, v chp nhn
sai s trong vng 1 cm (d = 1) vi khong tin cy 0.95 (tc =0.05) v power = 0.8 (hay
= 0.2). Cc nghin cu trc cho bit lch chun chiu cao ngi Vit khong 4.6
cm. Chng ta c th p dng cng thc [1] c tnh c mu cn thit cho nghin cu:
n=
( / )
7.85
(1/ 4.6 )
= 166
Ni cch khc, chng ta cn phi o chiu cao 166 i tng c tnh chiu cao n
ng Vit vi sai s trong vng 1 cm.
Nu sai s chp nhn l 0.5 cm (thay v 1 cm), s lng i tng cn thit l:
7.85
n=
= 664 . Nu sai s m chng ta chp nhn l 0.1 cm th s lng i
2
( 0.5 / 4.6 )
tng nghin cu ln n 16610 ngi! Qua cc c tnh ny, chng ta d dng thy c
mu ty thuc rt ln vo sai s m chng ta chp nhn. Mun c c tnh cng
chnh xc, chng ta cn cng nhiu i tng nghin cu.
Trong R c hm power.t.test c th p dng c tnh c mu cho v d
trn nh sau.
Ch chng ta cho R bit vn l mt nhm tc
type=one.sample:
# sai s 1 cm, c lch chun 4.6, a=0.05, power=0.8
> power.t.test(delta=1, sd=4.6, sig.level=.05, power=.80,
type='one.sample')
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
168.0131
1
4.6
0.05
0.8
two.sided
107
Nguyn Vn Tun
=
=
=
=
=
=
666.2525
0.5
4.6
0.05
0.8
two.sided
=
=
=
=
=
=
198.1513
3
15
0.05
0.8
two.sided
Trong thc t, rt nhiu nghin cu nhm so snh hai nhm vi nhau. Cch c
tnh c mu cho cc nghin cu ny ch yu da vo cng thc [2] nh trnh by phn
15.3.1.
108
Nguyn Vn Tun
2C
( / )
2 10.51
( 0.04 / 0.12 )
= 189
=
=
=
=
=
=
190.0991
0.04
0.12
0.05
0.9
two.sided
109
Nguyn Vn Tun
Phng php c tnh c mu cho so snh gia hai nhm cng c th khai trin
thm c tnh c mu cho trng hp so snh hn hai nhm. Trong trng hp c
nhiu nhm, nh cp trong Chng 11, phng php so snh l phn tch phng sai.
Theo phng php ny, s trung bnh bnh phng phn d (residual mean square, RMS)
chnh l c tnh ca dao ng ca o lng trong mi nhm, v ch s ny rt quan
trng trong vic c tnh c mu.
Chi tit v l thuyt ng sau cch c tnh c mu cho phn tch phng sai kh
phc tp, v khng nm trong phm vi ca chng ny. Nhng nguyn l ch yu vn
khng khc so vi l thuyt so snh gia hai nhm. Gi s trung bnh ca k nhm l 1,
2, 3, . . ., k, chng ta c th tnh tng bnh phng gia cc nhm bng
k
k
SS
2
SS SS = ( i ) , trong , = i / k . Cho =
, vn t ra l tm
( k 1) RMS
i =1
i =1
c lng c mu n sao cho z p ng yu cu power = 0.80 hay 0.9, m
z =
( k 1)(1 + n ) F + k ( n 1)(1 + 2n )
110
n
between.var
within.var
sig.level
power
=
=
=
=
=
Nguyn Vn Tun
12.81152
3.486667
8.7
0.05
0.9
p (1 p ) / n .
1.96
n
p (1 p )
m
Do , s lng c mu ty thuc vo sai s m v t l p m chng ta mun c tnh.
sai s cng thp, s lng c mu cng cao.
111
Nguyn Vn Tun
1.96
n
0.7 0.3
0.02
Ni cch khc, chng ta cn nghin cu t nht l 2017.
Nu chng ta mun gim sai s t 2% xung 1% (tc m = 0.01) th s lng i tng s
l 8067! Ch cn thm chnh xc 1%, s lng mu c th thm hn 6000 ngi. Do
, vn c tnh c mu phi rt thn trng, xem xt cn bng gia chnh xc thng
tin cn thu thp v chi ph.
R khng c hm cho c tnh c mu cho mt t l, nhng vi cng thc trn, bn c c
th vit mt hm tnh rt d dng.
13.4.5 c tnh c mu cho so snh hai t l
Nhiu nghin cu mang tnh suy lun thng c hai [hay nhiu hn hai] nhm
so snh. Trong phn 15.4.2 chng ta lm quen vi phng php c tnh c mu
so snh hai s trung bnh bng kim nh t. l nhng ngi cu m tiu ch l nhng
bin s lin tc. Nhng c nghin cu bin s khng lin tc m mang tnh nh phn nh
ti va bn trong phn 15.4.3. so snh hai t l, phng php kim nh thng dng
nht l kim nh nh phn (binomial test) hay Chi bnh phng (2 test). Trong phn
ny, ti s bn qua cch tnh c mu cho hai loi kim nh thng k ny.
Gi hai t l [m chng ta khng bit nhng mun tm hiu] l p1 v p2 , v gi
n=
z / 2 2 p (1 p ) + z
p1 (1 p1 ) + p2 (1 p2 )
Trong , p = ( p1 + p2 )/2, z / 2
hn nh khi = 0.05, th z / 2 = 1.96; khi = 0.01, th z / 2 = 2.57), v z l tr s z ca
112
Nguyn Vn Tun
phn phi chun cho xc sut (chng hn nh khi = 0.10, th z = 1.28; khi = 0.20,
th z = 0.84).
V d 25: Mt th nghim lm sng i chng ngu nhin c thit k nh
gi hiu qu ca mt loi thuc chng gy xng sng. Hai nhm bnh nhn s c
tuyn. Nhm 1 c iu tr bng thuc, v nhm 2 l nhm i chng (khng c
iu tr). Cc nh nghin cu gi thit rng t l gy xng trong nhm 2 l khong 10%,
v thuc c th lm gim t l ny xung khong 6%. Nu cc nh nghin cu mun th
nghim gi thit ny vi sai st I l = 0.01 v power = 0.90, bao nhiu bnh nhn cn
phi c tuyn m cho nghin cu?
( 2.57
n=
( 0.04 )
= 1361
Nh vy, cng trnh nghin cu ny cn phi tuyn t nht l 2722 bnh nhn kim
nh gi thit trn.
Hm power.prop.test R c th ng dng tnh c mu cho trng hp trn. Hm
power.prop.test cn nhng thng tin nh power, sig.level, p1, v p2.
Trong v d trn, chng ta c th vit:
> power.prop.test(p1=0.10, p2=0.06, power=0.90, sig.level=0.01)
Two-sample comparison of proportions power calculation
n
p1
p2
sig.level
power
alternative
=
=
=
=
=
=
1366.430
0.1
0.06
0.01
0.9
two.sided
Nguyn Vn Tun
114
Nguyn Vn Tun
Linear Models with R (Nh xut bn Chapman & Hall/CRC, 2004) ca Julian
Faraway. Sch hin c th ti t internet xung min ph ti website sau y:
hay
http://cran.rhttp://www.stat.lsa.umich.edu/~faraway/book/pra.pdf
project.org/doc/contrib/Faraway-PRA.pdf. Ti liu di 213 trang.
R Graphics (Computer Science and Data Analysis) (Nh xut bn Chapman &
Hall/CRC, 2005) ca Paul Murrell. y l cun sch chuyn v phn tch biu
bng R. Sch c rt nhiu m bn c c th t mnh thit k cc biu phc
tp v mu m.
Modern Applied Statistics with S-Plus (Nh xut bn Springer, 4th Edition,
2003) ca W. N. Venables v B. D. Ripley c vit cho ngn ng S-Plus nhng
tt c cc lnh v m trong sch ny u c th p dng cho R m khng cn thay
i. (S-Plus l tin thn ca R, nhng S-Plus l mt phn mm thng mi, cn R
th hon ton min ph!) y l cun sch c th ni l cun sch tham kho cho
tt c ai mun pht trin thm v R. Hai tc gi cng l nhng chuyn gia c thm
quyn v ngn ng R. Sch dnh cho bn c vi trnh cao v my tnh v
thng k hc.
115
Nguyn Vn Tun
V sau cng l mt ti liu Hng dn s dng R cho phn tch s liu v biu
(khong 50 trang thng xuyn cp nht ha) do chnh ti vit bng ting
Vit. Website: www.R.ykhoa.net thc cht l tm lc mt s chng chnh ca
cun sch ny. Trang web ny cn c tt c cc d liu (datasets) v cc m s
trong trong sch bn c c th ti xung my tnh c nhn s dng.
116
Nguyn Vn Tun
Ting Vit
Khong tin cy 95%
Tiu chun thng tin Akaike
Phn tch hip bin
Phn tch phng sai
Biu thanh
Phn phi nh phn
Biu hnh hp
Bin th bc
Biu ng h
H s tng quan
H s xc nh bi
H s bt ng nht
T hp
Bin lin tc
Tng quan
Hp bin
Th nghim giao cho
Hm phn phi tch ly
Bc t do
nh thc
Bin ri rc
Biu im
c s
Hm c lng thng k
Phn tch phng sai cho th nghim giai tha
nh hng bt bin
Tn s
Hm
Bt ng nht
Biu tn s
ng nht
Kim nh gi thit
Ma trn nghch o
Th nghim hnh vung Latin
Phng php bnh phng nh nht
Phn tch hi qui tuyn tnh logistic
Phn tch hi qui tuyn tnh
117
Matrix
Maximum likelihood method
Mean
Median
Meta-analysis
Missing value
Model
Multiple linear regression analysis
Normal distribution
Object
Parameter
Permutation
Pie chart
Poisson distribution
Polynomial regression
Probability
Probability density distribution
P-value
Quantile
Random effects
Random variable
Relative risk
Repeated measure experiment
Residual
Residual mean square
Residual sum of squares
Scalar matrix
Scatter plot
Significance
Simulation
Standard deviation
Standard error
Standardized normal distribution
Survival analysis
Traposed matrix
Variable
Variance
Weight
Weighted mean
Nguyn Vn Tun
Ma trn
Phng php hp l cc i
S trung bnh
S trung v
Phn tch tng hp
Gi tr khng
M hnh
Phn tch hi qui tuyn tnh a bin
Phn phi chun
i tng
Thng s
Hon v
Biu hnh trn
Phn phi Poisson
Hi qui a thc
Xc sut
Hm mt xc sut
Tr s P
Hm nh bc
nh hng ngu nhin
Bin ngu nhin
T s nguy c tng i
Th nghim ti o lng
Phn d
Trung bnh bnh phng phn d
Tng bnh phng phn d
Ma trn v hng
Biu tn x
C ngha thng k
M phng
lch chun
Sai s chun
Phn phi chun chun ha
Phn tch bin c
Ma trn chuyn v
Bin (bin s)
Phng sai
Trng s
Trung bnh trng s
118