Professional Documents
Culture Documents
Ebook Phan Tich So Lieu Va Tao Bieu Do Bang Ngon Ngu R 1458
Ebook Phan Tich So Lieu Va Tao Bieu Do Bang Ngon Ngu R 1458
Mc lc
1
Li ni u
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
3
3.1
3.2
3.3
3.4
3.5
3.6
Nhp d liu
Nhp s liu trc tip: c()
Nhp s liu trc tip: edit(data.frame())
Nhp s liu t mt textfile: read.table()
Nhp s liu t Excel: read.csv
Nhp s liu t SPSS: read.spss
Tm thng tin c bn v d liu
4
4.1
4.2
4.3
4.4
4.5
4.5.1
4.5.2
4.6
4.7
Bin tp d liu
Kim tra s liu trng khng: na.omit()
Tch ri d liu: subset
Chit s liu t mt data .frame
Nhp hai data.frame thnh mt: merge
M ha s liu (data coding)
M ho bng hm replace
i mt bin lin tc thnh bin ri rc
Chia mt bin lin tc thnh nhm: cut
Tp hp s liu bng cut2 (Hmisc)
5
5.1
5.2
5.3
5.4
5.4.1
5.4.2
6
6.1
6.1.1
6.1.2
6.2
6.3
6.3.1
6.3.2
6.3.3
6.3.4
6.3.5
6.4.
6.4.1
6.4.2
6.4.3
6.5
7
7.1
7.2
7.3
7.4
7.5
8
8.1
8.1.1
8.1.2
8.1.3
8.1.4
8.1.5
8.1.6
8.17
8.2
8.3
8.4.
8.5
8.6
8.6.1
8.6.2
8.6.3
8.6.4
8.6.5
8.7
8.7.1
8.8
8.9
8.9.1
8.9.2
8.9.3
8.9.4
8.9.5
8.9.6
8.9.10
9
9.0
9.1
9.2
9.3
9.4
9.4.1
9.4.2
9.5
9.6
9.7
9.8
9.9
9.10
9.11
9.12
9.12.1
9.12.2
10
10.1
10.1.1
10.1.2
10.1.3
10.2
10.2.1
10.2.2
10.2.3
10.2.4
10.3
10.4
10.5
10.6
11
11.1
11.1.1
11.1.2
11.2
11.2.1
11.2.2
11.3
11.4
11.4.1
11.5
11.5.1
11.5.2
11.6
11.7
11.8
11.9
12
12.1
12.2
12.3
12.4
12.5
12.6
12.7
13
13.1
13.2
13.3
13.4
13.5
13.6
14
14.1
14.2
14.5.1
14.5.2
15
15.1
15.2
15.3
15.4
15.4.1
15.4.2
15.4.3
15.4.4
15.4.5
16
14.3
14.4
14.4.1
14.4.2
14.5
17
18
19
CHNG I
LI NI U
1
Li ni u
Tri vi quan im ca nhiu ngi, thng k l mt b mn khoa hc: Khoa hc
thng k (Statistical Science). Cc phng php phn tch d da vo nn tng ca ton
hc v xc sut, nhng ch l phn k thut, phn quan trng hn l thit k nghin
cu v din dch ngha d liu. Ngi lm thng k, do , khng ch l ngi n
thun lm phn tch d liu, m phi l mt nh khoa hc, mt nh suy ngh (thinker)
v nghin cu khoa hc. Chnh v th, m khoa hc thng k ng mt vai tr cc k
quan trng, mt vai tr khng th thiu c trong cc cng trnh nghin cu khoa hc,
nht l khoa hc thc nghim. C th ni rng ngy nay, nu khng c thng k th cc
th nghim gen vi triu triu s liu ch l nhng con s v hn, v ngha.
Mt cng trnh nghin cu khoa hc, cho d c tn km v quan trng c no,
nu khng c phn tch ng phng php s khng c ngha khoa hc g c. Chnh
v th th m ngy nay, ch cn nhn qua tt c cc tp san nghin cu khoa hc trn th
gii, hu nh bt c bi bo y hc no cng c phn Statistical Analysis (Phn tch
thng k), ni m tc gi phi m t cn thn phng php phn tch, tnh ton nh th
no, v gii thch ngn gn ti sao s dng nhng phng php hm bo k
hay tng trng lng khoa hc cho nhng pht biu trong bi bo. Cc tp san y hc c
uy tn cng cao yu cu v phn tch thng k cng nng. Xin nhc li nhn mnh:
khng c phn phn tch thng k, bi bo khng c ngha khoa hc.
Mt trong nhng pht trin quan trng nht trong khoa hc thng k l ng dng
my tnh cho phn tch v tnh ton thng k. C th ni khng ngoa rng khng c my
tnh, khoa hc thng k vn ch l mt khoa hc bun t kh khan, vi nhng cng thc
rc ri m thiu tnh ng dng vo thc t. My tnh gip khoa hc thng k lm mt
cuc cch mng ln nht trong lch s ca b mn: l a khoa hc thng k vo thc
t, gii quyt cc vn gai gc nht v gp phn lm pht trin khoa hc thc nghim.
Ngi vit cn nh hn 20 nm v trc khi cn l mt sinh vin theo hc
chng trnh thc s thng k c, mt v gio s kh knh k mt cu chuyn v nh
thng k danh ting ngi M, Fred Mosteller, nhn c mt hp ng nghin cu t
B Quc phng M ci tin chnh xc ca v kh M vo thi Th chin th II, m
trong ng phi gii mt bi ton thng k gm khong 30 thng s. ng phi mn
20 sinh vin sau i hc lm vic ny: 10 sinh vin ch vic sut ngy tnh ton bng tay;
cn 10 sinh vin khc kim tra li tnh ton ca 10 sinh vin kia. Cng vic ko di gn
mt thng tri. Ngy nay, vi mt my tnh c nhn (personal computer) khim tn,
phn tch thng k c th gii trong vng trn di 1 giy.
ting Anh bn c tham kho. Ngoi ra, trong phn cui ca sch, ti c lit k cc
thut ng Anh Vit c cp n trong sch.
Tt c cc d liu s dng trong sch ny u c th ti t internet xung my
tnh c nhn, hay c th truy nhp trc tip qua trang web: http://www.ykhoa.net/R.
Ti hi vng bn c s tm thy trong sch mt vi thng tin b ch, mt vi k
thut hay php tnh c ch cho vic hc tp, ging dy v nghin cu ca mnh. Nhng
c l chng c cun sch no hon thin hay khng c thiu st; thnh ra, nu bn c
pht hin mt sai st trong sch, xin bo cho ti bit qua in th
t.nguyen@garvan.org.au hay rknguyen@gmail.com. Thnh tht cm n cc bn c
trc.
Ti mun nhn dp ny cm n Tin s Nguyn Hong Dzng thuc khoa Ha,
i hc Bch khoa Thnh ph H Ch Minh, ngi gi v gip ti in cun sch
ny trong nc. Ti cm n Bc s Nguyn nh Nguyn, ngi c mt phn ln
bn tho ca cun sch, gp nhiu kin thit thc, v thit k ba sch. Ti cng
cm n Nh xut bn i hc Bch khoa Thnh ph H Ch Minh gip ti in cun
sch ny.
By gi, ti mi bn c cng i vi ti mt hnh trnh thng k ngn bng R.
CHNG II
2
Gii thiu ngn ng R
2.1 R l g ?
Ni mt cch ngn gn, R l mt phn mm s dng cho phn tch thng k v
th. Tht ra, v bn cht, R l ngn ng my tnh a nng, c th s dng cho nhiu mc
tiu khc nhau, t tnh ton n gin, ton hc gii tr (recreational mathematics), tnh
ton ma trn (matrix), n cc phn tch thng k phc tp. V l mt ngn ng, cho nn
ngi ta c th s dng R pht trin thnh cc phn mm chuyn mn cho mt vn
tnh ton c bit.
Hai ngi sng to ra R l hai nh thng k hc tn l Ross Ihaka v Robert
Gentleman. K t khi R ra i, rt nhiu nh nghin cu thng k v ton hc trn th
gii ng h v tham gia vo vic pht trin R. Ch trng ca nhng ngi sng to ra
R l theo nh hng m rng (Open Access). Cng mt phn v ch trng ny m R
hon ton min ph. Bt c ai bt c ni no trn th gii u c th truy nhp v ti
ton b m ngun ca R v my tnh ca mnh s dng. Cho n nay, ch qua cha
y 5 nm pht trin, cng ngy cng c nhiu cc nh thng k hc, ton hc, nghin
cu trong mi lnh vc chuyn sang s dng R phn tch d liu khoa hc. Trn
ton cu, c mt mng li gn mt triu ngi s dng R, v con s ny ang tng
theo cp s nhn. C th ni trong vng 10 nm na, chng ta s khng cn n cc
phn mm thng k t tin nh SAS, SPSS hay Stata (cc phn mm ny rt t tin, c
th ln n 100.000 USD mt nm) phn tch thng k na, v tt c cc phn tch
c th tin hnh bng R.
V th, nhng ai lm nghin cu khoa hc, nht l cc nc cn ngho kh nh
nc ta, cn phi hc cch s dng R cho phn tch thng k v th. Bi vit ngn ny
s hng dn bn c cch s dng R. Ti gi nh rng bn c khng bit g v R,
nhng ti k vng bn c bit qua v cch s dng my tnh.
R-2.2.1-win32.zip
Ti liu ny khong 26 MB, v a ch c th ti l:
http://cran.r-project.org/bin/windows/base/R-2.2.1-win32.exe
Ti website ny, chng ta c th tm thy rt nhiu ti liu ch dn cch s dng
R, trnh , t s ng n cao cp. Nu cha quen vi ting Anh, ti liu ny ca ti
c th cung cp nhng thng tin cn thit s dng m khng cn phi c cc ti liu
khc.
Khi ti R xung my tnh, bc k tip l ci t (set-up) vo my tnh.
lm vic ny, chng ta ch n gin nhn chut vo ti liu trn v lm theo hng dn
cch ci t trn mn hnh. y l mt bc rt n gin, ch cn 1 pht l vic ci t R
c th hon tt.
Chc nng
Dng v th v lm cho th p hn
Dng v th v lm cho th p hn
Mt s phng php m hnh d liu ca F. Harrell
Mt s m hnh thit k nghin cu ca F. Harrell
Dng cho cc phn tch dch t hc
Mt package khc chuyn cho cc phn tch dch t hc
Dng nhp d liu t cc phn mm khc nh
SPSS, Stata, SAS, v.v
Dng cho phn tch tng hp (meta-analysis)
Mt package khc cho phn tch tng hp
Chuyn dng cho phn tch theo m hnh Cox (Coxs
proportional hazard model)
splines
Zelig
genetics
BMA
leaps
R 2.2.1.lnk
Nhng nu
> R is great
R s khng ng vi lnh ny, v ngn ng ny khng c trong th vin ca R, mt
thng bo sau y s xut hin:
Error: syntax error
>
Khi mun ri khi R, chng ta c th n gin nhn nt cho (x) bn gc tri ca
window, hay g lnh q().
NULL
x bng 5
x khng bng 5
y nh hn x
x ln hn y
z nh hn hoc bng 7
p ln hn hoc bng 1
C phi x l bin s missing
A v B (AND)
A hoc B (OR)
Khng l (NOT)
Mt vi iu cn lu khi t tn trong R l:
2.7 H tr trong R
Ngoi lnh args() R cn cung cp lnh help() ngi s dng c th hiu
vn phm ca tng hm. Chng hn nh mun bit hm lm c nhng thng s
(arguments) no, chng ta ch n gin lnh:
> help(lm)
hay
> ?lm
Mt ca s s hin ra bn phi ca mn hnh ch r cch s dng ra sao v thm ch c c
v d. Bn c c th n gin copy v dn v d vo R xem cch vn hnh.
Trc khi s dng R, ngoi sch ny nu cn bn c c th c qua phn ch dn
c sn trong R bng cch chn mc help v sau chn Html help nh hnh di
V R s bo co cc hm vi k t lm nh sau c sn trong R:
[1] ".__C__anova.glm"
[4] ".__C__glm.null"
[7] "anova.glm"
[10] "anova.lmlist"
[13] "contr.helmert"
[16] "glm.fit"
[19] "KalmanForecast"
[22] "KalmanSmooth"
[25] "lm.fit.null"
[28]
"lm.wfit.null"
"model.frame.lm"
".__C__anova.glm.null" ".__C__glm"
".__C__lm"
".__C__mlm"
"anova.glmlist"
"anova.lm"
"anova.mlm"
"anovalist.lm"
"glm"
"glm.control"
"glm.fit.null"
"hatvalues.lm"
"KalmanLike"
"KalmanRun"
"lm"
"lm.fit"
"lm.influence"
"lm.wfit"
"model.frame.glm"
[31]
[34]
[37]
[40]
[43]
[46]
[49]
"model.matrix.lm"
"plot.lm"
"predict.lm"
"print.lm"
"rstandard.glm"
"rstudent.lm"
"summary.mlm"
"nlm"
"plot.mlm"
"predict.mlm"
"residuals.glm"
"rstandard.lm"
"summary.glm"
"kappa.lm"
"nlminb"
"predict.glm"
"print.glm"
"residuals.lm"
"rstudent.glm"
"summary.lm"
CHNG III
NHP D LIU
3
Nhp d liu
Mun lm phn tch d liu bng R, chng ta phi c sn d liu dng m R c
th hiu c x l. D liu m R hiu c phi l d liu trong mt data.frame.
C nhiu cch nhp s liu vo mt data.frame trong R, t nhp trc tip n
nhp t cc ngun khc nhau. Sau y l nhng cch thng dng nht:
16.5
10.8
32.3
19.3
14.2
11.3
15.5
15.8
16.2
11.2
Trong lnh ny, chng ta mun cho R bit rng nhp hai ct (hay hai i tng) age v
insulin vo mt i tng c tn l tuan.
V R s bo co:
1
2
3
4
5
6
7
8
9
10
age insulin
50
16.5
62
10.8
60
32.3
40
19.3
48
14.2
47
11.3
57
15.5
70
15.8
48
16.2
67
11.2
sex
Nam
Nu
age
57
64
bmi
17
18
hdl
5.000
4.380
ldl
2.0
3.0
tc
4.0
3.5
tg
1.1
2.1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
Nu
Nam
Nu
Nam
Nam
Nu
Nu
Nu
Nu
Nu
Nu
Nu
Nu
Nu
Nam
Nam
Nu
Nam
Nu
Nu
Nu
Nam
Nam
Nu
Nu
Nam
Nu
Nam
Nu
Nu
Nam
Nu
nam
Nam
Nam
Nu
Nam
Nam
Nu
Nu
60
65
47
65
76
61
59
57
63
51
60
42
64
49
44
45
80
48
61
45
70
51
63
54
57
70
47
60
60
50
60
55
74
48
46
49
69
72
51
58
60
45
63
52
64
45
64
62
18
18
18
18
19
19
19
19
20
20
20
20
20
20
21
21
21
21
21
21
21
21
22
22
22
22
22
22
22
22
22
22
23
23
23
23
23
23
23
23
24
24
24
24
24
24
25
25
3.360
5.920
6.250
4.150
0.737
7.170
6.942
5.000
4.217
4.823
3.750
1.904
6.900
0.633
5.530
6.625
5.960
3.800
5.375
3.360
5.000
2.608
4.130
5.000
6.235
3.600
5.625
5.360
6.580
7.545
6.440
6.170
5.270
3.220
5.400
6.300
9.110
7.750
6.200
7.050
6.300
5.450
5.000
3.360
7.170
7.880
7.360
7.750
3.0
4.0
2.1
3.0
3.0
3.0
3.0
2.0
5.0
1.3
1.2
0.7
4.0
4.1
4.3
4.0
4.3
4.0
3.1
3.0
1.7
2.0
2.1
4.0
4.1
4.0
4.2
4.2
4.4
4.3
2.3
6.0
3.0
3.0
2.6
4.4
4.3
4.0
3.0
4.1
4.4
2.8
3.0
2.0
1.0
4.0
4.6
4.0
4.7
7.7
5.0
4.2
5.9
6.1
5.9
4.0
6.2
4.1
3.0
4.0
6.9
5.7
5.7
5.3
7.1
3.8
4.3
4.8
4.0
3.0
3.1
5.3
5.3
5.4
4.5
5.9
5.6
8.3
5.8
7.6
5.8
3.1
5.4
6.3
8.2
6.2
6.2
6.7
6.3
6.0
4.0
3.7
6.1
6.7
8.1
6.2
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9
1.7
1.0
1.6
1.1
1.5
1.0
2.7
3.9
3.0
3.1
2.2
2.7
1.1
0.7
1.0
1.7
2.9
2.5
6.2
1.3
3.3
3.0
1.0
1.4
2.5
0.7
2.4
2.4
1.4
2.7
2.4
3.3
2.0
2.6
1.8
1.2
1.9
3.3
4.0
2.5
Chng ta mun nhp cc d liu ny vo R tin vic phn tch sau ny. Chng
ta s s dng lnh read.table nh sau:
> setwd(c:/works/stats)
Hay
> names(chol)
R s cho bit c cc ct nh sau trong d liu (name l lnh hi trong d liu c nhng
ct no v tn g):
[1] "id"
"tg"
Age
18
28
20
21
28
23
20
20
20
20
Sex
1
1
1
1
1
1
1
1
1
1
Ethnicity
1
1
1
1
1
4
1
1
1
1
IGFI
148.27
114.50
109.82
112.13
102.86
129.59
142.50
118.69
197.69
163.69
IGFBP3
5.14
5.23
4.33
4.38
4.04
4.16
3.85
3.44
4.12
3.96
ALS
316.00
296.42
269.82
247.96
240.04
266.95
300.86
277.46
335.23
306.83
PINP
61.84
98.64
93.26
101.59
58.77
48.93
135.62
79.51
57.25
74.03
ICTP
5.81
4.96
7.74
6.66
4.62
5.32
8.78
7.19
6.21
4.95
P3NP
4.21
5.33
4.56
4.61
4.95
3.82
6.75
5.11
4.44
4.84
11
12
13
14
15
16
17
18
19
20
22
27
26
33
34
32
28
18
26
27
1
0
1
1
1
1
1
0
0
1
1
2
1
1
3
1
1
2
2
2
144.81
141.60
161.80
89.20
161.80
148.50
157.70
222.90
186.70
167.56
3.63
3.48
4.10
2.82
3.80
3.72
3.98
3.98
4.64
3.56
295.46
231.20
244.80
177.20
243.60
234.80
224.80
281.40
340.80
321.12
68.26
56.78
75.75
48.57
50.68
83.98
60.42
74.17
38.05
30.18
4.54
4.47
6.27
3.58
3.52
4.85
4.89
6.43
5.12
4.78
3.70
4.07
5.26
3.68
3.35
3.80
4.09
5.84
5.77
6.12
Dn cho R bit chng ta mun x l chol bng cch dng lnh attach(arg) vi
arg l tn ca d liu..
> attach(chol)
Chng ta c th kim tra xem chol c phi l mt data.frame khng bng lnh
is.data.frame(arg) vi arg l tn ca d liu. V d:
> is.data.frame(chol)
[1] TRUE
C bao nhiu ct (hay variable = bin s) v dng s liu (observations) trong d liu
ny? Chng ta dng lnh dim(arg) vi arg l tn ca d liu. (dim vit tt ch
dimension). V d (kt qu ca R trnh by ngay sau khi chng ta g lnh):
> dim(chol)
[1] 50 8
> names(chol)
[1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc"
"tg"
> table(sex)
sex
nam Nam
1 21
Nu
28
CHNG IV
BIN TP D LIU
4
Bin tp d liu
Bin tp s liu y khng c ngha l thay i s liu gc (v l mt ti ln,
mt s gian di trong khoa hc khng th chp nhn c), m ch c ngha t chc s
liu sao cho R c th phn tch mt cch hu hiu. Nhiu khi trong phn tch thng k,
chng ta cn phi tp trung s liu thnh mt nhm, hay tch ri thnh tng nhm, hay
thay th t k t (characters) sang s (numeric) cho tin vic tnh ton. Trong chng
ny, ti s bn qua mt s lnh cn bn cho vic bin tp s liu.
Chng ta s quay li vi d liu chol trong v d 1. tin vic theo di v
hiu cu chuyn, ti xin nhc li rng chng ta nhp s liu vo trong mt d liu R
c tn l chol t mt text file c tn l chol.txt:
> setwd(c:/works/stats)
> chol <- read.table(chol.txt, header=TRUE)
> attach(chol)
Sau khi ra hai lnh ny, chng ta c 2 d liu (hai data.frame) mi tn l nam v nu.
Ch iu kin sex == Nam v sex == Nu chng ta dng == thay v = ch
iu kin chnh xc.
Tt nhin, chng ta cng c th tch d liu thnh nhiu data.frame khc nhau vi nhng
iu kin da vo cc bin s khc. Chng hn nh lnh sau y to ra mt data.frame
mi tn l old vi nhng bnh nhn trn 60 tui:
> old <- subset(chol, age>=60)
> dim(old)
[1] 25
[1] 9
1
2
3
4
5
6
7
8
id
1
2
3
4
5
6
7
8
sex
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
tc
4.0
3.5
4.7
7.7
5.0
4.2
5.9
6.1
9
9 Nam 5.9
10 10 Nu 4.0
Ch lnh print(arg) n gin lit k tt c s liu trong data.frame arg. Tht ra,
chng ta ch cn n gin g data3, kt qu cng ging y nh print(data3).
sex
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
Nu
tg
1.1
2.1
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9
1.7
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10 10
11 11
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
<NA>
4.0
3.5
4.7
7.7
5.0
4.2
5.9
6.1
5.9
4.0
NA
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
Nu
1.1
2.1
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9
1.7
phn loi 3 nhm long xng, xp xng, v bnh thng, chng ta c th dng m
s 1, 2 v 3. Ni cch khc, chng ta mun to nn mt bin s khc (hy gi l
diagnosis) gm 3 gi tr trn da vo gi tr ca bmd. lm vic ny, chng ta s
dng lnh:
# tm thi cho bin s diagnosis bng bmd
> diagnosis <- bmd
#
>
>
>
> data
bmd diagnosis
1 -0.92
3
2
0.21
3
3
0.17
3
4 -3.21
1
5 -1.80
2
6 -2.60
1
7 -2.00
2
8
1.71
3
9
2.12
3
10 -2.11
2
diagnosis
diagnosis
diagnosis
diagnosis
<<<<-
bmd
replace(diagnosis, bmd <= -2.5, 1)
replace(diagnosis, bmd > -2.5 & bmd <= 1.0, 2)
replace(diagnosis, bmd > -1.0, 3)
> mean(diagnosis)
[1] 2.3
nhng kt qu 2.3 ny khng c ngha g trong thc t c.
tui thp nht l 8 v cao nht l 51. Nu chng ta mun chia thnh 2 nhm tui:
> cut(age, 2)
[1] (7.96,29.5] (7.96,29.5] (7.96,29.5] (29.5,51]
(7.96,29.5] (7.96,29.5]
[9] (7.96,29.5]
(29.5,51]
(29.5,51]
(7.96,29.5]
(7.96,29.5]
(7.96,29.5] (7.96,29.5]
(7.96,29.5]
(29.5,51]
cut chia bin age thnh 2 nhm: nhm 1 tui t 7.96 n 29.5; nhm 2 t 29.5 n
51. Chng ta c th m s i tng trong tng nhm tui bng hm table nh sau:
> table(cut(age, 2))
(7.96,29.5]
11
(29.5,51]
4
low
high
Tt nhin, chng ta cng c th chia age thnh 4 nhm (quartiles) bng cch cho nhng
thng s 0, 0.25, 0.50 v 0.75 nh sau:
cut(age,
breaks=quantiles(age, c(0, 0.25, 0.50, 0.75, 1)),
labels=c(q1, q2, q3, q4),
include.lowest=TRUE)
cut(age,
breaks=quantiles(c(0, 0.25, 0.50, 0.75, 1)),
labels=c(q1, q2, q3, q4),
include.lowest=TRUE)
CHNG V
5
Dng R cho cc php tnh
n gin v ma trn
Mt trong nhng li th ca R l c th s dng nh mt my tnh cm tay.
Tht ra, hn th na, R c th s dng cho cc php tnh ma trn v lp chng. Trong
chng ny ti ch trnh by mt s php tnh n gin m hc sinh hay sinh vin c th
s dng lp tc trong khi c nhng dng ch ny.
Cng v tr:
> 15+2997
[1] 3012
> 15+2997-9768
[1] -6756
Nhn v chia
> -27*12/21
[1] -15.42857
Cn s bc hai: 10
S pi ()
> sqrt(10)
[1] 3.162278
> pi
[1] 3.141593
> 2+3*pi
[1] 11.42478
Logarit: loge
Logarit: log10
S m: e2.7689
Hm s lng gic
> exp(2.7689)
[1] 15.94109
> cos(pi)
[1] -1
> log(10)
[1] 2.302585
> log10(100)
[1] 2
> log10(2+3*pi)
[1] 1.057848
Vector
> x <- c(2,3,1,5,4,6,7,6,8)
> x
[1] 2 3 1 5 4 6 7 6 8
> sum(x)
[1] 42
> x*2
> exp(x/10)
[1] 1.221403 1.349859 1.105171 1.648
1.491825 1.822119 2.013753 1.822119
[9] 2.225541
> exp(cos(x/10))
[1] 2.664634 2.599545 2.704736 2.405
2.511954 2.282647 2.148655 2.282647
[9] 2.007132
[1]
2 10
8 12 14 12 16
Tnh tng bnh phng (sum of squares): 12 Tnh tng bnh phng iu chnh
n
+ 22 + 32 + 42 + 52 = ?
2
(adjusted
sum
of
squares):
( xi x ) = ?
i =1
( x x )
i =1
/n= ?
Phng sai: s 2 = ( xi x ) / ( n 1) = ?
2
i =1
lch chun:
s2 :
> sd(x)
[1] 1.581139
Ch chng ta nhp hai s liu khc nhau v th t ngy thng nm, nhng chng ta
cng cho bit c th cch c bng %d (ngy), %m (thng), v %y (nm). Chng ta c th
tnh s ngy gia hai thi im:
"2005-01-01"
"2005-03-12"
"2005-05-21"
"2005-07-30"
"2005-10-08"
"2005-12-17"
"2005-01-15"
"2005-03-26"
"2005-06-04"
"2005-08-13"
"2005-10-22"
"2005-12-31"
"2005-01-29"
"2005-04-09"
"2005-06-18"
"2005-08-27"
"2005-11-05"
"2005-02-12"
"2005-04-23"
"2005-07-02"
"2005-09-10"
"2005-11-19"
"2005-02-26"
"2005-05-07"
"2005-07-16"
"2005-09-24"
"2005-12-03"
To ra mt vector s t 1 n 12:
4
4
5
5
6
6
7
7
8
8
9 10 11 12
9 10 11 12
To ra mt vector s t 12 n 5:
> seq(12,7)
[1] 12 11 10
Cng thc chung ca hm seq l seq(from, to, by= ) hay seq(from, to,
length.out= ). Cch s dng s c minh ho bng vi v d sau y:
7.777778
9.222222
p dng rep
To ra s 10, 3 ln:
> rep(10, 3)
[1] 10 10 10
To ra s 1 n 4, 3 ln:
> rep(c(1:4), 3)
[1] 1 2 3 4 1 2 3 4 1 2 3 4
p dng gl
gl c p dng to ra mt bin th bc (categorical variable), tc bin khng tnh
ton, m l m. Cng thc chung ca hm gl l gl(n, k, length = n*k,
labels = 1:n, ordered = FALSE) v cch s dng s c minh ho bng vi
v d sau y:
To ra bin gm bc 1 v 2; mi bc c lp li 8 ln:
> gl(2, 8)
[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
Levels: 1 2
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Levels: 1 2
Hay:
> gl(2, 2, length=20)
[1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
Levels: 1 2
To mt bin gm 4 bc 1, 2, 3, 4. Mi bc lp li 2 ln.
Vi ngy gi thng:
A = 2 5 8
3 6 9
V vi R:
> y <- c(1,2,3,4,5,6,7,8,9)
> A <- matrix(y, nrow=3)
> A
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
th kt qu s l:
[1,]
[2,]
[3,]
Ma trn v hng (scalar matrix) l mt ma trn vung (tc s dng bng s ct), v
tt c cc phn t ngoi ng cho (off-diagonal elements) l 0, v phn t ng cho
l 1. Chng ta c th to mt ma trn nh th bng R nh sau:
> # to ra m ma trn 3 x 3 vi tt c phn t l 0.
> A <- matrix(0, 3, 3)
> # cho cc phn t ng cho bng 1
Hay A-B:
> D <- A-B
> D
[,1] [,2] [,3] [,4]
[1,]
2
8
14
20
[2,]
4
10
16
22
[3,]
6
12
18
24
1 4 7
A = 2 5 8
3 6 9
1 2 3
B = 4 5 6
7 8 9
Chng ta mun tnh AB, v c th trin khai bng R bng cch s dng %*% nh sau:
>
>
>
>
>
y <- c(1,2,3,4,5,6,7,8,9)
A <- matrix(y, nrow=3)
B <- t(A)
AB <- A%*%B
AB
[,1] [,2] [,3]
[1,]
66
78
90
[2,]
78
93 108
[3,]
90 108 126
Hay tnh BA, v c th trin khai bng R bng cch s dng %*% nh sau:
> BA <- B%*%A
> BA
[,1] [,2] [,3]
[1,]
14
32
50
[2,]
32
77 122
[3,]
50 122 194
3x1 + 4 x2 = 4
x1 + 6 x2 = 2
H phng trnh ny c th vit bng k hiu ma trn: AX = Y, trong :
3 4
A=
,
1 6
x
X = 1 ,
x2
4
Y =
2
[,1]
[,2]
[1,] -0.7071068 -0.9701425
[2,] -0.7071068 0.2425356
[,1]
[,2]
[,3]
[1,] 1.291667 -2.166667 0.9305556
[2,] -1.166667 1.666667 -0.6111111
[3,] 0.375000 -0.500000 0.1805556
Ngoi nhng php tnh n gin ny, R cn c th s dng cho cc php tnh
phc tp khc. Mt li th ng k ca R l phn mm cung cp cho ngi s dng t
do to ra nhng php tnh ph hp cho tng vn c th. Trong vi chng sau, ti s
quay li vn ny chi tit hn.
R c mt package Matrix chuyn thit k cho tnh ton ma trn. Bn c c th
ti package xung, ci vo my, v s dng, nu cn. a ch ti l:
http://cran.au.r-project.org/bin/windows/contrib/r-release/Matrix_0.995-8.zip
cng vi ti liu ch dn cch s dng (di khong 80 trang):
http://cran.au.r-project.org/doc/packages/Matrix.pdf
CHNG VI
6
Tnh ton xc sut
v m phng (simulation)
Xc sut l nn tng ca phn tch thng k. Tt c cc phng php phn tch s
liu v suy lun thng k u da vo l thuyt xc sut. L thuyt xc sut quan tm n
vic m t v th hin qui lut phn phi ca mt bin s ngu nhin. M t y
trong thc t cng c ngha n gin l m nhng trng hp hay kh nng xy ra ca
mt hay nhiu bin. Chng hn nh khi chng ta chn ngu nhin 2 i tng , v nu 2
i tng ny c th c phn loi bng hai c tnh nh gii tnh v s thch, th vn
t ra l c bao nhiu tt c phi hp gia hai c tnh ny. Hay i vi mt bin s
lin tc nh huyt p, m t c ngha l tnh ton cc ch s thng k ca bin nh tr s
trung bnh, trung v, phng sai, lch chun, v.v T nhng ch s m t, l thuyt
xc sut cung cp cho chng ta nhng m hnh thit lp cc hm phn phi cho cc
bin s . Trong chng ny, ti s bn qua hai lnh vc chnh l php m v cc hm
phn phi.
6.1 Cc php m
6.1.1 Php hon v (permutation).
Theo nh ngha, hon v n phn t l cch sp xp n phn t theo mt th t nh
sn. nh ngha ny tht l kh hiu, chng khc g ! C l mt v d c th s
lm r nh ngha hn. Hy tng tng mt trung tm cp cu c 3 bc s (x, y v z), v
c 3 bnh nhn (a, b v c) ang ngi ch c khm bnh. C ba bc s u c th khm
bt c bnh nhn a, b hay c. Cu hi t ra l c bao nhiu cch sp xp bc s bnh
nhn? tr li cu hi ny, chng ta xem xt vi trng hp sau y:
Tm 3!
> prod(3:1)
[1] 6
Tm 10!
> prod(10:1)
[1] 3628800
Tm 10.9.8.7.6.5.4
> prod(10:4)
[1] 604800
Tm (10.9.8.7.6.5.4) / (40.39.38.37.36)
> prod(10:4) / prod(40:36)
[1] 0.007659481
6.1.2 T hp (combination).
T hp n phn t chp k l mi tp hp con gm k phn t ca tp hp n phn t.
nh ngha ny phi ni l rt kh hiu v rm r! Cch d hiu nht l qua mt v
d nh sau: Cho 3 ngi (hy cho l A, B, v C) ng vin vo 2 chc ch tch v ph ch
tch, hi: c bao nhiu cch chn 2 chc ny trong s 3 ngi . Chng ta c th
tng tng c 2 gh m phi chn 3 ngi:
Cch chn
1
2
3
4
5
6
Ch tch
A
B
A
C
B
C
Ph ch tch
B
A
C
A
C
B
3
3!
6
= = 3 ln.
=
2 2!( 3 2 ) ! 2
Ni chung, s ln chn k ngi t n ngi l:
n
n!
=
k k !( n k ) !
2
n
Cng thc ny cng c khi vit l Ckn thay v . Vi R, php tnh ny rt n gin
k
bng hm choose(n, k). Sau y l vi v d minh ha:
5
Tm
2
> choose(5, 2)
[1] 10
1
N
156
2
N
160
3
Nam
175
4
N
145
5
N
165
6
N
158
7
Nam
170
8
Nam
167
9
N
178
10
Nam
155
phi nh phn, phn phi Poisson, v phn phi chun. Trong mi lut phn phi, c 4
loi hm quan trng m chng ta cn bit:
Mt
Tch ly
nh bc
M phng
Chun
dnorm(x, mean,
sd)
dbinom(k, n, p)
pbinom(q, n, p)
qbinom (p, n, p)
rbinom(k, n, prob)
dpois(k, lambda)
ppois(q, lambda)
qpois(p, lambda)
rpois(n, lambda)
dunif(x, min,
max)
dnbinom(x, k, p)
pnbinom(q, k, p)
qnbinom (p,k,prob)
rbinom(n, n, prob)
dbeta(x, shape1,
shape2)
dgamma(x, shape,
rate, scale)
dgeom(x, p)
pbeta(q, shape1,
shape2)
gamma(q, shape,
rate, scale)
pgeom(q, p)
qbeta(p, shape1,
shape2)
qgamma(p, shape,
rate, scale)
qgeom(p, prob)
rbeta(n, shape1,
shape2)
rgamma(n, shape,
rate, scale)
rgeom(n, prob)
dexp(x, rate)
pexp(q, rate)
qexp(p, rate)
rexp(n, rate)
dnorm(x, mean,
sd)
dcauchy(x,
location, scale)
df(x, df1, df2)
pcauchy(q,
location, scale)
pf(q, df1, df2)
qcauchy(p,
location, scale)
qf(p, df1, df2)
rcauchy(n,
location, scale)
rf(n, df1, df2)
Nh phn
Poisson
Uniform
Negative
binomial
Beta
Gamma
Geometric
Exponential
Weibull
Cauchy
F
dt(x, df)
pt(q, df)
qt(p, df)
rt(n, df)
T
dchisq(x,
df)
pchi(q,
df)
qchisq(p,
df)
rchisq(n,
df)
Chi-squared
Ch thch: Trong bng trn, df = degrees of freedome (bc t do); prob = probability (xc sut); n = sample
size (s lng mu). Cc thng s khc c th tham kho thm cho tng lut phn phi. Ring cc lut
phn phi F, t, Chi-squared cn c mt thng s khc na l non-centrality parameter (ncp) c cho s 0.
Tuy nhin ngi s dng c th cho mt thng s khc thch hp, nu cn.
c tin hnh n ln, mi ln cho ra kt qu hoc l thnh cng hoc l tht bi, v gm
xc sut thnh cng c bit trc l p, th xc sut c k ln th nghim thnh cng l:
nk
P ( k | n, p ) = Ckn p k (1 p ) , trong k == 0, 1, 2, . . . , n. hiu nh l r rng
Bn 3
Nam
N
Nam
N
Nam
N
Nam
N
Xc sut
(0.4)(0.4)(0.4) = 0.064
(0.4)(0.4)(0.6) = 0.096
(0.4)(0.6)(0.4) = 0.096
(0.4)(0.6)(0.6) = 0.144
(0.6)(0.4)(0.4) = 0.096
(0.6)(0.4)(0.6) = 0.144
(0.6)(0.6)(0.4) = 0.144
(0.6)(0.6)(0.6) = 0.216
1.000
nk
cn n gin lnh:
> dbinom(2, 3, 0.60)
[1] 0.432
7
68
8
23
9
13
10
3
Dng s liu th nht (0, 5, 6, , 10) l s bnh nhn mc bnh cao huyt p
trong s 20 ngi m chng ta chn. Dng s liu th hai cho chng ta bit s ln chn
mu trong 1000 ln xy ra. Do , c 6 mu khng c bnh nhn cao huyt p no, 45
mu vi ch 1 bnh nhn cao huyt p, v.v C l cch hiu l v th cc tn s
trn bng lnh hist nh sau:
> hist(b, main="Number of hypertensive patients")
Frequency
50
100
150
200
10
p s l xc sut 0.005 hay 0.5%. Ni cch khc, nu qu tht hai bia ging
nhau th xc sut m 16/20 ngi thch bia A ch 0.5%. Tc l, chng ta c bng chng
cho thy kh nng bia A qu tht c nhiu ngi thch hn bia B, ch khng phi do
yu t ngu nhin. Ch , chng ta dng 15 (thay v 16), l bi v P(X 16) = 1 P(X
15). M trong trng hp ta ang bn, P(X 15) = pbinom(15, 20, 0.5).
6.3.2 Hm phn phi Poisson (Poisson distribution)
t l sai chnh t trung bnh l 1( = 1). Lut phn phi Poisson pht biu rng xc sut
m X = k, vi iu kin t l trung bnh , :
e k
P( X = k | ) =
k!
e 212
= 0.1839 . p s ny c th
2!
tnh bng R mt cch nhanh chng hn bng hm dpois nh sau:
Do , p s cho cu hi trn l: P ( X = 2 | = 1) =
> dpois(2, 1)
[1] 0.1839397
Chng ta cng c th tnh xc sut sai 1 ch, v xc sut khng sai ch no:
> dpois(1, 1)
[1] 0.3678794
> dpois(0, 1)
[1] 0.3678794
= 1 P ( X 2)
= 1 0.3678 0.3678 0.1839
= 0.08
Bng R, chng ta c th tnh nh sau:
# P(X 2)
> ppois(2, 1)
[1] 0.9196986
# 1-P(X 2)
> 1-ppois(2, 1)
[1] 0.0803014
Hai lut phn phi m chng ta va xem xt trn y thuc vo nhm phn phi
p dng cho cc bin s phi lin tc (discrete distributions), m trong bin s c
nhng gi tr theo bc th hay th loi. i vi cc bin s lin tc, c vi lut phn phi
8
thch hp khc, m quan trng nht l phn phi chun. Phn phi chun l nn tng
quan trng nht ca phn tch thng k. C th ni khng ngoa rng hu ht l thuyt
thng k c xy dng trn nn tng ca phn phi chun. Hm mt phn phi
chun c hai thng s: trung bnh v phng sai 2 (hay lch chun ). Gi X l
mt bin s (nh chiu cao chng hn), hm mt phn phi chun pht biu rng xc
sut m X = x l:
( x )2
1
2
P ( X = x | , ) = f ( x ) =
exp
2 2
2
f(height)
0.00
0.02
0.04
0.06
0.08
130
140
150
160
170
180
190
200
Height
Biu trn c v bng hai lnh sau y. Lnh u tin nhm to ra mt bin s
height c gi tr 130, 131, 132, , 200 cm. Lnh th hai l v biu vi iu kin
trung bnh l 156 cm v lch chun l 4.6 cm.
> height <- seq(130, 200, 1)
> plot(height, dnorm(height, 156, 4.6),
type="l",
ylab=f(height),
xlab=Height,
2
4.6 2 3.1416
2 ( 4.6 )
= 0.0594
Hm dnorm(x, mean, sd)trong R c th tnh ton xc sut ny cho chng ta mt
cch gn nh:
> dnorm(160, mean=156, sd=4.6)
[1] 0.05942343
P(a X b) =
f ( x ) dx
a
Thnh ra, P(160 X 150) chnh l din tch tnh t trc honh = 150 n 160 ca biu
2. Trong R c hm pnorm(x, mean, sd) dng tnh xc sut tch ly cho
mt phn phi chun rt c ch.
pnorm (a, mean, sd) =
Chng hn nh xc sut chiu cao ph n Vit Nam bng hoc thp hn 150 cm l 9.6%:
> pnorm(150, 156, 4.6)
[1] 0.0960575
Hay xc sut chiu cao ph n Vit Nam bng hoc cao hn 165 cm l:
> 1-pnorm(164, 156, 4.6)
[1] 0.04100591
Ni cch khc, ch c khong 4.1% ph n Vit Nam c chiu cao bng hay cao hn 165
cm.
10
V d 7: ng dng lut phn phi chun: Trong mt qun th, chng ta bit
rng p sut mu trung bnh l 100 mmHg v lch chun l 13 mmHg, hi: c bao
nhiu ngi trong qun th ny c p sut mu bng hoc cao hn 120 mmHg? Cu tr
li bng R l:
> 1-pnorm(120, mean=100, sd=13)
[1] 0.0619679
Tc khong 6.2% ngi trong qun th ny c p sut mu bng hoc cao hn 120
mmHg.
6.3.4 Hm phn phi chun chun ha (Standardized Normal distribution)
Mt bin X tun theo lut phn phi chun vi trung bnh bnh v phng sai 2
thng c vit tt l:
X ~ N( , 2)
y v 2 ty thuc vo n v o lng ca bin s. Chng hn nh chiu
cao c tnh bng cm (hay m), huyt p c o bng mmHg, tui c o bng nm,
v.v cho nn i khi m t mt bin s bng n v gc rt kh so snh. Mt cch n
gin hn l chun ha (standardized) X sao cho s trung bnh l 0 v phng sai l 1.
Sau vi thao tc s hc, c th chng minh d dng rng, cch bin i X p ng iu
kin trn l:
Z=
11
0.2
0.0
0.1
f(z)
0.3
0.4
-4
-2
12
Ni cch khc, xc sut 95% l z nm gia -1.96 v 1.96. (Ch trong lnh trn ti
khng cung cp mean=0, sd=1, bi v trong thc t, pnorm gi tr mc nh (default
value) ca thng s mean l 0 v sd l 1).
V d 6 (tip tc). Xin nhc li tin vic theo di, chiu cao trung bnh ph
n Vit Nam l 156 cm v lch chun l 4.6 cm. Do , mt ph n c chiu cao 170
cm cng c ngha l z = (170 156) / 4.6 = 3.04 lch chun, v ti l cc ph n Vit
Nam c chiu cao cao hn 170 cm l rt thp, ch khong 0.1%.
> 1-pnorm(3.04)
[1] 0.001182891
P(Z < z) = p
Hay P(Z < z) = 0.975 cho phn phi chun vi trung bnh 0 v lch chun 1:
> qnorm(0.975, mean=0, sd=1)
[1] 1.959964
Phn phi Chi bnh phng (2). Phn phi 2 xut pht t tng bnh
n
u tun theo lut phn phi Chi bnh phng vi bc t do n (thng vit tt l
df). Ni theo ngn ng ton, u ~ n2 .
13
phn phi Chi bnh phng phi trung tm vi bc t do n v thng s phi trung
tm (non-centrality parameter) nh sau:
n
= i2
i =1
V k hiu l u ~
2
n ,
sai ca u l 2(n+2).
Tm xc sut m u nh hn hoc bng 21, vi iu kin bc t do l 13 v thng
s non-centrality bng 5.4:
> pchisq(21, 13, 5.4)
[1] 0.6837649
14
Phn phi F. T s gia hai bin s theo lut phn phi 2 c th chng minh l
tun theo lut phn phi F. Ni cch khc, nu u ~ n2 v v ~ m2 , th u/v ~ Fn,m,
trong n l bc t do t s (numerator degrees of freedom) v m l bc t do
mu s (denominator degrees of freedom).
V d 11: Tm xc sut m mt tr s F ln hn 3.24, bit rng bin s tun
theo lut phn phi F vi bc t do 3 v 15 df v thng s non-centrality 5:
> 1-pf(3.24, 3, 15, 5)
[1] 0.3558721
15
vi gi tr 1, 3 v 5 vi xc sut nh sau:
x
1
3
5
P(x)
0.60
0.30
0.10
Qua s liu ny, chng ta bit rng gi tr trung bnh l (1x0.60)+(3x0.30)+(5x0.10) = 2.0
v phng sai (bn c c th t tnh) l 1.8.
By gi chng ta s dng hai thng s ny th m phng 500 ln. Lnh th nht to
ra 3 gi tr ca x. Lnh th hai nhp s xc sut cho tng gi tr ca x. Lnh sample
yu cu R to nn 500 s ngu nhin v cho vo i tng draws.
x <- c(1, 3, 5)
px <- c(0.6, 0.3, 0.1)
draws <- sample(x, size=500, replace=T, prob=px)
hist(draws, breaks=seq(1,5, by=0.25), main=1000 draws)
16
150
0
50
100
Frequency
200
250
300
500 draws
draws
T lut phn phi xc sut chng ta bit rng tnh trung bnh s c 60% ln c gi
tr 1, 30% c gi tr 2, v 10% c gi tr 5. Do , chng ta k vng s quan st
300, 150 v 50 ln cho mi gi tr. Biu trn cho thy phn phi cc gi tr ny gn
vi gi tr m chng ta k vng. Ngoi ra, chng ta cng bit rng phng sai ca bin s
ny l khong 1.8. By gi chng ta kim tra xem c ng nh k vng hay khng:
> var(draws)
[1] 1.835671
17
50
Frequency
100
150
drawmeans
Chng ta thy rng phng sai ca phn phi ny nh hn. Tht ra, phng sai ca 500
s trung bnh ny l 0.45.
> var(drawmeans)
[1] 0.4501112
18
15
10
0
Number of samples
20
25
> mean(bin)
[1] 3.97
V mt phn phi:
19
Frequency
10
15
20
Histogram of pois
pois
6 12
3 25 15
Cch m phng trn y cn c th p dng cho cc lut phn phi khc nh nh phn
m (negative binomial distribution vi rnbinom), gamma (rgamma), beta (rbeta),
Chi bnh phng (rchisq), hm m (rexp), t (rt), F (rf), v.v Cc thng s cho cc
hm m phng ny c th tm trong phn u ca chng.
Cc lnh sau y s minh ha cc lut phn phi thng thng :
>
>
>
>
>
20
0.6
0.3
0.0
0.1
0.2
dchisq(x, 1)
0.4
0.5
df=1
df=2
df=3
df=5
10
>
>
>
>
>
>
>
Phn phi t:
curve(dt(x, 1), xlim=c(-3,3), ylim=c(0,0.4), col="red", lwd=3)
curve(dt(x, 2), add=T, col="blue", lwd=3)
curve(dt(x, 5), add=T, col="green", lwd=3)
curve(dt(x, 10), add=T, col="orange", lwd=3)
curve(dnorm(x), add=T, lwd=4, lty=3)
title(main=Student T distributions)
legend(par("usr")[2], par("usr")[4],
xjust=1,
c("df=1", "df=2", "df=5", "df=10", Normal distribution),
lwd=c(2,2,2,2,2),
lty=c(1,1,1,1,3),
col=c("red", "blue", "green", "orange", par(fg)))
21
0.4
Student T distributions
0.2
0.0
0.1
dt(x, 1)
0.3
df=1
df=2
df=5
df=10
Normal distribution
-3
-2
-1
>
>
>
>
>
>
>
>
>
Phn phi F:
curve(df(x,1,1), xlim=c(0,2), ylim=c(0,0.8), lwd=3)
curve(df(x,3,1), add=T)
curve(df(x,6,1), add=T, lwd=3)
curve(df(x,3,3), add=T, col="red")
curve(df(x,6,3), add=T, col="red", lwd=3)
curve(df(x,3,6), add=T, col="blue")
curve(df(x,6,6), add=T, col="blue", lwd=3)
title(main=Fisher F distributions)
legend(par("usr")[2], par("usr")[4],
xjust=1,
c("df=1,1", "df=3,1", "df=6,1", "df=3,3", df=6,3,
df=3,6, df=6,6),
lwd=c(1,1,3,1,3,1,3),
lty=c(2,1,1,1,1,1,1),
col=c(par("fg"), par("fg"), par("fg"), red, blue, blue))
22
0.8
Fisher F distributions
0.4
0.0
0.2
df(x, 1, 1)
0.6
df=1,1
df=3,1
df=6,1
df=3,3
df=6,3
df=3,6
6,6
0.0
0.5
1.0
1.5
2.0
>
>
>
>
>
>
>
1.0
0.6
0.4
0.2
0.0
dgamma(x, 1, 1)
0.8
23
>
>
>
>
>
>
>
>
>
>
>
>
Beta distribution
2
0
dbeta(x, 1, 1)
(1,1)
(2,1)
(3,1)
(4,1)
(2,2)
(3,2)
(4,2)
(2,3)
(3,3)
(4,3)
0.0
0.2
0.4
0.6
0.8
1.0
24
2.0
1.0
0.0
0.5
dexp(x)
1.5
Exponential
Weibull, shape=1
Weibull, shape=2
Weibull, shape=.8
0.0
0.5
1.0
1.5
2.0
2.5
3.0
25
0.5
0.3
0.2
0.0
0.1
dcauchy(x)
0.4
Cauchy distribution
Gaussian distribution
-4
-2
v.v
26
Trn y l lnh chng ta chn mu ngu nhin m khng thay th (random sampling
without replacement), tc l mi ln chn mu, chng ta khng b li cc mu chn
vo qun th.
Nhng nu chng ta mun chn mu thay th (tc mi ln chn ra mt s i tng,
chng ta b vo li trong qun th chn tip ln sau). V d, chng ta mun chn 10
ngi t mt qun th 50 ngi, bng cch ly mu vi thay th (random sampling with
replacement), chng ta ch cn thm tham s replace = TRUE:
> sample(1:50, 10, replace=T)
[1] 31 44 6 8 47 50 10 16 29 23
27
CHNG VII
KIM NH GI THUYT
V
TR S P
7
Kim nh gi thit thng k
v ngha ca tr s P (P-value)
7.1 Tr s P
Trong nghin cu khoa hc, ngoi nhng d kin bng s, biu v hnh nh,
con s m chng ta thng hay gp nht l tr s P (m ting Anh gi l P-value). Trong
cc chng sau y, bn c s gp tr s P rt nhiu ln, v i a s cc suy lun phn
tch thng k, suy lun khoa hc u da vo tr s P. Do , trc khi bn n cc
phng php phn tch thng k bng R, ti thy cn phi c i li v ngha ca tr s
ny.
Tr s P l mt con s xc sut, tc l vit tt ch probability value. Chng ta
thng gp nhng pht biu c km theo con s, chng hn nh Kt qu phn tch
cho thy t l gy xng trong nhm bnh nhn c iu tr bng thuc Alendronate l
2%, thp hn t l trong nhm bnh nhn khng c cha tr (5%), v mc khc bit
ny c ngha thng k (p = 0.01), hay mt pht biu nh Sau 3 thng iu tr, mc
gim p sut mu trong nhm bnh nhn l 10% (p < 0.05). Trong vn cnh trn y,
i a s nh khoa hc hiu rng tr s P phn nh xc sut s hiu nghim ca thuc
Alendronate hay mt thut iu tr, h hiu rng cu vn trn c ngha l xc sut m
thuc Alendronate tt hn gi dc l 0.99 (ly 1 tr cho 0.01). Nhng cch hiu
hon ton sai!
Trong T in ton kinh t thng k, kinh t lng Anh Vit (Nh xut bn
Khoa hc v K thut, 2004), tc gi nh ngha tr s P nh sau: P gi tr (hoc gi
tr xc sut). P gi tr l mc ngha thng k thp nht m gi tr quan st c
ca thng k kim nh c ngha (trang 690). nh ngha ny tht l kh hiu! Tht ra
cng l nh ngha chung m cc sch khoa Ty phng thng hay vit. Lt bt c
sch gio khoa no bng ting Anh, chng ta s thy mt nh ngha v tr s P na n
ging nhau nh Tr s P l xc sut m mc khc bit quan st do cc yu t ngu
nhin gy ra (P value is the probability that the observed difference arose by chance).
Tht ra nh ngha ny cha y , nu khng mun ni l sai. Chnh v s m m
ca nh ngha cho nn rt nhiu nh khoa hc hiu sai ngha ca tr s P.
Tht vy, rt nhiu ngi, khng ch ngi c m ngay c chnh cc tc gi ca
nhng bi bo khoa hc, khng hiu ngha ca tr s P. Theo mt nghin cu c
cng b trn tp san danh ting Statistics in Medicine [1], tc gi cho bit 85% cc tc gi
khoa hc v bc s nghin cu khng hiu hay hiu sai ngha ca tr s P. c n y
c l bn c rt ngc nhin, bi v iu ny c ngha l nhiu nh nghin cu khoa hc
c khi khng hiu hay hiu sai nhng g chnh h vit ra c ngha g! Th th, cu hi cn
t ra mt cch nghim chnh: ngha ca tr s P l g? tr li cho cu hi ny,
chng ta cn phi xem xt qua khi nim phn nghim v tin trnh ca mt nghin cu
khoa hc.
bin
14
1
15
1
34
2
35
3
16
2
17
11
18
16
19
24
20
47
21
60
22
83
23 24 25 26
94 107 132 114
27
98
28
65
29
44
30
44
31
26
32
14
33
12
Frequency
50
100
150
200
250
Histogram of bin
15
20
25
30
35
bin
13
40
14
83
15
197
16
462
17
946
18
1592
19
2719
20
4098
21
5892
22
7937
23
9733
24
25
26
10822 11191 10799
27
9497
28
7925
29
5904
30
4185
31
2682
32
1562
33
893
34
455
35
223
36
98
37
31
12
17
38
5
39
7
40
1
Ni cch khc, xc sut P(X 35 | p=0.50) qu thp (ch 0.3%), chng ta c bng
chng cho rng kt qu trn c th khng do cc yu t ngu nhin gy nn; tc c
mt s khc bit v s thch ca khch hng i vi hai loi c ph.
Con s P = 0.0035 chnh l tr s P. Theo mt qui c khoa hc, tt c cc tr s
P thp hn 0.05 (tc thp hn 5%) c xem l significant, tc l c ngha thng
k.
Cn phi nhn mnh mt ln na hiu ngha ca tr s P nh sau: Mc ch
ca phn tch trn l nhm tr li cu hi: nu hai loi c ph c xc sut a chung
bng nhau (p = 0.5, gi thuyt o), th xc sut m kt qu trn (35 trong s 50 khch
hng thch A) xy ra l bao nhiu? Ni cch khc, chnh l phng php i tm tr s
P. Do , din dch tr s P phi c iu kin, v iu kin y l p = 0.50. bn c c
th lm th nghim thm vi p = 0.6 hay p = 0.7 thy kt qu khc nhau ra sao.
Trong thc t, tr s P c mt nh hng rt ln n s phn ca mt bi bo khoa
hc. Nhiu tp san v nh khoa hc xem mt nghin cu khoa hc vi tr s P cao hn
0.05 l mt kt qu tiu cc (negative result) v bi bo c th b t chi cho cng
b. Chnh v th m i vi i a s nh khoa hc, con s P < 0.05 tr thnh mt
ci giy thng hnh cng b kt qu nghin cu. Nu kt qu vi P < 0.05, bi bo
c c may xut hin trn mt tp san no v tc gi c th s ni ting; nu kt qu P
> 0.05, s phn bi bo v cng trnh nghin cu c c may i vo lng qun!
7.4 Vn logic ca tr s P
Nhng ng trn phng din l tr v khoa hc nghim chnh, chng ta c nn
t tm quan trng vo tr s P nh th hay khng? Theo ti, cu tr li l khng. Tr s
P c nhiu vn , v vic ph thuc vo n trong qu kh (cng nh hin nay) b rt
nhiu ngi ph phn gay gt. Ci khim khuyt s 1 ca tr s P l n thiu tnh logic.
Tht vy, nu chng ta chu kh xem xt li v d trn, chng ta c th khi qut tin
trnh ca mt nghin cu y hc (da vo tr s P) nh sau:
c tnh bnh
nhn
Nhm c iu
tr bng calcium
v vitamin D 1
Nhm gi dc
(placebo) 1
T s nguy c
(relative risk) v
khong tin cy
95% 2
tui
50-59
60-69
70-79
29 (0.06)
53 (0.09)
93 (0.44)
13 (0.03)
71 (0.13)
115 (0.54)
2,17 (1.13-4.18)
0.74 (0.52-1.06)
0.82 (0.62-1.08)
69 (0.20)
63 (0.14)
43 (0.09)
66 (0.19)
74 (0.16)
59 (0.13)
1.05 (0.75-1.47)
0.87 (0.62-1.22)
0.73 (0.49-1.09)
Ht thuc l
Khng ht thuc
Hin ht thuc
159 (0.14)
14 (0.14)
178 (0.15)
16 (0.17)
0.90 (0.71-1.11)
0.85 (0.41-1.74)
Ch thch: 1 s ngoi ngoc l s bnh nhn b gy xng i trong thi gian theo di (7 nm) v
s trong ngoc l t l gy xng tnh bng phn trm mi nm. 2 T s nguy c tng i (hay
relative risk RR s gii thch trong mt chng sau) c c tnh bng cch ly t l gy
xng trong nhm can thip chia cho t l trong nhm gi dc; nu khong tin cy 95% bao
gm 1 th mc khc bit gia 2 nhm khng c ngha thng k; nu khong tin cy 95%
khng bao gm 1 th mc khc bit gia 2 nhm c xem l c ngha thng k (hay
p<0.05).
tch ca Berger v Sellke, khong 25% cc pht hin vi p < 0.05 l cc pht hin
dng tnh gi [2].
Do , chng ta khng nn qu ph thuc vo tr s P. Khng phi c nghin cu
no vi p<0.05 l thnh cng v p>0.05 l tht bi. C khi mt pht hin vi p>0.05
nhng li l mt pht hin c ngha. Vn quan trng l lm sao c tnh mc
kh d ca mt gi thuyt mt khi c d kin tht trong tay, tc l c tnh P(H+ | D).
c tnh P(H+ | D), chng ta phi p dng nh l Bayes, v cch tip cn nh l ny
khng nm trong phm tr ca cun sch ny. Bn c mun tham kho thm c th c
mt vi bi bo ca ti hay cc cc bi bo ca James Berger m ti liu tham kho di
y c th cung cp thm.
Ti liu tham kho:
CHNG VIII
8
Phn tch s liu bng biu
Yu t th gic rt quan trng. Ngi Trung Quc c cu mt biu c gi tr
bng c vn ch vit. Qu tht, biu tt c kh nng gy n tng cho ngi c bo
khoa hc rt ln, v thng c gi tr i din cho c cng trnh nghin cu. V th biu
l mt phng tin hu hiu nht nhn mnh thng ip ca bi bo. Biu
thng c s dng th hin xu hng v kt qu cho tng nhm, nhng cng c th
dng trnh by d kin mt cch gn gng. Cc biu d hiu, ni dung phong ph
l nhng phng tin v gi. Do , nh nghin cu cn phi suy ngh mt cch sng to
cch th hin s liu quan trng bng biu . V th, phn tch biu ng mt vai tr
cc k quan trng trong phn tch thng k. C th ni, khng c th l phn tch
thng k khng c ngha.
Trong ngn ng R c rt nhiu cch thit k mt biu gn v p. Phn ln
nhng hm thit k biu c sn trong R, nhng mt s loi biu tinh vi v phc
tp khc c th thit k bng cc package chuyn dng nh lattice hay trellis c
th ti t website ca R. Trong chng ny ti s ch cch v cc biu thng dng
bng cch s dng cc hm ph bin trong R.
par(mfrow=c(2,2))
N <- 200
x <- runif(N, -4, 4)
y <- sin(x) + 0.5*rnorm(N)
plot(x,y, main=Scatter plot of y and x)
hist(x, main=Histogram of x)
boxplot(y, main=Box plot of y)
20
15
Frequency
0
y
-2
-1
10
25
30
-2
-4
-2
Box plot of y
Bar chart of x
-2
-2
-1
-4
par(mfrow=c(1,2))
N <- 200
x <- runif(N, -4, 4)
y <- sin(x) + 0.5*rnorm(N)
plot(x,y)
plot(x, y, xlab=X factor,
ylab=Production,
main=Production and x factor \n Second line of title here)
> par(mfrow=c(1,1))
Trong cc lnh trn, xlab (vit tt t x label)v ylab (vit tt t y label) dng t
tn cho trc honh v trc tung. Cn main c dng t tn cho biu . Ch
rng trong main c k hiu \n dng vit dng th hai (nu tn gi biu qu di).
2
1
0
-1
-2
-2
-1
Production
-4
-2
-4
-2
X factor
0
-1
-2
Production
-4
-2
0
X factor
Figure 1
par(mfrow=c(2,2))
plot(y, type="l");
plot(y, type="b");
plot(y, type="o");
plot(y, type="h");
title("lines")
title("both")
title("overstruck")
title("high density")
1
100
150
200
50
100
Index
Index
overstruck
high density
150
200
150
200
1
0
-1
-2
-2
-1
50
0
-2
-1
0
-2
-1
both
lines
50
100
Index
150
200
50
100
Index
>
>
>
>
>
par(mfrow=c(2,2))
plot(y, type="l",
plot(y, type="l",
plot(y, type="l",
plot(y, type="l",
lty=1);
lty=2);
lty=3);
lty=4);
title(main="Production
title(main="Production
title(main="Production
title(main="Production
2
1
y
-2
-1
1
0
100
150
200
50
100
150
Index
lty=2
Production data
Production data
200
1
-1
-2
-2
-1
Index
lty=1
-1
-2
50
sub="lty=1")
sub="lty=2")
sub="lty=3")
sub="lty=4")
Production data
Production data
data",
data",
data",
data",
50
100
150
200
Index
lty=3
50
100
150
200
Index
lty=4
0.0
0.2
0.4
runif(10)
0.6
0.8
1.0
10
Index
0.0
0.2
0.4
runif(5)
0.6
0.8
1.0
Index
(stair steps)")
(histogram)")
(no plot)")
0.7
runif(5)
0.3
0.5
0.9
0.7
0.5
Index
0.2
0.4
0.4
0.6
runif(5)
0.8
0.8
Index
0.6
runif(5)
0.3
1
runif(5)
(lines)")
0.9
(points)")
Index
0.6
runif(5)
0.4
0.3
0.2
0.2
0.1
runif(5)
0.4
Index
3
Index
Index
Available symbols
21
22
23
24
25
16
17
18
19
20
11
12
13
14
15
10
0
-2
-1
-4
-2
N <- 200
x <- runif(N, -4, 4)
y <- x + 0.5*rnorm(N)
plot(x,y, pch=16, main=Scatter plot of y and x)
reg <- lm(y~x)
abline(reg)
legend(2,-2, c("Production","Regression line"), pch=16, lty=c(0,1))
Thng s legend(2, -2) c ngha l t phn ghi ch vo trc honh (x-axis) bng 2
v trc tung (y-axis) bng -2.
-2
Production
Regression line
-4
-4
-2
text(15, 4.3)
-4
-2
50
150
200
-4
-2
-4
-2
Gi s chng ta mun ghi ch ngay ti x=0 v y=0 l im trung tm, chng ta trc ht
dng arrows v mi tn. Trong lnh sau y, arrows(-1, 1, 1.5, 1.5) c
ngha nh sau ta x=-1, y=1 bt u v mi tn v chm dt ti ta x=1.5, y=1.5.
Phn text(0, 1) yu cu R vit ch ti ta x=0, y=1.
> arrows(-1, 1.0, 1.5, 1.5)
> text(0, 1, "Trung tam", cex=0.7)
-4
-2
Trung tam
-4
-2
64,
51,
45,
50,
58,
60,
60,
70,
60,
60,
18,
21,
22,
24,
65,
42,
51,
55,
45,
18,
21,
22,
24,
47,
64,
63,
74,
63,
18,
21,
22,
25,
65,
49,
54,
48,
52,
76,
44,
57,
46,
64,
61,
45,
70,
49,
45,
59,
80,
47,
69,
64,
57,
48,
60,
72,
62)
18, 18, 19, 19, 19, 19, 20, 20, 20, 20, 20,
21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22,
23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24,
25)
3.0,
1.3,
3.0,
4.3,
4.1,
3.0,
1.2,
1.7,
2.3,
4.4,
4.0,
0.7,
2.0,
6.0,
2.8,
2.1,
4.0,
2.1,
3.0,
3.0,
3.0,
4.1,
4.0,
3.0,
2.0,
3.0,
4.3,
4.1,
2.6,
1.0,
3.0,
4.0,
4.0,
4.4,
4.0,
3.0,
4.3,
4.2,
4.3,
4.6,
tc <-c (4.0,
6.2,
4.3,
5.6,
6.2,
3.5,
4.1,
4.8,
8.3,
6.7,
4.7,
3.0,
4.0,
5.8,
6.3,
7.7,
4.0,
3.0,
7.6,
6.0,
5.0,
6.9,
3.1,
5.8,
4.0,
4.2,
5.7,
5.3,
3.1,
3.7,
5.9,
5.7,
5.3,
5.4,
6.1,
6.1,
5.3,
5.4,
6.3,
6.7,
5.9,
7.1,
4.5,
8.2,
8.1,
4.0,
3.8,
5.9,
6.2,
6.2)
tg <- c(1.1,
1.7,
2.2,
3.3,
2.4,
2.1,
1.0,
2.7,
3.0,
3.3,
0.8,
1.6,
1.1,
1.0,
2.0,
1.1,
1.1,
0.7,
1.4,
2.6,
2.1,
1.5,
1.0,
2.5,
1.8,
1.5,
1.0,
1.7,
0.7,
1.2,
2.6,
2.7,
2.9,
2.4,
1.9,
1.5,
3.9,
2.5,
2.4,
3.3,
5.4,
3.0,
6.2,
1.4,
4.0,
1.9,
3.1,
1.3,
2.7,
2.5)
2.0,
4.0,
4.2,
4.0,
4.0)
Sau khi c s liu, chng ta sn sng tin hnh phn tch s liu bng biu nh sau:
Nam
10
15
20
Nu
25
Nam
Nu
10
15
20
25
Thay v th hin tn s nam v n bng 2 ct, chng ta c th th hin bng hai dng
bng thng s horiz = TRUE, nh sau (xem kt qu trong Biu 6b):
> barplot(sex.freq,
horiz = TRUE,
col = rainbow(length(sex.freq)),
main=Frequency of males and females)
(67.3,80]
7
Kt qu trn cho thy chng ta c 10 bnh nhn nam v 9 n trong nhm tui th nht,
10 nam v 14 na trong nhm tui th hai, v.v th hin tn s ca hai bin ny,
chng ta vn dng barplot:
> barplot(age.sex, main=Number of males and females in each age
group)
10
15
10
20
12
14
(42,54.7]
(54.7,67.3]
(67.3,80]
(42,54.7]
(54.7,67.3]
(67.3,80]
Age group
(42,54.7]
(49.6,57.2]
(42,49.6]
(72.4,80]
(67.3,80]
(54.7,67.3]
(64.8,72.4]
(57.2,64.8]
mg/L
Chng ta thy bin s tg c s bt lin tc, nht l cc i tng c tg cao. Trong khi
phn ln i tng c tg thp hn 5, th c 2 i tng vi tg rt cao (>5).
8.6.2 Histogram
Age l mt bin s lin tc. v biu tn s ca bin s age, chng ta ch
n gin lnh hist(age). Nh cp trn, chng ta c th ci tin th ny bng
cch cho thm ta chnh (main) v ta ca trc honh (xlab) v trc tung
(ylab):
> hist(age)
> hist(age, main="Frequency distribution by age group", xlab="Age
group", ylab="No of patients")
Histogram of age
8
0
No of patients
6
4
Frequency
10
10
12
12
40
50
60
70
80
40
50
60
age
70
80
Age group
Biu 9a. Trc tung l s bnh nhn (i Biu 9b. Thm tn biu v tn ca trc
tng nghin cu) v trc honh l tui. trung v trc honh bng xlab v ylab.
Chng hn nh tui 40 n 45 c 6 bnh nhn,
t 70 n 80 tui c 4 bnh nhn.
Chng ta cng c th bin i biu thnh mt th phn phi xc sut bng hm
plot(density) nh sau (kt qu trong Biu 10a):
> plot(density(age),add=TRUE)
density.default(x = age)
Density
0.00
0.00
0.01
0.02
0.02
0.01
Density
0.03
0.03
0.04
0.04
Histogram of age
30
40
50
60
70
N = 50 Bandwidth = 3.806
80
90
40
50
60
70
80
age
Biu 10a. Xc sut phn phi mt cho Biu 10b. Xc sut phn phi mt cho
bin age ( tui).
bin age ( tui) vi nhiu interquartile.
Trong th trn, chng ta dng khong cch 0.5*iqr (tng i gn nhau). Nhng
chng ta c th bin i thng s ny thnh 1.5*iqr lm cho phn phi thc t hn:
>
>
>
>
Density
0.00
0.01
0.02
0.03
0.04
Histogram of age
30
40
50
60
70
80
90
age
60
0.0
50
0.2
0.4
(1:n)/n
0.6
Sample Quantiles
70
0.8
80
1.0
50
60
70
80
-2
sort(age)
-1
Theoretical Quantiles
Biu 11. Xc sut phn phi mt cho Biu 12. Kim tra bin age c theo lut
bin age ( tui).
phn phi chun hay khng.
Trong th trn, trc tung l xc sut tch ly v trc honh l tui t thp n cao.
Chng hn nh nhn qua biu , chng ta c th thy khong 50% i tng c tui thp
hn 60.
bit xem phn phi ca age c theo lut phn phi chun (normal distribution) hay
khng chng ta c th s dng hm qqnorm.
> qqnorm(age)
Trc honh ca biu trn l nh lng theo lut phn phi chun (theoretical
quantile) v trc honh nh lng ca s liu (sample quantiles). Nu phn phi ca
age theo lut phn phi chun, th ng biu din phi theo mt ng thng cho 45
(tc l nh lng phn phi v nh lng s liu bng nhau). Nhng qua Biu
12, chng ta thy phn phi ca age khng hn theo lut phn phi chun.
8.6.3 Biu hp (boxplot)
v biu hp ca bin s tc, chng ta ch n gin lnh:
> boxplot(tc, main="Box plot of total cholesterol", ylab="mg/L")
mg/L
Nam
mg/L
mg/L
Nu
Nam
Nu
Biu 14a. Trong biu ny, chng ta Biu 14b. Total cholesterol cho tng
thy trung v ca total cholesterol n gii gii tnh, vi mu sc v hnh hp nm
thp hn nam gii, nhng dao ng gia ngang.
hai nhm khng khc nhau bao nhiu.
8.6.4 Biu thanh (bar chart)
v biu thanh ca bin s bmi, chng ta ch n gin lnh:
10
kg/m^2
15
20
25
Distribution of BM I
18
20
22
24
8
6
hdl
4
2
3
tc
M
8
8
M
F
6
tc
M
M
F
hdl
M
F
F
F
M
F
M
M
F
M
F
F F
M
F
F
F
F
M
F
M
M
F
F
F
3
3
tc
hdl
Biu 18a. Mi lin h gia tc v hdl theo Biu 18a. Mi lin h gia tc v hdl theo
tng gii tnh c th hin bng hai k hiu tng gii tnh c th hin bng hai k t.
du.
Chng ta cng c th v mt ng biu din hi qui tuyn tnh (regression line) qua cc
im trn bng cch tip tc ra cc lnh sau y:
> plot(hdl ~ tc, pch=16, main="Total cholesterol and HDL cholesterol",
xlab="Total cholesterol", ylab="HDL cholesterol", bty=l)
> reg <- lm(hdl ~ tc)
> abline(reg)
6
2
HDL cholesterol
4
2
HDL cholesterol
Total cholesterol
Total cholesterol
Biu 19a. Trong lnh trn, reg<- Biu 19b. Thay v dng abline, chng ta
lm(hdl~tc) c ngha l tm phng trnh dng hm lowess th hin mi lin h gia
lin h gia hdl v tc bng linear model tc v hdl.
(lm) v 8t kt qu vo i tng reg.
Lnh th hai abline(reg) yu cu R v
ng thng t phng trnh trong reg
Bn c c th th nghim vi nhiu thng s f=1/2, f=2/5, hay thm ch f=1/10
s thy th bin i mt cch th v.
Kt qu s l:
20
22
24
70
80
18
22
24
50
60
age
18
20
bmi
hdl
ldl
tc
50
60
70
80
20
22
24
70
80
18
0.12
0.22
0 .0 9 5
22
24
50
0 .0 6 5
60
age
0.38
.
0.29
0.25
***
0.62
0.35
hdl
18
20
bmi
**
***
0.65
ldl
tc
50
60
70
80
Nh trn trnh by, biu tn x gip cho chng ta hnh dung ra mi lin h gia
hai bin s lin tc nh tui age v hdl chng hn. V lm vic ny, chng ta
dng hm plot. tm hiu phn phi cho tng bin age hay hdl chng ta c th
dng hm boxplot. Nhng nu chng ta mun xem phn phi ca hai bin v ng
thi mi lin h gia hai bin, th chng ta cn phi vit mt vi lnh thc hin vic
ny. Cc lnh sau y v biu tn x v mi lin quan gia age v hdl, ng thi v
biu hnh hp cho tng bin.
op <- par()
layout( matrix( c(2,1,0,3), 2, 2, byrow=T ),
c(1,6), c(4,1),
)
par(mar=c(1,1,5,2))
plot(hdl ~ age,
xlab='', ylab='',
las = 1,
pch=16)
rug(side=1, jitter(age, 5) )
rug(side=2, jitter(hdl, 20) )
title(main = "Age and HDL")
par(mar=c(1,2,5,1))
boxplot(hdl, axes=F)
title(ylab='HDL', line=0)
par(mar=c(5,1,1,2))
boxplot(age, horizontal=T, axes=F)
title(xlab='Age', line=1)
par(op)
V kt qu l:
HDL
50
60
70
80
Age
HDL
Bubble plot
50
60
70
80
Age
print(names(x))
axis(1, at=1:length(x), labels=names(x))
par(op)
"(1.25,1.8]"
"(5.1,5.65]"
"(2.9,3.45]"
"(4.55,5.1]"
"(1.8,2.35]"
"(4,4.55]"
12
10
Value
0.6
Cumulated frequency
0.8
8
0.4
2
0
(0.695,1.25]
(1.25,1.8]
(1.8,2.35]
(5.65,6.21]
(4.55,5.1]
Trong biu ny, chng ta c hai trc tung. Trc tung pha tri l tn s (s bnh nhn)
cho tng nhm tg, v trc tung bn phi l tn s tch ly tch bng xc sut (do , s
cao nht l 1).
8.9.4 Biu hnh ng h (clock plot)
Biu hnh ng h, nh tn gi l biu dng v mt bin s lin tc bng
kim ng h. Tc l thay v th hin bng ct hay bng dng, biu ny th hin bng
ng h. Hm sau y (clock) c son thc hin biu hnh ng h:
clock.plot <- function (x, col = rainbow(n), ...) {
V kt qu l:
Distribution of LDL
45
46
47
48
49
44
43
42
41
40
10
39
11
38
12
37
13
36
14
35
15
34
16
33
17
32
18
31
19
30
29
28
27
26
25
24
23
22
21
20
5
4
3
mean
2
1
1
group
x <- rnorm(10)
y <- rnorm(10)
se.x <- runif(10)
se.y <- runif(10)
plot(x, ypch=22)
arrows(x, y-se.y, x, y+se.y, code=3, angle=90, length=0.1)
0.5
0.0
-0.5
y
-1.0
-1.5
-2.0
-2.5
-2
-1
N <- 50
x <- seq(-1, 1, length=N)
y <- seq(-1, 1, length=N)
xx <- matrix(x, nr=N, nc=N)
yy <- matrix(y, nr=N, nc=N, byrow=TRUE)
z <- 1 / (1 + xx^2 + (yy + .2 * sin(10*yy))^2)
contour(x, y, z, main = "Contour plot")
-1.0
-0.5
0.0
0.5
1.0
Contour plot
-1.0
-0.5
0.0
0.5
1.0
0.0
0.2
0.4
0.6
0.8
1.0
> image(z)
0.0
0.2
0.4
0.6
0.8
1.0
-1.0
-0.5
0.0
0.5
1.0
-1.0
-0.5
0.0
0.5
1.0
)
> par(op)
T he sinc function
8
6
S in c (r
4
10
2
0
-2
-10
0
-5
0
-5
5
10
-10
3
1
1 + x2
-4
-2
example(Japanese)
English
Kanji
Katakana
Hiragana
CHNG IX
THNG K M T
9
Phn tch thng k m t
Trong chng ny, chng ta s s dng R cho mc ch phn tch thng k m t.
Ni n thng k m t l ni n vic m t d liu bng cc php tnh v ch s thng
k thng thng m chng ta lm quen qua t thu trung hc nh s trung bnh
(mean), s trung v (median), phng sai (variance) lch chun (standard deviation)
cho cc bin s lin tc, v t s (proportion) cho cc bin s khng lin tc. Nhng
trc khi hng dn phn tch thng k m t, ti mun bn c phi phn bit cho c
hai khi nim tng th (population) v mu (sample).
10))
10))
10))
10))
10))
10))
> mean(sample(height,
[1] 158.6667
> mean(sample(height,
[1] 159.4
> mean(sample(height,
[1] 158.0667
> mean(sample(height,
[1] 158.1333
> mean(sample(height,
[1] 156.4667
15))
15))
15))
15))
15))
> mean(sample(height,
[1] 158.2222
> mean(sample(height,
[1] 158.7222
> mean(sample(height,
[1] 158.0556
> mean(sample(height,
[1] 158.4444
> mean(sample(height,
18))
18))
18))
18))
18))
[1] 158.6667
> mean(sample(height, 18))
[1] 159.0556
> mean(sample(height, 18))
[1] 159
0.0008. Trong chng ny, chng ta s lm quen vi mt s lnh trong R tin hnh
nhng tnh ton n gin trn.
"weight"
"pinp"
"height"
"ictp"
"ethnicity"
"p3np"
> igfdata
id
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
...
...
97
97
98
98
99
99
100 100
17
18
18
15
54
55
48
54
ictp
p3np
11.2867 8.3367
10.4300 6.7450
8.3633 12.5000
13.3300 14.2767
7.9233 4.5033
4.9833 4.9367
6.3500 5.3200
7.3700 4.6700
11.8700 6.8200
3.7400 6.1600
4.4367
8.8333
5.6600
6.5933
Hm R
mean(x)
L thuyt
S trung bnh: x =
Phng sai: s 2 =
1
xi .
n i =1
var(x)
1 n
2
( xi x )
n 1 i =1
sd(x)
lch chun: s = s 2
Sai s chun (standard error): SE =
s
n
Khng c
min(x)
max(x)
range(x)
Tr s thp nht
Tr s cao nht
Ton c (range)
Tuy nhin, R c lnh summary c th cho chng ta tt c thng tin thng k v mt bin
s:
> summary(age)
Min. 1st Qu.
13.00
16.00
Median
19.00
Max.
34.00
SE
5.898719
age
Min.
:13.00
1st Qu.:16.00
Median :19.00
Mean
:19.17
3rd Qu.:21.25
Max.
:34.00
igfbp3
Min.
:2.000
1st Qu.:3.292
Median :3.550
Mean
:3.617
3rd Qu.:3.875
Max.
:5.233
weight
Min.
:41.00
1st Qu.:47.00
Median :50.00
Mean
:49.91
3rd Qu.:53.00
Max.
:60.00
als
Min.
:192.7
1st Qu.:256.8
Median :292.5
Mean
:301.8
3rd Qu.:331.2
Max.
:471.7
height
Min.
:149.0
1st Qu.:157.0
Median :162.0
Mean
:163.1
3rd Qu.:168.0
Max.
:196.0
pinp
Min.
: 26.74
1st Qu.: 68.10
Median :103.26
Mean
:167.17
3rd Qu.:196.45
Max.
:742.68
ethnicity
African : 8
Asian
:60
Caucasian:30
Others
: 2
ictp
Min.
: 2.697
1st Qu.: 4.878
Median : 6.338
Mean
: 7.420
3rd Qu.: 8.423
Max.
:21.237
p3np
Min.
: 2.343
1st Qu.: 4.433
Median : 5.445
Mean
: 6.341
3rd Qu.: 7.150
Max.
:16.303
sex
Female:69
Male : 0
age
weight
height
Min.
:13.00
Min.
:41.00
Min.
:149.0
1st Qu.:17.00
1st Qu.:47.00
1st Qu.:156.0
Median :19.00
Median :50.00
Median :162.0
Mean
:19.59
Mean
:49.35
Mean
:161.9
3rd Qu.:22.00
3rd Qu.:52.00
3rd Qu.:166.0
Max.
:34.00
Max.
:60.00
Max.
:196.0
igfi
igfbp3
als
Min.
: 85.71
Min.
:2.767
Min.
:204.3
1st Qu.:136.67
1st Qu.:3.333
1st Qu.:263.8
Median :163.33
Median :3.567
Median :302.7
Mean
:167.97
Mean
:3.695
Mean
:311.5
3rd Qu.:186.17
3rd Qu.:3.933
3rd Qu.:361.7
Max.
:427.00
Max.
:5.233
Max.
:471.7
pinp
ictp
p3np
Min.
: 26.74
Min.
: 2.697
Min.
: 2.343
1st Qu.: 62.75
1st Qu.: 4.717
1st Qu.: 4.337
Median : 78.50
Median : 5.537
Median : 5.143
Mean
:108.74
Mean
: 6.183
Mean
: 5.643
3rd Qu.:115.26
3rd Qu.: 7.320
3rd Qu.: 6.143
Max.
:502.05
Max.
:13.633
Max.
:14.420
-----------------------------------------------------------sex: Male
id
sex
age
weight
height
Min.
: 2.00
Female: 0
Min.
:14.00
Min.
:44.00
Min.
:155.0
1st Qu.: 34.50
Male :31
1st Qu.:15.00
1st Qu.:48.50
1st Qu.:161.5
Median : 56.00
Median :17.00
Median :51.00
Median :164.0
Mean
: 55.61
Mean
:18.23
Mean
:51.16
Mean
:165.6
3rd Qu.: 75.00
3rd Qu.:20.00
3rd Qu.:53.50
3rd Qu.:169.0
Max.
:100.00
Max.
:27.00
Max.
:59.00
Max.
:191.0
ethnicity
igfi
igfbp3
als
African : 4
Min.
: 94.67
Min.
:2.000
Min.
:192.7
Asian
:17
1st Qu.:138.67
1st Qu.:3.183
1st Qu.:249.8
Caucasian: 8
Median :160.00
Median :3.500
Median :276.0
Others
: 2
Mean
:160.29
Mean
:3.443
Mean
:280.2
3rd Qu.:183.00
3rd Qu.:3.775
3rd Qu.:311.3
Max.
:274.00
Max.
:4.500
Max.
:388.7
pinp
ictp
p3np
Min.
: 56.28
Min.
: 3.650
Min.
: 3.390
1st Qu.:135.07
1st Qu.: 6.900
1st Qu.: 5.375
Median :245.92
Median : 9.513
Median : 7.140
Mean
:297.21
Mean
:10.173
Mean
: 7.895
3rd Qu.:450.38
3rd Qu.:13.517
3rd Qu.:10.010
Max.
:742.68
Max.
:21.237
Max.
:16.303
>
>
>
>
>
>
>
op <- par(mfrow=c(2,3))
hist(igfi)
hist(igfbp3)
hist(als)
hist(pinp)
hist(ictp)
hist(p3np)
Histogram of igfbp3
Histogram of als
200
300
400
0
100
20
Frequency
10
20
Frequency
10
20
0
10
Frequency
30
30
30
40
40
Histogram of igfi
2.0
3.0
4.0
5.0
150
250
350
450
igfbp3
als
Histogram of pinp
Histogram of ictp
Histogram of p3np
40
30
20
Frequency
30
0
200
400
pinp
600
800
10
10
10
20
Frequency
30
20
Frequency
40
50
igf i
10
ictp
15
20
10
15
p3np
10
0
Frequency
15
Histogram of weight
40
45
50
55
60
weight
Nu chng ta mun tnh trung bnh ca mt bin s nh igfi cho mi nhm nam
v n gii, hm tapply trong R c th dng cho vic ny:
> tapply(igfi, list(sex), mean)
Female
Male
167.9741 160.2903
Trong lnh trn, igfi l bin s chng ta cn tnh, bin s phn nhm l sex, v ch s
thng k chng ta mun l trung bnh (mean). Qua kt qu trn, chng ta thy s trung
bnh ca igfi cho n gii (167.97) cao hn nam gii (160.29).
Nhng nu chng ta mun tnh cho tng gii tnh v sc tc, chng ta ch cn thm mt
bin s trong hm list:
> tapply(igfi, list(ethnicity, sex), mean)
Female
Male
African
145.1252 120.9168
Asian
165.6589 160.4999
Caucasian 176.6536 169.4790
Others
NA 200.5000
t=
x
s/ n
Trong , x l gi tr trung bnh ca mu, l trung bnh theo gi thit (trong trng
hp ny, 30), s l lch chun, v n l s lng mu (100). Nu gi tr t cao hn gi tr
l thuyt theo phn phi t mt tiu chun c ngha nh 5% chng hn th chng ta c
l do pht biu khc bit c ngha thng k. Gi tr ny cho mu 100 c th tnh ton
bng hm qt ca R nh sau:
> qt(0.95, 100)
[1] 1.660234
Trong lnh trn age l bin s chng ta cn kim nh, v mu=30 l gi tr gi thit. R
trnh by tr s t = -27.66, vi 99 bc t do, v tr s p < 2.2e-16 (tc rt thp). R
cng cho bit tin cy 95% ca age l t 18.4 tui n 19.9 tui (30 tui nm qu ngoi
khong tin cy ny). Ni cch khc, chng ta c l do pht biu rng tui trung
bnh trong mu ny tht s thp hn tui trung bnh ca qun th.
9.4.2 Kim nh t hai mu
V d 3. Qua phn tch m t trn (phm summary) chng ta thy ph n c
hormone igfi cao hn nam gii (167.97 v 160.29). Cu hi t ra l c phi tht s
l mt khc bit c h thng hay do cc yu t ngu nhin gy nn. Tr li cu hi ny,
chng ta cn xem xt mc khc bit trung bnh gia hai nhm v lch chun ca
khc bit.
x2 x1
SED
Trong x1 v x2 l s trung bnh ca hai nhm nam v n, v SED l lch chun
ca ( x1 - x2 ) . Thc ra, SED c th c tnh bng cng thc:
t=
Trong SE1 v SE2 l sai s chun (standard error) ca hai nhm nam v n. Theo l
thuyt xc sut, t tun theo lut phn phi t vi bc t do n1 + n2 2 , trong n1 v n2 l
s mu ca hai nhm. Chng ta c th dng R tr li cu hi trn bng hm t.test
nh sau:
> t.test(igfi~ sex)
Welch Two Sample t-test
data: igfi by sex
t = 0.8412, df = 88.329, p-value = 0.4025
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.46855 25.83627
sample estimates:
mean in group Female
mean in group Male
167.9741
160.2903
df l bc t do. Tr s p = 0.4025 cho thy mc khc bit gia hai nhm nam v n
khng c ngha thng k (v cao hn 0.05 hay 5%).
95 percent confidence interval:
-10.46855 25.83627
l khong tin cy 95% v khc bit gia hai nhm. Kt qu tnh ton trn cho bit
igf n gii c th thp hn nam gii 10.5 ng/L hoc cao hn nam gii khong 25.8
ng/L. V khc bit qu ln v l thm bng chng cho thy khng c khc bit c
ngha thng k gia hai nhm.
Kim nh trn da vo gi thit hai nhm nam v n c khc phng sai. Nu
chng ta c l do cho rng hai nhm c cng phng sai, chng ta ch thay i mt
thng s trong hm t vi var.equal=TRUE nh sau:
> t.test(igfi~ sex, var.equal=TRUE)
Two Sample t-test
data: igfi by sex
t = 0.7071, df = 98, p-value = 0.4812
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-13.88137 29.24909
sample estimates:
mean in group Female
167.9741
Kt qu trn cho thy khc bit v phng sai gia hai nhm cao 2.62 ln. Tr s p =
0.0045 cho thy phng sai gia hai nhm khc nhau c ngha thng k. Nh vy,
chng ta chp nhn kt qu phn tch ca hm t.test(igfi~ sex).
Tr s p = 0.682 cho thy qu tht khc bit v igfi gia hai nhm nam v n khng
c ngha thng k. Kt lun ny cng khng khc vi kt qu phn tch bng kim nh
t.
180, 140, 160, 160, 220, 185, 145, 160, 160, 170
170, 145, 145, 125, 205, 185, 150, 150, 145, 155
# nhp d kin
before <- c(180, 140, 160, 160, 220, 185, 145, 160, 160, 170)
after <- c(170, 145, 145, 125, 205, 185, 150, 150, 145, 155)
bp <- data.frame(before, after)
> # kim nh t
> t.test(before, after, paired=TRUE)
Paired t-test
data: before and after
t = 2.7924, df = 9, p-value = 0.02097
Kt qu trn cho thy sau khi iu tr p sut mu gim 10.5 mmHg, v khong tin cy
95% l t 2.0 mmHg n 19 mmHg, vi tr s p = 0.0209. Nh vy, chng ta c bng
chng pht biu rng mc gim huyt p c ngha thng k.
Ch nu chng ta phn tch sai bng kim nh thng k cho hai nhm c lp di y
th tr s p = 0.32 cho bit mc gim p sut khng c ngha thng k!
> t.test(before, after)
Welch Two Sample t-test
data: before and after
t = 1.0208, df = 17.998, p-value = 0.3209
alternative hypothesis: true difference in means is not equal to
0
95 percent confidence interval:
-11.11065 32.11065
sample estimates:
mean of x mean of y
168.0
157.5
9.9 Tn s (frequency)
Hm table trong R c chc nng cho chng ta bit v tn s ca mt bin s
mang tnh phn loi nh sex v ethnicity.
> table(sex)
sex
Female
Male
69
31
> table(ethnicity)
ethnicity
African
Asian Caucasian
8
60
30
Others
2
Ch trong cc bng thng k trn, hm table khng cung cp cho chng ta s phn
trm. tnh s phn trm, chng ta cn n hm prop.table v cch s dng c th
minh ho nh sau:
# to ra mt object tn l freq cha kt qu tn s
> freq <- table(sex, ethnicity)
# kim tra kt qu
> freq
ethnicity
sex
African Asian Caucasian Others
Female
4
43
22
0
Male
4
17
8
2
# dng hm margin.table xem kt qu
> margin.table(freq, 1)
sex
Female
Male
69
31
> margin.table(freq, 2)
ethnicity
African
Asian Caucasian
8
60
30
Others
2
Trong bng thng k trn, prop.table tnh t l sc tc cho tng gii tnh. Chng hn
nh n gii (female), 5.8% l ngi Phi chu, 62.3% l ngi chu, 31.8% l ngi
Ty phng da trng . Tng cng l 100%. Tng t, nam gii t l ngi Phi chu l
12.9%, chu l 54.8%, v.v
# tnh phn trm bng hm prop.table
> prop.table(freq, 2)
ethnicity
sex
African
Asian Caucasian
Others
Female 0.5000000 0.7166667 0.7333333 0.0000000
Male
0.5000000 0.2833333 0.2666667 1.0000000
Trong bng thng k trn, prop.table tnh t l gii tnh cho tng sc tc. Chng hn
nh trong nhm ngi chu, 71.7% l n v 28.3% l nam.
# tnh phn trm cho ton b bng
> freq/sum(freq)
ethnicity
sex
African Asian Caucasian Others
Female
0.04 0.43
0.22
0.00
Male
0.04 0.17
0.08
0.02
x n
n (1 )
y, z tun theo lut phn phi chun vi trung bnh 0 v phng sai 1. Cng c th
ni z2 tun theo lut phn phi Chi bnh phng vi bc t do bng 1.
1 1
Vd = + p (1 p )
n1 n2
Trong :
p=
x1 + x2
n1 + n2
Thnh ra, z = d/Vd tun theo lut phn phi chun vi trung bnh 0 v phng sai 1. Ni
cch khc, z2 tun theo lut phn phi Chi bnh phng vi bc t do bng 1. Do ,
chng ta cng c th s dng prop.test kim nh hai t l.
V d 6. Mt nghin cu c tin hnh so snh hiu qu ca thuc chng gy
xng. Bnh nhn c chia thnh hai nhm: nhm A c iu tr gm c 100 bnh
nhn, v nhm B khng c iu tr gm 110 bnh nhn. Sau thi gian 12 thng theo
di, nhm A c 7 ngi b gy xng, v nhm B c 20 ngi gy xng. Vn t ra
l t l gy xng trong hai nhm ny bng nhau (tc thuc khng c hiu qu)?
kim nh xem hai t l ny c tht s khc nhau, chng ta c th s dng hm
prop.test(x, n, ) nh sau:
> fracture <- c(7, 20)
> total <- c(100, 110)
> prop.test(fracture, total)
2-sample test for equality of proportions with continuity
correction
data: fracture out of total
X-squared = 4.8901, df = 1, p-value = 0.02701
alternative hypothesis: two.sided
95 percent confidence interval:
-0.20908963 -0.01454673
sample estimates:
prop 1
prop 2
0.0700000 0.1818182
Kt qu phn tch trn cho thy t l gy xng trong nhm 1 l 0.07 v nhm 2 l 0.18.
Phn tch trn cn cho thy xc sut 95% rng khc bit gia hai nhm c th 0.01
n 0.20 (tc 1 n 20%). Vi tr s p = 0.027, chng ta c th ni rng t l gy xng
trong nhm A qu tht thp hn nhm B.
Chng ta mun bit t l n gii gia 4 nhm sc tc c khc nhau hay khng, v tr
li cu hi ny, chng ta li dng prop.test nh sau:
> female <- c( 4, 43, 22, 0)
> total <- c(8, 60, 30, 2)
> prop.test(female, total)
4-sample test for equality of proportions without continuity
correction
data: female out of total
X-squared = 6.2646, df = 3, p-value = 0.09942
alternative hypothesis: two.sided
sample estimates:
prop 1
prop 2
prop 3
prop 4
0.5000000 0.7166667 0.7333333 0.0000000
Warning message:
Chi-squared approximation may be incorrect in: prop.test(female, total)
Tuy t l n gii gia cc nhm c v khc nhau ln (73% trong nhm 3 (ngi da trng)
so vi 50% trong nhm 1 (Phi chu) v 71.7% trong nhm chu, nhng kim nh Chi
bnh phng cho bit trn phng din thng k, cc t l ny khng khc nhau, v tr s
p = 0.099.
9.12.1 Kim nh Chi bnh phng (Chi squared test, chisq.test)
Tht ra, kim nh Chi bnh phng cn c th tnh ton bng hm chisq.test nh
sau:
> chisq.test(sex, ethnicity)
Pearson's Chi-squared test
data: sex and ethnicity
X-squared = 6.2646, df = 3, p-value = 0.09942
Warning message:
Chi-squared
approximation
ethnicity)
may
be
incorrect
in:
chisq.test(sex,
CHNG X
PHN TCH
HI QUY TUYN TNH
10
Phn tch hi qui tuyn tnh
Phn tch hi qui tuyn tnh (linear regression analysis) c l l mt trong nhng
phng php phn tch s liu thng dng nht trong thng k hc. Anon tng vit Cho
con ngi 3 v kh h s tng quan, hi qui tuyn tnh v mt cy bt, con ngi s
s dng c ba! Trong chng ny, ti s gii thiu cch s dng R phn tch hi qui
tuyn tnh v cc phng php lin quan nh h s tng quan v kim nh gi thit
thng k.
V d 1. minh ha cho vn , chng ta th xem xt nghin cu sau y, m
trong nh nghin cu o lng cholestrol trong mu ca 18 i tng nam. T
trng c th (body mass index) cng c c tnh cho mi i tng bng cng thc
tnh BMI l ly trng lng (tnh bng kg) chia cho chiu cao bnh phng (m2). Kt qu
o lng nh sau:
Bng 1. tui, t trng c th v cholesterol
M s ID
(id)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
tui
(age)
46
20
52
30
57
25
28
36
22
43
57
33
22
63
40
48
28
49
BMI
(bmi)
25.4
20.6
26.2
22.6
25.4
23.1
22.7
24.9
19.8
25.3
23.2
21.8
20.9
26.7
26.4
21.2
21.2
22.8
Cholesterol
(chol)
3.5
1.9
4.0
2.6
4.5
3.0
2.9
3.8
2.1
3.8
4.1
3.0
2.5
4.6
3.2
4.2
2.3
4.0
Nhn s qua s liu chng ta thy ngi c tui cng cao cholesterol cng
cng cao. Chng ta th nhp s liu ny vo R v v mt biu tn x nh sau:
> age <- c(46,20,52,30,57,25,28,36,22,43,57,33,22,63,40,48,28,49)
2.0
2.5
3.0
chol
3.5
4.0
4.5
20
30
40
50
60
age
r=
( xi x )( yi y )
i =1
n
2 n
( xi x ) ( yi y )
i =1
i =1
yi = + xi + i
[1]
y ( + x )
i =1
nh
( x x )( y y )
i =1
(x x )
i =1
[2]
= y x
[3]
)
)
y, x v y l gi tr trung bnh ca bin s x v y. Ch , ti vit v (vi du
m pha trn) l nhc nh rng y l hai c s (estimates) ca v , ch khng
phi v (chng ta khng bit chnh xc v , nhng ch c th c tnh m thi).
)
)
Sau khi c c s v , chng ta c th c tnh cholesterol trung bnh
cho tng tui nh sau:
)
yi = + xi
s =
2
( y y )
i =1
[4]
n2
s2 chnh l c s ca 2.
Trong phn tch hi qui tuyn tnh, thng thng chng ta mun bit h s
= 0 hay khc 0. Nu bng 0, th cng c ngha l khng c mi lin h g gia x v y;
nu khc vi 0, chng ta c bng chng pht biu rng x v y c lin quan nhau.
kim nh gi thit = 0 chng ta dng xt nghim t sau y:
t=
( )
SE
[5]
( )
)
SE c ngha l sai s chun (standard error) ca c s . Trong phng trnh trn,
t tun theo lut phn phi t vi bc t do n-2 (nu tht s = 0).
10.2.2 Phn tch hi qui tuyn tnh n gin bng R
)
Hm lm (vit tt t linear model) trong R c th tnh ton cc gi tr ca
v , cng nh s2 mt cch nhanh gn. Chng ta tip tc vi v d bng R nh sau:
Call:
lm(formula = chol ~ age)
Coefficients:
(Intercept)
1.08922
age
0.05779
3Q
0.17939
Max
0.63040
Coefficients:
Estimate Std. Error t value
(Intercept) 1.089218
0.221466
4.918
age
0.057788
0.005399 10.704
--Signif. codes: 0 '***' 0.001 '**' 0.01
Pr(>|t|)
0.000154 ***
1.06e-08 ***
'*' 0.05 '.' 0.1 ' ' 1
Lnh th hai, summary(reg), yu cu R lit k cc thng tin tnh ton trong reg. Phn
kt qu chia lm 3 phn:
(a) Phn 1 m t phn d (residuals) ca m hnh hi qui:
Residuals:
Min
1Q
Median
-0.40729 -0.24133 -0.04522
3Q
0.17939
Max
0.63040
Chng ta bit rng trung bnh phn d phi l 0, v y, s trung v l -0.04, cng
khng xa 0 bao nhiu. Cc s quantiles 25% (1Q) v 75% (3Q) cng kh cn i chung
quan s trung v, cho thy phn d ca phng trnh ny tng i cn i.
)
)
(b) Phn hai trnh by c s ca v cng vi sai s chun v gi tr ca kim nh t.
)
Gi tr kim nh t cho l 10.74 vi tr s p = 1.06e-08, cho thy khng phi bng 0.
Ni cch khc, chng ta c bng chng cho rng c mt mi lin h gia cholesterol
v tui, v mi lin h ny c ngha thng k.
Coefficients:
Estimate Std. Error t value
(Intercept) 1.089218
0.221466
4.918
age
0.057788
0.005399 10.704
--Signif. codes: 0 '***' 0.001 '**' 0.01
Pr(>|t|)
0.000154 ***
1.06e-08 ***
'*' 0.05 '.' 0.1 ' ' 1
(c) Phn ba ca kt qu cho chng ta thng tin v phng sai ca phn d (residual mean
square). y, s2 = 0.3027. Trong kt qu ny cn c kim nh F, cng ch l mt
kim nh xem c qu tht bng 0, tc c ngha tng t nh kim nh t trong phn
trn. Ni chung, trong trng hp phn tch hi qui tuyn tnh n gin (vi mt yu t)
chng ta khng cn phi quan tm n kim nh F.
Residual standard error: 0.3027 on 16 degrees of freedom
Multiple R-Squared: 0.8775,
Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08
Ngoi ra, phn 3 cn cho chng ta mt thng tin quan trng, l tr s R2 hay h s xc
nh bi (coefficient of determination). H s ny c c tnh bng cng thc:
n
R2 =
( y y )
( y y )
i =1
n
i =1
[6]
Tc l bng tng bnh phng gia s c tnh v trung bnh chia cho tng bnh phng
s quan st v trung bnh. Tr s R2 trong v d ny l 0.8775, c ngha l phng trnh
tuyn tnh (vi tui l mt yu t) gii thch khong 88% cc khc bit v
cholesterol gia cc c nhn. Tt nhin tr s R2 c gi tr t 0 n 100% (hay 1). Gi tr
R2 cng cao l mt du hiu cho thy mi lin h gia hai bin s tui v cholesterol
cng cht ch.
Mt h s cng cn cp y l h s iu chnh xc nh bi (m trong kt
qu trn R gi l Adjusted R-squared). y l h s cho chng ta bit mc ci tin
ca phng sai phn d (residual variance) do yu t tui c mt trong m hnh tuyn
tnh. Ni chung, h s ny khng khc my so vi h s xc nh bi, v chng ta cng
khng cn ch tm qu mc.
10.2.3 Gi nh ca phn tch hi qui tuyn tnh
Tt c cc phn tch trn da vo mt s gi nh quan trng nh sau:
6
0.466072660
12
0.003765579
18
0.079151419
#yu cu R dnh ra 4 ca s
#v cc th trong reg
-1
Standardized residuals
0.0
0.2
17
17
3.0
1.5
2.5
3.5
4.0
4.5
-2
-1
Fitted values
Theoretical Quantiles
Scale-Location
Residuals vs Leverage
1
0.5
0.5
1.0
17
-1
Standardized residuals
Cook's distance
0.0
Standardized residuals
Normal Q-Q
-0.4
Residuals
0.4
0.6
Residuals vs Fitted
2.5
3.0
3.5
Fitted values
4.0
4.5
0.00
0.05
0.10
0.5
0.15
0.20
0.25
Leverage
Biu 10.2. Phn tch phn d kim tra cc gi nh trong phn tch hi
qui tuyn tnh.
(a) th bn tri dng 1 v phn d ei v gi tr tin on cholesterol yi . th ny cho
thy cc gi tr phn d tp chung quanh ng y = 0, cho nn gi nh (c), hay i c gi
tr trung bnh 0, l c th chp nhn c.
(b) th bn phi dng 1 v gi tr phn d v gi tr k vng da vo phn phi chun.
Chng ta thy cc s phn d tp trung rt gn cc gi tr trn ng chun, v do , gi
nh (b), tc i phn phi theo lut phn phi chun, cng c th p ng.
(c) th bn tri dng 2 v cn s phn d chun (standardized residual) v gi tr ca
yi . th ny cho thy khng c g khc nhau gia cc s phn d chun cho cc gi tr
ca yi , v do , gi nh (d), tc i c phng sai 2 c nh cho tt c xi, cng c th
p ng.
Ni chung qua phn tch phn d, chng ta c th kt lun rng m hnh hi qui tuyn
tnh m t mi lin h gia tui v cholesterol mt cch kh y v hp l.
10.2.4 M hnh tin on
Sau khi m hnh tin on cholesterol c kim tra v tnh hp l c
thit lp, chng ta c th v ng biu din ca mi lin h gia tui v cholesterol
bng lnh abline nh sau (xin nhc li object ca phn tch l reg):
2.0
2.5
3.0
chol
3.5
4.0
4.5
20
30
40
50
60
age
)
)
Nhng mi gi tr yi c tnh t c s v , m cc c s ny u c sai
s chun, cho nn gi tr tin on yi cng c sai s. Ni cch khc, yi ch l trung bnh,
nhng trong thc t c th cao hn hay thp hn ty theo chn mu. Khong tin cy
95% ny c th c tnh qua R bng cc lnh sau y:
> reg <- lm(chol ~ age)
> new <- data.frame(age = seq(15, 70, 5))
2.0
2.5
3.0
chol
3.5
4.0
4.5
>
>
>
>
>
>
>
>
>
>
20
30
40
50
60
age
linear regression model). Trong thc t, chng ta c th pht trin m hnh ny thnh
nhiu bin, ch khng ch gii hn mt bin nh trn, chng hn nh:
1
2
3
n
Ch trong phng trnh trn, chng ta c nhiu bin x (x1, x2, n xk), v mi bin c
mt thng s j (j = 1, 2, , k) cn phi c tnh. V th m hnh ny cn c gi l
m hnh hi qui tuyn tnh a bin.
Phng php c tnh j cng ch yu da vo phng php bnh phng nh
nht. Gi yi = + 1 x1i + 2 x1i + ... + k xki l c tnh ca yi , phng php bnh phng
nh nht tm gi tr , 1 , 2 ,..., k sao cho
( y y )
i
i =1
nh nht. i vi m hnh hi
qui tuyn tnh a bin, cch vit v m t m hnh gn nht l dng k hiu ma trn. M
hnh [7] c th th hin bng k hiu ma trn nh sau:
Y = X +
1 x11
1 x
12
X =
... ...
1 x1n
x21 ...xk1
x22 ...xk 2
,
...
...
x2 n xkn
1
= 2 ,
...
k
1
= 2
...
n
Phng php bnh phng nh nht gii vector bng phng trnh sau y:
= (X T X ) X T Y
1
T = Y Y
22
24
26
50
60
20
24
26
20
30
40
age
chol
20
30
40
50
60
20
22
bmi
Cng nh gia tui v cholesterol, mi lin h gia bmi v cholesterol cng gn tun
theo mt ng thng. Biu trn cn cho chng ta thy tui v bmi c lin h vi
nhau. Tht vy, phn tch hi qui tuyn tnh n gin gia bmi v cholesterol cho thy
nh mi lin h ny c ngha thng k:
> summary(lm(chol ~
bmi))
Call:
lm(formula = chol ~ bmi)
Residuals:
Min
1Q Median
-0.9403 -0.3565 -0.1376
3Q
0.3040
Max
1.4330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.83187
1.60841 -1.761 0.09739 .
bmi
0.26410
0.06861
3.849 0.00142 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.623 on 16 degrees of freedom
Multiple R-Squared: 0.4808,
Adjusted R-squared: 0.4483
F-statistic: 14.82 on 1 and 16 DF, p-value: 0.001418
BMI gii thch khong 48% dao ng v cholesterol gia cc c nhn. Nhng v BMI
cng c lin h vi tui, chng ta mun bit nu hai yu t ny c phn tch cng
mt lc th yu t no quan trng hn. bit nh hng ca c hai yu t age (x1) v
bmi (tm gi l x2) n cholesterol (y) qua mt m hnh hi qui tuyn tnh a bin, v m
hnh l:
yi = + 1 x1i + 2 x2i + i
hay phng trnh cng c th m t bng k hiu ma trn: Y = X + m ti va trnh
by trn. y, Y l mt vector vector 18 x 1, X l mt matrix 18 x 2 phn t, v mt
vector 2 x 1, v l vector gm 18 x 1 phn t. c tnh hai h s hi qui, 1 v
2 chng ta cng ng dng hm lm() trong R nh sau:
> mreg <- lm(chol ~ age + bmi)
> summary(mreg)
Call:
lm(formula = chol ~ age + bmi)
Residuals:
Min
1Q Median
-0.3762 -0.2259 -0.0534
3Q
0.1698
Max
0.5679
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.455458
0.918230
0.496
0.627
age
0.054052
0.007591
7.120 3.50e-06 ***
bmi
0.033364
0.046866
0.712
0.487
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3074 on 15 degrees of freedom
Multiple R-Squared: 0.8815,
Adjusted R-squared: 0.8657
F-statistic: 55.77 on 2 and 15 DF, p-value: 1.132e-07
Phng trnh cho bit khi tui tng 1 nm th cholesterol tng 0.054 mg/L (c s ny
khng khc my so vi 0.0578 trong phng trnh ch c tui), v mi 1 kg/m2 tng
BMI th cholesterol tng 0.0333 mg/L. Hai yu t ny gii thch khong 88.2% (R2 =
0.8815) dao ng ca cholesterol gia cc c nhn.
Chng ta ch phng trnh vi tui (trong phn tch phn trc) gii thch
khong 87.7% dao ng cholesterol gia cc c nhn. Khi chng ta thm yu t BMI,
h s ny tng ln 88.2%, tc ch 0.5%. Cu hi t ra l 0.5% tng trng ny c
ngha thng k hay khng. Cu tr li c th xem qua kt qu kim nh yu t bmi vi
tr s p = 0.487. Nh vy, bmi khng cung cp cho chng thm thng tin hay tin on
cholesterol hn nhng g chng ta c t tui. Ni cch khc, khi tui c
xem xt, th nh hng ca bmi khng cn ngha thng k. iu ny c th hiu c,
bi v qua Biu 10.5 chng ta thy tui v bmi c mt mi lin h kh cao. V hai
bin ny c tng quan vi nhau, chng ta khng cn c hai trong phng trnh. (Tuy
nhin, v d ny ch c tnh cch minh ha cho vic tin hnh phn tch hi qui tuyn tnh
a bin bng R, ch khng c nh m phng d liu theo nh hng sinh hc).
3.0
4.0
2.0
0.0
1.0
4.5
-2
-1
Scale-Location
Residuals vs Leverage
0.4
3.0
3.5
4.0
Fitted values
4.5
16
0.5
0.8
16
-1
Standardized residuals
Theoretical Quantiles
2.5
16
Fitted values
1.2
3.5
0.0
Standardized residuals
2.5
-1.0
0.0
0.4
16
-0.4
Residuals
8
6
Normal Q-Q
Standardized residuals
Residuals vs Fitted
Cook's distance15
0.00
0.10
0.20
0.30
Leverage
Tuy BMI khng c ngha thng k trong trng hp ny, Biu 10.6 cho thy
cc gi nh v m hnh hi qui tuyn tnh c th p ng.
Id
1
2
3
4
5
6
7
8
9
10
11
12
Hm lng
g cng (x)
1.0
1.5
2.0
3.0
4.0
4.5
5.0
5.5
6.0
6.5
7.0
8.0
cng
mnh (y)
6.3
11.1
20.0
24.0
26.1
30.0
33.8
34.0
38.1
39.9
42.0
46.1
13
14
15
16
17
18
19
9.0
10.0
11.0
12.0
13.0
14.0
15.0
53.1
52.0
52.5
48.0
42.8
27.8
21.9
Median
2.938
3Q
7.675
Max
15.840
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.3213
5.4302
3.926 0.00109 **
conc
1.7710
0.6478
2.734 0.01414 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.82 on 17 degrees of freedom
Multiple R-Squared: 0.3054,
Adjusted R-squared: 0.2645
F-statistic: 7.474 on 1 and 17 DF, p-value: 0.01414
Kt qu trn cho thy m hnh hi qui tuyn tnh n gin ny (strength = 21.32
+ 1.77*conc) gii thch khong 31% phng sai ca strength. c s phng sai
ca m hnh ny l: s2 = (11.82)2 = 139.7.
By gi chng ta xem qua biu v ng biu din ca m hnh trn:
> plot(strength ~ conc,
xlab="Concentration of hardwood",
ylab="Tensile strength",
main="Relationship between hardwood concentration \n and tensile
strengt", pch=16)
> abline(simple.model)
30
20
Tensile strength
40
50
10
yi = + 1x + 2x2
2
10
12
14
Concentration of hardwood
3Q
4.1350
By gi chng ta s s dng R
c tnh ba thng s trn.
> quadratic <- lm(strength ~
poly(conc, 2))
> summary(quadratic)
Call:
Max
6.5506
Pr(>|t|)
2.73e-16 ***
1.76e-06 ***
1.89e-08 ***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = strength ~ poly(conc, 3))
Residuals:
Min
1Q
-4.62503 -1.61085
Median
0.04125
3Q
1.58922
Max
5.02159
Coefficients:
# V 3 ng thng, bc hai v bc 3
> plot(strength ~ conc, pch=16,
main=Hardwood concentration and tensile strength,
sub=Linear, quadratic, and cubic fits)
> abline(linear, col=black)
> lines(xnew, y2, col=blue, lwd=3)
> lines(xnew, y3, col=red, lwd=4)
30
10
20
strength
40
50
10
12
14
conc
Linear, quadratic, and cubic fits
RSS = ( yi yi )
i =1
RSS 2k
AIC = log
+
n n
M hnh no c gi tr AIC thp nht c xem l m hnh ti u. Trong v d sau
y, chng ta s dng hm step tm mt m hnh ti u da vo gi tr AIC.
V d 4. nghin cu nh hng ca cc yu t nh nhit , thi gian, v
thnh phn ha hc n sn lng CO2. S liu ca nghin cu ny c th tm lc
trong bng s 2. Mc tiu chnh ca nghin cu l tm mt m hnh hi qui tuyn tnh
tin on sn lng CO2, cng nh nh gi nh hng ca cc yu t ny.
Bng 2. Sn lng CO2 v mt s yu t c th nh hng n CO2
Id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
y
36.98
13.74
10.08
8.53
36.42
26.59
19.07
5.96
15.52
56.61
26.72
20.80
6.99
45.93
43.09
15.79
21.60
35.19
26.14
8.60
11.63
9.59
4.42
38.89
11.19
75.62
X1
5.1
26.4
23.8
46.4
7.0
12.6
18.9
30.2
53.8
5.6
15.1
20.3
48.4
5.8
11.2
27.9
5.1
11.7
16.7
24.8
24.9
39.5
29.0
5.5
11.5
5.2
X2
400
400
400
400
450
450
450
450
450
400
400
400
400
425
425
425
450
450
450
450
450
450
450
460
450
470
X3
51.37
72.33
71.44
79.15
80.47
89.90
91.48
98.60
98.05
55.69
66.29
58.94
74.74
63.71
67.14
77.65
67.22
81.48
83.88
89.38
79.77
87.93
79.50
72.73
77.88
75.50
X4
4.24
30.87
33.01
44.61
33.84
41.26
41.88
70.79
66.82
8.92
17.98
17.79
33.94
11.95
14.73
34.49
14.48
29.69
26.33
37.98
25.66
22.36
31.52
17.86
25.20
8.66
X5
1484.83
289.94
320.79
164.76
1097.26
605.06
405.37
253.70
142.27
1362.24
507.65
377.60
158.05
130.66
682.59
274.20
1496.51
652.43
458.42
312.25
307.08
193.61
155.96
1392.08
663.09
1464.11
X6
2227.25
434.90
481.19
247.14
1645.89
907.59
608.05
380.55
213.40
2043.36
761.48
566.40
237.08
1961.49
1023.89
411.30
2244.77
978.64
687.62
468.38
460.62
290.42
233.95
2088.12
994.63
2196.17
X7
2.06
1.33
0.97
0.62
0.22
0.76
1.71
3.93
1.97
5.08
0.60
0.90
0.63
2.04
1.57
2.38
0.32
0.44
8.82
0.02
1.72
1.88
1.43
1.35
1.61
4.78
27
36.03
10.6
470
83.15
22.39
720.07
1080.11
5.88
Ch thch: y = sn lng CO2; X1 = thi gian (pht); X2 = nhit (C); X3 = phn trm ha tan; X4 =
lng du (g/100g); X5 = lng than ; X6 = tng s lng ha tan; X7 = s hydrogen tiu th.
Trc khi phn tch s liu, chng ta cn nhp s liu vo R bng cc lnh thng thng.
S liu s cha trong i tng REGdata.
> y <- c(36.98,13.74,10.08, 8.53,36.42,26.59,19.07, 5.96,15.52,56.61,
26.72,20.80, 6.99,45.93,43.09,15.79,21.60,35.19,26.14, 8.60,
11.63, 9.59, 4.42,38.89,11.19,75.62,36.03)
> x1 <- c(5.1,26.4,23.8,46.4, 7.0,12.6,18.9,30.2,53.8,5.6,15.1,20.3,48.4,
5.8,11.2,27.9,5.1,11.7,16.7,24.8,24.9,39.5,29.0, 5.5, 11.5,
5.2,10.6)
> x2 <- c(400,400, 400, 400, 450, 450, 450, 450, 450, 400, 400, 400,
400, 425, 425, 425, 450, 450, 450, 450, 450, 450, 450, 460,
450, 470, 470)
> x3 <- c(51.37,72.33,71.44,79.15,80.47,89.90,91.48,98.60,98.05,55.69,
66.29,58.94,74.74,63.71,67.14,77.65,67.22,81.48,83.88,89.38,
79.77,87.93,79.50,72.73,77.88,75.50,83.15)
> x4 <- c(4.24,30.87,33.01,44.61,33.84,41.26,41.88,70.79,66.82,
8.92,17.98,17.79,33.94,11.95,14.73,34.49,14.48,29.69,26.33,
37.98,25.66,22.36,31.52,17.86,25.20, 8.66,22.39)
> x5 <- c(1484.83, 289.94, 320.79, 164.76, 1097.26, 605.06, 405.37,
253.70, 142.27,1362.24, 507.65, 377.60, 158.05, 130.66,
682.59, 274.20, 1496.51, 652.43, 458.42, 312.25, 307.08,
193.61, 155.96,1392.08, 663.09,1464.11, 720.07)
> x6 <- c(2227.25, 434.90, 481.19, 247.14,1645.89, 907.59, 608.05,
380.55, 213.40,2043.36, 761.48, 566.40, 237.08,1961.49,1023.89,
411.30,2244.77, 978.64, 687.62, 468.38, 460.62, 290.42,
233.95,2088.12, 994.63,2196.17,1080.11)
> x7 <- c(2.06,1.33,0.97,0.62,0.22,0.76,1.71,3.93,1.97,5.08,0.60,0.90,
0.63,2.04,1.57,2.38,0.32,0.44,8.82,0.02,1.72,1.88,1.43,
1.35,1.61,4.78,5.88)
> REGdata <- data.frame(y, x1,x2,x3,x4,x5,x6,x7)
Trc khi phn tch s liu, chng ta cn nhp s liu vo R bng cc lnh thng thng.
S liu s cha trong i tng REGdata.
By gi chng ta bt u phn tch. M hnh u tin l m hnh gm tt c 7 bin c
lp nh sau:
> reg <- lm(y ~ x1+x2+x3+x4+x5+x6+x7, data=REGdata)
> summary(reg)
Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = REGdata)
Residuals:
Min
1Q
-20.035 -4.681
Median
-1.144
3Q
4.072
Max
21.214
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.937016 57.428952
0.939
0.3594
x1
-0.127653
x2
-0.229179
x3
0.824853
x4
-0.438222
x5
-0.001937
x6
0.019886
x7
1.993486
--Signif. codes: 0 '***'
0.281498
0.232643
0.765271
0.358551
0.009654
0.008088
1.089701
-0.453
-0.985
1.078
-1.222
-0.201
2.459
1.829
0.6553
0.3370
0.2946
0.2366
0.8431
0.0237 *
0.0831 .
Kt qu trn cho thy tt c 7 bin s gii thch khong 73% phng sai ca y. Nhng
trong 7 bin , ch c x6 l c ngha thng k (p = 0.024). Chng ta th gim m
hnh thnh mt m hnh hi qui tuyn tnh n gin vi ch bin x6.
> summary(lm(y ~ x6, data=REGdata))
Call:
lm(formula = y ~ x6, data = REGdata)
Residuals:
Min
1Q
-28.081 -5.829
Median
-0.839
3Q
5.522
Max
26.882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.144181
3.483064
1.764
0.09 .
x6
0.019395
0.002932
6.616 6.24e-07 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.7 on 25 degrees of freedom
Multiple R-Squared: 0.6365,
Adjusted R-squared: 0.6219
F-statistic: 43.77 on 1 and 25 DF, p-value: 6.238e-07
30
50
50
70
90
200
1000
8
70
10
50
10
40
440
10
30
x1
90
400
x2
70
50
70
x3
1000
10
40
x4
2000
200
x5
500
x6
x7
10
40
70
400
440
10
40
70
500
2000
Kt qu trn cho thy y c lin h vi cc bin nh x1, x5 v x6. Ngoi ra, bin x5 v
x6 c mt mi lin h rt mt thit (gn nh l mt ng thng) vi h s tng quan
l 0.88. Ngoi ra, x5 v x1 hay x6 v x5 cng c lin h vi nhau nhng theo mt hm
s nghch o. iu ny c ngha l bin x5 v x6 cung cp mt lng thng tin nh
nhau tin on y, tc l chng ta khng cn c hai trong m m hnh.
tm mt m hnh ti u trong bi cnh c nhiu mi tng quan nh th, chng ta ng
dng step nh sau. Ch cch cung cp thng s lm(y ~ .), du . c ngha l
yu cu R xem xt tt c bin trong i tng REGdata.
> reg <- lm(y ~ ., data=REGdata)
> step(reg, direction=both)
Start: AIC= 134.07
y ~ x1 + x2 + x3 +
Df Sum of Sq
- x5
1
4.54
- x1
1
23.17
- x2
1
109.34
- x3
1
130.90
<none>
- x4
1
168.31
- x7
1
377.09
- x6
1
681.09
x4 + x5 + x6 + x7
RSS
AIC
2145.37 132.13
2164.00 132.36
2250.18 133.42
2271.74 133.68
2140.83 134.07
2309.14 134.12
2517.92 136.45
2821.92 139.53
Df Sum of Sq
RSS
1
22.7 2168.1
1
113.8 2259.1
1
133.5 2278.9
2145.4
1
170.8 2316.2
1
4.5 2140.8
1
375.7 2521.1
1
1058.5 3203.8
AIC
130.4
131.5
131.8
132.1
132.2
134.1
134.5
141.0
- x2
- x3
<none>
- x4
+ x1
+ x5
- x7
- x6
Df Sum of Sq
RSS
1
96.8 2264.9
1
122.0 2290.0
2168.1
1
187.4 2355.5
1
22.7 2145.4
1
4.1 2164.0
1
385.0 2553.1
1
1526.2 3694.3
AIC
129.6
129.9
130.4
130.7
132.1
132.4
132.8
142.8
Df Sum of Sq
RSS
1
25.4 2290.3
1
90.9 2355.8
2264.9
1
96.8 2168.1
1
8.3 2256.5
1
5.7 2259.1
1
384.9 2649.7
1
2015.6 4280.5
AIC
127.9
128.7
129.6
130.4
131.5
131.5
131.8
144.8
- x3
- x4
<none>
+ x2
+ x5
+ x1
- x7
- x6
Df Sum of Sq
RSS
1
73.5 2363.8
2290.3
1
25.4 2264.9
1
11.3 2279.0
1
6.3 2284.0
1
0.3 2290.0
1
486.6 2776.9
1
1993.8 4284.1
AIC
126.7
127.9
129.6
129.8
129.8
129.9
131.1
142.8
Df Sum of Sq
<none>
+ x4
+ x1
+ x3
+ x5
+ x2
- x7
- x6
1
1
1
1
1
1
1
73.5
33.4
8.1
7.7
7.3
497.3
4477.0
RSS
2363.8
2290.3
2330.4
2355.8
2356.1
2356.6
2861.2
6840.8
AIC
126.7
127.9
128.4
128.7
128.7
128.7
129.9
153.4
Call:
lm(formula = y ~ x6 + x7, data =
REGdata)
Coefficients:
(Intercept)
2.52646
x6
0.01852
x7
2.18575
Median
0.2513
3Q
4.9339
Max
21.9682
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.526460
3.610055
0.700
0.4908
x6
0.018522
0.002747
6.742 5.66e-07 ***
x7
2.185753
0.972696
2.247
0.0341 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.924 on 24 degrees of freedom
Multiple R-Squared: 0.6996,
Adjusted R-squared: 0.6746
F-statistic: 27.95 on 2 and 24 DF, p-value: 5.391e-07
Phn tch chi tit (kt qu trn) cho thy hai bin ny gii thch khong 70% phng sai
ca y.
By gi chng ta sn sng phn tch bng php tnh BMA. Hm bicreg c vit
c bit cho phn tch hi qui tuyn tnh. Cch p dng hm bicreg nh sau:
> bma <- bicreg(xvars, co2, strict=FALSE, OR=20)
Call:
bicreg(x = xvars, y = co2, strict = FALSE, OR = 20)
16
Best
Intercept
x1
x2
x3
x4
x5
x6
x7
EV
5.75672
-0.01807
-0.00075
0.00011
-0.03059
-0.00023
0.01815
1.60766
nVar
r2
BIC
post prob
SD
14.6244
0.1008
0.0282
0.0791
0.1020
0.0030
0.0040
1.2821
0.6599 ):
model 1
2.5264
.
.
.
.
.
0.0185
2.1857
model 2
6.1441
.
.
.
.
.
0.0193
.
model 3
8.6120
.
.
.
-0.1419
.
0.0164
2.1628
model 4
7.5936
-0.1393
.
.
.
.
0.0162
2.1233
model 5
7.3537
.
.
-0.0572
.
.
0.0179
2.2382
2
0.700
-25.8832
0.311
1
0.636
-24.0238
0.123
3
0.709
-23.4412
0.092
3
0.704
-22.9721
0.072
3
0.701
-22.6801
0.063
M odels selected by BM A
x1
x2
x3
x4
x5
x6
x7
10
13
Model #
Raftery, Adrian E. (1995). Bayesian model selection in social research (with Discussion).
Sociological Methodology 1995 (Peter V. Marsden, ed.), pp. 111-196, Cambridge, Mass.:
Blackwells.
Mt s bi bo lin quan n BMA c th ti t trang web sau y:
www.stat.colostate.edu/~jah/papers.
CHNG XI
PHN TCH
PHNG SAI
11
Phn tch phng sai
(Analysis of variance)
Phn tch phng sai, nh tn gi, l mt s phng php phn tch thng k m
trng im l phng sai (thay v s trung bnh). Phng php phn tch phng sai nm
trong i gia nh cc phng php c tn l m hnh tuyn tnh (hay general linear
models), bao gm c hi qui tuyn tnh m chng ta gp trong chng trc. Trong
chng ny, chng ta s lm quen vi cch s dng R trong phn tch phng sai.
Chng ta s bt u bng mt phn tch n gin, sau s xem n phn tch phng
sai hai chiu, v cc phng php phi tham s thng dng.
Ho: 1 = 2 = 3
HA: c mt khc bit gia 3 j (j=1,2,3)
Bng 11.2. galactose cho 3 nhm bnh nhn Crohn, vim rut kt
v i chng
Nhm 1: bnh
Crohn
1343
1393
1420
1641
1897
2160
2169
2279
2890
Nhm 3: i
chng (control)
1809 2850
1926 2964
2283 2973
2384 3171
2447 3257
2479 3271
2495 3288
2525 3358
2541 3643
2769 3657
3011
n=9
n=11
n=20
Trung bnh: 1910 Trung bnh: 2226
Trung bnh: 2804
SD: 516
SD: 727
SD: 527
Ch thch: SD l lch chun (standard deviation).
Thot u c l bn c, sau khi hc qua phng php so snh hai nhm bng
kim nh t, s ngh rng chng ta cn lm 3 so snh bng kim nh t: gia nhm 1 v 2,
nhm 2 v 3, v nhm 1 v 3. Nhng phng php ny khng hp l, v c ba phng
sai khc nhau. Phng php thch hp cho so snh l phn tch phng sai. Phn tch
phng sai c th ng dng so snh nhiu nhm cng mt lc (simultaneous
comparisons).
11.1.1 M hnh phn tch phng sai
minh ha cho phng php phn tch phng sai, chng ta phi dng k hiu.
Gi galactose ca bnh nhn i thuc nhm j (j = 1, 2, 3) l xij. M hnh phn tch
phng sai pht biu rng:
xij = + i + ij
[1]
Hay c th hn:
xi1 = + 1 + i1
xi2 = + 2 + i2
xi3 = + 3 + i3
Tc l, gi tr galactose c bt c bnh nhn no bng gi tr trung bnh ca ton
qun th () cng/tr cho nh hng ca nhm j c o bng h s nh hng i , v sai
s ij . Mt gi nh khc l ij phi tun theo lut phn phi chun vi trung bnh 0 v
phng sai 2. Hai thng s cn c tnh l v i . Cng nh phn tch hi qui tuyn
tnh, hai thng s ny c c tnh bng phng php bnh phng nh nht; tc l tm
c s v j sao cho
( x
ij
j ) nh nht.
2
S i
tng (nj)
n1 = 9
Trung bnh
2 Vim rut kt
n2 = 11
x2 = 2226
s22 = 473387
3 i chng
n3 = 20
x3 = 2804
s32 = 277500
Ton b mu
n = 40
x = 2444
1 Crohn
x1 = 1910
Phng sai
s12 = 265944
xij = x + ( x j x ) + ( xij x j )
Ch rng:
[2]
bnh trng nhm v trung bnh ton mu, v phn ( xij x j ) phn nh hiu s gia mt
galactose ca mt i tng v s trung bnh ca tng nhm.
Theo ,
n (x
j
x)
(n
j
1) s 2j
SSW c tnh t mi bnh nhn cho 3 nhm, cho nn trung bnh bnh phng cho tng
nhm (mean square MSW) l:
MSW = SSW / (N k) = 12133922 / (40-3) = 327944
v trung bnh bnh phng gia cc nhm l:
[3]
Tng bnh
phng
(sum of
squares)
5681168
Kim nh
Trung bnh
bnh phng F
(mean
square)
2841810
8.6655
37
12133923
327944
39
12133923
Bc t do
(degrees of
freedom)
phn tch phng sai, chng ta phi nh ngha bin group l mt yu t - factor.
> group <- as.factor(group)
Bc k tip, chng ta np s liu galactose cho tng nhm nh nh ngha trn (gi
object l galactose):
> galactose <- c(1343,1393,1420,1641,1897,2160,2169,2279,2890,
1264,1314,1399,1605,2385,2511,2514,2767,2827,2895,3011,
1809,2850,1926,2964,2283,2973,2384,3171,2447,3257,2479,3271,2495,3288,
2525,3358,2541,3643,2769,3657)
Sau khi c d liu sn sng, chng ta dng hm lm() phn tch phng sai nh
sau:
> analysis <- lm(galactose ~ group)
3Q
456.0
Max
979.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
1910.2
190.9 10.007 4.5e-12 ***
group2
316.3
257.4
1.229 0.226850
group3
894.3
229.9
3.891 0.000402 ***
--Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(b)
1
2
2 0.6805 3 0.0012 0.0321
P value adjustment method: bonferroni
Kt qu trn cho thy tr s p gia nhm 1 (Crohn) v vim rut kt l 0.6805 (tc khng
c ngha thng k); gia nhm Crohn v i chng l 0.0012 (c ngha thng k), v
gia nhm vim rut kt v i chng l 0.0321 (tc cng c ngha thng k).
Mt phng php iu chnh tr s p khc c tn l phng php Holm:
> pairwise.t.test(galactose, group)
Pairwise comparisons using t tests with pooled SD
data:
1
2
2 0.2268 3 0.0012 0.0214
P value adjustment method: holm
data:
1
2
2 0.2557 3 0.0017 0.0544
P value adjustment method: holm
diff
lwr
upr
p adj
2-1 316.3232 -312.09857 944.745 0.4439821
3-1 894.2778 333.07916 1455.476 0.0011445
3-2 577.9545
53.11886 1102.790 0.0281768
Kt qu trn cho chng ta thy nhm 3 v 1 khc nhau khong 894 n v, v khong tin
cy 95% t 333 n 1455 n v. Tng t, galactose trong nhm bnh nhn vim rut
kt thp hn nhm i chng (nhm 3) khong 578 n v, v khong tin cy 95% t 53
n 1103.
3-2
3-1
2-1
500
1000
1500
3500
3000
2500
2000
1500
iu kin
(i)
1
2
1
4.1, 3.9, 4.3
2.7, 3.1, 2.6
Vt liu (j)
2
3.1, 2.8, 3.3
1.9, 2.2, 2.3
3
3.5, 3.2, 3.6
2.7, 2.3, 2.5
S liu ny c th tm lc bng s trung bnh cho tng iu kin v vt liu trong bng
thng k sau y:
Bng 11.3. Tm lc s liu t th nghim bn b ca nc sn
Vt liu (j)
2
4.10
2.80
3.450
3.07
2.13
2.600
3.43
2.50
2.967
0.040
0.070
0.063
0.043
0.043
0.040
iu kin (i)
Trung bnh
1
2
Trung bnh 2
nhm
Phng sai
1
2
Trung bnh
cho 3 vt
liu
3.533
2.478
3.00
Nhng tnh ton s khi trn y cho thy c th c s khc nhau (hay nh hng) ca
iu kin v vt liu th nghim.
Gi xij l score ca iu kin i (i = 1, 2) cho vt liu j (j = 1, 2, 3). ( n gin ha
vn , chng ta tm thi b qua k i tng). M hnh phn tch phng sai hai chiu
pht biu rng:
xij = + i + j + ij
[4]
Hay c th hn:
x11 = + 1 + 1 + 11
x12 = + 1 + 2 + 12
x13 = + 1 + 3 + 11
x21 = + 2 + 1 + 21
x22 = + 2 + 2 + 22
x23 = + 2 + 3 + 21
l s trung bnh cho ton qun th, cc h s i (nh hng ca iu kin i)v j (nh
hng ca vt liu j) cn phi c tnh t s liu thc t. ij c gi nh tun theo lut
phn phi chun vi trung bnh 0 v phng sai 2.
Trong phn tch phng sai hai chiu, chng ta cn chia tng bnh phng ra thnh 3
ngun:
Bc t do
(degrees of
freedom)
1
2
14
17
Tng bnh
phng
(sum of
squares)
5.01
2.18
0.73
7.92
Trung bnh
bnh phng
(mean
square)
5.01
1.09
0.052
Kim nh
F
95.6
20.8
Material
(vt liu)
1
1
1
2
2
2
3
3
i tng
Score
1
2
3
4
5
6
7
8
4.1
3.9
4.3
3.1
2.8
3.3
3.5
3.2
1
2
2
2
2
2
2
2
2
2
3
1
1
1
2
2
2
3
3
3
9
10
11
12
13
14
15
16
17
18
3.6
2.7
3.1
2.6
1.9
2.2
2.3
2.7
2.3
2.5
V to nn 18 m s (t 1 n 18):
> id <- 1:18
Response: score
Df Sum Sq Mean Sq F value
Pr(>F)
condition 1 5.0139 5.0139 95.575 1.235e-07 ***
material
2 2.1811 1.0906 20.788 6.437e-05 ***
Residuals 14 0.7344 0.0525
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Ba ngun dao ng (variation) ca score c phn tch trong bng trn. Qua
trung bnh bnh phng (mean square), chng ta thy nh hng ca iu kin c v quan
trng hn l nh hng ca vt liu th nghim. Tuy nhin, c hai nh hng u c
ngha thng k, v tr s p rt thp cho hai yu t.
(c) c s. Chng ta yu cu R tm lc cc c s phn tch bng lnh summary:
> summary(twoway)
Call:
lm(formula = score ~ condition + material)
Residuals:
Min
1Q
-0.32778 -0.16389
Median
0.03333
3Q
0.16111
Max
0.32222
Coefficients:
Estimate Std. Error t value
(Intercept)
3.9778
0.1080 36.841
condition2
-1.0556
0.1080 -9.776
material2
-0.8500
0.1322 -6.428
material3
-0.4833
0.1322 -3.655
--Signif. codes: 0 '***' 0.001 '**' 0.01
Pr(>|t|)
2.43e-15
1.24e-07
1.58e-05
0.0026
***
***
***
**
R2 =
5.0139 + 2.1811
= 0.9074
5.0139 + 2.1811 + 0.7344
cho phn tch hon tt, chng ta cn phi xem xt n kh nng nh hng
ca hai yu t ny c th tng tc nhau (interactive effects). Tc l m hnh score tr
thnh:
xij = + i + j + ( i j ) + ij
ij
Kt qu phn tch trn (p = 0.297 cho nh hng tng tc). Chng ta c bng chng
kt lun rng nh hng tng tc gia vt liu v iu kin khng c ngha thng k,
v chng ta chp nhn m hnh [4], tc khng c tng tc.
(e) So snh gia cc nhm. Chng ta s c tnh khc bit gia hai iu kin v ba
vt liu bng hm TukeyHSD vi aov:
> res <- aov(score ~ condition+ material+condition)
> TukeyHSD(res)
Tukey multiple comparisons of means
95% family-wise confidence level
diff
lwr
upr
p adj
2-1 -0.8500000 -1.19610279 -0.5038972 0.0000442
3-1 -0.4833333 -0.82943612 -0.1372305 0.0068648
3-2 0.3666667 0.02056388 0.7127695 0.0374069
3-2
3-1
2-1
-1.0
-0.5
0.0
0.5
4.0
condition
3.0
2.5
mean of score
3.5
1
2
material
Cu hi t ra l c s
khc bit no v chiu cao gia
tr em thnh th v nng thn
hay khng. Ni cch khc, mi
trng c tr c nh hng n
chiu cao hay khng, v nu c
th mc nh hng l bao
nhiu?
Mt yu t c nh hng
ln n chiu cao l tui.
Trong tui trng thnh,
chiu cao tng theo tui. Do
, so snh chiu cao gia hai
nhm ch c th khch quan nu
tui gia hai nhm phi tng
ng nhau. m bo tnh
khch quan ca so snh, chng ta
cn phi phn tch s liu bng
m hnh hip bin.
Vic u tin l chng ta
phi nhp s liu vo R vi
nhng lnh sau y:
>
>
>
>
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
rural
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
119
120
121
124
126
129
130
133
134
135
137
139
141
142
121
121
128
129
131
132
133
134
138
138
138
140
140
140
132.7
145.4
135.0
133.0
148.5
148.3
147.5
148.8
133.2
148.7
152.0
150.6
165.3
149.9
139.0
140.9
134.9
149.5
148.7
131.0
142.3
139.9
142.9
147.7
147.7
134.6
135.8
148.5
# to ra dy s id
id <- c(1:18, 1:14)
# group 1=urban 2=rural v cn phi xc nh group l mt factor
group <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2)
> group <- as.factor(group)
> # nhp d liu
> age <- c(109,113,115,116,119,120,121,124,126,129,130,133,134,135,
137,139,141,142,
121,121,128,129,131,132,133,134,138,138,138,140,140,140)
>
Kt qu trn cho thy nhm hc sinh thnh th c tui thp hn hc sinh nng
thn khong 6.3 thng (126.8 133.1). Tuy nhin, chiu cao ca hc sinh thnh th cao
hn hc sinh nng thn khong 2.8 cm (144.5 141.7). Bn c c th dng kim nh t
thy rng s khc bit v tui gia hai nhm c ngha thng k (p = 0.045).
150
130
135
140
145
height
155
160
165
Ngoi ra, biu sau y cn cho thy c mt mi lin h tng quan gia tui v chiu
cao:
110
115
120
125
130
135
140
age
V hai nhm khc nhau v tui, v tui c lin h vi chiu cao, cho nn chng
ta khng th pht biu hay so snh chiu cao gia 2 nhm hc sinh m khng iu chnh
cho tui. iu chnh tui, chng ta s dng phng php phn tch hip bin.
11.5.1 M hnh phn tch hip bin
ca hai nhm trong mi lin h ny khng khc nhau. Ni cch khc, vit theo k hiu
ca hi qui tuyn tnh, chng ta c:
y1 = 1 + x + e1
y2 = 2 + x + e2
in group 1
in group 2.
[5]
Trong :
y1a = y1 x1 x*
y1a c th xem l mt c s cho chiu cao trung bnh ca nhm 1 (thnh th) cho gi tr
x l x* . Tng t,
y2 a = y2 x2 x*
l s cho chiu cao trung bnh ca nhm 1 (nng thn) vi cng gi tr x*. T y,
chng ta c th c tnh nh hng ca thnh th v nng thn bng cng thc sau y:
y1a y2 a = y2 y1 ( x1 x2 )
[6]
Ni cch khc, m hnh trn pht biu rng chiu cao ca mt hc sinh b nh
hng bi 3 yu t: tui (), thnh th hay nng thn (), v tng tc gia hai yu t
(). Nu = 0 (tc nh hng tng tc khng c ngha thng k), m hnh trn
gim xung thnh:
y = + x+ g +e
[7]
Nu = 0 (tc nh hng ca thnh th khng c ngha thng k), m hnh trn gim
xung thnh:
y = + x+e
[8]
Cc tho lun va trnh by trn xem ra kh phc tp, nhng trong thc t, vi R,
cch c tnh rt n gin bng hm lm. Chng ta s phn tch ba m hnh [6], [7] v
[8]:
> # model 6
> model6 <- lm(height ~ group + age + group:age)
> # model 7
> model7 <- lm(height ~ group + age)
> # model 8
> model8 <- lm(height ~ age)
--Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m hnh [7] c 3 thng s (tc cn 29 bc t do), cho nn tng bnh phng phn
d cao hn m hnh [7]. Tuy nhin, ng trn phng din xc sut th trung
bnh bnh phng phn d ca m hnh ny 1338.02 / 29 = 46.13, khng khc
my so vi m hnh [6] (trung bnh bnh phng l: 1270.44 / 28 = 45.36), v tr
s p = 0.2325, tc khng c ngha thng k. Ni cch khc, b h s tng tc
khng lm thay i kh nng tin on ca m hnh mt cch ng k.
Median
0.879
3Q
3.956
Max
14.866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 91.8171
17.9294
5.121 1.81e-05 ***
group2
-5.4663
2.5749 -2.123 0.04242 *
age
0.4157
0.1408
2.953 0.00619 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.793 on 29 degrees of freedom
Multiple R-Squared: 0.2588,
Adjusted R-squared: 0.2077
F-statistic: 5.063 on 2 and 29 DF, p-value: 0.01300
Qua phn c tnh thng s trnh by trn y, chng ta thy tnh trung bnh chiu
cao hc sinh tng khong 0.41 cm cho mi thng tui. Ch trong kt qu trn, phn
group2 c ngha l h s hi qui (regression coefficient) cho nhm 2 (tc l nng
thn), v R phi t h s cho nhm 1 bng 0 tin vic tnh ton. V th, chng ta c
hai phng trnh (hay hai ng biu din) cho hai nhm hc sinh nh sau:
i vi hc sinh thnh th:
Height = 91.817 + 0.4157(age)
Ni cch khc, sau khi iu chnh cho tui, nhm hc sinh nng thn (rural) c
chiu cao thp hn nhm thnh th khong 5.5 cm v mc khc bit ny c ngha
thng k v tr s p = 0.0424. (Ch l trc khi iu chnh cho tui, mc khc
bit l 2.8 cm).
Cc biu sau y s minh ha cho cc m hnh trn:
> par(mfrow=c(2,2))
> plot(age, height, pch=as.character(group),
main=Mo hinh 1)
Mo hinh 1
Mo hinh 2
115
120
130
150
2
2
125
140
150
1 1
2 2
135
140
1
110
115
2
2
1
1
2 2 1 1
11
130
110
2
2
height
1
1
2 2 1 1
11
130
140
height
160
160
120
130
age
age
Mo hinh 3
Mo hinh 4
1 1
2 2
2
2
2
2
125
135
140
115
120
1
125
2
130
2
2
1
135
150
140
150
age
1 1
2 2
2
2
140
1
1
110
115
2
2
1
1
2
1 12 1 1
130
1
1
110
2
2
height
1
1
2
1 12 1 1
130
140
height
160
160
120
1
125
2
2
130
1 1
2 2
2
2
1
135
2
2
140
age
Ging cam
(variety)
B1
B2
B3
Tng s
Thuc tr su (pesticide)
2
3
50
43
58
42
85
63
193
154
Tng s
M
1
4
hnh phn
29
53
tch
th
41
73
nghim giai
66
85
tha cng
136
211
khng khc
g so vi phn tch phng sai hai chiu nh trnh by trong phn trn. C th
hnh m chng ta xem xt l:
175
214
305
694
hn, m
upr
33.202864
20.202864
39.202864
1.202864
20.202864
33.202864
p adj
0.0140509
0.5106152
0.0036109
0.0704233
0.5106152
0.0140509
Kt qu phn tch gia cc loi ging cho thy ging B3 c sn lng cao hn
ging B1 khong 32 n v vi khong tin cy 95% t 21 n 43 (p = 0.0002). Ging
cam B3 cng tt hn ging B2, vi khc bit trung bnh khong 22 n v (p =
0.0017). Nhng khng c khc bit ng k gia ging B2 v B1.
So snh gia cc loi thuc tr su, kt qu trn cho chng ta bit cc thuc tr
su 4 c hiu qu cao hn thuc 1 v 3. Ngoi ra, thuc 2 cng c hiu qu cao hn
thuc 1. Cn cc so snh khc khng c ngha thng k. Biu Tukey sau y minh
ha cho kt lun trn.
> plot(TukeyHSD(analysis), ordered=TRUE)
4-3
4-2
3-2
4-1
3-1
2-1
-20
-10
10
20
30
40
11.7 Phn tch phng sai cho th nghim hnh vung Latin
(Latin square experiment)
V d 5. so snh hiu qu ca 2 loi phn bn (A v B) cng 2 phng php
canh tc (a v b), cc nh nghin cu tin hnh mt th nghim hnh vung Latin. Theo
, c 4 nhm can thip tng hp t hai loi phn bn v phng php canh tc: Aa, Ab,
Ba, v Bb (s cho m s, ln lc, l 1=Aa, 2=Ab, 3=Ba, 4=Bb). Bn phng
(treatment) c p dng trong 4 mu rung (sample = 1, 2, 3, 4) v 4 loi cy trng
(variety = 1, 2, 3, 4). Tng cng, th nghim c 4x4 = 16 mu. Tiu ch nh gi l
sn lng, v kt qu sn lng c tm tt trong bng sau y:
Bng 11.6. Sn lng cho 2 loi phn bn v 2 phng php canh tc
Mu rung
(sample)
1
2
3
4
1
175
Aa
170
Ab
135
Bb
145
Ba
Ging (variety)
2
3
143
128
Ba
Bb
178
140
Aa
Ba
173
169
Ab
Aa
136
165
Bb
Ab
4
166
Ab
131
Bb
141
Ba
173
Aa
Ngun th nht l khc bit gia cc phng php canh tc v phn bn;
Ngun th hai l khc bit gia cc loi ging cy;
Ngun th ba l khc bit gia cc mu rung;
1: 156.25
2: 157.50
3: 150.50
4: 152.75
Tng trung bnh: 154.25
1: 153.00
2: 154.75
3: 154.50
4: 154.75
Tng trung bnh: 154.25
1: 173.75
2: 168.50
3: 142.25
4: 132.50
Tng trung bnh: 154.25
Bng tm lc trn cho php chng ta tnh tng bnh phng cho tng ngun bin thin.
Khi u l tng bnh phng cho ton b th nghim (ti s tm gi l SStotal):
Tng bnh phng do khc bit gia cc loi ging (SSvariety). Ch l v trung
bnh mi ging c tnh t 4 s, cho nn chng ta phi nhn cho 4 khi tnh tng
bnh phng:
SSvariety = 4(156.25 154.25)2 + 4(157.50 154.25)2 +
4(150.50 154.25)2 + 4(152.75 154.25)2
= 123.5
V c 4 loi ging v mt thng s, cho nn bc t do l 4-1=3. Theo , trung
bnh bnh phng (mean square) l: 123.5 / 3 = 41.2.
Tng bnh phng do khc bit gia ging (SSsample). Ch l v trung bnh
mi mu c tnh t 4 s, cho nn khi tnh tng bnh phng, cn phi nhn cho
4:
SSsample= 4(153.00 154.25)2 + 4(154.75 154.25)2 +
Nhng c tnh trn y c th trnh by trong mt bng phn tch phng sai nh sau:
Ngun bin thin
Gia 4 mu rung
Gia 4 loi ging
Gia 4 phng php
Phn d (residual)
Tng s
Bc t do
(degrees
of
freedom)
3
3
3
6
16
Tng bnh
phng
(Sum of
squares)
8.5
123.5
4801.5
7.5
4941.0
Trung bnh
Kim nh
bnh phng F
(Mean
square)
2.8
2.3
41.2
32.9
1600.5
1280.4
Qua phn tch th cng v n gin trn, chng ta d dng thy phng php
canh tc v loi ging c nh hng ln n sn lng. tnh ton chnh xc tr s p,
chng ta c th s dng R tin hnh phn tch phng sai cho th nghim hnh vung
Latin.
Vn t chc s liu sao cho thch hp R c th tnh ton rt quan trng. Ni
mt cch ngn gn, mi s liu phi l mt s c th (unique), hiu theo ngha n c
mt cn cc c nht v nh. Trong th nghim trn, chng ta c 4 loi ging, 4 mu,
cho nn tng s l 16 s liu. V, 16 s liu ny phi c nh ngha cho tng loi
ging, tng mu, v quan trng hn l cho tng phng php canh tc. Chng hn nh,
trong v d bng s liu 10.6 trn, 175 l sn lng ca phng php canh tc 1 (tc Aa),
loi ging 1, v mu 1; nhng 173 (s gc mc cui bng) l sn lng ca phng
php canh tc 1, nhng t loi ging 4, v mu 4; v.v...
> sample
<- c(1,1,1,1,
2,2,2,2,
3,3,3,3,
4,4,4,4)
> sample <- as.factor(sample)
Nhp s liu cho phng php, method, cng gm 4 bc (1,2,3,4) cho tng s
liu trong y (v cng nh ngha rng method l mt factor, tc bin th bc):
c(1, 3, 4, 2,
2, 1, 3, 4,
4, 2, 1, 3,
3, 4, 2, 1)
> method <- as.factor(method)
> data
sample variety method
y
1
1
1
1 175
2
1
2
3 143
3
1
3
4 128
4
1
4
2 166
5
2
1
2 170
6
7
8
9
10
11
12
13
14
15
16
2
2
2
3
3
3
3
4
4
4
4
2
3
4
1
2
3
4
1
2
3
4
1
3
4
4
2
1
3
3
4
2
1
178
140
131
135
173
169
141
145
136
165
173
upr
3.9867231
-3.0132769
-0.7632769
-4.2632769
-2.0132769
4.9867231
lwr
-7.986723
-34.236723
-43.986723
-28.986723
-38.736723
-12.486723
p adj
0.4528549
0.0014152
0.0173206
0.0004803
0.0038827
0.1034761
upr
-2.513277
-28.763277
-38.513277
-23.513277
-33.263277
-7.013277
p adj
0.0023016
0.0000001
0.0000000
0.0000004
0.0000000
0.0000730
So snh gia cc loi ging cho thy c s khc bit gia ging 3 v 1, 4 v 1, 3 v 2, 4
v 2.
Tt c cc so snh gia cc phng php canh tc u c ngha thng k. Nhng loi
no c sn lng cao nht? tr li cu hi ny, chng ta s s dng biu hp:
xlab="Methods
(1=Aa,
2=Ab,
3=Ba,
4=Bb",
Production
130
140
150
160
170
180
11.8 Phn tch phng sai cho th nghim giao cho (crossover experiment)
V d 6. th nghim hiu ng ca mt thuc mi i vi chng ra m hi
(thuc ny c bo ch cha tr bnh tim, nhng ra m hi l mt nh hng ph),
cc nh nghin cu tin hnh mt nghin cu trn 16 bnh nhn. S bnh nhn ny c
chia thnh 2 nhm (tm gi l nhm AB v BA) mt cch ngu nhin. Mi nhm gm 8
bnh nhn. Bnh nhn c theo di hai ln: thng th nht v thng th 2. i vi
bnh nhn nhm AB, thng th nht h c iu tr bng thuc, thng th hai h c
cho s dng gi dc (placebo). Ngc li, vi bnh nhn nhm BA, thng th nht s
dng gi dc, v thng th hai c iu tr bng thuc. Tiu ch nh gi l thi
gian ra m hi trn trn (tnh t lc ung thuc n khi ra m hi) sau khi s dng thuc
hay gi dc. Kt qu nghin cu c trnh by trong bng s liu sau y:
Bng 11.7. Kt qu nghin cu hiu ng ra m hi ca thuc iu tr bnh tim
Nhm
M s bnh
nhn s (id)
AB
1
3
5
6
9
10
13
15
BA
2
4
7
8
11
12
14
16
Thng 1
Thng 2
A
6
8
12
7
9
6
11
8
Placebo
5
9
7
4
9
5
8
9
Placebo
4
7
6
8
10
4
6
8
A
7
6
11
7
8
4
9
13
Cu hi chnh l c s khc bit v thi gian ra m hi gia hai nhm iu tr bng thuc
v gi dc hay khng.
tr li cu hi trn, chng ta cn tin hnh phn tch phng sai. Nhng v
cch thit k nghin cu kh c bit (hai nhm bnh nhn vi cch sp xp can thip
theo hai th t khc nhau), nn cc phng php phn tch trn khng th p dng c.
C mt phng php thng dng l phn tch phng sai trong tng nhm, ri sau so
snh gia hai nhm. Mt trong nhng vn chng ta cn phi lu l kh nng hiu
ng ko di (cn gi l carry-over effect), tc l trong nhm AB, hiu qu ca thng th
2 c th chu nh hng ko di t thng th nht khi bnh c c iu tr bng thuc
tht. Trc ht, chng ta th tm lc d liu bng bng sau y:
Bng 11.8. Tm lc kt qu th nghim hiu ng ra m hi ca thuc iu tr bnh
tim
Nhm
M s bnh nhn
s (id)
AB
1
3
5
6
9
10
13
15
Trung bnh
BA
2
4
7
8
11
12
14
16
Trung bnh
Trung bnh cho 2 nhm
8.375
Placebo
5
9
7
4
9
5
8
9
7.000
7.6875
6.625
A
7
6
11
7
8
4
9
13
8.125
7.3750
7.50
6.0
7.5
9.0
5.5
8.5
4.5
8.5
11.0
7.5625
7.5312
Tng bnh phng do khc bit gia hai nhm iu tr bng thuc v gi dc:
SSTreat = 16(8.25 7.5312)2 + 16(8.8125 7.5312)2 = 16.53
Tng bnh phng do khc bit gia hai nhm AB v BA (th t):
SSseq = 16(7.50 7.5312)2 + 16(7.5625 7.5312)2 = 0.031
Tng bnh phng do khc bit gia cc bnh nhn trong cng nhm AB hay BA:
SSw = (5.0 7.50)2 + (7.5 7.50)2 + (9.0 7.50)2 + + (8.0 7.50)2 +
(6.0 7.5625)2 + (7.5 7.5625)2 + (9.0 7.5625)2 + + (11.0 7.5625)2
= 103.44
Bc t do
(degrees
of
freedom)
1
1
1
14
14
31
Tng bnh
phng
(Sum of
squares)
16.53
0.781
0.031
103.44
47.19
167.97
Trung bnh
Kim nh
bnh phng F
(Mean
square)
16.53
4.90
0.781
0.23
0.031
0.004
7.39
3.37
Qua phn tch trn, chng ta thy khc bit gia thuc v gi dc ln hn l
khc bit gia hai thng hay hai nhm AB v BA. Kim nh F th nghim gi
thit thuc v gi dc c hiu qu nh nhau l kim nh F = 16.53 / 3.37 = 4.90 vi bc
t do 1 v 14. Da trn l thuyt xc sut, tr s F vi bc t do 1 v 14 l 4.60. Do ,
chng ta c th kt lun rng thuc ny c hiu ng lm ra m hi lu hn nhm gi
dc.
Tt c cc tnh ton th cng trn ch l minh ha cho cch phn tch phng
sai cho th nghim giao cho. Trong thc t, chng ta c th s dng R tin hnh cc
tnh ton nh cch tnh phng sai cho cc th nghim n gin. Vn chnh l t
chc s liu cho phn tch. R (cng nh nhiu phn mm khc) yu cu ngi s dng
phi nhp tng s liu mt, v mi s liu phi gn lin vi mt bnh nhn, mt nhm
iu tr, mt thng (hay giai on), v mt nhm th t. l mt yu cu rt quan
trng, v nu t chc s liu khng ng, kt qu phn tch c th sai.
Trong phn sau y, ti s m t tng bc mt:
# bc 1: nhp d liu v t tn object l y
> y <- c(6,8,12,7,9,6,11,8,
4,7,6,8,10,4,6,8,
5,9,7,4,9,5,8,9
7,6,11,7,8,4,9,13)
# bc 2: c mi s liu trong bc 1, ch ra nhm AB hay BA (m
s 1 v 2)
> seq <- c(1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2)
> seq <- as.factor(seq)
22
23
24
25
26
27
28
29
30
31
32
2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
12 4
14 9
16 13
2 5
4 9
7 7
8 4
11 9
12 5
14 8
16 9
diff
lwr
upr
p adj
2-1 -1.4375 -2.829658 -0.04534186 0.0438783
$seq
diff
lwr
upr
p adj
2-1 0.0625 -1.329658 1.454658 0.924656
$period
diff
lwr
upr
p adj
2-1 -0.3125 -1.704658 1.079658 0.6376395
Ch kt qu:
$treat
diff
lwr
upr
p adj
2-1 -1.4375 -2.829658 -0.04534186 0.0438783
cho bit tnh trung bnh thi gian ra m hi ca nhm c iu tr cao hn nhm gi
dc khong 1.44 pht, v khong tin cy 95% l t 0.05 pht n 2.8 pht. Cn cc kt
qu so snh gia hai nhm AB v BA (seq) hay gia thng 1 v thng 2 (period)
khng c ngha thng k.
M s bnh nhn
s (id)
Thng 1
1
2
3
4
6
7
4
8
3
3
1
4
0
1
2
3
5
6
7
8
6
9
5
6
5
4
3
2
5
6
4
3
Vc-xin
Placebo
n gin ha cch phn tch phng sai cho th nghim ti o lng, ti s trnh dng
k hiu ton, m ch minh ha bng vi php tnh th cng bn c c th theo di.
Trc ht, chng ta cn phi tm lc s liu bng cch tnh trung bnh cho mi bnh
nhn, mi nhm iu tr, v mi thng nh sau:
Bng 11.11. Tm lc s liu nghin cu vc-xin chng au thp khp
Nhm
iu tr
Vc-xin
Placebo
id
1
2
3
4
Trung bnh
SD
5
6
7
8
Trung bnh
SD
Trung bnh cho hai nhm
6
7
4
8
3
3
1
4
0
1
2
3
6.25
1.71
2.75
1.26
1.50
1.29
6
9
5
6
5
4
3
2
5
6
4
3
6.50
1.73
6.375
3.50
1.29
3.125
4.50
1.29
3.000
Trung bnh
3.000
3.667
2.333
5.000
3.500
5.333
6.333
4.000
3.667
4.833
4.167
Qua bng trn, chng ta c th thy ngay rng c 5 ngun lm cho kt qu th nghim
khc nhau:
(a) gia vc-xin v gi dc (c l l ngun m chng ta cn bit!);
(b) gia 3 thng theo di;
(c) gia mi ba thng trong mi nhm iu tr, m gii thng k thng cp
n l interaction (tng tc), v trong trng hp ny, tng tc gia
nhm iu tr v thi gian;
(d) gia cc bnh nhn trong cng mt nhm iu tr;
(e) v sau cng l phn d, tc phn m chng ta khng th gii thch sau khi
xem xt cc ngun (a) n (d) trn.
Ngun th t l tng bnh phng do tng tc gia bnh nhn trong mi nhm
iu tr, ti s gi l SSpatient(treat):
SSpatient(treat) = 3(3.0003.350)2 + 3(3.6673.350)2 + 3(2.3333.350)2 +3(5.0003.350)2+
3(5.3334.833)2 + 3(6.3334.833)2 + 3(4.0004.833)2 +3(3.6674.833)2
= 25.333
Bc t do
(degrees
of
freedom)
Gia vcxin v placebo
1
Bnh nhn (nhm iu tr)
6
Gia 3 thng
2
Thi gian v nhm iu tr
2
Phn d (residual)
12
Tng s
23
Tng bnh
phng
(Sum of
squares)
10.667
25.333
58.583
8.583
12.167
115.333
Trung bnh
bnh phng
(Mean
square)
10.667
4.222
29.292
4.292
1.014
Kim nh F
2.53
28.89
4.23
-
Trc ht, chng ta nhp d liu cho tng bnh nhn. Cng nh bt c phn mm
thng k no, mi gi tr phi c km theo nhng bin s c trng nh cho mi
bnh nhn, mi nhm, v mi thi gian:
y <- c(6,7,4,8,
3,3,1,4,
0,1,2,3,
6,9,5,6,
5,4,3,2,
5,6,4,3)
Error: id:time
Df Sum Sq Mean Sq F value
Pr(>F)
time
2 58.583 29.292 28.8904 2.586e-05 ***
treat:time 2 8.583
4.292 4.2329
0.04064 *
Residuals 12 12.167
1.014
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Kt qu phn tch trong phn u ca bng trn cho thy s khc bit gia nhm
iu tr bng thuc v gi dc khng c ngha thng k (p = 0.16). Nh vy chng ta
c th kt lun thuc khng c hiu nghim gim au thp khp?
Cu tr li l khng, bi v phn th hai ca bng phn tch phng sai cho
thy mi tng tc gia treat v time (tr s p = 0.041). iu ny c ngha l
khc bit gia thuc v gi dc ty thuc vo thng iu tr. Tht vy, nu chng ta
xem li bng 10.11 s thy trong thng 1, trung bnh ca nhm vc-xin v gi dc
khng my khc nhau (6.25 v 6.50), nhng n thng th 2 v nht l thng th 3 th
khc bit gia hai nhm rt cao (nh thng th ba: 1.50 cho vc-xin v 4.50 cho nhm
gi dc). Nh vy, hiu nghim trong nhm c iu tr tng dn theo thi gian,
cn trong nhm gi dc th hu nh khng c khc bit gia 3 thng. Ni cch khc v
tm li, qua th nghim s khi ny chng ta c th ni vc-xin c v c hiu qu gim
au trong cc bnh nhn thp khp.
***
Trn y l vi cch s dng cho vic phn tch phng sai vi cc th nghim
thng dng. Thit k v phn tch th nghim (experimental design) l mt lnh vc
nghin cu tng i chuyn su, nhng ch dn trn y khng th v cng khng c
tham vng m t tt c cc php tnh cng nh phng php cho tt c th nghim. Tuy
nhin, trong thc t, cc phng php v th nghim rt thng c p dng trong khoa
hc thc nghim. R c mt package tn l nlme (non-linear mixed-effects) cng c th
s dng cho cc phn tch trn v cc m hnh phc tp hn vi a bin v a th bc.
Package ny cng c th ti v my min ph ti website ca R: http://cran.R-project.org.
CHNG XIII
13
Phn tch s kin
(event history hay survival analysis)
Qua ba chng trc, chng ta lm quen vi cc m hnh thng k cho cc
bin ph thuc lin tc (nh p sut mu) v bin bc th (nh c/khng, bnh hay
khng bnh). Trong nghin cu khoa hc, v c bit l y hc v k thut, c khi nh
nghin cu mun tm hiu nh hng n cc bin ph thuc mang tnh thi gian. Nh
kinh t hc John Maynard Keynes tng ni mt cu c lin quan n ch m ti s m
t trong chng ny nh sau: V lu v di tt c chng ta u cht, ci khc nhau l
cht sm hay cht mun m thi. Thnh ra, y vic theo di hay m t mt bin bc
th nh sng hay cht tuy quan trng, nhng khng chnh xc. Ci bin s quan trng
hn v chnh xc hn l thi gian dn n vic s kin xy ra.
Trong cc nghin cu y hc, k c nghin cu lm sng, cc nh nghin cu
thng theo di bnh nhn trong mt thi gian, c khi ln n vi mi nm. Bin c
xy ra trong thi gian nh c bnh hay khng c bnh, sng hay cht, v.v l nhng
bin c c ngha lm sng nht nh, nhng thi gian dn n bnh nhn mc bnh hay
cht cn quan trng hn cho vic nh gi nh hng ca mt thut iu tr hay mt yu
t nguy c. Nhng thi gian ny khc nhau gia cc bnh nhn. Chng hn nh thi
im t lc iu tr ung th n thi im bnh nhn cht rt khc nhau gia cc bnh
nhn, v khc bit c th ty thuc vo cc yu t nh tui, gii tnh, tnh trng
bnh, v cc yu t m c khi chng ta khng/cha o lng c nh tng tc gia
cc gen.
M hnh chnh th hin mi lin h gia thi gian dn n bnh (hay khng
bnh) v cc yu t nguy c (risk factors) l m hnh c tn l survival analysis (c th
tm dch l phn tch sng st). Cm t survival analysis xut pht t nghin cu
trong bo him, v gii nghin cu y khoa t dng cm t cho b mn ca mnh.
Nhng nh ni trn, sng/cht khng phi l bin duy nht, v trong thc t chng ta
cng c nhng bin nh c bnh hay khng bnh, xy ra hay khng xy ra, v do ,
trong gii tm l hc, ngi ta dng cm t event history analysis (phn tch bin c)
m ti thy c v thch hp hn l phn tch sng st. Ngoi ra, trong cc b mn k
thut, ngi ta dng mt cm t khc, reliability analysis (phn tch tin cy), ch
cho khi nim survival analysis. Tuy nhin, trong chng ny ti s dng cm t phn
tch bin c.
Thi gian
(tun)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
18
10
13
30
19
23
38
54
36
107
104
97
107
56
59
107
75
93
Cu hi t ra l m t thi gian
ngng s dng y c. Thut ng m t
y c ngha l c tnh s trung v thi
gian dn n ngng s dng, hay xc sut
m ph n ngng s dng vo mt thi
im no . Tnh trng tip tc s dng c
khi gi l survival (tc sng st).
Tnh trng
(ngng=1 hay
tip tc=0)
0
1
0
1
1
0
0
0
1
1
0
1
0
0
1
0
1
1
f ( s ) ds
h ( t ) = lim
sao cho h(t) t l xc sut mt c nhn ngng s dng trong khong thi gian ngn t vi
iu kin c nhn sng n thi im t. T mi lin h:
Pr(sng st n t+t) = Pr(sng st n t) . Pr(sng st n t | sng n t)
chng ta c:
1 F ( t + t ) = (1 F ( t ) ) (1 h ( t ) t )
T , chng ta c:
tF ' ( t ) = (1 F ( t ) ) h ( t ) t
f (t )
1 F (t )
( t ) = ( u ) du
T nh ngha hm nguy c h ( t ) =
f (t )
1 F (t )
, chng ta c th vit:
( t ) = log (1 F ( t ) )
Mt s hm nguy c c th ng dng m t thi gian ny. Hm n gin nht l mt
hng s, dn n mt m hnh Poisson (thuc nhm cc lut phn phi m):
f ( t ) = e t
Do :
(t 0)
F ( t ) = 1 e t
Thnh ra:
h(t) =
18*
75
19
93
23* 30
36 38* 54*
97 104* 107 107* 107*
Mc
thi
gian (t)
Khong
thi gian
(tun)
1
2
3
4
5
6
7
8
9
10
09
10 18
19 29
30 35
36 58
59 74
75 92
93 96
97 106
107
S ph n S ph
n
lc bt
u thi ngng s
im (nt) dng (dt)
18
0
18
1
15
1
13
1
12
1
8
1
7
1
6
1
5
1
3
1
Xc sut
ngng s
dng h(t)
0.0000
0.0555
0.0667
0.0769
0.0833
0.1250
0.1428
0.1667
0.2000
0.3333
Xc
sut cn
s dng
pt
1.0000
0.9445
0.9333
0.9231
0.9167
0.8750
0.8572
0.8333
0.8000
0.6667
Xc
sut
tch ly
S(t)
1.0000
0.9445
0.8815
0.8137
0.7459
0.6526
0.5594
0.4662
0.3729
0.2486
S ( t ) = pt .
t =1
K n, chng ta to ra hai bin s: bin th nht gm thi gian (hy gi l weeks cho
trng hp ny), v bin th hai l ch s cho bit i tng ngng s dng y c (cho gi
tr 1) hay cn tip tc s dng (cho gi tr 0) v t tn bin ny l status. Sau
nhp hai bin vo mt dataframe (v gi l data) tin vic phn tch.
> weeks
<- c(10, 13, 18, 19, 23, 30, 36, 38, 54,
56, 59, 75, 93, 97, 104, 107, 107, 107)
> status <- c(1, 0, 0, 1, 0, 1, 1,0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0)
> data <- data.frame(duration, status)
18+ 19
23+
107+ 107+
30
36
38+
54+
56+
59
75
93
97
events
18
93
59
Inf
dt
, se S ( t ) = S ( t )
. Cng thc sai s chun ny cn c gi l
t =1 nt ( nt dt )
cng thc Greenwood (hay Greenwoods formula). Chng ta c th th hin kt qu trn
bng mt biu bng hm plot nh sau:
> plot(kp,
xlab="Time (weeks)",
ylab="Cumulative survival probability")
1.0
0.8
0.6
0.4
0.2
0.0
20
40
60
80
100
Time (weeks)
Trong biu trn, trc honh l thi gian (tnh bng tun) v trc tung l xc sut tch
ly cn s dng y c. ng chnh gia chnh l xc sut tch ly S ( t ) , hai ng chm
l khong tin cy 95% ca S ( t ) . Qua kt qu phn tch ny, chng ta c th pht biu
rng xc sut s dng y c n tun 107 l khong 25% v khong tin cy t 8% n
74.5%. Khong tin cy kh rng cho bit c s c dao ng cao, n gin v s
lng i tng nghin cu cn tng i thp.
Bng 13.1. Thi gian n nhim trng bnh nhn vi bnh mn gip cho nhm
gd2 v gi dc
id
1
3
6
7
8
10
12
14
15
18
20
23
24
26
28
31
33
34
36
39
40
42
44
46
48
episodes
12
10
7
10
6
8
8
9
11
13
7
13
9
12
13
8
10
16
6
14
13
13
16
13
9
id
2
4
5
9
11
13
16
17
19
21
22
25
27
29
30
32
35
37
38
41
43
45
47
time infected
8
1
12
0
52
0
28
1
44
1
14
1
3
1
52
1
35
1
6
1
12
1
7
0
52
0
52
0
36
1
52
0
9
1
11
0
52
0
15
1
13
1
21
1
24
0
52
0
28
1
episodes
9
10
12
7
7
7
7
11
16
16
6
15
9
10
17
8
8
8
8
14
13
9
15
time infected
15
1
44
0
2
0
8
1
12
1
52
0
21
1
19
1
6
1
10
1
15
0
4
1
9
0
27
1
1
1
12
1
20
1
32
0
15
1
5
1
35
1
28
1
6
1
e1 j =
n1 j d j
nj
e2 j =
n2 j d j
nj
vj =
n1 j n2 j d j ( n j d j )
n 2j ( n j 1)
O1 = d1 j
j =1
O2 = d 2 j
j =1
V tng s bnh nhn mc bnh nu c cng chung xc sut mc bnh cho c hai nhm:
k
E1 = v j
j =1
V = vj
j =1
(O E )
= 1 1
V
Nu > (trong , l tr s Chi bnh phng vi ngha thng k =0.95),
chng ta c bng chng kt lun rng khc bit v S(t) gia hai nhm c ngha
thng k.
2
2
1,
2
1,
12,20,32,15, 5,35,28, 6)
> infected <- c(1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1,
0, 1, 0, 0, 1, 1, 1, 0, 0, 1,
1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1,
1, 1, 0, 1, 1, 1, 1, 1)
> data <- data.frame(group, episode, time, infected)
(a) Chng ta ng dng hm survfit c tnh xc sut tch ly S(t) cho tng nhm
bnh nhn v cho kt qu vo i tng kp.by.group nh sau (ch cch cung cp
thng s ~ group):
> library(survival)
> kp.by.group <- survfit(Surv(time, infected==1) ~ group)
> summary(kp.by.group)
Call: survfit(formula = Surv(time, infected == 1) ~ group)
group=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
3
25
1
0.960 0.0392
0.886
1.000
6
24
1
0.920 0.0543
0.820
1.000
8
22
1
0.878 0.0660
0.758
1.000
9
21
1
0.836 0.0749
0.702
0.997
12
19
1
0.792 0.0829
0.645
0.973
13
17
1
0.746 0.0902
0.588
0.945
14
16
1
0.699 0.0958
0.534
0.915
15
15
1
0.653 0.1001
0.483
0.882
21
14
1
0.606 0.1033
0.434
0.846
28
12
2
0.505 0.1080
0.332
0.768
35
10
1
0.454 0.1083
0.285
0.725
36
9
1
0.404 0.1074
0.240
0.680
44
8
1
0.353 0.1052
0.197
0.633
52
7
1
0.303 0.1016
0.157
0.584
group=2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1
23
1
0.957 0.0425
0.8767
1.000
4
21
1
0.911 0.0601
0.8004
1.000
5
20
1
0.865 0.0723
0.7346
1.000
6
19
2
0.774 0.0889
0.6183
0.970
8
17
1
0.729 0.0946
0.5650
0.940
10
15
1
0.680 0.1000
0.5099
0.907
12
14
2
0.583 0.1067
0.4072
0.835
15
12
2
0.486 0.1088
0.3132
0.754
19
9
1
0.432 0.1093
0.2630
0.709
20
8
1
0.378 0.1082
0.2156
0.662
21
7
1
0.324 0.1053
0.1712
0.613
27
6
1
0.270 0.1007
0.1300
0.561
28
5
1
0.216 0.0939
0.0921
0.506
35
3
1
0.144 0.0859
0.0447
0.463
0.6
0.4
0.0
0.2
0.8
1.0
xlab="Time",
ylab="Cum. survival probability",
col=c(black, red))
10
20
30
40
50
Time
Kt qu phn tch log-rank cho tr s p=0.056. V p > 0.05, chng ta vn cha c bng
chng thuyt phc kt lun rng gd2 qu tht c hiu nghim gim nguy c ti pht
bnh.
nh trong nghin cu trn, s ln bnh nhn tng b nhim (bin episode) c xem l
c nh hng n nguy c bnh ti pht. Do , vn t ra l nu chng ta xem xt v
iu chnh cho nh hng ca episode th mc khc bit v S(t) gia hai nhm c
tht s tn ti hay khng?
Vo khong gia thp nin 1970s, David R. Cox, gio s thng k hc thuc i
hc Imperial College (London, Anh) pht trin mt phng php phn tch da vo m
hnh hi qui (regression) tr li cu hi trn (D.R. Cox, Regression models and life
tables (with discussion), Journal of the Royal Statistical Society series B, 1972; 74:187220). Phng php phn tch , sau ny c gi l M hnh Cox. M hnh Cox c
nh gi l mt trong nhng pht trin quan trng nht ca khoa hc ni chung (khng
ch khoa hc thng k) trong th k 20! Khng th k ht bao nhiu s ln trch dn bi
bo ca David Cox, v bi bo gy nh hng cho ton b hot ng nghin cu khoa
hc.
V m t chi tit m hnh Cox nm ngoi phm vi ca chng sch ny, nn ti
ch pht ho vi nt chnh bn c c th nm vn . Gi x1, x2, x3, xp l p yu t
nguy c. x c th l cc bin lin tc hay khng lin tc. M hnh Cox pht biu rng:
h (t ) = (t ) e
1 x1 + 2 x2 + 3 x3 +...+ p x p
Rsquare= 0.071
on 1 df,
on 1 df,
on 1 df,
p=0.0597
p=0.0596
p=0.0553
Rsquare= 0.196
(max possible=
Likelihood ratio test= 10.5 on
Wald test
= 10.4 on
Score (logrank) test = 10.6 on
0.986 )
2 df,
p=0.00537
2 df,
p=0.00555
2 df,
p=0.00489
Kt qu phn tch trn cho chng ta mt din dch khc v c l chnh xc hn.
M hnh h(t) by gi l:
h ( t | group = 2 )
h ( t | group = 1)
0.6
0.4
0.0
0.2
0.8
1.0
10
20
30
40
50
Time
#
>
>
>
>
>
To ra 5 bin s c lp
x1 <- (1:50)/2 3
x2 <- rnorm(50)
x3 <- rnorm(50)
x4 <- rnorm(50)
x5 <- rnorm(50)
Rsquare= 0.992
(max possible= 0.997 )
Likelihood ratio test= 241 on 5 df,
p=0
Wald test
= 33.3 on 5 df,
p=3.36e-06
Score (logrank) test = 107 on 5 df,
p=0
n= 50
coef exp(coef) se(coef)
z
p
x1 3.126
22.79
0.529 5.91 3.4e-09
x5 0.429
1.54
0.297 1.45 1.5e-01
x1
x5
Rsquare= 0.992
(max possible= 0.997 )
Likelihood ratio test= 240 on 2 df,
p=0
Wald test
= 35.3 on 2 df,
p=2.18e-08
Score (logrank) test = 104 on 2 df,
p=0
p!=0
100.0
9.6
14.6
10.0
31.0
nVar
BIC
post prob
EV
3.0360
0.0008
0.0410
0.0063
0.1349
SD
0.509
0.096
0.155
0.092
0.261
model 1
2.98048
.
.
.
.
1
-233.774
0.458
model 2
3.12625
.
.
.
0.42920
2
-232.126
0.201
0.8911 ):
model 3
3.03900
.
0.27046
.
.
model 4
2.98288
.
.
0.02497
.
2
-230.713
0.099
2
-229.933
0.067
model 5
2.98098
0.02136
.
.
.
2
-229.930
0.067
M odels selected by BM A
x1
x2
x3
x4
x5
Model #
CHNG XIV
14
Phn tch tng hp
ng b ta vn thng ni Mt cy lm chng nn non, ba cy chm li ln hn
ni cao cao tinh thn hp lc, on kt nhm hon tt mt cng vic quan trng
cn n nhiu ngi. Trong nghin cu khoa hc ni chung v y hc ni ring, nhiu khi
chng ta cn phi xem xt nhiu kt qu nghin cu t nhiu ngun khc nhau gii
quyt mt vn c th.
Hay ni chung l:
xi = M + ei
Tt nhin ei c th <0 hay >0. Nu M v ei c lp vi nhau (tc khng c tng quan
g vi nhau) th phng sai ca xi (gi l var[xi ] ) c th vit nh sau:
var[xi ] = var[M ] + var[ei ] = 0 + se2
xi = mi + ei
Trong :
Thnh ra:
mi = M + i
xi = M + i + ei
Nh ta thy qua cng thc ny, s2 phn nh dao ng gia cc nghin cu (betweenstudy variation), cn se2 phn nh dao ng trong mi nghin cu (within-study
variation). Mc ch ca phn tch tng hp nh hng bin thin l c tnh M, se2
v s2 .
Ni tm li, Phn tch tng hp nh hng bt bin v Phn tch tng hp nh
hng bin thin ch khc nhau phng sai. Trong khi phn tch tng hp bt bin
xem s2 = 0, th phn tch tng hp bin thin t yu cu phi c tnh s2 . Tt nhin,
nu s2 = 0 th kt qu ca hai phn tch ny ging nhau. Trong bi ny ti s tp trung
vo cch phn tch tng hp nh hng bt bin.
Cng nh phn tch thng k cho tng nghin cu ring l ty thuc vo loi tiu
ch (nh l bin s lin tc continuous variables hay bin s nh phn dichotomous
variables), phng php phn tch tng hp cng ty thuc vo cc tiu ch ca nghin
cu. Ti s ln lc m t hai phng php chnh cho hai loi bin s lin tc v nh
phn.
9
Tng cng
60
548
30
27
52
610
23
20
di = LOS1i LOS2i
Phng sai ca di (ti s k hiu l si2 ) c c tnh bng mt cng thc chun da vo
lch chun v s i tng trong tng nghin cu. Vi mi nghin cu i (i = 1, 2, 3,
, 9), chng ta c:
si2 =
1
1
N + N
2i
1i
N1i + N 2i 2
d1 = 75 55 = 20
v phng sai ca d1:
2
1
155 + 156 2
1
1
+
= 40.59
155 156
Bng 1a. khc bit v thi gian gia hai nhm v khong tin cy 95%
Nghin cu (i)
1
2
3
4
5
6
7
8
9
di
20
2
55
71
4
-1
-11
10
-7
si2
40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7
si
6.37
1.43
3.91
12.26
4.49
1.11
9.77
2.83
4.55
di-1.96*si
di+1.96*si
7.51
-0.80
47.34
46.98
-4.81
-3.17
-30.14
4.45
-15.92
32.49
4.80
62.66
95.02
12.81
1.17
8.14
15.55
1.92
Wi = 1 / si2
Chng hn nh vi nghin cu 1, chng ta c: W1 =
1
= 0.0246
40.59
1
2
3
4
5
6
7
8
9
Tng s
di
20
2
55
71
4
-1
-11
10
-7
Wi
si2
40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7
0.0246
0.4886
0.0654
0.0067
0.0495
0.8173
0.0105
0.1245
0.0483
1.6354
Do , tnh trung bnh d cho tng s nghin cu, chng ta phi xem xt n trng s
Wi. Vi mi di v Wi chng ta c th tnh tr s trung bnh trng s (weighted mean)
theo phng php chun nh sau:
9
d=
W d
i =1
9
W
i =1
Sai s chun (standard error, SE) ca d, do l: SE(d) = sd . Theo l thuyt phn phi
chun (Normal distribution), khong tin cy 95% (95% confidence interval, 95%CI) c
th c c tnh nh sau:
95%CI ca d = d 1.96 ( sd )
tnh d chng ta cn thm mt ct na: l ct Wi d i . Chng hn nh vi nghin cu
1, chng ta c W1d1 = 0,0246 20 = 0,4928 . Tip tc nh th, chng ta c thm mt ct.
Bng 1c. Tnh ton tr s trung bnh
Nghin cu
di
1
2
3
4
5
6
7
8
9
Tng s
si2
40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7
20
2
55
71
4
-1
-11
10
-7
Wi
0.0246
0.4886
0.0654
0.0067
0.0495
0.8173
0.0105
0.1245
0.0483
1.6354
Wi d i
0.4928
0.9771
3.5993
0.4726
0.1981
-0.8173
-0.1153
1.2450
-0.3383
5.7140
d=
W d
i =1
9
W
i =1
1
= 0.61 .
1.6345
Khong tin cy 95% (95% confidence interval hay 95%CI) c th c c tnh nh sau:
3.49 1,96*0.782 = 1.96 n 5.02.
n y, chng ta c th ni rng, tnh trung bnh, thi gian nm vin ti cc bnh vin
a khoa di hn cc bnh vin chuyn khoa 3.49 ngy v 95% khong tin cy l t 1.96
ngy n 5.02 ngy.
Bc 5: c tnh ch s ng nht (homogeneity) v bt ng nht
(heterogeneity) gia cc nghin cu [3]. Trong thc t, y l ch s o lng khc
bit gia mi nghin cu v tr s trung bnh trng s. Ch s ng nht (index of
homogeneity) c tnh theo cng thc sau y:
k
Q = Wi (d i d )
i =1
Q (k 1)
Q
Trong v d trn, c tnh Q v I2, chng ta cn tnh Wi (d i d ) cho tng nghin cu.
Chng hn nh, vi nghin cu 1:
2
di
si2
1
2
20
2
40.6
2.0
Wi
0.0246
0.4886
Wi (d i d )
0.4928
6.7129
0.9771
1.0903
Wi d i
10
3
4
5
6
7
8
9
Tng s
55
71
4
-1
-11
10
-7
15.3
150.2
20.2
1.2
95.4
8.0
20.7
0.0654
0.0067
0.0495
0.8173
0.0105
0.1245
0.0483
1.6354
3.5993
0.4726
0.1981
-0.8173
-0.1153
1.2450
-0.3383
5.7140
173.6080
30.3356
0.0127
16.5054
2.2026
5.2701
5.3215
241.05
Sau khi c tnh Wi (d i d ) cho tng nghin cu, chng ta cng li s ny (xem ct
sau cng) v chnh l Q :
2
Q = Wi (d i d ) = 241.05
2
i =1
T , I2 c th c tnh nh sau:
I2 =
241.05 8
= 0.966
241.05
1
sdi
11
Ni cch khc, biu funnel biu din precision vi di. Chng hn nh vi nghin cu
1, chng ta c: precision = 1 / 40,6 = 0,157 . Tnh cho tng nghin cu, chng ta c
dng bng thng k sau v biu funnel nh sau:
Bng 1e. c tnh publication bias
Nghin cu
1
2
3
4
5
6
7
8
9
di
20
2
55
71
4
-1
-11
10
-7
si2
40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7
1/si
0.1570
0.6990
0.2558
0.0816
0.2225
0.9041
0.1024
0.3528
0.2198
12
di
Ni
13
1
2
3
4
5
6
7
8
9
20
2
55
71
4
-1
-11
10
-7
311
63
146
36
21
109
67
293
112
phn tch tng hp bng R chng ta phi nhp package meta vo mi trng
vn hnh ca R (vi iu kin, tt nhin, l bn c ti v ci t meta vo R).
> library(meta)
14
S dng hm
95%-CI
z p.value
[ -4.9626; -1.9646] -4.5286 < 0.0001
[-24.0299; -3.9336] -2.7272
0.0064
Quantifying heterogeneity:
tau^2 = 205.4094; H = 5.46 [4.54; 6.58]; I^2 = 96.7% [95.2%; 97.7%]
Test of heterogeneity:
Q d.f. p.value
238.92
8 < 0.0001
Method: Inverse variance method
15
-100
-80
-60
-40
-20
Weighted mean difference
20
16
Beta-blocker
N1
T vong (d1)
25
5
9
1
194
23
25
1
105
4
320
53
33
3
261
12
133
6
232
2
1327
156
1990
145
214
8
4879
420
N2
25
16
189
25
34
321
16
84
145
134
1320
2001
212
4516
Placebo
T vong (d2)
6
2
21
2
2
67
2
13
11
5
228
217
17
612
N: s bnh nhn nghin cu; T vong: s bnh nhn cht trong thi gian theo di.
17
RR =
p1
p2
5
8
= 0,20 v p2 =
= 0,24 .
25
25
0,20
= 0,833 . Tnh ton tng t cho
0,24
cc nghin cu cn li, chng ta s c mt bng nh sau:
Nghin cu (i)
1
2
3
4
5
6
7
8
9
10
11
12
13
T l t vong
nhm BB (p1)
0.200
0.111
0.119
0.040
0.038
0.166
0.091
0.046
0.045
0.009
0.118
0.073
0.037
T l t vong
nhm placebo
(p2)
0.240
0.125
0.111
0.080
0.059
0.209
0.125
0.155
0.076
0.037
0.173
0.108
0.080
T s nguy
c (RR)
0.833
0.889
1.067
0.500
0.648
0.794
0.727
0.297
0.595
0.231
0.681
0.672
0.466
1
1
1
1
d1 N1 d1 d 2 N 2 d 2
18
1
1
1
1
d1 N 1 d1 d 2 N 2 d 2
1
1
1
1
+
= 0.264
5 25 5 6 25 6
V sai s chun:
Nghin
cu (i)
1
2
3
4
5
6
7
8
T s nguy
c (RR)
Log[RR]
Var[logRR]
0.200
0.111
0.119
0.040
0.038
0.166
0.091
0.046
-0.182
-0.118
0.065
-0.693
-0.434
-0.231
-0.318
-1.214
0.264
1.304
0.079
1.415
0.709
0.026
0.729
0.142
Phn cao
95% CI
ca RR
2.28
8.33
1.85
5.15
3.37
1.09
3.87
0.62
19
9
10
11
12
13
0.045
0.009
0.118
0.073
0.037
-0.520
-1.465
-0.385
-0.398
-0.763
0.242
0.688
0.009
0.010
0.174
0.492
0.829
0.095
0.102
0.417
0.23
0.05
0.56
0.55
0.21
1.56
1.17
0.82
0.82
1.06
20
Wi =
1
var[log RRi ]
W log[RR ]
i
log wRR =
Vi phng sai:
Var[logwRR] =
v sai s chun:
SE [log wRR ] =
1
= 3.79
0, 264
Nghin cu (i)
1
2
3
4
5
6
Log[RR]
-0.182
-0.118
0.065
-0.693
-0.434
-0.231
Var[logRR]
0.264
1.304
0.079
1.415
0.709
0.026
Wi Wilog[RRi]
3.79
-0.69
0.77
-0.09
12.61
0.82
0.71
-0.49
1.41
-0.61
38.30
-8.86
21
7
8
9
10
11
12
13
Tng s
-0.318
-1.214
-0.520
-1.465
-0.385
-0.398
-0.763
0.729
0.142
0.242
0.688
0.009
0.010
0.174
1.37
7.03
4.13
1.45
110.78
96.13
5.75
284.24
-0.44
-8.54
-2.15
-2.13
-42.63
-38.23
-4.39
-108.42
Chng ta c:
W log[RR ]
i
log wRR =
108, 42
= 0.38
284, 24
Vi phng sai:
1
= 0.0035
284.24
v sai s chun:
SE [ log wRR ] =
= 0.0035 = 0.06
22
c tnh ch s I2. chng ta cn tnh Wi (log RRi log wRR ) cho mi nghin cu.
Chng hn nh vi nghin cu 1. chng ta c:
2
Nghin cu (i)
1
2
3
4
5
6
7
8
9
10
11
12
13
Tng s
Log[RRi]
-0.182
-0.118
0.065
-0.693
-0.434
-0.231
-0.318
-1.214
-0.520
-1.465
-0.385
-0.398
-0.763
V d 2 c k = 13 nghin cu. Do .
k
i =1
V.
I2 =
Q (k 1) 11.18 12
=
= 0.16
Q
11.18
23
Log[RRi]
-0.182
-0.118
0.065
-0.693
-0.434
-0.231
-0.318
-1.214
-0.520
-1.465
-0.385
-0.398
-0.763
Ni
50
25
383
50
139
641
49
345
278
366
2647
3991
426
24
# S liu t v d 2
n1
d1
n2
d2
<<<<-
c(25.9.194.25.105.320.33.261.133.232.1327.1990.214)
c(5.1.23.1.4.53.3.12.6.2.156.145.8)
c(25.16.189.25.34.321.16.84.145.134.1320.2001.212)
c(6.2.21.2.2.67.2.13.11.5.228.217.17)
# To mt dataframe ly tn l bb
bb <- data.frame(n1.d1.n2.d2)
25
7
8
9
10
11
12
13
0.7273
0.2971
0.5947
0.2310
0.6806
0.6719
0.4662
[0.1346;
[0.1410;
[0.2262;
[0.0454;
[0.5635;
[0.5496;
[0.2056;
3.9282]
0.6258]
1.5632]
1.1744]
0.8221]
0.8214]
1.0570]
0.49
2.49
1.48
0.52
38.81
34.31
2.07
0.49
2.49
1.48
0.52
38.81
34.31
2.07
95%-CI
z p.value
[0.6064; 0.7672] -6.3741 < 0.0001
[0.6064; 0.7672] -6.3741 < 0.0001
Quantifying heterogeneity:
tau^2 = 0; H = 1 [1; 1.45]; I^2 = 0% [0%; 52.6%]
Test of heterogeneity:
Q d.f. p.value
11
12
0.5292
Method: Inverse variance method
1
2
3
4
5
6
7
8
9
10
11
12
13
0.05
0.10
0.20
0.50
1.00
Relative Risk
2.00
5.00
10.00
26
***
Thc ra. trong khoa hc ni chung. chng ta c mt truyn thng lu i v
vic duyt xt bng chng nghin cu (review). duyt xt kin thc hin hnh. Nhng
cc duyt xt nh th thng mang tnh nh cht (qualitative review). v v tnh nh
cht. chng ta kh m bit chnh xc c nhng khc bit mang tnh nh lng gia
cc nghin cu. Phn tch tng hp cung cp cho chng ta mt phng tin nh lng
h thng bng chng. Vi phn tch tng hp. chng ta c c hi :
Educational
[2] Normand SL. Meta-analysis: formulating. evaluating. combining. and reporting. Stat
Med. 1999;18(3):321-59.
[3] Higgins JPT. Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med.
2002;21:1539-1558
27
28
i vi cc bin s nh phn
Nhm 1 (s mu. s s kin): n1i . x1i ; i =
1. 2. 3. . k
Nhm 2 (s mu. s s kin): n2i . x2i ; i =
1. 2. 3. . k
nh hng (effect size. ES) tnh bng t
s nguy c RR:
x x
RRi = 2i 1i
n2i n1i
Bin chuyn sang logarithm:
i = log(RRi )
Phng sai ca d i :
Phng sai ca i :
1
1
1
1
s2i =
(n1i 1)s
1
+ (n2i 1)s
1
+
n1i + n2i 2
n1i n2i
Sai s (standard error) ca d i :
2
1i
sdi2 =
2
2i
sdi = sdi2
Sai s ca i :
1
1
1
1
s =
1
sdi2
c s nh hng chung:
Trng s: Wi =
k
i =1
i =1
1
s2i
c s nh hng chung:
Trng s: Wi =
k
i =1
i =1
d= Wi d i / Wi
= Wi i / Wi
Phng sai ca d:
Phng sai ca :
s 2 = 1 / Wi
s 2 = 1 / Wi
i =1
i =1
Q = Wi (d i d )
Q = Wi ( i )
Index of heterogeneity:
Q (k 1)
I2 =
Q
Xem xt publication bias: Phn tch hi qui
tuyn tnh: di = a + b*Ni . (Ni l tng s
mu ca nghin cu i). Xem ngha thng
k ca b.
Index of heterogeneity:
Q (k 1)
I2 =
Q
Xem xt publication bias: Phn tch hi qui
tuyn tnh: = a + b*Ni . (Ni l tng s
mu ca nghin cu i). Xem ngha thng
k ca b.
i =1
i =1
29
Q (k 1)
2
= max 0,
k
2
Wi
k
i =1
Wi k
i =1
Wi
i =1
Q (k 1)
2 = max 0,
k
2
Wi
k
i =1
Wi k
i =1
Wi
i =1
30
CHNG XV
C TNH C MU
15
c tnh c mu
(Sample size estimation)
Mt cng trnh nghin cu thng da vo mt mu (sample). Mt trong nhng
cu hi quan trng nht trc khi tin hnh nghin cu l cn bao nhiu mu hay bao
nhiu i tng cho nghin cu. i tng y l n v cn bn ca mt nghin
cu, l s bnh nhn, s tnh nguyn vin, s mu rung, cy trng, thit b, v.v c
tnh s lng i tng cn thit cho mt cng trnh nghin cu ng vai tr cc k quan
trng, v n c th l yu t quyt nh s thnh cng hay tht bi ca nghin cu. Nu
s lng i tng khng th kt lun rt ra t cng trnh nghin cu khng c
chnh xc cao, thm ch khng th kt lun g c. Ngc li, nu s lng i tng
qu nhiu hn s cn thit th ti nguyn, tin bc v thi gian s b hao ph. Do , vn
then cht trc khi nghin cu l phi c tnh cho c mt s i tng va cho
mc tiu ca nghin cu. S lng i tng va ty thuc vo ba yu t chnh:
gin nht l gi thit o (hin tng khng tn ti, k hiu H-) v gi thit chnh (hin
tng tn ti, k hiu H+).
Chng ta s dng cc phng php kim nh thng k (statistical test) nh kim
nh t, F, z, 2, v.v nh gi kh nng ca gi thit. Kt qu ca mt kim nh
thng k c th n gin chia thnh hai gi tr: hoc l c ngha thng k (statistical
significance), hoc l khng c ngha thng k (non-significance). C ngha thng k
y, nh cp trong Chng 7, thng da vo tr s P: nu P < 0.05, chng ta pht
biu kt qu c ngha thng k; nu P > 0.05 chng ta ni kt qu khng c ngha
thng k. Cng c th xem c ngha thng k hay khng c ngha thng k nh l c
tn hiu hay khng c tn hiu. Hy tm t k hiu T+ l kt qu c ngha thng k, v
T- l kt qu kim nh khng c ngha thng k.
Hy xem xt mt v d c th: bit thuc risedronate c hiu qu hay khng
trong vic iu tr long xng, chng ta tin hnh mt nghin cu gm 2 nhm bnh
nhn (mt nhm c iu tr bng risedronate v mt nhm ch s dng gi dc
placebo). Chng ta theo di v thu thp s liu gy xng, c tnh t l gy xng cho
tng nhm, v so snh hai t l bng mt kim nh thng k. Kt qu kim nh thng
k hoc l c ngha thng k (P<0.05) hay khng c ngha thng k (P>0.05). Xin
nhc li rng chng ta khng bit risedronate tht s c hiu nghim chng gy xng
hay khng; chng ta ch c th t gi thit H. Do , khi xem xt mt gi thit v kt
qu kim nh thng k, chng ta c bn tnh hung:
(a) Gi thuyt H ng (thuc risedronate c hiu nghim) v kt qu kim nh thng
k P<0.05.
(b) Gi thuyt H ng, nhng kt qu kim nh thng k khng c ngha thng k;
(c) Gi thuyt H sai (thuc risedronate khng c hiu nghim) nhng kt qu kim
nh thng k c ngha thng k;
(d) Gi thuyt H sai v kt qu kim nh thng k khng c ngha thng k.
y, trng hp (a) v (d) khng c vn , v kt qu kim nh thng k nht qun
vi thc t ca hin tng. Nhng trong trng hp (b) v (c), chng ta phm sai lm, v
kt qu kim nh thng k khng ph hp vi gi thit. Trong ngn ng thng k hc,
chng ta c vi thut ng:
xc sut ca tnh hung (c) c gi l sai st loi I (type I error, hay significance
level), v thng k hiu bng . Ni cch khc, chnh l xc sut m kt qu
kim nh thng cho ra kt qu p<0.05 vi iu kin gi thit H sai;
xc sut tnh hng (d) khng phi l vn cn quan tm, nn khng c thut
ng, d c th gi l kt qu m tnh tht (hay true negative).
ng
(thuc c hiu nghim)
Sai
(thuc khng c hiu nghim)
C ngha thng k
(p<0,05)
C bnh
Khng c bnh
nhy (sensitivity),
c hiu (Specificity),
Chn on y khoa
Chn on bnh
Bnh trng (c hay khng)
Phng php xt nghim
Kt qu xt nghim +ve
Kt qu xt nghim -ve
Dng tnh tht (sensitivity)
Dng tnh gi (false positive)
m tnh gi (false negative)
m tnh tht (c hiu, hay specificity)
V xc sut sai st, thng thng mt nghin cu chp nhn sai st loi I khong
1% hay 5% (tc = 0.01 hay 0.05), v xc sut sai st loi II khong = 0.1 n
= 0.2 (tc power phi t 0.8 n 0.9).
n=
( / )
[1]
Trong trng hp c hai nhm i tng, s lng i tng (n) cn thit cho
nghin cu c th tnh ton nh sau:
n = 2
( / )
[2]
=
0.10
0.05
0.01
= 0.20
(Power = 0.80)
6.15
7.85
13.33
= 0.10
(Power = 0.90)
8.53
10.51
16.74
= 0.05
(Power = 0.95)
10.79
13.00
19.84
15.4 c tnh c mu
15.4.1 c tnh c mu cho mt ch s trung bnh
V d 1: Chng ta mun c tnh chiu cao n ng ngi Vit, v chp nhn
sai s trong vng 1 cm (d = 1) vi khong tin cy 0.95 (tc =0.05) v power = 0.8 (hay
= 0.2). Cc nghin cu trc cho bit lch chun chiu cao ngi Vit khong 4.6
cm. Chng ta c th p dng cng thc [1] c tnh c mu cn thit cho nghin cu:
n=
( / )
7.85
(1/ 4.6 )
= 166
Ni cch khc, chng ta cn phi o chiu cao 166 i tng c tnh chiu cao n
ng Vit vi sai s trong vng 1 cm.
Nu sai s chp nhn l 0.5 cm (thay v 1 cm), s lng i tng cn thit l:
7.85
n=
= 664 . Nu sai s m chng ta chp nhn l 0.1 cm th s lng i
2
( 0.5 / 4.6 )
tng nghin cu ln n 16610 ngi! Qua cc c tnh ny, chng ta d dng thy c
mu ty thuc rt ln vo sai s m chng ta chp nhn. Mun c c tnh cng
chnh xc, chng ta cn cng nhiu i tng nghin cu.
=
=
=
=
=
=
168.0131
1
4.6
0.05
0.8
two.sided
=
=
=
=
=
=
666.2525
0.5
4.6
0.05
0.8
two.sided
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
198.1513
3
15
0.05
0.8
two.sided
Trong thc t, rt nhiu nghin cu nhm so snh hai nhm vi nhau. Cch c
tnh c mu cho cc nghin cu ny ch yu da vo cng thc [2] nh trnh by phn
15.3.1.
V d 3: Mt nghin cu c thit k th nghim thuc alendronate trong vic
iu tr long xng ph n sau thi k mn kinh. C hai nhm bnh nhn c tuyn:
nhm 1 l nhm can thip (c iu tr bng alendronate), v nhm 2 l nhm i
chng (tc khng c iu tr). Tiu ch nh gi hiu qu ca thuc l mt
xng (bone mineral density BMD). S liu t nghin cu dch t hc cho thy gi tr
trung bnh ca BMD trong ph n sau thi k mn kinh l 0.80 g/cm2, vi lch chun
l 0.12 g/cm2. Vn t ra l chng ta cn phi nghin cu bao nhiu i tng
chng minh rng sau 12 thng iu tr BMD ca nhm 1 tng khong 5% so vi nhm
2?
2C
( / )
2 10.51
( 0.04 / 0.12 )
= 189
=
=
=
=
=
=
190.0991
0.04
0.12
0.05
0.9
two.sided
Phng php c tnh c mu cho so snh gia hai nhm cng c th khai trin
thm c tnh c mu cho trng hp so snh hn hai nhm. Trong trng hp c
nhiu nhm, nh cp trong Chng 11, phng php so snh l phn tch phng sai.
Theo phng php ny, s trung bnh bnh phng phn d (residual mean square, RMS)
chnh l c tnh ca dao ng ca o lng trong mi nhm, v ch s ny rt quan
trng trong vic c tnh c mu.
Chi tit v l thuyt ng sau cch c tnh c mu cho phn tch phng sai kh
phc tp, v khng nm trong phm vi ca chng ny. Nhng nguyn l ch yu vn
khng khc so vi l thuyt so snh gia hai nhm. Gi s trung bnh ca k nhm l 1,
2, 3, . . ., k, chng ta c th tnh tng bnh phng gia cc nhm bng
k
k
SS
2
SS SS = ( i ) , trong , = i / k . Cho =
, vn t ra l tm
( k 1) RMS
i =1
i =1
c lng c mu n sao cho z p ng yu cu power = 0.80 hay 0.9, m
z =
( k 1)(1 + n ) F + k ( n 1)(1 + 2n )
=
=
=
=
=
=
4
12.81152
3.486667
8.7
0.05
0.9
p (1 p ) / n .
1.96
n
p (1 p )
m
1.96
n
0.7 0.3
0.02
Nhiu nghin cu mang tnh suy lun thng c hai [hay nhiu hn hai] nhm
so snh. Trong phn 15.4.2 chng ta lm quen vi phng php c tnh c mu
so snh hai s trung bnh bng kim nh t. l nhng ngi cu m tiu ch l nhng
bin s lin tc. Nhng c nghin cu bin s khng lin tc m mang tnh nh phn nh
ti va bn trong phn 15.4.3. so snh hai t l, phng php kim nh thng dng
nht l kim nh nh phn (binomial test) hay Chi bnh phng (2 test). Trong phn
ny, ti s bn qua cch tnh c mu cho hai loi kim nh thng k ny.
Gi hai t l [m chng ta khng bit nhng mun tm hiu] l p1 v p2 , v gi
= p1 p2 . Gi thit m chng ta mun kim nh l = 0. L thuyt ng sau c
tnh c mu cho kim nh gi thit ny kh rm r, nhng c th tm gn bng cng
thc sau y:
(z
n=
/2
2 p (1 p ) + z
p1 (1 p1 ) + p2 (1 p2 )
2
l tr s z ca phn phi chun cho xc sut /2 (chng
Trong , p = ( p1 + p2 )/2, z / 2
hn nh khi = 0.05, th z / 2 = 1.96; khi = 0.01, th z / 2 = 2.57), v z l tr s z ca
phn phi chun cho xc sut (chng hn nh khi = 0.10, th z = 1.28; khi = 0.20,
th z = 0.84).
V d 6: Mt th nghim lm sng i chng ngu nhin c thit k nh
gi hiu qu ca mt loi thuc chng gy xng sng. Hai nhm bnh nhn s c
tuyn. Nhm 1 c iu tr bng thuc, v nhm 2 l nhm i chng (khng c
iu tr). Cc nh nghin cu gi thit rng t l gy xng trong nhm 2 l khong 10%,
v thuc c th lm gim t l ny xung khong 6%. Nu cc nh nghin cu mun th
nghim gi thit ny vi sai st I l = 0.01 v power = 0.90, bao nhiu bnh nhn cn
phi c tuyn m cho nghin cu?
( 2.57
n=
( 0.04 )
= 1361
Nh vy, cng trnh nghin cu ny cn phi tuyn t nht l 2722 bnh nhn kim
nh gi thit trn.
Hm power.prop.test R c th ng dng tnh c mu cho trng hp trn. Hm
power.prop.test cn nhng thng tin nh power, sig.level, p1, v p2.
Trong v d trn, chng ta c th vit:
> power.prop.test(p1=0.10, p2=0.06, power=0.90, sig.level=0.01)
Two-sample comparison of proportions power calculation
n
p1
p2
sig.level
power
alternative
=
=
=
=
=
=
1366.430
0.1
0.06
0.01
0.9
two.sided
CHNG XVI
LP TRNH V HM
16
Ph lc 1: Lp trnh v hm vi R
R c pht trin sao cho ngi s dng c th pht trin nhng hm thch hp
cho mc ch phn tch v tnh ton ca mnh. Tht vy, nh cp trong phn u
ca sch, c th xem R l mt ngn ng thng k, v chng ta c th s dng ngn ng
gii quyt cc vn khng thng thy trong sch gio khoa. Trong phn ny, ti
ch trnh by mt vi hm n gin bn c c th hiu cch vn hnh ca R v hi
vng gip bn c t pht trin cc hm sau .
Hm (hay c khi cn gi l macro trong cc phn mm khc) thc cht l tp
hp mt s lnh c lu tr di mt ci tn. mc n gin nht, hm l tc k
cho mt nhm lnh.
V d 1. Trong cc lnh sau y, chng ta to hai d liu (data1 v data2).
Mi d liu c hai ct s liu c to ra bng m phng t phn phi chun. Sau , v
biu cho hai d liu vi ghi ch.
data1 <- cbind(rnorm(100,1), rnorm(100,0))
data2 <- cbind(rnorm(100,-1), rnorm(100,0))
xr <- range(rbind(data1,data2)[,1])
yr <- range(rbind(data1,data2)[,2])
plot(data1, xlim=xr, ylim=yr, col=1, xlab="", ylab="")
par(new=T)
plot(data2, xlim=xr, ylim=yr, col=2, xlab="", ylab="")
title(main="My simulated data", xlab="Weight", ylab="Yield")
legend(-3.0, -1.5, c("Big", "Small"), col=1:2, pch=1)
}
Sau khi cho vo R, chng ta ch n gin gi hm nhiu ln nh sau:
> plotfigure()
> plotfigure()
v kt qu s nh sau:
M y simulated data
Yield
-1
-1
Yield
M y simulated data
-2
-2
Big
Small
-4
-2
Weight
Big
Small
-2
Weight
}
Khi ng dng hm, chng ta ch n gin thay i n v mean. Trong hai lnh sau y,
chng ta u tin v mt biu tn x vi 200 s liu, v s trung bnh -2 v 2. Trong
lnh hai, chng ta nng s liu ln 200, nhng trung bnh vn nh ln m phng trc:
> plotfigure(200, 2, -2)
> plotfigure(500, 2, -2)
V kt qu s khc trn:
M y simulated data
0
-1
-1
Yield
Yield
M y simulated data
-2
Big
Small
-3
-3
-2
Big
Small
-4
-2
Weight
-4
-2
Weight
{
sum = a+b
ans <- "Answer = "
cat(ans, sum, \n)
}
cho ngi d dng hm, trong \ c ngha l sau khi trnh by, cho ngi s dng
mt prompt khc. Bn c c th dn cc lnh trn vo R v th cho lnh:
> add(3, 9)
Answer = 12
> add(sqrt(5), exp(10))
Answer = 22028.7
xi ~ N , 2
Nu chng ta c thng tin trc cho bit c lut phn phi chun vi trung bnh v
phng sai 2, hay:
~ N ( , 2 )
nx
+ 2
2
v phng sai
Qua nh l Bayes, chng ta c th c tnh trung bnh p =
1
n
1
= 2 + 2 . Trong , x l s trung bnh ca mu n. p v p2 c gi l
CHNG XVII
MT S
LNH R THNG DNG
17
Phc lc 2
Mt s lnh thng dng trong R
Lnh v mi trng vn hnh ca R
getwd()
setwd(c:/works)
options(prompt=R>)
options(width=100)
options(scipen=3)
options()
Lnh c bn
ls()
rm(object)
seach()
Cng
Tr
Nhn
Chia
Ly tha
Chia s nguyn
S d t chia hai s nguyn
K hiu logic
==
!=
<
>
<=
>=
is.na(x)
&
Bng
Khng bng
Nh hn
Ln hn
Nh hn hoc bng
Ln hn hoc bng
C phi x l bin s missing
V (AND)
|
!
Hoc (OR)
Khng l (NOT)
Pht s
numeric(n)
character(n)
logical(n)
seq(-4,3,0.5)
1:10
c(5,7,9,1)
rep(1, 5)
Gl(3,2,12)
Cho ra n s 0
Cho ra n k t
Cho ra n FALSE
Dy s -4.0, -3.5, -3.0, , 3.0
Ging nh lnh seq(1, 10, 1)
Nhp s 5, 7, 8 v 1
Cho ra 5 s 1: 1, 1, 1, 1, 1.
Yu t 3 bc, lp li 2 ln, tng cng 12 s:
112233112233
Data frames
data.frame(x,y)
tuan$age
attach(tuan)
detach(tuan)
Hm s ton
log(x)
log10(x)
exp(x)
sin(x)
cos(x)
tan(x)
asin(x)
acos(x)
atan(x)
Logart bc e
Logart bc 10
S m
Sin
Cosin
Tangent
Arcsin (hm sin o)
Arccosin (hm cosin o)
Arctang(hm tan o)
Hm s thng k
min(x)
max(x)
which.max(x)
which.min(x)
S nh nht ca bin s x
S ln nht ca bin s x
Tm dng no c gi tr ln nht ca bin s x
Tm dng no c gi tr nh nht ca bin s x
Ch s ma trn
x[1]
x[1:5]
x[y<=30]
x[sex==male]
S u tin ca bin s x
Nm s u tin ca bin s x
Chn x sao cho y nh hn hoc bng 30
Chn x sao cho sex bng male
Nhp d liu
Xy dng mt kho d liu
c / nhp s liu t file name
c / nhp s liu dng excel (cch nhau bng ,)
t file name
read.delim(name) c / nhp s liu dng tab delimited
read.delim2(name) c / nhp s liu dng tab delimited, cch nhau bng ;
v s thp phn l ,
read.csv2(name)
c / nhp s liu dng csv, cch nhau bng ;
v s thp phn l ,
data(name)
read.table(name)
read.csv(name)
Kim nh t
Kim nh t cho paired design
Kim nh h s tng quan
var.test
bartlett.test
method = kendall
method = spearman
Kim nh phng sai
Kim nh nhiu phng sai
wilcoxon.test
kruskal.test
friedman.test
Kim nh Wilcoxon
Kim nh Kruskal
Kim nh Friedman
lm(y
lm(y
lm(y
lm(y
~
~
~
~
x)
factor)
factor+x)
x1+x2+x3)
binom.test
prop.test
prop.trend.test
fisher.test
chisq.test
glm(y~x1+x2+x+x3)
s<-Surv(time,event)
survfit(s)
survdiff(s~g)
coxph(s ~ x`+x2)
th
plot(y~x)
hist(x)
plot(y ~ x | z)
pie(x)
boxplot(x)
qqnorm(x)
qqplot(x, y)
barplot(x)
hist(x)
stars(x)
abline(a, b)
abline(h=y)
abline(v=x)
abline(lm.object)
V th y v x (scatter plot)
V th y v x (scatter plot)
V hai biu x v y theo tng nhm ca z
V th trn
V th theo dng hnh hp
V phn phi quantile ca bin s x
V phn phi quantile ca bin s y theo x
V biu hnh khi cho bin s x
V histogram cho bin s x
V biu sao cho bin s x
V ng thng vi intercept=a v slope=b
V ng thng ngang
V ng thng ng
V th theo m hnh tuyn tnh
Mt s thng s cho th
pch
mfrow, mfcol
xlim, ylim
xlab, ylab
lty, lwd
cex, mex
col
CHNG XVIII
THUT NG
18
Phc lc 3
Thut ng dng trong sch
Ting Anh
95% confidence interval
Akaike Information criterion (AIC)
Analysis of covariance
Analysis of variance (ANOVA)
Bar chart
Binomial distribution
Box plot
Categorical variable
Clock chart
Coefficient of correlation
Coefficient of determination
Coefficient of heterogeneity
Combination
Continuous variable
Correlation
Covariance
Cross-over experiment
Cumulative probability distribution
Degree of freedom
Determinant
Discrete variable
Dot chart
Estimate
Estimator
Factorial analysis of variance
Fixed effects
Frequency
Function
Heterogeneity
Histogram
Homogeneity
Hypothesis test
Inverse matrix
Latin square experiment
Ting Vit
Khong tin cy 95%
Tiu chun thng tin Akaike
Phn tch hip bin
Phn tch phng sai
Biu thanh
Phn phi nh phn
Biu hnh hp
Bin th bc
Biu ng h
H s tng quan
H s xc nh bi
H s bt ng nht
T hp
Bin lin tc
Tng quan
Hp bin
Th nghim giao cho
Hm phn phi tch ly
Bc t do
nh thc
Bin ri rc
Biu im
c s
Hm c lng thng k
Phn tch phng sai cho th nghim giai tha
nh hng bt bin
Tn s
Hm
Bt ng nht
Biu tn s
ng nht
Kim nh gi thit
Ma trn nghch o
Th nghim hnh vung Latin
Weighted mean
CHNG XIX
19
Li bt
(ti liu tham kho v c thm)
Qua 15 chng sch v 3 ph lc bn c cng ti i mt hnh trnh kh di
trong phn tch thng k v biu . Thit tng trc khi chia tay bn c, ti cng
nn c i li tm bit.
Kinh nghim ging dy v nghin cu c nhn cho thy phn ln sinh vin khi
tip cn vi khoa hc thng k ln u l mt kinh nghim chng my g ho hng, nu
khng mun ni l kh khn, ch v sch gio khoa son cho mn hc ny rt xa ri thc
t, hay c khi dnh dng n thc t nhng vi nhng v d v b, nht nho. Nhng
khi nim tru tng, nhng cng thc rc ri, nhng php tnh phc tp v rm r lm
cho ngi hc cm thy chao o v t cm thy thiu hng th theo ui mn hc.
Tht vy, c khi c sch gio khoa, c cc bi bo nghin cu khoa hc, chng ta bt
gp nhng phng php hay v nhng m hnh thch hp cho nghin cu ca chnh
mnh, nhng khng bit lm sao tnh ton cc m hnh . Trong cun sch ny, ti
mun cung cp cho bn c mt phng tin phn tch thc t lp vo ci khong
trng phng php .
Hc phi i i vi hnh. Cch hc v phng php hay nht, theo ti, l [ni
mt cch nm na] bt chc. R cung cp cho bn c cch hc m phng rt l
tin li. Trong khi c nhng chng sch ny cng vi nhng v d, bn c c th g
nhng lnh vo my tnh v xem kt qu c nht qun vi nhng g mnh c hay khng.
Sau khi bit c cch s dng mt hm hay mt lnh no , bn c c th thm
vo (hay bt ra) nhng thng s ca hm xem kt qu ra sao. Ch c hc nh th th
bn c mi nm vng c cc khi nim v cch s dng R.
Chng ta hc t sai st. Trong sch ny, ti mun bn c i mt qung ng
kh gp ghnh, tc l bn c phi tng tc vi my tnh bng nhng lnh ca R.
Trong qu trnh tng tc , c th mt s lnh s khng chy, v g sai tn bin s hay
sai chnh t, v khng n k t vit hoa v vit thng, v s liu khng y hay
sai st, v.v Tt c nhng ln sai st s lm cho bn c rt ra kinh nghim v tr
nn thun tho hn. l cch hc m ngi Anh hay gi l trial and error, hc t sai
lm v th nghim.
Mt cng trnh phn tch s liu cn nhiu lnh v hm R. Tuy nhin, v tnh
tng tc m bn c theo di, cc lnh ny s bin mt khi ngng R. Vn t ra l
c cch no lu tr cc lnh ny trong mt h s sau ny s dng li. Phn mm cc
k c ch cho mc ch ny l Tinn-R (cng c th ti xung v ci t vo my hon
ton min ph).
Website ti Tinn-R v ti liu s dng l:
http://www.sciviews.org/Tinn-R.
Tinn-R thc cht l mt editor cho R (v nhiu phn mm khc). Tinn-R cho
php chng ta lu tr tt c cc lnh cho mt cng trnh phn tch trong mt h s. Vi
Tinn-R, chng ta c sn mt ch dn trc tuyn v cch s dng cc lnh hay hm trong
R. Trong khi lnh g sai vn phm R, Tinn-R s bo ngay v ngh cch sa! Giao
din Tinn-R c th ging nh sau:
Linear Models with R (Nh xut bn Chapman & Hall/CRC, 2004) ca Julian
Faraway. Sch hin c th ti t internet xung min ph ti website sau y:
hay
http://cran.rhttp://www.stat.lsa.umich.edu/~faraway/book/pra.pdf
project.org/doc/contrib/Faraway-PRA.pdf. Ti liu di 213 trang.
R Graphics (Computer Science and Data Analysis) (Nh xut bn Chapman &
Hall/CRC, 2005) ca Paul Murrell. y l cun sch chuyn v phn tch biu
bng R. Sch c rt nhiu m bn c c th t mnh thit k cc biu phc
tp v mu m.
Modern Applied Statistics with S-Plus (Nh xut bn Springer, 4th Edition,
2003) ca W. N. Venables v B. D. Ripley c vit cho ngn ng S-Plus nhng
tt c cc lnh v m trong sch ny u c th p dng cho R m khng cn thay
i. (S-Plus l tin thn ca R, nhng S-Plus l mt phn mm thng mi, cn R
th hon ton min ph!) y l cun sch c th ni l cun sch tham kho cho
tt c ai mun pht trin thm v R. Hai tc gi cng l nhng chuyn gia c thm
V sau cng l mt ti liu Hng dn s dng R cho phn tch s liu v biu
(khong 50 trang thng xuyn cp nht ha) do chnh ti vit bng ting
Vit. Website: www.R.ykhoa.net thc cht l tm lc mt s chng chnh ca