You are on page 1of 317

NGUYN VN TUN

PHN TCH S LIU v


TO BIU
bng

Hng dn thc hnh


Hng dn thc hnh

Phn tch s liu v to biu bng

hng dn thc hnh

Mc lc
1

Li ni u

2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8

Gii thiu ngn ng R


R l g ?
Ti v ci t R vo my tnh
Package cho cc phn tch c bit
Khi ng v ngng chy R
Vn phm ngn ng R
Cch t tn trong R
H tr trong R
Mi trng vn hnh

3
3.1
3.2
3.3
3.4
3.5
3.6

Nhp d liu
Nhp s liu trc tip: c()
Nhp s liu trc tip: edit(data.frame())
Nhp s liu t mt textfile: read.table()
Nhp s liu t Excel: read.csv
Nhp s liu t SPSS: read.spss
Tm thng tin c bn v d liu

4
4.1
4.2
4.3
4.4
4.5
4.5.1
4.5.2
4.6
4.7

Bin tp d liu
Kim tra s liu trng khng: na.omit()
Tch ri d liu: subset
Chit s liu t mt data .frame
Nhp hai data.frame thnh mt: merge
M ha s liu (data coding)
M ho bng hm replace
i mt bin lin tc thnh bin ri rc
Chia mt bin lin tc thnh nhm: cut
Tp hp s liu bng cut2 (Hmisc)

5
5.1
5.2
5.3
5.4
5.4.1
5.4.2

S R cho cc php tnh n gin v ma trn


Tnh ton n gin
S liu v ngy thng
To dy s bng seq, rep v gl
S dng R cho cc php tnh ma trn
Chit phn t t ma trn
Tnh ton vi ma trn

6
6.1
6.1.1
6.1.2
6.2
6.3

Tnh ton xc sut v m phng (simulation)


Tnh ton n gin
Php hon v (permutation)
T hp (combination)
Bin s ngu nhin v hm phn phi
Cc hm phn phi xc sut (probability distribution
function)
Hm phn phi nh phn (Binomial distribution)
Hm phn phi Poisson (Poisson distribution)
Hm phn phi chun (Normal distribution)
Hm phn phi chun chun ha (Standardized Normal
distribution)
Hm phn phi t, F v 2
M phng (simulation)
M phng phn phi nh phn
M phng phn phi Poisson
M phng phn phi 2, t, F, gamma, beta, Weibull,
Cauchy
Chn mu ngu nhin (random sampling)

6.3.1
6.3.2
6.3.3
6.3.4
6.3.5
6.4.
6.4.1
6.4.2
6.4.3
6.5
7
7.1
7.2
7.3
7.4
7.5

Kim nh gi thit thng k v ngha tr s P


Tr s P
Gi thit khoa hc v phn nghim
ngha ca tr s P qua m phng
Vn logic ca tr s P
Vn kim nh nhiu gi thit (multiple tests of
hypothesis)

8
8.1
8.1.1
8.1.2
8.1.3
8.1.4
8.1.5
8.1.6
8.17

Phn tch s liu bng biu


Mi trng v thit k biu
Nhiu biu cho mt ca s (windows)
t tn cho trc tung v trc honh
Cho gii hn ca trc tung v trc honh
Th loi v ng biu din
Mu sc, khung, v k hiu
Ghi ch (legend)
Vit ch trong biu

8.2
8.3
8.4.
8.5
8.6
8.6.1
8.6.2
8.6.3
8.6.4
8.6.5
8.7
8.7.1
8.8
8.9
8.9.1
8.9.2
8.9.3
8.9.4
8.9.5
8.9.6
8.9.10
9
9.0
9.1
9.2
9.3
9.4
9.4.1
9.4.2
9.5
9.6
9.7
9.8
9.9
9.10
9.11
9.12
9.12.1
9.12.2

S liu cho phn tch biu


Biu cho mt bin s ri rc (discrete variable):
barplot
Biu cho hai bin s ri rc (discrete variable):
barplot
Biu hnh trn
Biu cho mt bin s lin tc: stripchart v hist
Stripchart
Histogram
Biu hp (boxplot)
Biu thanh (barchart)
Biu im (dotchart)
Phn tch biu cho hai bin lin tc
Biu tn x (scatter plot)
Phn tch Biu cho nhiu bin: pairs
Mt s biu a nng
Biu tn x v hnh hp
Biu tn x vi kch thc bin th ba
Biu thanh v xc sut tch ly
Biu hnh ng h (clock plot)
Biu vi sai s chun (standard error)
Biu vng (contour plot)
Biu vi k hiu ton
Phn tch thng k m t
Khi nim v tng th (population) v mu (sample)
Thng k m t: summary
Kim nh xem mt bin c phi phn phi chun
Thng k m t theo tng nhm
Kim nh t (t.test)
Kim nh t mt mu
Kim nh t hai mu
So snh phng sai (var.test)
Kim nh Wilcoxon cho hai mu (wilcox.test)
Kim nh t cho cc bin s theo cp (paired t-test,
t.test)
Kim nh Wilcoxon cho cc bin s theo cp
(wilcox.test)
Tn s (frequency)
Kim nh t l (proportion test, prop.test,
binom.test)
So snh hai t l (prop.test, binom.test)
So snh nhiu t l (prop.test, chisq.test)
Kim nh Chi bnh phng
Kim nh Fisher

10
10.1
10.1.1
10.1.2
10.1.3
10.2
10.2.1
10.2.2
10.2.3
10.2.4
10.3
10.4
10.5
10.6
11
11.1
11.1.1
11.1.2
11.2
11.2.1
11.2.2
11.3
11.4
11.4.1
11.5
11.5.1
11.5.2
11.6
11.7
11.8
11.9
12
12.1

Phn tch hi qui tuyn tnh (regression analysis)


H s tng quan
H s tng quan Pearson
H s tng quan Spearman
H s tng quan Kendall
M hnh ca hi qui tuyn tnh n gin
Vi dng l thuyt
Phn tch hi qui tuyn tnh n gin bng R
Gi nh ca phn tch hi qui tuyn tnh
M hnh tin on
M hnh hi qui tuyn tnh a bin (multiple linear
regression)
Phn tch hi qui a thc (Polynomial regression analysis)
Xy dng m hnh tuyn tnh t nhiu bin
Xy dng m hnh tuyn tnh bng Bayesian Model
Average (BMA)
Phn tch phng sai (analysis of variance)
Phn tch phng sai n gin (one-way analysis of
variance - ANOVA)
M hnh phn tch phng sai
Phn tch phng sai n gin vi R
So snh nhiu nhm (multiple comparisons) v iu chnh
tr s p
So snh nhiu nhm bng phng php Tukey
Phn tch bng biu
Phn tch bng phng php phi tham s
Phn tch phng sai hai chiu (two-way analysis of
variance - ANOVA)
Phn tch phng sai hai chiu vi R
Phn tch hip bin (analysis of covariance - ANCOVA)
M hnh phn tch hip bin
Phn tch bng R
Phn tch phng sai cho th nghim giai tha (factorial
experiment)
Phn tch phng sai cho th nghim hnh vung Latin
(Latin square experiment)
Phn tch phng sai cho th nghim giao cho (cross-over
experiment)
Phn tch phng sai cho th nghim ti o lng (repeated
measure experiment)
Phn tch hi qui logistic (logistic regression
analysis)
M hnh hi qui logistic

12.2
12.3
12.4
12.5
12.6
12.7
13
13.1
13.2
13.3
13.4
13.5
13.6
14
14.1
14.2

Phn tch hi qui logistic bng R


c tnh xc sut bng R
Phn tch hi qui logistic t s liu gin lc bng R
Phn tch hi qui logistic a bin v chn m hnh
Chn m hnh hi qui logistic bng Bayesian Model
Average
S liu dng cho phn tch
Phn tch bin c (survival analysis)
M hnh phn tch s liu mang tnh thi gian
c tnh Kaplan-Meier bng R
So snh hai hm xc sut tch ly: kim nh log-rank (logrank test)
Kim nh log-rank bng R
M hnh Cox (hay Coxs proportional hazards model)
Xy dng m hnh Cox bng Bayesian Model Average
(BMA)

14.5.1
14.5.2

Phn tch tng hp (meta-analysis)


Nhu cu cho phn tch tng hp
nh hng ngu nhin v nh hng bt bin (Fixedeffects v Random-effects)
Qui trnh ca mt phn tch tng hp
Phn tch tng hp nh hng bt bin cho mt tiu ch lin
tc (Fixed-effects meta-analysis for a continuous outcome)
Phn tch tng hp bng tnh ton th cng
Phn tch tng hp bng R
Phn tch tng hp nh hng bt bin cho mt tiu ch nh
phn (Fixed-effects meta-analysis for a dichotomous
outcome)
M hnh phn tch
Phn tch bng R

15
15.1
15.2
15.3
15.4
15.4.1
15.4.2
15.4.3
15.4.4
15.4.5

c tnh c mu (estimation of sample size)


Khi nim v power
Th nghim gi thit thng k v chn on bnh
S liu c tnh c mu
c tnh c mu
c tnh c mu cho mt ch s trung bnh
c tnh c mu cho so snh hai s trung bnh
c tnh c mu cho phn tch phng sai
c tnh c mu cho c tnh mt t l
c tnh c mu cho so snh hai t l

16

Ph lc 1: Lp trnh v vit hm bng ngn ng R

14.3
14.4
14.4.1
14.4.2
14.5

17

Ph lc 2: Mt s lnh thng dng trong R

18

Ph lc 3: Thut ng dng trong sch

19

Li bt (ti liu tham kho v c thm)

CHNG I

LI NI U

1
Li ni u
Tri vi quan im ca nhiu ngi, thng k l mt b mn khoa hc: Khoa hc
thng k (Statistical Science). Cc phng php phn tch d da vo nn tng ca ton
hc v xc sut, nhng ch l phn k thut, phn quan trng hn l thit k nghin
cu v din dch ngha d liu. Ngi lm thng k, do , khng ch l ngi n
thun lm phn tch d liu, m phi l mt nh khoa hc, mt nh suy ngh (thinker)
v nghin cu khoa hc. Chnh v th, m khoa hc thng k ng mt vai tr cc k
quan trng, mt vai tr khng th thiu c trong cc cng trnh nghin cu khoa hc,
nht l khoa hc thc nghim. C th ni rng ngy nay, nu khng c thng k th cc
th nghim gen vi triu triu s liu ch l nhng con s v hn, v ngha.
Mt cng trnh nghin cu khoa hc, cho d c tn km v quan trng c no,
nu khng c phn tch ng phng php s khng c ngha khoa hc g c. Chnh
v th th m ngy nay, ch cn nhn qua tt c cc tp san nghin cu khoa hc trn th
gii, hu nh bt c bi bo y hc no cng c phn Statistical Analysis (Phn tch
thng k), ni m tc gi phi m t cn thn phng php phn tch, tnh ton nh th
no, v gii thch ngn gn ti sao s dng nhng phng php hm bo k
hay tng trng lng khoa hc cho nhng pht biu trong bi bo. Cc tp san y hc c
uy tn cng cao yu cu v phn tch thng k cng nng. Xin nhc li nhn mnh:
khng c phn phn tch thng k, bi bo khng c ngha khoa hc.
Mt trong nhng pht trin quan trng nht trong khoa hc thng k l ng dng
my tnh cho phn tch v tnh ton thng k. C th ni khng ngoa rng khng c my
tnh, khoa hc thng k vn ch l mt khoa hc bun t kh khan, vi nhng cng thc
rc ri m thiu tnh ng dng vo thc t. My tnh gip khoa hc thng k lm mt
cuc cch mng ln nht trong lch s ca b mn: l a khoa hc thng k vo thc
t, gii quyt cc vn gai gc nht v gp phn lm pht trin khoa hc thc nghim.
Ngi vit cn nh hn 20 nm v trc khi cn l mt sinh vin theo hc
chng trnh thc s thng k c, mt v gio s kh knh k mt cu chuyn v nh
thng k danh ting ngi M, Fred Mosteller, nhn c mt hp ng nghin cu t
B Quc phng M ci tin chnh xc ca v kh M vo thi Th chin th II, m
trong ng phi gii mt bi ton thng k gm khong 30 thng s. ng phi mn
20 sinh vin sau i hc lm vic ny: 10 sinh vin ch vic sut ngy tnh ton bng tay;
cn 10 sinh vin khc kim tra li tnh ton ca 10 sinh vin kia. Cng vic ko di gn
mt thng tri. Ngy nay, vi mt my tnh c nhn (personal computer) khim tn,
phn tch thng k c th gii trong vng trn di 1 giy.

Nhng nu my tnh m khng c phn mm th my tnh cng ch l mt ng


st hay silicon v hn v v dng. Mt phn mm , ang v s lm cch mng thng
k l R. Phn mm ny c mt s nh nghin cu thng k v khoa hc trn th gii
pht trin v hon thin trong khong 10 nm qua s dng cho vic hc tp, ging dy
v nghin cu. Cun sch ny s gii thiu bn c cch s dng R cho phn tch thng
k v th.
Ti sao R? Trc y, cc phn mm dng cho phn tch thng k c pht
trin v kh thng dng. Nhng phn mm ni ting t thi xa xa nh MINITAB,
BMD-P n nhng phn mm tng i mi nh STATISTICA, SPSS, SAS, STAT,
v.v thng rt t tin (gi cho mt i hc c khi ln n hng trm ngn -la hng
nm), mt c nhn hay thm ch cho mt i hc khng kh nng mua. Nhng R thay
i tnh trng ny, v R hon ton min ph. Tri vi cm nhn thng thng, min ph
khng c ngha l cht lng km. Tht vy, chng nhng hon ton min ph, R cn c
kh nng lm tt c (xin ni li: tt c), thm ch cn hn c, nhng phn tch m cc
phn mm thng mi lm. R c th ti xung my tnh c nhn ca bt c c nhn no,
bt c lc no, v bt c u trn th gii. Ch vi pht ci t l R c th a vo s
dng. Chnh v th m i a s cc i hc Ty phng v th gii cng ngy cng
chuyn sang s dng R cho hc tp, nghin cu v ging dy. Trong xu hng , cun
sch ny c mt mc tiu khim tn l gii thiu n bn c trong nc kp thi cp
nht ha nhng pht trin v tnh ton v phn tch thng k trn th gii.
Cun sch ny c son ch yu cho sinh vin i hc v cc nh nghin cu
khoa hc, nhng ngi cn mt phn mm hc thng k, phn tch s liu, hay v
th t s liu khoa hc. Cun sch ny khng phi l sch gio khoa v l thuyt thng
k, hay nhm ch bn c cch lm phn tch thng k, nhng s gip bn c lm phn
tch thng k hu hiu hn v ho hng hn. Mc ch chnh ca ti l cung cp cho bn
c nhng kin thc c bn v thng k, v cch ng dng R cho gii quyt vn , v
qua lm nn tng bn c tm hiu hay pht trin thm R.
Ti cho rng, cng nh bt c ngnh ngh no, cch hc phn tch thng k hay
nht l t mnh lm phn tch. V th, sch ny c vit vi rt nhiu v d v d liu
thc. Bn c c th va c sch, va lm theo nhng ch dn trong sch (bng cch g
cc lnh vo my tnh) v s thy ho hng hn. Nu bn c c sn mt d liu
nghin cu ca chnh mnh th vic hc tp s hu hiu hn bng cch ng dng ngay
nhng php tnh trong sch. i vi sinh vin, nu cha c s liu sn, cc bn c th
dng cc phng php m phng (simulation) hiu thng k hn.
Khoa hc thng k nc ta tng i cn mi, cho nn mt s thut ng cha
c din dch mt cch thng nht v hon chnh. V th, bn c s thy y trong
sch mt vi thut ng l, v trong trng hp ny, ti c gng km theo thut ng gc

ting Anh bn c tham kho. Ngoi ra, trong phn cui ca sch, ti c lit k cc
thut ng Anh Vit c cp n trong sch.
Tt c cc d liu s dng trong sch ny u c th ti t internet xung my
tnh c nhn, hay c th truy nhp trc tip qua trang web: http://www.ykhoa.net/R.
Ti hi vng bn c s tm thy trong sch mt vi thng tin b ch, mt vi k
thut hay php tnh c ch cho vic hc tp, ging dy v nghin cu ca mnh. Nhng
c l chng c cun sch no hon thin hay khng c thiu st; thnh ra, nu bn c
pht hin mt sai st trong sch, xin bo cho ti bit qua in th
t.nguyen@garvan.org.au hay rknguyen@gmail.com. Thnh tht cm n cc bn c
trc.
Ti mun nhn dp ny cm n Tin s Nguyn Hong Dzng thuc khoa Ha,
i hc Bch khoa Thnh ph H Ch Minh, ngi gi v gip ti in cun sch
ny trong nc. Ti cm n Bc s Nguyn nh Nguyn, ngi c mt phn ln
bn tho ca cun sch, gp nhiu kin thit thc, v thit k ba sch. Ti cng
cm n Nh xut bn i hc Bch khoa Thnh ph H Ch Minh gip ti in cun
sch ny.
By gi, ti mi bn c cng i vi ti mt hnh trnh thng k ngn bng R.

Sydney, 31 Thng Ba Nm 2006


Nguyn Vn Tun

CHNG II

GII THIU NGN NG R

2
Gii thiu ngn ng R
2.1 R l g ?
Ni mt cch ngn gn, R l mt phn mm s dng cho phn tch thng k v
th. Tht ra, v bn cht, R l ngn ng my tnh a nng, c th s dng cho nhiu mc
tiu khc nhau, t tnh ton n gin, ton hc gii tr (recreational mathematics), tnh
ton ma trn (matrix), n cc phn tch thng k phc tp. V l mt ngn ng, cho nn
ngi ta c th s dng R pht trin thnh cc phn mm chuyn mn cho mt vn
tnh ton c bit.
Hai ngi sng to ra R l hai nh thng k hc tn l Ross Ihaka v Robert
Gentleman. K t khi R ra i, rt nhiu nh nghin cu thng k v ton hc trn th
gii ng h v tham gia vo vic pht trin R. Ch trng ca nhng ngi sng to ra
R l theo nh hng m rng (Open Access). Cng mt phn v ch trng ny m R
hon ton min ph. Bt c ai bt c ni no trn th gii u c th truy nhp v ti
ton b m ngun ca R v my tnh ca mnh s dng. Cho n nay, ch qua cha
y 5 nm pht trin, cng ngy cng c nhiu cc nh thng k hc, ton hc, nghin
cu trong mi lnh vc chuyn sang s dng R phn tch d liu khoa hc. Trn
ton cu, c mt mng li gn mt triu ngi s dng R, v con s ny ang tng
theo cp s nhn. C th ni trong vng 10 nm na, chng ta s khng cn n cc
phn mm thng k t tin nh SAS, SPSS hay Stata (cc phn mm ny rt t tin, c
th ln n 100.000 USD mt nm) phn tch thng k na, v tt c cc phn tch
c th tin hnh bng R.
V th, nhng ai lm nghin cu khoa hc, nht l cc nc cn ngho kh nh
nc ta, cn phi hc cch s dng R cho phn tch thng k v th. Bi vit ngn ny
s hng dn bn c cch s dng R. Ti gi nh rng bn c khng bit g v R,
nhng ti k vng bn c bit qua v cch s dng my tnh.

2.2 Ti R xung v ci t vo my tnh


s dng R, vic u tin l chng ta phi ci t R trong my tnh ca mnh.
lm vic ny, ta phi truy nhp vo mng v vo website c tn l Comprehensive R
Archive Network (CRAN) sau y:
http://cran.R-project.org.
Ti liu cn ti v, ty theo phin bn, nhng thng c tn bt u bng mu t
R v s phin bn (version). Chng hn nh phin bn ti s dng vo cui nm 2005 l
2.2.1, nn tn ca ti liu cn ti l:

R-2.2.1-win32.zip
Ti liu ny khong 26 MB, v a ch c th ti l:
http://cran.r-project.org/bin/windows/base/R-2.2.1-win32.exe
Ti website ny, chng ta c th tm thy rt nhiu ti liu ch dn cch s dng
R, trnh , t s ng n cao cp. Nu cha quen vi ting Anh, ti liu ny ca ti
c th cung cp nhng thng tin cn thit s dng m khng cn phi c cc ti liu
khc.
Khi ti R xung my tnh, bc k tip l ci t (set-up) vo my tnh.
lm vic ny, chng ta ch n gin nhn chut vo ti liu trn v lm theo hng dn
cch ci t trn mn hnh. y l mt bc rt n gin, ch cn 1 pht l vic ci t R
c th hon tt.

2.3 Package cho cc phn tch c bit


R cung cp cho chng ta mt ngn ng my tnh v mt s function lm cc
phn tch cn bn v n gin. Nu mun lm nhng phn tch phc tp hn, chng ta
cn phi ti v my tnh mt s package khc. Package l mt phn mm nh c cc
nh thng k pht trin gii quyt mt vn c th, v c th chy trong h thng R.
Chng hn nh phn tch hi qui tuyn tnh, R c function lm s dng cho mc
ch ny, nhng lm cc phn tch su hn v phc tp hn, chng ta cn n cc
package nh lme4. Cc package ny cn phi c ti v my tnh v ci t.
a ch ti cc package vn l: http://cran.r-project.org, ri bm vo phn
Packages xut hin bn tri ca mc lc trang web. Mt s package cn ti v my
tnh s dng cho cc v d trong sch ny l:
Tn package
Trellis
lattice
Hmisc
Design
Epi
epitools
foreign
Rmeta
meta
survival

Chc nng
Dng v th v lm cho th p hn
Dng v th v lm cho th p hn
Mt s phng php m hnh d liu ca F. Harrell
Mt s m hnh thit k nghin cu ca F. Harrell
Dng cho cc phn tch dch t hc
Mt package khc chuyn cho cc phn tch dch t hc
Dng nhp d liu t cc phn mm khc nh
SPSS, Stata, SAS, v.v
Dng cho phn tch tng hp (meta-analysis)
Mt package khc cho phn tch tng hp
Chuyn dng cho phn tch theo m hnh Cox (Coxs
proportional hazard model)

splines
Zelig
genetics
BMA
leaps

Package cho survival vn hnh


Package dng cho cc phn tch thng k trong lnh
vc x hi hc
Package dng cho phn tch s liu di truyn hc
Bayesian Model Average
Package dng cho BMA

2.4 Khi ng v ngng chy R


Sau khi hon tt vic ci t, mt icon

R 2.2.1.lnk

s xut hin trn desktop ca my tnh. n y th chng ta sn sng s dng R. C


th nhp chut vo icon ny v chng ta s c mt window nh sau:

R thng c s dng di dng "command line", c ngha l chng ta phi trc


tip g lnh vo ci prompt mu trn. Cc lnh phi tun th nghim ngt theo vn
phm v ngn ng ca R. C th ni ton b bi vit ny l nhm hng dn bn c
hiu v vit theo ngn ng ca R. Mt trong nhng vn phm ny l R phn bit gia
Library v library. Ni cch khc, R phn bit lnh vit bng ch hoa hay ch
thng. Mt vn phm khc na l khi c hai ch ri nhau, R thng dng du chm

thay vo khong trng, chng hn nh data.frame, t.test, read.table,


v.v iu ny rt quan trng, nu khng s lm mt th gi ca ngi s dng.
Nu lnh g ra ng vn phm th R s cho chng ta mt ci prompt khc hay
cho ra kt qu no (ty theo lnh); nu lnh khng ng vn phm th R s cho ra mt
thng bo ngn l khng ng hay khng hiu. V d, nu chng ta g:
> x <- rnorm(20)
>
th R s hiu v lm theo lnh , ri cho chng ta mt prompt khc: >.
chng ta g:

Nhng nu

> R is great
R s khng ng vi lnh ny, v ngn ng ny khng c trong th vin ca R, mt
thng bo sau y s xut hin:
Error: syntax error
>
Khi mun ri khi R, chng ta c th n gin nhn nt cho (x) bn gc tri ca
window, hay g lnh q().

2.5 Vn phm ngn ng R


Vn phm chung ca R l mt lnh (command) hay function (ti s thnh
thong cp n l hm). M l hm th phi c thng s; cho nn theo sau hm l
nhng thng s m chng ta phi cung cp. Chng hn nh:
> reg <- lm(y ~ x)
th reg l mt object, cn lm l mt hm, v y ~ x l thng s ca hm. Hay:
> setwd(c:/works/stats)
th setwd l mt hm, cn c:/works/stats l thng s ca hm.
bit mt hm cn c nhng thng s no, chng ta dng lnh args(x), (args
vit tt ch arguments) m trong x l mt hm chng ta cn bit:
> args(lm)
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)

NULL

R l mt ngn ng i tng (object oriented language). iu ny c ngha l


cc d liu trong R c cha trong object. nh hng ny cng c vi nh hng n
cch vit ca R. Chng hn nh thay v vit x = 5 nh thng thng chng ta vn vit,
th R yu cu vit l x == 5.
i vi R, x = 5 tng ng vi x <- 5. Cch vit sau (dng k hiu <-)
c khuyn khch hn l cch vit trc (=). Chng hn nh:
> x <- rnorm(10)
c ngha l m phng 10 s liu v cha trong object x. Chng ta cng c th vit x =
rnorm(10).
Mt s k hiu hay dng trong R l:
x == 5
x != 5
y < x
x > y
z <= 7
p >= 1
is.na(x)
A & B
A | B
!

x bng 5
x khng bng 5
y nh hn x
x ln hn y
z nh hn hoc bng 7
p ln hn hoc bng 1
C phi x l bin s missing
A v B (AND)
A hoc B (OR)
Khng l (NOT)

Vi R, tt c cc cu ch hay lnh sau k hiu # u khng c hiu ng, v # l k hiu


dnh cho ngi s dng thm vo cc ghi ch, v d:
> # lnh sau y s m phng 10 gi tr normal
> x <- rnorm(10)

2.6 Cch t tn trong R


t tn mt i tng (object) hay mt bin s (variable) trong R kh linh hot,
v R khng c nhiu gii hn nh cc phn mm khc. Tn mt object phi c vit
lin nhau (tc khng c cch ri bng mt khong trng). Chng hn nh R chp nhn
myobject nhng khng chp nhn my object.
> myobject <- rnorm(10)
> my object <- rnorm(10)
Error: syntax error in "my object"

Nhng i khi tn myobject kh c, cho nn chng ta nn tc ri bng . Nh


my.object.
> my.object <- rnorm(10)
Mt iu quan trng cn lu l R phn bit mu t vit hoa v vit thng. Cho nn
My.object khc vi my.object. V d:
> My.object.u <- 15
> my.object.L <- 5
> My.object.u + my.object.L
[1] 20

Mt vi iu cn lu khi t tn trong R l:

Khng nn t tn mt bin s hay variable bng k hiu _ (underscore) nh


my_object hay my-object.

Khng nn t tn mt object ging nh mt bin s trong mt d liu. V d,


nu chng ta c mt data.frame (d liu hay dataset) vi bin s age trong
, th khng nn c mt object trng tn age, tc l khng nn vit: age <age. Tuy nhin, nu data.frame tn l data th chng ta c th cp n bin
s age vi mt k t $ nh sau: data$age. (Tc l bin s age trong
data.frame data), v trong trng hp , age <- data$age c th chp
nhn c.

2.7 H tr trong R
Ngoi lnh args() R cn cung cp lnh help() ngi s dng c th hiu
vn phm ca tng hm. Chng hn nh mun bit hm lm c nhng thng s
(arguments) no, chng ta ch n gin lnh:
> help(lm)
hay
> ?lm
Mt ca s s hin ra bn phi ca mn hnh ch r cch s dng ra sao v thm ch c c
v d. Bn c c th n gin copy v dn v d vo R xem cch vn hnh.
Trc khi s dng R, ngoi sch ny nu cn bn c c th c qua phn ch dn
c sn trong R bng cch chn mc help v sau chn Html help nh hnh di

y bit thm chi tit. Bn c cng c th copy v dn cc lnh trong mc ny vo R


xem cho bit cch vn hnh ca R.

Thay v chn mc trn, bn c cng c th n gin lnh:


> help.start()
v mt ca s s xut hin ch dn ton b h thng R.
Hm apropos cng rt c ch v n cung cp cho chng ta tt c cc hm trong R bt
u bng k t m chng ta mun tm. Chng hn nh chng ta mun bit hm no trong
R c k t lm th ch n gin lnh:
> apropos(lm)

V R s bo co cc hm vi k t lm nh sau c sn trong R:
[1] ".__C__anova.glm"
[4] ".__C__glm.null"
[7] "anova.glm"
[10] "anova.lmlist"
[13] "contr.helmert"
[16] "glm.fit"
[19] "KalmanForecast"
[22] "KalmanSmooth"
[25] "lm.fit.null"
[28]
"lm.wfit.null"
"model.frame.lm"

".__C__anova.glm.null" ".__C__glm"
".__C__lm"
".__C__mlm"
"anova.glmlist"
"anova.lm"
"anova.mlm"
"anovalist.lm"
"glm"
"glm.control"
"glm.fit.null"
"hatvalues.lm"
"KalmanLike"
"KalmanRun"
"lm"
"lm.fit"
"lm.influence"
"lm.wfit"
"model.frame.glm"

[31]
[34]
[37]
[40]
[43]
[46]
[49]

"model.matrix.lm"
"plot.lm"
"predict.lm"
"print.lm"
"rstandard.glm"
"rstudent.lm"
"summary.mlm"

"nlm"
"plot.mlm"
"predict.mlm"
"residuals.glm"
"rstandard.lm"
"summary.glm"
"kappa.lm"

"nlminb"
"predict.glm"
"print.glm"
"residuals.lm"
"rstudent.glm"
"summary.lm"

2.8 Mi trng vn hnh


D liu phi c cha trong mt khu vc (directory) ca my tnh. Trc khi s
dng R, c l cch hay nht l to ra mt directory cha d liu, chng hn nh
c:\works\stats. R bit d liu nm u, chng ta s dng lnh setwd (set
working directory) nh sau:
> setwd(c:/works/stats)
Lnh trn bo cho R bit l d liu s cha trong directory c tn l
c:\works\stats. Ch rng, R dng forward slash / ch khng phi backward slash \
nh trong h thng Windows.
bit hin nay, R ang lm vic directory no, chng ta ch cn lnh:
> getwd()
[1] "C:/Program Files/R/R-2.2.1"
Ci prompt mc nh ca R l >. Nhng nu chng ta mun c mt prompt
khc theo c tnh c nhn, chng ta c th thay th d dng:
> options(prompt=R> )
R>
Hay:
> options(prompt="Tuan> ")
Tuan>
Mn nh R mc nh l 80 characters, nhng nu chng ta mun mn nh rng
hn, th ch cn ra lnh:
> options(width=100)
Hay mun R trnh by cc s liu dng 3 s thp phn:
> options(scipen=3)

Cc la chn v thay i ny c th dng lnh options(). bit cc thng s hin


ti ca R l g, chng ta ch cn lnh:
> options()
Tm hiu ngy thng:
> Sys.Date()
[1] "2006-03-31"
Nu bn c cn thm thng tin, mt s ti liu trn mng (vit bng ting Anh) cng rt
c ch. Cc ti liu ny c th ti xung my min ph:
R for beginners (ca Emmanuel Paradis):
http://cran.r-project.org/doc/contrib/rdebuts_en.pdf
Using R for data analysis and graphics (ca John Maindonald):
http://cran.r-project.org/doc/contrib/usingR.pdf

CHNG III

NHP D LIU

3
Nhp d liu
Mun lm phn tch d liu bng R, chng ta phi c sn d liu dng m R c
th hiu c x l. D liu m R hiu c phi l d liu trong mt data.frame.
C nhiu cch nhp s liu vo mt data.frame trong R, t nhp trc tip n
nhp t cc ngun khc nhau. Sau y l nhng cch thng dng nht:

3.1 Nhp s liu trc tip: c()


V d 1: chng ta c s liu v tui v insulin cho 10 bnh nhn nh sau, v
mun nhp vo R.
50
62
60
40
48
47
57
70
48
67

16.5
10.8
32.3
19.3
14.2
11.3
15.5
15.8
16.2
11.2

Chng ta c th s dng function c tn c nh sau:


> age <- c(50,62, 60,40,48,47,57,70,48,67)
> insulin <- c(16.5,10.8,32.3,19.3,14.2,11.3,15.5,15.8,16.2,11.2)

Lnh th nht cho R bit rng chng ta mun to ra mt ct d liu (t nay ti s


gi l bin s, tc variable) c tn l age, v lnh th hai l to ra mt ct khc c tn l
insulin. Tt nhin, chng ta c th ly mt tn khc m mnh thch.
Chng ta dng function c (vit tt ca ch concatenation c ngha l mc
ni vo nhau) nhp d liu. Ch rng mi s liu cho mi bnh nhn c cch
nhau bng mt du phy.
K hiu insulin <- (cng c th vit l insulin =) c ngha l cc s liu
theo sau s c nm trong bin s insulin. Chng ta s gp k hiu ny rt nhiu ln
trong khi s dng R.
R l mt ngn ng cu trc theo dng i tng (thut ng chuyn mn l
object-oriented language), v mi ct s liu hay mi mt data.frame l mt i
tng (object) i vi R. V th, age v insulin l hai i tng ring l. By gi

chng ta cn phi nhp hai i tng ny thnh mt data.frame R c th x l sau


ny. lm vic ny chng ta cn n function data.frame:
> tuan <- data.frame(age, insulin)

Trong lnh ny, chng ta mun cho R bit rng nhp hai ct (hay hai i tng) age v
insulin vo mt i tng c tn l tuan.

n y th chng ta c mt i tng hon chnh tin hnh phn tch thng k.


kim tra xem trong tuan c g, chng ta ch cn n gin g:
> tuan

V R s bo co:
1
2
3
4
5
6
7
8
9
10

age insulin
50
16.5
62
10.8
60
32.3
40
19.3
48
14.2
47
11.3
57
15.5
70
15.8
48
16.2
67
11.2

Nu chng ta mun lu li cc s liu ny trong mt file theo dng R, chng ta


cn dng lnh save. Gi d nh chng ta mun lu s liu trong directory c tn l
c:\works\stats, chng ta cn g nh sau:
> setwd(c:/works/stats)
> save(tuan, file=tuan.rda)

Lnh u tin (setwd ch wd c ngha l working directory) cho R bit rng


chng ta mun lu cc s liu trong directory c tn l c:\works\stats. Lu rng
thng thng Windows dng du backward slash /, nhng trong R chng ta dng du
forward slash /.
Lnh th hai (save) cho R bit rng cc s liu trong i tng tuan s lu
trong file c tn l tuan.rda). Sau khi g xong hai lnh trn, mt file c tn
tuan.rda s c mt trong directory .

3.2 Nhp s liu trc tip: edit(data.frame())


V d 1 (tip tc): chng ta c th nhp s liu v tui v insulin cho 10 bnh
nhn bng mt function rt c ch, l: edit(data.frame()). Vi function ny,

R s cung cp cho chng ta mt window mi vi mt dy ct v dng ging nh Excel,


v chng ta c th nhp s liu trong bng . V d:
> ins <- edit(data.frame())

Chng ta s c mt window nh sau:

y, R khng bit chng ta c bin s no, cho nn R lit k cc bin s var1,


var2, v.v Nhp chut vo ct var1 v thay i bng cch g vo age. Nhp
chut vo ct var2 v thay i bng cch g vo insulin. Sau g s liu cho
tng ct. Sau khi xong, bm nt cho X gc phi ca spreadsheet, chng ta s c mt
data.frame tn ins vi hai bin s age v insulin.

3.3 Nhp s liu t mt text file: read.table


V d 2: Chng ta thu thp s liu v tui v cholesterol t mt nghin cu
50 bnh nhn mc bnh cao huyt p. Cc s liu ny c lu trong mt text file c tn
l chol.txt ti directory c:\works\stats. S liu ny nh sau: ct 1 l m s ca
bnh nhn, ct 2 l gii tnh, ct 3 l body mass index (bmi), ct 4 l HDL cholesterol
(vit tt l hdl), k n l LDL cholesterol, total cholesterol (tc) v triglycerides (tg).
id
1
2

sex
Nam
Nu

age
57
64

bmi
17
18

hdl
5.000
4.380

ldl
2.0
3.0

tc
4.0
3.5

tg
1.1
2.1

3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
Nu
Nam
Nu
Nam
Nam
Nu
Nu
Nu
Nu
Nu
Nu
Nu
Nu
Nu
Nam
Nam
Nu
Nam
Nu
Nu
Nu
Nam
Nam
Nu
Nu
Nam
Nu
Nam
Nu
Nu
Nam
Nu
nam
Nam
Nam
Nu
Nam
Nam
Nu
Nu

60
65
47
65
76
61
59
57
63
51
60
42
64
49
44
45
80
48
61
45
70
51
63
54
57
70
47
60
60
50
60
55
74
48
46
49
69
72
51
58
60
45
63
52
64
45
64
62

18
18
18
18
19
19
19
19
20
20
20
20
20
20
21
21
21
21
21
21
21
21
22
22
22
22
22
22
22
22
22
22
23
23
23
23
23
23
23
23
24
24
24
24
24
24
25
25

3.360
5.920
6.250
4.150
0.737
7.170
6.942
5.000
4.217
4.823
3.750
1.904
6.900
0.633
5.530
6.625
5.960
3.800
5.375
3.360
5.000
2.608
4.130
5.000
6.235
3.600
5.625
5.360
6.580
7.545
6.440
6.170
5.270
3.220
5.400
6.300
9.110
7.750
6.200
7.050
6.300
5.450
5.000
3.360
7.170
7.880
7.360
7.750

3.0
4.0
2.1
3.0
3.0
3.0
3.0
2.0
5.0
1.3
1.2
0.7
4.0
4.1
4.3
4.0
4.3
4.0
3.1
3.0
1.7
2.0
2.1
4.0
4.1
4.0
4.2
4.2
4.4
4.3
2.3
6.0
3.0
3.0
2.6
4.4
4.3
4.0
3.0
4.1
4.4
2.8
3.0
2.0
1.0
4.0
4.6
4.0

4.7
7.7
5.0
4.2
5.9
6.1
5.9
4.0
6.2
4.1
3.0
4.0
6.9
5.7
5.7
5.3
7.1
3.8
4.3
4.8
4.0
3.0
3.1
5.3
5.3
5.4
4.5
5.9
5.6
8.3
5.8
7.6
5.8
3.1
5.4
6.3
8.2
6.2
6.2
6.7
6.3
6.0
4.0
3.7
6.1
6.7
8.1
6.2

0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9
1.7
1.0
1.6
1.1
1.5
1.0
2.7
3.9
3.0
3.1
2.2
2.7
1.1
0.7
1.0
1.7
2.9
2.5
6.2
1.3
3.3
3.0
1.0
1.4
2.5
0.7
2.4
2.4
1.4
2.7
2.4
3.3
2.0
2.6
1.8
1.2
1.9
3.3
4.0
2.5

Chng ta mun nhp cc d liu ny vo R tin vic phn tch sau ny. Chng
ta s s dng lnh read.table nh sau:
> setwd(c:/works/stats)

> chol <- read.table("chol.txt", header=TRUE)

Lnh th nht chng ta mun m bo R truy nhp ng directory m s liu


ang c lu gi. Lnh th hai yu cu R nhp s liu t file c tn l chol.txt
(trong directory c:\works\stats) v cho vo i tng chol. Trong lnh ny,
header=TRUE c ngha l yu cu R c dng u tin trong file nh l tn ca
tng ct d kin.
Chng ta c th kim tra xem R c ht cc d liu hay cha bng cch ra lnh:
> chol

Hay
> names(chol)

R s cho bit c cc ct nh sau trong d liu (name l lnh hi trong d liu c nhng
ct no v tn g):
[1] "id"

"sex" "age" "bmi" "hdl" "ldl" "tc"

"tg"

By gi chng ta c th lu d liu di dng R x l sau ny bng cch ra lnh:


> save(chol, file="chol.rda")

3.4 Nhp s liu t Excel: read.csv


nhp s liu t phn mm Excel, chng ta cn tin hnh 2 bc:

Bc 1: Dng lnh Save as trong Excel v lu s liu di dng csv;


Bc 2: Dng R (lnh read.csv) nhp d liu dng csv.

V d 3: Mt d liu gm cc ct sau y ang c lu trong Excel, v chng ta mun


chuyn vo R phn tch. D liu ny c tn l excel.xls.
ID
1
2
3
4
5
6
7
8
9
10

Age
18
28
20
21
28
23
20
20
20
20

Sex
1
1
1
1
1
1
1
1
1
1

Ethnicity
1
1
1
1
1
4
1
1
1
1

IGFI
148.27
114.50
109.82
112.13
102.86
129.59
142.50
118.69
197.69
163.69

IGFBP3
5.14
5.23
4.33
4.38
4.04
4.16
3.85
3.44
4.12
3.96

ALS
316.00
296.42
269.82
247.96
240.04
266.95
300.86
277.46
335.23
306.83

PINP
61.84
98.64
93.26
101.59
58.77
48.93
135.62
79.51
57.25
74.03

ICTP
5.81
4.96
7.74
6.66
4.62
5.32
8.78
7.19
6.21
4.95

P3NP
4.21
5.33
4.56
4.61
4.95
3.82
6.75
5.11
4.44
4.84

11
12
13
14
15
16
17
18
19
20

22
27
26
33
34
32
28
18
26
27

1
0
1
1
1
1
1
0
0
1

1
2
1
1
3
1
1
2
2
2

144.81
141.60
161.80
89.20
161.80
148.50
157.70
222.90
186.70
167.56

3.63
3.48
4.10
2.82
3.80
3.72
3.98
3.98
4.64
3.56

295.46
231.20
244.80
177.20
243.60
234.80
224.80
281.40
340.80
321.12

68.26
56.78
75.75
48.57
50.68
83.98
60.42
74.17
38.05
30.18

4.54
4.47
6.27
3.58
3.52
4.85
4.89
6.43
5.12
4.78

3.70
4.07
5.26
3.68
3.35
3.80
4.09
5.84
5.77
6.12

Vic u tin l chng ta cn lm, nh ni trn, l vo Excel lu di dng csv:


Vo Excel, chn File Save as
Chn Save as type CSV (Comma delimited)

Sau khi xong, chng ta s c mt file vi tn excel.csv trong directory


c:\works\stats.
Vic th hai l vo R v ra nhng lnh sau y:
> setwd(c:/works/stats)
> gh <- read.csv ("excel.txt", header=TRUE)

Lnh th hai read.csv yu cu R c s liu t excel.csv, dng dng th nht l tn


ct, v lu cc s liu ny trong mt object c tn l gh.

By gi chng ta c th lu gh di dng R x l sau ny bng lnh sau y:


> save(gh, file="gh.rda")

3.5 Nhp s liu t mt SPSS: read.spss


Phn mm thng k SPSS lu d liu di dng sav. Chng hn nh nu
chng ta c mt d liu c tn l testo.sav trong directory c:\works\stats, v mun
chuyn d liu ny sang dng R c th hiu c, chng ta cn s dng lnh
read.spss trong package c tn l foreign. Cc lnh sau y s hon tt d dng
vic ny:
Vic u tin chng ta cho truy nhp foreign bng lnh library:
> library(foreign)

Vic th hai l lnh read.spss:


> setwd(c:/works/stats)
> testo <- read.spss(testo.sav, to.data.frame=TRUE)

Lnh th hai read.spss yu cu R c s liu t testo.sav, v cho vo mt


data.frame c tn l testo.
By gi chng ta c th lu testo di dng R x l sau ny bng lnh sau y:
> save(testo, file="testo.rda")

3.6 Thng tin c bn v d liu


Gi d nh chng ta nhp s liu vo mt data.frame c tn l chol nh trong v d
1. tm hiu xem trong d liu ny c g, chng ta c th nhp vo R nh sau:

Dn cho R bit chng ta mun x l chol bng cch dng lnh attach(arg) vi
arg l tn ca d liu..

> attach(chol)

Chng ta c th kim tra xem chol c phi l mt data.frame khng bng lnh
is.data.frame(arg) vi arg l tn ca d liu. V d:

> is.data.frame(chol)
[1] TRUE

R cho bit chol qu l mt data.frame.

C bao nhiu ct (hay variable = bin s) v dng s liu (observations) trong d liu
ny? Chng ta dng lnh dim(arg) vi arg l tn ca d liu. (dim vit tt ch
dimension). V d (kt qu ca R trnh by ngay sau khi chng ta g lnh):

> dim(chol)
[1] 50 8

Nh vy, chng ta c 50 dng v 8 ct (hay bin s). Vy nhng bin s ny tn g?


Chng ta dng lnh names(arg) vi arg l tn ca d liu. V d:

> names(chol)
[1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc"

"tg"

Trong bin s sex, chng ta c bao nhiu nam v n? tr li cu hi ny, chng


ta c th dng lnh table(arg) vi arg l tn ca bin s. V d:

> table(sex)

sex
nam Nam
1 21

Nu
28

Kt qu cho thy d liu ny c 21 nam v 28 n.

CHNG IV

BIN TP D LIU

4
Bin tp d liu
Bin tp s liu y khng c ngha l thay i s liu gc (v l mt ti ln,
mt s gian di trong khoa hc khng th chp nhn c), m ch c ngha t chc s
liu sao cho R c th phn tch mt cch hu hiu. Nhiu khi trong phn tch thng k,
chng ta cn phi tp trung s liu thnh mt nhm, hay tch ri thnh tng nhm, hay
thay th t k t (characters) sang s (numeric) cho tin vic tnh ton. Trong chng
ny, ti s bn qua mt s lnh cn bn cho vic bin tp s liu.
Chng ta s quay li vi d liu chol trong v d 1. tin vic theo di v
hiu cu chuyn, ti xin nhc li rng chng ta nhp s liu vo trong mt d liu R
c tn l chol t mt text file c tn l chol.txt:
> setwd(c:/works/stats)
> chol <- read.table(chol.txt, header=TRUE)
> attach(chol)

4.1 Kim tra s liu trng khng (missing value)


Trong nghin cu, v nhiu l do s liu khng th thu thp c cho tt c i
tng, hay khng th o lng tt c bin s cho mt i tng. Trong trng hp ,
s liu trng c xem l missing value (m ti tm dch l s liu trng khng). R
xem cc s liu trng khng l NA. C mt s kim nh thng k i hi cc s liu
trng khng phi c loi ra (v khng th tnh ton c) trc khi phn tch. R c
mt lnh rt c ch cho vic ny: na.omit, v cch s dng nh sau:
> chol.new <- na.omit(chol)
Trong lnh trn, chng ta yu cu R loi b cc s liu trng khng trong
data.frame chol v a cc s liu khng trng vo data.frame mi tn l chol.new.
Ch lnh trn ch l v d, v trong d liu chol khng c s liu trng khng.

4.2 Tch ri d liu: subset


Nu chng ta, v mt l do no , ch mun phn tch ring cho nam gii, chng
ta c th tch chol ra thnh hai data.frame, tm gi l nam v nu. lm chuyn ny,
chng ta dng lnh subset(data, cond), trong data l data.frame m chng ta
mun tch ri, v cond l iu kin. V d:
> nam <- subset(chol, sex==Nam)
> nu <- subset(chol, sex==Nu)

Sau khi ra hai lnh ny, chng ta c 2 d liu (hai data.frame) mi tn l nam v nu.
Ch iu kin sex == Nam v sex == Nu chng ta dng == thay v = ch
iu kin chnh xc.
Tt nhin, chng ta cng c th tch d liu thnh nhiu data.frame khc nhau vi nhng
iu kin da vo cc bin s khc. Chng hn nh lnh sau y to ra mt data.frame
mi tn l old vi nhng bnh nhn trn 60 tui:
> old <- subset(chol, age>=60)
> dim(old)

[1] 25

Hay mt data.frame mi vi nhng bnh nhn trn 60 tui v nam gii:


> n60 <- subset(chol, age>=60 & sex==Nam)
> dim(n60)

[1] 9

4.3 Chit s liu t mt data .frame


Trong chol c 8 bin s. Chng ta c th chit d liu chol v ch gi li
nhng bin s cn thit nh m s (id), tui (age) v total cholestrol (tc). t
lnh names(chol) rng bin s id l ct s 1, age l ct s 3, v bin s tc l ct s
7. Chng ta c th dng lnh sau y:
> data2 <- chol[, c(1,3,7)]

y, chng ta lnh cho R bit rng chng ta mun chn ct s 1, 3 v 7, v a tt c


s liu ca hai ct ny vo data.frame mi c tn l data2. Ch chng ta s dng
ngoc kp vung [] ch khng phi ngoc kp vng (), v chol khng phi lm mt
function. Du phy pha trc c, c ngha l chng ta chn tt c cc dng s liu trong
data.frame chol.
Nhng nu chng ta ch mun chn 10 dng s liu u tin, th lnh s l:
> data3 <- chol[1:10, c(1,3,7)]
> print(data3)

1
2
3
4
5
6
7
8

id
1
2
3
4
5
6
7
8

sex
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam

tc
4.0
3.5
4.7
7.7
5.0
4.2
5.9
6.1

9
9 Nam 5.9
10 10 Nu 4.0
Ch lnh print(arg) n gin lit k tt c s liu trong data.frame arg. Tht ra,
chng ta ch cn n gin g data3, kt qu cng ging y nh print(data3).

4.4 Nhp hai data.frame thnh mt: merge


Gi d nh chng ta c d liu cha trong hai data.frame. D liu th nht tn l d1
gm 3 ct: id, sex, tc nh sau:
id sex tc
1 Nam 4.0
2 Nu 3.5
3 Nu 4.7
4 Nam 7.7
5 Nam 5.0
6 Nu 4.2
7 Nam 5.9
8 Nam 6.1
9 Nam 5.9
10 Nu 4.0
D liu th hai tn l d2 gm 3 ct: id, sex, tg nh sau:
id
1
2
3
4
5
6
7
8
9
10
11

sex
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
Nu

tg
1.1
2.1
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9
1.7

Hai d liu ny c chung hai bin s id v sex. Nhng d liu d1 c 10 dng, cn d


liu d2 c 11 dng. Chng ta c th nhp hai d liu thnh mt data.frame bng cch
dng lnh merge nh sau:
> d <- merge(d1, d2, by="id", all=TRUE)
> d
id sex.x tc sex.y tg

1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10 10
11 11

Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
<NA>

4.0
3.5
4.7
7.7
5.0
4.2
5.9
6.1
5.9
4.0
NA

Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
Nu

1.1
2.1
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9
1.7

Trong lnh merge, chng ta yu cu R nhp 2 d liu d1 v d2 thnh mt v a vo


data.frame mi tn l d, v dng bin s id lm chun. Chng ta thy bnh nhn s
11 khng c s liu cho tc, cho nn R cho l NA (mt dng not available).

4.5 M ha s liu (data coding)


Trong vic x l s liu dch t hc, nhiu khi chng ta cn phi bin i s liu t bin
lin tc sang bin mang tnh cch phn loi. Chng hn nh trong chn on long
xng, nhng ph n c ch s T ca mt cht khong trong xng (bone mineral
density hay BMD) bng hay thp hn -2.5 c xem l long xng, nhng ai c
BMD gia -2.5 v -1.0 l xp xng (osteopenia), v trn -1.0 l bnh thng. V
d, chng ta c s liu BMD t 10 bnh nhn nh sau:
-0.92, 0.21, 0.17, -3.21, -1.80, -2.60, -2.00, 1.71, 2.12, -2.11

nhp cc s liu ny vo R chng ta c th s dng function c nh sau:


bmd <- c(-0.92,0.21,0.17,-3.21,-1.80,-2.60,-2.00,1.71,2.12,-2.11)

phn loi 3 nhm long xng, xp xng, v bnh thng, chng ta c th dng m
s 1, 2 v 3. Ni cch khc, chng ta mun to nn mt bin s khc (hy gi l
diagnosis) gm 3 gi tr trn da vo gi tr ca bmd. lm vic ny, chng ta s
dng lnh:
# tm thi cho bin s diagnosis bng bmd
> diagnosis <- bmd
#
>
>
>

bin i bmd thnh diagnosis


diagnosis[bmd <= -2.5] <- 1
diagnosis[bmd > -2.5 & bmd <= 1.0] <- 2
diagnosis[bmd > -1.0] <- 3

# to thnh mt data frame


> data <- data.frame(bmd, diagnosis)
# lit k kim tra xem lnh c hiu qu khng

> data
bmd diagnosis
1 -0.92
3
2
0.21
3
3
0.17
3
4 -3.21
1
5 -1.80
2
6 -2.60
1
7 -2.00
2
8
1.71
3
9
2.12
3
10 -2.11
2

4.5.1 Bin i s liu bng cch dng replace


Mt cch bin i s liu khc l dng replace, d cch ny c v rm r cht t.
Tip tc v d trn, chng ta bin i t bmd sang diagnosis nh sau:
>
>
>
>

diagnosis
diagnosis
diagnosis
diagnosis

<<<<-

bmd
replace(diagnosis, bmd <= -2.5, 1)
replace(diagnosis, bmd > -2.5 & bmd <= 1.0, 2)
replace(diagnosis, bmd > -1.0, 3)

4.5.2 Bin i thnh yu t (factor)


Trong phn tch thng k, chng ta phn bit mt bin s mang tnh yu t (factor) v
bin s lin tc bnh thng. Bin s yu t khng th dng tnh ton nh cng tr
nhn chia, nhng bin s s hc c th s dng tnh ton. Chng hn nh trong v d
bmd v diagnosis trn, diagnosis l yu t v gi tr trung bnh gia 1 v 2 chng
c ngha thc t g c; cn bmd l bin s s hc.
Nhng hin nay, diagnosis c xem l mt bin s s hc. bin thnh bin s
yu t, chng ta cn s dng function factor nh sau:
> diag <- factor(diagnosis)
> diag
[1] 3 3 3 1 2 1 2 3 3 2
Levels: 1 2 3

Ch R by gi thng bo cho chng ta bit diag c 3 bc: 1, 2 v 3. Nu chng ta yu


cu R tnh s trung bnh ca diag, R s khng lm theo yu cu ny, v khng phi l
mt bin s s hc:
> mean(diag)
[1] NA
Warning message:
argument is not numeric or logical: returning NA in: mean.default(diag)

D nhin, chng ta c th tnh gi tr trung bnh ca diagnosis:

> mean(diagnosis)
[1] 2.3
nhng kt qu 2.3 ny khng c ngha g trong thc t c.

4.6 Chia nhm bng cut


Vi mt bin lin tc, chng ta c th chia thnh nhiu nhm bng hm cut. V d,
chng ta c bin age nh sau:
> age <- c(17,19,22,43,14,8,12,19,20,51,8,12,27,31,44)

tui thp nht l 8 v cao nht l 51. Nu chng ta mun chia thnh 2 nhm tui:
> cut(age, 2)
[1] (7.96,29.5] (7.96,29.5] (7.96,29.5] (29.5,51]
(7.96,29.5] (7.96,29.5]
[9] (7.96,29.5]
(29.5,51]

(29.5,51]

(7.96,29.5]

(7.96,29.5]

(7.96,29.5] (7.96,29.5]
(7.96,29.5]

(29.5,51]

Levels: (7.96,29.5] (29.5,51]

cut chia bin age thnh 2 nhm: nhm 1 tui t 7.96 n 29.5; nhm 2 t 29.5 n
51. Chng ta c th m s i tng trong tng nhm tui bng hm table nh sau:
> table(cut(age, 2))
(7.96,29.5]
11

(29.5,51]
4

> ageg <- cut(age, 3, labels=c("low", "medium", "high"))


[1] low
low
low
high
low
low
low
low
low
low
medium medium
[15] high
Levels: low medium high

low

high

> ageg <- cut(age, 3, labels=c("low", "medium", "high"))


> table(ageg)
ageg
low medium
high
10
2
3

Tt nhin, chng ta cng c th chia age thnh 4 nhm (quartiles) bng cch cho nhng
thng s 0, 0.25, 0.50 v 0.75 nh sau:
cut(age,
breaks=quantiles(age, c(0, 0.25, 0.50, 0.75, 1)),
labels=c(q1, q2, q3, q4),

include.lowest=TRUE)
cut(age,
breaks=quantiles(c(0, 0.25, 0.50, 0.75, 1)),
labels=c(q1, q2, q3, q4),
include.lowest=TRUE)

4.7. Tp hp s liu bng cut2 (Hmisc)


Hm cut trn chia bin s theo gi tr ca bin, ch khng da vo s mu, cho
nn s lng mu trong tng nhm khng bng nhau. Tuy nhin, trong phn tch thng
k, c khi chng ta cn phi phn chia mt bin s lin tc thnh nhiu nhm da vo
phn phi ca bin s nhng s mu bng hay tng ng nhau. Chng hn nh i
vi bin s bmd chng ta c th ct dy s thnh 3 nhm vi s mu tng ng nhau
bng cch dng function cut2 (trong th vin Hmisc) nh sau:
> # nhp th vin Hmisc c th dng function cut2
> library(Hmisc)
> bmd <- c(-0.92,0.21,0.17,-3.21,-1.80,-2.60,-2.00,1.71,2.12,-2.11)
> # chia bin s bmd thnh 2 nhm v trong i tng group
> group <- cut2(bmd, g=2)
> table(group)
group
[-3.21,-0.92) [-0.92, 2.12]
5
5

Nh thy qua v d trn, g = 2 c ngha l chia thnh 2 nhm (g = group). R t ng


chia thnh nhm 1 gm gi tr bmd t -3.21 n -0.92, v nhm 2 t -0.92 n 2.12. Mi
nhm gm c 5 s.
Tt nhin, chng ta cng c th chia thnh 3 nhm bng lnh:
> group <- cut2(bmd, g=3)

V vi lnh table chng ta s bit c 3 nhm, nhm 1 gm 4 s, nhm 2 v 3 mi nhm


c 3 s:
> table(group)
group
[-3.21,-1.80) [-1.80, 0.21) [ 0.21, 2.12]
4
3
3

CHNG V

TNH TON N GIN


V
MA TRN

5
Dng R cho cc php tnh
n gin v ma trn
Mt trong nhng li th ca R l c th s dng nh mt my tnh cm tay.
Tht ra, hn th na, R c th s dng cho cc php tnh ma trn v lp chng. Trong
chng ny ti ch trnh by mt s php tnh n gin m hc sinh hay sinh vin c th
s dng lp tc trong khi c nhng dng ch ny.

5.1 Tnh ton n gin


Cng hai s hay nhiu s vi nhau:

Cng v tr:

> 15+2997
[1] 3012

> 15+2997-9768
[1] -6756

Nhn v chia

S ly tha: (25 5)3

> -27*12/21
[1] -15.42857

> (25 - 5)^3


[1] 8000

Cn s bc hai: 10

S pi ()

> sqrt(10)
[1] 3.162278

> pi
[1] 3.141593
> 2+3*pi
[1] 11.42478

Logarit: loge

Logarit: log10

S m: e2.7689

Hm s lng gic

> exp(2.7689)
[1] 15.94109

> cos(pi)
[1] -1

> log(10)
[1] 2.302585

> log10(100)
[1] 2

> log10(2+3*pi)
[1] 1.057848

Vector
> x <- c(2,3,1,5,4,6,7,6,8)
> x
[1] 2 3 1 5 4 6 7 6 8
> sum(x)
[1] 42
> x*2

> exp(x/10)
[1] 1.221403 1.349859 1.105171 1.648
1.491825 1.822119 2.013753 1.822119
[9] 2.225541
> exp(cos(x/10))
[1] 2.664634 2.599545 2.704736 2.405
2.511954 2.282647 2.148655 2.282647
[9] 2.007132

[1]

2 10

8 12 14 12 16

Tnh tng bnh phng (sum of squares): 12 Tnh tng bnh phng iu chnh
n
+ 22 + 32 + 42 + 52 = ?
2
(adjusted
sum
of
squares):
( xi x ) = ?

> x <- c(1,2,3,4,5)


> sum(x^2)
[1] 55

i =1

> x <- c(1,2,3,4,5)


> sum((x-mean(x))^2)
[1] 10

Trong cng thc trn mean(x) l s trung


bnh ca vector x.
Tnh sai s bnh phng (mean square):
n

( x x )
i =1

/n= ?

> x <- c(1,2,3,4,5)


> sum((x-mean(x))^2)/length(x)
[1] 2

Tnh phng sai (variance) v lch


chun (standard deviation):
n

Phng sai: s 2 = ( xi x ) / ( n 1) = ?
2

i =1

> x <- c(1,2,3,4,5)


> var(x)
[1] 2.5

Trong cng thc trn, length(x) c


ngha l tng s phn t (elements) trong
vector x.

lch chun:

s2 :

> sd(x)
[1] 1.581139

5.2 S liu v ngy thng


Trong phn tch thng k, cc s liu ngy thng c khi l mt vn nan gii, v
c rt nhiu cch m t cc d liu ny. Chng hn nh 01/02/2003, c khi ngi ta
vit 1/2/2003, 01/02/03, 01FEB2003, 2003-02-01, v.v Tht ra, c mt qui lut chun
vit s liu ngy thng l tiu chun ISO 8601 (nhng rt t ai tun theo!) Theo qui
lut ny, chng ta vit:
2003-02-01
L do ng sau cch vit ny l chng ta vit s vi n v ln nht trc, ri dn dn n
n v nh nht. Chng hn nh vi s 123 th chng ta bit ngay rng mt trm hai
mi ba: bt u l hng trm, ri n hng chc, v.v V cng l cch vit ngy
thng chun ca R.
> date1 <- as.Date(01/02/06, format=%d/%m/%y)
> date2 <- as.Date(06/03/01, format=%y/%m/%d)

Ch chng ta nhp hai s liu khc nhau v th t ngy thng nm, nhng chng ta
cng cho bit c th cch c bng %d (ngy), %m (thng), v %y (nm). Chng ta c th
tnh s ngy gia hai thi im:

> days <- date2-date1


> days
Time difference of 28 days

Chng ta cng c th to mt dy s liu ngy thng nh sau:


> seq(as.Date(2005-01-01), as.Date(2005-12-31), by=month)
[1] "2005-01-01" "2005-02-01" "2005-03-01" "2005-04-01" "2005-05-01"
[6] "2005-06-01" "2005-07-01" "2005-08-01" "2005-09-01" "2005-10-01"
[11] "2005-11-01" "2005-12-01"

> seq(as.Date(2005-01-01), as.Date(2005-12-31), by=2 weeks)


[1]
[6]
[11]
[16]
[21]
[26]

"2005-01-01"
"2005-03-12"
"2005-05-21"
"2005-07-30"
"2005-10-08"
"2005-12-17"

"2005-01-15"
"2005-03-26"
"2005-06-04"
"2005-08-13"
"2005-10-22"
"2005-12-31"

"2005-01-29"
"2005-04-09"
"2005-06-18"
"2005-08-27"
"2005-11-05"

"2005-02-12"
"2005-04-23"
"2005-07-02"
"2005-09-10"
"2005-11-19"

"2005-02-26"
"2005-05-07"
"2005-07-16"
"2005-09-24"
"2005-12-03"

5.3 To dy s bng hm seq, rep v gl


R cn c cng dng to ra nhng dy s rt tin cho vic m phng v thit k th
nghim. Nhng hm thng thng cho dy s l seq (sequence), rep (repetition) v
gl (generating levels):
p dng seq

To ra mt vector s t 1 n 12:

> x <- (1:12)


> x
[1] 1 2 3
> seq(12)
[1] 1 2

4
4

5
5

6
6

7
7

8
8

9 10 11 12
9 10 11 12

To ra mt vector s t 12 n 5:

> x <- (12:5)


> x
[1] 12 11 10 9

> seq(12,7)
[1] 12 11 10

Cng thc chung ca hm seq l seq(from, to, by= ) hay seq(from, to,
length.out= ). Cch s dng s c minh ho bng vi v d sau y:

To ra mt vector s t 4 n 6 vi khong cch bng 0.25:

> seq(4, 6, 0.25)


[1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00

To ra mt vector 10 s, vi s nh nht l 2 v s ln nht l 15

> seq(length=10, from=2, to=15)


[1] 2.000000 3.444444 4.888889 6.333333
10.666667 12.111111 13.555556 15.000000

7.777778

9.222222

p dng rep

Cng thc ca hm rep l rep(x, times, ...), trong , x l mt bin s v times


l s ln lp li. V d:

To ra s 10, 3 ln:

> rep(10, 3)
[1] 10 10 10

To ra s 1 n 4, 3 ln:

> rep(c(1:4), 3)
[1] 1 2 3 4 1 2 3 4 1 2 3 4

To ra s 1.2, 2.7, 4.8, 5 ln:

> rep(c(1.2, 2.7, 4.8), 5)


[1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8

To ra s 1.2, 2.7, 4.8, 5 ln:

> rep(c(1.2, 2.7, 4.8), 5)


[1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8

p dng gl
gl c p dng to ra mt bin th bc (categorical variable), tc bin khng tnh
ton, m l m. Cng thc chung ca hm gl l gl(n, k, length = n*k,
labels = 1:n, ordered = FALSE) v cch s dng s c minh ho bng vi
v d sau y:

To ra bin gm bc 1 v 2; mi bc c lp li 8 ln:

> gl(2, 8)
[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
Levels: 1 2

Hay mt bin gm bc 1, 2 v 3; mi bc c lp li 5 ln:


> gl(3, 5)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3

To ra bin gm bc 1 v 2; mi bc c lp li 10 ln (do length=20):

> gl(2, 10, length=20)

[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Levels: 1 2

Hay:
> gl(2, 2, length=20)
[1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
Levels: 1 2

Cho thm k hiu:

> gl(2, 5, label=c("C", "T"))


[1] C C C C C T T T T T
Levels: C T

To mt bin gm 4 bc 1, 2, 3, 4. Mi bc lp li 2 ln.

> rep(1:4, c(2,2,2,2))


[1] 1 1 2 2 3 3 4 4

Cng tng ng vi:


> rep(1:4, each = 2)
[1] 1 1 2 2 3 3 4 4

Vi ngy gi thng:

> x <- .leap.seconds[1:3]


> rep(x, 2)
[1] "1972-06-30 17:00:00 Pacific Standard Time" "1972-12-31 16:00:00
Pacific Standard Time"
[3] "1973-12-31 16:00:00 Pacific Standard Time" "1972-06-30 17:00:00
Pacific Standard Time"
[5] "1972-12-31 16:00:00 Pacific Standard Time" "1973-12-31 16:00:00
Pacific Standard Time"
> rep(as.POSIXlt(x), rep(2, 3))
[1] "1972-06-30 17:00:00 Pacific Standard Time" "1972-06-30 17:00:00
Pacific Standard Time"
[3] "1972-12-31 16:00:00 Pacific Standard Time" "1972-12-31 16:00:00
Pacific Standard Time"
[5] "1973-12-31 16:00:00 Pacific Standard Time" "1973-12-31 16:00:00
Pacific Standard Time"

5.4 S dng R cho cc php tnh ma trn


Nh chng ta bit ma trn (matrix), ni n gin, gm c dng (row) v ct
(column). Khi vit A[m, n], chng ta hiu rng ma trn A c m dng v n ct. Trong R,
chng ta cng c th th hin nh th. V d: chng ta mun to mt ma trn vung A
gm 3 dng v 3 ct, vi cc phn t (element) 1, 2, 3, 4, 5, 6, 7, 8, 9, chng ta vit:
1 4 7

A = 2 5 8
3 6 9

V vi R:
> y <- c(1,2,3,4,5,6,7,8,9)
> A <- matrix(y, nrow=3)
> A
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9

Nhng nu chng ta lnh:


> A <- matrix(y, nrow=3, byrow=TRUE)
> A

th kt qu s l:
[1,]
[2,]
[3,]

[,1] [,2] [,3]


1
2
3
4
5
6
7
8
9

Tc l mt ma trn chuyn v (transposed matrix). Mt cch khc to mt ma trn


hon v l dng t(). V d:
> y <- c(1,2,3,4,5,6,7,8,9)
> A <- matrix(y, nrow=3)
> A
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9

v B = A' c th din t bng R nh sau:


> B <- t(A)
> B
[,1] [,2] [,3]
[1,]
1
2
3
[2,]
4
5
6
[3,]
7
8
9

Ma trn v hng (scalar matrix) l mt ma trn vung (tc s dng bng s ct), v
tt c cc phn t ngoi ng cho (off-diagonal elements) l 0, v phn t ng cho
l 1. Chng ta c th to mt ma trn nh th bng R nh sau:
> # to ra m ma trn 3 x 3 vi tt c phn t l 0.
> A <- matrix(0, 3, 3)
> # cho cc phn t ng cho bng 1

> diag(A) <- 1


> diag(A)
[1] 1 1 1
> # by gi ma trn A s l:
> A
[,1] [,2] [,3]
[1,]
1
0
0
[2,]
0
1
0
[3,]
0
0
1

5.4.1 Chit phn t t ma trn


> y <- c(1,2,3,4,5,6,7,8,9)
> A <- matrix(y, nrow=3)
> A
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
> # ct 1 ca ma trn A
> A[,1]
[1] 1 4 7
> # ct 3 ca ma trn A
> A[3,]
[1] 7 8 9
> # dng 1 ca ma trn A
> A[1,]
[1] 1 2 3
> # dng 2, ct 3 ca ma trn A
> A[2,3]
[1] 6
> # tt c cc dng ca ma trn A, ngoi tr dng 2
> A[-2,]
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
3
6
9
> # tt c cc ct ca ma trn A, ngoi tr ct 1
> A[,-1]
[,1] [,2]
[1,]
4
7
[2,]
5
8
[3,]
6
9

> # xem phn t no cao hn 3.


> A>3
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] FALSE TRUE TRUE
[3,] FALSE TRUE TRUE

5.4.2 Tnh ton vi ma trn


Cng v tr hai ma trn. Cho hai ma trn A v B nh sau:
> A <- matrix(1:12, 3, 4)
> A
[,1] [,2] [,3] [,4]
[1,]
1
4
7
10
[2,]
2
5
8
11
[3,]
3
6
9
12
> B <- matrix(-1:-12, 3, 4)
> B
[,1] [,2] [,3] [,4]
[1,]
-1
-4
-7 -10
[2,]
-2
-5
-8 -11
[3,]
-3
-6
-9 -12

Chng ta c th cng A+B:


> C <- A+B
> C
[,1] [,2] [,3] [,4]
[1,]
0
0
0
0
[2,]
0
0
0
0
[3,]
0
0
0
0

Hay A-B:
> D <- A-B
> D
[,1] [,2] [,3] [,4]
[1,]
2
8
14
20
[2,]
4
10
16
22
[3,]
6
12
18
24

Nhn hai ma trn. Cho hai ma trn:

1 4 7

A = 2 5 8
3 6 9

1 2 3

B = 4 5 6
7 8 9

Chng ta mun tnh AB, v c th trin khai bng R bng cch s dng %*% nh sau:
>
>
>
>
>

y <- c(1,2,3,4,5,6,7,8,9)
A <- matrix(y, nrow=3)
B <- t(A)
AB <- A%*%B
AB
[,1] [,2] [,3]
[1,]
66
78
90
[2,]
78
93 108
[3,]
90 108 126

Hay tnh BA, v c th trin khai bng R bng cch s dng %*% nh sau:
> BA <- B%*%A
> BA
[,1] [,2] [,3]
[1,]
14
32
50
[2,]
32
77 122
[3,]
50 122 194

Nghch o ma trn v gii h phng trnh. V d chng ta c h phng trnh sau


y:

3x1 + 4 x2 = 4
x1 + 6 x2 = 2
H phng trnh ny c th vit bng k hiu ma trn: AX = Y, trong :
3 4
A=
,
1 6

x
X = 1 ,
x2

4
Y =
2

Nghim ca h phng trnh ny l: X = A-1Y, hay trong R:


>
>
>
>

A <- matrix(c(3,1,4,6), nrow=2)


Y <- matrix(c(4,2), nrow=2)
X <- solve(A)%*%Y
X
[,1]
[1,] 1.1428571
[2,] 0.1428571

Chng ta c th kim tra:


> 3*X[1,1]+4*X[2,1]
[1] 4

Tr s eigen cng c th tnh ton bng function eigen nh sau:


> eigen(A)
$values
[1] 7 2
$vectors

[,1]
[,2]
[1,] -0.7071068 -0.9701425
[2,] -0.7071068 0.2425356

nh thc (determinant). Lm sao chng ta xc nh mt ma trn c th o nghch


hay khng? Ma trn m nh thc bng 0 l ma trn suy bin (singular matrix) v
khng th o nghch. kim tra nh thc, R dng lnh det():
> E <- matrix((1:9), 3, 3)
> E
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
> det(E)
[1] 0

Nhng ma trn F sau y th c th o nghch:


> F <- matrix((1:9)^2, 3, 3)
> F
[,1] [,2] [,3]
[1,]
1
16
49
[2,]
4
25
64
[3,]
9
36
81
> det(F)
[1] -216

V nghch o ca ma trn F (F-1) c th tnh bng function solve() nh sau:


> solve(F)

[,1]
[,2]
[,3]
[1,] 1.291667 -2.166667 0.9305556
[2,] -1.166667 1.666667 -0.6111111
[3,] 0.375000 -0.500000 0.1805556

Ngoi nhng php tnh n gin ny, R cn c th s dng cho cc php tnh
phc tp khc. Mt li th ng k ca R l phn mm cung cp cho ngi s dng t
do to ra nhng php tnh ph hp cho tng vn c th. Trong vi chng sau, ti s
quay li vn ny chi tit hn.
R c mt package Matrix chuyn thit k cho tnh ton ma trn. Bn c c th
ti package xung, ci vo my, v s dng, nu cn. a ch ti l:

http://cran.au.r-project.org/bin/windows/contrib/r-release/Matrix_0.995-8.zip
cng vi ti liu ch dn cch s dng (di khong 80 trang):
http://cran.au.r-project.org/doc/packages/Matrix.pdf

CHNG VI

TNH TON XC SUT

6
Tnh ton xc sut
v m phng (simulation)
Xc sut l nn tng ca phn tch thng k. Tt c cc phng php phn tch s
liu v suy lun thng k u da vo l thuyt xc sut. L thuyt xc sut quan tm n
vic m t v th hin qui lut phn phi ca mt bin s ngu nhin. M t y
trong thc t cng c ngha n gin l m nhng trng hp hay kh nng xy ra ca
mt hay nhiu bin. Chng hn nh khi chng ta chn ngu nhin 2 i tng , v nu 2
i tng ny c th c phn loi bng hai c tnh nh gii tnh v s thch, th vn
t ra l c bao nhiu tt c phi hp gia hai c tnh ny. Hay i vi mt bin s
lin tc nh huyt p, m t c ngha l tnh ton cc ch s thng k ca bin nh tr s
trung bnh, trung v, phng sai, lch chun, v.v T nhng ch s m t, l thuyt
xc sut cung cp cho chng ta nhng m hnh thit lp cc hm phn phi cho cc
bin s . Trong chng ny, ti s bn qua hai lnh vc chnh l php m v cc hm
phn phi.

6.1 Cc php m
6.1.1 Php hon v (permutation).
Theo nh ngha, hon v n phn t l cch sp xp n phn t theo mt th t nh
sn. nh ngha ny tht l kh hiu, chng khc g ! C l mt v d c th s
lm r nh ngha hn. Hy tng tng mt trung tm cp cu c 3 bc s (x, y v z), v
c 3 bnh nhn (a, b v c) ang ngi ch c khm bnh. C ba bc s u c th khm
bt c bnh nhn a, b hay c. Cu hi t ra l c bao nhiu cch sp xp bc s bnh
nhn? tr li cu hi ny, chng ta xem xt vi trng hp sau y:

Bc s x c 3 la chn: khm bnh nhn a, b hoc c;


Khi bc s x chn mt bnh nhn ri, th bc s y c hai la chn cn li;
V sau cng, khi 2 bc s kia chn, bc s z ch cn 1 la chn.
Tng cng, chng ta c 6 la chn.

Mt v d khc, trong mt bui tic gm 6 bn, hi c bao nhiu cch sp xp


cch ngi trong mt bn vi 6 gh? Qua cch l gii ca v d trn, p s l: 6.5.4.3.2.1
= 720 cch. (Ch du . c ngha l du nhn hay tch s). V y chnh l php m
hon v.
Chng ta bit rng 3! = 3.2.1 = 6, v 0!=1. Ni chung, cng thc tnh hon v cho
mt s n l: n ! = n ( n 1)( n 2 )( n 3) ... 1 . Trong R cch tnh ny rt n gin vi
lnh prod() nh sau:

Tm 3!
> prod(3:1)
[1] 6
Tm 10!
> prod(10:1)
[1] 3628800
Tm 10.9.8.7.6.5.4
> prod(10:4)
[1] 604800
Tm (10.9.8.7.6.5.4) / (40.39.38.37.36)
> prod(10:4) / prod(40:36)
[1] 0.007659481
6.1.2 T hp (combination).
T hp n phn t chp k l mi tp hp con gm k phn t ca tp hp n phn t.
nh ngha ny phi ni l rt kh hiu v rm r! Cch d hiu nht l qua mt v
d nh sau: Cho 3 ngi (hy cho l A, B, v C) ng vin vo 2 chc ch tch v ph ch
tch, hi: c bao nhiu cch chn 2 chc ny trong s 3 ngi . Chng ta c th
tng tng c 2 gh m phi chn 3 ngi:
Cch chn
1
2
3
4
5
6

Ch tch
A
B
A
C
B
C

Ph ch tch
B
A
C
A
C
B

Nh vy c 6 cch chn. Nhng ch rng cch chn 1 v 2 trong thc t ch l 1 cp,


v chng ta ch c th m l 1 (ch khng 2 c). Tng t, 3 v 4, 5 v 6 cng ch c
th m l 1 cp. Tng cng, chng ta c 3 cch chn 3 ngi cho 2 chc v. p s
ny c gi l t hp.
Tht ra tng s ln chn c th tnh bng cng thc sau y:

3
3!
6
= = 3 ln.
=
2 2!( 3 2 ) ! 2
Ni chung, s ln chn k ngi t n ngi l:
n
n!
=
k k !( n k ) !
2

n
Cng thc ny cng c khi vit l Ckn thay v . Vi R, php tnh ny rt n gin
k
bng hm choose(n, k). Sau y l vi v d minh ha:
5
Tm
2
> choose(5, 2)
[1] 10

Tm xc sut cp A v B trong s 5 ngi c c c vo hai chc v:


> 1/choose(5, 2)
[1] 0.1

6.2 Bin s ngu nhin v hm phn phi


Phn ln phn tch thng k da vo cc lut phn phi xc sut suy lun. Hai
ch phn phi (distribution) c l cng cn vi dng gii thch y. Nu chng ta
chn ngu nhin 10 bn trong mt lp hc v ghi nhn chiu cao v gii tnh ca 10 bn
, chng ta c th c mt dy s liu nh sau:
Gii tnh
Chiu cao (cm)

1
N
156

2
N
160

3
Nam
175

4
N
145

5
N
165

6
N
158

7
Nam
170

8
Nam
167

9
N
178

10
Nam
155

Nu tnh gp chung li, chng ta c 6 bn gi v 4 bn trai. Ni theo phn trm, chng ta


c 60% n v 40% nam. Ni theo ngn ng xc sut, xc sut n l 0.6 v nam l 0.4.
V chiu cao, chng ta c gi tr trung bnh l 162.9 cm, vi chiu cao thp nht
l 155 cm v cao nht l 178 cm.
Ni theo ngn ng thng k xc sut, bin s gii tnh v chiu cao l hai bin s
ngu nhin (random variable). Ngu nhin l v chng ta khng on trc mt cch
chnh xc cc gi tr ny, nhng ch c th on gi tr tp trung, gi tr trung bnh, v
dao ng ca chng. Bin gii tnh ch c hai gi tr (nam hay n), v c gi l bin
khng lin tc, hay bin ri rc (discrete variable), hay bin th bc (categorical
variable). Cn bin chiu cao c th c bt c gi tr no t thp n cao, v do c tn
l bin lin tc (continuous variable).
Khi ni n phn phi (hay distribution) l cp n cc gi tr m bin s c
th c. Cc hm phn phi (distribution function) l hm nhm m t cc bin s mt
cch c h thng. C h thng y c ngha l theo m m hnh ton hc c th vi
nhng thng s cho trc. Trong xc sut thng k c kh nhiu hm phn phi, v
y chng ta s xem xt qua mt s hm quan trng nht v thng dng nht: l phn

phi nh phn, phn phi Poisson, v phn phi chun. Trong mi lut phn phi, c 4
loi hm quan trng m chng ta cn bit:

hm mt xc sut (probability density distribution);


hm phn phi tch ly (cumulative probability distribution);
hm nh bc (quantile); v
hm m phng (simulation).

R c nhng hm sn trn c th ng dng cho tnh ton xc sut. Tn mi hm


c gi bng mt tip u ng ch loi hm phn phi, v vit tt tn ca hm .
Cc tip u ng l d (ch distribution hay xc sut), p (ch cumulative probability, xc
sut tch ly), q (ch nh bc hay quantile), v r (ch random hay s ngu nhin). Cc
tn vit tt l norm (normal, phn phi chun), binom (binomial , phn phi nh
phn), pois (Poisson, phn phi Poisson), v.v Bng sau y tm tt cc hm v thng
s cho tng hm:
Hm phn
phi

Mt

Tch ly

nh bc

M phng

Chun

dnorm(x, mean,
sd)
dbinom(k, n, p)

pnorm(q, mean, sd)

qnorm(p, mean, sd)

rnorm(n, mean, sd)

pbinom(q, n, p)

qbinom (p, n, p)

rbinom(k, n, prob)

dpois(k, lambda)

ppois(q, lambda)

qpois(p, lambda)

rpois(n, lambda)

dunif(x, min,
max)
dnbinom(x, k, p)

punif(q, min, max)

qunif(p, min, max)

runif(n, min, max)

pnbinom(q, k, p)

qnbinom (p,k,prob)

rbinom(n, n, prob)

dbeta(x, shape1,
shape2)
dgamma(x, shape,
rate, scale)
dgeom(x, p)

pbeta(q, shape1,
shape2)
gamma(q, shape,
rate, scale)
pgeom(q, p)

qbeta(p, shape1,
shape2)
qgamma(p, shape,
rate, scale)
qgeom(p, prob)

rbeta(n, shape1,
shape2)
rgamma(n, shape,
rate, scale)
rgeom(n, prob)

dexp(x, rate)

pexp(q, rate)

qexp(p, rate)

rexp(n, rate)

dnorm(x, mean,
sd)
dcauchy(x,
location, scale)
df(x, df1, df2)

pnorm(q, mean, sd)

qnorm(p, mean, sd)

rnorm(n, mean, sd)

pcauchy(q,
location, scale)
pf(q, df1, df2)

qcauchy(p,
location, scale)
qf(p, df1, df2)

rcauchy(n,
location, scale)
rf(n, df1, df2)

Nh phn
Poisson
Uniform
Negative
binomial
Beta
Gamma
Geometric
Exponential
Weibull
Cauchy

F
dt(x, df)
pt(q, df)
qt(p, df)
rt(n, df)
T
dchisq(x,
df)
pchi(q,
df)
qchisq(p,
df)
rchisq(n,
df)
Chi-squared
Ch thch: Trong bng trn, df = degrees of freedome (bc t do); prob = probability (xc sut); n = sample
size (s lng mu). Cc thng s khc c th tham kho thm cho tng lut phn phi. Ring cc lut
phn phi F, t, Chi-squared cn c mt thng s khc na l non-centrality parameter (ncp) c cho s 0.
Tuy nhin ngi s dng c th cho mt thng s khc thch hp, nu cn.

6.3 Cc hm phn phi xc sut (probability distribution


function)
6.3.1 Hm phn phi nh phn (Binomial distribution)

Nh tn gi, hm phn phi nh phn ch c hai gi tr: nam / n, sng / cht, c /


khng, v.v Hm nh phn c pht biu bng nh l nh sau: Nu mt th nghim

c tin hnh n ln, mi ln cho ra kt qu hoc l thnh cng hoc l tht bi, v gm
xc sut thnh cng c bit trc l p, th xc sut c k ln th nghim thnh cng l:
nk
P ( k | n, p ) = Ckn p k (1 p ) , trong k == 0, 1, 2, . . . , n. hiu nh l r rng

hn, chng ta s xem qua qua vi v d sau y.


V d 1: Hm mt nh phn (Binomial density probability function).
Trong v d trn, lp hc c 10 ngi, trong c 6 na. Nu 3 bn c chn mt cch
ngu nhin, xc sut m chng ta c 2 bn n l bao nhiu? Chng ta c th tr li cu
hi ny mt cch tng i th cng bng cch xem xt tt c cc trng hp c th xy
ra. Mi ln chn c 2 kh khng (nam hay n), v 3 ln chn, chng ta c 23 = 8 trng
hp nh sau.
Bn 1
Bn 2
Nam
Nam
Nam
Nam
Nam
N
Nam
N
N
Nam
N
Nam
N
N
N
N
Tt c cc trng hp

Bn 3
Nam
N
Nam
N
Nam
N
Nam
N

Xc sut
(0.4)(0.4)(0.4) = 0.064
(0.4)(0.4)(0.6) = 0.096
(0.4)(0.6)(0.4) = 0.096
(0.4)(0.6)(0.6) = 0.144
(0.6)(0.4)(0.4) = 0.096
(0.6)(0.4)(0.6) = 0.144
(0.6)(0.6)(0.4) = 0.144
(0.6)(0.6)(0.6) = 0.216
1.000

Chng ta bit trc rng trong nhm 10 hc sinh c 6 n, v do , xc sut n l 0.60.


(Ni cch khc, xc sut chn mt bn nam l 0.4). Do , xc sut m tt c 3 bn c
chn u l nam gii l: 0.4 x 0.4 x 0.4 = 0.064. Trong bng trn, chng ta thy c 3
trng hp m trong c 2 bn gi: l trng hp Nam-N-N, N-N-Nam, v
N-Nam-N, c 3 u c xc sut 0.144. Thnh ra, xc sut chn ng 2 bn n trong s
3 bn c chn l 3x0.144= 0.432.
Trong R, c hm dbinom(k, n, p) c th gip chng ta tnh cng thc
P ( k | n, p ) = Ckn p k (1 p )

nk

mt cch nhanh chng. Trong trng hp trn, chng ta ch

cn n gin lnh:
> dbinom(2, 3, 0.60)
[1] 0.432

V d 2: Hm nh phn tch ly (Cumulative Binomial probability


distribution). Xc sut thuc chng long xng c hiu nghim l khong 70% (tc l
p = 0.70). Nu chng ta iu tr 10 bnh nhn, xc sut c ti thiu 8 bnh nhn vi kt
qu tch cc l bao nhiu? Ni cch khc, nu gi X l s bnh nhn c iu tr thnh
cng, chng ta cn tm P(X 8) = ? tr li cu hi ny, chng ta s dng hm
pbinom(k, n, p). Xin nhc li rng hm pbinom(k, n, p)cho chng ta P(X
k). Do , P(X 8) = 1 P(X 7). Thnh ra, p s bng R cho cu hi l:

> 1-pbinom(7, 10, 0.70)


[1] 0.3827828

V d 3: M phng hm nh phn: Bit rng trong mt qun th dn s c


khong 20% ngi mc bnh cao huyt p; nu chng ta tin hnh chn mu 1000 ln,
mi ln chn 20 ngi trong qun th mt cch ngu nhin, s phn phi s bnh
nhn cao huyt p s nh th no? tr li cu hi ny, chng ta c th ng dng hm
rbinom (n, k, p) trong R vi nhng thng s nh sau:
> b <- rbinom(1000, 20, 0.20)

Trong lnh trn, kt qu m phng c tm thi cha trong i tng tn l b. bit


b c g, chng ta m bng lnh table:
> table(b)
b
0
1
2
3
4
5
6
6 45 147 192 229 169 105

7
68

8
23

9
13

10
3

Dng s liu th nht (0, 5, 6, , 10) l s bnh nhn mc bnh cao huyt p
trong s 20 ngi m chng ta chn. Dng s liu th hai cho chng ta bit s ln chn
mu trong 1000 ln xy ra. Do , c 6 mu khng c bnh nhn cao huyt p no, 45
mu vi ch 1 bnh nhn cao huyt p, v.v C l cch hiu l v th cc tn s
trn bng lnh hist nh sau:
> hist(b, main="Number of hypertensive patients")

Frequency

50

100

150

200

Number of hypertensive patients

10

Biu 1. Phn phi s bnh nhn cao huyt


p trong s 20 ngi c chn ngu nhin
trong mt qun th gm 20% bnh nhn cao

huyt p, v chn mu c lp li 1000 ln.


Qua biu trn, chng ta thy xc sut c 4 bnh nhn cao huyt p (trong mi ln chn
mu 20 ngi) l cao nht (22.9%). iu ny cng c th hiu c, bi v t l cao
huyt p l 20%, cho nn chng ta k vng rng trung bnh 4 ngi trong s 20 ngi
c chn phi l cao huyt p. Tuy nhin, iu quan trng m biu trn th hin l
c khi chng ta quan st n 10 bnh nhn cao huyt p d xc sut cho mu ny rt thp
(ch 3/1000).
V d 4: ng dng hm phn phi nh phn: Hai mi khch hng c mi
ung hai loi bia A v B, v c hi h thch bia no. Kt qu cho thy 16 ngi thch
bia A. Vn t ra l kt qu ny c kt lun rng bia A c nhiu ngi thch
hn bia B, hay l kt qu ch l do cc yu t ngu nhin gy nn?

Chng ta bt u gii quyt vn bng cch gi thit rng nu khng c khc


nhau, th xc sut p=0.50 thch bia A v q=0.5 thch bia B. Nu gi thit ny ng, th
xc sut m chng ta quan st 16 ngi trong s 20 ngi thch bia A l bao nhiu.
Chng ta c th tnh xc sut ny bng R rt n gin:
> 1- pbinom(15, 20, 0.5)
[1] 0.005908966

p s l xc sut 0.005 hay 0.5%. Ni cch khc, nu qu tht hai bia ging
nhau th xc sut m 16/20 ngi thch bia A ch 0.5%. Tc l, chng ta c bng chng
cho thy kh nng bia A qu tht c nhiu ngi thch hn bia B, ch khng phi do
yu t ngu nhin. Ch , chng ta dng 15 (thay v 16), l bi v P(X 16) = 1 P(X
15). M trong trng hp ta ang bn, P(X 15) = pbinom(15, 20, 0.5).
6.3.2 Hm phn phi Poisson (Poisson distribution)

Hm phn phi Poisson, ni chung, rt ging vi hm nh phn, ngoi tr thng


s p thng rt nh v n thng rt ln. V th, hm Poisson thng c s dng
m t cc bin s rt him xy ra (nh s ngi mc ung th trong mt dn s chng
hn). Hm Poisson cn c ng dng kh nhiu v thnh cng trong cc nghin cu k
thut v th trng nh s lng khch hng n mt nh hng mi gi.
V d 5: Hm mt Poisson (Poisson density probability function). Qua
theo di nhiu thng, ngi ta bit c t l nh sai chnh t ca mt th k nh my.
Tnh trung bnh c khong 2.000 ch th th k nh sai 1 ch. Hi xc sut m th k
nh sai chnh t 2 ch, hn 2 ch l bao nhiu?

V tn s kh thp, chng ta c th gi nh rng bin s sai chnh t (tm t


tn l bin s X) l mt hm ngu nhin theo lut phn phi Poisson. y, chng ta c

t l sai chnh t trung bnh l 1( = 1). Lut phn phi Poisson pht biu rng xc sut
m X = k, vi iu kin t l trung bnh , :
e k
P( X = k | ) =
k!
e 212
= 0.1839 . p s ny c th
2!
tnh bng R mt cch nhanh chng hn bng hm dpois nh sau:

Do , p s cho cu hi trn l: P ( X = 2 | = 1) =

> dpois(2, 1)
[1] 0.1839397

Chng ta cng c th tnh xc sut sai 1 ch, v xc sut khng sai ch no:
> dpois(1, 1)
[1] 0.3678794
> dpois(0, 1)
[1] 0.3678794

Ch trong hm trn, chng ta ch n gin cung cp thng s k = 2 v ( = 1. Trn y


l xc sut m th k nh sai chnh t ng 2 ch. Nhng xc sut m th k nh sai
chnh t hn 2 ch (tc 3, 4, 5, ch) c th c tnh bng:
P ( X > 2 ) = P ( X = 3) + P ( X = 4 ) + P( X = 5) + ...

= 1 P ( X 2)
= 1 0.3678 0.3678 0.1839
= 0.08
Bng R, chng ta c th tnh nh sau:
# P(X 2)
> ppois(2, 1)
[1] 0.9196986

# 1-P(X 2)
> 1-ppois(2, 1)
[1] 0.0803014

6.3.3 Hm phn phi chun (Normal distribution)

Hai lut phn phi m chng ta va xem xt trn y thuc vo nhm phn phi
p dng cho cc bin s phi lin tc (discrete distributions), m trong bin s c
nhng gi tr theo bc th hay th loi. i vi cc bin s lin tc, c vi lut phn phi
8

thch hp khc, m quan trng nht l phn phi chun. Phn phi chun l nn tng
quan trng nht ca phn tch thng k. C th ni khng ngoa rng hu ht l thuyt
thng k c xy dng trn nn tng ca phn phi chun. Hm mt phn phi
chun c hai thng s: trung bnh v phng sai 2 (hay lch chun ). Gi X l
mt bin s (nh chiu cao chng hn), hm mt phn phi chun pht biu rng xc
sut m X = x l:
( x )2
1
2
P ( X = x | , ) = f ( x ) =
exp

2 2
2

V d 6: Hm mt phn phi chun (Normal density probability function).


Chiu cao trung bnh hin nay ph n Vit Nam l 156 cm, vi lch chun l 4.6
cm. Cng bit rng chiu cao ny tun theo lut phn phi chun. Vi hai thng s
=156, =4.6, chng ta c th xy dng mt hm phn phi chiu cao cho ton b qun
th ph n Vit Nam, v hm ny c hnh dng nh sau:

f(height)

0.00

0.02

0.04

0.06

0.08

Probability distribution of height in Vietnamese women

130

140

150

160

170

180

190

200

Height

Biu 2. Phn phi chiu cao ph n Vit


Nam vi trung bnh 156 cm v lch chun 4.6
cm. Trng honh l chiu cao v trc tung l xc
sut cho mi chiu cao.

Biu trn c v bng hai lnh sau y. Lnh u tin nhm to ra mt bin s
height c gi tr 130, 131, 132, , 200 cm. Lnh th hai l v biu vi iu kin
trung bnh l 156 cm v lch chun l 4.6 cm.
> height <- seq(130, 200, 1)
> plot(height, dnorm(height, 156, 4.6),
type="l",
ylab=f(height),
xlab=Height,

main="Probability distribution of height in Vietnamese women")

Vi hai thng s trn (v biu ), chng ta c th c tnh xc sut cho bt c


chiu cao no. Chng hn nh xc sut mt ph n Vit Nam c chiu cao 160 cm l:
(160 156 )2
1
exp
P(X = 160 | =156, =4.6) =

2
4.6 2 3.1416
2 ( 4.6 )

= 0.0594
Hm dnorm(x, mean, sd)trong R c th tnh ton xc sut ny cho chng ta mt
cch gn nh:
> dnorm(160, mean=156, sd=4.6)
[1] 0.05942343

Hm xc sut chun tch ly (cumulative normal probability function). V


chiu cao l mt bin s lin tc, trong thc t chng ta t khi no mun tm xc sut cho
mt gi tr c th x, m thng tm xc sut cho mt khong gi tr a n b. Chng hn
nh chng ta mun bit xc sut chiu cao t 150 n 160 cm (tc l P(160 X 150),
hay xc sut chiu cao thp hn 145 cm, tc P(X < 145). tm p s cc cu hi nh
th, chng ta cn n hm xc sut chun tch ly, c nh ngha nh sau:

P(a X b) =

f ( x ) dx
a

Thnh ra, P(160 X 150) chnh l din tch tnh t trc honh = 150 n 160 ca biu
2. Trong R c hm pnorm(x, mean, sd) dng tnh xc sut tch ly cho
mt phn phi chun rt c ch.
pnorm (a, mean, sd) =

f ( x ) dx = P(X a | mean, sd)

Chng hn nh xc sut chiu cao ph n Vit Nam bng hoc thp hn 150 cm l 9.6%:
> pnorm(150, 156, 4.6)
[1] 0.0960575

Hay xc sut chiu cao ph n Vit Nam bng hoc cao hn 165 cm l:
> 1-pnorm(164, 156, 4.6)
[1] 0.04100591

Ni cch khc, ch c khong 4.1% ph n Vit Nam c chiu cao bng hay cao hn 165
cm.

10

V d 7: ng dng lut phn phi chun: Trong mt qun th, chng ta bit
rng p sut mu trung bnh l 100 mmHg v lch chun l 13 mmHg, hi: c bao
nhiu ngi trong qun th ny c p sut mu bng hoc cao hn 120 mmHg? Cu tr
li bng R l:
> 1-pnorm(120, mean=100, sd=13)
[1] 0.0619679

Tc khong 6.2% ngi trong qun th ny c p sut mu bng hoc cao hn 120
mmHg.
6.3.4 Hm phn phi chun chun ha (Standardized Normal distribution)

Mt bin X tun theo lut phn phi chun vi trung bnh bnh v phng sai 2
thng c vit tt l:
X ~ N( , 2)
y v 2 ty thuc vo n v o lng ca bin s. Chng hn nh chiu
cao c tnh bng cm (hay m), huyt p c o bng mmHg, tui c o bng nm,
v.v cho nn i khi m t mt bin s bng n v gc rt kh so snh. Mt cch n
gin hn l chun ha (standardized) X sao cho s trung bnh l 0 v phng sai l 1.
Sau vi thao tc s hc, c th chng minh d dng rng, cch bin i X p ng iu
kin trn l:
Z=

Ni theo ngn ng ton: nu X ~ N( , 2), th (X )/2 ~ N(0, 1). Nh vy qua


cng thc trn, Z thc cht l khc bit gia mt s v trung bnh tnh bng s lch
chun. Nu Z = 0, chng ta bit rng X bng s trung bnh . Nu Z = -1, chng ta bit
rng X thp hn ng 1 lch chun. Tng t, Z = 2.5, chng ta bit rng X cao hn
ng 2.5 lch chun. v.v
Biu phn phi chiu cao ca ph n Vit Nam c th m t bng mt n v
mi, l ch s z nh sau:

11

0.2
0.0

0.1

f(z)

0.3

0.4

Probability distribution of height in Vietnamese women

-4

-2

Biu 3. Phn phi chun ha chiu cao ph


n Vit Nam.

Biu trn c v bng hai lnh sau y:


> height <- seq(-4, 4, 0.1)
> plot(height, dnorm(height, 0, 1),
type="l",
ylab=f(z),
xlab=z,
main="Probability distribution of height in Vietnamese women")

Vi phn phi chun chun ho, chng ta c mt tin li l c th dng n m t v so


snh mt phn phi ca bt c bin no, v tt c u c chuyn sang ch s z.
Trong biu trn, trc tung l xc sut z v trc honh l bin s z. Chng ta c th
tnh ton xc sut z nh hn mt hng s (constant) no d dng bng R. V d,
chng ta mun tm P(z -1.96) = ? cho mt phn phi m trung bnh l 0 v lch
chun l 1.
> pnorm(-1.96, mean=0, sd=1)
[1] 0.02499790

Hay P(z 1.96) = ?


> pnorm(1.96, mean=0, sd=1)
[1] 0.9750021

Do , P(-1.96 < z < 1.96) chnh l:


> pnorm(1.96) - pnorm(-1.96)
[1] 0.9500042

12

Ni cch khc, xc sut 95% l z nm gia -1.96 v 1.96. (Ch trong lnh trn ti
khng cung cp mean=0, sd=1, bi v trong thc t, pnorm gi tr mc nh (default
value) ca thng s mean l 0 v sd l 1).
V d 6 (tip tc). Xin nhc li tin vic theo di, chiu cao trung bnh ph
n Vit Nam l 156 cm v lch chun l 4.6 cm. Do , mt ph n c chiu cao 170
cm cng c ngha l z = (170 156) / 4.6 = 3.04 lch chun, v ti l cc ph n Vit
Nam c chiu cao cao hn 170 cm l rt thp, ch khong 0.1%.
> 1-pnorm(3.04)
[1] 0.001182891

Tm nh lng (quantile) ca mt phn phi chun. i khi chng ta cn


lm mt tnh ton o ngc. Chng hn nh chng ta mun bit: nu xc sut Z nh
hn mt hng s z no cho trc bng p, th z l bao nhiu? Din t theo k hiu xc
sut, chng ta mun tm z trong nu:

P(Z < z) = p

tr li cu hi ny, chng ta s dng hm qnorm(p, mean=, sd=).


V d 8: Bit rng Z ~ N(0, 1) v nu P(Z < z) = 0.95, chng ta mun tm z.
> qnorm(0.95, mean=0, sd=1)
[1] 1.644854

Hay P(Z < z) = 0.975 cho phn phi chun vi trung bnh 0 v lch chun 1:
> qnorm(0.975, mean=0, sd=1)
[1] 1.959964

6.3.5 Hm phn phi t, F v 2

Cc hm phn phi t, F v 2 trong thc t l hm ca hm phn phi chun. Mi


lin h v cch tnh cc hm ny c th c m t bng vi ghi ch sau y:

Phn phi Chi bnh phng (2). Phn phi 2 xut pht t tng bnh
n

phng ca mt bin phn phi chun. Nu nu xi ~ N(0, 1), v gi u = xi2 , th


i=

u tun theo lut phn phi Chi bnh phng vi bc t do n (thng vit tt l
df). Ni theo ngn ng ton, u ~ n2 .

13

V d 9: Tm xc sut ca mt bin Chi bnh phng, do , ch cn hai thng s


u v n. Chng hn nh nu chng ta mun tm xc sut P(u=21, df=13), ch n
gin dng hm pchisq nh sau:
> dchisq(21, 13)
[1] 0.01977879

Tm xc sut m mt bin s u nh hn 21 vi bc t do 13 df. Tc l tm P(u


21 | df=13) = ?
> pchisq(21, 13)
[1] 0.9270714

Cng c th ni kt qu trn cho bit P( 132 < 21) = 0.927.


Tm quantile ca mt tr s u tng ng vi 90% ca mt phn phi 2 vi 15
bc t do:
> qchisq(0.95, 15)
[1] 24.99579

Ni cch khc, P( 152 < 24.99) = 0.95.


Phi trung tm (Non-centrality). Ch trong nh ngha trn, phn phi 2 xut
pht t tng bnh phng ca mt bin phn phi chun c trung bnh 0 v
phng sai 1. Nhng nu mt bin phn phi chun c trung bnh khng phi l
0 v phng sai khng phi l 1, th chng ta s c mt phn phi Chi bnh
n

phng phi trung tm. Nu xi ~ N(i, 1) v t u = xi2 , th u tun theo lut


i =1

phn phi Chi bnh phng phi trung tm vi bc t do n v thng s phi trung
tm (non-centrality parameter) nh sau:
n

= i2
i =1

V k hiu l u ~

2
n ,

. C th ni thm rng, trung bnh ca u l n+, v phng

sai ca u l 2(n+2).
Tm xc sut m u nh hn hoc bng 21, vi iu kin bc t do l 13 v thng
s non-centrality bng 5.4:
> pchisq(21, 13, 5.4)
[1] 0.6837649

Tc l, P( 132 ,5.4 < 21) = 0.684.

14

Tm quantile ca mt tr s tng ng vi 50% ca mt phn phi 2 vi 7 bc


t do v thng s non-centrality bng 3.
> qchisq(0.5, 7, 3)
[1] 9.180148

Do , P( 72,3 < 9.180148) = 0.50

Phn phi t (t distribution). Chng ta va bit rng nu X ~ N(, s2) th th(X


)/2 ~ N(0, 1). Nhng pht biu ng (hay chnh xc) khi chng ta bit
phng sai 2. Trong thc t, t khi no chng ta bit chnh xc phng sai, m
ch c tnh t s liu thc nghim. Trong trng hp phng sai c c tnh
t s liu nghin cu, v hy gi c tnh ny l s2, th chng ta c th pht biu
rng: (X )/s2 ~ t(0, v), trong v l bc t do.
V d 10. Tm xc sut m x ln hn 1, trong bin theo lut phn phi t vi 6 bc
t do:
> 1-pt(1.1, 6)
[1] 0.1567481

Tc l, P(t6 > 1.1) = 1 P(t6 < 1.1) = 0.157.


Tm nh lng ca mt tr s tng ng vi 95% ca mt phn phi t vi 15
bc t do:
> qt(0.95, 15)
[1] 1.753050

Ni cch khc, P(t19 < 1.75035) = 0.95.

Phn phi F. T s gia hai bin s theo lut phn phi 2 c th chng minh l
tun theo lut phn phi F. Ni cch khc, nu u ~ n2 v v ~ m2 , th u/v ~ Fn,m,
trong n l bc t do t s (numerator degrees of freedom) v m l bc t do
mu s (denominator degrees of freedom).
V d 11: Tm xc sut m mt tr s F ln hn 3.24, bit rng bin s tun
theo lut phn phi F vi bc t do 3 v 15 df v thng s non-centrality 5:
> 1-pf(3.24, 3, 15, 5)
[1] 0.3558721

Do , P(F3, 15, 5 > 3.24) = 1 - P(F3, 15,5 3.24) = 0.355338.


Vi bc t do 3 v 15, tm C sao cho P(F3, 15 > C) = 0.05. Li gii ca R l:

15

> qf(1-0.05, 3, 15)


[1] 3.287382

Ni cch khc, P(F3, 15 > 3.287382) = 1 P(F3, 15 3.287382) = 1 0.95 = 0.05

6.4 M phng (simulation)


Trong phn tch thng k, i khi v hn ch s mu chng ta kh c th c tnh
mt cch chnh xc cc thng s, v trong trng hp bt nh , chng ta cn n m
phng bit c dao ng ca mt hay nhiu thng s. M phng thng da vo
cc lut phn phi. y l mt lnh vc kh phc tp m ti khng c nh trnh by
y trong chng ny. y, ti ch trnh by mt s m hnh m phng mang tnh
minh ha bn c c th da vo m pht trin thm.
V d 11: M phng chng minh phng sai ca s trung bnh bng
phng sai chia cho n ( var ( X ) = 2 / n ). Chng ta s xem mt bin s khng lin tc

vi gi tr 1, 3 v 5 vi xc sut nh sau:
x
1
3
5

P(x)
0.60
0.30
0.10

Qua s liu ny, chng ta bit rng gi tr trung bnh l (1x0.60)+(3x0.30)+(5x0.10) = 2.0
v phng sai (bn c c th t tnh) l 1.8.
By gi chng ta s dng hai thng s ny th m phng 500 ln. Lnh th nht to
ra 3 gi tr ca x. Lnh th hai nhp s xc sut cho tng gi tr ca x. Lnh sample
yu cu R to nn 500 s ngu nhin v cho vo i tng draws.
x <- c(1, 3, 5)
px <- c(0.6, 0.3, 0.1)
draws <- sample(x, size=500, replace=T, prob=px)
hist(draws, breaks=seq(1,5, by=0.25), main=1000 draws)

16

150
0

50

100

Frequency

200

250

300

500 draws

draws

T lut phn phi xc sut chng ta bit rng tnh trung bnh s c 60% ln c gi
tr 1, 30% c gi tr 2, v 10% c gi tr 5. Do , chng ta k vng s quan st
300, 150 v 50 ln cho mi gi tr. Biu trn cho thy phn phi cc gi tr ny gn
vi gi tr m chng ta k vng. Ngoi ra, chng ta cng bit rng phng sai ca bin s
ny l khong 1.8. By gi chng ta kim tra xem c ng nh k vng hay khng:
> var(draws)
[1] 1.835671

Kt qu trn cho thy phng sai ca 500 mu l 1.836, tc khng xa my so vi gi tr


k vng.
By gi chng ta th m phng 500 gi tr trung bnh x ( x l s trung bnh ca 4 s liu
m phng) t qun th trn:
> draws <- sample(x, size=4*500, replace=T, prob=px)
> draws = matrix(draws, 4)
> drawmeans = apply(draws, 2, mean)

Lnh th nht v th hai to nn i tng tn l draws vi 4 dng, mi dng c 500


gi tr t lut phn phi trn. Ni cch khc, chng ta c 4*500 = 2000 s. 500 s cng
c ngha l 500 ct: 1 n 500. Tc mi ct c 4 s. Lnh th ba tm tr s trung bnh
cho mi ct. Lnh ny s cho ra 500 s trung bnh v cha trong i tng
drawmeans. Biu sau y cho thy phn phi ca 500 s trung bnh:
> hist(drawmeans,breaks=seq(1,5,by=0.25), main=1000 means of 4 draws)

17

50

Frequency

100

150

1000 means of 4 draws

drawmeans

Chng ta thy rng phng sai ca phn phi ny nh hn. Tht ra, phng sai ca 500
s trung bnh ny l 0.45.
> var(drawmeans)
[1] 0.4501112

y l gi tr tng ng vi gi tr 0.45 m chng ta k vng t cng thc


var ( X ) = 2 / 4 = 1.8 / 4 = 0.45 .
6.4.1 M phng phn phi nh phn
V d 12: M phng mu t mt qun th vi lut phn phi nh phn. Gi
d chng ta bit mt qun th c 20% ngi b bnh i ng (xc sut p=0.2). Chng
ta mun ly mu t qun th ny, mi mu c 20 i tng, v phng n chn mu
c lp li 100 ln:
> bin <- rbinom(100, 20, 0.2)
> bin
[1] 4 4 5 3 2 2 3 2 5 4 3 6 7 3 4 4 1 5 3 5 3 4 4 5 1 4 4 4 4 3 2 4 2 2 5 4 5
[38] 7 3 5 3 3 4 3 2 4 5 2 4 5 5 4 2 2 2 8 5 5 5 3 4 5 7 4 3 6 4 6 6 8 8 3 3 1
[75] 1 4 4 2 3 9 7 4 4 0 0 8 6 9 3 1 4 5 6 4 5 3 2 4 3 2

Kt qu trn l s ln u, chng ta s c 4 ngi mc bnh; ln 2 cng 4 ngi; ln 3 c


5 ngi mc bnh; v.v kt qu ny c th tm lc trong mt biu nh sau:
> hist(bin,
xlab=Number of diabetic patients,
ylab=Number of samples,
main=Distribution of the number of diabetic patients)

18

15
10
0

Number of samples

20

25

Distribution of the number of diabetic patients

Number of diabetic patients

> mean(bin)
[1] 3.97

ng nh chng ta k vng, v chn mi ln 20 i tng v xc sut 20%, nn chng ta


tin on trung bnh s c 4 bnh nhn i ng.
6.4.2 M phng phn phi Poisson
V d 13: M phng mu t mt qun th vi lut phn phi Poisson. Trong
v d sau y, chng ta m phng 100 mu t mt qun th tun theo lut phn phi
Poisson vi trung bnh =3:
> pois <- rpois(100, lambda=3)
> pois
> pois
[1] 4 3 2 4 2 3 4 4 0 7 5 0 3 3 4 2 2 6 1 4 2 3 3 5 4 2 1 4 0 2 1 5 1 2 2 2 6
[38] 1 3 6 3 3 5 4 3 2 2 5 3 3 3 1 4 7 3 4 3 2 6 1 4 1 0 5 2 2 2 3 6 8 4 4 1 4
[75] 1 0 0 4 3 3 2 3 3 3 4 1 5 4 4 1 3 1 6 4 4 4 2 2 2 4

V mt phn phi:

19

Frequency

10

15

20

Histogram of pois

pois

Phn phi Poisson v phn phi m. Trong v d sau y, chng ta m phng


thi gian bnh nhn n mt bnh vin. Bit rng bnh nhn n bnh vin mt cch
ngu nhin theo lut phn phi Poisson, vi trung bnh 15 bnh nhn cho mi 150 pht.
C th chng minh d dng rng thi gian gia hai bnh nhn n bnh vin tun theo
lut phn phi m. Chng ta mun bit thi gian m bnh nhn gh bnh vin; do ,
chng ta m phng 15 thi gian gia hai bnh nhn t lut phn phi m vi t l 15/150
= 0.1 mi pht. Cc lnh sau y p ng yu cu :
# To thi gian n bnh vin
> appoint <- rexp(15, 0.1)
> times <- round(appoint,0)
> times
[1] 37 5 8 10 24 5 1 7 8

6 12

3 25 15

6.4.3 M phng phn phi 2, t, F

Cch m phng trn y cn c th p dng cho cc lut phn phi khc nh nh phn
m (negative binomial distribution vi rnbinom), gamma (rgamma), beta (rbeta),
Chi bnh phng (rchisq), hm m (rexp), t (rt), F (rf), v.v Cc thng s cho cc
hm m phng ny c th tm trong phn u ca chng.
Cc lnh sau y s minh ha cc lut phn phi thng thng :

>
>
>
>
>

Phn phi Chi bnh phng vi mt s bc t do:


curve(dchisq(x, 1),
curve(dchisq(x, 2),
curve(dchisq(x, 3),
curve(dchisq(x, 5),
abline(h=0, lty=3)

xlim=c(0,10), ylim=c(0,0.6), col="red", lwd=3)


add=T, col="green", lwd=3)
add=T, col="blue", lwd=3)
add=T, col="orange", lwd=3)

20

0.6

> abline(v=0, lty=3)


> legend(par("usr")[2], par("usr")[4],
xjust=1,
c("df=1", "df=2", "df=3", "df=5"), lwd=3, lty=1,
col=c("red", "green", "blue", "orange"))

0.3
0.0

0.1

0.2

dchisq(x, 1)

0.4

0.5

df=1
df=2
df=3
df=5

10

Biu 4. Phn phi Chi bnh phng vi bc t


do =1, 2, 3, 5.

>
>
>
>
>
>
>

Phn phi t:
curve(dt(x, 1), xlim=c(-3,3), ylim=c(0,0.4), col="red", lwd=3)
curve(dt(x, 2), add=T, col="blue", lwd=3)
curve(dt(x, 5), add=T, col="green", lwd=3)
curve(dt(x, 10), add=T, col="orange", lwd=3)
curve(dnorm(x), add=T, lwd=4, lty=3)
title(main=Student T distributions)
legend(par("usr")[2], par("usr")[4],
xjust=1,
c("df=1", "df=2", "df=5", "df=10", Normal distribution),
lwd=c(2,2,2,2,2),
lty=c(1,1,1,1,3),
col=c("red", "blue", "green", "orange", par(fg)))

21

0.4

Student T distributions

0.2
0.0

0.1

dt(x, 1)

0.3

df=1
df=2
df=5
df=10
Normal distribution

-3

-2

-1

Biu 5. Phn phi t vi bc t do =1, 2, 5, 10 so


snh vi phn phi chun.

>
>
>
>
>
>
>
>
>

Phn phi F:
curve(df(x,1,1), xlim=c(0,2), ylim=c(0,0.8), lwd=3)
curve(df(x,3,1), add=T)
curve(df(x,6,1), add=T, lwd=3)
curve(df(x,3,3), add=T, col="red")
curve(df(x,6,3), add=T, col="red", lwd=3)
curve(df(x,3,6), add=T, col="blue")
curve(df(x,6,6), add=T, col="blue", lwd=3)
title(main=Fisher F distributions)
legend(par("usr")[2], par("usr")[4],
xjust=1,
c("df=1,1", "df=3,1", "df=6,1", "df=3,3", df=6,3,
df=3,6, df=6,6),
lwd=c(1,1,3,1,3,1,3),
lty=c(2,1,1,1,1,1,1),
col=c(par("fg"), par("fg"), par("fg"), red, blue, blue))

22

0.8

Fisher F distributions

0.4
0.0

0.2

df(x, 1, 1)

0.6

df=1,1
df=3,1
df=6,1
df=3,3
df=6,3
df=3,6
6,6

0.0

0.5

1.0

1.5

2.0

Biu 6. Phn phi F vi nhiu bc t do khc


nhau.

>
>
>
>
>
>
>

Phn phi gamma:


curve( dgamma(x,1,1), xlim=c(0,5) )
curve( dgamma(x,2,1), add=T, col='red' )
curve( dgamma(x,3,1), add=T, col='green' )
curve( dgamma(x,4,1), add=T, col='blue' )
curve( dgamma(x,5,1), add=T, col='orange' )
title(main="Gamma probability distribution function")
legend(par('usr')[2], par('usr')[4], xjust=1,
c('k=1 (Exponential distribution)', 'k=2', 'k=3', 'k=4', 'k=5'),
lwd=1, lty=1,
col=c(par('fg'), 'red', 'green', 'blue', 'orange') )

1.0

Gamma probability distribution function

0.6
0.4
0.2
0.0

dgamma(x, 1, 1)

0.8

k=1 (Exponential distribution)


k=2
k=3
k=4
k=5

23

Biu 7. Phn phi Gamma vi nhiu hnh dng.

>
>
>
>
>
>
>
>
>
>
>
>

Phn phi beta:


curve( dbeta(x,1,1), xlim=c(0,1), ylim=c(0,4) )
curve( dbeta(x,2,1), add=T, col='red' )
curve( dbeta(x,3,1), add=T, col='green' )
curve( dbeta(x,4,1), add=T, col='blue' )
curve( dbeta(x,2,2), add=T, lty=2, lwd=2, col='red' )
curve( dbeta(x,3,2), add=T, lty=2, lwd=2, col='green' )
curve( dbeta(x,4,2), add=T, lty=2, lwd=2, col='blue' )
curve( dbeta(x,2,3), add=T, lty=3, lwd=3, col='red' )
curve( dbeta(x,3,3), add=T, lty=3, lwd=3, col='green' )
curve( dbeta(x,4,3), add=T, lty=3, lwd=3, col='blue' )
title(main="Beta distribution")
legend(par('usr')[1], par('usr')[4], xjust=0,
c('(1,1)', '(2,1)', '(3,1)', '(4,1)',
'(2,2)', '(3,2)', '(4,2)',
'(2,3)', '(3,3)', '(4,3)' ),
lwd=1, #c(1,1,1,1, 2,2,2, 3,3,3),
lty=c(1,1,1,1, 2,2,2, 3,3,3),
col=c(par('fg'), 'red', 'green', 'blue',
'red', 'green', 'blue',
'red', 'green', 'blue' ))

Beta distribution

2
0

dbeta(x, 1, 1)

(1,1)
(2,1)
(3,1)
(4,1)
(2,2)
(3,2)
(4,2)
(2,3)
(3,3)
(4,3)

0.0

0.2

0.4

0.6

0.8

1.0

Biu 8. Phn phi beta vi nhiu hnh dng.

Phn phi Weibull:

> curve(dexp(x), xlim=c(0,3), ylim=c(0,2))


> curve(dweibull(x,1), lty=3, lwd=3, add=T)
> curve(dweibull(x,2), col='red', add=T)

24

> curve(dweibull(x,.8), col='blue', add=T)


> title(main="Weibull Probability Distribution Function")
> legend(par('usr')[2], par('usr')[4], xjust=1,
c('Exponential', 'Weibull, shape=1',
'Weibull, shape=2', 'Weibull, shape=.8'),
lwd=c(1,3,1,1),
lty=c(1,3,1,1),
col=c(par("fg"), par("fg"), 'red', 'blue'))

2.0

Weibull Probability Distribution Function

1.0
0.0

0.5

dexp(x)

1.5

Exponential
Weibull, shape=1
Weibull, shape=2
Weibull, shape=.8

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Biu 9. Phn phi Weibull.

Phn phi Cauchy:

> curve(dcauchy(x),xlim=c(-5,5), ylim=c(0,.5), lwd=3)


> curve(dnorm(x), add=T, col='red', lty=2)
> legend(par('usr')[2], par('usr')[4], xjust=1,
c('Cauchy distribution', 'Gaussian distribution'),
lwd=c(3,1),
lty=c(1,2),
col=c(par("fg"), 'red'))

25

0.5
0.3
0.2
0.0

0.1

dcauchy(x)

0.4

Cauchy distribution
Gaussian distribution

-4

-2

Biu 9. Phn phi Cauchy so snh vi phn


phi chun.

6.5 Chn mu ngu nhin (random sampling)


Trong xc sut v thng k, ly mu ngu nhin rt quan trng, v n m bo
tnh hp l ca cc phng php phn tch v suy lun thng k. Vi R, chng ta c th
ly mu mt mu ngu nhin bng cch s dng hm sample.
V d: Chng ta c mt qun th gm 40 ngi (m s 1, 2, 3, , 40). Nu
chng ta mun chn 5 i tng qun th , ai s l ngi c chn? Chng ta c th
dng lnh sample() tr li cu hi nh sau:
> sample(1:40, 5)
[1] 32 26 6 18 9

Kt qu trn cho bit i tng 32, 26, 8, 18 v 9 c chn. Mi ln ra lnh ny, R s


chn mt mu khc, ch khng hon ton ging nh mu trn. V d:
> sample(1:40, 5)
[1] 5 22 35 19 4
> sample(1:40, 5)
[1] 24 26 12 6 22
> sample(1:40, 5)
[1] 22 38 11 6 18

v.v

26

Trn y l lnh chng ta chn mu ngu nhin m khng thay th (random sampling
without replacement), tc l mi ln chn mu, chng ta khng b li cc mu chn
vo qun th.
Nhng nu chng ta mun chn mu thay th (tc mi ln chn ra mt s i tng,
chng ta b vo li trong qun th chn tip ln sau). V d, chng ta mun chn 10
ngi t mt qun th 50 ngi, bng cch ly mu vi thay th (random sampling with
replacement), chng ta ch cn thm tham s replace = TRUE:
> sample(1:50, 10, replace=T)
[1] 31 44 6 8 47 50 10 16 29 23

Hay nm mt ng xu 10 ln; mi ln, d nhin ng xu c 2 kt qu H v T; v kt qu


10 ln c th l:
> sample(c("H", "T"), 10, replace=T)
[1] "H" "T" "H" "H" "H" "T" "H" "H" "T" "T"

Cng c th tng tng chng ta c 5 qu banh mu xanh (X) v 5 qu banh mu (D)


trong mt bao. Nu chng ta chn 1 qu banh, ghi nhn mu, ri li vo bao; ri li
chn 1 qu banh khc, ghi nhn mu, v b vo bao li. C nh th, chng ta chn 20
ln, kt qu c th l:
> sample(c("X", "D"), 20, replace=T)
[1] "X" "D" "D" "D" "D" "D" "X" "X" "X" "X" "X" "D" "X" "X" "D" "X" "X" "X" "X"
[20] "D"

Ngoi ra, chng ta cn c th ly mu vi mt xc sut cho trc. Trong hm sau y,


chng ta chn 10 i tng t dy s 1 n 5, nhng xc sut khng bng nhau:
> sample(5, 10, prob=c(0.3, 0.4, 0.1, 0.1, 0.1), replace=T)
[1] 3 1 3 2 2 2 2 2 5 1

i tng 1 c chn 2 ln, i tng 2 c chn 5 ln, i tng 3 c chn 2 ln,


v.v Tuy khng hon ton ph hp vi xc sut 0.3, 0.4, 0.1 nh cung cp v s mu
cn nh, nhng cng khng qu xa vi k vng.

27

CHNG VII

KIM NH GI THUYT
V
TR S P

7
Kim nh gi thit thng k
v ngha ca tr s P (P-value)
7.1 Tr s P
Trong nghin cu khoa hc, ngoi nhng d kin bng s, biu v hnh nh,
con s m chng ta thng hay gp nht l tr s P (m ting Anh gi l P-value). Trong
cc chng sau y, bn c s gp tr s P rt nhiu ln, v i a s cc suy lun phn
tch thng k, suy lun khoa hc u da vo tr s P. Do , trc khi bn n cc
phng php phn tch thng k bng R, ti thy cn phi c i li v ngha ca tr s
ny.
Tr s P l mt con s xc sut, tc l vit tt ch probability value. Chng ta
thng gp nhng pht biu c km theo con s, chng hn nh Kt qu phn tch
cho thy t l gy xng trong nhm bnh nhn c iu tr bng thuc Alendronate l
2%, thp hn t l trong nhm bnh nhn khng c cha tr (5%), v mc khc bit
ny c ngha thng k (p = 0.01), hay mt pht biu nh Sau 3 thng iu tr, mc
gim p sut mu trong nhm bnh nhn l 10% (p < 0.05). Trong vn cnh trn y,
i a s nh khoa hc hiu rng tr s P phn nh xc sut s hiu nghim ca thuc
Alendronate hay mt thut iu tr, h hiu rng cu vn trn c ngha l xc sut m
thuc Alendronate tt hn gi dc l 0.99 (ly 1 tr cho 0.01). Nhng cch hiu
hon ton sai!
Trong T in ton kinh t thng k, kinh t lng Anh Vit (Nh xut bn
Khoa hc v K thut, 2004), tc gi nh ngha tr s P nh sau: P gi tr (hoc gi
tr xc sut). P gi tr l mc ngha thng k thp nht m gi tr quan st c
ca thng k kim nh c ngha (trang 690). nh ngha ny tht l kh hiu! Tht ra
cng l nh ngha chung m cc sch khoa Ty phng thng hay vit. Lt bt c
sch gio khoa no bng ting Anh, chng ta s thy mt nh ngha v tr s P na n
ging nhau nh Tr s P l xc sut m mc khc bit quan st do cc yu t ngu
nhin gy ra (P value is the probability that the observed difference arose by chance).
Tht ra nh ngha ny cha y , nu khng mun ni l sai. Chnh v s m m
ca nh ngha cho nn rt nhiu nh khoa hc hiu sai ngha ca tr s P.
Tht vy, rt nhiu ngi, khng ch ngi c m ngay c chnh cc tc gi ca
nhng bi bo khoa hc, khng hiu ngha ca tr s P. Theo mt nghin cu c
cng b trn tp san danh ting Statistics in Medicine [1], tc gi cho bit 85% cc tc gi
khoa hc v bc s nghin cu khng hiu hay hiu sai ngha ca tr s P. c n y
c l bn c rt ngc nhin, bi v iu ny c ngha l nhiu nh nghin cu khoa hc
c khi khng hiu hay hiu sai nhng g chnh h vit ra c ngha g! Th th, cu hi cn
t ra mt cch nghim chnh: ngha ca tr s P l g? tr li cho cu hi ny,

chng ta cn phi xem xt qua khi nim phn nghim v tin trnh ca mt nghin cu
khoa hc.

7.2 Gi thit khoa hc v phn nghim


Mt gi thit c xem l mang tnh khoa hc nu gi thit c kh nng
phn nghim. TheoKarl Popper, nh trit hc khoa hc, c im duy nht c th
phn bit gia mt l thuyt khoa hc thc th vi ngy khoa hc (pseudoscience) l
thuyt khoa hc lun c c tnh c th b bc b (hay b phn bc falsified) bng
nhng thc nghim n gin. ng gi l kh nng phn nghim (falsifiability, c
ti liu ghi l falsibility). Php phn nghim l phng cch tin hnh nhng thc
nghim khng phi xc minh m ph phn cc l thuyt khoa hc, v c th coi y
nh l mt nn tng cho khoa hc thc th. Chng hn nh gi thit Tt c cc qu u
mu en c th b bc b nu ta tm ra c mt con qu mu .
C th xem qui trnh phn nghim l mt cch hc hi t sai lm! Tht vy, trong
khoa hc chng ta hc hi t sai lm. Khoa hc pht trin cng mt phn ln l do hc
hi t sai lm m gii khoa hc khng ai chi ci. Sai lm l im mnh ca khoa hc.
C th xc nh nghin cu khoa hc nh l mt qui trnh th nghim gi thuyt, theo cc
bc sau y:
Bc 1, nh nghin cu cn phi nh ngha mt gi thuyt o (null hypothesis),
tc l mt gi thuyt ngc li vi nhng g m nh nghin cu tin l s tht. Th d
trong mt nghin cu lm sng, gm hai nhm bnh nhn: mt nhm c iu tr bng
thuc A, v mt nhm c iu tr bng placebo, nh nghin cu c th pht biu mt
gi thuyt o rng s hiu nghim thuc A tng ng vi s hiu nghim ca placebo
(c ngha l thuc A khng c tc dng nh mong mun).
Bc 2, nh nghin cu cn phi nh ngha mt gi thuyt ph (alternative
hypothesis), tc l mt gi thuyt m nh nghin cu ngh l s tht, v iu cn c
chng minh bng d kin. Chng hn nh trong v d trn y, nh nghin cu c th
pht biu gi thuyt ph rng thuc A c hiu nghim cao hn placebo.
Bc 3, sau khi thu thp y nhng d kin lin quan, nh nghin cu dng
mt hay nhiu phng php thng k kim tra xem trong hai gi thuyt trn, gi
thuyt no c xem l kh d. Cch kim tra ny c tin hnh tr li cu hi: nu
gi thuyt o ng, th xc sut m nhng d kin thu thp c ph hp vi gi thuyt
o l bao nhiu. Gi tr ca xc sut ny thng c cp n trong cc bo co
khoa hc bng k hiu P value. iu cn ch y l nh nghin cu khng th
nghim gi thuyt khc, m ch th nghim gi thuyt o m thi.
Bc 4, quyt nh chp nhn hay loi b gi thuyt o, bng cch da vo gi
tr xc sut trong bc th ba. Chng hn nh theo truyn thng la chn trong mt
nghin cu y hc, nu gi tr xc sut nh hn 5% th nh nghin cu sn sng bc b gi
thuyt o: s hiu nghim ca thuc A khc vi s hiu nghim ca placebo. Tuy
nhin, nu gi tr xc sut cao hn 5%, th nh nghin cu ch c th pht biu rng cha

c bng chng y bc b gi thuyt o, v iu ny khng c ngha rng gi


thuyt o l ng, l s tht. Ni mt cch khc, thiu bng chng khng c ngha l
khng c bng chng.
Bc 5, nu gi thuyt o b bc b, th nh nghin cu mc nhin tha nhn gi
thuyt ph. Nhng vn khi i t y, bi v c nhiu gi thuyt ph khc nhau.
Chng hn nh so snh vi gi thuyt ph ban u (A khc vi Placebo), nh nghin cu
c th t ra nhiu gi thuyt ph khc nhau nh thuc s hiu nghim ca thuc A cao
hn Placebo 5%, 10% hay ni chung X%. Ni tm li, mt khi nh nghin cu bc b
gi thuyt o, th gi thuyt ph c mc nhin cng nhn, nhng nh nghin cu
khng th xc nh gi thuyt ph no l ng vi s tht.

7.3 ngha ca tr s P qua m phng


hiu ngha thc t ca tr s P, ti s nu mt v d n gin nh sau:
V d 1. Mt th nghim c tin hnh tm hiu s thch ca ngi tiu th
i vi hai loi c ph (hy tm gi l c ph A v B). Cc nh nghin cu cho 50 khch
hng ung th hai loi c ph trong cng mt iu kin, v hi h thch loi c ph no.
Kt qu cho thy 35 ngi thch c ph A, v 15 ngi thch c ph B. Vn t ra l
qua kt qu ny, cc nh nghin cu c th kt lun rng c ph loi A c a chung
hn c ph B, hay kt qu trn ch l do ngu nhin m ra?
Do ngu nhin m ra c ngha l theo lut nh phn, kh nng m kt qu trn
xy ra l bao nhiu? Do , l thuyt xc sut nh phn c phn ng dng trong trng
hp ny, bi v kt qu ca nghin cu ch c hai gi tr (hoc l thch A, hoc thch
B).
Ni theo ngn ng ca phn nghim, gi thit o l nu khng c s khc bit v
s thch, xc sut m mt khch hng a chung mt loi c ph l 0.5. Nu gi thit ny
l ng (tc p = 0.5, p y l xc sut thch c ph A), v nu nghin cu trn c lp
i lp li (chng hn nh) 1000 ln, v mi ln vn 50 khch hng, th c bao nhiu ln
vi 35 khch hng a chung c ph A? Gi s ln nghin cu m 35 (hay nhiu hn)
trong s 50 thch c ph A l bin c X, ni theo ngn ng xc sut, chng ta mun tm
P(X | p=0.50) =?
tr li cu hi ny, chng ta c th ng dng hm rbinom m phng v
nh ni trn thc cht ca vn l mt phn phi nh phn:
> bin <- rbinom(1000, 50, 0.5)

Trong lnh trn, chng ta yu cu R m phng 1000 ln nghin cu, mi ln c 50 khch


hng, v theo gi thit o, xc sut thch A l 0.50. bit kt qu ca m phng ,
chng ta s dng hm table nh sau:
> table(bin)

bin
14
1

15
1

34
2

35
3

16
2

17
11

18
16

19
24

20
47

21
60

22
83

23 24 25 26
94 107 132 114

27
98

28
65

29
44

30
44

31
26

32
14

33
12

Qua kt qu trn, chng ta thy trong s 1000 nghin cu , ch c 3 nghin cu m


s khch hng thch c ph A l 35 ngi (vi iu kin khng c khc bit gia hai loi
c ph, hay ni ng hn l nu p =0.5). Ni cch khc:
P(X 35 | p=0.50) = 3/1000 = 0.003
Chng ta cng c th th hin tn s trn bng mt biu tn s nh sau:

Frequency

50

100

150

200

250

Histogram of bin

15

20

25

30

35

bin

Tt nhin chng ta c th lm mt m phng khc vi s ln ti th nghim l


100.000 ln (thay v 1000 ln) v tnh xc sut P(X 35 | p=0.50).
bin <- rbinom(100000, 50, 0.5)
> bin <- rbinom(100000, 50, 0.5)
> table(bin)
bin
11
4

13
40

14
83

15
197

16
462

17
946

18
1592

19
2719

20
4098

21
5892

22
7937

23
9733

24
25
26
10822 11191 10799

27
9497

28
7925

29
5904

30
4185

31
2682

32
1562

33
893

34
455

35
223

36
98

37
31

12
17

38
5

39
7

40
1

Ln ny, chng ta c nhiu kh nng hn (v s ln m phng tng ln). Chng hn nh


c th c nghin cu cho ra 11 khch hng (ti thiu) hay 40 khch hng (ti a) thch c

ph A. Nhng chng ta mun bit s ln nghin cu m 35 khch hng tr ln thch c


ph A, v kt qu trn cho chng ta bit, xc sut l:
> (223+98+21+5+7+1)/100000
[1] 0.00355

Ni cch khc, xc sut P(X 35 | p=0.50) qu thp (ch 0.3%), chng ta c bng
chng cho rng kt qu trn c th khng do cc yu t ngu nhin gy nn; tc c
mt s khc bit v s thch ca khch hng i vi hai loi c ph.
Con s P = 0.0035 chnh l tr s P. Theo mt qui c khoa hc, tt c cc tr s
P thp hn 0.05 (tc thp hn 5%) c xem l significant, tc l c ngha thng
k.
Cn phi nhn mnh mt ln na hiu ngha ca tr s P nh sau: Mc ch
ca phn tch trn l nhm tr li cu hi: nu hai loi c ph c xc sut a chung
bng nhau (p = 0.5, gi thuyt o), th xc sut m kt qu trn (35 trong s 50 khch
hng thch A) xy ra l bao nhiu? Ni cch khc, chnh l phng php i tm tr s
P. Do , din dch tr s P phi c iu kin, v iu kin y l p = 0.50. bn c c
th lm th nghim thm vi p = 0.6 hay p = 0.7 thy kt qu khc nhau ra sao.
Trong thc t, tr s P c mt nh hng rt ln n s phn ca mt bi bo khoa
hc. Nhiu tp san v nh khoa hc xem mt nghin cu khoa hc vi tr s P cao hn
0.05 l mt kt qu tiu cc (negative result) v bi bo c th b t chi cho cng
b. Chnh v th m i vi i a s nh khoa hc, con s P < 0.05 tr thnh mt
ci giy thng hnh cng b kt qu nghin cu. Nu kt qu vi P < 0.05, bi bo
c c may xut hin trn mt tp san no v tc gi c th s ni ting; nu kt qu P
> 0.05, s phn bi bo v cng trnh nghin cu c c may i vo lng qun!

7.4 Vn logic ca tr s P
Nhng ng trn phng din l tr v khoa hc nghim chnh, chng ta c nn
t tm quan trng vo tr s P nh th hay khng? Theo ti, cu tr li l khng. Tr s
P c nhiu vn , v vic ph thuc vo n trong qu kh (cng nh hin nay) b rt
nhiu ngi ph phn gay gt. Ci khim khuyt s 1 ca tr s P l n thiu tnh logic.
Tht vy, nu chng ta chu kh xem xt li v d trn, chng ta c th khi qut tin
trnh ca mt nghin cu y hc (da vo tr s P) nh sau:

ra mt gi thuyt chnh (H+)


T gi thuyt chnh, ra mt gi thuyt o (H-)
Tin hnh thu thp d kin (D)
Phn tch d kin: tnh ton xc sut D xy ra nu H- l s tht. Ni theo
ngn ng ton xc sut, bc ny xc nh P(D | H-).

V th, con s P c ngha l xc sut ca d kin D xy ra nu (nhn mnh: nu) gi


thuyt o H- l s tht. Nh vy, con s P khng trc tip cho chng ta mt nim g
v s tht ca gi thuyt chnh H; n ch gin tip cung cp bng chng chng ta chp
nhn gi thuyt chnh v bc b gi thuyt o.
Ci logic ng sau ca tr s P c th c hiu nh l mt tin trnh chng minh
o ngc (proof by contradiction):

Mnh 1: Nu gi thuyt o l s tht, th d kin ny khng th xy ra;


Mnh 2: D kin xy ra;
Mnh 3 (kt lun): Gi thuyt o khng th l s tht.

Nu bn c cm thy kh hiu cch lp lun trn, ti xin ly thm mt v d


trong y khoa minh ha cho tin trnh ny:

Nu ng Tun b cao huyt p, th ng khng th c triu chng rng tc (hai


hin tng sinh hc ny khng lin quan vi nhau, t ra l theo kin thc y
khoa hin nay);
ng Tun b rng tc;
Do , ng Tun khng th b cao huyt p.

Tr s P, do , gin tip phn nh xc sut ca mnh 3. V cng chnh l


mt khim khuyt quan trng ca tr s P, bi v con s P n c tnh mc kh d ca
d kin, ch khng ni cho chng ta bit mc kh d ca mt gi thuyt. iu ny
lm cho vic suy lun da vo tr s P rt xa ri vi thc t, xa ri vi khoa hc thc
nghim. Trong khoa hc thc nghim, iu m nh nghin cu mun bit l vi d kin
m h c c, xc sut ca gi thuyt chnh l bao nhiu, ch h khng mun bit nu
gi thuyt o l s tht th xc sut ca d kin l bao nhiu. Ni cch khc v dng k
hiu m t trn, nh nghin cu mun bit P(H+ | D), ch khng mun bit P(D | H+)
hay P(D | H-).

7.5. Vn kim nh nhiu gi thuyt (multiple tests of


hypothesis)
Nh ni trn, nghin cu y hc l mt qui trnh th nghim gi thuyt. Trong
mt nghin cu, t khi no chng ta th nghim ch mt gi thuyt duy nht, m rt nhiu
gi thuyt mt lc. Chng hn nh trong mt nghin cu v mi lin h gia vitamin D
v nguy c gy xng i, cc nh nghin cu c th phn tch mi lin h tng quan
gia vitamin D v mt xng (bone mineral density), gia vitamin D v nguy c gy
xng theo tng gii tnh, tng nhm tui, hay phn tch theo cc c tnh lm sng ca
bnh nhn, v.v (Xem v d di y). Mi mt phn tch nh th c th xem l mt
th nghim gi thuyt. y, chng ta phi i din vi vn nhiu gi thuyt
(multiple tests of hypothesis hay cn gi l multiple comparisons).
Bng 2. Phn tch hiu qu ca vitamin D v calcium theo c tnh ca bnh nhn

c tnh bnh
nhn

Nhm c iu
tr bng calcium
v vitamin D 1

Nhm gi dc
(placebo) 1

T s nguy c
(relative risk) v
khong tin cy
95% 2

tui
50-59
60-69
70-79

29 (0.06)
53 (0.09)
93 (0.44)

13 (0.03)
71 (0.13)
115 (0.54)

2,17 (1.13-4.18)
0.74 (0.52-1.06)
0.82 (0.62-1.08)

Body mass index


<25
25-30
>30

69 (0.20)
63 (0.14)
43 (0.09)

66 (0.19)
74 (0.16)
59 (0.13)

1.05 (0.75-1.47)
0.87 (0.62-1.22)
0.73 (0.49-1.09)

Ht thuc l
Khng ht thuc
Hin ht thuc

159 (0.14)
14 (0.14)

178 (0.15)
16 (0.17)

0.90 (0.71-1.11)
0.85 (0.41-1.74)

Ch thch: 1 s ngoi ngoc l s bnh nhn b gy xng i trong thi gian theo di (7 nm) v
s trong ngoc l t l gy xng tnh bng phn trm mi nm. 2 T s nguy c tng i (hay
relative risk RR s gii thch trong mt chng sau) c c tnh bng cch ly t l gy
xng trong nhm can thip chia cho t l trong nhm gi dc; nu khong tin cy 95% bao
gm 1 th mc khc bit gia 2 nhm khng c ngha thng k; nu khong tin cy 95%
khng bao gm 1 th mc khc bit gia 2 nhm c xem l c ngha thng k (hay
p<0.05).

Xin nhc li rng trong mi ln th nghim mt gi thuyt, chng ta chp nhn


mt sai st 5% (gi d chng ta chp nhn tiu chun p = 0.05 tuyn b c ngha hay
khng c ngha thng k). Vn t ra l trong bi cnh th nghim nhiu gi thuyt
l nh sau: nu trong s n th nghim, chng ta tuyn b k th nghim c ngha
thng k (tc l p<0.05), th xc sut c t nht mt gi thuyt sai l bao nhiu?
tr li cu hi ny ti s bt u bng mt v d n gin. Mi th nghim
chng ta chp nhn mt xc sut sai lm l 0.05. Ni cch khc, chng ta c xc sut
ng l 0.95. Nu chng ta th nghim 3 gi thuyt, xc sut m chng ta ng c ba l
[d nhin]: 0.95 x 0.95 x 0.95 = 0.8574. Nh vy, xc xut c t nht mt sai lm trong
ba tuyn b c ngha thng k l: 1 0.8574 = 0.1426 (tc khong 14%).
Ni chung, nu chng ta th nghim n gi thuyt, v mi ln th nghim chng ta
chp nhn mt xc sut sai lm l p, th xc sut c t nht 1 sai lm trong n ln th
n
nghim l 1 (1 p ) . Khi n = 10 v p = 0.05 th xc sut c t nht mt sai lm ln
n: 40%.

Bi hc rt ra t cch l gii trn l nh sau: nu chng ta c mt bi bo khoa


hc m trong nh nghin cu tin hnh nhiu th nghim khc nhau vi cc kt qu tr
s p < 0.05, chng ta c l do cho rng xc sut m mt trong nhng ci-gi-l

significant (hay c ngha thng k) rt cao. Chng ta cn phi d dt vi nhng


kt qu phn tch nh th.
i vi mt ngi lm nghin cu, ngha ca vn th nghim nhiu gi
thuyt l: khng nn cu c. Ti xin ni thm v khi nim cu c trong khoa hc.
Hy tng tng, mt nh nghin cu mun tm hiu hiu qu ca mt thut iu tr mi
cho cc bnh nhn au khp. Sau khi xem xt cc nghin cu cng b trong y vn,
nh nghin cu quyt nh tin hnh mt nghin cu trn 300 bnh nhn: phn na c
iu tr bng thut mi, phn na ch s dng gi dc. Sau thi gian theo di, thu thp
d liu, nh nghin cu phn tch v pht hin s khc bit gia hai nhm khng c
ngha thng k. Ni cch khc, thut iu tr khng c hiu qu. Nh nghin cu khng
chu u hng, nn tm cch tm cho c mt kt qu c ngha thng k. ng chia
bnh nhn thnh nhiu nhm theo tui (trn 50 hay di 50), theo gii tnh (nam hay
na), thnh phn kinh t (c thu nhp cao hay thp), v thi quen (chi th thao hay
khng). Tnh chung, ng c 16 nhm khc nhau, v c th th nghim 16 ln. ng
khm ph thut iu tr c ngha thng k trong nhm ph n tui trn 50 v c thu
nhp cao. V, ng cng b kt qu. l mt qui trnh lm vic m gii nghin cu
khoa hc gi l fishing expedition (mt chuyn i cu c). Tt nhin, mt kt qu nh
th khng c gi tr khoa hc v khng th tin c. (Vi 16 th nghim khc nhau v
vi p = 0.05, xc sut m mt th nghim c kt qu significant ln n 55%, do
chng ta chng ngc nhin khi thy c mt con c c bt!)
cho kt qu tr s P c ngha nguyn thy ca n trong bi cnh th nghim
nhiu gi thuyt, cc nh nghin cu ngh s dng thut iu chnh Bonferroni (tn
ca mt nh thng k hc ngi tng ngh cch lm ny). Theo ngh ny, trc
khi tin hnh nghin cu, nh nghin cu phi xc nh r gi thuyt no l chnh, v gi
thuyt no l ph. Ngoi ra, nh nghin cu cn phi ra k hoch s th nghim bao
nhiu gi thuyt trc khi bt tay vo phn tch d liu. Chng hn nh nu nh
nghin cu c k hoch th nghim 20 so snh v mun gi cho tr s p 0.05, th thay
v da vo 0.05 l tiu chun tuyn bsignificant, nh nghin cu phi da vo tiu
chun 0.0025 (tc ly 0.05 chia cho 20) tuyn b significant. Ni cch khc, ch khi
no mt kt qu c tr s p thp hn 0.0025 (hay ni chung l p/n) th nh nghin cu
mi c quyn tuyn b kt qu c ngha thng k.
Tr s P, d cc k thng dng trong nghin cu khoa hc, khng phi l mt
phn xt cui cng ca mt cng trnh nghin cu hay mt gi thuyt. Th nhng trong
thc t, cc nh khoa hc qu l thuc vo tr s P suy lun trong nghin cu v
tuyn b nhng khm ph m sau ny c chng minh l sai lm. C th ni khng
ngoa rng chnh v s lm dng v ph thuc mt cch m qung vo tr s P m khoa
hc, nht l y sinh hc, tr nn ngho nn. Hng ngy chng ta c hay nghe nhng
pht hin khoa hc tri ngc nhau (nh lc th c nghin cu cho thy c ph c tc
dng tt cho sc khe, lc khc c nghin cu cho bit c ph c hi cho sc khe; hay
lc th thuc gim au aspirin c hiu nng lm gim nguy c ung th, nhng mi y c
nghin cu cho thy aspirin c th lm tng nguy c b ung th v, v.v). C khi cng
chng khng bit pht hin no l thc v pht hin no l dng tnh gi. Theo phn

tch ca Berger v Sellke, khong 25% cc pht hin vi p < 0.05 l cc pht hin
dng tnh gi [2].
Do , chng ta khng nn qu ph thuc vo tr s P. Khng phi c nghin cu
no vi p<0.05 l thnh cng v p>0.05 l tht bi. C khi mt pht hin vi p>0.05
nhng li l mt pht hin c ngha. Vn quan trng l lm sao c tnh mc
kh d ca mt gi thuyt mt khi c d kin tht trong tay, tc l c tnh P(H+ | D).
c tnh P(H+ | D), chng ta phi p dng nh l Bayes, v cch tip cn nh l ny
khng nm trong phm tr ca cun sch ny. Bn c mun tham kho thm c th c
mt vi bi bo ca ti hay cc cc bi bo ca James Berger m ti liu tham kho di
y c th cung cp thm.
Ti liu tham kho:

[1] Wulff et al., Statistics in Medicine 1987; 6:3-10.


[2] Berger JO, Sellke T. Testing a point null hypothesis: the irreconcilability of P-values
and evidence. Journal of the American Statistical Association 1987; 82:112-20.

CHNG VIII

PHN TCH S LIU


BNG BIU

8
Phn tch s liu bng biu
Yu t th gic rt quan trng. Ngi Trung Quc c cu mt biu c gi tr
bng c vn ch vit. Qu tht, biu tt c kh nng gy n tng cho ngi c bo
khoa hc rt ln, v thng c gi tr i din cho c cng trnh nghin cu. V th biu
l mt phng tin hu hiu nht nhn mnh thng ip ca bi bo. Biu
thng c s dng th hin xu hng v kt qu cho tng nhm, nhng cng c th
dng trnh by d kin mt cch gn gng. Cc biu d hiu, ni dung phong ph
l nhng phng tin v gi. Do , nh nghin cu cn phi suy ngh mt cch sng to
cch th hin s liu quan trng bng biu . V th, phn tch biu ng mt vai tr
cc k quan trng trong phn tch thng k. C th ni, khng c th l phn tch
thng k khng c ngha.
Trong ngn ng R c rt nhiu cch thit k mt biu gn v p. Phn ln
nhng hm thit k biu c sn trong R, nhng mt s loi biu tinh vi v phc
tp khc c th thit k bng cc package chuyn dng nh lattice hay trellis c
th ti t website ca R. Trong chng ny ti s ch cch v cc biu thng dng
bng cch s dng cc hm ph bin trong R.

8.1 Mi trng v thit k biu


8.1.1 Nhiu biu cho mt ca s (windows)
Thng thng, R v mt biu cho mt ca s. Nhng chng ta c th v nhiu
biu trong mt ca s bng cch s dng hm par. Chng hn nh
par(mfrow=c(1,2))c hiu nng chia ca s ra thnh 1 dng v hai ct, tc l
chng ta c th trnh by hai biu k cnh bn nhau. Cn par(mfrow=c(2,3))
chia ca s ra thnh 2 dng v 3 ct, tc chng ta c th trnh by 6 biu trong m ca
s. Sau khi v xong, chng ta c th quay v vi ch 1 ca s bng lnh
par(mfrow=c(1,1).
V d sau y to ra mt d liu gm hai bin x v y bng phng php m phng
(tc s liu hon ton c to ra bng R). Sau , chng ta chia ca s thnh 2 dng v
2 ct, v trnh by bn loi biu t d liu c m phng:
>
>
>
>
>
>
>

par(mfrow=c(2,2))
N <- 200
x <- runif(N, -4, 4)
y <- sin(x) + 0.5*rnorm(N)
plot(x,y, main=Scatter plot of y and x)
hist(x, main=Histogram of x)
boxplot(y, main=Box plot of y)

> barplot(x, main=Bar chart of x)


> par(mfrow=c(1,1))
Histogram of x

20
15

Frequency

0
y

-2

-1

10

25

30

Scatter plot of y and x

-2

-4

-2

Box plot of y

Bar chart of x

-2

-2

-1

-4

Biu 1. Cch chia ca s thnh 2 dng v 2 ct


v trnh by 4 biu trong cng mt ca s.
8.1.2 t tn cho trc tung v trc honh
Biu thng c trc tung (y-axis) v trc honh. V d liu thng c gi
bng cc ch vit tt, cho nn biu cn phi c tn cho tng bin d theo di.
Trong v d sau y, biu bn tri khng c tn m ch dng tn ca bin gc (tc x v
y), cn bn phi c tn d hiu hn.
>
>
>
>
>
>

par(mfrow=c(1,2))
N <- 200
x <- runif(N, -4, 4)
y <- sin(x) + 0.5*rnorm(N)
plot(x,y)
plot(x, y, xlab=X factor,
ylab=Production,
main=Production and x factor \n Second line of title here)
> par(mfrow=c(1,1))

Trong cc lnh trn, xlab (vit tt t x label)v ylab (vit tt t y label) dng t
tn cho trc honh v trc tung. Cn main c dng t tn cho biu . Ch
rng trong main c k hiu \n dng vit dng th hai (nu tn gi biu qu di).

2
1
0
-1
-2

-2

-1

Production

Production and x factor


Second line of title here

-4

-2

-4

-2

X factor

Biu 2. Biu bn tri khng c tn gi, biu


bn phi c tn gi cho trc tung, trc honh v
tn ca biu .
Ngoi ra, chng ta cn c th s dng hm title v sub t tn:
> plot(x, y, xlab=Time,
ylab=Production)
> title(main=Plot of production and x factor,
sub=Figure 1)

0
-1
-2

Production

Plot of production and x factor

-4

-2

0
X factor
Figure 1

8.1.3 Cho gii hn ca trc tung v trc honh


Nu khng cung cp gii hn ca trc tung v trc honh, R s t ng tm iu
chnh v cho cc s liu ny. Tuy nhin, chng ta cng c th kim sot biu bng
cch s dng xlim v ylim cho R bit c th gii hn ca hai trc ny:
> plot(x, y, xlab=X factor,
ylab=Production,
main=Plot of production and x factor,
xlim=c(-5, 5),
ylim=c(-3, 3))

8.1.4 Th loi v ng biu din


Trong mt dy biu , chng ta c th yu cu R v nhiu kiu v ng biu
din khc nhau.
>
>
>
>
>

par(mfrow=c(2,2))
plot(y, type="l");
plot(y, type="b");
plot(y, type="o");
plot(y, type="h");

title("lines")
title("both")
title("overstruck")
title("high density")

1
100

150

200

50

100

Index

Index

overstruck

high density

150

200

150

200

1
0
-1
-2

-2

-1

50

0
-2

-1

0
-2

-1

both

lines

50

100
Index

150

200

50

100
Index

Biu 3. Kiu biu v ng biu din.


Ngoi ra, chng ta cng c th nhiu ng biu din bng lty nh sau:

>
>
>
>
>

par(mfrow=c(2,2))
plot(y, type="l",
plot(y, type="l",
plot(y, type="l",
plot(y, type="l",

lty=1);
lty=2);
lty=3);
lty=4);

title(main="Production
title(main="Production
title(main="Production
title(main="Production

2
1
y

-2

-1

1
0

100

150

200

50

100

150

Index
lty=2

Production data

Production data

200

1
-1
-2

-2

-1

Index
lty=1

-1
-2

50

sub="lty=1")
sub="lty=2")
sub="lty=3")
sub="lty=4")

Production data

Production data

data",
data",
data",
data",

50

100

150

200

Index
lty=3

50

100

150

200

Index
lty=4

Biu 4. nh hng ca lty.


8.1.5 Mu sc, khung, v k hiu
Chng ta c th kim sot mu sc ca mt biu bng lnh col. Gi tr mc
nh ca col l 1. Tuy nhin, chng ta c th thay i cc mu theo mun hoc bng
cch cho s hoc bng cch vit ra tn mu nh red, blue, green,
orange, yellow, cyan, v.v
V d sau y dng mt hm v ba ng biu din vi ba mu , xanh nc bin, v
xanh l cy:
> plot(runif (10), ylim=c(0,1), type='l')
> for (i in c('red', 'blue', 'green'))
{
lines(runif (10), col=i )
}
> title(main="Lines in various colours")

0.0

0.2

0.4

runif(10)

0.6

0.8

1.0

Lines in various colours

10

Index

Ngoi ra, chng ta cn c th v ng biu din bng cch tng b dy ca mi ng:


> plot(runif(5), ylim=c(0,1), type='n')
> for (i in 5:1)
{
lines( runif(5), col=i, lwd=i )
}
> title(main="Varying the line thickness")

0.0

0.2

0.4

runif(5)

0.6

0.8

1.0

Varying the line thickness

Index

Hnh dng ca biu cng c th thay i bng type nh sau:


> op <- par(mfrow=c(3,2))

> plot(runif(5), type = 'p',


main = "plot type 'p'
> plot(runif(5), type = 'l',
main = "plot type 'l'
> plot(runif(5), type = 'b',
main = "plot type 'b'
> plot(runif(5), type = 's',
main = "plot type 's'
> plot(runif(5), type = 'h',
main = "plot type 'h'
> plot(runif(5), type = 'n',
main = "plot type 'n'
> par(op)

(stair steps)")
(histogram)")
(no plot)")

0.7

runif(5)

0.3

0.5

0.9
0.7
0.5

Index

plot type 'b' (both points a nd line s)

plot type 's' (sta ir steps)

0.2

0.4

0.4

0.6

runif(5)

0.8

0.8

Index

0.6

runif(5)

(both points and lines)")

plot type 'l' (lines)

0.3
1

runif(5)

(lines)")

0.9

plot type 'p' (points)

(points)")

Index

plot type 'h' (histogra m)

plot type 'n' (no plot)

0.6

runif(5)

0.4

0.3
0.2

0.2

0.1

runif(5)

0.4

Index

3
Index

Index

Khung biu c th kim sot bng lnh bty vi cc thng s nh sau:


bty=n
bty=o
bty=c
bty=l
bty=7

Khng c vng khung chung quanh biu


C 4 khung chung quanh biu
V mt hp gm 3 cnh chung quanh biu theo hnh ch C
V hp 2 cnh chung quanh biu theo hnh ch L
V hp 2 cnh chung quanh biu theo hnh s 7

Cch hay nht bn c lm quen vi cc cch v biu ny l bng cch th trn R


bit r hn.
K hiu ca mt biu cng c th thay th bng cch cung cp s cho pch (plotting
character) trong R. Cc k hiu thng dng l:

Available symbols

21

22

23

24

25

16

17

18

19

20

11

12

13

14

15

10

0
-2

-1

> plot(x, y, col=red, pch=16, bty=l)

-4

-2

Biu 4. nh hng ca pch=16 v col=red,


bty=l.

8.1.6 Ghi ch (legend)


Hm legend rt c ch cho vic ghi ch mt biu v gip ngi c hiu
c ngha ca biu tt hn. Cch s dng legend c th minh jo bng v d sau
y:
>
>
>
>
>
>
>

N <- 200
x <- runif(N, -4, 4)
y <- x + 0.5*rnorm(N)
plot(x,y, pch=16, main=Scatter plot of y and x)
reg <- lm(y~x)
abline(reg)
legend(2,-2, c("Production","Regression line"), pch=16, lty=c(0,1))

Thng s legend(2, -2) c ngha l t phn ghi ch vo trc honh (x-axis) bng 2
v trc tung (y-axis) bng -2.

-2

Scatter plot of y and x

Production
Regression line

-4

-4

-2

Biu 5. nh hng ca legend


8.1.7 Vit ch trong biu
Phn ln cc biu khng cung cp phng tin vit ch hay ghi ch trong
biu , hay c cung cp nhng rt hn ch. Trong R c hn mtext() cho php
chng ta t ch vit hay gii thch bn cnh hay trong biu .

Bt u t pha di ca biu (side=1), chng ta chuyn theo hng kim


ng h n cnh s 4. Lnh plot trong v d sau y khng in tn ca trc v tn ca
biu , nhng ch cung cp mt ci khung. Trong v d ny, chng ta s dng cex
(character expansion) kim sot kch thc ca ch vit. Theo mc nh th cex=1,
nhng vi cex=2, ch vit s c kch thc gp hai ln kch thc mc nh. Lnh
text() cho php chng ta t ch vit vo mt v tr c th. Lnh th nht t ch
vit trong ngoc kp v trung tm ti x=15, y=4.3. Qua s dng adj, chng ta cn
c th sp xp v pha tri (adj=0) sao cho ta l im xut pht ca ch vit.
>
>
>
>
>
>
>
>

plot(y, xlab=" ", ylab=" ", type="n")


mtext("Text on side 1, cex=1", side=1,cex=1)
mtext("Text on side 2, cex=1.2", side=2,cex=1.2)
mtext("Text on side 3, cex=1.5", side=3,cex=1.5)
mtext("Text on side 4, cex=2", side=4,cex=2)
text(15, 4.3, "text(15, 4.3)")
text(35, 3.5, adj=0, "text(35, 3.5), left aligned")
text(40, 5, adj=1, "text(40, 5), right aligned")
Text on side 3, cex=1.5

40, 5), right aligned

text(15, 4.3)

-4

-2

Text on side 2, cex=1.2

Text on side 4, cex=2

text(35, 3.5), left aligned

50

Text on side 1, cex=1


100

150

200

8.1.8 t k hiu vo biu . abline() c th s dng v mt ng thng,


vi nhng thng s nh sau:
abline(a,b): ng hi qui tuyn tnh a=intercept v b=slope.
abline(h=30) v mt ng ngang ti y=30.
abline(v=12) v mt ng thng ng ti im x=12.
Ngoi ra, chng ta cn c th cho vo biu mt mi tn ghi ch mt im s liu
no .
> N <- 200

> x <- runif(N, -4, 4)


> y <- x + 0.5*rnorm(N)
> plot(x,y, pch=16, main=Scatter plot of y and x)

-4

-2

Scatter plot of y and x

-4

-2

Gi s chng ta mun ghi ch ngay ti x=0 v y=0 l im trung tm, chng ta trc ht
dng arrows v mi tn. Trong lnh sau y, arrows(-1, 1, 1.5, 1.5) c
ngha nh sau ta x=-1, y=1 bt u v mi tn v chm dt ti ta x=1.5, y=1.5.
Phn text(0, 1) yu cu R vit ch ti ta x=0, y=1.
> arrows(-1, 1.0, 1.5, 1.5)
> text(0, 1, "Trung tam", cex=0.7)

Scatter plot of y and x

-4

-2

Trung tam

-4

-2

8.2 S liu cho phn tch biu


Sau khi bit qua mi trng v nhng la chn thit k mt biu , by
gi chng ta c th s dng mt s hm thng dng v cc biu cho s liu. Theo
ti, biu c th chia thnh 2 loi chnh: biu dng m t mt bin s v biu
v mi lin h gia hai hay nhiu bin s. Tt nhin, bin s c th l lin tc hay khng
lin tc, cho nn, trong thc t, chng ta c 4 loi biu . Trong phn sau y, ti s
im qua cc loi biu , t n gin n phc tp.
C l cch tt nht tm hiu cch v th bng R l bng mt d liu thc t.
Ti s quay li v d 2 trong chng trc. Trong v d , chng ta c d liu gm 8
ct (hay bin s): id, sex, age, bmi, hdl, ldl, tc, v tg. (Ch , id
l m s ca 50 i tng nghin cu; sex l gii tnh (nam hay n); age l tui;
bmi l t s trng lng; hdl l high density cholesterol; ldl l low density
cholesterol; tc l tng s - total cholesterol; v tg triglycerides). D liu c cha
trong directory directory c:\works\insulin di tn chol.txt. Trc khi v
th, chng ta bt u bng cch nhp d liu ny vo R.
> setwd(c:/works/stats)
> cong <- read.table(chol.txt, header=TRUE, na.strings=.)
> attach(cong)

Hay tin vic theo di ti s nhp cc d liu bng cc lnh sau y:


sex <- c(Nam, Nu, Nu,Nam,Nam, Nu,Nam,Nam,Nam, Nu,
Nu,Nam, Nu,Nam,Nam, Nu, Nu, Nu, Nu, Nu,
Nu, Nu, Nu, Nu,Nam,Nam, Nu,Nam, Nu, Nu,
Nu,Nam,Nam, Nu, Nu,Nam, Nu,Nam, Nu, Nu,

Nam, Nu,Nam,Nam,Nam, Nu,Nam,Nam, Nu, Nu)


age <- c(57,
63,
61,
60,
51,

64,
51,
45,
50,
58,

bmi <- c( 17,


20,
22,
24,

60,
60,
70,
60,
60,

18,
21,
22,
24,

65,
42,
51,
55,
45,

18,
21,
22,
24,

47,
64,
63,
74,
63,

18,
21,
22,
25,

65,
49,
54,
48,
52,

76,
44,
57,
46,
64,

61,
45,
70,
49,
45,

59,
80,
47,
69,
64,

57,
48,
60,
72,
62)

18, 18, 19, 19, 19, 19, 20, 20, 20, 20, 20,
21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22,
23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24,
25)

hdl <- c(5.000,4.380,3.360,5.920,6.250,4.150,0.737,7.170,6.942,5.000,


4.217,4.823,3.750,1.904,6.900,0.633,5.530,6.625,5.960,3.800,
5.375,3.360,5.000,2.608,4.130,5.000,6.235,3.600,5.625,5.360,
6.580,7.545,6.440,6.170,5.270,3.220,5.400,6.300,9.110,7.750,
6.200,7.050,6.300,5.450,5.000,3.360,7.170,7.880,7.360,7.750)
ldl <- c(2.0,
5.0,
3.1,
4.4,
3.0,

3.0,
1.3,
3.0,
4.3,
4.1,

3.0,
1.2,
1.7,
2.3,
4.4,

4.0,
0.7,
2.0,
6.0,
2.8,

2.1,
4.0,
2.1,
3.0,
3.0,

3.0,
4.1,
4.0,
3.0,
2.0,

3.0,
4.3,
4.1,
2.6,
1.0,

3.0,
4.0,
4.0,
4.4,
4.0,

3.0,
4.3,
4.2,
4.3,
4.6,

tc <-c (4.0,
6.2,
4.3,
5.6,
6.2,

3.5,
4.1,
4.8,
8.3,
6.7,

4.7,
3.0,
4.0,
5.8,
6.3,

7.7,
4.0,
3.0,
7.6,
6.0,

5.0,
6.9,
3.1,
5.8,
4.0,

4.2,
5.7,
5.3,
3.1,
3.7,

5.9,
5.7,
5.3,
5.4,
6.1,

6.1,
5.3,
5.4,
6.3,
6.7,

5.9,
7.1,
4.5,
8.2,
8.1,

4.0,
3.8,
5.9,
6.2,
6.2)

tg <- c(1.1,
1.7,
2.2,
3.3,
2.4,

2.1,
1.0,
2.7,
3.0,
3.3,

0.8,
1.6,
1.1,
1.0,
2.0,

1.1,
1.1,
0.7,
1.4,
2.6,

2.1,
1.5,
1.0,
2.5,
1.8,

1.5,
1.0,
1.7,
0.7,
1.2,

2.6,
2.7,
2.9,
2.4,
1.9,

1.5,
3.9,
2.5,
2.4,
3.3,

5.4,
3.0,
6.2,
1.4,
4.0,

1.9,
3.1,
1.3,
2.7,
2.5)

2.0,
4.0,
4.2,
4.0,
4.0)

cong <- data.frame(sex, age, bmi, hdl, ldl, tc, tg)

Sau khi c s liu, chng ta sn sng tin hnh phn tch s liu bng biu nh sau:

8.3 Biu cho mt bin s ri rc (discrete variable):


barplot
Bin sex trong d liu trn c hai gi tr (nam v nu), tc l mt bin khng lin
tc. Chng ta mun bit tn s ca gii tnh (bao nhiu nam v bao nhiu n) v v mt
biu n gin. thc hin nh ny, trc ht, chng ta cn dng hm table
bit tn s:
> sex.freq <- table(sex)
> sex.freq
sex
Nam Nu
22 28

C 22 nam v 28 na trong nghin cu. Sau dng hm barplot th hin tn s


ny nh sau:
> barplot(sex.freq, main=Frequency of males and females)

Biu trn cng c th c c bng mt lnh n gin hn (Biu 8a):


> barplot(table(sex), main=Frequency of males and females)

Frequency of males and females

Nam

10

15

20

Nu

25

Frequency of males and females

Nam

Nu

Biu 8a. Tn s gii tnh th hin bng


ct s.

10

15

20

25

Biu 8b. Tn s gii tnh th hin bng


dng s.

Thay v th hin tn s nam v n bng 2 ct, chng ta c th th hin bng hai dng
bng thng s horiz = TRUE, nh sau (xem kt qu trong Biu 6b):
> barplot(sex.freq,
horiz = TRUE,
col = rainbow(length(sex.freq)),
main=Frequency of males and females)

8.4 Biu cho hai bin s ri rc (discrete variable):


barplot
Age l mt bin s lin tc. Chng ta c th chia bnh nhn thnh nhiu nhm
da vo tui. Hm cut c chc nng ct mt bin lin tc thnh nhiu nhm ri
rc. Chng hn nh:
> ageg <- cut(age, 3)
> table(ageg)
ageg
(42,54.7] (54.7,67.3]
19
24

(67.3,80]
7

C hiu qu chia bin age thnh 3 nhm. Tn s ca ba nhm ny l: 42 tui n 54.7


tui thnh nhm 1, 54.7 n 67.3 thnh nhm 2, v 67.3 n 80 tui thnh nhm 3.
Nhm 1 c 19 bnh nhn, nhm 2 v 3 c 24 v 7 bnh nhn.
By gi chng ta mun bit c bao nhiu bnh nhn trong tng tui v tng gii tnh
bng lnh table:
> age.sex <- table(sex, ageg)
> age.sex
ageg
sex
(42,54.7] (54.7,67.3] (67.3,80]
Nam
10
10
2
Nu
9
14
5

Kt qu trn cho thy chng ta c 10 bnh nhn nam v 9 n trong nhm tui th nht,
10 nam v 14 na trong nhm tui th hai, v.v th hin tn s ca hai bin ny,
chng ta vn dng barplot:
> barplot(age.sex, main=Number of males and females in each age
group)

10

15

10

20

12

14

Number of males and females in each age group

(42,54.7]

(54.7,67.3]

(67.3,80]

(42,54.7]

(54.7,67.3]

(67.3,80]

Age group

Biu 7a. Tn s gii tnh v nhm tui


th hin bng ct s.

Biu 7b. Tn s gii tnh v nhm tui


th hin bng hai dng s.

Trong Biu 7a, mi ct l cho mt tui, v phn m ca ct l n, v phn mu


nht l tn s ca nam gii. Thay v th hin tn s nam n trong mt ct, chng ta cng
c th th hin bng 2 ct vi beside=T nh sau (Biu 7b):
barplot(age.sex, beside=TRUE, xlab="Age group")

8.5 Biu hnh trn

Tn s mt bin ri rc cng c th th hin bng biu hnh trn. V d sau y v


biu tn s ca tui. Biu 8a l 3 nhm tui, v Biu 8b l biu tn s
cho 5 nhm tui:
> pie(table(ageg))
pie(table(cut(age,5)))

(42,54.7]
(49.6,57.2]

(42,49.6]

(72.4,80]

(67.3,80]
(54.7,67.3]

Biu 8a. Tn s cho 3 nhm tui

(64.8,72.4]
(57.2,64.8]

Biu 8b. Tn s cho 5 nhm tui

8.6 Biu cho mt bin s lin tc: stripchart v hist


8.6.1 Stripchart
Biu strip cho chng ta thy tnh lin tc ca mt bin s. Chng hn nh
chng ta mun tm hiu tnh lin tc ca triglyceride (tg), hm stripchart() s gip
trong mc tiu ny:
> stripchart(tg,
main=Strip chart for triglycerides, xlab=mg/L)

Strip chart for triglycerides

mg/L

Chng ta thy bin s tg c s bt lin tc, nht l cc i tng c tg cao. Trong khi
phn ln i tng c tg thp hn 5, th c 2 i tng vi tg rt cao (>5).
8.6.2 Histogram
Age l mt bin s lin tc. v biu tn s ca bin s age, chng ta ch
n gin lnh hist(age). Nh cp trn, chng ta c th ci tin th ny bng
cch cho thm ta chnh (main) v ta ca trc honh (xlab) v trc tung
(ylab):
> hist(age)
> hist(age, main="Frequency distribution by age group", xlab="Age
group", ylab="No of patients")

Histogram of age

8
0

No of patients

6
4

Frequency

10

10

12

12

Frequency distribution by age group

40

50

60

70

80

40

50

60

age

70

80

Age group

Biu 9a. Trc tung l s bnh nhn (i Biu 9b. Thm tn biu v tn ca trc
tng nghin cu) v trc honh l tui. trung v trc honh bng xlab v ylab.
Chng hn nh tui 40 n 45 c 6 bnh nhn,
t 70 n 80 tui c 4 bnh nhn.
Chng ta cng c th bin i biu thnh mt th phn phi xc sut bng hm
plot(density) nh sau (kt qu trong Biu 10a):
> plot(density(age),add=TRUE)
density.default(x = age)

Density

0.00

0.00

0.01

0.02

0.02
0.01

Density

0.03

0.03

0.04

0.04

Histogram of age

30

40

50

60

70

N = 50 Bandwidth = 3.806

80

90

40

50

60

70

80

age

Biu 10a. Xc sut phn phi mt cho Biu 10b. Xc sut phn phi mt cho
bin age ( tui).
bin age ( tui) vi nhiu interquartile.

Chng ta c th v hai th chng ln bng cch dng hm interquartile nh sau (kt


qu xem Biu 10b):
>
>
>
>

iqr <- diff(summary(age)[c(2,5)])


des <- density(age, width=0.5*iqr)
hist(age, xlim=range(des$x), probability=TRUE)
lines(des, lty=2)

Trong th trn, chng ta dng khong cch 0.5*iqr (tng i gn nhau). Nhng
chng ta c th bin i thng s ny thnh 1.5*iqr lm cho phn phi thc t hn:
>
>
>
>

iqr <- diff(summary(age)[c(2,5)])


des <- density(age, width=1.5*iqr)
hist(age, xlim=range(des$x), probability=TRUE)
lines(des, lty=2)

Density

0.00

0.01

0.02

0.03

0.04

Histogram of age

30

40

50

60

70

80

90

age

Chng ta c th bin i biu thnh mt th phn phi xc sut tch ly (cumulative


distribution) bng hm plot v sort nh sau:
> n <- length(age)
> plot(sort(age), (1:n)/n, type="s", ylim=c(0,1))

Kt qu c trnh by trong phn tri ca biu sau y (Biu 11).

60

0.0

50

0.2

0.4

(1:n)/n

0.6

Sample Quantiles

70

0.8

80

1.0

Normal Q-Q Plot

50

60

70

80

-2

sort(age)

-1

Theoretical Quantiles

Biu 11. Xc sut phn phi mt cho Biu 12. Kim tra bin age c theo lut
bin age ( tui).
phn phi chun hay khng.
Trong th trn, trc tung l xc sut tch ly v trc honh l tui t thp n cao.
Chng hn nh nhn qua biu , chng ta c th thy khong 50% i tng c tui thp
hn 60.
bit xem phn phi ca age c theo lut phn phi chun (normal distribution) hay
khng chng ta c th s dng hm qqnorm.
> qqnorm(age)
Trc honh ca biu trn l nh lng theo lut phn phi chun (theoretical
quantile) v trc honh nh lng ca s liu (sample quantiles). Nu phn phi ca
age theo lut phn phi chun, th ng biu din phi theo mt ng thng cho 45
(tc l nh lng phn phi v nh lng s liu bng nhau). Nhng qua Biu
12, chng ta thy phn phi ca age khng hn theo lut phn phi chun.
8.6.3 Biu hp (boxplot)
v biu hp ca bin s tc, chng ta ch n gin lnh:
> boxplot(tc, main="Box plot of total cholesterol", ylab="mg/L")

mg/L

Box plot of total cholesterol

Biu 13. Trong biu ny, chng ta thy median


(trung v) khong 5.6 mg/L, 25% total cholesterol thp
hn 4.1, v 75% thp hn 6.2. Total cholesterol thp nht
l khoang 3, v cao nht l trn 8 mg/L.
Trong biu sau y, chng ta so snh tc gia hai nhm nam v n:
> boxplot(tc ~ sex, main=Box plot of total cholestrol by sex,
ylab="mg/L")

Kt qu trnh by trong Biu 14a. Chng ta c th bin giao din ca th bng


cch dng thng s horizontal=TRUE v thay i mu bng thng s col nh sau
(Biu 14b):
> boxplot(tc~sex, horizontal=TRUE, main="Box plot of total
cholesterol", ylab="mg/L", col = "pink")

Box plot of total cholesterol

Nam

mg/L

mg/L

Nu

Box plot of total cholesterol by sex

Nam

Nu

Biu 14a. Trong biu ny, chng ta Biu 14b. Total cholesterol cho tng
thy trung v ca total cholesterol n gii gii tnh, vi mu sc v hnh hp nm
thp hn nam gii, nhng dao ng gia ngang.
hai nhm khng khc nhau bao nhiu.
8.6.4 Biu thanh (bar chart)
v biu thanh ca bin s bmi, chng ta ch n gin lnh:

10

kg/m^2

15

20

25

> barplot(bmi, col=blue)

Biu 15. Biu thanh cho bin bmi.

8.6.5 Biu im (dotchart)


Mt th khc cung cp thng tin ging nh barplot l dotchart:
> dotchart(bmi, xlab="Body mass index (kg/m^2)", main="Distribution of
BMI")

Distribution of BM I

18

20

22

24

Body mass index (kg/m^2)

Biu 16. Biu im bin bmi.

8.7 Phn tch biu cho hai bin lin tc


8.7.1 Biu tn x (scatter plot)
tm hiu mi lin h gia hai bin, chng ta dng biu tn x. v biu tn x
v mi lin h gia bin s tc v hdl, chng ta s dng hm plot. Thng s th nht
ca hm plot l trc honh (x-axis) v thng s th 2 l trc tung. tm hiu mi lin
h gia tc v hdl chng ta n gin lnh:
> plot(tc, hdl)

8
6
hdl
4
2
3

tc

Biu 17. Mi lin h gia tc v hdl. Trong biu


ny, chng ta v bin s hdl trn trc tung v tc trn
trc honh.
Chng ta mun phn bit gii tnh (nam v n) trong biu trn. v biu ,
chng ta phi dng n hm ifelse. Trong lnh sau y, nu sex==Nam th v k
t s 16 ( trn), nu khng nam th v k t s 22 (tc vung):
> plot(hdl, tc, pch=ifelse(sex=="Nam", 16, 22))
Kt qu l Biu 18a. Chng ta cng c th thay k t thnh M (nam) v F
n(xem Biu 18b):
> plot(hdl, tc, pch=ifelse(sex=="Nam", M, F))

M
8
8

M
F

6
tc

M
M

F
hdl

M
F
F
F

M
F
M

M
F

M
F

F F

M
F
F
F
F

M
F
M

M
F

F
F

3
3

tc

hdl

Biu 18a. Mi lin h gia tc v hdl theo Biu 18a. Mi lin h gia tc v hdl theo
tng gii tnh c th hin bng hai k hiu tng gii tnh c th hin bng hai k t.
du.
Chng ta cng c th v mt ng biu din hi qui tuyn tnh (regression line) qua cc
im trn bng cch tip tc ra cc lnh sau y:
> plot(hdl ~ tc, pch=16, main="Total cholesterol and HDL cholesterol",
xlab="Total cholesterol", ylab="HDL cholesterol", bty=l)
> reg <- lm(hdl ~ tc)
> abline(reg)

Kt qu l Biu 19a di y. Chng ta cng c th dng hm trn (smooth function)


biu din mi lin h gia hai bin s. th sau y s dng lowess (mt hm
thng thng nht) trong vic lm trn s liu tc v hdl (Biu 19b).
> plot(hdl ~ tc, pch=16,
main="Total cholesterol and HDL cholesterol with LOEWSS smooth
function",
xlab="Total cholesterol", ylab="HDL cholesterol", bty=l)

> lines(lowess(hdl, tc, f=2/3, iter=3), col="red")

T otal cholesterol and HDL cholesterol

6
2

HDL cholesterol

4
2

HDL cholesterol

T otal cholesterol and HDL cholesterol with LOEWSS smooth function

Total cholesterol

Total cholesterol

Biu 19a. Trong lnh trn, reg<- Biu 19b. Thay v dng abline, chng ta
lm(hdl~tc) c ngha l tm phng trnh dng hm lowess th hin mi lin h gia
lin h gia hdl v tc bng linear model tc v hdl.
(lm) v 8t kt qu vo i tng reg.
Lnh th hai abline(reg) yu cu R v
ng thng t phng trnh trong reg
Bn c c th th nghim vi nhiu thng s f=1/2, f=2/5, hay thm ch f=1/10
s thy th bin i mt cch th v.

8.8 Phn tch Biu cho nhiu bin: pairs


Chng ta c th tm hiu mi lin h gia cc bin s nh age, bmi, hdl, ldl v
tc bng cch dng lnh pairs. Nhng trc ht, chng ta phi a cc bin s ny
vo mt data.frame ch gm nhng bin s c th v c, v sau s dng hm
pairs trong R.
> lipid <- data.frame(age,bmi,hdl,ldl,tc)
> pairs(lipid, pch=16)

Kt qu s l:

20

22

24

70

80

18

22

24

50

60

age

18

20

bmi

hdl

ldl

tc

50

60

70

80

Biu trn y c th ci tin bng hm matrix.cor (do mt tc gi trn mng son)


sau y cho ra nhiu thng tin th v.
matrix.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
test <- cor.test(x,y)
# borrowed from printCoefmat
Signif <- symnum(test$p.value, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " "))

text(0.5, 0.5, txt, cex = cex * r)


text(.8, .8, Signif, cex=cex, col=2)

Chng ta quay li vi d liu lipid bng cch gi hm matrix.cor nh sau:


pairs(lipid,lower.panel=panel.smooth, upper.panel=matrix.cor)

20

22

24

70

80

18

0.12

0.22

0 .0 9 5

22

24

50

0 .0 6 5

60

age

0.38

.
0.29

0.25

***

0.62

0.35

hdl

18

20

bmi

**

***

0.65

ldl

tc

50

60

70

80

th ny cung cp cho chng ta tt c h s tng quan gia tt c cc bin s. Chng


hn nh, h s tng quan gia age v bmi qu thp v khng c ngha thng k;
gia age v hdl hay gia age v hdl cng khng c ngha thng k; nhng gia
age v tc th bng 0.22. H s tng quan cao nht l gia ldl v tc (0.65) v hdl
v tc (0.62). Gi hdl v ldl, h s tng quan ch 0.35, nhng c ngha thng k
(c sao!)
Ch biu trn chng nhng cung cp hai thng tin chnh (h s tng quan hay
correlation coefficient, v v biu tn x cho tng cp bin s), m cn cho bit h s
tng quan no c ngha thng k (nhng k hiu sao). H s tng quan cng cao,
kch thc ca font ch cng ln. Mt biu rt n tng!

8.9 Mt s biu a nng


8.9.1 Biu tn x v hnh hp

Nh trn trnh by, biu tn x gip cho chng ta hnh dung ra mi lin h gia
hai bin s lin tc nh tui age v hdl chng hn. V lm vic ny, chng ta
dng hm plot. tm hiu phn phi cho tng bin age hay hdl chng ta c th
dng hm boxplot. Nhng nu chng ta mun xem phn phi ca hai bin v ng
thi mi lin h gia hai bin, th chng ta cn phi vit mt vi lnh thc hin vic
ny. Cc lnh sau y v biu tn x v mi lin quan gia age v hdl, ng thi v
biu hnh hp cho tng bin.
op <- par()
layout( matrix( c(2,1,0,3), 2, 2, byrow=T ),
c(1,6), c(4,1),
)
par(mar=c(1,1,5,2))
plot(hdl ~ age,
xlab='', ylab='',
las = 1,
pch=16)
rug(side=1, jitter(age, 5) )
rug(side=2, jitter(hdl, 20) )
title(main = "Age and HDL")
par(mar=c(1,2,5,1))
boxplot(hdl, axes=F)
title(ylab='HDL', line=0)
par(mar=c(5,1,1,2))
boxplot(age, horizontal=T, axes=F)
title(xlab='Age', line=1)
par(op)

V kt qu l:

Age and HDL

HDL

50

60

70

80

Age

8.9.2 Biu tn x vi kch thc bin th ba


Biu trn th hin mi lin h gia age v hdl, vi mi im chm c kch
thc nhau nhau. Nhng chng ta bit rng hdl cng c lin h vi triglyceride (tg).
th hin mt phn no mi lin h 3 chiu ny, mt cch lm l v kch thc ca im
ty theo gi tr ca tg. Chng ta s s dng thng s cex bn trong phn u v
mi lin h ba chiu ny nh sau:
> plot(age, hdl, cex=tg,
pch=16,
col=red,
xlab="Age", ylab="HDL",
main="Bubble plot")
> points(age, hdl, cex=tg)

HDL

Bubble plot

50

60

70

80

Age

8.9.3 Biu thanh v xc sut tch ly


v biu tn s ca mt bin lin tc chng ta ch yu s dng hm hist.
Hm ny cho ra kt qu tn s cho tng nhm (nh nhm tui chng hn). Nhng i
khi chng ta cn bit c xc sut tch ly cho tng nhm, v mun v c hai kt qu trong
mt biu . lm vic ny chng ta cn phi vit mt hm bng ngn ng R. Hm
sau y c gi l pareto (tt nhin bn c c th cho mt tn khc) c son ra
thc hin mc tiu trn. M cho hm pareto nh sau:
pareto <- function (x, main = "", ylab = "Value")
{
op <- par(mar = c(5, 4, 4, 5) + 0.1,
las = 2)
if( ! inherits(x, "table") ) {
x <- table(x)
}
x <- rev(sort(x))
plot( x, type = 'h', axes = F, lwd = 16,
xlab = "", ylab = ylab, main = main )
axis(2)
points( x, type = 'h', lwd = 12,
col = heat.colors(length(x)) )
y <- cumsum(x)/sum(x)
par(new = T)
plot(y, type = "b", lwd = 3, pch = 7,
axes = FALSE,
xlab='', ylab='', main='')
points(y, type = 'h')
axis(4)
par(las=0)
mtext("Cumulated frequency", side=4, line=3)

print(names(x))
axis(1, at=1:length(x), labels=names(x))
par(op)

By gi chng ta s p dng hm pareto vo vic v tn s cho bin tg (triglyceride) nh


sau. Trc ht, chng ta chia tg thnh 10 nhm bng cch dng hm cut v cho kt qu
vo i tng tg.group.
> tg.group

<- cut(tg, 10)

K n, chng ta ng dng hm pareto:


> pareto(tg.group)

[1] "(0.695,1.25]" "(2.35,2.9]"


[6] "(3.45,4]"
"(5.65,6.21]"

"(1.25,1.8]"
"(5.1,5.65]"

"(2.9,3.45]"
"(4.55,5.1]"

"(1.8,2.35]"
"(4,4.55]"

> title(main="Pareto plot of Tg with cumulated frequencies")

Pareto plot of T g with cumulated frequencies


1.0

12

10

Value

0.6

Cumulated frequency

0.8
8

0.4
2

0
(0.695,1.25]

(1.25,1.8]

(1.8,2.35]

(5.65,6.21]

(4.55,5.1]

Trong biu ny, chng ta c hai trc tung. Trc tung pha tri l tn s (s bnh nhn)
cho tng nhm tg, v trc tung bn phi l tn s tch ly tch bng xc sut (do , s
cao nht l 1).
8.9.4 Biu hnh ng h (clock plot)
Biu hnh ng h, nh tn gi l biu dng v mt bin s lin tc bng
kim ng h. Tc l thay v th hin bng ct hay bng dng, biu ny th hin bng
ng h. Hm sau y (clock) c son thc hin biu hnh ng h:
clock.plot <- function (x, col = rainbow(n), ...) {

if( min(x)<0 ) x <- x - min(x)


if( max(x)>1 ) x <- x/max(x)
n <- length(x)
if(is.null(names(x))) names(x) <- 0:(n-1)
m <- 1.05
plot(0,
type = 'n', # do not plot anything
xlim = c(-m,m), ylim = c(-m,m),
axes = F, xlab = '', ylab = '', ...)
a <- pi/2 - 2*pi/200*0:200
polygon( cos(a), sin(a) )
v <- .02
a <- pi/2 - 2*pi/n*0:n
segments( (1+v)*cos(a), (1+v)*sin(a),
(1-v)*cos(a), (1-v)*sin(a) )
segments( cos(a), sin(a),
0, 0,
col = 'light grey', lty = 3)
ca <- -2*pi/n*(0:50)/50
for (i in 1:n) {
a <- pi/2 - 2*pi/n*(i-1)
b <- pi/2 - 2*pi/n*i
polygon( c(0, x[i]*cos(a+ca), 0),
c(0, x[i]*sin(a+ca), 0),
col=col[i] )
v <- .1
text((1+v)*cos(a), (1+v)*sin(a), names(x)[i])
}

Chng ta s ng dng hm clock v biu cho bin ldl nh sau:


> clock.plot(ldl,
main = "Distribution of LDL")

V kt qu l:

Distribution of LDL

45

46

47

48

49

44

43

42

41

40

10

39

11

38

12

37

13

36

14

35

15

34

16

33

17
32

18
31

19
30

29

28

27

26

25

24

23

22

21

20

8.9.5 Biu vi sai s chun (standard error)


Trong biu sau y, chng ta c 5 nhm (bin s x c m phng ch khng phi s
liu tht), v mi nhm c gi tr trung bnh mean, v tin cy 95% (lcl v ucl).
Thng thng lcl=mean-1.96*SE v ucl = mean+1.96*SE (SE l sai s
chun). Chng ta mun v biu cho 5 nhm vi sai s chun . Cc lnh v hm sau
y s cn thit:
>
>
>
>
>
>

group <- c(1,2,3,4,5)


mean <- c(1.1, 2.3, 3.0, 3.9, 5.1)
lcl <- c(0.9, 1.8, 2.7, 3.8, 5.0)
ucl <- c(1.3, 2.4, 3.5, 4.1, 5.3)
plot(group, mean, ylim=range(c(lcl, ucl)))
arrows(group, ucl, group, lcl, length=0.5, angle=90, code=3)

5
4
3

mean

2
1
1

group

Sau y l mt m phng khc. Chng ta to ra 10 gi tr x v y theo lut phn phi


chun, v 10 gi tr sai s theo lut phn phi u (se.x v se.y uniform distribution).
>
>
>
>
>
>

x <- rnorm(10)
y <- rnorm(10)
se.x <- runif(10)
se.y <- runif(10)
plot(x, ypch=22)
arrows(x, y-se.y, x, y+se.y, code=3, angle=90, length=0.1)

0.5
0.0
-0.5
y
-1.0
-1.5
-2.0
-2.5

-2

-1

8.9.6 Biu vng (contour plot)


R c th v cc th vng vi nhiu hnh dng khc nhau, ty theo thch v d liu.
Trong cc lnh sau y, chng ta s dng k thut m phng v th vng cho ba
bin s x, y v z.
>
>
>
>
>
>
>

N <- 50
x <- seq(-1, 1, length=N)
y <- seq(-1, 1, length=N)
xx <- matrix(x, nr=N, nc=N)
yy <- matrix(y, nr=N, nc=N, byrow=TRUE)
z <- 1 / (1 + xx^2 + (yy + .2 * sin(10*yy))^2)
contour(x, y, z, main = "Contour plot")

-1.0

-0.5

0.0

0.5

1.0

Contour plot

-1.0

-0.5

0.0

0.5

1.0

th ny c th chuyn thnh mt hnh (image) bng hm image.

0.0

0.2

0.4

0.6

0.8

1.0

> image(z)

0.0

0.2

0.4

0.6

Mt vi thay i nh nhng quan trng:


> image(x, y, z,
xlab=x,
ylab=y)

0.8

1.0

-1.0

-0.5

0.0

0.5

1.0

> contour(x, y, z, lwd=3, add=TRUE)

-1.0

-0.5

0.0

0.5

1.0

Sau y l mt vi thay i v biu theo hm s sin v 3 chiu. th ny tuy xem


hp dn, nhng trong thc t c l t s dng. Tuy nhin, ti trnh by y cho
thy mt v d v tnh a dng ca R.
>
>
>
>
>
>
>

x <- seq(-10, 10, length= 30)


y <- x
f <- function(x,y) { r <- sqrt(x^2+y^2); 10 * sin(r)/r }
z <- outer(x, y, f)
z[is.na(z)] <- 1
op <- par(bg = "white", mar=c(0,2,3,0)+.1)
persp(x, y, z,
theta = 30, phi = 30,
expand = 0.5,
col = "lightblue",
ltheta = 120,
shade = 0.75,
ticktype = "detailed",
xlab = "X", ylab = "Y", zlab = "Sinc(r)",
main = "The sinc function"

)
> par(op)

T he sinc function

8
6
S in c (r

4
10

2
0

-2
-10

0
-5
0

-5

5
10

-10

8.9.10 Biu vi k hiu ton


i khi chng ta cn v biu vi ta c k hiu ton hc. Trong th sau y,
chng ta to ra mt bin s x vi 200 gi tr t -5 n 5, v y = 1 + x 2 . vit cng
thc trn, chng ta cn s dng hm expression nh sau:
> x <- seq(-5,5,length=200)
> y <- sqrt(1+x^2)
> plot(y~x, type='l', ylab=expression(sqrt(1+x^2)))
> title(main=expression("Graph of the function
f"(x)==sqrt(1+x^2)))

3
1

1 + x2

Graph of the function f (x) = 1 + x2

-4

-2

Ngay c ting Nht cng c th th hin bng R:


> plot(1:9, type="n", axes=FALSE, frame=TRUE, ylab="",
main= "example(Japanese)", xlab= "using Hershey fonts")
> par(cex=3)
> Vf <- c("serif", "plain")
> text(4, 2, "\\#J2438\\#J2421\\#J2451\\#J2473", vfont = Vf)
> text(4, 4, "\\#J2538\\#J2521\\#J2551\\#J2573", vfont = Vf)
> text(4, 6, "\\#J467c\\#J4b5c", vfont = Vf)
> text(4, 8, "Japan", vfont = Vf)
> par(cex=1)
> text(8, 2, "Hiragana")
> text(8, 4, "Katakana")
> text(8, 6, "Kanji")
> text(8, 8, "English")

example(Japanese)

English

Kanji

Katakana

Hiragana

using Hershey fonts

Chng ny ch gii thiu mt s biu thng thng trong nghin cu khoa


hc. Ngoi cc biu thng dng ny, R cn c kh nng v nhng th phc tp v
tinh vi hn na. Hin nay, R c mt package tn l lattice c th v nhng biu
cht lng cao hn. lattice, cng nh bt c package no ca R, u min ph, c
th ti v my tnh v ci t s dng khi cn thit.

CHNG IX

THNG K M T

9
Phn tch thng k m t
Trong chng ny, chng ta s s dng R cho mc ch phn tch thng k m t.
Ni n thng k m t l ni n vic m t d liu bng cc php tnh v ch s thng
k thng thng m chng ta lm quen qua t thu trung hc nh s trung bnh
(mean), s trung v (median), phng sai (variance) lch chun (standard deviation)
cho cc bin s lin tc, v t s (proportion) cho cc bin s khng lin tc. Nhng
trc khi hng dn phn tch thng k m t, ti mun bn c phi phn bit cho c
hai khi nim tng th (population) v mu (sample).

9.0 Khi nim tng th (population) v mu (sample)


Sch gio khoa thng k thng gii thch hai khi nim ny mt cch m m v
c khi v ngha. Chng hn nh cun Modern Mathematical Statistics (E. J. Dudewicz
v S. N. Mishra, Nh xut bn Wiley, 1988) gii thch tng th rng population is a set
of n distinct elements (points) a1, a2, a3, an. (trang 24, tm dch: tng th l tp hp
gm n phn t hay im a1, a2, a3, an), cn L. Fisher v G. van Belle trong
Biostatistics A Methodology for the Health Science (Nh xut bn Wiley, 1993), gii
thch rng The sample space or population is the set of all possible values of a variable
(trang 38, tm dch Khng gian mu hay tng th l tp hp tt c cc gi tr kh d ca
mt bin). i vi mt nh nghin cu thc nghim phi ni nhng nh ngha loi ny
rt tru tng v kh hiu, v dng nh chng c lin quan g vi thc t! Trong phn
ny ti s gii thch hai khi nim ny bng m phng v hi vng l b c s hiu r
hn.
C th ni mc tiu ca nghin cu khoa hc thc nghim l nhm tm hiu v
khm ph nhng ci cha c bit (unknown), trong bao gm nhng qui lut hot
ng ca t nhin. khm ph, chng ta s dng n cc phng php phn loi, so
snh, v phng on. Tt c cc phng php khoa hc, k c thng k hc, c pht
trin nhm vo ba mc tiu trn. phn loi, chng ta phi o lng mt yu t hay
tiu ch c lin quan n vn cn nghin cu. so snh v phng on, chng ta cn
n cc phng php kim nh gi thit v m hnh thng k hc.
Cng nh bt c m hnh no, m hnh thng k phi c thng s. V mun c
thng s, chng ta trc ht phi tin hnh o lng, v sau l c tnh thng s t o
lng. Chng hn nh bit sinh vin n c ch s thng minh (IQ) bng sinh vin nam
hay khng, chng ta c th lm nghin cu theo hai phng n:
(a) Mt l lp danh snh tt c sinh vin nam v n trn ton quc, ri o lng ch
s IQ tng ngi, v sau so snh gia hai nhm;
(b) Hai l chn ngu nhin mt mu gm n nam v m n sinh vin, ri o lng ch
s IQ tng ngi, v sau so snh gia hai nhm.

Phng n (a) rt tn km v c th ni l khng thc t, v chng ta phi tp hp


tt c sinh vin ca c nc, mt vic lm rt kh thc hin c. Nhng gi d nh
chng ta c th lm c, th phng n ny khng cn n thng k hc. Gi tr IQ
trung bnh ca n v nam sinh vin tnh t phng n (a) l gi tr cui cng, v n tr li
cu hi ca chng ta mt cch trc tip, chng ta khng cn phi suy lun, khng cn n
kim nh thng k g c!
Phng n (b) i hi chng ta phi chn n nam v m n sinh vin sao cho i
din (representative) cho ton qun th sinh vin ca c nc. Tnh i din y c
ngha l cc s n nam v m n sinh vin ny phi c cng c tnh nh tui, trnh
hc vn, thnh phn kinh t, x hi, ni sinh sng. v.v so vi tng th sinh vin ca c
nc. Bi v chng ta khng bit cc c tnh ny trong ton b tng th sinh vin,
chng ta khng th so snh trc tip c, cho nn mt phng php rt hu hiu l ly
mu mt cch ngu nhin. C nhiu phng php ly mu ngu nhin c pht trin
v ti s khng bn qua chi tit ca cc phng php ny, ngoi tr mun nhn mnh
rng, nu cch ly mu khng ngu nhin th cc c s t mu s khng c ngha khoa
hc cao, bi v cc phng php phn tch thng k da vo gi nh rng mu phi c
chn mt cch ngu nhin.
Ti s ly mt v d c th v tng th v mu qua ng dng R nh sau. Gi d
chng ta c mt tng th gm 20 ngi v bit rng chiu cao ca h nh sau (tnh bng
cm): 162, 160, 157, 155, 167, 160, 161, 153, 149, 157, 159, 164, 150, 162, 168, 165, 156,
157, 154 v 157. Nh vy, chng ta bit rng chiu cao trung bnh ca tng th l 158.65
cm. Xin nhn mnh l tng th.
V thiu thn phng tin chng ta khng th nghin cu trn ton tng th m
ch c th ly mu t tng th c tnh chiu cao. Hm sample() cho php chng
ta ly mu. V c tnh chiu cao trung bnh t mu tt nhin s khc vi chiu cao
trung bnh ca tng th.

Chn 5 ngi t tng th:

> sample5 <- sample(height, 5)


> sample5
[1] 153 157 164 156 149

c tnh chiu cao trung bnh t mu ny:


> mean(sample5)
[1] 155.8

Chn 5 ngi khc t tng th v tnh chiu cao trung bnh:

> sample5 <- sample(height, 5)


> sample5
[1] 157 162 167 161 150
> mean(sample5)
[1] 159.4

Ch c tnh chiu cao ca mu th hai l 159.4 cm (thay v 155.8 cm), bi v


chn ngu nhin, cho nn i tng c chn ln hai khng nht thit phi l i tng
ln th nht, cho nn c tnh trung bnh khc nhau.

By gi chng ta th ly mu 10 ngi t tng th v tnh chiu cao trung bnh:

> sample10 <- sample(height, 10)


> sample10
[1] 153 160 150 165 159 160 164 156 162 157
> mean(sample10)
[1] 158.6

Chng ta c th ly nhiu mu, mi mu gm 10 ngi v c tnh s trung bnh t mu,


bng mt lnh n gin hn nh sau:
> mean(sample(height,
[1] 156.7
> mean(sample(height,
[1] 157.1
> mean(sample(height,
[1] 159.3
> mean(sample(height,
[1] 159.3
> mean(sample(height,
[1] 158.3
> mean(sample(height,

10))
10))
10))
10))
10))
10))

Ch dao ng ca s trung bnh t 156.7 n 159.3 cm.

Chng ta th ly mu 15 ngi t tng th v tnh chiu cao trung bnh:

> mean(sample(height,
[1] 158.6667
> mean(sample(height,
[1] 159.4
> mean(sample(height,
[1] 158.0667
> mean(sample(height,
[1] 158.1333
> mean(sample(height,
[1] 156.4667

15))
15))
15))
15))
15))

Ch dao ng ca s trung bnh by gi t 158.0 n 158.7 cm, tc thp hn mu


vi 10 i tng.

Tng c mu ln 18 ngi (tc gn s i tng trong tng th)

> mean(sample(height,
[1] 158.2222
> mean(sample(height,
[1] 158.7222
> mean(sample(height,
[1] 158.0556
> mean(sample(height,
[1] 158.4444
> mean(sample(height,

18))
18))
18))
18))
18))

[1] 158.6667
> mean(sample(height, 18))
[1] 159.0556
> mean(sample(height, 18))
[1] 159

By gi th c tnh chiu cao kh n nh, nhng khng khc g so vi c mu


vi 15 ngi, do dao ng t 158.2 n 159 cm.
T cc v d trn y, chng ta c th rt ra mt nhn xt quan trng: c s t
cc mu c chn mt cch ngu nhin s khc vi thng s ca tng th, nhng khi s
c mu tng ln th khc bit s nh li dn. Do , mt trong nhng vn then cht
ca thit k nghin cu l nh nghin cu phi c tnh c mu sao cho c s m chng
ta tnh t mu gn (hay chnh xc) so vi thng s ca tng th. Ti s quay li vn
ny trong Chng 15.
Trong v d trn s trung bnh ca tng th l 158.65 cm. Trong thng k hc,
chng ta gi l thng s (parameter). V cc s trung bnh c tnh t cc mu chn
t tng th c gi l c s mu (sample estimate). Do , xin nhc li nhn
mnh: nhng ch s lin quan n tng th l thng s, cn nhng s c tnh t cc mu
l c s. Nh thy trn, c s c dao ng chung quanh thng s, v v trong thc
t chng ta khng bit thng s, cho nn chng mc tiu chnh ca phn tch thng k l
s dng c s suy lun v thng s.
Mc tiu chnh ca phn tch thng k m t l tm nhng c s ca mu. C
hai loi o lng: lin tc (continuous measurement) v khng lin tc hay ri rc
(discrete measurement). Cc bin lin tc nh tui, chiu cao, trng lng c th,
v.v l bin s lin tc, cn cc bin mang tnh phn loi nh c hay khng c bnh,
thch hay khng thch, trng hay en, v.v l nhng bin s khng lin tc. Cch tnh
hai loi bin s ny cng khc nhau.
c s thng thng nht dng m t mt bin s lin tc l s trung bnh
(mean). Chng hn nh chiu cao ca nhm 1 gm 5 i tng l 160, 160, 167, 156, v
161, do s trung bnh l 160.8 cm. Nhng chiu cao ca nhm 2 cng gm 5 i
tng khc nh142, 150, 187, 180 v 145, th s trung bnh vn l 160.8. Do , s trung
bnh khng th phn nh y s phn phi ca mt bin lin tc, v y tuy hai
nhm c cng trung bnh nhng khc bit ca nhm 2 cao hn nhm 1 rt nhiu. V
chng ta cn mt c s khc gi l phng sai (variance). Phng sai ca nhm 1 l
15.7 cm2 v nhm 2 l 443.7 cm2.
Vi mt bin s khng lin tc nh 0 v 1 (0 k hiu cn sng, v 1 k hiu t
vong) th c s trung bnh khng cn ngha trung bnh na, cho nn chng ta c c
s t l (proportion). Chng hn nh trong s 10 ngi c 2 ngi t vong, th t l t
vong l 0.2 (hay 20%). Trong s 200 ngi c 40 ngi qua i th t l t vong vn 0.2.
Do , cng nh trng hp trung bnh, t l khng th m t mt bin khng lin tc y
c. Chng ta cn n phng sai , cng vi t l, m t mt bin khng lin tc.
Trong trng hp 2/10 phng sai l 0.016, cn trong trng hp 40/200, phng sai l

0.0008. Trong chng ny, chng ta s lm quen vi mt s lnh trong R tin hnh
nhng tnh ton n gin trn.

9.1 Thng k m t (descriptive statistics, summary)


minh ha cho vic p dng R vo thng k m t, ti s s dng mt d liu
nghin cu c tn l igfdata. Trong nghin cu ny, ngoi cc ch s lin quan n
gii tnh, tui, trng lng v chiu cao, chng ti o lng cc hormone lin quan
n tnh trng tng trng nh igfi, igfbp3, als, v cc markers lin quan n
s chuyn ha ca xng pinp, ictp v pinp. C 100 i tng nghin cu. D
liu ny c cha trong directory c:\works\stats. Trc ht, chng ta cn phi
nhp d liu vo R vi nhng lnh sau y (cc cu ch theo sau du # l nhng ch
thch bn c theo di):
> options(width=100)
# chuyn directory
> setwd("c:/works/stats")
# c d liu vo R
> igfdata <- read.table("igf.txt", header=TRUE, na.strings=".")
> attach(igfdata)
# xem xt cc ct s trong d liu
> names(igfdata)
[1] "id"
"sex"
"age"
[7] "igfi"
"igfbp3"
"als"

"weight"
"pinp"

"height"
"ictp"

"ethnicity"
"p3np"

> igfdata
id
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
...
...
97
97
98
98
99
99
100 100

sex age weight height ethnicity


igfi igfbp3
als
pinp
Female 15
42
162
Asian 189.000 4.00000 323.667 353.970
Male 16
44
160 Caucasian 160.000 3.75000 333.750 375.885
Female 15
43
157
Asian 146.833 3.43333 248.333 199.507
Female 15
42
155
Asian 185.500 3.40000 251.000 483.607
Female 16
47
167
Asian 192.333 4.23333 322.000 105.430
Female 25
45
160
Asian 110.000 3.50000 284.667 76.487
Female 19
45
161
Asian 157.000 3.20000 274.000 75.880
Female 18
43
153
Asian 146.000 3.40000 303.000 86.360
Female 15
41
149
Asian 197.667 3.56667 308.500 254.803
Female 24
45
157
African 148.000 3.40000 273.000 44.720
Female
Male
Female
Male

17
18
18
15

54
55
48
54

ictp
p3np
11.2867 8.3367
10.4300 6.7450
8.3633 12.5000
13.3300 14.2767
7.9233 4.5033
4.9833 4.9367
6.3500 5.3200
7.3700 4.6700
11.8700 6.8200
3.7400 6.1600

168 Caucasian 204.667 4.96667 441.333 64.130 5.1600


169
Asian 178.667 3.86667 273.000 185.913 7.5267
151
Asian 237.000 3.46667 324.333 105.127 5.9867
168
Asian 130.000 2.70000 259.333 325.840 10.2767

Trn y ch l mt phn s liu trong s 100 i tng.


Cho mt bin s x1 , x2 , x3 ,..., xn chng ta c th tnh ton mt s ch s thng k m t
nh sau:

4.4367
8.8333
5.6600
6.5933

Hm R
mean(x)

L thuyt
S trung bnh: x =

Phng sai: s 2 =

1
xi .
n i =1

var(x)

1 n
2
( xi x )
n 1 i =1

sd(x)

lch chun: s = s 2
Sai s chun (standard error): SE =

s
n

Khng c
min(x)
max(x)
range(x)

Tr s thp nht
Tr s cao nht
Ton c (range)

V d 1: tm gi tr trung bnh ca tui, chng ta ch n gin lnh:


> mean(age)
[1] 19.17

Hay phng sai v c lch chun ca tui:


> var(age)
[1] 15.33444
> sd(age)
[1] 3.915922

Tuy nhin, R c lnh summary c th cho chng ta tt c thng tin thng k v mt bin
s:
> summary(age)
Min. 1st Qu.
13.00
16.00

Median
19.00

Mean 3rd Qu.


19.17
21.25

Max.
34.00

Ni chung, kt qu ny n gin v cc vit tt cng c th d hiu. Ch , trong


kt qu trn, c hai ch s 1st Qu v 3rd Qu c ngha l first quartile (tng
ng vi v tr 25%) v third quartile (tng ng vi v tr 75%) ca mt bin s.
First quartile = 16 c ngha l 25% i tng nghin cu c tui bng hoc nh hn
16 tui. Tng t, Third quartile = 34 c ngha l 75% i tng c tui bng hoc
thp hn 34 tui. Tt nhin s trung v (median) 19 cng c ngha l 50% i tng c
tui 19 tr xung (hay 19 tui tr ln).

R khng c hm tnh sai s chun, v trong hm summary, R cng khng cung


cp lch chun. c cc s ny, chng ta c th t vit mt hm n gin (hy gi
l desc) nh sau:
desc <- function(x)
{
av <- mean(x)
sd <- sd(x)
se <- sd/sqrt(length(x))
c(MEAN=av, SD=sd, SE=se)
}

V c th gi hm ny tnh bt c bin no chng ta mun, nh tnh bin als sau


y:
> desc(als)
MEAN
SD
301.841120 58.987189

SE
5.898719

c mt quang cnh chung v d liu igfdata chng ta ch n gin lnh


summary nh sau:
> summary(igfdata)
id
sex
Min.
: 1.00
Female:69
1st Qu.: 25.75
Male :31
Median : 50.50
Mean
: 50.50
3rd Qu.: 75.25
Max.
:100.00
igfi
Min.
: 85.71
1st Qu.:137.17
Median :161.50
Mean
:165.59
3rd Qu.:186.46
Max.
:427.00

age
Min.
:13.00
1st Qu.:16.00
Median :19.00
Mean
:19.17
3rd Qu.:21.25
Max.
:34.00

igfbp3
Min.
:2.000
1st Qu.:3.292
Median :3.550
Mean
:3.617
3rd Qu.:3.875
Max.
:5.233

weight
Min.
:41.00
1st Qu.:47.00
Median :50.00
Mean
:49.91
3rd Qu.:53.00
Max.
:60.00

als
Min.
:192.7
1st Qu.:256.8
Median :292.5
Mean
:301.8
3rd Qu.:331.2
Max.
:471.7

height
Min.
:149.0
1st Qu.:157.0
Median :162.0
Mean
:163.1
3rd Qu.:168.0
Max.
:196.0

pinp
Min.
: 26.74
1st Qu.: 68.10
Median :103.26
Mean
:167.17
3rd Qu.:196.45
Max.
:742.68

ethnicity
African : 8
Asian
:60
Caucasian:30
Others
: 2

ictp
Min.
: 2.697
1st Qu.: 4.878
Median : 6.338
Mean
: 7.420
3rd Qu.: 8.423
Max.
:21.237

p3np
Min.
: 2.343
1st Qu.: 4.433
Median : 5.445
Mean
: 6.341
3rd Qu.: 7.150
Max.
:16.303

R tnh ton tt c cc bin s no c th tnh ton c! Thnh ra, ngay c ct id


(tc m s ca i tng nghin cu) R cng tnh lun! (v chng ta bit kt qu ca ct
id chng c ngha thng k g). i vi cc bin s mang tnh phn loi nh sex v
ethnicity (sc tc) th R ch bo co tn s cho mi nhm.

Kt qu trn cho tt c i tng nghin cu. Nu chng ta mun kt qu cho


tng nhm nam v n ring bit, hm by trong R rt hu dng. Trong lnh sau y,
chng ta yu cu R tm lc d liu igfdata theo sex.
> by(igfdata, sex, summary)
sex: Female
id
Min.
: 1.0
1st Qu.:21.0
Median :47.0
Mean
:48.2
3rd Qu.:75.0
Max.
:99.0
ethnicity
African : 4
Asian
:43
Caucasian:22
Others
: 0

sex
Female:69
Male : 0

age
weight
height
Min.
:13.00
Min.
:41.00
Min.
:149.0
1st Qu.:17.00
1st Qu.:47.00
1st Qu.:156.0
Median :19.00
Median :50.00
Median :162.0
Mean
:19.59
Mean
:49.35
Mean
:161.9
3rd Qu.:22.00
3rd Qu.:52.00
3rd Qu.:166.0
Max.
:34.00
Max.
:60.00
Max.
:196.0
igfi
igfbp3
als
Min.
: 85.71
Min.
:2.767
Min.
:204.3
1st Qu.:136.67
1st Qu.:3.333
1st Qu.:263.8
Median :163.33
Median :3.567
Median :302.7
Mean
:167.97
Mean
:3.695
Mean
:311.5
3rd Qu.:186.17
3rd Qu.:3.933
3rd Qu.:361.7
Max.
:427.00
Max.
:5.233
Max.
:471.7
pinp
ictp
p3np
Min.
: 26.74
Min.
: 2.697
Min.
: 2.343
1st Qu.: 62.75
1st Qu.: 4.717
1st Qu.: 4.337
Median : 78.50
Median : 5.537
Median : 5.143
Mean
:108.74
Mean
: 6.183
Mean
: 5.643
3rd Qu.:115.26
3rd Qu.: 7.320
3rd Qu.: 6.143
Max.
:502.05
Max.
:13.633
Max.
:14.420
-----------------------------------------------------------sex: Male
id
sex
age
weight
height
Min.
: 2.00
Female: 0
Min.
:14.00
Min.
:44.00
Min.
:155.0
1st Qu.: 34.50
Male :31
1st Qu.:15.00
1st Qu.:48.50
1st Qu.:161.5
Median : 56.00
Median :17.00
Median :51.00
Median :164.0
Mean
: 55.61
Mean
:18.23
Mean
:51.16
Mean
:165.6
3rd Qu.: 75.00
3rd Qu.:20.00
3rd Qu.:53.50
3rd Qu.:169.0
Max.
:100.00
Max.
:27.00
Max.
:59.00
Max.
:191.0
ethnicity
igfi
igfbp3
als
African : 4
Min.
: 94.67
Min.
:2.000
Min.
:192.7
Asian
:17
1st Qu.:138.67
1st Qu.:3.183
1st Qu.:249.8
Caucasian: 8
Median :160.00
Median :3.500
Median :276.0
Others
: 2
Mean
:160.29
Mean
:3.443
Mean
:280.2
3rd Qu.:183.00
3rd Qu.:3.775
3rd Qu.:311.3
Max.
:274.00
Max.
:4.500
Max.
:388.7
pinp
ictp
p3np
Min.
: 56.28
Min.
: 3.650
Min.
: 3.390
1st Qu.:135.07
1st Qu.: 6.900
1st Qu.: 5.375
Median :245.92
Median : 9.513
Median : 7.140
Mean
:297.21
Mean
:10.173
Mean
: 7.895
3rd Qu.:450.38
3rd Qu.:13.517
3rd Qu.:10.010
Max.
:742.68
Max.
:21.237
Max.
:16.303

xem qua phn phi ca cc hormones v ch s sinh ha cng mt lc, chng


ta c th v th cho tt c 6 bin s. Trc ht, chia mn nh thnh 6 ca s (vi 2
dng v 3 ct); sau ln lt v:

>
>
>
>
>
>
>

op <- par(mfrow=c(2,3))
hist(igfi)
hist(igfbp3)
hist(als)
hist(pinp)
hist(ictp)
hist(p3np)
Histogram of igfbp3

Histogram of als

200

300

400

0
100

20

Frequency

10

20

Frequency

10

20
0

10

Frequency

30

30

30

40

40

Histogram of igfi

2.0

3.0

4.0

5.0

150

250

350

450

igfbp3

als

Histogram of pinp

Histogram of ictp

Histogram of p3np

40
30
20

Frequency

30
0

200

400
pinp

600

800

10

10

10

20

Frequency

30
20

Frequency

40

50

igf i

10
ictp

15

20

10

15

p3np

9.2 Kim nh xem mt bin c phi phn phi chun


Trong phn tch thng k, phn ln cc php tnh da vo gi nh bin s phi l
mt bin s phn phi chun (normal distribution). Do , mt trong nhng vic quan
trng khi xem xt d kin l phi kim nh gi thit phn phi chun ca mt bin s.
Trong th trn, chng ta thy cc bin s nh igfi, pinp, ictp v p3np c v
tp trung vo cc gi tr thp v khng cn i, tc du hiu ca mt s phn phi khng
chun.

kim nh nghim chnh, chng ta cn phi s dng kim nh thng k c tn


l Shapiro test v trong R gi l hm shapiro.test. Chng hn nh kim nh gi
thit phn phi chun ca bin s pinp,
> shapiro.test(pinp)
Shapiro-Wilk normality test
data: pinp
W = 0.748, p-value = 8.314e-12

V tr s p (p-value) thp hn 0.05, chng ta c th kt lun rng bin s pinp khng p


ng lut phn phi chun.
Nhng vi bin s weight (trng lng c th) th kim nh ny cho bit y l mt
bin s tun theo lut phn phi chun v tr s p > 0.05.
> shapiro.test(weight)
Shapiro-Wilk normality test
data: weight
W = 0.9887, p-value = 0.5587

Tht ra, kt qu trn cng ph hp vi th ca weight:


> hist(weight)

10
0

Frequency

15

Histogram of weight

40

45

50

55

60

weight

9.3 Thng k m t theo tng nhm

Nu chng ta mun tnh trung bnh ca mt bin s nh igfi cho mi nhm nam
v n gii, hm tapply trong R c th dng cho vic ny:
> tapply(igfi, list(sex), mean)
Female
Male
167.9741 160.2903

Trong lnh trn, igfi l bin s chng ta cn tnh, bin s phn nhm l sex, v ch s
thng k chng ta mun l trung bnh (mean). Qua kt qu trn, chng ta thy s trung
bnh ca igfi cho n gii (167.97) cao hn nam gii (160.29).
Nhng nu chng ta mun tnh cho tng gii tnh v sc tc, chng ta ch cn thm mt
bin s trong hm list:
> tapply(igfi, list(ethnicity, sex), mean)
Female
Male
African
145.1252 120.9168
Asian
165.6589 160.4999
Caucasian 176.6536 169.4790
Others
NA 200.5000

Trong kt qu trn, NA c ngha l not available, tc khng c s liu cho ph n trong


cc sc tc others.

9.4 Kim nh t (t.test)


Kim nh t da vo gi thit phn phi chun. C hai loi kim nh t: kim
nh t cho mt mu (one-sample t-test), v kim nh t cho hai mu (two-sample t-test).
Kim nh t mt mu nm tr li cu hi d liu t mt mu c phi tht s bng mt
thng s no hay khng. Cn kim nh t hai mu th nhm tr li cu hi hai mu c
cng mt lut phn phi, hay c th hn l hai mu c tht s c cng tr s trung bnh
hay khng. Ti s ln lt minh ha hai kim nh ny qua s liu igfdata trn.
9.1.1 Kim nh t mt mu
V d 2. Qua phn tch trn, chng ta thy tui trung bnh ca 100 i tng
trong nghin cu ny l 19.17 tui. Chng hn nh trong qun th ny, trc y chng
ta bit rng tui trung bnh l 30 tui. Vn t ra l c phi mu m chng ta c c
c i din cho qun th hay khng. Ni cch khc, chng ta mun bit gi tr trung bnh
19.17 c tht s khc vi gi tr trung bnh 30 hay khng.

tr li cu hi ny, chng ta s dng kim nh t. Theo l thuyt thng k,


kim nh t c nh ngha bng cng thc sau y:

t=

x
s/ n

Trong , x l gi tr trung bnh ca mu, l trung bnh theo gi thit (trong trng
hp ny, 30), s l lch chun, v n l s lng mu (100). Nu gi tr t cao hn gi tr
l thuyt theo phn phi t mt tiu chun c ngha nh 5% chng hn th chng ta c
l do pht biu khc bit c ngha thng k. Gi tr ny cho mu 100 c th tnh ton
bng hm qt ca R nh sau:
> qt(0.95, 100)
[1] 1.660234

Nhng c mt cch tnh ton nhanh gn hn tr li cu hi trn, bng cch dng hm


t.test nh sau:
> t.test(age, mu=30)
One Sample t-test
data: age
t = -27.6563, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
18.39300 19.94700
sample estimates:
mean of x
19.17

Trong lnh trn age l bin s chng ta cn kim nh, v mu=30 l gi tr gi thit. R
trnh by tr s t = -27.66, vi 99 bc t do, v tr s p < 2.2e-16 (tc rt thp). R
cng cho bit tin cy 95% ca age l t 18.4 tui n 19.9 tui (30 tui nm qu ngoi
khong tin cy ny). Ni cch khc, chng ta c l do pht biu rng tui trung
bnh trong mu ny tht s thp hn tui trung bnh ca qun th.
9.4.2 Kim nh t hai mu
V d 3. Qua phn tch m t trn (phm summary) chng ta thy ph n c
hormone igfi cao hn nam gii (167.97 v 160.29). Cu hi t ra l c phi tht s
l mt khc bit c h thng hay do cc yu t ngu nhin gy nn. Tr li cu hi ny,
chng ta cn xem xt mc khc bit trung bnh gia hai nhm v lch chun ca
khc bit.

x2 x1
SED
Trong x1 v x2 l s trung bnh ca hai nhm nam v n, v SED l lch chun
ca ( x1 - x2 ) . Thc ra, SED c th c tnh bng cng thc:
t=

SED = SE12 + SE22

Trong SE1 v SE2 l sai s chun (standard error) ca hai nhm nam v n. Theo l
thuyt xc sut, t tun theo lut phn phi t vi bc t do n1 + n2 2 , trong n1 v n2 l
s mu ca hai nhm. Chng ta c th dng R tr li cu hi trn bng hm t.test
nh sau:
> t.test(igfi~ sex)
Welch Two Sample t-test
data: igfi by sex
t = 0.8412, df = 88.329, p-value = 0.4025
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.46855 25.83627
sample estimates:
mean in group Female
mean in group Male
167.9741
160.2903

R trnh by cc gi tr quan trng trc ht:


t = 0.8412, df = 88.329, p-value = 0.4025

df l bc t do. Tr s p = 0.4025 cho thy mc khc bit gia hai nhm nam v n
khng c ngha thng k (v cao hn 0.05 hay 5%).
95 percent confidence interval:
-10.46855 25.83627

l khong tin cy 95% v khc bit gia hai nhm. Kt qu tnh ton trn cho bit
igf n gii c th thp hn nam gii 10.5 ng/L hoc cao hn nam gii khong 25.8
ng/L. V khc bit qu ln v l thm bng chng cho thy khng c khc bit c
ngha thng k gia hai nhm.
Kim nh trn da vo gi thit hai nhm nam v n c khc phng sai. Nu
chng ta c l do cho rng hai nhm c cng phng sai, chng ta ch thay i mt
thng s trong hm t vi var.equal=TRUE nh sau:
> t.test(igfi~ sex, var.equal=TRUE)
Two Sample t-test
data: igfi by sex
t = 0.7071, df = 98, p-value = 0.4812
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-13.88137 29.24909

sample estimates:
mean in group Female
167.9741

mean in group Male


160.2903

V mc s, kt qu phn tch trn c khc cht t so vi kt qu phn tch da vo gi


nh hai phng sai khc nhau, nhng tr s p cng i n mt kt lun rng khc bit
gia hai nhm khng c ngha thng k.

9.5 So snh phng sai (var.test)


By gi chng ta th kim nh xem phng sai gia hai nhm c khc nhau khng.
tin hnh phn tch, chng ta ch cn lnh:
> var.test(igfi ~ sex)
F test to compare two variances
data: igfi by sex
F = 2.6274, num df = 68, denom df = 30, p-value = 0.004529
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.366187 4.691336
sample estimates:
ratio of variances
2.627396

Kt qu trn cho thy khc bit v phng sai gia hai nhm cao 2.62 ln. Tr s p =
0.0045 cho thy phng sai gia hai nhm khc nhau c ngha thng k. Nh vy,
chng ta chp nhn kt qu phn tch ca hm t.test(igfi~ sex).

9.6 Kim nh Wilcoxon cho hai mu (wilcox.test)


Kim nh t da vo gi thit l phn phi ca mt bin phi tun theo lut phn
phi chun. Nu gi nh ny khng ng, kt qu ca kim nh t c th khng hp l
(valid). kim nh phn phi ca igfi, chng ta c th dng hm shapiro.test
nh sau:
> shapiro.test(igfi)
Shapiro-Wilk normality test
data: igfi
W = 0.8528, p-value = 1.504e-08

Tr s p nh hn 0.05 rt nhiu, cho nn chng ta c th ni rng phn phi ca igfi


khng tun theo lut phn phi chun. Trong trng hp ny, vic so snh gia hai
nhm c th da vo phng php phi tham s (non-parametric) c tn l kim nh

Wilcoxon, v kim nh ny (khng nh kim nh t) khng ty thuc vo gi nh phn


phi chun.
> wilcox.test(igfi ~ sex)
Wilcoxon rank sum test with continuity correction
data: igfi by sex
W = 1125, p-value = 0.6819
alternative hypothesis: true mu is not equal to 0

Tr s p = 0.682 cho thy qu tht khc bit v igfi gia hai nhm nam v n khng
c ngha thng k. Kt lun ny cng khng khc vi kt qu phn tch bng kim nh
t.

9.7 Kim nh t cho cc bin s theo cp (paired t-test,


t.test)
Kim nh t va trnh by trn l cho cc nghin cu gm hai nhm c lp nhau
(nh gia hai nhm nam v n), nhng khng th ng dng cho cc nghin cu m mt
nhm i tng c theo di theo thi gian. Ti tm gi cc nghin cu ny l nghin
cu theo cp. Trong cc nghin cu ny, chng ta cn s dng mt kim nh t c tn l
paired t-test.
V d 4. Mt nhm bnh nhn gm 10 ngi c iu tr bng mt thuc nhm
gim huyt p. Huyt p ca bnh nhn c o lc khi u nghin cu (lc cha iu
tr), v sau khi iu kh. S liu huyt p ca 10 bnh nhn nh sau:

Trc khi iu tr (x0)


Sau khi iu tr (x1)

180, 140, 160, 160, 220, 185, 145, 160, 160, 170
170, 145, 145, 125, 205, 185, 150, 150, 145, 155

Cu hi t ra l bin chuyn huyt p trn c kt lun rng thuc iu tr c


hiu qu gim p huyt. tr li cu hi ny, chng ta dng kim nh t cho tng cp
nh sau:
>
>
>
>

# nhp d kin
before <- c(180, 140, 160, 160, 220, 185, 145, 160, 160, 170)
after <- c(170, 145, 145, 125, 205, 185, 150, 150, 145, 155)
bp <- data.frame(before, after)

> # kim nh t
> t.test(before, after, paired=TRUE)
Paired t-test
data: before and after
t = 2.7924, df = 9, p-value = 0.02097

alternative hypothesis: true difference in means is not equal to


0
95 percent confidence interval:
1.993901 19.006099
sample estimates:
mean of the differences
10.5

Kt qu trn cho thy sau khi iu tr p sut mu gim 10.5 mmHg, v khong tin cy
95% l t 2.0 mmHg n 19 mmHg, vi tr s p = 0.0209. Nh vy, chng ta c bng
chng pht biu rng mc gim huyt p c ngha thng k.
Ch nu chng ta phn tch sai bng kim nh thng k cho hai nhm c lp di y
th tr s p = 0.32 cho bit mc gim p sut khng c ngha thng k!
> t.test(before, after)
Welch Two Sample t-test
data: before and after
t = 1.0208, df = 17.998, p-value = 0.3209
alternative hypothesis: true difference in means is not equal to
0
95 percent confidence interval:
-11.11065 32.11065
sample estimates:
mean of x mean of y
168.0
157.5

9.8 Kim nh Wilcoxon cho cc bin s theo cp


(wilcox.test)
Thay v dng kim nh t cho tng cp, chng ta cng c th s dng hm
wilcox.test cho cng mc ch:
> wilcox.test(before, after, paired=TRUE)
Wilcoxon signed rank test with continuity correction
data: before and after
V = 42, p-value = 0.02291
alternative hypothesis: true mu is not equal to 0

Kt qu trn mt ln na khng nh rng gim p sut mu c ngha thng k vi


tr s (p=0.023) chng khc my so vi kim nh t cho tng cp.

9.9 Tn s (frequency)
Hm table trong R c chc nng cho chng ta bit v tn s ca mt bin s
mang tnh phn loi nh sex v ethnicity.
> table(sex)
sex
Female
Male
69
31
> table(ethnicity)
ethnicity
African
Asian Caucasian
8
60
30

Others
2

Mt bng thng k 2 chiu:


> table(sex, ethnicity)
ethnicity
sex
African Asian Caucasian Others
Female
4
43
22
0
Male
4
17
8
2

Ch trong cc bng thng k trn, hm table khng cung cp cho chng ta s phn
trm. tnh s phn trm, chng ta cn n hm prop.table v cch s dng c th
minh ho nh sau:
# to ra mt object tn l freq cha kt qu tn s
> freq <- table(sex, ethnicity)
# kim tra kt qu
> freq
ethnicity
sex
African Asian Caucasian Others
Female
4
43
22
0
Male
4
17
8
2
# dng hm margin.table xem kt qu
> margin.table(freq, 1)
sex
Female
Male
69
31
> margin.table(freq, 2)
ethnicity
African
Asian Caucasian
8
60
30

Others
2

# tnh phn trm bng hm prop.table


> prop.table(freq, 1)
ethnicity
sex
African
Asian Caucasian
Others
Female 0.05797101 0.62318841 0.31884058 0.00000000
Male
0.12903226 0.54838710 0.25806452 0.06451613

Trong bng thng k trn, prop.table tnh t l sc tc cho tng gii tnh. Chng hn
nh n gii (female), 5.8% l ngi Phi chu, 62.3% l ngi chu, 31.8% l ngi
Ty phng da trng . Tng cng l 100%. Tng t, nam gii t l ngi Phi chu l
12.9%, chu l 54.8%, v.v
# tnh phn trm bng hm prop.table
> prop.table(freq, 2)
ethnicity
sex
African
Asian Caucasian
Others
Female 0.5000000 0.7166667 0.7333333 0.0000000
Male
0.5000000 0.2833333 0.2666667 1.0000000

Trong bng thng k trn, prop.table tnh t l gii tnh cho tng sc tc. Chng hn
nh trong nhm ngi chu, 71.7% l n v 28.3% l nam.
# tnh phn trm cho ton b bng
> freq/sum(freq)
ethnicity
sex
African Asian Caucasian Others
Female
0.04 0.43
0.22
0.00
Male
0.04 0.17
0.08
0.02

9.10 Kim nh t l (proportion test, prop.test,


binom.test)
Kim nh mt t l thng da vo gi nh phn phi nh phn (binomial distribution).
Vi mt s mu n v t l p, v nu n ln (tc hn 50 chng hn), th phn phi nh phn
c th tng ng vi phn phi chun vi s trung bnh np v phng sai np(1 p).
Gi x l s bin c m chng ta quan tm, kim nh gi thit p = c th s dng thng
k sau y:
z=

x n
n (1 )

y, z tun theo lut phn phi chun vi trung bnh 0 v phng sai 1. Cng c th
ni z2 tun theo lut phn phi Chi bnh phng vi bc t do bng 1.

V d 5. Trong nghin cu trn, chng ta thy c 69 n v 31 nam. Nh vy t l


n l 0.69 (hay 69%). kim nh xem t l ny c tht s khc vi t l 0.5 hay
khng, chng ta c th s dng hm prop.test(x, n, ) nh sau:
> prop.test(69, 100, 0.50)
1-sample proportions test with continuity correction
data: 69 out of 100, null probability 0.5
X-squared = 13.69, df = 1, p-value = 0.0002156
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5885509 0.7766330
sample estimates:
p
0.69

Trong kt qu trn, prop.test c tnh t l n gii l 0.69, v khong tin cy 95% l


0.588 n 0.776. Gi tr Chi bnh phng l 13.69, vi tr s p = 0.00216. Nh vy,
nghin cu ny c t l n cao hn 50%.
Mt cch tnh chnh xc hn kim nh t l l kim nh nh phn bionom.test(x,
n, ) nh sau:
> binom.test(69, 100, 0.50)
Exact binomial test
data: 69 and 100
number of successes = 69, number of trials = 100, p-value = 0.0001831
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5896854 0.7787112
sample estimates:
probability of success
0.69

Ni chung, kt qu ca kim nh nh phn khng khc g so vi kim nh Chi bnh


phng, vi tr s p = 0.00018, chng ta cng c bng chng kt lun rng t l n gii
trong nghin cu ny tht s cao hn 50%.

9.11 So snh hai t l (prop.test, binom.test)


Phng php so snh hai t l c th khai trin trc tip t l thuyt kim nh mt t l
va trnh by trn. Cho hai mu vi s i tng n1 v n2, v s bin c l x1 v x2. Do
, chng ta c th c tnh hai t l p1 v p2. L thuyt xc sut cho php chng ta pht
biu rng khc bit gia hai mu d = p1 p2 tun theo lut phn phi chun vi s
trung bnh 0 v phng sai bng:

1 1
Vd = + p (1 p )
n1 n2

Trong :

p=

x1 + x2
n1 + n2

Thnh ra, z = d/Vd tun theo lut phn phi chun vi trung bnh 0 v phng sai 1. Ni
cch khc, z2 tun theo lut phn phi Chi bnh phng vi bc t do bng 1. Do ,
chng ta cng c th s dng prop.test kim nh hai t l.
V d 6. Mt nghin cu c tin hnh so snh hiu qu ca thuc chng gy
xng. Bnh nhn c chia thnh hai nhm: nhm A c iu tr gm c 100 bnh
nhn, v nhm B khng c iu tr gm 110 bnh nhn. Sau thi gian 12 thng theo
di, nhm A c 7 ngi b gy xng, v nhm B c 20 ngi gy xng. Vn t ra
l t l gy xng trong hai nhm ny bng nhau (tc thuc khng c hiu qu)?
kim nh xem hai t l ny c tht s khc nhau, chng ta c th s dng hm
prop.test(x, n, ) nh sau:
> fracture <- c(7, 20)
> total <- c(100, 110)
> prop.test(fracture, total)
2-sample test for equality of proportions with continuity
correction
data: fracture out of total
X-squared = 4.8901, df = 1, p-value = 0.02701
alternative hypothesis: two.sided
95 percent confidence interval:
-0.20908963 -0.01454673
sample estimates:
prop 1
prop 2
0.0700000 0.1818182

Kt qu phn tch trn cho thy t l gy xng trong nhm 1 l 0.07 v nhm 2 l 0.18.
Phn tch trn cn cho thy xc sut 95% rng khc bit gia hai nhm c th 0.01
n 0.20 (tc 1 n 20%). Vi tr s p = 0.027, chng ta c th ni rng t l gy xng
trong nhm A qu tht thp hn nhm B.

9.12 So snh nhiu t l (prop.test, chisq.test)


Kim nh prop.test cn c th s dng kim nh nhiu t l cng mt lc.
Trong nghin cu trn, chng ta c 4 nhm sc tc v tn s cho tng gii tnh nh sau:

> table(sex, ethnicity)


ethnicity
sex
African Asian Caucasian Others
Female
4
43
22
0
Male
4
17
8
2

Chng ta mun bit t l n gii gia 4 nhm sc tc c khc nhau hay khng, v tr
li cu hi ny, chng ta li dng prop.test nh sau:
> female <- c( 4, 43, 22, 0)
> total <- c(8, 60, 30, 2)
> prop.test(female, total)
4-sample test for equality of proportions without continuity
correction
data: female out of total
X-squared = 6.2646, df = 3, p-value = 0.09942
alternative hypothesis: two.sided
sample estimates:
prop 1
prop 2
prop 3
prop 4
0.5000000 0.7166667 0.7333333 0.0000000
Warning message:
Chi-squared approximation may be incorrect in: prop.test(female, total)

Tuy t l n gii gia cc nhm c v khc nhau ln (73% trong nhm 3 (ngi da trng)
so vi 50% trong nhm 1 (Phi chu) v 71.7% trong nhm chu, nhng kim nh Chi
bnh phng cho bit trn phng din thng k, cc t l ny khng khc nhau, v tr s
p = 0.099.
9.12.1 Kim nh Chi bnh phng (Chi squared test, chisq.test)

Tht ra, kim nh Chi bnh phng cn c th tnh ton bng hm chisq.test nh
sau:
> chisq.test(sex, ethnicity)
Pearson's Chi-squared test
data: sex and ethnicity
X-squared = 6.2646, df = 3, p-value = 0.09942
Warning message:
Chi-squared
approximation
ethnicity)

may

be

incorrect

in:

Kt qu ny hon ton ging vi kt qu t hm prop.test.


9.12.2 Kim nh Fisher (Fishers exact test, fisher.test)

chisq.test(sex,

Trong kim nh Chi bnh phng trn, chng ta ch cnh bo:


Warning message:
Chi-squared approximation may be incorrect in: prop.test(female, total)

V trong nhm 4, khng c n gii cho nn t l l 0%. Hn na, trong nhm ny ch c


2 i tng. V s lng i tng qu nh, cho nn cc c tnh thng k c th khng
ng tin cy. Mt phng php khc c th p dng cho cc nghin cu vi tn s thp
nh trn l kim nh fisher (cn gi l Fishers exact test). Bn c c th tham kho
l thuyt ng sau kim nh fisher hiu r hn v logic ca phng php ny, nhng
y, chng ta ch quan tm n cch dng R tnh ton kim nh ny. Chng ta ch
n gin lnh:
> fisher.test(sex, ethnicity)
Fisher's Exact Test for Count Data
data: sex and ethnicity
p-value = 0.1048
alternative hypothesis: two.sided

Ch tr s p t kim nh Fisher l 0.1048, tc rt gn vi tr s p ca kim nh Chi


bnh phng. Cho nn, chng ta c thm bng chng khng nh rng t l n gii
gia cc sc tc khng khc nhau mt cch ng k.

CHNG X

PHN TCH
HI QUY TUYN TNH

10
Phn tch hi qui tuyn tnh
Phn tch hi qui tuyn tnh (linear regression analysis) c l l mt trong nhng
phng php phn tch s liu thng dng nht trong thng k hc. Anon tng vit Cho
con ngi 3 v kh h s tng quan, hi qui tuyn tnh v mt cy bt, con ngi s
s dng c ba! Trong chng ny, ti s gii thiu cch s dng R phn tch hi qui
tuyn tnh v cc phng php lin quan nh h s tng quan v kim nh gi thit
thng k.
V d 1. minh ha cho vn , chng ta th xem xt nghin cu sau y, m
trong nh nghin cu o lng cholestrol trong mu ca 18 i tng nam. T
trng c th (body mass index) cng c c tnh cho mi i tng bng cng thc
tnh BMI l ly trng lng (tnh bng kg) chia cho chiu cao bnh phng (m2). Kt qu
o lng nh sau:
Bng 1. tui, t trng c th v cholesterol
M s ID
(id)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

tui
(age)
46
20
52
30
57
25
28
36
22
43
57
33
22
63
40
48
28
49

BMI
(bmi)
25.4
20.6
26.2
22.6
25.4
23.1
22.7
24.9
19.8
25.3
23.2
21.8
20.9
26.7
26.4
21.2
21.2
22.8

Cholesterol
(chol)
3.5
1.9
4.0
2.6
4.5
3.0
2.9
3.8
2.1
3.8
4.1
3.0
2.5
4.6
3.2
4.2
2.3
4.0

Nhn s qua s liu chng ta thy ngi c tui cng cao cholesterol cng
cng cao. Chng ta th nhp s liu ny vo R v v mt biu tn x nh sau:
> age <- c(46,20,52,30,57,25,28,36,22,43,57,33,22,63,40,48,28,49)

> bmi <-c(25.4,20.6,26.2,22.6,25.4,23.1,22.7,24.9,19.8,25.3,23.2,


21.8,20.9,26.7,26.4,21.2,21.2,22.8)
> chol <- c(3.5,1.9,4.0,2.6,4.5,3.0,2.9,3.8,2.1,3.8,4.1,3.0,
2.5,4.6,3.2, 4.2,2.3,4.0)

2.0

2.5

3.0

chol

3.5

4.0

4.5

> data <- data.frame(age, bmi, chol)


> plot(chol ~ age, pch=16)

20

30

40

50

60

age

Biu 10.1. Lin h gia tui v cholesterol.


Biu 10.1 trn y gi cho thy mi lin h gia tui (age) v cholesterol l mt
ng thng (tuyn tnh). o lng mi lin h ny, chng ta c th s dng h s
tng quan (coefficient of correlation).

10.1 H s tng quan


H s tng quan (r) l mt ch s thng k o lng mi lin h tng quan gia
hai bin s, nh gia tui (x) v cholesterol (y). H s tng quan c gi tr t -1 n
1. H s tng quan bng 0 (hay gn 0) c ngha l hai bin s khng c lin h g vi
nhau; ngc li nu h s bng -1 hay 1 c ngha l hai bin s c mt mi lin h tuyt
i. Nu gi tr ca h s tng quan l m (r <0) c ngha l khi x tng cao th y gim
(v ngc li, khi x gim th y tng); nu gi tr h s tng quan l dng (r > 0) c
ngha l khi x tng cao th y cng tng, v khi x tng cao th y cng gim theo.

Thc ra c nhiu h s tng quan trong thng k, nhng y ti s trnh by 3


h s tng quan thng dng nht: h s tng quan Pearson r, Spearman , v Kendall
.
10.1.1 H s tng quan Pearson
Cho hai bin s x v y t n mu, h s tng quan Pearson c c tnh bng
cng thc sau y:
n

r=

( xi x )( yi y )

i =1
n

2 n

( xi x ) ( yi y )

i =1

i =1

Trong , nh nh ngha phn trn, x v y l gi tr trung bnh ca bin s x v


y. c tnh h s tng quan gia tui age v cholesterol, chng ta c th s
dng hm cor(x,y) nh sau:
> cor(age, chol)
[1] 0.936726

Chng ta c th kim nh gi thit h s tng quan bng 0 (tc hai bin x v y


khng c lin h). Phng php kim nh ny thng da vo php bin i Fisher m
R c sn mt hm cor.test tin hnh vic tnh ton.
> cor.test(age, chol)
Pearson's product-moment correlation
data: age and chol
t = 10.7035, df = 16, p-value = 1.058e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8350463 0.9765306
sample estimates:
cor
0.936726

Kt qu phn tch cho thy kim nh t = 10.70 vi tr s p = 1.058e-08; do ,


chng ta c bng chng kt lun rng mi lin h gia tui v cholesterol c
ngha thng k. Kt lun ny cng chnh l kt lun chng ta i n trong phn phn
tch hi qui tuyn tnh trn.
10.1.2 H s tng quan Spearman
H s tng quan Pearson ch hp l nu bin s x v y tun theo lut phn phi
chun. Nu x v y khng tun theo lut phn phi chun, chng ta phi s dng mt h
s tng quan khc tn l Spearman, mt phng php phn tch phi tham s. H s ny

c c tnh bng cch bin i hai bin s x v y thnh th bc (rank), v xem


tng quan gia hai dy s bc. Do , h s cn c tn ting Anh l Spearmans Rank
correlation. R c tnh h s tng quan Spearman bng hm cor.test vi thng s
method=spearman nh sau:
> cor.test(age, chol, method="spearman")
Spearman's rank correlation rho
data: age and chol
S = 51.1584, p-value = 2.57e-09
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.947205
Warning message:
Cannot compute exact p-values with ties in: cor.test.default(age,
chol, method = "spearman")

Kt qu phn tch cho thy gi tr rho = 0.947, v tr s p = 2.57e-09. Kt qu t


phn tch ny cng khng khc vi phn tch hi qui tuyn tnh: mi lin h gia tui
v cholesterol rt cao v c ngha thng k.
10.1.3 H s tng quan Kendall
H s tng quan Kendall (cng l mt phng php phn tch phi tham s) c
c tnh bng cch tm cc cp s (x, y) song hnh" vi nhau. Mt cp (x, y) song hnh
y c nh ngha l hiu ( khc bit) trn trc honh c cng du hiu (dng hay
m) vi hiu trn trc tung. Nu hai bin s x v y khng c lin h vi nhau, th s cp
song hnh bng hay tng ng vi s cp khng song hnh.
Bi v c nhiu cp phi kim nh, phng php tnh ton h s tng quan
Kendall i hi thi gian ca my tnh kh cao. Tuy nhin, nu mt d liu di 5000
i tng th mt my vi tnh c th tnh ton kh d dng. R dng hm cor.test vi
thng s method=kendall c tnh h s tng quan Kendall:
> cor.test(age, chol, method="kendall")
Kendall's rank correlation tau
data: age and chol
z = 4.755, p-value = 1.984e-06
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.8333333
Warning message:

Cannot compute exact p-value with ties in: cor.test.default(age,


chol, method = "kendall")

Kt qu phn tch h s tng quan Kendall mt ln na khng nh mi lin h


gia tui v cholesterol c ngha thng k, v h s tau = 0.833 v tr s p = 1.98e06.
Cc h s tng quan trn y o mc tng quan gia hai bin s, nhng
khng cho chng ta mt phng trnh ni hai bin s vi nhau. Thnh ra, vn
t ra l chng ta tm mt phng trnh tuyn tnh m t mi lin h ny. Chng ta s
ng dng m hnh hi qui tuyn tnh.

10.2 M hnh ca hi qui tuyn tnh n gin


10.2.1 vi dng l thuyt
tin vic theo di v m t m hnh, gi tui cho c nhn i l xi v
cholesterol l yi. y i = 1, 2, 3, , 18. M hnh hi tuyn tnh pht biu rng:

yi = + xi + i

[1]

Ni cch khc, phng trnh trn gi nh rng cholesterol ca mt c nhn bng mt


hng s cng vi mt h s lin quan n tui, v mt sai s i. Trong phng
trnh trn, l chn (intercept, tc gi tr lc xi =0), v l dc (slope hay gradient).
Trong thc t, v l hai thng s (paramater, cn gi l regression coefficient hay h
s hi qui), v i l mt bin s theo lut phn phi chun vi trung bnh 0 v phng sai
2 .
Cc thng s , v 2 phi c c tnh t d liu. Phng php c tnh
cc thng s ny l phng php bnh phng nh nht (least squares method). Nh tn
gi, phng php bnh phng nh nht tm gi tr , sao cho

y ( + x )
i =1

nh

nht. Sau vi thao tc ton, c th chng minh d dng rng, c s cho v p ng


iu kin l:
n

( x x )( y y )
i =1

(x x )
i =1

[2]

= y x

[3]

)
)
y, x v y l gi tr trung bnh ca bin s x v y. Ch , ti vit v (vi du
m pha trn) l nhc nh rng y l hai c s (estimates) ca v , ch khng
phi v (chng ta khng bit chnh xc v , nhng ch c th c tnh m thi).

)
)
Sau khi c c s v , chng ta c th c tnh cholesterol trung bnh
cho tng tui nh sau:
)
yi = + xi

Tt nhin, yi y ch l s trung bnh cho tui xi, v phn cn li (tc yi - yi ) gi l


phn d (residual). V phng sai ca phn d c th c tnh nh sau:
n

s =
2

( y y )
i =1

[4]

n2

s2 chnh l c s ca 2.
Trong phn tch hi qui tuyn tnh, thng thng chng ta mun bit h s
= 0 hay khc 0. Nu bng 0, th cng c ngha l khng c mi lin h g gia x v y;
nu khc vi 0, chng ta c bng chng pht biu rng x v y c lin quan nhau.
kim nh gi thit = 0 chng ta dng xt nghim t sau y:
t=

( )

SE

[5]

( )

)
SE c ngha l sai s chun (standard error) ca c s . Trong phng trnh trn,
t tun theo lut phn phi t vi bc t do n-2 (nu tht s = 0).
10.2.2 Phn tch hi qui tuyn tnh n gin bng R
)
Hm lm (vit tt t linear model) trong R c th tnh ton cc gi tr ca
v , cng nh s2 mt cch nhanh gn. Chng ta tip tc vi v d bng R nh sau:

> lm(chol ~ age)

Call:
lm(formula = chol ~ age)
Coefficients:
(Intercept)
1.08922

age
0.05779

Trong lnh trn, chol ~ age c ngha l m t chol l mt hm s ca age. Kt


)
)
qu tnh ton ca lm cho thy = 1.0892 v = 0.05779. Ni cch khc, vi hai thng
s ny, chng ta c th c tnh cholesterol cho bt c tui no trong khong tui
ca mu bng phng trnh tuyn tnh:

yi = 1.08922 + 0.05779 x age


Phng trnh ny c ngha l khi tui tng 1 nm th cholesterol tng khong 0.058
mmol/L.
Tht ra, hm lm cn cung cp cho chng ta nhiu thng tin khc, nhng chng ta phi
a cc thng tin ny vo mt object. Gi object l reg, th lnh s l:
> reg <- lm(chol ~ age)
> summary(reg)
Call:
lm(formula = chol ~ age)
Residuals:
Min
1Q
Median
-0.40729 -0.24133 -0.04522

3Q
0.17939

Max
0.63040

Coefficients:
Estimate Std. Error t value
(Intercept) 1.089218
0.221466
4.918
age
0.057788
0.005399 10.704
--Signif. codes: 0 '***' 0.001 '**' 0.01

Pr(>|t|)
0.000154 ***
1.06e-08 ***
'*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3027 on 16 degrees of freedom


Multiple R-Squared: 0.8775,
Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

Lnh th hai, summary(reg), yu cu R lit k cc thng tin tnh ton trong reg. Phn
kt qu chia lm 3 phn:
(a) Phn 1 m t phn d (residuals) ca m hnh hi qui:
Residuals:
Min
1Q
Median
-0.40729 -0.24133 -0.04522

3Q
0.17939

Max
0.63040

Chng ta bit rng trung bnh phn d phi l 0, v y, s trung v l -0.04, cng
khng xa 0 bao nhiu. Cc s quantiles 25% (1Q) v 75% (3Q) cng kh cn i chung
quan s trung v, cho thy phn d ca phng trnh ny tng i cn i.

)
)
(b) Phn hai trnh by c s ca v cng vi sai s chun v gi tr ca kim nh t.
)
Gi tr kim nh t cho l 10.74 vi tr s p = 1.06e-08, cho thy khng phi bng 0.
Ni cch khc, chng ta c bng chng cho rng c mt mi lin h gia cholesterol
v tui, v mi lin h ny c ngha thng k.
Coefficients:
Estimate Std. Error t value
(Intercept) 1.089218
0.221466
4.918
age
0.057788
0.005399 10.704
--Signif. codes: 0 '***' 0.001 '**' 0.01

Pr(>|t|)
0.000154 ***
1.06e-08 ***
'*' 0.05 '.' 0.1 ' ' 1

(c) Phn ba ca kt qu cho chng ta thng tin v phng sai ca phn d (residual mean
square). y, s2 = 0.3027. Trong kt qu ny cn c kim nh F, cng ch l mt
kim nh xem c qu tht bng 0, tc c ngha tng t nh kim nh t trong phn
trn. Ni chung, trong trng hp phn tch hi qui tuyn tnh n gin (vi mt yu t)
chng ta khng cn phi quan tm n kim nh F.
Residual standard error: 0.3027 on 16 degrees of freedom
Multiple R-Squared: 0.8775,
Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

Ngoi ra, phn 3 cn cho chng ta mt thng tin quan trng, l tr s R2 hay h s xc
nh bi (coefficient of determination). H s ny c c tnh bng cng thc:
n

R2 =

( y y )

( y y )

i =1
n
i =1

[6]

Tc l bng tng bnh phng gia s c tnh v trung bnh chia cho tng bnh phng
s quan st v trung bnh. Tr s R2 trong v d ny l 0.8775, c ngha l phng trnh
tuyn tnh (vi tui l mt yu t) gii thch khong 88% cc khc bit v
cholesterol gia cc c nhn. Tt nhin tr s R2 c gi tr t 0 n 100% (hay 1). Gi tr
R2 cng cao l mt du hiu cho thy mi lin h gia hai bin s tui v cholesterol
cng cht ch.
Mt h s cng cn cp y l h s iu chnh xc nh bi (m trong kt
qu trn R gi l Adjusted R-squared). y l h s cho chng ta bit mc ci tin
ca phng sai phn d (residual variance) do yu t tui c mt trong m hnh tuyn
tnh. Ni chung, h s ny khng khc my so vi h s xc nh bi, v chng ta cng
khng cn ch tm qu mc.
10.2.3 Gi nh ca phn tch hi qui tuyn tnh
Tt c cc phn tch trn da vo mt s gi nh quan trng nh sau:

(a) x l mt bin s c nh hay fixed, (c nh y c ngha l khng c sai st ngu


nhin trong o lng);
(b) i phn phi theo lut phn phi chun;
(c) i c gi tr trung bnh (mean) l 0;
(d) i c phng sai 2 c nh cho tt c xi; v
(e) cc gi tr lin tc ca i khng c lin h tng quan vi nhau (ni cch khc, 1 v 2
khng c lin h vi nhau).
Nu cc gi nh ny khng c p ng th phng trnh m chng ta c tnh
c vn hp l (validity). Do , trc khi trnh by v din dch m hnh trn, chng
ta cn phi kim tra xem cc gi nh trn c p ng c hay khng. Trong trng
hp ny, gi nh (a) khng phi l vn , v tui khng phi l mt bin s ngu
nhin, v khng c sai s khi tnh tui ca mt c nhn.
i vi cc gi nh (b) n (e), cch kim tra n gin nhng hu hiu nht l
bng cch xem xt mi lin h gia yi , xi , v phn d ei ( ei = yi yi ) bng nhng th
tn x.
Vi lnh fitted() chng ta c th tnh ton yi cho tng c nhn nh sau (v d
i vi c nhn 1, 46 tui, cholestrol c th tin on nh sau: 1.08922 + 0.05779
x 46 = 3.747).
> fitted(reg)
1
2
3
4
5
6
7
8
3.747483 2.244985 4.094214 2.822869 4.383156 2.533927 2.707292 3.169600
9
10
11
12
13
14
15
16
2.360562 3.574118 4.383156 2.996234 2.360562 4.729886 3.400753 3.863060
17
18
2.707292 3.920849

Vi lnh resid() chng ta c th tnh ton phn d ei cho tng c nhn nh


sau (vi i tng 1, e1 = 3.5 3.74748 = -0.24748):
> resid(reg)
1
2
3
4
5
-0.247483426 -0.344985415 -0.094213736 -0.222869265 0.116844338
7
8
9
10
11
0.192707505 0.630400424 -0.260562185 0.225881729 -0.283155662
13
14
15
16
17
0.139437815 -0.129885972 -0.200753116 0.336939804 -0.407292495

6
0.466072660
12
0.003765579
18
0.079151419

kim tra cc gi nh trn, chng ta c th v mt lot 4 th m ti s gii


thch sau y:

#yu cu R dnh ra 4 ca s
#v cc th trong reg

> op <- par(mfrow=c(2,2))


> plot(reg)

-1

Standardized residuals

0.0

0.2

17

17

3.0

1.5

2.5

3.5

4.0

4.5

-2

-1

Fitted values

Theoretical Quantiles

Scale-Location

Residuals vs Leverage
1

0.5

0.5

1.0

17

-1

Standardized residuals

Cook's distance

0.0

Standardized residuals

Normal Q-Q

-0.4

Residuals

0.4

0.6

Residuals vs Fitted

2.5

3.0

3.5
Fitted values

4.0

4.5

0.00

0.05

0.10

0.5

0.15

0.20

0.25

Leverage

Biu 10.2. Phn tch phn d kim tra cc gi nh trong phn tch hi
qui tuyn tnh.
(a) th bn tri dng 1 v phn d ei v gi tr tin on cholesterol yi . th ny cho
thy cc gi tr phn d tp chung quanh ng y = 0, cho nn gi nh (c), hay i c gi
tr trung bnh 0, l c th chp nhn c.
(b) th bn phi dng 1 v gi tr phn d v gi tr k vng da vo phn phi chun.
Chng ta thy cc s phn d tp trung rt gn cc gi tr trn ng chun, v do , gi
nh (b), tc i phn phi theo lut phn phi chun, cng c th p ng.
(c) th bn tri dng 2 v cn s phn d chun (standardized residual) v gi tr ca
yi . th ny cho thy khng c g khc nhau gia cc s phn d chun cho cc gi tr
ca yi , v do , gi nh (d), tc i c phng sai 2 c nh cho tt c xi, cng c th
p ng.

Ni chung qua phn tch phn d, chng ta c th kt lun rng m hnh hi qui tuyn
tnh m t mi lin h gia tui v cholesterol mt cch kh y v hp l.
10.2.4 M hnh tin on
Sau khi m hnh tin on cholesterol c kim tra v tnh hp l c
thit lp, chng ta c th v ng biu din ca mi lin h gia tui v cholesterol
bng lnh abline nh sau (xin nhc li object ca phn tch l reg):

2.0

2.5

3.0

chol

3.5

4.0

4.5

> plot(chol ~ age, pch=16)


> abline(reg)

20

30

40

50

60

age

Biu 10.3. ng biu din mi lin h gia tui (age)


v cholesterol.

)
)
Nhng mi gi tr yi c tnh t c s v , m cc c s ny u c sai
s chun, cho nn gi tr tin on yi cng c sai s. Ni cch khc, yi ch l trung bnh,
nhng trong thc t c th cao hn hay thp hn ty theo chn mu. Khong tin cy
95% ny c th c tnh qua R bng cc lnh sau y:
> reg <- lm(chol ~ age)
> new <- data.frame(age = seq(15, 70, 5))

pred.w.plim <- predict.lm(reg, new, interval="prediction")


pred.w.clim <- predict.lm(reg, new, interval="confidence")
resc <- cbind(pred.w.clim, new)
resp <- cbind(pred.w.plim, new)
plot(chol ~ age, pch=16)
lines(resc$fit ~ resc$age)
lines(resc$lwr ~ resc$age, col=2)
lines(resc$upr ~ resc$age, col=2)
lines(resp$lwr ~ resp$age, col=4)
lines(resp$upr ~ resp$age, col=4)

2.0

2.5

3.0

chol

3.5

4.0

4.5

>
>
>
>
>
>
>
>
>
>

20

30

40

50

60

age

Biu 10.4. Gi tr tin on v khong tin cy 95%.


Biu trn v gi tr tin on trung bnh yi (ng thng mu en), v khong tin cy
95% ca gi tr ny l ng mu . Ngoi ra, ng mu xanh l khong tin cy ca
gi tr tin on cholesterol cho mt tui mi trong qun th.

10.3 M hnh hi qui tuyn tnh a bin (multiple linear


regression)
M hnh c din t qua phng trnh [1] yi = + xi + i c mt yu t duy
nht ( l x), v v th thng c gi l m hnh hi qui tuyn tnh n gin (simple

linear regression model). Trong thc t, chng ta c th pht trin m hnh ny thnh
nhiu bin, ch khng ch gii hn mt bin nh trn, chng hn nh:

yi = + 1 x1i + 2 x2i + ... + k xki + i [7]


ni c th hn:

y1 = + 1x11 + 2x21 + + kxk1 +


y2 = + 1x12 + 2x22 + + kxk2 +
y3 = + 1x13 + 2x23 + + kxk3 +

yn = + 1x1n + 2x2n + + kxkn +

1
2
3
n

Ch trong phng trnh trn, chng ta c nhiu bin x (x1, x2, n xk), v mi bin c
mt thng s j (j = 1, 2, , k) cn phi c tnh. V th m hnh ny cn c gi l
m hnh hi qui tuyn tnh a bin.
Phng php c tnh j cng ch yu da vo phng php bnh phng nh
nht. Gi yi = + 1 x1i + 2 x1i + ... + k xki l c tnh ca yi , phng php bnh phng
nh nht tm gi tr , 1 , 2 ,..., k sao cho

( y y )
i

i =1

nh nht. i vi m hnh hi

qui tuyn tnh a bin, cch vit v m t m hnh gn nht l dng k hiu ma trn. M
hnh [7] c th th hin bng k hiu ma trn nh sau:
Y = X +

Trong : Y l mt vector n x 1, X l mt matrix n x k phn t, v mt vector k x 1, v


l vector gm n x 1 phn t:
y1
y
Y = 2 ,
...

yn

1 x11
1 x
12
X =
... ...

1 x1n

x21 ...xk1
x22 ...xk 2
,
...
...

x2 n xkn

1

= 2 ,
...

k

1

= 2
...

n

Phng php bnh phng nh nht gii vector bng phng trnh sau y:

= (X T X ) X T Y
1

v tng bnh phng phn d:

T = Y Y

V d 2. Chng ta quay li nghin cu v mi lin h gia tui, bmi v


cholesterol. Trong v d, chng ta ch mi xt mi lin h gia tui v cholesterol, m
cha xem n mi lin h gia c hai yu t tui v bmi v cholesterol. Biu sau
y cho chng ta thy mi lin h gia ba bin s ny:
> pairs(data)

22

24

26

50

60

20

24

26

20

30

40

age

chol

20

30

40

50

60

2.0 2.5 3.0 3.5 4.0 4.5

20

22

bmi

2.0 2.5 3.0 3.5 4.0 4.5

Biu 10.5. Gi tr tin on v khong tin cy 95%.

Cng nh gia tui v cholesterol, mi lin h gia bmi v cholesterol cng gn tun
theo mt ng thng. Biu trn cn cho chng ta thy tui v bmi c lin h vi
nhau. Tht vy, phn tch hi qui tuyn tnh n gin gia bmi v cholesterol cho thy
nh mi lin h ny c ngha thng k:
> summary(lm(chol ~

bmi))

Call:
lm(formula = chol ~ bmi)
Residuals:
Min
1Q Median
-0.9403 -0.3565 -0.1376

3Q
0.3040

Max
1.4330

Coefficients:
Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.83187
1.60841 -1.761 0.09739 .
bmi
0.26410
0.06861
3.849 0.00142 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.623 on 16 degrees of freedom
Multiple R-Squared: 0.4808,
Adjusted R-squared: 0.4483
F-statistic: 14.82 on 1 and 16 DF, p-value: 0.001418

BMI gii thch khong 48% dao ng v cholesterol gia cc c nhn. Nhng v BMI
cng c lin h vi tui, chng ta mun bit nu hai yu t ny c phn tch cng
mt lc th yu t no quan trng hn. bit nh hng ca c hai yu t age (x1) v
bmi (tm gi l x2) n cholesterol (y) qua mt m hnh hi qui tuyn tnh a bin, v m
hnh l:
yi = + 1 x1i + 2 x2i + i
hay phng trnh cng c th m t bng k hiu ma trn: Y = X + m ti va trnh
by trn. y, Y l mt vector vector 18 x 1, X l mt matrix 18 x 2 phn t, v mt
vector 2 x 1, v l vector gm 18 x 1 phn t. c tnh hai h s hi qui, 1 v
2 chng ta cng ng dng hm lm() trong R nh sau:
> mreg <- lm(chol ~ age + bmi)
> summary(mreg)
Call:
lm(formula = chol ~ age + bmi)
Residuals:
Min
1Q Median
-0.3762 -0.2259 -0.0534

3Q
0.1698

Max
0.5679

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.455458
0.918230
0.496
0.627
age
0.054052
0.007591
7.120 3.50e-06 ***
bmi
0.033364
0.046866
0.712
0.487
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3074 on 15 degrees of freedom
Multiple R-Squared: 0.8815,
Adjusted R-squared: 0.8657
F-statistic: 55.77 on 2 and 15 DF, p-value: 1.132e-07

Kt qu phn tch trn cho thy c s = 0.455, 1 = 0.054 v 2 = 0.0333. Ni cch


khc, chng ta c phng trnh c on cholesterol da vo hai bin s tui v
bmi nh sau:

Cholesterol = 0.455 + 0.054(age) + 0.0333(bmi)

Phng trnh cho bit khi tui tng 1 nm th cholesterol tng 0.054 mg/L (c s ny
khng khc my so vi 0.0578 trong phng trnh ch c tui), v mi 1 kg/m2 tng
BMI th cholesterol tng 0.0333 mg/L. Hai yu t ny gii thch khong 88.2% (R2 =
0.8815) dao ng ca cholesterol gia cc c nhn.
Chng ta ch phng trnh vi tui (trong phn tch phn trc) gii thch
khong 87.7% dao ng cholesterol gia cc c nhn. Khi chng ta thm yu t BMI,
h s ny tng ln 88.2%, tc ch 0.5%. Cu hi t ra l 0.5% tng trng ny c
ngha thng k hay khng. Cu tr li c th xem qua kt qu kim nh yu t bmi vi
tr s p = 0.487. Nh vy, bmi khng cung cp cho chng thm thng tin hay tin on
cholesterol hn nhng g chng ta c t tui. Ni cch khc, khi tui c
xem xt, th nh hng ca bmi khng cn ngha thng k. iu ny c th hiu c,
bi v qua Biu 10.5 chng ta thy tui v bmi c mt mi lin h kh cao. V hai
bin ny c tng quan vi nhau, chng ta khng cn c hai trong phng trnh. (Tuy
nhin, v d ny ch c tnh cch minh ha cho vic tin hnh phn tch hi qui tuyn tnh
a bin bng R, ch khng c nh m phng d liu theo nh hng sinh hc).

3.0

4.0

2.0
0.0

1.0

4.5

-2

-1

Scale-Location

Residuals vs Leverage

0.4

3.0

3.5

4.0

Fitted values

4.5

16

0.5

0.8

16

-1

Standardized residuals

Theoretical Quantiles

2.5

16

Fitted values

1.2

3.5

0.0

Standardized residuals

2.5

-1.0

0.0

0.4

16

-0.4

Residuals

8
6

Normal Q-Q
Standardized residuals

Residuals vs Fitted

Cook's distance15
0.00

0.10

0.20

0.30

Leverage

Biu 10.6. Phn tch phn d kim tra cc gi nh trong


phn tch hi qui tuyn tnh a bin.

Tuy BMI khng c ngha thng k trong trng hp ny, Biu 10.6 cho thy
cc gi nh v m hnh hi qui tuyn tnh c th p ng.

10.4 Phn tch hi qui a thc (Polynomial regression


analysis)
Mt khai trin tt nhin t phn tch hi qui a bin c lp l phn tch hi qui
a thc. M hnh hi qui a bin m t mt bin ph thuc nh l mt hm s tuyn tnh
(linear function) ca nhiu bin c lp, trong khi m hnh hi qui a thc m t mt
bin ph thuc l hm s phi tuyn tnh (non-linear function) ca mt bin c lp.
Ni theo ngn ng ton hc, m hnh hi qui a thc tm mi lin h gia bin
ph thuc y v bin c lp x theo nhng hm s sau y:

yi = + 1x + 2x2 + 3x3 + .. + pxp + i.


Trong cc thng s j (j = 1, 2, 3, p) l h s o lng mi lin h gia y v x; v i
l phn d ca m hnh, vi gi nh i tun theo lut phn phi chun vi trung bnh 0
v phng sai 2. Cho mt dy cp s (y1, x1), (y2, x2), (y3, x3), , (yn, xn), chng ta c
th p dng phng php bnh phng nh nht c tnh j v 2.
Trong m hnh trn, chng ta c th d dng thy rng m hnh hi qui a thc
cn l mt pht trin trc tip t m hnh hi qui tuyn tnh n gin. Tc l nu 2 = 0,
3 = 0, , v p = 0, th m hnh trn n gin thnh m hnh hi qui tuyn tnh mt
bin m chng ta gp trong phn u ca chng ny. Nu yi = + 1x + 2x2 + i th
m hnh n gin l mt phng trnh bc hai, v.v.
V d 3. Th nghim sau y tm mi lin h gia hm lng g cng
(hardwoord concentration) v cng (tensile strength) ca vt liu. Mi chn vt liu
khc nhau vi nhiu hm lng g cng c th nghim o cng mnh ca vt
liu, v kt qu c tm lc trong bng s liu sau y:

Id
1
2
3
4
5
6
7
8
9
10
11
12

Hm lng
g cng (x)
1.0
1.5
2.0
3.0
4.0
4.5
5.0
5.5
6.0
6.5
7.0
8.0

cng
mnh (y)
6.3
11.1
20.0
24.0
26.1
30.0
33.8
34.0
38.1
39.9
42.0
46.1

13
14
15
16
17
18
19

9.0
10.0
11.0
12.0
13.0
14.0
15.0

53.1
52.0
52.5
48.0
42.8
27.8
21.9

Trc khi phn tch cc s liu ny, chng ta


cn nhp s liu vo R vi nhng lnh thng
thng nh sau:

> id <- 1:19


> conc <- c(1.0, 1.5, 2.0, 3.0, 4.0,
4.5, 5.0, 5.5, 6.0,
6.5, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0)
> strength <- c(6.3, 11.1, 20.0, 24.0, 26.1, 30.0, 33.8, 34.0, 38.1,
39.9, 42.0, 46.1, 53.1, 52.0, 52.5, 48.0, 42.8, 27.8, 21.9)
> data <- data.frame(id, conc, strength)

Chng ta th xem m hnh hi qui tuyn tnh n gin bng lnh:


> simple.model <- lm(strength ~ conc)
> summary(simple.model)
Call:
lm(formula = strength ~ conc)
Residuals:
Min
1Q
-25.986 -3.749

Median
2.938

3Q
7.675

Max
15.840

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.3213
5.4302
3.926 0.00109 **
conc
1.7710
0.6478
2.734 0.01414 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.82 on 17 degrees of freedom
Multiple R-Squared: 0.3054,
Adjusted R-squared: 0.2645
F-statistic: 7.474 on 1 and 17 DF, p-value: 0.01414

Kt qu trn cho thy m hnh hi qui tuyn tnh n gin ny (strength = 21.32
+ 1.77*conc) gii thch khong 31% phng sai ca strength. c s phng sai
ca m hnh ny l: s2 = (11.82)2 = 139.7.
By gi chng ta xem qua biu v ng biu din ca m hnh trn:
> plot(strength ~ conc,
xlab="Concentration of hardwood",
ylab="Tensile strength",
main="Relationship between hardwood concentration \n and tensile
strengt", pch=16)
> abline(simple.model)

Relationship between hardwood concentration


and tensile strengt

30
20

Tensile strength

40

50

Qua biu ny, chng ta thy r


rng m hnh hi qui tuyn tnh
khng thch hp cho s liu, bi v
mi lin h gia hai bin ny
khng tun theo mt phng trnh
ng thng, m l mt ng
cong. Ni cch khc, mt m hnh
phng trnh bc hai c l thch
hp hn. Gi y l strength v x l
conc, chng ta c th vit m hnh
nh sau:

10

yi = + 1x + 2x2
2

10

12

14

Concentration of hardwood

Biu 10.7. Mi lin h gia hm lng g


cng v cng mnh ca vt liu. ng thng
l ng biu din ca m hnh hi qui tuyn tnh
n gin.
lm(formula = strength ~ poly(conc, 2))
Residuals:
Min
1Q Median
-5.8503 -3.2482 -0.7267
Coefficients:
(Intercept)
poly(conc, 2)1
poly(conc, 2)2
--Signif. codes:

3Q
4.1350

By gi chng ta s s dng R
c tnh ba thng s trn.
> quadratic <- lm(strength ~
poly(conc, 2))
> summary(quadratic)
Call:

Max
6.5506

Estimate Std. Error t value


34.184
1.014 33.709
32.302
4.420
7.308
-45.396
4.420 -10.270

Pr(>|t|)
2.73e-16 ***
1.76e-06 ***
1.89e-08 ***

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.42 on 16 degrees of freedom


Multiple R-Squared: 0.9085,
Adjusted R-squared: 0.8971
F-statistic: 79.43 on 2 and 16 DF, p-value: 4.912e-09

Nh vy, m hnh mi ny y = 34.18 + 32.30*x 45.4*x2 gii thch


khong 91% phng sai ca y. Phng sai ca y by gi l s2 = (4.42)2 = 19.5. So vi
m hnh tuyn tnh, m hnh ny r rng l tt hn rt nhiu.
Chng ta th xt mt m hnh cubic (bc ba) yi = + 1x + 2x2 + 3x3 xem c
m t y tt hn m hnh phng trnh bc hai hay khng.
> cubic <- lm(strength ~ poly(conc, 3))
> summary(cubic)

Call:
lm(formula = strength ~ poly(conc, 3))
Residuals:
Min
1Q
-4.62503 -1.61085

Median
0.04125

3Q
1.58922

Max
5.02159

Coefficients:

Estimate Std. Error t value Pr(>|t|)


(Intercept)
34.1842
0.5931 57.641 < 2e-16 ***
poly(conc, 3)1 32.3021
2.5850 12.496 2.48e-09 ***
poly(conc, 3)2 -45.3963
2.5850 -17.561 2.06e-11 ***
poly(conc, 3)3 -14.5740
2.5850 -5.638 4.72e-05 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.585 on 15 degrees of freedom
Multiple R-Squared: 0.9707,
Adjusted R-squared: 0.9648
F-statistic: 165.4 on 3 and 15 DF, p-value: 1.025e-11

M hnh cubic ny thm ch c kh nng m t y tt hn hai m hnh trc, vi


h s xc nh bi (R2) bng 0.97, v tt c cc thng s trong m hnh u c ngha
thng k. Biu sau y so snh 3 m hnh trn:
# lp li cc m hnh trn:
> linear <- lm(strength ~ conc)
> quadratic <- lm(strength ~ poly(conc, 2))
> cubic <- lm(strength ~ poly(conc, 3))

# to nn mt bin x vi nhiu s gn nhau


> xnew <- (0:160)/10

# Tnh gi tr tin on (predictive values) ca y


> y2 = predict(quadratic, data.frame(conc=xnew))
> y3 = predict(cubic, data.frame(conc=xnew))

# V 3 ng thng, bc hai v bc 3
> plot(strength ~ conc, pch=16,
main=Hardwood concentration and tensile strength,
sub=Linear, quadratic, and cubic fits)
> abline(linear, col=black)
> lines(xnew, y2, col=blue, lwd=3)
> lines(xnew, y3, col=red, lwd=4)

30
10

20

strength

40

50

Hardwood concentration and tensile strength

10

12

14

conc
Linear, quadratic, and cubic fits

10.5 Xy dng m hnh tuyn tnh t nhiu bin


Trong mt nghin cu thng thng vi mt bin s ph thuc, nhiu bin s c
lp x1, x2, x3,., xk, m k c th ln n hng chc, thm ch hng trm. Cc bin c lp
thng lin h vi nhau. C rt nhiu t hp bin c lp c kh nng tin on bin
ph thuc y. V d nu chng ta c 3 bin c lp x1, x2, v x3, xy dng m hnh tin
on y, chng ta c th phi xem xt cc m hnh sau y: y = f1(x1), y = f2(x2), y =
f3(x3), y = f4(x1, x2,), y = f5(x1, x3,), y = f6(x3, x3,), y = f7(x1, x2, x3), v.v trong fk l
nhng hm s c nh ngha bi h s lin quan n cc bin c th. Khi k cao, s
lng m hnh cng ln rt cao.
Vn t ra l trong cc m hnh , m hnh no c th tin on y mt cch
y , n gin v hp l. Ti s quay li ba tiu chun ny trong chng phn tch hi
qui logistic. y, ti ch mun bn n mt tiu chun thng k xy dng m m
hnh hi qui tuyn tnh. Trong trng hp c nhiu m hnh nh th, tiu chun thng k
chn mt m hnh ti u thng da vo tiu chun thng tin Akaike (cn gi l AIC
hay Akaike Information Criterion).
Cho mt m hnh hi qui tuyn tnh yi = + 1 x1 + 2 x2 + ... + k xk , chng ta c
k+1 thng s , 1 , 2 ,..., k ), v c th tnh tng bnh phng phn d (residual sum of
squares, RSS):
n

RSS = ( yi yi )
i =1

Trong , n l s lng mu. Cng thc trn cho thy nu m hnh m t y y th


RSS s thp, v khc bit gia gi tr tin on y v gi tr quan st y gn nhau. Mt
qui lut chung ca phn tch hi qui tuyn tnh l mt m hnh vi k bin c lp s c
RSS thp hn m hnh vi k-1 bin; v tng t m hnh vi k-1 bin s c RSS thp hn
m hnh vi k-2 bin, v.v Ni cch khc, m hnh cng c nhiu bin c lp s gii
thch y cng tt hn. Nhng v mt s bin c lp x lin h vi nhau, cho nn c thm
nhiu bin khng c ngha l RSS s gim mt cch c ngha. Mt php tnh dung
ha RSS v s bin c lp trong mt m hnh l AIC, c nh ngha nh sau:

RSS 2k
AIC = log
+
n n
M hnh no c gi tr AIC thp nht c xem l m hnh ti u. Trong v d sau
y, chng ta s dng hm step tm mt m hnh ti u da vo gi tr AIC.
V d 4. nghin cu nh hng ca cc yu t nh nhit , thi gian, v
thnh phn ha hc n sn lng CO2. S liu ca nghin cu ny c th tm lc
trong bng s 2. Mc tiu chnh ca nghin cu l tm mt m hnh hi qui tuyn tnh
tin on sn lng CO2, cng nh nh gi nh hng ca cc yu t ny.
Bng 2. Sn lng CO2 v mt s yu t c th nh hng n CO2
Id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

y
36.98
13.74
10.08
8.53
36.42
26.59
19.07
5.96
15.52
56.61
26.72
20.80
6.99
45.93
43.09
15.79
21.60
35.19
26.14
8.60
11.63
9.59
4.42
38.89
11.19
75.62

X1
5.1
26.4
23.8
46.4
7.0
12.6
18.9
30.2
53.8
5.6
15.1
20.3
48.4
5.8
11.2
27.9
5.1
11.7
16.7
24.8
24.9
39.5
29.0
5.5
11.5
5.2

X2
400
400
400
400
450
450
450
450
450
400
400
400
400
425
425
425
450
450
450
450
450
450
450
460
450
470

X3
51.37
72.33
71.44
79.15
80.47
89.90
91.48
98.60
98.05
55.69
66.29
58.94
74.74
63.71
67.14
77.65
67.22
81.48
83.88
89.38
79.77
87.93
79.50
72.73
77.88
75.50

X4
4.24
30.87
33.01
44.61
33.84
41.26
41.88
70.79
66.82
8.92
17.98
17.79
33.94
11.95
14.73
34.49
14.48
29.69
26.33
37.98
25.66
22.36
31.52
17.86
25.20
8.66

X5
1484.83
289.94
320.79
164.76
1097.26
605.06
405.37
253.70
142.27
1362.24
507.65
377.60
158.05
130.66
682.59
274.20
1496.51
652.43
458.42
312.25
307.08
193.61
155.96
1392.08
663.09
1464.11

X6
2227.25
434.90
481.19
247.14
1645.89
907.59
608.05
380.55
213.40
2043.36
761.48
566.40
237.08
1961.49
1023.89
411.30
2244.77
978.64
687.62
468.38
460.62
290.42
233.95
2088.12
994.63
2196.17

X7
2.06
1.33
0.97
0.62
0.22
0.76
1.71
3.93
1.97
5.08
0.60
0.90
0.63
2.04
1.57
2.38
0.32
0.44
8.82
0.02
1.72
1.88
1.43
1.35
1.61
4.78

27
36.03
10.6
470
83.15
22.39
720.07
1080.11
5.88
Ch thch: y = sn lng CO2; X1 = thi gian (pht); X2 = nhit (C); X3 = phn trm ha tan; X4 =
lng du (g/100g); X5 = lng than ; X6 = tng s lng ha tan; X7 = s hydrogen tiu th.

Trc khi phn tch s liu, chng ta cn nhp s liu vo R bng cc lnh thng thng.
S liu s cha trong i tng REGdata.
> y <- c(36.98,13.74,10.08, 8.53,36.42,26.59,19.07, 5.96,15.52,56.61,
26.72,20.80, 6.99,45.93,43.09,15.79,21.60,35.19,26.14, 8.60,
11.63, 9.59, 4.42,38.89,11.19,75.62,36.03)
> x1 <- c(5.1,26.4,23.8,46.4, 7.0,12.6,18.9,30.2,53.8,5.6,15.1,20.3,48.4,
5.8,11.2,27.9,5.1,11.7,16.7,24.8,24.9,39.5,29.0, 5.5, 11.5,
5.2,10.6)
> x2 <- c(400,400, 400, 400, 450, 450, 450, 450, 450, 400, 400, 400,
400, 425, 425, 425, 450, 450, 450, 450, 450, 450, 450, 460,
450, 470, 470)
> x3 <- c(51.37,72.33,71.44,79.15,80.47,89.90,91.48,98.60,98.05,55.69,
66.29,58.94,74.74,63.71,67.14,77.65,67.22,81.48,83.88,89.38,
79.77,87.93,79.50,72.73,77.88,75.50,83.15)
> x4 <- c(4.24,30.87,33.01,44.61,33.84,41.26,41.88,70.79,66.82,
8.92,17.98,17.79,33.94,11.95,14.73,34.49,14.48,29.69,26.33,
37.98,25.66,22.36,31.52,17.86,25.20, 8.66,22.39)
> x5 <- c(1484.83, 289.94, 320.79, 164.76, 1097.26, 605.06, 405.37,
253.70, 142.27,1362.24, 507.65, 377.60, 158.05, 130.66,
682.59, 274.20, 1496.51, 652.43, 458.42, 312.25, 307.08,
193.61, 155.96,1392.08, 663.09,1464.11, 720.07)
> x6 <- c(2227.25, 434.90, 481.19, 247.14,1645.89, 907.59, 608.05,
380.55, 213.40,2043.36, 761.48, 566.40, 237.08,1961.49,1023.89,
411.30,2244.77, 978.64, 687.62, 468.38, 460.62, 290.42,
233.95,2088.12, 994.63,2196.17,1080.11)
> x7 <- c(2.06,1.33,0.97,0.62,0.22,0.76,1.71,3.93,1.97,5.08,0.60,0.90,
0.63,2.04,1.57,2.38,0.32,0.44,8.82,0.02,1.72,1.88,1.43,
1.35,1.61,4.78,5.88)
> REGdata <- data.frame(y, x1,x2,x3,x4,x5,x6,x7)

Trc khi phn tch s liu, chng ta cn nhp s liu vo R bng cc lnh thng thng.
S liu s cha trong i tng REGdata.
By gi chng ta bt u phn tch. M hnh u tin l m hnh gm tt c 7 bin c
lp nh sau:
> reg <- lm(y ~ x1+x2+x3+x4+x5+x6+x7, data=REGdata)
> summary(reg)
Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = REGdata)
Residuals:
Min
1Q
-20.035 -4.681

Median
-1.144

3Q
4.072

Max
21.214

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.937016 57.428952
0.939
0.3594

x1
-0.127653
x2
-0.229179
x3
0.824853
x4
-0.438222
x5
-0.001937
x6
0.019886
x7
1.993486
--Signif. codes: 0 '***'

0.281498
0.232643
0.765271
0.358551
0.009654
0.008088
1.089701

-0.453
-0.985
1.078
-1.222
-0.201
2.459
1.829

0.6553
0.3370
0.2946
0.2366
0.8431
0.0237 *
0.0831 .

0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.61 on 19 degrees of freedom


Multiple R-Squared: 0.728,
Adjusted R-squared: 0.6278
F-statistic: 7.264 on 7 and 19 DF, p-value: 0.0002674

Kt qu trn cho thy tt c 7 bin s gii thch khong 73% phng sai ca y. Nhng
trong 7 bin , ch c x6 l c ngha thng k (p = 0.024). Chng ta th gim m
hnh thnh mt m hnh hi qui tuyn tnh n gin vi ch bin x6.
> summary(lm(y ~ x6, data=REGdata))
Call:
lm(formula = y ~ x6, data = REGdata)
Residuals:
Min
1Q
-28.081 -5.829

Median
-0.839

3Q
5.522

Max
26.882

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.144181
3.483064
1.764
0.09 .
x6
0.019395
0.002932
6.616 6.24e-07 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.7 on 25 degrees of freedom
Multiple R-Squared: 0.6365,
Adjusted R-squared: 0.6219
F-statistic: 43.77 on 1 and 25 DF, p-value: 6.238e-07

Ch vi mt bin x6 m m hnh c th gii thch khong 64% phng sai ca y. Chng


ta chp nhn m hnh ny? Trc khi chp nhn m hnh ny, chng ta phi xem xt
tng quan gia cc bin c lp:
> pairs(REGdata)

30

50

50

70

90

200

1000

8
70

10

50

10

40

440

10

30

x1

90

400

x2

70

50

70

x3

1000

10

40

x4

2000

200

x5

500

x6

x7
10

40

70

400

440

10

40

70

500

2000

Kt qu trn cho thy y c lin h vi cc bin nh x1, x5 v x6. Ngoi ra, bin x5 v
x6 c mt mi lin h rt mt thit (gn nh l mt ng thng) vi h s tng quan
l 0.88. Ngoi ra, x5 v x1 hay x6 v x5 cng c lin h vi nhau nhng theo mt hm
s nghch o. iu ny c ngha l bin x5 v x6 cung cp mt lng thng tin nh
nhau tin on y, tc l chng ta khng cn c hai trong m m hnh.
tm mt m hnh ti u trong bi cnh c nhiu mi tng quan nh th, chng ta ng
dng step nh sau. Ch cch cung cp thng s lm(y ~ .), du . c ngha l
yu cu R xem xt tt c bin trong i tng REGdata.
> reg <- lm(y ~ ., data=REGdata)
> step(reg, direction=both)
Start: AIC= 134.07
y ~ x1 + x2 + x3 +
Df Sum of Sq
- x5
1
4.54
- x1
1
23.17
- x2
1
109.34
- x3
1
130.90
<none>
- x4
1
168.31
- x7
1
377.09
- x6
1
681.09

x4 + x5 + x6 + x7
RSS
AIC
2145.37 132.13
2164.00 132.36
2250.18 133.42
2271.74 133.68
2140.83 134.07
2309.14 134.12
2517.92 136.45
2821.92 139.53

Step 2: AIC= 130.42


y ~ x2 + x3 + x4 + x6 + x7

Step 1: AIC= 132.13


y ~ x1 + x2 + x3 + x4 + x6 + x7
- x1
- x2
- x3
<none>
- x4
+ x5
- x7
- x6

Df Sum of Sq
RSS
1
22.7 2168.1
1
113.8 2259.1
1
133.5 2278.9
2145.4
1
170.8 2316.2
1
4.5 2140.8
1
375.7 2521.1
1
1058.5 3203.8

Step 3: AIC= 129.59


y ~ x3 + x4 + x6 + x7

AIC
130.4
131.5
131.8
132.1
132.2
134.1
134.5
141.0

- x2
- x3
<none>
- x4
+ x1
+ x5
- x7
- x6

Df Sum of Sq
RSS
1
96.8 2264.9
1
122.0 2290.0
2168.1
1
187.4 2355.5
1
22.7 2145.4
1
4.1 2164.0
1
385.0 2553.1
1
1526.2 3694.3

AIC
129.6
129.9
130.4
130.7
132.1
132.4
132.8
142.8

Df Sum of Sq
RSS
1
25.4 2290.3
1
90.9 2355.8
2264.9
1
96.8 2168.1
1
8.3 2256.5
1
5.7 2259.1
1
384.9 2649.7
1
2015.6 4280.5

AIC
127.9
128.7
129.6
130.4
131.5
131.5
131.8
144.8

Step 5: AIC= 126.75


y ~ x6 + x7

Step 4: AIC= 127.9


y ~ x4 + x6 + x7
- x4
<none>
+ x3
+ x1
+ x5
+ x2
- x7
- x6

- x3
- x4
<none>
+ x2
+ x5
+ x1
- x7
- x6

Df Sum of Sq
RSS
1
73.5 2363.8
2290.3
1
25.4 2264.9
1
11.3 2279.0
1
6.3 2284.0
1
0.3 2290.0
1
486.6 2776.9
1
1993.8 4284.1

AIC
126.7
127.9
129.6
129.8
129.8
129.9
131.1
142.8

Df Sum of Sq
<none>
+ x4
+ x1
+ x3
+ x5
+ x2
- x7
- x6

1
1
1
1
1
1
1

73.5
33.4
8.1
7.7
7.3
497.3
4477.0

RSS
2363.8
2290.3
2330.4
2355.8
2356.1
2356.6
2861.2
6840.8

AIC
126.7
127.9
128.4
128.7
128.7
128.7
129.9
153.4

Call:
lm(formula = y ~ x6 + x7, data =
REGdata)
Coefficients:
(Intercept)
2.52646

x6
0.01852

x7
2.18575

Qu trnh tm m hnh ti u dng m hnh vi hai bin x6 v x7, v m hnh ny c


gi tr AIC thp nht. Phng trnh tuyn tnh tin on y l: y = 2.526 + 0.0185(x6) +
2.186(x7).
> summary(lm(y ~ x6+x7, data=REGdata))
Call:
lm(formula = y ~ x6 + x7, data = REGdata)
Residuals:
Min
1Q
-23.2035 -4.3713

Median
0.2513

3Q
4.9339

Max
21.9682

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.526460
3.610055
0.700
0.4908
x6
0.018522
0.002747
6.742 5.66e-07 ***
x7
2.185753
0.972696
2.247
0.0341 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.924 on 24 degrees of freedom
Multiple R-Squared: 0.6996,
Adjusted R-squared: 0.6746
F-statistic: 27.95 on 2 and 24 DF, p-value: 5.391e-07

Phn tch chi tit (kt qu trn) cho thy hai bin ny gii thch khong 70% phng sai
ca y.

10.6 Xy dng m hnh tuyn tnh bng Bayesian Model


Average (BMA)
Mt vn trong cch xy dng m hnh trn l m hnh vi x6 v x7 c xem
l m hnh sau cng, trong khi chng ta bit rng mt m hnh x5 v x7 cng c th
l mt m hnh kh d, bi v x5 v x6 c mi tng quan rt gn nhau. Nu nghin cu
c tin hnh tip v vi thm s liu mi, c l mt m hnh khc s ra i.
nh gi s bt nh trong vic xy dng m hnh thng k, mt php tnh
khc c trin vng tt hn cch php tnh trn l BMA (Bayesian Model Average). Bn
c mun tm hiu thm v php tnh ny c th tham kho vi bi bo khoa hc di
y. Ni mt cch ngn gn, php tnh BMA tm tt c cc m hnh kh d (vi 7 bin
c lp, s m hnh kh d l 27 = 128, cha tnh n cc m hnh tng tc!) v trnh by
kt qu ca cc m hnh c xem l ti u nht v lu v di. Tiu chun ti u cng
da vo gi tr AIC.
tin hnh php tnh BMA, chng ta phi dng n package BMA (c th ti v
t trang web ca R http://cran.R-project.org). Sau khi c ci t package BMA trong
my tnh, chng ta ra phi nhp BMA vo mi trng vn hnh ca R bng lnh:
> library(BMA)

Sau , to ra mt ma trn ch gm cc bin c lp. Trong data frame chng ta bit


REGdata c 8 bin, vi bin s 1 l y. Do , lnh REGdata[, -1] c ngha l to
ra mt data frame mi ngoi tr ct th nht (tc y).
> xvars <- REGdata[,-1]

K tip, chng ta nh ngha bin ph thuc tn co2 t REGdata:


> co2 <- REGdata[,1]

By gi chng ta sn sng phn tch bng php tnh BMA. Hm bicreg c vit
c bit cho phn tch hi qui tuyn tnh. Cch p dng hm bicreg nh sau:
> bma <- bicreg(xvars, co2, strict=FALSE, OR=20)

Chng ta s dng hm summary bit kt qu:


> summary(bma)

Call:
bicreg(x = xvars, y = co2, strict = FALSE, OR = 20)
16

models were selected

Best

Intercept
x1
x2
x3
x4
x5
x6
x7

models (cumulative posterior probability =


p!=0
100.0
12.4
10.4
10.7
20.2
10.5
100.0
73.7

EV
5.75672
-0.01807
-0.00075
0.00011
-0.03059
-0.00023
0.01815
1.60766

nVar
r2
BIC
post prob

SD
14.6244
0.1008
0.0282
0.0791
0.1020
0.0030
0.0040
1.2821

0.6599 ):

model 1
2.5264
.
.
.
.
.
0.0185
2.1857

model 2
6.1441
.
.
.
.
.
0.0193
.

model 3
8.6120
.
.
.
-0.1419
.
0.0164
2.1628

model 4
7.5936
-0.1393
.
.
.
.
0.0162
2.1233

model 5
7.3537
.
.
-0.0572
.
.
0.0179
2.2382

2
0.700
-25.8832
0.311

1
0.636
-24.0238
0.123

3
0.709
-23.4412
0.092

3
0.704
-22.9721
0.072

3
0.701
-22.6801
0.063

BMA trnh by kt qu ca 5 m hnh c nh gi l ti u nht cho tin on y


(model 1, model 2, model 5).

Ct th nht lit k danh sch cc bin s c lp;


Ct 2 trnh by xc sut gi thit mt bin c lp c nh hng n y. Chng
hn nh xc sut l x6 c nh hng n y l 100%; trong khi xc sut m x7
c nh hng n y l 73.7%. Tuy nhin xc sut cc bin khc thp hn hay ch
bng 20%. Do , chng ta c th ni rng m hnh vi x6 v x7 c l l m
hnh ti u nht.
Ct 3 (EV) v 4 (SD) trnh by tr s trung bnh v lch chun ca h s cho
mi bin s c lp.
Ct 5 l c tnh h s nh hng (regression coefficient) ca m hnh 1. Nh
thy trong ct ny, m hnh 1 gm intercept (tc ), v hai bin x6 v x7. M
hnh ny gii thch (nh chng ta bit qua phn tch phn trn) 70% phng sai
ca y. Tr s BIC (Bayesian Information Criterion) thp nht. Trong s tt c m
hnh m BMA tm, m hnh ny c xc sut xut hin l 31.1%.
Ct 6 l c tnh h s nh hng ca m hnh 2. Nh thy trong ct ny, m
hnh 2 gm intercept (tc ), v bin x6. M hnh ny gii thch 64% phng sai
ca y. Trong s tt c m hnh m BMA tm, m hnh ny c xc sut xut hin
ch l 12.3%.
Cc m hnh khc cng c th din dch mt cch tng t.

Mt cch th hin kt qu trn l qua mt biu nh sau:


> imageplot.bma(bma)

M odels selected by BM A

x1

x2

x3

x4

x5

x6

x7

10

13

Model #

Biu ny trnh by 13 m hnh. Trong 13 m hnh , bin x6 xut hin mt


cch nht qun. K n l bin x7 cng c xut hin trong mt s m hnh, nhng nh
chng ta bit xc sut l 74%.
Trong v d ny, c hai php tnh u cho ra mt kt qu nht qun, nhng trong
nhiu trng hp, hai php tnh c th cho ra kt qu khc nhau. Nhiu nghin cu l
thuyt gn y cho thy kt qu t php tnh BMA rt ng tin cy, v trong tng lai, c
l l phng php chun xy dng m hnh.
Ti liu tham kho cho BMA

Raftery, Adrian E. (1995). Bayesian model selection in social research (with Discussion).
Sociological Methodology 1995 (Peter V. Marsden, ed.), pp. 111-196, Cambridge, Mass.:
Blackwells.
Mt s bi bo lin quan n BMA c th ti t trang web sau y:
www.stat.colostate.edu/~jah/papers.

CHNG XI

PHN TCH
PHNG SAI

11
Phn tch phng sai
(Analysis of variance)
Phn tch phng sai, nh tn gi, l mt s phng php phn tch thng k m
trng im l phng sai (thay v s trung bnh). Phng php phn tch phng sai nm
trong i gia nh cc phng php c tn l m hnh tuyn tnh (hay general linear
models), bao gm c hi qui tuyn tnh m chng ta gp trong chng trc. Trong
chng ny, chng ta s lm quen vi cch s dng R trong phn tch phng sai.
Chng ta s bt u bng mt phn tch n gin, sau s xem n phn tch phng
sai hai chiu, v cc phng php phi tham s thng dng.

11.1 Phn tch phng sai n gin (one-way analysis of


variance - ANOVA)
V d 1. Bng thng k 11.1 di y so snh galactose trong 3 nhm bnh
nhn: nhm 1 gm 9 bnh nhn vi bnh Crohn; nhm 2 gm 11 bnh nhn vi bnh
vim rut kt (colitis); v nhm 3 gm 20 i tng khng c bnh (gi l nhm i
chng). Cu hi t ra l galactose gia 3 nhm bnh nhn c khc nhau hay khng?
Gi gi tr trung bnh ca ba nhm l 1, 2, v 3, v ni theo ngn ng ca kim nh
gi thit th gi thit o l:
V gi thit chnh l:

Ho: 1 = 2 = 3
HA: c mt khc bit gia 3 j (j=1,2,3)

Bng 11.2. galactose cho 3 nhm bnh nhn Crohn, vim rut kt
v i chng
Nhm 1: bnh
Crohn
1343
1393
1420
1641
1897
2160
2169
2279
2890

Nhm 2: bnh vim


rut kt
1264
1314
1399
1605
2385
2511
2514
2767
2827
2895

Nhm 3: i
chng (control)
1809 2850
1926 2964
2283 2973
2384 3171
2447 3257
2479 3271
2495 3288
2525 3358
2541 3643
2769 3657

3011
n=9
n=11
n=20
Trung bnh: 1910 Trung bnh: 2226
Trung bnh: 2804
SD: 516
SD: 727
SD: 527
Ch thch: SD l lch chun (standard deviation).
Thot u c l bn c, sau khi hc qua phng php so snh hai nhm bng
kim nh t, s ngh rng chng ta cn lm 3 so snh bng kim nh t: gia nhm 1 v 2,
nhm 2 v 3, v nhm 1 v 3. Nhng phng php ny khng hp l, v c ba phng
sai khc nhau. Phng php thch hp cho so snh l phn tch phng sai. Phn tch
phng sai c th ng dng so snh nhiu nhm cng mt lc (simultaneous
comparisons).
11.1.1 M hnh phn tch phng sai
minh ha cho phng php phn tch phng sai, chng ta phi dng k hiu.
Gi galactose ca bnh nhn i thuc nhm j (j = 1, 2, 3) l xij. M hnh phn tch
phng sai pht biu rng:

xij = + i + ij

[1]

Hay c th hn:

xi1 = + 1 + i1
xi2 = + 2 + i2
xi3 = + 3 + i3
Tc l, gi tr galactose c bt c bnh nhn no bng gi tr trung bnh ca ton
qun th () cng/tr cho nh hng ca nhm j c o bng h s nh hng i , v sai
s ij . Mt gi nh khc l ij phi tun theo lut phn phi chun vi trung bnh 0 v
phng sai 2. Hai thng s cn c tnh l v i . Cng nh phn tch hi qui tuyn
tnh, hai thng s ny c c tnh bng phng php bnh phng nh nht; tc l tm
c s v j sao cho

( x

ij

j ) nh nht.
2

Quay li vi s liu nghin cu trn, chng ta c nhng tm tt thng k nh sau:


Nhm

S i
tng (nj)
n1 = 9

Trung bnh

2 Vim rut kt

n2 = 11

x2 = 2226

s22 = 473387

3 i chng

n3 = 20

x3 = 2804

s32 = 277500

Ton b mu

n = 40

x = 2444

1 Crohn

x1 = 1910

Phng sai
s12 = 265944

xij = x + ( x j x ) + ( xij x j )

Ch rng:

[2]

Trong , x l s trung bnh ca ton mu, v x j l s trung bnh ca nhm j. Ni cch


khc, phn ( x j x ) phn nh khc bit (hay cng c th gi l hiu s) gia trung

bnh trng nhm v trung bnh ton mu, v phn ( xij x j ) phn nh hiu s gia mt
galactose ca mt i tng v s trung bnh ca tng nhm.

Theo ,

tng bnh phng cho ton b mu l:


SST = ( xij x )
i

= (13432444)2 + (13932444)2 + (1343 2444)2 + + (3657 2444)2


= 12133923

tng bnh phng v khc nhau gia cc nhm:


SSB = ( xi x ) =
2

n (x
j

x)

= 9(1910 2444)2 + 11(2226 2444)2 + 20(2804 2444)2


= 5681168

tng bnh phng v dao ng trong mi nhm:


SSW = ( xij x j ) =
2

(n
j

1) s 2j

= (9-1)(265944) + (11-1)(473387) + (20-1)(277500)


= 12133922
C th chng minh d dng rng: SST = SSB + SSW.

SSW c tnh t mi bnh nhn cho 3 nhm, cho nn trung bnh bnh phng cho tng
nhm (mean square MSW) l:
MSW = SSW / (N k) = 12133922 / (40-3) = 327944
v trung bnh bnh phng gia cc nhm l:

MSB = SSB / (k 1) = 5681168 / (3-1) = 2841810


Trong N l tng s bnh nhn (N = 40) ca ba nhm, v k = 3 l s nhm bnh nhn.
Nu c s khc bit gia cc nhm, th chng ta k vng rng MSB s ln hn MSW.
Thnh ra, kim tra gi thit, chng ta c th da vo kim nh F:

F = MSB / MSW = 8.67

[3]

Vi bc t do k-1 v N-k. Cc s liu tnh ton trn y c th trnh by trong mt bng


phn tch phng sai (ANOVA table) nh sau:

Tng bnh
phng
(sum of
squares)
5681168

Kim nh
Trung bnh
bnh phng F
(mean
square)
2841810
8.6655

37

12133923

327944

39

12133923

Ngun bin thin (source


of variation)

Bc t do
(degrees of
freedom)

Khc bit gia cc nhm


(between-group)
Khc bit trong tng
nhm (with-group)
Tng s

11.1.2 Phn tch phng sai n gin vi R

Tt c cc tnh ton trn tng i rm r, v tn kh nhiu thi gian. Tuy nhin vi R,


cc tnh ton c th lm trong vng 1 giy, sau khi d liu c chun b ng
cch.
(a) Nhp d liu. Trc ht, chng ta cn phi nhp d liu vo R. Bc th nht l
bo cho R bit rng chng ta c ba nhm bnh nhn (1, 2 v ), nhm 1 gm 9 ngi,
nhm 2 c 11 ngi, v nhm 3 c 20 ngi:
> group <- c(1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3)

phn tch phng sai, chng ta phi nh ngha bin group l mt yu t - factor.
> group <- as.factor(group)

Bc k tip, chng ta np s liu galactose cho tng nhm nh nh ngha trn (gi
object l galactose):
> galactose <- c(1343,1393,1420,1641,1897,2160,2169,2279,2890,

1264,1314,1399,1605,2385,2511,2514,2767,2827,2895,3011,
1809,2850,1926,2964,2283,2973,2384,3171,2447,3257,2479,3271,2495,3288,
2525,3358,2541,3643,2769,3657)

a hai bin group v galactose vo mt dataframe v gi l data:


> data <- data.frame(group, galactose)
> attach(data)

Sau khi c d liu sn sng, chng ta dng hm lm() phn tch phng sai nh
sau:
> analysis <- lm(galactose ~ group)

Trong hm trn chng ta cho R bit bin galactose l mt hm s ca group. Gi


kt qu phn tch l analysis.
(b) Kt qu phn tch phng sai. By gi chng ta dng lnh anova bit kt qu
phn tch:
> anova(analysis)
Analysis of Variance Table
Response: galactose
Df
Sum Sq Mean Sq F value
Pr(>F)
group
2 5683620 2841810 8.6655 0.0008191 ***
Residuals 37 12133923
327944
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Trong kt qu trn, c ba ct: Df (degrees of freedom) l bc t do; Sum Sq l tng bnh


phng (sum of squares), Mean Sq l trung bnh bnh phng (mean square); F
value l gi tr F nh nh ngha [3] va cp phn trn; v Pr(>F) l tr s P lin
quan n kim nh F.
Dng group trong kt qu trn c ngha l bnh phng gia cc nhm (betweengroups) v residual l bnh phng trong mi nhm (within-group). y, chng
ta c:
SSB = 5683620 v MSB = 2841810
v:
MSB = 2841810 v MSB = 327944
Thnh ra, F = 2841810 / 327944 = 8.6655.
Tr s p = 0.00082 c ngha l tn hiu cho thy c s khc bit v galactose gia ba
nhm.
(c) c s. bit thm chi tit kt qu phn tch, chng ta dng lnh summary nh
sau:
> summary(analysis)
Call:
lm(formula = galactose ~ group)
Residuals:
Min
1Q Median
-995.5 -437.9 102.0

3Q
456.0

Max
979.8

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
1910.2
190.9 10.007 4.5e-12 ***
group2
316.3
257.4
1.229 0.226850
group3
894.3
229.9
3.891 0.000402 ***

--Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 572.7 on 37 degrees of freedom


Multiple R-Squared: 0.319,
Adjusted R-squared: 0.2822
F-statistic: 8.666 on 2 and 37 DF, p-value: 0.0008191

Theo kt qu trn y, intercept chnh l trong m hnh [1]. Ni cch khc, =


1910 v sai s chun l 190.9.
c tnh thng s j , R t 1 =0, v 2 = 2 1 = 316.3, vi sai s chun l 257,
v kim nh t = 316.3 / 257 = 1.229 vi tr s p = 0.2268. Ni cch khc, so vi nhm 1
(bnh nhn Crohn), bnh nhn vim rut kt c galactose trung bnh cao hn 257,
nhng khc bit ny khng c ngha thng k.

Tng t, 3 = 3 1 = 894.3, vi sai s chun l 229.9, kim nh t =


894.3/229.9=3.89, v tr s p = 0.00040. So vi bnh nhn Crohn, nhm i chng c
galactose cao hn 894, v mc khc bit ny c ngha thng k.

11.2 So snh nhiu nhm (multiple comparisons) v iu


chnh tr s p
Cho k nhm, chng ta c t nht l k(k-1)/2 so snh. V d trn c 3 nhm, cho
nn tng s so snh kh d l 3 (gia nhm 1 v 2, nhm 1 v 3, v nhm 2 v 3). Khi
k=10, s ln so snh c th ln rt cao. Nh cp trong chng 7, khi c nhiu so
snh, tr s p tnh ton t cc kim nh thng k khng cn ngha ban u na, bi v
cc kim nh ny c th cho ra kt qu dng tnh gi (tc kt qu vi p<0.05 nhng
trong thc t khng c khc nhau hay nh hng). Do , trong trng hp c nhiu so
snh, chng ta cn phi iu chnh tr s p sao cho hp l.
C kh nhiu phng php iu chnh tr s p, v 4 phng php thng dng nht
l: Bonferroni, Scheff, Holm v Tukey (tn ca 4 nh thng k hc danh ting).
Phng php no thch hp nht? Khng c cu tr li dt khot cho cu hi ny, nhng
hai im sau y c th gip bn c quyt nh tt hn:
(a)

Nu k < 10, chng ta c th p dng bt c phng php no iu


chnh tr s p. Ring c nhn ti th thy phng php Tukey thng
rt hu ch trong so snh.

(b)

Nu k>10, phng php Bonferroni c th tr nn rt bo th. Bo


th y c ngha l phng php ny rt t khi no tuyn b mt so
snh c ngha thng k, d trong thc t l c tht! Trong trng
hp ny, hai phng php Tukey, Holm v Scheff c th p dng.

y, ti s khng gii thch l thuyt ng sau cc phng php ny (v bn c


c th tham kho trong cc sch gio khoa v thng k), nhng s ch cch s dng R
tin hnh cc so snh theo phng php ca Tukey.
Quay li v d trn, cc tr s p trn y l nhng tr s cha c iu chnh cho
so snh nhiu ln. Trong chng v tr s p, ti ni cc tr s ny phng i ngha
thng k, khng phn nh tr s p lc ban u (tc 0.05). iu chnh cho nhiu so
snh, chng ta phi s dng n phng php iu chnh Bonferroni.
Chng ta c th dng lnh pairwise.t.test c c tt c cc tr s p so
snh gia ba nhm nh sau:
> pairwise.t.test(galactose, group, p.adj="bonferroni")
Pairwise comparisons using t tests with pooled SD
data:

galactose and group

1
2
2 0.6805 3 0.0012 0.0321
P value adjustment method: bonferroni

Kt qu trn cho thy tr s p gia nhm 1 (Crohn) v vim rut kt l 0.6805 (tc khng
c ngha thng k); gia nhm Crohn v i chng l 0.0012 (c ngha thng k), v
gia nhm vim rut kt v i chng l 0.0321 (tc cng c ngha thng k).
Mt phng php iu chnh tr s p khc c tn l phng php Holm:
> pairwise.t.test(galactose, group)
Pairwise comparisons using t tests with pooled SD
data:

galactose and group

1
2
2 0.2268 3 0.0012 0.0214
P value adjustment method: holm

Kt qu ny cng khng khc so vi phng php Bonferroni.


Tt c cc phng php so snh trn s dng mt sai s chun chung cho c ba nhm.
Nu chng ta mun s dng cho tng nhm th lnh sau y (pool.sd=F) s p ng
yu cu :
> pairwise.t.test(galactose, group, pool.sd=FALSE)
Pairwise comparisons using t tests with non-pooled SD

data:

galactose and group

1
2
2 0.2557 3 0.0017 0.0544
P value adjustment method: holm

Mt ln na, kt qu ny cng khng lm thay i kt lun.


11.2.1 So snh nhiu nhm bng phng php Tukey

Trong cc phng php trn, chng ta ch bit tr s p so snh gia cc nhm,


nhng khng bit mc khc bit cng nh khong tin cy 95% gia cc nhm. c
nhng c s ny, chng ta cn n mt hm khc c tn l aov (vit tt t analysis of
variance) v hm TukeyHSD (HSD l vit tt t Honest Significant Difference, tm dch
nm na l Khc bit c ngha thnh tht) nh sau:
> res <- aov(galactose ~ group)
> TukeyHSD (res)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = galactose ~ group)
$group

diff
lwr
upr
p adj
2-1 316.3232 -312.09857 944.745 0.4439821
3-1 894.2778 333.07916 1455.476 0.0011445
3-2 577.9545
53.11886 1102.790 0.0281768

Kt qu trn cho chng ta thy nhm 3 v 1 khc nhau khong 894 n v, v khong tin
cy 95% t 333 n 1455 n v. Tng t, galactose trong nhm bnh nhn vim rut
kt thp hn nhm i chng (nhm 3) khong 578 n v, v khong tin cy 95% t 53
n 1103.

3-2

3-1

2-1

95% family-wise confidence level

500

1000

1500

Differences in mean levels of group

Biu 11.1. Trung bnh hiu v khong tin cy


95% gia nhm 1 v 2, 1 v 3, v 3 v 2. Trc
honh l galactose, trc tung l ba so snh.
11.2.2 Phn tch bng biu

Mt phn tch thng k khng th no hon tt nu khng c mt th minh ha


cho kt qu. Cc lnh sau y v th th hin galactose trung bnh v sai s chun
cho tng nhm bnh nhn. Biu ny cho thy, nhm bnh nhn Crohn c
galactose thp nht (nhng khng thp hn nhm vim rut kt), v c hai nhm thp
hn nhm i chng v s khc bit ny c ngha thng k.
>
>
>
>
>
>
>

xbar <- tapply(galactose, group, mean)


s <- tapply(galactose, group, sd)
n <- tapply(galactose, group, length)
sem <- s/sqrt(n)
stripchart(galactose ~ group, jitter, jit=0.05, pch=16, vert=TRUE)
arrows(1:3, xbar+sem, 1:3, xbar-sem, angle=90, code=3, length=0.1)
lines(1:3, xbar, pch=4, type=b, cex=2)

3500
3000
2500
2000
1500

Biu 11.2. galactose ca nhm 1 (bnh nhn Crohn),


nhm 2 (bnh nhn vim rut kt), v nhm 3 (i chng).

11.3 Phn tch bng phng php phi tham s


Phng php so snh nhiu nhm phi tham s (non-parametric statistics) tng
ng vi phng php phn tch phng sai l Kruskal-Wallis. Cng nh phng php
Wilcoxon so snh hai nhm theo phng php phi tham s, phng php Kruskal-Wallis
cng bin i s liu thnh th bc (ranks) v phn tch khc bit th bc ny gia cc
nhm. Hm kruskal.test trong R c th gip chng ta trong kim nh ny:
> kruskal.test(galactose ~ group)
Kruskal-Wallis rank sum test
data: galactose by group
Kruskal-Wallis chi-squared = 12.1381, df = 2, p-value = 0.002313

Tr s p t kim nh ny kh thp (p = 0.002313) cho thy c s khc bit gia


ba nhm nh phn tch phng sai qua hm lm trn y. Tuy nhin, mt bt tin ca
kim nh phi tham s Kruskal-Wallis l phng php ny khng cho chng ta bit hai
nhm no khc nhau, m ch cho mt tr s p chung. Trong nhiu trng hp, phn tch

phi tham s nh kim nh Kruskal-Wallis thng khng c hiu qu nh cc phng


php thng k tham s (parametric statistics).

11.4 Phn tch phng sai hai chiu (two-way analysis of


variance - ANOVA)
Phn tch phng sai n gin hay mt chiu ch c mt yu t (factor). Nhng
phn tch phng sai hai chiu (two-way ANOVA), nh tn gi, c hai yu t. Phng
php phn tch phng sai hai chiu ch n gin khai trin t phng php phn tch
phng sai n gin. Thay v c tnh phng sai ca mt yu t, phng php phn sai
hai chiu c tnh phng sai ca hai yu t.
V d 2. Trong v d sau y, nh gi hiu qu ca mt k thut sn mi, cc
nh nghin cu p dng sn trn 3 loi vt liu (1, 2 v 3) trong hai iu kin (1, 2).
Mi iu kin v loi vt liu, nghin cu c lp li 3 ln. bn c o l ch s
bn b (tm gi l score). Tng cng, c 18 s liu nh sau:
Bng 11.2. bn b ca sn cho 2 iu kin v 3 vt liu

iu kin
(i)
1
2

1
4.1, 3.9, 4.3
2.7, 3.1, 2.6

Vt liu (j)
2
3.1, 2.8, 3.3
1.9, 2.2, 2.3

3
3.5, 3.2, 3.6
2.7, 2.3, 2.5

S liu ny c th tm lc bng s trung bnh cho tng iu kin v vt liu trong bng
thng k sau y:
Bng 11.3. Tm lc s liu t th nghim bn b ca nc sn

Vt liu (j)
2

4.10
2.80
3.450

3.07
2.13
2.600

3.43
2.50
2.967

0.040
0.070

0.063
0.043

0.043
0.040

iu kin (i)

Trung bnh
1
2
Trung bnh 2
nhm
Phng sai
1
2

Trung bnh
cho 3 vt
liu
3.533
2.478
3.00

Nhng tnh ton s khi trn y cho thy c th c s khc nhau (hay nh hng) ca
iu kin v vt liu th nghim.
Gi xij l score ca iu kin i (i = 1, 2) cho vt liu j (j = 1, 2, 3). ( n gin ha
vn , chng ta tm thi b qua k i tng). M hnh phn tch phng sai hai chiu
pht biu rng:
xij = + i + j + ij

[4]

Hay c th hn:
x11 = + 1 + 1 + 11
x12 = + 1 + 2 + 12
x13 = + 1 + 3 + 11
x21 = + 2 + 1 + 21
x22 = + 2 + 2 + 22
x23 = + 2 + 3 + 21
l s trung bnh cho ton qun th, cc h s i (nh hng ca iu kin i)v j (nh
hng ca vt liu j) cn phi c tnh t s liu thc t. ij c gi nh tun theo lut
phn phi chun vi trung bnh 0 v phng sai 2.
Trong phn tch phng sai hai chiu, chng ta cn chia tng bnh phng ra thnh 3
ngun:

ngun th nht l tng bnh phng do bin i gia 2 iu kin:


SSc = ni ( xi x )

= 9(3.533 3.00)2 + 9(2.478 3.00)2


= 5.01

ngun th hai l tng bnh phng do bin i gia 3 vt liu:


SSm = n j ( x j x )

= 6(3.45 3.00)2 + 6(2.60 3.00)2 + 6(2.967 3.00)2


= 2.18

ngun th ba l tng bnh phng phn d (residual sum of squares):

SSe = ( xij xi x j + x ) = ( nij 1) sij2


2

= 2(0.040) + 2(0.063) + 2(0.043) + 2(0.070) + 2(0.043) + 2(0.040)


= 0.73
Trong cc phng trnh trn, n = 3 (lp li 3 ln cho mi iu kin v vt liu), m = 3
vt liu, x l s trung bnh cho ton mu, xi l s trung bnh cho tng iu kin, x j l s
trung bnh cho tng vt liu. V SSc c m-1 bc t do, SSm c (n -1) bc t do, v SSe c
Nnm+2 bc t do, trong N l tng s mu (tc 18). Do , cc trung bnh bnh
phng

gia hai iu kin:


gia ba vt liu:
phn d:

MSc = SSc / (m-1) = 5.01 / 1 = 5.01


MSm = SSc / (n-1) = 2.18 /2 = 1.09
MSe = SSe / (N-nm+2) = 0.73 / 14 = 0.052

Do , so snh khc bit gia hai iu kin da vo kim nh F = MSc/Mse vi bc


t do 1 v 14. Tng t, so snh khc bit gia ba vt liu c th da vo kim nh
F = MSm/Mse vi bc t do 2 v 14. Cc phn tch trn c th trnh by trong mt bng
phn tch phng sai nh sau:
Ngun bin thin (source
of variation)

Bc t do
(degrees of
freedom)

Khc bit gia 2 iu kin


Khc bit gia 3 vt liu
Phn d (residual)
Tng s

1
2
14
17

Tng bnh
phng
(sum of
squares)
5.01
2.18
0.73
7.92

Trung bnh
bnh phng
(mean
square)
5.01
1.09
0.052

Kim nh
F
95.6
20.8

11.4.1 Phn tch phng sai hai chiu vi R


(a) Bc u tin l nhp s liu t bng 11.2 vo R. Chng ta cn phi t chc d
liu sao cho c 4 bin nh sau:
Condition
(iu kin)
1
1
1
1
1
1
1
1

Material
(vt liu)
1
1
1
2
2
2
3
3

i tng

Score

1
2
3
4
5
6
7
8

4.1
3.9
4.3
3.1
2.8
3.3
3.5
3.2

1
2
2
2
2
2
2
2
2
2

3
1
1
1
2
2
2
3
3
3

9
10
11
12
13
14
15
16
17
18

3.6
2.7
3.1
2.6
1.9
2.2
2.3
2.7
2.3
2.5

Chng ta c th to ra mt dy s bng cch s dng hm gl (generating levels). Cch


s dng hm ny c th minh ha nh sau:
> gl(9, 1, 18)
[1] 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Levels: 1 2 3 4 5 6 7 8 9

Trong lnh trn, chng ta to ra mt dy s 1,2,3, 9 hai ln (vi tng s 18 s). Mi


mt ln l mt nhm. Trong khi lnh:
> gl(4, 9, 36)
[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4
Levels: 1 2 3 4

Trong lnh trn, chng ta to ra mt dy s vi 4 bc (1,2,3, 4) 9 ln (vi tng s 36 s).


Do , to ra cc bc cho iu kin v vt liu, chng ta lnh nh sau:
> condition <- gl(2, 9, 18)
> material <- gl(3, 3, 18)

V to nn 18 m s (t 1 n 18):
> id <- 1:18

Sau cng l s liu cho score:


> score <- c(4.1,3.9,4.3, 3.1,2.8,3.3, 3.5,3.2,3.6,
2.7,3.1,2.6, 1.9,2.2,2.3, 2.7,2.3,2.5)

Tt c cho vo mt dataframe tn l data:


> data <- data.frame(condition, material, id, score)
> attach(data)

(b) Phn tch v kt qu s khi. By gi s liu sn sng cho phn tch.


phn tch phng sai hai chiu, chng ta vn s dng lnh lm vi cc thng s nh
sau:
> twoway <- lm(score ~ condition + material)
> anova(twoway)
Analysis of Variance Table

Response: score
Df Sum Sq Mean Sq F value
Pr(>F)
condition 1 5.0139 5.0139 95.575 1.235e-07 ***
material
2 2.1811 1.0906 20.788 6.437e-05 ***
Residuals 14 0.7344 0.0525
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Ba ngun dao ng (variation) ca score c phn tch trong bng trn. Qua
trung bnh bnh phng (mean square), chng ta thy nh hng ca iu kin c v quan
trng hn l nh hng ca vt liu th nghim. Tuy nhin, c hai nh hng u c
ngha thng k, v tr s p rt thp cho hai yu t.
(c) c s. Chng ta yu cu R tm lc cc c s phn tch bng lnh summary:
> summary(twoway)
Call:
lm(formula = score ~ condition + material)
Residuals:
Min
1Q
-0.32778 -0.16389

Median
0.03333

3Q
0.16111

Max
0.32222

Coefficients:
Estimate Std. Error t value
(Intercept)
3.9778
0.1080 36.841
condition2
-1.0556
0.1080 -9.776
material2
-0.8500
0.1322 -6.428
material3
-0.4833
0.1322 -3.655
--Signif. codes: 0 '***' 0.001 '**' 0.01

Pr(>|t|)
2.43e-15
1.24e-07
1.58e-05
0.0026

***
***
***
**

'*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.229 on 14 degrees of freedom


Multiple R-Squared: 0.9074,
Adjusted R-squared: 0.8875
F-statistic: 45.72 on 3 and 14 DF, p-value: 1.761e-07

Kt qu trn cho thy so vi iu kin 1, iu kin 2 c score thp hn khong


1.056 v sai s chun l 0.108, vi tr s p = 1.24e-07, tc c ngha thng k. Ngoi ra,
so vi vt liu 1, score cho vt liu 2 v 3 cng thp hn ng k vi thp nht ghi
nhn vt liu 2, v nh hng ca vt liu th nghim cng c ngha thng k.
Gi tr c tn l Residual standard error c c tnh t trung bnh bnh
phng phn d trong phn (a), tc l 0.0525 = 0.229, tc l c s ca .
H s xc nh bi (R2) cho bit hai yu t iu kin v vt liu gii thch khong
91% dao ng ca ton b mu. H s ny c tnh t tng bnh phng trong kt
qu phn (a) nh sau:

R2 =

5.0139 + 2.1811
= 0.9074
5.0139 + 2.1811 + 0.7344

V sau cng, h s R2 iu chnh phn nh ci tin ca m hnh. hiu h


s ny tt hn, chng ta thy phng sai ca ton b mu l s2 = (5.0139 + 2.1811 +
0.7344) / 17 = 0.4644. Sau khi iu chnh cho nh hng ca iu kin v vt liu,
phng sai ny cn 0.0525 (tc l residual mean square). Nh vy hai yu t ny lm
gim phng sai khong 0.4644 0.0525 = 0.4119. V h s R2 iu chnh l:
Adj R2 = 0.4119 / 0.4644 = 0.88
Tc l sau khi iu chnh cho hai yu t iu kin v vt liu phng sai ca score gim
khong 88%.
(d) Hiu ng tng tc (interaction effects)

cho phn tch hon tt, chng ta cn phi xem xt n kh nng nh hng
ca hai yu t ny c th tng tc nhau (interactive effects). Tc l m hnh score tr
thnh:
xij = + i + j + ( i j ) + ij
ij

Ch phng trnh trn c phn ( i j ) phn nh s tng tc gia hai yu t. V


ij

chng ta ch n gin lnh R nh sau:


> anova(twoway <- lm(score ~ condition+ material+condition*material))
Analysis of Variance Table
Response: score

Df Sum Sq Mean Sq F value


Pr(>F)
condition
1 5.0139 5.0139 100.2778 3.528e-07 ***
material
2 2.1811 1.0906 21.8111 0.0001008 ***
condition:material 2 0.1344 0.0672
1.3444 0.2972719
Residuals
12 0.6000 0.0500
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Kt qu phn tch trn (p = 0.297 cho nh hng tng tc). Chng ta c bng chng
kt lun rng nh hng tng tc gia vt liu v iu kin khng c ngha thng k,
v chng ta chp nhn m hnh [4], tc khng c tng tc.
(e) So snh gia cc nhm. Chng ta s c tnh khc bit gia hai iu kin v ba
vt liu bng hm TukeyHSD vi aov:
> res <- aov(score ~ condition+ material+condition)
> TukeyHSD(res)
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = score ~ condition + material + condition)


$condition
diff
lwr
upr p adj
2-1 -1.055556 -1.287131 -0.8239797 1e-07
$material

diff
lwr
upr
p adj
2-1 -0.8500000 -1.19610279 -0.5038972 0.0000442
3-1 -0.4833333 -0.82943612 -0.1372305 0.0068648
3-2 0.3666667 0.02056388 0.7127695 0.0374069

Biu sau y s minh ho cho cc kt qu trn:


> plot(TukeyHSD(res), ordered=TRUE)
There were 16 warnings (use warnings() to see them)

3-2

3-1

2-1

95% family-wise confidence level

-1.0

-0.5

0.0

0.5

Differences in mean levels of material

Biu 11.3. So snh gia 3 loi vt liu


bng phng php Tukey.
(f) Biu . xem qua nh hng ca hai yu t iu kin v vt liu, chng ta cn
phi c mt th, m trong phn tch phng sai gi l th tng tc. Hm
interaction.plot cung cp phng tin v biu ny:
> interaction.plot(score, condition, material)

4.0

condition

3.0
2.5

mean of score

3.5

1
2

material

Biu 11.4. Trung bnh score cho tng iu kin 1 (ng


t on) v iu kin 2 cho 3 loi vt liu.

11.5 Phn tch hip bin (analysis of covariance - ANCOVA)


Phn tch hip bin (s vit tt l ANCOVA) l phng php phn tch s dng c
hai m hnh hi qui tuyn tnh v phn tch phng sai. Trong phn tch hi qui tuyn
tnh, c hai bin ph thuc (dependent variable, cng c th gi l bin ng response
variable) v bin c lp (independent variable hay predictor variable) phn ln l dng
lin tc (continuous variable), nh cholesterol v tui chng hn. Trong phn tch
phng sai, bin ph thuc l bin lin tc, cn bin c lp th dng th bc v th
loi (categorical variable), nh galactose v nhm bnh nhn trong v d 1 chng hn.
Trong phn tch hip bin, bin ph thuc l lin tc, nhng bin c lp c th l lin
tc v th loi.
V d 3. Trong nghin cu m kt qa c trnh by di y, cc nh nghin
cu o chiu cao v tui ca 18 hc sinh thuc vng thnh th (urban) v 14 hc tr
thuc vng nng thn (rural).
Bng 11.4. Chiu cao ca hc tr vng thnh th v nng
thn
Area
ID
Age (months) Height
(cm)
urban
1
109
137.6
urban
2
113
147.8
urban
3
115
136.8
urban
4
116
140.7

Cu hi t ra l c s
khc bit no v chiu cao gia
tr em thnh th v nng thn
hay khng. Ni cch khc, mi
trng c tr c nh hng n
chiu cao hay khng, v nu c
th mc nh hng l bao
nhiu?
Mt yu t c nh hng
ln n chiu cao l tui.
Trong tui trng thnh,
chiu cao tng theo tui. Do
, so snh chiu cao gia hai
nhm ch c th khch quan nu
tui gia hai nhm phi tng
ng nhau. m bo tnh
khch quan ca so snh, chng ta
cn phi phn tch s liu bng
m hnh hip bin.
Vic u tin l chng ta
phi nhp s liu vo R vi
nhng lnh sau y:
>
>
>
>

urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
rural
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban

5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14

119
120
121
124
126
129
130
133
134
135
137
139
141
142
121
121
128
129
131
132
133
134
138
138
138
140
140
140

132.7
145.4
135.0
133.0
148.5
148.3
147.5
148.8
133.2
148.7
152.0
150.6
165.3
149.9
139.0
140.9
134.9
149.5
148.7
131.0
142.3
139.9
142.9
147.7
147.7
134.6
135.8
148.5

# to ra dy s id
id <- c(1:18, 1:14)
# group 1=urban 2=rural v cn phi xc nh group l mt factor
group <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2)
> group <- as.factor(group)
> # nhp d liu
> age <- c(109,113,115,116,119,120,121,124,126,129,130,133,134,135,
137,139,141,142,
121,121,128,129,131,132,133,134,138,138,138,140,140,140)
>

height <- c(137.6,147.8,136.8,140.7,132.7,145.4,135.0,133.0,148.5,


148.3,147.5,148.8,133.2,148.7,152.0,150.6,165.3,149.9,
139.0,140.9,134.9,149.5,148.7,131.0,142.3,139.9,142.9,
147.7,147.7,134.6,135.8,148.5)

> # to mt data frame


> data <- data.frame(id, group, age, height)
> attach(data)

Chng ta th xem qua vi ch s thng k m t bng cch c tnh tui v


chiu cao trung bnh cho tng nhm hc sinh:

> tapply(age, group, mean)


1
2
126.8333 133.0714
> tapply(height, group, mean)
1
2
144.5444 141.6714

Kt qu trn cho thy nhm hc sinh thnh th c tui thp hn hc sinh nng
thn khong 6.3 thng (126.8 133.1). Tuy nhin, chiu cao ca hc sinh thnh th cao
hn hc sinh nng thn khong 2.8 cm (144.5 141.7). Bn c c th dng kim nh t
thy rng s khc bit v tui gia hai nhm c ngha thng k (p = 0.045).

150
130

135

140

145

height

155

160

165

Ngoi ra, biu sau y cn cho thy c mt mi lin h tng quan gia tui v chiu
cao:

110

115

120

125

130

135

140

age

Biu 11.5. Chiu cao (cm) v tui (thng


tui) ca hai nhm hc sinh thnh th v nng
thn.

V hai nhm khc nhau v tui, v tui c lin h vi chiu cao, cho nn chng
ta khng th pht biu hay so snh chiu cao gia 2 nhm hc sinh m khng iu chnh
cho tui. iu chnh tui, chng ta s dng phng php phn tch hip bin.
11.5.1 M hnh phn tch hip bin

Gi y l chiu cao, x l tui, v g l nhm. M hnh cn bn ca ANCOVA


gi nh rng mi lin h gia y v x l mt ng thng, v dc (gradient hay slope)

ca hai nhm trong mi lin h ny khng khc nhau. Ni cch khc, vit theo k hiu
ca hi qui tuyn tnh, chng ta c:
y1 = 1 + x + e1
y2 = 2 + x + e2

in group 1
in group 2.

[5]

Trong :

1 : l gi tr trung bnh ca y khi x= 0 ca nhm 1;


2 : l gi tr trung bnh ca y khi x= 0 ca nhm 2;
: dc ca mi lin h gia y v x;

e1 v e2: bin s ngu nhin vi trung bnh 0 v phng sai 2.


Gi x l s trung bnh ca tui cho c 2 nhm, x1 v x2 l tui trung bnh ca
nhm 1 v nhm 2. Nh ni trn, nu x1 x2 , th so snh chiu cao trung bnh ca nhm
1 v 2 ( y1 v y2 ) s thiu khch quan, v
y1 = 1 + x1 + e1
y2 = 2 + x2 + e2
v mc khc bit gia hai nhm by gi ty thuc vo h s :
y1 y 2 = 1 2 + ( x1 x2 )

Ch rng trong m hnh [5], chng ta c th din dch 1 2 l khc bit


chiu cao trung bnh gia hai nhm nu c hai nhm c cng tui trung bnh. Mc khc
bit ny ny th hin nh hng ca hai nhm nu khng c mt yu t no lin h n y.
Thnh ra, c tnh 1 2 , chng ta khng th n gin tr hai s trung bnh y1 - y2 ,
nhng phi iu chnh cho x. Gi x* l mt gi tr chung cho c hai nhm, chng ta c
th c tnh gi tr iu chnh y cho nhm 1 (k hiu y1a ) nh sau:

y1a = y1 x1 x*

y1a c th xem l mt c s cho chiu cao trung bnh ca nhm 1 (thnh th) cho gi tr
x l x* . Tng t,

y2 a = y2 x2 x*

l s cho chiu cao trung bnh ca nhm 1 (nng thn) vi cng gi tr x*. T y,
chng ta c th c tnh nh hng ca thnh th v nng thn bng cng thc sau y:
y1a y2 a = y2 y1 ( x1 x2 )

Do , vn l chng ta phi c tnh . C th chng minh rng c s t phng


php bnh phng nh nht cng l c tnh khch quan cho 1 2 . Khi vit bng m
hnh tuyn tnh, m hnh hip bin c th m t nh sau:
y = + x + g + ( xg ) + e

[6]

Ni cch khc, m hnh trn pht biu rng chiu cao ca mt hc sinh b nh
hng bi 3 yu t: tui (), thnh th hay nng thn (), v tng tc gia hai yu t
(). Nu = 0 (tc nh hng tng tc khng c ngha thng k), m hnh trn
gim xung thnh:

y = + x+ g +e

[7]

Nu = 0 (tc nh hng ca thnh th khng c ngha thng k), m hnh trn gim
xung thnh:

y = + x+e

[8]

11.5.2 Phn tch bng R

Cc tho lun va trnh by trn xem ra kh phc tp, nhng trong thc t, vi R,
cch c tnh rt n gin bng hm lm. Chng ta s phn tch ba m hnh [6], [7] v
[8]:
> # model 6
> model6 <- lm(height ~ group + age + group:age)
> # model 7
> model7 <- lm(height ~ group + age)
> # model 8
> model8 <- lm(height ~ age)

Chng ta cng c th so snh c ba m hnh cng mt lc bng lnh anova nh sau:


> anova(model6, model7, model8)
Analysis of Variance Table
Model 1:
Model 2:
Model 3:
Res.Df
1
28
2
29
3
30

height ~ group + age + group:age


height ~ group + age
height ~ age
RSS Df Sum of Sq
F Pr(>F)
1270.44
1338.02 -1
-67.57 1.4893 0.23251
1545.95 -1
-207.93 4.5827 0.04114 *

--Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Ch model 1 chnh l m hnh [6], model 2 l m hnh [7], v model 3 l


m hnh [8]. RSS l residual sum of squares, tc tng bnh phng phn d
cho mi m hnh. Kt qu phn tch trn cho thy:

Ton b mu c 18+14=32 hc sinh, m hnh [6] c 4 thng s (, , v ), cho


nn m hnh ny c 32-4 = 28 bc t do. Tng bnh phng ca m hnh l
1270.44.

m hnh [7] c 3 thng s (tc cn 29 bc t do), cho nn tng bnh phng phn
d cao hn m hnh [7]. Tuy nhin, ng trn phng din xc sut th trung
bnh bnh phng phn d ca m hnh ny 1338.02 / 29 = 46.13, khng khc
my so vi m hnh [6] (trung bnh bnh phng l: 1270.44 / 28 = 45.36), v tr
s p = 0.2325, tc khng c ngha thng k. Ni cch khc, b h s tng tc
khng lm thay i kh nng tin on ca m hnh mt cch ng k.

M hnh [8] ch c 2 thng s (v do c 30 bc t do), vi tng bnh phng l


1545.95. Trung bnh bnh phng phn d ca m hnh ny l 51.53 (1545.95 /
30), tc cao hn hai m hnh [6] mt cch ng k, v tr s p = 0.0411.

Qua phn tch trn, chng ta thy m hnh [7] l ti u hn c, v ch cn 3 thng s m


c th gii thch c d liu mt cch y . By gi chng ta s ch tm vo phn
tch kt qu ca m hnh ny.
> summary(model7)
Call:
lm(formula = height ~ group + age)
Residuals:
Min
1Q
-14.324 -3.285

Median
0.879

3Q
3.956

Max
14.866

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 91.8171
17.9294
5.121 1.81e-05 ***
group2
-5.4663
2.5749 -2.123 0.04242 *
age
0.4157
0.1408
2.953 0.00619 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.793 on 29 degrees of freedom
Multiple R-Squared: 0.2588,
Adjusted R-squared: 0.2077
F-statistic: 5.063 on 2 and 29 DF, p-value: 0.01300

Qua phn c tnh thng s trnh by trn y, chng ta thy tnh trung bnh chiu
cao hc sinh tng khong 0.41 cm cho mi thng tui. Ch trong kt qu trn, phn
group2 c ngha l h s hi qui (regression coefficient) cho nhm 2 (tc l nng
thn), v R phi t h s cho nhm 1 bng 0 tin vic tnh ton. V th, chng ta c
hai phng trnh (hay hai ng biu din) cho hai nhm hc sinh nh sau:
i vi hc sinh thnh th:
Height = 91.817 + 0.4157(age)

V i vi hc sinh nng thn:


Height = 91.817 5.4663(rural) + 0.4157(age)

Ni cch khc, sau khi iu chnh cho tui, nhm hc sinh nng thn (rural) c
chiu cao thp hn nhm thnh th khong 5.5 cm v mc khc bit ny c ngha
thng k v tr s p = 0.0424. (Ch l trc khi iu chnh cho tui, mc khc
bit l 2.8 cm).
Cc biu sau y s minh ha cho cc m hnh trn:
> par(mfrow=c(2,2))
> plot(age, height, pch=as.character(group),
main=Mo hinh 1)

> abline(144.54, 0) #mean value for urban


> abline(141.67, 0) #mean value for rural
> plot(age, height, pch=as.character(group),
main=Mo hinh 2)

> abline(102.63, 0.3138) #single line for dependence on age


> plot(age, height, pch=as.character(group),
main=Mo hinh 3)

> abline(91.8, 0.416) #line for males


> abline(91.8-5.46,0.416) #line for females parallel
> plot(age, height, pch=as.character(group),
main=Mo hinh 4)

> abline(79.7, 0.511) #line for males


> abline(79.7+47.08, 0.511-0.399) #line for females parallel
> par(mfrow=c(1,1))

Mo hinh 1

Mo hinh 2

115

120

130

150

2
2

125

140

150

1 1
2 2

135

140

1
110

115

2
2

1
1

2 2 1 1
11

130

110

2
2

height

1
1

2 2 1 1
11

130

140

height

160

160

120

130

age

age

Mo hinh 3

Mo hinh 4

1 1
2 2
2

2
2
2

125

135

140

115

120

1
125

2
130

2
2
1
135

150
140

150

age

1 1
2 2

2
2

140

1
1

110

115

2
2

1
1

2
1 12 1 1

130

1
1

110

2
2

height

1
1

2
1 12 1 1

130

140

height

160

160

120

1
125

2
2
130

1 1
2 2
2

2
1
135

2
2

140

age

Biu 11.6. M hnh 1: chiu cao l hm s ca ni tr ng, nhng khng


c lin h vi tui; M hnh 2 gi thit rng chiu cao ph thuc vo
tui, nhng khng c khc nhau gia hai nhm thnh th v nng thn; M
hnh 3 gi thit rng mi lin h gia chiu cao v tui ca nhm thnh th
tng ng vi vi nhm nng thn (hai ng song song), nhng hc sinh
thnh th c chiu cao cao hn nng thn; v m hnh 4 gi thit rng mc
khc bit v chiu cao gia hai nhm ty thuc vo tui (tc c tng tc
gia tui v ni tr ng): tui <120 thng, chiu cao hai nhm khng
khc nhau my, nhng khi tui >120 thng tui th nhm hc sinh thnh th c
chiu cao cao hn nhm nng thn. Phn tch trn cho thy m hnh 3 l tt
nht.

11.6 Phn tch phng sai cho th nghim giai tha


(factorial experiment)
V d 4. kho st nh hng ca 4 loi thuc tr su (1, 2, 3 v 4) v ba loi
ging (B1, B2 v B3) n sn lng ca cam, cc nh nghin cu tin hnh mt th
nghim loi giai tha. Trong th nghim ny, mi ging cam c 4 cy cam c chn
mt cch ngu nhin, v 4 loi thuc tr su p dng (cng ngu nhin) cho mi cy cam.
Kt qu nghin cu (sn lng cam) cho tng ging v thuc tr su nh sau:

Bng 11.5. Sn lng cam cho 3 loi ging v 4 loi thuc tr su

Ging cam
(variety)
B1
B2
B3
Tng s

Thuc tr su (pesticide)
2
3
50
43
58
42
85
63
193
154

Tng s

M
1
4
hnh phn
29
53
tch
th
41
73
nghim giai
66
85
tha cng
136
211
khng khc
g so vi phn tch phng sai hai chiu nh trnh by trong phn trn. C th
hnh m chng ta xem xt l:

175
214
305
694

hn, m

product = + (variety) + (pesticide) +

Trong , l hng s biu hin trung bnh ton mu, l h s nh hng ca


ba ging cam, v l h s nh hng ca 4 loi thuc tr su, v l phn d (residual)
ca m hnh.
Chng ta c th s dng hm aov ca R c tnh cc thng s trn nh sau:
# trc ht chng ta nhp s liu

> variety <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)


> pesticide <- c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
> product <- c(29,50,43,53,41,58,42,73,66,85,69,85)

# nh ngha variety v pesticide l hai yu t (factors)


> variety <- as.factor(variety)
> pesticide <- as.factor(pesticide)
# cho vo mt data frame tn l data
> data <- data.frame(variety, pesticide, product)
# phn tch phng sai bng aov v cho vo object analysis
> analysis <- aov(product ~ variety + pesticide)
> anova(analysis)
Analysis of Variance Table
Response: product
Df Sum Sq Mean Sq F value
Pr(>F)
variety
2 2225.17 1112.58 44.063 0.000259 ***
pesticide 3 1191.00 397.00 15.723 0.003008 **
Residuals 6 151.50
25.25
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Kt qu trn cho thy c hai yu t ging cy (variety) v thuc tr su (pesticide) u c


nh hng n sn lng cam, v tr s p < 0.05. so snh c th cho tng hai nhm,
chng ta s dng hm TukeyHSD nh sau:
> TukeyHSD(analysis)

Tukey multiple comparisons of means


95% family-wise confidence level
Fit: aov(formula = product ~ variety + pesticide)
$variety
diff
lwr
upr
p adj
2-1 9.75 -1.152093 20.65209 0.0749103
3-1 32.50 21.597907 43.40209 0.0002363
3-2 22.75 11.847907 33.65209 0.0016627
$pesticide
diff
lwr
2-1
19
4.797136
3-1
6 -8.202864
4-1
25 10.797136
3-2 -13 -27.202864
4-2
6 -8.202864
4-3
19
4.797136

upr
33.202864
20.202864
39.202864
1.202864
20.202864
33.202864

p adj
0.0140509
0.5106152
0.0036109
0.0704233
0.5106152
0.0140509

Kt qu phn tch gia cc loi ging cho thy ging B3 c sn lng cao hn
ging B1 khong 32 n v vi khong tin cy 95% t 21 n 43 (p = 0.0002). Ging
cam B3 cng tt hn ging B2, vi khc bit trung bnh khong 22 n v (p =
0.0017). Nhng khng c khc bit ng k gia ging B2 v B1.
So snh gia cc loi thuc tr su, kt qu trn cho chng ta bit cc thuc tr
su 4 c hiu qu cao hn thuc 1 v 3. Ngoi ra, thuc 2 cng c hiu qu cao hn
thuc 1. Cn cc so snh khc khng c ngha thng k. Biu Tukey sau y minh
ha cho kt lun trn.
> plot(TukeyHSD(analysis), ordered=TRUE)

4-3

4-2

3-2

4-1

3-1

2-1

95% family-wise confidence level

-20

-10

10

20

30

40

Differences in mean levels of pesticide

11.7 Phn tch phng sai cho th nghim hnh vung Latin
(Latin square experiment)
V d 5. so snh hiu qu ca 2 loi phn bn (A v B) cng 2 phng php
canh tc (a v b), cc nh nghin cu tin hnh mt th nghim hnh vung Latin. Theo
, c 4 nhm can thip tng hp t hai loi phn bn v phng php canh tc: Aa, Ab,
Ba, v Bb (s cho m s, ln lc, l 1=Aa, 2=Ab, 3=Ba, 4=Bb). Bn phng
(treatment) c p dng trong 4 mu rung (sample = 1, 2, 3, 4) v 4 loi cy trng
(variety = 1, 2, 3, 4). Tng cng, th nghim c 4x4 = 16 mu. Tiu ch nh gi l
sn lng, v kt qu sn lng c tm tt trong bng sau y:
Bng 11.6. Sn lng cho 2 loi phn bn v 2 phng php canh tc

Mu rung
(sample)
1
2
3
4

1
175
Aa
170
Ab
135
Bb
145
Ba

Ging (variety)
2
3
143
128
Ba
Bb
178
140
Aa
Ba
173
169
Ab
Aa
136
165
Bb
Ab

4
166
Ab
131
Bb
141
Ba
173
Aa

Cu hi t ra l cc phng php canh tc v phn bn c nh hng n sn lng


hay khng. tr li cu hi , chng ta phi xem xt n cc ngun lm cho sn
lng thay i hay bin thin. Nhn qua th nghim v bng s liu trn, rt d dng
hnh dung ra 3 ngun bin thin chnh:

Ngun th nht l khc bit gia cc phng php canh tc v phn bn;
Ngun th hai l khc bit gia cc loi ging cy;
Ngun th ba l khc bit gia cc mu rung;

V phn cn li l khc bit trong mi mu rung v loi ging. c mt ci nhn


chung v s liu, chng ta hy tnh trung bnh cho tng nhm qua bng s sau y:
Trung bnh cho tng loi
ging

Trung bnh cho tng mu

Trung bnh cho tng


phng php

1: 156.25
2: 157.50
3: 150.50
4: 152.75
Tng trung bnh: 154.25

1: 153.00
2: 154.75
3: 154.50
4: 154.75
Tng trung bnh: 154.25

1: 173.75
2: 168.50
3: 142.25
4: 132.50
Tng trung bnh: 154.25

Bng tm lc trn cho php chng ta tnh tng bnh phng cho tng ngun bin thin.
Khi u l tng bnh phng cho ton b th nghim (ti s tm gi l SStotal):

Tng bnh phng chung cho ton th nghim:


SStotal = (175 154.25)2 + (143 154.25)2 + (165 154.25)2 + (173 154.25)2
= 4941

Tng bnh phng do khc bit gia cc loi ging (SSvariety). Ch l v trung
bnh mi ging c tnh t 4 s, cho nn chng ta phi nhn cho 4 khi tnh tng
bnh phng:
SSvariety = 4(156.25 154.25)2 + 4(157.50 154.25)2 +
4(150.50 154.25)2 + 4(152.75 154.25)2
= 123.5
V c 4 loi ging v mt thng s, cho nn bc t do l 4-1=3. Theo , trung
bnh bnh phng (mean square) l: 123.5 / 3 = 41.2.

Tng bnh phng do khc bit gia ging (SSsample). Ch l v trung bnh
mi mu c tnh t 4 s, cho nn khi tnh tng bnh phng, cn phi nhn cho
4:
SSsample= 4(153.00 154.25)2 + 4(154.75 154.25)2 +

4(154.50 154.25)2 + 4(154.75 154.25)2


= 8.5
V c 4 mu v mt thng s, cho nn bc t do l 4-1=3, v theo trung bnh
bnh phng l: 8.5 / 3 = 2.8.

Tng bnh phng do khc bit gia cc phng php (SSmethod). Ch l v


trung bnh mi phng php c tnh t 4 s, cho nn khi tnh tng bnh
phng, cn phi nhn cho 4:
SSsample= 4(173.75 154.25)2 + 4(168.50 154.25)2 +
4(142.25 154.25)2 + 4(132.50 154.25)2
= 4801.50
V c 4 phng php v mt thng s, cho nn bc t do l 4-1=3, v theo
trung bnh bnh phng l: 4801.5 / 3 = 1600.5.

Tng bnh phng phn d (residual sum of squares):


SSresidual = SStotal SSmethod SSsample - SSvariety
= 4941.0 4801.5 8.5 123.5
= 7.5

Nhng c tnh trn y c th trnh by trong mt bng phn tch phng sai nh sau:
Ngun bin thin

Gia 4 mu rung
Gia 4 loi ging
Gia 4 phng php
Phn d (residual)
Tng s

Bc t do
(degrees
of
freedom)
3
3
3
6
16

Tng bnh
phng
(Sum of
squares)
8.5
123.5
4801.5
7.5
4941.0

Trung bnh
Kim nh
bnh phng F
(Mean
square)
2.8
2.3
41.2
32.9
1600.5
1280.4

Qua phn tch th cng v n gin trn, chng ta d dng thy phng php
canh tc v loi ging c nh hng ln n sn lng. tnh ton chnh xc tr s p,
chng ta c th s dng R tin hnh phn tch phng sai cho th nghim hnh vung
Latin.
Vn t chc s liu sao cho thch hp R c th tnh ton rt quan trng. Ni
mt cch ngn gn, mi s liu phi l mt s c th (unique), hiu theo ngha n c
mt cn cc c nht v nh. Trong th nghim trn, chng ta c 4 loi ging, 4 mu,
cho nn tng s l 16 s liu. V, 16 s liu ny phi c nh ngha cho tng loi
ging, tng mu, v quan trng hn l cho tng phng php canh tc. Chng hn nh,

trong v d bng s liu 10.6 trn, 175 l sn lng ca phng php canh tc 1 (tc Aa),
loi ging 1, v mu 1; nhng 173 (s gc mc cui bng) l sn lng ca phng
php canh tc 1, nhng t loi ging 4, v mu 4; v.v...

Trc ht, chng ta nhp s liu sn lng, v gi l y:

> y <- c(175, 143, 128, 166,


170, 178, 140, 131,
135, 173, 169, 141,
145, 136, 165, 173)

K n, gi variety l ging gm 4 bc (1,2,3,4) cho tng s liu trong y (v


cng nh ngha rng variety l mt factor, tc bin th bc):

> variety <- c(1,2,3,4,


1,2,3,4,
1,2,3,4,
1,2,3,4,)
> variety <- as.factor(variety)

Gi sample l mu gm 4 bc (1,2,3,4) cho tng s liu trong y (v cng nh


ngha rng sample l mt factor, tc bin th bc):

> sample

<- c(1,1,1,1,
2,2,2,2,
3,3,3,3,
4,4,4,4)
> sample <- as.factor(sample)

Nhp s liu cho phng php, method, cng gm 4 bc (1,2,3,4) cho tng s
liu trong y (v cng nh ngha rng method l mt factor, tc bin th bc):

> method <-

c(1, 3, 4, 2,
2, 1, 3, 4,
4, 2, 1, 3,
3, 4, 2, 1)
> method <- as.factor(method)

Tng hp tt c cc s liu trn vo mt data frame v gi l data:

> data <- data.frame(sample, variety, method, y)

In ra data kim tra xem s liu c ng v thch hp hay cha:

> data
sample variety method
y
1
1
1
1 175
2
1
2
3 143
3
1
3
4 128
4
1
4
2 166
5
2
1
2 170

6
7
8
9
10
11
12
13
14
15
16

2
2
2
3
3
3
3
4
4
4
4

2
3
4
1
2
3
4
1
2
3
4

1
3
4
4
2
1
3
3
4
2
1

178
140
131
135
173
169
141
145
136
165
173

By gi chng ta sn sng dng hm lm hay aov phn tch s liu. y


ti s s dng hm aov tnh cc ngun bin thin trn (kt qu tnh ton s cha trong
i tng latin):
> latin <- aov(y ~ sample + variety + method)
> summary(latin)
Df Sum Sq Mean Sq
F value
Pr(>F)
sample
3
8.5
2.8
2.2667 0.1810039
variety
3 123.5
41.2
32.9333 0.0004016 ***
method
3 4801.5 1600.5 1280.4000 8.293e-09 ***
Residuals
6
7.5
1.3
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tt c cc kt qu ny (d nhin) l nhng kt qu m chng ta tm tt trong


bng phn tch phng sai mt cch th cng trn y. Tuy nhin, y R cung cp
cho chng ta tr s p (trong Pr > F) c th suy lun thng k. V, qua tr s p,
chng ta c th pht biu rng mu rung khng c nh hng n sn lng, nhng loi
ging v phng php canh tc th c nh hng n sn lng.
bit mc khc bit gia cc phng php canh tc v gia cc loi ging,
chng ta dng hm TukeyHSD nh sau:
> TukeyHSD(latin)
$variety
diff
lwr
2-1 1.25 -1.4867231
3-1 -5.75 -8.4867231
4-1 -3.50 -6.2367231
3-2 -7.00 -9.7367231
4-2 -4.75 -7.4867231
4-3 2.25 -0.4867231
$method
diff
2-1 -5.25
3-1 -31.50
4-1 -41.25
3-2 -26.25
4-2 -36.00
4-3 -9.75

upr
3.9867231
-3.0132769
-0.7632769
-4.2632769
-2.0132769
4.9867231

lwr
-7.986723
-34.236723
-43.986723
-28.986723
-38.736723
-12.486723

p adj
0.4528549
0.0014152
0.0173206
0.0004803
0.0038827
0.1034761

upr
-2.513277
-28.763277
-38.513277
-23.513277
-33.263277
-7.013277

p adj
0.0023016
0.0000001
0.0000000
0.0000004
0.0000000
0.0000730

So snh gia cc loi ging cho thy c s khc bit gia ging 3 v 1, 4 v 1, 3 v 2, 4
v 2.
Tt c cc so snh gia cc phng php canh tc u c ngha thng k. Nhng loi
no c sn lng cao nht? tr li cu hi ny, chng ta s s dng biu hp:
xlab="Methods

(1=Aa,

2=Ab,

3=Ba,

4=Bb",

Production

130

140

150

160

170

180

> boxplot(y ~ method,


ylab=Production")

Methods (1=Aa, 2=Ab, 3=Ba, 4=Bb

Biu so snh sn lng ca bn phng


php canh tc.

11.8 Phn tch phng sai cho th nghim giao cho (crossover experiment)
V d 6. th nghim hiu ng ca mt thuc mi i vi chng ra m hi
(thuc ny c bo ch cha tr bnh tim, nhng ra m hi l mt nh hng ph),
cc nh nghin cu tin hnh mt nghin cu trn 16 bnh nhn. S bnh nhn ny c
chia thnh 2 nhm (tm gi l nhm AB v BA) mt cch ngu nhin. Mi nhm gm 8
bnh nhn. Bnh nhn c theo di hai ln: thng th nht v thng th 2. i vi
bnh nhn nhm AB, thng th nht h c iu tr bng thuc, thng th hai h c
cho s dng gi dc (placebo). Ngc li, vi bnh nhn nhm BA, thng th nht s
dng gi dc, v thng th hai c iu tr bng thuc. Tiu ch nh gi l thi
gian ra m hi trn trn (tnh t lc ung thuc n khi ra m hi) sau khi s dng thuc
hay gi dc. Kt qu nghin cu c trnh by trong bng s liu sau y:
Bng 11.7. Kt qu nghin cu hiu ng ra m hi ca thuc iu tr bnh tim
Nhm

Thi gian (pht) ra m hi trn trn

M s bnh
nhn s (id)
AB
1
3
5
6
9
10
13
15
BA
2
4
7
8
11
12
14
16

Thng 1

Thng 2

A
6
8
12
7
9
6
11
8
Placebo
5
9
7
4
9
5
8
9

Placebo
4
7
6
8
10
4
6
8
A
7
6
11
7
8
4
9
13

Cu hi chnh l c s khc bit v thi gian ra m hi gia hai nhm iu tr bng thuc
v gi dc hay khng.
tr li cu hi trn, chng ta cn tin hnh phn tch phng sai. Nhng v
cch thit k nghin cu kh c bit (hai nhm bnh nhn vi cch sp xp can thip
theo hai th t khc nhau), nn cc phng php phn tch trn khng th p dng c.
C mt phng php thng dng l phn tch phng sai trong tng nhm, ri sau so
snh gia hai nhm. Mt trong nhng vn chng ta cn phi lu l kh nng hiu
ng ko di (cn gi l carry-over effect), tc l trong nhm AB, hiu qu ca thng th
2 c th chu nh hng ko di t thng th nht khi bnh c c iu tr bng thuc
tht. Trc ht, chng ta th tm lc d liu bng bng sau y:
Bng 11.8. Tm lc kt qu th nghim hiu ng ra m hi ca thuc iu tr bnh
tim
Nhm

M s bnh nhn
s (id)

AB
1
3
5
6
9
10
13
15

Thi gian (pht) ra m hi trn


trn
Thng 1
Thng 2
A
Placebo
6
4
8
7
12
6
7
8
9
10
6
4
11
6
8
8

Trung bnh cho


tng bnh nhn
5.0
7.5
9.0
7.5
9.5
5.0
8.5
8.0

Trung bnh
BA
2
4
7
8
11
12
14
16
Trung bnh
Trung bnh cho 2 nhm

8.375
Placebo
5
9
7
4
9
5
8
9
7.000
7.6875

6.625
A
7
6
11
7
8
4
9
13
8.125
7.3750

7.50
6.0
7.5
9.0
5.5
8.5
4.5
8.5
11.0
7.5625
7.5312

Trung bnh cho nhm A = (8.375 + 8.125) / 2 = 8.25


Trung bnh cho nhm P (gi dc) = (6.625 + 7.000) / 2 = 6.8125

Qua bng tm lc trn, chng ta c th tnh ton mt s tng bnh phng:

Tng bnh phng do khc bit gia hai nhm iu tr bng thuc v gi dc:
SSTreat = 16(8.25 7.5312)2 + 16(8.8125 7.5312)2 = 16.53

Tng bnh phng do khc bit gia thng 1 v thng 2:


SSPeriod = 16(7.6875 7.5312)2 + 16(7.3750 7.5312)2 = 0.781

Tng bnh phng do khc bit gia hai nhm AB v BA (th t):
SSseq = 16(7.50 7.5312)2 + 16(7.5625 7.5312)2 = 0.031

Tng bnh phng do khc bit gia cc bnh nhn trong cng nhm AB hay BA:
SSw = (5.0 7.50)2 + (7.5 7.50)2 + (9.0 7.50)2 + + (8.0 7.50)2 +
(6.0 7.5625)2 + (7.5 7.5625)2 + (9.0 7.5625)2 + + (11.0 7.5625)2
= 103.44

Tng bnh phng cho ton b mu:


SStotal = (6 7.5312)2 + (9 7.5312)2 + + (13 7.5312)2 + (9 7.5312)2
= 167.97

Tng bnh phng cn li (tc phn d):


SSres = 167.97 16.53 0.781 0.031 103.44 = 47.19

n y, chng ta c th lp bng phn tch phng sai nh sau:


Bng 11.9. Kt qu phn tch phng sai s liu trong bng 11.7

Ngun bin thin

Gia hai nhm iu tr


Gia hai thng
Gia AB v BA
Trong mi nhm
Phn d (residual)
Tng s

Bc t do
(degrees
of
freedom)
1
1
1
14
14
31

Tng bnh
phng
(Sum of
squares)
16.53
0.781
0.031
103.44
47.19
167.97

Trung bnh
Kim nh
bnh phng F
(Mean
square)
16.53
4.90
0.781
0.23
0.031
0.004
7.39
3.37

Qua phn tch trn, chng ta thy khc bit gia thuc v gi dc ln hn l
khc bit gia hai thng hay hai nhm AB v BA. Kim nh F th nghim gi
thit thuc v gi dc c hiu qu nh nhau l kim nh F = 16.53 / 3.37 = 4.90 vi bc
t do 1 v 14. Da trn l thuyt xc sut, tr s F vi bc t do 1 v 14 l 4.60. Do ,
chng ta c th kt lun rng thuc ny c hiu ng lm ra m hi lu hn nhm gi
dc.
Tt c cc tnh ton th cng trn ch l minh ha cho cch phn tch phng
sai cho th nghim giao cho. Trong thc t, chng ta c th s dng R tin hnh cc
tnh ton nh cch tnh phng sai cho cc th nghim n gin. Vn chnh l t
chc s liu cho phn tch. R (cng nh nhiu phn mm khc) yu cu ngi s dng
phi nhp tng s liu mt, v mi s liu phi gn lin vi mt bnh nhn, mt nhm
iu tr, mt thng (hay giai on), v mt nhm th t. l mt yu cu rt quan
trng, v nu t chc s liu khng ng, kt qu phn tch c th sai.
Trong phn sau y, ti s m t tng bc mt:
# bc 1: nhp d liu v t tn object l y
> y <- c(6,8,12,7,9,6,11,8,
4,7,6,8,10,4,6,8,
5,9,7,4,9,5,8,9
7,6,11,7,8,4,9,13)
# bc 2: c mi s liu trong bc 1, ch ra nhm AB hay BA (m
s 1 v 2)
> seq <- c(1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2)
> seq <- as.factor(seq)

# bc 3: c mi s liu trong bc 1, ch ra thng 1 hay thng 2


> period <- c(1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1)
> period <- as.factor(period)
# bc 4: c mi s liu trong bc 1, ch ra nhm A hay placebo
bng m s 1 v 2:
> treat <- c(1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2)
> treat <- as.factor(treat)
# bc 5: c mi s liu trong bc 1, ch ra m s cho tng bnh
nhn
> id <- c(1,3,5,6,9,10,13,15,
1,3,5,6,9,10,13,15,
2,4,7,8,11,12,14,16,
2,4,7,8,11,12,14,16)
> id <- as.factor(id)
# bc 6: lp thnh mt data frame tn l data v in ra kim
tra mt ln na.
> data <- data.frame(seq, period, treat, id, y)
> data
seq period treat id y
1
1
1
1 1 6
2
1
1
1 3 8
3
1
1
1 5 12
4
1
1
1 6 7
5
1
1
1 9 9
6
1
1
1 10 6
7
1
1
1 13 11
8
1
1
1 15 8
9
1
2
2 1 4
10
1
2
2 3 7
11
1
2
2 5 6
12
1
2
2 6 8
13
1
2
2 9 10
14
1
2
2 10 4
15
1
2
2 13 6
16
1
2
2 15 8
17
2
2
1 2 7
18
2
2
1 4 6
19
2
2
1 7 11
20
2
2
1 8 7
21
2
2
1 11 8

22
23
24
25
26
27
28
29
30
31
32

2
2
2
2
2
2
2
2
2
2
2

2
2
2
1
1
1
1
1
1
1
1

1
1
1
2
2
2
2
2
2
2
2

12 4
14 9
16 13
2 5
4 9
7 7
8 4
11 9
12 5
14 8
16 9

By gi chng ta sn sng dng hm lm ca R phn tch s liu. Ch rng cch


dng hm lm cho phn tch phng sai p dng cho th nghim giao cho hon ton
khng khc g vi cch dng cho cc th nghim khc. Kha cnh khc bit duy nht l
cch t chc d liu cho phn tch nh trnh by trn.
> xover <- lm(y ~ treat+seq+period)
> anova(xover)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
treat
1 16.531 16.531 4.9046 0.04388 *
seq
1
0.031
0.031 0.0093 0.92466
period
1
0.781
0.781 0.2318 0.63764
id
14 103.438
7.388 2.1921 0.07711 .
Residuals 14 47.187
3.371
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Phn tch trn y mt ln na khng nh cch tnh th cng m ti trnh by


phn trn. Ni tm li, mc khc bit gia thuc v gi duc c ngha thng k, vi
tr s F l 0.044.
Chng ta cng c th yu cu khong tin cy 95% cho khc bit gia hai nhm
(bng cch lnh TukeyHSD) nh sau (ch l vi TukeyHSD chng ta ch s dng
hm aov ch khng phi lm):
> TukeyHSD(aov(y ~ treat+seq+period+id))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = y ~ treat + seq + period + id)
$treat

diff
lwr
upr
p adj
2-1 -1.4375 -2.829658 -0.04534186 0.0438783
$seq

diff
lwr
upr
p adj
2-1 0.0625 -1.329658 1.454658 0.924656
$period
diff
lwr
upr
p adj
2-1 -0.3125 -1.704658 1.079658 0.6376395

Ch kt qu:
$treat

diff
lwr
upr
p adj
2-1 -1.4375 -2.829658 -0.04534186 0.0438783

cho bit tnh trung bnh thi gian ra m hi ca nhm c iu tr cao hn nhm gi
dc khong 1.44 pht, v khong tin cy 95% l t 0.05 pht n 2.8 pht. Cn cc kt
qu so snh gia hai nhm AB v BA (seq) hay gia thng 1 v thng 2 (period)
khng c ngha thng k.

11.9 Phn tch phng sai cho th nghim ti o lng


(repeated measure experiment)
V d 7. Mt nghin cu s khi (pilot study) c tin hnh nh gi hiu
nghim ca mt vc-xin mi chng bnh thp khp. Nghin cu gm 8 bnh nhn, c
chia thnh 2 nhm mt cch ngu nhin. Nhm 1 gm 4 bnh nhn c iu tr bng
vc-xin; nhm 2 cng gm 4 bnh nhn nhng c nhn gi dc (placebo, hay i
chng). Bnh nhn c theo di trong 3 thng, v c mi thng, bnh nhn c hi v
tnh trng ca bnh ra sao. Tnh trng bnh c o lng bng mt ch s c gi tr t
0 (khng c hiu nghim, bnh vn nh trc) n 10 (c hiu nghim tuyt i, ht
bnh). Kt qu nghin cu c th tm tt trong bng s liu sau y:
Bng 11.10. Kt qu nghin cu vc-xin chng au thp khp
Nhm

Ch s bnh qua tng thng


Thng 2
Thng 3

M s bnh nhn
s (id)

Thng 1

1
2
3
4

6
7
4
8

3
3
1
4

0
1
2
3

5
6
7
8

6
9
5
6

5
4
3
2

5
6
4
3

Vc-xin

Placebo

Cu hi chnh l c s khc bit no gia hai nhm vc-xin v gi dc hay khng.

n gin ha cch phn tch phng sai cho th nghim ti o lng, ti s trnh dng
k hiu ton, m ch minh ha bng vi php tnh th cng bn c c th theo di.
Trc ht, chng ta cn phi tm lc s liu bng cch tnh trung bnh cho mi bnh
nhn, mi nhm iu tr, v mi thng nh sau:
Bng 11.11. Tm lc s liu nghin cu vc-xin chng au thp khp

Nhm
iu tr
Vc-xin

Placebo

id
1
2
3
4
Trung bnh
SD

5
6
7
8
Trung bnh
SD
Trung bnh cho hai nhm

Ch s bnh qua tng thng

6
7
4
8

3
3
1
4

0
1
2
3

6.25
1.71

2.75
1.26

1.50
1.29

6
9
5
6

5
4
3
2

5
6
4
3

6.50
1.73
6.375

3.50
1.29
3.125

4.50
1.29
3.000

Trung bnh
3.000
3.667
2.333
5.000
3.500
5.333
6.333
4.000
3.667
4.833
4.167

Qua bng trn, chng ta c th thy ngay rng c 5 ngun lm cho kt qu th nghim
khc nhau:
(a) gia vc-xin v gi dc (c l l ngun m chng ta cn bit!);
(b) gia 3 thng theo di;
(c) gia mi ba thng trong mi nhm iu tr, m gii thng k thng cp
n l interaction (tng tc), v trong trng hp ny, tng tc gia
nhm iu tr v thi gian;
(d) gia cc bnh nhn trong cng mt nhm iu tr;
(e) v sau cng l phn d, tc phn m chng ta khng th gii thch sau khi
xem xt cc ngun (a) n (d) trn.

Trc ht l tng bnh phng gia hai nhm iu tr (vc-xin v gi dc), ti s


gi l SStreat:
SStreat = 12(3.500 4.167)2 +
12(4.833 4.167)2 = 10.667

K n l tng bnh phng gia 3 thng iu tr, ti s gi l SStime:


SStime= 8(6.375 4.167)2 +
8(3.125 4.167)2 +
8(3.000 4.167)2 = 58.583

Ngun th ba l tng bnh phng do tng tc gia iu tr v thi gian, ti s


gi l SSint
SSint= 4(6.25 4.167)2 +
4(2.75 4.167)2 +
4(1.50 4.167)2 +
4(6.50 4.167)2 +
4(3.50 4.167)2 +
4(4.50 4.167)2
SSvcxin SStime
= 77.833 10.667 58.583
= 8.583

Ngun th t l tng bnh phng do tng tc gia bnh nhn trong mi nhm
iu tr, ti s gi l SSpatient(treat):
SSpatient(treat) = 3(3.0003.350)2 + 3(3.6673.350)2 + 3(2.3333.350)2 +3(5.0003.350)2+
3(5.3334.833)2 + 3(6.3334.833)2 + 3(4.0004.833)2 +3(3.6674.833)2

= 25.333

Ngoi ra, tng bnh phng cho ton mu l:


SStotal = (6-4.167)2 +(3-4.167)2 +(0-4.167)2 + +(3-4.167)2 = 115.333

T , chng ta c th c tnh tng bnh phng cho phn d:


SSE = SStotal SSvcxin SStime SSpatient(vcxin) SSvcxin-time
= 115.333 10.667 58.583 25.333 8.583
= 12.167

n y, chng ta c th lp bng phn tch phng sai nh sau:


Ngun bin thin

Bc t do
(degrees
of
freedom)
Gia vcxin v placebo
1
Bnh nhn (nhm iu tr)
6
Gia 3 thng
2
Thi gian v nhm iu tr
2
Phn d (residual)
12
Tng s
23

Tng bnh
phng
(Sum of
squares)
10.667
25.333
58.583
8.583
12.167
115.333

Trung bnh
bnh phng
(Mean
square)
10.667
4.222
29.292
4.292
1.014

Kim nh F

2.53
28.89
4.23
-

Tt c cc tnh ton th cng trn, nh bn c c th thy, kh rm r, v rt d sai


st. Nhng trong R, chng ta c th c kt qu trong vng 1 giy, sau khi s liu c
sp xp mt cch thch hp. Sau y, ti s trnh by cch phn tch phng sai ti o
lng bng R:

Trc ht, chng ta nhp d liu cho tng bnh nhn. Cng nh bt c phn mm
thng k no, mi gi tr phi c km theo nhng bin s c trng nh cho mi
bnh nhn, mi nhm, v mi thi gian:

y <- c(6,7,4,8,
3,3,1,4,
0,1,2,3,
6,9,5,6,
5,4,3,2,
5,6,4,3)

Trong mi s liu trn, cho R bit thuc nhm iu tr (m s 1) hay gi dc (m s


2). Cng nn cho R bit treat l mt bin th bc (categorical variable) ch khng
phi bin s (numerical variable):

treat <- c(1,1,1,1,


1,1,1,1,
1,1,1,1,
2,2,2,2,
2,2,2,2,
2,2,2,2)
treat <- as.factor(treat)

Trong mi s liu trn, cho R bit thuc thng no (m s 1, 2, 3), v nh ngha


time l mt bin th bc.

time <- c(1,1,1,1,


2,2,2,2,
3,3,3,3,
1,1,1,1,
2,2,2,2,
3,3,3,3)
time <- as.factor(time)

Trong mi s liu trn, cho R bit thuc bnh nhn no (m s 1, 2, 3, ,8), v nh


ngha id l mt bin th bc.

id <- c(1,2,3,4, 1,2,3,4, 1,2,3,4,


5,6,7,8, 5,6,7,8, 5,6,7,8)
id <- as.factor(id)

Nhp tt c bin vo mt data frame v t tn l, chng hn nh, data. Kim tra


mt ln na xem s liu ng vi nh sp xp hay cha. Xin nhc li, trc khi
phn tch s liu, vic quan trng l phi kim tra li cho tht k s liu m bo s
liu c t chc ng v thch hp.

data <- data.frame(id, time, treat, y)


data
id time treat y
1
1
1
1 6
2
2
1
1 7
3
3
1
1 4
4
4
1
1 8
5
1
2
1 3
6
2
2
1 3
7
3
2
1 1
8
4
2
1 4
9
1
3
1 0
10 2
3
1 1
11 3
3
1 2
12 4
3
1 3
13 5
1
2 6
14 6
1
2 9
15 7
1
2 5
16 8
1
2 6
17 5
2
2 5
18 6
2
2 4
19 7
2
2 3
20 8
2
2 2
21 5
3
2 5
22 6
3
2 6
23 7
3
2 4
24 8
3
2 3

By gi, chng ta sn sng s dng R phn tch. Hm chnh phn tch


phng sai l aov (analysis of variance). Trong hm ny, ch cch cung cp thng s
bng cch dng mt hm khc c tn l Error. Trong hm Error, chng ta cho R bit
rng mi bnh nhn (id) thuc vo mt nhm iu tr v do thuc vo bin time.
Cch cho R bit l: Error(id/time). C th hn:
> repeated <- aov(y ~ treat*time + Error(id/time))

Lnh trn y yu cu R phn tch theo m hnh: y = treat + time +


treat*time (ch treat*time tng ng vi treat+time+treat*time),
v trung bnh bnh phng phn d phi c tch thnh hai phn: mt phn trong cc
bnh nhn, v mt phn gia cc thng iu tr (vit tt bng k hiu id/time). Tt c
kt qu cho vo itng c tn l repeated. Chng ta yu cu mt bng tm lc
kt qu t i tng repeated:
> summary(repeated)
Error: id
treat
Residuals

Df Sum Sq Mean Sq F value Pr(>F)


1 10.6667 10.6667 2.5263 0.1631
6 25.3333 4.2222

Error: id:time
Df Sum Sq Mean Sq F value

Pr(>F)

time
2 58.583 29.292 28.8904 2.586e-05 ***
treat:time 2 8.583
4.292 4.2329
0.04064 *
Residuals 12 12.167
1.014
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Kt qu phn tch trong phn u ca bng trn cho thy s khc bit gia nhm
iu tr bng thuc v gi dc khng c ngha thng k (p = 0.16). Nh vy chng ta
c th kt lun thuc khng c hiu nghim gim au thp khp?
Cu tr li l khng, bi v phn th hai ca bng phn tch phng sai cho
thy mi tng tc gia treat v time (tr s p = 0.041). iu ny c ngha l
khc bit gia thuc v gi dc ty thuc vo thng iu tr. Tht vy, nu chng ta
xem li bng 10.11 s thy trong thng 1, trung bnh ca nhm vc-xin v gi dc
khng my khc nhau (6.25 v 6.50), nhng n thng th 2 v nht l thng th 3 th
khc bit gia hai nhm rt cao (nh thng th ba: 1.50 cho vc-xin v 4.50 cho nhm
gi dc). Nh vy, hiu nghim trong nhm c iu tr tng dn theo thi gian,
cn trong nhm gi dc th hu nh khng c khc bit gia 3 thng. Ni cch khc v
tm li, qua th nghim s khi ny chng ta c th ni vc-xin c v c hiu qu gim
au trong cc bnh nhn thp khp.
***
Trn y l vi cch s dng cho vic phn tch phng sai vi cc th nghim
thng dng. Thit k v phn tch th nghim (experimental design) l mt lnh vc
nghin cu tng i chuyn su, nhng ch dn trn y khng th v cng khng c
tham vng m t tt c cc php tnh cng nh phng php cho tt c th nghim. Tuy
nhin, trong thc t, cc phng php v th nghim rt thng c p dng trong khoa
hc thc nghim. R c mt package tn l nlme (non-linear mixed-effects) cng c th
s dng cho cc phn tch trn v cc m hnh phc tp hn vi a bin v a th bc.
Package ny cng c th ti v my min ph ti website ca R: http://cran.R-project.org.

CHNG XIII

PHN TCH S KIN

13
Phn tch s kin
(event history hay survival analysis)
Qua ba chng trc, chng ta lm quen vi cc m hnh thng k cho cc
bin ph thuc lin tc (nh p sut mu) v bin bc th (nh c/khng, bnh hay
khng bnh). Trong nghin cu khoa hc, v c bit l y hc v k thut, c khi nh
nghin cu mun tm hiu nh hng n cc bin ph thuc mang tnh thi gian. Nh
kinh t hc John Maynard Keynes tng ni mt cu c lin quan n ch m ti s m
t trong chng ny nh sau: V lu v di tt c chng ta u cht, ci khc nhau l
cht sm hay cht mun m thi. Thnh ra, y vic theo di hay m t mt bin bc
th nh sng hay cht tuy quan trng, nhng khng chnh xc. Ci bin s quan trng
hn v chnh xc hn l thi gian dn n vic s kin xy ra.
Trong cc nghin cu y hc, k c nghin cu lm sng, cc nh nghin cu
thng theo di bnh nhn trong mt thi gian, c khi ln n vi mi nm. Bin c
xy ra trong thi gian nh c bnh hay khng c bnh, sng hay cht, v.v l nhng
bin c c ngha lm sng nht nh, nhng thi gian dn n bnh nhn mc bnh hay
cht cn quan trng hn cho vic nh gi nh hng ca mt thut iu tr hay mt yu
t nguy c. Nhng thi gian ny khc nhau gia cc bnh nhn. Chng hn nh thi
im t lc iu tr ung th n thi im bnh nhn cht rt khc nhau gia cc bnh
nhn, v khc bit c th ty thuc vo cc yu t nh tui, gii tnh, tnh trng
bnh, v cc yu t m c khi chng ta khng/cha o lng c nh tng tc gia
cc gen.
M hnh chnh th hin mi lin h gia thi gian dn n bnh (hay khng
bnh) v cc yu t nguy c (risk factors) l m hnh c tn l survival analysis (c th
tm dch l phn tch sng st). Cm t survival analysis xut pht t nghin cu
trong bo him, v gii nghin cu y khoa t dng cm t cho b mn ca mnh.
Nhng nh ni trn, sng/cht khng phi l bin duy nht, v trong thc t chng ta
cng c nhng bin nh c bnh hay khng bnh, xy ra hay khng xy ra, v do ,
trong gii tm l hc, ngi ta dng cm t event history analysis (phn tch bin c)
m ti thy c v thch hp hn l phn tch sng st. Ngoi ra, trong cc b mn k
thut, ngi ta dng mt cm t khc, reliability analysis (phn tch tin cy), ch
cho khi nim survival analysis. Tuy nhin, trong chng ny ti s dng cm t phn
tch bin c.

13.1 M hnh phn tch s liu mang tnh thi gian


V d 1. Thi gian dn n ngng s dng IUD. Mt nghin cu v hiu qu
ca mt y c nga thai trn 18 ph n, tui t 18 n 35. Mt s ph n ngng s dng
y c v b chy mu. Cn s khc th tip tc s dng. Bng s liu sau y l thi gian

(tnh bng tun) k t lc bt u s dng y c n khi chy mu (tc ngng s dng)


hay n khi kt thc nghin cu (tc vn cn s dng n khi chm dt nghin cu).
Bng 13.1 Thi gian dn n ngng s dng hay tip tc s dng y c IUD
M s bnh
nhn

Thi gian
(tun)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

18
10
13
30
19
23
38
54
36
107
104
97
107
56
59
107
75
93

Cu hi t ra l m t thi gian
ngng s dng y c. Thut ng m t
y c ngha l c tnh s trung v thi
gian dn n ngng s dng, hay xc sut
m ph n ngng s dng vo mt thi
im no . Tnh trng tip tc s dng c
khi gi l survival (tc sng st).

Tnh trng
(ngng=1 hay
tip tc=0)
0
1
0
1
1
0
0
0
1
1
0
1
0
0
1
0
1
1

gii quyt vn trn, i nhng


ph n ngng s dng vn c tnh
thi gian khng phi l kh. Nhng vn
quan trng trong d liu mang tnh thi gian
ny l mt s ph n vn cn tip tc s
dng, bi v chng ta khng bit h cn s
dng bao lu na, trong khi nghin cu phi
ng s theo mt thi im nh trc.
Nhng trng hp c gi bng mt
thut ng kh hiu l censored hay
survival (tc cn sng, hay cn tip tc s
dng, hay bin c cha xy ra).

Gi T l thi gian cn tip tc s dng (c khi gi l survival time). T l mt bin


ngu nhin, vi hm mt (probability density distribution) f(t), v hm phn phi tch
ly (cumulative distribution) l:
F (t ) =

f ( s ) ds

y l xc sut m mt c nhn ngng s dng (hay kinh qua bin c) ti thi im t.


Hm b sung S(t) = 1 F(t) thng c gi l hm sng st (survival function).
S liu thi gian T thng c m phng bng hai hm xc sut: hm sng st
v hm nguy c (hazard function). Hm sng st nh nh ngha trn l xc sut mt c
nhn cn sng st (hay trong v d trn, cn s dng y c) n mt thi im t. Hm
nguy c, thng c vit bng k hiu h(t) hay (t) l xc sut m c nhn ngng s
dng (hay xy ra bin c) ngay ti thi im t.
Pr ( t T < t + t ) | T t f ( t )
=
0
t
S (t )

h ( t ) = lim

sao cho h(t) t l xc sut mt c nhn ngng s dng trong khong thi gian ngn t vi
iu kin c nhn sng n thi im t. T mi lin h:
Pr(sng st n t+t) = Pr(sng st n t) . Pr(sng st n t | sng n t)
chng ta c:

1 F ( t + t ) = (1 F ( t ) ) (1 h ( t ) t )
T , chng ta c:

tF ' ( t ) = (1 F ( t ) ) h ( t ) t

Thnh ra, hm nguy c l:


h (t ) =

f (t )

1 F (t )

V hm nguy c tch ly:


t

( t ) = ( u ) du

T nh ngha hm nguy c h ( t ) =

f (t )

1 F (t )

, chng ta c th vit:

( t ) = log (1 F ( t ) )
Mt s hm nguy c c th ng dng m t thi gian ny. Hm n gin nht l mt
hng s, dn n mt m hnh Poisson (thuc nhm cc lut phn phi m):
f ( t ) = e t

Do :

(t 0)

F ( t ) = 1 e t

Thnh ra:

h(t) =

Nhng l thuyt trn y thot u mi xem qua c v tng i rc ri, nhng


vi s liu thc t th s d theo di hn. By gi chng ta quay li vi s liu t V d
1. tin vic theo di v tnh ton, chng ta cn phi sp xp li d liu trn theo th
t thi gian, bt k l thi gian ngng s dng hay cn tip tc s dng:
10 13*
56* 59

18*
75

19
93

23* 30
36 38* 54*
97 104* 107 107* 107*

Trong dy s liu trn du * l nh du thi gian censored (tc cn tip tc s


dng IUD). Cch n gin nht l chia thi gian t 10 tun (ngn nht) n 107 tun (lu
nht) thnh nhiu khong thi gian nh trong bng phn tch sau y:
Bng 13.2. c tnh xc sut tch ly cho mi khong thi gian

Mc
thi
gian (t)

Khong
thi gian
(tun)

1
2
3
4
5
6
7
8
9
10

09
10 18
19 29
30 35
36 58
59 74
75 92
93 96
97 106
107

S ph n S ph
n
lc bt
u thi ngng s
im (nt) dng (dt)
18
0
18
1
15
1
13
1
12
1
8
1
7
1
6
1
5
1
3
1

Xc sut
ngng s
dng h(t)
0.0000
0.0555
0.0667
0.0769
0.0833
0.1250
0.1428
0.1667
0.2000
0.3333

Xc
sut cn
s dng
pt
1.0000
0.9445
0.9333
0.9231
0.9167
0.8750
0.8572
0.8333
0.8000
0.6667

Xc
sut
tch ly
S(t)
1.0000
0.9445
0.8815
0.8137
0.7459
0.6526
0.5594
0.4662
0.3729
0.2486

Trong bng tnh ton trn, chng ta c:

Ct th nht l mc thi gian (tm k hiu l t). Ct ny khng c ngha g,


ngoi tr s dng lm ch s;

Ct th 2 l khong thi gian (duration) tnh bng tun. Nh cp trn, chng


ta chia thi gian thnh nhiu khong tnh ton, chng hn nh t 0 n 9 tun,
10 n 18 tun, v.v Ch rng trong thc t, chng ta khng c s liu cho thi
gian t 0 n 9 tun, nhng khong thi gian ny t ra lm ci mc khi u
tin cho vic c tnh sau ny. y ch l nhng phn chia tng i ty tin
v ch c tnh cch minh ha; trong thc t my tnh c th lm vic cho chng
ta;

Ct th 3 l s i tng nghin cu nt (hay c th hn l s ph n trong nghin


cu ny) bt u mt khong thi gian. Chng hn nh khong thi gian 0-9, ti
thi im bt u 0, c 18 ph n (hay cng c th hiu rng s ph n c theo
di/quan st t nht 0 tun l 18 ngi).
Trong khong thi gian 1018, ngay ti thi im bt u 10, chng ta c 18 ph
n; nhng trong khong thi gian 1929, ngay ti thi im bt u 19, chng ta
c 15 ph n (c th l: 19 23* 30 36 38* 54* 56* 59 75 93 97 104* 107
107* 107*); vn vn.

Ni cch khc, ct ny th hin s i tng vi thi gian quan st ti thiu l t.


Do , trong khong thi gian 97 106, chng ta c 5 ph n vi thi gian theo
di t 97 tun tr ln (97 104* 107 107* 107*).

Ct th 4 trnh by s ph n ngng s dng y c dt (hay bin c xy ra) trong


mt khong thi gian. Chng hn nh trong khong thi gian 1018 tun, c mt
ph n ngng s dng(ti 10 tun); trong khong thi gian 19 29 tun cng c
mt trng hp ngng s dng (ti 19 tun), v.v

Ct th 5 l xc sut nguy c h(t) trong mt khong thi gian. Mt cch n


gin, h(t) c c tnh bng cch ly dt chia cho nt. V d trong khong thi
gian 10-18 c 1 ph n ngng s dng (trong s 18 ph n), v xc sut nguy c
l 1/18 = 0.0555. Xc sut ny c c tnh cho tng khong thi gian.

Ct th 6 l xc sut cn s dng cho mt khong thi gian, tc ly 1 tr cho h(t)


trong ct th 5. Xc sut ny khng cung cp nhiu thng tin, nhng ch c
trnh by d theo di tnh ton trong ct k tip.

Ct th 7 l xc sut tch ly cn s dng y c S(t) (hay cumulative survival


probability). y l ct s liu quan trng nht cho phn tch. V tnh cht tch
ly, cho nn cch c tnh c nhn t hai hay nhiu xc sut.
Trong khong thi gian 0-9, xc sut tch ly chnh l xc sut cn s dng trong
ct 6, (v khng c ai ngng s dng).
Trong khong thi gian 10-18, xc sut tch ly c c tnh bng cch ly xc
sut cn s dng trong thi gian 0-9 nhn cho xc sut cn s dng trong thi
gian 10-18, tc l: 1.000 x 0.9445 = 0.9445. ngha ca c tnh ny l: xc sut
cn s dng cho n thi gian 9 tun l 94.45%.
Tng t, trong khong thi gian 19-29 tun, xc sut tch ly cn s dng c
tnh bng cch ly xc sut tch ly cn s dng n tun 10-18 nhn cho xc sut
cn s dng trong khong thi gian 19-29: 0.9445 x 0.9333 = 0.8815. Tc l,
xc sut cn s dng n tun 29 l 88.15%.
k
n dt
Ni chung, cng thc c tnh S(t) l S ( t ) = t
. Ch du m ^
nt
t =1
trn S(t) l nhc nh rng l c s. Nu gi xc sut cn s dng trong
khong thi gian t l pt (tc ct 6), th S(t) cng c th tnh bng cng thc:

S ( t ) = pt .
t =1

Php c tnh c m t trn thng c gi l c tnh Kaplan-Meier


(Kaplan-Meier estimates), hay thnh thong cng c gi l product-limit estimate.

13.2 c tnh Kaplan-Meier bng R


Tt c cc tnh ton trn, tt nhin, c th c tin hnh bng R. Trong R c
mt package tn l survival (do Terry Therneau v Thomas Lumley pht trin) c th
ng dng phn tch bin c. Trong phn sau y ti s hng dn cch s dng
package ny.
Quay li vi V d 1, vic u tin m chng ta cn lm l nhp d liu vo R.
Nhng trc ht, chng ta phi nhp package survival vo mi trng lm vic:
> library(survival)

K n, chng ta to ra hai bin s: bin th nht gm thi gian (hy gi l weeks cho
trng hp ny), v bin th hai l ch s cho bit i tng ngng s dng y c (cho gi
tr 1) hay cn tip tc s dng (cho gi tr 0) v t tn bin ny l status. Sau
nhp hai bin vo mt dataframe (v gi l data) tin vic phn tch.
> weeks

<- c(10, 13, 18, 19, 23, 30, 36, 38, 54,
56, 59, 75, 93, 97, 104, 107, 107, 107)
> status <- c(1, 0, 0, 1, 0, 1, 1,0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0)
> data <- data.frame(duration, status)

By gi, chng ta sn sng phn tch. c tnh Kaplan-Meier, chng ta s s dng


hai hm Surv v survfit trong package survival. Hm Surv dng to ra
mt bin s hp (combined variable) vi thi gian v tnh trng. V d, trong lnh sau
y:
> survtime <- Surv(weeks, status==1)
> survtime
[1] 10
13+
[15] 104+ 107

18+ 19
23+
107+ 107+

30

36

38+

54+

56+

59

75

93

97

chng ta s c survtime l mt bin vi thi gian v du + (ch cn sng st, hay


censored observation, hay trong trng hp ny l cn s dng y c). Bin s ny ch c
gi tr v ngha cho phn tch ca R, ch trong thc t, c l chng ta khng cn n.
Cn hm survfit cng kh n gin, chng ta ch cn cung cp hai thng s: thi
gian v ch s nh v d sau y:
> survfit(Surv(weeks, status==1))

Hay nu c object survtime th chng ta ch n gin gi:


> survfit(survtime)
Call: survfit(formula = survtime)
n

events

median 0.95LCL 0.95UCL

18

93

59

Inf

Kt qu trn y chng c g hp dn, v n cung cp nhng thng tin m chn ta bit:


c 9 bin c (ngng s dng y c) trong s 18 i tng. Thi gian (median - trung v)
ngng s dng l 93 tun, vi khong tin cy 95% t 59 tun n v cc (Inf =
infinity). c thm kt qu chng ta cn phi a kt qu phn tch vo mt
object chng hn nh kp v dng hm summary bit thm chi tit:
> kp <- survfit(Surv(weeks, status==1))
> summary(kp)
Call: survfit(formula = Surv(weeks, status == 1))
time n.risk n.event survival std.err lower 95% CI upper 95% CI
10
18
1
0.944 0.0540
0.844
1.000
19
15
1
0.881 0.0790
0.739
1.000
30
13
1
0.814 0.0978
0.643
1.000
36
12
1
0.746 0.1107
0.558
0.998
59
8
1
0.653 0.1303
0.441
0.965
75
7
1
0.559 0.1412
0.341
0.917
93
6
1
0.466 0.1452
0.253
0.858
97
5
1
0.373 0.1430
0.176
0.791
107
3
1
0.249 0.1392
0.083
0.745

Mt phn ca kt qu ny (ct time, n.risk, n.event, survival)


chng ta tnh ton th cng trong bng trn. Tuy nhin R cn cung cp cho chng
ta sai s chun (standard error) ca S(t) v khong tin cy 95%.
Khong tin cy 95% c c tnh t cng thc S ( t ) 1.96 se S ( t ) , m trong
k

dt
, se S ( t ) = S ( t )
. Cng thc sai s chun ny cn c gi l
t =1 nt ( nt dt )
cng thc Greenwood (hay Greenwoods formula). Chng ta c th th hin kt qu trn
bng mt biu bng hm plot nh sau:
> plot(kp,
xlab="Time (weeks)",
ylab="Cumulative survival probability")

1.0
0.8
0.6
0.4
0.2
0.0

Cumulative survival probability

20

40

60

80

100

Time (weeks)

Trong biu trn, trc honh l thi gian (tnh bng tun) v trc tung l xc sut tch
ly cn s dng y c. ng chnh gia chnh l xc sut tch ly S ( t ) , hai ng chm
l khong tin cy 95% ca S ( t ) . Qua kt qu phn tch ny, chng ta c th pht biu
rng xc sut s dng y c n tun 107 l khong 25% v khong tin cy t 8% n
74.5%. Khong tin cy kh rng cho bit c s c dao ng cao, n gin v s
lng i tng nghin cu cn tng i thp.

13.3 So snh hai hm xc sut tch ly: kim nh log-rank


(log-rank test)
Phn tch trn ch p dng cho mt nhm i tng, v mc ch chnh l c
tnh S(t) cho tng khong thi gian. Trong thc t, nhiu nghin cu c mc ch so snh
S(t) gia hai hay nhiu nhm khc nhau. Chng hn nh trong cc nghin cu lm sng,
nht l nghin cu cha tr ung th, cc nh nghin cu thng so snh thi gian sng st
gia hai nhm bnh nhn nh gi mc hiu nghim ca mt thut iu tr.
V d 2. Mt nghin cu trn 48 bnh nhn vi bnh mn gip (herpes) b
phn sinh dc nhm xt nghim hiu qu ca mt loi vc-xin mi (tm gi bng m
danh gd2). Bnh nhn c chia thnh 2 nhm mt cch ngu nhin: nhm 1 c iu
tr bng gd2 (gm 25 ngi), v 23 ngi cn li trong nhm hai nhn gi dc
(placebo). Tnh trng bnh c theo di trong vng 12 thng. Bng s liu sau y
trnh by thi gian (tnh bng tun v gi tt l time) n khi bnh ti pht. Ngoi ra,
mi bnh nhn cn cung cp s liu v s ln b nhim trong vng 12 thng trc khi
tham gia cng trnh nghin cu (episodes). Theo kinh nghim lm sng, episodes
c lin h mt thit n xc sut b nhim (v chng ta s quay li vi cch phn tch bin
s ny mt mt phn sau). Cu hi t ra l gd2 c hiu nghim lm gim nguy c bnh
ti pht hay khng.

Bng 13.1. Thi gian n nhim trng bnh nhn vi bnh mn gip cho nhm
gd2 v gi dc
id
1
3
6
7
8
10
12
14
15
18
20
23
24
26
28
31
33
34
36
39
40
42
44
46
48

episodes
12
10
7
10
6
8
8
9
11
13
7
13
9
12
13
8
10
16
6
14
13
13
16
13
9

id
2
4
5
9
11
13
16
17
19
21
22
25
27
29
30
32
35
37
38
41
43
45
47

time infected
8
1
12
0
52
0
28
1
44
1
14
1
3
1
52
1
35
1
6
1
12
1
7
0
52
0
52
0
36
1
52
0
9
1
11
0
52
0
15
1
13
1
21
1
24
0
52
0
28
1

episodes
9
10
12
7
7
7
7
11
16
16
6
15
9
10
17
8
8
8
8
14
13
9
15

time infected
15
1
44
0
2
0
8
1
12
1
52
0
21
1
19
1
6
1
10
1
15
0
4
1
9
0
27
1
1
1
12
1
20
1
32
0
15
1
5
1
35
1
28
1
6
1

Ch thch: trong bin infected (nhim), 1 c ngha l b nhim, v 0 l khng b nhim.


Trong trng hp trn chng ta c hai nhm so snh. Mt cch phn tch n
gin l c tnh S(t) cho tng nhm v tng khong thi gian, ri so snh hai nhm bng
mt kim nh thng k thch hp. Song, phng php phn tch ny c nhc im l
n khng cung cp cho chng ta mt bc tranh chung ca tt c cc khong thi gian.
Ngoi ra, vn so snh gia hai nhm trong nhiu khong thi gian khc nhau lm cho
kt qu rt kh din dch.
khc phc hai nhc im so snh trn, mt phng php phn tch c pht
trin c tn l log-rank test (kim nh log-rank). y l mt phng php phn tch phi
thng s kim nh gi thit rng hai nhm c cng S(t). Phng php ny cng chia
thi gian ra thnh k khong thi gian, t1, t2, t3, , tk, m khong thi gian tj (j = 1, 2, 3,
k) phn nh thi im j khi mt hay nhiu i tng ca hai nhm cng li. Gi dij l s
bnh nhn trong nhm i (i=1, 2) b bnh trong khong thi gian tj. Gi d j = d1 j + d 2i l
tng s bnh nhn mc bnh v t n j = n1 j + n2 j l tng s bnh nhn ca hai nhm
trong khong thi gian tj. Vi j = 1, 2, 3, k, chng ta c th c tnh:

e1 j =

n1 j d j
nj

e2 j =

n2 j d j
nj

vj =

n1 j n2 j d j ( n j d j )
n 2j ( n j 1)

( y, e1 j , e2 j l s bnh nhn trong nhm 1 v 2 m chng ta tin on l s mc bnh


nu c cng xc sut mc bnh trong c hai nhm (tc xc sut trung bnh), v j l phng
sai). Ngoi ra, chng ta c th c tnh tng s bnh nhn mc bnh cho nhm 1 v 2:
k

O1 = d1 j

j =1

O2 = d 2 j
j =1

V tng s bnh nhn mc bnh nu c cng chung xc sut mc bnh cho c hai nhm:
k

E1 = v j

j =1

V = vj
j =1

Gi Ti l mt bin ngu nhin phn nh thi gian t khi c iu tr n khi mc bnh


cho nhm i, v gi Si ( t ) = Pr (Ti t ) , kim nh log-rank c nh ngha nh sau:

(O E )
= 1 1

V
Nu > (trong , l tr s Chi bnh phng vi ngha thng k =0.95),
chng ta c bng chng kt lun rng khc bit v S(t) gia hai nhm c ngha
thng k.
2

2
1,

2
1,

13.4 Kim nh log-rank bng R


V d 2 (tip tc). Chng ta quay li vi v d 2 v s s dng R tnh ton
kim nh log-rank. Trc ht, chng ta phi nhp cc d liu cn thit bng cc lnh
thng thng nh sau:
> group <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2)
> episode <- c(12, 10, 7, 10, 6, 8, 8, 9, 11, 13, 7, 13, 9,
12, 13, 8, 10, 16, 6, 14, 13, 13, 16, 13, 9,
9, 10, 12, 7, 7, 7, 7, 11, 16, 16, 6, 15,
9, 10, 17, 8, 8, 8, 8, 14, 13, 9, 15)
> time <- c(8, 12, 52, 28, 44, 14, 3, 52, 35, 6, 12, 7, 52,
52, 36, 52, 9, 11, 52,15, 13, 21,24, 52,28,
15,44, 2, 8,12,52,21,19, 6,10,15, 4, 9,27, 1,

12,20,32,15, 5,35,28, 6)
> infected <- c(1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1,
0, 1, 0, 0, 1, 1, 1, 0, 0, 1,
1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1,
1, 1, 0, 1, 1, 1, 1, 1)
> data <- data.frame(group, episode, time, infected)

(a) Chng ta ng dng hm survfit c tnh xc sut tch ly S(t) cho tng nhm
bnh nhn v cho kt qu vo i tng kp.by.group nh sau (ch cch cung cp
thng s ~ group):
> library(survival)
> kp.by.group <- survfit(Surv(time, infected==1) ~ group)
> summary(kp.by.group)
Call: survfit(formula = Surv(time, infected == 1) ~ group)
group=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
3
25
1
0.960 0.0392
0.886
1.000
6
24
1
0.920 0.0543
0.820
1.000
8
22
1
0.878 0.0660
0.758
1.000
9
21
1
0.836 0.0749
0.702
0.997
12
19
1
0.792 0.0829
0.645
0.973
13
17
1
0.746 0.0902
0.588
0.945
14
16
1
0.699 0.0958
0.534
0.915
15
15
1
0.653 0.1001
0.483
0.882
21
14
1
0.606 0.1033
0.434
0.846
28
12
2
0.505 0.1080
0.332
0.768
35
10
1
0.454 0.1083
0.285
0.725
36
9
1
0.404 0.1074
0.240
0.680
44
8
1
0.353 0.1052
0.197
0.633
52
7
1
0.303 0.1016
0.157
0.584
group=2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1
23
1
0.957 0.0425
0.8767
1.000
4
21
1
0.911 0.0601
0.8004
1.000
5
20
1
0.865 0.0723
0.7346
1.000
6
19
2
0.774 0.0889
0.6183
0.970
8
17
1
0.729 0.0946
0.5650
0.940
10
15
1
0.680 0.1000
0.5099
0.907
12
14
2
0.583 0.1067
0.4072
0.835
15
12
2
0.486 0.1088
0.3132
0.754
19
9
1
0.432 0.1093
0.2630
0.709
20
8
1
0.378 0.1082
0.2156
0.662
21
7
1
0.324 0.1053
0.1712
0.613
27
6
1
0.270 0.1007
0.1300
0.561
28
5
1
0.216 0.0939
0.0921
0.506
35
3
1
0.144 0.0859
0.0447
0.463

V v biu Kaplan-Meier cho tng nhm nh sau:


> plot(kp.by.group,

0.6
0.4
0.0

0.2

Cum. survival probability

0.8

1.0

xlab="Time",
ylab="Cum. survival probability",
col=c(black, red))

10

20

30

40

50

Time

Qua biu trn, chng ta c th thy kh r l nhm c iu tr bng gd2 (ng


mu en pha trn) c xc sut nhim (hay bnh ti pht) thp hn nhm gi dc (ng
mu , pha di). Nhng phn tch trn khng cung cp tr s p chng ta pht biu
kt lun.
(b) c tr s p, chng ta cn phi s dng hm survdiff nh sau:
> survdiff(Surv(time, infected==1) ~ group)
Call:
survdiff(formula = Surv(time, infected == 1) ~ group)
N Observed Expected (O-E)^2/E (O-E)^2/V
group=1 25
15
20.0
1.26
3.65
group=2 23
17
12.0
2.11
3.65
Chisq= 3.7

on 1 degrees of freedom, p= 0.056

Kt qu phn tch log-rank cho tr s p=0.056. V p > 0.05, chng ta vn cha c bng
chng thuyt phc kt lun rng gd2 qu tht c hiu nghim gim nguy c ti pht
bnh.

13.5 M hnh Cox (hay Coxs proportional hazards model)


Kim nh log-rank l phng php cho php chng ta so snh S(t) gia hai hay
nhiu nhm. Nhng trong thc t, S(t) hay hm nguy c h(t) c th khng ch khc nhau
gia cc nhm, m cn chu s chi phi ca cc yu t khc. Vn t ra l lm sao
c tnh mc nh hng ca cc yu t nguy c (risk factors) n h(t). Chng hn

nh trong nghin cu trn, s ln bnh nhn tng b nhim (bin episode) c xem l
c nh hng n nguy c bnh ti pht. Do , vn t ra l nu chng ta xem xt v
iu chnh cho nh hng ca episode th mc khc bit v S(t) gia hai nhm c
tht s tn ti hay khng?
Vo khong gia thp nin 1970s, David R. Cox, gio s thng k hc thuc i
hc Imperial College (London, Anh) pht trin mt phng php phn tch da vo m
hnh hi qui (regression) tr li cu hi trn (D.R. Cox, Regression models and life
tables (with discussion), Journal of the Royal Statistical Society series B, 1972; 74:187220). Phng php phn tch , sau ny c gi l M hnh Cox. M hnh Cox c
nh gi l mt trong nhng pht trin quan trng nht ca khoa hc ni chung (khng
ch khoa hc thng k) trong th k 20! Khng th k ht bao nhiu s ln trch dn bi
bo ca David Cox, v bi bo gy nh hng cho ton b hot ng nghin cu khoa
hc.
V m t chi tit m hnh Cox nm ngoi phm vi ca chng sch ny, nn ti
ch pht ho vi nt chnh bn c c th nm vn . Gi x1, x2, x3, xp l p yu t
nguy c. x c th l cc bin lin tc hay khng lin tc. M hnh Cox pht biu rng:

h (t ) = (t ) e

1 x1 + 2 x2 + 3 x3 +...+ p x p

h(t) c nh ngha nh phn trn (tc hm nguy c), j (j = 1, 2, 3, , p) l h s nh


hng lin quan n xj, v (t) l hm s nguy c nu cc yu t nguy c x khng tn ti
(cn gi l baseline hazard function). V mc nh hng ca mt yu t nguy c xj
thng c th hin bng t s nguy c (hazard ratio, HR, cng tng t nh odds ratio
trong phn tch hi qui logistic), h s exp(j) chnh l HR cho khi xj tng mt n v.
Hm coxph trong package R c th c ng dng c tnh h s j. Trong
lnh sau y:
> analysis <- coxph(Surv(time, infected==1) ~ group)

Trong lnh trn, chng ta mun kim nh nh hng ca hai nhm iu tr n hm


nguy c h(t) v kt qu c cha trong i tng analysis. tm lc
analysis, chng ta s dng hm summary:
> summary(analysis)
Call:
coxph(formula = Surv(time, infected == 1) ~ group)
n= 48
coef exp(coef) se(coef)
z
p
group 0.684
1.98
0.363 1.88 0.06
group

exp(coef) exp(-coef) lower .95 upper .95


1.98
0.505
0.973
4.04

Rsquare= 0.071

(max possible= 0.986 )

Likelihood ratio test= 3.55


Wald test
= 3.55
Score (logrank) test = 3.67

on 1 df,
on 1 df,
on 1 df,

p=0.0597
p=0.0596
p=0.0553

Nn nh nhm iu tr c cho m s 1, v nhm gi dc c m s 2. Do ,


kt qu phn tch trn cho bit khi group tng 1 n v th h(t) tng 1.98 ln (vi khong
tin cy 95% dao ng t 0.97 n 4.04). Ni cch khc, nguy c bnh ti pht trong
nhm gi dc cao hn nhm iu tr gd2 gn 2 ln. Tuy nhin v khong tin cy 95%
bao gm c 1 v tr s p = 0.06, cho nn chng ta vn khng th kt lun rng mc
nh hng ny c ngha thng k.
Nhng chng ta cn phi xem xt (v iu chnh) cho nh hng ca qu trnh
bnh trong qu kh c o lng bng bin s episode. tin hnh phn tch ny,
chng ta cho thm episode vo hm coxph nh sau:
> analysis <- coxph(Surv(time, infected==1) ~ group + episode)
> summary(analysis)
Call:
coxph(formula = Surv(time, infected == 1) ~ group + episode)
n= 48

coef exp(coef) se(coef)


z
p
group
0.874
2.40
0.3712 2.35 0.0190
episode 0.172
1.19
0.0648 2.66 0.0079
group
episode

exp(coef) exp(-coef) lower .95 upper .95


2.40
0.417
1.16
4.96
1.19
0.842
1.05
1.35

Rsquare= 0.196
(max possible=
Likelihood ratio test= 10.5 on
Wald test
= 10.4 on
Score (logrank) test = 10.6 on

0.986 )
2 df,
p=0.00537
2 df,
p=0.00555
2 df,
p=0.00489

Kt qu phn tch trn cho chng ta mt din dch khc v c l chnh xc hn.
M hnh h(t) by gi l:

h ( t | group, episode ) = ( t ) e0.874( group )+ 0.172( episode )


Nu episode tm thi gi c nh, t s h(t) gia hai nhm l:

h ( t | group = 2 )
h ( t | group = 1)

= e0.874( 21) = 2.40

Tng t, nu group tm thi gi c nh, khi episode tng mt n v, t s nguy c


s tng 1.14 ln.

Ni cch khc, mi ln mc bnh trong qu kh (tc episode tng 1 n v)


lm tng nguy c ti pht bnh 19% (vi khong tin cy 95% dao ng t 5% n 35%).
Nhm gi dc c nguy c bnh ti pht tng gp 2.4 ln so vi nhm iu tr bng gd2
(v khong tin cy 95% c th t 1.2 n gn 5 ln). C hai yu t (nhm iu tr) v
episode u c ngha thng k, v tr s p<0.05.
Nhng episode l mt bin lin tc. Vn t ra l sau khi iu chnh cho
episode th hm S(t) cho tng nhm s ra sao? Cch khc quan nht l gi nh c hai
nhm gd2 v gi dc c cng s ln episode (nh s trung bnh chng hn), v hm
S(t) cho tng nhm c th c tnh bng:
> Cox.model <- survfit(coxph(Surv(time, infected==1)~episode+strata(group)))
> plot(Cox.model,
xlab="Time",
ylab="Cumulative survival probability",
col=c(black, red))

hay n gin hn:

0.6
0.4
0.0

0.2

Cumulative survival probability

0.8

1.0

> plot(survfit(coxph(Surv(time, infected==1)~episode+strata(group))),


xlab="Time",
ylab="Cumulative survival probability",
col=c(black, red))

10

20

30

40

50

Time

13.6 Xy dng m hnh Cox bng Bayesian Model Average


(BMA)
Cng nh trng hp ca phn tch hi qui tuyn tnh a bin v phn tch hi qui
logistic a bin, vn tm mt m hnh ti u tin on bin c trong trong iu
kin c nhiu bin c lp l mt vn nan gii. Phn ln sch gio khoa thng k hc

trnh by ba phng n chnh tm mt m hnh ti u: forward algorithm, backward


algorithm, v tiu chun AIC.
Vi phng n forward algorithm, chng ta khi u tm bin c lp x c nh
hng ln n bin ph thuc y, ri tng bc thm cc bin c lp khc x cho n khi
m hnh khng cn ci tin thm na.
Vi phng n backward algorithm, chng ta khi u bng cch xem xt tt c
bin c lp x trong d liu c th c nh hng ln n bin ph thuc y, ri tng bc
loi b tng bin c lp x cho n khi m hnh ch cn li nhng bin c ngha thng
k.
Hai phng n trn (forward v backward algorithm) da vo phn d (residual)
v tr s P xt mt m hnh ti u. Mt phng n th ba l da vo tiu chun
Aikaike Information Criterion (AIC) m ti trnh by trong chng trc. hiu
phng php xy dng m hnh da vo AIC ti s ly mt v d thc t nh sau. Gi d
chng ta mun i t tnh A n tnh B qua huyn C, v mi tuyn ng chng ta c 3
la chn: bng xe hi, bng ng thy, v bng xe gn my. Tt nhin, i xe hi t
tin hn i xe gn my, Mt khc, i ng thy tuy t tn km nhng chm hn i bng
xe hi hay xe gn my. Nu c tt c 6 phng n i, vn t ra l chng ta mun tm
mt phng n i sao cho t tn km nht, nhng tiu ra mt thi gian ngn nht! Tng
t, phng php xy dng m hnh da vo tiu chun AIC l i tm mt m hnh sao
cho t thng s nht nhng c kh nng tin on bin ph thuc y nht.
Nhng c ba phng n trn c vn l m hnh ti u nht c xem l m
hnh sau cng, v tt c suy lun khoa hc u da vo c s ca m hnh . Trong
thc t, bt c m hnh no (k c m hnh ti u) cng c bt nh ca n, v khi
chng ta c thm s liu, m hnh ti u cha chc l m hnh sau cng, v do suy
lun c th sai lm. Mt cch tt hn v c trin vng hn xem xt n yu t bt
nh ny l Bayesian Model Average (BMA).
Vi phn tch BMA, thay v chng ta hi yu t c lp x nh hng n bin ph
thuc c ngha thng k hay khng, chng ta hi: xc sut m bin c lp x c nh
hng n y l bao nhiu. tr li cu hi BMA xem xt tt c cc m hnh c kh
nng gii thch y, v xem trong cc m hnh , bin x xut hin bao nhiu ln.
V d 3. Trong v d sau y, chng ta s m phng mt nghin cu vi 5 bin
c lp x1, x2, x3, x4, v x5. Ngoi tr x1, 4 bin kia c m phng theo lut phn phi
chun. Bin y l thi gian v km theo bin t vong (death). Trong 5 bin x ny, ch
c bin x1 c lin h vi xc sut t vong bng mi lin h exp(3*x1 + 1), cn cc
bin x2, x3, x4, v x5 c m phng ton c lp vi nguy c t vong. Chng ta s s
dng phng php xy dng m hnh theo tiu chun AIC v BMA so snh.
# Nhp package survival v BMA phn tch
> library(survival)
> library(BMA)

#
>
>
>
>
>

To ra 5 bin s c lp
x1 <- (1:50)/2 3
x2 <- rnorm(50)
x3 <- rnorm(50)
x4 <- rnorm(50)
x5 <- rnorm(50)

# M phng mi lin h risk=exp(beta*x1 + 1)


> model <- exp(3*x1 + 1)
# To ra bin s ph thuc y
> y <- rexp(50, rate = model)
#
>
>
>

To ra bin s kin theo lut phn phi m, t l 0.3


censored <- rexp(50, rate=0.3)
ycencored <- pmin(y, censored)
death <- as.numeric(y <= censored)

# Cho tt c bin s vo data frame tn simdata


> simdata <- data.frame(y, death, x1,x2,x3,x4,x5)
# Phn tch bng m hnh Cox
> cox <- coxph(Surv(y, death) ~ ., data=simdata)
> summary(cox)
Call:
coxph(formula = Surv(y, death) ~ ., data = simdata)
n= 50
coef exp(coef) se(coef)
z
p
x1 3.2325
25.344
0.568 5.6908 1.3e-08
x2 -0.0319
0.969
0.331 -0.0963 9.2e-01
x3 0.3112
1.365
0.327 0.9518 3.4e-01
x4 0.1364
1.146
0.297 0.4600 6.5e-01
x5 0.4898
1.632
0.313 1.5643 1.2e-01
x1
x2
x3
x4
x5

exp(coef) exp(-coef) lower .95 upper .95


25.344
0.0395
8.325
77.16
0.969
1.0324
0.506
1.85
1.365
0.7326
0.719
2.59
1.146
0.8725
0.641
2.05
1.632
0.6127
0.883
3.01

Rsquare= 0.992
(max possible= 0.997 )
Likelihood ratio test= 241 on 5 df,
p=0
Wald test
= 33.3 on 5 df,
p=3.36e-06
Score (logrank) test = 107 on 5 df,
p=0

Kt qu trn cho thy bin x1,x3 v x5 c nh hng c ngha thng k n bin y.


Tt nhin, y lm mt kt qu sai v chng ta bit rng ch c x1 l c ngha thng k
m thi. By gi chng ta th p dng cch xy dng m hnh da vo tiu chun AIC:
# Tm m hnh da vo tiu chun AIC
> searchAIC <- step(cox, direction=both)
> summary(searchAIC)
Call:
coxph(formula = Surv(y, death) ~ x1 + x5, data = simdata)

n= 50
coef exp(coef) se(coef)
z
p
x1 3.126
22.79
0.529 5.91 3.4e-09
x5 0.429
1.54
0.297 1.45 1.5e-01
x1
x5

exp(coef) exp(-coef) lower .95 upper .95


22.79
0.0439
8.080
64.27
1.54
0.6510
0.858
2.75

Rsquare= 0.992
(max possible= 0.997 )
Likelihood ratio test= 240 on 2 df,
p=0
Wald test
= 35.3 on 2 df,
p=2.18e-08
Score (logrank) test = 104 on 2 df,
p=0

Kt qu ny cho thy x1 v x5 l hai yu t c lp c nh hng c ngha thng k


n bin y. Mt ln na, kt qu ny sai! By gi chng ta s p dng php tnh BMA:
#tm m hnh bng php tnh BMA
> time <- simdata$y
> death <- simdata$death
> xvars <- simdata[,c(3,4,5,6,7)]
> bma <- bic.surv(xvars, time, death)
> summary(bma)
> imageplot.bma(bma)
Call:
bic.surv.data.frame(x = xvars, surv.t = time, cens = death)
8 models were selected
Best 5 models (cumulative posterior probability =
x1
x2
x3
x4
x5

p!=0
100.0
9.6
14.6
10.0
31.0

nVar
BIC
post prob

EV
3.0360
0.0008
0.0410
0.0063
0.1349

SD
0.509
0.096
0.155
0.092
0.261

model 1
2.98048
.
.
.
.

1
-233.774
0.458

model 2
3.12625
.
.
.
0.42920

2
-232.126
0.201

0.8911 ):

model 3
3.03900
.
0.27046
.
.

model 4
2.98288
.
.
0.02497
.

2
-230.713
0.099

2
-229.933
0.067

model 5
2.98098
0.02136
.
.
.
2
-229.930
0.067

Kt qu phn tch BMA cho thy m hnh ti u l m hnh 1 ch c mt bin c ngha


thng k: l bin x1. Xc sut m yu t ny c nh hng n nguy c t vong l
100%. y chnh l kt qu m chng ta k vng, bi v chng ta m phng ch c x1
c nh hng n y m thi. M hnh 2 c hai bin x1 v x5 (tc cng chnh l m hnh
m tiu chun AIC xc nh), nhng m hnh ny ch c xc sut 0.201 m thi. Cc m
hnh 3(x1 v x3), m hnh 4 (x1 v x4) v m hnh 5 (x1 v x2) cng c kh nng
nhng xc sut qu thp (di 0.1) cho nn chng ta khng th chp nhn c. Biu
sau y th hin cc kt qu trn:

M odels selected by BM A

x1

x2

x3

x4

x5

Model #

Biu trn trnh by 8 m hnh, v trong tt c 8 m hnh, bin x1 xut hin mt cc


nht qun (xc sut 100%). Cn cc bin khc c nh hng nhng khng nht qun.
Qua so snh gia hai phng php xy dng m hnh r rng cho thy cch phn tch
BMA cung cp cho chng ta m hnh ph hp ng tin cy nht v c v ph hp vi
thc t nht.
Trn y l nhng phng php phn tch bin c thng dng nht trong khoa hc
thc nghim vi m hnh Cox v kim nh log-rank. M hnh Cox c th khai trin
thnh nhng m hnh phc tp v tinh vi hn cho cc nghin cu phc tp khc vi nhiu
bin v tng tc gia cc yu t nguy c. Ti liu hng dn cch s dng package
survival c th gip bn c tm hiu su hn. Ti liu ny c ti trang web
www.cran.R-project.org.

CHNG XIV

PHN TCH TNG HP

14
Phn tch tng hp
ng b ta vn thng ni Mt cy lm chng nn non, ba cy chm li ln hn
ni cao cao tinh thn hp lc, on kt nhm hon tt mt cng vic quan trng
cn n nhiu ngi. Trong nghin cu khoa hc ni chung v y hc ni ring, nhiu khi
chng ta cn phi xem xt nhiu kt qu nghin cu t nhiu ngun khc nhau gii
quyt mt vn c th.

14.1 Nhu cu cho phn tch tng hp


Trong my nm gn y, trong nghin cu khoa hc xut hin kh nhiu nghin
cu di danh mc meta-analysis, m ti tm dch l phn tch tng hp. Vy phn
tch tng hp l g, mc ch l g, v cch tin hnh ra sao l nhng cu hi m rt
nhiu bn c mun bit. Trong bi ny ti s m t s qua vi khi nim v cch tin
hnh mt phn tch tng hp, vi hi vng bn c c th t mnh lm mt phn tch m
khng cn n cc phn mm t tin.
Ngun gc v tng tng hp d liu khi u t th k 17, ch chng phi l
mt tng mi. Thi , cc nh thin vn hc ngh rng cn phi h thng ha d liu
t nhiu ngun c th i n mt quyt nh chnh xc v hp l hn cc nghin cu
ring l. Nhng phng php phn tch tng hp hin i phi ni l bt u t hn na
th k trc trong ngnh tm l hc. Nm 1952, nh tm l hc tr danh Hans J. Eysenck
tuyn b rng tm l tr liu (psychotherapy) chng c hiu qu g c. Hn hai mi nm
sau, nm 1976, Gene V. Glass, mt nh tm l hc ngi M, mun chng minh rng
Eysenck sai, nn ng tm cch thu thp d liu ca hn 375 nghin cu v tm l tr liu
trong qu kh, v tin hnh tng hp chng bng mt phng php m ng t tn l
meta-analysis [1]. Qua phng php phn tch ny, Glass tuyn b rng tm l tr liu
c hiu qu v gip ch cho bnh nhn.
Phn tch tng hp hay meta-analysis t c cc b mn khoa hc khc,
nht l y hc, ng dng gii quyt cc vn nh hiu qu ca thuc trong vic iu
tr bnh nhn. Cho n nay, cc phng php phn tch tng hp pht trin mt bc
di, v tr thnh mt phng php chun thm nh cc vn gai gc, cc vn m
s nht tr gia cc nh khoa hc vn cha t c. C ngi xem phn tch tng hp
c th cung cp mt cu tr li sau cng cho mt cu hi y hc. Ngi vit bi ny
khng lc quan v t tin nh th, nhng vn cho rng phn tch tng hp l mt phng
php rt c ch cho chng ta gii quyt nhng vn cn trong vng tranh ci. Phn tch
tng hp cng c th gip cho chng ta nhn ra nhng lnh vc no cn phi nghin cu
thm hay cn thm bng chng.
Kt qu ca mi nghin cu n l thng c nh gi hoc l tch cc (tc
l, chng hn nh, thut iu tr c hiu qu), hoc l tiu cc (tc l thut iu tr
khng c hiu qu), v s nh gi ny da vo tr s P. Thut ng ting Anh gi qui
1

trnh l significance testing th nghim ngha thng k. Nhng ngha thng k


ty thuc vo s mu c chn trong nghin cu, v mt kt qu tiu cc khng c
ngha l gi thit ca nghin cu sai, m c th l tn hiu cho thy s lng mu cha
y i n mt kt lun ng tin cy. Ci logic ca phn tch tng hp, do , l
chuyn hng t significance testing sang c tnh effect size - mc nh hng. Cu
tr li m phn tch tng hp mun a ra khng ch n gin l c hay khng c ngha
thng k (significant hay insignificant) m l mc nh hng bao nhiu, c ng
chng ta quan tm, c thch hp chng ta ng dng vo thc t lm sng trong vic
chm sc bnh nhn.

14.2 Fixed-effects v Random-effects


Hai thut ng m bn c thng gp trong cc phn tch tng hp l fixedeffects (tm dch l nh hng bt bin) v random-effects (nh hng bin thin).
hiu hai thut ng ny ti s a ra mt v d tng i n gin. Hy tng tng
chng ta mun c tnh chiu cao ca ngi Vit Nam trong tui trng thnh (18
tui tr ln). Chng ta c th tin hnh 100 nghin cu ti nhiu a im khc nhau trn
ton quc; mi nghin cu chn mu (samples) mt cch ngu nhin t 10 ngi n vi
chc ngn ngi; v c mi nghin cu chng ta tnh ton chiu cao trung bnh. Nh
vy, chng ta c 100 s trung bnh, v chc chn nhng con s ny khng ging nhau:
mt s nghin cu c chiu cao trung bnh thp, cao hay trung bnh. Phn tch tng
hp l nhm mc ch s dng 100 s trung bnh c tnh chiu cao cho ton th
ngi Vit. C hai cch c tnh: fixed-effects meta-analysis (phn tch tng hp nh
hng bt bin) v random-effects meta-analysis (phn tch tng hp nh hng bt bin)
[2].
Phn tch tng hp nh hng bt bin xem s khc bit gia 100 con s trung
bnh l do cc yu t ngu nhin lin quan n mi nghin cu (cn gi l withinstudy variance) gy nn. Ci gi nh ng sau cch nhn thc ny l: nu 100 nghin
cu u c tin hnh y chang nhau (nh c cng s lng i tng, cng tui,
cng t l gii tnh, cng ch dinh dng, v.v) th s khng c s khc bit gia cc
s trung bnh.
Nu chng ta gi s trung bnh ca 100 nghin cu l x1 , x2 ,..., x100 , quan im
ca phn tch tng hp nh hng bt bin cho rng mi xi l mt bin s gm hai phn:
mt phn phn nh s trung ca ton b qun th dn s (tm gi l M), v phn cn li
(khc bit gia xi v M l mt bin s ei . Ni cch khc:
x1 = M + e1
x2 = M + e2
.
x100 = M + e100

Hay ni chung l:

xi = M + ei
Tt nhin ei c th <0 hay >0. Nu M v ei c lp vi nhau (tc khng c tng quan
g vi nhau) th phng sai ca xi (gi l var[xi ] ) c th vit nh sau:
var[xi ] = var[M ] + var[ei ] = 0 + se2

Ch var[M] = 0 v M l mt hng s bt bin, se2 l phng sai ca ei . Mc ch ca


phn tch tng hp l c tnh M v se2 .
Phn tch tng hp nh hng bin thin xem mc khc bit (cn gi l
variance hay phng sai) gia cc s trung bnh l do hai nhm yu t gy nn: cc yu
t lin quan n mi nghin cu (within-study variance) v cc yu t gia cc nghin
cu (between-study variance). Cc yu t khc bit gia cc nghin cu nh a im,
tui, gii tnh, dinh dng, v.v cn phi c xem xt v phn tch. Ni cch khc,
phn tch tng hp nh hng bin thin i xa hn phn tch tng hp nh hng bt bin
mt bc bng cch xem xt n nhng khc bit gia cc nghin cu. Do , kt qu t
phn tch tng hp nh hng bin thin thng bo th hn cc phn tch tng hp
nh hng bt bin.
Quan im ca phn tch tng hp nh hng bin thin cho rng mi nghin cu
c mt gi tr trung bnh c bit phi c tnh, gi l mi . Do , xi l mt bin s gm
hai phn: mt phn phn nh s trung ca qun th m mu c chn ( mi , ch y
c ch t i ch mt nghin cu ring l i), v phn cn li (khc bit gia xi v mi l
mt bin s ei . Ngoi ra, phn tch tng hp nh hng bin thin cn pht biu rng mi
dao ng chung quanh s tng trung bnh M bng mt bin ngu nhin i . Ni cch
khc:

xi = mi + ei
Trong :

Thnh ra:

mi = M + i
xi = M + i + ei

V phng sai ca xi by gi c hai thnh phn:


var[xi ] = var[M ] + var[ i ] + var[ei ] = 0 + s2 + se2

Nh ta thy qua cng thc ny, s2 phn nh dao ng gia cc nghin cu (betweenstudy variation), cn se2 phn nh dao ng trong mi nghin cu (within-study
variation). Mc ch ca phn tch tng hp nh hng bin thin l c tnh M, se2
v s2 .
Ni tm li, Phn tch tng hp nh hng bt bin v Phn tch tng hp nh
hng bin thin ch khc nhau phng sai. Trong khi phn tch tng hp bt bin
xem s2 = 0, th phn tch tng hp bin thin t yu cu phi c tnh s2 . Tt nhin,
nu s2 = 0 th kt qu ca hai phn tch ny ging nhau. Trong bi ny ti s tp trung
vo cch phn tch tng hp nh hng bt bin.

14.3 Qui trnh ca mt phn tch tng hp


Cng nh bt c nghin cu no, mt phn tch tng hp c tin hnh qua cc
cng on nh: thu thp d liu, kim tra d liu, phn tch d liu, v kim tra kt qu
phn tch.

Bc th nht: s dng h thng th vin y khoa PubMed hay mt h thng th


vin khoa hc ca chuyn ngnh tm nhng bi bo lin quan n vn cn
nghin cu. Bi v c nhiu nghin cu, v l do no (nh kt qu tiu cc
chng hn), khng c cng b, cho nn nh nghin cu c khi cng cn phi
thm vo cc nghin cu . Vic lm ny tuy ni th d, nhng trong thc t
khng d dng cht no!

Bc th hai: r sot xem trong s cc nghin cu c truy tm , c bao nhiu


t cc tiu chun c ra. Cc tiu chun ny c th l i tng bnh
nhn, tnh trng bnh, tui, gii tnh, tiu ch, v. Chng hn nh trong s
hng trm nghin cu v nh hng ca viatmin D n long xng, c th ch
vi chc nghin cu t tiu chun nh i tng phi l ph n sau thi mn
kinh, mt xng thp, phi l nghin cu lm sng i chng ngu nhin
(randomized controlled clinical trials - RCT), tiu ch phi l gy xng i,
v.v (Nhng tiu chun ny phi c ra trc khi tin hnh nghin cu).

Bc th ba: chit s liu v d kin (data extraction). Sau khi xc nh c


i tng nghin cu, bc k tip l phi ln k hoch chit s liu t cc nghin
cu . Chng hn nh nu l cc nghin cu RCT, chng ta phi tm cho c
s liu cho hai nhm can thip v i chng. C khi cc s liu ny khng c
cng b hay trnh by trong bi bo, v trong trng hp , nh nghin cu phi
trc tip lin lc vi tc gi tm s liu. Mt bng tm lc kt qu nghin cu
c th tng t nh Bng 1 di y.

Bc th t: tin hnh phn tch thng k. Trong bc ny, mc ch l c tnh


mc nh hng chung cho tt c nghin cu v dao ng ca nh hng
. Trong bi ny, ti s gii thch c th cch lm.

Bc th nm: xem xt cc kt qu phn tch, v tnh ton thm mt s ch tiu


khc nh gi tin cy ca kt qu phn tch.

Cng nh phn tch thng k cho tng nghin cu ring l ty thuc vo loi tiu
ch (nh l bin s lin tc continuous variables hay bin s nh phn dichotomous
variables), phng php phn tch tng hp cng ty thuc vo cc tiu ch ca nghin
cu. Ti s ln lc m t hai phng php chnh cho hai loi bin s lin tc v nh
phn.

14.4 Phn tch tng hp nh hng bt bin cho mt tiu


ch lin tc (Fixed-effects meta-analysis for a continuous
outcome).
14.4.1 Phn tch tng hp bng tnh ton th cng
V d 1. Thi gian nm vin iu tr cc bnh nhn t qu l mt tiu ch
quan trng trong vic vch nh chnh sch ti chnh. Cc nh nghin cu mun bit s
khc bit v thi gian nm vin gia hai nhm bnh vin chuyn khoa v bnh vin a
khoa. Cc nh nghin cu ra sot v thu thp s liu t 9 nghin cu nh sau (xem Bng
1). Mt s nghin cu cho thy thi gian nm vin trong cc bnh vin chuyn khoa
ngn hn cc bnh vin a khoa (nh nghin cu 1, 2, 3, 4, 5, 8), mt s nghin cu khc
cho thy ngc li (nh nghin cu 7 v 9). Vn t ra l cc s liu ny c ph hp
vi gi thit bnh nhn cc bnh vin a khoa thng c thi gian nm vin ngn hn cc
bnh vin a khoa hay khng. Chng ta c th tr li cu hi ny qua cc bc sau y:
Bc 1: tm lc d liu trong mt bng thng k nh sau:
Bng 1. Thi gian nm bnh vin ca cc bnh nhn t qu trong hai nhm bnh
vin chuyn khoa v a khoa
Nghin
cu (i)
1
2
3
4
5
6
7
8

Bnh vin chuyn khoa


LOS1i
SD1i
N1i
155
55
47
31
27
7
75
64
17
18
66
20
8
14
8
57
19
7
34
52
45
110
21
16

Bnh vin a khoa


N2i
LOS2i
SD2i
156
75
64
32
29
4
71
119
29
18
137
48
13
18
11
52
18
4
33
41
34
183
31
27

9
Tng cng

60
548

30

27

52
610

23

20

Ch thch: Trong bng ny, i l ch s ch mi nghin cu, i=1,2,,9. N1 v N2 l s bnh nhn


nghin cu cho tng nhm bnh vin; LOS1 v LOS2 (length of stay): thi gian trung bnh nm
vin (tnh bng ngy); SD1 v SD2: lch chun (standard deviation) ca thi gian nm vin.

Bc 2: c tnh mc khc bit trung bnh v phng sai (variance) cho


tng nghin cu. Mi nghin cu c tnh mt nh hng, hay ni chnh xc hn l
khc bit v thi gian nm vin k hiu, v ti s t k hiu l di gia hai nhm bnh
vin. Ch s nh hng ny ch n gin l:

di = LOS1i LOS2i
Phng sai ca di (ti s k hiu l si2 ) c c tnh bng mt cng thc chun da vo
lch chun v s i tng trong tng nghin cu. Vi mi nghin cu i (i = 1, 2, 3,
, 9), chng ta c:
si2 =

(N1i 1)SD12i + (N 2i 1)SD22i

1
1
N + N
2i
1i

N1i + N 2i 2

Chng hn nh vi nghin cu 1, chng ta c:

d1 = 75 55 = 20
v phng sai ca d1:

(155 1)( 47 ) + (156 1)( 64 )


=
2

2
1

155 + 156 2

1
1
+

= 40.59
155 156

hay lch chun: s1 = 40.59 = 6.37


Vi lch chun si chng ta c th c tnh khong tin cy 95% (95% confidence
interval hay 95%CI) cho di bng l thuyt phn phi chun (Normal distribution). Cn
nhc li rng, nu mt bin s tun theo nh lut phn phi chun th 95% cc gi tr ca
bin s s nm trong khong 1,96 ln lch chun. Do , khong tin cy 95% cho
mc khc bit ca nghin cu 1 l:
n

di - 1.96*si = 20 1.96*6.37 = 7.71 ngy


di + 1.96*si = 20 + 1.96*6.37 = 32.49 ngy

Tip tc tnh nh th cho cc nghin cu khc, chng ta s c thm bn ct trong bng


sau y:

Bng 1a. khc bit v thi gian gia hai nhm v khong tin cy 95%
Nghin cu (i)

1
2
3
4
5
6
7
8
9

di
20
2
55
71
4
-1
-11
10
-7

si2
40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7

si
6.37
1.43
3.91
12.26
4.49
1.11
9.77
2.83
4.55

di-1.96*si

di+1.96*si

7.51
-0.80
47.34
46.98
-4.81
-3.17
-30.14
4.45
-15.92

32.49
4.80
62.66
95.02
12.81
1.17
8.14
15.55
1.92

n y chng ta c th th hin mc nh hng di v khong tin cy 95% trong mt


biu c tn l forest plot nh sau:

Biu forest th hin gi tr ca di v khong tin cy 95%. Mc nh hng di


ghi nhn t nghin cu 5, 7 v 9 c xem l khng c ngha thng k, v khong
tin cy 95% vt qua ct mc 0.

Bc 3: c tnh trng s (weight) cho mi nghin cu. Trng s (Wi) thc


ra ch l s o ca phng sai si2 ,

Wi = 1 / si2
Chng hn nh vi nghin cu 1, chng ta c: W1 =

1
= 0.0246
40.59

V chng ta c thm mt ct mi cho bng trn nh sau:


Bng 1b. Trng s (weight) cho tng nghin cu
Nghin cu

1
2
3
4
5
6
7
8
9
Tng s

di

20
2
55
71
4
-1
-11
10
-7

Wi

si2
40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7

0.0246
0.4886
0.0654
0.0067
0.0495
0.8173
0.0105
0.1245
0.0483
1.6354

Bc 4: c tnh tr s trung bnh ca d cho tt c cc nghin cu. Chng ta


c th n gin tnh trung bnh d bng cch cng tt c di v chia cho 9, nhng cch tnh
nh th khng khch quan, bi v mi gi tr di c mt phng sai v trng s (Wi) c
bit. Chng hn nh nghin cu 4, v phng sai cao nht (150.2), chng t rng nghin
cu ny c s i tng t hay dao ng rt cao, v dao ng cao c ngha l chng
ta khng t nim tin cy vo cao c. Chnh v th m trng s cho nghin cu
ny rt thp, ch 0.0067. Ngc li, nghin cu 6 c trng s cao v dao ng thp
(phng sai thp) v c tnh nh hng ca nghin cu ny c trng lng hn cc
nghin khc trong nhm.

Do , tnh trung bnh d cho tng s nghin cu, chng ta phi xem xt n trng s
Wi. Vi mi di v Wi chng ta c th tnh tr s trung bnh trng s (weighted mean)
theo phng php chun nh sau:
9

d=

W d
i =1
9

W
i =1

Bt c mt c tnh thng k (estimate) no cng phi c mt phng sai. V trong


trng hp d, phng sai (ti s k hiu l sd2 ) ch n gin l s o ca tng trng s
W i:
1
sd2 = 9
Wi
i =1

Sai s chun (standard error, SE) ca d, do l: SE(d) = sd . Theo l thuyt phn phi
chun (Normal distribution), khong tin cy 95% (95% confidence interval, 95%CI) c
th c c tnh nh sau:
95%CI ca d = d 1.96 ( sd )
tnh d chng ta cn thm mt ct na: l ct Wi d i . Chng hn nh vi nghin cu
1, chng ta c W1d1 = 0,0246 20 = 0,4928 . Tip tc nh th, chng ta c thm mt ct.
Bng 1c. Tnh ton tr s trung bnh
Nghin cu

di

1
2
3
4
5
6
7
8
9
Tng s

si2
40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7

20
2
55
71
4
-1
-11
10
-7

Wi
0.0246
0.4886
0.0654
0.0067
0.0495
0.8173
0.0105
0.1245
0.0483
1.6354

Wi d i
0.4928
0.9771
3.5993
0.4726
0.1981
-0.8173
-0.1153
1.2450
-0.3383
5.7140

Sau , cng tt c Wi v Wi d i (trong hng Tng s ca bng trn). Nh vy, tr s


trung bnh trng s ca d l:
9

d=

W d
i =1
9

W
i =1

0.4928 + 0.9771 + ... 0.3383 5.7140


=
= 3.49 .
0.0246 + 0.4886 + ... + 0.0483 1.6354

V phng sai ca d l: sd2 =

1
= 0.61 .
1.6345

Ni cch khc, sai s chun (standard error) ca d l: sd = 0.61 = 0.782 .


9

Khong tin cy 95% (95% confidence interval hay 95%CI) c th c c tnh nh sau:
3.49 1,96*0.782 = 1.96 n 5.02.
n y, chng ta c th ni rng, tnh trung bnh, thi gian nm vin ti cc bnh vin
a khoa di hn cc bnh vin chuyn khoa 3.49 ngy v 95% khong tin cy l t 1.96
ngy n 5.02 ngy.
Bc 5: c tnh ch s ng nht (homogeneity) v bt ng nht
(heterogeneity) gia cc nghin cu [3]. Trong thc t, y l ch s o lng khc
bit gia mi nghin cu v tr s trung bnh trng s. Ch s ng nht (index of
homogeneity) c tnh theo cng thc sau y:
k

Q = Wi (d i d )

i =1

y, k l s nghin cu (trong v d trn k = 9). Theo l thuyt xc sut, Q c phn


phi theo lut Chi-square vi bc t do (degrees of freedom df) l k-1 (tc l k21 ).
Ni cch khc, nu Q ln hn k21 th l tn hiu cho thy s bt ng nht gia cc
nghin cu c ngha thng k (significant).
Nhiu nghin cu trong thi gian qua ch ra rng Q thng khng pht hin c
s bt ng nht mt cch nht qun, cho nn ngy nay t ai dng ch s ny trong phn
tch tng hp. Mt ch s khc thay th Q c tn l index of heterogeneity (I2)m ti
tm dch l ch s bt ng nht, nhng s gi cch vit I2. Ch s ny c nh ngha
nh sau:
I2 =

Q (k 1)
Q

I2 c gi tr t m n 1. Nu I2 < 0, th chng ta s cho n l 0; nu I2 gn bng 1 th


l du hiu cho thy c s bt ng nht gia cc nghin cu.

Trong v d trn, c tnh Q v I2, chng ta cn tnh Wi (d i d ) cho tng nghin cu.
Chng hn nh, vi nghin cu 1:
2

Wi (d i d ) = 0,0246*(20 3.49)2 = 6,7129


2

Bng 1d. Tnh ton cc ch s ng nht v bt ng nht


Nghin cu

di

si2

1
2

20
2

40.6
2.0

Wi

0.0246
0.4886

Wi (d i d )
0.4928
6.7129
0.9771
1.0903

Wi d i

10

3
4
5
6
7
8
9
Tng s

55
71
4
-1
-11
10
-7

15.3
150.2
20.2
1.2
95.4
8.0
20.7

0.0654
0.0067
0.0495
0.8173
0.0105
0.1245
0.0483
1.6354

3.5993
0.4726
0.1981
-0.8173
-0.1153
1.2450
-0.3383
5.7140

173.6080
30.3356
0.0127
16.5054
2.2026
5.2701
5.3215
241.05

Sau khi c tnh Wi (d i d ) cho tng nghin cu, chng ta cng li s ny (xem ct
sau cng) v chnh l Q :
2

Q = Wi (d i d ) = 241.05
2

i =1

T , I2 c th c tnh nh sau:
I2 =

241.05 8
= 0.966
241.05

Ch s bt ng nht I2 rt cao, cho thy dao ng v di gia cc nghin cu rt cao.


iu ny chng ta c th thy c ch qua nhn vo ct s 2 trong bng thng k trn.
Bc 6: nh gi kh nng publication bias [4]. Publication bias (tm dch:
trong thin v) l mt khi nim tng i mi c th gii thch bng tnh hung thc t
sau y. Chng ta bit rng khi mt nghin cu cho ra kt qu negative (kt qu tiu
cc, tc l khng pht hin mt nh hng hay mt mi lin h c ngha thng k)
cng trnh nghin cu rt kh c c hi c cng b trn cc tp san, bi v gii ch
bt tp san ni chung khng thch in nhng bi nh th. Ngc li, mt nghin cu vi
mt kt qu tch cc (tc c ngha thng k) th nghin cu c kh nng xut hin
trn cc tp san khoa hc cao hn l cc nghin cu vi kt qu tiu cc. Th nhng
phn ln nhng phn tch tng hp li da vo cc kt qu cng b trn cc tp san
khoa hc. Do , c tnh ca mt phn tch tng hp c kh nng thiu khch quan, v
cha xem xt y n cc nghin cu tiu cc cha bao gi cng b.

Mt s nh nghin cu ngh dng biu funnel (cn gi l funnel plot)


kim tra kh nng publication bias. Biu funnel c th hin bng cch v chnh
xc precision (trc tung, y-axis) vi c tnh mc nh hng cho tng nghin cu.
y precision c nh ngha l s o ca sai s chun (standard error):
precision =

1
sdi

11

Ni cch khc, biu funnel biu din precision vi di. Chng hn nh vi nghin cu
1, chng ta c: precision = 1 / 40,6 = 0,157 . Tnh cho tng nghin cu, chng ta c
dng bng thng k sau v biu funnel nh sau:
Bng 1e. c tnh publication bias
Nghin cu

1
2
3
4
5
6
7
8
9

di

20
2
55
71
4
-1
-11
10
-7

si2
40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7

1/si
0.1570
0.6990
0.2558
0.0816
0.2225
0.9041
0.1024
0.3528
0.2198

Biu funnel (biu phu): trc tung l precision v trc honh l d.


Biu ny cho thy phn ln cc nghin cu c kt qu thi gian nm
vin trong cc bnh vin a khoa thng lu hn cc bnh vin chuyn
khoa.

12

Ci logic ng sau biu funnel l nu cc cng trnh nghin cu ln (tc c


precision cao) c kh nng c cng b cao, th s lng nghin cu vi kt qu tch
cc s nhiu hn s lng nghin cu nh hay vi kt qu tiu cc trong cc tp san. V
nu iu ny xy ra, th biu funnel s th hin mt s thiu cn i (asymmetry). Ni
cch khc, s thiu cn i ca mt biu funnel l du hiu cho thy c vn v
publication bias. Nhng vn t ra l publication bias c ngha thng k hay
khng? Biu funnel khng th tr li cu hi ny, chng ta cn n cc phng php
phn tch nh lng nghim chnh hn.
Nghim ton Egger

Vi nm gn y c kin cho rng biu funnel rt kh din dch, v c th


gy nn ng nhn v publication bias [5-6]. Tht vy, mt s tp san y hc c chnh sch
khuyn khch cc nh nghin cu tm mt phng php khc nh gi publication bias
thay v dng biu funnel.
Mt trong nhng phng php l nghim ton Egger (cn gi l Egger's test).
Vi phng php ny, chng ta m hnh rng SND = a + b x precision, trong SND
d
c c tnh bng cch ly d chia cho sai s chun ca d, tc l: SNDi = i , a v b l
sdi
hai thng s phi c tnh t m hnh hi qui ng thng . y, a cung cp cho
chng ta mt c s v tnh trng thiu cn i ca biu funnel: a>0 c ngha l xu
hng nghin cu cng c qui m ln cng c c s v nh hng vi s chnh xc
cao.
Trong v d trn, chng ta c th dng mt phn mm phn tch thng k (nh
SAS hay R) c tnh a v b nh sau:
SNDi = 4.20 + -4.17084*precisioni
Kt qu c s a = 4.20 tuy l >0 nhng khng c ngha thng k, cho nn y bng
chng cho thy khng c s publication bias.
Tuy nhin, nh thy trong thc t, nghim tan Egger ny cng ch l mt cch
th hin biu funnel m thi, ch cng khng c thay i g ln. C mt cch nh
gi publication bias, cho n nay, c xem l ng tin cy nht: l phng php phn
tch hi qui ng thng (linear regression) gia di v tng s mu (Ni). Ni cch khc,
chng ta tm a v b trong m hnh [7]:
di = a + b*Ni
Nu khng c publication bias th gi tr ca b s rt gn vi 0 hay khng c ngha
thng k. Nu tr s b khc vi 0 th l mt tn hiu ca publication bias. Trong v d
va nu vi d liu sau y,
Nghin cu

di

Ni
13

1
2
3
4
5
6
7
8
9

20
2
55
71
4
-1
-11
10
-7

311
63
146
36
21
109
67
293
112

chng ta c phng trnh:


di = 16.0 - 0.0009*/Ni
v qu tht gi tr ca b qu thp (cng nh khng c ngha thng k), cho nn n y
chng ta c th kt lun rng khng c vn publication bias trong nghin cu va
cp n.
Ni tm li, qua phn tch tng hp ny, chng ta c bng chng ng tin cy
kt lun rng thi gian nm vin ca bnh nhn trong cc bnh vin a khoa di hn cc
bnh vin chuyn khoa khong 3 ngy ri, hoc trong 95% trng hp thi gian khc
bit khong t 2 ngy n 5 ngy. Kt qu ny cng cho thy khng c thin v xut bn
(publication bias) trong phn tch.
14.4.2 Phn tch tng hp bng R
R c hai package c vit v thit k cho phn tch tng hp. Package c s
dng kh thng dng l meta. Bn c c th ti min ph t trang web ca R (trong
phn packages): http://cran.R-project.org.

phn tch tng hp bng R chng ta phi nhp package meta vo mi trng
vn hnh ca R (vi iu kin, tt nhin, l bn c ti v ci t meta vo R).
> library(meta)

Sau , chng ta s nhp s liu trong v d 1 vo R bin nh sau:

Nhp d liu cho tng ct trong Bng 1 v cho vo mt dataframe gi l los:


> n1 <- c(155,31,75,18,8,57,34,110,60)
> los1 <- c(55,27,64,66,14,19,52,21,30)
> sd1 <- c(47,7,17,20,8,7,45,16,27)
> n2 <- c(156,32,71,18,13,52,33,183,52)
> los2 <- c(75,29,119,137,18,18,41,31,23)
> sd2 <-c (64,4,29,48,11,4,34,27,20)

14

> los <- data.frame(n1,los1,sd1,n2,los2,sd2)

S dng hm

metacont (dng phn tch cc bin lin tc do


cont=continuous variable) v cho kt qu vo i tng res:

> res <- metacont(n1,los1,sd1,n2,los2,sd2,data=los)


> res
> res
WMD
95%-CI %W(fixed) %W(random)
1 -20 [-32.4744; -7.5256]
1.44
10.69
2 -2 [ -4.8271;
0.8271]
28.11
12.67
3 -55 [-62.7656; -47.2344]
3.73
11.89
4 -71 [-95.0223; -46.9777]
0.39
7.39
5 -4 [-12.1539;
4.1539]
3.38
11.80
6
1 [ -1.1176;
3.1176]
50.11
12.72
7 11 [ -8.0620; 30.0620]
0.62
8.76
8 -10 [-14.9237; -5.0763]
9.27
12.41
9
7 [ -1.7306; 15.7306]
2.95
11.67
Number of trials combined: 9
WMD
Fixed effects model
-3.4636
Random effects model -13.9817

95%-CI
z p.value
[ -4.9626; -1.9646] -4.5286 < 0.0001
[-24.0299; -3.9336] -2.7272
0.0064

Quantifying heterogeneity:
tau^2 = 205.4094; H = 5.46 [4.54; 6.58]; I^2 = 96.7% [95.2%; 97.7%]
Test of heterogeneity:
Q d.f. p.value
238.92
8 < 0.0001
Method: Inverse variance method

meta cung cp cho chng ta hai kt qu: mt kt qu da vo m hnh fixed-effects v


mt da vo m hnh random-effects. Nh thy qua kt qu trn, mc khc bit gia
hai m hnh kh ln, nhng kt qu chung th ging nhau, tc kt qu ca c hai m hnh
u c ngha thng k.

Chng ta tt nhin cng c th s dng hm plot th hin kt qu trn bng biu


forest nh sau:
> plot(res, lwd=3)

15

-100

-80

-60

-40
-20
Weighted mean difference

20

14.5. Phn tch tng hp nh hng bt bin cho mt tiu


ch nh phn (Fixed-effects meta-analysis for a dichotomous
outcome).
Trong phn trn, ti va m t nhng bc chnh trong mt phn tch tng hp
nhng nghin cu m tiu ch l mt bin lin tc (continuous variable). i vi cc
bin lin tc, tr s trung bnh v lch chun l hai ch s thng k thng c s
dng tm lc. Nhng hai ch s ny khng th ng dng cho nhng tiu ch mang
tnh th loi hay th bc nh t vong, gy xng, v.v v nhng tiu ch ny ch c hai
gi tr: hoc l c, hoc l khng. Mt ngi hoc l cn sng hay cht, b gy xng
hay khng gy xng, mc bnh suy tim hay khng mc bnh suy tim, v.v i vi
nhng bin ny, chng ta cn mt phng php phn tch khc vi phng php dnh
cho cc bin lin tc.
14.5.1 M hnh phn tch

i vi nhng tiu ch nh phn (ch c hai gi tr), ch s thng k tng ng


vi tr s trung bnh l t l hay proportion, c th tnh phn trm); v ch s tng
ng vi lch chun l sai s chun (standard error). Chng hn nh nu mt
nghin cu theo di 25 bnh nhn trong mt thi gian, v trong thi gian c 5 bnh
nhn mc bnh, th t l (k hiu l p) n gin l: p = 5/25 = 0,20 (hay 20%). Theo l
thuyt xc sut, phng sai ca p (k hiu l var[p]) l: var[p] = p(1-p)/n = 0,2*(1 0,8)/25 = 0,0064.
Theo , sai s chun ca p (k hiu SE[p]) l:

16

SE [ p ] = var[ p ] = 0,0064 = 0,08. Chng ta cn c th c tnh khong tin cy 95%

ca t l nh sau: p 1,96 SE [ p ] = 0,2 1,96 0,08 = 0,04 n 0,36.

V cch tnh ca cc tiu ch nh phn kh c th, cho nn phng php phn


tch tng hp cc nghin cu vi bin nh phn cng khc. minh ha cch phn tch
tng hp dng ny, ti s ly mt v d (phng theo mt nghin cu c tht).
V d 2: Beta-blocker (s vit tt l BB) l mt loi thuc c chc nng iu tr
v phng chng cao huyt p. C gi thit cho rng BB cng c th phng chng bnh
suy tim, hay t ra l lm gim nguy c suy tim. th nghim gi thit ny, hng lot
nghin cu lm sng i chng ngu nhin c tin hnh trong thi gian 20 nm qua.
Mi nghin cu c 2 nhm bnh nhn: nhm c iu tr bng BB, v mt nhm khng
c iu tr (cn gi l placebo hay gi dc). Trong thi gian 2 nm theo di, cc nh
nghin cu xem xt tn s t vong cho tng nhm. Bng 2 sau y tm lc 13 nghin
cu trong qu kh:
Bng 2. Beta-blocker v bnh suy tim (congestive heart failure)
Nghin cu
(i)
1
2
3
4
5
6
7
8
9
10
11
12
13
Tng cng

Beta-blocker
N1
T vong (d1)
25
5
9
1
194
23
25
1
105
4
320
53
33
3
261
12
133
6
232
2
1327
156
1990
145
214
8
4879
420

N2
25
16
189
25
34
321
16
84
145
134
1320
2001
212
4516

Placebo
T vong (d2)
6
2
21
2
2
67
2
13
11
5
228
217
17
612

N: s bnh nhn nghin cu; T vong: s bnh nhn cht trong thi gian theo di.

Nh chng ta thy, mt s nghin cu c s mu kh nh, li c nhng nghin cu vi s


mu gn 4000 ngi! Cu hi t ra l tng hp cc nghin cu ny, kt qu c nht
qun hay ph hp vi gi thit BB lm gim nguy c suy tim hay khng? tr li cu
hi ny, chng ta tin hnh nhng bc sau y:
Bc 1: c tnh mc nh hng cho tng nghin cu. Mi nghin cu c
hai t l: mt cho nhm BB v mt cho nhm placebo. Ti s gi hai t l ny l p1 v p2.
Ch s nh gi mc nh hng ca thuc BB l t s nguy c tng i (relative
risk RR), v RR c th c c tnh nh sau:

17

RR =

p1
p2

Chng hn nh, trong nghin cu 1, chng ta c: p1 =

5
8
= 0,20 v p2 =
= 0,24 .
25
25

0,20
= 0,833 . Tnh ton tng t cho
0,24
cc nghin cu cn li, chng ta s c mt bng nh sau:

Nh vy t s nguy c cho nghin cu 1 l: RR =

Bng 2a. c tnh t l t vong v t s nguy c tng i

Nghin cu (i)
1
2
3
4
5
6
7
8
9
10
11
12
13

T l t vong
nhm BB (p1)
0.200
0.111
0.119
0.040
0.038
0.166
0.091
0.046
0.045
0.009
0.118
0.073
0.037

T l t vong
nhm placebo
(p2)
0.240
0.125
0.111
0.080
0.059
0.209
0.125
0.155
0.076
0.037
0.173
0.108
0.080

T s nguy
c (RR)
0.833
0.889
1.067
0.500
0.648
0.794
0.727
0.297
0.595
0.231
0.681
0.672
0.466

Bc 2: bin i RR thnh n v logarithm v tnh phng sai, sai s


chun. Mi c s thng k, nh c ln ni, u c mt lut phn phi, v lut phn
phi c th phn nh bng phn sai (hay sai s chun). Cch tnh phng sai ca RR kh
phc tp, cho nn chng ta s tnh bng mt phng php gin tip. Theo phng php
ny, chng ta s bin i RR thnh log[RR] (ch log y c ngha l loga t nhin,
tc l loge hay c khi cn vit tt l ln natural logarithm) , v sau s tnh phng sai
ca log[RR].

Nu N1 v N2 l ln lc tng s mu ca nhm 1 v nhm 2; v d1 v d2 l s t


vong ca nhm 1 v nhm 2 ca mt nghin cu, th phng sai ca log[RR] c th c
tnh bng cng thc sau y:
Var[logRR] =

1
1
1
1

d1 N1 d1 d 2 N 2 d 2

18

V sai s chun ca log[RR] l:


SE[logRR] =

1
1
1
1

d1 N 1 d1 d 2 N 2 d 2

Trong v d trn, vi nghin cu 1, chng ta c:


Log[RR] = loge(0.833) = -0.182
Vi phng sai:
var[log RR ] =

1
1
1
1

+
= 0.264
5 25 5 6 25 6

V sai s chun:

SE[log RR] = 0.264 = 0.514


Da vo lut phn phi chun, chng ta cng c th tnh ton khong tin cy 95% ca
RR cho tng nghin cu bng cch bin i ngc li theo n v RR. Chng hn nh
vi nghin cu 1, chng ta c khong tin cy 95% ca log[RR] l:
logRR 1.96*SE[logRR] = -0.182 1.96*0.514 = -1.19 n 0.82
hay bin i thnh n v nguyn thy ca RR l:
exp(-1.19) = 0.30 n exp(0.82) = 2.28
Tnh ton tng t cho cc nghin cu khc, chng ta c thm mt bng mi nh sau:
Bng 2b. c tnh t s nguy c tng i, phng sai, sai s chun v khong tin
cy 95% cho tng nghin cu

Nghin
cu (i)
1
2
3
4
5
6
7
8

T s nguy
c (RR)

Log[RR]

Var[logRR]

0.200
0.111
0.119
0.040
0.038
0.166
0.091
0.046

-0.182
-0.118
0.065
-0.693
-0.434
-0.231
-0.318
-1.214

0.264
1.304
0.079
1.415
0.709
0.026
0.729
0.142

SE[logRR] Phn thp


95%CI
ca RR
0.514
0.30
1.142
0.09
0.282
0.61
1.189
0.05
0.842
0.12
0.162
0.58
0.854
0.14
0.377
0.14

Phn cao
95% CI
ca RR
2.28
8.33
1.85
5.15
3.37
1.09
3.87
0.62
19

9
10
11
12
13

0.045
0.009
0.118
0.073
0.037

-0.520
-1.465
-0.385
-0.398
-0.763

0.242
0.688
0.009
0.010
0.174

0.492
0.829
0.095
0.102
0.417

0.23
0.05
0.56
0.55
0.21

1.56
1.17
0.82
0.82
1.06

Chng ta c th th hin RR v khong tin cy 95% bng biu forest nh sau:

Biu forest th hin gi tr ca RR v khong tin cy 95%. Cc c tnh khong


tin cy 95%CI ca RR vt qua ct mc 1 c xem l khng c ngha thng k.
Bc 3: c tnh trng s (weight) cho tng nghin cu v RR cho ton b
nghin cu. Biu trn cho thy mt s nghin cu c dao ng RR rt ln (chng
t cc nghin cu ny c s mu nh hay c s RR khng n nh), v ngc li, mt
s nghin cu ln c c s RR n nh hn. Trng s cho mi nghin cu (Wi ti s
cho vo k hiu i) o lng n nh ny l s o ca phng sai:

20

Wi =

1
var[log RRi ]

V s trung bnh trng s ca log[RR] (k hiu l logwRR) c th c tnh t tng ca


tch Wilog[RRi]:

W log[RR ]
i

log wRR =

Vi phng sai:
Var[logwRR] =

v sai s chun:
SE [log wRR ] =

Ngoi ra, khong tin cy 95% c th c tnh bng:


log wRR SE [log wRR ]

tnh trung bnh trng s logRR, chng ta cn mt ct Wilog[RRi]. Chng hn nh


vi nghin cu 1, chng ta c:
W1 =
v

1
= 3.79
0, 264

Wi log[RRi ] = 3.79 (-0.182) = -0.69

Tng t cho cc nghin cu khc:


Bng 2c. c tnh t trng s (Wi)

Nghin cu (i)
1
2
3
4
5
6

Log[RR]
-0.182
-0.118
0.065
-0.693
-0.434
-0.231

Var[logRR]
0.264
1.304
0.079
1.415
0.709
0.026

Wi Wilog[RRi]
3.79
-0.69
0.77
-0.09
12.61
0.82
0.71
-0.49
1.41
-0.61
38.30
-8.86

21

7
8
9
10
11
12
13
Tng s

-0.318
-1.214
-0.520
-1.465
-0.385
-0.398
-0.763

0.729
0.142
0.242
0.688
0.009
0.010
0.174

1.37
7.03
4.13
1.45
110.78
96.13
5.75
284.24

-0.44
-8.54
-2.15
-2.13
-42.63
-38.23
-4.39
-108.42

Chng ta c:

=3.79 + 0.77 + + 5.75 = 284.24

W log[RR ] = -0.69 0.09 + -4.39 = -108.42


i

Do . trung bnh trng s ca log[RR] c th c tnh bng:

W log[RR ]
i

log wRR =

108, 42
= 0.38
284, 24

Vi phng sai:

Var [ log wRR ] =

1
= 0.0035
284.24

v sai s chun:
SE [ log wRR ] =

= 0.0035 = 0.06

Do . khong tin cy 95% ca logwRR c th c tnh bng:


log wRR SE [log wRR ] = -0.38 1.960.06 = 0.498 n -0.265

Nhng chng ta mun th hin bng n v gc (tc t s); do . cc c s trn phi


c bin chuyn v n v gc:
RR = exp(logwRR) = log(-0.38) = 0.68
V khong tin cy 95%:
Exp(-0.498) = 0.61 n Exp(-0.265) = 0.77.

22

n y chng ta c th ni rng t l t vong trong cc bnh nhn c iu tr


bng BB bng 0.68 (hay thp hn 32%) so vi cc bnh nhn gi dc (placebo). Ngoi
ra. v khong tin cy 95% khng bao gm 1. chng ta cng c th ni. mc khc bit
ny c ngha thng k.
Bc 4: c tnh ch s ng nht v bt ng nht. Nh ni trong phn
(1) lin quan n phn tch bin lin tc. sau khi c tnh t s nguy c trung bnh.
chng ta cn phi xem xt ch s I2.

c tnh ch s I2. chng ta cn tnh Wi (log RRi log wRR ) cho mi nghin cu.
Chng hn nh vi nghin cu 1. chng ta c:
2

Wi (log RRi log wRR ) = 3.79(-0.182 + 0.38)2 = 0.1502


2

v cho cc nghin cu khc:


Bng 2d. c tnh ch s heterogeneity (I2)

Nghin cu (i)
1
2
3
4
5
6
7
8
9
10
11
12
13
Tng s

Wi Wi (log RRi log wRR )2


3.79
0.1502
0.77
0.0533
12.61
2.5118
0.71
0.0687
1.41
0.0040
38.30
0.8635
1.37
0.0054
7.03
4.8731
4.13
0.0790
1.45
1.7074
110.78
0.0012
96.13
0.0253
5.75
0.8382
284.24
11.1811

Log[RRi]
-0.182
-0.118
0.065
-0.693
-0.434
-0.231
-0.318
-1.214
-0.520
-1.465
-0.385
-0.398
-0.763

V d 2 c k = 13 nghin cu. Do .
k

Q = Wi ( log RRi log wRR ) = 11.1811


2

i =1

V.
I2 =

Q (k 1) 11.18 12
=
= 0.16
Q
11.18

23

V I2 < 0. nn chng ta c th cho I2 = 0. Ni cch khc. mc khc bit v RR gia


cc nghin cu khng c ngha thng k.
Bc 5: nh gi kh nng publication bias. Nh gii thch trong phn 1f.
cch nh gi kh nng publication bias c ngha nht l phn tch hi qui ng thng
log[RR] v tng s mu (N):
log[RRi] = a + bNi

Da vo bng thng k sau.


Nghin cu (i)
1
2
3
4
5
6
7
8
9
10
11
12
13

Log[RRi]
-0.182
-0.118
0.065
-0.693
-0.434
-0.231
-0.318
-1.214
-0.520
-1.465
-0.385
-0.398
-0.763

Ni
50
25
383
50
139
641
49
345
278
366
2647
3991
426

Chng ta c th c tnh a v b nh sau:


log[RRi] = -0.534 + 0.00003Ni
c tnh b = 0.00003 khng c ngha thng k (p = 0.782). Do . chng ta c th
pht biu rng mc thin lch v xut bn khng ng k trong phn tch tng hp
ny.

24

Biu funnel cng cho thy khng c vn publication bias


14.5.2 Phn tch bng R

Package meta c hm metabin c th s dng tin hnh phn tch tng hp


cho cc bin nh phn nh s liu trong v d 2 trn y. Khi u. chng ta np
package meta (nu cha lm) vo mi trng vn hnh. v sau thu nhp s liu vo
mt data frame:
library(meta)

# S liu t v d 2
n1
d1
n2
d2

<<<<-

c(25.9.194.25.105.320.33.261.133.232.1327.1990.214)
c(5.1.23.1.4.53.3.12.6.2.156.145.8)
c(25.16.189.25.34.321.16.84.145.134.1320.2001.212)
c(6.2.21.2.2.67.2.13.11.5.228.217.17)

# To mt dataframe ly tn l bb

bb <- data.frame(n1.d1.n2.d2)

# Phn tch bng hm metabin v kt qu trong res

> res <- metabin(d1.n1.d2.n2.data=bb.sm=RR.meth=I)


> res
> res
RR
95%-CI %W(fixed) %W(random)
1 0.8333 [0.2918; 2.3799]
1.26
1.26
2 0.8889 [0.0930; 8.4951]
0.27
0.27
3 1.0670 [0.6116; 1.8617]
4.47
4.47
4 0.5000 [0.0484; 5.1677]
0.25
0.25
5 0.6476 [0.1240; 3.3814]
0.51
0.51
6 0.7935 [0.5731; 1.0986]
13.08
13.08

25

7
8
9
10
11
12
13

0.7273
0.2971
0.5947
0.2310
0.6806
0.6719
0.4662

[0.1346;
[0.1410;
[0.2262;
[0.0454;
[0.5635;
[0.5496;
[0.2056;

3.9282]
0.6258]
1.5632]
1.1744]
0.8221]
0.8214]
1.0570]

0.49
2.49
1.48
0.52
38.81
34.31
2.07

0.49
2.49
1.48
0.52
38.81
34.31
2.07

Number of trials combined: 13


RR
Fixed effects model 0.6821
Random effects model 0.6821

95%-CI
z p.value
[0.6064; 0.7672] -6.3741 < 0.0001
[0.6064; 0.7672] -6.3741 < 0.0001

Quantifying heterogeneity:
tau^2 = 0; H = 1 [1; 1.45]; I^2 = 0% [0%; 52.6%]
Test of heterogeneity:
Q d.f. p.value
11
12
0.5292
Method: Inverse variance method

Kt qu t m hnh fixed-effects v random-effects mt ln na cho chng ta bng chng


kt lun rng beta-blocker c hiu nghim trong vic lm gim nguy c t vong.
# Biu forest
> plot(res. lwd=3)

1
2
3
4
5
6
7
8
9
10
11
12
13

0.05

0.10

0.20

0.50
1.00
Relative Risk

2.00

5.00

10.00

26

***
Thc ra. trong khoa hc ni chung. chng ta c mt truyn thng lu i v
vic duyt xt bng chng nghin cu (review). duyt xt kin thc hin hnh. Nhng
cc duyt xt nh th thng mang tnh nh cht (qualitative review). v v tnh nh
cht. chng ta kh m bit chnh xc c nhng khc bit mang tnh nh lng gia
cc nghin cu. Phn tch tng hp cung cp cho chng ta mt phng tin nh lng
h thng bng chng. Vi phn tch tng hp. chng ta c c hi :

xem xt nhng nghin cu no c tin hnh gii quyt vn ;


kt qu ca cc nghin cu nh th no;
h thng cc tiu ch lm sng ng quan tm;
r sot nhng khc bit v c tnh gia cc nghin cu;
cch thc tng hp kt qu; v
truyn t kt qu mt cch khoa hc.

Mc ch ca phn tch tng hp. xin nhc li mt ln na. l c tnh mt ch s


nh hng trung bnh sau khi xem xt tt c kt qu nghin cu hin hnh. Mt kt
qu chung nh th gip cho chng ta i n mt kt lun chnh xc v ng tin cy hn.
Hai v d trn y hi vng gip ch cho bn c hiu c c ch v ngha
ca mt phn tch tng hp. Hi vng bn c c th t mnh lm mt phn tch nh th
khi c d liu. Thc ra. tt c cc tnh ton trn c th thc hin bng mt phn mm
nh Microsoft Excel. Ngoi ra. mt s phn mm chuyn mn khc (nh SAS chng
hn) cng c th tin hnh nhng phn tch trn. Ti s gii thch phn ny trong phn k
tip). Cc php tnh tht n gin. Vn ca phn tch tng hp khng phi l tnh
ton. m l d liu ng sau tnh ton.
Phn tch tng hp cng khng phi l khng c nhng khim khuyt. Trong
nghin cu ngi ta c cu rc vo. rc ra. tc l nu cc d liu c s dng trong
phn tch khng c cht lng cao th kt qu ca phn tch tng hp cng chng c gi
tr khoa hc g. Do . vn quan trng nht trong phn tch tng hp l chn la d
liu v nghin cu phn tch. vn ny cn phi c cn nhc cc k cn thn
m bo tnh hp l v khoa hc ca kt qu.
Ti liu tham kho v ch thch

[1] Glass GV. Primary. secondary. and meta-analysis of research.


Researcher 1976; 5:3-8.

Educational

[2] Normand SL. Meta-analysis: formulating. evaluating. combining. and reporting. Stat
Med. 1999;18(3):321-59.
[3] Higgins JPT. Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med.
2002;21:1539-1558

27

[4] Egger M. Davey Smith G. Schneider M. Minder C. Bias in meta-analysis detected by


a simple. graphical test. Br Med J 1997;315:62934.
[5] Tang JL. Liu JL. Misleading funnel plot for detection of bias in meta-analysis. J Clin
Epidemiol. 2000;53(5):477-84.
[6] Peters JL. Sutton AJ. Jones DR. Abrams KR. Rushton L. Comparison of two
methods to detect publication bias in meta-analysis. JAMA. 2006;295(6):676-80.
[7] Macaskill P. Walter SD. Irwig L. A comparison of methods to detect publication bias
in meta-analysis. Stat Med. 2001;20:641-654.

28

Tm tt phn tch tng hp


i vi cc bin s lin tc
Nhm 1 (s mu. trung bnh. lch
chun): n1i . x1i . s1i ; i = 1. 2. 3. . k
Nhm 2 (s mu. trung bnh. lch
chun): n2i . x2i . s2i
nh hng (effect size. ES):
d i = x2i x1i

i vi cc bin s nh phn
Nhm 1 (s mu. s s kin): n1i . x1i ; i =
1. 2. 3. . k
Nhm 2 (s mu. s s kin): n2i . x2i ; i =
1. 2. 3. . k
nh hng (effect size. ES) tnh bng t
s nguy c RR:
x x
RRi = 2i 1i
n2i n1i
Bin chuyn sang logarithm:
i = log(RRi )

Phng sai ca d i :

Phng sai ca i :
1
1
1
1
s2i =

x1i n1i x1i x2i n2i x2i

(n1i 1)s

1
+ (n2i 1)s
1

+
n1i + n2i 2
n1i n2i
Sai s (standard error) ca d i :
2
1i

sdi2 =

2
2i

sdi = sdi2

Sai s ca i :
1
1
1
1

x1i n1i x1i x2i n2i x2i

s =

1
sdi2
c s nh hng chung:
Trng s: Wi =
k

i =1

i =1

1
s2i
c s nh hng chung:
Trng s: Wi =
k

i =1

i =1

d= Wi d i / Wi

= Wi i / Wi

Phng sai ca d:

Phng sai ca :

s 2 = 1 / Wi

s 2 = 1 / Wi

Khong tin cy 95%: d 1,96 s 2


Index of homogeneity:

Khong tin cy 95%: 1,96 s 2


Index of homogeneity:

i =1

i =1

Q = Wi (d i d )

Q = Wi ( i )

Index of heterogeneity:
Q (k 1)
I2 =
Q
Xem xt publication bias: Phn tch hi qui
tuyn tnh: di = a + b*Ni . (Ni l tng s
mu ca nghin cu i). Xem ngha thng
k ca b.

Index of heterogeneity:
Q (k 1)
I2 =
Q
Xem xt publication bias: Phn tch hi qui
tuyn tnh: = a + b*Ni . (Ni l tng s
mu ca nghin cu i). Xem ngha thng
k ca b.

i =1

i =1

29

c tnh phng sai gia cc nghin cu


(between-study variance):

Q (k 1)
2
= max 0,

k
2

Wi

k
i =1

Wi k
i =1
Wi

i =1

c tnh phng sai gia cc nghin cu


(between-study variance):

Q (k 1)
2 = max 0,

k
2

Wi

k
i =1

Wi k
i =1
Wi

i =1

30

CHNG XV

C TNH C MU

15
c tnh c mu
(Sample size estimation)
Mt cng trnh nghin cu thng da vo mt mu (sample). Mt trong nhng
cu hi quan trng nht trc khi tin hnh nghin cu l cn bao nhiu mu hay bao
nhiu i tng cho nghin cu. i tng y l n v cn bn ca mt nghin
cu, l s bnh nhn, s tnh nguyn vin, s mu rung, cy trng, thit b, v.v c
tnh s lng i tng cn thit cho mt cng trnh nghin cu ng vai tr cc k quan
trng, v n c th l yu t quyt nh s thnh cng hay tht bi ca nghin cu. Nu
s lng i tng khng th kt lun rt ra t cng trnh nghin cu khng c
chnh xc cao, thm ch khng th kt lun g c. Ngc li, nu s lng i tng
qu nhiu hn s cn thit th ti nguyn, tin bc v thi gian s b hao ph. Do , vn
then cht trc khi nghin cu l phi c tnh cho c mt s i tng va cho
mc tiu ca nghin cu. S lng i tng va ty thuc vo ba yu t chnh:

Sai st m nh nghin cu chp nhn, c th l sai st loi I v II;


dao ng (variability) ca o lng, m c th l lch chun; v
Mc khc bit hay nh hng m nh nghin cu mun pht hin.

Khng c s liu v ba yu t ny th khng th no c tnh c mu. Kinh


nghim ca ngi vit cho thy rt nhiu ngi khi tin hnh nghin cu thng khng
c nim g v cc s liu ny, cho nn khi n tham vn cc chuyn gia v thng k
hc, h ch nhn cu tr li: khng th tnh c! Trong chng ny ti s bn qua ba
yu t trn.

15.1 Khi nim v power


Thng k hc l mt phng php khoa hc c mc ch pht hin, hay i tm
nhng ci c th gp chung li bng cm t cha c bit (unknown). Ci cha c
bit y l nhng hin tng chng ta khng quan st c, hay quan st c nhng
khng y . Ci cha bit c th l mt n s (nh chiu cao trung bnh ngi
Vit Nam, hay trng lng mt phn t), hiu qu ca mt thut iu tr, gen c chc
nng lm cho cy l c mu xanh, s thch ca con ngi, v.v Chng ta c th o chiu
cao, hay tin hnh xt nghim bit hiu qu ca thuc, nhng cc nghin cu nh th
ch c tin hnh trn mt nhm i tng, ch khng phi ton b qun th ca dn
s.
mc n gin nht, nhng ci cha bit ny c th xut hin di hai hnh
thc: hoc l c, hoc l khng. Chng hn nh mt thut iu tr c hay khng c hiu
qu chng gy xng, khch hng thch hay khng thch mt loi nc gii kht. Bi v
khng ai bit hin tng mt cch y , chng ta phi t ra gi thit. Gi thit n

gin nht l gi thit o (hin tng khng tn ti, k hiu H-) v gi thit chnh (hin
tng tn ti, k hiu H+).
Chng ta s dng cc phng php kim nh thng k (statistical test) nh kim
nh t, F, z, 2, v.v nh gi kh nng ca gi thit. Kt qu ca mt kim nh
thng k c th n gin chia thnh hai gi tr: hoc l c ngha thng k (statistical
significance), hoc l khng c ngha thng k (non-significance). C ngha thng k
y, nh cp trong Chng 7, thng da vo tr s P: nu P < 0.05, chng ta pht
biu kt qu c ngha thng k; nu P > 0.05 chng ta ni kt qu khng c ngha
thng k. Cng c th xem c ngha thng k hay khng c ngha thng k nh l c
tn hiu hay khng c tn hiu. Hy tm t k hiu T+ l kt qu c ngha thng k, v
T- l kt qu kim nh khng c ngha thng k.
Hy xem xt mt v d c th: bit thuc risedronate c hiu qu hay khng
trong vic iu tr long xng, chng ta tin hnh mt nghin cu gm 2 nhm bnh
nhn (mt nhm c iu tr bng risedronate v mt nhm ch s dng gi dc
placebo). Chng ta theo di v thu thp s liu gy xng, c tnh t l gy xng cho
tng nhm, v so snh hai t l bng mt kim nh thng k. Kt qu kim nh thng
k hoc l c ngha thng k (P<0.05) hay khng c ngha thng k (P>0.05). Xin
nhc li rng chng ta khng bit risedronate tht s c hiu nghim chng gy xng
hay khng; chng ta ch c th t gi thit H. Do , khi xem xt mt gi thit v kt
qu kim nh thng k, chng ta c bn tnh hung:
(a) Gi thuyt H ng (thuc risedronate c hiu nghim) v kt qu kim nh thng
k P<0.05.
(b) Gi thuyt H ng, nhng kt qu kim nh thng k khng c ngha thng k;
(c) Gi thuyt H sai (thuc risedronate khng c hiu nghim) nhng kt qu kim
nh thng k c ngha thng k;
(d) Gi thuyt H sai v kt qu kim nh thng k khng c ngha thng k.
y, trng hp (a) v (d) khng c vn , v kt qu kim nh thng k nht qun
vi thc t ca hin tng. Nhng trong trng hp (b) v (c), chng ta phm sai lm, v
kt qu kim nh thng k khng ph hp vi gi thit. Trong ngn ng thng k hc,
chng ta c vi thut ng:

xc sut ca tnh hung (b) xy ra c gi l sai st loi II (type II error), v


thng k hiu bng .

xc sut ca tnh hung (a) c gi l Power. Ni cch khc, power chnh l xc


sut m kt qu kim nh thng cho ra kt qu p<0.05 vi iu kin gi thit H l
tht. Ni cch khc: power = 1- ;

xc sut ca tnh hung (c) c gi l sai st loi I (type I error, hay significance
level), v thng k hiu bng . Ni cch khc, chnh l xc sut m kt qu
kim nh thng cho ra kt qu p<0.05 vi iu kin gi thit H sai;

xc sut tnh hng (d) khng phi l vn cn quan tm, nn khng c thut
ng, d c th gi l kt qu m tnh tht (hay true negative).

C th tm lc 4 tnh hung trong mt Bng 1 sau y:


Bng 1. Cc tnh hung trong vic th nghim mt gi thit khoa hc
Gi thuyt H
Kt qu kim nh thng
k

ng
(thuc c hiu nghim)

Sai
(thuc khng c hiu nghim)

C ngha thng k
(p<0,05)

Dng tnh tht (power),


1-= P(s | H+)

Sai st loi I (type I error)


= P(s | H-)

Sai st loi II (type II error)


= P(ns | H+)

m tnh tht (true negative)


1- = P(ns | H-)

Khng c ngha thng k


(p>0,05)

Ch thch: s trong biu ny c ngha l significant; ns non-significant; H+ l gi thuyt ng;


v H- l gi thuyt sai. Do , c th m t 4 tnh hung trn bng ngn ng xc sut c iu
kin nh sau: Power = 1 = P(s | H+); = P(ns | H+); v = P(s | H-).

15.2 Th nghim gi thit thng k v chn on y khoa


C l nhng l gii trn y, i vi mt s bn c, vn cn kh tru tng. Mt
cch minh ha cc khi nim power v tr s P l qua chn on y khoa. Tht vy, c
th v nghin cu khoa hc v suy lun thng k nh l mt qui trnh chn on bnh.
Trong chn on, thot u chng ta khng bit bnh nhn mc bnh hay khng, v phi
thu thp thng tin (nh tm hiu tin s bnh, cch sng, thi quen, v.v) v lm xt
nghim (nh quang tuyn X, nh siu m, phn tch mu, nc tiu, v.v) i n kt
lun.
C hai gi thit: bnh nhn khng c bnh (k hiu H-) v bnh nhn mc bnh
(H+). mc n gin nht, kt qu xt nghim c th l dng tnh (+ve) hay m
tnh (-ve). Trong chn on cng c 4 tnh hung v ti s bn trong phn di y,
nhng vn r rng hn, chng ta hy xem qua mt v d c th nh sau:
Trong chn on ung th, bit chc chn c ung th hay khng, phng php
chun l dng sinh thit (tc gii phu xem xt m di ng knh hin vi xc nh
xem c ung th hay khng c ung th. Nhng sinh thit l mt phu thut c tnh cch

xm phm vo c th bnh nhn, nn khng th p dng phu thut ny mt cch i tr


cho mi ngi. Thay vo , y khoa pht trin nhng phng php xt nghim khng
mang tnh xm phm th nghim ung th. Cc phng php ny bao gm quang
tuyn X hay th mu. Kt qu ca mt xt nghim bng quang tuyn X hay th mu c
th tm tt bng hai gi tr: hoc l dng tnh (+ve), hoc l m tnh (-ve).
Nhng khng c mt phng php gin tip th nghim no, d tinh vi n u
i na, l hon ho v chnh xc tuyt i. Mt s ngi c kt qu dng tnh, nhng
thc s khng c ung th. V mt s ngi c kt qu m tnh, nhng trong thc t li
c ung th. n y th chng ta c bn kh nng:

Bnh nhn c ung th, v kt qu th nghim l dng tnh. y l trng hp


dng tnh tht (danh t chuyn mn l nhy,ting Anh gi l sensitivity);

bnh nhn khng c ung th, nhng kt qu th nghim l dng tnh. y l


trng hp dng tnh gi (false positive);

bnh nhn khng c ung th, nhng kt qu th nghim l m tnh. y l trng


hp ca m tnh tht (specificity); v,

bnh nhn c ung th, v kt qu th nghim l m tnh. y l trng hp m


tnh gi hay c hiu (false negative).
C th tm lc 4 tnh hung trong Bng 2 sau y:

Bng 2. Cc tnh hung trong vic chn on y khoa: kt qu xt nghim v bnh


trng
Bnh trng
Kt qu xt nghim
+ve (dng tnh)
-ve (m tnh)

C bnh

Khng c bnh

nhy (sensitivity),

Dng tnh gi (false positive)

m tnh gi (false negative),

c hiu (Specificity),

n y, chng ta c th thy qua mi tng quan song song gia chn on y


khoa v th nghim thng k. Trong chn on y khoa c ch s dng tnh tht, tng
ng vi khi nim power trong nghin cu. Trong chn on y khoa c xc sut
dng tnh gi, v xc sut ny chnh l tr s p trong suy lun khoa hc. Bng sau y
s cho thy mi tng quan :
Bng 3. Tng quan gia chn on y khoa v suy lun trong khoa hc

Chn on y khoa

Th nghim gi thit khoa hc

Chn on bnh
Bnh trng (c hay khng)
Phng php xt nghim
Kt qu xt nghim +ve
Kt qu xt nghim -ve
Dng tnh tht (sensitivity)
Dng tnh gi (false positive)
m tnh gi (false negative)
m tnh tht (c hiu, hay specificity)

Th nghim mt gi thit khoa hc


Gi thit khoa hc (H+ hay H-)
Kim nh thng k
Tr s p < 0.05 hay c ngha thng k
Tr s p > 0.05 hay khng c ngha thng k
Power; 1-; P(s | H+)
Sai st loi I; tr s p; ; P(s | H-)
Sai st loi II; ; = P(ns | H+)
m tnh tht; 1- = P(ns | H-)

Cng nh cc phng php xt nghim y khoa khng bao gi hon ho, cc


phng php kim nh thng k cng c sai st. V do , kt qu nghin cu lc no
cng c bt nh (nh s bt nh trong mt chn on y khoa vy). Vn l chng
ta phi thit k nghin cu sao cho sai st loi I v II thp nht.

15.3 S liu c tnh c mu


Nh cp trong phn u ca chng ny, c tnh s i tng cn thit
cho mt cng trnh nghin cu, chng ta cn phi c 3 s liu: xc sut sai st loi I v II,
dao ng ca o lng, v nh hng.

V xc sut sai st, thng thng mt nghin cu chp nhn sai st loi I khong
1% hay 5% (tc = 0.01 hay 0.05), v xc sut sai st loi II khong = 0.1 n
= 0.2 (tc power phi t 0.8 n 0.9).

dao ng chnh l c lch chun (standard deviation) ca o lng m cng


trnh nghin cu da vo phn tch. Chng hn nh nu nghin cu v cao
huyt p, th nh nghin cu cn phi c lch chun ca p sut mu. Chng
ta tm gi dao ng l .

nh hng, nu l cng trnh nghin cu so snh hai nhm, l khc bit


trung bnh gia hai nhm m nh nghin cu mun pht hin. Chng hn nh
nh nghin cu c th gi thit rng bnh nhn c iu tr bng thuc A c p
sut mu gim 10 mmHg so vi nhm gi c. y, 10 mmHg c xem l
nh hng. Chng ta tm gi nh hng l .

Mt nghin cu c th c mt nhm i tng hay hai (v c khi hn 2) nhm


i tng. V c tnh c mu cng ty thuc vo cc trng hp ny.
Trong trng hp mt nhm i tng, s lng i tng (n) cn thit cho
nghin cu c th tnh ton mt cch th cng nh sau:

n=

( / )

[1]

Trong trng hp c hai nhm i tng, s lng i tng (n) cn thit cho
nghin cu c th tnh ton nh sau:

n = 2

( / )

[2]

Trong , hng s C c xc nh t xc sut sai st loi I v II (hay power) nh sau:


Bng 3: Hng s C lin quan n sai st loi I v II

=
0.10
0.05
0.01

= 0.20
(Power = 0.80)
6.15
7.85
13.33

= 0.10
(Power = 0.90)
8.53
10.51
16.74

= 0.05
(Power = 0.95)
10.79
13.00
19.84

15.4 c tnh c mu
15.4.1 c tnh c mu cho mt ch s trung bnh
V d 1: Chng ta mun c tnh chiu cao n ng ngi Vit, v chp nhn
sai s trong vng 1 cm (d = 1) vi khong tin cy 0.95 (tc =0.05) v power = 0.8 (hay
= 0.2). Cc nghin cu trc cho bit lch chun chiu cao ngi Vit khong 4.6
cm. Chng ta c th p dng cng thc [1] c tnh c mu cn thit cho nghin cu:

n=

( / )

7.85

(1/ 4.6 )

= 166

Ni cch khc, chng ta cn phi o chiu cao 166 i tng c tnh chiu cao n
ng Vit vi sai s trong vng 1 cm.
Nu sai s chp nhn l 0.5 cm (thay v 1 cm), s lng i tng cn thit l:
7.85
n=
= 664 . Nu sai s m chng ta chp nhn l 0.1 cm th s lng i
2
( 0.5 / 4.6 )
tng nghin cu ln n 16610 ngi! Qua cc c tnh ny, chng ta d dng thy c
mu ty thuc rt ln vo sai s m chng ta chp nhn. Mun c c tnh cng
chnh xc, chng ta cn cng nhiu i tng nghin cu.

Trong R c hm power.t.test c th p dng c tnh c mu cho v d


trn nh sau.
Ch chng ta cho R bit vn l mt nhm tc
type=one.sample:
# sai s 1 cm, c lch chun 4.6, a=0.05, power=0.8
> power.t.test(delta=1, sd=4.6, sig.level=.05, power=.80,
type='one.sample')
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative

=
=
=
=
=
=

168.0131
1
4.6
0.05
0.8
two.sided

kt qu tnh ton t R l 168, khc vi cch tnh th cng 2 i tng, v c nhin R s


dng nhiu s l hn v chnh xc hn cch tnh th cng. Vi sai s 0.5 cm:
# sai s 0.5 cm, c lch chun 4.6, a=0.05, power=0.8
> power.t.test(delta=0.5, sd=4.6, sig.level=.05, power=.80,
type='one.sample')
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative

=
=
=
=
=
=

666.2525
0.5
4.6
0.05
0.8
two.sided

V d 2: Mt loi thuc iu tr c kh nng tng alkaline phosphatase bnh


nhn long xng. lch chun ca alkaline phosphatase l 15 U/l. Mt nghin cu
mi s tin hnh trong mt qun th bnh nhn Vit Nam, v cc nh nghin cu mun
bit bao nhiu bnh nhn cn tuyn chng minh rng thuc c th alkaline
phosphatase t 60 n 65 U/l sau 3 thng iu tr, vi sai s I = 0.05 v power = 0.8.

y l mt loi nghin cu trc sau (before-after study); c ngha l trc


v sau khi iu tr. y, chng ta ch c mt nhm bnh nhn, nhng c o hai ln
(trc khi dng thuc v sau khi dng thuc). Ch tiu lm sng nh gi hiu nghim
ca thuc l thay i v alkaline phosphatase. Trong trng hp ny, chng ta c tr
s tng trung bnh l 5 U/l v lch chun l 15 U/l, hay ni theo ngn ng R,
delta=5, sd=15, sig.level=.05, power=.80, v lnh:
> power.t.test(delta=3, sd=15, sig.level=.05, power=.80,
type='one.sample')
One-sample t test power calculation

n
delta
sd
sig.level
power
alternative

=
=
=
=
=
=

198.1513
3
15
0.05
0.8
two.sided

Nh vy, chng ta cn phi c 198 bnh nhn t cc mc tiu trn.


15.4.2 c tnh c mu cho so snh hai s trung bnh

Trong thc t, rt nhiu nghin cu nhm so snh hai nhm vi nhau. Cch c
tnh c mu cho cc nghin cu ny ch yu da vo cng thc [2] nh trnh by phn
15.3.1.
V d 3: Mt nghin cu c thit k th nghim thuc alendronate trong vic
iu tr long xng ph n sau thi k mn kinh. C hai nhm bnh nhn c tuyn:
nhm 1 l nhm can thip (c iu tr bng alendronate), v nhm 2 l nhm i
chng (tc khng c iu tr). Tiu ch nh gi hiu qu ca thuc l mt
xng (bone mineral density BMD). S liu t nghin cu dch t hc cho thy gi tr
trung bnh ca BMD trong ph n sau thi k mn kinh l 0.80 g/cm2, vi lch chun
l 0.12 g/cm2. Vn t ra l chng ta cn phi nghin cu bao nhiu i tng
chng minh rng sau 12 thng iu tr BMD ca nhm 1 tng khong 5% so vi nhm
2?

Trong v d trn, tm gi tr s trung bnh ca nhm 2 l 2 v nhm 1 l 1 ,


chng ta c: 1 = 0.8*1.05 = 0.84 g/cm2 (tc tng 5% so vi nhm 1), v do , = 0.84
0.80 = 0.04 g/cm2. lch chun l = 0.12 g/cm2. Vi power = 0.90 v = 0.05, c
mu cn thit l:
n=

2C

( / )

2 10.51

( 0.04 / 0.12 )

= 189

V li gii t R qua hm power.t.test nh sau:


> power.t.test(delta=0.04, sd=0.12, sig.level=0.05, power=0.90,
type="two.sample")
Two-sample t test power calculation
n
delta
sd
sig.level
power
alternative

=
=
=
=
=
=

190.0991
0.04
0.12
0.05
0.9
two.sided

NOTE: n is number in *each* group

Ch trong hm power.t.test, ngoi cc thng s thng thng nh delta (


nh hng hay khc bit theo gi thit), sd ( lch chun), sig.level xc sut sai
st loi I, v power, chng ta cn phi c th ch ra rng y l nghin cu gm c hai
nhm vi thng s type=two.sample.
Kt qu trn cho bit chng ta cn 190 bnh nhn cho mi nhm (hay 380 bnh
nhn cho cng trnh nghin cu). Trong trng hp ny, power = 0.90 v = 0.05 c
ngha l g ? Tr li: hai thng s c ngha l nu chng ta tin hnh tht nhiu nghin
cu (v d 1000) v mi nghin cu vi 380 bnh nhn, s c 90% (hay 900) nghin cu
s cho ra kt qu trn vi tr s p < 0.05.
15.4.3 c tnh c mu cho phn tch phng sai

Phng php c tnh c mu cho so snh gia hai nhm cng c th khai trin
thm c tnh c mu cho trng hp so snh hn hai nhm. Trong trng hp c
nhiu nhm, nh cp trong Chng 11, phng php so snh l phn tch phng sai.
Theo phng php ny, s trung bnh bnh phng phn d (residual mean square, RMS)
chnh l c tnh ca dao ng ca o lng trong mi nhm, v ch s ny rt quan
trng trong vic c tnh c mu.
Chi tit v l thuyt ng sau cch c tnh c mu cho phn tch phng sai kh
phc tp, v khng nm trong phm vi ca chng ny. Nhng nguyn l ch yu vn
khng khc so vi l thuyt so snh gia hai nhm. Gi s trung bnh ca k nhm l 1,
2, 3, . . ., k, chng ta c th tnh tng bnh phng gia cc nhm bng
k
k
SS
2
SS SS = ( i ) , trong , = i / k . Cho =
, vn t ra l tm
( k 1) RMS
i =1
i =1
c lng c mu n sao cho z p ng yu cu power = 0.80 hay 0.9, m
z =

( k 1)(1 + n ) F + k ( n 1)(1 + 2n )

k ( n 1) 2 ( k 1)(1 + n ) (1| 2n ) F ( k 1)(1 + n ) ( 2k ( n 1) 1)

Trong F l kim nh F. (Xem J. Fleiss, The Design and Analysis of Clinical


Experiments, John Wiley & Sons, New York 1986, trang 373).
V d 4. so snh ngt ca mt loi nc ung gia 4 nhm i tng khc
nhau v gii tnh v tui (tm gi 4 nhm l A, B, C v D), cc nh nghin cu gi
thit rng ngt trong nhm A, B. C v D ln lc l 4.5, 3.0, 5.6, v 1.3. Qua xem xt
nhiu nghin cu trc, cc nh nghin cu cn bit rng RMS v ngt trong mi

nhm l khong 8.7. Vn t ra l bao nhiu i tng cn nghin cu pht hin s


khc bit c ngha thng k mc = 0.05 v power = 0.9.
Hm power.anova.test trong R c th ng dng gii quyt vn . Chng ta ch
cn n gin cung cp 4 s trung bnh theo gi thit v s RMS nh sau:
# trc ht cho 4 s trung bnh vo mt vector
> groupmeans <- c(4.5, 3.0, 5.6, 1.3)
# sau , gi hm power.anova.test:
> power.anova.test(groups = length(groupmeans),
between.var=var(groupmeans),
within.var=8.7, power=0.90, sig.level=0.05)
Balanced one-way analysis of variance power calculation
groups
n
between.var
within.var
sig.level
power

=
=
=
=
=
=

4
12.81152
3.486667
8.7
0.05
0.9

NOTE: n is number in each group

Kt qu cho thy cc nh nghin cu cn khong 13 i tng cho mi nhm (tc 52 i


tng cho ton b nghin cu).
15.4.4 c tnh c mu c tnh mt t l

Nhiu nghin cu m t c mc ch kh n gin l c tnh mt t l. Chng


hn nh gii y t thng hay tm hiu t l mt bnh trong cng ng, hay gii thm d
kin v th trng thng tm hiu t l dn s a thch mt sn phm. Trong cc trng
hp ny, chng ta khng c nhng o lng mang tnh lin tc, nhng kt qu ch l
nhng gi tr nh nh c / khng, thch / khng tch, v.v V cch c tnh c mu cng
khc vi ba v d trn y.
Nm 1991, mt cuc thm d kin M cho thy 45% ngi c hi sn sng
khuyn khch con h nn hin mt qu thn cho nhng bnh nhn cn thit. Khong tin
cy 95% ca t l ny l 42% n 48%, tc mt khong cch n 6%! Kt qu ny
[tng i] thiu chnh xc, d s lng i tng tham gia ln n 1000 ngi. Ti
sao? tr li cu hi ny, chng ta th xem qua mt vi l thuyt v c tnh c mu
cho mt t l.
Chng ta bit qua Chng 6 v 9 rng nu p c c tnh t n i tng, th
khong tin cy 95% ca mt t l p [trong dn s] l: p 1.96 SE ( p ) , trong
SE ( p ) =

p (1 p ) / n .

By gi th lt ngc vn : chng ta mun c tnh p sao khong rng


2 1.96 SE ( p ) khng qu mt hng s m. Ni cch khc, chng ta mun:
1.96 p (1 p ) / n m

Chng ta mun tm s lng i tng n t yu cu trn. Qua cch din t trn, d


dng thy rng:
2

1.96
n
p (1 p )
m

Do , s lng c mu ty thuc vo sai s m v t l p m chng ta mun c tnh.


sai s cng thp, s lng c mu cng cao.
V d 5: Chng ta mun c tnh t l n ng ht thuc Vit Nam, sao cho
c s khng cao hn hay thp hn 2% so vi t l tht trong ton dn s. Mt nghin
cu trc cho thy t l ht thuc trong n ng ngi Vit c th ln n 70%. Cu hi
t ra l chng ta cn nghin cu trn bao nhiu n ng t yu cu trn.

Trong v d ny, chng ta c sai s m = 0.02, p = 0.70, v s lng c mu cn


thit cho nghin cu l:
2

1.96
n
0.7 0.3
0.02

Ni cch khc, chng ta cn nghin cu t nht l 2017.


Nu chng ta mun gim sai s t 2% xung 1% (tc m = 0.01) th s lng i tng s
l 8067! Ch cn thm chnh xc 1%, s lng mu c th thm hn 6000 ngi. Do
, vn c tnh c mu phi rt thn trng, xem xt cn bng gia chnh xc thng
tin cn thu thp v chi ph.
R khng c hm cho c tnh c mu cho mt t l, nhng vi cng thc trn, bn c c
th vit mt hm tnh rt d dng.
15.4.5 c tnh c mu cho so snh hai t l

Nhiu nghin cu mang tnh suy lun thng c hai [hay nhiu hn hai] nhm
so snh. Trong phn 15.4.2 chng ta lm quen vi phng php c tnh c mu
so snh hai s trung bnh bng kim nh t. l nhng ngi cu m tiu ch l nhng
bin s lin tc. Nhng c nghin cu bin s khng lin tc m mang tnh nh phn nh
ti va bn trong phn 15.4.3. so snh hai t l, phng php kim nh thng dng

nht l kim nh nh phn (binomial test) hay Chi bnh phng (2 test). Trong phn
ny, ti s bn qua cch tnh c mu cho hai loi kim nh thng k ny.
Gi hai t l [m chng ta khng bit nhng mun tm hiu] l p1 v p2 , v gi
= p1 p2 . Gi thit m chng ta mun kim nh l = 0. L thuyt ng sau c
tnh c mu cho kim nh gi thit ny kh rm r, nhng c th tm gn bng cng
thc sau y:

(z
n=

/2

2 p (1 p ) + z

p1 (1 p1 ) + p2 (1 p2 )

2
l tr s z ca phn phi chun cho xc sut /2 (chng

Trong , p = ( p1 + p2 )/2, z / 2
hn nh khi = 0.05, th z / 2 = 1.96; khi = 0.01, th z / 2 = 2.57), v z l tr s z ca

phn phi chun cho xc sut (chng hn nh khi = 0.10, th z = 1.28; khi = 0.20,
th z = 0.84).
V d 6: Mt th nghim lm sng i chng ngu nhin c thit k nh
gi hiu qu ca mt loi thuc chng gy xng sng. Hai nhm bnh nhn s c
tuyn. Nhm 1 c iu tr bng thuc, v nhm 2 l nhm i chng (khng c
iu tr). Cc nh nghin cu gi thit rng t l gy xng trong nhm 2 l khong 10%,
v thuc c th lm gim t l ny xung khong 6%. Nu cc nh nghin cu mun th
nghim gi thit ny vi sai st I l = 0.01 v power = 0.90, bao nhiu bnh nhn cn
phi c tuyn m cho nghin cu?

y, chng ta c = 0.10 0.06 = 0.04, v p = (0.10 + 0.06)/2 = 0.08. Vi


= 0.01, z / 2 = 2.57 v vi power = 0.90, z = 1.28. Do , s lng bnh nhn cn thit
cho mi nhm l:

( 2.57
n=

2 0.08 0.92 + 1.28 0.1 0.90 + 0.06 0.94

( 0.04 )

= 1361

Nh vy, cng trnh nghin cu ny cn phi tuyn t nht l 2722 bnh nhn kim
nh gi thit trn.
Hm power.prop.test R c th ng dng tnh c mu cho trng hp trn. Hm
power.prop.test cn nhng thng tin nh power, sig.level, p1, v p2.
Trong v d trn, chng ta c th vit:
> power.prop.test(p1=0.10, p2=0.06, power=0.90, sig.level=0.01)
Two-sample comparison of proportions power calculation

n
p1
p2
sig.level
power
alternative

=
=
=
=
=
=

1366.430
0.1
0.06
0.01
0.9
two.sided

NOTE: n is number in *each* group

Ch kt qu t R c phn chnh xc hn (1366 i tng cho mi nhm) v R dng


nhiu s l cho tnh ton hn l tnh th cng.
Trc khi ri chng ny, ti mun nhn c hi ny nhn mnh mt ln na,
c tnh c mu cho nghin cu l mt bc cc k quan trng trong vic thit k mt
nghin cu cho c ngha khoa hc, v n c th quyt nh thnh bi ca nghin cu.
Trc khi c tnh c mu nh nghin cu cn phi bit trc (hay t ra l c vi gi thit
c th) v vn mnh quan tm. c tnh c mu cn mt s thng s nh cp n
trong phn u ca chng, v nu cc thng s ny khng c th khng th c tnh
c. Trong trng hp mt nghin cu hon ton mi, tc cha ai tng lm trc ,
c th cc thng s v nh hng v dao ng o lng s khng c, v nh nghin
cu cn phi tin hnh mt s m phng (simulation) hay mt nghin cu s khi c
nhng thng s cn thit. Cch c tnh c mu bng m phng l mt lnh vc nghin
cu kh chuyn su, khng nm trong ti ca sch ny, nhng bn c c th tm hiu
thm phng php ny trong cc sch gio khoa v thng k hc cp cao hn.

CHNG XVI

LP TRNH V HM

16
Ph lc 1: Lp trnh v hm vi R
R c pht trin sao cho ngi s dng c th pht trin nhng hm thch hp
cho mc ch phn tch v tnh ton ca mnh. Tht vy, nh cp trong phn u
ca sch, c th xem R l mt ngn ng thng k, v chng ta c th s dng ngn ng
gii quyt cc vn khng thng thy trong sch gio khoa. Trong phn ny, ti
ch trnh by mt vi hm n gin bn c c th hiu cch vn hnh ca R v hi
vng gip bn c t pht trin cc hm sau .
Hm (hay c khi cn gi l macro trong cc phn mm khc) thc cht l tp
hp mt s lnh c lu tr di mt ci tn. mc n gin nht, hm l tc k
cho mt nhm lnh.
V d 1. Trong cc lnh sau y, chng ta to hai d liu (data1 v data2).
Mi d liu c hai ct s liu c to ra bng m phng t phn phi chun. Sau , v
biu cho hai d liu vi ghi ch.
data1 <- cbind(rnorm(100,1), rnorm(100,0))
data2 <- cbind(rnorm(100,-1), rnorm(100,0))
xr <- range(rbind(data1,data2)[,1])
yr <- range(rbind(data1,data2)[,2])
plot(data1, xlim=xr, ylim=yr, col=1, xlab="", ylab="")
par(new=T)
plot(data2, xlim=xr, ylim=yr, col=2, xlab="", ylab="")
title(main="My simulated data", xlab="Weight", ylab="Yield")
legend(-3.0, -1.5, c("Big", "Small"), col=1:2, pch=1)

Mt cch nh tt c cc lnh ny l lu tr chng trong mt text file chng


hn. Mi ln mun s dng, chng ta ch n gin ct v dn cc lnh ny vo R. Mt
cch khc tt hn l to ra mt hm gm cc lnh trn c th s dng nhiu ln.
Mi hm R phi c tn. Tt c cc lnh c cha trong khu vc c gii hn
bng hai k hiu { v }. K hiu { cho bit tt c cc lnh sau l nm trong hm; v k
hiu } cho bit chm dt hm. Trong v d trn, chng ta gi hm l plotfigure:
plotfigure <- function()

data1 <- cbind(rnorm(100,1), rnorm(100,0))


data2 <- cbind(rnorm(100,-1), rnorm(100,0))
xr <- range(rbind(data1,data2)[,1])
yr <- range(rbind(data1,data2)[,2])
plot(data1, xlim=xr, ylim=yr, col=1, xlab="", ylab="")
par(new=T)
plot(data2, xlim=xr, ylim=yr, col=2, xlab="", ylab="")
title(main="My simulated data", xlab="Weight", ylab="Yield")

legend(-3.0, -1.5, c("Big", "Small"), col=1:2, pch=1)

}
Sau khi cho vo R, chng ta ch n gin gi hm nhiu ln nh sau:
> plotfigure()
> plotfigure()

v kt qu s nh sau:
M y simulated data

Yield
-1

-1

Yield

M y simulated data

-2

-2

Big
Small

-4

-2

Weight

Big
Small

-2

Weight

Trong hm plotfigure trn, chng ta m phng 100 s liu t phn phi


chun. V c mi ln ng dng, hm ch to ra 100 s liu, ch chng ta khng thay i
c (ngoi tr phi thay i t lc bin tp, hay lp hm). Ni cch khc, hm trn
khng c thng s.
Kha cnh tin li ca hm l chng ta c th lm cho thng s thay i theo
mun ca ngi s dng. Chng hn nh chng ta mun thay i s s liu m phng v
trung bnh t lut phn phi chun, chng ta ch cn cho hai con s ny l hai thng s
(parameters) ngi s dng c th thay i. Tm gi l thng s n, mean1, v
mean2, th hm s nh sau:
plotfigure <- function(n, mean1, mean2)

data1 <- cbind(rnorm(n,mean1), rnorm(n,0))


data2 <- cbind(rnorm(n,mean2), rnorm(n,0))
xr <- range(rbind(data1,data2)[,1])
yr <- range(rbind(data1,data2)[,2])
plot(data1, xlim=xr, ylim=yr, col=1, xlab="", ylab="")
par(new=T)
plot(data2, xlim=xr, ylim=yr, col=2, xlab="", ylab="")

title(main="My simulated data", xlab="Weight", ylab="Yield")


legend(-3.0, -1.5, c("Big", "Small"), col=1:2, pch=1)

}
Khi ng dng hm, chng ta ch n gin thay i n v mean. Trong hai lnh sau y,
chng ta u tin v mt biu tn x vi 200 s liu, v s trung bnh -2 v 2. Trong
lnh hai, chng ta nng s liu ln 200, nhng trung bnh vn nh ln m phng trc:
> plotfigure(200, 2, -2)
> plotfigure(500, 2, -2)

V kt qu s khc trn:
M y simulated data

0
-1

-1

Yield

Yield

M y simulated data

-2

Big
Small

-3

-3

-2

Big
Small

-4

-2

Weight

-4

-2

Weight

V d 2. Chng ta mun vit mt hm cng hai s. (Tt nhin R c kh nng


lm vic ny, nhng v l do minh ha, ti s gi thit n gin nh th). Gi hm
l add. Hai thng s a v b l arguments. Cch vit nh sau:
add <- function(a, b)

{
sum = a+b
ans <- "Answer = "
cat(ans, sum, \n)
}

Th l xong! Nh thy, bc u tin, chng ta cho tn hm l add v nh


ngha thng s a v b. Mt hm phi c m u bng k hiu { v chm dt bng }.
sum l mt bin s cng a v b. ans <- "Answer = " nh ngha tr li (c th
khng cn). cat(ans, sum, \n) c chc nng thu thp s liu v trnh by kt qu

cho ngi d dng hm, trong \ c ngha l sau khi trnh by, cho ngi s dng
mt prompt khc. Bn c c th dn cc lnh trn vo R v th cho lnh:
> add(3, 9)
Answer = 12
> add(sqrt(5), exp(10))
Answer = 22028.7

V d 3. Hm sau y tin hnh nhiu tnh ton hn hm trong v d 1. Nu


chng ta c mt bin s gm n phn t x1 , x2 , x3 ,..., xn tun theo lut phn phi chun
vi trung bnh v phng sai 2 . Vit theo k hiu ton:

xi ~ N , 2

Nu chng ta c thng tin trc cho bit c lut phn phi chun vi trung bnh v
phng sai 2, hay:
~ N ( , 2 )

nx
+ 2
2
v phng sai
Qua nh l Bayes, chng ta c th c tnh trung bnh p =
1

n
1
= 2 + 2 . Trong , x l s trung bnh ca mu n. p v p2 c gi l

posterior. Chng ta c th vit mt hm bng R tnh hai s ny nh sau. Gi tn


hm l bayes.
2
p

bayes <- function(x, prior.mean, prior.var)


{
n <- length(x)
sample.mean <- mean(x)
sample.var <- var(x)
numerator <- (prior.mean/prior.var) + (n*sample.mean/sample.var)
denominator <- 1/prior.var + n/sample.var
posterior.mean = numerator/denominator
posterior.var = 1/denominator
a <- "Posterior mean = "
b <- "Posterior variance = "
cat(Sample size = , n, \n)
cat(Sample mean = , sample.mean, \n)
cat(Sample var = , sample.var, \n)
cat(Prior mean = , prior.mean, \n)
cat(Prior var = , prior.var, \n)
cat(a, posterior.mean, \n)

cat(b, posterior.var, \n)


}

V d 4. Mt cht khong trong xng (bone mineral density - bmd) trong


mt qun th thng phn phi theo lut phn phi chun, vi gi tr trung bnh khong
1.0 g/cm2 v phng sai 0.0144 g/cm4. Gi d chng ta o mt xng ca mt nhm
bnh nhn nh sau: 1.0, 1.5, 2.1, 1.7, 1.8, 0.9, 0.7. Chng ta mun bit gi
tr trung bnh v phng sai ca mu ny sau khi iu chnh cho trung bnh v phng
sai bit trc. Trc ht, chng ta gi nhm s liu ny l bmd:
> bmd

<- c(1.0, 1.5, 2.1, 1.7, 1.8, 0.9, 0.7)

v sau gi hm bayes nh sau:


> bayes(bmd, 1.0, 0.0144)
Sample size = 7
Sample mean = 1.385714
Sample var = 0.2747619
Prior mean = 1
Prior var = 0.0144
Posterior mean = 1.103525
Posterior variance = 0.01053507

Trn y ch l mt vi hng gii thiu cch lp trnh v vit hm bng ngn ng


R. Trong thc t, tt c cc hm nh survival, BMA, meta, Hmisc, v.v u c
pht trin bng ngn ng R. Bn c c th tham kho ti liu Introduction to R ca
W. Venables v B. Ripley (phn cui ca sch) bit thm chi tit k thut.

CHNG XVII

MT S
LNH R THNG DNG

17
Phc lc 2
Mt s lnh thng dng trong R
Lnh v mi trng vn hnh ca R
getwd()
setwd(c:/works)
options(prompt=R>)
options(width=100)
options(scipen=3)
options()

Cho bit directory hin hnh l g


Chuyn directory vn hnh v c:\works (ch R dng /)
i prompt thnh R>
i chiu rng ca s R thnh 100 characters
i s thnh 3 s thp phn (thay v kiu 1.2E-04)
Cho bit cc thng s v mi trng hin nay ca R

Lnh c bn
ls()
rm(object)
seach()

Lit k cc i tng (objects) trong b nh


Xa b i tng
Tm hng

K hiu tnh ton


+
*
/
^
%/%
%%

Cng
Tr
Nhn
Chia
Ly tha
Chia s nguyn
S d t chia hai s nguyn

K hiu logic
==
!=
<
>
<=
>=
is.na(x)
&

Bng
Khng bng
Nh hn
Ln hn
Nh hn hoc bng
Ln hn hoc bng
C phi x l bin s missing
V (AND)

|
!

Hoc (OR)
Khng l (NOT)

Pht s
numeric(n)
character(n)
logical(n)
seq(-4,3,0.5)
1:10
c(5,7,9,1)
rep(1, 5)
Gl(3,2,12)

Cho ra n s 0
Cho ra n k t
Cho ra n FALSE
Dy s -4.0, -3.5, -3.0, , 3.0
Ging nh lnh seq(1, 10, 1)
Nhp s 5, 7, 8 v 1
Cho ra 5 s 1: 1, 1, 1, 1, 1.
Yu t 3 bc, lp li 2 ln, tng cng 12 s:
112233112233

To nn s ngu nhin bng m phng theo cc lut phn phi


(simulation)
rnorm(n, mean=0, sd=1)

Phn phi chun (normal distribution) vi trung


bnh = 0 v lch chun = 1.
rexp(n, rate=1)
Phn phi m (exponential distribution)
rgamma(n,shape,scale=1)
Phn phi gamma
rpois(n, lambda)
Phn phi Poisson
rweibull(n,shape,scale=1)
Phn phi Weibull
rcauchy(n,location=0,scale=1) Phn phi Cauchy
rbeta(n, shape1, shape2)
Phn phi beta
rt(n, df)
Phn phi t
rchisq(n, df)
Phn phi Chi bnh phng
rbinom(n, size, prob)
Phn phi nh phn (binomial)
rgeom(n, prob)
Phn phi geometric
rhyper(nn, m, n, k)
hypergeometric
rlnorm(n,meanlog=0,sdlog=1)
Phn phi log normal
rlogis(n,location=0,scale=1)
Phn phi logistic
rnbinom(n,size,prob)
Phn phi negative Binomial
runif(n,min=0,max=1)
Phn phi uniform
Bin i s thnh k t v ngc li
as.numeric(x)
as.character(x)
as.logical(x)
factor(x)

Bin i x thnh bin s s hc c th tnh ton


Bin i x thnh bin s ch (character) phn loi
Bin i x thnh bin s logic
Bin i x thnh bin s yu t

Data frames
data.frame(x,y)
tuan$age
attach(tuan)
detach(tuan)

Nhp x v y thnh mt data frame


Chn bin s age t dataframe tuan.
a dataframe tuan vo h thng R
Xa b dataframe tuan khi h thng R

Hm s ton
log(x)
log10(x)
exp(x)
sin(x)
cos(x)
tan(x)
asin(x)
acos(x)
atan(x)

Logart bc e
Logart bc 10
S m
Sin
Cosin
Tangent
Arcsin (hm sin o)
Arccosin (hm cosin o)
Arctang(hm tan o)

Hm s thng k
min(x)
max(x)
which.max(x)
which.min(x)

S nh nht ca bin s x
S ln nht ca bin s x
Tm dng no c gi tr ln nht ca bin s x
Tm dng no c gi tr nh nht ca bin s x

Tng s yu t (elements) trong mt bin s (hay s mu)


S tng ca bin s x
Khc bit gia max(x) v min(x)
S trung bnh ca bin s x
median(x)
S trung v (median) ca bin s x
sd(x)
lch chun (standard deviation) ca bin s x
var(x)
Phng sai (variance) ca bin s x
cov(x,y)
Hip bin (covariance) gia hai bin s x v y
cor(x,y)
H s tng quan (coefficient of correlation) gia bin s x v y.
quantile(x)
Ch s ca bin s x
cor(x,y)
H s tng quan (correlation coefficient) gia bin s x v y
is.na(x)
Kim tra xem x c phi l s trng khng (missing value)
complete.cases(x1,x2,...)
Kim tra nu tt c x1, x2, u khng c s trng.
length(x)
sum(x)
range(x)
mean(x)

Ch s ma trn

x[1]
x[1:5]
x[y<=30]
x[sex==male]

S u tin ca bin s x
Nm s u tin ca bin s x
Chn x sao cho y nh hn hoc bng 30
Chn x sao cho sex bng male

Nhp d liu
Xy dng mt kho d liu
c / nhp s liu t file name
c / nhp s liu dng excel (cch nhau bng ,)
t file name
read.delim(name) c / nhp s liu dng tab delimited
read.delim2(name) c / nhp s liu dng tab delimited, cch nhau bng ;
v s thp phn l ,
read.csv2(name)
c / nhp s liu dng csv, cch nhau bng ;
v s thp phn l ,
data(name)
read.table(name)
read.csv(name)

Phn ph trong read.table


header=TRUE
sep=,
dec=,
na.strings=.

Hng u tin ca d liu l tn ca bin s


S liu ngn cch bng du hiu ,
S thp phn l , ( phn bit vi .)
S liu trng (missing value) l .

Phn phi thng k


pnorm(x,mean,sd)
Phn phi chun
plnorm(x,mean,sd)
Phn phi chun logarit
pt(x,df)
Phn phi t
pf(x,n1,n2)
Phn phi F
pchisq(x,df)
Phn phi Chi bnh phng
ppois(x,lambda)
Phn phi Poisson
punif(x,min,max)
Phn phi uniform (ng dng)
pexp(x,rate)
Phn phi hm m
pgamma(x,shape,scale)
Phn phi gamma
pbeta(x,a,b)
Phn phi beta
Phn tch thng k
t.test
pairwise.t.test
cor.test

Kim nh t
Kim nh t cho paired design
Kim nh h s tng quan

var.test
bartlett.test

method = kendall
method = spearman
Kim nh phng sai
Kim nh nhiu phng sai

wilcoxon.test
kruskal.test
friedman.test

Kim nh Wilcoxon
Kim nh Kruskal
Kim nh Friedman

lm(y
lm(y
lm(y
lm(y

Phn tch hi qui tuyn tnh (linear regression)


Phn tch phng sai 1 chiu (1-way analysis of variance)
Phn tch hip bin (analysis of covariance)
Phn tch hi qui tuyn tnh a bin s
(multiple linear regression)

~
~
~
~

x)
factor)
factor+x)
x1+x2+x3)

binom.test
prop.test
prop.trend.test
fisher.test
chisq.test
glm(y~x1+x2+x+x3)

Kim nh nh phn (Binomial test)


Kim nh so snh nhiu t s
Kim nh so snh nhiu t s theo xu hng
Kim nh Fisher
Kim nh Chi bnh phng
Phn tch hi qui logistic

s<-Surv(time,event)
survfit(s)
survdiff(s~g)
coxph(s ~ x`+x2)

Phn tch survival


Biu Kaplan-Meier
Kim nh Log-rank gia hai nhm g
Phn tch hi qui Cox

th
plot(y~x)
hist(x)
plot(y ~ x | z)
pie(x)
boxplot(x)
qqnorm(x)
qqplot(x, y)
barplot(x)
hist(x)
stars(x)
abline(a, b)
abline(h=y)
abline(v=x)
abline(lm.object)

V th y v x (scatter plot)
V th y v x (scatter plot)
V hai biu x v y theo tng nhm ca z
V th trn
V th theo dng hnh hp
V phn phi quantile ca bin s x
V phn phi quantile ca bin s y theo x
V biu hnh khi cho bin s x
V histogram cho bin s x
V biu sao cho bin s x
V ng thng vi intercept=a v slope=b
V ng thng ngang
V ng thng ng
V th theo m hnh tuyn tnh

Mt s thng s cho th
pch
mfrow, mfcol
xlim, ylim
xlab, ylab
lty, lwd
cex, mex
col

K hiu v th (pch = plotting characters)


To ra nhiu ca s v nhiu th cng mt lc (multiframe)
Cho gii hn ca trc honh v trc tung
Vit tn trc honh v trc tung
Dng v kch thc ca ng biu din
Kch thc v khong cch gia cc k t.
Mu sc

CHNG XVIII

THUT NG

18
Phc lc 3
Thut ng dng trong sch
Ting Anh
95% confidence interval
Akaike Information criterion (AIC)
Analysis of covariance
Analysis of variance (ANOVA)
Bar chart
Binomial distribution
Box plot
Categorical variable
Clock chart
Coefficient of correlation
Coefficient of determination
Coefficient of heterogeneity
Combination
Continuous variable
Correlation
Covariance
Cross-over experiment
Cumulative probability distribution
Degree of freedom
Determinant
Discrete variable
Dot chart
Estimate
Estimator
Factorial analysis of variance
Fixed effects
Frequency
Function
Heterogeneity
Histogram
Homogeneity
Hypothesis test
Inverse matrix
Latin square experiment

Ting Vit
Khong tin cy 95%
Tiu chun thng tin Akaike
Phn tch hip bin
Phn tch phng sai
Biu thanh
Phn phi nh phn
Biu hnh hp
Bin th bc
Biu ng h
H s tng quan
H s xc nh bi
H s bt ng nht
T hp
Bin lin tc
Tng quan
Hp bin
Th nghim giao cho
Hm phn phi tch ly
Bc t do
nh thc
Bin ri rc
Biu im
c s
Hm c lng thng k
Phn tch phng sai cho th nghim giai tha
nh hng bt bin
Tn s
Hm
Bt ng nht
Biu tn s
ng nht
Kim nh gi thit
Ma trn nghch o
Th nghim hnh vung Latin

Least squares method


Linear Logistic regression analysis
Linear regression analysis
Matrix
Maximum likelihood method
Mean
Median
Meta-analysis
Missing value
Model
Multiple linear regression analysis
Normal distribution
Object
Parameter
Permutation
Pie chart
Poisson distribution
Polynomial regression
Probability
Probability density distribution
P-value
Quantile
Random effects
Random variable
Relative risk
Repeated measure experiment
Residual
Residual mean square
Residual sum of squares
Scalar matrix
Scatter plot
Significance
Simulation
Standard deviation
Standard error
Standardized normal distribution
Survival analysis
Traposed matrix
Variable
Variance
Weight

Phng php bnh phng nh nht


Phn tch hi qui tuyn tnh logistic
Phn tch hi qui tuyn tnh
Ma trn
Phng php hp l cc i
S trung bnh
S trung v
Phn tch tng hp
Gi tr khng
M hnh
Phn tch hi qui tuyn tnh a bin
Phn phi chun
i tng
Thng s
Hon v
Biu hnh trn
Phn phi Poisson
Hi qui a thc
Xc sut
Hm mt xc sut
Tr s P
Hm nh bc
nh hng ngu nhin
Bin ngu nhin
T s nguy c tng i
Th nghim ti o lng
Phn d
Trung bnh bnh phng phn d
Tng bnh phng phn d
Ma trn v hng
Biu tn x
C ngha thng k
M phng
lch chun
Sai s chun
Phn phi chun chun ha
Phn tch bin c
Ma trn chuyn v
Bin (bin s)
Phng sai
Trng s

Weighted mean

Trung bnh trng s

CHNG XIX

TI LIU THAM KHO


V
SCH C THM

19
Li bt
(ti liu tham kho v c thm)
Qua 15 chng sch v 3 ph lc bn c cng ti i mt hnh trnh kh di
trong phn tch thng k v biu . Thit tng trc khi chia tay bn c, ti cng
nn c i li tm bit.
Kinh nghim ging dy v nghin cu c nhn cho thy phn ln sinh vin khi
tip cn vi khoa hc thng k ln u l mt kinh nghim chng my g ho hng, nu
khng mun ni l kh khn, ch v sch gio khoa son cho mn hc ny rt xa ri thc
t, hay c khi dnh dng n thc t nhng vi nhng v d v b, nht nho. Nhng
khi nim tru tng, nhng cng thc rc ri, nhng php tnh phc tp v rm r lm
cho ngi hc cm thy chao o v t cm thy thiu hng th theo ui mn hc.
Tht vy, c khi c sch gio khoa, c cc bi bo nghin cu khoa hc, chng ta bt
gp nhng phng php hay v nhng m hnh thch hp cho nghin cu ca chnh
mnh, nhng khng bit lm sao tnh ton cc m hnh . Trong cun sch ny, ti
mun cung cp cho bn c mt phng tin phn tch thc t lp vo ci khong
trng phng php .
Hc phi i i vi hnh. Cch hc v phng php hay nht, theo ti, l [ni
mt cch nm na] bt chc. R cung cp cho bn c cch hc m phng rt l
tin li. Trong khi c nhng chng sch ny cng vi nhng v d, bn c c th g
nhng lnh vo my tnh v xem kt qu c nht qun vi nhng g mnh c hay khng.
Sau khi bit c cch s dng mt hm hay mt lnh no , bn c c th thm
vo (hay bt ra) nhng thng s ca hm xem kt qu ra sao. Ch c hc nh th th
bn c mi nm vng c cc khi nim v cch s dng R.
Chng ta hc t sai st. Trong sch ny, ti mun bn c i mt qung ng
kh gp ghnh, tc l bn c phi tng tc vi my tnh bng nhng lnh ca R.
Trong qu trnh tng tc , c th mt s lnh s khng chy, v g sai tn bin s hay
sai chnh t, v khng n k t vit hoa v vit thng, v s liu khng y hay
sai st, v.v Tt c nhng ln sai st s lm cho bn c rt ra kinh nghim v tr
nn thun tho hn. l cch hc m ngi Anh hay gi l trial and error, hc t sai
lm v th nghim.
Mt cng trnh phn tch s liu cn nhiu lnh v hm R. Tuy nhin, v tnh
tng tc m bn c theo di, cc lnh ny s bin mt khi ngng R. Vn t ra l
c cch no lu tr cc lnh ny trong mt h s sau ny s dng li. Phn mm cc
k c ch cho mc ch ny l Tinn-R (cng c th ti xung v ci t vo my hon
ton min ph).
Website ti Tinn-R v ti liu s dng l:
http://www.sciviews.org/Tinn-R.

Tinn-R thc cht l mt editor cho R (v nhiu phn mm khc). Tinn-R cho
php chng ta lu tr tt c cc lnh cho mt cng trnh phn tch trong mt h s. Vi
Tinn-R, chng ta c sn mt ch dn trc tuyn v cch s dng cc lnh hay hm trong
R. Trong khi lnh g sai vn phm R, Tinn-R s bo ngay v ngh cch sa! Giao
din Tinn-R c th ging nh sau:

Chng hn nh trong giao din trn, khi chng ta g read.table( th mt ch


dn ngay pha di hin ra, vi tt c thng s ca hm read.table. Vi Tinn-R
chng ta t khi phm phi nhng sai st nh trong khi chy R. Sau khi xong mt s
lnh, chng ta c th dng chut t m (highlight) nhng lnh cn chy v gi sang
R. Ch chng ta khng cn phi ri Tinn-R trong khi R chy.
n y, c l bn c s hi: c cch no s dng R d dng hn m khng cn
phi g cc lnh? Cu tr li l c. Ti sao ti khng gii thiu trc, ngay t
chng u? Ti v ti mun bn c i con ng kh trc khi i con ng d, nn
n by gi mi ni n mt phn mm ph khc c kh nng gip cho bn c s dng
R mt cch nhanh chng hn, d dng hn, v tin li hn bng chut thay v bng bn
phm.
Phn mm t ng ha R c tn l Rcmdr (vit tt t R commander).
Trong thc t, Rcmdr l mt package, m bn c c th ti t website chnh thc ca R

(http://cran.au.r-project.org/src/contrib/Descriptions/Rcmdr.html) hay website ca tc gi


ca Rcmdr sau y: http://socserv.socsci.mcmaster.ca/jfox/Misc/Rcmdr. Ch , khi
Rcmdr vn hnh tt khi c nhng package sau y trong my: relimp, multcomp,
lmtest, effects, car, v abind. Nu cha c nhng package ny, bn c
nn ti chng v my. Ti liu ch dn Rmdr cng c th ti t website http://cran.Rproject.org/doc/packages/Rcmdr.pdf.
Khi ti Rcmdr xung v ci t vo my tnh, bn c ch n gin lnh:
library(Rcmdr), v mt giao din nh sau s xut hin. Vi phn menu (nh
File, Edit, Data, Statistics, Graphs, Models, Distribution,
Tool, Help) bn c c th t mnh khm ph cch vn hnh ca Rcmdr bng chut.

V ni dung ln in th nht ny, ti khng c nh bn v nhng m hnh phn


tch a bin (multivariate analysis model) nh phn tch yu t (factor analysis), phn
tch tp hp (cluster analysis), phn tch tng quan a bin (correspondence analysis),
phn tch phng sai a bin (multivariate analysis of variance), v.v v y l nhng
phng php tng i cao cp, i hi ngi s dng phi thng tho chng nhng v l
thuyt thng k, m cn phi hiu rt r nhng phng php phn tch cn bn nh trnh
by trong sch ny. Tuy nhin, bn c c nhu cu cho cc phng php phn tch ny
cng c th tm hiu trong trang web ca R bit thm cc package chuyn dng cho
phn tch a bin.

Ti liu tham kho


Hin nay, th vin sch v R cn tng i khim tn so vi th vin cho cc
phn mm thng mi nh SAS v SPSS. Tuy nhin, trong thi i tin b phi thng
v thng tin internet v ton cu ha nh hin nay, sch in v sch xut bn trn website
khng cn l nhng khc nhau bao xa. Phn ln ch dn v cch s dng R c th tm
thy ri rc y trn cc website t cc trng i hc v website c nhn trn khp
th gii. Trong phn ny ti ch lit k mt s sch m bn c, nu cn tham kho
thm, nn tm c. Trong qu trnh vit cun sch m bn c ang cm trn tay, ti
cng tham kho mt s sch v trang web m ti s lit k sau y vi vi li nhn xt c
nhn.
Ti liu tham kho chnh v R l bi bo ca hai ngi sng to ra R: Ihaka R,
Gentleman R. R: A language for data analysis and graphics. Journal of Computational
and Graphical Statistics 1996; 5:299-314.
18.1 Sch tham kho v R

Data Analysis and Graphics Using R An Example Approach (Nh xut bn


Cambridge University Press, 2003) ca John Maindonald nay xut in li ln th
2 vi thm mt tc gi mi John Braun. y l cun sch rt c ch cho nhng ai
mun tm hiu v hc v R. Nm chng u ca sch vit cho bn c cha tng
bit v R, cn cc chng sau th vit cho cc bn c bit cch s dng R thnh
tho.

Introductory Statistics With R (Nh xut bn Springer, 2004) ca Peter


Dalgaard l mt cun sch loi cn bn cho R nhm vo bn c cha bit g v R.
Sch tng i ngn (ch khong 200 trang) nhng kh t gi!

Linear Models with R (Nh xut bn Chapman & Hall/CRC, 2004) ca Julian
Faraway. Sch hin c th ti t internet xung min ph ti website sau y:
hay
http://cran.rhttp://www.stat.lsa.umich.edu/~faraway/book/pra.pdf
project.org/doc/contrib/Faraway-PRA.pdf. Ti liu di 213 trang.

R Graphics (Computer Science and Data Analysis) (Nh xut bn Chapman &
Hall/CRC, 2005) ca Paul Murrell. y l cun sch chuyn v phn tch biu
bng R. Sch c rt nhiu m bn c c th t mnh thit k cc biu phc
tp v mu m.

Modern Applied Statistics with S-Plus (Nh xut bn Springer, 4th Edition,
2003) ca W. N. Venables v B. D. Ripley c vit cho ngn ng S-Plus nhng
tt c cc lnh v m trong sch ny u c th p dng cho R m khng cn thay
i. (S-Plus l tin thn ca R, nhng S-Plus l mt phn mm thng mi, cn R
th hon ton min ph!) y l cun sch c th ni l cun sch tham kho cho
tt c ai mun pht trin thm v R. Hai tc gi cng l nhng chuyn gia c thm

quyn v ngn ng R. Sch dnh cho bn c vi trnh cao v my tnh v


thng k hc.
18.2 Cc website quan trng hay c ch v R

Rt nhiu ti liu tham kho c th ti t website chnh thc ca R sau y:


http://cran.R-project.org/other-docs.html
Trong c mt s ti liu quan trng nh An Introduction to R ca W. N.
Venables v B. D. Ripley.
a ch internet: http://cran.r-project.org/doc/manuals/R-intro.pdf.

Vi ti liu hng dn cch s dng R c th ti (min ph) v tham kho nh sau:


R for Beginners (57 trang) ca Emmanuel Paradis. Ti liu c son cho bn
c mi lm quen vi R.
a ch internet: http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf.
Using R for Data Analysis and Graphics: Introduction, Code and Commentary
(35 trang) ca John Maindonald l mt tm lc cc lnh v hm cn bn ca R
cho phn tch s liu v biu . Ch ca ti liu ny rt gn vi cun sch m
bn ang c.
a ch internet: http://cran.r-project.org/doc/contrib/usingR.pdf
Statistical Analysis with R a quick start (46 trang) ca Oleg Nenadic v
Walter Zucchini. Web. Ti liu hng dn cch ng dng R cho phn tch thng
k v biu .
a ch internet: http://www.statoek.wiso.uni-goettingen.de/mitarbeiter/ogi/pub/r_workshop.pdf
A Brief Guide to R for Beginners in Econometrics (31 trang) ca M. Arai. Ti
liu ch yu son cho gii phn tch thng k kinh t.
a ch internet: http://people.su.se/~ma/R_intro
Notes on the use of R for psychology experiments and questionnaires (39
trag) ca Jonathan Baron v Yuelin Li. Web. Ti liu c son cho gii nghin
cu tm l hc v x hi hc. C v d v log-linear model v mt s m hnh phn
tch phng sai trong tm l hc.
a ch internet: http://www.psych.upenn.edu/~baron/rpsych/rpsych.html

StatsRus gm mt su tp v cc mo s dng R hu hiu hn (di khong 80


trang). a ch internet: http://lark.cc.ukans.edu/pauljohn/R/statsRus.html

V sau cng l mt ti liu Hng dn s dng R cho phn tch s liu v biu
(khong 50 trang thng xuyn cp nht ha) do chnh ti vit bng ting
Vit. Website: www.R.ykhoa.net thc cht l tm lc mt s chng chnh ca

cun sch ny. Trang web ny cn c tt c cc d liu (datasets) v cc m s


trong trong sch bn c c th ti xung my tnh c nhn s dng.

You might also like