You are on page 1of 118

Phn tch s liu v biu bng R

Nguyn Vn Tun

Phn tch s liu v biu bng

Nguyn Vn Tun
Garvan Institute of Medical Research
Sydney, Australia

Phn tch s liu v biu bng R

Nguyn Vn Tun

Mc lc
1

Ti R xung v ci t vo my tnh

Ti R package v ci t vo my tnh

3
3.1
3.2

Vn phm R
Cch t tn trong R
H tr trong R

7
9
9

4
4.1
4.2
4.3
4.4
4.5
4.6
4.7

Cch nhp d liu vo R


Nhp s liu trc tip: c()
Nhp s liu trc tip: edit(data.frame())
Nhp s liu t mt text file: read.table
Nhp s liu t Excel
Nhp s liu t SPSS
Thng tin v s liu
To dy s bng hm seq, rep v gl

10
10
12
13
14
15
16
17

5
5.1
5.2
5.3
5.4
5.5
5.6
5.7

Bin tp s liu
Tch ri s liu: subset
Chit s liu t mt data .frame
Nhp hai data.frame thnh mt: merge
Bin i s liu (data coding)
Bin i s liu bng cch dng replace
Bin i thnh yu t (factor)
Phn nhm s liu bng cut2 (Hmisc)

19
19
20
21
22
23
23
24

6
6.1
6.2

S dng R cho tnh ton n gin


Tnh ton n gin
S dng R cho cc php tnh ma trn

24
24
26

7
7.1
7.2
7.3
7.3.1
7.3.2
7.3.3
7.3.4
7.4

S dng R cho tnh ton xc sut


Php hon v (permutation)
Bin s ngu nhin v hm phn phi
Bin s ngu nhin v hm phn phi
Hm phn phi nh phn (Binomial distribution)
Hm phn phi Poisson (Poisson distribution)
Hm phn phi chun (Normal distribution)
Hm phn phi chun chun ha (Standardized Normal distribution)
Chn mu ngu nhin (random sampling)

31
31
32
32
33
35
36
38
41

8
8.1
8.2
8.3
8.4
8.5
8.5.1
8.5.2
8.6
8.7
8.7.1
8.8

Biu
S liu cho phn tch biu
Biu cho mt bin s ri rc (discrete variable): barplot
Biu cho hai bin s ri rc (discrete variable): barplot
Biu hnh trn
Biu cho mt bin s lin tc: stripchart v hist
Stripchart
Histogram
Biu hp (boxplot)
Phn tch biu cho hai bin lin tc
Biu tn x (scatter plot)
Phn tch Biu cho nhiu bin: pairs

42
42
44
45
46
47
47
48
49
50
50
53

Phn tch s liu v biu bng R

Nguyn Vn Tun

8.9

Biu vi sai s chun (standard error)

54

9
9.1
9.2
9.3
9.3.1
9.3.2
9.4
9.5
9.6
9.7
9.8
9.9
9.10
9.10.1
9.10.2

Phn tch thng k m t


Thng k m t (descriptive statistics, summary)
Thng k m t theo tng nhm
Kim nh t (t.test)
Kim nh t mt mu
Kim nh t hai mu
Kim nh Wilcoxon cho hai mu (wilcox.test)
Kim nh t cho cc bin s theo cp (paired t-test, t.test)
Kim nh Wilcoxon cho cc bin s theo cp (wilcox.test)
Tn s (frequency)
Kim nh t l (proportion test, prop.test, binom.test)
So snh hai t l (prop.test, binom.test)
So snh nhiu t l (prop.test, chisq.test)
Kim nh Chi bnh phng (Chi squared test, chisq.test)
Kim nh Fisher (Fishers exact test, fisher.test)

55
55
60
61
61
62
63
64
65
66
67
68
69
70
71

10
10.1
10.1.1
10.1.2
10.1.3
10.2
10.3

Phn tch hi qui tuyn tnh


H s tng quan
H s tng quan Pearson
H s tng quan Spearman
H s tng quan Kendall
M hnh ca hi qui tuyn tnh n gin
M hnh hi qui tuyn tnh a bin (multiple linear regression)

71
73
73
74
74
75
82

11
11.1
11.2
11.3
11.4

Phn tch phng sai


Phn tch phng sai n gin (one-way analysis of variance)
So snh nhiu nhm v iu chnh tr s p
Phn tch bng phng php phi tham s
Phn tch phng sai hai chiu (two-way ANOVA)

85
85
87
90
91

12
12.1
12.2
12.3

Phn tch hi qui logistic


M hnh hi qui logistic
Phn tch hi qui logistic bng R
c tnh xc sut bng R

94
95
97
101

13
13.1
13.2
13.4
13.4.1
13.4.2
13.4.3
13.4.4
13.4.5

c tnh c mu (sample size estimation)


Khi nim v power
S liu c tnh c mu
c tnh c mu
c tnh c mu cho mt ch s trung bnh
c tnh c mu cho so snh hai s trung bnh
c tnh c mu cho phn tch phng sai
c tnh c mu c tnh mt t l
c tnh c mu cho so snh hai t l

103
104
106
107
107
108
110
111
112

14

Ti liu tham kho

115

15

Thut ng dng trong sch

117

Phn tch s liu v biu bng R

Nguyn Vn Tun

Gii thiu R
Phn tch s liu v biu thng c tin hnh bng cc phn mm thng
dng nh SAS, SPSS, Stata, Statistica, v S-Plus. y l nhng phn mm c cc
cng ti phn mm pht trin v gii thiu trn th trng khong ba thp nin qua, v
c cc trng i hc, cc trung tm nghin cu v cng ti k ngh trn ton th gii
s dng cho ging dy v nghin cu. Nhng v chi ph s dng cc phn mm ny
tung i t tin (c khi ln n hng trm ngn -la mi nm), mt s trng i hc
cc nc ang pht trin (v ngay c mt s nc pht trin) khng c kh nng
ti chnh s dng chng mt cch lu di. Do , cc nh nghin cu thng k trn
th gii hp tc vi nhau pht trin mt phn mm mi, vi ch trng m ngun
m, sao cho tt c cc thnh vin trong ngnh thng k hc v ton hc trn th gii c
th s dng mt cch thng nht v hon ton min ph.
Nm 1996, trong mt bi bo quan trng v tnh ton thng k, hai nh thng k
hc Ross Ihaka v Robert Gentleman [lc ] thuc Trng i hc Auckland, New
Zealand pht ho mt ngn ng mi cho phn tch thng k m h t tn l R [1]. Sng
kin ny c rt nhiu nh thng k hc trn th gii tn thnh v tham gia vo vic
pht trin R.
Cho n nay, qua cha y 10 nm pht trin, cng ngy cng c nhiu nh thng
k hc, ton hc, nghin cu trong mi lnh vc chuyn sang s dng R phn tch
d liu khoa hc. Trn ton cu, c mt mng li hn mt triu ngi s dng R,
v con s ny ang tng rt nhanh. C th ni trong vng 10 nm na, vai tr ca cc
phn mm thng k thng mi s khng cn ln nh trong thi gian qua na.
Vy R l g? Ni mt cch ngn gn, R l mt phn mm s dng cho phn tch
thng k v v biu . Tht ra, v bn cht, R l ngn ng my tnh a nng, c th s
dng cho nhiu mc tiu khc nhau, t tnh ton n gin, ton hc gii tr (recreational
mathematics), tnh ton ma trn (matrix), n cc phn tch thng k phc tp. V l mt
ngn ng, cho nn ngi ta c th s dng R pht trin thnh cc phn mm chuyn
mn cho mt vn tnh ton c bit.
V th, nhng ai lm nghin cu khoa hc, nht l cc nc cn ngho kh nh
nc ta, cn phi hc cch s dng R cho phn tch thng k v th. Bi vit ngn
ny s hng dn bn c cch s dng R. Ti gi nh rng bn c khng bit g v
R, nhng ti k vng bn c bit qua v cch s dng my tnh.

1. Ti R xung v ci t vo my tnh
s dng R, vic u tin l chng ta phi ci t R trong my tnh ca mnh.
lm vic ny, ta phi truy nhp vo mng v vo website c tn l Comprehensive R
Archive Network (CRAN) sau y:
http://cran.R-project.org.
4

Phn tch s liu v biu bng R

Nguyn Vn Tun

Ti liu cn ti v, ty theo phin bn, nhng thng c tn bt u bng mu t


R v s phin bn (version). Chng hn nh phin bn ti s dng vo cui nm 2005 l
2.2.1, nn tn ca ti liu cn ti l:
R-2.2.1-win32.zip
Ti liu ny khong 26 MB, v a ch c th ti l:
http://cran.r-project.org/bin/windows/base/R-2.2.1-win32.exe
Ti website ny, chng ta c th tm thy rt nhiu ti liu ch dn cch s dng
R, trnh , t s ng n cao cp. Nu cha quen vi ting Anh, ti liu ny ca ti
c th cung cp nhng thng tin cn thit s dng m khng cn phi c cc ti liu
khc.
Khi ti R xung my tnh, bc k tip l ci t (set-up) vo my tnh.
lm vic ny, chng ta ch n gin nhn chut vo ti liu trn v lm theo hng dn
cch ci t trn mn hnh. y l mt bc rt n gin, ch cn 1 pht l vic ci t R
c th hon tt.
Sau khi hon tt vic ci t, mt icon

R 2.2.1.lnk

s xut hin trn desktop ca my tnh. n y th chng ta sn sng s dng R. C


th nhp chut vo icon ny v chng ta s c mt window nh sau:

Phn tch s liu v biu bng R

Nguyn Vn Tun

2. Ti R package v ci t vo my tnh
R cung cp cho chng ta mt ngn ng my tnh v mt s function lm cc
phn tch cn bn v n gin. Nu mun lm nhng phn tch phc tp hn, chng ta
cn phi ti v my tnh mt s package khc. Package l mt phn mm nh c cc
nh thng k pht trin gii quyt mt vn c th, v c th chy trong h thng R.
Chng hn nh phn tch hi qui tuyn tnh, R c function lm s dng cho mc
ch ny, nhng lm cc phn tch su hn v phc tp hn, chng ta cn n cc
package nh lme4. Cc package ny cn phi c ti v v ci t vo my tnh.
a ch ti cc package vn l: http://cran.r-project.org, ri bm vo phn
Packages xut hin bn tri ca mc lc trang web. Theo ti, mt s package cn ti
v my tnh s dng cho cc phn tch dch t hc l:
Tn package
trellis
lattice
Hmisc
Design
Epi
epitools
Foreign
Rmeta
meta

Chc nng
Dng v th v lm cho th p hn
Dng v th v lm cho th p hn
Mt s phng php m hnh d liu ca F. Harrell
Mt s m hnh thit k nghin cu ca F. Harrell
Dng cho cc phn tch dch t hc
Mt package khc chuyn cho cc phn tch dch t hc
Dng nhp d liu t cc phn mm khc nh
SPSS, Stata, SAS, v.v
Dng cho phn tch tng hp (meta-analysis)
Mt package khc cho phn tch tng hp
6

Phn tch s liu v biu bng R

survival

Nguyn Vn Tun

Chuyn dng cho phn tch theo m hnh Cox (Coxs


proportional hazard model)
Package dng cho cc phn tch thng k trong lnh
vc x hi hc
Package dng cho phn tch s liu di truyn hc
Bayesian Model Average

Zelig
Genetics
BMA

Cc package ny c th ci t trc tuyn bng cch chn Install packages trong phn
packages ca R nh hnh di y. Ngoi ra, nu package c ti xung my tnh
c nhn, vic ci t c th nhanh hn bng cch chn Install package(s) from local zip
file cng trong phn packages (xem hnh di y).

3. Vn phm R
R l mt ngn ng tng tc (interactive language), c ngha l khi chng ta ra
lnh, v nu lnh theo ng vn phm, R s p li bng mt kt qu. V, s tng
tc tip tc cho n khi chng ta t c yu cu. Vn phm chung ca R l mt lnh
(command) hay function (ti s thnh thong cp n l hm). M l hm th
phi c thng s; cho nn theo sau hm l nhng thng s m chng ta phi cung cp.
C php chung ca R l nh sau:
i tng <- hm(thng s 1, thng s 2, , thng s n)

Phn tch s liu v biu bng R

Nguyn Vn Tun

Chng hn nh:
> reg <- lm(y ~ x)
th reg l mt i tng (object), cn lm l mt hm, v y ~ x l thng s ca hm.
Hay:
> setwd(c:/works/stats)
th setwd l mt hm, cn c:/works/stats l thng s ca hm.
bit mt hm cn c nhng thng s no, chng ta dng lnh args(x), (args
vit tt ch arguments) m trong x l mt hm chng ta cn bit:
> args(lm)
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
NULL

R l mt ngn ng i tng (object oriented language). iu ny c ngha l


cc d liu trong R c cha trong object. nh hng ny cng c vi nh hng n
cch vit ca R. Chng hn nh thay v vit x = 5 nh thng thng chng ta vn vit,
th R yu cu vit l x == 5.
i vi R, x = 5 tng ng vi x <- 5. Cch vit sau (dng k hiu <-)
c khuyn khch hn l cch vit trc (=). Chng hn nh:
> x <- rnorm(10)
c ngha l m phng 10 s liu v cha trong object x. Chng ta cng c th vit x =
rnorm(10).
Mt s k hiu hay dng trong R l:
x == 5
x != 5
y < x
x > y
z <= 7
p >= 1
is.na(x)
A & B
A | B
!

x bng 5
x khng bng 5
y nh hn x
x ln hn y
z nh hn hoc bng 7
p ln hn hoc bng 1
C phi x l bin s trng khng (missing value)
A v B (AND)
A hoc B (OR)
Khng l (NOT)

Phn tch s liu v biu bng R

Nguyn Vn Tun

Vi R, tt c cc cu ch hay lnh sau k hiu # u khng c hiu ng, v # l k hiu


dnh cho ngi s dng thm vo cc ghi ch, v d:
> # lnh sau y s m phng 10 gi tr normal
> x <- rnorm(10)
3.1 Cch t tn trong R
t tn mt i tng (object) hay mt bin s (variable) trong R kh linh hot,
v R khng c nhiu gii hn nh cc phn mm khc. Tn mt object phi c vit
lin nhau (tc khng c cch ri bng mt khong trng). Chng hn nh R chp
nhn myobject nhng khng chp nhn my object.
> myobject <- rnorm(10)
> my object <- rnorm(10)
Error: syntax error in "my object"

Nhng i khi tn myobject kh c, cho nn chng ta nn tc ri bng . Nh


my.object.
> my.object <- rnorm(10)
Mt iu quan trng cn lu l R phn bit mu t vit hoa v vit thng. Cho nn
My.object khc vi my.object. V d:
> My.object.u <- 15
> my.object.L <- 5
> My.object.u + my.object.L
[1] 20

Mt vi iu cn lu khi t tn trong R l:

Khng nn t tn mt bin s hay variable bng k hiu _ (underscore) nh


my_object hay my-object.

Khng nn t tn mt object ging nh mt bin s trong mt d liu. V d,


nu chng ta c mt data.frame (d liu hay dataset) vi bin s age trong
, th khng nn c mt object trng tn age, tc l khng nn vit: age <age. Tuy nhin, nu data.frame tn l data th chng ta c th cp n bin
s age vi mt k t $ nh sau: data$age. (Tc l bin s age trong
data.frame data), v trong trng hp , age <- data$age c th chp
nhn c.

3.2 H tr trong R

Phn tch s liu v biu bng R

Nguyn Vn Tun

Ngoi lnh args() R cn cung cp lnh help() ngi s dng c th hiu


vn phm ca tng hm. Chng hn nh mun bit hm lm c nhng thng s
(arguments) no, chng ta ch n gin lnh:
> help(lm)
hay
> ?lm
Mt ca s s hin ra bn phi ca mn hnh ch r cch s dng ra sao v thm ch c c
v d. Bn c c th n gin copy v dn v d vo R xem cch vn hnh.
Trc khi s dng R, ngoi sch ny nu cn bn c c th c qua phn ch dn
c sn trong R bng cch chn mc help v sau chn Html help nh hnh di
y bit thm chi tit. Bn c cng c th copy v dn cc lnh trong mc ny vo R
xem cho bit cch vn hnh ca R.

4. Cch nhp d liu vo R


Mun lm phn tch d liu bng R, chng ta phi c sn d liu dng m R c
th hiu c x l. D liu m R hiu c phi l d liu trong mt data.frame.
C nhiu cch nhp s liu vo mt data.frame trong R, t nhp trc tip n
nhp t cc ngun khc nhau. Sau y l nhng cch thng dng nht:
4.1 Nhp s liu trc tip: c()
V d 1: chng ta c s liu v tui v insulin cho 10 bnh nhn nh sau, v
mun nhp vo R.
50
62
60
40
48
47
57
70
48
67

16.5
10.8
32.3
19.3
14.2
11.3
15.5
15.8
16.2
11.2

Chng ta c th s dng function c tn c nh sau:


> age <- c(50,62, 60,40,48,47,57,70,48,67)
> insulin <- c(16.5,10.8,32.3,19.3,14.2,11.3,15.5,15.8,16.2,11.2)

10

Phn tch s liu v biu bng R

Nguyn Vn Tun

Lnh th nht cho R bit rng chng ta mun to ra mt ct d liu (t nay ti s


gi l bin s, tc variable) c tn l age, v lnh th hai l to ra mt ct khc c tn l
insulin. Tt nhin, chng ta c th ly mt tn khc m mnh thch.
Chng ta dng function c (vit tt ca ch concatenation c ngha l mc
ni vo nhau) nhp d liu. Ch rng mi s liu cho mi bnh nhn c cch
nhau bng mt du phy.
K hiu insulin <- (cng c th vit l insulin =) c ngha l cc s liu
theo sau s c nm trong bin s insulin. Chng ta s gp k hiu ny rt nhiu ln
trong khi s dng R.
R l mt ngn ng cu trc theo dng i tng (thut ng chuyn mn l
object-oriented language), v mi ct s liu hay mi mt data.frame l mt i
tng (object) i vi R. V th, age v insulin l hai i tng ring l. By gi
chng ta cn phi nhp hai i tng ny thnh mt data.frame R c th x l sau
ny. lm vic ny chng ta cn n function data.frame:
> tuan <- data.frame(age, insulin)

Trong lnh ny, chng ta mun cho R bit rng nhp hai ct (hay hai i tng) age v
insulin vo mt i tng c tn l tuan.
n y th chng ta c mt i tng hon chnh tin hnh phn tch thng k.
kim tra xem trong tuan c g, chng ta ch cn n gin g:
> tuan

V R s bo co:
1
2
3
4
5
6
7
8
9
10

age insulin
50
16.5
62
10.8
60
32.3
40
19.3
48
14.2
47
11.3
57
15.5
70
15.8
48
16.2
67
11.2

Nu chng ta mun lu li cc s liu ny trong mt file theo dng R, chng ta


cn dng lnh save. Gi d nh chng ta mun lu s liu trong directory c tn l
c:\works\insulin, chng ta cn g nh sau:
> setwd(c:/works/insulin)
> save(tuan, file=tuan.rda)

11

Phn tch s liu v biu bng R

Nguyn Vn Tun

Lnh u tin (setwd ch wd c ngha l working directory) cho R bit rng


chng ta mun lu cc s liu trong directory c tn l c:\works\insulin. Lu rng
thng thng Windows dng du backward slash /, nhng trong R chng ta dng du
forward slash /.
Lnh th hai (save) cho R bit rng cc s liu trong i tng tuan s lu
trong file c tn l tuan.rda). Sau khi g xong hai lnh trn, mt file c tn
tuan.rda s c mt trong directory .
4.2 Nhp s liu trc tip: edit(data.frame())
V d 1 (tip tc): chng ta c th nhp s liu v tui v insulin cho 10 bnh
nhn bng mt function rt c ch, l: edit(data.frame()). Vi function ny,
R s cung cp cho chng ta mt window mi vi mt dy ct v dng ging nh Excel,
v chng ta c th nhp s liu trong bng . V d:
> ins <- edit(data.frame())

Chng ta s c mt ca s nh sau:

y, R khng bit chng ta c bin s no, cho nn R lit k cc bin s var1,


var2, v.v Nhp chut vo ct var1 v thay i bng cch g vo age. Nhp
chut vo ct var2 v thay i bng cch g vo insulin. Sau g s liu cho
12

Phn tch s liu v biu bng R

Nguyn Vn Tun

tng ct. Sau khi xong, bm nt cho X gc phi ca spreadsheet, chng ta s c mt


data.frame tn ins vi hai bin s age v insulin.
4.3 Nhp s liu t mt text file: read.table
V d 2: Chng ta thu thp s liu v tui v cholesterol t mt nghin cu
50 bnh nhn mc bnh cao huyt p. Cc s liu ny c lu trong mt text file c tn
l chol.txt ti directory c:\works\insulin. S liu ny nh sau: ct 1 l m s
ca bnh nhn, ct 2 l gii tnh, ct 3 l body mass index (bmi), ct 4 l HDL
cholesterol (vit tt l hdl), k n l LDL cholesterol, total cholesterol (tc) v
triglycerides (tg).
id
1
2
3
4
5
6
7
8
9
10
...
46
47
48
49
50

sex
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu

age
57
64
60
65
47
65
76
61
59
57

bmi
17
18
18
18
18
18
19
19
19
19

hdl
5.000
4.380
3.360
5.920
6.250
4.150
0.737
7.170
6.942
5.000

ldl
2.0
3.0
3.0
4.0
2.1
3.0
3.0
3.0
3.0
2.0

tc
4.0
3.5
4.7
7.7
5.0
4.2
5.9
6.1
5.9
4.0

tg
1.1
2.1
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9

Nu
Nam
Nam
Nu
Nu

52
64
45
64
62

24
24
24
25
25

3.360
7.170
7.880
7.360
7.750

2.0
1.0
4.0
4.6
4.0

3.7
6.1
6.7
8.1
6.2

1.2
1.9
3.3
4.0
2.5

Chng ta mun nhp cc d liu ny vo R tin vic phn tch sau ny. Chng
ta s s dng lnh read.table nh sau:
> setwd(c:/works/insulin)
> chol <- read.table("chol.txt", header=TRUE)

Lnh th nht chng ta mun m bo R truy nhp ng directory m s liu


ang c lu gi. Lnh th hai yu cu R nhp s liu t file c tn l chol.txt
(trong directory c:\works\insulin) v cho vo i tng chol. Trong lnh ny,
header=TRUE c ngha l yu cu R c dng u tin trong file nh l tn ca
tng ct d kin.
Chng ta c th kim tra xem R c ht cc d liu hay cha bng cch ra lnh:
> chol

Hay

13

Phn tch s liu v biu bng R

Nguyn Vn Tun

> names(chol)

R s cho bit c cc ct nh sau trong d liu (names l lnh hi trong d liu c nhng
ct no v tn g):
[1] "id"

"sex" "age" "bmi" "hdl" "ldl" "tc"

"tg"

By gi chng ta c th lu d liu di dng R x l sau ny bng cch ra lnh:


> save(chol, file="chol.rda")

4.4 Nhp s liu t Excel: read.csv


nhp s liu t phn mm Excel, chng ta cn tin hnh 2 bc:

Bc 1: Dng lnh Save as trong Excel v lu s liu di dng csv;


Bc 2: Dng R (lnh read.csv) nhp d liu dng csv.

V d 3: Mt d liu gm cc ct sau y ang c lu trong Excel, v chng ta mun


chuyn vo R phn tch. D liu ny c tn l excel.xls.
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Age
18
28
20
21
28
23
20
20
20
20
22
27
26
33
34
32
28
18
26
27

Sex
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
0
0
1

Ethnicity
1
1
1
1
1
4
1
1
1
1
1
2
1
1
3
1
1
2
2
2

IGFI
148.27
114.50
109.82
112.13
102.86
129.59
142.50
118.69
197.69
163.69
144.81
141.60
161.80
89.20
161.80
148.50
157.70
222.90
186.70
167.56

IGFBP3
5.14
5.23
4.33
4.38
4.04
4.16
3.85
3.44
4.12
3.96
3.63
3.48
4.10
2.82
3.80
3.72
3.98
3.98
4.64
3.56

ALS
316.00
296.42
269.82
247.96
240.04
266.95
300.86
277.46
335.23
306.83
295.46
231.20
244.80
177.20
243.60
234.80
224.80
281.40
340.80
321.12

PINP
61.84
98.64
93.26
101.59
58.77
48.93
135.62
79.51
57.25
74.03
68.26
56.78
75.75
48.57
50.68
83.98
60.42
74.17
38.05
30.18

ICTP
5.81
4.96
7.74
6.66
4.62
5.32
8.78
7.19
6.21
4.95
4.54
4.47
6.27
3.58
3.52
4.85
4.89
6.43
5.12
4.78

P3NP
4.21
5.33
4.56
4.61
4.95
3.82
6.75
5.11
4.44
4.84
3.70
4.07
5.26
3.68
3.35
3.80
4.09
5.84
5.77
6.12

Vic u tin l chng ta cn lm, nh ni trn, l vo Excel lu di dng csv:


Vo Excel, chn File Save as
Chn Save as type CSV (Comma delimited)

14

Phn tch s liu v biu bng R

Nguyn Vn Tun

Sau khi xong, chng ta s c mt file vi tn excel.csv trong directory


c:\works\insulin.
Vic th hai l vo R v ra nhng lnh sau y:
> setwd(c:/works/insulin)
> gh <- read.csv ("excel.txt", header=TRUE)

Lnh th hai read.csv yu cu R c s liu t excel.csv, dng dng th nht l tn


ct, v lu cc s liu ny trong mt object c tn l gh.
By gi chng ta c th lu gh di dng R x l sau ny bng lnh sau y:
> save(gh, file="gh.rda")

4.5 Nhp s liu t mt SPSS: read.spss


Phn mm thng k SPSS lu d liu di dng sav. Chng hn nh nu
chng ta c mt d liu c tn l testo.sav trong directory c:\works\insulin, v mun
chuyn d liu ny sang dng R c th hiu c, chng ta cn s dng lnh
read.spss trong package c tn l foreign. Cc lnh sau y s hon tt d dng
vic ny:
Vic u tin chng ta cho truy nhp foreign bng lnh library:
15

Phn tch s liu v biu bng R

Nguyn Vn Tun

> library(foreign)

Vic th hai l lnh read.spss:


> setwd(c:/works/insulin)
> testo <- read.spss(testo.sav, to.data.frame=TRUE)

Lnh th hai read.spss yu cu R c s liu t testo.sav, v cho vo mt


data.frame c tn l testo.
By gi chng ta c th lu testo di dng R x l sau ny bng lnh sau y:
> save(testo, file="testo.rda")

4.6 Thng tin v d liu


Gi d nh chng ta nhp s liu vo mt data.frame c tn l chol nh trong v d
1. tm hiu xem trong d liu ny c g, chng ta c th nhp vo R nh sau:

Dn cho R bit chng ta mun x l chol bng cch dng lnh attach(arg) vi
arg l tn ca d liu..

> attach(chol)

Chng ta c th kim tra xem chol c phi l mt data.frame khng bng lnh
is.data.frame(arg) vi arg l tn ca d liu. V d:

> is.data.frame(chol)
[1] TRUE
R cho bit chol qu l mt data.frame.

C bao nhiu ct (hay variable = bin s) v dng s liu (observations) trong d liu
ny? Chng ta dng lnh dim(arg) vi arg l tn ca d liu. (dim vit tt ch
dimension). V d (kt qu ca R trnh by ngay sau khi chng ta g lnh):

> dim(chol)
[1] 50 8

Nh vy, chng ta c 50 dng v 8 ct (hay bin s). Vy nhng bin s ny tn g?


Chng ta dng lnh names(arg) vi arg l tn ca d liu. V d:

> names(chol)
[1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc"

16

"tg"

Phn tch s liu v biu bng R

Nguyn Vn Tun

Trong bin s sex, chng ta c bao nhiu nam v n? tr li cu hi ny, chng


ta c th dng lnh table(arg) vi arg l tn ca bin s. V d:

> table(sex)

sex
nam Nam
1 21

Nu
28

Kt qu cho thy d liu ny c 21 nam v 28 n.


4.7 To dy s bng hm seq, rep v gl
R cn c cng dng to ra nhng dy s rt tin cho vic m phng v thit k th
nghim. Nhng hm thng thng cho dy s l seq (sequence), rep (repetition) v
gl (generating levels):
p dng seq

To ra mt vector s t 1 n 12:

> x <- (1:12)


> x
[1] 1 2 3
> seq(12)
[1] 1 2

4
4

5
5

6
6

7
7

8
8

9 10 11 12
9 10 11 12

To ra mt vector s t 12 n 5:

> x <- (12:5)


> x
[1] 12 11 10 9

> seq(12,7)
[1] 12 11 10

Cng thc chung ca hm seq l seq(from, to, by= ) hay seq(from, to,
length.out= ). Cch s dng s c minh ho bng vi v d sau y:

To ra mt vector s t 4 n 6 vi khong cch bng 0.25:

> seq(4, 6, 0.25)


[1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00

To ra mt vector 10 s, vi s nh nht l 2 v s ln nht l 15

> seq(length=10, from=2, to=15)


[1] 2.000000 3.444444 4.888889 6.333333
10.666667 12.111111 13.555556 15.000000

17

7.777778

9.222222

Phn tch s liu v biu bng R

Nguyn Vn Tun

p dng rep
Cng thc ca hm rep l rep(x, times, ...), trong , x l mt bin s v times
l s ln lp li. V d:

To ra s 10, 3 ln:

> rep(10, 3)
[1] 10 10 10

To ra s 1 n 4, 3 ln:

> rep(c(1:4), 3)
[1] 1 2 3 4 1 2 3 4 1 2 3 4

To ra s 1.2, 2.7, 4.8, 5 ln:

> rep(c(1.2, 2.7, 4.8), 5)


[1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8

To ra s 1.2, 2.7, 4.8, 5 ln:

> rep(c(1.2, 2.7, 4.8), 5)


[1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8

p dng gl
gl c p dng to ra mt bin th bc (categorical variable), tc bin khng tnh
ton, m l m. Cng thc chung ca hm gl l gl(n, k, length = n*k,
labels = 1:n, ordered = FALSE) v cch s dng s c minh ho bng vi
v d sau y:

To ra bin gm bc 1 v 2; mi bc c lp li 8 ln:

> gl(2, 8)
[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
Levels: 1 2

Hay mt bin gm bc 1, 2 v 3; mi bc c lp li 5 ln:


> gl(3, 5)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3

To ra bin gm bc 1 v 2; mi bc c lp li 10 ln (do length=20):

> gl(2, 10, length=20)


[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Levels: 1 2

Hay:
> gl(2, 2, length=20)
[1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
Levels: 1 2

Cho thm k hiu:

18

Phn tch s liu v biu bng R

Nguyn Vn Tun

> gl(2, 5, label=c("C", "T"))


[1] C C C C C T T T T T
Levels: C T

To mt bin gm 4 bc 1, 2, 3, 4. Mi bc lp li 2 ln.

> rep(1:4, c(2,2,2,2))


[1] 1 1 2 2 3 3 4 4

Cng tng ng vi:


> rep(1:4, each = 2)
[1] 1 1 2 2 3 3 4 4

Vi ngy gi thng:

> x <- .leap.seconds[1:3]


> rep(x, 2)
[1] "1972-06-30 17:00:00 Pacific Standard Time" "1972-12-31 16:00:00
Pacific Standard Time"
[3] "1973-12-31 16:00:00 Pacific Standard Time" "1972-06-30 17:00:00
Pacific Standard Time"
[5] "1972-12-31 16:00:00 Pacific Standard Time" "1973-12-31 16:00:00
Pacific Standard Time"
> rep(as.POSIXlt(x), rep(2, 3))
[1] "1972-06-30 17:00:00 Pacific Standard Time" "1972-06-30 17:00:00
Pacific Standard Time"
[3] "1972-12-31 16:00:00 Pacific Standard Time" "1972-12-31 16:00:00
Pacific Standard Time"
[5] "1973-12-31 16:00:00 Pacific Standard Time" "1973-12-31 16:00:00
Pacific Standard Time"

5. Bin tp s liu
5.1 Tch ri d liu: subset
Chng ta s quay li vi d liu chol trong v d 1. tin vic theo di v
hiu cu chuyn, ti xin nhc li rng chng ta nhp s liu vo trong mt d liu R
c tn l chol t mt text file c tn l chol.txt:
> setwd(c:/works/insulin)
> chol <- read.table(chol.txt, header=TRUE)
> attach(chol)

Nu chng ta, v mt l do no , ch mun phn tch ring cho nam gii, chng
ta c th tch chol ra thnh hai data.frame, tm gi l nam v nu. lm chuyn ny,
chng ta dng lnh subset(data, cond), trong data l data.frame m chng ta
mun tch ri, v cond l iu kin. V d:
> nam <- subset(chol, sex==Nam)
> nu <- subset(chol, sex==Nu)

19

Phn tch s liu v biu bng R

Nguyn Vn Tun

Sau khi ra hai lnh ny, chng ta c 2 d liu (hai data.frame) mi tn l nam v nu.
Ch iu kin sex == Nam v sex == Nu chng ta dng == thay v = ch
iu kin chnh xc.
Tt nhin, chng ta cng c th tch d liu thnh nhiu data.frame khc nhau vi nhng
iu kin da vo cc bin s khc. Chng hn nh lnh sau y to ra mt data.frame
mi tn l old vi nhng bnh nhn trn 60 tui:
> old <- subset(chol, age>=60)
> dim(old)

[1] 25

Hay mt data.frame mi vi nhng bnh nhn trn 60 tui v nam gii:


> n60 <- subset(chol, age>=60 & sex==Nam)
> dim(n60)

[1] 9

5.2 Chit s liu t mt data .frame


Trong chol c 8 bin s. Chng ta c th chit d liu chol v ch gi li
nhng bin s cn thit nh m s (id), tui (age) v total cholestrol (tc). t
lnh names(chol) rng bin s id l ct s 1, age l ct s 3, v bin s tc l ct s
7. Chng ta c th dng lnh sau y:
> data2 <- chol[, c(1,3,7)]

y, chng ta lnh cho R bit rng chng ta mun chn ct s 1, 3 v 7, v a tt c


s liu ca hai ct ny vo data.frame mi c tn l data2. Ch chng ta s dng
ngoc kp vung [] ch khng phi ngoc kp vng (), v chol khng phi lm mt
function. Du phy pha trc c, c ngha l chng ta chn tt c cc dng s liu trong
data.frame chol.
Nhng nu chng ta ch mun chn 10 dng s liu u tin, th lnh s l:
> data3 <- chol[1:10, c(1,3,7)]
> print(data3)

1
2
3
4
5
6
7
8

id
1
2
3
4
5
6
7
8

sex
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam

tc
4.0
3.5
4.7
7.7
5.0
4.2
5.9
6.1

20

Phn tch s liu v biu bng R

Nguyn Vn Tun

9
9 Nam 5.9
10 10 Nu 4.0
Ch lnh print(arg) n gin lit k tt c s liu trong data.frame arg. Tht ra,
chng ta ch cn n gin g data3, kt qu cng ging y nh print(data3).
5.3 Nhp hai data.frame thnh mt: merge
Gi d nh chng ta c d liu cha trong hai data.frame. D liu th nht tn l d1
gm 3 ct: id, sex, tc nh sau:
id sex tc
1 Nam 4.0
2 Nu 3.5
3 Nu 4.7
4 Nam 7.7
5 Nam 5.0
6 Nu 4.2
7 Nam 5.9
8 Nam 6.1
9 Nam 5.9
10 Nu 4.0
D liu th hai tn l d2 gm 3 ct: id, sex, tg nh sau:
id
1
2
3
4
5
6
7
8
9
10
11

sex
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
Nu

tg
1.1
2.1
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9
1.7

Hai d liu ny c chung hai bin s id v sex. Nhng d liu d1 c 10 dng, cn d


liu d2 c 11 dng. Chng ta c th nhp hai d liu thnh mt data.frame bng cch
dng lnh merge nh sau:
> d <- merge(d1, d2, by="id", all=TRUE)
> d
id sex.x tc sex.y tg

21

Phn tch s liu v biu bng R

1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10 10
11 11

Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
<NA>

4.0
3.5
4.7
7.7
5.0
4.2
5.9
6.1
5.9
4.0
NA

Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
Nu

Nguyn Vn Tun

1.1
2.1
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9
1.7

Trong lnh merge, chng ta yu cu R nhp 2 d liu d1 v d2 thnh mt v a vo


data.frame mi tn l d, v dng bin s id lm chun. Chng ta thy bnh nhn s
11 khng c s liu cho tc, cho nn R cho l NA (mt dng not available).
5.4 Bin i s liu (data coding)
Trong vic x l s liu dch t hc, nhiu khi chng ta cn phi bin i s liu t bin
lin tc sang bin mang tnh cch phn loi. Chng hn nh trong chn on long
xng, nhng ph n c ch s T ca mt cht khong trong xng (bone mineral
density hay BMD) bng hay thp hn -2.5 c xem l long xng, nhng ai c
BMD gia -2.5 v -1.0 l xp xng (osteopenia), v trn -1.0 l bnh thng. V
d, chng ta c s liu BMD t 10 bnh nhn nh sau:
-0.92, 0.21, 0.17, -3.21, -1.80, -2.60, -2.00, 1.71, 2.12, -2.11

nhp cc s liu ny vo R chng ta c th s dng function c nh sau:


bmd <- c(-0.92,0.21,0.17,-3.21,-1.80,-2.60,-2.00,1.71,2.12,-2.11)

phn loi 3 nhm long xng, xp xng, v bnh thng, chng ta c th dng m
s 1, 2 v 3. Ni cch khc, chng ta mun to nn mt bin s khc (hy gi l
diagnosis) gm 3 gi tr trn da vo gi tr ca bmd. lm vic ny, chng ta s
dng lnh:
# tm thi cho bin s diagnosis bng bmd
> diagnosis <- bmd
#
>
>
>

bin i bmd thnh diagnosis


diagnosis[bmd <= -2.5] <- 1
diagnosis[bmd > -2.5 & bmd <= 1.0] <- 2
diagnosis[bmd > -1.0] <- 3

# to thnh mt data frame


> data <- data.frame(bmd, diagnosis)
# lit k kim tra xem lnh c hiu qu khng
> data

22

Phn tch s liu v biu bng R

1
2
3
4
5
6
7
8
9
10

Nguyn Vn Tun

bmd diagnosis
-0.92
3
0.21
3
0.17
3
-3.21
1
-1.80
2
-2.60
1
-2.00
2
1.71
3
2.12
3
-2.11
2

5.5 Bin i s liu bng cch dng replace


Mt cch bin i s liu khc l dng replace, d cch ny c v rm r cht t.
Tip tc v d trn, chng ta bin i t bmd sang diagnosis nh sau:
>
>
>
>

diagnosis
diagnosis
diagnosis
diagnosis

<<<<-

bmd
replace(diagnosis, bmd <= -2.5, 1)
replace(diagnosis, bmd > -2.5 & bmd <= 1.0, 2)
replace(diagnosis, bmd > -1.0, 3)

5.6 Bin i thnh yu t (factor)


Trong phn tch thng k, chng ta phn bit mt bin s mang tnh yu t (factor) v
bin s lin tc bnh thng. Bin s yu t khng th dng tnh ton nh cng tr
nhn chia, nhng bin s s hc c th s dng tnh ton. Chng hn nh trong v d
bmd v diagnosis trn, diagnosis l yu t v gi tr trung bnh gia 1 v 2 chng
c ngha thc t g c; cn bmd l bin s s hc.
Nhng hin nay, diagnosis c xem l mt bin s s hc. bin thnh bin s
yu t, chng ta cn s dng function factor nh sau:
> diag <- factor(diagnosis)
> diag
[1] 3 3 3 1 2 1 2 3 3 2
Levels: 1 2 3

Ch R by gi thng bo cho chng ta bit diag c 3 bc: 1, 2 v 3. Nu chng ta yu


cu R tnh s trung bnh ca diag, R s khng lm theo yu cu ny, v khng phi l
mt bin s s hc:
> mean(diag)
[1] NA
Warning message:
argument is not numeric or logical: returning NA in: mean.default(diag)

D nhin, chng ta c th tnh gi tr trung bnh ca diagnosis:

23

Phn tch s liu v biu bng R

Nguyn Vn Tun

> mean(diagnosis)
[1] 2.3
nhng kt qu 2.3 ny khng c ngha g trong thc t c.
5.7 Phn nhm s liu bng cut2 (Hmisc)
Trong phn tch thng k, c khi chng ta cn phi phn chia mt bin s lin tc thnh
nhiu nhm da vo phn phi ca bin s. Chng hn nh i vi bin s bmd chng ta
c th ct dy s thnh 3 nhm tng ng nhau bng cch dng function cut2
(trong th vin Hmisc) nh sau:
> # nhp th vin Hmisc c th dng function cut2
> library(Hmisc)
> bmd <- c(-0.92,0.21,0.17,-3.21,-1.80,-2.60,-2.00,1.71,2.12,-2.11)
> # chia bin s bmd thnh 2 nhm v trong i tng group
> group <- cut2(bmd, g=2)
> table(group)
group
[-3.21,-0.92) [-0.92, 2.12]
5
5

Nh thy qua v d trn, g = 2 c ngha l chia thnh 2 nhm (g = group). R t ng


chia thnh nhm 1 gm gi tr bmd t -3.21 n -0.92, v nhm 2 t -0.92 n 2.12. Mi
nhm gm c 5 s.
Tt nhin, chng ta cng c th chia thnh 3 nhm bng lnh:
> group <- cut2(bmd, g=3)

V vi lnh table chng ta s bit c 3 nhm, nhm 1 gm 4 s, nhm 2 v 3 mi nhm


c 3 s:
> table(group)
group
[-3.21,-1.80) [-1.80, 0.21) [ 0.21, 2.12]
4
3
3

6. S dng R cho tnh ton n gin


Mt trong nhng li th ca R l c th s dng nh mt my tnh cm tay.
Tht ra, hn th na, R c th s dng cho cc php tnh ma trn v lp chng. Trong
chng ny ti ch trnh by mt s php tnh n gin m hc sinh hay sinh vin c th
s dng lp tc trong khi c nhng dng ch ny.
6.1 Tnh ton n gin

24

Phn tch s liu v biu bng R

Nguyn Vn Tun

Cng hai s hay nhiu s vi nhau:

Cng v tr:

> 15+2997
[1] 3012

> 15+2997-9768
[1] -6756

Nhn v chia

S ly tha: (25 5)3

> -27*12/21
[1] -15.42857

> (25 - 5)^3


[1] 8000

Cn s bc hai: 10

S pi ()

> sqrt(10)
[1] 3.162278

> pi
[1] 3.141593
> 2+3*pi
[1] 11.42478

Logarit: loge

Logarit: log10

S m: e2.7689

Hm s lng gic

> exp(2.7689)
[1] 15.94109

> cos(pi)
[1] -1

> log(10)
[1] 2.302585

> log10(100)
[1] 2

> log10(2+3*pi)
[1] 1.057848

Vector
> x <- c(2,3,1,5,4,6,7,6,8)
> x
[1] 2 3 1 5 4 6 7 6 8

> exp(cos(x/10))
[1] 2.664634 2.599545 2.704736 2.405
2.511954 2.282647 2.148655 2.282647
[9] 2.007132

> sum(x)
[1] 42
> x*2
[1] 4

> exp(x/10)
[1] 1.221403 1.349859 1.105171 1.648
1.491825 1.822119 2.013753 1.822119
[9] 2.225541

2 10

8 12 14 12 16

Tnh tng bnh phng (sum of squares): 12 Tnh tng bnh phng iu chnh
n
+ 22 + 32 + 42 + 52 = ?
2
(adjusted
sum
of
squares):
( xi x ) = ?
> x <- c(1,2,3,4,5)
> sum(x^2)
[1] 55

i =1

> x <- c(1,2,3,4,5)


> sum((x-mean(x))^2)
[1] 10

Trong cng thc trn mean(x) l s trung


bnh ca vector x.
Tnh sai s bnh phng (mean square):

Tnh phng sai (variance) v lch


chun (standard deviation):

25

Phn tch s liu v biu bng R

( x x )
i =1

Nguyn Vn Tun

Phng sai: s 2 = ( xi x ) / ( n 1) = ?

/n= ?

i =1

> x <- c(1,2,3,4,5)


> sum((x-mean(x))^2)/length(x)
[1] 2

Trong cng thc trn, length(x) c


ngha l tng s phn t (elements) trong
vector x.

> x <- c(1,2,3,4,5)


> var(x)
[1] 2.5

lch chun:

s2 :

> sd(x)
[1] 1.581139

6.2 S dng R cho cc php tnh ma trn

Nh chng ta bit ma trn (matrix), ni n gin, gm c dng (row) v ct


(column). Khi vit A[m, n], chng ta hiu rng ma trn A c m dng v n ct. Trong R,
chng ta cng c th th hin nh th. V d: chng ta mun to mt ma trn vung A
gm 3 dng v 3 ct, vi cc phn t (element) 1, 2, 3, 4, 5, 6, 7, 8, 9, chng ta vit:
1 4 7

A = 2 5 8
3 6 9

V vi R:
> y <- c(1,2,3,4,5,6,7,8,9)
> A <- matrix(y, nrow=3)
> A
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9

Nhng nu chng ta lnh:


> A <- matrix(y, nrow=3, byrow=TRUE)
> A

th kt qu s l:
[,1] [,2] [,3]
[1,]
1
2
3
[2,]
4
5
6
[3,]
7
8
9

Tc l mt ma trn chuyn v (transposed matrix). Mt cch khc to mt ma trn


hon v l dng t(). V d:
26

Phn tch s liu v biu bng R

Nguyn Vn Tun

> y <- c(1,2,3,4,5,6,7,8,9)


> A <- matrix(y, nrow=3)
> A
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9

v B = A' c th din t bng R nh sau:


> B <- t(A)
> B
[,1] [,2] [,3]
[1,]
1
2
3
[2,]
4
5
6
[3,]
7
8
9

Ma trn v hng (scalar matrix) l mt ma trn vung (tc s dng bng s ct), v
tt c cc phn t ngoi ng cho (off-diagonal elements) l 0, v phn t ng cho
l 1. Chng ta c th to mt ma trn nh th bng R nh sau:
> # to ra m ma trn 3 x 3 vi tt c phn t l 0.
> A <- matrix(0, 3, 3)
> # cho cc phn t ng cho bng 1
> diag(A) <- 1
> diag(A)
[1] 1 1 1
> # by gi ma trn A s l:
> A
[,1] [,2] [,3]
[1,]
1
0
0
[2,]
0
1
0
[3,]
0
0
1

6.2.1 Chit phn t t ma trn


> y <- c(1,2,3,4,5,6,7,8,9)
> A <- matrix(y, nrow=3)
> A
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
> # ct 1 ca ma trn A
> A[,1]

27

Phn tch s liu v biu bng R

Nguyn Vn Tun

[1] 1 4 7
> # ct 3 ca ma trn A
> A[3,]
[1] 7 8 9
> # dng 1 ca ma trn A
> A[1,]
[1] 1 2 3
> # dng 2, ct 3 ca ma trn A
> A[2,3]
[1] 6
> # tt c cc dng ca ma trn A, ngoi tr dng 2
> A[-2,]
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
3
6
9
> # tt c cc ct ca ma trn A, ngoi tr ct 1
> A[,-1]
[,1] [,2]
[1,]
4
7
[2,]
5
8
[3,]
6
9
> # xem phn t no cao hn 3.
> A>3
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] FALSE TRUE TRUE
[3,] FALSE TRUE TRUE

6.2.2 Tnh ton vi ma trn


Cng v tr hai ma trn. Cho hai ma trn A v B nh sau:
> A <- matrix(1:12, 3, 4)
> A
[,1] [,2] [,3] [,4]
[1,]
1
4
7
10
[2,]
2
5
8
11
[3,]
3
6
9
12
> B <- matrix(-1:-12, 3, 4)
> B
[,1] [,2] [,3] [,4]
[1,]
-1
-4
-7 -10

28

Phn tch s liu v biu bng R

[2,]
[3,]

-2
-3

-5
-6

-8
-9

Nguyn Vn Tun

-11
-12

Chng ta c th cng A+B:


> C <- A+B
> C
[,1] [,2] [,3] [,4]
[1,]
0
0
0
0
[2,]
0
0
0
0
[3,]
0
0
0
0

Hay A-B:
> D <- A-B
> D
[,1] [,2] [,3] [,4]
[1,]
2
8
14
20
[2,]
4
10
16
22
[3,]
6
12
18
24

Nhn hai ma trn. Cho hai ma trn:


1 4 7

A = 2 5 8
3 6 9

1 2 3

B = 4 5 6
7 8 9

Chng ta mun tnh AB, v c th trin khai bng R bng cch s dng %*% nh sau:
>
>
>
>
>

y <- c(1,2,3,4,5,6,7,8,9)
A <- matrix(y, nrow=3)
B <- t(A)
AB <- A%*%B
AB
[,1] [,2] [,3]
[1,]
66
78
90
[2,]
78
93 108
[3,]
90 108 126

Hay tnh BA, v c th trin khai bng R bng cch s dng %*% nh sau:
> BA <- B%*%A
> BA
[,1] [,2] [,3]
[1,]
14
32
50
[2,]
32
77 122
[3,]
50 122 194

29

Phn tch s liu v biu bng R

Nguyn Vn Tun

Nghch o ma trn v gii h phng trnh. V d chng ta c h phng trnh sau


y:

3x1 + 4 x2 = 4
x1 + 6 x2 = 2
H phng trnh ny c th vit bng k hiu ma trn: AX = Y, trong :
3 4
A=
,
1 6

x
X = 1 ,
x2

4
Y =
2

Nghim ca h phng trnh ny l: X = A-1Y, hay trong R:


>
>
>
>

A <- matrix(c(3,1,4,6), nrow=2)


Y <- matrix(c(4,2), nrow=2)
X <- solve(A)%*%Y
X
[,1]
[1,] 1.1428571
[2,] 0.1428571

Chng ta c th kim tra:


> 3*X[1,1]+4*X[2,1]
[1] 4

Tr s eigen cng c th tnh ton bng function eigen nh sau:


> eigen(A)
$values
[1] 7 2
$vectors
[,1]
[,2]
[1,] -0.7071068 -0.9701425
[2,] -0.7071068 0.2425356

nh thc (determinant). Lm sao chng ta xc nh mt ma trn c th o nghch


hay khng? Ma trn m nh thc bng 0 l ma trn suy bin (singular matrix) v
khng th o nghch. kim tra nh thc, R dng lnh det():
> E <- matrix((1:9), 3, 3)
> E
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9

30

Phn tch s liu v biu bng R

Nguyn Vn Tun

> det(E)
[1] 0

Nhng ma trn F sau y th c th o nghch:


> F <- matrix((1:9)^2, 3, 3)
> F
[,1] [,2] [,3]
[1,]
1
16
49
[2,]
4
25
64
[3,]
9
36
81
> det(F)
[1] -216

V nghch o ca ma trn F (F-1) c th tnh bng function solve() nh sau:


> solve(F)
[,1]
[,2]
[,3]
[1,] 1.291667 -2.166667 0.9305556
[2,] -1.166667 1.666667 -0.6111111
[3,] 0.375000 -0.500000 0.1805556

Ngoi nhng php tnh n gin ny, R cn c th s dng cho cc php tnh
phc tp khc. Mt li th ng k ca R l phn mm cung cp cho ngi s dng t
do to ra nhng php tnh ph hp cho tng vn c th. R c mt package Matrix
chuyn thit k cho tnh ton ma trn. Bn c c th ti package xung, ci vo my, v
s dng, nu cn. a ch ti l: http://cran.au.r-project.org/bin/windows/contrib/rrelease/Matrix_0.995-8.zip cng vi ti liu ch dn cch s dng (di khong 80 trang):
http://cran.au.r-project.org/doc/packages/Matrix.pdf.

7. S dng R cho tnh ton xc sut


7.1 Php hon v (permutation)

Chng ta bit rng 3! = 3.2.1 = 6, v 0!=1. Ni chung, cng thc tnh hon v cho
mt s n l: n ! = n ( n 1)( n 2 )( n 3) ... 1 . Trong R cch tnh ny rt n gin vi
lnh prod() nh sau:
Tm 3!
> prod(3:1)
[1] 6
Tm 10!
> prod(10:1)
[1] 3628800

31

Phn tch s liu v biu bng R

Nguyn Vn Tun

Tm 10.9.8.7.6.5.4
> prod(10:4)
[1] 604800
Tm (10.9.8.7.6.5.4) / (40.39.38.37.36)
> prod(10:4) / prod(40:36)
[1] 0.007659481
7.2 T hp (combination)

n
n!
S ln chn k ngi t n phn t l: =
. Cng thc ny cng c khi vit l
k k !( n k ) !
n
Ckn thay v . Vi R, php tnh ny rt n gin bng hm choose(n, k). Sau
k
y l vi v d minh ha:
5
Tm
2
> choose(5, 2)
[1] 10

Tm xc sut cp A v B trong s 5 ngi c c c vo hai chc v:


> 1/choose(5, 2)
[1] 0.1
7.3 Bin s ngu nhin v hm phn phi

Khi ni n phn phi (hay distribution) l cp n cc gi tr m bin s c


th c. Cc hm phn phi (distribution function) l hm nhm m t cc bin s mt
cch c h thng. C h thng y c ngha l theo m m hnh ton hc c th vi
nhng thng s cho trc. Trong xc sut thng k c kh nhiu hm phn phi, v
y chng ta s xem xt qua mt s hm quan trng nht v thng dng nht: l phn
phi nh phn, phn phi Poisson, v phn phi chun. Trong mi lut phn phi, c 4
loi hm quan trng m chng ta cn bit:

hm mt xc sut (probability density distribution);


hm phn phi tch ly (cumulative probability distribution);
hm nh bc (quantile); v
hm m phng (simulation).

R c nhng hm sn trn c th ng dng cho tnh ton xc sut. Tn mi hm


c gi bng mt tip u ng ch loi hm phn phi, v vit tt tn ca hm .
Cc tip u ng l d (ch distribution hay xc sut), p (ch cumulative probability, xc
sut tch ly), q (ch nh bc hay quantile), v r (ch random hay s ngu nhin). Cc

32

Phn tch s liu v biu bng R

Nguyn Vn Tun

tn vit tt l norm (normal, phn phi chun), binom (binomial , phn phi nh
phn), pois (Poisson, phn phi Poisson), v.v Bng sau y tm tt cc hm v thng
s cho tng hm:
Hm phn
phi

Mt

Tch ly

nh bc

M phng

Chun

dnorm(x, mean,
sd)
dbinom(k, n, p)

pnorm(q, mean, sd)

qnorm(p, mean, sd)

rnorm(n, mean, sd)

pbinom(q, n, p)

qbinom (p, n, p)

rbinom(k, n, prob)

dpois(k, lambda)

ppois(q, lambda)

qpois(p, lambda)

rpois(n, lambda)

dunif(x, min,
max)
dnbinom(x, k, p)

punif(q, min, max)

qunif(p, min, max)

runif(n, min, max)

pnbinom(q, k, p)

qnbinom (p,k,prob)

rbinom(n, n, prob)

dbeta(x, shape1,
shape2)
dgamma(x, shape,
rate, scale)
dgeom(x, p)

pbeta(q, shape1,
shape2)
gamma(q, shape,
rate, scale)
pgeom(q, p)

qbeta(p, shape1,
shape2)
qgamma(p, shape,
rate, scale)
qgeom(p, prob)

rbeta(n, shape1,
shape2)
rgamma(n, shape,
rate, scale)
rgeom(n, prob)

dexp(x, rate)

pexp(q, rate)

qexp(p, rate)

rexp(n, rate)

dnorm(x, mean,
sd)
dcauchy(x,
location, scale)
df(x, df1, df2)

pnorm(q, mean, sd)

qnorm(p, mean, sd)

rnorm(n, mean, sd)

pcauchy(q,
location, scale)
pf(q, df1, df2)

qcauchy(p,
location, scale)
qf(p, df1, df2)

rcauchy(n,
location, scale)
rf(n, df1, df2)

Nh phn
Poisson
Uniform
Negative
binomial
Beta
Gamma
Geometric
Exponential
Weibull
Cauchy

F
dt(x, df)
pt(q, df)
qt(p, df)
rt(n, df)
T
dchisq(x,
df)
pchi(q,
df)
qchisq(p,
df)
rchisq(n,
df)
Chi-squared
Ch thch: Trong bng trn, df = degrees of freedome (bc t do); prob = probability (xc sut); n = sample
size (s lng mu). Cc thng s khc c th tham kho thm cho tng lut phn phi. Ring cc lut
phn phi F, t, Chi-squared cn c mt thng s khc na l non-centrality parameter (ncp) c cho s 0.
Tuy nhin ngi s dng c th cho mt thng s khc thch hp, nu cn.

7.3.1 Hm phn phi nh phn (Binomial distribution)

Nh tn gi, hm phn phi nh phn ch c hai gi tr: nam / n, sng / cht, c / khng,
v.v Hm nh phn c pht biu bng nh l nh sau: Nu mt th nghim c tin
hnh n ln, mi ln cho ra kt qu hoc l thnh cng hoc l tht bi, v gm xc sut
thnh cng c bit trc l p, th xc sut c k ln th nghim thnh cng l:
nk
P ( k | n, p ) = Ckn p k (1 p ) , trong k = 0, 1, 2, . . . , n. Trong R, c hm dbinom(k,
n, p) c th gip chng ta tnh cng thc P ( k | n, p ) = Ckn p k (1 p )

nk

mt cch nhanh

chng. Trong trng hp trn, chng ta ch cn n gin lnh:


> dbinom(2, 3, 0.60)
[1] 0.432

V d 2: Hm nh phn tch ly (Cumulative Binomial probability


distribution). Xc sut thuc chng long xng c hiu nghim l khong 70% (tc l
p = 0.70). Nu chng ta iu tr 10 bnh nhn, xc sut c ti thiu 8 bnh nhn vi kt
qu tch cc l bao nhiu? Ni cch khc, nu gi X l s bnh nhn c iu tr thnh
cng, chng ta cn tm P(X 8) = ? tr li cu hi ny, chng ta s dng hm

33

Phn tch s liu v biu bng R

Nguyn Vn Tun

pbinom(k, n, p). Xin nhc li rng hm pbinom(k, n, p)cho chng ta P(X


k). Do , P(X 8) = 1 P(X 7). Thnh ra, p s bng R cho cu hi l:
> 1-pbinom(7, 10, 0.70)
[1] 0.3827828

V d 3: M phng hm nh phn: Bit rng trong mt qun th dn s c


khong 20% ngi mc bnh cao huyt p; nu chng ta tin hnh chn mu 1000 ln,
mi ln chn 20 ngi trong qun th mt cch ngu nhin, s phn phi s bnh
nhn cao huyt p s nh th no? tr li cu hi ny, chng ta c th ng dng hm
rbinom (n, k, p) trong R vi nhng thng s nh sau:
> b <- rbinom(1000, 20, 0.20)

Trong lnh trn, kt qu m phng c tm thi cha trong i tng tn l b. bit


b c g, chng ta m bng lnh table:
> table(b)
b
0
1
2
3
4
5
6
6 45 147 192 229 169 105

7
68

8
23

9
13

10
3

Dng s liu th nht (0, 5, 6, , 10) l s bnh nhn mc bnh cao huyt p
trong s 20 ngi m chng ta chn. Dng s liu th hai cho chng ta bit s ln chn
mu trong 1000 ln xy ra. Do , c 6 mu khng c bnh nhn cao huyt p no, 45
mu vi ch 1 bnh nhn cao huyt p, v.v C l cch hiu l v th cc tn s
trn bng lnh hist nh sau:
> hist(b, main="Number of hypertensive patients")

50

Frequency

100

150

200

Number of hypertensive patients

10

34

Phn tch s liu v biu bng R

Nguyn Vn Tun

Biu 1. Phn phi s bnh nhn cao huyt


p trong s 20 ngi c chn ngu nhin
trong mt qun th gm 20% bnh nhn cao
huyt p, v chn mu c lp li 1000 ln.

Qua biu trn, chng ta thy xc sut c 4 bnh nhn cao huyt p (trong mi ln chn
mu 20 ngi) l cao nht (22.9%). iu ny cng c th hiu c, bi v t l cao
huyt p l 20%, cho nn chng ta k vng rng trung bnh 4 ngi trong s 20 ngi
c chn phi l cao huyt p. Tuy nhin, iu quan trng m biu trn th hin l
c khi chng ta quan st n 10 bnh nhn cao huyt p d xc sut cho mu ny rt thp
(ch 3/1000).
7.3.2 Hm phn phi Poisson (Poisson distribution)

Hm phn phi Poisson, ni chung, rt ging vi hm nh phn, ngoi tr thng


s p thng rt nh v n thng rt ln. V th, hm Poisson thng c s dng
m t cc bin s rt him xy ra (nh s ngi mc ung th trong mt dn s chng
hn). Hm Poisson cn c ng dng kh nhiu v thnh cng trong cc nghin cu k
thut v th trng nh s lng khch hng n mt nh hng mi gi.
V d 4: Hm mt Poisson (Poisson density probability function). Qua
theo di nhiu thng, ngi ta bit c t l nh sai chnh t ca mt th k nh my.
Tnh trung bnh c khong 2.000 ch th th k nh sai 1 ch. Hi xc sut m th k
nh sai chnh t 2 ch, hn 2 ch l bao nhiu?

V tn s kh thp, chng ta c th gi nh rng bin s sai chnh t (tm t


tn l bin s X) l mt hm ngu nhin theo lut phn phi Poisson. y, chng ta c
t l sai chnh t trung bnh l 1( = 1). Lut phn phi Poisson pht biu rng xc sut
m X = k, vi iu kin t l trung bnh , :
P( X = k | ) =

e k
k!

e212
Do , p s cho cu hi trn l: P ( X = 2 | = 1) =
= 0.1839 . p s ny c th
2!
tnh bng R mt cch nhanh chng hn bng hm dpois nh sau:
> dpois(2, 1)
[1] 0.1839397

Chng ta cng c th tnh xc sut sai 1 ch, v xc sut khng sai ch no:
> dpois(1, 1)
[1] 0.3678794
> dpois(0, 1)

35

Phn tch s liu v biu bng R

Nguyn Vn Tun

[1] 0.3678794

Ch trong hm trn, chng ta ch n gin cung cp thng s k = 2 v ( = 1. Trn y


l xc sut m th k nh sai chnh t ng 2 ch. Nhng xc sut m th k nh sai
chnh t hn 2 ch (tc 3, 4, 5, ch) c th c tnh bng:
P ( X > 2 ) = P ( X = 3) + P ( X = 4 ) + P( X = 5) + ...

= 1 P ( X 2)
= 1 0.3678 0.3678 0.1839
= 0.08
Bng R, chng ta c th tnh nh sau:
# P(X 2)
> ppois(2, 1)
[1] 0.9196986

# 1-P(X 2)
> 1-ppois(2, 1)
[1] 0.0803014

7.3.3 Hm phn phi chun (Normal distribution)

Hai lut phn phi m chng ta va xem xt trn y thuc vo nhm phn phi
p dng cho cc bin s phi lin tc (discrete distributions), m trong bin s c
nhng gi tr theo bc th hay th loi. i vi cc bin s lin tc, c vi lut phn phi
thch hp khc, m quan trng nht l phn phi chun. Phn phi chun l nn tng
quan trng nht ca phn tch thng k. C th ni khng ngoa rng hu ht l thuyt
thng k c xy dng trn nn tng ca phn phi chun. Hm mt phn phi
chun c hai thng s: trung bnh v phng sai 2 (hay lch chun ). Gi X l
mt bin s (nh chiu cao chng hn), hm mt phn phi chun pht biu rng xc
sut m X = x l:
( x )2
1
2
P ( X = x | , ) = f ( x ) =
exp

2 2
2

V d 5: Hm mt phn phi chun (Normal density probability function).


Chiu cao trung bnh hin nay ph n Vit Nam l 156 cm, vi lch chun l 4.6
cm. Cng bit rng chiu cao ny tun theo lut phn phi chun. Vi hai thng s
=156, =4.6, chng ta c th xy dng mt hm phn phi chiu cao cho ton b qun
th ph n Vit Nam, v hm ny c hnh dng nh sau:

36

Phn tch s liu v biu bng R

Nguyn Vn Tun

f(height)

0.00

0.02

0.04

0.06

0.08

Probability distribution of height in Vietnamese women

130

140

150

160

170

180

190

200

Height

Biu 2. Phn phi chiu cao ph n Vit


Nam vi trung bnh 156 cm v lch chun 4.6
cm. Trng honh l chiu cao v trc tung l xc
sut cho mi chiu cao.

Biu trn c v bng hai lnh sau y. Lnh u tin nhm to ra mt bin s
height c gi tr 130, 131, 132, , 200 cm. Lnh th hai l v biu vi iu kin
trung bnh l 156 cm v lch chun l 4.6 cm.
> height <- seq(130, 200, 1)
> plot(height, dnorm(height, 156, 4.6),
type="l",
ylab=f(height),
xlab=Height,
main="Probability distribution of height in Vietnamese women")

Vi hai thng s trn (v biu ), chng ta c th c tnh xc sut cho bt c


chiu cao no. Chng hn nh xc sut mt ph n Vit Nam c chiu cao 160 cm l:
(160 156 )2
1
P(X = 160 | =156, =4.6) =
exp

2
4.6 2 3.1416
2 ( 4.6 )

= 0.0594
Hm dnorm(x, mean, sd)trong R c th tnh ton xc sut ny cho chng ta mt
cch gn nh:
> dnorm(160, mean=156, sd=4.6)
[1] 0.05942343

37

Phn tch s liu v biu bng R

Nguyn Vn Tun

Hm xc sut chun tch ly (cumulative normal probability function). V


chiu cao l mt bin s lin tc, trong thc t chng ta t khi no mun tm xc sut cho
mt gi tr c th x, m thng tm xc sut cho mt khong gi tr a n b. Chng hn
nh chng ta mun bit xc sut chiu cao t 150 n 160 cm (tc l P(160 X 150),
hay xc sut chiu cao thp hn 145 cm, tc P(X < 145). tm p s cc cu hi nh
th, chng ta cn n hm xc sut chun tch ly, c nh ngha nh sau:

P(a X b) =

f ( x ) dx
a

Thnh ra, P(160 X 150) chnh l din tch tnh t trc honh = 150 n 160 ca biu
2. Trong R c hm pnorm(x, mean, sd) dng tnh xc sut tch ly cho
mt phn phi chun rt c ch.
pnorm (a, mean, sd) =

f ( x ) dx = P(X a | mean, sd)

Chng hn nh xc sut chiu cao ph n Vit Nam bng hoc thp hn 150 cm l 9.6%:
> pnorm(150, 156, 4.6)
[1] 0.0960575

Hay xc sut chiu cao ph n Vit Nam bng hoc cao hn 165 cm l:
> 1-pnorm(164, 156, 4.6)
[1] 0.04100591

Ni cch khc, ch c khong 4.1% ph n Vit Nam c chiu cao bng hay cao hn 165
cm.
V d 6: ng dng lut phn phi chun: Trong mt qun th, chng ta bit
rng p sut mu trung bnh l 100 mmHg v lch chun l 13 mmHg, hi: c bao
nhiu ngi trong qun th ny c p sut mu bng hoc cao hn 120 mmHg? Cu tr
li bng R l:
> 1-pnorm(120, mean=100, sd=13)
[1] 0.0619679

Tc khong 6.2% ngi trong qun th ny c p sut mu bng hoc cao hn 120
mmHg.
7.3.4 Hm phn phi chun chun ha (Standardized Normal distribution)

38

Phn tch s liu v biu bng R

Nguyn Vn Tun

Mt bin X tun theo lut phn phi chun vi trung bnh bnh v phng sai 2
thng c vit tt l:
X ~ N( , 2)
y v 2 ty thuc vo n v o lng ca bin s. Chng hn nh chiu
cao c tnh bng cm (hay m), huyt p c o bng mmHg, tui c o bng nm,
v.v cho nn i khi m t mt bin s bng n v gc rt kh so snh. Mt cch n
gin hn l chun ha (standardized) X sao cho s trung bnh l 0 v phng sai l 1.
Sau vi thao tc s hc, c th chng minh d dng rng, cch bin i X p ng iu
kin trn l:
X

Z=

Ni theo ngn ng ton: nu X ~ N( , 2), th (X )/2 ~ N(0, 1). Nh vy qua


cng thc trn, Z thc cht l khc bit gia mt s v trung bnh tnh bng s lch
chun. Nu Z = 0, chng ta bit rng X bng s trung bnh . Nu Z = -1, chng ta bit
rng X thp hn ng 1 lch chun. Tng t, Z = 2.5, chng ta bit rng X cao hn
ng 2.5 lch chun. v.v
Biu phn phi chiu cao ca ph n Vit Nam c th m t bng mt n v
mi, l ch s z nh sau:

0.2
0.0

0.1

f(z)

0.3

0.4

Probability distribution of height in Vietnamese women

-4

-2

Biu 3. Phn phi chun ha chiu cao ph


n Vit Nam.

Biu trn c v bng hai lnh sau y:

39

Phn tch s liu v biu bng R

Nguyn Vn Tun

> height <- seq(-4, 4, 0.1)


> plot(height, dnorm(height, 0, 1),
type="l",
ylab=f(z),
xlab=z,
main="Probability distribution of height in Vietnamese women")

Vi phn phi chun chun ho, chng ta c mt tin li l c th dng n m t v so


snh mt phn phi ca bt c bin no, v tt c u c chuyn sang ch s z.
Trong biu trn, trc tung l xc sut z v trc honh l bin s z. Chng ta c th
tnh ton xc sut z nh hn mt hng s (constant) no d dng bng R. V d,
chng ta mun tm P(z -1.96) = ? cho mt phn phi m trung bnh l 0 v lch
chun l 1.
> pnorm(-1.96, mean=0, sd=1)
[1] 0.02499790

Hay P(z 1.96) = ?


> pnorm(1.96, mean=0, sd=1)
[1] 0.9750021

Do , P(-1.96 < z < 1.96) chnh l:


> pnorm(1.96) - pnorm(-1.96)
[1] 0.9500042

Ni cch khc, xc sut 95% l z nm gia -1.96 v 1.96. (Ch trong lnh trn ti
khng cung cp mean=0, sd=1, bi v trong thc t, pnorm gi tr mc nh (default
value) ca thng s mean l 0 v sd l 1).
V d 5 (tip tc). Xin nhc li tin vic theo di, chiu cao trung bnh ph
n Vit Nam l 156 cm v lch chun l 4.6 cm. Do , mt ph n c chiu cao 170
cm cng c ngha l z = (170 156) / 4.6 = 3.04 lch chun, v ti l cc ph n Vit
Nam c chiu cao cao hn 170 cm l rt thp, ch khong 0.1%.
> 1-pnorm(3.04)
[1] 0.001182891

Tm nh lng (quantile) ca mt phn phi chun. i khi chng ta cn


lm mt tnh ton o ngc. Chng hn nh chng ta mun bit: nu xc sut Z nh
hn mt hng s z no cho trc bng p, th z l bao nhiu? Din t theo k hiu xc
sut, chng ta mun tm z trong nu:

P(Z < z) = p

tr li cu hi ny, chng ta s dng hm qnorm(p, mean=, sd=).

40

Phn tch s liu v biu bng R

Nguyn Vn Tun

V d 7: Bit rng Z ~ N(0, 1) v nu P(Z < z) = 0.95, chng ta mun tm z.


> qnorm(0.95, mean=0, sd=1)
[1] 1.644854

Hay P(Z < z) = 0.975 cho phn phi chun vi trung bnh 0 v lch chun 1:
> qnorm(0.975, mean=0, sd=1)
[1] 1.959964

7.4 Chn mu ngu nhin (random sampling)

Trong xc sut v thng k, ly mu ngu nhin rt quan trng, v n m bo


tnh hp l ca cc phng php phn tch v suy lun thng k. Vi R, chng ta c th
ly mu mt mu ngu nhin bng cch s dng hm sample.
V d 8: Chng ta c mt qun th gm 40 ngi (m s 1, 2, 3, , 40). Nu
chng ta mun chn 5 i tng qun th , ai s l ngi c chn? Chng ta c th
dng lnh sample() tr li cu hi nh sau:
> sample(1:40, 5)
[1] 32 26 6 18 9

Kt qu trn cho bit i tng 32, 26, 8, 18 v 9 c chn. Mi ln ra lnh ny, R s


chn mt mu khc, ch khng hon ton ging nh mu trn. V d:
> sample(1:40, 5)
[1] 5 22 35 19 4
> sample(1:40, 5)
[1] 24 26 12 6 22
> sample(1:40, 5)
[1] 22 38 11 6 18

v.v
Trn y l lnh chng ta chn mu ngu nhin m khng thay th (random sampling
without replacement), tc l mi ln chn mu, chng ta khng b li cc mu chn
vo qun th.
Nhng nu chng ta mun chn mu thay th (tc mi ln chn ra mt s i tng,
chng ta b vo li trong qun th chn tip ln sau). V d, chng ta mun chn 10
ngi t mt qun th 50 ngi, bng cch ly mu vi thay th (random sampling with
replacement), chng ta ch cn thm tham s replace = TRUE:
> sample(1:50, 10, replace=T)

41

Phn tch s liu v biu bng R

[1] 31 44

Nguyn Vn Tun

8 47 50 10 16 29 23

Hay nm mt ng xu 10 ln; mi ln, d nhin ng xu c 2 kt qu H v T; v kt qu


10 ln c th l:
> sample(c("H", "T"), 10, replace=T)
[1] "H" "T" "H" "H" "H" "T" "H" "H" "T" "T"

Cng c th tng tng chng ta c 5 qu banh mu xanh (X) v 5 qu banh mu (D)


trong mt bao. Nu chng ta chn 1 qu banh, ghi nhn mu, ri li vo bao; ri li
chn 1 qu banh khc, ghi nhn mu, v b vo bao li. C nh th, chng ta chn 20
ln, kt qu c th l:
> sample(c("X", "D"), 20, replace=T)
[1] "X" "D" "D" "D" "D" "D" "X" "X" "X" "X" "X" "D" "X" "X" "D" "X" "X" "X" "X"
[20] "D"

Ngoi ra, chng ta cn c th ly mu vi mt xc sut cho trc. Trong hm sau y,


chng ta chn 10 i tng t dy s 1 n 5, nhng xc sut khng bng nhau:
> sample(5, 10, prob=c(0.3, 0.4, 0.1, 0.1, 0.1), replace=T)
[1] 3 1 3 2 2 2 2 2 5 1

i tng 1 c chn 2 ln, i tng 2 c chn 5 ln, i tng 3 c chn 2 ln,


v.v Tuy khng hon ton ph hp vi xc sut 0.3, 0.4, 0.1 nh cung cp v s mu
cn nh, nhng cng khng qu xa vi k vng.

8. Biu
Trong ngn ng R c rt nhiu cch thit k mt biu gn v p. Phn ln
nhng hm thit k biu c sn trong R, nhng mt s loi biu tinh vi v phc
tp khc c th thit k bng cc package chuyn dng nh lattice hay trellis c
th ti t website ca R. Trong chng ny ti s ch cch v cc biu thng dng
bng cch s dng cc hm ph bin trong R.
8.1 S liu cho phn tch biu

Sau khi bit qua mi trng v nhng la chn thit k mt biu , by


gi chng ta c th s dng mt s hm thng dng v cc biu cho s liu. Theo
ti, biu c th chia thnh 2 loi chnh: biu dng m t mt bin s v biu
v mi lin h gia hai hay nhiu bin s. Tt nhin, bin s c th l lin tc hay khng
lin tc, cho nn, trong thc t, chng ta c 4 loi biu . Trong phn sau y, ti s
im qua cc loi biu , t n gin n phc tp.
C l cch tt nht tm hiu cch v th bng R l bng mt d liu thc t.
Ti s quay li v d 2 (phn 4.2). Trong v d , chng ta c d liu gm 8 ct (hay
42

Phn tch s liu v biu bng R

Nguyn Vn Tun

bin s): id, sex, age, bmi, hdl, ldl, tc, v tg. (Ch , id l m s
ca 50 i tng nghin cu; sex l gii tnh (nam hay n); age l tui; bmi l t
s trng lng; hdl l high density cholesterol; ldl l low density cholesterol; tc l
tng s - total cholesterol; v tg triglycerides). D liu c cha trong directory
directory c:\works\insulin di tn chol.txt. Trc khi v th, chng ta
bt u bng cch nhp d liu ny vo R.
> setwd(c:/works/stats)
> cong <- read.table(chol.txt, header=TRUE, na.strings=.)
> attach(cong)

Hay tin vic theo di ti s nhp cc d liu bng cc lnh sau y:


sex <- c(Nam, Nu, Nu,Nam,Nam, Nu,Nam,Nam,Nam, Nu,
Nu,Nam, Nu,Nam,Nam, Nu, Nu, Nu, Nu, Nu,
Nu, Nu, Nu, Nu,Nam,Nam, Nu,Nam, Nu, Nu,
Nu,Nam,Nam, Nu, Nu,Nam, Nu,Nam, Nu, Nu,
Nam, Nu,Nam,Nam,Nam, Nu,Nam,Nam, Nu, Nu)
age <- c(57,
63,
61,
60,
51,

64,
51,
45,
50,
58,

bmi <- c( 17,


20,
22,
24,

60,
60,
70,
60,
60,

18,
21,
22,
24,

65,
42,
51,
55,
45,

18,
21,
22,
24,

47,
64,
63,
74,
63,

18,
21,
22,
25,

65,
49,
54,
48,
52,

76,
44,
57,
46,
64,

61,
45,
70,
49,
45,

59,
80,
47,
69,
64,

57,
48,
60,
72,
62)

18, 18, 19, 19, 19, 19, 20, 20, 20, 20, 20,
21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22,
23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24,
25)

hdl <- c(5.000,4.380,3.360,5.920,6.250,4.150,0.737,7.170,6.942,5.000,


4.217,4.823,3.750,1.904,6.900,0.633,5.530,6.625,5.960,3.800,
5.375,3.360,5.000,2.608,4.130,5.000,6.235,3.600,5.625,5.360,
6.580,7.545,6.440,6.170,5.270,3.220,5.400,6.300,9.110,7.750,
6.200,7.050,6.300,5.450,5.000,3.360,7.170,7.880,7.360,7.750)
ldl <- c(2.0,
5.0,
3.1,
4.4,
3.0,

3.0,
1.3,
3.0,
4.3,
4.1,

3.0,
1.2,
1.7,
2.3,
4.4,

4.0,
0.7,
2.0,
6.0,
2.8,

2.1,
4.0,
2.1,
3.0,
3.0,

3.0,
4.1,
4.0,
3.0,
2.0,

3.0,
4.3,
4.1,
2.6,
1.0,

3.0,
4.0,
4.0,
4.4,
4.0,

3.0,
4.3,
4.2,
4.3,
4.6,

tc <-c (4.0,
6.2,
4.3,
5.6,
6.2,

3.5,
4.1,
4.8,
8.3,
6.7,

4.7,
3.0,
4.0,
5.8,
6.3,

7.7,
4.0,
3.0,
7.6,
6.0,

5.0,
6.9,
3.1,
5.8,
4.0,

4.2,
5.7,
5.3,
3.1,
3.7,

5.9,
5.7,
5.3,
5.4,
6.1,

6.1,
5.3,
5.4,
6.3,
6.7,

5.9,
7.1,
4.5,
8.2,
8.1,

4.0,
3.8,
5.9,
6.2,
6.2)

tg <- c(1.1,
1.7,
2.2,
3.3,
2.4,

2.1,
1.0,
2.7,
3.0,
3.3,

0.8,
1.6,
1.1,
1.0,
2.0,

1.1,
1.1,
0.7,
1.4,
2.6,

2.1,
1.5,
1.0,
2.5,
1.8,

1.5,
1.0,
1.7,
0.7,
1.2,

2.6,
2.7,
2.9,
2.4,
1.9,

1.5,
3.9,
2.5,
2.4,
3.3,

5.4,
3.0,
6.2,
1.4,
4.0,

1.9,
3.1,
1.3,
2.7,
2.5)

cong <- data.frame(sex, age, bmi, hdl, ldl, tc, tg)

43

2.0,
4.0,
4.2,
4.0,
4.0)

Phn tch s liu v biu bng R

Nguyn Vn Tun

8.2 Biu cho mt bin s ri rc (discrete variable): barplot

Bin sex trong d liu trn c hai gi tr (nam v nu), tc l mt bin khng lin
tc. Chng ta mun bit tn s ca gii tnh (bao nhiu nam v bao nhiu n) v v mt
biu n gin. thc hin nh ny, trc ht, chng ta cn dng hm table
bit tn s:
> sex.freq <- table(sex)
> sex.freq
sex
Nam Nu
22 28

C 22 nam v 28 na trong nghin cu. Sau dng hm barplot th hin tn s


ny nh sau:
> barplot(sex.freq, main=Frequency of males and females)

Biu trn cng c th c c bng mt lnh n gin hn (Biu 8a):


> barplot(table(sex), main=Frequency of males and females)

Frequency of males and females

Nam

10

15

20

Nu

25

Frequency of males and females

Nam

Nu

Biu 8a. Tn s gii tnh th hin bng


ct s.

10

15

20

25

Biu 8b. Tn s gii tnh th hin bng


dng s.

Thay v th hin tn s nam v n bng 2 ct, chng ta c th th hin bng hai dng
bng thng s horiz = TRUE, nh sau (xem kt qu trong Biu 6b):
> barplot(sex.freq,
horiz = TRUE,
col = rainbow(length(sex.freq)),
main=Frequency of males and females)

44

Phn tch s liu v biu bng R

Nguyn Vn Tun

8.3 Biu cho hai bin s ri rc (discrete variable): barplot

Age l mt bin s lin tc. Chng ta c th chia bnh nhn thnh nhiu nhm
da vo tui. Hm cut c chc nng ct mt bin lin tc thnh nhiu nhm ri
rc. Chng hn nh:
> ageg <- cut(age, 3)
> table(ageg)
ageg
(42,54.7] (54.7,67.3]
19
24

(67.3,80]
7

C hiu qu chia bin age thnh 3 nhm. Tn s ca ba nhm ny l: 42 tui n 54.7


tui thnh nhm 1, 54.7 n 67.3 thnh nhm 2, v 67.3 n 80 tui thnh nhm 3.
Nhm 1 c 19 bnh nhn, nhm 2 v 3 c 24 v 7 bnh nhn.
By gi chng ta mun bit c bao nhiu bnh nhn trong tng tui v tng gii tnh
bng lnh table:
> age.sex <- table(sex, ageg)
> age.sex
ageg
sex
(42,54.7] (54.7,67.3] (67.3,80]
Nam
10
10
2
Nu
9
14
5

Kt qu trn cho thy chng ta c 10 bnh nhn nam v 9 n trong nhm tui th nht,
10 nam v 14 na trong nhm tui th hai, v.v th hin tn s ca hai bin ny,
chng ta vn dng barplot:
> barplot(age.sex, main=Number of males and females in each age
group)

45

Phn tch s liu v biu bng R

Nguyn Vn Tun

10

15

10

20

12

14

Number of males and females in each age group

(42,54.7]

(54.7,67.3]

(42,54.7]

(67.3,80]

(54.7,67.3]

(67.3,80]

Age group

Biu 9a. Tn s gii tnh v nhm tui


th hin bng ct s.

Biu 9b. Tn s gii tnh v nhm tui


th hin bng hai dng s.

Trong Biu 9a, mi ct l cho mt tui, v phn m ca ct l n, v phn mu


nht l tn s ca nam gii. Thay v th hin tn s nam n trong mt ct, chng ta cng
c th th hin bng 2 ct vi beside=T nh sau (Biu 9b):
barplot(age.sex, beside=TRUE, xlab="Age group")

8.4 Biu hnh trn

Tn s mt bin ri rc cng c th th hin bng biu hnh trn. V d sau y v


biu tn s ca tui. Biu 10a l 3 nhm tui, v Biu 10b l biu tn
s cho 5 nhm tui:
> pie(table(ageg))
pie(table(cut(age,5)))

46

Phn tch s liu v biu bng R

Nguyn Vn Tun

(42,54.7]
(49.6,57.2]

(42,49.6]

(72.4,80]

(67.3,80]
(64.8,72.4]

(54.7,67.3]

(57.2,64.8]

Biu 10a. Tn s cho 3 nhm tui

Biu 10b. Tn s cho 5 nhm tui

8.5 Biu cho mt bin s lin tc: stripchart v hist


8.5.1 Stripchart

Biu strip cho chng ta thy tnh lin tc ca mt bin s. Chng hn nh


chng ta mun tm hiu tnh lin tc ca triglyceride (tg), hm stripchart() s gip
trong mc tiu ny:
> stripchart(tg,
main=Strip chart for triglycerides, xlab=mg/L)
Strip chart for triglycerides

mg/L

47

Phn tch s liu v biu bng R

Nguyn Vn Tun

Chng ta thy bin s tg c s bt lin tc, nht l cc i tng c tg cao. Trong khi
phn ln i tng c tg thp hn 5, th c 2 i tng vi tg rt cao (>5).
8.5.2 Histogram

Age l mt bin s lin tc. v biu tn s ca bin s age, chng ta ch


n gin lnh hist(age). Nh cp trn, chng ta c th ci tin th ny bng
cch cho thm ta chnh (main) v ta ca trc honh (xlab) v trc tung
(ylab):
> hist(age)
> hist(age, main="Frequency distribution by age group", xlab="Age
group", ylab="No of patients")
Histogram of age

8
0

No of patients

6
4

Frequency

10

10

12

12

Frequency distribution by age group

40

50

60

70

80

age

40

50

60

70

80

Age group

Biu 11a. Trc tung l s bnh nhn (i Biu 11b. Thm tn biu v tn ca trc
tng nghin cu) v trc honh l tui. trung v trc honh bng xlab v ylab.
Chng hn nh tui 40 n 45 c 6 bnh nhn,
t 70 n 80 tui c 4 bnh nhn.

Chng ta cng c th bin i biu thnh mt th phn phi xc sut bng hm


plot(density) nh sau (kt qu trong Biu 12a):
> plot(density(age),add=TRUE)

48

Phn tch s liu v biu bng R

Nguyn Vn Tun

density.default(x = age)

Density

0.00

0.00

0.01

0.02

0.02
0.01

Density

0.03

0.03

0.04

0.04

Histogram of age

30

40

50

60

70

80

90

40

N = 50 Bandwidth = 3.806

50

60

70

80

age

Biu 12a. Xc sut phn phi mt cho Biu 12b. Xc sut phn phi mt cho
bin age ( tui).
bin age ( tui) vi nhiu interquartile.

Chng ta c th v hai th chng ln bng cch dng hm interquartile nh sau (kt


qu xem Biu 12b):
8.6 Biu hp (boxplot)

v biu hp ca bin s tc, chng ta ch n gin lnh:


> boxplot(tc, main="Box plot of total cholesterol", ylab="mg/L")

mg/L

Box plot of total cholesterol

Biu 13. Trong biu ny, chng ta thy median


(trung v) khong 5.6 mg/L, 25% total cholesterol thp
hn 4.1, v 75% thp hn 6.2. Total cholesterol thp nht

49

Phn tch s liu v biu bng R

Nguyn Vn Tun

l khoang 3, v cao nht l trn 8 mg/L.


Trong biu sau y, chng ta so snh tc gia hai nhm nam v n:
> boxplot(tc ~ sex, main=Box plot of total cholestrol by sex,
ylab="mg/L")

Kt qu trnh by trong Biu 14a. Chng ta c th bin giao din ca th bng


cch dng thng s horizontal=TRUE v thay i mu bng thng s col nh sau
(Biu 14b):
> boxplot(tc~sex, horizontal=TRUE, main="Box plot of total
cholesterol", ylab="mg/L", col = "pink")

Box plot of total cholesterol

Nam

mg/L

mg/L

Nu

Box plot of total cholesterol by sex

Nam

Nu

Biu 14a. Trong biu ny, chng ta Biu 14b. Total cholesterol cho tng
thy trung v ca total cholesterol n gii gii tnh, vi mu sc v hnh hp nm
thp hn nam gii, nhng dao ng gia ngang.
hai nhm khng khc nhau bao nhiu.

8.7 Phn tch biu cho hai bin lin tc


8.7.1 Biu tn x (scatter plot)

tm hiu mi lin h gia hai bin, chng ta dng biu tn x. v biu tn x


v mi lin h gia bin s tc v hdl, chng ta s dng hm plot. Thng s th nht
ca hm plot l trc honh (x-axis) v thng s th 2 l trc tung. tm hiu mi lin
h gia tc v hdl chng ta n gin lnh:
> plot(tc, hdl)

50

Nguyn Vn Tun

hdl

Phn tch s liu v biu bng R

tc

Biu 15. Mi lin h gia tc v hdl. Trong biu


ny, chng ta v bin s hdl trn trc tung v tc trn
trc honh.

Chng ta mun phn bit gii tnh (nam v n) trong biu trn. v biu ,
chng ta phi dng n hm ifelse. Trong lnh sau y, nu sex==Nam th v k
t s 16 ( trn), nu khng nam th v k t s 22 (tc vung):
> plot(hdl, tc, pch=ifelse(sex=="Nam", 16, 22))

Kt qu l Biu 16a. Chng ta cng c th thay k t thnh M (nam) v F


n(xem Biu 16b):
> plot(hdl, tc, pch=ifelse(sex=="Nam", M, F))

51

Phn tch s liu v biu bng R

Nguyn Vn Tun

M
8
8

M
F

6
tc

M
M

F
hdl

M
F
F
F

M
F
M

M
F

M
F

F F

M
F
F
F
F

M
F
M

M
F

F
F

3
3

tc

hdl

Biu 16a. Mi lin h gia tc v hdl theo Biu 16a. Mi lin h gia tc v hdl theo
tng gii tnh c th hin bng hai k hiu tng gii tnh c th hin bng hai k t.
du.

Chng ta cng c th v mt ng biu din hi qui tuyn tnh (regression line) qua cc
im trn bng cch tip tc ra cc lnh sau y:
> plot(hdl ~ tc, pch=16, main="Total cholesterol and HDL cholesterol",
xlab="Total cholesterol", ylab="HDL cholesterol", bty=l)
> reg <- lm(hdl ~ tc)
> abline(reg)

Kt qu l Biu 17a di y. Chng ta cng c th dng hm trn (smooth function)


biu din mi lin h gia hai bin s. th sau y s dng lowess (mt hm
thng thng nht) trong vic lm trn s liu tc v hdl (Biu 17b).
> plot(hdl ~ tc, pch=16,
main="Total cholesterol and HDL cholesterol with LOEWSS smooth
function",
xlab="Total cholesterol", ylab="HDL cholesterol", bty=l)

> lines(lowess(hdl, tc, f=2/3, iter=3), col="red")

52

Phn tch s liu v biu bng R

Nguyn Vn Tun

T otal cholesterol and HDL cholesterol

6
2

HDL cholesterol

4
2

HDL cholesterol

T otal cholesterol and HDL cholesterol with LOEWSS smooth function

Total cholesterol

Total cholesterol

Biu 17a. Trong lnh trn, reg<- Biu 17b. Thay v dng abline, chng ta
lm(hdl~tc) c ngha l tm phng trnh dng hm lowess th hin mi lin h gia
lin h gia hdl v tc bng linear model tc v hdl.
(lm) v 8t kt qu vo i tng reg.
Lnh th hai abline(reg) yu cu R v
ng thng t phng trnh trong reg

Bn c c th th nghim vi nhiu thng s f=1/2, f=2/5, hay thm ch f=1/10


s thy th bin i mt cch th v.
8.8 Phn tch Biu cho nhiu bin: pairs

Chng ta c th tm hiu mi lin h gia cc bin s nh age, bmi, hdl, ldl v


tc bng cch dng lnh pairs. Nhng trc ht, chng ta phi a cc bin s ny
vo mt data.frame ch gm nhng bin s c th v c, v sau s dng hm
pairs trong R.
> lipid <- data.frame(age,bmi,hdl,ldl,tc)
> pairs(lipid, pch=16)

Kt qu s l:

53

Phn tch s liu v biu bng R

20

22

24

70

80

18

Nguyn Vn Tun

22

24

50

60

age

18

20

bmi

hdl

ldl

tc

50

60

70

80

8.9 Biu vi sai s chun (standard error)

Trong biu sau y, chng ta c 5 nhm (bin s x c m phng ch khng phi s


liu tht), v mi nhm c gi tr trung bnh mean, v tin cy 95% (lcl v ucl).
Thng thng lcl=mean-1.96*SE v ucl = mean+1.96*SE (SE l sai s
chun). Chng ta mun v biu cho 5 nhm vi sai s chun . Cc lnh v hm sau
y s cn thit:
>
>
>
>
>
>

group <- c(1,2,3,4,5)


mean <- c(1.1, 2.3, 3.0, 3.9, 5.1)
lcl <- c(0.9, 1.8, 2.7, 3.8, 5.0)
ucl <- c(1.3, 2.4, 3.5, 4.1, 5.3)
plot(group, mean, ylim=range(c(lcl, ucl)))
arrows(group, ucl, group, lcl, length=0.5, angle=90, code=3)

54

Nguyn Vn Tun

3
1

mean

Phn tch s liu v biu bng R

group

9. Phn tch thng k m t


9.1 Thng k m t (descriptive statistics, summary)

minh ha cho vic p dng R vo thng k m t, ti s s dng mt d liu


nghin cu c tn l igfdata. Trong nghin cu ny, ngoi cc ch s lin quan n
gii tnh, tui, trng lng v chiu cao, chng ti o lng cc hormone lin quan
n tnh trng tng trng nh igfi, igfbp3, als, v cc markers lin quan n
s chuyn ha ca xng pinp, ictp v pinp. C 100 i tng nghin cu. D
liu ny c cha trong directory c:\works\stats. Trc ht, chng ta cn phi
nhp d liu vo R vi nhng lnh sau y (cc cu ch theo sau du # l nhng ch
thch bn c theo di):
> options(width=100)
# chuyn directory
> setwd("c:/works/stats")
# c d liu vo R
> igfdata <- read.table("igf.txt", header=TRUE, na.strings=".")
> attach(igfdata)
# xem xt cc ct s trong d liu
> names(igfdata)
[1] "id"
"sex"
"age"
[7] "igfi"
"igfbp3"
"als"

"weight"
"pinp"

"height"
"ictp"

"ethnicity"
"p3np"

> igfdata
id

sex age weight height ethnicity

55

igfi

igfbp3

als

pinp

ictp

p3np

Phn tch s liu v biu bng R

1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
...
...
97
97
98
98
99
99
100 100

Nguyn Vn Tun

Female
Male
Female
Female
Female
Female
Female
Female
Female
Female

15
16
15
15
16
25
19
18
15
24

42
44
43
42
47
45
45
43
41
45

162
Asian 189.000 4.00000 323.667 353.970
160 Caucasian 160.000 3.75000 333.750 375.885
157
Asian 146.833 3.43333 248.333 199.507
155
Asian 185.500 3.40000 251.000 483.607
167
Asian 192.333 4.23333 322.000 105.430
160
Asian 110.000 3.50000 284.667 76.487
161
Asian 157.000 3.20000 274.000 75.880
153
Asian 146.000 3.40000 303.000 86.360
149
Asian 197.667 3.56667 308.500 254.803
157
African 148.000 3.40000 273.000 44.720

11.2867 8.3367
10.4300 6.7450
8.3633 12.5000
13.3300 14.2767
7.9233 4.5033
4.9833 4.9367
6.3500 5.3200
7.3700 4.6700
11.8700 6.8200
3.7400 6.1600

Female
Male
Female
Male

17
18
18
15

54
55
48
54

168 Caucasian 204.667 4.96667 441.333 64.130 5.1600


169
Asian 178.667 3.86667 273.000 185.913 7.5267
151
Asian 237.000 3.46667 324.333 105.127 5.9867
168
Asian 130.000 2.70000 259.333 325.840 10.2767

Trn y ch l mt phn s liu trong s 100 i tng.


Cho mt bin s x1 , x2 , x3 ,..., xn chng ta c th tnh ton mt s ch s thng k m t
nh sau:
Hm R
mean(x)

L thuyt
S trung bnh: x =

Phng sai: s 2 =

1
xi .
n i =1

var(x)

1 n
2
( xi x )
n 1 i =1

sd(x)

lch chun: s = s 2
Sai s chun (standard error): SE =

Khng c

s
n

min(x)
max(x)
range(x)

Tr s thp nht
Tr s cao nht
Ton c (range)

V d 9: tm gi tr trung bnh ca tui, chng ta ch n gin lnh:


> mean(age)
[1] 19.17

Hay phng sai v c lch chun ca tui:


> var(age)
[1] 15.33444
> sd(age)
[1] 3.915922

56

4.4367
8.8333
5.6600
6.5933

Phn tch s liu v biu bng R

Nguyn Vn Tun

Tuy nhin, R c lnh summary c th cho chng ta tt c thng tin thng k v mt bin
s:
> summary(age)
Min. 1st Qu.
13.00
16.00

Median
19.00

Mean 3rd Qu.


19.17
21.25

Max.
34.00

Ni chung, kt qu ny n gin v cc vit tt cng c th d hiu. Ch , trong


kt qu trn, c hai ch s 1st Qu v 3rd Qu c ngha l first quartile (tng
ng vi v tr 25%) v third quartile (tng ng vi v tr 75%) ca mt bin s.
First quartile = 16 c ngha l 25% i tng nghin cu c tui bng hoc nh hn
16 tui. Tng t, Third quartile = 34 c ngha l 75% i tng c tui bng hoc
thp hn 34 tui. Tt nhin s trung v (median) 19 cng c ngha l 50% i tng c
tui 19 tr xung (hay 19 tui tr ln).
R khng c hm tnh sai s chun, v trong hm summary, R cng khng cung
cp lch chun. c cc s ny, chng ta c th t vit mt hm n gin (hy gi
l desc) nh sau:
desc <- function(x)
{
av <- mean(x)
sd <- sd(x)
se <- sd/sqrt(length(x))
c(MEAN=av, SD=sd, SE=se)
}

V c th gi hm ny tnh bt c bin no chng ta mun, nh tnh bin als sau


y:
> desc(als)
MEAN
SD
301.841120 58.987189

SE
5.898719

c mt quang cnh chung v d liu igfdata chng ta ch n gin lnh


summary nh sau:
> summary(igfdata)
id
sex
Min.
: 1.00
Female:69
1st Qu.: 25.75
Male :31
Median : 50.50
Mean
: 50.50
3rd Qu.: 75.25
Max.
:100.00
igfi

igfbp3

age
Min.
:13.00
1st Qu.:16.00
Median :19.00
Mean
:19.17
3rd Qu.:21.25
Max.
:34.00
als

57

weight
Min.
:41.00
1st Qu.:47.00
Median :50.00
Mean
:49.91
3rd Qu.:53.00
Max.
:60.00
pinp

height
Min.
:149.0
1st Qu.:157.0
Median :162.0
Mean
:163.1
3rd Qu.:168.0
Max.
:196.0
ictp

ethnicity
African : 8
Asian
:60
Caucasian:30
Others
: 2

Phn tch s liu v biu bng R

Min.
: 85.71
1st Qu.:137.17
Median :161.50
Mean
:165.59
3rd Qu.:186.46
Max.
:427.00

Min.
:2.000
1st Qu.:3.292
Median :3.550
Mean
:3.617
3rd Qu.:3.875
Max.
:5.233

Nguyn Vn Tun

Min.
:192.7
1st Qu.:256.8
Median :292.5
Mean
:301.8
3rd Qu.:331.2
Max.
:471.7

Min.
: 26.74
1st Qu.: 68.10
Median :103.26
Mean
:167.17
3rd Qu.:196.45
Max.
:742.68

Min.
: 2.697
1st Qu.: 4.878
Median : 6.338
Mean
: 7.420
3rd Qu.: 8.423
Max.
:21.237

p3np
Min.
: 2.343
1st Qu.: 4.433
Median : 5.445
Mean
: 6.341
3rd Qu.: 7.150
Max.
:16.303

R tnh ton tt c cc bin s no c th tnh ton c! Thnh ra, ngay c ct id


(tc m s ca i tng nghin cu) R cng tnh lun! (v chng ta bit kt qu ca ct
id chng c ngha thng k g). i vi cc bin s mang tnh phn loi nh sex v
ethnicity (sc tc) th R ch bo co tn s cho mi nhm.

Kt qu trn cho tt c i tng nghin cu. Nu chng ta mun kt qu cho


tng nhm nam v n ring bit, hm by trong R rt hu dng. Trong lnh sau y,
chng ta yu cu R tm lc d liu igfdata theo sex.
> by(igfdata, sex, summary)
sex: Female
id
Min.
: 1.0
1st Qu.:21.0
Median :47.0
Mean
:48.2
3rd Qu.:75.0
Max.
:99.0
ethnicity
African : 4
Asian
:43
Caucasian:22
Others
: 0

sex
Female:69
Male : 0

age
weight
height
Min.
:13.00
Min.
:41.00
Min.
:149.0
1st Qu.:17.00
1st Qu.:47.00
1st Qu.:156.0
Median :19.00
Median :50.00
Median :162.0
Mean
:19.59
Mean
:49.35
Mean
:161.9
3rd Qu.:22.00
3rd Qu.:52.00
3rd Qu.:166.0
Max.
:34.00
Max.
:60.00
Max.
:196.0
igfi
igfbp3
als
Min.
: 85.71
Min.
:2.767
Min.
:204.3
1st Qu.:136.67
1st Qu.:3.333
1st Qu.:263.8
Median :163.33
Median :3.567
Median :302.7
Mean
:167.97
Mean
:3.695
Mean
:311.5
3rd Qu.:186.17
3rd Qu.:3.933
3rd Qu.:361.7
Max.
:427.00
Max.
:5.233
Max.
:471.7
pinp
ictp
p3np
Min.
: 26.74
Min.
: 2.697
Min.
: 2.343
1st Qu.: 62.75
1st Qu.: 4.717
1st Qu.: 4.337
Median : 78.50
Median : 5.537
Median : 5.143
Mean
:108.74
Mean
: 6.183
Mean
: 5.643
3rd Qu.:115.26
3rd Qu.: 7.320
3rd Qu.: 6.143
Max.
:502.05
Max.
:13.633
Max.
:14.420
-----------------------------------------------------------sex: Male
id
sex
age
weight
height
Min.
: 2.00
Female: 0
Min.
:14.00
Min.
:44.00
Min.
:155.0
1st Qu.: 34.50
Male :31
1st Qu.:15.00
1st Qu.:48.50
1st Qu.:161.5
Median : 56.00
Median :17.00
Median :51.00
Median :164.0
Mean
: 55.61
Mean
:18.23
Mean
:51.16
Mean
:165.6
3rd Qu.: 75.00
3rd Qu.:20.00
3rd Qu.:53.50
3rd Qu.:169.0
Max.
:100.00
Max.
:27.00
Max.
:59.00
Max.
:191.0
ethnicity
igfi
igfbp3
als

58

Phn tch s liu v biu bng R

African : 4
Asian
:17
Caucasian: 8
Others
: 2
pinp
Min.
: 56.28
1st Qu.:135.07
Median :245.92
Mean
:297.21
3rd Qu.:450.38
Max.
:742.68

Min.
: 94.67
1st Qu.:138.67
Median :160.00
Mean
:160.29
3rd Qu.:183.00
Max.
:274.00
ictp
Min.
: 3.650
1st Qu.: 6.900
Median : 9.513
Mean
:10.173
3rd Qu.:13.517
Max.
:21.237

Nguyn Vn Tun

Min.
:2.000
Min.
:192.7
1st Qu.:3.183
1st Qu.:249.8
Median :3.500
Median :276.0
Mean
:3.443
Mean
:280.2
3rd Qu.:3.775
3rd Qu.:311.3
Max.
:4.500
Max.
:388.7
p3np
Min.
: 3.390
1st Qu.: 5.375
Median : 7.140
Mean
: 7.895
3rd Qu.:10.010
Max.
:16.303

xem qua phn phi ca cc hormones v ch s sinh ha cng mt lc, chng


ta c th v th cho tt c 6 bin s. Trc ht, chia mn nh thnh 6 ca s (vi 2
dng v 3 ct); sau ln lt v:
>
>
>
>
>
>
>

op <- par(mfrow=c(2,3))
hist(igfi)
hist(igfbp3)
hist(als)
hist(pinp)
hist(ictp)
hist(p3np)

59

Phn tch s liu v biu bng R

Nguyn Vn Tun

Histogram of igfbp3

Histogram of als

200

300

400

0
100

20

Frequency

10

20

Frequency

10

20
0

10

Frequency

30

30

30

40

40

Histogram of igfi

2.0

3.0

4.0

5.0

150

250

350

450

igfbp3

als

Histogram of pinp

Histogram of ictp

Histogram of p3np

40
30
20

Frequency

30

10

10

10

20

Frequency

30
20

Frequency

40

50

igf i

200

400
pinp

600

800

10

15

20

ictp

10

15

p3np

9.2 Thng k m t theo tng nhm

Nu chng ta mun tnh trung bnh ca mt bin s nh igfi cho mi nhm nam
v n gii, hm tapply trong R c th dng cho vic ny:
> tapply(igfi, list(sex), mean)
Female
Male
167.9741 160.2903

Trong lnh trn, igfi l bin s chng ta cn tnh, bin s phn nhm l sex, v ch s
thng k chng ta mun l trung bnh (mean). Qua kt qu trn, chng ta thy s trung
bnh ca igfi cho n gii (167.97) cao hn nam gii (160.29).
Nhng nu chng ta mun tnh cho tng gii tnh v sc tc, chng ta ch cn thm mt
bin s trong hm list:
> tapply(igfi, list(ethnicity, sex), mean)
Female
Male
African
145.1252 120.9168

60

Phn tch s liu v biu bng R

Nguyn Vn Tun

Asian
165.6589 160.4999
Caucasian 176.6536 169.4790
Others
NA 200.5000

Trong kt qu trn, NA c ngha l not available, tc khng c s liu cho ph n trong


cc sc tc others.
9.3 Kim nh t (t.test)

Kim nh t da vo gi thit phn phi chun. C hai loi kim nh t: kim


nh t cho mt mu (one-sample t-test), v kim nh t cho hai mu (two-sample t-test).
Kim nh t mt mu nm tr li cu hi d liu t mt mu c phi tht s bng mt
thng s no hay khng. Cn kim nh t hai mu th nhm tr li cu hi hai mu c
cng mt lut phn phi, hay c th hn l hai mu c tht s c cng tr s trung bnh
hay khng. Ti s ln lt minh ha hai kim nh ny qua s liu igfdata trn.
9.3.1 Kim nh t mt mu
V d 10. Qua phn tch trn, chng ta thy tui trung bnh ca 100 i tng
trong nghin cu ny l 19.17 tui. Chng hn nh trong qun th ny, trc y chng
ta bit rng tui trung bnh l 30 tui. Vn t ra l c phi mu m chng ta c c
c i din cho qun th hay khng. Ni cch khc, chng ta mun bit gi tr trung bnh
19.17 c tht s khc vi gi tr trung bnh 30 hay khng.

tr li cu hi ny, chng ta s dng kim nh t. Theo l thuyt thng k,


kim nh t c nh ngha bng cng thc sau y:
t=

x
s/ n

Trong , x l gi tr trung bnh ca mu, l trung bnh theo gi thit (trong trng
hp ny, 30), s l lch chun, v n l s lng mu (100). Nu gi tr t cao hn gi tr
l thuyt theo phn phi t mt tiu chun c ngha nh 5% chng hn th chng ta c
l do pht biu khc bit c ngha thng k. Gi tr ny cho mu 100 c th tnh ton
bng hm qt ca R nh sau:
> qt(0.95, 100)
[1] 1.660234

Nhng c mt cch tnh ton nhanh gn hn tr li cu hi trn, bng cch dng hm


t.test nh sau:
> t.test(age, mu=30)
One Sample t-test

61

Phn tch s liu v biu bng R

Nguyn Vn Tun

data: age
t = -27.6563, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
18.39300 19.94700
sample estimates:
mean of x
19.17

Trong lnh trn age l bin s chng ta cn kim nh, v mu=30 l gi tr gi thit. R
trnh by tr s t = -27.66, vi 99 bc t do, v tr s p < 2.2e-16 (tc rt thp). R
cng cho bit tin cy 95% ca age l t 18.4 tui n 19.9 tui (30 tui nm qu ngoi
khong tin cy ny). Ni cch khc, chng ta c l do pht biu rng tui trung
bnh trong mu ny tht s thp hn tui trung bnh ca qun th.
9.3.2 Kim nh t hai mu
V d 11. Qua phn tch m t trn (phm summary) chng ta thy ph n c
hormone igfi cao hn nam gii (167.97 v 160.29). Cu hi t ra l c phi tht s
l mt khc bit c h thng hay do cc yu t ngu nhin gy nn. Tr li cu hi ny,
chng ta cn xem xt mc khc bit trung bnh gia hai nhm v lch chun ca
khc bit.

x2 x1
SED
Trong x1 v x2 l s trung bnh ca hai nhm nam v n, v SED l lch chun
ca ( x1 - x2 ) . Thc ra, SED c th c tnh bng cng thc:
t=

SED = SE12 + SE22

Trong SE1 v SE2 l sai s chun (standard error) ca hai nhm nam v n. Theo l
thuyt xc sut, t tun theo lut phn phi t vi bc t do n1 + n2 2 , trong n1 v n2 l
s mu ca hai nhm. Chng ta c th dng R tr li cu hi trn bng hm t.test
nh sau:
> t.test(igfi~ sex)
Welch Two Sample t-test
data: igfi by sex
t = 0.8412, df = 88.329, p-value = 0.4025
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.46855 25.83627
sample estimates:
mean in group Female
mean in group Male
167.9741
160.2903

62

Phn tch s liu v biu bng R

Nguyn Vn Tun

R trnh by cc gi tr quan trng trc ht:


t = 0.8412, df = 88.329, p-value = 0.4025

df l bc t do. Tr s p = 0.4025 cho thy mc khc bit gia hai nhm nam v n
khng c ngha thng k (v cao hn 0.05 hay 5%).
95 percent confidence interval:
-10.46855 25.83627

l khong tin cy 95% v khc bit gia hai nhm. Kt qu tnh ton trn cho bit
igf n gii c th thp hn nam gii 10.5 ng/L hoc cao hn nam gii khong 25.8
ng/L. V khc bit qu ln v l thm bng chng cho thy khng c khc bit c
ngha thng k gia hai nhm.
Kim nh trn da vo gi thit hai nhm nam v n c khc phng sai. Nu
chng ta c l do cho rng hai nhm c cng phng sai, chng ta ch thay i mt
thng s trong hm t vi var.equal=TRUE nh sau:
> t.test(igfi~ sex, var.equal=TRUE)
Two Sample t-test
data: igfi by sex
t = 0.7071, df = 98, p-value = 0.4812
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-13.88137 29.24909
sample estimates:
mean in group Female
mean in group Male
167.9741
160.2903

V mc s, kt qu phn tch trn c khc cht t so vi kt qu phn tch da vo gi


nh hai phng sai khc nhau, nhng tr s p cng i n mt kt lun rng khc bit
gia hai nhm khng c ngha thng k.
9.4 Kim nh Wilcoxon cho hai mu (wilcox.test)

Kim nh t da vo gi thit l phn phi ca mt bin phi tun theo lut phn
phi chun. Nu gi nh ny khng ng, kt qu ca kim nh t c th khng hp l
(valid). kim nh phn phi ca igfi, chng ta c th dng hm shapiro.test
nh sau:
> shapiro.test(igfi)
Shapiro-Wilk normality test

63

Phn tch s liu v biu bng R

Nguyn Vn Tun

data: igfi
W = 0.8528, p-value = 1.504e-08

Tr s p nh hn 0.05 rt nhiu, cho nn chng ta c th ni rng phn phi ca igfi


khng tun theo lut phn phi chun. Trong trng hp ny, vic so snh gia hai
nhm c th da vo phng php phi tham s (non-parametric) c tn l kim nh
Wilcoxon, v kim nh ny (khng nh kim nh t) khng ty thuc vo gi nh phn
phi chun.
> wilcox.test(igfi ~ sex)
Wilcoxon rank sum test with continuity correction
data: igfi by sex
W = 1125, p-value = 0.6819
alternative hypothesis: true mu is not equal to 0

Tr s p = 0.682 cho thy qu tht khc bit v igfi gia hai nhm nam v n khng
c ngha thng k. Kt lun ny cng khng khc vi kt qu phn tch bng kim nh
t.
9.5 Kim nh t cho cc bin s theo cp (paired t-test, t.test)

Kim nh t va trnh by trn l cho cc nghin cu gm hai nhm c lp nhau


(nh gia hai nhm nam v n), nhng khng th ng dng cho cc nghin cu m mt
nhm i tng c theo di theo thi gian. Ti tm gi cc nghin cu ny l nghin
cu theo cp. Trong cc nghin cu ny, chng ta cn s dng mt kim nh t c tn l
paired t-test.
V d 12. Mt nhm bnh nhn gm 10 ngi c iu tr bng mt thuc
nhm gim huyt p. Huyt p ca bnh nhn c o lc khi u nghin cu (lc cha
iu tr), v sau khi iu kh. S liu huyt p ca 10 bnh nhn nh sau:

Trc khi iu tr (x0)


Sau khi iu tr (x1)

180, 140, 160, 160, 220, 185, 145, 160, 160, 170
170, 145, 145, 125, 205, 185, 150, 150, 145, 155

Cu hi t ra l bin chuyn huyt p trn c kt lun rng thuc iu tr c


hiu qu gim p huyt. tr li cu hi ny, chng ta dng kim nh t cho tng cp
nh sau:
>
>
>
>

# nhp d kin
before <- c(180, 140, 160, 160, 220, 185, 145, 160, 160, 170)
after <- c(170, 145, 145, 125, 205, 185, 150, 150, 145, 155)
bp <- data.frame(before, after)

> # kim nh t
> t.test(before, after, paired=TRUE)

64

Phn tch s liu v biu bng R

Nguyn Vn Tun

Paired t-test
data: before and after
t = 2.7924, df = 9, p-value = 0.02097
alternative hypothesis: true difference in means is not equal to
0
95 percent confidence interval:
1.993901 19.006099
sample estimates:
mean of the differences
10.5

Kt qu trn cho thy sau khi iu tr p sut mu gim 10.5 mmHg, v khong tin cy
95% l t 2.0 mmHg n 19 mmHg, vi tr s p = 0.0209. Nh vy, chng ta c bng
chng pht biu rng mc gim huyt p c ngha thng k.
Ch nu chng ta phn tch sai bng kim nh thng k cho hai nhm c lp di y
th tr s p = 0.32 cho bit mc gim p sut khng c ngha thng k!
> t.test(before, after)
Welch Two Sample t-test
data: before and after
t = 1.0208, df = 17.998, p-value = 0.3209
alternative hypothesis: true difference in means is not equal to
0
95 percent confidence interval:
-11.11065 32.11065
sample estimates:
mean of x mean of y
168.0
157.5

9.6 Kim nh Wilcoxon cho cc bin s theo cp (wilcox.test)

Thay v dng kim nh t cho tng cp, chng ta cng c th s dng hm


wilcox.test cho cng mc ch:
> wilcox.test(before, after, paired=TRUE)
Wilcoxon signed rank test with continuity correction
data: before and after
V = 42, p-value = 0.02291
alternative hypothesis: true mu is not equal to 0

Kt qu trn mt ln na khng nh rng gim p sut mu c ngha thng k vi


tr s (p=0.023) chng khc my so vi kim nh t cho tng cp.
65

Phn tch s liu v biu bng R

Nguyn Vn Tun

9.7 Tn s (frequency)

Hm table trong R c chc nng cho chng ta bit v tn s ca mt bin s


mang tnh phn loi nh sex v ethnicity.
> table(sex)
sex
Female
Male
69
31
> table(ethnicity)
ethnicity
African
Asian Caucasian
8
60
30

Others
2

Mt bng thng k 2 chiu:


> table(sex, ethnicity)
ethnicity
sex
African Asian Caucasian Others
Female
4
43
22
0
Male
4
17
8
2

Ch trong cc bng thng k trn, hm table khng cung cp cho chng ta s phn
trm. tnh s phn trm, chng ta cn n hm prop.table v cch s dng c th
minh ho nh sau:
# to ra mt object tn l freq cha kt qu tn s
> freq <- table(sex, ethnicity)
# kim tra kt qu
> freq
ethnicity
sex
African Asian Caucasian Others
Female
4
43
22
0
Male
4
17
8
2
# dng hm margin.table xem kt qu
> margin.table(freq, 1)
sex
Female
Male
69
31
> margin.table(freq, 2)
ethnicity
African
Asian Caucasian

Others

66

Phn tch s liu v biu bng R

60

Nguyn Vn Tun

30

# tnh phn trm bng hm prop.table


> prop.table(freq, 1)
ethnicity
sex
African
Asian Caucasian
Others
Female 0.05797101 0.62318841 0.31884058 0.00000000
Male
0.12903226 0.54838710 0.25806452 0.06451613

Trong bng thng k trn, prop.table tnh t l sc tc cho tng gii tnh. Chng hn
nh n gii (female), 5.8% l ngi Phi chu, 62.3% l ngi chu, 31.8% l ngi
Ty phng da trng . Tng cng l 100%. Tng t, nam gii t l ngi Phi chu l
12.9%, chu l 54.8%, v.v
# tnh phn trm bng hm prop.table
> prop.table(freq, 2)
ethnicity
sex
African
Asian Caucasian
Others
Female 0.5000000 0.7166667 0.7333333 0.0000000
Male
0.5000000 0.2833333 0.2666667 1.0000000

Trong bng thng k trn, prop.table tnh t l gii tnh cho tng sc tc. Chng hn
nh trong nhm ngi chu, 71.7% l n v 28.3% l nam.
# tnh phn trm cho ton b bng
> freq/sum(freq)
ethnicity
sex
African Asian Caucasian Others
Female
0.04 0.43
0.22
0.00
Male
0.04 0.17
0.08
0.02

9.8 Kim nh t l (proportion test, prop.test, binom.test)

Kim nh mt t l thng da vo gi nh phn phi nh phn (binomial distribution).


Vi mt s mu n v t l p, v nu n ln (tc hn 50 chng hn), th phn phi nh phn
c th tng ng vi phn phi chun vi s trung bnh np v phng sai np(1 p).
Gi x l s bin c m chng ta quan tm, kim nh gi thit p = c th s dng thng
k sau y:
z=

x n
n (1 )

y, z tun theo lut phn phi chun vi trung bnh 0 v phng sai 1. Cng c th
ni z2 tun theo lut phn phi Chi bnh phng vi bc t do bng 1.

67

Phn tch s liu v biu bng R

Nguyn Vn Tun

V d 13. Trong nghin cu trn, chng ta thy c 69 n v 31 nam. Nh vy t


l n l 0.69 (hay 69%). kim nh xem t l ny c tht s khc vi t l 0.5 hay
khng, chng ta c th s dng hm prop.test(x, n, ) nh sau:
> prop.test(69, 100, 0.50)
1-sample proportions test with continuity correction
data: 69 out of 100, null probability 0.5
X-squared = 13.69, df = 1, p-value = 0.0002156
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5885509 0.7766330
sample estimates:
p
0.69

Trong kt qu trn, prop.test c tnh t l n gii l 0.69, v khong tin cy 95% l


0.588 n 0.776. Gi tr Chi bnh phng l 13.69, vi tr s p = 0.00216. Nh vy,
nghin cu ny c t l n cao hn 50%.
Mt cch tnh chnh xc hn kim nh t l l kim nh nh phn bionom.test(x,
n, ) nh sau:
> binom.test(69, 100, 0.50)
Exact binomial test
data: 69 and 100
number of successes = 69, number of trials = 100, p-value = 0.0001831
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5896854 0.7787112
sample estimates:
probability of success
0.69

Ni chung, kt qu ca kim nh nh phn khng khc g so vi kim nh Chi bnh


phng, vi tr s p = 0.00018, chng ta cng c bng chng kt lun rng t l n gii
trong nghin cu ny tht s cao hn 50%.
9.9 So snh hai t l (prop.test, binom.test)

Phng php so snh hai t l c th khai trin trc tip t l thuyt kim nh mt t l
va trnh by trn. Cho hai mu vi s i tng n1 v n2, v s bin c l x1 v x2. Do
, chng ta c th c tnh hai t l p1 v p2. L thuyt xc sut cho php chng ta pht
biu rng khc bit gia hai mu d = p1 p2 tun theo lut phn phi chun vi s
trung bnh 0 v phng sai bng:

68

Phn tch s liu v biu bng R

Nguyn Vn Tun

1 1
Vd = + p (1 p )
n1 n2

Trong :

p=

x1 + x2
n1 + n2

Thnh ra, z = d/Vd tun theo lut phn phi chun vi trung bnh 0 v phng sai 1. Ni
cch khc, z2 tun theo lut phn phi Chi bnh phng vi bc t do bng 1. Do ,
chng ta cng c th s dng prop.test kim nh hai t l.
V d 14. Mt nghin cu c tin hnh so snh hiu qu ca thuc chng gy
xng. Bnh nhn c chia thnh hai nhm: nhm A c iu tr gm c 100 bnh
nhn, v nhm B khng c iu tr gm 110 bnh nhn. Sau thi gian 12 thng theo
di, nhm A c 7 ngi b gy xng, v nhm B c 20 ngi gy xng. Vn t ra
l t l gy xng trong hai nhm ny bng nhau (tc thuc khng c hiu qu)?
kim nh xem hai t l ny c tht s khc nhau, chng ta c th s dng hm
prop.test(x, n, ) nh sau:
> fracture <- c(7, 20)
> total <- c(100, 110)
> prop.test(fracture, total)
2-sample test for equality of proportions with continuity
correction
data: fracture out of total
X-squared = 4.8901, df = 1, p-value = 0.02701
alternative hypothesis: two.sided
95 percent confidence interval:
-0.20908963 -0.01454673
sample estimates:
prop 1
prop 2
0.0700000 0.1818182

Kt qu phn tch trn cho thy t l gy xng trong nhm 1 l 0.07 v nhm 2 l 0.18.
Phn tch trn cn cho thy xc sut 95% rng khc bit gia hai nhm c th 0.01
n 0.20 (tc 1 n 20%). Vi tr s p = 0.027, chng ta c th ni rng t l gy xng
trong nhm A qu tht thp hn nhm B.
9.10 So snh nhiu t l (prop.test, chisq.test)

Kim nh prop.test cn c th s dng kim nh nhiu t l cng mt lc.


Trong nghin cu trn, chng ta c 4 nhm sc tc v tn s cho tng gii tnh nh sau:
> table(sex, ethnicity)

69

Phn tch s liu v biu bng R

Nguyn Vn Tun

ethnicity
sex
African Asian Caucasian Others
Female
4
43
22
0
Male
4
17
8
2

Chng ta mun bit t l n gii gia 4 nhm sc tc c khc nhau hay khng, v tr
li cu hi ny, chng ta li dng prop.test nh sau:
> female <- c( 4, 43, 22, 0)
> total <- c(8, 60, 30, 2)
> prop.test(female, total)
4-sample test for equality of proportions without continuity
correction
data: female out of total
X-squared = 6.2646, df = 3, p-value = 0.09942
alternative hypothesis: two.sided
sample estimates:
prop 1
prop 2
prop 3
prop 4
0.5000000 0.7166667 0.7333333 0.0000000
Warning message:
Chi-squared approximation may be incorrect in: prop.test(female, total)

Tuy t l n gii gia cc nhm c v khc nhau ln (73% trong nhm 3 (ngi da trng)
so vi 50% trong nhm 1 (Phi chu) v 71.7% trong nhm chu, nhng kim nh Chi
bnh phng cho bit trn phng din thng k, cc t l ny khng khc nhau, v tr s
p = 0.099.
9.10.1 Kim nh Chi bnh phng (Chi squared test, chisq.test)

Tht ra, kim nh Chi bnh phng cn c th tnh ton bng hm chisq.test nh
sau:
> chisq.test(sex, ethnicity)
Pearson's Chi-squared test
data: sex and ethnicity
X-squared = 6.2646, df = 3, p-value = 0.09942
Warning message:
Chi-squared
approximation
ethnicity)

may

be

incorrect

Kt qu ny hon ton ging vi kt qu t hm prop.test.

70

in:

chisq.test(sex,

Phn tch s liu v biu bng R

Nguyn Vn Tun

9.10.2 Kim nh Fisher (Fishers exact test, fisher.test)

Trong kim nh Chi bnh phng trn, chng ta ch cnh bo:


Warning message:
Chi-squared approximation may be incorrect in: prop.test(female, total)

V trong nhm 4, khng c n gii cho nn t l l 0%. Hn na, trong nhm ny ch c


2 i tng. V s lng i tng qu nh, cho nn cc c tnh thng k c th khng
ng tin cy. Mt phng php khc c th p dng cho cc nghin cu vi tn s thp
nh trn l kim nh fisher (cn gi l Fishers exact test). Bn c c th tham kho
l thuyt ng sau kim nh fisher hiu r hn v logic ca phng php ny, nhng
y, chng ta ch quan tm n cch dng R tnh ton kim nh ny. Chng ta ch
n gin lnh:
> fisher.test(sex, ethnicity)
Fisher's Exact Test for Count Data
data: sex and ethnicity
p-value = 0.1048
alternative hypothesis: two.sided

Ch tr s p t kim nh Fisher l 0.1048, tc rt gn vi tr s p ca kim nh Chi


bnh phng. Cho nn, chng ta c thm bng chng khng nh rng t l n gii
gia cc sc tc khng khc nhau mt cch ng k.

10. Phn tch hi qui tuyn tnh


V d 15. minh ha cho vn , chng ta th xem xt nghin cu sau y, m
trong nh nghin cu o lng cholestrol trong mu ca 18 i tng nam. T
trng c th (body mass index) cng c c tnh cho mi i tng bng cng thc
tnh BMI l ly trng lng (tnh bng kg) chia cho chiu cao bnh phng (m2). Kt qu
o lng nh sau:
tui, t trng c th v cholesterol

M s ID
(id)
1
2
3
4
5
6
7
8

tui
(age)
46
20
52
30
57
25
28
36

BMI
(bmi)
25.4
20.6
26.2
22.6
25.4
23.1
22.7
24.9

Cholesterol
(chol)
3.5
1.9
4.0
2.6
4.5
3.0
2.9
3.8

71

Phn tch s liu v biu bng R

9
10
11
12
13
14
15
16
17
18

Nguyn Vn Tun

22
43
57
33
22
63
40
48
28
49

19.8
25.3
23.2
21.8
20.9
26.7
26.4
21.2
21.2
22.8

2.1
3.8
4.1
3.0
2.5
4.6
3.2
4.2
2.3
4.0

Nhn s qua s liu chng ta thy ngi c tui cng cao cholesterol cng
cng cao. Chng ta th nhp s liu ny vo R v v mt biu tn x nh sau:
> age <- c(46,20,52,30,57,25,28,36,22,43,57,33,22,63,40,48,28,49)
> bmi <-c(25.4,20.6,26.2,22.6,25.4,23.1,22.7,24.9,19.8,25.3,23.2,
21.8,20.9,26.7,26.4,21.2,21.2,22.8)
> chol <- c(3.5,1.9,4.0,2.6,4.5,3.0,2.9,3.8,2.1,3.8,4.1,3.0,
2.5,4.6,3.2, 4.2,2.3,4.0)

2.0

2.5

3.0

chol

3.5

4.0

4.5

> data <- data.frame(age, bmi, chol)


> plot(chol ~ age, pch=16)

20

30

40

50

60

age

Biu 18. Lin h gia tui v cholesterol.

72

Phn tch s liu v biu bng R

Nguyn Vn Tun

Biu 18 trn y gi cho thy mi lin h gia tui (age) v cholesterol l mt


ng thng (tuyn tnh). o lng mi lin h ny, chng ta c th s dng h s
tng quan (coefficient of correlation).
10.1 H s tng quan

H s tng quan (r) l mt ch s thng k o lng mi lin h tng quan gia


hai bin s, nh gia tui (x) v cholesterol (y). H s tng quan c gi tr t -1 n
1. H s tng quan bng 0 (hay gn 0) c ngha l hai bin s khng c lin h g vi
nhau; ngc li nu h s bng -1 hay 1 c ngha l hai bin s c mt mi lin h tuyt
i. Nu gi tr ca h s tng quan l m (r <0) c ngha l khi x tng cao th y gim
(v ngc li, khi x gim th y tng); nu gi tr h s tng quan l dng (r > 0) c
ngha l khi x tng cao th y cng tng, v khi x tng cao th y cng gim theo.
Thc ra c nhiu h s tng quan trong thng k, nhng y ti s trnh by 3
h s tng quan thng dng nht: h s tng quan Pearson r, Spearman , v Kendall
.
10.1.1 H s tng quan Pearson

Cho hai bin s x v y t n mu, h s tng quan Pearson c c tnh bng


n

cng thc sau y: r =

( xi x )( yi y )

i =1
n

2 n

( xi x ) ( yi y )

i =1

. Trong , nh nh ngha phn trn, x

i =1

v y l gi tr trung bnh ca bin s x v y. c tnh h s tng quan gia tui


age v cholesterol, chng ta c th s dng hm cor(x,y) nh sau:
> cor(age, chol)
[1] 0.936726

Chng ta c th kim nh gi thit h s tng quan bng 0 (tc hai bin x v y


khng c lin h). Phng php kim nh ny thng da vo php bin i Fisher m
R c sn mt hm cor.test tin hnh vic tnh ton.
> cor.test(age, chol)
Pearson's product-moment correlation
data: age and chol
t = 10.7035, df = 16, p-value = 1.058e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8350463 0.9765306
sample estimates:
cor
0.936726

73

Phn tch s liu v biu bng R

Nguyn Vn Tun

10.1.2 H s tng quan Spearman

H s tng quan Pearson ch hp l nu bin s x v y tun theo lut phn phi


chun. Nu x v y khng tun theo lut phn phi chun, chng ta phi s dng mt h
s tng quan khc tn l Spearman, mt phng php phn tch phi tham s. H s ny
c c tnh bng cch bin i hai bin s x v y thnh th bc (rank), v xem
tng quan gia hai dy s bc. Do , h s cn c tn ting Anh l Spearmans Rank
correlation. R c tnh h s tng quan Spearman bng hm cor.test vi thng s
method=spearman nh sau:
> cor.test(age, chol, method="spearman")
Spearman's rank correlation rho
data: age and chol
S = 51.1584, p-value = 2.57e-09
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.947205
Warning message:
Cannot compute exact p-values with ties in: cor.test.default(age,
chol, method = "spearman")

10.1.3 H s tng quan Kendall

H s tng quan Kendall (cng l mt phng php phn tch phi tham s) c
c tnh bng cch tm cc cp s (x, y) song hnh" vi nhau. Mt cp (x, y) song hnh
y c nh ngha l hiu ( khc bit) trn trc honh c cng du hiu (dng hay
m) vi hiu trn trc tung. Nu hai bin s x v y khng c lin h vi nhau, th s cp
song hnh bng hay tng ng vi s cp khng song hnh.
Bi v c nhiu cp phi kim nh, phng php tnh ton h s tng quan
Kendall i hi thi gian ca my tnh kh cao. Tuy nhin, nu mt d liu di 5000
i tng th mt my vi tnh c th tnh ton kh d dng. R dng hm cor.test vi
thng s method=kendall c tnh h s tng quan Kendall:
> cor.test(age, chol, method="kendall")
Kendall's rank correlation tau
data: age and chol
z = 4.755, p-value = 1.984e-06
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.8333333

74

Phn tch s liu v biu bng R

Nguyn Vn Tun

Warning message:
Cannot compute exact p-value with ties in: cor.test.default(age,
chol, method = "kendall")

10.2 M hnh ca hi qui tuyn tnh n gin

tin vic theo di v m t m hnh, gi tui cho c nhn i l xi v


cholesterol l yi. y i = 1, 2, 3, , 18. M hnh hi tuyn tnh pht biu rng:
yi = + xi + i
Ni cch khc, phng trnh trn gi nh rng cholesterol ca mt c nhn bng mt
hng s cng vi mt h s lin quan n tui, v mt sai s i. Trong phng
trnh trn, l chn (intercept, tc gi tr lc xi =0), v l dc (slope hay gradient).
Trong thc t, v l hai thng s (paramater, cn gi l regression coefficient hay h
s hi qui), v i l mt bin s theo lut phn phi chun vi trung bnh 0 v phng sai
2 .

Cc thng s , v 2 phi c c tnh t d liu. Phng php c tnh


cc thng s ny l phng php bnh phng nh nht (least squares method). Nh tn
gi, phng php bnh phng nh nht tm gi tr , sao cho

y ( + x )
i =1

nh

nht. Sau vi thao tc ton, c th chng minh d dng rng, c s cho v p ng


iu kin l:
n

( x x )( y y )
i =1

(x x )
i =1

)
)
v = y x

)
)
y, x v y l gi tr trung bnh ca bin s x v y. Ch , ti vit v (vi du
m pha trn) l nhc nh rng y l hai c s (estimates) ca v , ch khng
phi v (chng ta khng bit chnh xc v , nhng ch c th c tnh m thi).
)
)
Sau khi c c s v , chng ta c th c tnh cholesterol trung bnh cho tng
tui nh sau:
)
yi = + xi

Tt nhin, yi y ch l s trung bnh cho tui xi, v phn cn li (tc yi - yi ) gi l


phn d (residual). V phng sai ca phn d c th c tnh nh sau:
n

s2 =

( y y )
i =1

n2

. y, s2 chnh l c s ca 2.

75

Phn tch s liu v biu bng R

Nguyn Vn Tun

)
Hm lm (vit tt t linear model) trong R c th tnh ton cc gi tr ca
v , cng nh s2 mt cch nhanh gn. Chng ta tip tc vi v d bng R nh sau:

> lm(chol ~ age)

Call:
lm(formula = chol ~ age)
Coefficients:
(Intercept)
1.08922

age
0.05779

Trong lnh trn, chol ~ age c ngha l m t chol l mt hm s ca age. Kt


)
)
qu tnh ton ca lm cho thy = 1.0892 v = 0.05779. Ni cch khc, vi hai thng
s ny, chng ta c th c tnh cholesterol cho bt c tui no trong khong tui
ca mu bng phng trnh tuyn tnh:
yi = 1.08922 + 0.05779 x age

Phng trnh ny c ngha l khi tui tng 1 nm th cholesterol tng khong 0.058
mmol/L.
Tht ra, hm lm cn cung cp cho chng ta nhiu thng tin khc, nhng chng ta phi
a cc thng tin ny vo mt object. Gi object l reg, th lnh s l:
> reg <- lm(chol ~ age)
> summary(reg)
Call:
lm(formula = chol ~ age)
Residuals:
Min
1Q
Median
-0.40729 -0.24133 -0.04522

3Q
0.17939

Max
0.63040

Coefficients:
Estimate Std. Error t value
(Intercept) 1.089218
0.221466
4.918
age
0.057788
0.005399 10.704
--Signif. codes: 0 '***' 0.001 '**' 0.01

Pr(>|t|)
0.000154 ***
1.06e-08 ***
'*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3027 on 16 degrees of freedom


Multiple R-Squared: 0.8775,
Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

Lnh th hai, summary(reg), yu cu R lit k cc thng tin tnh ton trong reg. Phn
kt qu chia lm 3 phn:
76

Phn tch s liu v biu bng R

Nguyn Vn Tun

(a) Phn 1 m t phn d (residuals) ca m hnh hi qui:


Residuals:
Min
1Q
Median
-0.40729 -0.24133 -0.04522

3Q
0.17939

Max
0.63040

Chng ta bit rng trung bnh phn d phi l 0, v y, s trung v l -0.04, cng
khng xa 0 bao nhiu. Cc s quantiles 25% (1Q) v 75% (3Q) cng kh cn i chung
quan s trung v, cho thy phn d ca phng trnh ny tng i cn i.

)
)
(b) Phn hai trnh by c s ca v cng vi sai s chun v gi tr ca kim nh t.
)
Gi tr kim nh t cho l 10.74 vi tr s p = 1.06e-08, cho thy khng phi bng 0.
Ni cch khc, chng ta c bng chng cho rng c mt mi lin h gia cholesterol
v tui, v mi lin h ny c ngha thng k.
Coefficients:
Estimate Std. Error t value
(Intercept) 1.089218
0.221466
4.918
age
0.057788
0.005399 10.704
--Signif. codes: 0 '***' 0.001 '**' 0.01

Pr(>|t|)
0.000154 ***
1.06e-08 ***
'*' 0.05 '.' 0.1 ' ' 1

(c) Phn ba ca kt qu cho chng ta thng tin v phng sai ca phn d (residual mean
square). y, s2 = 0.3027. Trong kt qu ny cn c kim nh F, cng ch l mt
kim nh xem c qu tht bng 0, tc c ngha tng t nh kim nh t trong phn
trn. Ni chung, trong trng hp phn tch hi qui tuyn tnh n gin (vi mt yu t)
chng ta khng cn phi quan tm n kim nh F.
Residual standard error: 0.3027 on 16 degrees of freedom
Multiple R-Squared: 0.8775,
Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

Ngoi ra, phn 3 cn cho chng ta mt thng tin quan trng, l tr s R2 hay h
s xc nh bi (coefficient of determination). Tc l bng tng bnh phng gia s c
tnh v trung bnh chia cho tng bnh phng s quan st v trung bnh. Tr s R2 trong
v d ny l 0.8775, c ngha l phng trnh tuyn tnh (vi tui l mt yu t) gii
thch khong 88% cc khc bit v cholesterol gia cc c nhn. Tt nhin tr s R2
c gi tr t 0 n 100% (hay 1). Gi tr R2 cng cao l mt du hiu cho thy mi lin h
gia hai bin s tui v cholesterol cng cht ch.
Mt h s cng cn cp y l h s iu chnh xc nh bi (m trong kt
qu trn R gi l Adjusted R-squared). y l h s cho chng ta bit mc ci tin
ca phng sai phn d (residual variance) do yu t tui c mt trong m hnh tuyn
tnh. Ni chung, h s ny khng khc my so vi h s xc nh bi, v chng ta cng
khng cn ch tm qu mc.
Gi nh ca phn tch hi qui tuyn tnh

77

Phn tch s liu v biu bng R

Nguyn Vn Tun

Tt c cc phn tch trn da vo mt s gi nh quan trng nh sau:


(a) x l mt bin s c nh hay fixed, (c nh y c ngha l khng c sai st ngu
nhin trong o lng);
(b) i phn phi theo lut phn phi chun;
(c) i c gi tr trung bnh (mean) l 0;
(d) i c phng sai 2 c nh cho tt c xi; v
(e) cc gi tr lin tc ca i khng c lin h tng quan vi nhau (ni cch khc, 1 v 2
khng c lin h vi nhau).
Nu cc gi nh ny khng c p ng th phng trnh m chng ta c tnh
c vn hp l (validity). Do , trc khi trnh by v din dch m hnh trn, chng
ta cn phi kim tra xem cc gi nh trn c p ng c hay khng. Trong trng
hp ny, gi nh (a) khng phi l vn , v tui khng phi l mt bin s ngu
nhin, v khng c sai s khi tnh tui ca mt c nhn.
i vi cc gi nh (b) n (e), cch kim tra n gin nhng hu hiu nht l
bng cch xem xt mi lin h gia yi , xi , v phn d ei ( ei = yi yi ) bng nhng th
tn x.
Vi lnh fitted() chng ta c th tnh ton yi cho tng c nhn nh sau (v d
i vi c nhn 1, 46 tui, cholestrol c th tin on nh sau: 1.08922 + 0.05779
x 46 = 3.747).
> fitted(reg)
1
2
3
4
5
6
7
8
3.747483 2.244985 4.094214 2.822869 4.383156 2.533927 2.707292 3.169600
9
10
11
12
13
14
15
16
2.360562 3.574118 4.383156 2.996234 2.360562 4.729886 3.400753 3.863060
17
18
2.707292 3.920849

Vi lnh resid() chng ta c th tnh ton phn d ei cho tng c nhn nh


sau (vi i tng 1, e1 = 3.5 3.74748 = -0.24748):
> resid(reg)
1
2
3
4
5
-0.247483426 -0.344985415 -0.094213736 -0.222869265 0.116844338
7
8
9
10
11
0.192707505 0.630400424 -0.260562185 0.225881729 -0.283155662
13
14
15
16
17
0.139437815 -0.129885972 -0.200753116 0.336939804 -0.407292495

78

6
0.466072660
12
0.003765579
18
0.079151419

Phn tch s liu v biu bng R

Nguyn Vn Tun

kim tra cc gi nh trn, chng ta c th v mt lot 4 th m ti s gii


thch sau y:
#yu cu R dnh ra 4 ca s
#v cc th trong reg

> op <- par(mfrow=c(2,2))


> plot(reg)

0.2

-1

0.0

Standardized residuals

17

17

3.0

1.5

2.5

3.5

4.0

4.5

-2

-1

Fitted values

Theoretical Quantiles

Scale-Location

Residuals vs Leverage
1

0.5

0.5

1.0

17

-1

Standardized residuals

Cook's distance

0.0

Standardized residuals

Normal Q-Q

-0.4

Residuals

0.4

0.6

Residuals vs Fitted

2.5

3.0

3.5

4.0

4.5

0.00

0.05

Fitted values

0.10

0.5

0.15

0.20

0.25

Leverage

Biu 19. Phn tch phn d kim tra cc gi nh trong phn tch hi
qui tuyn tnh.

(a) th bn tri dng 1 v phn d ei v gi tr tin on cholesterol yi . th ny cho


thy cc gi tr phn d tp chung quanh ng y = 0, cho nn gi nh (c), hay i c gi
tr trung bnh 0, l c th chp nhn c.
(b) th bn phi dng 1 v gi tr phn d v gi tr k vng da vo phn phi chun.
Chng ta thy cc s phn d tp trung rt gn cc gi tr trn ng chun, v do , gi
nh (b), tc i phn phi theo lut phn phi chun, cng c th p ng.
(c) th bn tri dng 2 v cn s phn d chun (standardized residual) v gi tr ca
yi . th ny cho thy khng c g khc nhau gia cc s phn d chun cho cc gi tr

79

Phn tch s liu v biu bng R

Nguyn Vn Tun

ca yi , v do , gi nh (d), tc i c phng sai 2 c nh cho tt c xi, cng c th


p ng.
Ni chung qua phn tch phn d, chng ta c th kt lun rng m hnh hi qui tuyn
tnh m t mi lin h gia tui v cholesterol mt cch kh y v hp l.
M hnh tin on

Sau khi m hnh tin on cholesterol c kim tra v tnh hp l c thit lp,
chng ta c th v ng biu din ca mi lin h gia tui v cholesterol bng lnh
abline nh sau (xin nhc li object ca phn tch l reg):

2.0

2.5

3.0

chol

3.5

4.0

4.5

> plot(chol ~ age, pch=16)


> abline(reg)

20

30

40

50

60

age

Biu 20. ng biu din mi lin h gia tui (age) v


cholesterol.

)
)
Nhng mi gi tr yi c tnh t c s v , m cc c s ny u c sai
s chun, cho nn gi tr tin on yi cng c sai s. Ni cch khc, yi ch l trung bnh,

80

Phn tch s liu v biu bng R

Nguyn Vn Tun

nhng trong thc t c th cao hn hay thp hn ty theo chn mu. Khong tin cy
95% ny c th c tnh qua R bng cc lnh sau y:
reg <- lm(chol ~ age)
new <- data.frame(age = seq(15, 70, 5))
pred.w.plim <- predict.lm(reg, new, interval="prediction")
pred.w.clim <- predict.lm(reg, new, interval="confidence")
resc <- cbind(pred.w.clim, new)
resp <- cbind(pred.w.plim, new)
plot(chol ~ age, pch=16)
lines(resc$fit ~ resc$age)
lines(resc$lwr ~ resc$age, col=2)
lines(resc$upr ~ resc$age, col=2)
lines(resp$lwr ~ resp$age, col=4)
lines(resp$upr ~ resp$age, col=4)

2.0

2.5

3.0

chol

3.5

4.0

4.5

>
>
>
>
>
>
>
>
>
>
>
>

20

30

40

50

60

age

Biu 21. Gi tr tin on v khong tin cy 95%.

Biu trn v gi tr tin on trung bnh yi (ng thng mu en), v khong tin cy
95% ca gi tr ny l ng mu . Ngoi ra, ng mu xanh l khong tin cy ca
gi tr tin on cholesterol cho mt tui mi trong qun th.

81

Phn tch s liu v biu bng R

Nguyn Vn Tun

10.3 M hnh hi qui tuyn tnh a bin (multiple linear regression)

M hnh c din t qua phng trnh yi = + xi + i c mt yu t duy nht


( l x), v v th thng c gi l m hnh hi qui tuyn tnh n gin (simple linear
regression model). Trong thc t, chng ta c th pht trin m hnh ny thnh nhiu
bin, ch khng ch gii hn mt bin nh trn, chng hn nh:
yi = + 1 x1i + 2 x2i + ... + k xki + i
Ch trong phng trnh trn, chng ta c nhiu bin x (x1, x2, n xk), v mi bin c
mt thng s j (j = 1, 2, , k) cn phi c tnh. V th m hnh ny cn c gi l
m hnh hi qui tuyn tnh a bin.
V d 16. Chng ta quay li nghin cu v mi lin h gia tui, bmi v
cholesterol. Trong v d, chng ta ch mi xt mi lin h gia tui v cholesterol, m
cha xem n mi lin h gia c hai yu t tui v bmi v cholesterol. Biu sau
y cho chng ta thy mi lin h gia ba bin s ny:
> pairs(data)

22

24

26

50

60

20

24

26

20

30

40

age

chol

20

30

40

50

60

2.0 2.5 3.0 3.5 4.0 4.5

20

22

bmi

2.0 2.5 3.0 3.5 4.0 4.5

Biu 22. Gi tr tin on v khong tin cy 95%.

Cng nh gia tui v cholesterol, mi lin h gia bmi v cholesterol cng gn tun
theo mt ng thng. Biu trn cn cho chng ta thy tui v bmi c lin h vi

82

Phn tch s liu v biu bng R

Nguyn Vn Tun

nhau. Tht vy, phn tch hi qui tuyn tnh n gin gia bmi v cholesterol cho thy
nh mi lin h ny c ngha thng k:
> summary(lm(chol ~

bmi))

Call:
lm(formula = chol ~ bmi)
Residuals:
Min
1Q Median
-0.9403 -0.3565 -0.1376

3Q
0.3040

Max
1.4330

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.83187
1.60841 -1.761 0.09739 .
bmi
0.26410
0.06861
3.849 0.00142 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.623 on 16 degrees of freedom
Multiple R-Squared: 0.4808,
Adjusted R-squared: 0.4483
F-statistic: 14.82 on 1 and 16 DF, p-value: 0.001418

BMI gii thch khong 48% dao ng v cholesterol gia cc c nhn. Nhng v BMI
cng c lin h vi tui, chng ta mun bit nu hai yu t ny c phn tch cng
mt lc th yu t no quan trng hn. bit nh hng ca c hai yu t age (x1) v
bmi (tm gi l x2) n cholesterol (y) qua mt m hnh hi qui tuyn tnh a bin, v m
hnh l:
yi = + 1 x1i + 2 x2i + i

hay phng trnh cng c th m t bng k hiu ma trn: Y = X + m ti va trnh


by trn. y, Y l mt vector vector 18 x 1, X l mt matrix 18 x 2 phn t, v mt
vector 2 x 1, v l vector gm 18 x 1 phn t. c tnh hai h s hi qui, 1 v
2 chng ta cng ng dng hm lm() trong R nh sau:
> mreg <- lm(chol ~ age + bmi)
> summary(mreg)
Call:
lm(formula = chol ~ age + bmi)
Residuals:
Min
1Q Median
-0.3762 -0.2259 -0.0534

3Q
0.1698

Max
0.5679

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.455458
0.918230
0.496
0.627

83

Phn tch s liu v biu bng R

Nguyn Vn Tun

age
0.054052
0.007591
7.120 3.50e-06 ***
bmi
0.033364
0.046866
0.712
0.487
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3074 on 15 degrees of freedom
Multiple R-Squared: 0.8815,
Adjusted R-squared: 0.8657
F-statistic: 55.77 on 2 and 15 DF, p-value: 1.132e-07

Kt qu phn tch trn cho thy c s = 0.455, 1 = 0.054 v 2 = 0.0333. Ni cch


khc, chng ta c phng trnh c on cholesterol da vo hai bin s tui v
bmi nh sau:
Cholesterol = 0.455 + 0.054(age) + 0.0333(bmi)

Phng trnh cho bit khi tui tng 1 nm th cholesterol tng 0.054 mg/L (c s ny
khng khc my so vi 0.0578 trong phng trnh ch c tui), v mi 1 kg/m2 tng
BMI th cholesterol tng 0.0333 mg/L. Hai yu t ny gii thch khong 88.2% (R2 =
0.8815) dao ng ca cholesterol gia cc c nhn.
Chng ta ch phng trnh vi tui (trong phn tch phn trc) gii thch
khong 87.7% dao ng cholesterol gia cc c nhn. Khi chng ta thm yu t BMI,
h s ny tng ln 88.2%, tc ch 0.5%. Cu hi t ra l 0.5% tng trng ny c
ngha thng k hay khng. Cu tr li c th xem qua kt qu kim nh yu t bmi vi
tr s p = 0.487. Nh vy, bmi khng cung cp cho chng thm thng tin hay tin on
cholesterol hn nhng g chng ta c t tui. Ni cch khc, khi tui c
xem xt, th nh hng ca bmi khng cn ngha thng k. iu ny c th hiu c,
bi v qua Biu 10.5 chng ta thy tui v bmi c mt mi lin h kh cao. V hai
bin ny c tng quan vi nhau, chng ta khng cn c hai trong phng trnh. (Tuy
nhin, v d ny ch c tnh cch minh ha cho vic tin hnh phn tch hi qui tuyn tnh
a bin bng R, ch khng c nh m phng d liu theo nh hng sinh hc).

84

Phn tch s liu v biu bng R

Nguyn Vn Tun

3.0

4.0

2.0
0.0

1.0

4.5

-2

-1

Scale-Location

Residuals vs Leverage

0.4

3.0

3.5

4.0

16

0.5

0.8

16

-1

Standardized residuals

Theoretical Quantiles

2.5

16

Fitted values

1.2

3.5

0.0

Standardized residuals

2.5

-1.0

0.0

0.4

16

-0.4

Residuals

8
6

Normal Q-Q
Standardized residuals

Residuals vs Fitted

4.5

Cook's distance15
0.00

0.10

Fitted values

0.20

0.30

Leverage

Biu 23. Phn tch phn d kim tra cc gi nh trong


phn tch hi qui tuyn tnh a bin.

Tuy BMI khng c ngha thng k trong trng hp ny, Biu 10.6 cho thy
cc gi nh v m hnh hi qui tuyn tnh c th p ng.

11. Phn tch phng sai


11.1 Phn tch phng sai n gin (one-way analysis of variance ANOVA)
V d 17. Bng di y so snh galactose trong 3 nhm bnh nhn: nhm 1
gm 9 bnh nhn vi bnh Crohn; nhm 2 gm 11 bnh nhn vi bnh vim rut kt
(colitis); v nhm 3 gm 20 i tng khng c bnh (gi l nhm i chng). Cu hi
t ra l galactose gia 3 nhm bnh nhn c khc nhau hay khng?
galactose cho 3 nhm bnh nhn Crohn, vim rut kt
v i chng

Nhm 1: bnh
Crohn

Nhm 2: bnh vim


rut kt

Nhm 3: i
chng (control)

85

Phn tch s liu v biu bng R

1343
1393
1420
1641
1897
2160
2169
2279
2890

Nguyn Vn Tun

1264
1314
1399
1605
2385
2511
2514
2767
2827
2895
3011

1809
1926
2283
2384
2447
2479
2495
2525
2541
2769

2850
2964
2973
3171
3257
3271
3288
3358
3643
3657

n=9
n=11
n=20
Trung bnh: 1910 Trung bnh: 2226
Trung bnh: 2804
SD: 516
SD: 727
SD: 527
Ch thch: SD l lch chun (standard deviation).

Gi gi tr trung bnh ca ba nhm l 1, 2, v 3, v ni theo ngn ng ca kim nh


gi thit th gi thit o l:
V gi thit chnh l:

Ho: 1 = 2 = 3
HA: c mt khc bit gia 3 j (j = 1,2,3)

Thot u c l bn c, sau khi hc qua phng php so snh hai nhm bng
kim nh t, s ngh rng chng ta cn lm 3 so snh bng kim nh t: gia nhm 1 v 2,
nhm 2 v 3, v nhm 1 v 3. Nhng phng php ny khng hp l, v c ba phng
sai khc nhau. Phng php thch hp cho so snh l phn tch phng sai. Phn tch
phng sai c th ng dng so snh nhiu nhm cng mt lc (simultaneous
comparisons).
minh ha cho phng php phn tch phng sai, chng ta phi dng k hiu.
Gi galactose ca bnh nhn i thuc nhm j (j = 1, 2, 3) l xij. M hnh phn tch
phng sai pht biu rng:
xij = + i + ij

Hay c th hn:
xi1 = + 1 + i1
xi2 = + 2 + i2
xi3 = + 3 + i3

Trc ht, chng ta cn phi nhp d liu vo R. Bc th nht l bo cho R bit rng
chng ta c ba nhm bnh nhn (1, 2 v ), nhm 1 gm 9 ngi, nhm 2 c 11 ngi, v
nhm 3 c 20 ngi:
> group <- c(1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3)

86

Phn tch s liu v biu bng R

Nguyn Vn Tun

phn tch phng sai, chng ta phi nh ngha bin group l mt yu t - factor.
> group <- as.factor(group)

Bc k tip, chng ta np s liu galactose cho tng nhm nh nh ngha trn (gi
object l galactose):
> galactose <- c(1343,1393,1420,1641,1897,2160,2169,2279,2890,

1264,1314,1399,1605,2385,2511,2514,2767,2827,2895,3011,
1809,2850,1926,2964,2283,2973,2384,3171,2447,3257,2479,3271,2495,3288,
2525,3358,2541,3643,2769,3657)

a hai bin group v galactose vo mt dataframe v gi l data:


> data <- data.frame(group, galactose)
> attach(data)

Sau khi c d liu sn sng, chng ta dng hm lm() phn tch phng sai nh
sau:
> analysis <- lm(galactose ~ group)

Trong hm trn chng ta cho R bit bin galactose l mt hm s ca group. Gi


kt qu phn tch l analysis.
Kt qu phn tch phng sai. By gi chng ta dng lnh anova bit kt qu
phn tch:
> anova(analysis)
Analysis of Variance Table
Response: galactose
Df
Sum Sq Mean Sq F value
Pr(>F)
group
2 5683620 2841810 8.6655 0.0008191 ***
Residuals 37 12133923
327944
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Trong kt qu trn, c ba ct: Df (degrees of freedom) l bc t do; Sum Sq l tng bnh


phng (sum of squares), Mean Sq l trung bnh bnh phng (mean square); F
value l gi tr F; v Pr(>F) l tr s P lin quan n kim nh F.
11.2 So snh nhiu nhm (multiple comparisons) v iu chnh tr s p

Cho k nhm, chng ta c t nht l k(k-1)/2 so snh. V d trn c 3 nhm, cho


nn tng s so snh kh d l 3 (gia nhm 1 v 2, nhm 1 v 3, v nhm 2 v 3). Khi
k=10, s ln so snh c th ln rt cao. Nh cp trong chng 7, khi c nhiu so
snh, tr s p tnh ton t cc kim nh thng k khng cn ngha ban u na, bi v
cc kim nh ny c th cho ra kt qu dng tnh gi (tc kt qu vi p<0.05 nhng

87

Phn tch s liu v biu bng R

Nguyn Vn Tun

trong thc t khng c khc nhau hay nh hng). Do , trong trng hp c nhiu so
snh, chng ta cn phi iu chnh tr s p sao cho hp l.
C kh nhiu phng php iu chnh tr s p, v 4 phng php thng dng nht
l: Bonferroni, Scheff, Holm v Tukey (tn ca 4 nh thng k hc danh ting).
Phng php no thch hp nht? Khng c cu tr li dt khot cho cu hi ny, nhng
hai im sau y c th gip bn c quyt nh tt hn:
(a)

Nu k < 10, chng ta c th p dng bt c phng php no iu


chnh tr s p. Ring c nhn ti th thy phng php Tukey thng
rt hu ch trong so snh.

(b)

Nu k>10, phng php Bonferroni c th tr nn rt bo th. Bo


th y c ngha l phng php ny rt t khi no tuyn b mt so
snh c ngha thng k, d trong thc t l c tht! Trong trng
hp ny, hai phng php Tukey, Holm v Scheff c th p dng.

Quay li v d trn, cc tr s p trn y l nhng tr s cha c iu chnh cho


so snh nhiu ln. Trong chng v tr s p, ti ni cc tr s ny phng i ngha
thng k, khng phn nh tr s p lc ban u (tc 0.05). iu chnh cho nhiu so
snh, chng ta phi s dng n phng php iu chnh Bonferroni.
Chng ta c th dng lnh pairwise.t.test c c tt c cc tr s p so
snh gia ba nhm nh sau:
> pairwise.t.test(galactose, group, p.adj="bonferroni")
Pairwise comparisons using t tests with pooled SD
data:

galactose and group

1
2
2 0.6805 3 0.0012 0.0321
P value adjustment method: bonferroni

Kt qu trn cho thy tr s p gia nhm 1 (Crohn) v vim rut kt l 0.6805 (tc khng
c ngha thng k); gia nhm Crohn v i chng l 0.0012 (c ngha thng k), v
gia nhm vim rut kt v i chng l 0.0321 (tc cng c ngha thng k).
Mt phng php iu chnh tr s p khc c tn l phng php Holm:
> pairwise.t.test(galactose, group)
Pairwise comparisons using t tests with pooled SD
data:

galactose and group

88

Phn tch s liu v biu bng R

Nguyn Vn Tun

1
2
2 0.2268 3 0.0012 0.0214
P value adjustment method: holm

Kt qu ny cng khng khc so vi phng php Bonferroni.


Tt c cc phng php so snh trn s dng mt sai s chun chung cho c ba nhm.
Nu chng ta mun s dng cho tng nhm th lnh sau y (pool.sd=F) s p ng
yu cu :
> pairwise.t.test(galactose, group, pool.sd=FALSE)
Pairwise comparisons using t tests with non-pooled SD
data:

galactose and group

1
2
2 0.2557 3 0.0017 0.0544
P value adjustment method: holm

Mt ln na, kt qu ny cng khng lm thay i kt lun.


Trong cc phng php trn, chng ta ch bit tr s p so snh gia cc nhm,
nhng khng bit mc khc bit cng nh khong tin cy 95% gia cc nhm. c
nhng c s ny, chng ta cn n mt hm khc c tn l aov (vit tt t analysis of
variance) v hm TukeyHSD (HSD l vit tt t Honest Significant Difference, tm dch
nm na l Khc bit c ngha thnh tht) nh sau:
> res <- aov(galactose ~ group)
> TukeyHSD (res)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = galactose ~ group)
$group

diff
lwr
upr
p adj
2-1 316.3232 -312.09857 944.745 0.4439821
3-1 894.2778 333.07916 1455.476 0.0011445
3-2 577.9545
53.11886 1102.790 0.0281768

Kt qu trn cho chng ta thy nhm 3 v 1 khc nhau khong 894 n v, v khong tin
cy 95% t 333 n 1455 n v. Tng t, galactose trong nhm bnh nhn vim rut
kt thp hn nhm i chng (nhm 3) khong 578 n v, v khong tin cy 95% t 53
n 1103.

89

Phn tch s liu v biu bng R

Nguyn Vn Tun

3-2

3-1

2-1

95% family-wise confidence level

500

1000

1500

Differences in mean levels of group

Biu 24. Trung bnh hiu v khong tin cy 95%


gia nhm 1 v 2, 1 v 3, v 3 v 2. Trc honh l
galactose, trc tung l ba so snh.

11.3 Phn tch bng phng php phi tham s

Phng php so snh nhiu nhm phi tham s (non-parametric statistics) tng
ng vi phng php phn tch phng sai l Kruskal-Wallis. Cng nh phng php
Wilcoxon so snh hai nhm theo phng php phi tham s, phng php Kruskal-Wallis
cng bin i s liu thnh th bc (ranks) v phn tch khc bit th bc ny gia cc
nhm. Hm kruskal.test trong R c th gip chng ta trong kim nh ny:
> kruskal.test(galactose ~ group)
Kruskal-Wallis rank sum test
data: galactose by group
Kruskal-Wallis chi-squared = 12.1381, df = 2, p-value = 0.002313

Tr s p t kim nh ny kh thp (p = 0.002313) cho thy c s khc bit gia


ba nhm nh phn tch phng sai qua hm lm trn y. Tuy nhin, mt bt tin ca
kim nh phi tham s Kruskal-Wallis l phng php ny khng cho chng ta bit hai
nhm no khc nhau, m ch cho mt tr s p chung. Trong nhiu trng hp, phn tch
phi tham s nh kim nh Kruskal-Wallis thng khng c hiu qu nh cc phng
php thng k tham s (parametric statistics).

90

Phn tch s liu v biu bng R

Nguyn Vn Tun

11.4 Phn tch phng sai hai chiu (two-way analysis of variance ANOVA)

Phn tch phng sai n gin hay mt chiu ch c mt yu t (factor). Nhng


phn tch phng sai hai chiu (two-way ANOVA), nh tn gi, c hai yu t. Phng
php phn tch phng sai hai chiu ch n gin khai trin t phng php phn tch
phng sai n gin. Thay v c tnh phng sai ca mt yu t, phng php phn sai
hai chiu c tnh phng sai ca hai yu t.
V d 18. Trong v d sau y, nh gi hiu qu ca mt k thut sn mi,
cc nh nghin cu p dng sn trn 3 loi vt liu (1, 2 v 3) trong hai iu kin (1, 2).
Mi iu kin v loi vt liu, nghin cu c lp li 3 ln. bn c o l ch s
bn b (tm gi l score). Tng cng, c 18 s liu nh sau:
bn b ca sn cho 2 iu kin v 3 vt liu

iu kin
(i)
1
2

1
4.1, 3.9, 4.3
2.7, 3.1, 2.6

Vt liu (j)
2
3.1, 2.8, 3.3
1.9, 2.2, 2.3

3
3.5, 3.2, 3.6
2.7, 2.3, 2.5

Gi xij l score ca iu kin i (i = 1, 2) cho vt liu j (j = 1, 2, 3). ( n gin ha


vn , chng ta tm thi b qua k i tng). M hnh phn tch phng sai hai chiu
pht biu rng:
xij = + i + j + ij
l s trung bnh cho ton qun th, cc h s i (nh hng ca iu kin i)v j (nh
hng ca vt liu j) cn phi c tnh t s liu thc t. ij c gi nh tun theo lut
phn phi chun vi trung bnh 0 v phng sai 2.

phn tch bng R, chng ta cn phi t chc d liu sao cho c 4 bin nh sau:
Condition
(iu kin)
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2

Material
(vt liu)
1
1
1
2
2
2
3
3
3
1
1
1
2
2
2

i tng

Score

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

4.1
3.9
4.3
3.1
2.8
3.3
3.5
3.2
3.6
2.7
3.1
2.6
1.9
2.2
2.3

91

Phn tch s liu v biu bng R

2
2
2

3
3
3

Nguyn Vn Tun

16
17
18

2.7
2.3
2.5

Chng ta c th to ra mt dy s bng cch s dng hm gl (generating levels).


> condition <- gl(2, 9, 18)
> material <- gl(3, 3, 18)

V to nn 18 m s (t 1 n 18):
> id <- 1:18

Sau cng l s liu cho score:


> score <- c(4.1,3.9,4.3, 3.1,2.8,3.3, 3.5,3.2,3.6,
2.7,3.1,2.6, 1.9,2.2,2.3, 2.7,2.3,2.5)

Tt c cho vo mt dataframe tn l data:


> data <- data.frame(condition, material, id, score)
> attach(data)

By gi s liu sn sng cho phn tch. phn tch phng sai hai chiu, chng ta
vn s dng lnh lm vi cc thng s nh sau:
> twoway <- lm(score ~ condition + material)
> anova(twoway)
Analysis of Variance Table
Response: score
Df Sum Sq Mean Sq F value
Pr(>F)
condition 1 5.0139 5.0139 95.575 1.235e-07 ***
material
2 2.1811 1.0906 20.788 6.437e-05 ***
Residuals 14 0.7344 0.0525
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Ba ngun dao ng (variation) ca score c phn tch trong bng trn. Qua
trung bnh bnh phng (mean square), chng ta thy nh hng ca iu kin c v quan
trng hn l nh hng ca vt liu th nghim. Tuy nhin, c hai nh hng u c
ngha thng k, v tr s p rt thp cho hai yu t. Chng ta yu cu R tm lc cc c
s phn tch bng lnh summary:
> summary(twoway)
Call:
lm(formula = score ~ condition + material)
Residuals:
Min
1Q
-0.32778 -0.16389

Median
0.03333

3Q
0.16111

Max
0.32222

92

Phn tch s liu v biu bng R

Nguyn Vn Tun

Coefficients:
Estimate Std. Error t value
(Intercept)
3.9778
0.1080 36.841
condition2
-1.0556
0.1080 -9.776
material2
-0.8500
0.1322 -6.428
material3
-0.4833
0.1322 -3.655
--Signif. codes: 0 '***' 0.001 '**' 0.01

Pr(>|t|)
2.43e-15
1.24e-07
1.58e-05
0.0026

***
***
***
**

'*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.229 on 14 degrees of freedom


Multiple R-Squared: 0.9074,
Adjusted R-squared: 0.8875
F-statistic: 45.72 on 3 and 14 DF, p-value: 1.761e-07

Kt qu trn cho thy so vi iu kin 1, iu kin 2 c score thp hn khong


1.056 v sai s chun l 0.108, vi tr s p = 1.24e-07, tc c ngha thng k. Ngoi ra,
so vi vt liu 1, score cho vt liu 2 v 3 cng thp hn ng k vi thp nht ghi
nhn vt liu 2, v nh hng ca vt liu th nghim cng c ngha thng k.
Gi tr c tn l Residual standard error c c tnh t trung bnh bnh
phng phn d trong phn (a), tc l 0.0525 = 0.229, tc l c s ca .
H s xc nh bi (R2) cho bit hai yu t iu kin v vt liu gii thch khong
91% dao ng ca ton b mu. H s ny c tnh t tng bnh phng trong kt
qu phn (a) nh sau:
R2 =

5.0139 + 2.1811
= 0.9074
5.0139 + 2.1811 + 0.7344

V sau cng, h s R2 iu chnh phn nh ci tin ca m hnh. hiu h


s ny tt hn, chng ta thy phng sai ca ton b mu l s2 = (5.0139 + 2.1811 +
0.7344) / 17 = 0.4644. Sau khi iu chnh cho nh hng ca iu kin v vt liu,
phng sai ny cn 0.0525 (tc l residual mean square). Nh vy hai yu t ny lm
gim phng sai khong 0.4644 0.0525 = 0.4119. V h s R2 iu chnh l:
Adj R2 = 0.4119 / 0.4644 = 0.88
Tc l sau khi iu chnh cho hai yu t iu kin v vt liu phng sai ca score gim
khong 88%.
So snh gia cc nhm. Chng ta s c tnh khc bit gia hai iu kin v
ba vt liu bng hm TukeyHSD vi aov:
> res <- aov(score ~ condition+ material+condition)
> TukeyHSD(res)
Tukey multiple comparisons of means
95% family-wise confidence level

93

Phn tch s liu v biu bng R

Nguyn Vn Tun

Fit: aov(formula = score ~ condition + material + condition)


$condition
diff
lwr
upr p adj
2-1 -1.055556 -1.287131 -0.8239797 1e-07
$material

diff
lwr
upr
p adj
2-1 -0.8500000 -1.19610279 -0.5038972 0.0000442
3-1 -0.4833333 -0.82943612 -0.1372305 0.0068648
3-2 0.3666667 0.02056388 0.7127695 0.0374069

Biu sau y s minh ho cho cc kt qu trn:


> plot(TukeyHSD(res), ordered=TRUE)
There were 16 warnings (use warnings() to see them)

3-2

3-1

2-1

95% family-wise confidence level

-1.0

-0.5

0.0

0.5

Differences in mean levels of material

Biu 25. So snh gia 3 loi vt liu bng


phng php Tukey.

12. Phn tch hi qui logistic


Trong cc phn trc v phn tch hi qui tuyn tnh v phn tch phng sai,
chng ta tm m hnh v mi lin h gia mt bin ph thuc lin tc (continuous
dependent variable) v mt hay nhiu bin c lp (independent variable) hoc l lin tc
hoc l khng lin tc. Nhng trong nhiu trng hp, bin ph thuc khng phi l bin
lin tc m l bin mang tnh o lng nh phn: c/khng, mc bnh/khng mc bnh,
cht/sng, xy ra/khng xy ra, v.v, cn cc bin c lp c th l lin tc hay khng
lin tc. Chng ta cng mun tm hiu mi lin h gia cc bin c lp v bin ph
thuc.

94

Phn tch s liu v biu bng R

Nguyn Vn Tun

V d 19. Trong mt nghin cu do ti tin hnh tm hiu mi lin h gia


nguy c gy xng (fracture, vit tt l fx) v mt xng cng mt s ch s sinh ha
khc, 139 bnh nhn nam (hay ni ng hn l i tng nghin cu) tui t 60 tr ln.
Nm 1990, cc s liu sau y c thu thp cho mi i tng: tui (age), t trng
c th (body mass index hay BMI), mt cht khong trong xng (bone mineral
density hay BMD), ch s hy xng ICTP, ch s to xng PINP. Cc i tng
nghin cu c theo di trong vng 15 nm. Trong thi gian theo di, cc bnh nhn b
gy xng hay khng gy xng c ghi nhn. Cu hi t ra ban u l c mt mi
lin h g gia BMD v nguy c gy xng hay khng. S liu ca nghin cu ny c
trnh by trong phn cui ca chng ny, v s trnh by mt phn di y bn c
nm c vn .
Mt phn s liu nghin cu v cc yu t nguy c cho gy xng
id
1
2
3
4
5
6
7
8
9
10

fx
1
1
1
1
1
0
0
0
0
0

age
79
89
70
88
85
68
70
69
74
79

bmi
24.7252
25.9909
25.3934
23.2254
24.6097
25.0762
19.8839
25.0593
25.6544
19.9594

bmd
0.818
0.871
1.358
0.714
0.748
0.935
1.040
1.002
0.987
0.863
...

ictp
9.170
7.561
5.347
7.354
6.760
4.939
4.321
4.212
5.605
5.204

pinp
37.383
24.685
40.620
56.782
58.358
67.123
26.399
47.515
26.132
60.267

137
138
139

0
1
0

64
80
67

38.0762
23.3887
25.9455

1.086
0.875
0.983

5.043
4.086
4.328

32.835
23.837
71.334

y, v bin ph thuc (gy xng) khng c o lng theo tnh lin tc (m


ch l c hay khng), cho nn phng php phn tch hi qui tuyn tnh phn tch mi
lin h gia bin ph thuc v bin c lp. Mt phng php phn tch c pht trin
tng i gn y (vo thp nin 1970s) c tn l logistic regression analysis (hay phn
tch hi qui logistic) c th p dng cho trng hp trn.
Trong nghin cu ny, sau 15 nm theo di, c 38 bnh nhn b gy xng. Tnh
theo phn trm, t l gy xng l 38 / 139 = 0.273 (hay 27.3%).
12.1 M hnh hi qui logistic

Cho mt tn s bin c x ghi nhn t n i tng, chng ta c th tnh xc sut


ca bin c l:
x
p=
n
p c th xem l mt ch s o lng nguy c ca mt bin c. Mt cch th hin nguy c
khc l odds (mt danh t, nu ti khng lm, ch c trong ting Anh ngay c ting
Php, c, Ty Ban Nha cng khng c danh t tng ng vi odds). Ti tm dch

95

Phn tch s liu v biu bng R

Nguyn Vn Tun

odds l kh nng. Kh nng ca mt bin c c nh ngha n gin bng t s xc


sut bin c xy ra trn xc sut bin c khng xy ra:
p
odds =
1 p
Hm logit ca odds c nh ngha nh sau:
p
l ogit ( p ) = log

1 p
Cho mt bin c lp x (x c th l lin tc hay khng lin tc), m hnh hi qui logistic
pht biu rng:
logit(p) = + x
Tng t nh m hnh hi qui tuyn tnh, v l hai thng s tuyn tnh cn phi c
tnh t d liu nghin cu. Nhng ngha ca thng s ny, c bit l thng s , rt
khc vi ngha m ta quen vi m hnh hi qui tuyn tnh. hiu ngha ca hai
thng s ny, ti s quay li vi v d 19.
Vn m chng ta mun bit l mi lin h gia mt xng bmd v nguy c
gy xng (fx). tin cho vic minh ha, gi bmd l x, vn m chng ta cn bit
c th vit bng ngn ng m hnh nh sau

p
logit ( p ) = log
+ x
1 p
Ni cch khc:

odds ( p ) =

p
= e + x
1 p

Ni cch khc, m hnh hi qui logistic va trnh by trn pht biu rng mi lin
h gia xc sut gy xng (p) v mt xng bmd l mt mi lin h theo hnh ch S.
M hnh trn cn cho thy xc sut gy xng p ty thuc vo gi tr ca x. Thnh ra,
m hnh trn c th vit mt cch chnh xc hn rng kh nng gy xng vi iu kin x
l:
odds ( p | x ) = e + x
Khi x = x0, kh nng gy xng l: odds ( p | x = x0 ) = e + x0
Khi x = x0 + 1 (tc tng 1 n v t x0), kh nng gy xng l:

odds ( p | x = x0 + 1) = e

+ ( x0 +1)

V, t s ca hai xc sut gy xng:

96

Phn tch s liu v biu bng R

Nguyn Vn Tun

odds ( p | x = x0 + 1)
odds ( p | x = x0 )

+ ( x0 +1)

= e

+ x0

Trong dch t hc, e c gi l odds ratio. Odds ratio, nh tn gi l, t s kh nng


hay t s kh d. Ni cch khc, h s trong m hnh hi qui logistic chnh l t s kh
d.
Phng php c tnh thng s trong m hnh [3] kh phc tp (dng phng php
maximum likelihood tc phng php Hp l cc i) v khng nm trong phm vi ca
cun sch ny, nn ti s khng trnh by y (bn c c th tham kho sch gio
khoa bit thm, nu cn thit). Tuy nhin, ti mun cp ngn gn l phng php
hp l cc i cung cp cho chng ta mt h phng trnh nh sau:

1
n
n
( + xi )
yi = 1 + e
i =1
i =1
n
n
x y = x 1 + e ( + xi )

i i
i

i =1
i =1

Trong , Trong , yi l bin ph thuc (gy xng vi gi tr 0 hay 1), v xi l


bin c lp (mt xng), v n l s mu. tm c s v , mt trong nhng
php tnh hay s dng l iterative weighted least square hay Newton-Raphson. R s
dng php tnh Newton-Raphson tm hai c s .
Sau khi c c s v chng ta c th c tnh xc sut p cho bt c gi tr
no ca x nh sau (sau vi thao tc i s):

p =

e + x
+ x

1+ e

1+ e

1
(

+ x

Ch ti dng du m p ch s c tnh (predicted value), ch khng phi p l xc


sut quan st. Nu m hnh m t d liu tt v y , khc bit gia p v p nh;
nu m hnh khng thch hp hay khng tt, khc bit c th s cao. khc bit
gia p v p c gi l deviance. Phng php tnh deviance kh phc tp, nhng
khng phi l ch y, cho nn ti ch ni qua khi nim m thi. Khi chng ta c
nhiu m hnh m phng mt hay nhiu mi lin h, deviance c th c s dng
nh gi s thch hp ca mt m hnh, hay chn mt m hnh ti u.
12.2 Phn tch hi qui logistic bng R

By gi, chng ta quay li vi v d 1, dng s liu trong Bng 12.1 c tnh


hai thng s v bng R. Trc ht chng ta phi nhp ton b s liu vo mt data

97

Phn tch s liu v biu bng R

Nguyn Vn Tun

frame, v cho mt ci tn, chng hn nh fracture. Trong trng hp ca ti, d liu


c cha trong directory c:\works\stats di tn fracture.txt, do , cc lnh sau
y cn thit nhp s liu:
# bo cho R bit ni cha s liu
> setwd(c:/works/stats)

# nhp s liu v cho vo mt data frame tn fracture


> fracture <- read.table(fracture.txt, header=TRUE, na.string=.)

# kim tra xem c bao nhiu bin trong d liu fracture


> names(fracture)
[1] "id"
"fx"

"age"

"bmi"

"bmd"

"ictp" "pinp"

# Chn nhng bnh nhn c y s liu cho phn tch


> fulldata <- na.omit(fracture)
> attach(fulldata)

Hai bin m chng ta quan tm trong v d ny l: fx (gy xng) v bmd (mt


xng). Chng ta kim tra xem c bao nhiu bnh nhn gy xng:
> table(fx)
fx
0
1
101 38

K n, xem mt xng trong nhm gy xng v khng gy xng ra sao:


> tapply(bmd, fx, mean)
0
1
0.9444851 0.9016667
> boxplot(bmd ~ fx,
xlab=Fracture: 1=yes, 0=no),
ylab=BMD)

98

Nguyn Vn Tun

1.0
0.6

0.8

BMD

1.2

Phn tch s liu v biu bng R

1
Fracture: 1=yes, 0=no)

Kt qu trn cho thy, bmd trong nhm bnh nhn b gy xng thp hn so vi nhm
khng b gy xng (0.90 v 0.94). V, kim nh t sau y cho thy mc khc bit
ny khng c ngha thng k (p = 0.15).
> t.test(bmd~fx)
Welch Two Sample t-test
data: bmd by fx
t = 1.4572, df = 53.952, p-value = 0.1508
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.01609226 0.10172922
sample estimates:
mean in group 0 mean in group 1
0.9444851
0.9016667

c tnh thng s trong m hnh [4], hm s glm (vit tt t generalized


linear model) trong R c th p dng, vi c php nh sau:
> logistic <- glm(fx ~ bmd, family=binomial)
> summary(logistic)
Call:
glm(formula = fx ~ bmd, family = "binomial")
Deviance Residuals:
Min
1Q
Median
-1.0287 -0.8242 -0.7020

3Q
1.3780

Max
2.0709

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
1.063
1.342
0.792
0.428
bmd
-2.270
1.455 -1.560
0.119
(Dispersion parameter for binomial family taken to be 1)

99

Phn tch s liu v biu bng R

Null deviance: 157.81


Residual deviance: 155.27
AIC: 159.27

Nguyn Vn Tun

on 136
on 135

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 4

Ti s ln lt gii thch cc kt qu trn:


(a) Trong lnh logistic <- glm(fx ~ bmd, family=binomial) chng ta yu cu
R phn tch theo m hnh fx l mt hm s vi bmd nh m hnh [4]. Trong glm c
nhiu lut phn phi, m trong phn phi nh phn (binomial) l mt lut phn
phi chun cho hi qui logistic. Do , family=binomial cn thit cho R.
(b) Deviance: phn th nht ca kt qu cho bit qua v deviance.
Deviance Residuals:
Min
1Q
Median
-1.0287 -0.8242 -0.7020

3Q
1.3780

Max
2.0709

Deviance nh gii thch trn phn nh khc bit gia m hnh v d liu (cng tng
t nh mean square residual trong phn tch hi qui tuyn tnh vy). i vi mt m
hnh n l nh v d ny th gi tr ca deviance khng c ngha g nhiu.
(c) Phn k tip cung cp c s ca (m R t tn l intercept) v (bmd) v
sai s chun (standard error).
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
1.063
1.342
0.792
0.428
bmd
-2.270
1.455 -1.560
0.119

Qua kt qu ny, chng ta c = 1.063 v = -2.27. c s l s m cho thy mi


lin h gia nguy c gy xng v bmd l mi lin h nghch o: xc sut gy xng
tng khi gi tr ca bmd gim. Tuy nhin, kim nh z (tnh bng cch ly c s chia
cho sai s chun) cho chng ta thy nh hng ca bmd khng c ngha thng k, v tr
s p = 0.119.
Nh rng t s kh d (odds ratio hay vit tt l OR) chnh l e-2.27 = 0.1033. Ni cch
khc, khi bmd tng 1 g/cm2 (n v o lng ca bmd l g/cm2) th t s OR gim 0.9067
hay 90.67%. Nhng tng 1 g/cm2 l mt rt cao trong xng v khng thc t. Cho
nn mt cch tnh khc l tnh trn lch chun (standard deviation) ca bmd. Chng
ta s tm hiu lch chun ca bmd:
> sd(bmd)
[1] 0.1406543

Do , OR s tnh trn mi 0.14 g/cm2. V OR cho mi lch chun, do , l:

100

Phn tch s liu v biu bng R

Nguyn Vn Tun

e-2.27*0.1406 = 0.7267
Tc l, khi bmd tng mt lch chun th t s kh d gy xng gim khong 28%.
Cng c th ni cch khc, l khi bmd gim mt lch chun th t s kh d tng
e2.27*0.1406 = 1.376 hay khong 38%.
Mt cch khc bit nh hng ca bmd l c tnh xc sut gy xng qua phng
trnh:
1.063 2.27 ( bmd )
e
p =
1.063 2.27 ( bmd )
1+ e
Theo , khi bmd = 1.00, p = 0.23. Khi bmd = 0.86 (tc gim 1 lch chun), p =
0.291. Tc l, nu BMD gim 1 lch chun th xc sut gy xng tng 0.291/0.23 =
1.265 hay 26%5.
(d) Phn cui ca kt qu cung cp deviance cho hai m hnh: m hnh khng c bin
c lp (null deviance), v m hnh vi bin c lp, tc l bmd trong v d
(residual deviance).
Null deviance: 157.81
Residual deviance: 155.27
AIC: 159.27

on 136
on 135

degrees of freedom
degrees of freedom

Qua hai s ny, chng ta thy bmd nh hng rt thp n vic tin on gy
xng, ch lm gim deviance t 157.8 xung cn 155.27, v mc gim ny khng c
ngha thng k.
Ngoi ra, R cn cung cp gi tr ca AIC (Akaike Information Criterion) c
tnh t deviance v bc t do. Ti s quay li ngha ca AIC trong phn sp n khi so
snh cc m hnh.
12.3 c tnh xc sut bng R

Xin nhc li trong phn tch trn, chng ta cho cc kt qu vo i tng


logistic. Trong i tng ny c nhiu thng tin c ch, nhng nu mun xem cc
thng tin ny chng ta phi dng n cc lnh nh summary chng hn. Trong phn
ny, ti s trnh by mt vi hm xem xt cc thng tin lin quan n vic tin on
xc sut.

predict dng lit k cc gi tr c tnh (predicted values) ca m hnh


p
log
= + x cho tng bnh nhn.
1 p

> predict(logistic)
1

101

Phn tch s liu v biu bng R

Nguyn Vn Tun

2.377576584 1.085694014 -2.141117756 1.492824115 0.965379946 -0.941253280


7
8
9
10
11
12
-1.733686514 -1.675645430 -0.665282957 -0.507046129 -0.941854868 -0.648740461
...

Cc s trn l log(p / (1 p)), tc log odds, khng c ngha hc t bao nhiu. Chng ta
1.063 2.27 ( bmd )
e
. c gi tr
mun bit gi tr tin on xc sut p tnh t phng trnh p =
1.063 2.27 ( bmd )
1+ e
ny cho tng bnh nhn, chng ta cho thng s type=response vo hm predict
nh sau:
> predict(logistic, type="response")

1
2
3
4
5
6
7
0.91510135 0.74757001 0.10516416 0.81650178 0.72419767 0.28064726 0.15011664
8
9
10
11
12
13
14
0.15767295 0.33955387 0.37588624 0.28052582 0.34327343 0.44305196 0.23830776
...

Trong kt qu trn (ch in mt phn) c tnh xc sut gy xng cho bnh nhn 1 l
0.915, cho bnh nhn 2 l 0.747, v.v

Chng ta c th xem xt cc gi tr tin on ny vi bmd bng cch dng hm


plot thng thng:

0.35
0.30
0.25
0.20
0.15

fitted(glm(fx ~ bmd, family = "binomial"))

0.40

> plot(bmd, fitted(glm(fx ~ bmd, family=binomial)))

0.6

0.8

1.0

1.2

bmd

Xc sut tin on gy xng (trc tung) v bmd (trc


honh) qua m hnh hi qui logistic.

102

Phn tch s liu v biu bng R

Nguyn Vn Tun

Biu trn c th ci tin bng cch cho cc khong cch gi tr bmd gn nhau hn
(nh 0.50, 0.55, 0.60, , 1.20 chng hn), v dng ng biu din thay v dng du
chm. Cc lnh sau y s ci tin biu .
logistic <- glm(fx ~ bmd, family=binomial)
fnbmd <- seq(0.5, 1.2, 0.05) #cho fnbmd t > 0.50,0.55,0.6,...,1.2
new.data <- data.frame(bmd = fnbmd) #cho vo mt dataframe mi
predicted <- predict(logistic, new.data, type=response)
plot(predicted ~ fnbmd, type=l)

0.35
0.30
0.15

0.20

0.25

predicted

0.40

0.45

>
>
>
>
>

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

fnbmd

Xc sut tin on gy xng (trc tung) v bmd (trc


honh) qua m hnh hi qui logistic.

13. c tnh c mu (sample size estimation)


Mt cng trnh nghin cu thng da vo mt mu (sample). Mt trong nhng cu hi
quan trng nht trc khi tin hnh nghin cu l cn bao nhiu mu hay bao nhiu i
tng cho nghin cu. i tng y l n v cn bn ca mt nghin cu, l s
bnh nhn, s tnh nguyn vin, s mu rung, cy trng, thit b, v.v c tnh s
lng i tng cn thit cho mt cng trnh nghin cu ng vai tr cc k quan trng,
v n c th l yu t quyt nh s thnh cng hay tht bi ca nghin cu. Nu s
lng i tng khng th kt lun rt ra t cng trnh nghin cu khng c chnh
xc cao, thm ch khng th kt lun g c. Ngc li, nu s lng i tng qu
nhiu hn s cn thit th ti nguyn, tin bc v thi gian s b hao ph. Do , vn
then cht trc khi nghin cu l phi c tnh cho c mt s i tng va cho
mc tiu ca nghin cu. S lng i tng va ty thuc vo ba yu t chnh:

103

Phn tch s liu v biu bng R

Nguyn Vn Tun

Sai st m nh nghin cu chp nhn, c th l sai st loi I v II;


dao ng (variability) ca o lng, m c th l lch chun; v
Mc khc bit hay nh hng m nh nghin cu mun pht hin.

Khng c s liu v ba yu t ny th khng th no c tnh c mu. Kinh


nghim ca ngi vit cho thy rt nhiu ngi khi tin hnh nghin cu thng khng
c nim g v cc s liu ny, cho nn khi n tham vn cc chuyn gia v thng k
hc, h ch nhn cu tr li: khng th tnh c! Trong chng ny ti s bn qua ba
yu t trn.
13.1 Khi nim v power

Thng k hc l mt phng php khoa hc c mc ch pht hin, hay i tm


nhng ci c th gp chung li bng cm t cha c bit (unknown). Ci cha c
bit y l nhng hin tng chng ta khng quan st c, hay quan st c nhng
khng y . Ci cha bit c th l mt n s (nh chiu cao trung bnh ngi
Vit Nam, hay trng lng mt phn t), hiu qu ca mt thut iu tr, gen c chc
nng lm cho cy l c mu xanh, s thch ca con ngi, v.v Chng ta c th o chiu
cao, hay tin hnh xt nghim bit hiu qu ca thuc, nhng cc nghin cu nh th
ch c tin hnh trn mt nhm i tng, ch khng phi ton b qun th ca dn
s.
mc n gin nht, nhng ci cha bit ny c th xut hin di hai hnh
thc: hoc l c, hoc l khng. Chng hn nh mt thut iu tr c hay khng c hiu
qu chng gy xng, khch hng thch hay khng thch mt loi nc gii kht. Bi v
khng ai bit hin tng mt cch y , chng ta phi t ra gi thit. Gi thit n
gin nht l gi thit o (hin tng khng tn ti, k hiu H-) v gi thit chnh (hin
tng tn ti, k hiu H+).
Chng ta s dng cc phng php kim nh thng k (statistical test) nh kim
nh t, F, z, 2, v.v nh gi kh nng ca gi thit. Kt qu ca mt kim nh
thng k c th n gin chia thnh hai gi tr: hoc l c ngha thng k (statistical
significance), hoc l khng c ngha thng k (non-significance). C ngha thng k
y, nh cp trong Chng 7, thng da vo tr s P: nu P < 0.05, chng ta pht
biu kt qu c ngha thng k; nu P > 0.05 chng ta ni kt qu khng c ngha
thng k. Cng c th xem c ngha thng k hay khng c ngha thng k nh l c
tn hiu hay khng c tn hiu. Hy tm t k hiu T+ l kt qu c ngha thng k, v
T- l kt qu kim nh khng c ngha thng k.
Hy xem xt mt v d c th: bit thuc risedronate c hiu qu hay khng
trong vic iu tr long xng, chng ta tin hnh mt nghin cu gm 2 nhm bnh
nhn (mt nhm c iu tr bng risedronate v mt nhm ch s dng gi dc
placebo). Chng ta theo di v thu thp s liu gy xng, c tnh t l gy xng cho
tng nhm, v so snh hai t l bng mt kim nh thng k. Kt qu kim nh thng
k hoc l c ngha thng k (P<0.05) hay khng c ngha thng k (P>0.05). Xin
nhc li rng chng ta khng bit risedronate tht s c hiu nghim chng gy xng

104

Phn tch s liu v biu bng R

Nguyn Vn Tun

hay khng; chng ta ch c th t gi thit H. Do , khi xem xt mt gi thit v kt


qu kim nh thng k, chng ta c bn tnh hung:
(a) Gi thuyt H ng (thuc risedronate c hiu nghim) v kt qu kim nh thng
k P<0.05.
(b) Gi thuyt H ng, nhng kt qu kim nh thng k khng c ngha thng k;
(c) Gi thuyt H sai (thuc risedronate khng c hiu nghim) nhng kt qu kim
nh thng k c ngha thng k;
(d) Gi thuyt H sai v kt qu kim nh thng k khng c ngha thng k.
y, trng hp (a) v (d) khng c vn , v kt qu kim nh thng k nht qun
vi thc t ca hin tng. Nhng trong trng hp (b) v (c), chng ta phm sai lm, v
kt qu kim nh thng k khng ph hp vi gi thit. Trong ngn ng thng k hc,
chng ta c vi thut ng:

xc sut ca tnh hung (b) xy ra c gi l sai st loi II (type II error), v


thng k hiu bng .

xc sut ca tnh hung (a) c gi l Power. Ni cch khc, power chnh l xc


sut m kt qu kim nh thng cho ra kt qu p<0.05 vi iu kin gi thit H l
tht. Ni cch khc: power = 1- ;

xc sut ca tnh hung (c) c gi l sai st loi I (type I error, hay significance
level), v thng k hiu bng . Ni cch khc, chnh l xc sut m kt qu
kim nh thng cho ra kt qu p<0.05 vi iu kin gi thit H sai;

xc sut tnh hng (d) khng phi l vn cn quan tm, nn khng c thut
ng, d c th gi l kt qu m tnh tht (hay true negative).

C th tm lc 4 tnh hung trong mt Bng 1 sau y:


Cc tnh hung trong vic th nghim mt gi thit khoa hc
Gi thuyt H
Kt qu kim nh thng k

ng
(thuc c hiu nghim)

Sai
(thuc khng c hiu nghim)

C ngha thng k (p<0,05)

Dng tnh tht (power),


1-= P(s | H+)

Sai st loi I (type I error)


= P(s | H-)

Khng c ngha thng k


(p>0,05)

Sai st loi II (type II error)


= P(ns | H+)

m tnh tht (true negative)


1- = P(ns | H-)

105

Phn tch s liu v biu bng R

Nguyn Vn Tun

Ch thch: s trong biu ny c ngha l significant; ns non-significant; H+ l gi thuyt ng;


v H- l gi thuyt sai. Do , c th m t 4 tnh hung trn bng ngn ng xc sut c iu
kin nh sau: Power = 1 = P(s | H+); = P(ns | H+); v = P(s | H-).
13.2 S liu c tnh c mu

Nh cp trong phn u ca chng ny, c tnh s i tng cn thit


cho mt cng trnh nghin cu, chng ta cn phi c 3 s liu: xc sut sai st loi I v II,
dao ng ca o lng, v nh hng.

V xc sut sai st, thng thng mt nghin cu chp nhn sai st loi I khong
1% hay 5% (tc = 0.01 hay 0.05), v xc sut sai st loi II khong = 0.1 n
= 0.2 (tc power phi t 0.8 n 0.9).

dao ng chnh l c lch chun (standard deviation) ca o lng m cng


trnh nghin cu da vo phn tch. Chng hn nh nu nghin cu v cao
huyt p, th nh nghin cu cn phi c lch chun ca p sut mu. Chng
ta tm gi dao ng l .

nh hng, nu l cng trnh nghin cu so snh hai nhm, l khc bit


trung bnh gia hai nhm m nh nghin cu mun pht hin. Chng hn nh
nh nghin cu c th gi thit rng bnh nhn c iu tr bng thuc A c p
sut mu gim 10 mmHg so vi nhm gi c. y, 10 mmHg c xem l
nh hng. Chng ta tm gi nh hng l .

Mt nghin cu c th c mt nhm i tng hay hai (v c khi hn 2) nhm


i tng. V c tnh c mu cng ty thuc vo cc trng hp ny.
Trong trng hp mt nhm i tng, s lng i tng (n) cn thit cho
nghin cu c th tnh ton mt cch th cng nh sau:

n=

( / )

Trong trng hp c hai nhm i tng, s lng i tng (n) cn thit cho
nghin cu c th tnh ton nh sau:

n = 2

( / )

Trong , hng s C c xc nh t xc sut sai st loi I v II (hay power) nh sau:

106

Phn tch s liu v biu bng R

Nguyn Vn Tun

Hng s C lin quan n sai st loi I v II

=
0.10
0.05
0.01

= 0.20
(Power = 0.80)
6.15
7.85
13.33

= 0.10
(Power = 0.90)
8.53
10.51
16.74

= 0.05
(Power = 0.95)
10.79
13.00
19.84

13.4 c tnh c mu
13.4.1 c tnh c mu cho mt ch s trung bnh
V d 20: Chng ta mun c tnh chiu cao n ng ngi Vit, v chp nhn
sai s trong vng 1 cm (d = 1) vi khong tin cy 0.95 (tc =0.05) v power = 0.8 (hay
= 0.2). Cc nghin cu trc cho bit lch chun chiu cao ngi Vit khong 4.6
cm. Chng ta c th p dng cng thc [1] c tnh c mu cn thit cho nghin cu:

n=

( / )

7.85

(1/ 4.6 )

= 166

Ni cch khc, chng ta cn phi o chiu cao 166 i tng c tnh chiu cao n
ng Vit vi sai s trong vng 1 cm.
Nu sai s chp nhn l 0.5 cm (thay v 1 cm), s lng i tng cn thit l:
7.85
n=
= 664 . Nu sai s m chng ta chp nhn l 0.1 cm th s lng i
2
( 0.5 / 4.6 )
tng nghin cu ln n 16610 ngi! Qua cc c tnh ny, chng ta d dng thy c
mu ty thuc rt ln vo sai s m chng ta chp nhn. Mun c c tnh cng
chnh xc, chng ta cn cng nhiu i tng nghin cu.
Trong R c hm power.t.test c th p dng c tnh c mu cho v d
trn nh sau.
Ch chng ta cho R bit vn l mt nhm tc
type=one.sample:
# sai s 1 cm, c lch chun 4.6, a=0.05, power=0.8
> power.t.test(delta=1, sd=4.6, sig.level=.05, power=.80,
type='one.sample')
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative

=
=
=
=
=
=

168.0131
1
4.6
0.05
0.8
two.sided

107

Phn tch s liu v biu bng R

Nguyn Vn Tun

kt qu tnh ton t R l 168, khc vi cch tnh th cng 2 i tng, v c nhin R s


dng nhiu s l hn v chnh xc hn cch tnh th cng. Vi sai s 0.5 cm:
# sai s 0.5 cm, c lch chun 4.6, a=0.05, power=0.8
> power.t.test(delta=0.5, sd=4.6, sig.level=.05, power=.80,
type='one.sample')
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative

=
=
=
=
=
=

666.2525
0.5
4.6
0.05
0.8
two.sided

V d 21: Mt loi thuc iu tr c kh nng tng alkaline phosphatase bnh


nhn long xng. lch chun ca alkaline phosphatase l 15 U/l. Mt nghin cu
mi s tin hnh trong mt qun th bnh nhn Vit Nam, v cc nh nghin cu mun
bit bao nhiu bnh nhn cn tuyn chng minh rng thuc c th alkaline
phosphatase t 60 n 65 U/l sau 3 thng iu tr, vi sai s I = 0.05 v power = 0.8.

y l mt loi nghin cu trc sau (before-after study); c ngha l trc


v sau khi iu tr. y, chng ta ch c mt nhm bnh nhn, nhng c o hai ln
(trc khi dng thuc v sau khi dng thuc). Ch tiu lm sng nh gi hiu nghim
ca thuc l thay i v alkaline phosphatase. Trong trng hp ny, chng ta c tr
s tng trung bnh l 5 U/l v lch chun l 15 U/l, hay ni theo ngn ng R,
delta=5, sd=15, sig.level=.05, power=.80, v lnh:
> power.t.test(delta=3, sd=15, sig.level=.05, power=.80,
type='one.sample')
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative

=
=
=
=
=
=

198.1513
3
15
0.05
0.8
two.sided

Nh vy, chng ta cn phi c 198 bnh nhn t cc mc tiu trn.


13.4.2 c tnh c mu cho so snh hai s trung bnh

Trong thc t, rt nhiu nghin cu nhm so snh hai nhm vi nhau. Cch c
tnh c mu cho cc nghin cu ny ch yu da vo cng thc [2] nh trnh by phn
15.3.1.
108

Phn tch s liu v biu bng R

Nguyn Vn Tun

V d 22: Mt nghin cu c thit k th nghim thuc alendronate trong


vic iu tr long xng ph n sau thi k mn kinh. C hai nhm bnh nhn c
tuyn: nhm 1 l nhm can thip (c iu tr bng alendronate), v nhm 2 l nhm
i chng (tc khng c iu tr). Tiu ch nh gi hiu qu ca thuc l mt
xng (bone mineral density BMD). S liu t nghin cu dch t hc cho thy gi tr
trung bnh ca BMD trong ph n sau thi k mn kinh l 0.80 g/cm2, vi lch chun
l 0.12 g/cm2. Vn t ra l chng ta cn phi nghin cu bao nhiu i tng
chng minh rng sau 12 thng iu tr BMD ca nhm 1 tng khong 5% so vi nhm
2?

Trong v d trn, tm gi tr s trung bnh ca nhm 2 l 2 v nhm 1 l 1 ,


chng ta c: 1 = 0.8*1.05 = 0.84 g/cm2 (tc tng 5% so vi nhm 1), v do , = 0.84
0.80 = 0.04 g/cm2. lch chun l = 0.12 g/cm2. Vi power = 0.90 v = 0.05, c
mu cn thit l:
n=

2C

( / )

2 10.51

( 0.04 / 0.12 )

= 189

V li gii t R qua hm power.t.test nh sau:


> power.t.test(delta=0.04, sd=0.12, sig.level=0.05, power=0.90,
type="two.sample")
Two-sample t test power calculation
n
delta
sd
sig.level
power
alternative

=
=
=
=
=
=

190.0991
0.04
0.12
0.05
0.9
two.sided

NOTE: n is number in *each* group

Ch trong hm power.t.test, ngoi cc thng s thng thng nh delta (


nh hng hay khc bit theo gi thit), sd ( lch chun), sig.level xc sut sai
st loi I, v power, chng ta cn phi c th ch ra rng y l nghin cu gm c hai
nhm vi thng s type=two.sample.
Kt qu trn cho bit chng ta cn 190 bnh nhn cho mi nhm (hay 380 bnh
nhn cho cng trnh nghin cu). Trong trng hp ny, power = 0.90 v = 0.05 c
ngha l g ? Tr li: hai thng s c ngha l nu chng ta tin hnh tht nhiu nghin
cu (v d 1000) v mi nghin cu vi 380 bnh nhn, s c 90% (hay 900) nghin cu
s cho ra kt qu trn vi tr s p < 0.05.

109

Phn tch s liu v biu bng R

Nguyn Vn Tun

13.4.3 c tnh c mu cho phn tch phng sai

Phng php c tnh c mu cho so snh gia hai nhm cng c th khai trin
thm c tnh c mu cho trng hp so snh hn hai nhm. Trong trng hp c
nhiu nhm, nh cp trong Chng 11, phng php so snh l phn tch phng sai.
Theo phng php ny, s trung bnh bnh phng phn d (residual mean square, RMS)
chnh l c tnh ca dao ng ca o lng trong mi nhm, v ch s ny rt quan
trng trong vic c tnh c mu.
Chi tit v l thuyt ng sau cch c tnh c mu cho phn tch phng sai kh
phc tp, v khng nm trong phm vi ca chng ny. Nhng nguyn l ch yu vn
khng khc so vi l thuyt so snh gia hai nhm. Gi s trung bnh ca k nhm l 1,
2, 3, . . ., k, chng ta c th tnh tng bnh phng gia cc nhm bng
k
k
SS
2
SS SS = ( i ) , trong , = i / k . Cho =
, vn t ra l tm
( k 1) RMS
i =1
i =1
c lng c mu n sao cho z p ng yu cu power = 0.80 hay 0.9, m
z =

( k 1)(1 + n ) F + k ( n 1)(1 + 2n )

k ( n 1) 2 ( k 1)(1 + n ) (1| 2n ) F ( k 1)(1 + n ) ( 2k ( n 1) 1)

Trong F l kim nh F. (Xem J. Fleiss, The Design and Analysis of Clinical


Experiments, John Wiley & Sons, New York 1986, trang 373).
V d 23. so snh ngt ca mt loi nc ung gia 4 nhm i tng
khc nhau v gii tnh v tui (tm gi 4 nhm l A, B, C v D), cc nh nghin cu
gi thit rng ngt trong nhm A, B. C v D ln lc l 4.5, 3.0, 5.6, v 1.3. Qua xem
xt nhiu nghin cu trc, cc nh nghin cu cn bit rng RMS v ngt trong mi
nhm l khong 8.7. Vn t ra l bao nhiu i tng cn nghin cu pht hin s
khc bit c ngha thng k mc = 0.05 v power = 0.9.

Hm power.anova.test trong R c th ng dng gii quyt vn . Chng ta ch


cn n gin cung cp 4 s trung bnh theo gi thit v s RMS nh sau:
# trc ht cho 4 s trung bnh vo mt vector
> groupmeans <- c(4.5, 3.0, 5.6, 1.3)
# sau , gi hm power.anova.test:
> power.anova.test(groups = length(groupmeans),
between.var=var(groupmeans),
within.var=8.7, power=0.90, sig.level=0.05)
Balanced one-way analysis of variance power calculation
groups = 4

110

Phn tch s liu v biu bng R

n
between.var
within.var
sig.level
power

=
=
=
=
=

Nguyn Vn Tun

12.81152
3.486667
8.7
0.05
0.9

NOTE: n is number in each group

Kt qu cho thy cc nh nghin cu cn khong 13 i tng cho mi nhm (tc 52 i


tng cho ton b nghin cu).
13.4.4 c tnh c mu c tnh mt t l

Nhiu nghin cu m t c mc ch kh n gin l c tnh mt t l. Chng


hn nh gii y t thng hay tm hiu t l mt bnh trong cng ng, hay gii thm d
kin v th trng thng tm hiu t l dn s a thch mt sn phm. Trong cc trng
hp ny, chng ta khng c nhng o lng mang tnh lin tc, nhng kt qu ch l
nhng gi tr nh nh c / khng, thch / khng tch, v.v V cch c tnh c mu cng
khc vi ba v d trn y.
Nm 1991, mt cuc thm d kin M cho thy 45% ngi c hi sn sng
khuyn khch con h nn hin mt qu thn cho nhng bnh nhn cn thit. Khong tin
cy 95% ca t l ny l 42% n 48%, tc mt khong cch n 6%! Kt qu ny
[tng i] thiu chnh xc, d s lng i tng tham gia ln n 1000 ngi. Ti
sao? tr li cu hi ny, chng ta th xem qua mt vi l thuyt v c tnh c mu
cho mt t l.
Chng ta bit qua Chng 6 v 9 rng nu p c c tnh t n i tng, th
khong tin cy 95% ca mt t l p [trong dn s] l: p 1.96 SE ( p ) , trong
SE ( p ) =

p (1 p ) / n .

By gi th lt ngc vn : chng ta mun c tnh p sao khong rng


2 1.96 SE ( p ) khng qu mt hng s m. Ni cch khc, chng ta mun:
1.96 p (1 p ) / n m

Chng ta mun tm s lng i tng n t yu cu trn. Qua cch din t trn, d


dng thy rng:
2

1.96
n
p (1 p )
m
Do , s lng c mu ty thuc vo sai s m v t l p m chng ta mun c tnh.
sai s cng thp, s lng c mu cng cao.

111

Phn tch s liu v biu bng R

Nguyn Vn Tun

V d 24: Chng ta mun c tnh t l n ng ht thuc Vit Nam, sao cho


c s khng cao hn hay thp hn 2% so vi t l tht trong ton dn s. Mt nghin
cu trc cho thy t l ht thuc trong n ng ngi Vit c th ln n 70%. Cu hi
t ra l chng ta cn nghin cu trn bao nhiu n ng t yu cu trn.

Trong v d ny, chng ta c sai s m = 0.02, p = 0.70, v s lng c mu cn


thit cho nghin cu l:
2

1.96
n
0.7 0.3
0.02
Ni cch khc, chng ta cn nghin cu t nht l 2017.
Nu chng ta mun gim sai s t 2% xung 1% (tc m = 0.01) th s lng i tng s
l 8067! Ch cn thm chnh xc 1%, s lng mu c th thm hn 6000 ngi. Do
, vn c tnh c mu phi rt thn trng, xem xt cn bng gia chnh xc thng
tin cn thu thp v chi ph.
R khng c hm cho c tnh c mu cho mt t l, nhng vi cng thc trn, bn c c
th vit mt hm tnh rt d dng.
13.4.5 c tnh c mu cho so snh hai t l

Nhiu nghin cu mang tnh suy lun thng c hai [hay nhiu hn hai] nhm
so snh. Trong phn 15.4.2 chng ta lm quen vi phng php c tnh c mu
so snh hai s trung bnh bng kim nh t. l nhng ngi cu m tiu ch l nhng
bin s lin tc. Nhng c nghin cu bin s khng lin tc m mang tnh nh phn nh
ti va bn trong phn 15.4.3. so snh hai t l, phng php kim nh thng dng
nht l kim nh nh phn (binomial test) hay Chi bnh phng (2 test). Trong phn
ny, ti s bn qua cch tnh c mu cho hai loi kim nh thng k ny.
Gi hai t l [m chng ta khng bit nhng mun tm hiu] l p1 v p2 , v gi

= p1 p2 . Gi thit m chng ta mun kim nh l = 0. L thuyt ng sau c


tnh c mu cho kim nh gi thit ny kh rm r, nhng c th tm gn bng cng
thc sau y:

n=

z / 2 2 p (1 p ) + z

p1 (1 p1 ) + p2 (1 p2 )

l tr s z ca phn phi chun cho xc sut /2 (chng


2

Trong , p = ( p1 + p2 )/2, z / 2
hn nh khi = 0.05, th z / 2 = 1.96; khi = 0.01, th z / 2 = 2.57), v z l tr s z ca

112

Phn tch s liu v biu bng R

Nguyn Vn Tun

phn phi chun cho xc sut (chng hn nh khi = 0.10, th z = 1.28; khi = 0.20,
th z = 0.84).
V d 25: Mt th nghim lm sng i chng ngu nhin c thit k nh
gi hiu qu ca mt loi thuc chng gy xng sng. Hai nhm bnh nhn s c
tuyn. Nhm 1 c iu tr bng thuc, v nhm 2 l nhm i chng (khng c
iu tr). Cc nh nghin cu gi thit rng t l gy xng trong nhm 2 l khong 10%,
v thuc c th lm gim t l ny xung khong 6%. Nu cc nh nghin cu mun th
nghim gi thit ny vi sai st I l = 0.01 v power = 0.90, bao nhiu bnh nhn cn
phi c tuyn m cho nghin cu?

y, chng ta c = 0.10 0.06 = 0.04, v p = (0.10 + 0.06)/2 = 0.08. Vi


= 0.01, z / 2 = 2.57 v vi power = 0.90, z = 1.28. Do , s lng bnh nhn cn thit
cho mi nhm l:

( 2.57
n=

2 0.08 0.92 + 1.28 0.1 0.90 + 0.06 0.94

( 0.04 )

= 1361

Nh vy, cng trnh nghin cu ny cn phi tuyn t nht l 2722 bnh nhn kim
nh gi thit trn.
Hm power.prop.test R c th ng dng tnh c mu cho trng hp trn. Hm
power.prop.test cn nhng thng tin nh power, sig.level, p1, v p2.
Trong v d trn, chng ta c th vit:
> power.prop.test(p1=0.10, p2=0.06, power=0.90, sig.level=0.01)
Two-sample comparison of proportions power calculation
n
p1
p2
sig.level
power
alternative

=
=
=
=
=
=

1366.430
0.1
0.06
0.01
0.9
two.sided

NOTE: n is number in *each* group

Ch kt qu t R c phn chnh xc hn (1366 i tng cho mi nhm) v R dng


nhiu s l cho tnh ton hn l tnh th cng.
Trc khi ri chng ny, ti mun nhn c hi ny nhn mnh mt ln na,
c tnh c mu cho nghin cu l mt bc cc k quan trng trong vic thit k mt
nghin cu cho c ngha khoa hc, v n c th quyt nh thnh bi ca nghin cu.
Trc khi c tnh c mu nh nghin cu cn phi bit trc (hay t ra l c vi gi thit
c th) v vn mnh quan tm. c tnh c mu cn mt s thng s nh cp n
113

Phn tch s liu v biu bng R

Nguyn Vn Tun

trong phn u ca chng, v nu cc thng s ny khng c th khng th c tnh


c. Trong trng hp mt nghin cu hon ton mi, tc cha ai tng lm trc ,
c th cc thng s v nh hng v dao ng o lng s khng c, v nh nghin
cu cn phi tin hnh mt s m phng (simulation) hay mt nghin cu s khi c
nhng thng s cn thit. Cch c tnh c mu bng m phng l mt lnh vc nghin
cu kh chuyn su, khng nm trong ti ca sch ny, nhng bn c c th tm hiu
thm phng php ny trong cc sch gio khoa v thng k hc cp cao hn.
Trn y l vi hng dn nhanh bn c c th s dng R cho phn tch s
liu v to biu . Bi vit ny thc cht l tm lc t cun Phn tch s liu v to
biu bng R: hng dn v thc hnh, do Nh xut bn i hc Quc gia Thnh ph
H Ch Minh n hnh vo nm 2006. Chi tit v l thuyt v mt s phng php khc
nh phn tch s kin, xy dng m hnh thng k, m phng, lp chng, v.v c th
tm trong sch trn.

114

Phn tch s liu v biu bng R

Nguyn Vn Tun

14. Ti liu tham kho


Hin nay, th vin sch v R cn tng i khim tn so vi th vin cho cc
phn mm thng mi nh SAS v SPSS. Tuy nhin, trong thi i tin b phi thng
v thng tin internet v ton cu ha nh hin nay, sch in v sch xut bn trn website
khng cn l nhng khc nhau bao xa. Phn ln ch dn v cch s dng R c th tm
thy ri rc y trn cc website t cc trng i hc v website c nhn trn khp
th gii. Trong phn ny ti ch lit k mt s sch m bn c, nu cn tham kho
thm, nn tm c. Trong qu trnh vit cun sch m bn c ang cm trn tay, ti
cng tham kho mt s sch v trang web m ti s lit k sau y vi vi li nhn xt c
nhn.
Ti liu tham kho chnh v R l bi bo ca hai ngi sng to ra R: Ihaka R,
Gentleman R. R: A language for data analysis and graphics. Journal of Computational
and Graphical Statistics 1996; 5:299-314.

Data Analysis and Graphics Using R An Example Approach (Nh xut bn


Cambridge University Press, 2003) ca John Maindonald nay xut in li ln th
2 vi thm mt tc gi mi John Braun. y l cun sch rt c ch cho nhng ai
mun tm hiu v hc v R. Nm chng u ca sch vit cho bn c cha tng
bit v R, cn cc chng sau th vit cho cc bn c bit cch s dng R thnh
tho.

Introductory Statistics With R (Nh xut bn Springer, 2004) ca Peter


Dalgaard l mt cun sch loi cn bn cho R nhm vo bn c cha bit g v R.
Sch tng i ngn (ch khong 200 trang) nhng kh t gi!

Linear Models with R (Nh xut bn Chapman & Hall/CRC, 2004) ca Julian
Faraway. Sch hin c th ti t internet xung min ph ti website sau y:
hay
http://cran.rhttp://www.stat.lsa.umich.edu/~faraway/book/pra.pdf
project.org/doc/contrib/Faraway-PRA.pdf. Ti liu di 213 trang.

R Graphics (Computer Science and Data Analysis) (Nh xut bn Chapman &
Hall/CRC, 2005) ca Paul Murrell. y l cun sch chuyn v phn tch biu
bng R. Sch c rt nhiu m bn c c th t mnh thit k cc biu phc
tp v mu m.

Modern Applied Statistics with S-Plus (Nh xut bn Springer, 4th Edition,
2003) ca W. N. Venables v B. D. Ripley c vit cho ngn ng S-Plus nhng
tt c cc lnh v m trong sch ny u c th p dng cho R m khng cn thay
i. (S-Plus l tin thn ca R, nhng S-Plus l mt phn mm thng mi, cn R
th hon ton min ph!) y l cun sch c th ni l cun sch tham kho cho
tt c ai mun pht trin thm v R. Hai tc gi cng l nhng chuyn gia c thm
quyn v ngn ng R. Sch dnh cho bn c vi trnh cao v my tnh v
thng k hc.

115

Phn tch s liu v biu bng R

Nguyn Vn Tun

Cc website quan trng hay c ch v R

Rt nhiu ti liu tham kho c th ti t website chnh thc ca R sau y:


http://cran.R-project.org/other-docs.html
Trong c mt s ti liu quan trng nh An Introduction to R ca W. N.
Venables v B. D. Ripley.
a ch internet: http://cran.r-project.org/doc/manuals/R-intro.pdf.

Vi ti liu hng dn cch s dng R c th ti (min ph) v tham kho nh sau:


R for Beginners (57 trang) ca Emmanuel Paradis. Ti liu c son cho bn
c mi lm quen vi R.
a ch internet: http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf.
Using R for Data Analysis and Graphics: Introduction, Code and Commentary
(35 trang) ca John Maindonald l mt tm lc cc lnh v hm cn bn ca R
cho phn tch s liu v biu . Ch ca ti liu ny rt gn vi cun sch m
bn ang c.
a ch internet: http://cran.r-project.org/doc/contrib/usingR.pdf
Statistical Analysis with R a quick start (46 trang) ca Oleg Nenadic v
Walter Zucchini. Web. Ti liu hng dn cch ng dng R cho phn tch thng
k v biu .
a ch internet: http://www.statoek.wiso.uni-goettingen.de/mitarbeiter/ogi/pub/r_workshop.pdf
A Brief Guide to R for Beginners in Econometrics (31 trang) ca M. Arai. Ti
liu ch yu son cho gii phn tch thng k kinh t.
a ch internet: http://people.su.se/~ma/R_intro
Notes on the use of R for psychology experiments and questionnaires (39
trag) ca Jonathan Baron v Yuelin Li. Web. Ti liu c son cho gii nghin
cu tm l hc v x hi hc. C v d v log-linear model v mt s m hnh phn
tch phng sai trong tm l hc.
a ch internet: http://www.psych.upenn.edu/~baron/rpsych/rpsych.html

StatsRus gm mt su tp v cc mo s dng R hu hiu hn (di khong 80


trang). a ch internet: http://lark.cc.ukans.edu/pauljohn/R/statsRus.html

V sau cng l mt ti liu Hng dn s dng R cho phn tch s liu v biu
(khong 50 trang thng xuyn cp nht ha) do chnh ti vit bng ting
Vit. Website: www.R.ykhoa.net thc cht l tm lc mt s chng chnh ca
cun sch ny. Trang web ny cn c tt c cc d liu (datasets) v cc m s
trong trong sch bn c c th ti xung my tnh c nhn s dng.

116

Phn tch s liu v biu bng R

Nguyn Vn Tun

15. Thut ng dng trong sch


Ting Anh
95% confidence interval
Akaike Information criterion (AIC)
Analysis of covariance
Analysis of variance (ANOVA)
Bar chart
Binomial distribution
Box plot
Categorical variable
Clock chart
Coefficient of correlation
Coefficient of determination
Coefficient of heterogeneity
Combination
Continuous variable
Correlation
Covariance
Cross-over experiment
Cumulative probability distribution
Degree of freedom
Determinant
Discrete variable
Dot chart
Estimate
Estimator
Factorial analysis of variance
Fixed effects
Frequency
Function
Heterogeneity
Histogram
Homogeneity
Hypothesis test
Inverse matrix
Latin square experiment
Least squares method
Linear Logistic regression analysis
Linear regression analysis

Ting Vit
Khong tin cy 95%
Tiu chun thng tin Akaike
Phn tch hip bin
Phn tch phng sai
Biu thanh
Phn phi nh phn
Biu hnh hp
Bin th bc
Biu ng h
H s tng quan
H s xc nh bi
H s bt ng nht
T hp
Bin lin tc
Tng quan
Hp bin
Th nghim giao cho
Hm phn phi tch ly
Bc t do
nh thc
Bin ri rc
Biu im
c s
Hm c lng thng k
Phn tch phng sai cho th nghim giai tha
nh hng bt bin
Tn s
Hm
Bt ng nht
Biu tn s
ng nht
Kim nh gi thit
Ma trn nghch o
Th nghim hnh vung Latin
Phng php bnh phng nh nht
Phn tch hi qui tuyn tnh logistic
Phn tch hi qui tuyn tnh

117

Phn tch s liu v biu bng R

Matrix
Maximum likelihood method
Mean
Median
Meta-analysis
Missing value
Model
Multiple linear regression analysis
Normal distribution
Object
Parameter
Permutation
Pie chart
Poisson distribution
Polynomial regression
Probability
Probability density distribution
P-value
Quantile
Random effects
Random variable
Relative risk
Repeated measure experiment
Residual
Residual mean square
Residual sum of squares
Scalar matrix
Scatter plot
Significance
Simulation
Standard deviation
Standard error
Standardized normal distribution
Survival analysis
Traposed matrix
Variable
Variance
Weight
Weighted mean

Nguyn Vn Tun

Ma trn
Phng php hp l cc i
S trung bnh
S trung v
Phn tch tng hp
Gi tr khng
M hnh
Phn tch hi qui tuyn tnh a bin
Phn phi chun
i tng
Thng s
Hon v
Biu hnh trn
Phn phi Poisson
Hi qui a thc
Xc sut
Hm mt xc sut
Tr s P
Hm nh bc
nh hng ngu nhin
Bin ngu nhin
T s nguy c tng i
Th nghim ti o lng
Phn d
Trung bnh bnh phng phn d
Tng bnh phng phn d
Ma trn v hng
Biu tn x
C ngha thng k
M phng
lch chun
Sai s chun
Phn phi chun chun ha
Phn tch bin c
Ma trn chuyn v
Bin (bin s)
Phng sai
Trng s
Trung bnh trng s

118

You might also like