You are on page 1of 29

10

Phn tch hi qui tuyn tnh


Phn tch hi qui tuyn tnh (linear regression analysis) c l l mt trong nhng
phng php phn tch s liu thng dng nht trong thng k hc. Anon tng vit Cho
con ngi 3 v kh h s tng quan, hi qui tuyn tnh v mt cy bt, con ngi s
s dng c ba! Trong chng ny, ti s gii thiu cch s dng R phn tch hi qui
tuyn tnh v cc phng php lin quan nh h s tng quan v kim nh gi thit
thng k.
V d 1. minh ha cho vn , chng ta th xem xt nghin cu sau y, m
trong nh nghin cu o lng cholestrol trong mu ca 18 i tng nam. T
trng c th (body mass index) cng c c tnh cho mi i tng bng cng thc
tnh BMI l ly trng lng (tnh bng kg) chia cho chiu cao bnh phng (m2). Kt qu
o lng nh sau:
Bng 1. tui, t trng c th v cholesterol
M s ID
(id)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

tui
(age)
46
20
52
30
57
25
28
36
22
43
57
33
22
63
40
48
28
49

BMI
(bmi)
25.4
20.6
26.2
22.6
25.4
23.1
22.7
24.9
19.8
25.3
23.2
21.8
20.9
26.7
26.4
21.2
21.2
22.8

Cholesterol
(chol)
3.5
1.9
4.0
2.6
4.5
3.0
2.9
3.8
2.1
3.8
4.1
3.0
2.5
4.6
3.2
4.2
2.3
4.0

Nhn s qua s liu chng ta thy ngi c tui cng cao cholesterol cng
cng cao. Chng ta th nhp s liu ny vo R v v mt biu tn x nh sau:
> age <- c(46,20,52,30,57,25,28,36,22,43,57,33,22,63,40,48,28,49)

> bmi <-c(25.4,20.6,26.2,22.6,25.4,23.1,22.7,24.9,19.8,25.3,23.2,


21.8,20.9,26.7,26.4,21.2,21.2,22.8)
> chol <- c(3.5,1.9,4.0,2.6,4.5,3.0,2.9,3.8,2.1,3.8,4.1,3.0,
2.5,4.6,3.2, 4.2,2.3,4.0)

2.0

2.5

3.0

chol

3.5

4.0

4.5

> data <- data.frame(age, bmi, chol)


> plot(chol ~ age, pch=16)

20

30

40

50

60

age

Biu 10.1. Lin h gia tui v cholesterol.


Biu 10.1 trn y gi cho thy mi lin h gia tui (age) v cholesterol l mt
ng thng (tuyn tnh). o lng mi lin h ny, chng ta c th s dng h s
tng quan (coefficient of correlation).

10.1 H s tng quan


H s tng quan (r) l mt ch s thng k o lng mi lin h tng quan gia
hai bin s, nh gia tui (x) v cholesterol (y). H s tng quan c gi tr t -1 n
1. H s tng quan bng 0 (hay gn 0) c ngha l hai bin s khng c lin h g vi
nhau; ngc li nu h s bng -1 hay 1 c ngha l hai bin s c mt mi lin h tuyt
i. Nu gi tr ca h s tng quan l m (r <0) c ngha l khi x tng cao th y gim
(v ngc li, khi x gim th y tng); nu gi tr h s tng quan l dng (r > 0) c
ngha l khi x tng cao th y cng tng, v khi x tng cao th y cng gim theo.

Thc ra c nhiu h s tng quan trong thng k, nhng y ti s trnh by 3


h s tng quan thng dng nht: h s tng quan Pearson r, Spearman , v Kendall
.
10.1.1 H s tng quan Pearson
Cho hai bin s x v y t n mu, h s tng quan Pearson c c tnh bng
cng thc sau y:
n

r=

( xi x )( yi y )

i =1
n

2 n

( xi x ) ( yi y )

i =1

i =1

Trong , nh nh ngha phn trn, x v y l gi tr trung bnh ca bin s x v


y. c tnh h s tng quan gia tui age v cholesterol, chng ta c th s
dng hm cor(x,y) nh sau:
> cor(age, chol)
[1] 0.936726

Chng ta c th kim nh gi thit h s tng quan bng 0 (tc hai bin x v y


khng c lin h). Phng php kim nh ny thng da vo php bin i Fisher m
R c sn mt hm cor.test tin hnh vic tnh ton.
> cor.test(age, chol)
Pearson's product-moment correlation
data: age and chol
t = 10.7035, df = 16, p-value = 1.058e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8350463 0.9765306
sample estimates:
cor
0.936726

Kt qu phn tch cho thy kim nh t = 10.70 vi tr s p = 1.058e-08; do ,


chng ta c bng chng kt lun rng mi lin h gia tui v cholesterol c
ngha thng k. Kt lun ny cng chnh l kt lun chng ta i n trong phn phn
tch hi qui tuyn tnh trn.
10.1.2 H s tng quan Spearman
H s tng quan Pearson ch hp l nu bin s x v y tun theo lut phn phi
chun. Nu x v y khng tun theo lut phn phi chun, chng ta phi s dng mt h
s tng quan khc tn l Spearman, mt phng php phn tch phi tham s. H s ny

c c tnh bng cch bin i hai bin s x v y thnh th bc (rank), v xem


tng quan gia hai dy s bc. Do , h s cn c tn ting Anh l Spearmans Rank
correlation. R c tnh h s tng quan Spearman bng hm cor.test vi thng s
method=spearman nh sau:
> cor.test(age, chol, method="spearman")
Spearman's rank correlation rho
data: age and chol
S = 51.1584, p-value = 2.57e-09
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.947205
Warning message:
Cannot compute exact p-values with ties in: cor.test.default(age,
chol, method = "spearman")

Kt qu phn tch cho thy gi tr rho = 0.947, v tr s p = 2.57e-09. Kt qu t


phn tch ny cng khng khc vi phn tch hi qui tuyn tnh: mi lin h gia tui
v cholesterol rt cao v c ngha thng k.
10.1.3 H s tng quan Kendall
H s tng quan Kendall (cng l mt phng php phn tch phi tham s) c
c tnh bng cch tm cc cp s (x, y) song hnh" vi nhau. Mt cp (x, y) song hnh
y c nh ngha l hiu ( khc bit) trn trc honh c cng du hiu (dng hay
m) vi hiu trn trc tung. Nu hai bin s x v y khng c lin h vi nhau, th s cp
song hnh bng hay tng ng vi s cp khng song hnh.
Bi v c nhiu cp phi kim nh, phng php tnh ton h s tng quan
Kendall i hi thi gian ca my tnh kh cao. Tuy nhin, nu mt d liu di 5000
i tng th mt my vi tnh c th tnh ton kh d dng. R dng hm cor.test vi
thng s method=kendall c tnh h s tng quan Kendall:
> cor.test(age, chol, method="kendall")
Kendall's rank correlation tau
data: age and chol
z = 4.755, p-value = 1.984e-06
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.8333333
Warning message:

Cannot compute exact p-value with ties in: cor.test.default(age,


chol, method = "kendall")

Kt qu phn tch h s tng quan Kendall mt ln na khng nh mi lin h


gia tui v cholesterol c ngha thng k, v h s tau = 0.833 v tr s p = 1.98e06.
Cc h s tng quan trn y o mc tng quan gia hai bin s, nhng
khng cho chng ta mt phng trnh ni hai bin s vi nhau. Thnh ra, vn
t ra l chng ta tm mt phng trnh tuyn tnh m t mi lin h ny. Chng ta s
ng dng m hnh hi qui tuyn tnh.

10.2 M hnh ca hi qui tuyn tnh n gin


10.2.1 vi dng l thuyt
tin vic theo di v m t m hnh, gi tui cho c nhn i l xi v
cholesterol l yi. y i = 1, 2, 3, , 18. M hnh hi tuyn tnh pht biu rng:

yi = + xi + i

[1]

Ni cch khc, phng trnh trn gi nh rng cholesterol ca mt c nhn bng mt


hng s cng vi mt h s lin quan n tui, v mt sai s i. Trong phng
trnh trn, l chn (intercept, tc gi tr lc xi =0), v l dc (slope hay gradient).
Trong thc t, v l hai thng s (paramater, cn gi l regression coefficient hay h
s hi qui), v i l mt bin s theo lut phn phi chun vi trung bnh 0 v phng sai
2 .
Cc thng s , v 2 phi c c tnh t d liu. Phng php c tnh
cc thng s ny l phng php bnh phng nh nht (least squares method). Nh tn
gi, phng php bnh phng nh nht tm gi tr , sao cho

y ( + x )
i =1

nh

nht. Sau vi thao tc ton, c th chng minh d dng rng, c s cho v p ng


iu kin l:
n

( x x )( y y )
i =1

(x x )
i =1

[2]

= y x

[3]

)
)
y, x v y l gi tr trung bnh ca bin s x v y. Ch , ti vit v (vi du
m pha trn) l nhc nh rng y l hai c s (estimates) ca v , ch khng
phi v (chng ta khng bit chnh xc v , nhng ch c th c tnh m thi).

)
)
Sau khi c c s v , chng ta c th c tnh cholesterol trung bnh
cho tng tui nh sau:
)
yi = + xi

Tt nhin, yi y ch l s trung bnh cho tui xi, v phn cn li (tc yi - yi ) gi l


phn d (residual). V phng sai ca phn d c th c tnh nh sau:
n

s =
2

( y y )
i =1

[4]

n2

s2 chnh l c s ca 2.
Trong phn tch hi qui tuyn tnh, thng thng chng ta mun bit h s
= 0 hay khc 0. Nu bng 0, th cng c ngha l khng c mi lin h g gia x v y;
nu khc vi 0, chng ta c bng chng pht biu rng x v y c lin quan nhau.
kim nh gi thit = 0 chng ta dng xt nghim t sau y:
t=

( )

SE

[5]

( )

)
SE c ngha l sai s chun (standard error) ca c s . Trong phng trnh trn,
t tun theo lut phn phi t vi bc t do n-2 (nu tht s = 0).
10.2.2 Phn tch hi qui tuyn tnh n gin bng R
)
Hm lm (vit tt t linear model) trong R c th tnh ton cc gi tr ca
v , cng nh s2 mt cch nhanh gn. Chng ta tip tc vi v d bng R nh sau:

> lm(chol ~ age)

Call:
lm(formula = chol ~ age)
Coefficients:
(Intercept)
1.08922

age
0.05779

Trong lnh trn, chol ~ age c ngha l m t chol l mt hm s ca age. Kt


)
)
qu tnh ton ca lm cho thy = 1.0892 v = 0.05779. Ni cch khc, vi hai thng
s ny, chng ta c th c tnh cholesterol cho bt c tui no trong khong tui
ca mu bng phng trnh tuyn tnh:

yi = 1.08922 + 0.05779 x age


Phng trnh ny c ngha l khi tui tng 1 nm th cholesterol tng khong 0.058
mmol/L.
Tht ra, hm lm cn cung cp cho chng ta nhiu thng tin khc, nhng chng ta phi
a cc thng tin ny vo mt object. Gi object l reg, th lnh s l:
> reg <- lm(chol ~ age)
> summary(reg)
Call:
lm(formula = chol ~ age)
Residuals:
Min
1Q
Median
-0.40729 -0.24133 -0.04522

3Q
0.17939

Max
0.63040

Coefficients:
Estimate Std. Error t value
(Intercept) 1.089218
0.221466
4.918
age
0.057788
0.005399 10.704
--Signif. codes: 0 '***' 0.001 '**' 0.01

Pr(>|t|)
0.000154 ***
1.06e-08 ***
'*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3027 on 16 degrees of freedom


Multiple R-Squared: 0.8775,
Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

Lnh th hai, summary(reg), yu cu R lit k cc thng tin tnh ton trong reg. Phn
kt qu chia lm 3 phn:
(a) Phn 1 m t phn d (residuals) ca m hnh hi qui:
Residuals:
Min
1Q
Median
-0.40729 -0.24133 -0.04522

3Q
0.17939

Max
0.63040

Chng ta bit rng trung bnh phn d phi l 0, v y, s trung v l -0.04, cng
khng xa 0 bao nhiu. Cc s quantiles 25% (1Q) v 75% (3Q) cng kh cn i chung
quan s trung v, cho thy phn d ca phng trnh ny tng i cn i.

)
)
(b) Phn hai trnh by c s ca v cng vi sai s chun v gi tr ca kim nh t.
)
Gi tr kim nh t cho l 10.74 vi tr s p = 1.06e-08, cho thy khng phi bng 0.
Ni cch khc, chng ta c bng chng cho rng c mt mi lin h gia cholesterol
v tui, v mi lin h ny c ngha thng k.
Coefficients:
Estimate Std. Error t value
(Intercept) 1.089218
0.221466
4.918
age
0.057788
0.005399 10.704
--Signif. codes: 0 '***' 0.001 '**' 0.01

Pr(>|t|)
0.000154 ***
1.06e-08 ***
'*' 0.05 '.' 0.1 ' ' 1

(c) Phn ba ca kt qu cho chng ta thng tin v phng sai ca phn d (residual mean
square). y, s2 = 0.3027. Trong kt qu ny cn c kim nh F, cng ch l mt
kim nh xem c qu tht bng 0, tc c ngha tng t nh kim nh t trong phn
trn. Ni chung, trong trng hp phn tch hi qui tuyn tnh n gin (vi mt yu t)
chng ta khng cn phi quan tm n kim nh F.
Residual standard error: 0.3027 on 16 degrees of freedom
Multiple R-Squared: 0.8775,
Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

Ngoi ra, phn 3 cn cho chng ta mt thng tin quan trng, l tr s R2 hay h s xc
nh bi (coefficient of determination). H s ny c c tnh bng cng thc:
n

R2 =

( y y )

( y y )

i =1
n
i =1

[6]

Tc l bng tng bnh phng gia s c tnh v trung bnh chia cho tng bnh phng
s quan st v trung bnh. Tr s R2 trong v d ny l 0.8775, c ngha l phng trnh
tuyn tnh (vi tui l mt yu t) gii thch khong 88% cc khc bit v
cholesterol gia cc c nhn. Tt nhin tr s R2 c gi tr t 0 n 100% (hay 1). Gi tr
R2 cng cao l mt du hiu cho thy mi lin h gia hai bin s tui v cholesterol
cng cht ch.
Mt h s cng cn cp y l h s iu chnh xc nh bi (m trong kt
qu trn R gi l Adjusted R-squared). y l h s cho chng ta bit mc ci tin
ca phng sai phn d (residual variance) do yu t tui c mt trong m hnh tuyn
tnh. Ni chung, h s ny khng khc my so vi h s xc nh bi, v chng ta cng
khng cn ch tm qu mc.
10.2.3 Gi nh ca phn tch hi qui tuyn tnh
Tt c cc phn tch trn da vo mt s gi nh quan trng nh sau:

(a) x l mt bin s c nh hay fixed, (c nh y c ngha l khng c sai st ngu


nhin trong o lng);
(b) i phn phi theo lut phn phi chun;
(c) i c gi tr trung bnh (mean) l 0;
(d) i c phng sai 2 c nh cho tt c xi; v
(e) cc gi tr lin tc ca i khng c lin h tng quan vi nhau (ni cch khc, 1 v 2
khng c lin h vi nhau).
Nu cc gi nh ny khng c p ng th phng trnh m chng ta c tnh
c vn hp l (validity). Do , trc khi trnh by v din dch m hnh trn, chng
ta cn phi kim tra xem cc gi nh trn c p ng c hay khng. Trong trng
hp ny, gi nh (a) khng phi l vn , v tui khng phi l mt bin s ngu
nhin, v khng c sai s khi tnh tui ca mt c nhn.
i vi cc gi nh (b) n (e), cch kim tra n gin nhng hu hiu nht l
bng cch xem xt mi lin h gia yi , xi , v phn d ei ( ei = yi yi ) bng nhng th
tn x.
Vi lnh fitted() chng ta c th tnh ton yi cho tng c nhn nh sau (v d
i vi c nhn 1, 46 tui, cholestrol c th tin on nh sau: 1.08922 + 0.05779
x 46 = 3.747).
> fitted(reg)
1
2
3
4
5
6
7
8
3.747483 2.244985 4.094214 2.822869 4.383156 2.533927 2.707292 3.169600
9
10
11
12
13
14
15
16
2.360562 3.574118 4.383156 2.996234 2.360562 4.729886 3.400753 3.863060
17
18
2.707292 3.920849

Vi lnh resid() chng ta c th tnh ton phn d ei cho tng c nhn nh


sau (vi i tng 1, e1 = 3.5 3.74748 = -0.24748):
> resid(reg)
1
2
3
4
5
-0.247483426 -0.344985415 -0.094213736 -0.222869265 0.116844338
7
8
9
10
11
0.192707505 0.630400424 -0.260562185 0.225881729 -0.283155662
13
14
15
16
17
0.139437815 -0.129885972 -0.200753116 0.336939804 -0.407292495

6
0.466072660
12
0.003765579
18
0.079151419

kim tra cc gi nh trn, chng ta c th v mt lot 4 th m ti s gii


thch sau y:

#yu cu R dnh ra 4 ca s
#v cc th trong reg

> op <- par(mfrow=c(2,2))


> plot(reg)

-1

Standardized residuals

0.0

0.2

17

17

3.0

1.5

2.5

3.5

4.0

4.5

-2

-1

Fitted values

Theoretical Quantiles

Scale-Location

Residuals vs Leverage
1

0.5

0.5

1.0

17

-1

Standardized residuals

Cook's distance

0.0

Standardized residuals

Normal Q-Q

-0.4

Residuals

0.4

0.6

Residuals vs Fitted

2.5

3.0

3.5
Fitted values

4.0

4.5

0.00

0.05

0.10

0.5

0.15

0.20

0.25

Leverage

Biu 10.2. Phn tch phn d kim tra cc gi nh trong phn tch hi
qui tuyn tnh.
(a) th bn tri dng 1 v phn d ei v gi tr tin on cholesterol yi . th ny cho
thy cc gi tr phn d tp chung quanh ng y = 0, cho nn gi nh (c), hay i c gi
tr trung bnh 0, l c th chp nhn c.
(b) th bn phi dng 1 v gi tr phn d v gi tr k vng da vo phn phi chun.
Chng ta thy cc s phn d tp trung rt gn cc gi tr trn ng chun, v do , gi
nh (b), tc i phn phi theo lut phn phi chun, cng c th p ng.
(c) th bn tri dng 2 v cn s phn d chun (standardized residual) v gi tr ca
yi . th ny cho thy khng c g khc nhau gia cc s phn d chun cho cc gi tr
ca yi , v do , gi nh (d), tc i c phng sai 2 c nh cho tt c xi, cng c th
p ng.

Ni chung qua phn tch phn d, chng ta c th kt lun rng m hnh hi qui tuyn
tnh m t mi lin h gia tui v cholesterol mt cch kh y v hp l.
10.2.4 M hnh tin on
Sau khi m hnh tin on cholesterol c kim tra v tnh hp l c
thit lp, chng ta c th v ng biu din ca mi lin h gia tui v cholesterol
bng lnh abline nh sau (xin nhc li object ca phn tch l reg):

2.0

2.5

3.0

chol

3.5

4.0

4.5

> plot(chol ~ age, pch=16)


> abline(reg)

20

30

40

50

60

age

Biu 10.3. ng biu din mi lin h gia tui (age)


v cholesterol.

)
)
Nhng mi gi tr yi c tnh t c s v , m cc c s ny u c sai
s chun, cho nn gi tr tin on yi cng c sai s. Ni cch khc, yi ch l trung bnh,
nhng trong thc t c th cao hn hay thp hn ty theo chn mu. Khong tin cy
95% ny c th c tnh qua R bng cc lnh sau y:
> reg <- lm(chol ~ age)
> new <- data.frame(age = seq(15, 70, 5))

pred.w.plim <- predict.lm(reg, new, interval="prediction")


pred.w.clim <- predict.lm(reg, new, interval="confidence")
resc <- cbind(pred.w.clim, new)
resp <- cbind(pred.w.plim, new)
plot(chol ~ age, pch=16)
lines(resc$fit ~ resc$age)
lines(resc$lwr ~ resc$age, col=2)
lines(resc$upr ~ resc$age, col=2)
lines(resp$lwr ~ resp$age, col=4)
lines(resp$upr ~ resp$age, col=4)

2.0

2.5

3.0

chol

3.5

4.0

4.5

>
>
>
>
>
>
>
>
>
>

20

30

40

50

60

age

Biu 10.4. Gi tr tin on v khong tin cy 95%.


Biu trn v gi tr tin on trung bnh yi (ng thng mu en), v khong tin cy
95% ca gi tr ny l ng mu . Ngoi ra, ng mu xanh l khong tin cy ca
gi tr tin on cholesterol cho mt tui mi trong qun th.

10.3 M hnh hi qui tuyn tnh a bin (multiple linear


regression)
M hnh c din t qua phng trnh [1] yi = + xi + i c mt yu t duy
nht ( l x), v v th thng c gi l m hnh hi qui tuyn tnh n gin (simple

linear regression model). Trong thc t, chng ta c th pht trin m hnh ny thnh
nhiu bin, ch khng ch gii hn mt bin nh trn, chng hn nh:

yi = + 1 x1i + 2 x2i + ... + k xki + i [7]


ni c th hn:

y1 = + 1x11 + 2x21 + + kxk1 +


y2 = + 1x12 + 2x22 + + kxk2 +
y3 = + 1x13 + 2x23 + + kxk3 +

yn = + 1x1n + 2x2n + + kxkn +

1
2
3
n

Ch trong phng trnh trn, chng ta c nhiu bin x (x1, x2, n xk), v mi bin c
mt thng s j (j = 1, 2, , k) cn phi c tnh. V th m hnh ny cn c gi l
m hnh hi qui tuyn tnh a bin.
Phng php c tnh j cng ch yu da vo phng php bnh phng nh
nht. Gi yi = + 1 x1i + 2 x1i + ... + k xki l c tnh ca yi , phng php bnh phng
nh nht tm gi tr , 1 , 2 ,..., k sao cho

( y y )
i

i =1

nh nht. i vi m hnh hi

qui tuyn tnh a bin, cch vit v m t m hnh gn nht l dng k hiu ma trn. M
hnh [7] c th th hin bng k hiu ma trn nh sau:
Y = X +

Trong : Y l mt vector n x 1, X l mt matrix n x k phn t, v mt vector k x 1, v


l vector gm n x 1 phn t:
y1
y
Y = 2 ,
...

yn

1 x11
1 x
12
X =
... ...

1 x1n

x21 ...xk1
x22 ...xk 2
,
...
...

x2 n xkn

1

= 2 ,
...

k

1

= 2
...

n

Phng php bnh phng nh nht gii vector bng phng trnh sau y:

= (X T X ) X T Y
1

v tng bnh phng phn d:

T = Y Y

V d 2. Chng ta quay li nghin cu v mi lin h gia tui, bmi v


cholesterol. Trong v d, chng ta ch mi xt mi lin h gia tui v cholesterol, m
cha xem n mi lin h gia c hai yu t tui v bmi v cholesterol. Biu sau
y cho chng ta thy mi lin h gia ba bin s ny:
> pairs(data)

22

24

26

50

60

20

24

26

20

30

40

age

chol

20

30

40

50

60

2.0 2.5 3.0 3.5 4.0 4.5

20

22

bmi

2.0 2.5 3.0 3.5 4.0 4.5

Biu 10.5. Gi tr tin on v khong tin cy 95%.

Cng nh gia tui v cholesterol, mi lin h gia bmi v cholesterol cng gn tun
theo mt ng thng. Biu trn cn cho chng ta thy tui v bmi c lin h vi
nhau. Tht vy, phn tch hi qui tuyn tnh n gin gia bmi v cholesterol cho thy
nh mi lin h ny c ngha thng k:
> summary(lm(chol ~

bmi))

Call:
lm(formula = chol ~ bmi)
Residuals:
Min
1Q Median
-0.9403 -0.3565 -0.1376

3Q
0.3040

Max
1.4330

Coefficients:
Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.83187
1.60841 -1.761 0.09739 .
bmi
0.26410
0.06861
3.849 0.00142 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.623 on 16 degrees of freedom
Multiple R-Squared: 0.4808,
Adjusted R-squared: 0.4483
F-statistic: 14.82 on 1 and 16 DF, p-value: 0.001418

BMI gii thch khong 48% dao ng v cholesterol gia cc c nhn. Nhng v BMI
cng c lin h vi tui, chng ta mun bit nu hai yu t ny c phn tch cng
mt lc th yu t no quan trng hn. bit nh hng ca c hai yu t age (x1) v
bmi (tm gi l x2) n cholesterol (y) qua mt m hnh hi qui tuyn tnh a bin, v m
hnh l:
yi = + 1 x1i + 2 x2i + i
hay phng trnh cng c th m t bng k hiu ma trn: Y = X + m ti va trnh
by trn. y, Y l mt vector vector 18 x 1, X l mt matrix 18 x 2 phn t, v mt
vector 2 x 1, v l vector gm 18 x 1 phn t. c tnh hai h s hi qui, 1 v
2 chng ta cng ng dng hm lm() trong R nh sau:
> mreg <- lm(chol ~ age + bmi)
> summary(mreg)
Call:
lm(formula = chol ~ age + bmi)
Residuals:
Min
1Q Median
-0.3762 -0.2259 -0.0534

3Q
0.1698

Max
0.5679

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.455458
0.918230
0.496
0.627
age
0.054052
0.007591
7.120 3.50e-06 ***
bmi
0.033364
0.046866
0.712
0.487
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3074 on 15 degrees of freedom
Multiple R-Squared: 0.8815,
Adjusted R-squared: 0.8657
F-statistic: 55.77 on 2 and 15 DF, p-value: 1.132e-07

Kt qu phn tch trn cho thy c s = 0.455, 1 = 0.054 v 2 = 0.0333. Ni cch


khc, chng ta c phng trnh c on cholesterol da vo hai bin s tui v
bmi nh sau:

Cholesterol = 0.455 + 0.054(age) + 0.0333(bmi)

Phng trnh cho bit khi tui tng 1 nm th cholesterol tng 0.054 mg/L (c s ny
khng khc my so vi 0.0578 trong phng trnh ch c tui), v mi 1 kg/m2 tng
BMI th cholesterol tng 0.0333 mg/L. Hai yu t ny gii thch khong 88.2% (R2 =
0.8815) dao ng ca cholesterol gia cc c nhn.
Chng ta ch phng trnh vi tui (trong phn tch phn trc) gii thch
khong 87.7% dao ng cholesterol gia cc c nhn. Khi chng ta thm yu t BMI,
h s ny tng ln 88.2%, tc ch 0.5%. Cu hi t ra l 0.5% tng trng ny c
ngha thng k hay khng. Cu tr li c th xem qua kt qu kim nh yu t bmi vi
tr s p = 0.487. Nh vy, bmi khng cung cp cho chng thm thng tin hay tin on
cholesterol hn nhng g chng ta c t tui. Ni cch khc, khi tui c
xem xt, th nh hng ca bmi khng cn ngha thng k. iu ny c th hiu c,
bi v qua Biu 10.5 chng ta thy tui v bmi c mt mi lin h kh cao. V hai
bin ny c tng quan vi nhau, chng ta khng cn c hai trong phng trnh. (Tuy
nhin, v d ny ch c tnh cch minh ha cho vic tin hnh phn tch hi qui tuyn tnh
a bin bng R, ch khng c nh m phng d liu theo nh hng sinh hc).

3.0

4.0

2.0
0.0

1.0

4.5

-2

-1

Scale-Location

Residuals vs Leverage

0.4

3.0

3.5

4.0

Fitted values

4.5

16

0.5

0.8

16

-1

Standardized residuals

Theoretical Quantiles

2.5

16

Fitted values

1.2

3.5

0.0

Standardized residuals

2.5

-1.0

0.0

0.4

16

-0.4

Residuals

8
6

Normal Q-Q
Standardized residuals

Residuals vs Fitted

Cook's distance15
0.00

0.10

0.20

0.30

Leverage

Biu 10.6. Phn tch phn d kim tra cc gi nh trong


phn tch hi qui tuyn tnh a bin.

Tuy BMI khng c ngha thng k trong trng hp ny, Biu 10.6 cho thy
cc gi nh v m hnh hi qui tuyn tnh c th p ng.

10.4 Phn tch hi qui a thc (Polynomial regression


analysis)
Mt khai trin tt nhin t phn tch hi qui a bin c lp l phn tch hi qui
a thc. M hnh hi qui a bin m t mt bin ph thuc nh l mt hm s tuyn tnh
(linear function) ca nhiu bin c lp, trong khi m hnh hi qui a thc m t mt
bin ph thuc l hm s phi tuyn tnh (non-linear function) ca mt bin c lp.
Ni theo ngn ng ton hc, m hnh hi qui a thc tm mi lin h gia bin
ph thuc y v bin c lp x theo nhng hm s sau y:

yi = + 1x + 2x2 + 3x3 + .. + pxp + i.


Trong cc thng s j (j = 1, 2, 3, p) l h s o lng mi lin h gia y v x; v i
l phn d ca m hnh, vi gi nh i tun theo lut phn phi chun vi trung bnh 0
v phng sai 2. Cho mt dy cp s (y1, x1), (y2, x2), (y3, x3), , (yn, xn), chng ta c
th p dng phng php bnh phng nh nht c tnh j v 2.
Trong m hnh trn, chng ta c th d dng thy rng m hnh hi qui a thc
cn l mt pht trin trc tip t m hnh hi qui tuyn tnh n gin. Tc l nu 2 = 0,
3 = 0, , v p = 0, th m hnh trn n gin thnh m hnh hi qui tuyn tnh mt
bin m chng ta gp trong phn u ca chng ny. Nu yi = + 1x + 2x2 + i th
m hnh n gin l mt phng trnh bc hai, v.v.
V d 3. Th nghim sau y tm mi lin h gia hm lng g cng
(hardwoord concentration) v cng (tensile strength) ca vt liu. Mi chn vt liu
khc nhau vi nhiu hm lng g cng c th nghim o cng mnh ca vt
liu, v kt qu c tm lc trong bng s liu sau y:

Id
1
2
3
4
5
6
7
8
9
10
11
12

Hm lng
g cng (x)
1.0
1.5
2.0
3.0
4.0
4.5
5.0
5.5
6.0
6.5
7.0
8.0

cng
mnh (y)
6.3
11.1
20.0
24.0
26.1
30.0
33.8
34.0
38.1
39.9
42.0
46.1

13
14
15
16
17
18
19

9.0
10.0
11.0
12.0
13.0
14.0
15.0

53.1
52.0
52.5
48.0
42.8
27.8
21.9

Trc khi phn tch cc s liu ny, chng ta


cn nhp s liu vo R vi nhng lnh thng
thng nh sau:

> id <- 1:19


> conc <- c(1.0, 1.5, 2.0, 3.0, 4.0,
4.5, 5.0, 5.5, 6.0,
6.5, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0)
> strength <- c(6.3, 11.1, 20.0, 24.0, 26.1, 30.0, 33.8, 34.0, 38.1,
39.9, 42.0, 46.1, 53.1, 52.0, 52.5, 48.0, 42.8, 27.8, 21.9)
> data <- data.frame(id, conc, strength)

Chng ta th xem m hnh hi qui tuyn tnh n gin bng lnh:


> simple.model <- lm(strength ~ conc)
> summary(simple.model)
Call:
lm(formula = strength ~ conc)
Residuals:
Min
1Q
-25.986 -3.749

Median
2.938

3Q
7.675

Max
15.840

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.3213
5.4302
3.926 0.00109 **
conc
1.7710
0.6478
2.734 0.01414 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.82 on 17 degrees of freedom
Multiple R-Squared: 0.3054,
Adjusted R-squared: 0.2645
F-statistic: 7.474 on 1 and 17 DF, p-value: 0.01414

Kt qu trn cho thy m hnh hi qui tuyn tnh n gin ny (strength = 21.32
+ 1.77*conc) gii thch khong 31% phng sai ca strength. c s phng sai
ca m hnh ny l: s2 = (11.82)2 = 139.7.
By gi chng ta xem qua biu v ng biu din ca m hnh trn:
> plot(strength ~ conc,
xlab="Concentration of hardwood",
ylab="Tensile strength",
main="Relationship between hardwood concentration \n and tensile
strengt", pch=16)
> abline(simple.model)

Relationship between hardwood concentration


and tensile strengt

30
20

Tensile strength

40

50

Qua biu ny, chng ta thy r


rng m hnh hi qui tuyn tnh
khng thch hp cho s liu, bi v
mi lin h gia hai bin ny
khng tun theo mt phng trnh
ng thng, m l mt ng
cong. Ni cch khc, mt m hnh
phng trnh bc hai c l thch
hp hn. Gi y l strength v x l
conc, chng ta c th vit m hnh
nh sau:

10

yi = + 1x + 2x2
2

10

12

14

Concentration of hardwood

Biu 10.7. Mi lin h gia hm lng g


cng v cng mnh ca vt liu. ng thng
l ng biu din ca m hnh hi qui tuyn tnh
n gin.
lm(formula = strength ~ poly(conc, 2))
Residuals:
Min
1Q Median
-5.8503 -3.2482 -0.7267
Coefficients:
(Intercept)
poly(conc, 2)1
poly(conc, 2)2
--Signif. codes:

3Q
4.1350

By gi chng ta s s dng R
c tnh ba thng s trn.
> quadratic <- lm(strength ~
poly(conc, 2))
> summary(quadratic)
Call:

Max
6.5506

Estimate Std. Error t value


34.184
1.014 33.709
32.302
4.420
7.308
-45.396
4.420 -10.270

Pr(>|t|)
2.73e-16 ***
1.76e-06 ***
1.89e-08 ***

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.42 on 16 degrees of freedom


Multiple R-Squared: 0.9085,
Adjusted R-squared: 0.8971
F-statistic: 79.43 on 2 and 16 DF, p-value: 4.912e-09

Nh vy, m hnh mi ny y = 34.18 + 32.30*x 45.4*x2 gii thch


khong 91% phng sai ca y. Phng sai ca y by gi l s2 = (4.42)2 = 19.5. So vi
m hnh tuyn tnh, m hnh ny r rng l tt hn rt nhiu.
Chng ta th xt mt m hnh cubic (bc ba) yi = + 1x + 2x2 + 3x3 xem c
m t y tt hn m hnh phng trnh bc hai hay khng.
> cubic <- lm(strength ~ poly(conc, 3))
> summary(cubic)

Call:
lm(formula = strength ~ poly(conc, 3))
Residuals:
Min
1Q
-4.62503 -1.61085

Median
0.04125

3Q
1.58922

Max
5.02159

Coefficients:

Estimate Std. Error t value Pr(>|t|)


(Intercept)
34.1842
0.5931 57.641 < 2e-16 ***
poly(conc, 3)1 32.3021
2.5850 12.496 2.48e-09 ***
poly(conc, 3)2 -45.3963
2.5850 -17.561 2.06e-11 ***
poly(conc, 3)3 -14.5740
2.5850 -5.638 4.72e-05 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.585 on 15 degrees of freedom
Multiple R-Squared: 0.9707,
Adjusted R-squared: 0.9648
F-statistic: 165.4 on 3 and 15 DF, p-value: 1.025e-11

M hnh cubic ny thm ch c kh nng m t y tt hn hai m hnh trc, vi


h s xc nh bi (R2) bng 0.97, v tt c cc thng s trong m hnh u c ngha
thng k. Biu sau y so snh 3 m hnh trn:
# lp li cc m hnh trn:
> linear <- lm(strength ~ conc)
> quadratic <- lm(strength ~ poly(conc, 2))
> cubic <- lm(strength ~ poly(conc, 3))

# to nn mt bin x vi nhiu s gn nhau


> xnew <- (0:160)/10

# Tnh gi tr tin on (predictive values) ca y


> y2 = predict(quadratic, data.frame(conc=xnew))
> y3 = predict(cubic, data.frame(conc=xnew))

# V 3 ng thng, bc hai v bc 3
> plot(strength ~ conc, pch=16,
main=Hardwood concentration and tensile strength,
sub=Linear, quadratic, and cubic fits)
> abline(linear, col=black)
> lines(xnew, y2, col=blue, lwd=3)
> lines(xnew, y3, col=red, lwd=4)

30
10

20

strength

40

50

Hardwood concentration and tensile strength

10

12

14

conc
Linear, quadratic, and cubic fits

10.5 Xy dng m hnh tuyn tnh t nhiu bin


Trong mt nghin cu thng thng vi mt bin s ph thuc, nhiu bin s c
lp x1, x2, x3,., xk, m k c th ln n hng chc, thm ch hng trm. Cc bin c lp
thng lin h vi nhau. C rt nhiu t hp bin c lp c kh nng tin on bin
ph thuc y. V d nu chng ta c 3 bin c lp x1, x2, v x3, xy dng m hnh tin
on y, chng ta c th phi xem xt cc m hnh sau y: y = f1(x1), y = f2(x2), y =
f3(x3), y = f4(x1, x2,), y = f5(x1, x3,), y = f6(x3, x3,), y = f7(x1, x2, x3), v.v trong fk l
nhng hm s c nh ngha bi h s lin quan n cc bin c th. Khi k cao, s
lng m hnh cng ln rt cao.
Vn t ra l trong cc m hnh , m hnh no c th tin on y mt cch
y , n gin v hp l. Ti s quay li ba tiu chun ny trong chng phn tch hi
qui logistic. y, ti ch mun bn n mt tiu chun thng k xy dng m m
hnh hi qui tuyn tnh. Trong trng hp c nhiu m hnh nh th, tiu chun thng k
chn mt m hnh ti u thng da vo tiu chun thng tin Akaike (cn gi l AIC
hay Akaike Information Criterion).
Cho mt m hnh hi qui tuyn tnh yi = + 1 x1 + 2 x2 + ... + k xk , chng ta c
k+1 thng s , 1 , 2 ,..., k ), v c th tnh tng bnh phng phn d (residual sum of
squares, RSS):
n

RSS = ( yi yi )
i =1

Trong , n l s lng mu. Cng thc trn cho thy nu m hnh m t y y th


RSS s thp, v khc bit gia gi tr tin on y v gi tr quan st y gn nhau. Mt
qui lut chung ca phn tch hi qui tuyn tnh l mt m hnh vi k bin c lp s c
RSS thp hn m hnh vi k-1 bin; v tng t m hnh vi k-1 bin s c RSS thp hn
m hnh vi k-2 bin, v.v Ni cch khc, m hnh cng c nhiu bin c lp s gii
thch y cng tt hn. Nhng v mt s bin c lp x lin h vi nhau, cho nn c thm
nhiu bin khng c ngha l RSS s gim mt cch c ngha. Mt php tnh dung
ha RSS v s bin c lp trong mt m hnh l AIC, c nh ngha nh sau:

RSS 2k
AIC = log
+
n n
M hnh no c gi tr AIC thp nht c xem l m hnh ti u. Trong v d sau
y, chng ta s dng hm step tm mt m hnh ti u da vo gi tr AIC.
V d 4. nghin cu nh hng ca cc yu t nh nhit , thi gian, v
thnh phn ha hc n sn lng CO2. S liu ca nghin cu ny c th tm lc
trong bng s 2. Mc tiu chnh ca nghin cu l tm mt m hnh hi qui tuyn tnh
tin on sn lng CO2, cng nh nh gi nh hng ca cc yu t ny.
Bng 2. Sn lng CO2 v mt s yu t c th nh hng n CO2
Id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

y
36.98
13.74
10.08
8.53
36.42
26.59
19.07
5.96
15.52
56.61
26.72
20.80
6.99
45.93
43.09
15.79
21.60
35.19
26.14
8.60
11.63
9.59
4.42
38.89
11.19
75.62

X1
5.1
26.4
23.8
46.4
7.0
12.6
18.9
30.2
53.8
5.6
15.1
20.3
48.4
5.8
11.2
27.9
5.1
11.7
16.7
24.8
24.9
39.5
29.0
5.5
11.5
5.2

X2
400
400
400
400
450
450
450
450
450
400
400
400
400
425
425
425
450
450
450
450
450
450
450
460
450
470

X3
51.37
72.33
71.44
79.15
80.47
89.90
91.48
98.60
98.05
55.69
66.29
58.94
74.74
63.71
67.14
77.65
67.22
81.48
83.88
89.38
79.77
87.93
79.50
72.73
77.88
75.50

X4
4.24
30.87
33.01
44.61
33.84
41.26
41.88
70.79
66.82
8.92
17.98
17.79
33.94
11.95
14.73
34.49
14.48
29.69
26.33
37.98
25.66
22.36
31.52
17.86
25.20
8.66

X5
1484.83
289.94
320.79
164.76
1097.26
605.06
405.37
253.70
142.27
1362.24
507.65
377.60
158.05
130.66
682.59
274.20
1496.51
652.43
458.42
312.25
307.08
193.61
155.96
1392.08
663.09
1464.11

X6
2227.25
434.90
481.19
247.14
1645.89
907.59
608.05
380.55
213.40
2043.36
761.48
566.40
237.08
1961.49
1023.89
411.30
2244.77
978.64
687.62
468.38
460.62
290.42
233.95
2088.12
994.63
2196.17

X7
2.06
1.33
0.97
0.62
0.22
0.76
1.71
3.93
1.97
5.08
0.60
0.90
0.63
2.04
1.57
2.38
0.32
0.44
8.82
0.02
1.72
1.88
1.43
1.35
1.61
4.78

27
36.03
10.6
470
83.15
22.39
720.07
1080.11
5.88
Ch thch: y = sn lng CO2; X1 = thi gian (pht); X2 = nhit (C); X3 = phn trm ha tan; X4 =
lng du (g/100g); X5 = lng than ; X6 = tng s lng ha tan; X7 = s hydrogen tiu th.

Trc khi phn tch s liu, chng ta cn nhp s liu vo R bng cc lnh thng thng.
S liu s cha trong i tng REGdata.
> y <- c(36.98,13.74,10.08, 8.53,36.42,26.59,19.07, 5.96,15.52,56.61,
26.72,20.80, 6.99,45.93,43.09,15.79,21.60,35.19,26.14, 8.60,
11.63, 9.59, 4.42,38.89,11.19,75.62,36.03)
> x1 <- c(5.1,26.4,23.8,46.4, 7.0,12.6,18.9,30.2,53.8,5.6,15.1,20.3,48.4,
5.8,11.2,27.9,5.1,11.7,16.7,24.8,24.9,39.5,29.0, 5.5, 11.5,
5.2,10.6)
> x2 <- c(400,400, 400, 400, 450, 450, 450, 450, 450, 400, 400, 400,
400, 425, 425, 425, 450, 450, 450, 450, 450, 450, 450, 460,
450, 470, 470)
> x3 <- c(51.37,72.33,71.44,79.15,80.47,89.90,91.48,98.60,98.05,55.69,
66.29,58.94,74.74,63.71,67.14,77.65,67.22,81.48,83.88,89.38,
79.77,87.93,79.50,72.73,77.88,75.50,83.15)
> x4 <- c(4.24,30.87,33.01,44.61,33.84,41.26,41.88,70.79,66.82,
8.92,17.98,17.79,33.94,11.95,14.73,34.49,14.48,29.69,26.33,
37.98,25.66,22.36,31.52,17.86,25.20, 8.66,22.39)
> x5 <- c(1484.83, 289.94, 320.79, 164.76, 1097.26, 605.06, 405.37,
253.70, 142.27,1362.24, 507.65, 377.60, 158.05, 130.66,
682.59, 274.20, 1496.51, 652.43, 458.42, 312.25, 307.08,
193.61, 155.96,1392.08, 663.09,1464.11, 720.07)
> x6 <- c(2227.25, 434.90, 481.19, 247.14,1645.89, 907.59, 608.05,
380.55, 213.40,2043.36, 761.48, 566.40, 237.08,1961.49,1023.89,
411.30,2244.77, 978.64, 687.62, 468.38, 460.62, 290.42,
233.95,2088.12, 994.63,2196.17,1080.11)
> x7 <- c(2.06,1.33,0.97,0.62,0.22,0.76,1.71,3.93,1.97,5.08,0.60,0.90,
0.63,2.04,1.57,2.38,0.32,0.44,8.82,0.02,1.72,1.88,1.43,
1.35,1.61,4.78,5.88)
> REGdata <- data.frame(y, x1,x2,x3,x4,x5,x6,x7)

Trc khi phn tch s liu, chng ta cn nhp s liu vo R bng cc lnh thng thng.
S liu s cha trong i tng REGdata.
By gi chng ta bt u phn tch. M hnh u tin l m hnh gm tt c 7 bin c
lp nh sau:
> reg <- lm(y ~ x1+x2+x3+x4+x5+x6+x7, data=REGdata)
> summary(reg)
Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = REGdata)
Residuals:
Min
1Q
-20.035 -4.681

Median
-1.144

3Q
4.072

Max
21.214

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.937016 57.428952
0.939
0.3594

x1
-0.127653
x2
-0.229179
x3
0.824853
x4
-0.438222
x5
-0.001937
x6
0.019886
x7
1.993486
--Signif. codes: 0 '***'

0.281498
0.232643
0.765271
0.358551
0.009654
0.008088
1.089701

-0.453
-0.985
1.078
-1.222
-0.201
2.459
1.829

0.6553
0.3370
0.2946
0.2366
0.8431
0.0237 *
0.0831 .

0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.61 on 19 degrees of freedom


Multiple R-Squared: 0.728,
Adjusted R-squared: 0.6278
F-statistic: 7.264 on 7 and 19 DF, p-value: 0.0002674

Kt qu trn cho thy tt c 7 bin s gii thch khong 73% phng sai ca y. Nhng
trong 7 bin , ch c x6 l c ngha thng k (p = 0.024). Chng ta th gim m
hnh thnh mt m hnh hi qui tuyn tnh n gin vi ch bin x6.
> summary(lm(y ~ x6, data=REGdata))
Call:
lm(formula = y ~ x6, data = REGdata)
Residuals:
Min
1Q
-28.081 -5.829

Median
-0.839

3Q
5.522

Max
26.882

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.144181
3.483064
1.764
0.09 .
x6
0.019395
0.002932
6.616 6.24e-07 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.7 on 25 degrees of freedom
Multiple R-Squared: 0.6365,
Adjusted R-squared: 0.6219
F-statistic: 43.77 on 1 and 25 DF, p-value: 6.238e-07

Ch vi mt bin x6 m m hnh c th gii thch khong 64% phng sai ca y. Chng


ta chp nhn m hnh ny? Trc khi chp nhn m hnh ny, chng ta phi xem xt
tng quan gia cc bin c lp:
> pairs(REGdata)

30

50

50

70

90

200

1000

8
70

10

50

10

40

440

10

30

x1

90

400

x2

70

50

70

x3

1000

10

40

x4

2000

200

x5

500

x6

x7
10

40

70

400

440

10

40

70

500

2000

Kt qu trn cho thy y c lin h vi cc bin nh x1, x5 v x6. Ngoi ra, bin x5 v
x6 c mt mi lin h rt mt thit (gn nh l mt ng thng) vi h s tng quan
l 0.88. Ngoi ra, x5 v x1 hay x6 v x5 cng c lin h vi nhau nhng theo mt hm
s nghch o. iu ny c ngha l bin x5 v x6 cung cp mt lng thng tin nh
nhau tin on y, tc l chng ta khng cn c hai trong m m hnh.
tm mt m hnh ti u trong bi cnh c nhiu mi tng quan nh th, chng ta ng
dng step nh sau. Ch cch cung cp thng s lm(y ~ .), du . c ngha l
yu cu R xem xt tt c bin trong i tng REGdata.
> reg <- lm(y ~ ., data=REGdata)
> step(reg, direction=both)
Start: AIC= 134.07
y ~ x1 + x2 + x3 +
Df Sum of Sq
- x5
1
4.54
- x1
1
23.17
- x2
1
109.34
- x3
1
130.90
<none>
- x4
1
168.31
- x7
1
377.09
- x6
1
681.09

x4 + x5 + x6 + x7
RSS
AIC
2145.37 132.13
2164.00 132.36
2250.18 133.42
2271.74 133.68
2140.83 134.07
2309.14 134.12
2517.92 136.45
2821.92 139.53

Step 2: AIC= 130.42


y ~ x2 + x3 + x4 + x6 + x7

Step 1: AIC= 132.13


y ~ x1 + x2 + x3 + x4 + x6 + x7
- x1
- x2
- x3
<none>
- x4
+ x5
- x7
- x6

Df Sum of Sq
RSS
1
22.7 2168.1
1
113.8 2259.1
1
133.5 2278.9
2145.4
1
170.8 2316.2
1
4.5 2140.8
1
375.7 2521.1
1
1058.5 3203.8

Step 3: AIC= 129.59


y ~ x3 + x4 + x6 + x7

AIC
130.4
131.5
131.8
132.1
132.2
134.1
134.5
141.0

- x2
- x3
<none>
- x4
+ x1
+ x5
- x7
- x6

Df Sum of Sq
RSS
1
96.8 2264.9
1
122.0 2290.0
2168.1
1
187.4 2355.5
1
22.7 2145.4
1
4.1 2164.0
1
385.0 2553.1
1
1526.2 3694.3

AIC
129.6
129.9
130.4
130.7
132.1
132.4
132.8
142.8

Df Sum of Sq
RSS
1
25.4 2290.3
1
90.9 2355.8
2264.9
1
96.8 2168.1
1
8.3 2256.5
1
5.7 2259.1
1
384.9 2649.7
1
2015.6 4280.5

AIC
127.9
128.7
129.6
130.4
131.5
131.5
131.8
144.8

Step 5: AIC= 126.75


y ~ x6 + x7

Step 4: AIC= 127.9


y ~ x4 + x6 + x7
- x4
<none>
+ x3
+ x1
+ x5
+ x2
- x7
- x6

- x3
- x4
<none>
+ x2
+ x5
+ x1
- x7
- x6

Df Sum of Sq
RSS
1
73.5 2363.8
2290.3
1
25.4 2264.9
1
11.3 2279.0
1
6.3 2284.0
1
0.3 2290.0
1
486.6 2776.9
1
1993.8 4284.1

AIC
126.7
127.9
129.6
129.8
129.8
129.9
131.1
142.8

Df Sum of Sq
<none>
+ x4
+ x1
+ x3
+ x5
+ x2
- x7
- x6

1
1
1
1
1
1
1

73.5
33.4
8.1
7.7
7.3
497.3
4477.0

RSS
2363.8
2290.3
2330.4
2355.8
2356.1
2356.6
2861.2
6840.8

AIC
126.7
127.9
128.4
128.7
128.7
128.7
129.9
153.4

Call:
lm(formula = y ~ x6 + x7, data =
REGdata)
Coefficients:
(Intercept)
2.52646

x6
0.01852

x7
2.18575

Qu trnh tm m hnh ti u dng m hnh vi hai bin x6 v x7, v m hnh ny c


gi tr AIC thp nht. Phng trnh tuyn tnh tin on y l: y = 2.526 + 0.0185(x6) +
2.186(x7).
> summary(lm(y ~ x6+x7, data=REGdata))
Call:
lm(formula = y ~ x6 + x7, data = REGdata)
Residuals:
Min
1Q
-23.2035 -4.3713

Median
0.2513

3Q
4.9339

Max
21.9682

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.526460
3.610055
0.700
0.4908
x6
0.018522
0.002747
6.742 5.66e-07 ***
x7
2.185753
0.972696
2.247
0.0341 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.924 on 24 degrees of freedom
Multiple R-Squared: 0.6996,
Adjusted R-squared: 0.6746
F-statistic: 27.95 on 2 and 24 DF, p-value: 5.391e-07

Phn tch chi tit (kt qu trn) cho thy hai bin ny gii thch khong 70% phng sai
ca y.

10.6 Xy dng m hnh tuyn tnh bng Bayesian Model


Average (BMA)
Mt vn trong cch xy dng m hnh trn l m hnh vi x6 v x7 c xem
l m hnh sau cng, trong khi chng ta bit rng mt m hnh x5 v x7 cng c th
l mt m hnh kh d, bi v x5 v x6 c mi tng quan rt gn nhau. Nu nghin cu
c tin hnh tip v vi thm s liu mi, c l mt m hnh khc s ra i.
nh gi s bt nh trong vic xy dng m hnh thng k, mt php tnh
khc c trin vng tt hn cch php tnh trn l BMA (Bayesian Model Average). Bn
c mun tm hiu thm v php tnh ny c th tham kho vi bi bo khoa hc di
y. Ni mt cch ngn gn, php tnh BMA tm tt c cc m hnh kh d (vi 7 bin
c lp, s m hnh kh d l 27 = 128, cha tnh n cc m hnh tng tc!) v trnh by
kt qu ca cc m hnh c xem l ti u nht v lu v di. Tiu chun ti u cng
da vo gi tr AIC.
tin hnh php tnh BMA, chng ta phi dng n package BMA (c th ti v
t trang web ca R http://cran.R-project.org). Sau khi c ci t package BMA trong
my tnh, chng ta ra phi nhp BMA vo mi trng vn hnh ca R bng lnh:
> library(BMA)

Sau , to ra mt ma trn ch gm cc bin c lp. Trong data frame chng ta bit


REGdata c 8 bin, vi bin s 1 l y. Do , lnh REGdata[, -1] c ngha l to
ra mt data frame mi ngoi tr ct th nht (tc y).
> xvars <- REGdata[,-1]

K tip, chng ta nh ngha bin ph thuc tn co2 t REGdata:


> co2 <- REGdata[,1]

By gi chng ta sn sng phn tch bng php tnh BMA. Hm bicreg c vit
c bit cho phn tch hi qui tuyn tnh. Cch p dng hm bicreg nh sau:
> bma <- bicreg(xvars, co2, strict=FALSE, OR=20)

Chng ta s dng hm summary bit kt qu:


> summary(bma)

Call:
bicreg(x = xvars, y = co2, strict = FALSE, OR = 20)
16

models were selected

Best

Intercept
x1
x2
x3
x4
x5
x6
x7

models (cumulative posterior probability =


p!=0
100.0
12.4
10.4
10.7
20.2
10.5
100.0
73.7

EV
5.75672
-0.01807
-0.00075
0.00011
-0.03059
-0.00023
0.01815
1.60766

nVar
r2
BIC
post prob

SD
14.6244
0.1008
0.0282
0.0791
0.1020
0.0030
0.0040
1.2821

0.6599 ):

model 1
2.5264
.
.
.
.
.
0.0185
2.1857

model 2
6.1441
.
.
.
.
.
0.0193
.

model 3
8.6120
.
.
.
-0.1419
.
0.0164
2.1628

model 4
7.5936
-0.1393
.
.
.
.
0.0162
2.1233

model 5
7.3537
.
.
-0.0572
.
.
0.0179
2.2382

2
0.700
-25.8832
0.311

1
0.636
-24.0238
0.123

3
0.709
-23.4412
0.092

3
0.704
-22.9721
0.072

3
0.701
-22.6801
0.063

BMA trnh by kt qu ca 5 m hnh c nh gi l ti u nht cho tin on y


(model 1, model 2, model 5).

Ct th nht lit k danh sch cc bin s c lp;


Ct 2 trnh by xc sut gi thit mt bin c lp c nh hng n y. Chng
hn nh xc sut l x6 c nh hng n y l 100%; trong khi xc sut m x7
c nh hng n y l 73.7%. Tuy nhin xc sut cc bin khc thp hn hay ch
bng 20%. Do , chng ta c th ni rng m hnh vi x6 v x7 c l l m
hnh ti u nht.
Ct 3 (EV) v 4 (SD) trnh by tr s trung bnh v lch chun ca h s cho
mi bin s c lp.
Ct 5 l c tnh h s nh hng (regression coefficient) ca m hnh 1. Nh
thy trong ct ny, m hnh 1 gm intercept (tc ), v hai bin x6 v x7. M
hnh ny gii thch (nh chng ta bit qua phn tch phn trn) 70% phng sai
ca y. Tr s BIC (Bayesian Information Criterion) thp nht. Trong s tt c m
hnh m BMA tm, m hnh ny c xc sut xut hin l 31.1%.
Ct 6 l c tnh h s nh hng ca m hnh 2. Nh thy trong ct ny, m
hnh 2 gm intercept (tc ), v bin x6. M hnh ny gii thch 64% phng sai
ca y. Trong s tt c m hnh m BMA tm, m hnh ny c xc sut xut hin
ch l 12.3%.
Cc m hnh khc cng c th din dch mt cch tng t.

Mt cch th hin kt qu trn l qua mt biu nh sau:


> imageplot.bma(bma)

M odels selected by BM A

x1

x2

x3

x4

x5

x6

x7

10

13

Model #

Biu ny trnh by 13 m hnh. Trong 13 m hnh , bin x6 xut hin mt


cch nht qun. K n l bin x7 cng c xut hin trong mt s m hnh, nhng nh
chng ta bit xc sut l 74%.
Trong v d ny, c hai php tnh u cho ra mt kt qu nht qun, nhng trong
nhiu trng hp, hai php tnh c th cho ra kt qu khc nhau. Nhiu nghin cu l
thuyt gn y cho thy kt qu t php tnh BMA rt ng tin cy, v trong tng lai, c
l l phng php chun xy dng m hnh.
Ti liu tham kho cho BMA

Raftery, Adrian E. (1995). Bayesian model selection in social research (with Discussion).
Sociological Methodology 1995 (Peter V. Marsden, ed.), pp. 111-196, Cambridge, Mass.:
Blackwells.
Mt s bi bo lin quan n BMA c th ti t trang web sau y:
www.stat.colostate.edu/~jah/papers.

You might also like