You are on page 1of 16

VINCENT J.

SAMUEL (MSc/STA/07/14)

GENERALIZED LINEAR MODELS (GLM)


MULTIPLE REGRESSSION EXERCISE

6.30. Two Models for predicting the average length of patient stay in a hospital.

. corr length age infectio service


(obs=113)

length age infectio service

length 1.0000
age 0.1889 1.0000
infectio 0.5334 0.0011 1.0000
service 0.3555 -0.0405 0.4126 1.0000

The table above shows that there is positive linear association between length, age, infection
and service hence.

. corr length beds infectio service


(obs=113)

length beds infectio service

length 1.0000
beds 0.4093 1.0000
infectio 0.5334 0.3598 1.0000
service 0.3555 0.7945 0.4126 1.0000

The table above shows that there is positive linear association between length, beds, infection
and service.
A. Prepare a stem and leaf plot for each of the predictor variables.
. stem age

Stem-and-leaf plot for age (Average age of all patients (in yrs))

age rounded to nearest multiple of .1


plot in units of .1

38* 8
39*
40*
41*
42* 0
43* 7
44* 2
45* 0257
46*
47* 12
48* 126
49* 01355579
50* 22456679
51* 0012355677789
52* 0011123458888
53* 02222788899
54* 00112245569
55* 00788
56* 0123355789999
57* 22668
58* 022
59* 0569
60* 9
61* 1
62* 2
63* 9
64* 1
65* 9

The stem and leaf plot shows that, there is a normal distribution and the plots are symmetric
and skewed to the mean 53.23.

. stem infectio

Stem-and-leaf plot for infectio (Infection risk)

infectio rounded to nearest multiple of .1


plot in units of .1

1* 334
1. 678
2* 0013
2. 5677899999
3* 011244
3. 5577778999
4* 0111122222333333344444
4. 5555555666778888999
5* 0000112233344
5. 5555666777889
6* 12334
6. 56
7*
7. 678

The stem and leaf plot shows that, there is a normal distribution and the plots are symmetric
and skewed to the mean 4.35
. stem service

Stem-and-leaf plot for service (Available facilities and services)

service rounded to nearest multiple of .1


plot in units of .1

0** 57
1** 14,43
1** 71,71
2** 00,29,29,29,29,29,29,29
2** 57,57,57,57,57,86,86,86,86,86,86,86
3** 14,14,14,14,14,43,43,43,43,43,43,43
3** 71,71,71,71,71,71,71,71,71
4** 00,00,00,00,00,00,00,00,29,29,29
4** 57,57,57,57,57,57,57,57,57,57,86,86,86,86,86,86,86,86
5** 14,14,14,14,14,14,14,14,43,43,43,43,43,43
5** 71,71,71,71,71,71,71
6** 00,00,29,29,29,29,29
6** 57,57,57,86,86,86
7** 14,43
7** 71
8** 00

The stem and leaf plot shows that, there is a normal distribution and the plots are symmetric
and skewed towards the mean.

. stem beds

Stem-and-leaf plot for beds (Number of beds)

0** 29
0** 52,56,60,64,68,70,72,72,73,76,76,80,83,85,87,90,91,92,92,95, ... (26)
1** 00,06,07,08,13,15,15,19,29,30,30,33,34,43,47
1** 50,54,57,63,65,66,67,67,70,75,76,80,82,84,86,90,91,95,95,95,96,96,97
2** 10,21,35,37,46,48
2** 52,65,66,70,79,81,97,98,98,98
3** 04,05,06,12,18,18,22
3** 53,56,62,87
4** 24,45
4** 61,77,87,89
5** 08,35,46
5** 68,71,93,95
6** 00,20,40
6**
7**
7** 52,68
8** 31,33,35

The stem and leaf plot shows that, there is a normal distribution and the plots are symmetric
and skewed to the left.

B. Obtain the scatter plot matrix and correlation matrix for the first model
graph matrix length age infectio service, half

Length
of stay

70

60 Average age
of all
50 patients
(in yrs)
40
8

6
Infection
4 risk
2
100
Available
50 facilities
and
services
0
5 10 15 2040 50 60 70 2 4 6 8

The scatter plot matrix shows that there is a linear relationship among the variables.
However, on infection risk and available facilities and services, there is a slow increase in the
infection risk with the available facilities and services.

. pwcorr length age infectio service, sig

length age infectio service

length 1.0000

age 0.1889 1.0000


0.0451

infectio 0.5334 0.0011 1.0000


0.0000 0.9908

service 0.3555 -0.0405 0.4126 1.0000


0.0001 0.6705 0.0000

From the table above, it shows that age, infection risk and services are statistically significant
to the model since all of them are less that the p-value 0.05.

Obtain the scatter plot matrix and correlation matrix for the second model
graph matrix length beds infectio service, half

Length
of stay

1000

Number
500 of
beds
0
8

6
Infection
4 risk
2
100
Available
50 facilities
and
services
0
5 10 15 200 500 1000 2 4 6 8

The plots show that there is positive linear correlation among the variables length, number of
beds, infection risk and available facilities and services.

. pwcorr length beds infectio service, sig

length beds infectio service

length 1.0000

beds 0.4093 1.0000


0.0000

infectio 0.5334 0.3598 1.0000


0.0000 0.0001

service 0.3555 0.7945 0.4126 1.0000


0.0001 0.0000 0.0000

From the table above, it shows that age, infection risk and services are statistically significant
to the model since all of them are less that the p-value 0.05.
C. Fit first order regression models with three variables
. regress length age infectio service

Source SS df MS Number of obs = 113


F( 3, 109) = 19.12
Model 141.085149 3 47.0283829 Prob > F = 0.0000
Residual 268.125232 109 2.45986451 R-squared = 0.3448
Adj R-squared = 0.3267
Total 409.210381 112 3.65366411 Root MSE = 1.5684

length Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .0837145 .0332502 2.52 0.013 .0178137 .1496152


infectio .6584481 .121354 5.43 0.000 .4179284 .8989678
service .0217363 .0107137 2.03 0.045 .0005021 .0429706
_cons 1.386462 1.866217 0.74 0.459 -2.312319 5.085243

From the results above, the model will be:


Ŷ = 1.386462 + 0.0837145*age + 0.6584481*infection + 0.0217363*service

. regress length beds infectio service

Source SS df MS Number of obs = 113


F( 3, 109) = 18.78
Model 139.432709 3 46.4775696 Prob > F = 0.0000
Residual 269.777672 109 2.47502451 R-squared = 0.3407
Adj R-squared = 0.3226
Total 409.210381 112 3.65366411 Root MSE = 1.5732

length Coef. Std. Err. t P>|t| [95% Conf. Interval]

beds .0030178 .0012716 2.37 0.019 .0004976 .0055381


infectio .6477069 .1219077 5.31 0.000 .4060899 .8893239
service -.0092852 .0165237 -0.56 0.575 -.0420346 .0234643
_cons 6.46738 .6151573 10.51 0.000 5.248158 7.686602

From the results above, the model will be:


Ŷ = 6.46738 + 0.0030178*beds + 0.6477069*infection -0.0092852*service

D. Calculate R2 for each model


First Model: R2 = 0.3448
Second Model: R2 = 0.3407
Basically the R2 (coefficients of determination) are similar in both models. Therefore,
each model can be used. There is no preferable model in terms of these measures.
E. Obtain the residuals and plot them against Ŷ.

. regress length age infectio service

Source SS df MS Number of obs = 113


F( 3, 109) = 19.12
Model 141.085149 3 47.0283829 Prob > F = 0.0000
Residual 268.125232 109 2.45986451 R-squared = 0.3448
Adj R-squared = 0.3267
Total 409.210381 112 3.65366411 Root MSE = 1.5684

length Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .0837145 .0332502 2.52 0.013 .0178137 .1496152


infectio .6584481 .121354 5.43 0.000 .4179284 .8989678
service .0217363 .0107137 2.03 0.045 .0005021 .0429706
_cons 1.386462 1.866217 0.74 0.459 -2.312319 5.085243

predict res, residual


predict yhat, xb
twoway (scatter res yhat) (lfit res yhat
10
5
0
-5

6 8 10 12
Linear prediction

Residuals Fitted values


pnorm res
1.00 0.75
Normal F[(res-m)/s]
0.50 0.25
0.00

0.00 0.25 0.50 0.75 1.00


Empirical P[i] = i/(N+1)

. regress length beds infectio service

Source SS df MS Number of obs = 113


F( 3, 109) = 18.78
Model 139.432709 3 46.4775696 Prob > F = 0.0000
Residual 269.777672 109 2.47502451 R-squared = 0.3407
Adj R-squared = 0.3226
Total 409.210381 112 3.65366411 Root MSE = 1.5732

length Coef. Std. Err. t P>|t| [95% Conf. Interval]

beds .0030178 .0012716 2.37 0.019 .0004976 .0055381


infectio .6477069 .1219077 5.31 0.000 .4060899 .8893239
service -.0092852 .0165237 -0.56 0.575 -.0420346 .0234643
_cons 6.46738 .6151573 10.51 0.000 5.248158 7.686602

predict res1, residual


twoway (scatter res1 yhat) (lfit res1 yhat)
10
5
0
-5

6 8 10 12
Linear prediction

Residuals Fitted values

pnorm res1
1.00 0.75
Normal F[(res1-m)/s]
0.500.25
0.00

0.00 0.25 0.50 0.75 1.00


Empirical P[i] = i/(N+1)

There is no model which is appropriate than the other since both of them are showing the
same trend. The scatter plots with line of best fits and normal probability plot look very
similar.
6.31.
A. First order regression models

. regress infectio age routine census service if region==1

Source SS df MS Number of obs = 28


F( 4, 23) = 4.92
Model 20.1260532 4 5.03151329 Prob > F = 0.0051
Residual 23.5007326 23 1.02177098 R-squared = 0.4613
Adj R-squared = 0.3676
Total 43.6267857 27 1.61580688 Root MSE = 1.0108

infectio Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .116954 .050215 2.33 0.029 .0130762 .2208317


routine .0582402 .0226092 2.58 0.017 .0114694 .105011
census .0015082 .0016514 0.91 0.371 -.001908 .0049244
service .006613 .0230415 0.29 0.777 -.0410519 .0542779
_cons -3.349576 2.709974 -1.24 0.229 -8.955585 2.256434

Estimated regression function


Ŷ = -3.349576 + 0.116954*age + 0.0226092*routine + 0.0015082*census +
0.006613*service

. regress infectio age routine census service if region==2

Source SS df MS Number of obs = 32


F( 4, 27) = 4.72
Model 22.8769903 4 5.71924758 Prob > F = 0.0051
Residual 32.7217597 27 1.21191703 R-squared = 0.4115
Adj R-squared = 0.3243
Total 55.59875 31 1.79350806 Root MSE = 1.1009

infectio Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .0047417 .0472126 0.10 0.921 -.0921305 .1016138


routine .0580299 .0161918 3.58 0.001 .024807 .0912527
census .0011725 .0020941 0.56 0.580 -.0031242 .0054692
service .015018 .0215103 0.70 0.491 -.0291174 .0591535
_cons 2.291536 2.686553 0.85 0.401 -3.220815 7.803888

Estimated regression function


Ŷ = 2.291536 + 0.0047417*age + 0.0580299*routine + 0.0011725*census +
0.015018*service
. regress infectio age routine census service if region==3

Source SS df MS Number of obs = 37


F( 4, 32) = 12.45
Model 46.6579215 4 11.6644804 Prob > F = 0.0000
Residual 29.9750515 32 .936720358 R-squared = 0.6088
Adj R-squared = 0.5600
Total 76.632973 36 2.12869369 Root MSE = .96784

infectio Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .0308476 .0391753 0.79 0.437 -.0489499 .1106451


routine .1022811 .0212149 4.82 0.000 .0590679 .1454944
census .004114 .0021598 1.90 0.066 -.0002854 .0085134
service .0080387 .020362 0.39 0.696 -.0334373 .0495148
_cons -.1438581 2.32319 -0.06 0.951 -4.876041 4.588325

Estimated regression function


Ŷ = -0.1438581 + 0.0308476*age + 0.1022811*routine + 0.004114*census +
0.0080387*service

. regress infectio age routine census service if region==4

Source SS df MS Number of obs = 16


F( 4, 11) = 0.27
Model 1.03252919 4 .258132298 Prob > F = 0.8908
Residual 10.4918458 11 .953804164 R-squared = 0.0896
Adj R-squared = -0.2415
Total 11.524375 15 .768291667 Root MSE = .97663

infectio Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .0352413 .0743833 0.47 0.645 -.1284751 .1989578


routine .0403277 .0587802 0.69 0.507 -.0890466 .169702
census -.000664 .0031731 -0.21 0.838 -.007648 .0063201
service .0127923 .0258659 0.49 0.631 -.0441381 .0697227
_cons 1.566549 4.382848 0.36 0.728 -8.080035 11.21313

Estimated regression function


Ŷ = 1.566549 + 0.0352413*age + 0.0403277*routine -0.000664*census + 0.0127923*service

B. There are slight differences in terms of coefficients in these models. However, any unit
increase in the average number of patients in the hospital per day during study period will no
or minimal contribution to the infection risk if the other variables like age, routine culturing
ratio and available facilities and services are held constant.
C. Calculation of MSE and R2 for each region.
Region 1 (NE):
√MSE = 1.0108
MSE = 1.0108*1.0108 = 1.02171664
R2 = 0.4613
Region 2 (NC):
√MSE = 1.1009
MSE = 1.1009*1.1009 = 1.21198081
R2 = 0.4115
Region 3 (S):
√MSE = 0.96784
MSE = 0.96784*0.96784 = 0.9367142656
R2 = 0.6088
Region 4 (W):
√MSE = 0.97663
MSE = 0.97663*0.97663 = 0.9538061569
R2 = 0.0896
The measures for the region 1 and 2 are similar and region 3 and 4 are also similar.
D. Obtain the residuals for each fitted model and prepare a box plot of the residuals for each
fitted model.

. regress infectio age routine census service if region==1

Source SS df MS Number of obs = 28


F( 4, 23) = 4.92
Model 20.1260532 4 5.03151329 Prob > F = 0.0051
Residual 23.5007326 23 1.02177098 R-squared = 0.4613
Adj R-squared = 0.3676
Total 43.6267857 27 1.61580688 Root MSE = 1.0108

infectio Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .116954 .050215 2.33 0.029 .0130762 .2208317


routine .0582402 .0226092 2.58 0.017 .0114694 .105011
census .0015082 .0016514 0.91 0.371 -.001908 .0049244
service .006613 .0230415 0.29 0.777 -.0410519 .0542779
_cons -3.349576 2.709974 -1.24 0.229 -8.955585 2.256434

predict resid1, residual


graph box resid1
4
2
Residuals
0-2
-4

From the box plot above, it shows that there are no outliers. Most of the residuals fall
between -1 and 1. This shows the normal distribution.
. regress infectio age routine census service if region==2

Source SS df MS Number of obs = 32


F( 4, 27) = 4.72
Model 22.8769903 4 5.71924758 Prob > F = 0.0051
Residual 32.7217597 27 1.21191703 R-squared = 0.4115
Adj R-squared = 0.3243
Total 55.59875 31 1.79350806 Root MSE = 1.1009

infectio Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .0047417 .0472126 0.10 0.921 -.0921305 .1016138


routine .0580299 .0161918 3.58 0.001 .024807 .0912527
census .0011725 .0020941 0.56 0.580 -.0031242 .0054692
service .015018 .0215103 0.70 0.491 -.0291174 .0591535
_cons 2.291536 2.686553 0.85 0.401 -3.220815 7.803888

predict resid2, residual


graph box resid2
3
2 1
Residuals
0 -1
-2

The box plot above shows that there is one outlier and most of the plots are below the median
(average). This shows a negative skewness and is portraying an extreme value in the positive
skew.
. regress infectio age routine census service if region==3

Source SS df MS Number of obs = 37


F( 4, 32) = 12.45
Model 46.6579215 4 11.6644804 Prob > F = 0.0000
Residual 29.9750515 32 .936720358 R-squared = 0.6088
Adj R-squared = 0.5600
Total 76.632973 36 2.12869369 Root MSE = .96784

infectio Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .0308476 .0391753 0.79 0.437 -.0489499 .1106451


routine .1022811 .0212149 4.82 0.000 .0590679 .1454944
census .004114 .0021598 1.90 0.066 -.0002854 .0085134
service .0080387 .020362 0.39 0.696 -.0334373 .0495148
_cons -.1438581 2.32319 -0.06 0.951 -4.876041 4.588325

predict resid3, residual


graph box resid3
2
0
Residuals
-2
-4

The plots shows that there is positive skewness and shows a normal distribution although
there is one extreme value of the residuals to the negative side.
. regress infectio age routine census service if region==4

Source SS df MS Number of obs = 16


F( 4, 11) = 0.27
Model 1.03252919 4 .258132298 Prob > F = 0.8908
Residual 10.4918458 11 .953804164 R-squared = 0.0896
Adj R-squared = -0.2415
Total 11.524375 15 .768291667 Root MSE = .97663

infectio Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .0352413 .0743833 0.47 0.645 -.1284751 .1989578


routine .0403277 .0587802 0.69 0.507 -.0890466 .169702
census -.000664 .0031731 -0.21 0.838 -.007648 .0063201
service .0127923 .0258659 0.49 0.631 -.0441381 .0697227
_cons 1.566549 4.382848 0.36 0.728 -8.080035 11.21313

predict resid4, residual


graph box resid4
4
2
Residuals
0
-2
-4

The plots shows the normal distribution and there are two residuals which are outside lower
and upper quartiles which seems to be outliers.

You might also like