Low Birth Weight

A Regression Analysis on the birth weight of Children born in Philadelphia, 1990











Christian Jones
Andrew Garrison

Introduction:
Low birth weight has been linked to mental retardation, epilepsy, deaIness, and blindness.
It has Iurthermore been proven to increase chances oI perinatal mortality (Bache et al 2008).
Because oI this, we want to analyze the Iactors that may contribute to low birth weight in
children. Many oI the suggested causes Ior low birth weight are manageable, even in poor
countries (Pojda & Kelley 2000). We thereIore hope that by Iinding some oI the key
contributing Iactors to low birth weight, we shed some light on this issue.
We chose to analyze the eIIect on birth weight oI children born in Philadelphia in the
year 1990. We Iound this to be relevant because we and our peers were born close to this time,
and thus could have Iit right into this dataset. While there are many Iactors that may aIIect birth
weight, we chose to Iocus on a select Iew Ior simplicity`s sake. Smoking is oIten linked to low
birth weight and thus we wanted to test that Iactor in our regression (Bache et al 2008). Other
predictor variables we chose to include were race, education level, and gestation period. Race
could possibly aIIect the outcome due to genetic diIIerences. Education level may not have a
direct eIIect on birth weight, but higher education may lead to healthier decisions which improve
the child`s birth weight. Finally gestation period should also directly aIIect the birth weight.
We expect that smoking and gestation period will be strong predictors oI birth weight in
our regression analysis. We are unsure whether or not race and education level will be
signiIicant to the model at this time.

nalysis:
We will attempt to build a linear regression model to predict the birth weight oI children
(measured in grams). Our pool oI potential predictor variables include race (other ÷ 0, black ÷ 1),
education level (number oI years), whether or not the mother smoked during pregnancy (no ÷ 0,
yes ÷ 1) and gestation period (number oI days).
We begin by examining the Minitab output regressing birth weight on all Iour oI these
predictors (Iig. 1). Using Minitab`s data subsetting Ieature, we immediately see that there may be
an interaction between the race and smoking variables, and there may be curvature in the
gestation period variable. There is also a clear lack oI Iit (p ÷ 0.000). In order to correct these
problems, we add an interaction term between race and smoking and add squared and cubed
terms oI the gestation period variable. We run the regression analysis in Minitab again and these
issues are resolved. However, there is clearly collinearity present between the gestation period
variable and its squared and cubed terms (VIF ÷ 6750.440, 30542.551 and 8773.925,
respectively). To address this concern, we center the gestation period variable by subtracting the
mean value Irom each value. We then square and cube this variable to create the polynomial
terms. These centered variables are represented by 'Cgestation¨, 'cgest`2¨ and 'cgest`3.¨ Fig.
2a shows the Minitab output Ior this regression analysis. We see that there are still legitimate
collinearity concerns between the squared and cubed terms oI the gestation period variable (VIF
÷ 23.061, 19.656, respectively). However, having already centered the variables, we choose to
move Iorward.
Next, we check the assumption that the residual terms Iollow a normal distribution (Iig.
2b). The Anderson-Darling test statistic ÷ 0.365, and p ÷ .437. ThereIore, we conclude that the
residuals Iollow a normal distribution.
We also check the assumption that the residual terms have an equal variance by applying
the modiIied Levine test (Iig. 2c). We conclude that the error terms Ior the education level (p÷
0.843) and gestation period (p ÷ 0.843) variables, as well as the squared (p ÷ 0.930) and cubed (p
÷ 0.843) terms Ior gestation period all have a constant variance. We cannot run this test Ior the
race and smoking variables because they are not continuous.
Having completed checking the assumptions and making the necessary transIormations,
we begin the process oI building the best model. We split our data into two equal sets: one to
select a Iew possible models and the other to validate our selection. We make use oI all three oI
Minitab`s automatic procedures, the Iorward selection, backward selection and stepwise
selection (Iigs. 3a-c). Both the Iorward and stepwise selection methods select the same model,
selecting the gestation, race, smoking, and the interaction term between race and smoking
predictors to include in the model. The backward selection method selected a model including all
oI the predictors except education level. We also use Minitab`s best subsets Iunction to compare
models based on R
2
, R
2
(adj) and cp values (Iig. 3d). We Iind that the same model selected by the
backward selection method has the highest R
2
and R
2
(adj)

values and has the lowest cp. Thus,
we choose these two models to compare using our validation data set.
Using the validation data set, we again use Minitab`s best subsets Ieature to compare our
two possible models (Iig. 4). The only diIIerence between the models is whether the squared and
cubed terms Ior the gestation period are included. While including these terms yields only a
marginal increase in R
2
and R
2
(adj) values, the cp value decrease Irom 26.8 to 7.0. Since these
terms can be manuIactured Irom the gestation period term, including them in the model will not
require any additional observations and will not be expensive to include. ThereIore, we Ieel that
the large reduction in the cp value justiIies their inclusion in the model. Our Iinal model includes
race, smoking, the race and smoking interaction, gestation period, and the squared and cubed
terms oI gestation period as its predictors. Education level is excluded Irom the model. The
regression analysis oI this model is shown in Iig. 5.

onclusion
Our regression supported much oI what we thought to be true, and also surprised us in some
areas. Our regression conIirmed our belieIs that smoking and gestation period are key components in
predicting birth weight, although we did not expect to only see an R
2
oI only 54.2°. However, birth
weight is a complex problem that cannot be easily explained by Iour variables. II Iour variables can
predict 54.2° oI the variation in birth weight, that can be considered a success.
We were somewhat surprised to see that education did not contribute much to the model.
Education clearly should not directly impact birth weight; however we thought that since it might impact
a mother`s health choices, higher education may lead to higher birth weights.
It is interesting to note that race was a signiIicant component oI our selected model. This could
be that low birth weight is genetically more common in black people, but it is also very possible that it is
actually more linked to socioeconomic class.
The inclusion oI the gestation period was oI no surprise to us. It would seem intuitive that a
shorter gestation period would likely result in a smaller baby, and our results conIirm our suspicions. As
we would have suspected, it is the strongest indicator oI birth weight in our model.
Through running our regression, we now have a better understanding oI some oI the Iactors
contributing to low birth weight. Smoking and gestation period were Iactors as expected, but we were
surprised to learn that higher education levels may not be a reasonable solution to the problem.
Furthermore, we were surprised to Iind that race acted as a signiIicant predictor oI low birth weight. We
are satisIied with the results oI our regression and have learned more about the causes oI low birth weight.





550ndix

Figur0 1: Mod0l including rac0 0ducation smoking g0station

The regression equation is
weight = - 2834 - 169 black + 9.57 edu - 175 smoke + 157 gestation


Predictor Coef SE Coef T P VIF
Constant -2834.5 215.6 -13.15 0.000
black -168.97 27.26 -6.20 0.000 1.051
edu 9.572 6.458 1.48 0.139 1.074
smoke -174.81 31.62 -5.53 0.000 1.073
gestation 156.512 5.014 31.22 0.000 1.051


S = 436.107 R-Sq = 52.8% R-Sq(adj) = 52.6%


Analysis of Variance

Source DF SS MS F P
Regression 4 235987581 58996895 310.20 0.000
Residual Error 1110 211110570 190190
Total 1114 447098151


Source DF Seq SS
black 1 29407936
edu 1 3020604
smoke 1 18211075
gestation 1 185347966


Unusual Jbservations

R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.


Lack of fit test
Possible interaction in variable black (P-Value = 0.000 )


Possible interaction in variable smoke (P-Value = 0.047 )


Possible curvature in variable gestation (P-Value = 0.000 )

Jverall lack of fit test is significant at P = 0.000








Figur0 2: Mod0l including rac0 0ducation smoking g0station (c0nt0r0d) int0raction t0rm
b0tw00n rac0 and smoking and g0station (c0nt0r0d) squar0d and cub0d

Fig. 2a R0gr0ssion nalysis
The regression equation is
weight = 3352 - 222 black + 6.68 edu - 306 smoke + 139 Cgestation
+ 209 black·smoke - 10.2 Cgest^2 - 0.527 Cgest^3


Predictor Coef SE Coef T P VIF
Constant 3352.28 86.85 38.60 0.000
black -222.48 30.60 -7.27 0.000 1.363
edu 6.677 6.405 1.04 0.297 1.087
smoke -305.87 50.38 -6.07 0.000 2.803
Cgestation 138.517 7.843 17.66 0.000 2.647
black·smoke 209.29 62.70 3.34 0.001 3.093
Cgest^2 -10.199 2.110 -4.83 0.000 23.061
Cgest^3 -0.5271 0.1208 -4.36 0.000 19.656


S = 429.956 R-Sq = 54.2% R-Sq(adj) = 53.9%


Analysis of Variance

Source DF SS MS F P
Regression 7 242455509 34636501 187.36 0.000
Residual Error 1107 204642642 184862
Total 1114 447098151


Source DF Seq SS
black 1 29407936
edu 1 3020604
smoke 1 18211075
Cgestation 1 185347966
black·smoke 1 2077746
Cgest^2 1 871580
Cgest^3 1 3518602

No evidence of lack of fit (P = 0.1).


















Fig 2b 0cking Normality of R0siduals


Anderson-Darling Test
H
0
: residuals Iollow a normal distribution
H
a
: residuals do not Iollow a normal distribution
At the .05 signiIicance level, we Iail to reject H
0
and conclude that the normality assumption is plausible.



Fig. 2c 0cking Equal Varianc0 ssum5tion (Modifi0d L0;0n0s T0st)

Education: P ÷ 0.843
Centered Gestation: P ÷ 0.843
(Centered Gestation)
2
: P ÷ 0.930
(Centered Gestation)
2
: P ÷ 0.843

Note: We cannot run test Ior race and smoking variables because they are not continuous.

H
0
: residuals have constant variance
H
a
: residuals do not have constant variance

For all continuous variables, we Iail to reject H
0
at the .05 signiIicance level and conclude that the residuals have
equal variance.








Figur0 3: Mod0l Building
2000 1000 0 ·1000 ·2000
33.33
33
3S
80
S0
20
S
1
0.01
RES!1
P
e
r
c
e
n
t
Nean 3.164281E·12
StDev 428.6
N 111S
AD 0.36S
P·value 0.437
Probability Plot of RES!1
Normal · 3Sº C!

Fig. 3a Forward S0l0ction
Forward selection. Alpha-to-Enter: 0.25


Response is weight on 7 predictors, with N = 557


Step 1 2 3 4
Constant 3208 3325 3368 3399

Cgestation 167.7 161.3 156.3 156.8
T-Value 23.83 22.98 22.39 22.56
P-Value 0.000 0.000 0.000 0.000

black -196 -181 -234
T-Value -4.93 -4.64 -5.29
P-Value 0.000 0.000 0.000

smoke -199 -349
T-Value -4.61 -4.73
P-Value 0.000 0.000

black·smoke 225
T-Value 2.50
P-Value 0.013

S 459 450 442 440
R-Sq 50.57 52.65 54.40 54.91
R-Sq(adj) 50.48 52.48 54.15 54.58
Mallows Cp 54.2 30.7 11.2 6.9


























Fig. 3b Backward S0l0ction
ackward elimination. Alpha-to-Remove: 0.1


Response is weight on 7 predictors, with N = 557


Step 1 2
Constant 3417 3423

black -235 -235
T-Value -5.17 -5.30
P-Value 0.000 0.000

edu 0.5
T-Value 0.05
P-Value 0.957

smoke -353 -354
T-Value -4.66 -4.82
P-Value 0.000 0.000

black·smoke 236 237
T-Value 2.59 2.62
P-Value 0.010 0.009

Cgest^2 -6.4 -6.4
T-Value -2.16 -2.17
P-Value 0.031 0.030

Cgest^3 -0.35 -0.35
T-Value -2.18 -2.19
P-Value 0.030 0.029

Cgestation 147 147
T-Value 12.83 12.84
P-Value 0.000 0.000

S 439 439
R-Sq 55.31 55.31
R-Sq(adj) 54.74 54.82
Mallows Cp 8.0 6.0



















Fig. 3c St05wis0 S0l0ction
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15


Response is weight on 7 predictors, with N = 557


Step 1 2 3 4
Constant 3208 3325 3368 3399

Cgestation 167.7 161.3 156.3 156.8
T-Value 23.83 22.98 22.39 22.56
P-Value 0.000 0.000 0.000 0.000

black -196 -181 -234
T-Value -4.93 -4.64 -5.29
P-Value 0.000 0.000 0.000

smoke -199 -349
T-Value -4.61 -4.73
P-Value 0.000 0.000

black·smoke 225
T-Value 2.50
P-Value 0.013

S 459 450 442 440
R-Sq 50.57 52.65 54.40 54.91
R-Sq(adj) 50.48 52.48 54.15 54.58
Mallows Cp 54.2 30.7 11.2 6.9





























Fig. 3d rit0rion S0l0ction
Response is weight

b
l C
a g
c e
k C C s
· g g t
b s s e e a
l m m s s t
a e o o t t i
Mallows c d k k ^ ^ o
Vars R-Sq R-Sq(adj) Cp S k u e e 2 3 n
1 50.6 50.5 54.2 459.24 X
1 30.4 30.2 302.3 545.05 X
2 52.6 52.5 30.7 449.90 X X
2 52.6 52.5 30.9 450.01 X X
3 54.4 54.2 11.2 441.90 X X X
3 53.1 52.8 27.4 448.25 X X X
4 54.9 54.6 6.9 439.82 X X X X
4 54.4 54.1 12.9 442.19 X X X X
5 54.9 54.5 8.7 440.15 X X X X X
5 54.9 54.5 8.8 440.18 X X X X X
6 55.3 54.8 6.0 438.67 X X X X X X
6 54.9 54.4 10.7 440.53 X X X X X X
7 55.3 54.7 8.0 439.07 X X X X X X X




































Figur0 4 Mod0l Validation

Response is weight

b
l C
a g
c e
k C C s
· g g t
b s s e e a
l m m s s t
a o o t t i
Mallows c k k ^ ^ o
Vars R-Sq R-Sq(adj) Cp S k e e 2 3 n
1 47.6 47.5 61.4 443.73 X
1 32.6 32.5 237.2 503.13 X
1 28.7 28.5 284.0 517.78 X
1 5.6 5.4 554.8 595.59 X
1 3.6 3.4 578.1 601.83 X
2 49.3 49.1 43.9 437.04 X X
2 48.8 48.6 49.7 439.16 X X
2 48.1 47.9 57.2 441.89 X X
2 48.1 47.9 57.8 442.09 X X
2 47.7 47.5 62.4 443.75 X X
3 50.5 50.2 31.3 432.03 X X X
3 49.9 49.6 38.5 434.70 X X X
3 49.5 49.2 43.2 436.44 X X X
3 49.4 49.1 44.2 436.80 X X X
3 49.4 49.1 44.5 436.89 X X X
4 51.4 51.1 22.6 428.40 X X X X
4 51.2 50.9 24.7 429.19 X X X X
4 51.1 50.7 26.8 429.98 X X X X
4 50.7 50.4 30.9 431.52 X X X X
4 50.6 50.3 31.7 431.83 X X X X
5 52.7 52.2 10.0 423.28 X X X X X
5 51.7 51.3 21.2 427.52 X X X X X
5 51.6 51.2 22.5 428.03 X X X X X
5 51.2 50.8 26.8 429.64 X X X X X
5 50.6 50.2 33.7 432.20 X X X X X
6 53.1 52.6 7.0 421.76 X X X X X X
















Figur0 5 R0gr0ssion nalysis of Final Mod0l

The regression equation is
weight = 3439 - 228 black - 317 smoke + 139 Cgestation + 215 black·smoke
- 10.3 Cgest^2 - 0.534 Cgest^3


Predictor Coef SE Coef T P VIF
Constant 3439.07 24.75 138.93 0.000
black -227.52 30.22 -7.53 0.000 1.329
smoke -316.64 49.31 -6.42 0.000 2.685
Cgestation 138.724 7.841 17.69 0.000 2.646
black·smoke 215.25 62.44 3.45 0.001 3.067
Cgest^2 -10.285 2.109 -4.88 0.000 23.026
Cgest^3 -0.5338 0.1206 -4.42 0.000 19.600


S = 429.973 R-Sq = 54.2% R-Sq(adj) = 53.9%


Analysis of Variance

Source DF SS MS F P
Regression 6 242254609 40375768 218.39 0.000
Residual Error 1108 204843542 184877
Total 1114 447098151


Source DF Seq SS
black 1 29407936
smoke 1 20664220
Cgestation 1 185497639
black·smoke 1 2229039
Cgest^2 1 836653
Cgest^3 1 3619122

No evidence of lack of fit (P = 0.1).