You are on page 1of 33

CLASS XIII

Controlling for Other Variables


Association versus Causation

Storks and Births

30

20
Births

10

0
0 2 4 6 8 10
Storks
Equations for bivariate and multiple regression

The bivariate regression equation is expressed as:

Y= a + bX

a = Y intercept

b = the slope of the relationship between X and Y

The multiple regression equation is expressed as:

Y= a + b1X1 + b2X2

a = Y intercept

b1 = the slope of the relationship between X 1 and Y, holding


constant X2

b2 = the slope of the linear relationship between the X 2 and Y,


holding constant X1
What happens to the X  Y association
when we consider X2?
When X2 is introduced1, there are three possible
effects on the X  Y association. Either:
1.The magnitude of the association changes
2.The association disappears (magnitude = 0)
3.The magnitude of the association stays the same

1“X2 is introduced” = “X2 is held constant” = “adjusting for X 2”


--- these are used interchangeably in multiple regression.
(1) Changes, (2) Disappears, (3) Stays the
same
 We’ll illustrate these three possibilities in three
different ways
 Venn diagrams
 Word problem examples
 Regression equations
 Each time
 b reflects the magnitude of the association between X 1
and Y
 Bivariate
 b1 reflects the magnitude of the association between
X1 and Y, holding X2 constant
 Multivariate
The bivariate association
Seeing the patterns in Venn diagrams

X1
b

b reflects the magnitude of the bivariate


association between X1 and Y.
Seeing the patterns in an example

Maternal Age Infant birth weight

Suppose that you find that older mothers give birth to heavier
babies. That is, the slope of maternal age (X) on birthweight (Y)
is positive.

Is this relationship direct, or could other factors explain it?

Let’s look at how the relationship is modified by 3rd variables.


Seeing the patterns in a regression context

If Maternal SES and Birth weight were measured as continuous


variables, then we could look at the association between the two,
using regression.

Here is what the regression equation might look like:

BW = a + b1 (Maternal age)
Where a = 2500 and b1 = 20

(BW in grams; AGE in years)

BW= 2500 + 20 (AGE)

B
W

AGE

“For every additional year of age, birthweight increases 20 grams”


(1) Adding a third variable
changes the magnitude of the association
Seeing the patterns in Venn diagrams

X1
b1

X2
b1 reflects the magnitude of the association
between X1 and Y, holding X2 constant.
In this case b1 changes – it is smaller than the
Seeing the patterns in an example

Maternal Age Infant birth weight

Maternal smoking during pregnancy


Consider the 3rd variable “smoking during pregnancy”

Suppose: older women are less likely to smoke during pregnancy, and
women who smoke less have healthier (heavier) babies.

Holding smoking constant, there is still an association between maternal


age and birth weight, but it is smaller. We say that maternal smoking
confounds the association between maternal age & birth weight.
Seeing the patterns in a regression context

Let’s now look at the association between AGE and BW, holding
constant # of cigarettes smoked.

The form of our regression equation would be:

BW = a + b1 (AGE) + b2 (# cigs smoked/day)


And the values of the coefficients might be:
a = 2000 and b1 = 15 and b2 = 25

or

BW= 2800 + 15 (AGE) - 25 (# cigs/day)

B
W

AGE

“For every additional year of age, birthweight increases 15 grams,


holding constant number of cigarettes smoked”
(2) Adding a third variable
makes the association disappear
Seeing the patterns in Venn diagrams

X2 X1
b1

b1 reflects the magnitude of the association


between X1 and Y, holding X2 constant.
In this case b1 = 0 – the new variable has
swallowed up all of b’s space!
Seeing the patterns in an example

Maternal Age Infant birth weight

Maternal socioeconomic status (SES)


Consider the 3rd variable “maternal socioeconomic status.”

Suppose: Mothers of higher SES are older when they have their babies.
Mothers of higher SES have heavier babies
(perhaps because they have better prenatal care)?

So when maternal SES is held constant, there is no longer


a relationship between maternal age and infant birth weight.
The bivariate relationship was spurious.
Seeing the patterns in a regression context

Now let’s look at the association between AGE and BW, holding constant
maternal SES.

The form of our regression equation would be:

BW = a + b1 (AGE) + b2 (SES)
And the values of the coefficients might be:
a = 2200 and b1 = 0 and b2 = 200, or

BW = 2200 + 0 (AGE) + 200 (SES)

B
W

AGE
“Holding SES constant, there is no association between birthweight and age”,
or
For every additional year of age, birthweight increases 0 grams,
holding SES constant”
(3) Adding a third variable
doesn’t change the association at all
Seeing the patterns in Venn diagrams

X1
b1

X2

b1 reflects the magnitude of the association


between X1 and Y, holding X2 constant.
In this case b1 is the same size (magnitude)
as the bivariate b.
Seeing the patterns in an example

Maternal Age Infant birth weight

Gender of the child

Consider the 3rd variable “gender of the child.”

Suppose that: while male children are heavier then female children,
there is no association between maternal age and gender of the child.

Gender of the child has no effect on the association between maternal age
and infant birth weight. We say that the relationship between maternal age
and infant birth weight is direct, or not confounded by the gender of
the child.
Seeing the patterns in a regression context

Now let’s look at the association between BW and AGE, holding


constant gender.

The form of our regression equation would be:

BW = a + b1 (AGE) + b2 (GENDER) a dummy = 1 for Male children


and the values of the coefficients might be:
a = 2300 and b1 = 20 and b2 = 400

or

BW = 2300 + 20 (AGE) + 400 (GENDER)

B
W

AGE

“For every additional year of age, birthweight increases 20 grams,


holding gender constant”
Multiple Regression --- The Movie

http://www.math.yorku.ca/SCS/spida/lm/visreg.html
Multiple regression in Stata:
interpreting the output
Example: Predicting the number of hours worked per week,
based on education and drinking habits

hrs_worked = number of hours worked per week


educ = number of years of education
repub = number of hours spent in the pub last night
Regression output for the example
Interpreting the regression output

hrs_worked = Number of hours worked per week


educ = Number of years of education
repub = Number of hours spent in the pub last night

N = 1725
Look under Number of obs

R2 = 0.0572
The proportion of variance in Y that is explained both independent
variables together. In this case it is 5.72%.

Adjusted R2= 0.0561


R2, adjusting for the number of independent variables in the model.

Sum of Squares – Related to the residuals, or the “unexplained” distance


between the regression line and the actual values of Y. FYI one can
calculate R2 by dividing the Model Sum of Squares [=21115.4807] by
the Total [Model + Residual] Sum of Squares [=369471.478] – see
ANOVA table.
Interpreting the regression output, continued

F =52.19, Sig. F = 0.000


Tests for the statistical significance of the entire regression equation –
that is, the predictive value of all the variables in the equation. If
significance is less than 0.05, then the equation is significant at the
0.05 level.

Constant = 38.49222
The constant is the Y intercept (a.k.a. “a”).

The Regression Equation


Hrs_worked’ = 38.49 + .4831397 * [educ] - 6.556286 [repub]

b, the Regression Coefficient


Represents the change in the value of the dependent variable [hrs_worked]
for each unit change in the independent variable

“For each additional year of education, there is an average increase of


0.48 hours worked per week, holding constant number of hours in the
pub last night.”

For every additional hour spent in the pub last night, there is an
average decrease of 6.5 hours worked per week, holding constant
years of education.”
Interpreting the regression output, continued

Std. Error [for b] – Standard error for b, which is needed to test


whether b is statistically significant.

t, Pr(t) – tell you whether b is statistically significant, that is, different


from zero.
•t is tobtained for the regression coefficient. If tobtained falls in the critical
region, then the coefficient is statistically significantly different from
zero. (For alpha = .05, critical region is beyond +/- 1.96)
•Pr(t) is the p-value associated with tobtained. If Pr(|t|) is less than alpha,
then the coefficient is statistically significantly different from zero.

If “t” were not given, you could still calculate t by using the
formula:

t= Statistic
Std. Error
Rules for X’s in multiple regression

(For those who are aiming to finish the


final project analysis; we’ll discuss this
again next week)
Rules for X’s in multiple regression
 Continuous variables are OK
 You knew that before today!
 Grouped variables with > 2 values are NOT OK
 You knew that before today!
 Dummy variables are OK
 These special grouped variables are permitted as X’s in
regression
 This is news – it breaks the “continuous/ continuous” rule
 We’ll talk a lot more about dummy variables in regression next
week
Encore !

JB
This week in Stat 1
 No Lab this week
 For next lecture
 No Healey reading
 Re-read Chapters 7, 8, 9 of Wagner Way
 Complete the homework
 Complete the challenge problems
 Final assignment
 You can now complete almost all analyses
 Dig into the assignment deeply before you
attend next week’s lecture – I’ll take questions
then
 Always: analyze, then stop and think
JB
Practice problems

JB

You might also like