You are on page 1of 22

ECON335 Wooldridge Ch.

7 2-19-24

Chapter 7
Multiple Regression Analysis with Qualitative Information

Talked about qualitative variables in Chapter 2-7 (7th: 51). These are variables which place a
person in a category. We use Dummy Variables to place people in a category.

Ex. Gender: x1 = 1 if Female and = 0 if Male.

We will now consider other issues involving use of dummy variables.

7-2 A Single Dummy Independent Variable (7th: 222)

We will consider next the situation in which we have a dummy variable & a quantitative variable
in the regression.

Ex. We propose to estimate the following model: y = β0 + β1∙x1 + β2∙x2 + u

Where y = expenditures, x1 = 1 if female & = 0 if male and x2 = income (years).

We might ask how we interpret the slope parameters this context. The interpretation actually
turns out to be the same with the one change.

 Let's consider the expected wage for Males: x1 = 0

E(y|x1=0,x2) = β0 + β1∙0 + β2∙x2 = β0 + β2∙x2

So, the mean wage = the intercept plus the effect of income.

Graphically, we have a relationship between 2 variables that's linear. We get a line with an
intercept = β0.

1
ECON335 Wooldridge Ch. 7 2-19-24

If β2 > 0 we get something like (7th: 223 Fig. 7.1)

 Next, expected food expenditures for females: x1 = 1

E(y|x1=1,x2) = β0 + β1∙1 + β2∙x2 = (β0+β1) + β2∙x2

 What's the difference in the regressions? The intercept. So, the difference between male &
female expected expenditures is a shift in the intercept of the regression. Graphically,

Note this important interpretive point: the impact of being a female equals (β1+β2), the male
intercept plus β1. So, we can think of β1 as the difference between females and males.

We think of Males as the base (benchmark) group (or the reference group). And interpret the
parameter on the dummy variable with reference to this group. (7th: 223)

2
ECON335 Wooldridge Ch. 7 2-19-24

R Code and Results

Ex. Female expenditures are $228 lower (on average).


We may note that the foregoing regression assumes that gender affects wages only in terms of a
shift in the intercept; the relationship assumes implicitly that the relationship between wages &
food expenditures is the same for females & males and that impact is measured by β2. In other
words, a one-year increase in education will have the same impact on the expected wages of
males and females.

We might believe that education will have differential impacts depending on whether one is
female or male. We will consider shortly how to incorporate possible differences in education on
male & female wages.

3
ECON335 Wooldridge Ch. 7 2-19-24

 Redefine the dummy variable (7th: 223). [in Slide 1]


Suppose that we let x1 = 1 if male, and = 0 if female. As we saw in Chapter 2, the reference
group would change and the sign of the parameter on the dummy variable would be the opposite.

7-2a: Interpreting Coefficients on Dummy Explanatory Variables When the Dependent


Variable is log(y) (7th: 226) skip

7-3 Using Dummy Variables for Multiple Categories (7th: 228)


The categorical variables we have considered so far have been binary in the sense that the
variable has only two categories. We might ask what happens when we have more than two
categories to consider.

Ex. Food Expenditures Suppose that we are interested in FE and theory implies that it depends
on gender (x1), family income(x2), and whether one lives in an urban, suburban or rural area.

We may note that the location variable is categorical, with three possible categories.

 The question that arises now is "how do we handle such categorical variables?“
Generally, we must create (# categories-1) dummy variables.

4
ECON335 Wooldridge Ch. 7 2-19-24

Here, we have three categories. So, we must create (3-1) = 2 dummy variables.

We handle them by defining two dummy variables instead of 1. So, for ex, we might define
x4 = 1 if urban & = 0 otherwise and
x5 = 1 if suburban & = 0 otherwise.

For reasons you will see below, we will drop the gender dummy variable temporarily.

Equation of interest: y = β0 + β2∙x2 + β3∙x3 + β4∙x4 + u


Reference Group: Rural.

 How would we interpret the regression results in this context? The interpretation follows that
in the binary variable case. We consider the various possible combinations of values of x3 & x4.

If we take expectations for given values of x3 and x4 we get

E(y|x3=0,x4 = 0,x2) = β0 + β2∙x2 + β3∙0 + β4∙0 = β0 + β2∙x2

E(y|x3=1,x4 = 0,x2) = β0 + β2∙x2 + β3∙1 + β4∙0 = (β0+β3) + β2∙x2

E(y|x3=0,x4 =1,x2) = β0 + β2∙x2 + β3∙0 + β4∙1 = (β0+β4) + β2∙x2

We see that the coefficient on each dummy variable represents a shift in the intercept of the
regression equation associated with when the dummy variable = 1.

Interpretation of the coefficients on the dummy variables is with respect to the reference
category. In this case is the area that is not covered by the two dummy variables; i.e., rural areas.

5
ECON335 Wooldridge Ch. 7 2-19-24

Thus, we interpret each coefficient with reference to rural areas. β3 reflects the increase
(decrease) in food expenditures of urban people, as compared to rural people. And, β4 reflects the
increase (decrease) in food expenditures of suburban people as compared with rural people.

R Code and output

Interpret the parameter estimates on the dv’s, as well as the intercept ….

 Why not three dummy variables?


One question we might ask is "why don't we include a dummy variable for rural areas?" After
all, we then would obtain three coefficients for each area.

Anyone remember why we cannot do so? Perfect multicollinearity. We would have three
columns summing to one. Recalling that the variable for the intercept term is a column of ones,
we would have an independent variable (the intercept) which is a linear function of three
independent variables (the dummy variables).
Intuitively, if we stick with our interpretation of the coefficients as being wrt some reference
area, we might simply say that including three d.v.'s would leave us w/o a reference area.

Note that x3 and x4 will never both = 1. One will = 1 & the other 0 or both will = 0.

6
ECON335 Wooldridge Ch. 7 2-19-24

 Why not a variable taking on values 0, 1 & 2?


Suppose that we defined x3 as taking on those three values. Our regression model is
y = β0 + β2∙x2 + β3∙x3 + u
where we continue to ignore the Gender variable.

To see the implication of such a definition consider how we would interpret the regression
results
E(y|x3=0,x2) = β0 + β2∙x2 + β3∙0 = β0 + β2∙x2
E(y|x3=1,x2) = β0 + β2∙x2 + β3∙1 = (β0+β3) + β2∙x2
E(y|x3=2,x2) = β0 + β2∙x2 + β3∙2 = (β0+2∙β3) + β2∙x2

Again, the reference is with respect to a rural student. We see that the difference in scores
between rural & urban is β3.

The difference between a rural & an urban single person is 2∙β3; i.e., twice that of the first
difference. Alternatively, we may interpret β4 as the diff tween the score of a rural & a suburban
student & the difference between the score of a suburban & an urban student.

We typically would not have reason to assume that those differences are the same. Indeed, we
would not want to do so because we might be interested in whether urban & suburban influences
are the same (as compared to rural).

We avoid making such an implicit assumption when we have a dummy variable for each
category but the reference category.
[End 1st ½ of Class 16]
 More than one categorical Variable:
Let’s bring back the gender variable. Our equation of interest is
y = β0 + β1∙x1 + β2∙x2 + β3∙x3 + β4∙x4 + u

You will note that we now have two categorical variables: one for gender and one for location.

7
ECON335 Wooldridge Ch. 7 2-19-24

What impact does this have on interpreting our results? The base/reference group will now have
two characteristics: one for each category.

Ex Base group is (male, rural area)

Interpretations: consider some examples.


(1) (male, rural) - E(y|x1=0,x3=0, x4=0,x2) = β0 + β1∙0 + β2∙x2 + β3∙0 + β4∙0
= β0 + β2∙x2.
(2) (female, urban) E(y|x1=1,x3=1, x4=0,x2) = β0 + β1∙1 + β2∙x2 + β3∙1+ β4∙0
= (β0 + β1 + β3) + β2∙x2
Interpret: the intercept shifts by the female difference and the urban difference.

(3) (male, suburban) E(y|x1=0,x3=0, x4=1,x2) = β0 + β1∙0 + β2∙x2 + β3∙1+ β4∙1


= (β0 + β4) + β2∙x2

7-4a Interactions Among Dummy Variables (7th: 232)

Bring back the gender variable and interact gender with the location variables:

y = β0 + β1∙x1 + β2∙x2 + β3∙x3 + β4∙x4 + β5∙x1•x3 + β6∙x1•x4 + u

How do we interpret the interaction terms? They allow the impact of location to vary by gender
(or, the impact of gender to vary by location).

Consider some possibilities:


(1) (male, urban): E(y|x1=0,x3=0, x4=0,x2) = β0 + β1∙0 + β2∙x2 + β3∙1 + β4∙0 + β5∙0•1 + β6∙0•0
= β0 + β3 + β2∙x2.
Impact of urbanicity reflected in the β3 parameter

8
ECON335 Wooldridge Ch. 7 2-19-24

(2) (female, urban): E(y|x1=1,x3=0, x4=0,x2) = β0 + β1∙1 + β2∙x2 + β3∙1 + β4∙0 + β5∙1•1 + β6∙1•0
= (β0 + β1) + (β3 + β5) + β2∙x2.

Interpretation: you will note that the intercept has new two components:

β1 is the impact (vs males) of being female


β5 is the impact (vs males) of urban for females (with β5 being the impact of urban for males)

Talk about MEs: ME(x1) = β1 + β5•x3 + β6•x4


ME(x3) = β3 + β5•x1
[Relate to discussion in Ch 6]

Ex.

9
ECON335 Wooldridge Ch. 7 2-19-24

7-4b Allowing for Different Slopes (7th: 233)


In the Food Expenditures example, I noted that when we incorporated family income into the
regression which included gender that we were implicitly assuming that family income affected
an individual's expenditures in the same fashion for males & for females.

Suppose that we believe that family income has differential effects on the food expenditures of
males & females.

Graphically, ... we might believe that the slope of one relationship is > another.

We might ask how we could incorporate such differences into our model. It is actually easy to
incorporate differences in slopes on quantitative variables into our regressions.

Let’s ignore the Place variables for this discussion; we could include it if we wished to do so.

Our base equation is y = β0 + β1∙x1 + β2∙x2 + u

We believe that the slope of family income - β2 - should differ for males & females.
To allow for a diff in slopes we multiply the dummy variable by the income variable & introduce
the product into the regression equation. In other words, we run the following regression

y = β0 + β1∙x1 + β2∙x2 + β3∙(x1∙x2) + u

We now have 3 variables in the regression - the 3rd being an interaction tween the 1st two.

To see how the regression allows for differential slopes, let’s take the expected value of the
equation when X2 = 0 & X2 = 1.

E(y|x1 = 0) = β1 + β1∙0 + β2∙x2 + β3∙(0∙x2) = β0 + β2∙x2

10
ECON335 Wooldridge Ch. 7 2-19-24

The equation looks like that we’ve seen before. Turn to

E(y|x1 = 1) = β0 + β1∙1 + β2∙x2 + β3∙(1∙x2) = (β0+β1) + (β2 + β3)∙x2


We see that the slope coefficient is the summation of two different coefficients. We also see that
the coefficients on the income variable now can vary for females & males. Graphically, ....

We should also note how we interpret the coefficient on the interaction term; it represents the
differential impact of after tax income on female and male food expenditures.

If it’s negative then income has a > impact on males; steeper slope for the line for men.

Ex Food Expenditures Let’s estimate the model in STATA, ignoring the Place variables.

 Interpret the regression results.

11
ECON335 Wooldridge Ch. 7 2-19-24

7-4c: Testing for Differences in Regression Functions across Groups (7th: 237)

Suppose that we have run the above regression and that we wish to test whether there is a
difference between males and females in the relationship between food expenditures, income,
and location.

We would undertake the analysis by interacting the gender dummy variable with all of the other
regressors in the equation; i.e.,

y = β0 + β1∙x1 + β2∙x2 + β3∙x3 + β4∙x4 + β5∙(x1∙x2) + β6∙(x1∙x3) + β7∙(x1∙x4) + u



Interact x1 with all independent variables

 What types of tests to undertake?

It’s not straightforward because we have allowed males and females to differ in several ways: (1)
the intercept, and (2) the slope parameters.

There are several possibilities. The one that makes sense is a test for a Dissimilar Regression:
both the intercepts and slopes differ.

Estimate a model which allows both slopes and intercepts to vary and one which does not allow
both to vary. We do this in the following regression>

The test that makes most sense is the Test for Dissimilar Regression:
Ho: β1 = β5 = β6 = β7 = 0.
HA: HO is not true.

A test for the intercept and all slopes being the same.
Undertake an F (Wald) test.

12
ECON335 Wooldridge Ch. 7 2-19-24

Ex. Test the Simple interaction model above


HO: β2 = β3 = 0 HA: HO is not true

[start Cl 17 F21]

.______________________________ ?skip?__________________________________
Ex. The Savings-Income Relationship in the U.S. (4th: 196).
The relationship between savings & income is important to macro-economists.
We are focusing on y = β0 + β1∙x1
where y = savings and x1 = Income. Call it the Savings Equation.

☺ Suppose that we believe that savings behavior of hh’s in the U.S. changed in the early 1980s.
Suppose that we have relevant data for the 1970 - 1995 period (in Table 6-7 (4th: 197)).
We are interested in testing whether savings behavior in the 1970-1981 period differed from that
in the 1982 - 1995 period.

A change in savings behavior might manifest itself in a shift in the intercept of the savings
equation or a change in the MPS.

How might we capture these ideas? This is the same idea as distinguishing between males &
females wrt both the intercept & the impact of hh inc of verbal scores. To do that we created a
dummy variable for male & ran with it.

13
ECON335 Wooldridge Ch. 7 2-19-24

In this case, we wish to distinguish 2 periods. So, we create a dummy variable. We might let it =
1 for the years between 1982 & 1995 & let it = 0 for the 1970 - 1981. Let that variable be x2.

Thus, to allow for different intercepts & slopes we estimate the following model:
y = β0 + β1∙x1 + β2∙x2 + β3∙(x1∙x2) + u

☺ How do we interpret the parameters in this model?

β1 is the MPS in the 1970-1982 period.


(β1 + β3) is the MPS in the 1982-1995.
β3 is the difference in MPS’s in the 1st period as compared to the 2nd period.

β0 is the intercept in the 70-81 period & β2 is any shift in the intercept in the 1st period.

☞ Have the following regression results for the foregoing model:


^y i = 1.0161 + 0.0803∙x1 + 152.4786∙x2 - 0.0655∙(x1∙x2)

How do we interpret these results?


1st let’s consider the slope estimate. The estimate on x1 implies a MPS in the 1st period of 8%.
That seems pretty high.

The estimate on the interaction terms suggests that there was a diff in the MPS in the 2 pds.
Indeed, it suggests that the MPS was fully 75% lower in the 1982-1995 period
._______________________________ end skip__________________________________
[End Class 16 F21]

14
ECON335 Wooldridge Ch. 7 2-19-24

7-5 A Binary Dependent Variable: The Linear Probability Model (7th: 239)
What can we say about a model in which the dependent variable is a dummy variable? We will
now consider such a situation in the context of the LRM.

Ex 1. [7th: 240] Female spouse participation in the labor force.


y = 1 if in the labor force, y = 0 if not in the lf.
Independent variables: (i) sources of income (incl. husband’s earnings: x1),
(ii) years of education: x2, (iii) Age, (iv) # kids < 6,
(v) # kids between 6 & 18, (vi) experience: x6 (quadratic).

We have y = β0 + β1•x1 + β2•x2 + ..+ β6•x6 + β7• x 26 + u

How to interpret the model? Notice that the dependent variable can take on only two values. So,
we have a discrete r.v. which takes on values {0,1}. The expected value equals

Pr(y=0|x)•0 + Pr(y=1|x)•1 = Pr(y=1|x) (Eq 7.27: 7th: 240).

So, the expected value equals the probability that y = 1.


Called the LPM because the model is linear in the beta parameters.

Ex 1. Wife Labor Force Participation: interpret the expected value and interpret the parameter
estimates on p. 225 (6th).
R Code and Output

15
ECON335 Wooldridge Ch. 7 2-19-24

Look at some predicted values


predict yh

16
ECON335 Wooldridge Ch. 7 2-19-24

Notes (i) experience enters in quadratically (? ask them to calculate the APE?), (ii) estimate on #
of kids < 6 is unrealistic for going from, e.g., 2 to 3 kids, or, generally, adding one more, (iii)

→Also, look at the graph of the expected value & education on page 241 (7th) (the values at
which the other regressors are fixed is nwifeinc = 50, exper = 5, age=3, kidslt6 = 1, & kidsge6 =
0).
Note that the probability is negative (and that it can exceed one at times). This is a shortcoming
of the model.

Also, variance of P(y=1|x) equals P(y=1|x)•(1- P(y=1|x). This violates a LRM assumption.

7.6 More on Policy Analysis and Program Evaluation (6th: 229)


Economists are often interested in the impact of some policy, program, or individual
characteristic on an outcome of interest.

Ex. 1 Impact of Going to College on One’s Income.


Income = f(College=1 if …, experience, age, family characteristics, …)
[write out the equation & note that want to interpret the parameter estimate on the d.v. as
reflecting the returns to education.]
We have y = β0 + β1•College + β2•x2 + … + u
And want to interpret β1 as capturing the impact of a college degree on earnings.

Ex. 2 Discrimination in Bank Small Business Loan Approvals


y = 1 if loan was approved, = 0 if rejected.
Independent Variables: Education, Experience, How long the business has existed, prior credit
history ….
Suppose that we include a d.v. for whether the loan applicant was a person of color and we
intend to interpret it as reflecting the presence of discrimination. [We would expect the
parameter estimate on the d.v. to be negative if discrimination existed.]

We have y = β0 + β1•Non-White + β2•x2 + … + u

17
ECON335 Wooldridge Ch. 7 2-19-24

 All of these are examples of situations in which we wish to use our regression techniques to
determine whether a policy is effective (the change in tort law) or might be effective (e.g., make
going to college cheaper), & whether some illegal behavior exists (e.g. loan applications & race).

For each of these examples, the parameter estimated is likely to be biased. Let’s consider why
for each example.
[called Selection Bias]

Ex College degree. Comparing the earnings of people who went to college with people who
didn’t get a college degree (e.g., stopped at a high school diploma). We want to attribute the
difference in income between the two groups as the returns to a college degree (reflected in the
value of the parameter estimate on the dummy variable).

Any problem with this? Ignores innate ability.

What if included, e.g., high school grades & IQ test results. Would that take care of it? While
they might capture some aspects of innate ability they won’t capture everything. There will be
factors relevant to college degree attainment which are not in the equation.
For the reasons cited earlier, we would have relevant variables missing from the regression
which would bias parameter estimates. If the missing variable is positively correlated with
earnings (e.g., grit) then the parameter estimate will overstate the returns to college.

Note that this is an example of what is called a Selection Problem. It is a SP in that the
individuals decide whether to go to college.

☞ To understand why it is a SP, let’s consider a situation in which it wouldn’t arise.


Experiment: suppose that we took a class of high school graduates and we randomly assigned
them to either college or no college. Random assignment would mean that grades, extracurricular
activities, family income, etc. would not be a factor in the college decision. Indeed, other than
having a hs diploma nothing else would matter in the decision.

Do you see how this situation differs from the situation we observe in the real world?

18
ECON335 Wooldridge Ch. 7 2-19-24

This comparison reflects an important characteristic of SPs; they arise when individual behavior
affects allocation to one of the two states being considered (college or no college).
So, if being in one of the categories might depend on the behavior of the individual (or some
other actor) a SP might exist.

Implication of the potential bias: if we wish to adopt a policy based on these results which is
designed to make college cheaper (expecting higher wages for the extra individuals who go to
college) then we might be disappointed.

Ex. Discrimination in Loan Acceptances


Okay, you know what I’m going to ask. Are there any confounding factors?
Look at the loan application process. If people are randomly assigned in the loan application
process then we’d be okay. Do we have any reason to believe that there would be systematic
differences in the loan application decision between people of color and white folks? Yes.
E.g., given the lower likelihood of obtaining a loan, the people of color who are on the margin in
terms of whether to apply for a loan might not apply. As a result, only the stronger candidates
will apply. Since they are stronger than the typical person of color they will be more likely to
have their loan application approved, making the loan acceptance rate appear higher than it
actually is for the population of people of color as a whole.

All three of these examples counsel caution when using dummy variables to try to draw
inferences about policies of interest. Later in the semester, we will consider various techniques
developed by econometricians to get around these problems. For now, we will simply
acknowledge them and go on … to a discussion of heteroscedasticity.

19
ECON335 Wooldridge Ch. 7 2-19-24

REDACTIONS

Aside (not in the text): Interaction Between Quantitative Variables


So far, we have discussed interactions between d.v.’s & quantitative vars. I will diverge from our
focus on d.v.’s to consider interactions between quantitative vars.

Electricity example. Suppose that we are focusing on the demand for electricity during the
winter months.

We believe that the demand for electricity during the winter depends on the price of electricity as
well as the temperature.

We might think of regressing Q demanded on price & temperature.

We’d get QE = β1 + β2∙Pi + β3∙Ti

What sign would we expect on β2? Expect it to be < 0.


What would we expect of β3? ↑T ⇒ ↓QE. So, β3 < 0.
This equation accounts for P & T. Can fix P (at, say, .1) and get a relationship between E & T.

Now, suppose that price increases to 2. What happens to the relationship between E & T?

20
ECON335 Wooldridge Ch. 7 2-19-24

Get a shift in of the curve.

We see that this regression implies that P does not affect the slope of the relationship between
E & T.

In other words, the impact of temp on demand will be the same regardless of the p of electricity.
We might, however, believe that the price affects the impact of temperature on the demand for
electricity. In other words, if the price of electricity is high the impact of temperature on our
demand for electricity will differ than that impact at a lower price. Graphically,

We might ask how we can take into acct interactions tween vars such as this interaction.

We do so by interacting the two quantitative variables; i.e., we include (Price*Temperature) in


the regression as an independent variable.

The situation differs from those we’ve considered in that we are dealing with 2 qualitative
variables.

Note that this situation differs from what we’ve considered so far in that we are dealing with 2
quantitative variables.

The regression we’d estimate would be

Electricity D = β1∙ + β2∙Price + β3∙Temperature + β4∙(Price∙Temperature) + ui

21
ECON335 Wooldridge Ch. 7 2-19-24

How does this allow for differing impact of temperature across price?

Fix P at some level, say .10. What happens to the regression equation?

E= β1 + β2∙(0.1) + β3∙Temperature + β4∙((0.1)∙Temperature) + ui

= [β1 + β2∙(0.1)] + [ β3 + β4∙(0.1)]∙Temperature .... get a line w/ a different slope.

If we change the price to some other level we get another line with a different slope.

The impact of temperature on demand now depends on the price level; as prices change the
change in demand in response to a change in temperature varies.

What sign would we expect on β4? Positive [we’d expect demand not to be as responsive to
temperature. That implies a line w/ a slope closer to 0. Since β3 is negative, we need a + β4 to get
closer to zero.

Expect demand not to respond to temperature changes as much as price increase; so, negative.

Ex. 2: [Raman 263] As another ex, consider the consumption function that we have considered at
various times over the weeks. It focuses on the relationship tween hh inc & hh consump & the
MPC. Suppose that we believe that the MPC depends on a hh’s asset level (asset income is
included in total income). How might we incorporate this idea into a regression? Interact income
& asset levels; i.e.,

Consumption = β1∙ + β2∙Income + β3∙(Income∙Assets) + ui

What is the MPC in this case? ∂C/∂Inc = β2 + β3∙Assets


We see that the MPC of the household will vary as Assets vary.

22

You might also like