Professional Documents
Culture Documents
7 2-19-24
Chapter 7
Multiple Regression Analysis with Qualitative Information
Talked about qualitative variables in Chapter 2-7 (7th: 51). These are variables which place a
person in a category. We use Dummy Variables to place people in a category.
We will consider next the situation in which we have a dummy variable & a quantitative variable
in the regression.
We might ask how we interpret the slope parameters this context. The interpretation actually
turns out to be the same with the one change.
So, the mean wage = the intercept plus the effect of income.
Graphically, we have a relationship between 2 variables that's linear. We get a line with an
intercept = β0.
1
ECON335 Wooldridge Ch. 7 2-19-24
What's the difference in the regressions? The intercept. So, the difference between male &
female expected expenditures is a shift in the intercept of the regression. Graphically,
Note this important interpretive point: the impact of being a female equals (β1+β2), the male
intercept plus β1. So, we can think of β1 as the difference between females and males.
We think of Males as the base (benchmark) group (or the reference group). And interpret the
parameter on the dummy variable with reference to this group. (7th: 223)
2
ECON335 Wooldridge Ch. 7 2-19-24
We might believe that education will have differential impacts depending on whether one is
female or male. We will consider shortly how to incorporate possible differences in education on
male & female wages.
3
ECON335 Wooldridge Ch. 7 2-19-24
Ex. Food Expenditures Suppose that we are interested in FE and theory implies that it depends
on gender (x1), family income(x2), and whether one lives in an urban, suburban or rural area.
We may note that the location variable is categorical, with three possible categories.
The question that arises now is "how do we handle such categorical variables?“
Generally, we must create (# categories-1) dummy variables.
4
ECON335 Wooldridge Ch. 7 2-19-24
Here, we have three categories. So, we must create (3-1) = 2 dummy variables.
We handle them by defining two dummy variables instead of 1. So, for ex, we might define
x4 = 1 if urban & = 0 otherwise and
x5 = 1 if suburban & = 0 otherwise.
For reasons you will see below, we will drop the gender dummy variable temporarily.
How would we interpret the regression results in this context? The interpretation follows that
in the binary variable case. We consider the various possible combinations of values of x3 & x4.
We see that the coefficient on each dummy variable represents a shift in the intercept of the
regression equation associated with when the dummy variable = 1.
Interpretation of the coefficients on the dummy variables is with respect to the reference
category. In this case is the area that is not covered by the two dummy variables; i.e., rural areas.
5
ECON335 Wooldridge Ch. 7 2-19-24
Thus, we interpret each coefficient with reference to rural areas. β3 reflects the increase
(decrease) in food expenditures of urban people, as compared to rural people. And, β4 reflects the
increase (decrease) in food expenditures of suburban people as compared with rural people.
Anyone remember why we cannot do so? Perfect multicollinearity. We would have three
columns summing to one. Recalling that the variable for the intercept term is a column of ones,
we would have an independent variable (the intercept) which is a linear function of three
independent variables (the dummy variables).
Intuitively, if we stick with our interpretation of the coefficients as being wrt some reference
area, we might simply say that including three d.v.'s would leave us w/o a reference area.
Note that x3 and x4 will never both = 1. One will = 1 & the other 0 or both will = 0.
6
ECON335 Wooldridge Ch. 7 2-19-24
To see the implication of such a definition consider how we would interpret the regression
results
E(y|x3=0,x2) = β0 + β2∙x2 + β3∙0 = β0 + β2∙x2
E(y|x3=1,x2) = β0 + β2∙x2 + β3∙1 = (β0+β3) + β2∙x2
E(y|x3=2,x2) = β0 + β2∙x2 + β3∙2 = (β0+2∙β3) + β2∙x2
Again, the reference is with respect to a rural student. We see that the difference in scores
between rural & urban is β3.
The difference between a rural & an urban single person is 2∙β3; i.e., twice that of the first
difference. Alternatively, we may interpret β4 as the diff tween the score of a rural & a suburban
student & the difference between the score of a suburban & an urban student.
We typically would not have reason to assume that those differences are the same. Indeed, we
would not want to do so because we might be interested in whether urban & suburban influences
are the same (as compared to rural).
We avoid making such an implicit assumption when we have a dummy variable for each
category but the reference category.
[End 1st ½ of Class 16]
More than one categorical Variable:
Let’s bring back the gender variable. Our equation of interest is
y = β0 + β1∙x1 + β2∙x2 + β3∙x3 + β4∙x4 + u
You will note that we now have two categorical variables: one for gender and one for location.
7
ECON335 Wooldridge Ch. 7 2-19-24
What impact does this have on interpreting our results? The base/reference group will now have
two characteristics: one for each category.
Bring back the gender variable and interact gender with the location variables:
How do we interpret the interaction terms? They allow the impact of location to vary by gender
(or, the impact of gender to vary by location).
8
ECON335 Wooldridge Ch. 7 2-19-24
(2) (female, urban): E(y|x1=1,x3=0, x4=0,x2) = β0 + β1∙1 + β2∙x2 + β3∙1 + β4∙0 + β5∙1•1 + β6∙1•0
= (β0 + β1) + (β3 + β5) + β2∙x2.
Interpretation: you will note that the intercept has new two components:
Ex.
9
ECON335 Wooldridge Ch. 7 2-19-24
Suppose that we believe that family income has differential effects on the food expenditures of
males & females.
Graphically, ... we might believe that the slope of one relationship is > another.
We might ask how we could incorporate such differences into our model. It is actually easy to
incorporate differences in slopes on quantitative variables into our regressions.
Let’s ignore the Place variables for this discussion; we could include it if we wished to do so.
We believe that the slope of family income - β2 - should differ for males & females.
To allow for a diff in slopes we multiply the dummy variable by the income variable & introduce
the product into the regression equation. In other words, we run the following regression
We now have 3 variables in the regression - the 3rd being an interaction tween the 1st two.
To see how the regression allows for differential slopes, let’s take the expected value of the
equation when X2 = 0 & X2 = 1.
10
ECON335 Wooldridge Ch. 7 2-19-24
We see that the slope coefficient is the summation of two different coefficients. We also see that
the coefficients on the income variable now can vary for females & males. Graphically, ....
We should also note how we interpret the coefficient on the interaction term; it represents the
differential impact of after tax income on female and male food expenditures.
If it’s negative then income has a > impact on males; steeper slope for the line for men.
Ex Food Expenditures Let’s estimate the model in STATA, ignoring the Place variables.
11
ECON335 Wooldridge Ch. 7 2-19-24
7-4c: Testing for Differences in Regression Functions across Groups (7th: 237)
Suppose that we have run the above regression and that we wish to test whether there is a
difference between males and females in the relationship between food expenditures, income,
and location.
We would undertake the analysis by interacting the gender dummy variable with all of the other
regressors in the equation; i.e.,
It’s not straightforward because we have allowed males and females to differ in several ways: (1)
the intercept, and (2) the slope parameters.
There are several possibilities. The one that makes sense is a test for a Dissimilar Regression:
both the intercepts and slopes differ.
Estimate a model which allows both slopes and intercepts to vary and one which does not allow
both to vary. We do this in the following regression>
The test that makes most sense is the Test for Dissimilar Regression:
Ho: β1 = β5 = β6 = β7 = 0.
HA: HO is not true.
A test for the intercept and all slopes being the same.
Undertake an F (Wald) test.
12
ECON335 Wooldridge Ch. 7 2-19-24
[start Cl 17 F21]
.______________________________ ?skip?__________________________________
Ex. The Savings-Income Relationship in the U.S. (4th: 196).
The relationship between savings & income is important to macro-economists.
We are focusing on y = β0 + β1∙x1
where y = savings and x1 = Income. Call it the Savings Equation.
☺ Suppose that we believe that savings behavior of hh’s in the U.S. changed in the early 1980s.
Suppose that we have relevant data for the 1970 - 1995 period (in Table 6-7 (4th: 197)).
We are interested in testing whether savings behavior in the 1970-1981 period differed from that
in the 1982 - 1995 period.
A change in savings behavior might manifest itself in a shift in the intercept of the savings
equation or a change in the MPS.
How might we capture these ideas? This is the same idea as distinguishing between males &
females wrt both the intercept & the impact of hh inc of verbal scores. To do that we created a
dummy variable for male & ran with it.
13
ECON335 Wooldridge Ch. 7 2-19-24
In this case, we wish to distinguish 2 periods. So, we create a dummy variable. We might let it =
1 for the years between 1982 & 1995 & let it = 0 for the 1970 - 1981. Let that variable be x2.
Thus, to allow for different intercepts & slopes we estimate the following model:
y = β0 + β1∙x1 + β2∙x2 + β3∙(x1∙x2) + u
β0 is the intercept in the 70-81 period & β2 is any shift in the intercept in the 1st period.
The estimate on the interaction terms suggests that there was a diff in the MPS in the 2 pds.
Indeed, it suggests that the MPS was fully 75% lower in the 1982-1995 period
._______________________________ end skip__________________________________
[End Class 16 F21]
14
ECON335 Wooldridge Ch. 7 2-19-24
7-5 A Binary Dependent Variable: The Linear Probability Model (7th: 239)
What can we say about a model in which the dependent variable is a dummy variable? We will
now consider such a situation in the context of the LRM.
How to interpret the model? Notice that the dependent variable can take on only two values. So,
we have a discrete r.v. which takes on values {0,1}. The expected value equals
Ex 1. Wife Labor Force Participation: interpret the expected value and interpret the parameter
estimates on p. 225 (6th).
R Code and Output
15
ECON335 Wooldridge Ch. 7 2-19-24
16
ECON335 Wooldridge Ch. 7 2-19-24
Notes (i) experience enters in quadratically (? ask them to calculate the APE?), (ii) estimate on #
of kids < 6 is unrealistic for going from, e.g., 2 to 3 kids, or, generally, adding one more, (iii)
→Also, look at the graph of the expected value & education on page 241 (7th) (the values at
which the other regressors are fixed is nwifeinc = 50, exper = 5, age=3, kidslt6 = 1, & kidsge6 =
0).
Note that the probability is negative (and that it can exceed one at times). This is a shortcoming
of the model.
Also, variance of P(y=1|x) equals P(y=1|x)•(1- P(y=1|x). This violates a LRM assumption.
17
ECON335 Wooldridge Ch. 7 2-19-24
All of these are examples of situations in which we wish to use our regression techniques to
determine whether a policy is effective (the change in tort law) or might be effective (e.g., make
going to college cheaper), & whether some illegal behavior exists (e.g. loan applications & race).
For each of these examples, the parameter estimated is likely to be biased. Let’s consider why
for each example.
[called Selection Bias]
Ex College degree. Comparing the earnings of people who went to college with people who
didn’t get a college degree (e.g., stopped at a high school diploma). We want to attribute the
difference in income between the two groups as the returns to a college degree (reflected in the
value of the parameter estimate on the dummy variable).
What if included, e.g., high school grades & IQ test results. Would that take care of it? While
they might capture some aspects of innate ability they won’t capture everything. There will be
factors relevant to college degree attainment which are not in the equation.
For the reasons cited earlier, we would have relevant variables missing from the regression
which would bias parameter estimates. If the missing variable is positively correlated with
earnings (e.g., grit) then the parameter estimate will overstate the returns to college.
Note that this is an example of what is called a Selection Problem. It is a SP in that the
individuals decide whether to go to college.
Do you see how this situation differs from the situation we observe in the real world?
18
ECON335 Wooldridge Ch. 7 2-19-24
This comparison reflects an important characteristic of SPs; they arise when individual behavior
affects allocation to one of the two states being considered (college or no college).
So, if being in one of the categories might depend on the behavior of the individual (or some
other actor) a SP might exist.
Implication of the potential bias: if we wish to adopt a policy based on these results which is
designed to make college cheaper (expecting higher wages for the extra individuals who go to
college) then we might be disappointed.
All three of these examples counsel caution when using dummy variables to try to draw
inferences about policies of interest. Later in the semester, we will consider various techniques
developed by econometricians to get around these problems. For now, we will simply
acknowledge them and go on … to a discussion of heteroscedasticity.
19
ECON335 Wooldridge Ch. 7 2-19-24
REDACTIONS
Electricity example. Suppose that we are focusing on the demand for electricity during the
winter months.
We believe that the demand for electricity during the winter depends on the price of electricity as
well as the temperature.
Now, suppose that price increases to 2. What happens to the relationship between E & T?
20
ECON335 Wooldridge Ch. 7 2-19-24
We see that this regression implies that P does not affect the slope of the relationship between
E & T.
In other words, the impact of temp on demand will be the same regardless of the p of electricity.
We might, however, believe that the price affects the impact of temperature on the demand for
electricity. In other words, if the price of electricity is high the impact of temperature on our
demand for electricity will differ than that impact at a lower price. Graphically,
We might ask how we can take into acct interactions tween vars such as this interaction.
The situation differs from those we’ve considered in that we are dealing with 2 qualitative
variables.
Note that this situation differs from what we’ve considered so far in that we are dealing with 2
quantitative variables.
21
ECON335 Wooldridge Ch. 7 2-19-24
How does this allow for differing impact of temperature across price?
Fix P at some level, say .10. What happens to the regression equation?
If we change the price to some other level we get another line with a different slope.
The impact of temperature on demand now depends on the price level; as prices change the
change in demand in response to a change in temperature varies.
What sign would we expect on β4? Positive [we’d expect demand not to be as responsive to
temperature. That implies a line w/ a slope closer to 0. Since β3 is negative, we need a + β4 to get
closer to zero.
Expect demand not to respond to temperature changes as much as price increase; so, negative.
Ex. 2: [Raman 263] As another ex, consider the consumption function that we have considered at
various times over the weeks. It focuses on the relationship tween hh inc & hh consump & the
MPC. Suppose that we believe that the MPC depends on a hh’s asset level (asset income is
included in total income). How might we incorporate this idea into a regression? Interact income
& asset levels; i.e.,
22