0% found this document useful (0 votes)
12 views27 pages

Dummy Variable Regression

Uploaded by

naveen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views27 pages

Dummy Variable Regression

Uploaded by

naveen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

REGRESSION ANALYSIS

WITH
DUMMY, DICHOTOMOUS OR INDICATOR
VARIABLES
Learning objectives
• Understand the role of dummy variables to represent
qualitative explanatory variables and use them in
regression.
• Test for differences between the categories of a
qualitative variable.
• Calculate and interpret confidence intervals and
prediction intervals, to allow inferences about the
regression coefficients.
• Explain the role of the assumptions on the OLS
estimators.
• Describe common violations of the assumptions and
offer remedies.
Categorical Independent Variables

In many situations we must work with categorical


independent variables such as gender (male, female),
method of payment (cash, check, credit card), etc.

For example, x2 might represent gender where x2 = 0


indicates male and x2 = 1 indicates female.

In this case, x2 is called a dummy or indicator variable.


Example

 Example: Programmer Salary Survey


A software firm collected data for a sample of 20
computer programmers. A suggestion was made that
regression analysis could be used to determine if
salary was related to the years of experience and the
score on the firm’s Programmer Aptitude Test.
The years of experience, score on the aptitude test
test, and corresponding annual salary ($1000s) for a
sample of 20 programmers is shown on the previous
class.
Categorical Independent Variables

 Example: Programmer Salary Survey


As an extension of the problem involving the
computer programmer salary survey, suppose that
management also believes that the annual salary is
related to whether the individual has a graduate
degree in computer science or other.
The years of experience, the score on the
programmer aptitude test, whether the individual has
a relevant graduate degree, and the annual salary
($000) for each of the sampled 20 programmers are
shown on the next slide.
Categorical Independent Variables

Exper. Test Salary Exper. Test Salary


(Yrs.) Score Degr. ($000s) (Yrs.) Score Degr. ($000s)
4 78 No 24.0 9 88 Yes 38.0
7 100 Yes 43.0 2 73 No 26.6
1 86 No 23.7 10 75 Yes 36.2
5 82 Yes 34.3 5 81 No 31.6
8 86 Yes 35.8 6 74 No 29.0
10 84 Yes 38.0 8 87 Yes 34.0
0 75 No 22.2 4 79 No 30.1
1 80 No 23.1 6 94 Yes 33.9
6 83 No 30.0 3 70 No 28.2
6 91 Yes 33.0 3 89 No 30.0
Estimated Regression Equation

y^ = b0 + b1x1 + b2x2 + b3x3

where:
y^ = annual salary ($1000)a
x1 = years of experience
x2 = score on programmer aptitude test
x3 = 0 if individual does not have a graduate degree
1 if individual does have a graduate degree

x3 is a dummy variable
Categorical Independent Variables

 ANOVA Output
Analysis of Variance

SOURCE DF SS MS F P
Regression 3 507.8960 269.299 29.48 0.000
Residual Error 16 91.8895 5.743
Previously,
Total 19 599.7855 R Square = .8342

R2 = 507.896/599.7855 = .8468 Previously,


Adjusted
20  1 R Square = .815
R  1  (1  .8468)
2
a  .8181
20  3  1
Categorical Independent Variables

 Regression Equation Output

Predictor Coef SE Coef T p


Constant 7.945 7.382 1.076 0.298
Experience 1.148 0.298 3.856 0.001
Test Score 0.197 0.090 2.191 0.044
Grad. Degr. 2.280 1.987 1.148 0.268

Not significant
Dummy, Dichotomous or indicator
Variables
• Qualitative Explanatory variable with two
categories
• Qualitative Explanatory variable with multiple
categories
More Complex Categorical Variables

If a categorical variable has k levels, k - 1 dummy


variables are required, with each dummy variable
being coded as 0 or 1.

For example, a variable with levels A, B, and C could


be represented by x1 and x2 values of (0, 0) for A, (1, 0)
for B, and (0,1) for C.

Care must be taken in defining and interpreting the


dummy variables.
More Complex Categorical Variables

For example, a variable indicating level of


education could be represented by x1 and x2 values
as follows:

Highest
Degree x1 x2
*Bachelor’s 0 0
Master’s 1 0
Ph.D. 0 1

*: Base line Indicator


Example: Is there evidence of gender pay
discrimination?
• Worldwide studies have documented gender
differences in wages and that female academics
received lower pay than their male colleagues.
• Numerous studies have focused on salary
differences between men and women, indigenous
and non-indigenous, and young and old Australians.
• Joanna Smith works in human resources at a large
university.
• After the release of the latest Australian Bureau of
Statistics data, the university asked her to test for
both gender and age discrimination in salaries.
continued
Is there evidence of gender pay
discrimination?
• She gathers data on 42 professors, including the
salary, experience, gender and age of each.

• Using this data set, Joanna hopes to:


– Determine whether there is evidence of gender
discrimination in salaries
– Determine whether there is evidence of age discrimination
in salaries.
Dummy variables
LO :Understand the role of dummy variables to represent
qualitative explanatory variables and use them in
regression.

• Previously, all the variables used in regression


applications are quantitative.
• In empirical work it is common to have some
variables that are qualitative: the values represent
categories that may have no implied ordering.
• We can include these factors in a regression through
the use of dummy variables.

continued
Dummy variables

• A dummy variable for a qualitative variable with two


categories assigns a value of 1 for one of the
categories and a value of 0 for the other.
• For example, suppose we are interested in teen
behaviour. We might first define a dummy variable d
that has the following structure:
Let d = 1 if age is between 13 and 19
and d = 0 if age is anything else.
• This would allows us to capture the role of being a
teenager in a regression model and quantify its
impact.
continued
Dummy variables

• For the sake of simplicity, consider a model


containing one quantitative explanatory variable and
one dummy variable.
y = b0 + b1x1 + b2d + e

• Conducting a standard ordinary least squares (OLS)


regression will yield an estimated equation of
ŷ = b0 + b1x1 + b2d.

continued
Dummy variables

• For a given x, and d = 0, we compute ŷ as


ŷ = b0 + b1x1 + b2(0) = b0 + b1x1.

• Similarly, when d = 1
ŷ = b0 + b1x1 + b2(1) = (b0 + b2) + b1x1.
• The dummy variable allows a shift in the intercept
term, enabling us to use a single regression
equation to represent both categories of the
qualitative variable.

continued
Dummy variables

Graphically, we can see how the dummy variable shifts


the intercept of the regression line.

continued
Dummy variables

• Example: Evidence of gender pay discrimination?


– The introductory case has two qualitative variables, gender
and age group. To measure the impact of gender and age
on salary, we need to create two dummy variables.
Let d1 = 1 if the professor is male; 0 if female
Let d2 = 1 if the professor is 60 or over; 0 if under 60.

continued
Dummy variables
• Example:

– The estimated equation is


ŷ = 54.011 + 1.503x + 18.541d1 + 5.772d2
– The difference in salary between a male and a female
professor is captured in the coefficient of d1. A male
professor, on average, makes $18,541 more than a female
with comparable experience.
– The age coefficient, though statistically insignificant in this
case, would have a similar interpretation.
Qualitative variables with two
categories
LO : Test for differences between the categories of a
qualitative variable.

• The statistical tests discussed in remain valid for


dummy variables as well.
• We can perform a t test for individual significance,
form a confidence interval using the parameter
estimate and its standard error, and conduct a partial
F test for joint significance.

continued
Qualitative variables with two
categories
• Example: Evidence of gender pay discrimination?
– Is there a gender effect in the salary study?
H0: b2 = 0 (males and females are paid the same)
HA: b2 ≠ 0 (there is a difference due to gender)

– Given a value of the tdf test statistic of 4.86 and p-value of


approximately 0.00, we reject the null hypothesis and
conclude that the gender dummy variable is significant.

• For the age coefficient, tdf is 0.94 and the p-value is 0.36, so
we do not reject the null hypothesis. The evidence suggests
that professors over 60 do not have significantly different
salaries, compared to those under 60.

continued
Qualitative variables with two
categories
• Sometimes a qualitative variable may be described
by more than two categories.
• In such cases we use multiple dummy variables to
capture the effect of the variable.
– For example, suppose we divide the mode of transport used
by commuters into three categories: public transport, driving
and park-and-ride.
– We then define two dummy variables, d1 and d2, where d1
equals 1 to denote public transport and 0 otherwise, and d2
equals 1 to denote driving and 0 otherwise. Park-and-ride is
captured when both d1 and d2 equal 0.

continued
Qualitative variables with two
categories
• Our regression model for the mode of transport
example would then be

y = b0 + b1x + b2d1 + b3d2 + e

and the estimated equation would be

ŷ = b0 + b1x + b2d1 + b3d2.

continued
Qualitative variables with two
categories
• Given the intercept term, we exclude one of the
dummy variables from the regression.
• The excluded variable represents the reference
category (baseline indicator) against which the
others are assessed.
• If we included as many dummy variables as
categories, this would create perfect multicollinearity
in the data, and such a model cannot be estimated.
• So, we include one fewer dummy variable than the
number of categories of the qualitative variable.
Interval estimates for the response
variable
LO : Calculate and interpret confidence intervals and
prediction intervals, to allow inferences about the
regression coefficients.
• Once we have developed a regression model, we
often want to use it to make predictions.
• In the academic salary example, what salary would
we predict for a male professor with 10 years of
experience? Inserting these values into our
estimated regression equation, we find:
Salary(predicted) = ŷ = 54.011 + 1.503(10) + 18.541(1) + 5.772(0)
= 87.554, that is, $87,554.

continued

You might also like