You are on page 1of 17

ECON20003 – QUANTITATIVE METHODS 2

TUTORIAL 12

Download the t12e1 Excel data file from the subject website and save it to your computer or
USB flash drive. Read this handout and try to complete the tutorial exercises before your
tutorial class, so that you can ask help from your tutor during the Zoom session if necessary.

Dummy Dependent Variable Regression Models

In the week 10 lectures you learnt about three dummy dependent variable regression
models, the linear probability model, the logit model and the probit model.

Linear Probability Model (LPM)

LPM is the simplest dummy dependent variable model. Assuming, for the sake of simplicity,
that there is only one independent variable1, it takes the form

D  0  1 X  

where the dependent variable D is a (0, 1) dummy variable that represents a qualitative
variable with two different categories (0 = failure, 1 = success), X is a quantitative
independent variable, and the  random error is supposed to have zero conditional expected
value.2 Consequently,

E ( D | X )  0  1 X

In this sense, LPM is not different from any standard regression model that satisfies (LR2).
However, at the same time, since D is a Bernoulli random variable, its conditional expected
value is also equal to

E ( D | X )  0  (1  P)  1 P  P

where P is the conditional probability of success, i.e. P(D = 1 | X). Combining these two
results, we obtain that

P  0  1 X

1
To keep the formulas relatively simple, every formula on LPM in this tutorial assumes that there is only one
independent variable. This restriction, however, can be removed and our discussions can be easily generalised
to the more realistic multiple regression case.
2
The independent variable of a dummy dependent variable model does not have to be quantitative, but we
assume at this stage that it is, for the sake of simplicity. However, in the exercise that follows some of the
independent variables will be qualitative (i.e. dummy) variables.
1
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
i.e. the conditional probability of success depends on X and it can be modelled using a linear
regression (in this case a simple linear regression) model. For this reason, this dummy
dependent variable model is called a linear probability model (LPM).

Finally, note that the marginal effect of the independent variable on the probability of success
is

dP
 1
dX

The main advantage of LPM is its simplicity and that it can be estimated and interpreted just
like any other linear regression model. Namely, each slope parameter measures the
marginal change in the dependent variable, i.e. the marginal change in the probability of
success due to a one unit increase in the corresponding independent variable while all other
independent variables in the model are kept constant.

Exercise 1 (HGL, p. 332, ex. 7.7)

A researcher intends to model shoppers’ decisions between purchasing Coke or Pepsi with
dummy dependent variable models based on the following variable:

COKE: dummy variable for purchasing Coke (= 1 if Coke is chosen and 0 if Pepsi is
chosen).

She thinks that the most important factors behind this choice decision are the price of and
the presence of store displays for these products. She decides to capture these factors with
the following variables:

PRATIO: relative price of Coke to Pepsi (i.e. the price of Coke divided by the price of
Pepsi);
DISP_COKE: dummy variable for displaying Coke in the store (= 1 if Coke is on display
and 0 otherwise);
DISP_PEPSI: dummy variable for displaying Pepsi in the store (=1 if Pepsi is on display
and 0 otherwise).

The researcher collects data on 1140 individuals who purchased either Coke or Pepsi. This
data set is saved in the t12e1 Excel file.

a) Do you expect PRATIO, DISP_COKE and DISP_PEPSI to have positive or negative


effects on COKE?

The more expensive Coke is relative to Pepsi, the less likely is that a shopper chooses
Coke, so PRATIO is expected to have a negative effect on COKE.

The presence of a Coke display in the store is expected to increase the purchases of
Coke, so DISP_COKE is expected to have a positive effect on COKE.

2
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
The presence of a Pepsi display in the store is expected to increase the purchases of
Pepsi and hence to decrease the purchases of Coke, so DISP_PEPSI is expected to
have a negative effect on COKE.

b) Estimate a linear probability model and briefly evaluate and interpret the results.

Launch RStudio, create a new project and script, name them t12e1, import the data from
the t12e1 Excel file and execute the following commands:

attach(t12e1)
lpm = lm(COKE ~ PRATIO + DISP_COKE + DISP_PEPSI)
summary(lpm)

You should obtain

The slope estimates have the logical signs and they are all significant at the 1.5% level
in the logical directions. They suggest that

(i) given DISP_COKE and DISP_PEPSI, a unit increase of the relative price of Coke
to Pepsi (PRATIO) is expected to decrease the probability of purchasing Coke by
about 0.401;
(ii) given PRATIO and DISP_PEPSI, a store display for Coke (DISP_COKE = 1) is
expected to increase the probability of purchasing Coke by about 0.077;
(iii) given PRATIO and DISP_COKE, a store display for Pepsi (DISP_PEPSI = 1) is
expected to decrease the probability of purchasing Coke by about 0.166.

Since each independent variable is significant individually, unsurprisingly the F-test of


overall significance (F-statistic = 51.67, p-value = 0.0000) rejects the null hypothesis of
zero slopes at any level.

The coefficient of determination is quite small (the unadjusted and the adjusted R2 are
both about 0.12). This is typical for LPM because it tries to estimate the 0 and 1 values
of the dependent variable with a straight line sample regression equation.
3
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
For this reason, it is better to judge the quality of the fit by calculating the percent
correctly predicted based on the estimated dependent variable values rounded to zero
or one, i.e.

0 if Dˆ  0.5
D  
1 if Dˆ  0.5

A prediction is considered to be correct if D and D-tilde (i.e. the corresponding observed


and rounded estimated values of D) are both equal to zero or if they are both equal to
one.

To obtain the percent correctly predicted statistic, first we need to predict D, round it to
the nearest integer and then develop a relative frequency table for the observed and
rounded predicted variables, COKE and COKE_lpm. This can be achieved by executing
the

COKE_lpm = round(fitted.values(lpm), digits = 0)


prop.table(table(COKE, COKE_lpm))

commands. They return

which shows that the relative frequency of COKE = COKE_lpm = 0 is 0.4447368 and
that of COKE = COKE_lpm = 1 is 0.2166667. Hence, after rounding to zero or one, the
percent correctly predicted is about (0.445 + 0.217)  100% = 66.2%.

In addition, to check whether any predicted value of D is negative or greater than one,
and hence is an impossible value, execute

sum(fitted.values(lpm) < 0)
sum(fitted.values(lpm) > 1)

These commands return

and

meaning that there are 16 negative predictions, but none over one.

4
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
c) Estimate the probability of purchasing Coke (‘success’) assuming that PRATIO is equal
to its sample mean and

(i) Coke is on display, but Pepsi is not (DISP_COKE = 1, DISP_PEPSI = 0),


(ii) Coke is not on display, but Pepsi is (DISP_COKE = 0, DISP_PEPSI = 1).

Using R you can obtain that the sample mean of PRATIO is about 1.027. Given this
PRATIO value and the sample regression equation, the probability of purchasing Coke
is estimated as

Pˆ  ˆ0  ˆ1 pratio  ˆ2 disp _ coke  ˆ3 disp _ pepsi


(i)
 0.890  0.4011.027  0.077 1  0.166  0  0.555

and

Pˆ  ˆ0  ˆ1 pratio  ˆ2 disp _ coke  ˆ3 disp _ pepsi


(ii)
 0.890  0.4011.027  0.077  0  0.166 1  0.312

To obtain these estimates of P with R, first add the two new observations to a new data
frame called new_data,

new_data = data.frame(PRATIO = mean(PRATIO),


DISP_COKE = c(1, 0),
DISP_PEPSI = c(0, 1))

and then obtain the corresponding estimates of COKE

predict(lpm, new_data, interval = "none")

You should get the following printout:

It confirms our manual calculations and indicates that, given the average price ratio, the
probability of purchasing Coke is about 0.555 when Coke is on display, but Pepsi is not,
however, it is only 0.312 when Coke is not on display, but Pepsi is.

Although LPM is indeed simple and convenient to use, it has some disadvantages. First of
all, in this model  is certainly not normally distributed because it has only two possible
values. In particular:

   0  1 X if D  0
 
1   0  1 X if D  1

5
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
Hence, just like D,  is a binary random variable with the same probability of success, P.
Therefore, LPM certainly violates the sixth classical assumption (LR6, i.e. normality). This
is not a major problem though if the sample size is relatively large.

LPM also violates the homoskedasticity assumption (LR3) because the conditional variance
of  is

Var ( | X )  P(1  P)  (0  1 X )(1  0  1 X )

i.e. it depends on X. This problem can be overcome or mitigated by using HC standard errors
instead of the usual standard errors or estimating the model with the WLS method instead
of OLS.

(Exercise 1)

d) Does the White test detect heteroskedasticity?

Recall that the White test is based on an auxiliary regression of the dependent variable
on the independent variables, on the squared independent variables and on the cross-
products of the independent variables. Hence, the appropriate commands are3

library(lmtest)
bptest(lpm, ~ PRATIO + DISP_COKE + DISP_PEPSI +
I(PRATIO^2) + I(DISP_COKE^2) + I(DISP_PEPSI^2) +
I(PRATIO * DISP_COKE) + I(PRATIO * DISP_PEPSI) +
I(DISP_COKE * DISP_PEPSI))

and they return

The White test rejects the null hypothesis of homoskedasticity even at the 0.5% level,
so this regression indeed suffers from heteroskedasticity, as expected.

e) Obtain White’s heteroskedasticity-consistent (HC) standard errors and compare them


to the original standard errors. What do you observe?

We can obtain the HC standard errors like in part (d) of Exercise 4 of Tutorial 10.
Namely, by calling the coeftest function of the sandwich package:

library(sandwich)
coeftest(lpm, vcov = vcovHC(lpm, type = "HC"))

3
Since two of the independent variables are dummy variables and hence, they are equal to their squares, two
squared terms are redundant in the bbtest function. Luckily, R is intelligent enough to ignore them.
6
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
You should get the following printout:

By comparing this printout to the one page 3, you can see that the point estimates have
not changed, but the standard errors, the t-values and p-values have. However, the
differences are not big enough to alter the t-test results, so heteroskedasticity is not a
real problem this time.

On top of the violation of the third and the sixth classical assumption (LR3: homoskedasticity,
LR6: normality), LPM has two further shortcoming that are potentially even more serious.

The first is that, as we have already seen in part (b) of Exercise 1, although D and P are
both restricted to the [0,1] interval, the estimated dependent variable D̂  ˆ0  ˆ1 X is not.
The second is that in LPM the marginal effect of X on D and P, which are both equal to 1,
is restricted to be a constant.

A possible solution for both of these problems is to change the model specification to

P  E( D | X )  F (0  1 X )

where F is a function that maps the unbounded Z = 0 + 1X variable to the (0,1) interval.4
For example, F can be the cumulative distribution function (CDF) of some continuous
random variable that can assume any real number between negative infinity and positive
infinity and has a symmetric distribution around zero. In practice the two preferred CDFs are
the logistic and the standard normal CDFs, leading to the so called logit and probit models,
respectively.

Logit model

The logit model is based on the logistic CDF,5

1
F (v ) 
1  ev

Accordingly, in the logit model the probability of success is

4
Note that in this case Z does not denote a standard normal random variable or a standardized variable. Also,
although in this case we assume that there is only one independent variable, in general there can be more.
5
For this reason, the logit model is also referred to as logistic model (for example, in the Selvanathan book).
7
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
1
P  F (Z ) 
1  e Z

The marginal effect of the independent variable on the probability of success is

dP
 f ( Z )  1
dX

where  is the probability density function (PDF), i.e. the derivative of CDF. For the logit
model (logistic distribution), it is

dF ( Z ) eZ
f (Z )  
dZ (1  e Z ) 2

Probit model

The probit model is based on the standard normal CDF is

 u2
1
F (v ) 
2 e

2
du

In this case the probability of success is given by

Z u2
1
P  F (Z ) 
2 e

2
du

and the probability density function (PDF) is


2
dF ( Z ) 1  Z2
f (Z )   e
dZ 2

The standard normal CDF is clearly more complicated than the logistic CDF, but in practice
this does not pose any real problem because its values are tabulated in the standard normal
table and can be also obtained easily with programs like R.

The logit and probit models are nonlinear regression models and they cannot be estimated
with OLS. Instead, they are estimated with the maximum likelihood (ML) method. We do not
discuss the details of this procedure, but fortunately with R it can be implemented as easily
as the OLS method.

The logit and probit regression are interpreted differently, but usually they lead to very similar
inferences and conclusions, except under the tails of the distributions, i.e. for relatively small
and large values.

8
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
Exercise 2 (HGL, p. 694, ex. 16.6)

Using the same data than in Exercise 1, complete the following tasks.

a) Estimate a logit model and briefly evaluate and interpret the results.

In R logit and probit models can be estimated with the

glm(formula = y ~ x1 + x2 + …, family =familytype(link=linkfunction))

function, where formula is like in the lm() function and family is binomial(link = “logit”) for the
logit model and binomial(link = “probit”) for the probit model.

Hence, execute the

logit = glm(COKE ~ PRATIO + DISP_COKE + DISP_PEPSI,


family = binomial(link = "logit"))
summary(logit)

commands. You should get

There are several details on this printout that warrant some explanation.

(i) The corresponding logit and LPM coefficients cannot be compared directly to each
other because they measure different things, but their logical signs are the same.

9
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
(ii) Note that, since ML is a large-sample method, instead of t-ratios, this time R reports
z-ratios. As you can see, just like in LPM, all three slopes are significant individually
in the logical direction even at the 1.5% level.6

(iii) Below call, which reminds us of the command we just executed, R reports the usual
location statistics for the deviance residuals. Deviance is the generalization of the
idea of using SSE to evaluate the goodness of fit of regressions estimated by OLS
to regressions estimated by the ML method. Like in the case of SSE, the smaller the
deviance the better the fit.

Although in this course we do not rely directly on deviance, it is worth to mention


that further down on the printout R reports the so-called null deviance and the
residual deviance. The former shows how well COKE is predicted by a model that
includes only the intercept (restricted model), while the latter shows how well COKE
is predicted by the unrestricted model. The difference between the two is the
improvement due to the independent variables in the unrestricted model. In this
example, the null deviance is 1567.7 with 1139 degrees of freedom and the residual
deviance is 1418.9 with 1136 degrees of freedom. Hence, by including the three
independent variables in the model the deviance is reduced by 148.8 with a loss of
3 degrees of freedom.

(iv) As you can see, the usual R2 statistic is missing from the printout. We can, however,
evaluate the quality of the fit by comparing the null and residual deviances to each
other and calculating the relative improvement (i.e. decrease) in deviance. It is

1567.7  1418.9
 0.0949
1567.7

which means that due to the two independent variables in the model, the fit improves
by about 9.5%. This relative improvement is measured by the so-called McFadden
R2 (R2McF) statistic, which is similar to the usual R2 statistic in the sense that it is
between 0 and 1 and larger values indicate better goodness of fit, but it is always
smaller than one and it cannot be interpreted in the same way as R2 (i.e. as the
proportion of the total sample variation that is accounted for by the model).

R2McF is not provided directly by the gls function, but you can get it by executing

library(DescTools)
PseudoR2(logit, which = "McFadden")

which returns

What constitutes a “good” R2McF value depends on the area of application, but as a
rule of thumb, values between 0.2 and 0.4 indicate a fairly good model fit. Hence,
this logit model does not fit to the data really well.
6
Recall that the reported p-values are for two-tail z-tests.
10
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
(v) The F-test of overall significance is not available either for models estimated by ML.
Instead, the overall significance of these models (H0: each slope parameter is zero)
can be tested with the LR (likelihood ratio) test. We do not discuss the details of this
test, just remember that it has the same hypotheses and serves the same purpose
as the F-test.

In R the LR test can be performed with the

lrtest(model)

function of the lmtest package.

Execute

library(lmtest)
lrtest(logit)

to obtain

The test statistic is Chisq = 148.83 and its p-value, i.e Pr( > Chisq), is practically zero,
so although R2McF is only about 0.095, this logit model is still significant at any reasonable
level.

The percent correctly predicted can be obtained like in part (b) of Exercise 1. Hence,
execute

COKE_logit = round(fitted.values(logit), digits = 0)


prop.table(table(COKE, COKE_logit))

to get

This printout shows that the percent correctly predicted by the logit model is about (0.445
+ 0.217)  100% = 66.2%, the same as in Exercise 1. This time, however, all predicted
values are restricted to be between 0 and 1.

11
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
b) Estimate the probability of purchasing Coke (‘success’) assuming that PRATIO is equal
to its sample mean and

(i) Coke is on display but Pepsi is not (disp_coke = 1, disp_pepsi = 0),


(ii) Coke is not on display but Pepsi is (disp_coke = 0, disp_pepsi = 1).

Based on the logit sample regression equation and on the last formula on page #7,

Z  ˆ0  ˆ1 pratio  ˆ2 disp _ coke  ˆ3 disp _ pepsi


(i)
 1.923  1.996  1.027  0.352 1  0.731 0  0.225

1 1
P  F (Z )  Z
  0.556
1 e 1  e 0.225

and

Z  ˆ0  ˆ1 pratio  ˆ2 disp _ coke  ˆ3 disp _ pepsi


(ii)
 1.923  1.996  1.027  0.352  0  0.731 1  0.858

1 1
P  F (Z )  Z
  0.298
1 e 1  e0.858

You can verify these results with R like in Exercise 1. First, add the two new observations
to the data frame,

newdata = data.frame(PRATIO = mean(PRATIO),


DISP_COKE = c(1, 0),
DISP_PEPSI = c(0, 1))

Then, obtain the corresponding Z values and estimates of the probability of success by
executing the

predict.glm(logit, newdata, type = "link")

and

predict.glm(logit, newdata, type = "response")

commands.

The first command returns the Z values

and the second the estimates of the probability of success

12
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
c) Estimate the marginal effect of PRATIO on P for the two scenarios in part (b).

From page 8, this marginal effect is

dP eZ
 f ( Z )  1   1
dX (1  e Z ) 2

Its estimates for the two scenarios are

dP e0.225
(i)   (1.996)  0.493
dX (1  e0.225 ) 2

dP e0.858
(ii)   (1.996)  0.417
dX (1  e0.858 ) 2

In R these marginal effects are provided by the

marginal_effects(model, data)

function of the margins package.

Install the margins package on your computer if you do not have it yet, and then execute

library(margins)
marginal_effects(logit, data = newdata)

You should get the following printout:

R reports the marginal effects of all three independent variables, the ones we calculated
manually are in the dydx_PRATIO column. They suggest that, given the average of
PRATIO, the marginal effect of the relative price of Coke to Pepsi on the probability of
purchasing Coke is -0.493 when only Coke is on display and it is -0.417 when only Pepsi
is on display.7

7
Note that in LPM the marginal effect of the relative price of Coke to Pepsi on the probability of purchasing
Coke is constant, 0.401.

13
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
Exercise 3 (HGL, p. 694, ex. 16.6)

Repeat Exercise 2 but use the probit model instead of the logit model.

a) Estimate a probit model and briefly evaluate and interpret the results.

The probit model is estimated like the logit model, except that this time in every
command we need to replace logit with probit. Hence, execute the following commands:

probit = glm(COKE ~ PRATIO + DISP_COKE + DISP_PEPSI,


family = binomial(link = "probit"))
summary(probit)

library(DescTools)
PseudoR2(probit, which = "McFadden")

library(lmtest)
lrtest(probit)

COKE_probit = round(fitted.values(probit), digits = 0)


prop.table(table(COKE, COKE_probit))

They generate the following printouts.

14
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
These printouts show that the probit model is also significant (LR-statistic = 145.82
and Pr( > Chisq) = 0.0000), but R2McF is small (0.093). Still, each slope coefficient is
significant in the logical direction at the 1.5% level.

The percent correctly predicted is about (0.445 + 0.217)  100% = 66.2%, the same as
in Exercises 1 and 2. Therefore, this time the linear probability, logit and probit models
predict the probability of purchasing Coke with the same success rate.

b) Estimate the probability of purchasing Coke (‘success’) assuming that pratio is equal to
its sample mean and

(i) Coke is on display but Pepsi is not (disp_coke = 1, disp_pepsi = 0),


(ii) Coke is not on display but Pepsi is (disp_coke = 0, disp_pepsi = 1).

Based on the probit sample regression equation, on the second formula on page 2, and
on the standard normal table,

Z  ˆ0  ˆ1 pratio  ˆ2 disp _ coke  ˆ3 disp _ pepsi


(i)
 1.108  1.146  1.027  0.217  1  0.447  0  0.148

and

Z u2 0.148 u2


1 1
P  F (Z ) 
2 e

2
du 
2 

e 2
du  0.5596

Z  ˆ0  ˆ1 pratio  ˆ2 disp _ coke  ˆ3 disp _ pepsi


(ii)
 1.108  1.146  1.027  0.217  0  0.447  1  0.516

and

Z u 2 0.516 u 2
1 1
P  F (Z ) 
2 e

2
du 
2 

e 2
du  0.3015

15
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
To verify these results with R, execute

newdata = data.frame(PRATIO = mean(PRATIO),


DISP_COKE = c(1, 0),
DISP_PEPSI = c(0, 1))
predict.glm(probit, newdata, type = "link")
predict.glm(probit, newdata, type = "response")

The second command returns the Z values

and the third the estimates of the probability of success

c) Estimate the marginal effect of PRATIO on P for the two scenarios in part (b).

From page 8, the marginal effect of PRATIO on P is


2
dP 1  Z2
 f ( Z )  1  e  1
dX 2

Its estimates for the two scenarios are


2
dP 1 0.148
(i)  e 2  (1.146)  0.452
dX 2
2
dP 1  ( 0.516)
(ii)  e 2  (1.146)  0.400
dX 2

To verify these results in R, execute

library(margins)
marginal_effects(probit, data = newdata)

You should get the following printout:

Considering again just the dydx_PRATIO column, we can see that the marginal effect
of the relative price of Coke to Pepsi on the probability of purchasing Coke is -0.452
when only Coke is on display and it is -0.400 when only Pepsi is on display.

16
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12
Compare Exercises 1, 2 and 3 to each other to see that, as far as the estimated probability
of success and estimated marginal effect of PRATIO are concerned, the results from the
linear probability, logit and probit models are qualitatively the same and even quantitatively
they are very similar to each other this time.

17
L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 12

You might also like