You are on page 1of 11

1.

A survey of married couples finds that those who have been married more than 5 years
report higher happiness levels in their lives on average compared with those who have
been married less than 5 years. The researchers conclude that “marriage gets better
over time.”
a. Is survivor bias likely to play an important role here? Explain briefly.
Answer: Yes, couples with a marriage surviving at least 5 years are likely to be
the happy ones; the unhappy ones are likely to get divorced before 5 years and
would not show up in the survey in the long marriage group.

b. Someone suggests that age is very likely to be a confounding factor here. Do you
agree or disagree? Explain briefly.
Answer: Agree. Couples married over 5 years are more likely to be older couples
compared with couples married under 5 years, and older people may also have
higher happiness levels.

c. In the study the researchers mention that their sample is comprised of 1000
couples that have been married more than 5 years and only 300 couples that
have been married less than five years. Would the different group sizes be a
likely confounding factor here? Briefly explain why or why not.
Answer: no, because averages are being compared here – this adjusts for the
different sample sizes.

2. A company studies its accounts receivable records and sees that there is a small
correlation (+.02) between the size of the bill and the number of days it takes to collect
on the bill. However, when it separately examined the relationship for residential and
commercial bills, it found that for residential bills, there was a positive correlation, +.35;
for commercial bills, there was a negative correlation, -.43. Draw a scatterplot that is
consistent with these findings.
Answer:

3. An analyst gathers weekly market data for a Pillsbury breakfast food product over the
past 40 weeks. Let Q = weekly quantity of Pillsbury sold (in millions of units), P = weekly
average price ($) of the Pillsbury product, A = the fraction of the Pillsbury product sold
during the week during a promotion, Pc = average weekly price ($) of products from
Pillsbury’s direct competitors, Ac = fraction of Pillsbury’s direct competing products sold
during the week during a (competitor’s) promotion, t = week number (1 through 40) and
SAc = 1 if a major competitor ran a special high intensity promotion during that week
and 0 otherwise. The analyst computes two regression models along with the
corresponding R2 and standard errors:
Q = 12.45 – 2.77P, R2 = 61.3%, SE = .38
Q = 11.31 – 3.61P + 1.11*Pc, R2 = 67.4%, SE = .36
a. Give managerial interpretations for the numbers 12.45, -2.77, -3.61 and 1.11.
Answer: 12.45: no managerial interpretation, -2.77: each additional dollar in
average weekly price of Pillsbury is associated with 2.77 million fewer units sold
on average. -3.61: in weeks when the competitor’s price is the same, each
additional dollar and average weekly price of Pillsbury is associated with 3.60 1
million fewer units sold on average. 1.11: in weeks where Pillsbury’s price is the
same, each extra dollar in average weekly price of the competitor is associated
with an extra 1.11 million units sold of Pillsbury product.
b. What could explain why the -3.61 coefficient is much more negative than the
corresponding
-2.77 coefficient? Explain in language a manager would understand.
Answer: If Pillsbury raises its price and the competitor’s price stays the same (in
the second equation), customers will switch to the competing product and the
quantity sold will drop much more dramatically than if Pillsbury’s competitor is
allowed to raises prices at the same time (as is allowed in the first equation).

c. The analyst computes a fourth model: Q = 4.19 - .47P + 3.67A - .013t - .36SAc, R 2
= .81, SE = .28. Next week (week number 41) Pillsbury plans to set its price at
$3.12, sell 35% of product during promotions (P=0.35) and a major competitor is
planning a special high intensity promotion campaign. What are forecasted sales
for Pillsbury and how would you communicate to a manager the uncertainty
associated with the forecast?
Answer: Plugging in the values gives the forecasted sales, 3.11 million. To
communicate the uncertainty, give the 95% prediction interval, or mention that
the margin of error is about 0.56 million units.

4. To study the benefits of upgrading information technology (IT), which many believe can
both reduce costs and improve quality. A number of hospitals are offered some
incentives to upgrade to a new state-of-the-art IT system, and two years later
researchers find that the hospitals that choose to accept the offer and upgrade had
lower costs per patient and higher levels of healthcare quality measures compared with
hospitals that did not upgrade to the new IT system.
(a) Analysts conclude from this that the upgraded IT system led to this boost in
performance. Can you think of a confounding factor that could play a large role here?
Please give an example of one such factor, explain why it's likely to be a confounding
factor, and please limit your answer to a couple sentences.

Hospitals that have the resources or interest needed to upgrade are probably more
likely to have better performance to begin with. In other words, the interest in boosting
performance is a confounding factor would affect both performance and the likelihood
of upgrading the system.

(b) Without repeating the study, can you think of a better way to analyze the data already
gathered?
They could compare performance before and after the upgrade, or they could try to
adjust for the number of tests performed before the upgrade -- so that they are
comparing hospitals that initially had the same performance.

Different researchers decide to choose a group of hospitals and randomly select half the
hospitals in the group to be offered the incentives. Two years later researchers find that the
hospitals that chose to accept the offer and upgrade had higher levels of healthcare quality
measures compared with hospitals that were not offered the incentives.

(c) A manager says that now the study is experimental and thus there are unlikely to be
confounding factors. Do you agree? Explain briefly.

no -- individual hospitals still get to choose whether or not they upgrade. Comparing the
ones who choose to upgrade with the ones who don't introduces the same confounding
factor as in part (a).

d. Without repeating the study, can you think of a better way to analyze the data
already gathered?
They should compare group that were offered the incentives with the entire group
that weren't offered incentives -- not just the ones in the first group that upgraded.
This way the study becomes experimental because hospitals don't get to decide
whether or not they are offered the incentives. There shouldn't be confounding as a
result. This means they are comparing the entire group that was offered the
incentives with the entire group that was not offered the incentives regardless of
whether or not they actually upgraded.

5. You are hired by the Department of Health and Human Services to help understand the
determinants of the obesity epidemic in the US. You are given data on more than 20,000
individuals, aged 22-60. You have the following information:
- Weight in pounds
- Height in inches
- Gender
- Age
- Region: Northeast, South, Midwest, West (these are the only 4 regions in the US)
- Immigrant status (=1 if person is an immigrant, 0 if not)

With this data in hand you start by running regression models where the dependent variable is
weight.

Based on your results presented in the Table at the end of the document, answer the following
questions:
a. Explain why the coefficient of immigrant goes from more negative to less
negative from column (1) to column (2).

The first comparison is without controlling for height – if height and weight are positively
related and the coefficient on immigrant increases (becomes less negative) when you control
for height, it has to be that immigrants are on average shorter, and the coefficient of the simple
regression is capturing this indirect effect.

b. Interpret the intercept and the coefficient on midwest in regression 3.

Intercept: The average weight for non-immigrants in the west is 177 pounds.
Coefficient on Midwest: keeping immigrant status constant, people in the Midwest
weight close to 4 pounds more on average than people in the West.

c. What is the predicted difference in weight between a female and a male, who
have the same age, height, immigrant status, and live in the same region. Give a
95% confidence interval for this prediction.

We use the male coefficient in regression (4):

13 -2*0.656 = (11.69,14.31) – prediction is 13

d. Based on your models, and assuming no changes in other variables but age,
whom do you predict to have a larger increase in their weight (in pounds) from
this year to the next on average?

1. QM 716 students
2. Their professor
3. Same for both
4. Cannot tell

Explain: Because the coefficient on age is positive and that of age_sq is negative it means that
the association between age and weight is increasing at a decreasing rate. Note that your
professor is older than all of you, but not that old to be on the other side of the inverted U
( change of slope happens at about age 66 and out of the sample)

e. Which variable has a stronger correlations with weight – immigrant status or


gender? Explain.

Gender, because the R-squared of regression 5 is much larger than the R-squared of regression
1.

6. Air pollution in Beijing has been a very serious problem in recent years. Air pollution has
a negative impact on health, particularly the health of infants. As a policymaker in
Beijing, you want to investigate if providing air filters to households improves infant
health. You randomly sample 1,000 households with an infant in their home.

Experiment A

First, suppose you asked households if they would like to receive an air filter. Of the 1000
households, 500 households requested and received an air filter, while the other 500
households did not request one. Let’s call this “Experiment A”. You collect data on the
following variables:

InfantHealth: Health indicator for infants in each household (Worst score = 1, Best score
= 100)

AirFilter: Dummy variable = 1 if the household received an air filter, = 0 if no filter

MothersEducation: Mother’s years of schooling (minimum 9 years, and maximum 22


years)

a) Analyzing data from “Experiment A,” you find the following regression result. Standard
errors are in parentheses.

Regression 1: InfantHealth = 2.2 + 30.5*AirFilter + 3.5*MothersEducation


(0.8) (8.4) (0.9)

Interpret the coefficient on AirFilter in regression #1. (One sentence).


Answer: The coefficient on AirFilter in regression #1 means that having an air filter is
associated with an increase in InfantHealth by 30.5, holding MothersEducation constant
(or controlling for MothersEducation).

b) Calculate the t-statistic and 95% confidence interval for the coefficient on AirFilter in
regression (#1). Is it statistically significant?

t-statistic:___________ Answer: 30.5/8.4 = 3.63___________________

95% confidence interval:_Answer: 30.5+-8.4 *2 = (13.7, 47.3)

Is the AirFilter coefficient statistically significant?


___Yes___________________________

c) Based on this result, your colleague says, “Providing air filters is a good public policy
because regression #1 shows that air filters cause better infant health.” Do you agree
that this evidence shows that air filters cause better infant health? Why or why not?

Answer: I disagree. The regression does not necessarily provide the causal relationship of
having an air filer on infant health because the treatment group is self-selected (or households
voluntarily sign up for the treatment) and, therefore, there could be other factors that affect
both AirFilter and InfantHealth.

Experiment B

Now, instead of asking households to choose whether they want an air filter, you randomly
assign air filters among another 1000 randomly selected households. Let’s call this “Experiment
B”. You provide air filters to 500 randomly selected households while the other 500 households
do not receive air filters.

d) Using the Experiment B data, you find the following regression result.
Reg #2: InfantHealth = α1 + 10.2*AirFilter + 4.5*MothersEducation

(8.4) (0.9)

Based on this result, your colleague says, “Providing air filters is a good public policy
because regression (2) shows that air filters cause better infant health.” Do you agree
that this evidence shows that air filters cause better infant health? Why or why not?

Answer: I disagree. The regression tells us a causal relationship of AirFilter on


InfantHealth because the AirFilter is randomly assigned. However, the t-statistics for
AirFilter is 10.2/8.4 = 1.2 < 2, which implies that the coefficient is not significant (we
cannot reject the null that the coefficient is zero). Therefore, this result does not show
that AirFilter improves InfantHealth.

e) Using the Experiment B data, you now run the following regression:

Regression #4: AirFilter = b0 + b 1* MothersEducation

Which of the following statements is most likely to be correct? Circle one and explain
why.

i) b1 will be positive and statistically significant

ii) b 1 will be negative and statistically significant

iii) b 1 will be close to zero and not statistically significant

iv) Cannot tell

Explanation:

Answer: The answer is (iii). Because AirFilter is randomly assigned, AirFilter and any other pre-
determined variables (variables that are not affected by the experiment) have zero correlation.
Therefore, we expect that b 1 will be close to zero and not statistically significant.

a. Using the Experiment B data, you now run another regression (#5):
Regression #3: InfantHealth = θ0 + θ 1*AirFilter

Which of the following statements is correct? Circle one and explain why. (Note: 10.2 is
the coefficient on AirFilter in Regression 2 above.)

i) θ 1 will be larger than 10.2

ii) θ 1 will be smaller than 10.2

iii) θ 1 will be very close to 10.2

iv) Cannot tell

Explanation

Answer: The answer is (iii). The answer to the previous question means that AirFilter and
MothersEducation have no correlation, therefore there is no bias in the coefficient for air filter
and will not change if other variables are added.

7. The Dean of Questrom wants to encourage business students to study more. As a result,
the dean proposes a $1,000 incentive for all students who get an A grade in all of their
fall classes. You encourage the Dean to randomize the incentive across students – half
of the students are offered the incentive and another half are not offered the incentive.
(The ones offered the incentive are chosen by flipping a coin: heads they get it.)

After the fall term, you collect the data on the students and run the following regression to
see whether the incentive worked in increasing the GPA (t-statistics in parentheses):

GPA = 3.00 + 0.60*Incentive adj. R2 =.12


(2.5) (2.02)
(t-statistics in parentheses)

a) Did the incentive accomplish its goal of increasing GPA? CIRCLE ONE:

YES NO CAN’T TELL


Explain how you know (1 sentence)

It is a randomly assigned experiment. And its coefficient has t>2 so is statistically significant .
b) The QM222 coordinator looks at all this analysis and says “You cannot tell at all from this
analysis whether if you could get people to study more, their GPAs would improve. There
are likely to be so many additional factors determining GPA” Assuming that professors
never change their criteria for grading (i.e. don’t curve), could she be right? CIRCLE ONE:

YES NO MAYBE

Explain:

NO: The reasoning would be that since this is a randomized experience, there is no bias on
treatment coefficient, no confounding factors.

c) You also know the average daily studyhours of each student that semester and include it into
the regression as well:
GPA = 1.40 + 0.60*Incentive + 0.15*studyhours adj. R2 =.24
(2.8) (2.03) (3.5)
(t-statistics in parentheses)

The Dean is confused as to why the coefficient on the Incentive dummy variable did not change
when you included study hours. How would you respond to the Dean? (1 sentence)

Since this is a randomized experience, there is no bias on treatment coefficient so it won’t


change.

Table. Dependent Variable is weight in pounds


  (1) (2) (3) (4) (5)
VARIABLES weight weight weight weight weight
           
immigrant -16.717 -6.351 -16.027 -7.479
[0.757] [0.663] [0.765] [0.673]
height 5.392 4.295
[0.059] [0.081]
northeast -0.481 -0.171
[0.897] [0.765]
midwest 3.967 3.265
[0.809] [0.689]
south 2.961 4.559
[0.747] [0.637]
age 0.797
[0.189]
age_sq -0.006
[0.002]
male 13.004 37.123
[0.656] [0.506]
Constant 179.498 -183.437 177.399 -141.028 159.687
[0.301] [3.978] [0.611] [6.435] [0.344]

Observations 24,407 24,407 24,407 24,407 24,407


R-squared 0.020 0.270 0.021 0.290 0.181

You might also like