You are on page 1of 9

Sample Final Exam (SMMD)

Part A: Each question in this part is worth 1point.

1. Suppose you are interested in examining the determinants of earnings. You have
information on the age of the individual as well as their level of education: high school
graduate, college graduate or graduate degree. Let Y = earnings, X1= age, X2= 1 if the
person has studied till high school or less and 0 otherwise, X 3 = 1 if the person has
studied more than high school but earned less than a masters degree and 0 otherwise,
X4= 1 if the person has earned a masters level or higher degree and 0 otherwise. Which of
the following model specifications cannot be estimated?

A. Y = 0 + 1X1 + 2X2 + 3X3


B. Y = 0 + 1X1 + 2X2 + 3X3 +4X4
C. Y = 0 + 1X1
D. None of the above.

2. A manager of a newly opened coffee shop is struggling to manage the waiting time of
customers. Based on preliminary data he has collected, he hypothesizes that there is a
linear relationship between the average waiting time experienced by customers to
receive their drink and number of customers sitting in the coffee shop. He does not have
access to any statistical software but only has the following descriptive statistics.

Variable Mean Covariance matrix


Waiting time Number of customers
Waiting time 23 min 136 min 2 86cust-min
Number of customers 12cust 86cust-min 75 cust2

The average waiting time for a customer who walks into an empty coffee shop is
(ignore set-up time)

A. 8.5min
B. 9.24min
C. 12.07min
D. Not enough information

Questions 3 through 6 are based on the following situation:

A researcher intended to study the relationship between the number of major natural
calamities such as tornadoes, hurricanes, earthquakes, floods that occurred during a year (X)
and the average profit (in millions of dollars) of all insurance companies in the country in
that year (Y). She took a random sample of 10 years in which number of calamities per year
varied from 10 through 23 and found that the estimated least squares regression line is 𝑦" =
212.6 − 1.9𝑥.

1
3. The number 212.6 in the above regression can be interpreted most reasonably as

A. The part of the average profit of the insurance companies that is not associated
with the number of natural calamities
B. Change in profit of insurance companies associated with an additional natural
calamity
C. The average profits for all the insurance companies in the country in a year with
no major natural calamities
D. None of the above

4. For the above regression equation, correlation between X and Y will be

A.-1.9
B. Negative but cannot determine the magnitude.
C. +1.9
D. Positive but cannot determine the magnitude.

5. A randomly selected year had 24 major calamities, and the actual average profits in that
year were $200 million. The residual associated with this year is

A. $200million
B. $167million
C. $33million
D. - $33million

6. The reason for the residual in the previous question is:

A. Sampling variability – the coefficients were estimated from a random sample


B. Insurance company profits are determined by things other than number of
natural calamities
C. Both of the above
D. None of the above

7. While comparing two regression models for the same response variable from the same
dataset, it was found that R2 of model A is 0.80 while that of model B is 0.512. Which of
the following is true about the ratio of the RMSE (s e) for the two models (Model
A/Model B)?

A. It’s exact value cannot be determined based on the information given


B. It will be greater than 1
C. It is equal to 0.64
D. Both A and B

Questions 8 through 10 are based on the following regression output which was obtained
from a study which linked Age and Smoking (1 = Smoker, 0 = Non-smoker) to the risk of
heart disease:

2
Analysis of Variance,ANOVA
Degrees Sum of Mean Square,
Freedom, df Squares, SS MS F-Ratio p-Value
Regression 2 2633.388 1316.694 14.371 0.00022
Error 17 1557.562 91.621
Total 19 4190.950

Regression Equation Results


Dependent Variable, Y: RISK
RISK = -28.086 + 0.689 AGE + 14.396 SMOKER

Standard 95% Conf. 95% Conf.


Indep. X Variables Coefficient Error t Statistic p-Value Lower Upper VIF
Intercept -28.086 16.707 -1.6811 0.11103 -63.334 7.163
AGE 0.689 0.25 2.7501 0.01367 0.16 1.217 1.203
SMOKER 14.396 4.695 4.49 24.302 1.203

R-squared
Multiple R
Adj. R-squared 58.46%
StandardErrorofEstimate 9.572
Durbin-Watson 1.684
Number of Observations 20

8. The missing value of R2 in the regression should be:

A.62.84%
B. 72.97%
C. 59.46%
D. 58.46%

9. The missing value of the t-statistic for SMOKER should be:

A.3.066
B. 5.863
C. 2.853
D. 2.750

10. Assuming that the OLS assumptions hold, an unbiased estimate for standard deviation
of the error term(σε) is:

A. – 28.086
B. 91.621
C. 9.572
D. 16.707

11. Which of the following statements are TRUE in reference to a simple linear regression
line of y on x?

3
I. The regression line will always pass through at least one of the sample
points (xi,yi)
II. The regression line will always pass through the point ( x , y), where x
and y are respectively the sample means of x and y
III. The point(x,y),where x and y are as above, is always included as one
of the points in the sample data set that OLS uses to estimate the
parameters of the model

A. I., II. and III.


B. II. only
C. I. and II. only
D. None of the statements is TRUE

12. Consider a multiple regression model with two predictors. If the overall F-ratio is
significant, i.e., if p-value associated with F-ratio is less than -value, then we can
conclude that

A. 0, 1 and 2 are different from zero


B. 1 and 2 are different from zero
C. 1 or 2 is different from zero, but not both
D. Either 1 or 2 or both are different from zero

13. You did a multiple linear regression with a set of predictor variables and found that the
overall regression was significant, while the individual predictors were all insignificant.
The most likely explanation for this is that:

A. The response has no linear relationship with any of the predictors


B. The predictors are each cancelling out the other predictors’ effects
C. There is multicollinearity among the predictor variables
D. We’ll need to look at the value of R2 before trying to explain this

14. The correlation between two variables in a sample equals zero. This implies that:

A. The two variables must be independent in the population


B. A regression with one of these variables as a response, and the other one as a
predictor will be significant
C. The adjusted R2 for the regression in alternative B will be negative
D. None of the above

15. Consider the plot of residuals versus predictor below for a simple linear regression.
Which of the following statements is true for this regression?

4
Standardised Residuals v Food
3

Residuals
s
e 0
R
.
d 0 5 10 15 20 25 30
n
-1
St
a
-2

-3

-4
Food
X

A. This regression is insignificant


B. The prediction intervals based on this regression will be incorrect
C. RMSE is more than 3
D. The errors are likely to be autocorrelated

16. In a regression model involving 34 observations, the following estimated regression


model was obtained: yˆ  48 2.5x1 1.2x2  0.7x3 . For this model, the following
statistics were given: SST = 960 and SSE = 270. Then, the value of the F statistic for
testing the validity of this model is:

A. 25.56
B. 7.94
C. 28.24
D. 22.26

17. Following estimated regression equation compares total compensation among top
executives in a large set of US public corporations in the 1990s. The variables in the data
set are:

Earnings: Total compensation (in $ ‘000s)


Female: Dummy variable – Equals 1 for females and 0 for males
MarketValue: A measure of firm size (in $ millions)
Return: Stock return (a measure of firm performance in percentage points)

The estimated regression equation is (all predictors were significant):

ln(Earnings) = 3.86 − 0.28 𝐹𝑒𝑚𝑎𝑙𝑒+ 0.37ln(𝑀𝑎𝑟𝑘𝑒𝑡𝑉𝑎𝑙𝑢𝑒)+ 0.004𝑅𝑒𝑡𝑢𝑟𝑛

We can conclude from the above regression equation is that, controlling for return:

A. Females in larger companies suffer a smaller salary discount than smaller


companies
B. Females in smaller companies suffer a smaller salary discount than larger
companies
C. A 1% increase in size (as measured by Market Value) decreases the
salary discount of female executives, on average, by0.37%
D. None of the above

5
Questions 18 – 20 are based on the following problem.

A professor decides to run an experiment to measure the effect of time pressure on final
exam scores. He gives each of the 285 students in his course the same final exam, but some
students have 90 minutes to complete the exam while others have 120 minutes to complete
it. Each student is randomly assigned one of the examination times based on the flip of a
coin. Consider a regression model of the form: Score = 0 + 1X +.

18. The professor is considering two different choices for X. The first choice would be to
treat X as the time given for the exam (in the sample data set, it would only take
values of 90 and 120 minutes). The second choice would be to make X into a dummy
variable, and code it as 0 for students who are given 90 minutes, and code it as 1
otherwise. Unable to choose between these two alternatives, the professor decided to
include both variables as X1 and X2 respectively in a multiple regression of Score on
these two predictors. Which of the following statements is true:

A. The estimated coefficients of both predictors will be identical


B. The estimated coefficient of the time variable will be 30 times the coefficient of
the dummy variable
C. The estimated coefficient of the dummy variable will be 30 times the coefficient of
the dummy variable
D. None of the above

19. After more deliberation, the professor decided to go with the time variable instead of
the dummy variable. The estimated simple linear equation was E(Score) = 60 + 0.5X.
Based on this, and the accompanying regression output, the professor estimated with
reasonable confidence that the extra 30 minutes resulted in an expected increase in
score of somewhere between 10 and 20 points. The uncertainty in this estimate is
primarily driven by:

A. Sampling variation
B. The difficulty of the exam
C. The choice of 90 and 120 minutes as the test times
D. None of the above

20. Which of the following might be the most likely driver of the intercept term 60 in the
above equation in question19?

A. The random allocation of students to the two groups


B. The variation in intelligence levels among students
C. The variation in susceptibility to time pressure among students
D. The degree of difficulty of the exam

21. Consider a simple regression where the estimated coefficient of the x variable is
greater than 1. This necessarily implies that:

A. The variance of y-variable is higher than the variance of x-variable


B. The variance of y-variable is lower than the variance of x-variable
C. The variance of y-variable is equal to the variance of x-variable

6
D. We cannot conclude any of the above without further information

22. In the previous problem, suppose the coefficient of the x-variable is lower than 1.
This necessarily implies that

A. The variance of y-variable is higher than the variance of x-variable


B. The variance of y-variable is lower than the variance of x-variable
C. The variance of y-variable is equal to the variance of x-variable
D. We cannot conclude any of the above without further information

23. Let R2YX be the R2 value associated with the regression of Y (response) on X
(predictor). Let R2XY be the R2 value associated with the regression of X (response)
onY (predictor). Which of the following is true?

A. R2XY =R2YX
B. R2XY>R2YX
C. R2XY<R2YX
D. None of the above have to be true

Part B: Each question in this part is worth 2 points.

24. A multiple regression model with a person’s weight as response and the person’s
height and the average number of calories consumed per day as predictors was
found to have both slopes positive and significant. Assume further that the height is
positively correlated with calorie consumption. If we consider a simple linear
regression model of weight (response) on height (predictor), the estimated slope
from this regression will be:

A. An unbiased estimate of the true population parameter


B. The estimate will be biased upwards
C. The estimate will be biased downwards
D. Cannot conclude any of the above without further information

25. A famous auction house in London is initiating a data analytics approach to


understand factors associated with the value of antique clocks. A regression of a
random sample of 32 clocks sold in the last 10 years with Price (‘00$) gave the
following output.

The general manager of the auction house claims that this is evidence against the
industry maxim that each additional year of a clock’s age is associated with an average
increase of $1500 in the value of the clock. Do you agree?

A. Yes, at 5% significance level


7
B. Yes, at 1% significance level
C. Yes at 0.1% significance level
D. Cannot be determined from the output

Questions 26 – 27 are related to the following description and the JMP reports that follow.
The dataset comes from a set of 420 school districts in California. The response variable is
the test scores of 5th grade students in these districts and is calculated as the average of math
and reading scores for students in each district. The Superintendent of education is
considering whether to decrease class sizes (decrease the student to teacher ratio, labeled
STR hereafter), and wondering whether this would improve student performance on test
scores. Of course, the flip side is that decreasing STR would increase costs and take up much
of the scarce financial resources these school districts have. The regression output of a
simple linear regression of test scores on class sizes (STR) is shown below.
710
700
690
680
670
660
TestScr

650
640
630
620
610
600
13 14 15 16 17 18 19 20 21 22 23 24 25 26
STR

TestScr = 698.93295 - 2.2798083*STR


Summary of Fit Analysis of Variance
RSquare 0.05124 Sum of
RSquare Adj 0.04897 Source DF Squares Mean Square F Ratio
Root Mean Square Error 18.58097 Model 1 7794.11 7794.11 22.5751
Mean of Response 654.1565 Error 418 144315.48 345.25 Prob > F
Observations (or Sum Wgts) 420 C. Total 419 152109.59 <.0001*

Parameter Estimates
Term Estimate StdError t Ratio Prob>|t|
Intercept 698.93295 9.467491 73.82 <.0001*
STR -2.279808 0.479826 -4.75 <.0001*

8
26. Based on the regression output, we can say that

A. The regression is not statistically significant because the R 2 is only5%


B. The regression is not statistically significant because the RMSE is much bigger
than the absolute value of slope
C. STR isn’t a significant predictor of test scores because the intercept term
dominates the slope term
D. None of the above

27. A second predictor, the percentage of students whose native language is not English
(PctEL), is added to the regression. While admitting that this leads to an impressive
increase in R2, the school superintendent decides to take a look at the scatterplot of
Test scores on PctEL. The scatterplot, reproduced below, MOST LIKELY indicates
that:
Bivariate Fit of TestScr By PctEL
710
700
690
680
TestScr

670
660
650
640
630
620
610
600
0 10 20 30 40 50 60 70 8090
PctEL

A. There are a number of influential observations


B. The R2 number must have been misread because test scores seem to decrease
with PctEL
C. The errors in the multiple regression will be heteroskedastic
D. The errors are correlated.

28. Suppose that you fit a simple regression line between response variable Y and
predictor variable X. Further, suppose that you fit a second regression line between
responsevariableYandthepredictedvaluesoftheresponsevariableŶ.Whichofthe
following statements will be true about this second fitted line?

a. Its slope will be the same as the slope between Y and X


b. Its slope will be the reciprocal of the slope between Y and X
c. Its slope will be1
d. Its slope will be zero

You might also like