Professional Documents
Culture Documents
Antonio Fidalgo
Foreword xxix
I Introduction 1
1 Statistical Intuition 3
1.1 A Few Questions in Statistics . . . . . . . . . . . . . . . . . . . 3
1.1.3 Mean IQ . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.7 Armour . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
v
vi Contents
2 Statistical Statements 19
2.1 Introductory Example . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 p-Hacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Contents vii
II Statistical Inference 39
4.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.1 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.8 P-Value . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3.4 Implementation in R . . . . . . . . . . . . . . . . . . . . 76
6.4.2 Implementation in R . . . . . . . . . . . . . . . . . . . . 79
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2.2 Implementation in R . . . . . . . . . . . . . . . . . . . . 93
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
V Visualizations 141
Contents xi
12 Bars 145
12.1 Bars for Proportions . . . . . . . . . . . . . . . . . . . . . . . . 145
VI Bridge 153
13 Correlation 155
13.1 Bivariate Relationships . . . . . . . . . . . . . . . . . . . . . . . 155
18 Assumptions 217
18.1 When is the Model Valid? . . . . . . . . . . . . . . . . . . . . . 217
20 Inference 227
̂ . . . . . . . . . . . . . . . . . . .
20.1 Sampling Distributions of 𝛽 ’s 227
IX Intermezzo 291
26 Presentations 293
26.1 “Conclude with a Conclusion” Approach . . . . . . . . . . . . 293
Why 299
28 Endogeneity 305
28.1 The Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Appendix 317
A Assignments 319
A.1 Assignment I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
F.2 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
xx Contents
F.3 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
F.4 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
F.5 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
F.6 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
F.7 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
F.8 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
F.9 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
F.10 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
F.11 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
F.12 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
F.13 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
F.14 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
F.15 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
409
409
List of Tables
2.1 Inflammation levels in the two groups, the drug treated (D) and
the control (C) group. . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 All combinations of the six observations into two groups. . . . . 20
2.3 Observed hotwings consumption of female individuals. . . . . . 22
2.4 Group averages in hotwings consumption and difference between
groups of males (M) and females (F). . . . . . . . . . . . . . . . 22
2.5 Group averages in repair times and difference between groups of
Verizon customers (V) and customers of other companies (C). . . 24
C.1 Practice quiz questions with elements of solution in this appendix. 329
xxi
List of Figures
xxiii
xxiv List of Figures
13.1 Scatter plots of pairs of variables and their linear relationship. . 157
13.2 Anscombe plots. . . . . . . . . . . . . . . . . . . . . . . . . . . 159
13.3 Assessing associations with base R. . . . . . . . . . . . . . . . . 160
13.4 Assessing associations with base corrgram package. . . . . . . . 161
13.5 Assessing associations with base corrplot package. . . . . . . . . 162
15.1 Instance of simulated Income data along with true 𝑓() and errors. 176
15.2 Instance of simulated Income data along with true 𝑓() and errors
(two predictors). . . . . . . . . . . . . . . . . . . . . . . . . . . 176
15.3 Wage as function of various variables. . . . . . . . . . . . . . . . 179
15.4 Factors influencing the risk of a heart attack. . . . . . . . . . . . 180
15.5 Frequencies for main words in email (to George). . . . . . . . . 180
xxvi List of Figures
19.1 Using the mean as the best fit and the resulting residuals. . . . . 222
19.2 Linear fit and residuals. . . . . . . . . . . . . . . . . . . . . . . . 222
22.5 Scatter plot of sample with non-linear relationship along with OLS
fit (blue). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
30.1 Mita border and specific portion analyzed by Dell (2010). . . . . 316
These notes are intended as introductory to the topics they covered. The vary-
ing levels of detail and comprehensiveness, within and across the lecture notes,
reflect this characteristic. They replace the usual decks of slides in a format that
allows for a general overview of the material thanks to the comprehensive table
of contents.
The departure from the usual slides model towards a narrative, memo-like for-
mat for each lecture is a choice that calls for some explanation, if anything be-
cause it is very uncommon.
I think that the style of the typical slide-shows, especially if built with MS Power-
Point (PP), is characterized by an excessive oversimplification of the arguments,
which are reduced to bullet points, key words and bad graphical representa-
tions. From a pedagogical point of view, these are not sufficient for conveying a
nuanced line of argumentation and often result in a black-or-white misinterpre-
tation.1 For a in-depth critique of the PP presentations arguing that the cognitive
style of PP is “making us stupid” and may be associated with tragic mistakes2
see the work of Edward Tufte (Tufte (2003)). See also the hilarious example3 of
the abuse of PP and its “AutoContent Wizard”. PP presentations are also getting
criticized in the business world4 and are sometimes replaced by memos, e.g., at
Amazon5 .
As for the slides, however, the notes must be completed with elements emerging
during the discussion in class. It is unreasonable to considered the words written
here as the exclusive material covered in the exam. Most elements in these notes
are mere placeholders for arguments and discussions hold in greater length in
1
Here, I only claim a reduction of that risk since it would be presumptuous and flatly wrong
to pretend that the full sentences format will leave no room for misunderstanding.
2
https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001yB
3
https://norvig.com/Gettysburg/
4
https://www.inc.com/geoffrey-james/sick-of-powerpoint-heres-what-to-use-instead.html
5
https://conorneill.com/2012/11/30/amazon-staff-meetings-no-powerpoint/
xxix
xxx Foreword
various sources. In that sense, the main advantage of these notes is to provide a
structure for the classes.
Part I
Introduction
1
Statistical Intuition
3
4 1 Statistical Intuition
Linda is 31 years old, single, outspoken, and very bright. She majored in philos-
ophy. As a student, she was deeply concerned with issues of discrimination and
social justice, and also participated in anti-nuclear demonstrations.
Which of the following two alternatives is more probable?
a. Linda is a banker,
b. Linda is a banker and active in the feminist movement.
Suppose you’re on a game show, and you’re given the choice of three doors:
Behind one door is a car; behind the others, goats.
You pick a door, say No. 1, and the host, who knows what’s behind the doors,
always opens a door with a goat, say No. 3. He then says to you, “Do you want
to pick door No. 2”?
What would you answer?
1.1.3 Mean IQ
The mean IQ of the population of high school students in a a given big city is
known to be 100.
You have selected a random sample of 50 of these students for a study. The first
of these students tested has an IQ of 150.
What do you expect the mean IQ to be in the whole sample of 50 students?
Which of the following sequences of X’s and O’s seems more likely to have been
generated by a random process (e.g., flipping a coin)?
1.1 A Few Questions in Statistics 5
a. XOXXXOOOOXOXXOOOXXXOX
b. XOXOXOOOXXOXOXOOXXXOX
The probability of breast cancer is 1% for a woman at age forty who participates
in routine screening.
If a woman has breast cancer, the probability is 95% that she will get a positive
mammography. If a woman does not have breast cancer, the probability is 8.6%
that she will also get a positive mammography.
A woman in this age group had a positive mammography in a routine screening.
What is, approximately, the probability that she has breast cancer?
1.1.7 Armour
During WWII, the Navy tried to determine where they needed to armor their
aircraft to ensure they came back home. Once back, the planes were submitted
to an analysis of where they had been shot. Figure 1.2 shows the results of these
analyses.
What places of the plane (areas A to F) do you think would need the most an
armor?
the learning the topic into a frustrating endeavor leading to the same desperate
self-assessments as when we learn a language:
Every reader has already experienced all of these. And there will be no soothing
counter-argument here. Only a reminder that the benefits of understanding this
language are very numerous, too many indeed to be encapsulated into a few
sentences. Instead, their full list will be slowly uncovered throughout a life of
decisions, improved and not fooled by randomness.
A last word about this language. Despite the popular belief, statistics is not a
special dialect of mathematics. Sure enough, they share many expressions. And,
more often than not, a good command of math allows to get away with it. This
view and this practice of statistics, however, are unfortunate and detrimental. I
hope these notes will help make it clear.
In this introductory chapter, I would like to lay down a few elements of the learn-
ing strategy adopted here. These are given below without particular order.
Did you know there exist books listing the most used words of a language? For
instance, I have Jones and Tschirner (2015) on my book shell, listing die Statistik
in the 2864th position.1
Similarly, this course adopts mainly this frequency approach. It touches on the
core methods of empirical research. I trust it will allow you to tell many interest-
ing stories.
The number of statistical techniques is very large. One may wonder if the par-
ticular one we use is the most appropriate for the problem at hand.
Here is a perspective from my experience. A statistical analysis is virtually never
incorrect because it uses the wrong technique. Instead, it is often criticized be-
cause if fails to comply with basic principles.
Surprisingly, those of mostly fall into this trap are precisely those who know the
smaller number of techniques, i.e., you. This course will put a particular empha-
sis on these principles in order to help you avoid disqualifying mistakes.
This section offers a few pointers to better understand the questions (and their
answers) of Section 1.1. Its conclusions must be understood by all, but its details
are meant for inquiring minds only.
It is easy to believe that the second option, “Linda is a banker and active in
the feminist movement”, must represent a subset of the first option “Linda is
a banker”. As for why the former seems more probable than the latter, please
1
Glaube nur der Statistik, die du selbst gefälscht hast.
10 1 Statistical Intuition
see Tversky and Kahneman (1983) or Kahneman (2011). Arguably, the second
option taps into our brain’s love for stories.
This is a question about which a great many stories have already been told. A
main perspective emerges in all of them, namely how much it has fooled the
overwhelming majority of those who attempted the question. Many of these sto-
ries also quote a letter written to a columnist who gave the right answer.
You blew it! Let me explain: If one door is shown to be a loser, that information changes the probability of
either remaining choice – neither of which has any reason to be more likely – to 1/2. As a professional
mathematician, I’m very concerned with the general public’s lack of mathematical skills. Please help by
confessing your error and, in the future, being more careful.
There are several ways of demonstrating that “Switching doors” is the right thing
to do: a theoretical demonstration based on Bayes theorem, a simulation, and
another attempt at intuition. I briefly describe the three below.
Theoretical demonstration based on Bayes theorem.
We will show how to calculate the correct probabilities:
• the probability that the car is behind door No.2 given that Monty Hall opened
door No.3,
• the probability that the car is behind door No.1, the initially chosen door, given
that Monty Hall opened door No.3; notice that, since the car must be in one of
the two doors, this probability is simply one minus the probability calculated
just above.
1
𝑃 (𝐶1 ) = 𝑃 (𝐶2 ) = 𝑃 (𝐶3 ) =
3
In the current configuration, the new information is that Monty Hall opens door
No.3, i.e., we observe the event 𝐷3 . We are looking to compare the posterior
probabilities:
We do not know these posterior probabilities but we know that they can be cal-
culated with Bayes’s rule thanks to the “inverted” probabilities:
• If the car is behind door No.1, then Monty Hall could open either door No.2 or
door No.3, with equal probability; hence
1
𝑃 (𝐷3 |𝐶1 ) =
2
• If the car is behind door No.2, then Monty Hall could only open door No.3
since he cannot show a car or open your door; hence
𝑃 (𝐷3 |𝐶2 ) = 1
• If the car is behind door No.3, then Monty Hall cannot open door No.3 since
he cannot show the car; hence
12 1 Statistical Intuition
𝑃 (𝐷3 |𝐶3 ) = 0
We can now calculate the correct probability mentioned above, the probabil-
ity that the car is behind door No.2 given that Monty Hall opened door No.3,
𝑃 (𝐶2 |𝐷3 ). We do this by applying the Bayes’ rule.
𝑃 (𝐶2 )𝑃 (𝐷3 |𝐶2 )
𝑃 (𝐶2 |𝐷3 ) =
𝑃 (𝐶1 )𝑃 (𝐷3 |𝐶1 ) + 𝑃 (𝐶2 )𝑃 (𝐷3 |𝐶2 ) + 𝑃 (𝐶3 )𝑃 (𝐷3 |𝐶3 )
for(i in 1:N)
{
prize <- floor(runif(1,1,4)) #randomize which door has the good prize
guess <- floor(runif(1,1,4)) #guess a door at random
## Reveal one of the doors you didn't pick which has a bum prize
if(prize!=guess)
reveal <- doors[-c(prize,guess)]
else
reveal <- sample(doors[-c(prize,guess)],1)
if(print_games)
cat(paste('Guess: ',guess,
'\nRevealed: ',reveal,
'\nSelection: ',select,
'\nPrize door: ',prize,
14 1 Statistical Intuition
'\n',outcome,'\n\n',sep=''))
}
You can then use the function to simulate the games the number of times that
you want (by chosing 𝑁 ).
Suppose you’re on a game show, and you’re given the choice of 100 doors: Behind one door is a car; behind
the others, goats. You pick a door, say No.1, and the host, who knows what’s behind the doors, opens 98
doors, say No.3 to No.100, which have a goat. He then says to you, “Do you want to pick door No.2?”
Why is it more intuitive to see that one should switch doors? The host knows
where the car is and, out of the remaining 99 doors, he opens 98 showing a goat.
With your initial choice you had only 1% chances of guessing the correct door. Of
1.4 Strengthening your Intuition 15
course, this means that with 99% probability the car is behind one of the other 99
doors. Out of these 99 doors, you now know which 98 are not wining. Can you
still believe that the only one still closed has only 50% probability of containing
the car?
A misleading intuition here would suggest that the 49 students who were not
tested will, on average, compensate for the very high score of the first student.
That type of compensation does not exist. To expect it is similar to judge highly
probable that the ball of a roulette wheel will land on red after landing 10 times
in a row on black.
What can be said, instead, is that the average of the 49 students can be expected
to be 100. Hence, over the 50 students, the average will be,
1 5050
(150 + 49 × 100) = = 101.
50 50
c. XXXOXXXXOXXXXOXXXXXX
d. XOOOOXXOXOXOOOXXOOOX
Obviously, both of these sequences can happen. It doesn’t mean, however, that
they are both as likely as being formed by a fair die (coded as explained above).
This question is again an example of application for the Bayes theorem. Recall
that this is the kind of cases that eludes the most human intuition. Part of the
explanation for our inability resides in our poor handing of probabilities, i.e.,
relative numbers, i.e., a ratio of two numbers that are often equally difficult to
estimate.
1.4 Strengthening your Intuition 17
There is a way out that proves useful in many cases. Replace the relative
numbers–probabilities–by absolute numbers.
In this situation, imagine 500 women in the relevant group take a test. How many
of these have the disease? 1% of 500, 5. Of these 5, 95% will indeed test positive,
i.e., a bit less than 5, but we’ll round up to 5. How many of the 500 do not have the
disease? 495. Now, out of these 495, 8.6% will actually test positive, i.e., around
43.
Total of positive tests: 5 + 43 = 48. Of these 48, how many suffer from the
disease? Only 5. Hence, the probability if having cancer after a positive test under
these conditions is 5/48, around 10%.
The issue presented in this question has far-reaching implications. Far more in-
deed than recognized. This is why the issue will be brought back to discussion
multiple times in this course. In the literature, it is referred to as sample selection
bias.
The core of the issue is that the planes that returned are not the relevant planes
needed for a question on which parts to armor. They represent, actually, a sample
of planes that was not randomly selected. And the very criterion for this selection
is directly related to the question: only planes that were hit in places that did not
need armor could return home.
Make sure you understand that the hits were in fact approximately uniformly
distributed over all the planes that were hit. It is not likely that the enemies man-
aged to hit some particular place at which they aimed.
This question is concerned with the evolution of the average wage of the whole
city. We know that in both parts of the city the average wage increased. However,
we cannot deduce anything about the overall average wage of the city. This is
because the question gives no information about the composition of the city, say
the how many people there were in each side in the initial period and in the final
period.
Hence, the composition is key and can drive the average on any direction, as
18 1 Statistical Intuition
illustrated by this extreme example. Suppose that the North had lots of people
with high wages in the initial period. But, then, this number dramatically de-
creased in the 10 years. Despite the increase in wages of the (remaining) North-
erners, and even despite the increase of wages of the Southerners, the whole city
simply ends up with many less rich individuals. In turn, this new composition
will drive the city’s average down.
2
Statistical Statements
TL;DR .
This topic is often considered as advanced and therefore neglected from the tra-
ditional statistics courses. However, it arguably contains several key elements
encountered in data analysis that we will develop in further lectures.
Suppose you obtain the results, shown on Table 2.1, for an experiment on a drug
aiming at reducing the levels of inflammation. The drug is administered to three
subjects while three others (control) take a placebo.
TABLE 2.1: Inflammation levels in the two groups, the drug treated (D) and the
control (C) group.
Drug Control
D1 D2 D3 C1 C2 C3 𝑋̄ 𝐷 𝑋̄ 𝐶 Δ𝑜
18 21 22 30 25 20 20.33 25 -4.67
The drug does seem to have an effect. The difference in the means of inflam-
mation is -4.67. But does this constitute enough evidence in favor of the drug
or could it also be obtained by chance? The answer to that question requires a
proper test.
19
20 2 Statistical Statements
The crucial step to make a proper test is to understand the following. What
would the results of the experiment look like if, indeed, they were due entirely
to chance, as opposed to an effect of the drug?
A natural but unsatisfactory answer would suggest that the means of the two
samples should be relatively close and, at the limit, even equal. This is unsatis-
factory because it correctly implies that the two means will be affected by some
random variance. But it does help determine how close is “close enough” to be
considered as “equal”.
Here is a more fruitful view. If the differences of the means across samples are
simply due to randomness, i.e., the drug has no effect, then both samples (drug
and control) are nothing but two random samples of the same population.
In that case, the above difference, call it Δ𝑜 = -4.67 is one of the possible differ-
ences between two random groups of three subjects.
Now, is Δ𝑜 so large that we cannot believe that only chance was at play and,
therefore, we would rather accept the idea that the drug played a role? To answer
this question, I suggest listing all the possible Δ’s.
We first obtain all the possible permutations for two groups of three out of the
six observations. Then, for each of these groups, we calculate their mean and the
difference between their means, i.e., Δ𝑖 . The results are given on Table 2.2.
TABLE 2.2: All combinations of the six observations into two groups.
As if Drug As if Control
D1 D2 D3 C1 C2 C3 𝑋̄ 𝐷 𝑋̄ 𝐶 Δ𝑖
18 21 20 22 25 30 19.67 25.67 -6.00
18 22 20 21 25 30 20.00 25.33 -5.33
18 21 22 20 25 30 20.33 25.00 -4.67
18 25 20 21 22 30 21.00 24.33 -3.33
21 22 20 18 25 30 21.00 24.33 -3.33
18 21 25 20 22 30 21.33 24.00 -2.67
2.3 Subsetted Permutation Distribution 21
TABLE 2.2: All combinations of the six observations into two groups. (continued)
D1 D2 D3 C1 C2 C3 𝑋̄ 𝐷 𝑋̄ 𝐶 Δ𝑖
18 22 25 20 21 30 21.67 23.67 -2.00
21 25 20 18 22 30 22.00 23.33 -1.33
22 25 20 18 21 30 22.33 23.00 -0.67
18 30 20 21 22 25 22.67 22.67 0.00
21 22 25 18 20 30 22.67 22.67 0.00
18 21 30 20 22 25 23.00 22.33 0.67
18 22 30 20 21 25 23.33 22.00 1.33
21 30 20 18 22 25 23.67 21.67 2.00
22 30 20 18 21 25 24.00 21.33 2.67
18 30 25 20 21 22 24.33 21.00 3.33
21 22 30 18 20 25 24.33 21.00 3.33
30 25 20 18 21 22 25.00 20.33 4.67
21 30 25 18 20 22 25.33 20.00 5.33
22 30 25 18 20 21 25.67 19.67 6.00
So, was the above difference Δ𝑜 =-4.67 extreme? We can look how it fits in the
overall distribution of the possible differences between two groups, as shown in
Figure 2.1.
What percentage of values are smaller or equal than the observed value? As
much as 3 out of 20, i.e., 15%.
Hence, if the drug had no effect, we would have 15% chances of observing that
value in a sample of three elements versus three controls. That’s is a small prob-
ability, but usually too high for being conclusive.
The exact distribution may not always be feasible, often because the number
of combinations is too high. In that case, we resort to randomly subset a large
number of values from the permutation distribution. We explore that case in the
example below.
22 2 Statistical Statements
observed
−6 −3 0 3 6
∆
We use the dataset called Beerwings containing the consumption of beer and hot
wings of 30 individuals, with information on the gender. Notice that there are 15
males and 15 females in the sample. Table 2.3 shows the raw data.
We are interested in evaluating whether male individuals have the same con-
sumption of hotwings than female individuals. Notice that, in our sample, we
do observe a difference, see Table 2.4.
All individuals
Gender
Female 4 5 5 6 7 7 8 9 11 12 12 13 13 14 14
Male 7 8 8 11 13 13 14 16 16 17 17 18 18 21 21
𝑋̄ 𝑀 𝑋̄ 𝐹 Δ𝑜
14.53 9.33 5.2
But we can only evaluate whether that difference is due to chance or not once
we have the sampling distribution of that difference. Again, the starting point
is assuming that the difference that we observe is simply due to chance. Under
that view, gender doesn’t matter, i.e., we could take any combination of 15 in-
dividuals, take the mean of their consumption and compare it with the mean
consumption of the remaining 15 individuals.
Notice, however that it is unrealistic to try the approach above with the exact
permutation distribution. This is because the number of permutations in that
2.4 Unbalanced, Skewed Case 23
0.20
observed
0.15
density
0.10
0.05
0.00
−6 −3 0 3 6
∆
The present case features highly unbalanced groups as well as a highly skewed
permutation distribution.
The data set is about average repair times in the USA. In a given area, by law,
Verizon should attend all clients, no matter if they are clients of Verizon or clients
of other companies.
We are interested in evaluating whether the average repair times are indeed
equal. Again, we will use the method above, i.e., making a subset of all the rele-
vant possible permutations for repairs that have a contract.
24 2 Statistical Statements
Notice again at the outset, that there seem to be a difference in the average time
of repair between the groups, as shown in Table 2.5.
TABLE 2.5: Group averages in repair times and difference between groups of
Verizon customers (V) and customers of other companies (C).
𝑋̄ 𝑉 𝑋̄ 𝐶 Δ𝑜
8.41 16.51 -8.1
A word about these relevant permutations is as follows. The data set contains
1687 observations. But the distribution across groups is very unbalanced. In-
deed, there are 1664 observations for Verizon clients and only 23 observations
for clients of other companies. In order to know if the average repair times are
the same across groups, we should compare the difference between any group
of 23 observations with the group of remaining 1664 observations.
Again, we cannot rely on all the possible permutations. There are 4.052292e+51
ways of choosing 23 observations in a group of 1687. That is way too many to
compute! Therefore, we subset again a large number (99’999) of these permuta-
tions and proceed as if it was the actual permutation distribution.
So, was the observed difference Δ𝑜 = -8.1 extreme? We can look how it fits in the
overall distribution of the possible differences between two groups, as shown in
Figure 2.3.
Again, the observed value seems too extreme. Indeed, there are only 1.828% of
values as small or smaller than the value we observe. This percentage is so small
that we do not believe that the difference observed is only due to chance. The
data indicates that there is a difference in repair times.
2.5 R Code
This section provides the R code used to obtain the results of this chapter, though
not how to display them in tables and graphs. For clarity it is separated by sec-
tion/task.
2.5 R Code 25
0.15
observed
0.10
density
0.05
0.00
−15 −10 −5 0 5
∆
library(gtools) # combinations()
library(magrittr) # %>%
library(tibble) # as_tibble()
library(dplyr) # rename(), mutate(), ...
library(resampledata) # Beerwings and Verizon data
rowwise() %>%
mutate(md =round(mean(c(d1, d2, d3)), 2),
mc= round(mean(c(c1, c2, c3)), 2),
delta = round(mean(c(d1, d2, d3)) - mean(c(c1, c2, c3)), 2))
data("Beerwings")
nmen <- nrow(Beerwings[Beerwings$Gender=="M",])
n <- nrow(Beerwings)
observed <- round(mean(Beerwings[Beerwings$Gender=="M", "Hotwings"]) -
mean(Beerwings[Beerwings$Gender=="F", "Hotwings"]),2)
hw <- Beerwings$Hotwings
for (i in 1:n.s){
index <- sample(1:length(hw), 15, replace = FALSE)
delta[i] <- mean(hw[index]) - mean(hw[-index])
}
for (i in 1:n.s){
index <- sample(1:length(time), n.ilec, replace = FALSE)
delta[i] <- mean(time[index]) - mean(time[-index])
}
2.6 Exercises 27
2.6 Exercises
Exercise 2.1. Follow the argument and slightly modify the R code in Section 2.4
to answer the following question.
Is there a significant difference in the median repair times between the Verizon
clients and the clients of other companies served by Verizon?
3
Paul the Octopus and 𝑝 < 0.05
Paul the Octopus was a common octopus living at the Sea Life Centre in
Oberhausen, Germany. It became famous worldwide, and even received death
threats, after managing to predict the outcomes of international football matches,
29
30 3 Paul the Octopus and 𝑝 < 0.05
0.2095
0.20
0.1833 0.1833
observed
0.15
0.1222 0.1222
Probability
0.10
0.0611 0.0611
0.05
0.0222 0.0222
0.0056 0.0056
1e−04 9e−04 9e−04 1e−04
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Possible correct predictions
mainly involving the German team, at both the Euro 2008 and the 2010 FIFA
World Cup. Overall, Paul correctly predicted 12 results out the 14 matches it
gave an opinion on.
From a statistical point of view, how ought we to appreciate this remarkable feat
of the octopus? In particular, how does it fit with the insights of Chapter 2?
A rather natural approach inspired by Chapter 2 consist in asking whether luck
alone could explain this level of accuracy. For that end, we rely on the description
of a binomial distribution of the random variable 𝑋 counting the number of
successes in 14 trials, noted 𝑋 ∼ 𝐵(𝑛 = 14, 𝑝 = 0.5). Figure 3.1 shows the
distribution over all the possible values that 𝑋 could take.
It seems that if chance alone was to explain the observation, there would be only
a 0.5554% probability of achieving this result. Way below the usual 5% for statis-
tical significance. Hence, we are led to reject the role of luck alone and embrace
the possibility of a truly psychic animal. Given the increasingly glowing picture
of the octopus and its abilities, this alternative explanation may even seem a solid
ground. If you haven’t seen it yet, I strongly recommend the documentary “My
Octopus Teacher”.1
1
https://www.netflix.com/title/81045007
3.2 Paul the Octopus… 31
Notice that the competition in the domain of animal oracles is rather fierce. Rabio
the octopus had a perfect score for the group stage games of Japan in the 2018
FIFA World Cup. But was chopped for a meal before having a chance of fully
exploring its talent in the remaining games.
But octopuses are not the only animals with clairvoyant powers. At the time of
the 2010 FIFA World Cup final, Paul fought for its place against Mani the para-
keet, which itself correctly called all the results of the quarter-final games.
Sure enough, not all animals are so skilled. You better not put your money on
Flopsy the kangaroo who tends to be biased towards Australia, Leon the porcu-
pine, Petty the pygmy hippopotamus or Anton the tamarin.
But the contestant for replacing Paul are still numerous in the race, including
Shaheen the camel, Madame Shiva the guinea pig, Nelly the elephant. What is
more, the BBC reported2 that a full colony of penguins at the National Sea Life
Centre in Birmingham is entering the competition, along with Big Head the log-
gerhead turtle, Alistair and Derek the miniature donkeys, and Sarge and Oscar
the macaws.
3.2 p-Hacking
Consider the news reported in Figure 3.2. It clearly indicates that the result meets
the usual standard of 𝑝 < 5%, suggesting that result is not due to chance alone.
Hence, we are left to believe that green jelly beans are linked to acne. Teenagers,
watch out!
But should we really believe that result? As it turns out, Figure 3.2 is only one
part of a larger humorous cartoon, given in Figure 3.3. And the fuller picture
gives the key to understand the origin of the result.
There seems initially to be no evidence linking jelly beans and acne. However,
if researchers multiply the experiments, say by changing the color of the jelly
beans, then, by the very nature of randomness and its effect, there will be one
experiment whose result will land sufficiently away from the true value 0, i.e, no
effect of jelly beans.
Unfortunately, this “significant” result, i.e., the case yielding a 𝑝 < 0.05, is the
one to be submitted to research journals or even to general-audience publica-
tions. To take it at face value is a great error indeed. Only by thoroughly exam-
ining the process through which the result emerged can we avoid it.
3.3 Efficient Markets Hypothesis 33
— Ioannidis (2005)
Suppose you are the only one to know with certainty something about the evo-
lution of a stock. Then you can make big money by buying/selling in the stock
market. By doing so, you would somewhat slightly affect the price of the stock
in that market.
As it turns out, the assumption that only you know, and nobody else does, it
too strong an assumption in finance. Instead, another assumption is take as a
valid description, namely the Efficient Markets Hypothesis (EMH). Under that
assumption/theory the asset prices reflect all the information for the actors in
the market. As a consequence, there is no way of making more money than the
average of market.
Enters Bill Miller4 . This fund manager was referred to as “the greatest money
manager of our time”5 by CNN Money, among the several distinctions received
3
https://youtu.be/0Rnq1NpHdmw?t=195
4
https://en.wikipedia.org/wiki/Bill_Miller_(investor)
5
https://money.cnn.com/magazines/fortune/fortune_archive/2006/11/27/8394343/index.htm
34 3 Paul the Octopus and 𝑝 < 0.05
in the financial media industry. The reason? Bill Miller managed to beat the mar-
ket 15 calendar years in a row, 1991 through 2005!
The man’s performance was seen as nothing short of genius! Why? Because of
the extremely low probability of observing such a performance by chance alone.
Mauboussin and Bartholdson (2003), see here,6 offered some estimates of this
probability when Miller was in his 12th year feat only. I adopted the numbers to
the 15-year streak.
• If, every year, beating the market was a 50%-50% game, similar to a flip of a
coin, then Miller’s correct predictions had the probability of 1 in 32’768 (i.e.,
215 ).
• If you consider that less than 50% of the funds beat the market every year, say
around 44%, then the probability is 1 in 222’951.
• Some commentators (see Mauboussin and Bartholdson (2003)) infer that beat-
ing the market once is as likely as rolling a seven (when throwing two dice). If
that is true, then Miller’s roll of dice had 1 in 470’184’984’576 chances of hap-
pening.
Now, is this all that impressive? Following the argument above, we can offer
a different perspective on the God-like abilities of Bob Miller. As it happens,
Mlodinow (2009) estimates that there was 3 chances out 4 of observing such a
streak. How would we explain such a probability.
Consider again the first case above whereas beating the market once is similar to
a flip of a coin, i.e., a 50%-50% proposition. The above estimate of the probability
is correct under a very narrow view. If, at the beginning of 1991, you would talk
to Miller and evaluate his probability of beating the market in the next 15 years,
then, yes, having Miller achieving it would be impressive.
But a larger view needs to take into account that there are many firms active in
the market. As a very low estimate, say 1000. Also, these firms have been active
for a long time, say 40 years. Now, Miller’s performance takes another color.
Miller is simply the one fund manager, out of the 1000, that managed to predict
15 throws of a coin in a row out of the 40 trials.
How likely is it that some manager beats the market for some 15-year period? 75%
according to Mlodinow (2009). Bill Miller was the lucky guy.
6
http://docplayer.net/52594811-On-streaks-perception-probability-and-skill-finding-the-hot-
shot.html
3.4 Rigorous Uncertainty and Moral Certainty 35
A couple of words to recall the arbitrariness of the usual confidence level, 95%,
that allows a result to have statistical significance.
First, we can credit one of the most important statistician of the XXth century,
Ronald Fisher for canonizing the 5% level (Stigler (2008)). The p-value is sug-
gested as a measure of “rigorously specified uncertainty”. Hence, the 5% limit is
justified as follows:
The value for which P = .05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in
judging whether a deviation is to be considered significant or not. Deviations exceeding twice the
standard deviation are thus formally regarded as significant. Using this criterion we should be led to
follow up a negative result only once in 22 trials, even if the statistics are the only guide available. Small
effects would still escape notice if the data were insufficiently numerous to bring them out, but no
lowering of the standard of significance would meet this difficulty.
— Fisher (1925)
To be sure, Fisher did not impose this bar. It seems however that his suggestion
was shared and found useful enough to serve as a satisfactory compromise.
The necessity of agreeing on a workable level can be better appreciated when
looking at the limit used by the funding father of probability theory, Jakob
Bernoulli (Bernoulli (1713), see Figure 3.4).
Jakob7 Bernoulli would not settle for a confidence level lower than 99.9%, which
7
In the Bernoulli family, you must specify the first name.
36 3 Paul the Octopus and 𝑝 < 0.05
he associates with moral certainty. Notice that this corresponds to a limit for the
p-value of 0.001, or equivalently, for allowing an error in less than 1 time in 1000!
Such a high bar was depressing even for Jakob Bernoulli. In Bernoulli (1713), he
calculates that for obtaining moral certainty about a relevant proportion in the
population of Basel, he had to sample… more than entire population of Basel of
that time!
3.5 Exercises
Exercise 3.1. Why are there 20 similar panels in the middle of Figure 3.3, as op-
posed to 25 or another number? Explain.
Exercise 3.3. Have a look at Mauboussin and Bartholdson (2003), that you can
find online here8 .
Reproduce the results of the last line of Exhibit 1, i.e., for # of years = 15 (p.3). In
other words, show how they were obtained.
Exercise 3.4. Again in Mauboussin and Bartholdson (2003), p.3. The authors use
Exhibit 2 for placing the probability of beating the market at 1 in 477’000.
Reproduce that number. Notice that the authors use the expression “about”
when referring to their result. My own estimate, given Exhibit 2, is 1 in 475’186.
The discrepancy may be due to the effect of rounding.
Exercise 3.5. Consider the R code below. Get inspiration from it in order to write
the R code to answer Exercise 3.4.
aa <- c(7, 6, 2)
prod(aa)
## [1] 84
8
http://docplayer.net/52594811-On-streaks-perception-probability-and-skill-finding-the-hot-
shot.html
3.5 Exercises 37
Exercise 3.6. In the quote from Fisher (1925), see Section 3.4, we read “P = .05,
or 1 in 20” and, further, “using this criterion we should be led to follow up a
negative result only once in 22 trials”.
Explain this apparent contradiction, i.e., 20 vs 22 trials.
38 3 Paul the Octopus and 𝑝 < 0.05
Statistical Inference
4
A Blueprint for Inference
4.1 Introduction
43
44 4 A Blueprint for Inference
and the hypothesis made is true, what is the probability of observing a value as
“extreme” as the one we calculated in the sample. Here, the term “extreme” is
ill-defined. The practice in statistical analysis has dictated a set of values, noted
𝛼, but one in particular, beyond which the probability of observing a value as ex-
treme as the one given in the sample is deemed too small to be compatible with
the hypothesis, given the assumptions. In that case, the hypothesis is rejected in
a statistical sense. This can be a correct decision or an error on the part of the
researcher.
4.2 Assumptions
The assumptions on the underlying data are essential in any analysis. The valid-
ity of the results depends crucially on them. It is therefore of utmost importance
that the researcher verifies that they are likely to apply.
Not all assumptions are equally reasonable and likely to be satisfied. We gener-
ally place them in a range from mild/weak too strong.
Examples of mild assumptions include the independence of the observations
or their identical distribution. The actual distribution can sometimes also be as-
sumed when it does not overly affects the results.
Assumptions tend to be seen as strong the more structure they impose on the
data, e.g., a full model of relationships between variables.
to be true unless the data provide convincing evidence that it is false. This usually
represents the “status quo” or some statement about the population parameter
that the researcher wants to test.
The alternative (research) hypothesis, denoted 𝐻𝑎 , represents the values of a
population parameter for which the researcher wants to gather evidence to sup-
port.
The following are examples of hypotheses.
a. Is the mean weight of a certain candy bar different from the desired 40
grams? 𝐻0 : 𝜇 = 40 vs. 𝐻𝑎 : 𝜇 ≠ 40.
b. Do men and women have different starting salaries after graduating
university? 𝐻0 : 𝜇𝑀 = 𝜇𝐹 vs. 𝐻𝑎 : 𝜇𝑀 ≠ 𝜇𝐹 .
c. Do three different production processes all have the same variance? 𝐻0 :
𝜎1 = 𝜎2 = 𝜎3 vs. 𝐻𝑎 : They are not all equal.
4.4 Estimator
The critical region is the portion of the sampling distribution that contains all the
values that allow you to reject a null hypothesis. For that reason, we refer to the
critical region as the region of rejection as it represents the set of possible values
of the test statistic for which the researcher will reject 𝐻0 in favor of 𝐻𝑎 .
The critical value is the point that marks the beginning of the critical region.
4.7 Deciding on an Hypothesis 47
1. Select the null hypothesis as the status quo, that which will be presumed
true unless the sampling experiment conclusively establishes the alter-
native hypothesis. The null hypothesis will be specified as that parame-
ter value closest to the alternative in one-tailed tests and as the comple-
mentary (or only unspecified) value in two-tailed tests. e.g., 𝐻 ∶ 𝜇 = 𝜇0 .
2. Select the alternative hypothesis as that which the sampling experiment
is intended to establish. The alternative hypothesis will assume one of
three forms:
a. One-tailed, lower-tailed, e.g., 𝐻𝑎 ∶ 𝜇 < 𝜇0 ,
b. One-tailed, upper-tailed, e.g., 𝐻𝑎 ∶ 𝜇 < 𝜇0 ,
c. Two-tailed, e.g., 𝐻𝑎 ∶ 𝜇 ≠ 𝜇0 .
α 1−α
−zα
Ha:µ>µ0
1−α α
Do not reject H0 Reject H0
−zα
Ha:µ ≠ µ0
α 2 1−α α 2
Reject H0 Do not reject H0 Reject H0
− zα/2 zα/2
z (statistic)
In particular, the marketer wants to collect data to show that fewer than 80% of
cigarette consumers fail to see the warning, i.e., 𝑝 < 0.80.
Consequently, 𝑝 < 0.80 represents the alternative hypothesis and 𝑝 = 0.80 (the
claim made by the FTC) represents the null hypothesis. That is, the marketer
desires the one-tailed (lower-tailed) test:
The observed significance level, or 𝑝-value, for a specific statistical test is the
probability (assuming 𝐻0 is true) of observing a value of the test statistic that is
4.8 Types of Error 49
Sections 4.7.1 and 4.7.3 describe two approaches that allow the researcher to
make a decision on 𝐻0 . These two approaches are equivalent. In other words,
they always give the same decision.
A small caveat must be noted, however:
• If the test is one-tailed, the 𝑝-value is equal to the tail area beyond 𝑧 in the same
direction as the alternative hypothesis.
• If the test is two-tailed, the 𝑝-value is equal to twice the tail area beyond the
observed 𝑧 -value in the direction of the sign of 𝑧 .
What if we are wrong? When we do an hypothesis test there are two possibilities:
However, it is possible that a different sample would have yielded different re-
sults. When conducting hypothesis tests, we can make two kinds of mistakes:
• Type I error: False positive. You could read “positive” as “yes, existence of a
sign against 𝐻0 ”. A false positive would then mean “a misleading sign against
𝐻0 ”.
50 4 A Blueprint for Inference
FIGURE 4.2: Types of error for case ’𝐻0 : the person is not pregnant’.
• Type II error: False negative. You could read “negative” as “no, nothing against
𝐻0 ”. A false negative would then mean “a misleading absence of sign against
𝐻0 ”.
Even if the null hypothesis is true, we may still get a test statistic that is extreme
just due to chance. In this situation, we would incorrectly reject the null hypoth-
esis.
A Type I error occurs if the researcher rejects the null hypothesis in favor of the
alternative hypothesis when, in fact, 𝐻0 is true. The probability of committing a
Type I error is denoted by 𝛼.
Unfortunately, we never know whether we’ve made a Type I error but we know
that the probability of making a Type I error is equal to the level of significance
(𝛼). A level of significance of 0.05 it’s simply a statement that you’re willing to
tolerate a 5% chance of making a Type I error.
Even if the null hypothesis is false, it is still possible, only by chance, to get a test
statistic that is not extreme compared to the value given by 𝐻0 . In this situation,
we would incorrectly fail to reject the null hypothesis.
A Type II error occurs if the researcher fails to reject the null hypothesis when, in
fact, 𝐻0 is false. The probability of committing a Type II error is denoted by 𝛽 .
4.9 Exercises 51
States of Nature
Decision on 𝐻0 𝐻0 is true 𝐻0 is false
Fail to reject 𝐻0 Correct decision Type II error
prob. 1−𝛼 𝛽
4.9 Exercises
Exercise 4.1. In a US court (as much as in other countries’ courts), the defendant
is either innocent (𝐻0 ), or guilty (𝐻𝑎 ).
How could we reduce the rate of errors of type I in US courts? What would that
mean/imply in real terms, i.e., in terms of the decisions of the judges?
What influence would that reduction in Type I errors have on the rate of errors
of type II? Again, explain in concrete terms, not general formulas.
5
Theoretical Sampling Distributions
TL;DR .
5.1 Introduction
Before the advent of massive computational power, exercises such as the one in
Section 2.4 were not available for statisticians. How did they manage to have an
idea about the sampling distribution of a statistic?
What they couldn’t do brute-force, they did through heroic theoretical break-
throughs. These were impressive and very useful results achieved around one
century ago by the likes of Jerzy Neyman1 , Egon Pearson2 or their “enemy”
Ronald Fisher3 .
We can, however, point at two limitations. First, many of these results rely on
approximations in small samples. And there is often no guide on the size of the
error. This is often discarded by suggesting that the sample size is large enough.
And we don’t have much choice.
The second limitation is getting increasingly stringent with the furious develop-
ment of data science (see Efron and Hastie (2016)). The production theoretical
results is simply not keeping the pace. Arguably, it has become more and more
difficult to derive analytical results as the ground for statistical inference with all
1
https://en.wikipedia.org/wiki/Jerzy_Neyman
2
https://en.wikipedia.org/wiki/Egon_Pearson
3
https://en.wikipedia.org/wiki/Ronald_Fisher
53
54 5 Theoretical Sampling Distributions
the new estimators and techniques out there. So the practitioners didn’t way for
them and went on with other methods for validating their claims. We actually
arrive at a point where one wonders whether statistics is useful for data science
or not.4
In this chapter we offer a few examples of the theoretical results.
The central limit theorem shows that the mean of a random sample of size 𝑛,
drawn from a population with any probability distribution, will be approxi-
2
mately normally distributed with mean 𝜇 and variance 𝜎𝑛 , given a large-enough
sample size.
In applied statistics the probability distribution for the population being sam-
pled is often not known, and there is no way to be certain that the underlying
distribution is normal.
The CLT allows us to use the normal distribution to compute probabilities for
sample means obtained from many different populations. Combined with the
law of large numbers it provides the basis for statistical inference.
𝑋̄ − 𝜇
𝑍=
𝜎𝑋̄
approaches the standard normal distribution.
In other words, if repeated random samples of size 𝑛 are taken from a popula-
tion with mean 𝜇 and standard deviation 𝜎, the sampling distribution of sample
4
My view is that it is. If anything, because of the structure if imposes on an analysis, making
it all the more reliable.
5.3 The Central Limit Theorem 55
7.5
10
4
count
count
count
5.0
5
2
2.5
0 0.0 0
2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0
One sample, n=25 One sample, n=50 One sample, n=100
100
100 75
75
75
count
count
count
50
50
50
25
25 25
0 0 0
4 5 6 7 5 6 4.5 5.0 5.5 6.0 6.5
1000 sample means, n=25 1000 sample means, n=50 1000 sample means, n=100
FIGURE 5.1: Illustration of the Central Limit Theorem: distribution of the means
of samples from uniform distributions for different sample sizes, sampled 1000
times.
means will have mean 𝜇 and standard error 𝜎𝑋̄ = √𝜎𝑛 . And, as 𝑛 increases, the
sampling distribution will approach a normal distribution.
The central limit theorem can be applied to both discrete and continuous random
variables.
5.2.1 Illustration
Figures 5.1 and 5.2 provide an illustration of the Central Limit Theorem at work.
56 5 Theoretical Sampling Distributions
5
20
4 9
15
3
count
count
count
6
10
2
3 5
1
0 0 0
4 8 12 16 6 9 12 15 5 10 15
One sample, n=25 One sample, n=50 One sample, n=100
75
90
100
count
count
count
50 60
50
25 30
0 0 0
8 9 10 11 12 8.5 9.0 9.5 10.0 10.5 11.0 8.5 9.0 9.5 10.0 10.5 11.0
1000 sample means, n=25 1000 sample means, n=50 1000 sample means, n=100
FIGURE 5.2: Illustration of the Central Limit Theorem: distribution of the means
of samples from Poisson distributions (𝜆 = 10 ) for different sample sizes, sam-
pled 1000 times.
Definition 5.1 (Sample proportion). The sample proportion is simply the sum of
the success cases in our sample divided by the total number of elements in our
sample.
𝑛
∑ 𝑋𝑖
𝑝̂ = 𝑖=1
𝑛
where each 𝑋𝑖 is an independent Bernoulli random variable with probability of
success 𝑝, i.e., 𝑋 ∼ 𝑏(1, 𝑝).
5.3 Sampling Distribution of the Sample Proportion 57
Proposition 5.1 (Expected value and variance of a sample proportion). The ex-
pected value of the sample proportion is:
𝐸[𝑝]̂ = 𝑝
𝑝(1 − 𝑝)
𝑉 𝑎𝑟(𝑝)̂ =
𝑛
Proposition 5.2 (Sampling distribution of the sample proportion). The distribu-
tion of the sample proportion is approximately normal for large sample sizes (𝑛𝑝(1 −
𝑝) > 5).
𝑝(1 − 𝑝)
𝑝̂∼𝑁
̇ (𝑝, )
𝑛
Thus,
𝑝̂ − 𝑝
𝑍= ∼𝑁
̇ (0, 1)
√ 𝑝(1−𝑝)
𝑛
Example 5.1. Assume that 60% of all city voters are in favor of a particular can-
didate.
Question
In a random sample of 100 voters, what is the probability that fewer than half
are in favor of this candidate?
Answer
Since 𝑛 is large, we know that 𝑝̂ is normally distributed with mean 𝑝 = 0.6 and
𝑝(1−𝑝)
standard error √ 𝑛 = 0.049.
58 5 Theoretical Sampling Distributions
0.5 − 0.6
𝑃 (𝑝̂ < 0.5) = 𝑃 (𝑧 < ) = 𝑃 (𝑧 < −2.04) = 0.021
0.049
𝜎2 = 𝐸[(𝑋 − 𝜇)2 ]
Proposition 5.3 ((Adjusted) sample variance). The expected value of the adjusted
sample variance is the population variance.
𝐸[𝑆 ′2 ] = 𝜎2
Example 5.3. Suppose the weights of bags of flour are normally distributed with
a population standard deviation of 𝜎 = 1.2 ounces.
Question
Find the probability that a sample of 200 bags would have a standard deviation
between 1.1 and 1.3 ounces.
Answer
(𝑛−1)𝑆 2
We evaluate the random variable 𝜎2 at the endpoints of the interval in ques-
tion:
−4 −3 −2 −1 0 1 2 3 4 0 5 10
Z Z2
FIGURE 5.3: Standard normal (left) and Chi-square with one degree of freedom
(right).
0 5 10 0 5 10
x x
n 1 2 3 4 6 9
FIGURE 5.4: Chi-square distributions for various degrees of freedom, 𝑟, pdf (left)
and cdf (right).
Here is how to use the Chi-square distribution table, see Figure 5.5.
63
6
Inference on Sample Proportions
6.1 Definitions
65
66 6 Inference on Sample Proportions
Recall that categorical variables are variables that can take a limited number of
values.1 In most cases, they are even dichotomous: yes/no, true/false, up/down,
correct/incorrect, red/not red, etc…
To fix ideas, here are a few examples:
Definition 6.1 (Bernoulli trial). A Bernoulli trial (or variable) is a random exper-
iment that can have only two possible, mutually exclusive outcomes: “success”
or “failure”. “Success” is noted with value 1 while “failure” is noted with the
value 0.
Notice that the term “success” is potentially misleading. It does imply a victory
or achievement. It simply means “the condition is satisfied”. For instance, in a
Bernoulli trial where the outcomes are “arrived late (1) /didn’t arrive late (0)”,
the “success” is the case of arriving late, which simply means “the condition of
arriving late was met in the observation”
1
These values are called with different names in different sources or contexts: categories, lev-
els in R, etc…
6.1 Definitions 67
𝑥 𝑃 (𝑋 = 𝑥)
0 1−𝑝
1 𝑝
This chapter has a somewhat simple, yet very common and useful perspective.
What can we say about the number of observations in each (of the two) cate-
gories?
A subtle yet crucial step is to regard each observation has a Bernoulli trial with
probability 𝑝 of taking the value 1 (success) and probability 1 − 𝑝 of taking the
value 0 (failure). The actual 𝑝 is the true parameter of the population, and gen-
erally unknown. We can write,
𝑋𝑖 ∼ 𝑏(𝑝), ∀𝑖
What we do observe is the proportion in a given sample, noted 𝑝̂. This sample
proportion is calculated as follows.
1 𝑛
𝑝 ̂ = ∑ 𝑋𝑖
𝑛 𝑖=1
This last remark if of utmost importance. Indeed, it allows us to use the Central
Limit Theorem, provided that further conditions are met.
6.1.4 Example
Let’s work with an example. The dataset rosling_responses from the package
openintro has observations on adults with a 4-year college degree responding to
the following question:
How many of the world’s 1-year-old children today have been vaccinated against some disease?
a. 20%
b. 50%
c. 80%
The sample data could contain a series with the choice given by each respondent,
e.g.: a, b, a, b, b, a, a, c, a, a… However, the focus of interest will be on a more
relevant question, namely who’s got the answer right or wrong, and what are
the proportions of each group.
library(openintro)
data("rosling_responses")
rosling_responses %>%
filter(question ==
"children_with_1_or_more_vaccination") %>%
pull(response)
## [1] "correct" "correct" "incorrect" "incorrect" "incorrect" "incorrect"
## [7] "incorrect" "incorrect" "correct" "correct" "incorrect" "incorrect"
## [13] "incorrect" "incorrect" "incorrect" "incorrect" "incorrect" "incorrect"
## [19] "correct" "incorrect" "incorrect" "incorrect" "incorrect" "incorrect"
## [25] "incorrect" "incorrect" "incorrect" "correct" "correct" "incorrect"
## [31] "incorrect" "incorrect" "incorrect" "incorrect" "incorrect" "correct"
6.2 Definitions 69
Again, noticed that each observation is either a “correct” or “incorrect”. The for-
mer could be branded “success” and the latter “failure”. The probability of suc-
cess is the unknown 𝑝.
We still need to transform these values into numeric values in order to obtain an
actual number for the proportion. Here is a way using the function case_when()
from the dplyr package.
Notice that I coded “correct” as TRUE. I could have as well coded as “1”. Notice
that, in most coding languages, these are equivalent because TRUE is coerced to 1
and FALSE as 0.
Finally, we can now calculate the sample proportion. In the data set, only the 50
first entries are about this question.
We now call the object that we want to see by writing its name.
p.hat1
## [1] 0.24
70 6 Inference on Sample Proportions
We assume that the observations are independent. They wouldn’t be if, for in-
stance, the individuals would first discuss the question together. The random
sampling that was certainly applied by Rosling and colleagues make us safe on
this ground.
The choice of the null hypothesis 𝐻0 is always context specific. In our case, one
possibility would be to test whether at least half of the individuals guesses cor-
rectly.
As it turns out, beyond arbitrary, this is also a too optimistic hypothesis. Instead,
we will wonder how these adults with a 4-year degree perform against the mark
of random guessing, i.e., 33.3%. But, starting here, we will adopt a more colorful
comparison:
You’ve probably heard that one before. It’s famous–in some circles infamous. It has popped up in the
New York Times, the Wall Street Journal, the Financial Times, the Economist and other outlets
around the world. It goes like this: A researcher gathered a big group of experts–academics pundits, and
the like–to make thousands of predictions about the economy, stocks, elections, wars, and other issues of
the day. Time passed, and when the researcher checked the accuracy of the predictions, he found that the
average expert did about as well as random guessing. Except that’s not the punch line because “random
guessing” isn’t funny. The punch line is about a dart-throwing chimpanzee. Because chimpanzees are
funny.
Hence, the hypothesis tested will be whether the respondents are performing
better or worse than dart-throwing chimps. Formally, we would write: 𝐻0 ∶ 𝑝0 =
1
3.
As alluded to above, we will use the sample proportion, 𝑝̂, a sort of mean calcu-
lated on the sample data.
Alternatively, we can use the standardized value of 𝑝̂ as a sample statistic under
the null, i.e.,
𝑝 ̂ − 𝑝0
𝑍= .
𝜎𝑝0
This is the core part of the test. What are the possible values and their associated
probabilities that could be observed in a such a sample if the assumptions and
𝐻0 hold?
At this point we must point towards a theoretical result based on the Central
Limit Theorem, see Proposition 5.2.
The distribution of the sample proportion is approximately normal for large sam-
ple sizes (𝑛𝑝(1 − 𝑝) > 5).
𝑝0 (1 − 𝑝0 )
𝑝̂∼𝑁
̇ (𝑝0 , ).
𝑛
𝑝 ̂ − 𝑝0
𝑍= ∼𝑁
̇ (0, 1).
√ 𝑝0 (1−𝑝
𝑛
0)
The usual level of significance may be applied here without raising concerns.
Carefully noticed that I wrote above that we wanted to know whether the re-
spondents would do better or worse than the chimps. This, in turn, calls for a
bilateral, with 𝐻𝑎 ∶ 𝑝0 ≠ 13 .
Another alternative could also be used, namely that the humans do better than
the chimps. We would then carry a unilateral test with a rejection region of prob-
ability 𝛼 all the way to the right of the sampling distribution.
1 13 (1 − 31 )
𝑝̂∼𝑁
̇ ( , ).
3 50
1
(1− 1 )
Using 𝜎𝑝 = √ 3 50 3 = 0.0667, we can compute the critical values for 𝛼 =
0.05, i.e., the points that are 1.96 standard deviations from the mean,
1
± 1.96 × 0.0667 = [0.2026; 0.4641].
3
Since the observed sample proportion does not fall in the rejection region, recall
𝑝̂ = 0.24, we do not reject 𝐻0 . Figure 6.1 represents the basis for the decision.
6.2.8 P-Value
Instead of the reject regions, we can calculate the p-value for the observed sample
proportion. With the values above for the mean and the standard deviation of the
sampling distribution, we obtain,
6.2 Inference for a Single Proportion 73
observed
2.5% 2.5%
FIGURE 6.1: Rejection regions for the sample proportion of our example.
left.of.po
## [1] 0.081
This means that, if the assumptions and 𝐻0 hold, then there is a probability of
0.081 of observing a proportion in the sample that is equal or smaller than 0.24.
That is too high to reject 𝐻0 . Figure 6.2 illustrates this calculation.
Notice a subtle point: 0.081 is not the p-value. Since we are conducting a bilat-
eral test, our decision would would be to reject 𝐻0 if the area on the left of the
observed sample proportion was smaller than 𝛼/2, or if twice the area on the
left of the observed sample proportion was smaller than 𝛼. The p-value is the
probability to be compared with 𝛼. Hence, in this case,
74 6 Inference on Sample Proportions
8.1% observed
2.5%
R has a dedicated function for this type of hypothesis testing, prop.test(). How-
ever, it is based on a different estimator and, therefore, in a different sampling
distribution.
A element to bear in mind is that it uses the absolute number of successes, x, and
the number of trials, n.
correct= FALSE)
𝐻0 ∶ 𝑝1 − 𝑝2 = 0.
This section presents a formal test to answer that question (and a R implemen-
tation). The discussion will also be shorter, because I assume that many of the
elements above are mastered and need not be repeated.
As before, we assume that the observations are independent within each group.
We now require that they are also independent between the two groups. They
2
Of course, the word “say” implies a statement in the language of statistics.
76 6 Inference on Sample Proportions
would be independent between groups if, for instance, the same individuals
would be included in both groups.
As before, we will also assume and verify that the samples are sufficiently large.
The resulting sampling distribution in this context is again derived from the Cen-
tral Limit Theorem.
If,
then, under the null 𝐻0 ∶ 𝑝1 − 𝑝2 = 0, the two probabilities are equal and they
can be pooled as follows:
𝑝1̂ 𝑛1 + 𝑝2̂ 𝑛2
𝑝̂ = .
𝑛1 + 𝑛 2
Therefore,
1 1
𝑝1̂ − 𝑝2̂ ∼𝑁
̇ (0, 𝑝(1
̂ − 𝑝)(
̂ + ))
𝑛1 𝑛2
or, in the standardized version,
𝑝1̂ − 𝑝2̂
𝑍= ∼𝑁
̇ (0, 1)
√𝑝(1 ̂ 𝑛1 +
̂ − 𝑝)( 1
𝑛2 )
1
Suppose 125 voters are surveyed from state A and 120 in state B. Assume the
survey uses simple random sampling. In state A, 52% of the respondents said
6.3 Comparing Two Proportions 77
they vote Republican, and 48% vote Democrat. In state B, 45% declared voting
Republican, and 55% Democrat.
Given the information of the samples, is there evidence that the percentage of
Republicans is different in the two states?
Let,
• 𝑝1 and 𝑝1̂ be the proportion of Republican voters in the first state and in the
sample from the first state, respectively.
• 𝑝2 and 𝑝2̂ the proportion of Republican voters in the second state and in the
sample from the second state, respectively.
𝐻0 ∶ 𝑝1 − 𝑝2 = 0,
versus the alternative,
𝐻𝑎 ∶ 𝑝1 − 𝑝2 ≠ 0.
Under the null, the two proportions are equal. Hence, we can use the pooling
proportion,
𝑝1̂ 𝑛1 + 𝑝2̂ 𝑛2
𝑝̂ = = 0.4857,
𝑛1 + 𝑛2
to calculate the standard error of their difference,
1 1
𝜎𝑑 = √𝑝(1
̂ − 𝑝)(
̂ + ) = 0.06387.
𝑛1 𝑛2
78 6 Inference on Sample Proportions
0.07 − 0
𝑃 (𝑝1̂ − 𝑝2̂ > 0) = 𝑃 (𝑍 > )
0.06387
= 1 − 𝑃 (𝑍 < 1.096) = 0.1365
As for the p-value, recall that in a bilateral test it is calculated as twice the area
“beyond” the sample statistic. Hence, in this case,
6.3.4 Implementation in R
Again, R has a dedicated function for this type of hypothesis testing, prop.test().
However, it is based on a different estimator and, therefore, in a different sam-
pling distribution.
A element to bear in mind is that it uses the absolute number of successes, x, and
the number of trials, n.
For the example above, it would be called in the following way.
p1 <- 0.52
p2 <- 0.45
n1 <- 125
n2 <- 120
test.c <- prop.test(x = c(p1*n1, p2*n2),
n = c(n1, n2),
correct = FALSE)
test.c
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(p1 * n1, p2 * n2) out of c(n1, n2)
6.3 Comparing Two Proportions 79
The interpretation of the result of the test, once again, is grandly facilitated by
recalling 𝐻0 and understanding its p-value.
About 𝐻0 , the output is very explicit. As calculated thanks to the sample pro-
portions, we tested whether the two true proportions are the same.
The p-value is 0.27. Since the p-value of the test of 𝐻0 is larger than 0.05, we do
not reject the null hypothesis of the equality of proportions.
Finally, notice that the p-value is identical to the value calculated above after
using the Central Limit Theorem. This is to be expected in this case (when df =
1) and the test does not apply a continuity correction (we set correct = FALSE).
We use the above data to compare the results of the answers to the Rosling ques-
tions.
# recall
n1 <- rosling_responses %>%
filter(question ==
"children_with_1_or_more_vaccination") %>%
nrow()
test.fluke
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(p.hat1 * n1, p.hat2 * n2) out of c(n1, n2)
## X-squared = 2.4525, df = 1, p-value = 0.1173
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.03621123 0.21796562
## sample estimates:
## prop 1 prop 2
## 0.2400000 0.1491228
Since 0.12 is larger than 0.05, we do not reject the null hypothesis of equality of
the proportions. The respondents seem to keep their level of knowledge, or lack
thereof, in the two questions.
6.4 Goodness of Fit for Many Proportions 81
with
𝑘
∑ 𝑝0𝑘 = 1.
𝑖=1
Suppose that you want to verify whether the regional origins in a given sample
for a poll are correctly matching the regional distribution in the population. The
proportions of each region are known from the official records. Table 6.2 provides
the relevant data.
TABLE 6.2: Representation by region in the poll and in the population.
6.4.2 Implementation in R
The formal test in this case is a 𝜒2 test as the one implemented in R and described
in Section 6.5. Her is how the test would be carried in R.
82 6 Inference on Sample Proportions
test.poll
##
## Chi-squared test for given probabilities
##
## data: x.poll
## X-squared = 2.8402, df = 2, p-value = 0.2417
The output of the test is not very detailed. Its 𝐻0 is given above. In this case,
since 0.242 is larger than 0.05, we do not reject the hypothesis that the sample for
the poll has the same distribution across regions as the population.
are the expected probabilities under the null, and the 𝑥’s are the number of ob-
servations in each case.
In our case,
12 2 2
( 50 − 13 ) ( 38 2
50 − 3 )
𝜒21 = 50( 1 + 2 ) = 1.96
3 3
The probability of observing such an extreme, i.e., large value if the null is true
is,
6.6 Exercises
observed
observed
5% 16.2%
“The support for closing the schools is higher among women, 61% are in favor
against 53% among men.”
“The field study took place in January 7 to 10 and January 14 to 18. For the sam-
ple, 629 interviews were carried.”
Assume that both genders were equally represented (with one extra woman).
Using the usual significance level, is the percentage of support for the closing of
the schools statistically different across genders.
Answer both with calculations and with R commands.
Exercise 6.5. Use the ncbirths data5 from the openintro package. Consider the all
the babies whose birth was classified as premature.
In that group, is the percentage of female babies equal to the percentage of male
babies? Calculate and verify with R code.
5
https://www.openintro.org/data/index.php?data=ncbirths
86 6 Inference on Sample Proportions
pull(p.hat2)
x.poll <- c(302, 406, 221) x.poll and p.0 are vectors for the
p.0 <- c(.30, .45, .25) numbers in each category and the ex-
pected probabilities, respectively.
test.poll <- chisq.test(x = x.poll, They were both created with the func-
p = p.0) tion c().
Under the conditions stated in Section 5.2, the Central Limit Theorem allowed
us to approximate the sampling distribution of a sample’s mean, 𝑋̄ , of size 𝑛,
2
𝜎
𝑋̄ ∼ 𝑁 (𝜇, )
𝑛
91
92 7 Inference for Numerical Data
𝑋̄ − 𝜇
𝑍= √ ∼ 𝑁 (0, 1)
𝜎/ 𝑛
where 𝜇 and 𝜎2 are the population’s parameters of its mean and its variance. In
this chapter we are interested in making hypotheses about the true value of the
population, 𝜇.
At the outset, one difficulty should be apparent. There are two unknown param-
eters in the formula above. In order to make inference on one of them, then the
other must be known.
Our strategy to solve the issue is to first estimate the variance of the sample and
use in the place of the true variance,
𝑛
2 1 ̄ 2.
𝑠 = ∑ (𝑋𝑖 − 𝑋)
𝑛 − 1 𝑖=1
Notice that 𝑠2 uses 𝑋̄ . This will adversely affect the precision of the inference on
𝑋̄ .
Formally, inference about 𝜇 is made through a null hypothesis such as 𝐻0 ∶ 𝜇 =
𝜇0 . Under the null, and if 𝜎2 was known, the sampling distribution of the stan-
dardized mean would be normally distributed,
𝑋̄ − 𝜇0
𝑍= √ ∼ 𝑁 (0, 1).
𝜎/ 𝑛
Now, since 𝜎2 is not known, replacing it by 𝑠2 results in a related but slightly
different sampling distribution, namely the 𝑡-distribution,
𝑋̄ − 𝜇0
𝑡= √ ∼ 𝑡𝑛−1 .
𝑠/ 𝑛
where 𝑛 is the number of observations and (𝑛 − 1) is called the degrees of free-
dom of the distribution.
7.2 One-Sample 𝑡-Test 93
−4 −3 −2 −1 0 1 2 3 4
df 1 2 4 9 10 ∞ (Normal)
FIGURE 7.1: Normal distribution and 𝑡-distribution for various degrees of free-
dom.
As it turns out, the 𝑡-distribution becomes a normal distribution when the de-
grees of freedom, i.e., the sample size, becomes large, say 𝑛 > 30. Figure 7.1
depicts the normal distribution along with various 𝑡 distributions.
The one-sample 𝑡-test is build on the statistic that uses the sample variance in
place of an unknown variance in the population.
Formally, we define a null hypothesis,
𝐻0 ∶ 𝜇 = 𝜇 0
,
as well as an alternative such as 𝐻𝑎 ∶ 𝜇 ≠ 𝜇0 , or 𝐻𝑎 ∶ 𝜇 > 𝜇0 , or, of course,
𝐻𝑎 ∶ 𝜇 < 𝜇 0 .
Under the null, we have,
94 7 Inference for Numerical Data
𝑋̄ − 𝜇0
𝑇 = √ ∼ 𝑡𝑛−1 .
𝑠/ 𝑛
[From OpenIntro] In a sample of 100 runners in the Cherry Blossom Race in 2017,
the average run time was 95.61 minutes, while the standard deviation was 15.78
minutes. In 2006, the average run time was 93.29. We want to test whether the
runners were faster or slower in 2017 than they were in 2006.
Or null hypothesis is, naturally, 𝐻0 ∶ 𝜇 = 93.29, i.e., no difference between the
years. The alternative, since runners in 2017 could be faster or slower, is 𝐻𝑎 ∶
𝜇 ≠ 93.29.
The test is build on the statistic,
95.61 − 93.29
𝑇 = √ = 1.47.
15.78/ 100
If the null hypothesis is true, what is the probability of observing such an extreme
value in a 𝑡-distribution with 𝑑𝑓 = 𝑛 − 1 = 99? Checking the tables, or our
calculator, we have,
p.extreme
## [1] 0.07233725
We know that the p-value in a bilateral test must take into account the probabil-
ities of the extremes over the two sides, hence,
The interpretation of the test is that, under the assumptions, we cannot reject 𝐻0 .
The runners in 2017 were not statistically faster or lower than in 2006.
7.3 Test for Paired Data 95
7.2.2 Implementation in R
The implementation in R uses the theoretical result above with the function
t.test(). We illustrate it with the same example as above.
df <- read_csv(paste0("https://",
"www.openintro.org/data/csv/run10samp.csv"))
test.run <- t.test(x = df$time,
mu = 93.29)
test.run
##
## One Sample t-test
##
## data: df$time
## t = 1.4734, df = 99, p-value = 0.1438
## alternative hypothesis: true mean is not equal to 93.29
## 95 percent confidence interval:
## 92.48412 98.74508
## sample estimates:
## mean of x
## 95.6146
Observations are paired in two data sets if each observation of the first data set
has a particular connexion with one observation in the other data set.
Such cases arise, for instance, when the same person is surveyed twice, when a
measurement is taken at the same place, when both data sets have information
on the same objects, etc.
In that case, any meaningful comparison of the means across the groups, say 𝐴
96 7 Inference for Numerical Data
In other words, the variable analyzed is 𝑋𝐷 , the difference across data sets for
each observation. Under the same general conditions as above, we can perform
a test on an hypothesis about 𝑋𝐷 such as, the very common,
𝐻0 ∶ 𝜇𝐷 = 0.
We use another data set from OpenIntro about prices of books in two different
locations, the UCLA Bookstore and… Amazon.
df <- read_csv(paste0("https://",
"www.openintro.org/data/csv/textbooks.csv"))
x.bar <- mean(df$diff)
s.sample <- sd(df$diff)
n.sample <- length(df$diff)
We have,
12.7616438 − 0
𝑇 = √ = 7.6487711.
14.2553008/ 73
Now, the probability of observing such an extreme value, if 𝐻0 is true, is
p.value
## [1] 6.92757e-11
Close to impossible…
Verifying with the appropriate command in R.
test.book
##
## One Sample t-test
##
## data: df$diff
## t = 7.6488, df = 72, p-value = 6.928e-11
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 9.435636 16.087652
## sample estimates:
## mean of x
## 12.76164
The 𝑡-test can be used to test hypotheses about the difference of the means be-
tween two groups, say 𝐴 and 𝐵.
It the observations are independent within and across groups, and the samples
are not too “weird” (outliers, skewed, …), then, under the null such as
𝐻0 ∶ 𝜇 𝐴 − 𝜇 𝐵 = 𝜇 0
(𝑋̄ 𝐴 − 𝑋̄ 𝐵 ) − 𝜇0
𝑇 = 2
∼ 𝑡𝑘 .
𝑠2𝐵
√ 𝑛𝑠𝐴 + 𝑛𝐵
𝐴
where 𝐴 and 𝐵 refer to the two groups and where 𝑠2 and 𝑛 are the variance and
the number of observations of each group, respectively. As for 𝑘, this is a rather
complicated number to calculate. Let us use 𝑘 = min(𝑛𝐴 , 𝑛𝐵 ) − 1.
We use a famous data set that was also included in the package OpenIntro. It col-
lects information on baby births in North Carolina.
We first collect a sample with 50 mothers who have the habit of smoking as well
as 100 who do not have that habit.
library(openintro)
data(ncbirths)
dfA <- ncbirths %>%
filter(habit == "smoker") %>%
sample_n(size = 50)
dfB <- ncbirths %>%
filter(habit == "nonsmoker") %>%
sample_n(size = 100)
df <- bind_rows(dfA, dfB)
𝐻0 ∶ 𝑤𝐴 − 𝑤𝐵 = 0.
The test in R will then be,
test.w
##
## Welch Two Sample t-test
##
## data: weight by habit
## t = 0.3713, df = 106.45, p-value = 0.7112
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3818629 0.5578629
## sample estimates:
## mean in group nonsmoker mean in group smoker
## 7.1194 7.0314
7.5 Exercises
Exercise 7.1. The file data-grades.Rdata (download here1 ) contains actual data on
the grades of the students at the midterm and at the endterm. You simply need
to load it to R with the function load().
Evaluate whether or not the mean at the midterm is the same as the mean at the
endterm. Use the usual significance level.
Answer by coding in R.
Exercise 7.2. Use the ncbirths data2 from the openintro package. Take a sample of
240 babies. Test whether the average weight of female babies is the same as the
average weight of male babies.
1
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=335656
2
https://www.openintro.org/data/index.php?data=ncbirths
100 7 Inference for Numerical Data
Confidence Intervals
8
Estimators and Confidence Intervals
107
108 8 Estimators and Confidence Intervals
In statistics jargon, we do not use the noun or verb “guess” but, instead, “esti-
mate”. Loosely speaking, estimator is a way of guessing the true value of a pa-
rameter (such as its mean) in the population by applying an algorithm to the
available sample.
Definition 8.1 (Estimator and point estimator). An estimator of a population
parameter is a random variable obtained from a sample.
When the estimate is a single value, the estimator is called a point estimator
(that gives a point estimate). The point estimator is defined by a rule or formula
that tells us how to use the sample data to calculate a single number, the point
estimate, that can be used as an estimate of the population parameter.
There can be many different estimators for a given parameter (different statis-
tics can be used to estimate a certain parameter). The choice of the appropriate
estimator will depend on the parameter we are trying to estimate.
Note that an estimator is itself a random variable as it is a function of other ran-
dom variables. As such, it has an expected value and a probability density func-
tion.
The distribution of an estimator is called the sampling distribution and it typifies
the properties of the estimator. In order to compare between estimators, we need
to compare their sampling distributions.
Recall that the sampling distribution is simply the probability distribution func-
tion of sample statistics. The sampling distribution of a given statistic is the dis-
tribution we would get if we were to take all possible samples of a given size 𝑛
and for each of those samples calculate the same statistic.
As pdfs, sampling distributions have an expected value, variance, and often fol-
low known probability distributions (e.g. Normal, 𝑡, Chi-square, or 𝐹 distribu-
tions).
8.2 “Best” Statistic 109
θ^3
θ θ
FIGURE 8.1: Estimators with different expected value (left) et different variance
(right).
8.2.1 Properties
8.2.2 Unbiasedness
Definition 8.2 (Unbiased estimator). Let 𝜃 ̂ be a point estimator for the true pa-
rameter 𝜃, then 𝜃 ̂ is an unbiased estimator if
𝐸[𝜃]̂ = 𝜃
The bias of the estimator is
bias(𝜃)̂ = 𝐸[𝜃]̂ − 𝜃
1 𝑛
𝜃1̂ = ∑𝑋
𝑛 𝑖=1 𝑖
1
𝜃2̂ = ( min(𝑋𝑖 ) + max(𝑋𝑖 ))
2
We can show that both 𝜃1̂ and 𝜃2̂ are unbiased estimators of the population’s
mean. Intuitively, however, it appears clearly that one of them is more reliable
than the other.
Definition 8.3 (Confidence interval). Let 𝜃 be the parameter searched for. Sup-
pose we can define two random variables, 𝐴 and 𝐵, based on the sample such
that 𝑃 (𝐴 < 𝜃 < 𝐵) = 1 − 𝛼, where 𝛼 is a small number between 0 and 1.
Then, the interval between 𝑎 and 𝑏 (values of the variables 𝐴 and 𝐵) is the (1−𝛼)
confidence interval of 𝜃. Also, (1 − 𝛼) is the confidence level.
these cases are considered in a similar way. This is why the confidence interval
is symmetric around the point estimate 𝜃.̂
Another way of expressing that idea is to express the confidence interval as a
point estimate with a symmetric margin of error, 𝑀 𝐸 :
𝜃 ̂ ± 𝑀 𝐸.
The discussion above left open the question of how to estimate 𝑎 and 𝑏 such that
if the population was to be sampled a very large number of times then in (1 − 𝛼)
of the cases, the true parameter would be include in the interval.
In this section, we introduce an estimator for such an interval, namely one the
most common case, i.e., based on the normal distribution
Let 𝑋̄ be the mean of a sample of 𝑛 observations from a normally distributed pop-
ulation with unknown mean 𝜇 but known variance 𝜎2 . Then we can say some-
thing about the 𝑍 -score of the sampe, i.e., the standardized version of the sample
mean, 𝑋̄ ,
𝑋̄ − 𝜇
𝑍= √ .
𝜎/ 𝑛
In particular for our discussion, we know the probability of 𝑍 being between
two values in the standard normal distribution, see Figure 8.3. For a probability
of 1 − 𝛼 around the mean, we know
By developing this expression, we can find the confidence interval for the true
parameter.
8.4 CI for the Mean 113
1−α
α 2 α 2
−zα 2 0 zα 2
Z (standard normal)
For the 95%, for instance, a confidence interval for the true mean when the pop-
ulation is normally distributed is given by
𝜎 𝜎
0.95 = 𝑃 (𝑋̄ − 1.96 √ < 𝜇 < 𝑋̄ + 1.96 √ )
𝑛 𝑛
114 8 Estimators and Confidence Intervals
Proposition 8.1 (Confidence interval for the mean). For a normally distributed pop-
ulation with unknown mean 𝜇 and a known variance 𝜎2 , a confidence interval of (1−𝛼)
can be found in the following way.
Use the sample mean, 𝑋̄ , as a unbiased estimator to build the confidence interval as
𝜎
𝑋̄ ± 𝑧𝛼/2 √
𝑛
where the margin of error is
𝜎
𝑀 𝐸 = 𝑧𝛼/2 √
𝑛
The limits of the interval are called upper and lower confidence limit.
𝛼 𝑧𝛼/2
1% 2.58
5% 1.96
10% 1.64
Figure 8.4 illustrates the concept of confidence interval for the mean. Table 8.1
gives the values for 𝑧𝛼/2 for common significance levels, 𝛼.
Notice that ways of reducing the margin of error follow straightforwardly:
1−α
α 2 α 2
X−ME X X+ME
The confidence interval for a population proportion has an expression very sim-
ilar to the expression for the mean of a sample.
As it happens, however, the normally assumption is no longer necessary. In-
stead, we will require that the sample size is large enough. A rule of thumb for
that criterion is 𝑛𝑝(1 − 𝑝) > 5. Then, we can show that,
𝐸[𝑝]̂ = 𝑝
Also,
√ 𝑝(1
̂ − 𝑝)̂
≈√
𝑝(1 − 𝑝)
𝑛 𝑛
Hence, we can say that the 𝑍 -score
𝑝̂ − 𝑝
𝑍=
√ 𝑝(1−
̂
𝑛
𝑝)̂
116 8 Estimators and Confidence Intervals
𝑝(1
̂ − 𝑝)̂ 𝑝(1
̂ − 𝑝)̂
= 𝑃 ( − 𝑧𝛼/2 √ < 𝑝̂ − 𝑝 < 𝑧𝛼/2 √ )
𝑛 𝑛
𝑝(1
̂ − 𝑝)̂ 𝑝(1
̂ − 𝑝)̂
= 𝑃 ( − 𝑧𝛼/2 √ < 𝑝̂ − 𝑝 < 𝑧𝛼/2 √ )
𝑛 𝑛
𝑝(1
̂ − 𝑝)̂ 𝑝(1
̂ − 𝑝)̂
= 𝑃 ( − 𝑝̂ − 𝑧𝛼/2 √ < −𝑝 < −𝑝̂ + 𝑧𝛼/2 √ )
𝑛 𝑛
𝑝(1
̂ − 𝑝)̂ 𝑝(1
̂ − 𝑝)̂
= 𝑃 (𝑝̂ − 𝑧𝛼/2 √ < 𝑝 < 𝑝̂ + 𝑧𝛼/2 √ )
𝑛 𝑛
Proposition 8.2 (Confidence interval for the population proportion). Let 𝑝̂ be the
observed proportion of “successes” in a random sample of 𝑛 observations from a popula-
tion with a proportion of successes 𝑝.
Then, for large enough samples, a confidence interval of (1 − 𝛼) can be found in the
following way.
Use the sample proportion, 𝑝̂, as a unbiased estimator of 𝑝 and as a good estimator of 𝑝
in the true variance to build the confidence interval as
𝑝(1
̂ − 𝑝)̂
𝑝̂ ± 𝑧𝛼/2 √
𝑛
where the margin of error is
𝑝(1
̂ − 𝑝)̂
𝑀 𝐸 = 𝑧𝛼/2 √
𝑛
The limits of the interval are called upper and lower confidence limit.
8.6 Extensions 117
8.6 Extensions
In some situations, the knowledge that we would like to have about a true pa-
rameter in a population, such as its mean, is the probability that it exceeds some
value or that it falls below some value. These cases call for a one-sided confidence
interval.
The following illustrates that notion for the estimate of the mean of a normally
distribute population. In a similar manner as above, we can show for a 1 − 𝛼
level of confidence ,
𝜎
1 − 𝛼 = 𝑃 (𝜇 < 𝑋̄ + 𝑧𝛼 √ )
𝑛
and
𝜎
1 − 𝛼 = 𝑃 (𝑋̄ − 𝑧𝛼 √ < 𝜇)
𝑛
These expressions give the one-sided bounds for the true parameter, given the
level of confidence.
A confidence interval can also be found in the case of unknown variance. In that
case, the value of the variance is replaced by an estimate of the it.
The resulting distribution of the standardized value, however, does not follow a
normal distribution but a 𝑡−distribution.
Notice that the 𝑡−distribution matches the normal distribution when 𝑛 is large
enough (higher than 30).
Part IV
Consider again the standard deviation of the sampling distribution of the sample
proportion under the null,
𝑝0 (1 − 𝑝0 )
𝜎𝑝0 = √ .
𝑛
For any given 𝑛, 𝜎𝑝0 is the highest when 𝑝0 = 0.5. For the remaining of this
121
122 9 Curse, Blessing & Back
section we will then consider that value. This is to go against our own argument
that 𝜎𝑝0 can be surprisingly small in today’s samples.
Suppose you want to test whether a sample proportion is around 𝑝0 . For any
given level of significance, 𝛼, you can guarantee whether the sample proportion
is arbitrarily close, i.e., within a given margin, 𝑚, to the true proportion, 𝑝0 , by
choosing a sample size at least as large as 𝑛∗ . Actually, we can show that the
relationship between these variables is given by,
2
(Φ−1 (1 − 𝛼2 )) ⋅ 𝑝0 (1 − 𝑝0 )
= 𝑛∗
𝑚2
For instance, we can achieve the usual statistical significance level (i.e., with
𝛼 = .05) to check whether the sample proportion is within 𝑚 = 0.05 of the
true proportion 𝑝0 = 0.60 by analyzing 369 observations. Indeed, note that
Φ−1 (0.975) = 1.96 so that we can write
40000
20000
10000
n∗ (log scale)
5000
1000
500
.5% 1% 2% 3% 4% 5%
m
FIGURE 9.1: Minimal 𝑛 for various values of 𝛼 and margins of error, 𝑚, keeping
𝑝0 = 0.5.
For a long time the sample size was considered a curse. The sample sizes required
for reasonable accuracy and significance were desperately too high. Recall, for
instance, that Jakob Bernoulli is believed to end his opus Bernoulli (1713) after
showing that his required sample size was 25550, i.e., more than the actual pop-
ulation.1
The curse was lifted recently with the systematic collection of data and the con-
struction of very large datasets.
Notice, however, that big data has come with its own curse. As shown in Figure
9.1, one needs thousands of observations for establishing very accurately and
significantly a difference with respect to a given proportion. But the current sit-
uation has the problem reversed. The datasets are often so large, that any small
difference between proportions is deemed statistically significant!
1
The number was inaccurate but the impression remains of to high a demand.
124 9 Curse, Blessing & Back
9.3 An Illustration
I use data on confirmed Covid-19 cases in Portugal built by the Portuguese DGS.
An advantage of this dataset over the publicly available data by Our World in
Data2 is that it provides details about the observations, such as gender, age, loca-
tion, etc. This, in turn, allows testing various hypotheses based on sample pro-
portions.
Figure 9.2 plots the evolution of the daily number of covid-19 cases in Portugal,
on a rolling average over 7 days. The total number of observations in the DGS
data is 419’910.
We now run a series of tests on sample proportions. Notice that these are purely
illustrative.
Since the DGS data is divided by gender, one could ask whether the two genders
are equally represented. Translated into a test of hypothesis, we would have,
2
https://ourworldindata.org/
9.3 An Illustration 125
10000
5000
0
Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar
DGS OWID
FIGURE 9.2: Confirmed covid-19 cases in Portugal, daily 7-day rolling moving
average.
𝐻0 ∶ 𝑝𝑀
̂ = 0.5, where 𝑀 stands for male. Here is how we could proceed: i.
Calculate the number of observations for each group. ii. Compute the test.
mf.all
## # A tibble: 2 x 2
## gender n
## <chr> <int>
## 1 F 231049
## 2 M 188711
mf.test.05
##
## 1-sample proportions test with continuity correction
##
## data: mf.all[2] out of sum(mf.all), null probability 0.5
## X-squared = 4270.1, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4480632 0.4510753
## sample estimates:
## p
## 0.4495688
Given the sample size and the observed difference between genders, it seems
impossible to not reject the hypothesis of equal representation.
As it turns out, the proportions of each group were the closest in June. We could
test whether they were actually equal that month.
mf.june
## # A tibble: 2 x 2
## gender n
## <chr> <int>
## 1 F 5241
## 2 M 4969
9.3 An Illustration 127
mf.test.june
##
## 1-sample proportions test with continuity correction
##
## data: mf.june[2] out of sum(mf.june), null probability 0.5
## X-squared = 7.193, df = 1, p-value = 0.007319
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4769426 0.4964270
## sample estimates:
## p
## 0.4866797
Despite the lower sample sizes and the close proportions, we must reject again
the equal representation of genders in the sample, this time for the data of June
only.
“Playing” with the data can lead to surprising results. If we separate the observa-
tions in June into two groups of age, 40 and above and below 40, then we cannot
reject equal proportion of age-groups in the sample.
oy.june
## # A tibble: 2 x 2
## old n
## <lgl> <int>
## 1 FALSE 3725
## 2 TRUE 3758
oy.test
##
## 1-sample proportions test with continuity correction
##
## data: oy.june[2] out of sum(oy.june), null probability 0.5
## X-squared = 0.13684, df = 1, p-value = 0.7114
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4908114 0.5135963
## sample estimates:
## p
## 0.502205
9.4 Exercises
Exercise 9.1. The following were the technical details for a poll by Obser-
vador/TVI/Pitagórica3 (the author’s translation from this source):
3
https://observador.pt/especiais/sondagem-observador-tvi-pitagorica-maioria-e-a-favor-do-fecho-
das-escolas-e-defendia-adiamento-das-eleicoes/
9.5 Exercises 129
“The field study took place in January 7 to 10 and January 14 to 18. For the sam-
ple, 629 interviews were carried, resulting in a confidence level of 95.5%, with an
implicit maximal margin of error of ± 4%.”
Verify that the margin of error was correctly calculated.
130 9 Curse, Blessing & Back
9.5 Commented R Code 131
owid <- read_csv( owid and dgs are the names of the
paste0("https://covid.ourworldindata.org/", data frames imported.
"data/owid-covid-data.csv")) %>%
read_csv() is from package readr. It
filter(location == "Portugal")
reads into R the csv file. Here, the
file is online. I only give its url.
dgs <- read_excel("data/covid.xlsx") %>%
See `Intro R > §readr & readxl'.
mutate(date = ymd(data_notificacao),
datet = ymd(data_colheita_amostra)) %>% read_excel() is from the package
rename() is self-explanatory. It is
always better to have short names of
variables, though sufficiently ex-
plicit.
132 9 Curse, Blessing & Back
TL;DR .
Theorem 10.1 (De Moivre’s Equation: Variance of the sample mean, 𝑋̄ , of ran-
dom variables). Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be independent and identically distributed vari-
ables with mean 𝜇 and variance 𝜎2 . Then,
𝜎2
𝑉 𝑎𝑟(𝑋)̄ = 𝜎𝑋
def
2
̄ =
𝑛
or, for the standard error,
𝜎
𝜎𝑋̄ = √
𝑛
135
136 10 Field of Fools
FIGURE 10.1: The counties with the highest 10 percent age-standardized death
rates for cancer of the kidney/ureter for U.S. males, 1980-89. (Source: Gelman
and Nolan (2017))
Proof.
̄ 1 𝑛 1 𝑛
𝑉 𝑎𝑟(𝑋) = 𝑉 𝑎𝑟( ∑ 𝑋𝑖 ) = 2 ∑ 𝑉 𝑎𝑟(𝑋𝑖 )
𝑛 𝑖=1 𝑛 𝑖=1
1 𝑛 2 1
= 2 ∑ 𝜎 = 2 𝑛𝜎2
𝑛 𝑖=1 𝑛
𝜎2
=
𝑛
Consider the Figures 10.1, 10.2 and try evaluate what are the characteristics of
the areas that the most and the least prone to the type of cancer described.
Test your explanations for the location on Figure 10.3.
10.1 De Moivre’s Equation 137
FIGURE 10.2: The counties with the lowet 10 percent age-standardized death
rates for cancer of the kidney/ureter for U.S. males, 1980-89. (Source: Gelman
and Nolan (2017))
FIGURE 10.3: The counties with both the highest and lowest 10 percent age-
standardized death rates for cancer of the kidney/ureter for U.S. males, 1980-89.
(Source: Wainer (2007))
Do small schools improve learning? Figure 10.5 provides a basis for discussing
this point.
138 10 Field of Fools
FIGURE 10.4: Population versus age-standardized death rates for cancer of the
kidney/ureter for U.S. males, 1980-89. (Source: Wainer (2007))
FIGURE 10.5: Enrollment vs. math score, 5th grade (left) and 11th grade (right).
(Source: Wainer (2007))
10.1 De Moivre’s Equation 139
FIGURE 10.6: Ten safest and most dangerous American cities for driving, and
ten largest American cities. (Source: Wainer (2007))
What are the safest and the most dangerous American cities for driving? Con-
sider Figure 10.6 for an answer.
140 10 Field of Fools
Are there differences in performance between males and females? Figure 10.7
provides evidence on that question.
Note the following ratio.
𝜎√𝑐𝑋 √
1 𝜎𝑐𝑋 2
𝜎√𝑐𝑋 = √ ≈ 1.4
2 1 𝜎𝑐𝑋
Tversky and Kahneman (1971) suggest the following about our belief in the Law
of Small Numbers.
10.2 Law of Small Numbers 141
[Form one of the belief:] We submit that people view a sample randomly drawn from a population as
highly representative, that is, similar to the population in all essential characteristics. Consequently, they
expect any two samples drawn from a particular population to be more similar to one another and to the
population than sampling theory predicts, at least for small samples.
When subjects are instructed to generate a random sequence of hypothetical tosses of a fair coin, for
example, they produce sequences where the proportion of heads in any short segment stays far closer to
.50 than the laws of chance would predict.
[Form two of the belief:] Subjects act as if every segment of the random sequence must reflect the true
proportion: if the sequence has strayed from the population proportion, a corrective bias in the other
direction is expected. This has been called the gambler’s fallacy.
Both [forms of the belief] generate expectations about characteristics of samples, and the variability of
these expectations is less than the true variability, at least for small samples.
People’s intuitions about random sampling appear to satisfy the law of small numbers, which asserts that
the law of large numbers applies to small numbers as well.
Part V
Visualizations
11
Data Visualization
Data visualization is an absolutely required skill for any data scientist. It is key
to:
• explore data,
• explain data,
• communicate quantitative information,
• convince specific audiences,
• …
1
https://www.edwardtufte.com/tufte/
145
12
Bars
The main focus of this chapter is to provide illustrations of the tools that we could
use to visualize the values calculated in Chapter 6 and Chapter Section 7 as well
the margins of error in Chapter 8.
data("rosling_responses")
rosling_responses %>%
group_by(response) %>%
summarise(count = n()) %>%
mutate(percent = count/ sum(count)) %>%
ggplot(aes(x=response, y= percent)) +
geom_col(alpha=0.5)
data("rosling_responses")
rosling_responses %>%
group_by(question, response) %>%
summarise(count = n()) %>%
group_by(question) %>%
147
148 12 Bars
0.8
0.6
percent
0.4
0.2
0.0
correct incorrect
response
0.8
0.6
response
percent
correct
0.4
incorrect
0.2
0.0
children_in_2100 children_with_1_or_more_vaccination
question
0.8
0.6
percent
0.4
0.2
0.0
rosling_responses %>%
group_by(question, response) %>%
summarise(count = n()) %>%
group_by(question) %>%
mutate(percent = count/ sum(count)) %>%
ggplot(aes(x=response, y= percent, fill=response)) +
geom_col(alpha=0.5, position = "dodge") +
facet_wrap(.~question) +
scale_fill_manual(values = c("correct" = "#006400", "incorrect" = "#8B0000")) +
theme(legend.position = "none")
We now plot a confidence interval for each proportion following the discussion
in Chapter 8.
150 12 Bars
0.75
0.50
percent
0.25
0.00
correct incorrect
response
rosling_responses %>%
group_by(response) %>%
summarise(count = n()) %>%
mutate(percent = count/ sum(count),
se = sqrt(percent*(1-percent)/count)) %>%
ggplot(aes(x=response, y= percent)) +
geom_col(alpha=0.5) +
geom_errorbar(aes(ymin = percent-1.96*se, ymax = percent+1.96*se ), width = 0.2)
library(openintro)
data(ncbirths)
set.seed(142)
dfA <- ncbirths %>%
filter(habit == "smoker") %>%
12.3 Bars for Numerical Data 151
6
m.weight
nonsmoker smoker
habit
sample_n(size = 50)
dfB <- ncbirths %>%
filter(habit == "nonsmoker") %>%
sample_n(size = 100)
df <- bind_rows(dfA, dfB)
df %>%
group_by(habit) %>%
summarise(m.weight = mean(weight)) %>%
ggplot(aes(x= habit, y=m.weight)) +
geom_col(alpha=0.5)
df %>%
group_by(habit, gender, whitemom) %>%
summarise(m.weight = mean(weight)) %>%
ggplot(aes(x= habit, y=m.weight, fill = gender)) +
geom_col(alpha=0.5, position= "dodge") +
facet_wrap(.~whitemom) +
152 12 Bars
not white white
theme(legend.position = "bottom") +
xlab("Mother's habit") +
ylab("Average weight (pounds)")
df %>%
group_by(habit) %>%
summarise(m.weight = mean(weight),
se= sd(weight)/sqrt(n()) ) %>%
ggplot(aes(x= habit, y=m.weight)) +
geom_col(alpha=0.5) +
geom_errorbar(aes(ymin=m.weight-1.96*se, ymax= m.weight+1.96*se ), width = 0.5) +
xlab("Mother's habit") +
ylab("Average weight (pounds)")
12.5 Exercise 153
nonsmoker smoker
Mother's habit
12.5 Exercise
Reproduce the following plot with the given line for the data.
30
20
Mean Arousal
10
FIGURE 12.8: Mean arousal per film over gender with confidence interval.
Part VI
Bridge
13
Correlation
TL;DR .
One of the most natural exercises in presence of two variables is to gauge their
association. Is a perilous exercise, though, as one is often carried away into the
realm of causal explanations.
Our approach here remains into the field of purely descriptive analysis. We intro-
duce various ways of numerically assessing the relationship between two vari-
ables.
A first important step consists in visualizing the data in a scatter plot. Then, the
first eyeball test evaluates whether a line, representing a linear relationship can
be drawn into the cloud of points. Figures 13.1 provides examples. These use the
data frame df from the golf.Rdata file extracted from ESPN1 (download here2 ).
1
https://www.espn.com/golf/statistics
2
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=335657
157
158 13 Correlation
load("data/golf.Rdata")
df
## # A tibble: 70 x 17
## rank surname name age events rounds cutsmade top10 wins cuppoints
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Bryson DeChambeau 27 7 26 7 4 2 1375
## 2 2 Dustin Johnson 36 6 24 6 4 1 1105
## 3 3 Viktor Hovland 23 10 40 10 4 1 1204
## 4 4 Xander Schauffele 27 9 36 9 5 0 1110
## 5 5 Patrick Cantlay 28 9 36 10 4 1 1234
## 6 7 Brooks Koepka 30 9 30 9 4 1 960
## 7 8 Tony Finau 31 10 40 10 5 0 980
## 8 9 Jason Kokrak 35 12 42 10 3 1 841
## 9 10 Justin Thomas 27 9 34 9 4 0 897
## 10 11 Max Homa 30 13 46 10 3 1 909
## # ... with 60 more rows, and 7 more variables: earnings <dbl>, yds-drive <dbl>,
## # drvacc <dbl>, drvetotal <dbl>, greensinreg <dbl>, puttavg <dbl>,
## # savepct <dbl>
p1 <- df %>%
ggplot(aes(x=events, y=rounds )) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x="Events", y="Rounds") +
ggtitle("Strong positive linear relationship")
p2 <- df %>%
ggplot(aes(x=drvacc, y=`yds-drive`)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x="Drive Accuracy", y="Yards per drive") +
ggtitle("Negative linear relationship")
p3 <- df %>%
ggplot(aes(x=age, y=cutsmade )) +
geom_point() +
13.1 Bivariate Relationships 159
Strong positive linear relationship Negative linear relationship
50 320
310
300
30 290
280
20
6 8 10 12 14 16 50 55 60 65 70
Events Drive Accuracy
12.5
10.0
Age
7.5
5.0
30 40 50
Rank
FIGURE 13.1: Scatter plots of pairs of variables and their linear relationship.
Notice that the eyeball test should always be attempted. This is because the nu-
merical estimators of the strength of the relationship are summarizing point es-
timates. As such, they can hide various situations. This point was cleverly il-
lustrated by Anscombe (1973) with the help of four scatter plots reproduced in
Figure 13.2.
The particularity of the four plots is that they share the exact same linear relation-
160 13 Correlation
ship between the variables. Obviously, the nature of the real relationship differs
greatly.
Fortunately, the current software have rendered this visual examination ex-
tremely easy. Figures 13.3 through 13.5 show three possibilities in R.
10
10
8
y1
y2
6
4
6 9 12 6 9 12
x1 x2
13
11 11
9 9
y3
y4
7 7
5
5
6 9 12 7.5 10.0 12.5 15.0 17.5
x3 x4
The Pearson’s correlation, aka the correlation, is a measure of the linear relation-
ship between two variables.
It’s formula is:
∑ 𝑥𝑖 𝑦𝑖 − 𝑛𝑥𝑦̄ ̄
𝑟𝑥𝑦 =
√(∑ 𝑥2𝑖 − 𝑛𝑥2̄ ) √(∑ 𝑦𝑖2 − 𝑛𝑦2̄ )
In a population, it reflects the covariance between two variables, normalized by
the product of the their standard deviations. It can range from -1 to 1, indicating
the direction and the strength of the linear relationship.
162 13 Correlation
25 45 20 40 0 2 4 0 800
60
rank
0
25 40 55
age
16
events
6 10
20 35 50
rounds
10 14
cutsmade
6
4
top10
2
0
1.5
wins
0.0
800
cuppoints
0
earnings
1e+06
0 60 6 12 6 10 0.0 1.5 1e+06
The variables should be continuous and have constant variance across their
range. As for other statistics, it needs a large 𝑛 to allow reliability in the nor-
mality of the sampling distribution.
Estimation in R is straightforward. When assessing the correlation between two
variables, we could use a base R function.
cor(df$events, df$rounds)
## [1] 0.8265256
cor(df$drvacc, df$`yds-drive`)
## [1] -0.522812
cor(df$age, df$cutsmade)
## [1] NA
13.2 Pearson’s Correlation 163
rank
age
events
rounds
cutsmade
top10
wins
cuppoints
earnings
To speed up the process, the comparison can be made over a selection of vari-
ables.
cutsmade
cuppoints
earnings
rounds
events
top10
wins
rank
age
1
rank 1 0.39 0.26 0.05 0.13 −0.8 −0.53 −0.89 −0.88
0.8
−0.4
wins 1 0.72 0.68
−0.6
cuppoints 1 0.98
−0.8
earnings 1
−1
𝐻0 ∶ 𝑟 = 0
cor.test(df$events, df$rounds)
##
## Pearson's product-moment correlation
##
## data: df$events and df$rounds
## t = 12.108, df = 68, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7341281 0.8888703
## sample estimates:
## cor
## 0.8265256
cor.test(df$drvacc, df$`yds-drive`)
##
## Pearson's product-moment correlation
##
## data: df$drvacc and df$`yds-drive`
## t = -5.0575, df = 68, p-value = 3.435e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6748790 -0.3281503
## sample estimates:
## cor
## -0.522812
cor.test(df$age, df$cutsmade, use = 'complete.obs')
##
## Pearson's product-moment correlation
##
## data: df$age and df$cutsmade
## t = -1.4241, df = 66, p-value = 0.1591
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3948356 0.0685836
166 13 Correlation
## sample estimates:
## cor
## -0.1726649
Again, an overview of the result of this test over many pairs can be done, using
the Hmisc package.
library(Hmisc)
rcorr(as.matrix(df[, c(1, 4:11)]))
## rank age events rounds cutsmade top10 wins cuppoints earnings
## rank 1.00 0.39 0.27 0.08 0.15 -0.77 -0.53 -0.88 -0.88
## age 0.39 1.00 -0.10 -0.17 -0.17 -0.31 -0.07 -0.36 -0.38
## events 0.27 -0.10 1.00 0.83 0.86 -0.32 -0.27 -0.26 -0.37
## rounds 0.08 -0.17 0.83 1.00 0.91 -0.05 -0.28 -0.17 -0.21
## cutsmade 0.15 -0.17 0.86 0.91 1.00 -0.12 -0.27 -0.20 -0.26
## top10 -0.77 -0.31 -0.32 -0.05 -0.12 1.00 0.30 0.67 0.74
## wins -0.53 -0.07 -0.27 -0.28 -0.27 0.30 1.00 0.72 0.68
## cuppoints -0.88 -0.36 -0.26 -0.17 -0.20 0.67 0.72 1.00 0.97
## earnings -0.88 -0.38 -0.37 -0.21 -0.26 0.74 0.68 0.97 1.00
##
## n
## rank age events rounds cutsmade top10 wins cuppoints earnings
## rank 70 68 70 70 70 70 70 70 70
## age 68 68 68 68 68 68 68 68 68
## events 70 68 70 70 70 70 70 70 70
## rounds 70 68 70 70 70 70 70 70 70
## cutsmade 70 68 70 70 70 70 70 70 70
## top10 70 68 70 70 70 70 70 70 70
## wins 70 68 70 70 70 70 70 70 70
## cuppoints 70 68 70 70 70 70 70 70 70
## earnings 70 68 70 70 70 70 70 70 70
##
## P
## rank age events rounds cutsmade top10 wins cuppoints earnings
## rank 0.0009 0.0244 0.5146 0.2264 0.0000 0.0000 0.0000 0.0000
## age 0.0009 0.4204 0.1622 0.1591 0.0113 0.5893 0.0025 0.0015
## events 0.0244 0.4204 0.0000 0.0000 0.0061 0.0261 0.0276 0.0016
13.3 Spearman’s Rank Correlation 167
The idea behind the Spearman’s correlation is to evaluate the association be-
tween the ranks of the observations in two different variables.
This approach makes the test less sensitive to outliers. Of course, it is better suited
when the variables are ordinal in measurement.
Importantly, the Spearman correlation assesses the relationship between two
variables that is not necessarily linear but simply monotonic.
The statistic is, assuming the ranks are distinct (no ties),
6 ∑ 𝑑𝑖2
𝑟𝑠 = 1 − ,
𝑛(𝑛2 − 1)
where 𝑑𝑖 = rg(𝑋𝑖 ) − rg(𝑌𝑖 ) is the difference between the two ranks of each
observation and 𝑛 is the number of observations.
Here is an example of use in R that emphasizes the difference with respect to the
Pearson’s coefficient.
cor(df1$x, df1$y)
## [1] 0.967074
cor(df1$x, df1$y, method = "spearman")
## [1] 1
168 13 Correlation
TL;DR .
Required Packages
169
170 14 Observational Versus Experimental Data
df %>%
ggplot(aes(x = Gender, y = Admission, fill = Gender)) +
geom_col() +
geom_text(aes(label = percent(Admission)), vjust = -1) +
labs(y = "Admission rate") +
scale_y_continuous(labels = percent, limits = c(0,0.5)) +
geom_hline(yintercept = mean.Admission, linetype="dashed") +
annotate(geom = "text",x=0.85, y =mean.Admission+0.02, label = paste0("Average admission rate (",per
guides(fill = FALSE)
14.1 Descriptive Approach 171
50.0%
45%
30%
Admission rate
30.0%
20.0%
10.0%
0.0%
Female Male
Gender
p1 <- df$Admission[1]
p2 <- df$Admission[2]
n1 <- df$N[1]
n2 <- df$N[2]
penguins %>%
na.omit() %>%
ggplot(aes(x=bill_length_mm, y=bill_depth_mm)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE) +
labs(x="Bill length", y="Bill Depth")
20.0
Bill Depth
17.5
15.0
40 50 60
Bill length
14.2 Covariates 173
14.2 Covariates
df %>%
ggplot(aes(x=Gender, y=Admission, fill = Gender)) +
geom_col() +
geom_text(aes(label = paste0(percent(Admission), "\n (of ", N, ")") ), vjust = -0.15, size=3) +
labs(y = "Admission rate") +
scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
facet_wrap(~Dept) +
guides(fill = FALSE)
A B C
100% 82.41%
(of 108) 68.00%
62.06% 63.04%
75% (of 25)
(of 825) (of 560)
25%
Admission rate
0%
D E F
100%
75%
penguins %>%
na.omit() %>%
ggplot(aes(x=bill_length_mm, y=bill_depth_mm, color=species)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE) +
labs(x="Bill length", y="Bill Depth")
20.0
species
Bill Depth
17.5 Adelie
Chinstrap
Gentoo
15.0
40 50 60
Bill length
a.
176 14 Observational Versus Experimental Data
b.
c.
The original postulate is that there exist a relationship between a response 𝑌 vari-
able and, jointly a set 𝑋 of variables ( independent variables, predictors, explanatory
variables).
Then, the general form of the relationship between these variables is as follows.
𝑌 = 𝑓(𝑋) + 𝜀
177
178 15 Statistical Learning
FIGURE 15.1: Instance of simulated Income data along with true 𝑓() and errors.
FIGURE 15.2: Instance of simulated Income data along with true 𝑓() and errors
(two predictors).
set.seed(2)
income <- read_csv("data/islr/Income1.csv") %>%
select(-X1, -Income) %>%
mutate(Income = 20 + 600* dnorm(Education, 22, 4) + rnorm(length(Education),0,4),
tIncome = 20 + 600* dnorm(Education, 22, 4))
70 70
Income
Income
50 50
30 30
10.0 12.5 15.0 17.5 20.0 22.5 10.0 12.5 15.0 17.5 20.0 22.5
Education Education
180 15 Statistical Learning
income %>%
mutate(error = Income - tIncome)
## # A tibble: 30 x 4
## Education Income tIncome error
## <dbl> <dbl> <dbl> <dbl>
## 1 10 24.5 20.7 3.87
## 2 10.4 21.4 20.9 0.502
## 3 10.8 18.4 21.2 -2.84
## 4 11.2 18.0 21.6 -3.65
## 5 11.6 26.3 22.1 4.21
## 6 12.1 19.8 22.8 -3.01
## 7 12.5 17.8 23.5 -5.76
## 8 12.9 23.3 24.5 -1.14
## 9 13.3 21.5 25.6 -4.14
## 10 13.7 27.0 27.1 -0.113
## # ... with 20 more rows
There are two main reasons one would want to estimate 𝑓().
15.2.1 Prediction
In many occasions, the independent variables are known but the response is not.
Therefore, 𝑓() can be used to predict these values. These predictions are noted by
̂
𝑌 ̂ = 𝑓(𝑋)
15.2.2 Inference
300
300
300
200
200
200
Wage
Wage
Wage
50 100
50 100
50 100
20 40 60 80 2003 2006 2009 1 2 3 4 5
Statistical learning addresses a very large set of issues. This latter has been ex-
panding in the last decades thanks to the availability of computing power, data
sets, new software and theoretical developments. Here is a very short list of cases
handled by data science.
15.4 ̂
Ideal 𝑓() vs 𝑓()
For a solution to be ideal of optimal, one has to first specify a criterion. In the
context of prediction, for instance, a natural criterion arises.
̂
If one wants to predict 𝑌 given 𝑋 , i.e., to get 𝑌 ̂ = 𝑓(𝑋) , a common criterion used
in statistical learning is the mean-squared error defined thanks to the squared
error
squared error = (𝑌 − 𝑌 ̂ )2
It can be shown that, using that criterion, the optimal 𝑓() that minimizes the
expected value of the squared error,
min 𝐸[(𝑌 − 𝑌 ̂ )2 |𝑋 = 𝑥]
is given by
𝑓(𝑥) = 𝐸[𝑌 |𝑋 = 𝑥]
15.5.1 Approaches
Parametric methods imposed the functional form and estimate the parameters
for such function. The simplest and most common of these is the linear model of
the form:
𝑓(𝑋) = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
Non-parametric methods do not impose any functional form. But they have tun-
ing parameters, for instance, the level of smoothness.
15.5.2 Trade-offs
The estimation techniques, as hinted in the discussion above, present the re-
searcher with various trade-offs:
Depending on the researchers choice in these axes, there are more or less appro-
priate techniques.
From the examples above, and others, we can distinguish types of statistical
learning problems:
186 15 Statistical Learning
The most common measure for quality of a regression fit is the mean squared
error, MSE (or the square root of that number), given by
1 𝑛 ̂ ))2
𝑀 𝑆𝐸 = ∑ (𝑦𝑖 − 𝑓(𝑥 𝑖
𝑛 𝑖
The U-shape of the test MSE curves is an important result in statistical learning.
It derives from the following property. If the true model is 𝑌 = 𝑓(𝑋) + 𝜀 (with
𝑓(𝑥) = 𝐸[𝑌 |𝑋 = 𝑥] ), then we can show
̂ ))2 ] = 𝑉 𝑎𝑟(𝑓(𝑥
𝐸[(𝑦0 − 𝑓(𝑥 ̂ )))2 + 𝑉 𝑎𝑟(𝜀)
̂ )) + (𝐵𝑖𝑎𝑠(𝑓(𝑥
0 0 0
15.8 Bias-Variance Trade-Off 187
In classification problems, one cannot calculate the MSE. Instead, the most com-
mon metric in the classification setting is the classification error rate.
1 𝑛
err(𝐶,̂ data) = ̂ ))
∑ 𝐼(𝑦𝑖 ≠ 𝐶(𝑥𝑖
𝑛 𝑖=1
where, 𝐼 is an indicator function taking the value 1 or 0. Thus, the error rate is
an average of the correct classification.
15.9 Cross-Validation 189
̂
1 𝑦𝑖 ≠ 𝐶(𝑥)
̂
𝐼(𝑦𝑖 ≠ 𝐶(𝑥)) ={ ̂
0 𝑦𝑖 = 𝐶(𝑥)
Again in this setting, we often split the data between train and test. We can cal-
culate the Train (Classification) Error but Test (Classification) Error is a better
measure of how well a classifier will work on future unseen data.
1
errtrn (𝐶,̂ train data) = ̂ ))
∑ 𝐼(𝑦𝑖 ≠ 𝐶(𝑥 𝑖
𝑛trn 𝑖∈trn
1
errtst (𝐶,̂ test data) = ̂ ))
∑ 𝐼(𝑦𝑖 ≠ 𝐶(𝑥𝑖
𝑛tst 𝑖∈tst
Notice, however, that other criteria can be used depending on the type of error
that one wants to focus on.
15.9 Cross-Validation
Recall the fundamental issue of the training error not being an good guide for the
test error. Figure 15.13 illustrates the problem and emphasizes the core trade-off
for the optimal level of flexibility/complexity for a model.
This sections focuses on the how to achieve the minimal test data error thanks to
a set of methods based on holding out subsets of the training data.
The core idea is to achieve a better estimate of the error in the test data by eval-
uating the error in a subsample that was not used to train the model.
For illustration purposes in this section I will mainly use the following example.
We estimate a linear model for the relationship between a dependent variable,
𝑦, and an independent variable, 𝑥. In this example, 𝑦 is the miles per gallon and
𝑥 is the horse power of the vehicle. Figure 15.14 plots the data at hand.
In order to take into account possible non-linearities, we estimate various models
differing by the number of the polynomial.
190 15 Statistical Learning
40
30
mpg
20
10
The difficulty here is to establish the best value for 𝑝, as illustrated in Figure
15.15.
15.9 Cross-Validation 191
40
30
mpg
20
10
FIGURE 15.15: Fits of mpg for various degrees of the polynomial of horsepower.
• randomly divide the available data in two groups, training and validation set,
• estimate the model with the training data,
• use the favorite model in the validation data and calculate the error.
15.9.2 Leave-One-Out
This method shares the same approach as above but sets a specific choice for the
validation data: each and all observations are the validation data while the rest
is the training data. Figure 15.18 illustrates this technique.
Notice that this technique requires 𝑛 computations of the mean square, one for
each observation. The estimate for the test mean square error is simply the aver-
age of all of them.
1 𝑛
𝐶𝑉(𝑛) = ∑ MSE𝑖
𝑛 𝑖=1
This method can be computationally cumbersome. For our example, the results
are given in Figure 15.20.
15.9 Cross-Validation 193
15.9.3 𝑘-Fold
1 𝑘
𝐶𝑉(𝑘) = ∑ MSE𝑖
𝑘 𝑖=1
15.9.4 Comments
• One could wonder why use 𝑘 < 𝑛, given the speed of our modern computers.
The reason is that the LOOCV method averages over many models that are
based essentially on almost the same data. Hence, the average is over correlated
values, which, in turn increases the variance of the estimate of the test error.
This might not be a desirable feature.
• In general, values of 𝑘 = 5 or 𝑘 = 10 are standard compromises.
Predictions are an essential part of the human experience as they permeate our
day-by-day lives:
But predictions are more ubiquitous than these direct questions indicate. Indeed,
predictions are at the heart of judgment and decision making. Consider the fol-
lowing illustrations:
• “should parole be granted?”: the answer to that question depends on the com-
mittee prediction about the future behavior of a inmate.
• “is this email a spam?”: when services such as Gmail classify email as spam,
they make a prediction about the classification that the user would make, had
he/she read the email.
• “what candidate ought to be hired?”: the hiring process is based on the predic-
tion about each candidate’s future performance.
Seeing predictions as part of the judgment and decision process puts forward
the documented human fallibility in that matter.
Here, the background reference is the compelling literature on heuristics and bias
to which Kahneman and Tversky are two preeminent contributors.
In short, humans’ mental apparatus is prone to systematic and predictable er-
roneous judgments and decisions because of its reliance on (time-, energy-) effi-
cient but ultimately potentially misleading mental rules (heuristics) and biased
reasoning.
In that perspective, data science is a tool for slow thinking type of judgment. Be-
cause it follows rules (algorithms) that are separated from, though not indepen-
dent of a human judgment, data science is potentially immune to these biases.
We encounter here a common theme: the struggle between humans and ma-
chines over intelligence status. As the top contender of the non-human side, AI is
often depicted as a almighty force. It might be. Our understanding, makes a less
grandiose vision of AI. This latter is simply the highest point in the scale starting
at heuristics and going over algorithms. In other words, the true power of AI is
its ability to make the best predictions.
196 15 Statistical Learning
A problem with that approach is that practitioners often aim at collecting its
sweet fruits of causal statements without paying the harsh price of the modeling
and the specification. As a consequence, the fruits of these empirical investiga-
tions, despite their glowing appearance, are actually not edible and sometimes
even poisonous.
This is why we opt to not attempt incursions in the high spheres of inference but
remain on the ground of predictions.
15.12.1 Plus
It is important to understand data science even if you never intend to apply it yourself. Data-analytic
thinking enables you to evaluate proposals for data mining projects. For example, if an employee, a
consultant, or a potential investment target proposes to improve a particular business application by
extracting knowledge from data, you should be able to assess the proposal systematically and decide
whether it is sound or flawed. This does not mean that you will be able to tell whether it will actually
succeed - for data mining projects, that often requires trying - but you should be able to spot obvious
flaws, unrealistic assumptions, and missing pieces. (…)
The consulting firm McKinsey and Company estimates that “there will be a shortage of talent necessary
for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of
140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with
the know-how to use the analysis of big data to make effective decisions.” (Manyika, 2011). Why 10 times
as many managers and analysts than those with deep analytical skills? Surely data scientists aren’t so
difficult to manage that they need 10 managers! The reason is that a business can get leverage from a
data science team for making better decisions in multiple areas of the business. However, as McKinsey is
pointing out, the managers in those areas need to understand the fundamentals of data science to
effectively get that leverage.
198 15 Statistical Learning
Linear Regression
16
Simple Linear Regression
201
202 16 Simple Linear Regression
So far, the mean of a population has been treated as a constant, and we have
shown how to use sample data to estimate or to test hypotheses about this con-
stant mean.
In many applications, the mean of a population is not viewed as a constant, but
rather as a variable. For instance:
This formula implies that the mean sale price of 1 bedroom homes is $67’920;
the mean sale price of 2 bedroom homes is $85’840, and the mean sale price of 3
bedroom homes is $103’760.
We will now study these situations in which the mean of the population is treated
as a variable, dependent on the value of another variable.
We will learn how to use the sample data to estimate the relationship between
the mean value of one variable, 𝑌 , as it relates to a second variable, 𝑋 .
𝑌 is generally referred to as the response variable, outcome variable, dependent
variable, or endogenous variable. 𝑋 is generally referred to as the covariate, re-
gressor, explanatory variable, independent variable, or exogeneous variable.
Examples include:
• A manager would like to know what mean level of sales (𝑌 ) can be expected
if the price (𝑋 ) is set at $10 per unit.
• If 250 workers (𝑋 ) are employed in a factory, how many units (𝑌 ) can be pro-
duced during an average day?
• If a developing country increases its fertilizer production (𝑋 ) by 1’000’000 tons,
how much increase in grain production (𝑌 ) should it expect?
• …
The relationship between 𝑋 and 𝑌 can take on linear or non-linear forms. Previ-
ously, we saw how the relationship between two variables can be described by
using scatter plots and correlation coefficients.
16.2 The Simple Linear Regression 203
𝑌 = 𝛽
⏟0⏟+
⏟ 𝛽1⏟
𝑋⏟1 + ⏟𝜀
Deterministic component Random error
This section builds around the example of the simple linear regression of sales on
the amount of TV advertising in the Advertising data set. To fix ideas, the linear
model estimated here is
sales = 𝛽0 + 𝛽1 × TV + 𝜀
204 16 Simple Linear Regression
We start by loading the data and manipulate it to make it usable. Figure 16.1
provides a scatter plot of the data,.
16.2.2 Estimation in R
The estimation of the model is carried with the function lm from the built-in stats
package. The result of the estimation is an object to assigned to a name.
The content of this linear regression object is better described with the function
summary.
16.2 The Simple Linear Regression 205
20
sales
10
summary(model.slr)
##
## Call:
## lm(formula = sales ~ TV, data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
206 16 Simple Linear Regression
20
sales
10
One of the main reasons for explaining the simple linear regression is its graphi-
cal appeal. Indeed, we can see what we are estimating. Figure 16.2 provides such
an illustration of the linear fit as well as its errors in prediction.
whereas,
The least squares procedure obtains estimates of the linear equation coefficients
𝛽0 and 𝛽1 by minimizing the sum of the squared residuals 𝑒𝑖 :
𝑛 𝑛 𝑛
∑ 𝑒2𝑖̂ = ∑(𝑦𝑖 − 𝑦𝑖̂ ) = ∑(𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 ))2
2
The coefficients 𝛽0̂ and 𝛽1̂ are chosen so that this sum is minimized. We use
differential calculus to obtain the coefficient estimators that minimize the sum
of squared residuals.
Early mathematicians struggled with the problem of developing a procedure for
estimating the coefficients for the linear equation. Various procedures have been
developed, but none has proven as useful or as popular as least squares regres-
sion. The coefficients developed using this procedure have very useful statistical
properties.
One way to decide quantitatively how well a straight line fits a set of data is to
note the extent to which the data points deviate from the line. Specifically, we
can calculate the magnitude of the deviations (i.e., the differences between the
observed and the predicted values of 𝑌 ).
These deviations, called residuals, are the vertical distances between observed
and predicted values. Note that for the best fitting line, the sum of residuals
equals 0 and so we square those residuals (or deviations) and compute the sum
of squares of the residuals.
By summing the squared residuals, we place a greater emphasis on large devi-
ations from the line. By shifting a ruler around the graph we can see that it is
possible to find many lines for which the sum of residuals is equal to 0.
However, it can be shown that there is one (and only one) line for which the sum
of the squared residuals is a minimum - the least squares line.
208 16 Simple Linear Regression
The “hats” indicate that the symbols are estimates of 𝐸[𝑌 |𝑋], 𝛽0 , and 𝛽1 , respec-
tively.
For a given data point (𝑥𝑖 , 𝑦𝑖 ), the observed value of 𝑌 is 𝑦𝑖 and the predicted
value of 𝑌 would be obtained by substituting 𝑥𝑖 into the prediction equation:
The deviation of the 𝑖𝑡ℎ value of 𝑦 from its predicted value is:
The sum of the squares of the deviations of the 𝑦 -values about their predicted
values for all the 𝑛 data points is:
2
∑ [𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 )]
The quantities 𝛽0̂ and 𝛽1̂ that make the sum of the squared residuals a minimum
are called the least squares estimates of the population parameters 𝛽0 and 𝛽1 .
The prediction equation
𝑦 ̂ = 𝛽0̂ + 𝛽1̂ 𝑥
The OLS estimators are obtained from minimizing the sum of the squared resid-
uals:
𝑛 𝑛
2
∑ 𝑒2𝑖̂ = ∑ [𝑦𝑖 − 𝑦𝑖̂ ]
𝑖=1 𝑖=1
𝑛 2
= ∑ [𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 )]
𝑖=1
𝜕𝑆(𝛽0̂ , 𝛽1̂ )
= ∑ 2 [𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 )] (−1) = 0 (16.1)
𝜕𝛽 ̂ 𝑖=1
0
𝜕𝑆(𝛽0̂ , 𝛽1̂ )
= ∑ 2 [𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 )] (−𝑥𝑖 ) = 0 (16.2)
𝜕𝛽 ̂ 𝑖=1
1
∑ 𝑦𝑖 − 𝑛𝛽0̂ − 𝛽1̂ ∑ 𝑥𝑖 = 0
𝑖=1 𝑖=1
𝛽0̂ = 𝑦 ̄ − 𝛽1̂ 𝑥̄
Therefore,
16.6 Exercises 211
𝑛 𝑛
∑ 𝑥𝑖 𝑦𝑖 − 𝑛𝑥𝑦 ∑(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
𝑖=1 𝑖=1
𝛽1̂ = 𝑛 = 𝑛
∑ 𝑥𝑖 2 − 𝑛𝑥2 ∑(𝑥𝑖 − 𝑥)2
𝑖=1 𝑖=1
𝑆𝑋𝑌 𝑆𝑌
= = 𝑟
𝑆𝑋2 𝑆𝑋
16.6 Exercises
Exercise 16.1. Consider a regression predicting weight (kg) from height (cm) for
a sample of adult males. What are, respectively, the units of:
• the intercept,
• the slope.
For the following exercises, use the Advertising data set (download here1 ).
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 .
a. Illustrate the fact that the linear fit passes through (𝑥,̄ 𝑦)̄ , i.e., the
point at defined by the mean of each variable. In other words, 𝛽0̂ and 𝛽1̂
satisfy:
𝑦 ̄ = 𝛽0̂ + 𝛽1̂ 𝑥̄
Hint: you can access a variable from a data frame, say df, by appending $ and the
name of the variable to df, e.g., df$TV.
b. Illustrate the fact that the sum of the OLS residuals is equal to 0, i.e.,
1
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=335655
212 16 Simple Linear Regression
𝑛
∑ 𝑒𝑖̂ = 0
𝑖=1
Hint: you can obtained the fitted values of an estimated model, say m1, by ap-
pending $fitted.values to its name, i.e., m1$fitted.values.
𝑠𝑦
𝛽1̂ = 𝑟
𝑠𝑥
where 𝑟 is the correlation between 𝑥 and 𝑦 , and 𝑠 is the sample’s standard devi-
ation of the given variable.
17
Multiple Linear Regression
In the simple linear regression model we considered that the dependent variable
was a function of a single independent variable.
But, in many practical economic, financial, and managerial situations, the depen-
dent variable is influenced by more than one factor:
where 𝛽0 is the intercept and 𝛽1 , 𝛽2 and 𝛽3 are the three parameters quantifying
the impact of the N. Bedrooms, Size and Year Built on Price.
Even though the multiple linear regression model is similar to the simple model,
the interpretation of some results is not exactly the same. The “House Price”
213
214 17 Multiple Linear Regression
model now tries to capture the joint influence of three different factors, isolating
the partial effect of each type.
Its general form is now,
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀
The predicted value depends on the effect of the independent variables individ-
ually and their effect in combination with the other independent variables.
The coefficient 𝛽𝑗̂ estimates the change in 𝑌 , given a unit change in 𝑋𝑗 , while
controlling for the effect of the other independent variables.
Note that the model has now 𝑘 independent variables and 𝑘 + 1 parameters.
The variable 𝛽𝑗 gives us the change in the expected value of 𝑌 ,
(𝐸[𝑌 |𝑋1 , 𝑋2 , … , 𝑋𝑘 ]), resulting from a unit increase in 𝑋𝑗 while holding
the other factors fixed (keeping the other independent variables unchanged):
𝜕𝑦
= 𝛽𝑗 .
𝜕𝑥𝑗
17.2 OLS Estimated Model 215
𝑒𝑖̂ = 𝑦𝑖 − 𝑦𝑖̂
The (Ordinary) Least Squares (OLS) estimators are the solution to:
𝑛 𝑛
2
min ∑ 𝑒2𝑖̂ = ∑ [𝑦𝑖 − 𝑦𝑖̂ ]
𝛽𝑗̂ 𝑖=1 𝑖=1
216 17 Multiple Linear Regression
where
17.3 Exercises
For the following exercises, use the Advertising data set (download here1 ).
c. Illustrate the fact that the linear fit passes through the mean of each
variable.
Hint: you can access a variable from a data frame, say df, by appending $ and the
name of the variable to df, e.g., df$TV.
d. Based on the model that you estimated above, suppose that you want
to illustrate the linear fit in the sales (𝑦 ) - TV (𝑥) quadrant, i.e., with a
single line. What choice does it imply about the other explanatory vari-
ables?
e. Illustrate the fact that the sum of the OLS residuals is equal to 0, i.e.,
𝑛
∑ 𝑒𝑖̂ = 0
𝑖=1
Hint: you can obtained the fitted values of an estimated model, say m1, by ap-
pending $fitted.values to its name, i.e., m1$fitted.values.
1
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=335655
18
Assumptions
18.2 Assumption 0
The relation between 𝑌 and 𝑋 is linear in the parameters (in other words, the
𝑌 ’s are linear functions of 𝑋 plus a random error term).
219
220 18 Assumptions
This assumption addresses the functional form of the model. In statistics, a re-
gression model is linear when all terms in the model are either the constant or a
parameter multiplied by an independent variable.
The model equation is built by adding the terms together. These rules constrain
the model to one type:
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
18.3 Assumption 1
𝐸[𝜀𝑖 |𝑋] = 0
In other words, the mean of the probability distribution of 𝜀𝑖 is 0. That is, the
average of the values of 𝜀𝑖 over an infinitely long series of experiments is 0 for
each setting of the independent variable 𝑋 .
This assumption implies that the mean value of 𝑌 , for a given value of 𝑋 , is:
𝐸[𝑦𝑖 |𝑋 = 𝑥𝑖 ] = 𝛽0 + 𝛽1 𝑥𝑖
18.4 Assumption 2
18.5 Assumption 3
The random errors 𝜀𝑖 are independent, they are not correlated with one another:
The values of 𝜀 associated with any two observed values of 𝑌 are independent.
That is, the value of 𝜀 associated with one value of 𝑌 has no effect on any of the
values of 𝜀 associated with any other 𝑌 values.
18.6 Assumption 4
𝐶𝑜𝑣(𝜀𝑖 , 𝑋𝑖 ) = 0
In other words, the 𝑋 values are fixed numbers, or realizations of the random
variable 𝑋 that are independent of the error terms, 𝜀𝑖 (𝑖 = 1, ..., 𝑛).
If an independent variable is correlated with the error term, we can use the in-
dependent variable to predict the error term, which violates the notion that the
error term represents unpredictable random error.
This assumption is also referred to as exogeneity. When this type of correlation
exists, and the assumption is violated, there is endogeneity.
222 18 Assumptions
18.7 Assumption 5
The objective of simple regression is to explain (part of) the variability of a de-
pendent variable 𝑌 by an independent variable 𝑋 . That is to say that part of the
observed changes in 𝑌 result from changes in 𝑋 .
The TSS is fixed and independent of the regression coefficients. We can use Figure
19.1 to get an illustration of the TSS which is calculated using the squares of the
distances shown in blue.
223
224 19 Goodness of the Fit
20
sales
10
FIGURE 19.1: Using the mean as the best fit and the resulting residuals.
20
sales
10
We can use Figure 19.2 to get an illustration of the RSS which is calculated using
the squares of the distances shown in red.
19.2 Decomposition of the Total Sample Variability 225
We expect 0 ≤ 𝐸𝑆𝑆 ≤ 𝑇 𝑆𝑆 .
The ESS would is the sum of the vertical distances between each point and esti-
mated regression line.
Squaring both sides of the equality, summing for all sample elements and sim-
plifying, we obtain the decomposition of 𝑇 𝑆𝑆 into 𝐸𝑆𝑆 and 𝑅𝑆𝑆 .
𝑛 𝑛 𝑛
2 2
∑ (𝑦𝑖 − 𝑦) = ∑ (𝑦𝑖̂ − 𝑦) + ∑ 𝑒2𝑖̂
⏟⏟⏟⏟⏟
𝑖=1 ⏟⏟⏟⏟⏟
𝑖=1 ⏟
𝑖=1
𝑇 𝑆𝑆 𝐸𝑆𝑆 𝑅𝑆𝑆
TSS = Total Sum of Squares, measures the variation of the 𝑦𝑖 values around
their mean 𝑦.̄
226 19 Goodness of the Fit
ESS = Explained Sum of Squares, measures the variation explained by the linear
regression model, that is, the variation attributable to the relationship between
𝑋 and 𝑌 .
RSS = Residual Sum of Squares, measures the amount of variation attributable
to factors other than the relationship between 𝑋 and 𝑌 .
The coefficient of determination (𝑅2 ) measures the proportion of the total vari-
ability of the dependent variable that is explained by the regression model, i.e.,
by the independent variable:
𝐸𝑆𝑆 𝑅𝑆𝑆
𝑅2 = =1−
𝑇 𝑆𝑆 𝑇 𝑆𝑆
Since 0 ≤ 𝐸𝑆𝑆 ≤ 𝑇 𝑆𝑆
0 ≤ 𝑅2 ≤ 1
19.3.1 Adjusted 𝑅2
𝑛−1
Adjusted 𝑅2 = 1 − (1 − 𝑅2 )
𝑛−𝑘
We use this measure to correct for the fact that nonrelevant independent vari-
ables will result in some small reduction in the error sum of squares.
The adjusted 𝑅2 provides a better comparison between multiple regression mod-
els with different numbers of independent variables.
Recall that the population error 𝜀 is a random variable with zero mean and vari-
ance 𝜎2 . The variance of 𝜀 can be estimated using the residual sum of squares:
𝑅𝑆𝑆
𝜎̂ 2 =
𝑛−2
with
𝑛
𝑅𝑆𝑆 = ∑ 𝑒2𝑖̂
𝑖=1
The global quality of the model fit can also be assessed by the standard error
of the regression, which measures the variation of our observations around the
regression line.
𝑅𝑆𝑆
𝜎̂ = √
𝑛−2
The standard error of the regression is also known as the standard error of the esti-
mate. It represents the average distance that the observed values fall from the
regression line. It tells us how wrong the regression model is on average using
228 19 Goodness of the Fit
the units of the response variable. Smaller values are better because it indicates
that the observations are closer to the fitted line.
Unlike 𝑅2 , we can use the standard error of the regression to assess the preci-
sion of the predictions. Approximately 95% of the observations should fall within
±2 × 𝜎̂ from the regression line, which is also a quick approximation of a 95%
prediction interval. If we want to use a regression model to make predictions,
assessing the standard error of the regression might be more important than as-
sessing 𝑅2 .
20
Inference
20.1 ̂
Sampling Distributions of 𝛽 ’s
Having developed estimators for the coefficients 𝛽0 and 𝛽1 and for 𝜎2 , we are
ready to make inferences about the population model. Specifically, we are inter-
ested in computing confidence intervals and conducting hypothesis tests for the
parameters of interest.
Recall our population model in the simple regression model,
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
In short, and given the normality of the errors 𝜀𝑖 , the sampling distributions of
𝛽0̂ and 𝛽1̂ are:
̂ 2 1 𝑥2
𝛽0 ∼ 𝑁 (𝛽0 , 𝜎 [ + 2
])
𝑛 𝑛𝑆𝑋
229
230 20 Inference
and
𝜎2
𝛽1̂ ∼ 𝑁 (𝛽1 , 2
)
𝑛𝑆𝑋
Under assumptions A.0 to A.4, the OLS estimators are BLUE - they are the best
(lowest variance) among all linear estimators which are unbiased (Gauss-Markov
Theorem).
Notice that 𝜎2 is typically unknown and must be estimated. Hence, the sampling
distribution becomes,
𝜎̂ 2
𝛽1̂ ∼ 𝑁 (𝛽1 , 2
)
𝑛𝑆𝑋
It is important to note that the variance of 𝛽1̂ depends on two important quanti-
ties:
• The distance of the points from the regression line measured by ∑ 𝑒2𝑖̂ : higher
values imply greater variance for 𝛽1̂ .
2
• The total deviation of the 𝑋 values from the mean, which is measured by 𝑆𝑋 :
greater deviations in the 𝑋 values and larger sample sizes result in smaller
variance for 𝛽1̂ .
20.2 Estimating 𝜎2
Note, that the variance of both estimators depends on the error variance 𝜎2 , a
population parameter, which is typically unknown.
Therefore, we will need an estimate of 𝜎2 . Previously, we have learned that we
can use
2 ∑ 𝑒2𝑖̂
𝜎̂ =
𝑛−2
as an estimator of 𝜎2 . It can be shown that:
20.3 Inference on the Slopes 231
𝑛
𝐸 [𝜎̂ ] = 𝐸 [ ∑ 𝑒2𝑖̂ /(𝑛−2)] = 𝜎2
2
𝑖=1
and
(𝑛 − 2)𝜎̂ 2
2
∼ 𝜒2 (𝑛−2)
𝜎
𝐻0 ∶ 𝛽1 = 0 vs 𝐻𝑎 ∶ 𝛽1 ≠ 0.
Given that 𝛽1̂ is normally distributed, our test statistic will be:
𝛽1̂ − 0 𝛽̂
𝑡= = 1
𝑠𝛽 ̂ 𝑠𝛽 ̂
1 1
which we will compare against the appropriate critical value from a 𝑡 distribu-
tion with (𝑛 − 2) degrees of freedom.
Inference on the slope will then follow the usual rules for tests of hypothesis for
the mean of a single variable. For instance, if the test is bilateral, as it almost
always is in this context, we have
232 20 Inference
𝛽1̂ 𝛽1̂
Reject 𝐻0 if < −𝑡(𝑛−2),𝛼/2 or > 𝑡(𝑛−2),𝛼/2
𝑠𝛽 ̂ 𝑠𝛽 ̂
1 1
21
Categorical Predictors
21.1 Introduction
233
234 21 Categorical Predictors
1 if male
𝑥={
0 if female
𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝜀
𝛽0 represents the mean response associated with the level of the qualitative vari-
able assigned the value 0 (called the base level). In this example, it represents the
mean salary for females 𝜇𝐹 .
𝛽1 represents the difference between the mean response for the level assigned
the value 1 and the mean for the base level. In this example, it represents the
difference between the mean salary for males and the mean salary for females,
𝜇𝑀 − 𝜇 𝐹 .
1 if level A
𝐷𝐴 = {
0 if level B
𝑦 = 𝛽0 + 𝛽1 𝐷𝐴 + 𝛾𝑍 + 𝜀
where 𝑍 is a set of other relevant regressors and 𝑔𝑎𝑚𝑚𝑎 the coefficient associ-
ated with it.
21.2 Including a Dummy with Multiple Levels 235
The estimated model then provides predictions for all the levels of the categorical
variable:
• the group for which the dummy 𝐷𝐴 of the categorical variable is 1, 𝐴 in this
case, and,
• the group for which the dummy 𝐷𝐴 of the categorical variable is 0, 𝐵 in this
case.
1 if level A
𝐷𝐴 = {
0 if not level A
1 if level B
𝐷𝐵 = {
0 if not level B
1 if level C
𝐷𝐶 = {
0 if not level C
236 21 Categorical Predictors
is superfluous because the observations for level 𝐶 are already implicitly de-
fined, i.e., they are those for which 𝐷𝐴 = 𝐷𝐵 = 0. In that case, we say the the
level 𝐶 is the reference level.
The estimated model now becomes,
𝑦 = 𝛽0 + 𝛽1 𝐷𝐴 + 𝛽2 𝐷𝐵 + 𝛾𝑍 + 𝜀
The estimated model then provides predictions for all the levels of the categorical
variable:
• the group for which the dummy 𝐷𝐴 of the categorical variable is 1, 𝐴 in this
case, and,
• the group for the reference level, 𝐶 in this case.
• the group for which the dummy 𝐷𝐵 of the categorical variable is 1, 𝐵 in this
case, and,
• the group for the reference level, 𝐶 in this case.
If a model requires it, then multiple dummy variables can be included. For in-
stance, we can add to the model above a dummy about the gender,
21.5 The Dummy Variable Trap 237
1 if male
𝐷𝐺 = {
0 if female
𝑦 = 𝛽0 + 𝛽1 𝐷𝐴 + 𝛽2 𝐷𝐵 + 𝛽3 𝐷𝐺 + 𝛾𝑍 + 𝜀
Notice that the interpretation of the coefficients is the same as above. However,
the we can now create more subgroups, e.g., 𝐴-female, 𝐵-female, 𝐵-male, etc…
The general principle of dummy variables can be extended to cases where there
are several (but not infinite) discrete groups/categories. In general just define a
dummy variable for each category.
For instance, if number of groups is 3 (North, Midlands, South), then define:
However, as a rule we always include one less dummy variable in the model
than there are categories, otherwise we will introduce multicolinearity into the
model. For a qualitative variable with 𝑘 levels, use 𝑘 − 1 dummy variables.
21.5 Exercises
Exercise 21.1. Consider the dummy variables defined in Section 21.2. Suppose
the estimation of the (true) model finds the following values: 𝛽0̂ = 10, 𝛽1̂ = 5,
𝛽2̂ = 1, and 𝛾̂ = 0.
238 21 Categorical Predictors
Now suppose that you estimate a model based on the same variables but where
you include 𝐷𝐶 .
𝑦 = 𝛼0 + 𝛼1 𝐷𝐴 + 𝛼3 𝐷𝐶 + 𝛾𝑍 + 𝜀
Place of sale: specialized wine shop or any other place (e.g., supermarket)
The estimated model has the following specification (with 𝜀 being a random
component):
𝑝 = 𝛽0 + 𝛽1 𝑇1 + 𝛽2 𝑇2 + 𝛽3 𝑅1 + 𝛽4 𝑅2 + 𝛽5 𝑅3 + 𝛽6 𝑆1 + 𝜀
𝛽0̂ = 5.43,
𝛽1̂ = 1.21,
𝛽 ̂ = 1.54,
2
𝛽3̂ = 0.73,
𝛽4̂ = −0.22,
𝛽 ̂ = 0.15,
5
𝛽6̂ = 1.02.
a. Calculate the estimated price of a bottle of white wine from the Dão
region sold in a supermarket?
b. Calculate the estimated price of a bottle of ‘rosé’ wine from a region
outside Douro, Alentejo or Dão sold in a specialized wine shop?
𝑝 = 𝛽0 + 𝛽1 𝑇1 + 𝛽2 𝑇2 + 𝛽3 𝑅1 + 𝛽4 𝑅2 + 𝛽5 𝑅3 + 𝛽6 𝑆1 + 𝛽7 𝑌 + 𝜀
c. What does the coefficient 𝛽7 measure? How would you interpret its
value?
d. Suggest an alternative way of incorporating in the regression model
a variable based on the year of production. Explain your choice.
22
Simulating Violations of Assumptions
22.1 Introduction
• Since the data is simulated, we know what are the real parameters of the model.
• We estimate a linear regression for that model.
• Finally we evaluate how well the linear regression estimation performed by
comparing various measures to what should be expected if the assumptions of
the linear model were satisfied.
241
242 22 Simulating Violations of Assumptions
400
y
300
200
The first exercise is the best case scenario whereas all the assumptions are satis-
fied and we try to uncover the true model.
I start by illustrating the case with one sample. The data is generated with the
following parameters. Figure 22.1 gives an example of a simulated sample.
set.seed(43)
n <- 50
x <- rnorm(n, mean= 100, sd=25)
epsilon <- rnorm(n, mean=0, sd=10)
y <- 5 + 3*x + epsilon
I now try to uncover the model. Sure enough, the results are excellent, meaning
22.2 Best Case Scenario 243
the the linear regression model works wonders when the true model is indeed
linear, see Figure 22.2.
n <- 50
244 22 Simulating Violations of Assumptions
400
y
300
200
FIGURE 22.2: Scatter plot of simulated data in best case scenario along with true
relationship (red) and OLS fit (blue).
beta0 <- 5
beta1 <- 3
mean.x <- 100
sd.x <- 20
mean.error <- 0
sd.error <- 10
for (i in 1:n.s){
x <- rnorm(n, mean = mean.x, sd = sd.x)
epsilon <- rnorm(n, mean= mean.error, sd=sd.error)
y <- beta0 + beta1 * x + epsilon
m0 <- lm(y~x)
beta1.hat[i] <- m0$coefficients[2]
}
A summary of the values obtained is the following (see Figure 22.3 for a visual
representation)
22.2 Best Case Scenario 245
4
density
summary(beta1.hat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.759 2.950 2.994 2.994 3.041 3.190
Nothing prevents us from simulating a richer model. The code is messier because
I want explanatory variables to have some degree of correlation.
# two variables
library(MASS)
n <- 50
my.means <- c(100, 100)
# off-diagonal = cov
# corr <- cov/(sqrt(var1*var2))
sd1 <- 80
sd2 <- 40
r <- 0.6
246 22 Simulating Violations of Assumptions
X <- mvrnorm(n,
mu = my.means,
Sigma = my.sigma,
empirical = TRUE)
x1 <- X[,1]
x2 <- X[,2]
epsilon <- rnorm(n, mean = 0, sd = 10)
y <- 20 + 1.2*x1 + 7.4*x2 + epsilon
22.3.1 𝑟>0
# two variables
library(MASS)
n <- 50
my.means <- c(100, 100)
# off-diagonal = cov
# corr <- cov/(sqrt(var1*var2))
sd1 <- 80
sd2 <- 40
r <- 0.6
X <- mvrnorm(n,
mu = my.means,
Sigma = my.sigma,
empirical = TRUE)
x1 <- X[,1]
x2 <- X[,2]
##
## Residuals:
## Min 1Q Median 3Q Max
## -543.91 -166.73 -34.65 204.62 449.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 540.2932 54.7800 9.863 3.97e-13 ***
## x1 3.3811 0.4294 7.873 3.43e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 240.5 on 48 degrees of freedom
## Multiple R-squared: 0.5636, Adjusted R-squared: 0.5545
## F-statistic: 61.99 on 1 and 48 DF, p-value: 3.425e-10
22.3.2 𝑟<0
n <- 50
my.means <- c(100, 100)
sd1 <- 80
sd2 <- 40
r <- -0.6
X <- mvrnorm(n,
mu = my.means,
Sigma = my.sigma,
empirical = TRUE)
x1 <- X[,1]
x2 <- X[,2]
Here, I attempt a simulation of a model whereas the true relationship among the
variables is not linear.
n <- 50
mean.x <- 5
sd.x <- 5
x <- rnorm(n, mean = mean.x, sd = sd.x)
mean.epsilon <- 0
sd.epsilon <- 7
epsilon <- rnorm(n, mean = mean.epsilon, sd = sd.epsilon)
250 22 Simulating Violations of Assumptions
400
300
y
200
100
0
−5 0 5 10 15
x
beta0 <- 20
beta1 <- 2
How did the linear regression preform in this case? Poorly, as shown in Figure
22.5.
m0 <- lm(y ~ x)
summary(m0)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.70 -52.02 -19.10 21.39 239.60
##
22.4 Incorrect Specification Issue 251
400
300
200
y
100
−5 0 5 10 15
x
FIGURE 22.5: Scatter plot of sample with non-linear relationship along with OLS
fit (blue).
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.849 11.754 4.326 7.65e-05 ***
## x 14.322 1.803 7.945 2.67e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.03 on 48 degrees of freedom
## Multiple R-squared: 0.568, Adjusted R-squared: 0.559
## F-statistic: 63.12 on 1 and 48 DF, p-value: 2.671e-10
23
Relevant Applications
23.1.1 Abstract
This paper examines the value of connections between German industry and
the Nazi movement in early 1933. Drawing on previously unused contemporary
sources about management and supervisory board composition and stock re-
turns, we find that one out of seven firms, and a large proportion of the biggest
companies, had substantive links with the National Socialist German Work-
ers’ Party. Firms supporting the Nazi movement experienced unusually high
returns, outperforming unconnected ones by 5% to 8% between January and
March 1933. These results are not driven by sectoral composition and are robust
to alternative estimators and definitions of affiliation.
1
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=335654
253
254 23 Relevant Applications
We systematically assess the value of prior ties with the new regime in 1933. To
do so, we combine two new data series: A new series of monthly stock prices,
collected from official publications of the Berlin stock exchange, and a second
series that uses hitherto unused contemporary data sources, in combination with
previous scholarship, to pin down ties between big business and the Nazis. We
consider both active managers (the Vorstand) and supervisory board members
(Aufsichtsrat).
We thus try to offer a quantitative answer to the question, How much was it
worth to have close, early connections with the Nazi party?
We identify businessmen and firms as connected to the NSDAP if they meet ei-
ther of two criteria. First, if business leaders or firms contributed financially to
the party or to Hitler or Göring, they qualify as connected. Second, certain busi-
nessmen provided political support for the Nazis at crucial moments, serving
on (or helping to finance) various groups that advised the party or Hitler on eco-
nomic policy.We also count the latter as connected. Appendix I lists all relevant
individuals and firms, along with notes on the main scholarly sources for each.
The first group includes early contributors such as Thyssen and Kirdorf. […] In
the second group are businessmen whose ties to the party also pre-dated Feb. 20.
It includes the signatories of a famous petition to President Hindenburg, urging
him to appoint Hitler as Chancellor.
ever, adding the betas to the basic regression setup as an additional explanatory
variable does not change our main result.
The lower panel of Table III [Figure 23.2] documents significant outperformance
over the period from mid-January to mid-March. Nazi affiliated firms saw their
prices increase by almost 7% more than the rest.
(Hansen 1978, App. 6, 10). As Eq. (1), Table VI, shows, being on the Reichswehr
list produced a positive return of 6.5% percent after January 30, but the coefficient
is not significant.
(…) In this subsection, we show that our results are not sensitive to alternative
definitions of connection with the Nazi party.
(…) Using 73’815 possible combinations of regressors - including all sector dum-
23.1 Betting on Hitler 257
mies, market capitalization, the dividend yield, the Jewish dummy, size quin-
tiles, Reichswehr association, beta, and twenty dummies of regional origin - the
smallest coefficient we obtain for the Nazi variable is 0.059 (t-statistic 3.1) and
the biggest is 0.11 (t-statistic 5.8). Despite using a large number of possible com-
binations of regressors, we consistently find a statistically significant and eco-
nomically meaningful coefficient.
24
Linear Regression Lab
24.1.1 Estimation
library(MASS)
data(Boston)
# ?Boston
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
259
260 24 Linear Regression Lab
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
24.1.2 Names
names(m1)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
24.1.3 Prediction
24.1 Simple Linear Regression 261
24.1.4 Plotting
library(ggplot2)
library(magrittr)
library(dplyr)
Boston %>%
mutate(fit.m = m1$coef[1] + m1$coef[2]*lstat,
fit.f = m1$fitted.values,
fit.p = predict(m1)) %>%
ggplot(aes(x=lstat, y=medv)) +
geom_point() +
geom_line(aes(y=fit.m), color="blue")
262 24 Linear Regression Lab
50
40
30
medv
20
10
0 10 20 30
lstat
Boston %>%
ggplot(aes(x=lstat, y=medv)) +
geom_point() +
geom_smooth(method = "lm")
24.3 Multiple Linear Regression 263
50
40
30
medv
20
10
0 10 20 30
lstat
24.3 Estimation
24.3.1 Prediction
24.3.2 Plotting
Boston %>%
mutate(fit.p = predict(m2)) %>%
ggplot(aes(x=lstat, y=medv)) +
24.3 Estimation 265
geom_point() +
geom_line(aes(y=fit.p), color="blue")
50
40
30
medv
20
10
0 10 20 30
lstat
Boston %>%
mutate(fit.p = predict(m2, fit.data),
fit.ps = predict(m1)) %>%
ggplot(aes(x=lstat, y=medv)) +
geom_point() +
geom_line(aes(y=fit.p), color="blue") +
geom_line(aes(y=fit.ps), color="red")
266 24 Linear Regression Lab
50
40
30
medv
20
10
0 10 20 30
lstat
24.4.1 Estimation
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.09412 0.56067 60.809 < 2e-16 ***
## lstat -0.94061 0.03804 -24.729 < 2e-16 ***
## chas 4.91998 1.06939 4.601 5.34e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.095 on 503 degrees of freedom
## Multiple R-squared: 0.5626, Adjusted R-squared: 0.5608
## F-statistic: 323.4 on 2 and 503 DF, p-value: < 2.2e-16
24.4.2 Plotting
Boston %>%
mutate(fit.p1 = predict(m3, fit.data1),
fit.p0 = predict(m3, fit.data0)) %>%
ggplot(aes(x=lstat, y=medv)) +
geom_point() +
geom_line(aes(y=fit.p1), color="blue") +
geom_line(aes(y=fit.p0), color="red")
268 24 Linear Regression Lab
50
40
30
medv
20
10
0 10 20 30
lstat
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.549 -3.995 -1.202 1.972 25.305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.30325 0.68752 48.439 < 2e-16 ***
## lstat -0.90329 0.04123 -21.908 < 2e-16 ***
## ctaxlow 2.94649 0.84310 3.495 0.000516 ***
## ctaxmedium 1.26345 0.73136 1.728 0.084688 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.15 on 502 degrees of freedom
## Multiple R-squared: 0.5556, Adjusted R-squared: 0.5529
## F-statistic: 209.2 on 3 and 502 DF, p-value: < 2.2e-16
24.5.1 Estimation
24.5.2 Plotting
Boston %>%
mutate(fit.f = m5$fitted.values) %>%
ggplot(aes(x=lstat, y=medv)) +
geom_point() +
geom_line(aes(y=fit.f), color="blue") +
geom_smooth(method = "lm", se=FALSE, color="red")
24.5 Non-linear Transformations 271
50
40
30
medv
20
10
0 10 20 30
lstat
Part VIII
Classification
25
Limited Dependent Variables
We describe here a method for building models when the dependent variable is
275
276 25 Limited Dependent Variables
a categorical variable with two possible values, e.g., 0 or 1, yes or no. It is referred
to as the logistic model.
Extensions to more than two values exist, i.e. extensions with multinomial vari-
ables ordered (e.g., number of stars given to a movie) or unordered (preferred
mean of transportation).. In that sense, the logistic model is a particular case.
The general group of methods to which the present one belongs is called classi-
fiers. This is because these tools are designed to assign each observation to a class
(or category). One of the most popular classifiers, one that you might have heard
of, is neural networks1 .
A sub-group of classifiers, including the logistic model, achieves the classifica-
tion by first estimating the probabilities of each observation of belonging to each
class and then, naturally, assigning the observation to the class for which the pre-
dicted probability is the highest. In that perspective, this sub-group behaves like
the regression models discussed previously, but with continuous probabilities
as dependent variables. This is why this sub-group is referred to as generalized
linear models.
This subsection introduces the data used to illustrate the techniques in this chap-
ter.
require(ISLR)
require(tidyverse)
data("Default")
as_tibble(Default)
## # A tibble: 10,000 x 4
## default student balance income
## <fct> <fct> <dbl> <dbl>
## 1 No No 730. 44362.
## 2 No Yes 817. 12106.
## 3 No No 1074. 31767.
## 4 No No 529. 35704.
## 5 No No 786. 38463.
1
https://youtu.be/aircAruvnKk
25.1 Motivation and Interpretation 277
60000 60000
2000
balance
income
income
40000 40000
1000
20000 20000
0 0 0
Notice that the response variable here is default recording whether or not the
individual defaulted in the credit card. Hence, it is a categorical variable (also
called factor in R).
We start by providing some plots of the data.
Explanatory variables, 𝑋 , explain the 𝑦 = 0 or 𝑦 = 1. However, since no other
value is possible for 𝑦 , the prediction of the model can be interpreted as a proba-
278 25 Limited Dependent Variables
bility. The relationship between the explained variable 𝑦 and the explanatory 𝑋
could then be illustrated as in Figure 25.2.
Therefore, we will model the relationship as
𝑝 = 𝑃 (𝑦 = 1|𝑋) = 𝐹 (𝑋𝛽)
Beyond these two functions, a candidate is simply the linear regression model,
which we describe first.
25.3 OLS: the Linear Probability Model (LPM) 279
We might first think of using the OLS estimator to discover the DGP at hand. In
terms of the expression above, we would have:
𝐹 (𝑋𝛽) = 𝑋𝛽
𝑦 = 𝑋𝛽 + 𝜀 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀 (25.1)
where
The OLS fit would then look like the one represented in Figure 25.3.
Technically, the estimated values are interpreted as probabilities that are used for
classification. For instance, we could have
280 25 Limited Dependent Variables
1 − 𝑋𝛽 with probability 𝑋𝛽
𝜀={ (25.3)
−𝑋𝛽 with probability 1 − 𝑋𝛽
Hence,
Therefore, the variance of the error term in this model is not constant but depends
on 𝑋 , a violation of Assumption 2, see Section ??. The estimate of 𝛽 might be
consistent but won’t be efficient.
As seen in the Figure 25.3, predicted values of 𝑦 might lie outside the (0, 1) range.
Figure 25.4 illustrates that point.
Yes
default
No
0 1000 2000
balance
Default %>%
mutate(fit_lm= cl.lm$fitted.values) %>%
ggplot(aes(x=balance, y=default)) +
geom_point() +
geom_line(aes(y=fit_lm), color="blue")
Clearly, the probability interpretation is jeopardized for this kind of cases. Also,
282 25 Limited Dependent Variables
it calls for a better way to fit the data through the use of appropriate cumulative
density functions. For instance, the second fit in Figure 25.5 would be a better
choice.
Both models give similar results. Practice shows arbitrary decisions on the choice
of one of the two functional forms.
25.4.1 Probit
𝑋𝛽
1 1 2
𝑝 = 𝑃 (𝑦 = 1|𝑋) = ∫ √ 𝑒− 2 𝑡 𝑑𝑡
−∞ 2𝜋
𝑋𝛽
= ∫ 𝜙(𝑡)𝑑𝑡
−∞
= Φ(𝑋𝛽) (25.4)
where 𝜙(⋅) and Φ(⋅) are the pdf and the cdf of a normal distribution, respectively.
25.4.2 Logit
𝑒𝑋𝛽
𝑝 = 𝑃 (𝑦 = 1|𝑋) =
1 + 𝑒𝑋𝛽
= Λ(𝑋𝛽) (25.5)
where Λ(𝑋𝛽) is the cdf for the logistic distribution. Its pdf is Λ(𝑋𝛽)(1 −
Λ(𝑋𝛽)).
Naturally, the prediction of the model will always be a S-shape curve in the [0, 1]
range, no matter the value of the explanatory variable, as shown in Figure 25.6.
The logit model is often preferred in some fields of research for the following
particular feature. We can rewrite the expression above to get,
𝑝
= 𝑒𝑋𝛽
1−𝑝
This ratio is called the odds. If the model estimates 𝑝̂ = 0.2, then the odds are
0.2
1/4 or 1 in 5 since 1−0.2 = 41 . Moreover, taking logs of the odds, we obtain the
log-odds,
𝑝
log ( ) = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘
1−𝑝
This last expression, in turn, shows that the logit model is a linear model of the
log-odds.
284 25 Limited Dependent Variables
Yes
default
No
0 1000 2000
balance
25.4.3 Illustration
25.5 Estimation
𝜕𝑃 (𝑦 = 1) 𝛽 LPM
= { 𝜙(𝑋𝛽) ⋅ 𝛽 probit
𝜕𝑋 Λ(𝑋𝛽) ⋅ (1 − Λ(𝑋𝛽)) ⋅ 𝛽 logit
Notice that 𝛽 is not the measure of the marginal effect. In fact, the impact of a
variable on the probability depends both on the coefficients and on the actual
values of the variables.
A natural prediction for a binary variable is also binary. Define the following
286 25 Limited Dependent Variables
Pred. / Act. 0 1
0 𝑛1 𝑛2
1 𝑛3 𝑛4
where 𝑛1 and 𝑛4 are the numbers of correctly predicted 0’s and 1’s, respectively,
while 𝑛2 and 𝑛3 are the number of misclassified observations. Note 𝑛 the total
number of observations.
Goodness of fit measures can then be calculated based on the confusion matrix,
depending on the most particular perspective for the case at hand. Among them,
we could mention, assuming 0 is the ‘positive’ category:
𝑛 +𝑛
• accuracy: 1 𝑛 4 ,
𝑛1
• sensitivity: 𝑛 +𝑛 ,
1 3
𝑛4
• specificity: 𝑛2 +𝑛4 .
Again in this setting, we should distinguish between train and test measures and
use cross-validation to better estimate the latter.
25.8 An Example
##
## Number of Fisher Scoring iterations: 4
##
## Number of Fisher Scoring iterations: 5
library(caret)
library(e1071)
OLS
Logit
Probit
Intermezzo
26
Presentations
This section gathers a few notes on the presentations that students are asked to
perform in class. At the outset, please note that I shall limit the discussion to
some selected aspects, in particular aspects related to the plan of the presenta-
tion. Therefore, I shall not attempt a full discussion on best practices for presen-
tations.
295
296 26 Presentations
FIGURE 26.1: Example of usual plan for presentation (Source: wiley.com (6 tips
for giving a fabulous academic presentation)).
A tentative alternative plan that students are encouraged to follow is the follow-
ing.
• It is wise to add some points if one wants to convince the audience of the con-
clusions reached. Usually useful are the following.
– robustness checks (what could be wrong… but it is not because the author
checked that the main results are immune to the possible problems),
– comparison with alternative results in the literature,
– implications for general understanding/ policy/ future research,
– Q&A…
Part X
Causality Claims
Why
Causal claims relating variables are of an extreme kind. They manage to be:
The following chapters gather some thoughts about making causal claims. For a
deep take on the issue, see the recent (!) contributions by Judea Pearl, e.g., Pearl
and Mackenzie (2018).
301
27
Sample Bias
Sample bias in an analysis arises when the data/sample used was chosen in a
way that does not allow to answer the research question precisely because the
way the data/sample was selected affects the answer to the research question.
This happens typically when the selected data is not representative of the popu-
lation that was needed in the research question.
There are several sources for this issue such as,
• non-random sampling,
• self-selection,
• survivorship bias,
• …
The following cases provide some illustrations while showing its relevance and
its ubiquity.
Several theses that I came to evaluate contain survey data obtained from Face-
book friends of the author. Clearly, this jeopardizes representativeness.
303
304 27 Sample Bias
FIGURE 27.1: President Truman holding a copy of the Chicago Daily Tribune,
November 1948.
27.3 Self-Selection
When AIDS became a serious concern, in the 80’s, health officials realized the
lack of evidence on the sexual behavior of individuals. This knowledge would
prove crucial, for instance, to predict the spread of STDs.
Since then, several countries have conducted surveys in that topic with questions
such as how many sexual partners do people report having had in their lifetime.
Consider the fact that the response rate is typically below 100%, say 60-70%,
because some individuals decide to participate while other decide not to. One
should clearly be concerned with potential biases in the calculation of the sam-
pling distribution of any statistic based on the responses of the survey.
27.3.2 Heights
The apparent decline in heights in the United States, Great Britain, Sweden, and Habsburg - era central
Europe is indeed interesting, yet we question the reliability of the evidence adduced for this apparent
decline. These countries had fundamentally different economies at the time of their height reversals, but
they shared an important feature: they filled their military ranks with volunteers rather than conscripts.
A volunteer sample, which is the predominant type of sample in the literature, is selected in the sense
that such samples contain only individuals who chose to enlist in the military. Elsewhere we have shown
that the problem of inferring changes in population heights from a selected sample of volunteers can be
grave (Bodenhorn, Guinnane, and Mroz 2014). The implications of selection bias render the observed
“shrinking in a growing economy” less of an anomaly (Komlos 1998a). As the economy grows, the
outside option of military service becomes less attractive, especially to the productive and the tall.
Military heights declined because tall people increasingly chose non-military employment. Thus, we
cannot really say whether population heights declined; we can only be confident that the average height
of those willing to enlist in the military declined.
Consider the brief description offered in the web page of the popular Tim Ferriss
Show2 .3
Each episode, I deconstruct world-class performers from eclectic areas (investing, sports, business, art,
1
This is neither an endorsement of the show… nor a critique of the show.
2
https://tim.blog/podcast/
3
https://tim.blog/podcast/
306 27 Sample Bias
etc.) to extract the tactics, tools, and routines you can use. This includes favorite books, morning
routines, exercise habits, time-management tricks, and much more.
From a statistical point of view, this admitted goal of the show, in italics (my
emphasis), is clearly a doubtful one.
This little video on BBC4 further illustrates the point.
The evidence we have about our prehistoric ancestors is based on artifacts that
arrived to us, e.g., paintings. But these should not be considered as representative
of the real life of these people.
4
https://www.bbc.com/reel/video/p088rp00/the-dangers-of-idolising-successful-people
28
Endogeneity
This barbarous term is actually a star in economics. The reason for that is its
rank as Number-One-Threat to the validity of an estimated model. Recall that
its mathematical description amounts to a simple formulation,
𝐶𝑜𝑣(𝜀, 𝑋) ≠ 0
A model suffers from an endogeneity issue when the explanatory variable is cor-
related with the error term. The consequence of that correlation is dramatic. For
instance, in the linear regression model, the estimated coefficient in the defective
model will not converge to the true parameter of the relationship.
There are several causes of endogeneity, including:
• omitted regressor,
• measurement error,
• omitted common source,
• omitted selection,
• simultaneity,
• …
Importantly, notice that this is not primarily a highly technically advanced issue.
It is above a defective way of setting causal claims.
307
308 28 Endogeneity
𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝛽 2 𝑧 + 𝜀
where 𝜀 is a true random shock. Assume was well that there is some level of
correlation between 𝑥 and 𝑧 , which we can express as,
𝑧 = 𝛾1 𝑥 + 𝜉
where 𝜉 is a true random shock. Now, suppose one goes along and forgets 𝑧 , to
estimated
𝑦 = 𝜙 0 + 𝜙1 𝑥 + 𝑢
Substituting, the actual estimated model is,
𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝛽
⏟⏟2 (𝛾1 𝑥 + 𝜉)⏟⏟
⏟⏟⏟ +𝜀
𝑢
or,
𝑦 = 𝛽0 + (𝛽
⏟⏟ + 𝛽2⏟
1⏟ 𝛾⏟
1 ) 𝑥 + (𝛽2 𝜉 + 𝜀)
𝜙1
This case is provided just as an illustration of the bias in the parameters. It is not
the most serious case. Suppose that the true model is
28.5 Omitted Common Source 309
𝑦 = 𝛽 0 + 𝛽 1 𝑥∗ + 𝜀
where 𝜀 is a true random shock. Now, instead of the real 𝑥∗ , one can only obtain
the imperfect measure,
𝑥 = 𝑥∗ + 𝜉
where 𝜉 is a true random shock. Substituting, the actual estimated model is,
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀⏟
− 𝛽1 𝜉
𝑢
𝑦 = 𝛼 0 + 𝛼1 𝑧 + 𝜈
𝑥 = 𝛾 0 + 𝛾1 𝑧 + 𝜉
𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝜀
Another example is when variables grow independently over time. They can-
not be judged as the cause of one another simply on the based of an estimated
relationship between them.
310 28 Endogeneity
When the observations arise from a phenomenon of self-selection, then the esti-
mated relationship cannot be considered as causal.
28.6 Simultaneity
𝑦 = 𝛼 0 + 𝛼1 𝑥 + 𝜈
𝑥 = 𝛾 0 + 𝛾1 𝑦 + 𝜉
Regression to the mean occurs when observations from two identical distribu-
tions are linked to one another. The problem with such link arises when extreme
observations of the first distribution are linked with observations of the second
distribution. Since the latter are less likely to be extreme, the unaware reader will
think that the two distributions are not identical. To compound the error, the un-
aware reader will often pick an obvious explanation for the difference and assign
it a causal origin. This misinterpretation is a famous fallacy.
Nobel Prize Winner, Daniel Kahneman has popularized the case of a flight in-
structor claiming the following:
“On many occasions I have praised flight cadets for clean execution of some aerobatic maneuver, and in
general when they try it again, they do worse. On the other hand, I have often screamed at cadets for bad
execution, and in general they do better the next time. So please don’t tell us that reinforcement works
and punishment does not, because the opposite is the case.”
The first step to avoid the fallacy is to acknowledge the nature any variable and
emphasize its random component. We could then think of any variable 𝑦 as,
311
312 29 Regression to the Mean
𝑦= 𝑓(𝑋,
⏟ 𝛽) + ⏟𝜀
Deterministic component Random error
Suppose one wants to analyze the midterm and the endterm grades of the stu-
dents of a class. For instance, one could link these grades, for each student, in a
linear regression model as follows:
e-grade𝑖 = 𝛽0 + 𝛽1 m-grade𝑖 + 𝜀𝑖
where e-grade and m-grade are the grades at the endterm and midterm exams,
respectively, and 𝑖 refers to each student in the class.
Think of effect of luck on the grade at each test as the variance of the grade around
its expected value. Consider two cases about the effect of luck:
1. It is very small.
2. It is not relatively small.
Argue that the first case would result in a slope coefficient 𝛽1 ≈ 1. Argue that
the second case would result in a slope coefficient 𝛽1 < 1. This is more difficult.
Here is a hint. Suppose a student is very lucky at a test. Think of what is likely
to happen at the next test.
Fallacious conclusions derived from a regression to the mean plagued the in-
fancy of data analysis. The very name regression comes from these dismal be-
ginnings.
Sir Francis Galton measured human characteristics, e.g., height, and noticed that
when these characteristics were outstanding in parents, they tended to be much
less so in the children. Therefore, he claimed that there was a regression towards
mediocrity in human characteristics.
29.3.2 SI Jinx
Figure 29.1 is the magazine’s cover refers to the Sports Illustrated Jinx stating that
individuals or teams who appear on the cover of the Sports Illustrated magazine
will subsequently experience bad luck.
Goyal and Wahal (2008) analyzed how 3’400 retirement plans, endowments, and
foundations (plan sponsors) hired and fired firms that manage investment funds
over a 10-year period.
Their results can be illustrated by Figure 29.2. The researchers link the hir-
ing/firing decisions to the excess returns of the firms in the various periods be-
fore and after that decision. For instance, “-2:0” is the period 2 years prior the
decision while “0:1” is the period of 1 year after the decision, etc.
Plan sponsors, despite the important consequences of their choice, are clearly
falling for the fallacy.
314 29 Regression to the Mean
5.0
Excess return %
2.5
0.0
−2.5
−2:0 −1:0 0:1 0:2 −2:0 −1:0 0:1 0:2
Periods before/after the hiring/firing decision
FIGURE 29.2: Excess returns and the selection and termination decisions of plan
sponsors.
30
“Gold Standard”
317
318 30 “Gold Standard”
FIGURE 30.1: Mita border and specific portion analyzed by Dell (2010).
Various authors have studied differences in institutions and their long term im-
pact on economic development. Dell (2010) evaluates the effect of the mita forced
labor system. She uses a regression discontinuity design that is made possible by
the mita border shown in Figure 30.1.
This discrete change suggests a regression discontinuity (RD) approach for evaluating the long-term
effects of the mita, with the mita boundary forming a multidimensional discontinuity in
longitude–latitude space. Because validity of the RD design requires all relevant factors besides
treatment to vary smoothly at the mita boundary, I focus exclusively on the portion that transects the
Andean range in southern Peru. Much of the boundary tightly follows the steep Andean precipice, and
hence has elevation and the ethnic distribution of the population changing discretely at the boundary. In
contrast, elevation, the ethnic distribution, and other observables are statistically identical across the
Approaching the Gold Standard 319
segment of the boundary on which this study focuses. Moreover, specification checks using detailed
census data on local tribute (tax) rates, the allocation of tribute revenue, and demography—collected just
prior to the mita’s institution in 1573 - do not find differences across this segment.
In contrast, elevation, the ethnic distribution, and other observables are statistically identical across the
segment of the boundary on which this study focuses.
Results:
Abstract This study utilizes regression discontinuity to examine the long-run impacts of the mita, an
extensive forced mining labor system in effect in Peru and Bolivia between 1573 and 1812. Results
indicate that a mita effect lowers household consumption by around 25% and increases the prevalence of
stunted growth in children by around 6 percentage points in subjected districts today. Using data from
the Spanish Empire and Peruvian Republic to trace channels of institutional persistence, I show that the
mita’s influence has persisted through its impacts on land tenure and public goods provision. Mita
districts historically had fewer large landowners and lower educational attainment. Today, they are less
integrated into road networks and their residents are substantially more likely to be subsistence farmers.
Explanation:
To minimize the competition the state faced in accessing scarce mita labor, colonial policy restricted the
formation of haciendas in mita districts, promoting communal land tenure instead (Garrett (2005),
Larson (1988)). The mita’s effect on hacienda concentration remained negative and significant in 1940.
Second, econometric evidence indicates that a mita effect lowered education historically, and today mita
districts remain less integrated into road networks. Finally, data from the most recent agricultural
census provide evidence that a long-run mita impact increases the prevalence of subsistence farming.
Based on the quantitative and historical evidence, I hypothesize that the long-term presence of large
landowners in non-mita districts provided a stable land tenure system that encouraged public goods
provision.
A
Assignments
A.1 Assignment I
General Instructions
• The goal of this assignment is threefold. First, it checks that the required soft-
ware is properly installed on your machine. Second, it illustrates several com-
ponents of the text editing language, Markdown. Finally, and arguably the
most important, it is a first example of a dynamic document.
• The assignment addresses exclusively the elements of the format of the docu-
ment. This means that it lacks any specific content such as an analysis to carry,
or a question to answer. My apologies for this dry exercise.
• As much as possible, organize your answers in Sections following the present
format.
• This is the only assignment that you will have to do alone.
• Please check Moodle for the submission link and deadline.
Deliverables
This assignment requires that you deliver several files. Please, put them in a
folder and compress this latter in one of the usual formats (.zip, .rar). The link
on Moodle will be set to accept only these compression files!
Make sure that you include all the required files. If the files are missing, then we
cannot knit your Rmd file. There is a penalty in that case.
If it knits, it ships.
321
322 A Assignments
Please make sure that it knits on your machine… and in ours! Because of the
task in Section A.1.2, you must knit your document a last time shortly before
submitting it.
Include your pdf document in the deliverables.
1. The main file of your submission is a Rmd file. Follow the instructions
of the relevant chapter2 of the notes on the introduction to R.
2. Modify the YAML appropriately to a personalized version, e.g., change
the title.
3. Make sure the item ‘author’ in the YAML is filled as follows,
5. Paste the following three lines at the beginning of your Rmd file. Make
sure that the chunk options required for having the code evaluated,
echoed in the output file, and showing its result are all set to TRUE.
```{r}
getwd()
```
1
https://alison.rbind.io/post/2020-05-28-how-i-teach-r-markdown/
2
https://af-ucp.courses/introR/template.html
A.1 Assignment I 323
The output of the code above is the location of the current file in your computer.
This location will be printed in the output file. It is expected that the location
contains elements referring to your name. If it does not, please write a word to
explain why.
Here is the above code in my file, along with its output. As you can see, it gives
the sought for indication about the author.
getwd()
## [1] "/Users/antoniofidalgo/Dropbox/brm"
Check Moodle for the key number, noted kn, on the day of submission.
Your time submission number, noted tsn, is simply the hour at the time of your
submission, in a 0-24 scale. For instance, if you submit your work in the morning
at 09:24, then your tsn is 9. If you submit it at 22:56, then the tsn is 22.
The present document will dynamically refer to the ‘dynamic number’, dn, build
as shown in the code below that you must include in your report.
dn <- kn + tsn + td
For this task, please find guidance in the dedicate Section of my introduction to
R3 .
3
https://af-ucp.courses/introR/mrmd.html
324 A Assignments
Replicate Table A.1 as a Markdown table.4 The caption of the table must also be
included in your copy, see help here5 .
For this task, please find guidance in the dedicate Section of my introduction to
R7 .
Google a picture related to Kahneman (2011) and include it in this section.
Find the picture in Figure A.1 in Tufte’s website8 and include it as well.
A.1.5 Cross-References
There are more than the 345 (3 times dn) people in the picture of Figure A.1 of Section A.1.4.
4
The number of the footnote in your file will likely be different. Also, do not pay attention to
the background colors or the differences that you may observe between the HTML version and
a pdf version.
5
https://af-ucp.courses/introR/mrmd.html#crossrefbook
6
But maybe meeting with colleague.
7
https://af-ucp.courses/introR/mrmd.html
8
https://www.edwardtufte.com/tufte/powerpoint
A.1 Assignment I 325
My schedule in Table A.1 of Section A.1.3 allows me to start reading Kahneman’s book on Tuesday.
A.1.6 Citations
The following are two master pieces: Kahneman (2011) and Tufte (2003).
These are books. Look up the reference of the research paper by Reinhard and
Rogoff that we saw in the notes9 and quote it.
9
https://af-ucp.courses/introR/error.html
326 A Assignments
For this task, you need to create a .bib file as explained in the notes on my intro-
duction to R10 . Do not forget to create a header called “References”.
10
https://af-ucp.courses/introR/mrmd.html
B
Bonus Assignments
You can answer any, both or none of the following questions. Their goal is
twofold. First, to provide practice of the concepts and techniques that we saw
together, and, second, to offer you the possibility of increasing your grade by
providing an extra effort.
You can work in groups of up to three people.
The deadline for submission is Monday, May 24, 10:59.
[This questions awards up to one extra and a half points in the midterm.]
This Assignment is based on Bland (2009) (download here1 ).
I created simulated data that you must use in this case (download here2 ).
The data set contains observations on two variables: wrinkle.red which measures
the wrinkle reduction at 6 months measured in the individual (in a made-up
unit), and, group which specifies whether the individual used the cream with the
active ingredient or the vehicle.
B.1.1 Task
Use the data as if it was the real data of Watson et al. (2009) to illustrate the
various elements discussed on Bland (2009), p.183, middle column.
1
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=335658
2
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=335659
327
328 B Bonus Assignments
[This questions awards up to one extra and a half points in the Quiz II.]
For the following exercises, use the grades.csv data set (download here3 ).
The file contains the grades of the students, identified by an ID, for three dif-
ferent tests in a semester: a midterm, a quiz and an endterm. If needed for the
interpretation, consider that the tests were made in that order, i.e., starting with
the midterm and ending with the endterm.
a. Read the file into a tibble and assign it to the name df.
b. Estimate the following models with OLS. Show and comment the out-
put of these estimated models.
Model 1:
quiz𝑖 = 𝛽0 + 𝛽1 midterm𝑖 + 𝜀𝑖
Model 2:
endterm𝑖 = 𝛼0 + 𝛼1 midterm𝑖 + 𝜉𝑖
c. For each model, provide the scatter plot of the data along with the linear
fit. Add a 45-degree line (geom_abline(slope= 1, intercept=0)).
3
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=336209
B.2 Grades and Luck 329
20
15
test
grade
endterm
10 midterm
quiz
0 10 20 30 40 50
id
d. Tidy the data such that every row is an observation and every column is
a variable. Assign the tidy tibble to the name t.df. Hint: Figure B.1 gives
hints about the resulting format.
Model 3:
grade = 𝛿0 + 𝛿1 𝐷1 + 𝛿2 𝐷2 + 𝑢
1 if endterm
𝐷1 = {
0 otherwise
and
1 if quiz
𝐷2 = {
0 otherwise
330 B Bonus Assignments
This requires that you turn the variable test into a factor with as.factor(). It also
requires to set the right level for the variable with relevel().
C
Practice Quiz Questions
TABLE C.1: Practice quiz questions with elements of solution in this appendix.
Exercise Solution
C.1, C.2, C.3, C.4, C.5 b, e, b, a, b
C.6, C.7, C.8, C.9 a, a, b, a
C.10, C.11, C.12, C.13, C.14 c, a, a, b, d
C.15, C.16, C.17, C.18, C.19 b, c, a, g, b
C.20, C.21, C.22, C.23, C.24 b, b, b, a, a
C.25, C.26, C.27, C.28, C.29
C.30, C.31, C.32, C.33, C.34
C.35
C.36, C.37, C.38, C.39, C.40 c, b, a, c, d
C.41, C.42, C.43, C.44, C.45 e, a, b, n, a
C.46, C.47, C.48, C.49, C.50 a, a, b, a, a
C.51, C.52 b, f
C.53, C.54, C.55, C.56, C.57 a, g, h, b, b
C.58, C.59, C.60, C.61, C.62 c, d, b, a, c
C.63, C.64, C.65, C.66, C.67 c, b, c, c, d
C.1 Quiz I
Exercise C.1. Suppose you’re on a game show, and you’re given the choice of 8
doors. Behind one door is a car; behind the others, goats.
You pick a door, say No. 1, and the host, who knows what’s behind the doors
and cannot open a door with a car, opens 6 doors with a goat. Your door and
door No. 2 remain closed.
331
332 C Practice Quiz Questions
He then says to you, “Do you want to pick door No. 2”? What is the probability
of winning the car by picking door No. 2?
a. 0.83
b. 0.88
c. 0.50
d. 0.13
e. 0.24
f. None in the list
Exercise C.3. Suppose that you want to compare a given proportion between
two populations (population 1 and population 2). You obtain a sample from each
population and want to test:
𝐻0 ∶ 𝑝1 − 𝑝2 = 0.25
In that case, using the pooled probability for the standard error, as we saw in
class, is still appropriate?
a. True
b. False
a. Do not reject 𝐻0
b. Reject 𝐻0
c. Cannot be said (information is lacking)
Exercise C.5. The use of a chi-squared distribution for the sampling distribution
of a statistic in a small sample, instead of a normal distribution is justified by its
higher accuracy.
a. True
b. False
Exercise C.6. The construction of the sampling distribution through the use of
permutations is an alternative to the use of analytical results (theorems, etc).
a. True
b. False
Exercise C.7. Suppose that, under the null, the relevant statistic for a two-tailed
test follows a standard normal distribution. Let 𝛼 = 5%.
The statistic in the sample is 0.01. What decision does the test recommend about
𝐻0 ?
a. Do no reject 𝐻0
b. Reject 𝐻0
c. Cannot be said (lack of information)
Exercise C.8. According to the central limit theorem, the sampling distribution
of the mean can be approximated by the normal distribution:
Exercise C.9. Which of the following properties is not true regarding the sam-
pling distribution of the sample mean?
b. 𝐸[𝑋]̄ = 𝜇, i.e., the expected value of the sample mean is the true
mean no matter how large 𝑛 is.
c. When the population being sampled follows a normal distribution,
the distribution of the sample mean is normal no matter how large 𝑛 is.
d. 𝜎𝑋̄ = √𝜎𝑛
Exercise C.10 (Quiz I, 20-21). Suppose that a test for COVID-19 uses
Suppose as well that the test can be modified by increasing the 𝐶𝑇 , from 20 to
25 or even higher. When the 𝐶𝑇 is increased, the presence of even pieces of the
virus, or even dead-virus will be detected and the test will be flagged as positive.
Recall the terminology about the type of errors and their probabilities. What does
a policy of decreasing the 𝐶𝑇 of the test correspond to?
a. increasing 𝛽
b. decreasing 𝛼
c. increasing 𝛼
d. cannot be said
e. decreasing 𝛼 and 𝛽
f. increasing 𝛼 and 𝛽
a. the magician spent time trying until they achieved the streak
b. the tests on the video were not good enough
c. the magician has psychic powers
d. the die obeyed the voice of the magician
e. this was the result of the skills of the magician at their best, though
it is extremely unlikely that they will be able to repeat the feat
C.1 Quiz I 335
Exercise C.12 (Quiz I, 20-21). I obtain a sample of students answering this ques-
tion.
If I want to test whether the students are better or worse than my grand-mother
at answering it, what is the most appropriate null hypothesis (for the probability
of knowing the correct answer) given that my grand-mother knows absolutely
nothing about statistics?
a. 𝑝 = 1/5
b. 𝑝 = 1/4
c. 𝑝 = 1/3
d. 𝑝 = 1/2
e. 𝑝 = 2/3
Exercise C.13 (Quiz I, 20-21). A die is known to be fair. For 5 times in a row, it
showed a number larger than 3.
The probability that the next throw (the 6th, in a n=6 experiment) shows a num-
ber smaller or equal to 3 is now 0.5 + √1𝑛 .
a. True
b. False
Exercise C.14 (Quiz I, 20-21). In R, the names of the functions for random vari-
ables (distributions) follow a pattern, in particular with respect to the first letter.
For instance, Xunif() does the “same” with a uniform distribution as Xchisq()
does with a chi-squared distribution (possibly needing more arguments,
though).
What is the function that gives the probability to the left of a value ‘q’ in a t-
distribution?
a. qt()
b. rt()
c. dt()
d. pt()
Exercise C.15 (Quiz I, 20-21). We take two samples, A and B, from a population
(with known variance). Sample A contains 200 observations and sample B, 250.
By coincidence, the mean of each sample is the same.
336 C Practice Quiz Questions
We then use each sample to make a test of hypothesis about the true mean in the
population, 𝜇. The null will be: 𝐻0 ∶ 𝜇 = 𝜇0 for some value of 𝜇0 . For instance,
we can use each sample to test whether the true mean in the population is 50.
Generally, in which sample will the null be rejected more often?
a. A
b. B
c. We cannot say: depends on 𝑚𝑢0
d. We cannot say: we more information about the samples
e. [None in the list]
a. 𝑝2 ≤ 𝑝1
b. 𝑝1 = 𝑝2
c. 𝑝1 ≤ 𝑝2
d. 𝑝1 = 𝑝2 = 0
e. cannot be said
Exercise C.17 (Quiz I, 20-21). In a test of hypothesis, what is the minimum value
that the p-value can take?
Exercise C.18 (Quiz I, 20-21). When conducting a test of hypothesis, the rejection
region approach and the p-value approach suggest decisions about 𝐻0 that are…
Exercise C.19 (Quiz I, 20-21). The fact that virtually everybody can access
spreadsheet-based software, such as Excel, qualifies this tool as a tool for repro-
ducible research.
a. True
b. False
a. reject 𝐻0
338 C Practice Quiz Questions
b. not reject 𝐻0
c. cannot be said
Exercise C.21 (Quiz I, 20-21). Under the null, the sampling distribution of a test
statistic is a normal distribution with mean 0 and standard deviation 1, i.e., a
standard normal.
In our sample, the value of the test statistic is 1.
Assume 𝛼 = 0.05. What decision does the test recommend about 𝐻0 ?
a. reject 𝐻0
b. not reject 𝐻0
c. cannot be said
Exercise C.22 (Quiz I, 20-21). The standard deviation of the sampling distribu-
tion of the sample mean, i.e., the standard error of the mean, decreases linearly
with 𝑛, the sample size.
a. True
b. False
Exercise C.23 (Quiz I, 20-21). Recall the following question:
Which of the following sequences of X’s and O’s seems more likely to have been
generated by a random process (e.g., flipping a coin)?
• XOXXXOOOOXOXXOOOXXXOX
• XOXOXOOOXXOXOXOOXXXOX
a. True
b. False
Exercise C.24 (Quiz I, 20-21). You estimate a parameter in the population. The
more precise you want your point estimate to be, the larger the sample size has
to be.
a. True
b. False
C.2 Midterm Quiz 339
Exercise C.25 (Midterm, 20-21). We would like to investigate the rate of return
of two stocks, say of Company A and Company B. For this, we take a random
sample of a large number of days and record the value of the stock for each
company on each of these days. These data are paired.
a. True
b. False
Exercise C.26 (Midterm, 20-21). While manipulating the data on two variables,
𝑥 and 𝑦, the largest (and positive) value of 𝑥 is accidentally multiplied by 1.96.
Which of the following applies for the correlation between 𝑥 and 𝑦 after this
manipulation?
a. True
b. False
Exercise C.28 (Midterm, 20-21). In different samples, I obtain the following Pear-
son correlations that I use in the relevant test. Other things equal, for which is
the null most likely to not be rejected?
a. 0.45
b. -0.25
c. 0.04
340 C Practice Quiz Questions
d. -0.78
e. 0.85
Exercise C.29 (Midterm, 20-21). Consider the output below. You read the main
part line by line, i.e., for each variable (smoke, race, ht, ui, etc…). For each line,
there is a test with 𝐻0 ∶ 𝐶𝑜𝑒𝑓. = 0, where “Coef.” stands for coefficient. The
columns towards the right give the 𝑡 score and the p-value (𝑃 > |𝑡|). Which of
the following variables has a coefficient statistically different from 0 given the
model? Use 𝛼 = 1%.
a. smoke
b. race
c. age
d. [none in this list]
e. [cannot be said]
d. True, if 𝐻0 ∶ 𝜇 = 0
a. 2
b. 3
c. 4
d. 5
e. 6
Exercise C.33 (Midterm, 20-21). Which of the following would result in an in-
crease of the range of the confidence interval for the estimation of the population
mean?
Exercise C.34 (Midterm, 20-21). Consider the construction of the confidence in-
terval for a parameter such as the mean from a relatively small sample. For a
given confidence level, the absolute value of 𝑡𝑑𝑓,1−𝛼 is XXX than 𝑧1−𝛼 , making
the confidence interval YYY. What do XXX and YYY stand for, respectively?
342 C Practice Quiz Questions
a. larger, wider
b. larger, narrower
c. smaller, wider
d. smaller, narrower
e. [cannot be said]
Exercise C.35 (Midterm, 20-21). In are interested in testing the mean effect, 𝜇, of
a new drug on the inflammation level of an injury.
Arguably, the best formulation of the null and the alternative is
𝐻0 ∶ 𝜇 = 0 vs 𝐻𝑎 ∶ 𝜇 ≠ 0,
and not
𝐻0 ∶ 𝜇 ≥ 0 vs 𝐻𝑎 ∶ 𝜇 < 0.
C.3 Quiz II 343
a. True
b. False
C.3 Quiz II
Exercise C.36 (Quiz II, 20-21). A categorical variable can take three values: north,
center, and south. This is why we create three dummy variables.
𝑦 = 𝛽0 + 𝛽1 𝐷1 + 𝛽2 𝐷2 + 𝜀
𝛽0̂ = 8.62
𝛽1̂ = 2.89
𝛽 ̂ = −1.44
2
𝑦 = 𝛼0 + 𝛼1 𝐷2 + 𝛼2 𝐷3 + 𝜖
a. 8.62
b. 7.18
c. 11.51
344 C Practice Quiz Questions
d. 7.17
e. 12.95
f. 5.73
g. 10.06
h. 1.45
i. -4.29
j. -1.45
Exercise C.37 (Quiz II, 20-21). Suppose that you use a statistical model to predict
the value of the variables below. Determine whether the estimation refers to a
regression (R) or a classification problem (C).
1. a worker’s wage,
2. the commute time of workers,
3. the preferred type of transportation of individuals,
4. the self-assessed health level from 1 (very bad) to 5(very good),
5. the risk of cancer group of individuals.
The answers below of R’s and C’s follow the order of these variables.
a. CRRCC
b. RRCCC
c. CCRCR
d. CCRRR
e. RRCRC
f. RCRCC
g. CCRRC
h. CCCRR
Exercise C.38 (Quiz II, 20-21). We can say that k-fold cross-validation is an ex-
tension/generalization of the simple validation set approach.
a. True
b. False
Exercise C.39 (Quiz II, 20-21). The following gives the estimates of a simple lin-
ear model.
𝑦 ̂ = 3.85 − 2.95𝑥
C.3 Quiz II 345
What is the predicted value of the model when the independent variable is equal
to 12?
a. 3.85
b. 15.85
c. -31.55
d. -35.40
e. 3.60
f. -2.76
Exercise C.40 (Quiz II, 20-21). In a simple linear regression, the slope coefficient
is 1.240 and it has a 𝑡-value of 5.544 when testing the null hypothesis that the
true parameter is 0.
What is the standard error of the slope coefficient?
a. 6.00
b. 4.47
c. 1.00
d. 0.22
e. 6.87
Exercise C.41 (Quiz II, 20-21). In the context of linear regression estimation for a
model of the variable 𝑦 , suppose that the usual significance test on a coefficient
allows us to reject the null hypothesis for the variable 𝑥1 .
My grandmother claims that this result implies that 𝑥1 has a causal effect (i.e.,
it’s a cause of) 𝑦 .
What is your best evaluation of my grandmother’s claim?
Exercise C.42 (Quiz II, 20-21). Suppose you regress a normal random variable
𝑦 on another explanatory (?) random normal variable 𝑥, i.e., you estimate the
model
346 C Practice Quiz Questions
𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝜀
Then, you carry the usual test of the hypothesis: 𝐻0 ∶ 𝛽1 = 0. What is the
probability that you reject 𝐻0 ?
a. 𝛼
b. cannot be determined
c. 𝛽1
d. 𝛽0
Exercise C.43 (Quiz II, 20-21). A researcher’s effort to continuously increase the
ability of their model to fit their data at hand is rewarded with increased accuracy
in predictions.
This statement is generally…
a. True
b. False
Exercise C.44 (Quiz II, 20-21). Consider a regression predicting weight (kg) from
height (cm) for a sample of adult males. What are, respectively, the units of:
A. the correlation coefficient, B. the intercept, C. the slope.
a. A. cm/kg. B. no units. C. cm
b. A. kg. B. kg. C. cm/kg
c. A. no units. B. no units. C. kg/cm
d. A. cm/kg. B. cm. C. kg
e. A. kg. B. kg. C. kg
f. A. kg. B. cm. C. no units
g. A. cm. B. no units. C. kg
h. A. kg/cm. B. kg/cm. C. kg/cm
i. A. kg. B. no units. C. kg/cm
j. A. cm. B. cm. C. cm
k. A. kg/cm. B. kg. C. no units
l. A. cm/kg. B. cm/kg. C. kg/cm
m. A. cm/kg. B. no units. C. kg/cm
n. A. no units. B. kg. C. kg/cm
o. A. no units. B. kg. C. no units.
C.3 Quiz II 347
Exercise C.45 (Quiz II, 20-21). The 𝑅2 of a multiple regression model does not
use cross-validation. Therefore, it is not a reliable measure of the 𝑅2 of the same
regression model in another sample.
a. True
b. False
Exercise C.46 (Quiz II, 20-21). You define a dummy variable 𝐷𝑀 which takes the
value 1 if the individual is a man and 0 otherwise (the individual is a woman).
You then estimate a linear regression model of an individual’s wage using sev-
eral variables, including 𝐷𝑀 . Other things equal, you find that the value of the
coefficient on the dummy variable 𝐷𝑀 is equal to 20.
Now, suppose that instead of 𝐷𝑀 , you had used 𝐷𝑊 , a dummy variable taking
the value 1 if the individual is a woman and 0 otherwise.
Other things equal, what would be the value of the coefficient on the dummy
variable 𝐷𝑊 ?
a. -20
b. It cannot be determined
c. It depends on the intercept
Exercise C.47 (Quiz II, 20-21). The mean squared error (MSE) criterion cannot
correctly be used in the context of classification problems.
a. True
b. False
Exercise C.48 (Quiz II, 20-21). In the Netflix challenge, competitors were asked
to provide a model and an estimation technique in order to predict an aspect of
the clients’ choice based on some training data. Since the competitors could not
access the data on which their model will be evaluated, the problem is one of
unsupervised learning.
a. True
b. False
348 C Practice Quiz Questions
Exercise C.49 (Quiz II, 20-21). Suppose you assume that the true relationship
between the explained variable, the weight of a baby, (in grams), and the ex-
planatory variable, the baby’s age (in months) is linear. You estimate a simple
linear regression model and obtain 𝛽0̂ and 𝛽1̂ .
You later come to realize that the scale that was used to weigh the babies was
deficient in the sense that it was always 52 grams above the real weight.
Does this result in a bias of your 𝛽1̂ ?
a. No
b. Yes
c. Cannot be determined
Exercise C.50 (Quiz II, 20-21). The following estimated model of relationship
between variables is a linear regression model.
log(𝑦) = 𝛽0 + 𝛽1 log(𝑥) + 𝜀
a. True
b. False
Exercise C.51 (Quiz II, 20-21). When estimating and presenting the results of a
linear regression model, a high value of the 𝑅2 is a necessary requirement for
the validity of the model and its publication in a good research journal.
a. True
b. False
Exercise C.52 (Quiz II, 20-21). Let 𝑦 be the total weight of the individuals in a
sample. The total number of individuals is denoted 𝑥 and the total number of
kids among these individuals is denoted 𝑤.
We estimated the following linear regressions:
𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝜀
𝑦 = 𝛾 0 + 𝛾1 𝑤 + 𝜖
C.4 Endterm Quiz 349
𝑦 = 𝛼 0 + 𝛼1 𝑥 + 𝛼 2 𝑤 + 𝜉
What are, respectively, the likely signs of: 𝛽1̂ , 𝛾1̂ , 𝛼1̂ , 𝛼2̂ ?
(A hat . ̂ over a variable denotes the estimate through the OLS procedure.)
Exercise C.53 (Candidate Endterm Quiz, 20-21). Consider the linear fit in Figure
C.3, assuming that it represents the true relationship between the variable ‘sales’
and the variable ‘TV’.
What assumption of the linear model seems to be violated?
a. Homoscedasticity
b. Normality of the errors
c. Correlation between errors
d. None in the suggested list
Exercise C.54 (Candidate Endterm Quiz, 20-21). A categorical variable can take
three values: north, center, and south. This is why we create three dummy vari-
ables.
20
sales
10
𝑦 = 𝛽0 + 𝛽1 𝐷1 + 𝛽2 𝐷2 + 𝜀
We find the following coefficients: 𝛽0̂ = −9.54, 𝛽1̂ = −1.31, 𝛽2̂ = 4.33.
Suppose that instead we had estimated the model:
𝑦 = 𝛼0 + 𝛼1 𝐷2 + 𝛼2 𝐷3 + 𝜖
a. 3.90
b. -15.18
c. -13.87
d. -5.21
e. -3.02
f. -12.56
g. 5.64
C.4 Endterm Quiz 351
h. 3.02
i. -9.54
j. -8.23
Exercise C.55 (Candidate Endterm Quiz, 20-21). Consider the regression output
in Figure C.4. Determine the value of XXX.
a. -4.358
b. 3.762
c. 7.852
d. 2.091
e. -0.782
f. -1.349
g. 2.876
h. none in the suggested list
Exercise C.56 (Candidate Endterm Quiz, 20-21). You use OLS to regress a vari-
able 𝑦 on a variable 𝑥 and find that the coefficient on the variable 𝑥 is not statisti-
cally significant, i.e., we cannot reject the hypothesis that the coefficient is equal
to 0. We do the same test on a large number of samples and the result of the test
is always the same.
This means that there is no relationship between these variables.
352 C Practice Quiz Questions
a. True
b. False
Exercise C.57 (Candidate Endterm Quiz, 20-21). If we knew the real/true func-
tion relating an explained variable 𝑌 to the set of explanatory variables 𝑋 , then,
given these explanatory variables, we could achieve 0 MSE in the test data.
a. True
b. False
Exercise C.58 (Candidate Endterm Quiz, 20-21). Recall the results of the paper
by Ferguson and Voth (2008) shown in Figure C.5.
Based on the table, call A, the average (log) returns of a firm connected to the Nazi
regime in the January-March 1933 period; call B the average (log) returns of a firm
unconnected to the Nazi regime in the November 1932- January 1933 period. This
implies that you must use the models without any explanatory variable beyond
‘Nazi’.
What is A-B (A minus B, i.e., the difference in the log returns for these two groups
of firms)?
a. 0.1215
b. -0.0343
c. none in the suggested list
d. -0.0522
e. 0.0522
f. 0.0865
g. -0.0673
h. 0.0673
i. 0.0343
Exercise C.59 (Candidate Endterm Quiz, 20-21). In the old times of low com-
puting power, which of the following would be the most affordable method of
cross-validation?
c. k-fold CV
d. Validation set
Exercise C.60 (Candidate Endterm Quiz, 20-21). Suppose individuals can have
three levels of education: 12 years of schooling, 17 years of schooling, or 21 years
of schooling.
Imagine we want to estimate a model of the wage. We could estimate a model
354 C Practice Quiz Questions
a. True
b. False
Exercise C.61 (Candidate Endterm Quiz, 20-21). Suppose we estimate the fol-
lowing model,
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑥2𝑖 + 𝜀𝑖
a. True
b. False
a. It depends/cannot be determined
b. Underestimate
c. Overestimate
Exercise C.63 (Candidate Endterm Quiz, 20-21). Suppose that you use a statis-
tical model to predict the value of the variables below. Determine whether the
estimation refers to a regression (R) or a classification problem (C).
The answers below of R’s and C’s follow the order of these variables.
C.4 Endterm Quiz 355
a. CRRCC
b. RRCCC
c. CCRCC
d. CCRRR
e. RRCRC
f. RCRCC
g. CCRRC
h. CCCRR
Exercise C.64 (Candidate Endterm Quiz, 20-21). Suppose you estimate a linear
model with a dummy variable 𝐷.
𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝛽 2 𝐷 + 𝜀
Exercise C.65 (Candidate Endterm Quiz, 20-21). Consider the two graphs in Fig-
ure C.6 showing the explained variable as a function of two explanatory vari-
ables, x1 and x2.
Which of the two (left or right) illustrates better a problem of unsupervised learn-
ing?
a. right
b. none of the two
c. left
d. it depends/cannot be determined
Exercise C.66 (Candidate Endterm Quiz, 20-21). Recall the results of the paper
by Ferguson and Voth (2008) shown in Figure C.5.
356 C Practice Quiz Questions
In all likelihood for the November 1932- January 1933 period, for how many
firms did the authors have information on the ‘Market Cap’ but not on the ‘Div-
idend Yield’?
a. 57
b. cannot be determined
c. 53
d. 22
e. 67
f. 0
g. 12
C.5 Selected Quiz I Solutions 357
a. higher; higher
b. cannot be determined
c. lower; higher
d. higher; lower
e. lower; lower
Recall that 𝛼 is the probability of the error of Type I that the researcher is ready
to accept when rejecting the null hypothesis, knowing that it could be true.
Since the value of 𝛼 is chosen by the researcher, there is always a value for which
the null would be rejected. For instance, depending on the conditions, the re-
searcher could say that 𝛼 is 20%. Weird and difficult to explain choice to make,
but always a possible one.
358 C Practice Quiz Questions
Look at the p-value. It is larger than 𝛼. If the null is true, the probability of ob-
serving a test statistic as large or even larger that the statistic from the sample is
too high to reject the null. Hence the test suggests to not reject 𝐻0 .
The sentence does not really make sense. The sampling distribution is derived
analytically, following the assumptions on the distributions of the random vari-
ables involved.
Notice that the analytical results are based on assumptions about the random
variables involved. If we are ready to make them to develop a theory, we can
also make them and feed them to a computer. This latter will then be able to
generate a very large number of sample and obtain the sampling distribution in
this way.
The statistic falls very close to the true value under the null. Hence, we will cer-
tainly not reject the null. To better see this, recall Figure 4.1. In this question, the
distribution is a normal around 0. Put it in the center of the distribution. The test
statistic is 0.01, i.e., very close to 0. So, there is no chance it falls in the rejection
rejection.
The question aims at evaluating your understanding of the difference between
C.5 Selected Quiz I Solutions 359
the score/statistic in the sample and the p-value. The former can range from
minus (virtually) infinity to plus infinity and leads to a rejection of the null if
it falls in the rejection area (i.e., has a small p-value). The latter is a probability,
hence must be between 0 and 1.
See Section 5.2. The condition for the Central Limit Theorem to apply its magic
is that the sample size becomes sufficiently large.
The standard deviation in the population do not change and its sample counter-
part cannot be expected to change dramatically with 𝑛.
Once again, see Section 5.2. The condition for the Central Limit Theorem to apply
its magic is that the sample size becomes sufficiently large.
Notice at the outset that all the options are possible, including the magician hav-
ing psychic powers. (Who knows!?) But the question asks about the “the most
likely explanation”.
This rules out psychic powers (if anything because otherwise the magician
would certainly be making millions of dollars before Youtube videos) and the
die obeying the voice of the magician.
The technical argument is defeated in the very question: “extremely sophisti-
cated tests” show no sign of editing.
At the end, the most likely explanation is the painful and long task of trying until
succeeding. There is no skill involved (what would it be if not psychic powers,
hence we would be back to above) and, if there was, the explanation is defeated
by the second part of the sentence: a skill, by definition, is an ability that we can
use more than once.
The null hypothesis should reflect the situation where the student knows nothing
about the question, i.e., they answer randomly.
In that case, the probability of answering correctly is one in 𝑛, 𝑛 being the number
of possible choices.
Obviously, the probabilities in the nth throw do not not change because of the
results of the nth-1 previous throws.
The function that gives the probability to the left of a value ‘q’ in a normal distri-
bution is pnorm(). Following the explanation in the question, the answer is pt().
C.5 Selected Quiz I Solutions 361
To understand this answer, we can recall the test for the mean based on the statis-
tic
𝑋̄ − 𝜇0
𝑍= √
𝜎/ 𝑛
The larger the 𝑛, the larger the statistic. Hence, the higher the chances that it fall
in the rejection region.
Intuitively, the fewer observations, the less sure we will about the true value.
Hence, when testing for a specific number given in 𝐻0 , we will be less able to
reject the hypothesis that the true value is equal to that given number.
Seen from the other side, imagine we have thousands and thousands of observa-
tions. Then, we will be much more certain about the true value. When testing for
a specific number given in 𝐻0 , if that number is not the mean that we obtained
in the sample, or very very close to it, then we will reject the null.
The first and main hint for this answer is that the R output says that
alternative hypothesis: greater. Therefore, the null involves the opposite, namely
≤. In the R command, we give the first proportion… first and the second after
that. So, the null of the test is 𝑝1 ≤ 𝑝2 .
Recall that the p-value is the probability, assuming that the statistical model and
𝐻0 are true, that we observe a test statistic as large as or more extreme than the
value we observe in the sample.
Read again… “as large as… the value we observe in the sample”. If we observe
it, then the probability of observing it cannot be zero! It can be extremely small,
yes, but not 0.
362 C Practice Quiz Questions
The two methods must be equivalent, otherwise we would need to discuss when
their results differ.
Mathematically, using a classic case, it is equivalent to evaluate the following
two comparisons:
That is not enough that you have the same software. For reproducibility, one
needs to be able to obtain the same results in a reasonably easy way, i.e., by need-
ing to check the all the cells individually to see if there is a mistake. (This is an
argument regarding Excel. Other arguments apply in general, e.g., availability
of data, etc).
The R output shows a p-value larger than 5%, i.e., the test statistic is not too
extreme compared to the threshold that we chose (see 95% confidence). Hence
the test recommends to not reject the null.
The statistic falls relatively very close to the true value under the null. Hence, we
will certainly not reject the null. To better see this, recall Figure 4.1. In this ques-
tion, the distribution is a normal around 0. Put it in the center of the distribution.
The test statistic is 1, i.e., somewhat close to 0. So, there is little chance it falls in
the rejection rejection.
Actually, since the sampling distribution has a standard deviation of 1 (and mean
0), then a test statistic of 1 is exactly 1 standard deviation away from 0. We should
C.5 Selected Quiz I Solutions 363
know that this is not in the rejection region. As a benchmark, recall that at the 5%
significance level, the rejection region starts around 2 standard deviations away
from the mean.
Recall that the relationship between the standard deviation of the sampling dis-
tribution of the sample mean is given by
2 𝜎
𝜎𝑋 ̄ = √
𝑛
2
The relationship between 𝜎𝑋 ̄ and 𝑛 is therefore not linear. It would be if, for
instance, we would have
2 𝑛
𝜎𝑋 ̄ =𝜎− 𝜎
100
The second sequence looks incorrectly more random because if fits the law of
small numbers. This latter states that the law of large numbers ought to apply to
small samples too.
As evidence of that, consider the first two observations of the second sequence,
i.e., 𝑛 = 2. By the “law of small numbers” we should expect 50%-50% distribu-
tion between X’s and O’s. That’s what we have.
The same applies to 𝑛 = 4, the first 4 observations. By the “law of small num-
bers” we should expect 50%-50% distribution between X’s and O’s. That’s what
we have. Same with 𝑛 = 6.
So, this example illustrates decisions about randomness based on the law of
small numbers.
This result should pretty intuitive: the larger the sample the more information
we have the more precise (and certain) we can be.
364 C Practice Quiz Questions
Another way of looking at it is by recalling the formula for the margin of error,
𝜎
𝑀 𝐸 = 𝑧𝛼/2 √
𝑛
We can see that the larger the 𝑛, the smaller the 𝑀 𝐸 .
In the second model, 𝛼0̂ will be the predicted value for an observation where 𝐷2
and 𝐷3 are both 0. In other words it’s the predicted value for the variable when
𝐷1 is equal to 1.
From the first model we can calculate the predicted value for the variable when
𝐷1 is equal to 1. It’s 𝛽0 + 𝛽1 . Hence, 𝛼0̂ = 𝛽0̂ + 𝛽1̂ .
“A worker’s wage”, “the commute time of workers” are measure with a contin-
uous variable. Hence, they would imply a regression problem.
The remaining variables are categorical in nature, even if we can express each
category with a number, e.g., 1 to 5. Hence, they call for a classification tool.
Yes, we can say so. The simple validation set approach separates the train data
into two sets, training and validation, using the former to train the models and
the later to estimate the MSE in test data.
The 𝑘− fold validation extends this approach by separating the train data 𝑘 times
into two sets, training and validation, using the former to train the models and
the later to estimate the MSE in test data. Since it does it 𝑘 times, the estimated
MSE in the test data will be the average of the 𝑘 estimates.
C.6 Selected Quiz II Solutions 365
Substitute 𝑥 = 12 in
𝑦 ̂ = 3.85 − 2.95𝑥
to obtain 𝑦 ̂ =-31.55.
𝛽1̂
𝑡=
𝑠𝛽 ̂
1
so,
𝛽1̂
𝑠𝛽 ̂ =
1 𝑡
Here, 1.240/5.544=0.2236652.
Nothing in a linear model, or any other estimated model for that matter, guar-
antees that the relationship is of causal nature. In some rare cases, it could be
causal, but these are really exceptions.
If both variables (𝑦 and 𝑦 ) are truly random, they the true 𝛽1 is 0. Because of
sampling error, however, some samples will have a 𝛽1̂ that are very different
from 0, i.e., extreme, and will lead us to reject 𝐻0 ∶ 𝛽1 = 0. How many time
these “extreme” cases will happen depends on how we define “extreme”. In a
test of hypotheses, this will happen 𝛼% of the cases.
366 C Practice Quiz Questions
False. Fitting the data at hand, i.e., train data, is no good indicator of the model’s
ability to fit test data, i.e., to make predictions.
The correlation coefficient ranges from 0 to 1, and is unit free. This is why it is
used to compare the goodness of the fit for various models.
The intercept is the prediction when all the explanatory variables are set to 0.
Hence, it must be in the same units as the explained variable, i.e., kg.
A prediction must be in the same unit as the predicted variable. Hence, every
𝛽𝑗 𝑥𝑗 must be in this same same unit. In this particular case, 𝑥𝑗 is in cm. Hence,
for 𝛽𝑗 𝑥𝑗 to be in kg, it must be the case that 𝛽𝑗 is in kg/cm.
True. The 𝑅2 of the multiple linear regression is calculated only in train data.
Therefore, it is not a reliable estimate for the quality of the fit in test data.
For the two estimated models (one with 𝐷𝑀 and the other with 𝐷𝑊 ) to give
the same estimates for each type of individuals, it must be the case that 𝐷𝑀 =
−𝐷𝑊 .
Notice that in a regression with 𝐷𝑀 , 𝐷𝑀 is, all things equal, the difference in
wage earned by the male individuals with respect to the female individuals. In
a regression with 𝐷𝑊 , 𝐷𝑊 is, all things equal, the difference in wage earned by
the female individuals with respect to the male individuals.
Hence, it should be clear that the two differences must be equal, though with a
different sign.
C.6 Selected Quiz II Solutions 367
No, it cannot. This is because the MSE error uses the numeric difference between
the observed value and the prediction for that observation. In classification prob-
lems, the observed value is a category, e.g., “Yes/No”, “Train/Car/Bicycle”.
Therefore, we cannot meaningfully calculate a difference between these values.
False. The problem is unsupervised learning if the explained variable is not ob-
served. In the Netflix challenge, the competitors had that information. What they
didn’t have was the test data, i.e., the observations including the values of 𝑦 , the
clients’ votes on the movies that the competing models had to predict.
It is linear in the log of the variables, but linear nevertheless. To convince your-
self, simply replace log(𝑦) by 𝑤 and log(𝑥) by 𝑧 . Then the model becomes,
𝑤 = 𝛽 0 + 𝛽1 𝑧 + 𝜀
As we saw in our discussion about the paper Ferguson and Voth (2008), a high
𝑅2 is not required for a publication in a prestigious outlet.
The positive value of 𝛽1̂ , 𝛾1̂ and even 𝛼1̂ is simple to understand and is not ques-
tioned.
368 C Practice Quiz Questions
The difficulty resides in the interpretation of 𝛼2̂ . Recall that a coefficient in the
linear model is the marginal effect of the variable, i.e., when the value of the other
regressors is maintained constant.
Here, if the value of the number of people in the sample is kept constant, then
having more kids in this sample will result in a smaller overall weight, hence a
negative coefficient 𝛼2̂ .
In class, we discussed a similar issue when we related the amount of money in a
wallet with 1. the number of coins in the wallet, and, 2. the number of 1 cent coins
in the wallet. Keeping the number of coins constant, the more 1 cent coins in a
wallet, the lower the amount of money in the wallet. The following simulation
illustrates this point, if you need to “see” it.
## 8 16.1 30 3
## 9 8.3 23 5
## 10 19.0 37 0
## 11 18.4 47 8
## 12 11.0 27 8
## 13 16.9 40 8
## 14 10.7 25 3
## 15 15.8 21 2
## 16 22.0 39 5
## 17 11.3 28 2
## 18 27.2 43 3
## 19 16.3 32 5
## 20 23.0 46 5
## 21 17.2 24 2
## 22 23.3 43 5
## 23 13.5 38 4
## 24 28.4 45 3
## 25 25.5 41 5
## 26 8.34 20 3
## 27 15.3 40 7
## 28 16.4 32 5
## 29 18.2 48 4
## 30 17.0 31 2
## # ... with 9,970 more rows
summary(lm(sum ~ n.coins + one.c, data = df))
##
## Call:
## lm(formula = sum ~ n.coins + one.c, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2013 -2.5711 -0.1084 2.4089 14.5991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.169640 0.154048 1.101 0.271
## n.coins 0.549636 0.004875 112.741 <2e-16 ***
370 C Practice Quiz Questions
The normality of the errors could not really be seen in the Figure C.3. It would
take another type of plot to see it.
The correlation between errors would be seen through patterns in the residuals,
which we don’t really see.
The key point to notice here is that the residuals seem to have very little variance
for the low values of TV and large variance at the other end. This is a case of non-
constant variance of the errors. In terms of the assumptions that we saw for the
linear model, this corresponds to a violation of the homoscedasticity assumption.
The procedure to respond this type of problems is to match the predictions for
each category across the two models.
The prediction for an observation in from north (𝐷1 = 1) is:
𝛽1̂ − 0 𝛽̂
𝑡= = 1
𝑠𝛽 ̂ 𝑠𝛽 ̂
1 1
Figure C.4 shows the 𝑡 and the standard error. It is therefore straightforward to
deduce the value of the coefficient.
𝛽1 = 𝑡 ⋅ 𝑠𝛽 ̂ = −2.875808
1
This is not correct. Rejection of the null in that case does not rule out the possi-
bility of a relationship between the variables, albeit nonlinear. For instance, their
relationship could be inverted-U shaped and it would typically not be picked up
by a linear fit (i.e., the estimated slope would be 0).
False. Knowing the true relationship does not eliminate the random shocks to
the relationship, often noted as 𝜀. Since these will still occur in the test data,
even the model with perfect knowledge of the true functio will not make perfect
predictions. It just can’t predict the random shocks. Hence, the MSE will never
be 0 in the test data.
1 if firms is connected
𝑁 𝑎𝑧𝑖 = {
0 if firms is unconnected
𝑙𝑜𝑔𝑟𝑒𝑡𝑢𝑟𝑛𝑖,𝑝 = 𝛽0 + 𝛽1 𝑁 𝑎𝑧𝑖𝑖,𝑝 + 𝜀𝑖
where 𝑖 refers to the firm and 𝑝 = 𝑝1 if the estimation is for the first period,
while 𝑝 = 𝑝2 if the estimation is for the second period.
The question asks for two values:
Looking at the tables, A=0.0024 + 0.0697, and B= 0.104. Therefore, A-B= -0.0319.
The most affordable of the list would be the one requiring less computations.
This would be the ‘validation set’ method because it typically only requires to
compute the MSE on one test data, while LOOCV would require 𝑛 and k-fold 𝑘.
No, the two models would typically not give the same predictions. One of the
major causes of the difference lies in the fact that the model B, with the number
of years of schooling, imposes a constant effect of that variable over its values.
In other words, in that case, every year of schooling would increase wage by the
same amount, i.e., 𝛽𝑗 .
With the dummies, the predictions would be more flexible with respect to the
effect of years of schooling, i.e., they could allow for non-constant effects (e.g., the
group with 21 years of schooling could earn less than the group with 17 years).
Yes, it is still a linear model despite the inclusion of a power-2 term. To be con-
vinced of it, just rename 𝑧 = 𝑥2 and rewrite the model as
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑧𝑖 + 𝜀𝑖
C.6 Selected Quiz II Solutions 373
Recall that a residual is the difference between the observed value (data) and the
predicted value (fit). If the residual is negative, at it is here, then it means that
the fit is larger than the data, i.e., the fit makes an overestimation.
All but one of these are classification problems. The number of people in a house-
hold is a continuous variable, implying a regression estimation. All the others
predict a category: yes/no, party, academic degree, VAT level.
Since 𝐷 is either 0 or 1, it does not have units. But 𝛽2 𝐷 must have units, and
it must be the same as 𝑦 since the right-hand side of the equation is a sum of
elements that must have the same units (we can’t add apples and oranges). To-
gether, these imply that 𝛽2 must have the same units as 𝑦 .
Now, notice that 𝑦 is not one of the options. We must chose one with the same
units as 𝑦 . In this case, only 𝜀 checks out.
The graph on the left does not offer guidance about the value taken by the ex-
plained variable. Indeed, all the observations appear as a dot in the graph. This
implies that the problem is an unsupervised one.
In contrast, the observations on the right graph indicate, by means of a difference
shape, the value/category of the explained variable.
374 C Practice Quiz Questions
120
80
y
40
0 25 50 75 100
x
For this question, we must compare models 2 and 3. Indeed, model 3 uses both
of the variables (‘Market Cap’ and ‘Dividend Yield’) while model 2 only uses
‘Market Cap’. Notice that the two models have different number of observations,
𝑁 . In all likelihood, this is because there were no values observed for one of these
variables.
Model 2 uses 352 observations and model 3 uses 299. This suggests that there
were 53 observations for which the authors have information on the ‘Market
Cap’ but not on the ‘Dividend Yield’.
This question is illustrated by Figure C.7. The red dot is the outlier added.
Clearly, compared to the original fit (blue line), the linear fit after the inclusion
of the outlier (red line) has a higher intercept and a smaller slope.
D
Practice Exam Questions
TABLE D.1: Practice exam questions with elements of solution in this appendix.
Exercise Solution
D.1 sol.
D.2 sol.
D.3 sol.
D.4 sol.
D.5 sol.
D.10 sol.
D.11 sol.
D.12 sol.
D.13 sol.
D.1 Midterm
Exercise D.1 (Severe complications at birth [Midterm 19-20). ] Table D.2 pro-
vides information related to the births and rate of severe complications at birth
(SCB) in three facilities, in 2019.
375
376 D Practice Exam Questions
“Doors in a lighter color have a positive impact on the rate of severe complications at birth. One possible
explanation is that lighter colors provide a calmer atmosphere which reduces stress and its related
adverse effects.”
a. Explain why a p-value of 0 for the observed test statistic does not
make sense.
b. Still in this context what is the minimal p-value that your procedure
should find?
Exercise D.4 (Reproducibility and data disclosure). A necessary element for re-
producibility in research is the availability of a publication’s original data.
Exercise D.5 (CT for Covid test). Suppose that a test for COVID-19 uses:
𝐻0 ∶ the virus is present and active in the individual.
Suppose as well that the test can be modified by increasing the 𝐶𝑇 , from 20 to
25 or even higher. When the 𝐶𝑇 is increased, the presence of even pieces of the
virus, or even dead-virus will be detected and the test will be flagged as positive.
Exercise D.7 (Animal testing [Midterm 20-21). ] Suppose that a researcher has
completed a difficult and time-consuming (harmless) experiment on 30 animals.
He has scored and analyzed a large number of variables. His results are generally
inconclusive, but one test (say a comparison of means before and after treatment)
yields a highly significant 𝑡-score, 2.70, which is surprising and could be of major
theoretical significance.
Assume that the researcher has in fact repeated the initial study with 20 addi-
tional animals, and has obtained an insignificant result in the same direction, 𝑡
= 1.02.
Exercise D.8 (Reproducible workflow [Midterm 20-21). ] All terms below refer
to the discussions we carried in class.
a. Explain how one could conclude that the average consumption is sta-
tistically different between the two countries.
b. The result above about the difference in the means has statistical
significance. What do you think of its economic significance? In other
words, do you think there is a substantial economic difference in milk
consumption between the two countries? Explain briefly.
c. What relation, then, can you make between statistical and economic
significance? Explain briefly.
D.2 Endterm
Exercise D.10 (Default data [Endterm 19-20). ] Consider again the ‘Default’ data
set that we analyzed in class. In particular, recall the following variables:
We then run a logistic regression with two explanatory variables, ‘balance’ and
‘student’ (studentYes), call it Model 3. Figure D.3 gives the predicted probabili-
ties of default calculated with Model 3 and separated for students (red line) and
non-students (blue line).
e-grade𝑖 = 𝛽0 + 𝛽1 m-grade𝑖 + 𝜀𝑖
where e-grade and m-grade are the grades at the endterm and midterm exams,
respectively, and 𝑖 refers to each student in the class.
What would you expect from the estimation of this model? Explain. What would
you say about causality in this model? Explain.
382 D Practice Exam Questions
Exercise D.12 (Wine price [Endterm 19-20). ] Exercise 21.2 was in this endterm
exam.
a. Argue that finding the correctly specified model for 𝑦 thanks to the
data at hand is an elusive quest that is not worth pursuing.
The statement linking colors of the doors and SCB is not warranted. What should
be the default hypothesis is that the rate is the same across facilities. The differ-
ence could simply be due to sampling error. Indeed, as we can see, two facilities
have much smaller sample size, which could account for a larger variance in
sampling distribution of their average rate.
The second statement is a example of ad hoc explanation, with no statistical sup-
port. Yet, it will be seen as likely because we crave explanations, and better wrong
ones than none.
It follows that, since we observed at least one such value, the very test statistic
that we calculated, the probability of observing it cannot be 0.
1
Then, the smallest 𝑝-value that the procedure should find is 𝑁, where 𝑁 is the
number of simulations made in the test.
Recall that we sample a large number of simulations, but do lot list all the pos-
sible samples. Therefore, it is not guaranteed that we draw the simulation that
matches the sample that we have at hand. Potentially, we could then obtain a p-
value equal to 0. To avoid it, we correct the 𝑝-value by adding 1 to the numerator
and the denominator in the calculation of the probability.
In this case,
This is a two-sided test because we the effect of attending the class on the grade
could be either positive (presumably expected) but also negative (if the class
adds only confusion).
If we don’t want to rely on a (theoretical) normal distribution, then a permutation
test could be run.
The idea is to compare the averages of the two groups. If the difference between
the average of the 28 students who attended and the average of the 17 students
who didn’t is not extreme, then we do not reject 𝐻0 .
In order to know if it is extreme, we must obtain the sampling distribution of the
difference of the average between the two groups of these sizes.
For that, we randomly sample any 28 and put them in a group and the remain-
ing 17 are in the other group. We calculate the average for both groups and the
difference between the two.
We repeat this procedure a very number of times in order to obtain a sampling
distribution.
Finally, we check where the observed difference fits into that distribution. If it is
extreme, i.e., below some threshold level 𝛼, we reject 𝐻0 , otherwise we do not
reject it.
384 D Practice Exam Questions
For research results to be fully reproducible, the analysis code is almost useless
without the data to which it applies. Hence, the availability of the original data
is crucial for reproduction of the results.
Put differently, if one cannot obtain the original data, then any result will depend
on the good faith of the researcher. Rogue cases are not unseen (read here1 or
here2 ).
Currently, it is not always possible to obtain the data of the research (read for in-
stance, here3 for the case in psychology.) Beyond fraud, there are several reasons
explaining the reluctance of researchers in providing their data. For instance,
these might be confidential.
In most cases, however, this has to due with the fact that the data were difficult to
obtain. For instance, this happens when the data collection involved expensive
equipment or simply years for a researcher to dig huge archives. Researchers
sometimes then find it unfair to simply deliver for free the data that other re-
searchers could use for their publications.
Research in economics is converging towards full availability of data and code
(read here4 ). One compromise with respect to the previous point is that re-
searchers commit themselves to provide the data to the editor of the journal
and/or the referees.
a. For this part, please read the explanation provide in the answer to the
quiz question here.
Consider the increase of CT. As described, this will result in more cases being
flagged as positive. Among those, the positive test will only be due to the de-
tection of “pieces of the virus, or even dead-virus”. In other words there will be
more cases where we will fail to reject the null, i.e., more case where we fail to
tell the person that they don’t have covid. This represents a wrong diagnose, an
error of Type II (see Table 4.1).
Hence, the higher the CT, the higher the 𝛽 .
In Type I, you may make an error by rejecting the null when actually it is true.
In other words, the person has covid but you do not detect it and send out a
person with covid. Of course, this generates costs related to the larger spread of
the disease since that person will contaminate other people.
In Type II, you may make an error by failing to reject the null when actually it
is wrong. In other words, the person does not have covid but you detect it and
send home a person without covid. Of course, this generates costs related to the
reduced economic production since that person will not be able to work.
a. balance and studentYes have a significant and positive impact on the prob-
ability of default.
b. False, there is no direct reading of that kind. The coefficient enters into
a non-linear function.
c. The non-students have a higher probability of default.
We would expect the estimation of the model to show a very high correlation
between these two variables. This is because the observations are paired, one for
each student, making it possible that the reasons that determine one variable are
the same that determine the other.
In terms of causality, the most likely relation that we could draw is between a
common source/cause and these two variables. For instance, the student’s skills
could simultaneously explain both the result at the midterm and at the endterm.
Technically, the regression model would not satisfy the usual conditions. This is
because the random shocks to the model would be correlated with the “explana-
tory” variable. To see that, think of any shock affecting the real cause, say the
skills. Now, this shock will affect the e-grade, via 𝜀, and m-grade, rendering these
two correlated.
𝑝 = 𝛽 0 + 𝛽1 ⋅ 0 + 𝛽 2 ⋅ 1 + 𝛽 3 ⋅ 0 + 𝛽 4 ⋅ 0 + 𝛽 5 ⋅ 1 + 𝛽 6 ⋅ 0 + 𝜀
𝑝 = 𝛽 0 + 𝛽1 ⋅ 0 + 𝛽 2 ⋅ 0 + 𝛽 3 ⋅ 0 + 𝛽 4 ⋅ 0 + 𝛽 5 ⋅ 0 + 𝛽 6 ⋅ 1 + 𝜀
So, 𝑝 = 6.45.
d. Alternatively, we could use dummies for some years to see the effect
of particular years. This would make more sense given the nature of
the good. Indeed, the price of a wine bottle does not increase/decrease
D.4 Selected Endterm Solutions 387
regularly with time, but is affected by the quality of the wine obtained
in a few particular years.
a. We never know what the correct model really is. So, trying to find it is
not worth the effort.
Exercise Solution
3.6 sol.
4.1 sol.
6.1 sol.
6.2 sol.
16.1 sol.
16.2 sol.
17.1 sol.
389
390 E Solutions to Selected End-of-Chapter Exercises
The error of Type I occurs if we reject 𝐻0 when 𝐻0 is actually true. In this case,
it would amount to convict an individual who is actually innocent. This deci-
sion would be based on false or incomplete evidence. Hence, to avoid that error,
one would need to advance with conviction only in cases where the evidence is
plentiful and extremely good, e.g., footages of the person committing the crime,
confession, etc.
The Type II error occurs when we fail to reject 𝐻0 when 𝐻0 is actually false. In
the present case, it would amount to let a guilty person go free.
There is a relationship between the two types of error. Reducing the first implies
an increase of the second. Indeed, requiring a higher level of evidence for con-
viction means that it will also be more difficult to convict criminals, plentiful and
extremely good evidence being more difficult to obtain.
The null hypothesis here is a bit different from usual. We don’t want to test
whether the true proportion, 𝑝, is equal to 0.5 or not. Instead, we want to find a
null whose alternative implies accepting that the true proportion is larger than
0.5. Hence, we write,
To make a link with the generic expressions in the notes, this example has 𝑝0 =
0.5.
We know that, under the null, the CLT implies that the sampling distribution of
the sample proportion, 𝑝̂, is normally distributed around the true value 𝑝0 ,
𝑝0 (1 − 𝑝0 )
𝑝̂∼𝑁
̇ (𝑝0 , ).
𝑛
𝑝 ̂ − 𝑝0
= 𝑍 ∼𝑁
̇ (0, 1).
√𝑝0 (1 − 𝑝0 )/𝑛
391
Formally, we would reject 𝐻0 if, under the null, the probability of observing a
sample proportion so or more extreme is too small, as determined by 𝛼 and the
type of test (two- or one-tailed). Hence, to evaluate our hypothesis we start by
calculating the z-score of our sample proportion (aka test statistic):
𝑝 ̂ − 𝑝0
𝑧= .
√𝑝0 (1 − 𝑝0 )/𝑛
When, 𝑛 = 300,
0.54 − 0.5
𝑧1 = = 1.39.
√0.5(1 − 0.5)/300
0.54 − 0.5
𝑧2 = = 1.79.
√0.5(1 − 0.5)/500
At this point, we saw we can follow two routes: the rejection region approach or
the p-value approach.
Rejection region approach
As explained above, this is a one-tailed test. Hence, there is only one rejection
392 E Solutions to Selected End-of-Chapter Exercises
region, on the right tail of the Z distribution. And the probability of Type I error,
𝛼, concentrates in that tail.
The limit of the rejection region in this case, with 𝛼 = 0.05, is
Φ−1 (1 − 𝛼) = 𝑧𝛼 = 1.64.
𝑃 (𝑍 > 𝑧𝑖 ) = 1 − 𝑃 (𝑍 < 𝑧𝑖 )
Since this is a unilateral test, the probability is already the p-value of the test,
When 𝑛 = 300, p-value = 0.0823 > 0.05 = 𝛼. Therefore, based on this sample,
we cannot reject the null of the true proportion being smaller or equal to 0.5.
When 𝑛 = 500, p-value = 0.0367 < 0.05 = 𝛼. Therefore, based on this sample,
we can reject the null and accept the alternative that the true proportion is larger
than 0.5.
393
For 𝑛 = 300.
For 𝑛 = 500.
## 0.5032207 1.0000000
## sample estimates:
## p
## 0.54
Notice how the results are fully in accordance with the results obtained manually
in the solution. In particular, the p-values are virtually identical to the analytical
results.
The intercept will have the same units as the explained variable. To convince
yourself, consider the case where the slope coefficient or the explanatory variable
is 0. Then the value of the explained variable amounts to the intercept. Hence,
they must have the same units.
As for the slope, notice that prediction must be in the same unit as the predicted
variable. Hence, 𝛽1 𝑥 must be in this same same unit. In this particular case, 𝑥 is
in cm. Hence, for 𝛽1 𝑥 to be in kg, it must be the case that 𝛽1 is in kg/cm.
df <- read_csv("data/Advertising.csv")
m1 <- lm(sales ~ TV, data = df)
summary(m1)
##
## Call:
## lm(formula = sales ~ TV, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
395
b0 <- m1$coefficients[[1]]
b1 <- m1$coefficients[[2]]
Now, I try to verify the given relationship, involving the means of the variables.
b. I first take a short-cut by using the part inside m1, the residuals. As sug-
gested in the hint, I calculate them (resid), and find the same result.
mean(m1$residuals)
## [1] -6.464447e-17
resid <- df$sales - m1$fitted.values
mean(resid)
## [1] 8.597723e-18
b1
## [1] 0.04753664
r *s.y/s.x
## [1] 0.04753664
a. Again, loading the data and estimating the model with the lm().
library(readr)
df <- read_csv("data/Advertising.csv")
m2 <- lm(sales ~ TV + radio + newspaper, data = df)
summary(m2)
##
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
397
Recall that the coefficient gives the marginal effect on sales, i.e., keeping all the
rest constant. One of the reasons that could explain this lack of association with
newspaper is that advertising campaigns in the newspapers are always run at the
same time as advertising campaigns in other media. The “technical” implication
would be that the “keeping the rest constant” does not really apply, leaving little
room for the model to pick up the effect of newspaper on sales.
b0 <- m2$coefficients[[1]]
b1 <- m2$coefficients[[2]]
b2 <- m2$coefficients[[3]]
b3 <- m2$coefficients[[4]]
d. In order to be able to draw that fit, some (constant) value must be as-
signed to the other variables. For instance, one could “fix” these other
variables at their mean.
e. Again, as suggested in the hint, I calculate the residuals (resid) and their
mean.
398 E Solutions to Selected End-of-Chapter Exercises
Here is a set of questions that I received and answered by email. They might of
interest for every one.
F.1 Q
My doubt is directed to the question B.15 to be exact. My answer was d (we cannot say: we need more
information about the samples). The rationale behind my choice was that even though the question states
the variances are known and I’m aware that a bigger sample assures more accurate results, it was not
clear to me how big the variances were within each sample. Therefore, if the variance in the bigger sample
turned out to be very extreme, I guessed the null could be rejected more often in the smaller sample with a
much smaller variance…
While I see virtues in your explanation, I would still stick to the answer provided.
This for two reasons: one weaker than the other. First, the question uses the term
“generally”, which points at an explanation centered on the only element that
changes for sure across the samples, the number of observations, and somehow
averaging over the possibilities of the other conditions. I’m not too happy with
this argument because it seems to rely heavily on the minutiae of the wording.
Your answer suggests that we should discard the true parameter of the popu-
lation and use the sample estimate instead. I reckon that it is unlikely that we
can know the true variance, but the question specifically says that we do. And
this point matters because it is not reasonable to throw away information. The
standard deviation of the sampling distribution of the mean will be the same for
both samples means, except for the effect of n. Justifying my answer in the quiz.
399
400 F Your Questions
F.2 Q
“green jelly beans” case. (…) what I am confused about now is that, when we are asking the computer to
calculate the risk of the firm with inductive models, we need to give as much data as we can (even the
logo shape and color) which we never consider as a factor of risk. by comparing these two concepts, could
we say that we need to prioritize the most important components regarding our budget and abilities? and
how can we be sure that we are not considering the wrong data? I want to give an example to make it
more clear, imagine that we are considering the most important factors for aggression in people and we
have come up with 2 factors of work stress and sleep time. how are we sure that there is not another factor
which is more important? (sleep time here might be like the color of the jelly - something useless) and
how can we be sure that these are not giving us wrong results?
F.3 Q
I get the fact that if the p-value is bigger than 0.05, we don’t reject the null hypothesis. if I understand
correctly, by expanding the sample size, we can reach a point that we are able to reject h0. so, there would
be a point where if we pass from it by 𝜖, h0 would not be accepted anymore and this point would be the
minimum size of our sample which we need to go for another hypothesis. in the republicans and
democrats example which we discussed in class, is it possible to say that this would be the minimum
population which we can for sure say that if we know the democrats in one state overweight the
republicans (or vice-versa) we can surely interpret the other state too?
I think we should clarify a point. What you say is correct IF you assume that
in the larger sample (increasing 𝑛) the observed difference between the samples
remains the same. In that case, if you keep that difference in larger samples, then
while getting a larger sample there will be a threshold 𝑛∗ for which you can reject
the null hypothesis, yes.
However, nothing guarantees that this will be the case, i.e., that the difference
will be the same in larger sample. You observe one sample with a difference.
Fine. But it could be the case that another sample would show another differ-
ence between the groups. And this idea is an even more fundamental point to
understand.
F.4 Q
This interpretation is actually not correct. We use the confidence level and the
margin of error in the same context. Despite their names, they are not comple-
ments, not even directly related.
The confidence level of, say, 95% means the following. If you were to repeat the
same estimator in the same case but in a very large number of samples, then 95%
of the confidence intervals obtained would contain the true parameter.
The margin of error expresses the range around the true parameter. To fix ideas,
keep mind that you could be 95% confident that the true parameter is around 60
± 1%, or 60 ± 2%, or 60 ± 3%… The actual percentage of this margin depends
on other factors such as the real variance in the population or the number of
observations.
Furthermore, the margin of error is often expressed in absolute terms, not in
percentages. For instance, we could read that the researcher is 95% confident
that the true parameter is 1700 ± 150.
F.5 Q
I see that the beta1 is close to the value that we expected (3) but the intercept is 12 and not 5, is the model
still good?
The key with simulations is that we know the true model. In this case, there can
only be an origin to the discrepancy between the estimated coefficient and the
true parameter, namely sampling error… it just can happen that we don’t get
exactly the true parameters even if we estimate the correct model.
Notice also that 𝛽0 is rarely of importance.
F.7 Q 403
F.6 Q
I am struggling to understand the difference between residuals and errors in practical terms, in a R code
for example. Where can we see this?
Fundamentally, we can never observe the errors. These errors, better understood
as random shocks to a variable 𝑦 , represent an influence on 𝑦 that nothing ac-
counts for. We can also see it as the difference between the value of the variable
𝑦 and the value that it would take on average under the true model for 𝑦. This
“true” model, however, is little short of a chimera: we will never know/see it.
Therefore, the true shock will similarly never be known/seen.
The residuals, on the other hand, can be observed because they are simply the
difference between the value of the variable 𝑦 and the value of the model’s pre-
diction, 𝑦.̂ This, in turn, implies that the residuals depend on the model that is
used to “explain” 𝑦 . For instance, a person earns a wage of 2000 euros. A given
model of wage predicts that, given some characteristics (the 𝑋 variables), she
would earn 1945. Then, the residual for that observation is 55.
F.7 Q
The 𝑅2 is only used in inferences and not predictions, am I correct to assume this? I also do not
understand why sometimes even if the 𝑅2 is small it is still a good model, so when do we look at it?
There is nothing that prevents using the 𝑅2 as a criterion in the context of pre-
diction. As you know, that would just be a very poor indicator. So, in reality, it
is used and reported in exercises that I referred to as inference, yes.
404 F Your Questions
We should always have a look at it, but never give it more importance than it
deserves. This is because the quality of a paper should not be measured with
that criterion only. There are many others, in particular many others that require
thought and context-specific knowledge.
Another way of seeing it is that a high 𝑅2 is no guarantee whatsoever of the
quality of a paper, certainly not a substitute for real thinking.
F.8 Q
Regarding the assumptions, when one of these does not hold it means the model is not good for our data?
When one of the assumptions of a model does not hold, then the inference based
on that model may be a little or highly misleading, depending on the assumption
and how seriously the assumption is violated.
For instance, in case of an endogeneity issue (𝑋 and 𝜀 correlated) then the infer-
ence on the coefficients cannot be trusted. The test of hypothesis may say that
we reject the null but this is no longer reliable.
F.9 Q
What was the conclusion for simulated models when the variables were correlated or when we forgot a
variable in our estimation?
Also, did we say that when we forget a variable in our estimation the variables used become correlated?
F.11 Q 405
then, the coefficient that we obtain in our defective model will be biased, i.e., the
estimate is not reliable.
F.10 Q
In general, I just want to make sure that the explanatory variable is the dependent Y variable and the
explain variable is the independent X variable?
Not exactly. The following table gives the main terms used for these variables.
𝑌 𝑋
dependent independent
explained explanatory
endogenous exogenous
predicted predictor
response regressor
outcome characteristics
… …
406 F Your Questions
F.11 Q
Bias Variance Trade off: What is more important to minimize the bias or the variance? Further, is a
flexible model the same as a complex model? If not what does it mean to be a “flexible” model?
Additionally, how should we understand and interpret the graph in Figure 15.14?
The MSE is the sum of 1. the bias squared, and, 2. the variance. The key point that
the Figure 15.14 conveys is there is a trade-off. When one decreases, the other in-
creases: reducing the bias increases the variance, and reducing variance increases
the bias. So, the optimal solution is when their sum, the MSE, is minimal.
Yes, in this context, the models with more variance are the flexible/ complex/
high degree of polynomial ones.
This is the opposite of rigid models such as the linear model. This latter is rigid
because no matter the shape of the relationship, it will always result in a line (or
hyperplane), which is doomed to not get close to the data points.
F.12 Q
Cross-Validation Method: Do I understand it correctly that the leave one out approach uses only one
variable as the training data and the rest will be validation data set? How can we estimate a model based
on only one value?
What is left out in the LOOCV is one observation. The model is then estimated
on the 𝑛 − 1 observations (training set) and the estimated model is used to make
a prediction on the left-out observation (validation set).
F.13 Q 407
F.13 Q
Further, in this part of the notes you talk a lot about polynomials, does the degree of polynomials refer to
the amount of values we have in each sample?
𝑙𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑎𝑔𝑒 + 𝜀
or 𝑝 = 2,
or 𝑝 = 3,
etc…
408 F Your Questions
p=1 p=2 p=3
15
14
13
12
logwage
11
p=4 p=5 p=6
15
14
13
12
11
20 30 40 50 60 20 30 40 50 60 20 30 40 50 60
age
In Figure F.1, we can “see” how the how the quality of the fit increases with the
degree of the polynomial 𝑝. The fit goes from a rigid line (𝑝 = 1) to a very flexible
curve with 𝑝 ≥ 3, though the gains in the fit after 𝑝 = 4 do not seem to be large.
F.14 Q
I am not quite sure if I understand the overall picture and difference of residuals, shocks, variances and
standard errors.
F.15 Q 409
F.15 Q
Does ESS refer to variance and RSS to residuals or does it not have anything to do with each other?
The TSS is proportional to the variance of the variable 𝑦 . The RSS is proportional
to the variance of the residuals, 𝑒.̂
We can show that,
𝑇 𝑆𝑆 = 𝐸𝑆𝑆 + 𝑅𝑆𝑆.
Recall that the residuals are the distance between the observations and the pre-
dicted value, i.e., they measure the failure in the prediction. Hence, the RSS is the
part of the variance of 𝑦 , i.e. of TSS, that the model did not manage to explain.
410 F Your Questions
The ESS, on the other hand, is the part of the variance of 𝑦 , i.e., the part of TSS,
that our model managed to explain.
The measure of fit that we use, the 𝑅2 , is the ratio of variance explained by the
model, i.e.,
𝐸𝑆𝑆
𝑅2 = .
𝑇 𝑆𝑆
411
Bibliography
Banerjee, A. V., Banerjee, A., and Duflo, E. (2011). Poor economics: A radical re-
thinking of the way to fight global poverty. Public Affairs.
Bland, M. (2009). Keep young and beautiful: evidence for an “anti-aging” prod-
uct? Significance, 6(4):182–183.
Cairo, A. (2016). The truthful art: data, charts, and maps for communication. New
Riders.
Efron, B. and Hastie, T. (2016). Computer age statistical inference, volume 5. Cam-
bridge University Press.
Ferguson, T. and Voth, H.-J. (2008). Betting on hitler—the value of political con-
nections in nazi germany. The Quarterly Journal of Economics, 123(1):101–137.
413
414 F Bibliography
Ioannidis, J. P. (2005). Why most published research findings are false. PLos med,
2(8):e124.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An introduction to
statistical learning, volume 112. Springer.
Kahneman, D. (2011). Thinking, fast and slow, volume 1. Farrar, Straus and Giroux
New York.
Mauboussin, M. J. (2012). The success equation: Untangling skill and luck in business,
sports, and investing. Harvard Business Review Press.
Mlodinow, L. (2009). The drunkard’s walk: How randomness rules our lives. Vintage.
Pearl, J. and Mackenzie, D. (2018). The book of why: the new science of cause and
effect. Basic Books.
Provost, F. and Fawcett, T. (2013). Data Science for Business: What you need to know
about data mining and data-analytic thinking. O’Reilly Media, Inc.
Rosling, H., Rosling, O., and Rosling Rönnlund, A. (2018). Factfulness : Ten reasons
we’re wrong about the world - and why things are better than you think. Sceptre,
London.
Tetlock, P. E. and Gardner, D. (2016). Superforecasting: The art and science of pre-
diction. Random House.
Tufte, E. R. (1997). Visual Explanations: Images and Quantities, Evidence and Narra-
tive. Cheshire, CT: Graphics Press.
F.15 Bibliography 415
Tufte, E. R. (2003). The cognitive style of PowerPoint. Graphics Press Cheshire, CT.
Tversky, A. and Kahneman, D. (1971). Belief in the law of small numbers. Psy-
chological bulletin, 76(2):105.
Watson, R., Ogden, S., Cotterell, L., Bowden, J., Bastrilles, J., Long, S., and Grif-
fiths, C. (2009). A cosmetic ‘anti-ageing’ product improves photoaged skin:
a double-blind, randomized controlled trial. British Journal of Dermatology,
161(2):419–426.