You are on page 1of 445

Prof.

Antonio Fidalgo

Business Research Methods


Introductory Lecture Notes
Contents

List of Tables xxi

List of Figures xxiii

Foreword xxix

I Introduction 1

1 Statistical Intuition 3
1.1 A Few Questions in Statistics . . . . . . . . . . . . . . . . . . . 3

1.1.1 Linda (Tversky and Kahneman, 1983) . . . . . . . . . . . 4

1.1.2 Monty Hall . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.3 Mean IQ . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.4 Binary Sequence . . . . . . . . . . . . . . . . . . . . . . 4

1.1.5 Your Random Number . . . . . . . . . . . . . . . . . . . 6

1.1.6 Positive Cancer Test . . . . . . . . . . . . . . . . . . . . . 6

1.1.7 Armour . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.8 Average Wage Growth . . . . . . . . . . . . . . . . . . . 6

1.2 Learning Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 A Learning Strategy . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.1 Content Over Form . . . . . . . . . . . . . . . . . . . . . 8

1.3.2 Main Words . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.3 Principles Over Techniques . . . . . . . . . . . . . . . . 9

v
vi Contents

1.4 Strengthening your Intuition . . . . . . . . . . . . . . . . . . . 9

1.4.1 Question in 1.1.1 . . . . . . . . . . . . . . . . . . . . . . 9

1.4.2 Question in 1.1.2 . . . . . . . . . . . . . . . . . . . . . . 10

1.4.3 Question in 1.1.3 . . . . . . . . . . . . . . . . . . . . . . 15

1.4.4 Question in 1.1.4 . . . . . . . . . . . . . . . . . . . . . . 15

1.4.5 Question in 1.1.5 . . . . . . . . . . . . . . . . . . . . . . 16

1.4.6 Question in 1.1.6 . . . . . . . . . . . . . . . . . . . . . . 16

1.4.7 Question in 1.1.7 . . . . . . . . . . . . . . . . . . . . . . 17

1.4.8 Question in 1.1.8 . . . . . . . . . . . . . . . . . . . . . . 17

2 Statistical Statements 19
2.1 Introductory Example . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Exact Permutation Distribution . . . . . . . . . . . . . . . . . . 20

2.3 Subsetted Permutation Distribution . . . . . . . . . . . . . . . . 21

2.4 Unbalanced, Skewed Case . . . . . . . . . . . . . . . . . . . . . 23

2.5 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Paul the Octopus and 𝑝 < 0.05 29


3.1 Paul the Octopus… . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 … and Other Psychic Beasts . . . . . . . . . . . . . . . . 31

3.1.2 Still Randomness . . . . . . . . . . . . . . . . . . . . . . 31

3.2 p-Hacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.1 A Threat to Science . . . . . . . . . . . . . . . . . . . . . 33

3.3 Efficient Markets Hypothesis . . . . . . . . . . . . . . . . . . . 33

3.4 Rigorous Uncertainty and Moral Certainty . . . . . . . . . . . . 35

3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Contents vii

II Statistical Inference 39

4 A Blueprint for Inference 41


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Testable Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4 Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.5 Sampling Distribution . . . . . . . . . . . . . . . . . . . . . . . 44

4.6 Level of Significance . . . . . . . . . . . . . . . . . . . . . . . . 44

4.7 Deciding on an Hypothesis . . . . . . . . . . . . . . . . . . . . 44

4.7.1 Critical Rejection Region . . . . . . . . . . . . . . . . . . 44

4.7.2 One-Tailed and Two-Tailed Tests . . . . . . . . . . . . . . 45

4.7.3 The 𝑝-Value . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.7.4 Equivalence of Approaches . . . . . . . . . . . . . . . . 47

4.8 Types of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.8.1 Type I Error . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.8.2 Type II Error . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Theoretical Sampling Distributions 51


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . 52

5.2.1 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3 Sampling Distribution of the Sample Proportion . . . . . . . . . 54

5.4 Sampling Distribution of the Sample Variance . . . . . . . . . . 56

5.4.1 Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . 56

5.4.2 Expected Value of Sample Variance . . . . . . . . . . . . 57


viii Contents

5.4.3 Sampling Distribution when Sampling from a Normally


Distributed Population . . . . . . . . . . . . . . . . . . . 57

5.5 The Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . 58

5.5.1 Using the Table . . . . . . . . . . . . . . . . . . . . . . . 60

6 Inference on Sample Proportions 63


6.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.1.1 Categorical Variables . . . . . . . . . . . . . . . . . . . . 64

6.1.2 Bernoulli Trial . . . . . . . . . . . . . . . . . . . . . . . . 64

6.1.3 Sample Proportion . . . . . . . . . . . . . . . . . . . . . 65

6.1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.2 Inference for a Single Proportion . . . . . . . . . . . . . . . . . 68

6.2.1 Assumptions: Independence . . . . . . . . . . . . . . . . 68

6.2.2 Testable Hypothesis: Dart-Throwing Chimpanzees . . . . 68

6.2.3 Estimator: Sample Proportion . . . . . . . . . . . . . . . 69

6.2.4 Sampling Distribution . . . . . . . . . . . . . . . . . . . 69

6.2.5 Level of Significance: 0.05 . . . . . . . . . . . . . . . . . 70

6.2.6 Deciding on an Hypothesis: Bilateral Test . . . . . . . . . 70

6.2.7 Critical Regions . . . . . . . . . . . . . . . . . . . . . . . 70

6.2.8 P-Value . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.2.9 Implemented Test in R . . . . . . . . . . . . . . . . . . . 72

6.3 Comparing Two Proportions . . . . . . . . . . . . . . . . . . . 73

6.3.1 Assumptions: Extended Independence . . . . . . . . . . 73

6.3.2 Sampling Distribution . . . . . . . . . . . . . . . . . . . 74

6.3.3 Illustration: Percentage Republicans . . . . . . . . . . . . 74

6.3.4 Implementation in R . . . . . . . . . . . . . . . . . . . . 76

6.3.5 Illustration: One Question Fluke? . . . . . . . . . . . . . 77


Contents ix

6.4 Goodness of Fit for Many Proportions . . . . . . . . . . . . . . 79

6.4.1 Illustration: Representative Poll . . . . . . . . . . . . . . 79

6.4.2 Implementation in R . . . . . . . . . . . . . . . . . . . . 79

6.5 Extra Material . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.5.1 Explaining prop.test() Implemented in R . . . . . . . . . 80

6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.7 Commented R Code . . . . . . . . . . . . . . . . . . . . . . . . 84

7 Inference for Numerical Data 89


7.1 Sampling Distribution of 𝑋̄ . . . . . . . . . . . . . . . . . . . . 89

7.1.1 The 𝑡-Distribution . . . . . . . . . . . . . . . . . . . . . . 90

7.2 One-Sample 𝑡-Test . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.2.1 A Hand-Calculated Illustration . . . . . . . . . . . . . . 92

7.2.2 Implementation in R . . . . . . . . . . . . . . . . . . . . 93

7.3 Test for Paired Data . . . . . . . . . . . . . . . . . . . . . . . . 93

7.3.1 An Illustration with R and in Calculation . . . . . . . . . 94

7.4 Testing the Difference of Two Means . . . . . . . . . . . . . . . 95

7.4.1 Illustration and Implementation in R . . . . . . . . . . . 96

7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.6 Commented R Code . . . . . . . . . . . . . . . . . . . . . . . . 98

III Confidence Intervals 103

8 Estimators and Confidence Intervals 105


8.1 Estimators and Estimates . . . . . . . . . . . . . . . . . . . . . 106

8.2 “Best” Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

8.2.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8.2.2 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . 107


x Contents

8.3 Confidence Interval and Margin of Error . . . . . . . . . . . . . 108

8.4 CI for the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.5 CI for the Population Proportion . . . . . . . . . . . . . . . . . 113

8.6 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

8.6.1 One-sided Confidence Interval . . . . . . . . . . . . . . . 115

8.6.2 Other extensions . . . . . . . . . . . . . . . . . . . . . . 115

IV Intermezzo: Sample Size 117

9 Curse, Blessing & Back 119


9.1 Sample Size and the Margin of Error . . . . . . . . . . . . . . . 119

9.2 The Curse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

9.3 An Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

9.3.1 Male and Female Equally Represented? . . . . . . . . . . 122

9.3.2 Male and Female Equally Represented in a Given Month? 124

9.3.3 Old and Young Equally Represented? . . . . . . . . . . . 125

9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

9.5 Commented R Code . . . . . . . . . . . . . . . . . . . . . . . . 129

10 Field of Fools 133


10.1 De Moivre’s Equation . . . . . . . . . . . . . . . . . . . . . . . 133

10.1.1 Cancer Prone Areas . . . . . . . . . . . . . . . . . . . . . 134

10.1.2 The Small-Schools Movement . . . . . . . . . . . . . . . 135

10.1.3 Safe Cities . . . . . . . . . . . . . . . . . . . . . . . . . . 137

10.1.4 Sex Differences in Performance . . . . . . . . . . . . . . 138

10.2 Law of Small Numbers . . . . . . . . . . . . . . . . . . . . . . . 138

V Visualizations 141
Contents xi

11 Data Visualization 143

12 Bars 145
12.1 Bars for Proportions . . . . . . . . . . . . . . . . . . . . . . . . 145

12.2 Adding Error Bars to Proportions . . . . . . . . . . . . . . . . . 147

12.3 Bars for Numerical Data . . . . . . . . . . . . . . . . . . . . . . 148

12.4 Adding Error Bars to Means . . . . . . . . . . . . . . . . . . . . 150

12.5 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

VI Bridge 153

13 Correlation 155
13.1 Bivariate Relationships . . . . . . . . . . . . . . . . . . . . . . . 155

13.1.1 Visualizing the Relationship . . . . . . . . . . . . . . . . 155

13.2 Pearson’s Correlation . . . . . . . . . . . . . . . . . . . . . . . 159

13.3 Spearman’s Rank Correlation . . . . . . . . . . . . . . . . . . . 165

14 Observational Versus Experimental Data 167


14.1 Descriptive Approach . . . . . . . . . . . . . . . . . . . . . . . 167

14.1.1 UCB Admissions . . . . . . . . . . . . . . . . . . . . . . 167

14.1.2 Palmer Penguins . . . . . . . . . . . . . . . . . . . . . . 170

14.2 Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

14.2.1 UCB Admissions, Again . . . . . . . . . . . . . . . . . . 171

14.2.2 Penguins, Again . . . . . . . . . . . . . . . . . . . . . . 173

14.3 Paradox Again . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

15 Statistical Learning 175


15.1 Statistical Learning . . . . . . . . . . . . . . . . . . . . . . . . . 175

15.2 Use of Statistical Learning . . . . . . . . . . . . . . . . . . . . . 178


xii Contents

15.2.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 178

15.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

15.3 Universal Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 179

15.3.1 Wage vs Demographic Variables . . . . . . . . . . . . . . 179

15.3.2 Probability of Heart Attack . . . . . . . . . . . . . . . . . 179

15.3.3 Spam Detection . . . . . . . . . . . . . . . . . . . . . . . 180

15.3.4 Identifying Hand-Written Numbers . . . . . . . . . . . . 180

15.3.5 Classify LANDSAT Image . . . . . . . . . . . . . . . . . 181


̂ . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.4 Ideal 𝑓() vs 𝑓() 182

15.5 Important Distinctions . . . . . . . . . . . . . . . . . . . . . . . 182

15.5.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 183

15.5.2 Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . 183

15.5.3 Types of Statistical Problems . . . . . . . . . . . . . . . . 183

15.6 Quality of Regression Fit . . . . . . . . . . . . . . . . . . . . . . 184

15.7 Bias-Variance Trade-Off . . . . . . . . . . . . . . . . . . . . . . 184

15.8 Accuracy in Classification Setting . . . . . . . . . . . . . . . . . 186

15.9 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 187

15.9.1 Validation Set Approach . . . . . . . . . . . . . . . . . . 189

15.9.2 Leave-One-Out . . . . . . . . . . . . . . . . . . . . . . . 190

15.9.3 𝑘-Fold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

15.9.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . 192

15.10Ubiquity of Predictions . . . . . . . . . . . . . . . . . . . . . . 192

15.11 Heuristics, Algorithms and AI . . . . . . . . . . . . . . . . . . . 193

15.12AI, Not Why: Predicting vs Understanding . . . . . . . . . . . . 194

15.12.1 Plus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

15.13Important Perspective . . . . . . . . . . . . . . . . . . . . . . . 195


Contents xiii

VII Linear Regression 197

16 Simple Linear Regression 199


16.1 A Classic Approach . . . . . . . . . . . . . . . . . . . . . . . . 199

16.2 The Simple Linear Regression . . . . . . . . . . . . . . . . . . . 201

16.2.1 Data and Scatter Plot . . . . . . . . . . . . . . . . . . . . 202

16.2.2 Estimation in R . . . . . . . . . . . . . . . . . . . . . . . 202

16.2.3 Fitted Values and Residuals . . . . . . . . . . . . . . . . 204

16.2.4 Residuals vs Errors/Shocks . . . . . . . . . . . . . . . . 204

16.3 Ordinary Least Squares Procedure . . . . . . . . . . . . . . . . 205

16.4 Finding the Least Squares Line . . . . . . . . . . . . . . . . . . 206

16.4.1 Features of the Least Squares Line . . . . . . . . . . . . . 207

16.5 Deriving the OLS Estimators . . . . . . . . . . . . . . . . . . . 207

16.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

17 Multiple Linear Regression 211


17.1 Multiple Linear Regression Model . . . . . . . . . . . . . . . . 211

17.1.1 Partial Effects . . . . . . . . . . . . . . . . . . . . . . . . 212

17.1.2 Analyzing a Multiple-Regression Model . . . . . . . . . 213

17.2 OLS Estimated Model . . . . . . . . . . . . . . . . . . . . . . . 213

17.2.1 Two Regressors Illustration . . . . . . . . . . . . . . . . 214

17.2.2 Properties of OLS Estimators in Multiple Regression . . . 214

17.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

18 Assumptions 217
18.1 When is the Model Valid? . . . . . . . . . . . . . . . . . . . . . 217

18.2 Assumption 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

18.3 Assumption 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218


xiv Contents

18.4 Assumption 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

18.5 Assumption 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

18.6 Assumption 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

18.7 Assumption 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

19 Goodness of the Fit 221


19.1 Sample Variability . . . . . . . . . . . . . . . . . . . . . . . . . 221

19.1.1 Total Sample Variability (TSS) . . . . . . . . . . . . . . . 221

19.1.2 Unexplained Sample Variability (RSS) . . . . . . . . . . . 222

19.1.3 Explained Sample Variability (ESS) . . . . . . . . . . . . 223

19.2 Decomposition of the Total Sample Variability . . . . . . . . . . 223

19.3 The Coefficient of Determination, 𝑅2 . . . . . . . . . . . . . . . 224

19.3.1 Adjusted 𝑅2 . . . . . . . . . . . . . . . . . . . . . . . . . 225

19.4 The Standard Error of the Regression . . . . . . . . . . . . . . . 225

20 Inference 227
̂ . . . . . . . . . . . . . . . . . . .
20.1 Sampling Distributions of 𝛽 ’s 227

20.2 Estimating 𝜎2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

20.3 Inference on the Slopes . . . . . . . . . . . . . . . . . . . . . . . 229

21 Categorical Predictors 231


21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

21.1.1 Simplest Illustration . . . . . . . . . . . . . . . . . . . . 231

21.1.2 Including a Dummy with Two Levels . . . . . . . . . . . 232

21.2 Including a Dummy with Multiple Levels . . . . . . . . . . . . 233

21.3 Including Multiple Dummies . . . . . . . . . . . . . . . . . . . 234

21.4 The Dummy Variable Trap . . . . . . . . . . . . . . . . . . . . . 235

21.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235


Contents xv

22 Simulating Violations of Assumptions 239


22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

22.2 Best Case Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 240

22.2.1 Simulating One Occurrence . . . . . . . . . . . . . . . . 240

22.2.2 Simulating Several Occurrences . . . . . . . . . . . . . . 241

22.2.3 Simulating a Multiple Linear Regression . . . . . . . . . 243

22.3 Omitted Variable Issue . . . . . . . . . . . . . . . . . . . . . . . 245

22.3.1 𝑟 > 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

22.3.2 𝑟 < 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

22.4 Incorrect Specification Issue . . . . . . . . . . . . . . . . . . . . 247

23 Relevant Applications 251


23.1 Betting on Hitler . . . . . . . . . . . . . . . . . . . . . . . . . . 251

23.1.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

23.1.2 Main Explanatory Variable . . . . . . . . . . . . . . . . . 252

23.1.3 Other Variables . . . . . . . . . . . . . . . . . . . . . . . 252

23.1.4 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . 253

23.1.5 Results (Selection) . . . . . . . . . . . . . . . . . . . . . . 253

23.1.6 Robustness Checks . . . . . . . . . . . . . . . . . . . . . 253

24 Linear Regression Lab 257


24.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . 257

24.1.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 257

24.1.2 Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

24.1.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 258

24.1.4 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

24.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . 261

24.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261


xvi Contents

24.3.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 262

24.3.2 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

24.4 Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . 264

24.4.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 264

24.4.2 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

24.4.3 Several Categories . . . . . . . . . . . . . . . . . . . . . 266

24.5 Non-linear Transformations . . . . . . . . . . . . . . . . . . . . 267

24.5.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 267

24.5.2 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

VIII Classification 271

25 Limited Dependent Variables 273


25.1 Motivation and Interpretation . . . . . . . . . . . . . . . . . . . 273

25.1.1 An Illustrative Case . . . . . . . . . . . . . . . . . . . . . 274

25.2 Choice of 𝐹 (⋅) . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

25.3 OLS: the Linear Probability Model (LPM) . . . . . . . . . . . . 277

25.3.1 LPM Issues: Heteroskedasticity . . . . . . . . . . . . . . 278

25.3.2 LPM Issues: Linear Increase of Probability . . . . . . . . 278

25.3.3 LPM Issues: Interpretation as Probability . . . . . . . . . 278

25.4 Probit and Logit Models . . . . . . . . . . . . . . . . . . . . . . 280

25.4.1 Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

25.4.2 Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

25.4.3 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 282

25.5 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

25.6 Marginal Effects . . . . . . . . . . . . . . . . . . . . . . . . . . 283

25.7 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

25.7.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . 283


Contents xvii

25.8 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

25.8.1 Linear Fit . . . . . . . . . . . . . . . . . . . . . . . . . . 285

25.8.2 Logit Estimation . . . . . . . . . . . . . . . . . . . . . . 286

25.8.3 Probit Estimation . . . . . . . . . . . . . . . . . . . . . . 287

25.8.4 Confusion Matrices . . . . . . . . . . . . . . . . . . . . . 288

IX Intermezzo 291

26 Presentations 293
26.1 “Conclude with a Conclusion” Approach . . . . . . . . . . . . 293

26.2 “Say It” Approach . . . . . . . . . . . . . . . . . . . . . . . . . 294

X Causality Claims 297

Why 299

27 Sample Bias 301


27.1 The Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

27.2 Non-Random Sampling . . . . . . . . . . . . . . . . . . . . . . 301

27.2.1 Dewey Defeats Truman . . . . . . . . . . . . . . . . . . . 301

27.2.2 Surveys of Friends . . . . . . . . . . . . . . . . . . . . . 301

27.3 Self-Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302

27.3.1 Lifetime Sexual Partners . . . . . . . . . . . . . . . . . . 302

27.3.2 Heights . . . . . . . . . . . . . . . . . . . . . . . . . . . 302

27.4 Survivorship Bias . . . . . . . . . . . . . . . . . . . . . . . . . . 303

27.5 The Tim Ferriss Show . . . . . . . . . . . . . . . . . . . . . . . 303

27.5.1 Caveman Effect . . . . . . . . . . . . . . . . . . . . . . . 304


xviii Contents

28 Endogeneity 305
28.1 The Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

28.2 Omitted Regressor . . . . . . . . . . . . . . . . . . . . . . . . . 306

28.3 Measurement Error . . . . . . . . . . . . . . . . . . . . . . . . . 306

28.4 Omitted Common Source . . . . . . . . . . . . . . . . . . . . . 307

28.5 Omitted Selection . . . . . . . . . . . . . . . . . . . . . . . . . 308

28.6 Simultaneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

29 Regression to the Mean 309


29.1 Tentative Definition . . . . . . . . . . . . . . . . . . . . . . . . 309

29.2 Skill & Luck, Always . . . . . . . . . . . . . . . . . . . . . . . . 309

29.2.1 Introductory Example . . . . . . . . . . . . . . . . . . . 310

29.3 Selected Gallery . . . . . . . . . . . . . . . . . . . . . . . . . . 310

29.3.1 Regression to Mediocrity . . . . . . . . . . . . . . . . . . 311

29.3.2 SI Jinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

29.3.3 Hiring Stars . . . . . . . . . . . . . . . . . . . . . . . . . 311

30 “Gold Standard” 315


30.1 The “Gold Standard” . . . . . . . . . . . . . . . . . . . . . . . . 315

30.2 Approaching the Gold Standard . . . . . . . . . . . . . . . . . 315

30.2.1 Mita System . . . . . . . . . . . . . . . . . . . . . . . . . 316

Appendix 317

A Assignments 319
A.1 Assignment I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

A.1.1 Checking Installation on Your Computer . . . . . . . . . 320

A.1.2 Dynamic Number . . . . . . . . . . . . . . . . . . . . . . 321


Contents xix

A.1.3 Simple Markdown Table . . . . . . . . . . . . . . . . . . 321

A.1.4 Include Graphic . . . . . . . . . . . . . . . . . . . . . . . 322

A.1.5 Cross-References . . . . . . . . . . . . . . . . . . . . . . 322

A.1.6 Citations . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

B Bonus Assignments 325


B.1 Keep Young and Beautiful . . . . . . . . . . . . . . . . . . . . . 325

B.1.1 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325

B.2 Grades and Luck . . . . . . . . . . . . . . . . . . . . . . . . . . 326

C Practice Quiz Questions 329


C.1 Quiz I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

C.2 Midterm Quiz . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

C.3 Quiz II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

C.4 Endterm Quiz . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

C.5 Selected Quiz I Solutions . . . . . . . . . . . . . . . . . . . . . . 355

C.6 Selected Quiz II Solutions . . . . . . . . . . . . . . . . . . . . . 362

D Practice Exam Questions 373


D.1 Midterm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373

D.2 Endterm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377

D.3 Selected Midterm Solutions . . . . . . . . . . . . . . . . . . . . 380

D.4 Selected Endterm Solutions . . . . . . . . . . . . . . . . . . . . 383

E Solutions to Selected End-of-Chapter Exercises 387

F Your Questions 397


F.1 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397

F.2 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
xx Contents

F.3 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

F.4 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

F.5 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400

F.6 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401

F.7 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401

F.8 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402

F.9 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402

F.10 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403

F.11 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404

F.12 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404

F.13 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405

F.14 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406

F.15 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

409

409
List of Tables

2.1 Inflammation levels in the two groups, the drug treated (D) and
the control (C) group. . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 All combinations of the six observations into two groups. . . . . 20
2.3 Observed hotwings consumption of female individuals. . . . . . 22
2.4 Group averages in hotwings consumption and difference between
groups of males (M) and females (F). . . . . . . . . . . . . . . . 22
2.5 Group averages in repair times and difference between groups of
Verizon customers (V) and customers of other companies (C). . . 24

4.1 Summary for types of errors. . . . . . . . . . . . . . . . . . . . . 49

6.2 Representation by region in the poll and in the population. . . . 79

8.1 Common values for 𝛼 and respective 𝑧𝛼/2 . . . . . . . . . . . . . 112

A.1 Table containing various formating elements. . . . . . . . . . . . 322

C.1 Practice quiz questions with elements of solution in this appendix. 329

D.1 Practice exam questions with elements of solution in this ap-


pendix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
D.2 Severe complications at birth (SCB). . . . . . . . . . . . . . . . . 373

E.1 End-of-chapter exercises with elements of solution in this ap-


pendix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

xxi
List of Figures

1.1 Illustration of the Monty Hall problem. . . . . . . . . . . . . . . 5


1.2 Scheme of impacts on returning plane. . . . . . . . . . . . . . . 7

2.1 Distribution of Δ over the real line. . . . . . . . . . . . . . . . . 22


2.2 Subset of the permutation distribution: hotwings case. . . . . . . 23
2.3 Subset of the permutation distribution: Verizon case. . . . . . . . 25

3.1 Binomial distribution 𝑋 ∼ 𝐵(14, 0.5) with associated probabili-


ties and emphasis of Paul’s 12 successes. . . . . . . . . . . . . . 30
3.2 News from research. (Source: xkcd.) . . . . . . . . . . . . . . . . 32
3.3 xkcd on significance (xkcd.com/882). . . . . . . . . . . . . . . . 37
3.4 Title page of Jakob Bernoulli 1713’s Ars Conjectandi. . . . . . . . 38

4.1 Rejection regions for three alternative hypotheses. . . . . . . . . 46


4.2 Types of error for case ’𝐻0 : the person is not pregnant’. . . . . . 48

5.1 Illustration of the Central Limit Theorem: distribution of the


means of samples from uniform distributions for different sam-
ple sizes, sampled 1000 times. . . . . . . . . . . . . . . . . . . . 53
5.2 Illustration of the Central Limit Theorem: distribution of the
means of samples from Poisson distributions (𝜆 = 10 ) for dif-
ferent sample sizes, sampled 1000 times. . . . . . . . . . . . . . 54
5.3 Standard normal (left) and Chi-square with one degree of freedom
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

xxiii
xxiv List of Figures

5.4 Chi-square distributions for various degrees of freedom, 𝑟, pdf


(left) and cdf (right). . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 Chi-square values for degrees of freedoms between 1 to 15 and for
main probabilities benchmarks. . . . . . . . . . . . . . . . . . . 61

6.1 Rejection regions for the sample proportion of our example. . . . 71


6.2 Probability on the left of observed sample proportion. . . . . . . 72
6.3 Probabilities in a chi-squared distribution with 1 degree of free-
dom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7.1 Normal distribution and 𝑡-distribution for various degrees of free-


dom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

8.1 Estimators with different expected value (left) et different vari-


ance (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.2 Interpreting a confidence interval. . . . . . . . . . . . . . . . . . 109
8.3 Confidence interval in a standard normal. . . . . . . . . . . . . 111
8.4 Confidence interval for the mean. . . . . . . . . . . . . . . . . . 113

9.1 Minimal 𝑛 for various values of 𝛼 and margins of error, 𝑚, keep-


ing 𝑝0 = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.2 Confirmed covid-19 cases in Portugal, daily 7-day rolling moving
average. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

10.1 The counties with the highest 10 percent age-standardized death


rates for cancer of the kidney/ureter for U.S. males, 1980-89.
(Source: Gelman and Nolan (2017)) . . . . . . . . . . . . . . . . 134
10.2 The counties with the lowet 10 percent age-standardized death
rates for cancer of the kidney/ureter for U.S. males, 1980-89.
(Source: Gelman and Nolan (2017)) . . . . . . . . . . . . . . . . 135
10.3 The counties with both the highest and lowest 10 percent age-
standardized death rates for cancer of the kidney/ureter for U.S.
males, 1980-89. (Source: Wainer (2007)) . . . . . . . . . . . . . . 135
List of Figures xxv

10.4 Population versus age-standardized death rates for cancer of the


kidney/ureter for U.S. males, 1980-89. (Source: Wainer (2007)) . 136
10.5 Enrollment vs. math score, 5th grade (left) and 11th grade (right).
(Source: Wainer (2007)) . . . . . . . . . . . . . . . . . . . . . . . 136
10.6 Ten safest and most dangerous American cities for driving, and
ten largest American cities. (Source: Wainer (2007)) . . . . . . . . 137
10.7 Data from the National Assessment of Educational Progress.
(Source: Wainer (2007)) . . . . . . . . . . . . . . . . . . . . . . . 138

12.1 Proportions over all responses. . . . . . . . . . . . . . . . . . . . 146


12.2 Proportions by question. . . . . . . . . . . . . . . . . . . . . . . 146
12.3 Proportion by question, in facets. . . . . . . . . . . . . . . . . . 147
12.4 Proportions over all responses with error bars. . . . . . . . . . . 148
12.5 Average weight per habit. . . . . . . . . . . . . . . . . . . . . . 149
12.6 Average weight per habit and other dimensions. . . . . . . . . . 150
12.7 Average weight per habit with confidence interval. . . . . . . . 151
12.8 Mean arousal per film over gender with confidence interval. . . 152

13.1 Scatter plots of pairs of variables and their linear relationship. . 157
13.2 Anscombe plots. . . . . . . . . . . . . . . . . . . . . . . . . . . 159
13.3 Assessing associations with base R. . . . . . . . . . . . . . . . . 160
13.4 Assessing associations with base corrgram package. . . . . . . . 161
13.5 Assessing associations with base corrplot package. . . . . . . . . 162

15.1 Instance of simulated Income data along with true 𝑓() and errors. 176
15.2 Instance of simulated Income data along with true 𝑓() and errors
(two predictors). . . . . . . . . . . . . . . . . . . . . . . . . . . 176
15.3 Wage as function of various variables. . . . . . . . . . . . . . . . 179
15.4 Factors influencing the risk of a heart attack. . . . . . . . . . . . 180
15.5 Frequencies for main words in email (to George). . . . . . . . . 180
xxvi List of Figures

15.6 Sample of hand-written numbers. . . . . . . . . . . . . . . . . . 181


15.7 LANDSAT images and classification. . . . . . . . . . . . . . . . 181
15.8 Linear, smooth non-parametric and rough non-parametric fit (left
to right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
15.9 B-V case 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
15.10B-V case 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
15.11 B-V case 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
15.12Bias-Variance trade-off. . . . . . . . . . . . . . . . . . . . . . . . 186
15.13Training versus test data performance. . . . . . . . . . . . . . . 188
15.14Scatter plot of data set. . . . . . . . . . . . . . . . . . . . . . . . 188
15.15Fits of mpg for various degrees of the polynomial of horsepower. 189
15.16Validation set approach. . . . . . . . . . . . . . . . . . . . . . . 189
15.17Choice of polynomial in the validation set approach. . . . . . . . 190
15.18LOOCV approach. . . . . . . . . . . . . . . . . . . . . . . . . . 190
15.195-fold example of a cross-validation approach. . . . . . . . . . . 191
15.20Choice of polynomial with LOOCV and 10-fold CV.. . . . . . . . 191

16.1 Scatter plot of the TV-Sales observations. . . . . . . . . . . . . . 203


16.2 Linear fit and residuals. . . . . . . . . . . . . . . . . . . . . . . . 204

19.1 Using the mean as the best fit and the resulting residuals. . . . . 222
19.2 Linear fit and residuals. . . . . . . . . . . . . . . . . . . . . . . . 222

22.1 Scatter plot of simulated data in best case scenario. . . . . . . . . 240


22.2 Scatter plot of simulated data in best case scenario along with true
relationship (red) and OLS fit (blue). . . . . . . . . . . . . . . . . 242
22.3 Density estimate for the simulated slope coefficient. . . . . . . . 243
22.4 Scatter plot of sample with non-linear relationship. . . . . . . . 248
List of Figures xxvii

22.5 Scatter plot of sample with non-linear relationship along with OLS
fit (blue). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

23.1 Descriptive statistics. . . . . . . . . . . . . . . . . . . . . . . . . 253


23.2 Regressions results. . . . . . . . . . . . . . . . . . . . . . . . . . 254

25.1 Some plots on the Default data set. . . . . . . . . . . . . . . . . 275


25.2 Depiction of a limited dependent variable. . . . . . . . . . . . . 276
25.3 OLS fit for LDV. . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
25.4 OLS prediction for Default. . . . . . . . . . . . . . . . . . . . . . 279
25.5 A possible better fit for LDV. . . . . . . . . . . . . . . . . . . . . 280
25.6 Fit of logistic regression. . . . . . . . . . . . . . . . . . . . . . . 282
25.7 Normal and logistic cdf’s. . . . . . . . . . . . . . . . . . . . . . 282

26.1 Example of usual plan for presentation (Source: wiley.com (6 tips


for giving a fabulous academic presentation)). . . . . . . . . . . 294
26.2 Another example of usual plan for presentation (Source:
http://phdcomics.com/comics/archive.php?comicid=1553). . . 295

27.1 President Truman holding a copy of the Chicago Daily Tribune,


November 1948. . . . . . . . . . . . . . . . . . . . . . . . . . . . 302

29.1 Sports Illustrated cover about... its own myth. . . . . . . . . . . 312


29.2 Excess returns and the selection and termination decisions of plan
sponsors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

30.1 Mita border and specific portion analyzed by Dell (2010). . . . . 316

A.1 Book cover of Tufte’s book. . . . . . . . . . . . . . . . . . . . . . 323

B.1 Grades at the three tests. . . . . . . . . . . . . . . . . . . . . . . 327

C.1 Estimation output. . . . . . . . . . . . . . . . . . . . . . . . . . 338


xxviii List of Figures

C.2 Aesthetics mappings. . . . . . . . . . . . . . . . . . . . . . . . . 340


C.3 Linear fit and residuals, again. . . . . . . . . . . . . . . . . . . . 348
C.4 Regression output for exercise with XXX. . . . . . . . . . . . . . 349
C.5 Regressions results. . . . . . . . . . . . . . . . . . . . . . . . . . 351
C.6 One is unsupervised learning. . . . . . . . . . . . . . . . . . . . 354
C.7 Illustrating the effect of an outlier. . . . . . . . . . . . . . . . . . 372

D.1 Summary of Model 1. . . . . . . . . . . . . . . . . . . . . . . . . 378


D.2 Summary of Model 2. . . . . . . . . . . . . . . . . . . . . . . . . 379
D.3 Plot for Model 3. . . . . . . . . . . . . . . . . . . . . . . . . . . 379

F.1 Polynomials of age to model logwage. . . . . . . . . . . . . . . 406


Foreword

These notes are intended as introductory to the topics they covered. The vary-
ing levels of detail and comprehensiveness, within and across the lecture notes,
reflect this characteristic. They replace the usual decks of slides in a format that
allows for a general overview of the material thanks to the comprehensive table
of contents.
The departure from the usual slides model towards a narrative, memo-like for-
mat for each lecture is a choice that calls for some explanation, if anything be-
cause it is very uncommon.
I think that the style of the typical slide-shows, especially if built with MS Power-
Point (PP), is characterized by an excessive oversimplification of the arguments,
which are reduced to bullet points, key words and bad graphical representa-
tions. From a pedagogical point of view, these are not sufficient for conveying a
nuanced line of argumentation and often result in a black-or-white misinterpre-
tation.1 For a in-depth critique of the PP presentations arguing that the cognitive
style of PP is “making us stupid” and may be associated with tragic mistakes2
see the work of Edward Tufte (Tufte (2003)). See also the hilarious example3 of
the abuse of PP and its “AutoContent Wizard”. PP presentations are also getting
criticized in the business world4 and are sometimes replaced by memos, e.g., at
Amazon5 .
As for the slides, however, the notes must be completed with elements emerging
during the discussion in class. It is unreasonable to considered the words written
here as the exclusive material covered in the exam. Most elements in these notes
are mere placeholders for arguments and discussions hold in greater length in
1
Here, I only claim a reduction of that risk since it would be presumptuous and flatly wrong
to pretend that the full sentences format will leave no room for misunderstanding.
2
https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001yB
3
https://norvig.com/Gettysburg/
4
https://www.inc.com/geoffrey-james/sick-of-powerpoint-heres-what-to-use-instead.html
5
https://conorneill.com/2012/11/30/amazon-staff-meetings-no-powerpoint/

xxix
xxx Foreword

various sources. In that sense, the main advantage of these notes is to provide a
structure for the classes.
Part I

Introduction
1
Statistical Intuition

TL;DR A selection of questions/puzzles illustrates


our generally poor understanding of phenomena in-
volving random processes, casting doubts on our
ability to make good judgments and subsequent de-
cisions.[1.1]
Solutions to these questions hint at the length of the
gap to be filled.[1.4]
Statistics is presented as a set of guiding rules to make
sense of random processes, in a way similar to that of
a grammar textbook helping to correctly speak a lan-
guage.[1.2]
Grammatically correct sentences, however, are point-
less if they don’t carry a relevant message. Work to
achieve this latter remains the priority of any empir-
ical analysis.[1.3]

1.1 A Few Questions in Statistics

Please answer the following questions.

3
4 1 Statistical Intuition

1.1.1 Linda (Tversky and Kahneman, 1983)

Linda is 31 years old, single, outspoken, and very bright. She majored in philos-
ophy. As a student, she was deeply concerned with issues of discrimination and
social justice, and also participated in anti-nuclear demonstrations.
Which of the following two alternatives is more probable?

a. Linda is a banker,
b. Linda is a banker and active in the feminist movement.

1.1.2 Monty Hall

Suppose you’re on a game show, and you’re given the choice of three doors:
Behind one door is a car; behind the others, goats.
You pick a door, say No. 1, and the host, who knows what’s behind the doors,
always opens a door with a goat, say No. 3. He then says to you, “Do you want
to pick door No. 2”?
What would you answer?

a. Keep door No. 1


b. Switch to door No. 2

1.1.3 Mean IQ

The mean IQ of the population of high school students in a a given big city is
known to be 100.
You have selected a random sample of 50 of these students for a study. The first
of these students tested has an IQ of 150.
What do you expect the mean IQ to be in the whole sample of 50 students?

1.1.4 Binary Sequence

Which of the following sequences of X’s and O’s seems more likely to have been
generated by a random process (e.g., flipping a coin)?
1.1 A Few Questions in Statistics 5

FIGURE 1.1: Illustration of the Monty Hall problem.


6 1 Statistical Intuition

a. XOXXXOOOOXOXXOOOXXXOX
b. XOXOXOOOXXOXOXOOXXXOX

1.1.5 Your Random Number

Randomly chose an (integer) number between 1 and 5.

1.1.6 Positive Cancer Test

The probability of breast cancer is 1% for a woman at age forty who participates
in routine screening.
If a woman has breast cancer, the probability is 95% that she will get a positive
mammography. If a woman does not have breast cancer, the probability is 8.6%
that she will also get a positive mammography.
A woman in this age group had a positive mammography in a routine screening.
What is, approximately, the probability that she has breast cancer?

1.1.7 Armour

During WWII, the Navy tried to determine where they needed to armor their
aircraft to ensure they came back home. Once back, the planes were submitted
to an analysis of where they had been shot. Figure 1.2 shows the results of these
analyses.
What places of the plane (areas A to F) do you think would need the most an
armor?

1.1.8 Average Wage Growth

A city has two parts: North and South.


Over the last 10 years, the wage of the Northerners increased, on average, by
24%. For the Southerners, the wages increased by 12%.
Consider the evolution of the average wage in the city. Which of the following
could not have happened (more than one answer possible)?
1.2 Learning Statistics 7

FIGURE 1.2: Scheme of impacts on returning plane.

a. it doubled (100% increase),


b. it decreased,
c. it increased by 18%,
d. it increased by 24%,
e. it decreased by 12%,
f. none could happen
g. they could all happen.

1.2 Learning Statistics

Statistics is the grammar of nature’s language, randomness.

Strangely enough, humans are not native in nature’s language, randomness. To a


large extent, they are even particularly ill-equipped to understand it. This turns
8 1 Statistical Intuition

the learning the topic into a frustrating endeavor leading to the same desperate
self-assessments as when we learn a language:

• “I’ll never manage to speak properly.”


• “These rules do not make sense to me”.
• “I’d better not speak and embarrass myself ”.

Every reader has already experienced all of these. And there will be no soothing
counter-argument here. Only a reminder that the benefits of understanding this
language are very numerous, too many indeed to be encapsulated into a few
sentences. Instead, their full list will be slowly uncovered throughout a life of
decisions, improved and not fooled by randomness.
A last word about this language. Despite the popular belief, statistics is not a
special dialect of mathematics. Sure enough, they share many expressions. And,
more often than not, a good command of math allows to get away with it. This
view and this practice of statistics, however, are unfortunate and detrimental. I
hope these notes will help make it clear.

1.3 A Learning Strategy

In this introductory chapter, I would like to lay down a few elements of the learn-
ing strategy adopted here. These are given below without particular order.

1.3.1 Content Over Form

An empirical research is a story we tell to others in order to convince them of a


particular point. The language of that story is statistics. Therefore, we must first
learn how to speak it.
However, the quality of a story does not essentially depend on the the variety or
the exquisiteness of the words it uses. It’s first and above all about the content of
the story.
This content, in turn, is on you to find it, based on your interest, your ability to
“see” issues, your experiences, etc…
1.4 Strengthening your Intuition 9

1.3.2 Main Words

Did you know there exist books listing the most used words of a language? For
instance, I have Jones and Tschirner (2015) on my book shell, listing die Statistik
in the 2864th position.1
Similarly, this course adopts mainly this frequency approach. It touches on the
core methods of empirical research. I trust it will allow you to tell many interest-
ing stories.

1.3.3 Principles Over Techniques

The number of statistical techniques is very large. One may wonder if the par-
ticular one we use is the most appropriate for the problem at hand.
Here is a perspective from my experience. A statistical analysis is virtually never
incorrect because it uses the wrong technique. Instead, it is often criticized be-
cause if fails to comply with basic principles.
Surprisingly, those of mostly fall into this trap are precisely those who know the
smaller number of techniques, i.e., you. This course will put a particular empha-
sis on these principles in order to help you avoid disqualifying mistakes.

1.4 Strengthening your Intuition

This section offers a few pointers to better understand the questions (and their
answers) of Section 1.1. Its conclusions must be understood by all, but its details
are meant for inquiring minds only.

1.4.1 Question in 1.1.1

It is easy to believe that the second option, “Linda is a banker and active in
the feminist movement”, must represent a subset of the first option “Linda is
a banker”. As for why the former seems more probable than the latter, please
1
Glaube nur der Statistik, die du selbst gefälscht hast.
10 1 Statistical Intuition

see Tversky and Kahneman (1983) or Kahneman (2011). Arguably, the second
option taps into our brain’s love for stories.

1.4.2 Question in 1.1.2

This is a question about which a great many stories have already been told. A
main perspective emerges in all of them, namely how much it has fooled the
overwhelming majority of those who attempted the question. Many of these sto-
ries also quote a letter written to a columnist who gave the right answer.

You blew it! Let me explain: If one door is shown to be a loser, that information changes the probability of
either remaining choice – neither of which has any reason to be more likely – to 1/2. As a professional
mathematician, I’m very concerned with the general public’s lack of mathematical skills. Please help by
confessing your error and, in the future, being more careful.

— Robert Sachs, Professor of Mathematics at George Mason University in Fairfax, Va.

There are several ways of demonstrating that “Switching doors” is the right thing
to do: a theoretical demonstration based on Bayes theorem, a simulation, and
another attempt at intuition. I briefly describe the three below.
Theoretical demonstration based on Bayes theorem.
We will show how to calculate the correct probabilities:

• the probability that the car is behind door No.2 given that Monty Hall opened
door No.3,
• the probability that the car is behind door No.1, the initially chosen door, given
that Monty Hall opened door No.3; notice that, since the car must be in one of
the two doors, this probability is simply one minus the probability calculated
just above.

We adopt the following notation:

• 𝐶𝑖 , the event of the car being behind door 𝑖,


1.4 Strengthening your Intuition 11

• 𝐷𝑖 , the event of Monty Hall opening door 𝑖.

Notice the prior probabilities:

1
𝑃 (𝐶1 ) = 𝑃 (𝐶2 ) = 𝑃 (𝐶3 ) =
3
In the current configuration, the new information is that Monty Hall opens door
No.3, i.e., we observe the event 𝐷3 . We are looking to compare the posterior
probabilities:

𝑃 (𝐶1 |𝐷3 ) and 𝑃 (𝐶2 |𝐷3 )

We do not know these posterior probabilities but we know that they can be cal-
culated with Bayes’s rule thanks to the “inverted” probabilities:

𝑃 (𝐷3 |𝐶1 ) and 𝑃 (𝐷3 |𝐶2 ) and 𝑃 (𝐷3 |𝐶3 )

These are easier to compute. We have:

• If the car is behind door No.1, then Monty Hall could open either door No.2 or
door No.3, with equal probability; hence

1
𝑃 (𝐷3 |𝐶1 ) =
2

• If the car is behind door No.2, then Monty Hall could only open door No.3
since he cannot show a car or open your door; hence

𝑃 (𝐷3 |𝐶2 ) = 1

• If the car is behind door No.3, then Monty Hall cannot open door No.3 since
he cannot show the car; hence
12 1 Statistical Intuition

𝑃 (𝐷3 |𝐶3 ) = 0
We can now calculate the correct probability mentioned above, the probabil-
ity that the car is behind door No.2 given that Monty Hall opened door No.3,
𝑃 (𝐶2 |𝐷3 ). We do this by applying the Bayes’ rule.
𝑃 (𝐶2 )𝑃 (𝐷3 |𝐶2 )
𝑃 (𝐶2 |𝐷3 ) =
𝑃 (𝐶1 )𝑃 (𝐷3 |𝐶1 ) + 𝑃 (𝐶2 )𝑃 (𝐷3 |𝐶2 ) + 𝑃 (𝐶3 )𝑃 (𝐷3 |𝐶3 )

By replacing with the values derived above, we have


1
3⋅1 2
𝑃 (𝐶2 |𝐷3 ) = 1 1 1 1
=
3 ⋅ 2 + ⋅1+
3 3 ⋅0 3
Again by applying Bayes’ rule, we can also calculate the probability that the car
is behind door No.1 given that Monty Hall opened door No.3, i.e., the probabil-
ity of winning by sticking to the initial door. Notice that this is not a necessary
calculation but rather a check because it must be the case that this probability is
the complement to the previous one.

𝑃 (𝐶1 )𝑃 (𝐷3 |𝐶1 )


𝑃 (𝐶1 |𝐷3 ) =
𝑃 (𝐶1 )𝑃 (𝐷3 |𝐶1 ) + 𝑃 (𝐶2 )𝑃 (𝐷3 |𝐶2 ) + 𝑃 (𝐶3 )𝑃 (𝐷3 |𝐶3 )

By replacing with the values derived above, we have,


1
3⋅ 12 1
𝑃 (𝐶1 |𝐷3 ) = 1 1 1 1
=
3 ⋅ 2 + ⋅1+
3 3 ⋅0 3
The conclusion from these calculations to this game is clear. One should always
change door after the acquisition of new information because the posterior prob-
abilities are 13 for the initially chosen door and a higher probability for the re-
maining door, 32 .
Simulation in R language
The following is a R code from R-bloggers2 that provides a function to simulate
the Monty Hall problem.
2
https://www.r-bloggers.com/monty-hall-by-simulation-in-r/
1.4 Strengthening your Intuition 13

monty <- function(strat='stay', N=1000, print_games=TRUE)


{
doors <- 1:3 #initialize the doors behind one of which is a good prize
win <- 0 #to keep track of number of wins

for(i in 1:N)
{
prize <- floor(runif(1,1,4)) #randomize which door has the good prize
guess <- floor(runif(1,1,4)) #guess a door at random

## Reveal one of the doors you didn't pick which has a bum prize
if(prize!=guess)
reveal <- doors[-c(prize,guess)]
else
reveal <- sample(doors[-c(prize,guess)],1)

## Stay with your initial guess or switch


if(strat=='switch')
select <- doors[-c(reveal,guess)]
if(strat=='stay')
select <- guess
if(strat=='random')
select <- sample(doors[-reveal],1)

## Count up your wins


if(select==prize)
{
win <- win+1
outcome <- 'Winner!'
}else
outcome <- 'Losser!'

if(print_games)
cat(paste('Guess: ',guess,
'\nRevealed: ',reveal,
'\nSelection: ',select,
'\nPrize door: ',prize,
14 1 Statistical Intuition

'\n',outcome,'\n\n',sep=''))
}

#Print the win percentage of your strategy


cat(paste('Using the ',strat,' strategy, your win percentage was ',win/N*100,'%\n',sep=''))
}

You can then use the function to simulate the games the number of times that
you want (by chosing 𝑁 ).

# the strategy can be set to 'stay', 'switch' or 'random'


# N is the desired number of simulations
# set print_games: FALSE, otherwise your screen or sheet will be full

monty(strat='stay', N=10000, print_games = FALSE)


## Using the stay strategy, your win percentage was 32.19%
monty(strat='random', N=10000, print_games = FALSE)
## Using the random strategy, your win percentage was 50.91%
monty(strat='switch', N=10000, print_games = FALSE)
## Using the switch strategy, your win percentage was 66.16%

Another take at intuition


Here, I modify the rules of the game to allow the correct answer to emerge more
intuitively. If it still doesn’t, then I hope at least to cast a doubt on the wrong and
so common intuition that holds “there are two doors left, hence it’s 50% − 50%
chances”. The game is modified in the following way.

Suppose you’re on a game show, and you’re given the choice of 100 doors: Behind one door is a car; behind
the others, goats. You pick a door, say No.1, and the host, who knows what’s behind the doors, opens 98
doors, say No.3 to No.100, which have a goat. He then says to you, “Do you want to pick door No.2?”

Why is it more intuitive to see that one should switch doors? The host knows
where the car is and, out of the remaining 99 doors, he opens 98 showing a goat.
With your initial choice you had only 1% chances of guessing the correct door. Of
1.4 Strengthening your Intuition 15

course, this means that with 99% probability the car is behind one of the other 99
doors. Out of these 99 doors, you now know which 98 are not wining. Can you
still believe that the only one still closed has only 50% probability of containing
the car?

1.4.3 Question in 1.1.3

A misleading intuition here would suggest that the 49 students who were not
tested will, on average, compensate for the very high score of the first student.
That type of compensation does not exist. To expect it is similar to judge highly
probable that the ball of a roulette wheel will land on red after landing 10 times
in a row on black.
What can be said, instead, is that the average of the 49 students can be expected
to be 100. Hence, over the 50 students, the average will be,

1 5050
(150 + 49 × 100) = = 101.
50 50

1.4.4 Question in 1.1.4

Many respondents are influenced by a misleading intuition that goes as follows.


Randomness should reflect in every part of the sequence. At the limit, in every
two draws we should see a X and a O. This is what sequence 𝑏. exhibits in the be-
ginning. Of course, this is the very opposite of a random sequence. The complete
alternation of X and O’s is actually a very deterministic pattern, not a random
one.
Another (wrong) reason to reject sequence 𝑎. is that it exhibits a few subse-
quences of the same symbol in a row, e.g., four O’s in a row. This is (incorrectly)
deemed as a sign of lack of randomness. In fact, nothing prevents a fair coin from
landing on Heads four times in a row. So there should no sign of pattern in it.
So, what points in the direction of 50-50 chances in the first sequence? The
marginal distribution. After each observation, one should, on average, observe
a X half of the time and a O in the other half. That’s the process at play in 𝑎.
Indeed, count how many X’s and O’s come after a X (5 X’s and 5 O’s, i.e., 50%
each) and after a O (5 X’s and 5 O’s, again 50% either way). One relevant way of
16 1 Statistical Intuition

reading this is result is in term of conditional probability: knowing the symbol


of a sequence, does it help predict the next symbol? The answer is no.
How does sequence 𝑏. behave in that respect? After a X, there are 7 times a O
and only 3 times a X. Similarly, after a O, there are 7 times a X and only 3 times
a O. This is not a 50-50 chances! Instead, knowing the current symbol, you can
predict with 70% chances what the next symbol will be.
But this question still leaves students puzzled. A last objection is as follows. Since
both sequence can happen under the assumption off 50-50 chances, then one can-
not tell which one is more likely to happen.
In order to discard that view, consider building a sequence by throwing a fair
die. Every time the die lands on the values 1 to 5, the sequence writes a X; when
it lands on a 6, the sequence writes a O. Now, consider the sequences 𝑐. and 𝑑.
below:

c. XXXOXXXXOXXXXOXXXXXX

d. XOOOOXXOXOXOOOXXOOOX

Obviously, both of these sequences can happen. It doesn’t mean, however, that
they are both as likely as being formed by a fair die (coded as explained above).

1.4.5 Question in 1.1.5

This is an easier question. If individuals were a good source of random numbers,


then the distribution of answers in this question should approximate a uniform
distribution. It almost never does: either individuals excessively peak a particu-
lar value (often 3 or 4) and/or, excessively neglect one of the values (often 1).

1.4.6 Question in 1.1.6

This question is again an example of application for the Bayes theorem. Recall
that this is the kind of cases that eludes the most human intuition. Part of the
explanation for our inability resides in our poor handing of probabilities, i.e.,
relative numbers, i.e., a ratio of two numbers that are often equally difficult to
estimate.
1.4 Strengthening your Intuition 17

There is a way out that proves useful in many cases. Replace the relative
numbers–probabilities–by absolute numbers.
In this situation, imagine 500 women in the relevant group take a test. How many
of these have the disease? 1% of 500, 5. Of these 5, 95% will indeed test positive,
i.e., a bit less than 5, but we’ll round up to 5. How many of the 500 do not have the
disease? 495. Now, out of these 495, 8.6% will actually test positive, i.e., around
43.
Total of positive tests: 5 + 43 = 48. Of these 48, how many suffer from the
disease? Only 5. Hence, the probability if having cancer after a positive test under
these conditions is 5/48, around 10%.

1.4.7 Question in 1.1.7

The issue presented in this question has far-reaching implications. Far more in-
deed than recognized. This is why the issue will be brought back to discussion
multiple times in this course. In the literature, it is referred to as sample selection
bias.
The core of the issue is that the planes that returned are not the relevant planes
needed for a question on which parts to armor. They represent, actually, a sample
of planes that was not randomly selected. And the very criterion for this selection
is directly related to the question: only planes that were hit in places that did not
need armor could return home.
Make sure you understand that the hits were in fact approximately uniformly
distributed over all the planes that were hit. It is not likely that the enemies man-
aged to hit some particular place at which they aimed.

1.4.8 Question in 1.1.8

This question is concerned with the evolution of the average wage of the whole
city. We know that in both parts of the city the average wage increased. However,
we cannot deduce anything about the overall average wage of the city. This is
because the question gives no information about the composition of the city, say
the how many people there were in each side in the initial period and in the final
period.
Hence, the composition is key and can drive the average on any direction, as
18 1 Statistical Intuition

illustrated by this extreme example. Suppose that the North had lots of people
with high wages in the initial period. But, then, this number dramatically de-
creased in the 10 years. Despite the increase in wages of the (remaining) North-
erners, and even despite the increase of wages of the Southerners, the whole city
simply ends up with many less rich individuals. In turn, this new composition
will drive the city’s average down.
2
Statistical Statements

TL;DR .

This topic is often considered as advanced and therefore neglected from the tra-
ditional statistics courses. However, it arguably contains several key elements
encountered in data analysis that we will develop in further lectures.

2.1 Introductory Example

Suppose you obtain the results, shown on Table 2.1, for an experiment on a drug
aiming at reducing the levels of inflammation. The drug is administered to three
subjects while three others (control) take a placebo.

TABLE 2.1: Inflammation levels in the two groups, the drug treated (D) and the
control (C) group.

Drug Control
D1 D2 D3 C1 C2 C3 𝑋̄ 𝐷 𝑋̄ 𝐶 Δ𝑜
18 21 22 30 25 20 20.33 25 -4.67

The drug does seem to have an effect. The difference in the means of inflam-
mation is -4.67. But does this constitute enough evidence in favor of the drug
or could it also be obtained by chance? The answer to that question requires a
proper test.

19
20 2 Statistical Statements

The crucial step to make a proper test is to understand the following. What
would the results of the experiment look like if, indeed, they were due entirely
to chance, as opposed to an effect of the drug?
A natural but unsatisfactory answer would suggest that the means of the two
samples should be relatively close and, at the limit, even equal. This is unsatis-
factory because it correctly implies that the two means will be affected by some
random variance. But it does help determine how close is “close enough” to be
considered as “equal”.
Here is a more fruitful view. If the differences of the means across samples are
simply due to randomness, i.e., the drug has no effect, then both samples (drug
and control) are nothing but two random samples of the same population.
In that case, the above difference, call it Δ𝑜 = -4.67 is one of the possible differ-
ences between two random groups of three subjects.
Now, is Δ𝑜 so large that we cannot believe that only chance was at play and,
therefore, we would rather accept the idea that the drug played a role? To answer
this question, I suggest listing all the possible Δ’s.

2.2 Exact Permutation Distribution

We first obtain all the possible permutations for two groups of three out of the
six observations. Then, for each of these groups, we calculate their mean and the
difference between their means, i.e., Δ𝑖 . The results are given on Table 2.2.

TABLE 2.2: All combinations of the six observations into two groups.

As if Drug As if Control
D1 D2 D3 C1 C2 C3 𝑋̄ 𝐷 𝑋̄ 𝐶 Δ𝑖
18 21 20 22 25 30 19.67 25.67 -6.00
18 22 20 21 25 30 20.00 25.33 -5.33
18 21 22 20 25 30 20.33 25.00 -4.67
18 25 20 21 22 30 21.00 24.33 -3.33
21 22 20 18 25 30 21.00 24.33 -3.33
18 21 25 20 22 30 21.33 24.00 -2.67
2.3 Subsetted Permutation Distribution 21

TABLE 2.2: All combinations of the six observations into two groups. (continued)

D1 D2 D3 C1 C2 C3 𝑋̄ 𝐷 𝑋̄ 𝐶 Δ𝑖
18 22 25 20 21 30 21.67 23.67 -2.00
21 25 20 18 22 30 22.00 23.33 -1.33
22 25 20 18 21 30 22.33 23.00 -0.67
18 30 20 21 22 25 22.67 22.67 0.00
21 22 25 18 20 30 22.67 22.67 0.00
18 21 30 20 22 25 23.00 22.33 0.67
18 22 30 20 21 25 23.33 22.00 1.33
21 30 20 18 22 25 23.67 21.67 2.00
22 30 20 18 21 25 24.00 21.33 2.67
18 30 25 20 21 22 24.33 21.00 3.33
21 22 30 18 20 25 24.33 21.00 3.33
30 25 20 18 21 22 25.00 20.33 4.67
21 30 25 18 20 22 25.33 20.00 5.33
22 30 25 18 20 21 25.67 19.67 6.00

So, was the above difference Δ𝑜 =-4.67 extreme? We can look how it fits in the
overall distribution of the possible differences between two groups, as shown in
Figure 2.1.
What percentage of values are smaller or equal than the observed value? As
much as 3 out of 20, i.e., 15%.
Hence, if the drug had no effect, we would have 15% chances of observing that
value in a sample of three elements versus three controls. That’s is a small prob-
ability, but usually too high for being conclusive.

2.3 Subsetted Permutation Distribution

The exact distribution may not always be feasible, often because the number
of combinations is too high. In that case, we resort to randomly subset a large
number of values from the permutation distribution. We explore that case in the
example below.
22 2 Statistical Statements

observed

−6 −3 0 3 6

FIGURE 2.1: Distribution of Δ over the real line.

We use the dataset called Beerwings containing the consumption of beer and hot
wings of 30 individuals, with information on the gender. Notice that there are 15
males and 15 females in the sample. Table 2.3 shows the raw data.
We are interested in evaluating whether male individuals have the same con-
sumption of hotwings than female individuals. Notice that, in our sample, we
do observe a difference, see Table 2.4.

TABLE 2.3: Observed hotwings consumption of female individuals.

All individuals
Gender
Female 4 5 5 6 7 7 8 9 11 12 12 13 13 14 14
Male 7 8 8 11 13 13 14 16 16 17 17 18 18 21 21

TABLE 2.4: Group averages in hotwings consumption and difference between


groups of males (M) and females (F).

𝑋̄ 𝑀 𝑋̄ 𝐹 Δ𝑜
14.53 9.33 5.2

But we can only evaluate whether that difference is due to chance or not once
we have the sampling distribution of that difference. Again, the starting point
is assuming that the difference that we observe is simply due to chance. Under
that view, gender doesn’t matter, i.e., we could take any combination of 15 in-
dividuals, take the mean of their consumption and compare it with the mean
consumption of the remaining 15 individuals.
Notice, however that it is unrealistic to try the approach above with the exact
permutation distribution. This is because the number of permutations in that
2.4 Unbalanced, Skewed Case 23

0.20

observed
0.15
density

0.10

0.05

0.00

−6 −3 0 3 6

FIGURE 2.2: Subset of the permutation distribution: hotwings case.

overall distribution of the possible differences between two groups, as shown in


Figure 2.2.
The observed value seems indeed too extreme. Indeed, there are only 0.081% of
values as large or larger than the value we observe. This percentage is so small
that we do not believe that the difference observed is only due to chance. The
data indicates that there is a difference in consumption.

2.4 Unbalanced, Skewed Case

The present case features highly unbalanced groups as well as a highly skewed
permutation distribution.
The data set is about average repair times in the USA. In a given area, by law,
Verizon should attend all clients, no matter if they are clients of Verizon or clients
of other companies.
We are interested in evaluating whether the average repair times are indeed
equal. Again, we will use the method above, i.e., making a subset of all the rele-
vant possible permutations for repairs that have a contract.
24 2 Statistical Statements

Notice again at the outset, that there seem to be a difference in the average time
of repair between the groups, as shown in Table 2.5.

TABLE 2.5: Group averages in repair times and difference between groups of
Verizon customers (V) and customers of other companies (C).

𝑋̄ 𝑉 𝑋̄ 𝐶 Δ𝑜
8.41 16.51 -8.1

A word about these relevant permutations is as follows. The data set contains
1687 observations. But the distribution across groups is very unbalanced. In-
deed, there are 1664 observations for Verizon clients and only 23 observations
for clients of other companies. In order to know if the average repair times are
the same across groups, we should compare the difference between any group
of 23 observations with the group of remaining 1664 observations.
Again, we cannot rely on all the possible permutations. There are 4.052292e+51
ways of choosing 23 observations in a group of 1687. That is way too many to
compute! Therefore, we subset again a large number (99’999) of these permuta-
tions and proceed as if it was the actual permutation distribution.
So, was the observed difference Δ𝑜 = -8.1 extreme? We can look how it fits in the
overall distribution of the possible differences between two groups, as shown in
Figure 2.3.
Again, the observed value seems too extreme. Indeed, there are only 1.828% of
values as small or smaller than the value we observe. This percentage is so small
that we do not believe that the difference observed is only due to chance. The
data indicates that there is a difference in repair times.

2.5 R Code

This section provides the R code used to obtain the results of this chapter, though
not how to display them in tables and graphs. For clarity it is separated by sec-
tion/task.
2.5 R Code 25
0.15

observed
0.10
density

0.05

0.00

−15 −10 −5 0 5

FIGURE 2.3: Subset of the permutation distribution: Verizon case.

library(gtools) # combinations()
library(magrittr) # %>%
library(tibble) # as_tibble()
library(dplyr) # rename(), mutate(), ...
library(resampledata) # Beerwings and Verizon data

drug <- c(18, 21, 22)


control <- c(30, 25, 20)
all.obs <- c(drug, control)
n <- length(all.obs)
observed <- round(mean(drug) - mean(control), 2)
all.comb <- combinations(n = n, r = 3, repeats.allowed = FALSE) %>%
apply(2, function(x) all.obs[x]) %>%
as_tibble() %>%
rename(d1 = V1, d2 = V2, d3 = V3) %>%
rowwise() %>%
mutate(c1 = sort(setdiff(all.obs, c(d1, d2, d3)))[1],
c2 = sort(setdiff(all.obs, c(d1, d2, d3)))[2],
c3 = sort(setdiff(all.obs, c(d1, d2, d3)))[3])

df <- all.comb %>%


26 2 Statistical Statements

rowwise() %>%
mutate(md =round(mean(c(d1, d2, d3)), 2),
mc= round(mean(c(c1, c2, c3)), 2),
delta = round(mean(c(d1, d2, d3)) - mean(c(c1, c2, c3)), 2))

data("Beerwings")
nmen <- nrow(Beerwings[Beerwings$Gender=="M",])
n <- nrow(Beerwings)
observed <- round(mean(Beerwings[Beerwings$Gender=="M", "Hotwings"]) -
mean(Beerwings[Beerwings$Gender=="F", "Hotwings"]),2)
hw <- Beerwings$Hotwings

n.s <- 1e5 -1


delta <- numeric(n.s)

for (i in 1:n.s){
index <- sample(1:length(hw), 15, replace = FALSE)
delta[i] <- mean(hw[index]) - mean(hw[-index])
}

pvalue <- round((sum(delta >= observed) + 1 ) / (n.s + 1),6)*100

observed <- round(mean(Verizon[Verizon$Group=="ILEC", "Time"]) -


mean(Verizon[Verizon$Group=="CLEC", "Time"]), 2)

n.ilec <- nrow(Verizon[Verizon$Group=="ILEC",])


n.clec <- nrow(Verizon[Verizon$Group=="CLEC",])
time <- Verizon$Time

n.s <- 1e5 -1


delta <- numeric(n.s)

for (i in 1:n.s){
index <- sample(1:length(time), n.ilec, replace = FALSE)
delta[i] <- mean(time[index]) - mean(time[-index])
}
2.6 Exercises 27

pvalue <- round((sum(delta <= observed) + 1 ) / (n.s + 1), 6)*100

2.6 Exercises

Exercise 2.1. Follow the argument and slightly modify the R code in Section 2.4
to answer the following question.
Is there a significant difference in the median repair times between the Verizon
clients and the clients of other companies served by Verizon?
3
Paul the Octopus and 𝑝 < 0.05

TL;DR The 5% significance level as a threshold for


statistical decision making ought not be blindly ap-
plied and accepted, despite it being ubiquitous in the
scientific literature ever since it was introduced by
statisticians such as R. Fisher.[3.4]
The process through which the statistically signifi-
cant result was obtained must always be carefully
scrutinized. Common sense and/or a solid theory
must be used to evaluate seemingly “extra”-ordinary
results.[3.1][3.3]
Researchers and scientific outlets alike ought to de-
fend against this threat to science by adopting norms
to avoid the publication of the currently too large
number of false positives.[3.2]

3.1 Paul the Octopus…

Paul the Octopus was a common octopus living at the Sea Life Centre in
Oberhausen, Germany. It became famous worldwide, and even received death
threats, after managing to predict the outcomes of international football matches,

29
30 3 Paul the Octopus and 𝑝 < 0.05

0.2095

0.20
0.1833 0.1833

observed
0.15
0.1222 0.1222
Probability

0.10

0.0611 0.0611

0.05
0.0222 0.0222

0.0056 0.0056
1e−04 9e−04 9e−04 1e−04
0.00

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Possible correct predictions

FIGURE 3.1: Binomial distribution 𝑋 ∼ 𝐵(14, 0.5) with associated probabilities


and emphasis of Paul’s 12 successes.

mainly involving the German team, at both the Euro 2008 and the 2010 FIFA
World Cup. Overall, Paul correctly predicted 12 results out the 14 matches it
gave an opinion on.
From a statistical point of view, how ought we to appreciate this remarkable feat
of the octopus? In particular, how does it fit with the insights of Chapter 2?
A rather natural approach inspired by Chapter 2 consist in asking whether luck
alone could explain this level of accuracy. For that end, we rely on the description
of a binomial distribution of the random variable 𝑋 counting the number of
successes in 14 trials, noted 𝑋 ∼ 𝐵(𝑛 = 14, 𝑝 = 0.5). Figure 3.1 shows the
distribution over all the possible values that 𝑋 could take.
It seems that if chance alone was to explain the observation, there would be only
a 0.5554% probability of achieving this result. Way below the usual 5% for statis-
tical significance. Hence, we are led to reject the role of luck alone and embrace
the possibility of a truly psychic animal. Given the increasingly glowing picture
of the octopus and its abilities, this alternative explanation may even seem a solid
ground. If you haven’t seen it yet, I strongly recommend the documentary “My
Octopus Teacher”.1
1
https://www.netflix.com/title/81045007
3.2 Paul the Octopus… 31

3.1.1 … and Other Psychic Beasts

Notice that the competition in the domain of animal oracles is rather fierce. Rabio
the octopus had a perfect score for the group stage games of Japan in the 2018
FIFA World Cup. But was chopped for a meal before having a chance of fully
exploring its talent in the remaining games.
But octopuses are not the only animals with clairvoyant powers. At the time of
the 2010 FIFA World Cup final, Paul fought for its place against Mani the para-
keet, which itself correctly called all the results of the quarter-final games.
Sure enough, not all animals are so skilled. You better not put your money on
Flopsy the kangaroo who tends to be biased towards Australia, Leon the porcu-
pine, Petty the pygmy hippopotamus or Anton the tamarin.
But the contestant for replacing Paul are still numerous in the race, including
Shaheen the camel, Madame Shiva the guinea pig, Nelly the elephant. What is
more, the BBC reported2 that a full colony of penguins at the National Sea Life
Centre in Birmingham is entering the competition, along with Big Head the log-
gerhead turtle, Alistair and Derek the miniature donkeys, and Sarge and Oscar
the macaws.

3.1.2 Still Randomness

What is to be taken from this frenetic search for psychic animals?


Notice that Paul’s score can and should be rewritten as follows. A probability of
0.5554% is equivalent to around 1/180.
Now, suppose that a very great number of animals, say 180 or more, all without
any knowledge in either sports, the European championship or the FIFA World
cup, are asked to make predictions. It should be obvious that it is very likely that
at least one of them, will achieve Paul’s performance.
2
https://www.bbc.com/news/uk-england-27810714
32 3 Paul the Octopus and 𝑝 < 0.05

FIGURE 3.2: News from research. (Source: xkcd.)

3.2 p-Hacking

Consider the news reported in Figure 3.2. It clearly indicates that the result meets
the usual standard of 𝑝 < 5%, suggesting that result is not due to chance alone.
Hence, we are left to believe that green jelly beans are linked to acne. Teenagers,
watch out!
But should we really believe that result? As it turns out, Figure 3.2 is only one
part of a larger humorous cartoon, given in Figure 3.3. And the fuller picture
gives the key to understand the origin of the result.
There seems initially to be no evidence linking jelly beans and acne. However,
if researchers multiply the experiments, say by changing the color of the jelly
beans, then, by the very nature of randomness and its effect, there will be one
experiment whose result will land sufficiently away from the true value 0, i.e, no
effect of jelly beans.
Unfortunately, this “significant” result, i.e., the case yielding a 𝑝 < 0.05, is the
one to be submitted to research journals or even to general-audience publica-
tions. To take it at face value is a great error indeed. Only by thoroughly exam-
ining the process through which the result emerged can we avoid it.
3.3 Efficient Markets Hypothesis 33

3.2.1 A Threat to Science

An assessment of the current practices in the industry of scientific research by


Ioannidis resulted in a devastating conclusion, expressed in its title:

Why most published research findings are false

— Ioannidis (2005)

The author identifies several conditions leading to a research finding to be less


likely true. The insidious technique of p-hacking described above is but one of
them. For a more colorful description (with age restriction), watch John Oliver’s.3

3.3 Efficient Markets Hypothesis

Suppose you are the only one to know with certainty something about the evo-
lution of a stock. Then you can make big money by buying/selling in the stock
market. By doing so, you would somewhat slightly affect the price of the stock
in that market.
As it turns out, the assumption that only you know, and nobody else does, it
too strong an assumption in finance. Instead, another assumption is take as a
valid description, namely the Efficient Markets Hypothesis (EMH). Under that
assumption/theory the asset prices reflect all the information for the actors in
the market. As a consequence, there is no way of making more money than the
average of market.
Enters Bill Miller4 . This fund manager was referred to as “the greatest money
manager of our time”5 by CNN Money, among the several distinctions received
3
https://youtu.be/0Rnq1NpHdmw?t=195
4
https://en.wikipedia.org/wiki/Bill_Miller_(investor)
5
https://money.cnn.com/magazines/fortune/fortune_archive/2006/11/27/8394343/index.htm
34 3 Paul the Octopus and 𝑝 < 0.05

in the financial media industry. The reason? Bill Miller managed to beat the mar-
ket 15 calendar years in a row, 1991 through 2005!
The man’s performance was seen as nothing short of genius! Why? Because of
the extremely low probability of observing such a performance by chance alone.
Mauboussin and Bartholdson (2003), see here,6 offered some estimates of this
probability when Miller was in his 12th year feat only. I adopted the numbers to
the 15-year streak.

• If, every year, beating the market was a 50%-50% game, similar to a flip of a
coin, then Miller’s correct predictions had the probability of 1 in 32’768 (i.e.,
215 ).
• If you consider that less than 50% of the funds beat the market every year, say
around 44%, then the probability is 1 in 222’951.
• Some commentators (see Mauboussin and Bartholdson (2003)) infer that beat-
ing the market once is as likely as rolling a seven (when throwing two dice). If
that is true, then Miller’s roll of dice had 1 in 470’184’984’576 chances of hap-
pening.

Now, is this all that impressive? Following the argument above, we can offer
a different perspective on the God-like abilities of Bob Miller. As it happens,
Mlodinow (2009) estimates that there was 3 chances out 4 of observing such a
streak. How would we explain such a probability.
Consider again the first case above whereas beating the market once is similar to
a flip of a coin, i.e., a 50%-50% proposition. The above estimate of the probability
is correct under a very narrow view. If, at the beginning of 1991, you would talk
to Miller and evaluate his probability of beating the market in the next 15 years,
then, yes, having Miller achieving it would be impressive.
But a larger view needs to take into account that there are many firms active in
the market. As a very low estimate, say 1000. Also, these firms have been active
for a long time, say 40 years. Now, Miller’s performance takes another color.
Miller is simply the one fund manager, out of the 1000, that managed to predict
15 throws of a coin in a row out of the 40 trials.
How likely is it that some manager beats the market for some 15-year period? 75%
according to Mlodinow (2009). Bill Miller was the lucky guy.
6
http://docplayer.net/52594811-On-streaks-perception-probability-and-skill-finding-the-hot-
shot.html
3.4 Rigorous Uncertainty and Moral Certainty 35

Mauboussin and Bartholdson (2003) understands this phenomenon but is more


reluctant in attributing the result to luck alone. This is because they find it more
difficult to untangle luck from skill (as Mauboussin further explains in his nice
book Mauboussin (2012)).

3.4 Rigorous Uncertainty and Moral Certainty

A couple of words to recall the arbitrariness of the usual confidence level, 95%,
that allows a result to have statistical significance.
First, we can credit one of the most important statistician of the XXth century,
Ronald Fisher for canonizing the 5% level (Stigler (2008)). The p-value is sug-
gested as a measure of “rigorously specified uncertainty”. Hence, the 5% limit is
justified as follows:

The value for which P = .05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in
judging whether a deviation is to be considered significant or not. Deviations exceeding twice the
standard deviation are thus formally regarded as significant. Using this criterion we should be led to
follow up a negative result only once in 22 trials, even if the statistics are the only guide available. Small
effects would still escape notice if the data were insufficiently numerous to bring them out, but no
lowering of the standard of significance would meet this difficulty.

— Fisher (1925)

To be sure, Fisher did not impose this bar. It seems however that his suggestion
was shared and found useful enough to serve as a satisfactory compromise.
The necessity of agreeing on a workable level can be better appreciated when
looking at the limit used by the funding father of probability theory, Jakob
Bernoulli (Bernoulli (1713), see Figure 3.4).
Jakob7 Bernoulli would not settle for a confidence level lower than 99.9%, which
7
In the Bernoulli family, you must specify the first name.
36 3 Paul the Octopus and 𝑝 < 0.05

he associates with moral certainty. Notice that this corresponds to a limit for the
p-value of 0.001, or equivalently, for allowing an error in less than 1 time in 1000!
Such a high bar was depressing even for Jakob Bernoulli. In Bernoulli (1713), he
calculates that for obtaining moral certainty about a relevant proportion in the
population of Basel, he had to sample… more than entire population of Basel of
that time!

3.5 Exercises

Exercise 3.1. Why are there 20 similar panels in the middle of Figure 3.3, as op-
posed to 25 or another number? Explain.

Exercise 3.2. Under reformulation, originally ambiguous.

Exercise 3.3. Have a look at Mauboussin and Bartholdson (2003), that you can
find online here8 .
Reproduce the results of the last line of Exhibit 1, i.e., for # of years = 15 (p.3). In
other words, show how they were obtained.

Exercise 3.4. Again in Mauboussin and Bartholdson (2003), p.3. The authors use
Exhibit 2 for placing the probability of beating the market at 1 in 477’000.
Reproduce that number. Notice that the authors use the expression “about”
when referring to their result. My own estimate, given Exhibit 2, is 1 in 475’186.
The discrepancy may be due to the effect of rounding.

Exercise 3.5. Consider the R code below. Get inspiration from it in order to write
the R code to answer Exercise 3.4.

aa <- c(7, 6, 2)
prod(aa)
## [1] 84

8
http://docplayer.net/52594811-On-streaks-perception-probability-and-skill-finding-the-hot-
shot.html
3.5 Exercises 37

Exercise 3.6. In the quote from Fisher (1925), see Section 3.4, we read “P = .05,
or 1 in 20” and, further, “using this criterion we should be led to follow up a
negative result only once in 22 trials”.
Explain this apparent contradiction, i.e., 20 vs 22 trials.
38 3 Paul the Octopus and 𝑝 < 0.05

FIGURE 3.3: xkcd on significance (xkcd.com/882).


3.5 Exercises 39

FIGURE 3.4: Title page of Jakob Bernoulli 1713’s Ars Conjectandi.


Part II

Statistical Inference
4
A Blueprint for Inference

TL;DR This chapter lists the components required in


statistical inference, in particular in hypothesis test-
ing.

4.1 Introduction

Statistical inference is the procedure aiming at estimating various parameters


characterizing a population along with quantifying the uncertainty of these esti-
mates. In short, it is a procedure dictating the rules for valid statistical statements
(as put in Chapter 2).
This chapter aims at providing a blueprint for statistical inference, in particular in
the specific framework of hypothesis testing. It does so by describing its central
elements that remain constant beyond the specificities of every given case.
Here is a crucial guiding thread for the chapter. The sections below then elabo-
rate on each element. The ground of statistical inference is a set of assumptions,
made by the researcher, about the nature of the process that generated the data
at hand. Then, the researcher formulates a testable hypothesis about a specific
aspect of interest, often a key parameter to answer a research question. The test
of the hypothesis is then carried in the following perspective. If the assumptions
are correct and the hypothesis made is true, then what are all the values that an
estimator could return in a sample? In other words, what is the sampling dis-
tribution of the chosen sample-based statistic, e.g., its mean. The actual value of
the statistic observed in the sample at hand is then placed within that distribu-
tion for comparison. The relevant question being, if the assumptions are correct

43
44 4 A Blueprint for Inference

and the hypothesis made is true, what is the probability of observing a value as
“extreme” as the one we calculated in the sample. Here, the term “extreme” is
ill-defined. The practice in statistical analysis has dictated a set of values, noted
𝛼, but one in particular, beyond which the probability of observing a value as ex-
treme as the one given in the sample is deemed too small to be compatible with
the hypothesis, given the assumptions. In that case, the hypothesis is rejected in
a statistical sense. This can be a correct decision or an error on the part of the
researcher.

4.2 Assumptions

The assumptions on the underlying data are essential in any analysis. The valid-
ity of the results depends crucially on them. It is therefore of utmost importance
that the researcher verifies that they are likely to apply.
Not all assumptions are equally reasonable and likely to be satisfied. We gener-
ally place them in a range from mild/weak too strong.
Examples of mild assumptions include the independence of the observations
or their identical distribution. The actual distribution can sometimes also be as-
sumed when it does not overly affects the results.
Assumptions tend to be seen as strong the more structure they impose on the
data, e.g., a full model of relationships between variables.

4.3 Testable Hypothesis

A statistical hypothesis is a premise or claim that we want to test, a statement


about the numerical value of a population parameter(s).
In hypothesis testing, we turn a question of interest into hypotheses about the
value of a parameter or parameters.
We define two hypotheses.
The null hypothesis, denoted 𝐻0 , represents the hypothesis that will be assumed
4.5 Estimator 45

to be true unless the data provide convincing evidence that it is false. This usually
represents the “status quo” or some statement about the population parameter
that the researcher wants to test.
The alternative (research) hypothesis, denoted 𝐻𝑎 , represents the values of a
population parameter for which the researcher wants to gather evidence to sup-
port.
The following are examples of hypotheses.

a. Is the mean weight of a certain candy bar different from the desired 40
grams? 𝐻0 : 𝜇 = 40 vs. 𝐻𝑎 : 𝜇 ≠ 40.
b. Do men and women have different starting salaries after graduating
university? 𝐻0 : 𝜇𝑀 = 𝜇𝐹 vs. 𝐻𝑎 : 𝜇𝑀 ≠ 𝜇𝐹 .
c. Do three different production processes all have the same variance? 𝐻0 :
𝜎1 = 𝜎2 = 𝜎3 vs. 𝐻𝑎 : They are not all equal.

4.4 Estimator

The goal of inferential statistics is to characterize a population using sample data.


Specifically, we are usually interested in estimating the parameters of a popula-
tion. To this purpose, we use sample statistics to make inferences about popula-
tion parameters.
For instance, we can use the sample mean 𝑋̄ = 𝑛1 ∑ 𝑋𝑖 to estimate the popula-
tion mean 𝜇. The statistic 𝑋̄ is an example of a point estimator of the parameter
𝜇. A specific output (value) of the sample mean 𝑥̄ is an estimate of the parameter
𝜇.
An estimator of a population parameter is a random variable that depends on
the sample information; its value provides approximations of this unknown pa-
rameter. A specific value of that random variable is called an estimate.
A point estimator (𝜃)̂ of a population parameter (𝜃) is a rule or formula that tells
us how to use the sample data to calculate a single number, the point estimate,
that can be used as an estimate of the population parameter.
46 4 A Blueprint for Inference

4.5 Sampling Distribution

The sampling distribution is simply the probability distribution function of the


sample statistics. In other words, the sampling distribution of a given statistic is
the distribution we would get if we were to take all possible samples of a given
size 𝑛 and for each of those samples calculate the same statistic.
Sampling distributions have an expected value, variance, and often follow a
known probability distributions (e.g. Normal, 𝑡, Chi-square, or 𝐹 distributions).

4.6 Level of Significance

Statisticians determine what constitutes an extreme value by setting a level of


significance 𝛼 before they perform any calculations. Commonly set levels of sig-
nificance (𝛼) are: 1 percent or 0.01, 5 percent or 0.05, 10 percent or 0.1.
To say that we’re working at the 0.05 level of significance is to say that we’re
looking for a value of our test statistic that is so extreme it would occur by chance
only 5% of the time, or less, under the assumption that the null hypothesis is true.

4.7 Deciding on an Hypothesis

4.7.1 Critical Rejection Region

The critical region is the portion of the sampling distribution that contains all the
values that allow you to reject a null hypothesis. For that reason, we refer to the
critical region as the region of rejection as it represents the set of possible values
of the test statistic for which the researcher will reject 𝐻0 in favor of 𝐻𝑎 .
The critical value is the point that marks the beginning of the critical region.
4.7 Deciding on an Hypothesis 47

4.7.2 One-Tailed and Two-Tailed Tests

1. Select the null hypothesis as the status quo, that which will be presumed
true unless the sampling experiment conclusively establishes the alter-
native hypothesis. The null hypothesis will be specified as that parame-
ter value closest to the alternative in one-tailed tests and as the comple-
mentary (or only unspecified) value in two-tailed tests. e.g., 𝐻 ∶ 𝜇 = 𝜇0 .
2. Select the alternative hypothesis as that which the sampling experiment
is intended to establish. The alternative hypothesis will assume one of
three forms:
a. One-tailed, lower-tailed, e.g., 𝐻𝑎 ∶ 𝜇 < 𝜇0 ,
b. One-tailed, upper-tailed, e.g., 𝐻𝑎 ∶ 𝜇 < 𝜇0 ,
c. Two-tailed, e.g., 𝐻𝑎 ∶ 𝜇 ≠ 𝜇0 .

A two-tailed test of hypothesis is one in which the alternative hypothesis does


not specify departure from 𝐻0 in a particular direction and is written with the
symbol “≠”.
As a rule of thumb, we choose a two-sided alternative hypothesis unless we have
strong reasons to only be interested in one particular side.
The rejection region for a two-tailed test differs from that for a one-tailed test. A
one-tailed test of hypothesis is one in which the alternative hypothesis is direc-
tional and includes the symbol “>” or “<”, see Figure 4.1 for an illustration.
Example 4.1. Cigarette advertisements are required by federal law to carry
the following statement: “Warning: The surgeon general has determined that
cigarette smoking is dangerous to your health.” However, this warning is of-
ten located in inconspicuous corners of the advertisements and printed in small
type. Suppose the Federal Trade Commission (FTC) claims that 80% of cigarette
consumers fail to see the warning. A marketer for a large tobacco firm wants to
gather evidence to show that the FTC’s claim is too high, i.e., that fewer than 80%
of cigarette consumers fail to see the warning.
Question
Specify the null and alternative hypotheses for a test of the FTC’s claim.
Answer
The marketer wants to make an inference about 𝑝, the true proportion of all
cigarette consumers who fail to see the surgeon general’s warning.
48 4 A Blueprint for Inference
Ha:µ<µ0

α 1−α

Reject H0 Do not reject H0

−zα
Ha:µ>µ0

1−α α
Do not reject H0 Reject H0
−zα
Ha:µ ≠ µ0

α 2 1−α α 2
Reject H0 Do not reject H0 Reject H0
− zα/2 zα/2
z (statistic)

FIGURE 4.1: Rejection regions for three alternative hypotheses.

In particular, the marketer wants to collect data to show that fewer than 80% of
cigarette consumers fail to see the warning, i.e., 𝑝 < 0.80.
Consequently, 𝑝 < 0.80 represents the alternative hypothesis and 𝑝 = 0.80 (the
claim made by the FTC) represents the null hypothesis. That is, the marketer
desires the one-tailed (lower-tailed) test:

• 𝐻0 ∶ 𝑝 = 0.8 (i.e. the FTC’s claim is true),


• 𝐻𝑎 ∶ 𝑝 < 0.8 (i.e. the FTC’s claim is false).

4.7.3 The 𝑝-Value

The observed significance level, or 𝑝-value, for a specific statistical test is the
probability (assuming 𝐻0 is true) of observing a value of the test statistic that is
4.8 Types of Error 49

at least as contradictory to the null hypothesis, and supportive of the alternative


hypothesis, as the actual one computed from the sample data.
Thus, the 𝑝-value is the smallest significance level at which a null hypothesis can
be rejected, given the observed sample statistic. When the 𝑝-value is calculated,
we can test the null hypothesis by using the following rule:

reject 𝐻0 if 𝑝-value < 𝛼.

4.7.4 Equivalence of Approaches

Sections 4.7.1 and 4.7.3 describe two approaches that allow the researcher to
make a decision on 𝐻0 . These two approaches are equivalent. In other words,
they always give the same decision.
A small caveat must be noted, however:

• If the test is one-tailed, the 𝑝-value is equal to the tail area beyond 𝑧 in the same
direction as the alternative hypothesis.
• If the test is two-tailed, the 𝑝-value is equal to twice the tail area beyond the
observed 𝑧 -value in the direction of the sign of 𝑧 .

4.8 Types of Error

What if we are wrong? When we do an hypothesis test there are two possibilities:

• If we find a significant difference, we reject the null hypothesis.


• If we don’t find a significant difference, we fail to reject the null hypothesis.

However, it is possible that a different sample would have yielded different re-
sults. When conducting hypothesis tests, we can make two kinds of mistakes:

• Type I error: False positive. You could read “positive” as “yes, existence of a
sign against 𝐻0 ”. A false positive would then mean “a misleading sign against
𝐻0 ”.
50 4 A Blueprint for Inference

FIGURE 4.2: Types of error for case ’𝐻0 : the person is not pregnant’.

• Type II error: False negative. You could read “negative” as “no, nothing against
𝐻0 ”. A false negative would then mean “a misleading absence of sign against
𝐻0 ”.

4.8.1 Type I Error

Even if the null hypothesis is true, we may still get a test statistic that is extreme
just due to chance. In this situation, we would incorrectly reject the null hypoth-
esis.
A Type I error occurs if the researcher rejects the null hypothesis in favor of the
alternative hypothesis when, in fact, 𝐻0 is true. The probability of committing a
Type I error is denoted by 𝛼.
Unfortunately, we never know whether we’ve made a Type I error but we know
that the probability of making a Type I error is equal to the level of significance
(𝛼). A level of significance of 0.05 it’s simply a statement that you’re willing to
tolerate a 5% chance of making a Type I error.

4.8.2 Type II Error

Even if the null hypothesis is false, it is still possible, only by chance, to get a test
statistic that is not extreme compared to the value given by 𝐻0 . In this situation,
we would incorrectly fail to reject the null hypothesis.
A Type II error occurs if the researcher fails to reject the null hypothesis when, in
fact, 𝐻0 is false. The probability of committing a Type II error is denoted by 𝛽 .
4.9 Exercises 51

TABLE 4.1: Summary for types of errors.

States of Nature
Decision on 𝐻0 𝐻0 is true 𝐻0 is false
Fail to reject 𝐻0 Correct decision Type II error
prob. 1−𝛼 𝛽

Reject 𝐻0 Type I error Correct decision


prob. 𝛼 1−𝛽

4.9 Exercises

Exercise 4.1. In a US court (as much as in other countries’ courts), the defendant
is either innocent (𝐻0 ), or guilty (𝐻𝑎 ).
How could we reduce the rate of errors of type I in US courts? What would that
mean/imply in real terms, i.e., in terms of the decisions of the judges?
What influence would that reduction in Type I errors have on the rate of errors
of type II? Again, explain in concrete terms, not general formulas.
5
Theoretical Sampling Distributions

TL;DR .

5.1 Introduction

Before the advent of massive computational power, exercises such as the one in
Section 2.4 were not available for statisticians. How did they manage to have an
idea about the sampling distribution of a statistic?
What they couldn’t do brute-force, they did through heroic theoretical break-
throughs. These were impressive and very useful results achieved around one
century ago by the likes of Jerzy Neyman1 , Egon Pearson2 or their “enemy”
Ronald Fisher3 .
We can, however, point at two limitations. First, many of these results rely on
approximations in small samples. And there is often no guide on the size of the
error. This is often discarded by suggesting that the sample size is large enough.
And we don’t have much choice.
The second limitation is getting increasingly stringent with the furious develop-
ment of data science (see Efron and Hastie (2016)). The production theoretical
results is simply not keeping the pace. Arguably, it has become more and more
difficult to derive analytical results as the ground for statistical inference with all
1
https://en.wikipedia.org/wiki/Jerzy_Neyman
2
https://en.wikipedia.org/wiki/Egon_Pearson
3
https://en.wikipedia.org/wiki/Ronald_Fisher

53
54 5 Theoretical Sampling Distributions

the new estimators and techniques out there. So the practitioners didn’t way for
them and went on with other methods for validating their claims. We actually
arrive at a point where one wonders whether statistics is useful for data science
or not.4
In this chapter we offer a few examples of the theoretical results.

5.2 The Central Limit Theorem

The central limit theorem shows that the mean of a random sample of size 𝑛,
drawn from a population with any probability distribution, will be approxi-
2
mately normally distributed with mean 𝜇 and variance 𝜎𝑛 , given a large-enough
sample size.
In applied statistics the probability distribution for the population being sam-
pled is often not known, and there is no way to be certain that the underlying
distribution is normal.
The CLT allows us to use the normal distribution to compute probabilities for
sample means obtained from many different populations. Combined with the
law of large numbers it provides the basis for statistical inference.

Theorem 5.1 (Central Limit Theorem). Let 𝑋1 , 𝑋2 , ..., 𝑋𝑛 be a set of 𝑛 independent


random variables having identical distributions with mean 𝜇, variance 𝜎2 , and 𝑋̄ is the
mean of these random variables.
As 𝑛 becomes large, the central limit theorem states that the distribution of

𝑋̄ − 𝜇
𝑍=
𝜎𝑋̄
approaches the standard normal distribution.

In other words, if repeated random samples of size 𝑛 are taken from a popula-
tion with mean 𝜇 and standard deviation 𝜎, the sampling distribution of sample
4
My view is that it is. If anything, because of the structure if imposes on an analysis, making
it all the more reliable.
5.3 The Central Limit Theorem 55

7.5
10
4
count

count

count
5.0

5
2
2.5

0 0.0 0
2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0
One sample, n=25 One sample, n=50 One sample, n=100

100
100 75
75
75
count

count

count
50
50
50
25
25 25

0 0 0
4 5 6 7 5 6 4.5 5.0 5.5 6.0 6.5
1000 sample means, n=25 1000 sample means, n=50 1000 sample means, n=100

FIGURE 5.1: Illustration of the Central Limit Theorem: distribution of the means
of samples from uniform distributions for different sample sizes, sampled 1000
times.

means will have mean 𝜇 and standard error 𝜎𝑋̄ = √𝜎𝑛 . And, as 𝑛 increases, the
sampling distribution will approach a normal distribution.
The central limit theorem can be applied to both discrete and continuous random
variables.

5.2.1 Illustration

Figures 5.1 and 5.2 provide an illustration of the Central Limit Theorem at work.
56 5 Theoretical Sampling Distributions

5
20
4 9
15
3
count

count

count
6
10
2
3 5
1

0 0 0
4 8 12 16 6 9 12 15 5 10 15
One sample, n=25 One sample, n=50 One sample, n=100

75
90
100
count

count

count
50 60
50
25 30

0 0 0
8 9 10 11 12 8.5 9.0 9.5 10.0 10.5 11.0 8.5 9.0 9.5 10.0 10.5 11.0
1000 sample means, n=25 1000 sample means, n=50 1000 sample means, n=100

FIGURE 5.2: Illustration of the Central Limit Theorem: distribution of the means
of samples from Poisson distributions (𝜆 = 10 ) for different sample sizes, sam-
pled 1000 times.

5.3 Sampling Distribution of the Sample Proportion

Definition 5.1 (Sample proportion). The sample proportion is simply the sum of
the success cases in our sample divided by the total number of elements in our
sample.
𝑛
∑ 𝑋𝑖
𝑝̂ = 𝑖=1
𝑛
where each 𝑋𝑖 is an independent Bernoulli random variable with probability of
success 𝑝, i.e., 𝑋 ∼ 𝑏(1, 𝑝).
5.3 Sampling Distribution of the Sample Proportion 57

Note that 𝑝̂ is the mean of a set of independent random variables. Therefore, we


can use the central limit theorem to argue that the probability distribution for 𝑝̂
can be modeled as a normally distributed random variable.
Recall that the mean and variance of a Bernoulli random variable 𝑋 are 𝐸[𝑋] =
𝑝 and 𝑉 𝑎𝑟(𝑋) = 𝑝(1 − 𝑝).

Proposition 5.1 (Expected value and variance of a sample proportion). The ex-
pected value of the sample proportion is:

𝐸[𝑝]̂ = 𝑝

The variance of the sample proportion is:

𝑝(1 − 𝑝)
𝑉 𝑎𝑟(𝑝)̂ =
𝑛
Proposition 5.2 (Sampling distribution of the sample proportion). The distribu-
tion of the sample proportion is approximately normal for large sample sizes (𝑛𝑝(1 −
𝑝) > 5).

𝑝(1 − 𝑝)
𝑝̂∼𝑁
̇ (𝑝, )
𝑛
Thus,

𝑝̂ − 𝑝
𝑍= ∼𝑁
̇ (0, 1)
√ 𝑝(1−𝑝)
𝑛

Example 5.1. Assume that 60% of all city voters are in favor of a particular can-
didate.
Question
In a random sample of 100 voters, what is the probability that fewer than half
are in favor of this candidate?
Answer
Since 𝑛 is large, we know that 𝑝̂ is normally distributed with mean 𝑝 = 0.6 and
𝑝(1−𝑝)
standard error √ 𝑛 = 0.049.
58 5 Theoretical Sampling Distributions

The desired probability is,

0.5 − 0.6
𝑃 (𝑝̂ < 0.5) = 𝑃 (𝑧 < ) = 𝑃 (𝑧 < −2.04) = 0.021
0.049

5.4 Sampling Distribution of the Sample Variance

Consider a random sample 𝑋1 , 𝑋2 , ...𝑋𝑛 of 𝑛 observations drawn from a pop-


ulation with unknown mean 𝜇 and unknown variance 𝜎2 . The population vari-
ance is,

𝜎2 = 𝐸[(𝑋 − 𝜇)2 ]

As 𝜇 is unknown, we use the sample mean 𝑋̄ to compute the sample variance.

Definition 5.2 (Sample variance). Let 𝑋1 , 𝑋2 , ...𝑋𝑛 be a random sample of ob-


servations from a population. The quantity:
𝑛
′2
∑𝑖=1 (𝑋𝑖 − 𝑋)̄ 2
𝑆 =
𝑛−1
is the (adjusted) sample variance and its square root, 𝑆 ′ , is the (adjusted) sample
standard deviation.

5.4.1 Degrees of Freedom

Degrees of freedom of an estimate is the number of independent pieces of in-


formation that went into calculating the estimate. It’s not quite the same as the
number of items in the sample. Another way to look at degrees of freedom is
that they are the number of values that are free to vary in a data set.

Example 5.2. In a random sample of 𝑛 observations, we have 𝑛 different inde-


pendent values or degrees of freedom. But, after we know the computed sample
mean, there are only 𝑛 − 1 different values that can be uniquely defined.
5.4 Sampling Distribution of the Sample Variance 59

Formula Advance knowledge required


∑ 𝑋𝑖
𝑋̄ = 𝑛 Nothing
∑(𝑋 ̄ 2
𝑖 −𝑋)
𝑆′ = √ Sample mean
𝑛−1

5.4.2 Expected Value of Sample Variance

Proposition 5.3 ((Adjusted) sample variance). The expected value of the adjusted
sample variance is the population variance.

𝐸[𝑆 ′2 ] = 𝜎2

5.4.3 Sampling Distribution when Sampling from a Normally Distributed Popula-


tion

Proposition 5.4 (Distribution of 𝑆 ′2 when sampling from a normal). Given a ran-


dom sample of 𝑛 observations from a normally distributed population whose population
variance is 𝜎2 and whose resulting sample variance is 𝑆 ′2 , it can be shown that
𝑛
(𝑛 − 1)𝑆 ′2 ∑𝑖=1 (𝑋𝑖 − 𝑋)̄ 2
2
= 2
= 𝜒2(𝑛−1)
𝜎 𝜎
2
follows a 𝜒 distribution with 𝑛 − 1 degrees of freedom.

Since the sample variance is not a linear transformation of independent normal


random variables, its distribution is not normal.
The 𝜒2 family of distributions provides a link between the sample and the pop-
ulation variances. The preceding 𝜒2 distribution and the resulting computed
probabilities for various values of 𝑆 ′2 require that the population distribution
be normal.
The 𝜒2 distribution is commonly used in applied statistics because it provides a
link between the sample and the population variances.
Note that the assumption of an underlying normal distribution is more impor-
tant for determining probabilities of sample variances than it is for determining
probabilities of sample means.
60 5 Theoretical Sampling Distributions

Proposition 5.5 (Expectation and variance of 𝑆 ′2 when sampling from a normal).


When the underlying population distribution is normal, it can be shown that:
𝐸[𝑆 ′2 ] = 𝜎2
2𝜎4
𝑉 𝑎𝑟(𝑆 ′2 ) = 𝑛−1

Example 5.3. Suppose the weights of bags of flour are normally distributed with
a population standard deviation of 𝜎 = 1.2 ounces.
Question
Find the probability that a sample of 200 bags would have a standard deviation
between 1.1 and 1.3 ounces.
Answer
(𝑛−1)𝑆 2
We evaluate the random variable 𝜎2 at the endpoints of the interval in ques-
tion:

(𝑛 − 1)𝑆 ′2 (200 − 1)1.12


= ≈ 167.22
𝜎2 1.22
(𝑛 − 1)𝑆 ′2 (200 − 1)1.32
= ≈ 233.55
𝜎2 1.22
The probability will be the area under the 𝜒2199 distribution between these values:
𝜒2 𝐶𝐷𝐹 (167.22, 233.55, 199) = 0.9037
There is a 90.37% probability that the standard deviation of the weights of the
sample of 200 bags of flour will fall between 1.1 and 1.3 ounces.

5.5 The Chi-Square Distribution

The Chi-square (𝜒2 ) distribution is a continuous distribution that is widely used


in statistical inference.
It is related to the standard normal distribution. Indeed, if a random variable 𝑍
follows the standard normal distribution, then 𝑍 2 follows a 𝜒2 distribution with
one degree of freedom, see Figure 5.3
5.5 The Chi-Square Distribution 61

−4 −3 −2 −1 0 1 2 3 4 0 5 10
Z Z2

FIGURE 5.3: Standard normal (left) and Chi-square with one degree of freedom
(right).

More generally, the 𝜒2 distribution results when 𝑛 independent variables with


standard normal distributions are squared and summed.
Definition 5.3 (Chi-square distribution). Let 𝑍1 , 𝑍2 ..., 𝑍𝑛 be n independent
and identically 𝑁 (0, 1) distributed random variables. The sum of their squares,
𝑛
∑ 𝑍𝑖2 ,
𝑖=1

is 𝜒2 -distributed with 𝑛 degrees of freedom and is denoted as 𝜒2𝑛 .

Note the following:

• The 𝜒2 -distribution is not symmetric.


• It can only realize values greater than or equal to zero.
• The degrees of freedom specify the shape of the distribution.

Figure 5.4 plots 𝜒2 distributions for various degrees of freedom 𝑘.


Note that the minimum value that 𝜒2 can take is zero. It can take any non-
negative value. As the degrees of freedom increase the skewness of the distri-
bution decreases.
62 5 Theoretical Sampling Distributions

0 5 10 0 5 10
x x

n 1 2 3 4 6 9

FIGURE 5.4: Chi-square distributions for various degrees of freedom, 𝑟, pdf (left)
and cdf (right).

5.5.1 Using the Table

Here is how to use the Chi-square distribution table, see Figure 5.5.

1. Find the row that corresponds to the relevant degrees of freedom, 𝑛.


2. Find the column headed by the probability of interest.
3. Determine the 𝜒2 value where the 𝑟 row and the probability column
intersect.
5.5 The Chi-Square Distribution
FIGURE 5.5: Chi-square values for degrees of freedoms between 1 to 15 and for main probabilities
benchmarks.

63
6
Inference on Sample Proportions

TL;DR A sample proportion is described as the mean


of the number of successes in 𝑛 Bernoulli trials.[6.1]
A series of tests based on the sample proportions are
presented. These answer the following types of ques-
tions.
Is the true proportion in the population equal to a
given value?[6.2]
Are proportions in two populations equal?[6.3]
Are proportions in a sample jointly equal to a given
set of proportions?[6.4]
For the inquiring minds, a technical section briefly
describes the theory behind the tests implemented in
R.[6.5]

6.1 Definitions

65
66 6 Inference on Sample Proportions

6.1.1 Categorical Variables

Recall that categorical variables are variables that can take a limited number of
values.1 In most cases, they are even dichotomous: yes/no, true/false, up/down,
correct/incorrect, red/not red, etc…
To fix ideas, here are a few examples:

• politically support policy: yes or no,


• customer satisfaction: 1 through 5,
• highest academic degree achieved: primary school, etc…
• gender: male or not male,
• etc…

Arguably, though this will remain an unsubstantiated claim here, categorical


variables are the most important to analyze. Indeed, I would suggest that they
fit the best our mental representations: friend or foe, not how much of a friend.
Also, notice that you can always create categories out of a continuous variable:
from income to rich/middle class/poor.

6.1.2 Bernoulli Trial

To simplify matters further, or because it is the overwhelmingly most common


case, we will focus on a dichotomous variable modeled as a Bernoulli trial.

Definition 6.1 (Bernoulli trial). A Bernoulli trial (or variable) is a random exper-
iment that can have only two possible, mutually exclusive outcomes: “success”
or “failure”. “Success” is noted with value 1 while “failure” is noted with the
value 0.

Notice that the term “success” is potentially misleading. It does imply a victory
or achievement. It simply means “the condition is satisfied”. For instance, in a
Bernoulli trial where the outcomes are “arrived late (1) /didn’t arrive late (0)”,
the “success” is the case of arriving late, which simply means “the condition of
arriving late was met in the observation”
1
These values are called with different names in different sources or contexts: categories, lev-
els in R, etc…
6.1 Definitions 67

We denote by 𝑝 the probability of success. The probability of failure is (1 − 𝑝).


Notice that Bernoulli trials are, by definition, independent.

Definition 6.2 (Bernoulli distribution). Let 𝑋 be a random variable with value 1


if the outcome is success and 0 otherwise. Then, the Bernoulli distribution is

𝑥 𝑃 (𝑋 = 𝑥)
0 1−𝑝
1 𝑝

6.1.3 Sample Proportion

This chapter has a somewhat simple, yet very common and useful perspective.
What can we say about the number of observations in each (of the two) cate-
gories?
A subtle yet crucial step is to regard each observation has a Bernoulli trial with
probability 𝑝 of taking the value 1 (success) and probability 1 − 𝑝 of taking the
value 0 (failure). The actual 𝑝 is the true parameter of the population, and gen-
erally unknown. We can write,

𝑋𝑖 ∼ 𝑏(𝑝), ∀𝑖

What we do observe is the proportion in a given sample, noted 𝑝̂. This sample
proportion is calculated as follows.

1 𝑛
𝑝 ̂ = ∑ 𝑋𝑖
𝑛 𝑖=1

This formula should be no mystery.

• The 𝑋𝑖 ’s are a bunch of 0’s and 1’s.


• If we sum them, we simply obtain the number of 1’s.
• If we divide the number on 1’s by 𝑛, we have the proportion of 1’s in the sam-
ple, i.e., the proportion of that category.
• In appearance and all formality, the sample proportion is a mean of the sample.
68 6 Inference on Sample Proportions

This last remark if of utmost importance. Indeed, it allows us to use the Central
Limit Theorem, provided that further conditions are met.

6.1.4 Example

Let’s work with an example. The dataset rosling_responses from the package
openintro has observations on adults with a 4-year college degree responding to
the following question:

How many of the world’s 1-year-old children today have been vaccinated against some disease?

a. 20%
b. 50%
c. 80%

— Rosling et al. (2018), p.4.

The sample data could contain a series with the choice given by each respondent,
e.g.: a, b, a, b, b, a, a, c, a, a… However, the focus of interest will be on a more
relevant question, namely who’s got the answer right or wrong, and what are
the proportions of each group.

library(openintro)
data("rosling_responses")
rosling_responses %>%
filter(question ==
"children_with_1_or_more_vaccination") %>%
pull(response)
## [1] "correct" "correct" "incorrect" "incorrect" "incorrect" "incorrect"
## [7] "incorrect" "incorrect" "correct" "correct" "incorrect" "incorrect"
## [13] "incorrect" "incorrect" "incorrect" "incorrect" "incorrect" "incorrect"
## [19] "correct" "incorrect" "incorrect" "incorrect" "incorrect" "incorrect"
## [25] "incorrect" "incorrect" "incorrect" "correct" "correct" "incorrect"
## [31] "incorrect" "incorrect" "incorrect" "incorrect" "incorrect" "correct"
6.2 Definitions 69

## [37] "incorrect" "correct" "incorrect" "incorrect" "incorrect" "incorrect"


## [43] "incorrect" "correct" "correct" "incorrect" "correct" "incorrect"
## [49] "incorrect" "incorrect"

Again, noticed that each observation is either a “correct” or “incorrect”. The for-
mer could be branded “success” and the latter “failure”. The probability of suc-
cess is the unknown 𝑝.
We still need to transform these values into numeric values in order to obtain an
actual number for the proportion. Here is a way using the function case_when()
from the dplyr package.

rosling_responses <- rosling_responses %>%


mutate(r.w = case_when(
response == "correct" ~ TRUE,
response == "incorrect" ~ FALSE
))

Notice that I coded “correct” as TRUE. I could have as well coded as “1”. Notice
that, in most coding languages, these are equivalent because TRUE is coerced to 1
and FALSE as 0.
Finally, we can now calculate the sample proportion. In the data set, only the 50
first entries are about this question.

p.hat1 <- rosling_responses %>%


filter(question ==
"children_with_1_or_more_vaccination") %>%
summarise(p.hat1 = mean(r.w)) %>%
pull(p.hat1)

We now call the object that we want to see by writing its name.

p.hat1
## [1] 0.24
70 6 Inference on Sample Proportions

6.2 Inference for a Single Proportion

The first hypothesis testing related to a sample proportion amounts to testing


whether this latter is equal to a certain value. It is time, therefore, to use our
blueprint of Section 4. We will use the Rosling’s data above. Here are some ele-
ments.

6.2.1 Assumptions: Independence

We assume that the observations are independent. They wouldn’t be if, for in-
stance, the individuals would first discuss the question together. The random
sampling that was certainly applied by Rosling and colleagues make us safe on
this ground.

6.2.2 Testable Hypothesis: Dart-Throwing Chimpanzees

The choice of the null hypothesis 𝐻0 is always context specific. In our case, one
possibility would be to test whether at least half of the individuals guesses cor-
rectly.
As it turns out, beyond arbitrary, this is also a too optimistic hypothesis. Instead,
we will wonder how these adults with a 4-year degree perform against the mark
of random guessing, i.e., 33.3%. But, starting here, we will adopt a more colorful
comparison:

You’ve probably heard that one before. It’s famous–in some circles infamous. It has popped up in the
New York Times, the Wall Street Journal, the Financial Times, the Economist and other outlets
around the world. It goes like this: A researcher gathered a big group of experts–academics pundits, and
the like–to make thousands of predictions about the economy, stocks, elections, wars, and other issues of
the day. Time passed, and when the researcher checked the accuracy of the predictions, he found that the
average expert did about as well as random guessing. Except that’s not the punch line because “random
guessing” isn’t funny. The punch line is about a dart-throwing chimpanzee. Because chimpanzees are
funny.

— Tetlock and Gardner (2016), p.4.


6.2 Inference for a Single Proportion 71

Hence, the hypothesis tested will be whether the respondents are performing
better or worse than dart-throwing chimps. Formally, we would write: 𝐻0 ∶ 𝑝0 =
1
3.

6.2.3 Estimator: Sample Proportion

As alluded to above, we will use the sample proportion, 𝑝̂, a sort of mean calcu-
lated on the sample data.
Alternatively, we can use the standardized value of 𝑝̂ as a sample statistic under
the null, i.e.,

𝑝 ̂ − 𝑝0
𝑍= .
𝜎𝑝0

6.2.4 Sampling Distribution

This is the core part of the test. What are the possible values and their associated
probabilities that could be observed in a such a sample if the assumptions and
𝐻0 hold?
At this point we must point towards a theoretical result based on the Central
Limit Theorem, see Proposition 5.2.
The distribution of the sample proportion is approximately normal for large sam-
ple sizes (𝑛𝑝(1 − 𝑝) > 5).

𝑝0 (1 − 𝑝0 )
𝑝̂∼𝑁
̇ (𝑝0 , ).
𝑛

Thus, the standardized statistic is described as,

𝑝 ̂ − 𝑝0
𝑍= ∼𝑁
̇ (0, 1).
√ 𝑝0 (1−𝑝
𝑛
0)

The values in our case are simply 𝑝0 = 13 , 𝑝̂ = 0.24 and 𝑛 = 50.


72 6 Inference on Sample Proportions

6.2.5 Level of Significance: 0.05

The usual level of significance may be applied here without raising concerns.

6.2.6 Deciding on an Hypothesis: Bilateral Test

Carefully noticed that I wrote above that we wanted to know whether the re-
spondents would do better or worse than the chimps. This, in turn, calls for a
bilateral, with 𝐻𝑎 ∶ 𝑝0 ≠ 13 .
Another alternative could also be used, namely that the humans do better than
the chimps. We would then carry a unilateral test with a rejection region of prob-
ability 𝛼 all the way to the right of the sampling distribution.

6.2.7 Critical Regions

Under the null, we have,

1 13 (1 − 31 )
𝑝̂∼𝑁
̇ ( , ).
3 50

1
(1− 1 )
Using 𝜎𝑝 = √ 3 50 3 = 0.0667, we can compute the critical values for 𝛼 =
0.05, i.e., the points that are 1.96 standard deviations from the mean,
1
± 1.96 × 0.0667 = [0.2026; 0.4641].
3
Since the observed sample proportion does not fall in the rejection region, recall
𝑝̂ = 0.24, we do not reject 𝐻0 . Figure 6.1 represents the basis for the decision.

6.2.8 P-Value

Instead of the reject regions, we can calculate the p-value for the observed sample
proportion. With the values above for the mean and the standard deviation of the
sampling distribution, we obtain,
6.2 Inference for a Single Proportion 73

observed

2.5% 2.5%

0.2 0.24 0.33 0.46


^
p

FIGURE 6.1: Rejection regions for the sample proportion of our example.

left.of.po <- pnorm(q = .24,


mean = 1/3,
sd = sqrt((1/3*2/3)/50)) %>%
round(3)
# or, standardized,
# pnorm(q = (.24-1/3)/sqrt((1/3*2/3)/50),
# mean = 0,
# sd = 1)

left.of.po
## [1] 0.081

This means that, if the assumptions and 𝐻0 hold, then there is a probability of
0.081 of observing a proportion in the sample that is equal or smaller than 0.24.
That is too high to reject 𝐻0 . Figure 6.2 illustrates this calculation.
Notice a subtle point: 0.081 is not the p-value. Since we are conducting a bilat-
eral test, our decision would would be to reject 𝐻0 if the area on the left of the
observed sample proportion was smaller than 𝛼/2, or if twice the area on the
left of the observed sample proportion was smaller than 𝛼. The p-value is the
probability to be compared with 𝛼. Hence, in this case,
74 6 Inference on Sample Proportions

8.1% observed

2.5%

0.2 0.24 0.33


^
p

FIGURE 6.2: Probability on the left of observed sample proportion.

p-value = 2 × area ”beyond” sample statistic


= 2 × 0.081 = 0.162.

6.2.9 Implemented Test in R

R has a dedicated function for this type of hypothesis testing, prop.test(). How-
ever, it is based on a different estimator and, therefore, in a different sampling
distribution.
A element to bear in mind is that it uses the absolute number of successes, x, and
the number of trials, n.

p.0 <- 1/3


n1 <- rosling_responses %>%
filter(question ==
"children_with_1_or_more_vaccination") %>%
nrow()

test.p <- prop.test(x = p.hat1 * n1,


n= n1,
p= p.0,
6.3 Comparing Two Proportions 75

correct= FALSE)

## clearer but not dynamic alternative


# test.p <- prop.test(x = 24,
# n= 50,
# p= 1/3,
# correct= FALSE)

The above output should be readily interpretable by noting the following.

• Always recall the 𝐻0 of a test. Here, it is “the sample proportion is equal to 13 ”.


The output of the test was written to recall the alternative hypothesis, 𝐻𝑎 .
• The p-value is defined as always. Since p-value > 𝛼, here 0.16 > 0.05, we do
not reject 𝐻0 , i.e., we do not reject that “the sample proportion is equal to 13 ”.

6.3 Comparing Two Proportions

We now address a question related to sample proportions. What can we say2


about the difference between two proportions, 𝑝1̂ and 𝑝2̂ , in two different
groups? The center of our focus will be to test whether the two true proportions
are equal, i.e.,

𝐻0 ∶ 𝑝1 − 𝑝2 = 0.

This section presents a formal test to answer that question (and a R implemen-
tation). The discussion will also be shorter, because I assume that many of the
elements above are mastered and need not be repeated.

6.3.1 Assumptions: Extended Independence

As before, we assume that the observations are independent within each group.
We now require that they are also independent between the two groups. They
2
Of course, the word “say” implies a statement in the language of statistics.
76 6 Inference on Sample Proportions

would be independent between groups if, for instance, the same individuals
would be included in both groups.
As before, we will also assume and verify that the samples are sufficiently large.

6.3.2 Sampling Distribution

The resulting sampling distribution in this context is again derived from the Cen-
tral Limit Theorem.
If,

• the random variables 𝑋𝑖 ∼ 𝑏(𝑝1 ), 𝑖 = 1, 2, ..., 𝑛1 , are i.i.d., and


• the random variables 𝑌𝑗 ∼ 𝑏(𝑝2 ), 𝑗 = 1, 2, ..., 𝑛2 , are i.i.d., and
• 𝑋 and 𝑌 are independent,
• the samples are large, 𝑛1 𝑝1 (1 − 𝑝1 ) > 5 and 𝑛2 𝑝2 (1 − 𝑝2 ) > 5,

then, under the null 𝐻0 ∶ 𝑝1 − 𝑝2 = 0, the two probabilities are equal and they
can be pooled as follows:

𝑝1̂ 𝑛1 + 𝑝2̂ 𝑛2
𝑝̂ = .
𝑛1 + 𝑛 2
Therefore,

1 1
𝑝1̂ − 𝑝2̂ ∼𝑁
̇ (0, 𝑝(1
̂ − 𝑝)(
̂ + ))
𝑛1 𝑛2
or, in the standardized version,

𝑝1̂ − 𝑝2̂
𝑍= ∼𝑁
̇ (0, 1)
√𝑝(1 ̂ 𝑛1 +
̂ − 𝑝)( 1
𝑛2 )
1

6.3.3 Illustration: Percentage Republicans

Suppose 125 voters are surveyed from state A and 120 in state B. Assume the
survey uses simple random sampling. In state A, 52% of the respondents said
6.3 Comparing Two Proportions 77

they vote Republican, and 48% vote Democrat. In state B, 45% declared voting
Republican, and 55% Democrat.
Given the information of the samples, is there evidence that the percentage of
Republicans is different in the two states?
Let,

• 𝑝1 and 𝑝1̂ be the proportion of Republican voters in the first state and in the
sample from the first state, respectively.
• 𝑝2 and 𝑝2̂ the proportion of Republican voters in the second state and in the
sample from the second state, respectively.

The sample sizes are 𝑛1 = 125 and 𝑛2 = 120.


We can use the normal approximation to the binomial distribution since

• 𝑛1 𝑝1 (1 − 𝑝1 ) = 125(0.52)(0.48) = 31.2 > 5, and,

• 𝑛2 𝑝2 (1 − 𝑝2 ) = 120(0.45)(0.55) = 29.7 > 5.

We want to test the null hypothesis,

𝐻0 ∶ 𝑝1 − 𝑝2 = 0,
versus the alternative,

𝐻𝑎 ∶ 𝑝1 − 𝑝2 ≠ 0.
Under the null, the two proportions are equal. Hence, we can use the pooling
proportion,

𝑝1̂ 𝑛1 + 𝑝2̂ 𝑛2
𝑝̂ = = 0.4857,
𝑛1 + 𝑛2
to calculate the standard error of their difference,

1 1
𝜎𝑑 = √𝑝(1
̂ − 𝑝)(
̂ + ) = 0.06387.
𝑛1 𝑛2
78 6 Inference on Sample Proportions

Hence, we can calculate the probability of observing a difference as large or


larger if the null is true.

0.07 − 0
𝑃 (𝑝1̂ − 𝑝2̂ > 0) = 𝑃 (𝑍 > )
0.06387
= 1 − 𝑃 (𝑍 < 1.096) = 0.1365

As for the p-value, recall that in a bilateral test it is calculated as twice the area
“beyond” the sample statistic. Hence, in this case,

p-value = 2 × 0.1365 = 0.273.

6.3.4 Implementation in R

Again, R has a dedicated function for this type of hypothesis testing, prop.test().
However, it is based on a different estimator and, therefore, in a different sam-
pling distribution.
A element to bear in mind is that it uses the absolute number of successes, x, and
the number of trials, n.
For the example above, it would be called in the following way.

p1 <- 0.52
p2 <- 0.45
n1 <- 125
n2 <- 120
test.c <- prop.test(x = c(p1*n1, p2*n2),
n = c(n1, n2),
correct = FALSE)

test.c
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(p1 * n1, p2 * n2) out of c(n1, n2)
6.3 Comparing Two Proportions 79

## X-squared = 1.201, df = 1, p-value = 0.2731


## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.05487447 0.19487447
## sample estimates:
## prop 1 prop 2
## 0.52 0.45

The interpretation of the result of the test, once again, is grandly facilitated by
recalling 𝐻0 and understanding its p-value.
About 𝐻0 , the output is very explicit. As calculated thanks to the sample pro-
portions, we tested whether the two true proportions are the same.
The p-value is 0.27. Since the p-value of the test of 𝐻0 is larger than 0.05, we do
not reject the null hypothesis of the equality of proportions.
Finally, notice that the p-value is identical to the value calculated above after
using the Central Limit Theorem. This is to be expected in this case (when df =
1) and the test does not apply a continuity correction (we set correct = FALSE).

6.3.5 Illustration: One Question Fluke?

We use the above data to compare the results of the answers to the Rosling ques-
tions.

# recall
n1 <- rosling_responses %>%
filter(question ==
"children_with_1_or_more_vaccination") %>%
nrow()

p.hat1 <- rosling_responses %>%


filter(question ==
"children_with_1_or_more_vaccination") %>%
summarise(p.hat1 = mean(r.w)) %>%
pull(p.hat1)
80 6 Inference on Sample Proportions

n2 <- rosling_responses %>%


filter(question ==
"children_in_2100") %>%
nrow()

p.hat2 <- rosling_responses %>%


filter(question ==
"children_in_2100") %>%
summarise(p.hat2 = mean(r.w)) %>%
pull(p.hat2)

test.fluke <- prop.test(


x = c(p.hat1 * n1, p.hat2 * n2),
n = c(n1, n2),
correct=FALSE)

test.fluke
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(p.hat1 * n1, p.hat2 * n2) out of c(n1, n2)
## X-squared = 2.4525, df = 1, p-value = 0.1173
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.03621123 0.21796562
## sample estimates:
## prop 1 prop 2
## 0.2400000 0.1491228

Since 0.12 is larger than 0.05, we do not reject the null hypothesis of equality of
the proportions. The respondents seem to keep their level of knowledge, or lack
thereof, in the two questions.
6.4 Goodness of Fit for Many Proportions 81

6.4 Goodness of Fit for Many Proportions

The comparison described in Section 6.2 can be extended to multiple propor-


tions. The perspective is then the following. Instead of testing if one proportion
is equal to a value given by 𝐻0 , we test whether 𝑘 proportions are, jointly, equal
to 𝑘 values given by 𝐻0 :

𝐻0 ∶ 𝑝1 = 𝑝01 , 𝑝2 = 𝑝02 , … , 𝑝𝑘 = 𝑝0𝑘

with

𝑘
∑ 𝑝0𝑘 = 1.
𝑖=1

6.4.1 Illustration: Representative Poll

Suppose that you want to verify whether the regional origins in a given sample
for a poll are correctly matching the regional distribution in the population. The
proportions of each region are known from the official records. Table 6.2 provides
the relevant data.
TABLE 6.2: Representation by region in the poll and in the population.

Region North Center South Total


Poll 302 406 221 929
Population share 0.3 0.45 0.25 1

6.4.2 Implementation in R

The formal test in this case is a 𝜒2 test as the one implemented in R and described
in Section 6.5. Her is how the test would be carried in R.
82 6 Inference on Sample Proportions

x.poll <- c(302, 406, 221)


p.0 <- c(.30, .45, .25)

test.poll <- chisq.test(x = x.poll,


p = p.0)

test.poll
##
## Chi-squared test for given probabilities
##
## data: x.poll
## X-squared = 2.8402, df = 2, p-value = 0.2417

The output of the test is not very detailed. Its 𝐻0 is given above. In this case,
since 0.242 is larger than 0.05, we do not reject the hypothesis that the sample for
the poll has the same distribution across regions as the population.

6.5 Extra Material

6.5.1 Explaining prop.test() Implemented in R

The test implemented in R is the so-called Pearson’s chi-squared (pronounce


“kai squared”) test. Roughly speaking, it uses a statistic based on the sum of the
squares of the differences between the observed probabilities and the expected
probabilities under the null. A statistic close to 0 indicates that the differences
are small, giving support to 𝐻0 .
The sampling distribution is a chi-squared with 1 degree of freedom.
2
𝑘
( 𝑥𝑛𝑖 − 𝑝0,𝑖 )
𝜒2𝑘−1 = 𝑛∑
𝑖=1
𝑝0,𝑖

where 𝑖 = 1, … , 𝑘 refer to the outcomes of the random process (cases), the 𝑝0 ’s


6.6 Exercises 83

are the expected probabilities under the null, and the 𝑥’s are the number of ob-
servations in each case.
In our case,

• there are two cases “correct” and “incorrect” (1 or 0),


• the expected probabilities under the null are 𝑝0,1 = 13 and 𝑝0,2 = 23 ,
• 𝑛 = 50,
• the number observations in each case are 𝑥1 = 12 and 𝑥2 = 38.

The sample statistic is therefore,

12 2 2
( 50 − 13 ) ( 38 2
50 − 3 )
𝜒21 = 50( 1 + 2 ) = 1.96
3 3

The probability of observing such an extreme, i.e., large value if the null is true
is,

p.value <- 1 - pchisq(1.96, df = 1) %>% round(3)


p.value
## [1] 0.162

Again, a graphical perspective such as Figure 6.3 may help.

6.6 Exercises

Exercise 6.1. In a sample of 𝑛 randomly chosen individuals from a population,


54% declare that they support the government crisis’ management.
Suppose 𝑛 = 300, and use 𝛼 = 5%. Specify the null and the alternative hy-
potheses in answering the following question. Does the evidence allow you to
believe that the majority of the populations shares that support for the govern-
ment? What if 𝑛 = 500? Calculate and explain. Use 𝛼 = 0.05.
84 6 Inference on Sample Proportions

observed

observed
5% 16.2%

0 1 1.96 3.84 0 1 1.96


χ21 χ21

FIGURE 6.3: Probabilities in a chi-squared distribution with 1 degree of freedom.

Exercise 6.2. Answer the questions in Exercise 6.1 with R commands.


Hint: you might want to use the argument alternative in your function. Check
this latter by typing ?prop.test() in the console.
Exercise 6.3. According to a recent poll by Observador/TVI/Pitagórica3 (the au-
thor’s translation from this source):
“A bit more than half of the Portuguese answering in this survey think that the
elections should have been postponed. More precisely, they were 55% (346 in-
dividuals) against 41% (257 individuals) who answered, instead, answered that
they should happen as planned, on January 24 – 4%, i.e., 25 individuals do not
know or didn’t not want to answer.”
Does the evidence allow you to believe that the majority of the population thinks
that the elections should have been postponed? Use 𝛼 = 0.05.
Answer both with calculations and with R commands.
Exercise 6.4. According to a recent poll by Observador/TVI/Pitagórica4 (the au-
thor’s translation of this source):
3
https://observador.pt/especiais/sondagem-observador-tvi-pitagorica-maioria-e-a-favor-do-fecho-
das-escolas-e-defendia-adiamento-das-eleicoes/
4
https://observador.pt/especiais/sondagem-observador-tvi-pitagorica-maioria-e-a-favor-do-fecho-
das-escolas-e-defendia-adiamento-das-eleicoes/
6.7 Exercises 85

“The support for closing the schools is higher among women, 61% are in favor
against 53% among men.”
“The field study took place in January 7 to 10 and January 14 to 18. For the sam-
ple, 629 interviews were carried.”
Assume that both genders were equally represented (with one extra woman).
Using the usual significance level, is the percentage of support for the closing of
the schools statistically different across genders.
Answer both with calculations and with R commands.

Exercise 6.5. Use the ncbirths data5 from the openintro package. Consider the all
the babies whose birth was classified as premature.
In that group, is the percentage of female babies equal to the percentage of male
babies? Calculate and verify with R code.
5
https://www.openintro.org/data/index.php?data=ncbirths
86 6 Inference on Sample Proportions

6.7 Commented R Code

library(openintro) data() brings the data set from the


data("rosling_responses") package to the environment.
rosling_responses %>%
%>% is an extremely useful function.
filter(question ==
It says: use the part on the left as
"children_with_1_or_more_vaccination") %>%
first argument of the function on the
pull(response)
right. See `Intro R > §magrittr'.

filter() is from package dplyr(). It


selects rows of a data frame by using
conditions. See `Intro R > §dplyr'.

The selected rows are those for


which the variable question equals
"children_with_1_or_more_vaccination".

NB: == not = to express that the


value in the variable must be equal
to the given value.

pull() takes the variable response


from the data frame and spits it out
as a vector, not a data frame.

NB: here, we do not assign an object


to a name, since we do not use <-.
This means that we just want to dis-
play something, not use the object
later.
6.7 Commented R Code 87

rosling_responses <- rosling_responses %>% Now we do assign an object to a name,


mutate(r.w = case_when( since we use <-. Actually, we are
response == "correct" ~ TRUE, even re-using a existing name, so,
response == "incorrect" ~ FALSE we are replacing the object called
)) by rosling_responses.

mutate() creates a new variable in


the data frame. The new variable is
called r.w. See `Intro R > §dplyr'.

The new variable is created with the


helper function case_when(). This
latter just sets values in r.w de-
pending… on the cases! E.g., when
response is (==) "correct", then the
value of r.w is set to (~) to TRUE.

p.hat1 <- rosling_responses %>% p.hat1 is the name assigned to the


filter(question == object created by the commands on
"children_with_1_or_more_vaccination") %>% the right of <-.
summarise(p.hat1 = mean(r.w)) %>%
Recall filter() selects rows with a
pull(p.hat1)
condition.

summarise() is a function from the


package dplyr. It creates a vari-
able which is the result of a func-
tion applied to another variable,
not to each of its elements indi-
vidually. The variable created is
p.hat1, which is the mean() of the
variable r.w.

NB: r.w is a vector of 1's and 0's,


so its mean is the percentage, i.e.,
the proportion of 1's in r.w.

pull() gets the variable p.hat1 as


a vector, not as a data frame.
88 6 Inference on Sample Proportions

left.of.po <- pnorm(q = .24, left.of.po is the name assigned to


mean = 1/3, the object created by the commands
sd = sqrt((1/3*2/3)/50)) %>% on the right of <-.
round(3)
pnorm() calculates the probability
# or, standardized,
in a normal distribution to the left
# pnorm(q = (.24-1/3)/sqrt((1/3*2/3)/50),
the value `q'. `mean' and `sd' are
# mean = 0,
the mean and standard deviations of
# sd = 1)
that normal distribution.

round() is a function that rounds


a number with the given number of
decimals, here 3.

p.0 <- 1/3 p.0 is the name of the probablity


n1 <- rosling_responses %>% for the null hypothesis.
filter(question ==
n1 is the name of the object created
"children_with_1_or_more_vaccination") %>%
by the commands on the right of <-,
nrow()
which, at the end, is the result of
the function nrow().
test.p <- prop.test(x = p.hat1 * n1,
n= n1, nrow() returns the number of rows of
p= p.0, a data frame.
correct= FALSE) test.p is the name of the ob-
ject created by the commands on the
## clearer but not dynamic alternative right of <- , i.e., by the function
# test.p <- prop.test(x = 24, prop.test().
# n= 50,
prop.test() is a test in R involving
# p= 1/3,
proportions. It uses, x the absolute
# correct= FALSE)
number of successes in the sample,
n, the number of trials, p, the null
proportion for the test.

NB: x is calculated thanks to the


variables previously created and
named, as it should be in a dynamic
document.
6.7 Commented R Code 89

p1 <- 0.52 NB: the function is the same as


p2 <- 0.45 above, prop.test(), but requires no
n1 <- 125 null probability, p.
n2 <- 120
x now requires two numbers of suc-
test.c <- prop.test(x = c(p1*n1, p2*n2),
cesses, since we have two groups. n
n = c(n1, n2),
also requires two numbers of trials,
correct = FALSE)
one for each group.

correct specifies whether we want to


apply a continuity correction.

n2 <- rosling_responses %>% We repeat the procedure with the sec-


filter(question == ond question in the data frame to
"children_in_2100") %>% obtain p.hat2 and n2.
nrow()
NB: r.w was created for the whole
data set before. We need not create
p.hat2 <- rosling_responses %>%
it again.
filter(question ==
"children_in_2100") %>% test.fluke is the name of the object

summarise(p.hat2 = mean(r.w)) %>% created by the function prop.test().

pull(p.hat2)

test.fluke <- prop.test(


x = c(p.hat1 * n1, p.hat2 * n2),
n = c(n1, n2),
correct=FALSE)
90 6 Inference on Sample Proportions

x.poll <- c(302, 406, 221) x.poll and p.0 are vectors for the
p.0 <- c(.30, .45, .25) numbers in each category and the ex-
pected probabilities, respectively.
test.poll <- chisq.test(x = x.poll, They were both created with the func-
p = p.0) tion c().

c() simply puts together, i.e., com-


bines values in a vector.

chisq.test() is a function for the


test in R. It requires a x and a p,
as described above.

test.poll is the name of the


object created by the function
chisq.test().
7
Inference for Numerical Data

TL;DR Inference about the sample mean, 𝑋̄ , could


use the normal distribution with the same mean as
the population, 𝜇, and a fraction of population’s vari-
ance 𝜎2 /𝑛. But only if 𝜎2 was known. Since it usually
isn’t, we must use a 𝑡-distribution.[7.1]
We can answer whether the mean of one sample is
compatible with a given hypothesis.[7.2]
We compare two samples to test if their mean is the
same.[7.4]
If these two samples have their observations linked
to one another, then they are paired. We then make
tests on the average of the differences of the observa-
tions across the two samples.[7.3]

7.1 Sampling Distribution of 𝑋̄

Under the conditions stated in Section 5.2, the Central Limit Theorem allowed
us to approximate the sampling distribution of a sample’s mean, 𝑋̄ , of size 𝑛,
2
𝜎
𝑋̄ ∼ 𝑁 (𝜇, )
𝑛
91
92 7 Inference for Numerical Data

or, in the standardized version,

𝑋̄ − 𝜇
𝑍= √ ∼ 𝑁 (0, 1)
𝜎/ 𝑛
where 𝜇 and 𝜎2 are the population’s parameters of its mean and its variance. In
this chapter we are interested in making hypotheses about the true value of the
population, 𝜇.
At the outset, one difficulty should be apparent. There are two unknown param-
eters in the formula above. In order to make inference on one of them, then the
other must be known.

7.1.1 The 𝑡-Distribution

Our strategy to solve the issue is to first estimate the variance of the sample and
use in the place of the true variance,
𝑛
2 1 ̄ 2.
𝑠 = ∑ (𝑋𝑖 − 𝑋)
𝑛 − 1 𝑖=1

Notice that 𝑠2 uses 𝑋̄ . This will adversely affect the precision of the inference on
𝑋̄ .
Formally, inference about 𝜇 is made through a null hypothesis such as 𝐻0 ∶ 𝜇 =
𝜇0 . Under the null, and if 𝜎2 was known, the sampling distribution of the stan-
dardized mean would be normally distributed,

𝑋̄ − 𝜇0
𝑍= √ ∼ 𝑁 (0, 1).
𝜎/ 𝑛
Now, since 𝜎2 is not known, replacing it by 𝑠2 results in a related but slightly
different sampling distribution, namely the 𝑡-distribution,

𝑋̄ − 𝜇0
𝑡= √ ∼ 𝑡𝑛−1 .
𝑠/ 𝑛
where 𝑛 is the number of observations and (𝑛 − 1) is called the degrees of free-
dom of the distribution.
7.2 One-Sample 𝑡-Test 93

−4 −3 −2 −1 0 1 2 3 4

df 1 2 4 9 10 ∞ (Normal)

FIGURE 7.1: Normal distribution and 𝑡-distribution for various degrees of free-
dom.

As it turns out, the 𝑡-distribution becomes a normal distribution when the de-
grees of freedom, i.e., the sample size, becomes large, say 𝑛 > 30. Figure 7.1
depicts the normal distribution along with various 𝑡 distributions.

7.2 One-Sample 𝑡-Test

The one-sample 𝑡-test is build on the statistic that uses the sample variance in
place of an unknown variance in the population.
Formally, we define a null hypothesis,

𝐻0 ∶ 𝜇 = 𝜇 0
,
as well as an alternative such as 𝐻𝑎 ∶ 𝜇 ≠ 𝜇0 , or 𝐻𝑎 ∶ 𝜇 > 𝜇0 , or, of course,
𝐻𝑎 ∶ 𝜇 < 𝜇 0 .
Under the null, we have,
94 7 Inference for Numerical Data

𝑋̄ − 𝜇0
𝑇 = √ ∼ 𝑡𝑛−1 .
𝑠/ 𝑛

7.2.1 A Hand-Calculated Illustration

[From OpenIntro] In a sample of 100 runners in the Cherry Blossom Race in 2017,
the average run time was 95.61 minutes, while the standard deviation was 15.78
minutes. In 2006, the average run time was 93.29. We want to test whether the
runners were faster or slower in 2017 than they were in 2006.
Or null hypothesis is, naturally, 𝐻0 ∶ 𝜇 = 93.29, i.e., no difference between the
years. The alternative, since runners in 2017 could be faster or slower, is 𝐻𝑎 ∶
𝜇 ≠ 93.29.
The test is build on the statistic,

95.61 − 93.29
𝑇 = √ = 1.47.
15.78/ 100
If the null hypothesis is true, what is the probability of observing such an extreme
value in a 𝑡-distribution with 𝑑𝑓 = 𝑛 − 1 = 99? Checking the tables, or our
calculator, we have,

score <- (95.61-93.29)/(15.78/sqrt(100))


p.extreme <- 1- pt(q = score, df = 99)

p.extreme
## [1] 0.07233725

We know that the p-value in a bilateral test must take into account the probabil-
ities of the extremes over the two sides, hence,

p-value = 2 × 0.0723372 = 0.1446745.

The interpretation of the test is that, under the assumptions, we cannot reject 𝐻0 .
The runners in 2017 were not statistically faster or lower than in 2006.
7.3 Test for Paired Data 95

7.2.2 Implementation in R

The implementation in R uses the theoretical result above with the function
t.test(). We illustrate it with the same example as above.

df <- read_csv(paste0("https://",
"www.openintro.org/data/csv/run10samp.csv"))
test.run <- t.test(x = df$time,
mu = 93.29)

test.run
##
## One Sample t-test
##
## data: df$time
## t = 1.4734, df = 99, p-value = 0.1438
## alternative hypothesis: true mean is not equal to 93.29
## 95 percent confidence interval:
## 92.48412 98.74508
## sample estimates:
## mean of x
## 95.6146

Of course, the conclusion is similar to above.

7.3 Test for Paired Data

Observations are paired in two data sets if each observation of the first data set
has a particular connexion with one observation in the other data set.
Such cases arise, for instance, when the same person is surveyed twice, when a
measurement is taken at the same place, when both data sets have information
on the same objects, etc.
In that case, any meaningful comparison of the means across the groups, say 𝐴
96 7 Inference for Numerical Data

and 𝐵, must take into account that relationship. We do so by defining a new


variable of interest,

𝑋𝐷,𝑖 = 𝑋𝐴,𝑖 − 𝑋𝐵,𝑖 ∀𝑖.

In other words, the variable analyzed is 𝑋𝐷 , the difference across data sets for
each observation. Under the same general conditions as above, we can perform
a test on an hypothesis about 𝑋𝐷 such as, the very common,

𝐻0 ∶ 𝜇𝐷 = 0.

Again, the most sensible alternative is 𝐻𝑎 ∶ 𝜇𝐷 ≠ 0.


Also, since, in reality, we only have one variable, the test is yet another instance
of the 𝑡-test.

7.3.1 An Illustration with R and in Calculation

We use another data set from OpenIntro about prices of books in two different
locations, the UCLA Bookstore and… Amazon.

df <- read_csv(paste0("https://",
"www.openintro.org/data/csv/textbooks.csv"))
x.bar <- mean(df$diff)
s.sample <- sd(df$diff)
n.sample <- length(df$diff)

We have,

12.7616438 − 0
𝑇 = √ = 7.6487711.
14.2553008/ 73
Now, the probability of observing such an extreme value, if 𝐻0 is true, is

T <- x.bar/ (s.sample/sqrt(n.sample))


p.extreme <- 1 - pt(q = T,
df = n.sample - 1)
p.value <- 2 * p.extreme
7.4 Testing the Difference of Two Means 97

p.value
## [1] 6.92757e-11

Close to impossible…
Verifying with the appropriate command in R.

test.book <- t.test(x = df$diff,


mu = 0)

test.book
##
## One Sample t-test
##
## data: df$diff
## t = 7.6488, df = 72, p-value = 6.928e-11
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 9.435636 16.087652
## sample estimates:
## mean of x
## 12.76164

7.4 Testing the Difference of Two Means

The 𝑡-test can be used to test hypotheses about the difference of the means be-
tween two groups, say 𝐴 and 𝐵.
It the observations are independent within and across groups, and the samples
are not too “weird” (outliers, skewed, …), then, under the null such as

𝐻0 ∶ 𝜇 𝐴 − 𝜇 𝐵 = 𝜇 0

where, typically, 𝜇0 = 0, meaning that we test whether there is a difference


between the two means.
98 7 Inference for Numerical Data

Under the null,

(𝑋̄ 𝐴 − 𝑋̄ 𝐵 ) − 𝜇0
𝑇 = 2
∼ 𝑡𝑘 .
𝑠2𝐵
√ 𝑛𝑠𝐴 + 𝑛𝐵
𝐴

where 𝐴 and 𝐵 refer to the two groups and where 𝑠2 and 𝑛 are the variance and
the number of observations of each group, respectively. As for 𝑘, this is a rather
complicated number to calculate. Let us use 𝑘 = min(𝑛𝐴 , 𝑛𝐵 ) − 1.

7.4.1 Illustration and Implementation in R

We use a famous data set that was also included in the package OpenIntro. It col-
lects information on baby births in North Carolina.
We first collect a sample with 50 mothers who have the habit of smoking as well
as 100 who do not have that habit.

library(openintro)
data(ncbirths)
dfA <- ncbirths %>%
filter(habit == "smoker") %>%
sample_n(size = 50)
dfB <- ncbirths %>%
filter(habit == "nonsmoker") %>%
sample_n(size = 100)
df <- bind_rows(dfA, dfB)

We are interest in knowing whether there is a difference between the average


weight, 𝑤, of the babies across the groups. Hence,

𝐻0 ∶ 𝑤𝐴 − 𝑤𝐵 = 0.
The test in R will then be,

test.w <- t.test(weight ~ habit,


data = df)
7.6 Exercises 99

test.w
##
## Welch Two Sample t-test
##
## data: weight by habit
## t = 0.3713, df = 106.45, p-value = 0.7112
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3818629 0.5578629
## sample estimates:
## mean in group nonsmoker mean in group smoker
## 7.1194 7.0314

7.5 Exercises

Exercise 7.1. The file data-grades.Rdata (download here1 ) contains actual data on
the grades of the students at the midterm and at the endterm. You simply need
to load it to R with the function load().
Evaluate whether or not the mean at the midterm is the same as the mean at the
endterm. Use the usual significance level.
Answer by coding in R.

Exercise 7.2. Use the ncbirths data2 from the openintro package. Take a sample of
240 babies. Test whether the average weight of female babies is the same as the
average weight of male babies.
1
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=335656
2
https://www.openintro.org/data/index.php?data=ncbirths
100 7 Inference for Numerical Data

7.6 Commented R Code

score <- (95.61-93.29)/(15.78/sqrt(100)) score is the name of a vector with


p.extreme <- 1- pt(q = score, df = 99) one value. We assign a name to the
object, the result of the calcula-
tion, so that we can use it later.

pt() gives the probability in a 𝑡-


distribution to the left of the ar-
gument q. 1- pt(q = ...) means that
we want the probability to the right
of q.

The only other required argument is


the degrees of freedom, df.
7.6 Commented R Code 101

df <- read_csv(paste0("https://", read_csv() is from package readr. It


"www.openintro.org/data/csv/run10samp.csv")) reads into R the csv file. Here, the
test.run <- t.test(x = df$time, file is online. I only give its url.
mu = 93.29) See `Intro R > §readr & readxl'.

paste0() glues together the two


strings. One big string would have
the most correct form. But it would
not print in this page. So I sep-
arated it and glued it again. Bad
coding!

NB: the name of the imported data


frame is chosen to be, simply, df.

The 𝑡-test is implemented with the


function t.test(). It is build under
the theory explained in the notes.
It uses the vector of all the ob-
servations for one variable in the
argument r. mu is the mean, under
𝐻0 , that we use in the test.

The sign $ is used to extract a


variable from a data frame (among
other things). See `Intro > Subset-
ting Data Structures'.

Here, we want the variable time from


the data frame df, hence df$time.
102 7 Inference for Numerical Data

df <- read_csv(paste0("https://", For read_csv and paste0, see immedi-


"www.openintro.org/data/csv/textbooks.csv")) ately above. Also, for $, we now un-
x.bar <- mean(df$diff) derstand that the data frame, called
s.sample <- sd(df$diff) again df, has a variable diff that
n.sample <- length(df$diff) we want to use.

We create three objects and assign


them to names. x.bar is the mean of
diff, hence mean(). s.sample is the
standard deviation in the sample,
hence sd().

n.sample is the size of the sample.


length(), applied on a vector, re-
turns the number of elements in the
vector.

T <- x.bar/ (s.sample/sqrt(n.sample)) The first and third commands illus-


p.extreme <- 1 - pt(q = T, trate the use of R as a simple cal-
df = n.sample - 1) culator. But notice that we assign
p.value <- 2 * p.extreme the objects, i.e., the results of
the calculations to a name. Again,
this is to allow for calling these
results later in the code.

About pt() see above.

test.book <- t.test(x = df$diff, Same as above, a 𝑡-test for the


mu = 0) hypothesis `true mean of differ-
ences is 0'. Done with the function
t.test() on the variable df$diff
with the argument for the null value
mu set to 0.
7.6 Commented R Code 103

library(openintro) library() loads the package to the


data(ncbirths) environment. This throws an error
dfA <- ncbirths %>% if the package was not previously
filter(habit == "smoker") %>% installed in the computer. data()
sample_n(size = 50) brings a data frame to the environ-
dfB <- ncbirths %>% ment. R will look for the data in all
filter(habit == "nonsmoker") %>% the packages loaded in the session.
sample_n(size = 100)
dfA is the name of a data frame cre-
df <- bind_rows(dfA, dfB)
ated in the series of piped commands.
First, with filter() which selects
rows based on one or more conditions,
such as `variable habit' takes the
value "smoker"'.

sample_n() is a useful function to


randomly sample n, a desired number,
of observations of the data frame.

Both dfA and dfB are subsets of the


original data frame. One is a sam-
ple of "smokers" and the other of
"nonsmokers". In particular, they
have the same variables.

At the end, I simply stack these data


frame on top of each other thanks the
the function bind_rows() which does
what its name says, to bind the rows
together.
104 7 Inference for Numerical Data

test.w <- t.test(weight ~ habit, The formula in this t.test() func-


data = df) tion is a bit different. It's for
comparing the means of the variable
on the left of ~ across groups formed
by the values in the variable in
the right of ~. Both variables be-
ing in the same data frame, given
by data. Here, we compare the means
in weight for the different groups
defined by habit, i.e., "smoker" and
"nonsmoker".
Part III

Confidence Intervals
8
Estimators and Confidence Intervals

TL;DR Parameters of a population must be esti-


mated thanks to an estimator.[8.1]
The choice of the best estimator is made on the basis
of the properties (e.g., unbiasedness) of the sampling
distribution of the estimator.[8.2]
A point estimate, obtained thanks to a point estima-
tor, is our best guess for the true parameter in the
population. A confidence interval is our best guess
for where the true parameter falls. The confidence
level is the percentage of samples that would include
the true parameter if we could observe a very large
number of samples. The point estimate is typically at
the center of the confidence interval, with a margin
or error below and above it.[8.3]
The confidence interval for the very common case of
the estimation of a mean, resulting in a normal distri-
bution, is derived along with a formula for the mar-
gin of error.[8.4]
The CI for a proportion is also derived.[8.5]
Extensions for one-side CI and for dealing with vari-
ables in a 𝑡-distribution are mentioned.[8.6]

107
108 8 Estimators and Confidence Intervals

8.1 Estimators and Estimates

In statistics jargon, we do not use the noun or verb “guess” but, instead, “esti-
mate”. Loosely speaking, estimator is a way of guessing the true value of a pa-
rameter (such as its mean) in the population by applying an algorithm to the
available sample.
Definition 8.1 (Estimator and point estimator). An estimator of a population
parameter is a random variable obtained from a sample.
When the estimate is a single value, the estimator is called a point estimator
(that gives a point estimate). The point estimator is defined by a rule or formula
that tells us how to use the sample data to calculate a single number, the point
estimate, that can be used as an estimate of the population parameter.

8.2 “Best” Statistic

There can be many different estimators for a given parameter (different statis-
tics can be used to estimate a certain parameter). The choice of the appropriate
estimator will depend on the parameter we are trying to estimate.
Note that an estimator is itself a random variable as it is a function of other ran-
dom variables. As such, it has an expected value and a probability density func-
tion.
The distribution of an estimator is called the sampling distribution and it typifies
the properties of the estimator. In order to compare between estimators, we need
to compare their sampling distributions.
Recall that the sampling distribution is simply the probability distribution func-
tion of sample statistics. The sampling distribution of a given statistic is the dis-
tribution we would get if we were to take all possible samples of a given size 𝑛
and for each of those samples calculate the same statistic.
As pdfs, sampling distributions have an expected value, variance, and often fol-
low known probability distributions (e.g. Normal, 𝑡, Chi-square, or 𝐹 distribu-
tions).
8.2 “Best” Statistic 109

θ^1 θ^2 θ^1

θ^3

θ θ

FIGURE 8.1: Estimators with different expected value (left) et different variance
(right).

It is usually easier to focus on a few features of the distribution of 𝜃 ̂ in evaluating


it as an estimator of the unknown population parameter 𝜃.

8.2.1 Properties

It is not straightforward to chose between two estimators. Their sampling distri-


bution is indicative but might not be enough.
In practice, we will not know the numerical value of the unknown parameter,
so we have to rely on our knowledge of the theoretical sampling distributions to
choose the best estimator.
No single mechanism exists for the determination of a uniquely “best” point
estimator. Instead we consider a set of criteria under which particular estimators
can be evaluated.

8.2.2 Unbiasedness

Because the sample is randomly selected, we understand that the for-


mula/function will produce different results in different samples. This is why
we say that an estimator is itself a random variable.
110 8 Estimators and Confidence Intervals

A often very important feature searched for in an estimator is that, in expectation,


it delivers the correct parameter of the population.

Definition 8.2 (Unbiased estimator). Let 𝜃 ̂ be a point estimator for the true pa-
rameter 𝜃, then 𝜃 ̂ is an unbiased estimator if

𝐸[𝜃]̂ = 𝜃
The bias of the estimator is

bias(𝜃)̂ = 𝐸[𝜃]̂ − 𝜃

Hence, an unbiased estimator has a bias of 0.

As two examples of estimators of the mean of a population, consider the follow-


ing:

1 𝑛
𝜃1̂ = ∑𝑋
𝑛 𝑖=1 𝑖
1
𝜃2̂ = ( min(𝑋𝑖 ) + max(𝑋𝑖 ))
2

We can show that both 𝜃1̂ and 𝜃2̂ are unbiased estimators of the population’s
mean. Intuitively, however, it appears clearly that one of them is more reliable
than the other.

8.3 Confidence Interval and Margin of Error

In sampling from a population, we obtain estimates thanks to an estimator. How-


ever, because these estimates can greatly vary, the questions arises of whether a
given estimate is close to the true parameter or not.
We will answer that question in the following way. Given the information in the
sample, we have a, say, 95% level of confidence that the true parameter is within
the interval between 𝑎 and 𝑏.
8.3 Confidence Interval and Margin of Error 111

FIGURE 8.2: Interpreting a confidence interval.

Definition 8.3 (Confidence interval). Let 𝜃 be the parameter searched for. Sup-
pose we can define two random variables, 𝐴 and 𝐵, based on the sample such
that 𝑃 (𝐴 < 𝜃 < 𝐵) = 1 − 𝛼, where 𝛼 is a small number between 0 and 1.
Then, the interval between 𝑎 and 𝑏 (values of the variables 𝐴 and 𝐵) is the (1−𝛼)
confidence interval of 𝜃. Also, (1 − 𝛼) is the confidence level.

This definition is not as trivial as it seems. A correct interpretation is the follow-


ing. If the population was to be sampled a very large number of times with the
estimator for 𝐴 and 𝐵, then in (1 − 𝛼) of the cases the true parameter would be
between 𝑎 and 𝑏. Figure 8.2 illustrates this notion.
Notice that sometimes the point estimate would be lower than the true param-
eter and sometimes it would be higher. When computing a confidence interval,
112 8 Estimators and Confidence Intervals

these cases are considered in a similar way. This is why the confidence interval
is symmetric around the point estimate 𝜃.̂
Another way of expressing that idea is to express the confidence interval as a
point estimate with a symmetric margin of error, 𝑀 𝐸 :

𝜃 ̂ ± 𝑀 𝐸.

8.4 CI for the Mean

The discussion above left open the question of how to estimate 𝑎 and 𝑏 such that
if the population was to be sampled a very large number of times then in (1 − 𝛼)
of the cases, the true parameter would be include in the interval.
In this section, we introduce an estimator for such an interval, namely one the
most common case, i.e., based on the normal distribution
Let 𝑋̄ be the mean of a sample of 𝑛 observations from a normally distributed pop-
ulation with unknown mean 𝜇 but known variance 𝜎2 . Then we can say some-
thing about the 𝑍 -score of the sampe, i.e., the standardized version of the sample
mean, 𝑋̄ ,

𝑋̄ − 𝜇
𝑍= √ .
𝜎/ 𝑛
In particular for our discussion, we know the probability of 𝑍 being between
two values in the standard normal distribution, see Figure 8.3. For a probability
of 1 − 𝛼 around the mean, we know

1 − 𝛼 = 𝑃 ( − 𝑧𝛼/2 < 𝑍 < 𝑧𝛼/2 )

By developing this expression, we can find the confidence interval for the true
parameter.
8.4 CI for the Mean 113

1−α

α 2 α 2

−zα 2 0 zα 2
Z (standard normal)

FIGURE 8.3: Confidence interval in a standard normal.

1 − 𝛼 = 𝑃 ( − 𝑧𝛼/2 < 𝑍 < 𝑧𝛼/2 )


𝑋̄ − 𝜇
= 𝑃 ( − 𝑧𝛼/2 < √ < 𝑧𝛼/2 )
𝜎/ 𝑛
𝜎 𝜎
= 𝑃 ( − 𝑧𝛼/2 √ < 𝑋̄ − 𝜇 < 𝑧𝛼/2 √ )
𝑛 𝑛
𝜎 𝜎
= 𝑃 ( − 𝑧𝛼/2 √ < 𝑋̄ − 𝜇 < 𝑧𝛼/2 √ )
𝑛 𝑛
𝜎 𝜎
= 𝑃 ( − 𝑋̄ − 𝑧𝛼/2 √ < −𝜇 < −𝑋̄ + 𝑧𝛼/2 √ )
𝑛 𝑛
𝜎 𝜎
= 𝑃 (𝑋̄ − 𝑧𝛼/2 √ < 𝜇 < 𝑋̄ + 𝑧𝛼/2 √ )
𝑛 𝑛

For the 95%, for instance, a confidence interval for the true mean when the pop-
ulation is normally distributed is given by

𝜎 𝜎
0.95 = 𝑃 (𝑋̄ − 1.96 √ < 𝜇 < 𝑋̄ + 1.96 √ )
𝑛 𝑛
114 8 Estimators and Confidence Intervals

Proposition 8.1 (Confidence interval for the mean). For a normally distributed pop-
ulation with unknown mean 𝜇 and a known variance 𝜎2 , a confidence interval of (1−𝛼)
can be found in the following way.
Use the sample mean, 𝑋̄ , as a unbiased estimator to build the confidence interval as
𝜎
𝑋̄ ± 𝑧𝛼/2 √
𝑛
where the margin of error is
𝜎
𝑀 𝐸 = 𝑧𝛼/2 √
𝑛
The limits of the interval are called upper and lower confidence limit.

TABLE 8.1: Common values for 𝛼 and respective 𝑧𝛼/2 .

𝛼 𝑧𝛼/2
1% 2.58
5% 1.96
10% 1.64

Figure 8.4 illustrates the concept of confidence interval for the mean. Table 8.1
gives the values for 𝑧𝛼/2 for common significance levels, 𝛼.
Notice that ways of reducing the margin of error follow straightforwardly:

• reduction of 𝜎 (if possible),


• increase of the sample size 𝑛,
• reduction of the confidence level 1 − 𝛼.
8.5 CI for the Population Proportion 115

1−α

α 2 α 2

X−ME X X+ME

FIGURE 8.4: Confidence interval for the mean.

8.5 CI for the Population Proportion

The confidence interval for a population proportion has an expression very sim-
ilar to the expression for the mean of a sample.
As it happens, however, the normally assumption is no longer necessary. In-
stead, we will require that the sample size is large enough. A rule of thumb for
that criterion is 𝑛𝑝(1 − 𝑝) > 5. Then, we can show that,

𝐸[𝑝]̂ = 𝑝

Also,

√ 𝑝(1
̂ − 𝑝)̂
≈√
𝑝(1 − 𝑝)
𝑛 𝑛
Hence, we can say that the 𝑍 -score

𝑝̂ − 𝑝
𝑍=
√ 𝑝(1−
̂
𝑛
𝑝)̂
116 8 Estimators and Confidence Intervals

is very close to a normal distribution. Therefore, as above, we can find a 1 − 𝛼


confidence interval in the following way.

1 − 𝛼 = 𝑃 ( − 𝑧𝛼/2 < 𝑍 < 𝑧𝛼/2 )


𝑝̂ − 𝑝
= 𝑃 ( − 𝑧𝛼/2 < < 𝑧𝛼/2 )
√ 𝑝(1−
̂
𝑛
𝑝)̂

𝑝(1
̂ − 𝑝)̂ 𝑝(1
̂ − 𝑝)̂
= 𝑃 ( − 𝑧𝛼/2 √ < 𝑝̂ − 𝑝 < 𝑧𝛼/2 √ )
𝑛 𝑛
𝑝(1
̂ − 𝑝)̂ 𝑝(1
̂ − 𝑝)̂
= 𝑃 ( − 𝑧𝛼/2 √ < 𝑝̂ − 𝑝 < 𝑧𝛼/2 √ )
𝑛 𝑛
𝑝(1
̂ − 𝑝)̂ 𝑝(1
̂ − 𝑝)̂
= 𝑃 ( − 𝑝̂ − 𝑧𝛼/2 √ < −𝑝 < −𝑝̂ + 𝑧𝛼/2 √ )
𝑛 𝑛
𝑝(1
̂ − 𝑝)̂ 𝑝(1
̂ − 𝑝)̂
= 𝑃 (𝑝̂ − 𝑧𝛼/2 √ < 𝑝 < 𝑝̂ + 𝑧𝛼/2 √ )
𝑛 𝑛
Proposition 8.2 (Confidence interval for the population proportion). Let 𝑝̂ be the
observed proportion of “successes” in a random sample of 𝑛 observations from a popula-
tion with a proportion of successes 𝑝.
Then, for large enough samples, a confidence interval of (1 − 𝛼) can be found in the
following way.
Use the sample proportion, 𝑝̂, as a unbiased estimator of 𝑝 and as a good estimator of 𝑝
in the true variance to build the confidence interval as

𝑝(1
̂ − 𝑝)̂
𝑝̂ ± 𝑧𝛼/2 √
𝑛
where the margin of error is

𝑝(1
̂ − 𝑝)̂
𝑀 𝐸 = 𝑧𝛼/2 √
𝑛
The limits of the interval are called upper and lower confidence limit.
8.6 Extensions 117

8.6 Extensions

8.6.1 One-sided Confidence Interval

In some situations, the knowledge that we would like to have about a true pa-
rameter in a population, such as its mean, is the probability that it exceeds some
value or that it falls below some value. These cases call for a one-sided confidence
interval.
The following illustrates that notion for the estimate of the mean of a normally
distribute population. In a similar manner as above, we can show for a 1 − 𝛼
level of confidence ,

𝜎
1 − 𝛼 = 𝑃 (𝜇 < 𝑋̄ + 𝑧𝛼 √ )
𝑛
and

𝜎
1 − 𝛼 = 𝑃 (𝑋̄ − 𝑧𝛼 √ < 𝜇)
𝑛

These expressions give the one-sided bounds for the true parameter, given the
level of confidence.

8.6.2 Other extensions

A confidence interval can also be found in the case of unknown variance. In that
case, the value of the variance is replaced by an estimate of the it.
The resulting distribution of the standardized value, however, does not follow a
normal distribution but a 𝑡−distribution.
Notice that the 𝑡−distribution matches the normal distribution when 𝑛 is large
enough (higher than 30).
Part IV

Intermezzo: Sample Size


9
Curse, Blessing & Back

TL;DR A specific relationship is derived between the


margin of error and the sample size when carrying
tests of hypotheses, for instance on a single propor-
tion.[9.1]
Large sample sizes used to be a daunting issue. But
recent data sets are so big that they can lift the curse.
The medicine, however, comes with another ill as any
difference in the data will appear significant.[9.2]
A large data set on Covid-19 cases in Portugal is used
to illustrate the case.[9.3]

9.1 Sample Size and the Margin of Error

Consider again the standard deviation of the sampling distribution of the sample
proportion under the null,

𝑝0 (1 − 𝑝0 )
𝜎𝑝0 = √ .
𝑛
For any given 𝑛, 𝜎𝑝0 is the highest when 𝑝0 = 0.5. For the remaining of this

121
122 9 Curse, Blessing & Back

section we will then consider that value. This is to go against our own argument
that 𝜎𝑝0 can be surprisingly small in today’s samples.
Suppose you want to test whether a sample proportion is around 𝑝0 . For any
given level of significance, 𝛼, you can guarantee whether the sample proportion
is arbitrarily close, i.e., within a given margin, 𝑚, to the true proportion, 𝑝0 , by
choosing a sample size at least as large as 𝑛∗ . Actually, we can show that the
relationship between these variables is given by,
2
(Φ−1 (1 − 𝛼2 )) ⋅ 𝑝0 (1 − 𝑝0 )
= 𝑛∗
𝑚2
For instance, we can achieve the usual statistical significance level (i.e., with
𝛼 = .05) to check whether the sample proportion is within 𝑚 = 0.05 of the
true proportion 𝑝0 = 0.60 by analyzing 369 observations. Indeed, note that
Φ−1 (0.975) = 1.96 so that we can write

1.962 ⋅ 0.6(1 − 0.6)


≈ 369.
0.052
Here is another example, with the maximal effect on the margin of error by the
true proportion, i.e., when 𝑝0 = 0.5. We can achieve a conficence level of 99%
(i.e., with 𝛼 = .01) to check whether the sample proportion is within 𝑚 = 0.02
of the true proportion 𝑝0 = 0.50 by analyzing 4147 observations. Indeed, note
that Φ−1 (0.995) = 2.576 so that we can write

2.5762 ⋅ 0.5(1 − 0.5)


≈ 4147.
0.022
Figure 9.1 provides the values of 𝑛∗ for various desired margins of error, 𝑚, and
levels of significance, 𝛼, assuming 𝑝0 = 0.5 so that 𝑛∗ is higher than for any
other 𝑝0 .
9.3 The Curse 123
80000

40000

20000

10000
n∗ (log scale)

5000

1000

500

.5% 1% 2% 3% 4% 5%
m

α 0.01 0.05 0.1

FIGURE 9.1: Minimal 𝑛 for various values of 𝛼 and margins of error, 𝑚, keeping
𝑝0 = 0.5.

9.2 The Curse

For a long time the sample size was considered a curse. The sample sizes required
for reasonable accuracy and significance were desperately too high. Recall, for
instance, that Jakob Bernoulli is believed to end his opus Bernoulli (1713) after
showing that his required sample size was 25550, i.e., more than the actual pop-
ulation.1
The curse was lifted recently with the systematic collection of data and the con-
struction of very large datasets.
Notice, however, that big data has come with its own curse. As shown in Figure
9.1, one needs thousands of observations for establishing very accurately and
significantly a difference with respect to a given proportion. But the current sit-
uation has the problem reversed. The datasets are often so large, that any small
difference between proportions is deemed statistically significant!
1
The number was inaccurate but the impression remains of to high a demand.
124 9 Curse, Blessing & Back

9.3 An Illustration

I use data on confirmed Covid-19 cases in Portugal built by the Portuguese DGS.
An advantage of this dataset over the publicly available data by Our World in
Data2 is that it provides details about the observations, such as gender, age, loca-
tion, etc. This, in turn, allows testing various hypotheses based on sample pro-
portions.

owid <- read_csv(


paste0("https://covid.ourworldindata.org/",
"data/owid-covid-data.csv")) %>%
filter(location == "Portugal")

dgs <- read_excel("data/covid.xlsx") %>%


mutate(date = ymd(data_notificacao),
datet = ymd(data_colheita_amostra)) %>%
rename(gender = sexo_utente,
age = idade_utente_a_data_validacao,
zip = codigo_concelho_morada_utente,
area = descricao_concelho_morada_utente,
type = tipo_teste,
rule = regra_de_confirmacao)

Figure 9.2 plots the evolution of the daily number of covid-19 cases in Portugal,
on a rolling average over 7 days. The total number of observations in the DGS
data is 419’910.
We now run a series of tests on sample proportions. Notice that these are purely
illustrative.

9.3.1 Male and Female Equally Represented?

Since the DGS data is divided by gender, one could ask whether the two genders
are equally represented. Translated into a test of hypothesis, we would have,
2
https://ourworldindata.org/
9.3 An Illustration 125

10000

5000

0
Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar

DGS OWID

FIGURE 9.2: Confirmed covid-19 cases in Portugal, daily 7-day rolling moving
average.

𝐻0 ∶ 𝑝𝑀
̂ = 0.5, where 𝑀 stands for male. Here is how we could proceed: i.
Calculate the number of observations for each group. ii. Compute the test.

mf.all <- dgs %>%


drop_na(gender) %>%
group_by(gender) %>%
summarise(n = n())

mf.all
## # A tibble: 2 x 2
## gender n
## <chr> <int>
## 1 F 231049
## 2 M 188711

mf.all <- pull(mf.all)


mf.test.05 <- prop.test(x = mf.all[2],
n = sum(mf.all),
p = 0.5)
126 9 Curse, Blessing & Back

mf.test.05
##
## 1-sample proportions test with continuity correction
##
## data: mf.all[2] out of sum(mf.all), null probability 0.5
## X-squared = 4270.1, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4480632 0.4510753
## sample estimates:
## p
## 0.4495688

Given the sample size and the observed difference between genders, it seems
impossible to not reject the hypothesis of equal representation.

9.3.2 Male and Female Equally Represented in a Given Month?

As it turns out, the proportions of each group were the closest in June. We could
test whether they were actually equal that month.

mf.june <- dgs %>%


filter(month(date) == 6) %>%
drop_na(gender) %>%
group_by(gender) %>%
summarise(n = n())

mf.june
## # A tibble: 2 x 2
## gender n
## <chr> <int>
## 1 F 5241
## 2 M 4969
9.3 An Illustration 127

mf.june <- pull(mf.june)


mf.test.june <- prop.test(x= mf.june[2],
n=sum(mf.june),
p=0.5)

mf.test.june
##
## 1-sample proportions test with continuity correction
##
## data: mf.june[2] out of sum(mf.june), null probability 0.5
## X-squared = 7.193, df = 1, p-value = 0.007319
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4769426 0.4964270
## sample estimates:
## p
## 0.4866797

Despite the lower sample sizes and the close proportions, we must reject again
the equal representation of genders in the sample, this time for the data of June
only.

9.3.3 Old and Young Equally Represented?

“Playing” with the data can lead to surprising results. If we separate the observa-
tions in June into two groups of age, 40 and above and below 40, then we cannot
reject equal proportion of age-groups in the sample.

oy.june <- dgs %>%


filter(month(date)==6) %>%
drop_na(age) %>%
mutate(old = case_when(
age >= 40 ~ TRUE,
TRUE ~ FALSE)) %>%
group_by(old) %>%
summarise(n = n() )
128 9 Curse, Blessing & Back

oy.june
## # A tibble: 2 x 2
## old n
## <lgl> <int>
## 1 FALSE 3725
## 2 TRUE 3758

oy.june <- pull(oy.june)


oy.test <- prop.test(x = oy.june[2],
n = sum(oy.june),
p = 0.5)

oy.test
##
## 1-sample proportions test with continuity correction
##
## data: oy.june[2] out of sum(oy.june), null probability 0.5
## X-squared = 0.13684, df = 1, p-value = 0.7114
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4908114 0.5135963
## sample estimates:
## p
## 0.502205

9.4 Exercises

Exercise 9.1. The following were the technical details for a poll by Obser-
vador/TVI/Pitagórica3 (the author’s translation from this source):
3
https://observador.pt/especiais/sondagem-observador-tvi-pitagorica-maioria-e-a-favor-do-fecho-
das-escolas-e-defendia-adiamento-das-eleicoes/
9.5 Exercises 129

“The field study took place in January 7 to 10 and January 14 to 18. For the sam-
ple, 629 interviews were carried, resulting in a confidence level of 95.5%, with an
implicit maximal margin of error of ± 4%.”
Verify that the margin of error was correctly calculated.
130 9 Curse, Blessing & Back
9.5 Commented R Code 131

9.5 Commented R Code

owid <- read_csv( owid and dgs are the names of the
paste0("https://covid.ourworldindata.org/", data frames imported.
"data/owid-covid-data.csv")) %>%
read_csv() is from package readr. It
filter(location == "Portugal")
reads into R the csv file. Here, the
file is online. I only give its url.
dgs <- read_excel("data/covid.xlsx") %>%
See `Intro R > §readr & readxl'.
mutate(date = ymd(data_notificacao),
datet = ymd(data_colheita_amostra)) %>% read_excel() is from the package

rename(gender = sexo_utente, readxl. It reads here a .xlsx file

age = idade_utente_a_data_validacao, that I have in the data folder. This

zip = codigo_concelho_morada_utente, latter being in the same folder as

area = descricao_concelho_morada_utente, the Rmd file, otherwise the path

type = tipo_teste, would need to be adapted accord-

rule = regra_de_confirmacao) ingly.

NB: not all functions can read di-


rectly online.

paste0() glues together the two


strings. One big string would have
the most correct form. But it would
not print in this page. So I sep-
arated it and glued it again. Bad
coding!

ymd() is a function from the package


lubridate. The letters of the name of
the function stand for year-month-
day. It is used for a good handling
of dates. Advanced stuff.

mutate() is used to create two vari-


ables.

rename() is self-explanatory. It is
always better to have short names of
variables, though sufficiently ex-
plicit.
132 9 Curse, Blessing & Back

mf.all <- dgs %>% This is a good example of piping with


drop_na(gender) %>% %>%.
group_by(gender) %>%
drop_na() removes the lines in the
summarise(n = n())
data frame dgs that have a NA in the
variable gender. I do this because
I want to compare over the gender.

group_by() is an extremely important


function in dplyr. It does not af-
fect the data directly but, all the
next computations in the pipe line
will be carried separately for each
group defined group_by(). The groups
defined are for each value of the
variable gender.

summarise() is a function from the


package dplyr. It creates a vari-
able which is the result of a func-
tion applied to another variable,
not to each of its elements indi-
vidually. Since we grouped the data
before, this function will apply to
each group.

summarise() creates a variable,


called here n, with one variable for
each group. That value is calculated
with n() which is a function that
counts the number of observations in
the group.
9.5 Commented R Code 133

mf.all <- pull(mf.all) mf.all was created in the previous


mf.test.05 <- prop.test(x = mf.all[2], code as a data frame. It is easier
n = sum(mf.all), to work with simple vector. This is
p = 0.5) why I use pull() in order to obtain
a vector, assign to the same name…
mf.all.

prop.test() is a test in R involving


proportions. It uses, x the absolute
number of successes in the sample,
n, the number of trials, p, the null
proportion for the test.

The number of successes in this case


is the number of males. Since this
number is the second value of the
vector mf.all, I code as mf.all[2]
since the square brackets [] allow
precisely to subset a vector by giv-
ing its position.

The number of trials is the sum of


the number of males and the number of
females. These are the two numbers
in mf.all. So I give the value to
n by calculating the sum of mf.all
with the function sum().

p is the probability given in the


null hypothesis.
134 9 Curse, Blessing & Back

mf.june <- dgs %>% The code is similar to above, with


filter(month(date) == 6) %>% one novelty.
drop_na(gender) %>%
The selection of the month in the
group_by(gender) %>%
filter() function is done is a very
summarise(n = n())
easy way, apparently. I simply use
the function month() on the variable
date. The simplicity is due to the
package lubridate to which month()
belongs.

mf.june <- pull(mf.june) Very similar to the code above.


mf.test.june <- prop.test(x= mf.june[2],
n=sum(mf.june),
p=0.5)

oy.june <- dgs %>% The functions were used already


filter(month(date)==6) %>% above.
drop_na(age) %>%
NB: the greater of equal than sign
mutate(old = case_when(
>=.
age >= 40 ~ TRUE,
TRUE ~ FALSE)) %>%
group_by(old) %>%
summarise(n = n() )

oy.june <- pull(oy.june) See above.


oy.test <- prop.test(x = oy.june[2],
n = sum(oy.june),
p = 0.5)
10
Field of Fools

TL;DR .

10.1 De Moivre’s Equation

For our purpose, we start by recalling a useful theorem.

Theorem 10.1 (De Moivre’s Equation: Variance of the sample mean, 𝑋̄ , of ran-
dom variables). Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be independent and identically distributed vari-
ables with mean 𝜇 and variance 𝜎2 . Then,

𝜎2
𝑉 𝑎𝑟(𝑋)̄ = 𝜎𝑋
def
2
̄ =
𝑛
or, for the standard error,
𝜎
𝜎𝑋̄ = √
𝑛

135
136 10 Field of Fools

FIGURE 10.1: The counties with the highest 10 percent age-standardized death
rates for cancer of the kidney/ureter for U.S. males, 1980-89. (Source: Gelman
and Nolan (2017))

Proof.
̄ 1 𝑛 1 𝑛
𝑉 𝑎𝑟(𝑋) = 𝑉 𝑎𝑟( ∑ 𝑋𝑖 ) = 2 ∑ 𝑉 𝑎𝑟(𝑋𝑖 )
𝑛 𝑖=1 𝑛 𝑖=1
1 𝑛 2 1
= 2 ∑ 𝜎 = 2 𝑛𝜎2
𝑛 𝑖=1 𝑛
𝜎2
=
𝑛

Below are a few cases of its mis-application.

10.1.1 Cancer Prone Areas

Consider the Figures 10.1, 10.2 and try evaluate what are the characteristics of
the areas that the most and the least prone to the type of cancer described.
Test your explanations for the location on Figure 10.3.
10.1 De Moivre’s Equation 137

FIGURE 10.2: The counties with the lowet 10 percent age-standardized death
rates for cancer of the kidney/ureter for U.S. males, 1980-89. (Source: Gelman
and Nolan (2017))

FIGURE 10.3: The counties with both the highest and lowest 10 percent age-
standardized death rates for cancer of the kidney/ureter for U.S. males, 1980-89.
(Source: Wainer (2007))

10.1.2 The Small-Schools Movement

Do small schools improve learning? Figure 10.5 provides a basis for discussing
this point.
138 10 Field of Fools

FIGURE 10.4: Population versus age-standardized death rates for cancer of the
kidney/ureter for U.S. males, 1980-89. (Source: Wainer (2007))

FIGURE 10.5: Enrollment vs. math score, 5th grade (left) and 11th grade (right).
(Source: Wainer (2007))
10.1 De Moivre’s Equation 139

FIGURE 10.6: Ten safest and most dangerous American cities for driving, and
ten largest American cities. (Source: Wainer (2007))

10.1.3 Safe Cities

What are the safest and the most dangerous American cities for driving? Con-
sider Figure 10.6 for an answer.
140 10 Field of Fools

FIGURE 10.7: Data from the National Assessment of Educational Progress.


(Source: Wainer (2007))

10.1.4 Sex Differences in Performance

Are there differences in performance between males and females? Figure 10.7
provides evidence on that question.
Note the following ratio.
𝜎√𝑐𝑋 √
1 𝜎𝑐𝑋 2
𝜎√𝑐𝑋 = √ ≈ 1.4
2 1 𝜎𝑐𝑋

10.2 Law of Small Numbers

Tversky and Kahneman (1971) suggest the following about our belief in the Law
of Small Numbers.
10.2 Law of Small Numbers 141

[Form one of the belief:] We submit that people view a sample randomly drawn from a population as
highly representative, that is, similar to the population in all essential characteristics. Consequently, they
expect any two samples drawn from a particular population to be more similar to one another and to the
population than sampling theory predicts, at least for small samples.

When subjects are instructed to generate a random sequence of hypothetical tosses of a fair coin, for
example, they produce sequences where the proportion of heads in any short segment stays far closer to
.50 than the laws of chance would predict.

[Form two of the belief:] Subjects act as if every segment of the random sequence must reflect the true
proportion: if the sequence has strayed from the population proportion, a corrective bias in the other
direction is expected. This has been called the gambler’s fallacy.

Both [forms of the belief] generate expectations about characteristics of samples, and the variability of
these expectations is less than the true variability, at least for small samples.

People’s intuitions about random sampling appear to satisfy the law of small numbers, which asserts that
the law of large numbers applies to small numbers as well.
Part V

Visualizations
11
Data Visualization

Data visualization is an absolutely required skill for any data scientist. It is key
to:

• explore data,
• explain data,
• communicate quantitative information,
• convince specific audiences,
• …

If further evidence was needed to emphasize the current outstanding position


of data visualization in business, recall the 2019 acquisition of Tableau, a tech
company whose software turns raw data into visualization. It went to Salesforce
in a $15.7bn deal.
A treatment of the topic is beyond the scope of this class. For great insights in
this “truthful art”, see Cairo (2016). Also, the most revered authority in the topic
is arguably Edward Tufte1 , “the Leonardo da Vinci of data” (NY Times). For a
thorough discussion of the issues, see his classic books Tufte et al. (1990), Tufte
(1997), Tufte (2006), the reprint of the master piece Tufte (2001) or the both tragic
and hilarious essay Tufte (2003).
Instead, we shall restrict ourselves to the techniques we need to illustrate the
theoretical aspects that we touched upon in the class.
Time allowing, this chapter may be completed with a discussion on guidelines
for good data visualization, in particular along the contribution of Cairo (2016)
and his “Five Qualities of Great Visualizations” as well or Tufte’s examples and
recommendations.

1
https://www.edwardtufte.com/tufte/

145
12
Bars

The main focus of this chapter is to provide illustrations of the tools that we could
use to visualize the values calculated in Chapter 6 and Chapter Section 7 as well
the margins of error in Chapter 8.

12.1 Bars for Proportions

We want to illustrate the proportions in the rosling_responses. Notice that we cal-


culate them manually, first for the overall data set.

data("rosling_responses")

rosling_responses %>%
group_by(response) %>%
summarise(count = n()) %>%
mutate(percent = count/ sum(count)) %>%
ggplot(aes(x=response, y= percent)) +
geom_col(alpha=0.5)

We probably should distinguish by question.

data("rosling_responses")

rosling_responses %>%
group_by(question, response) %>%
summarise(count = n()) %>%
group_by(question) %>%

147
148 12 Bars

0.8

0.6
percent

0.4

0.2

0.0

correct incorrect
response

FIGURE 12.1: Proportions over all responses.

0.8

0.6

response
percent

correct
0.4
incorrect

0.2

0.0

children_in_2100 children_with_1_or_more_vaccination
question

FIGURE 12.2: Proportions by question.

mutate(percent = count/ sum(count)) %>%


ggplot(aes(x=question, y= percent, fill=response)) +
geom_col(alpha=0.5, position = "dodge") +
scale_fill_manual(values = c("correct" = "#006400", "incorrect" = "#8B0000"))
12.2 Adding Error Bars to Proportions 149
children_in_2100 children_with_1_or_more_vaccination

0.8

0.6
percent

0.4

0.2

0.0

correct incorrect correct incorrect


response

FIGURE 12.3: Proportion by question, in facets.

This seems good enough. But here are another extension.

rosling_responses %>%
group_by(question, response) %>%
summarise(count = n()) %>%
group_by(question) %>%
mutate(percent = count/ sum(count)) %>%
ggplot(aes(x=response, y= percent, fill=response)) +
geom_col(alpha=0.5, position = "dodge") +
facet_wrap(.~question) +
scale_fill_manual(values = c("correct" = "#006400", "incorrect" = "#8B0000")) +
theme(legend.position = "none")

12.2 Adding Error Bars to Proportions

We now plot a confidence interval for each proportion following the discussion
in Chapter 8.
150 12 Bars

0.75

0.50
percent

0.25

0.00

correct incorrect
response

FIGURE 12.4: Proportions over all responses with error bars.

rosling_responses %>%
group_by(response) %>%
summarise(count = n()) %>%
mutate(percent = count/ sum(count),
se = sqrt(percent*(1-percent)/count)) %>%
ggplot(aes(x=response, y= percent)) +
geom_col(alpha=0.5) +
geom_errorbar(aes(ymin = percent-1.96*se, ymax = percent+1.96*se ), width = 0.2)

12.3 Bars for Numerical Data

We plot the average weight of newborns over various dimensions.

library(openintro)
data(ncbirths)
set.seed(142)
dfA <- ncbirths %>%
filter(habit == "smoker") %>%
12.3 Bars for Numerical Data 151

6
m.weight

nonsmoker smoker
habit

FIGURE 12.5: Average weight per habit.

sample_n(size = 50)
dfB <- ncbirths %>%
filter(habit == "nonsmoker") %>%
sample_n(size = 100)
df <- bind_rows(dfA, dfB)

df %>%
group_by(habit) %>%
summarise(m.weight = mean(weight)) %>%
ggplot(aes(x= habit, y=m.weight)) +
geom_col(alpha=0.5)

Again, it is easy to plot other aesthetics.

df %>%
group_by(habit, gender, whitemom) %>%
summarise(m.weight = mean(weight)) %>%
ggplot(aes(x= habit, y=m.weight, fill = gender)) +
geom_col(alpha=0.5, position= "dodge") +
facet_wrap(.~whitemom) +
152 12 Bars
not white white

Average weight (pounds) 6

nonsmoker smoker nonsmoker smoker


Mother's habit

gender female male

FIGURE 12.6: Average weight per habit and other dimensions.

theme(legend.position = "bottom") +
xlab("Mother's habit") +
ylab("Average weight (pounds)")

12.4 Adding Error Bars to Means

We now add a normal confidence interval for the means.

df %>%
group_by(habit) %>%
summarise(m.weight = mean(weight),
se= sd(weight)/sqrt(n()) ) %>%
ggplot(aes(x= habit, y=m.weight)) +
geom_col(alpha=0.5) +
geom_errorbar(aes(ymin=m.weight-1.96*se, ymax= m.weight+1.96*se ), width = 0.5) +
xlab("Mother's habit") +
ylab("Average weight (pounds)")
12.5 Exercise 153

Average weight (pounds) 6

nonsmoker smoker
Mother's habit

FIGURE 12.7: Average weight per habit with confidence interval.

12.5 Exercise

Reproduce the following plot with the given line for the data.

df <- read_delim("https://pzezula.pages.gwdg.de/data/ChickFlick.dat", delim = "\t")


154 12 Bars
Female Male

30

20
Mean Arousal

10

Bridget Jones' Diary Memento Bridget Jones' Diary Memento


Film

FIGURE 12.8: Mean arousal per film over gender with confidence interval.
Part VI

Bridge
13
Correlation

TL;DR .

13.1 Bivariate Relationships

One of the most natural exercises in presence of two variables is to gauge their
association. Is a perilous exercise, though, as one is often carried away into the
realm of causal explanations.
Our approach here remains into the field of purely descriptive analysis. We intro-
duce various ways of numerically assessing the relationship between two vari-
ables.

13.1.1 Visualizing the Relationship

A first important step consists in visualizing the data in a scatter plot. Then, the
first eyeball test evaluates whether a line, representing a linear relationship can
be drawn into the cloud of points. Figures 13.1 provides examples. These use the
data frame df from the golf.Rdata file extracted from ESPN1 (download here2 ).
1
https://www.espn.com/golf/statistics
2
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=335657

157
158 13 Correlation

load("data/golf.Rdata")
df
## # A tibble: 70 x 17
## rank surname name age events rounds cutsmade top10 wins cuppoints
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Bryson DeChambeau 27 7 26 7 4 2 1375
## 2 2 Dustin Johnson 36 6 24 6 4 1 1105
## 3 3 Viktor Hovland 23 10 40 10 4 1 1204
## 4 4 Xander Schauffele 27 9 36 9 5 0 1110
## 5 5 Patrick Cantlay 28 9 36 10 4 1 1234
## 6 7 Brooks Koepka 30 9 30 9 4 1 960
## 7 8 Tony Finau 31 10 40 10 5 0 980
## 8 9 Jason Kokrak 35 12 42 10 3 1 841
## 9 10 Justin Thomas 27 9 34 9 4 0 897
## 10 11 Max Homa 30 13 46 10 3 1 909
## # ... with 60 more rows, and 7 more variables: earnings <dbl>, yds-drive <dbl>,
## # drvacc <dbl>, drvetotal <dbl>, greensinreg <dbl>, puttavg <dbl>,
## # savepct <dbl>

p1 <- df %>%
ggplot(aes(x=events, y=rounds )) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x="Events", y="Rounds") +
ggtitle("Strong positive linear relationship")

p2 <- df %>%
ggplot(aes(x=drvacc, y=`yds-drive`)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x="Drive Accuracy", y="Yards per drive") +
ggtitle("Negative linear relationship")

p3 <- df %>%
ggplot(aes(x=age, y=cutsmade )) +
geom_point() +
13.1 Bivariate Relationships 159
Strong positive linear relationship Negative linear relationship

50 320

310

Yards per drive


40
Rounds

300

30 290

280
20
6 8 10 12 14 16 50 55 60 65 70
Events Drive Accuracy

Weak (negative) linear relationship

12.5

10.0
Age

7.5

5.0
30 40 50
Rank

FIGURE 13.1: Scatter plots of pairs of variables and their linear relationship.

geom_smooth(method = "lm", se = FALSE) +


labs(x="Rank", y="Age") +
ggtitle("Weak (negative) linear relationship")

ggarrange(p1, p2, p3, ncol=2, nrow=2)

Notice that the eyeball test should always be attempted. This is because the nu-
merical estimators of the strength of the relationship are summarizing point es-
timates. As such, they can hide various situations. This point was cleverly il-
lustrated by Anscombe (1973) with the help of four scatter plots reproduced in
Figure 13.2.
The particularity of the four plots is that they share the exact same linear relation-
160 13 Correlation

ship between the variables. Obviously, the nature of the real relationship differs
greatly.

p1 <- anscombe %>%


ggplot(aes(x=x1, y=y1 )) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)

p2 <- anscombe %>%


ggplot(aes(x=x2, y=y2 )) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)

p3 <- anscombe %>%


ggplot(aes(x=x3, y=y3 )) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)

p4 <- anscombe %>%


ggplot(aes(x=x4, y=y4 )) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)

ggarrange(p1, p2, p3, p4, ncol=2, nrow=2)

Fortunately, the current software have rendered this visual examination ex-
tremely easy. Figures 13.3 through 13.5 show three possibilities in R.

pairs(df[, c(1, 4:11)])

corrgram(df[, c(1, 4:11)], lower.panel = panel.shade, upper.panel = panel.pts)

cor_matrix <- cor(df[, c(1, 4:11)], use = 'complete.obs')


corrplot.mixed(cor_matrix, lower = "circle", upper = "number", tl.pos = "lt", diag = "u")
13.2 Pearson’s Correlation 161

10

10

8
y1

y2
6

4
6 9 12 6 9 12
x1 x2
13

11 11

9 9
y3

y4

7 7

5
5
6 9 12 7.5 10.0 12.5 15.0 17.5
x3 x4

FIGURE 13.2: Anscombe plots.

13.2 Pearson’s Correlation

The Pearson’s correlation, aka the correlation, is a measure of the linear relation-
ship between two variables.
It’s formula is:

∑ 𝑥𝑖 𝑦𝑖 − 𝑛𝑥𝑦̄ ̄
𝑟𝑥𝑦 =
√(∑ 𝑥2𝑖 − 𝑛𝑥2̄ ) √(∑ 𝑦𝑖2 − 𝑛𝑦2̄ )
In a population, it reflects the covariance between two variables, normalized by
the product of the their standard deviations. It can range from -1 to 1, indicating
the direction and the strength of the linear relationship.
162 13 Correlation
25 45 20 40 0 2 4 0 800

60
rank

0
25 40 55

age

16
events

6 10
20 35 50

rounds

10 14
cutsmade

6
4

top10
2
0

1.5
wins

0.0
800

cuppoints
0

earnings

1e+06
0 60 6 12 6 10 0.0 1.5 1e+06

FIGURE 13.3: Assessing associations with base R.

The variables should be continuous and have constant variance across their
range. As for other statistics, it needs a large 𝑛 to allow reliability in the nor-
mality of the sampling distribution.
Estimation in R is straightforward. When assessing the correlation between two
variables, we could use a base R function.

cor(df$events, df$rounds)
## [1] 0.8265256
cor(df$drvacc, df$`yds-drive`)
## [1] -0.522812
cor(df$age, df$cutsmade)
## [1] NA
13.2 Pearson’s Correlation 163

rank

age

events

rounds

cutsmade

top10

wins

cuppoints

earnings

FIGURE 13.4: Assessing associations with base corrgram package.

cor(df$age, df$cutsmade, use = 'complete.obs')


## [1] -0.1726649

To speed up the process, the comparison can be made over a selection of vari-
ables.

cor(df[, c(1, 4:11)], use = 'complete.obs')


## rank age events rounds cutsmade top10
## rank 1.00000000 0.39342313 0.25508070 0.05448489 0.1277277 -0.7979354
## age 0.39342313 1.00000000 -0.09931708 -0.17140106 -0.1726649 -0.3056268
## events 0.25508070 -0.09931708 1.00000000 0.89715918 0.9105595 -0.2472508
## rounds 0.05448489 -0.17140106 0.89715918 1.00000000 0.9093701 -0.0640426
## cutsmade 0.12772772 -0.17266492 0.91055946 0.90937010 1.0000000 -0.1236086
164 13 Correlation

cutsmade

cuppoints

earnings
rounds
events

top10

wins
rank

age
1
rank 1 0.39 0.26 0.05 0.13 −0.8 −0.53 −0.89 −0.88
0.8

age 1 −0.1 −0.17 −0.17 −0.31 −0.07 −0.36 −0.38


0.6

events 1 0.9 0.91 −0.25 −0.28 −0.31 −0.38


0.4

rounds 1 0.91 −0.06 −0.27 −0.13 −0.19 0.2

cutsmade 1 −0.12 −0.27 −0.18 −0.25 0

top10 1 0.32 0.77 0.78 −0.2

−0.4
wins 1 0.72 0.68
−0.6
cuppoints 1 0.98
−0.8
earnings 1
−1

FIGURE 13.5: Assessing associations with base corrplot package.

## top10 -0.79793539 -0.30562679 -0.24725082 -0.06404260 -0.1236086 1.0000000


## wins -0.53002657 -0.06662705 -0.28359085 -0.27057896 -0.2665066 0.3235051
## cuppoints -0.89120057 -0.36103058 -0.31497329 -0.12808014 -0.1805361 0.7652737
## earnings -0.88145066 -0.37700438 -0.38391331 -0.18730479 -0.2478651 0.7810479
## wins cuppoints earnings
## rank -0.53002657 -0.8912006 -0.8814507
## age -0.06662705 -0.3610306 -0.3770044
## events -0.28359085 -0.3149733 -0.3839133
## rounds -0.27057896 -0.1280801 -0.1873048
## cutsmade -0.26650662 -0.1805361 -0.2478651
## top10 0.32350513 0.7652737 0.7810479
## wins 1.00000000 0.7209731 0.6771228
## cuppoints 0.72097309 1.0000000 0.9764573
## earnings 0.67712276 0.9764573 1.0000000

A natural test to be made is based on


13.2 Pearson’s Correlation 165

𝐻0 ∶ 𝑟 = 0

This test can be carried as follows.

cor.test(df$events, df$rounds)
##
## Pearson's product-moment correlation
##
## data: df$events and df$rounds
## t = 12.108, df = 68, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7341281 0.8888703
## sample estimates:
## cor
## 0.8265256
cor.test(df$drvacc, df$`yds-drive`)
##
## Pearson's product-moment correlation
##
## data: df$drvacc and df$`yds-drive`
## t = -5.0575, df = 68, p-value = 3.435e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6748790 -0.3281503
## sample estimates:
## cor
## -0.522812
cor.test(df$age, df$cutsmade, use = 'complete.obs')
##
## Pearson's product-moment correlation
##
## data: df$age and df$cutsmade
## t = -1.4241, df = 66, p-value = 0.1591
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3948356 0.0685836
166 13 Correlation

## sample estimates:
## cor
## -0.1726649

Again, an overview of the result of this test over many pairs can be done, using
the Hmisc package.

library(Hmisc)
rcorr(as.matrix(df[, c(1, 4:11)]))
## rank age events rounds cutsmade top10 wins cuppoints earnings
## rank 1.00 0.39 0.27 0.08 0.15 -0.77 -0.53 -0.88 -0.88
## age 0.39 1.00 -0.10 -0.17 -0.17 -0.31 -0.07 -0.36 -0.38
## events 0.27 -0.10 1.00 0.83 0.86 -0.32 -0.27 -0.26 -0.37
## rounds 0.08 -0.17 0.83 1.00 0.91 -0.05 -0.28 -0.17 -0.21
## cutsmade 0.15 -0.17 0.86 0.91 1.00 -0.12 -0.27 -0.20 -0.26
## top10 -0.77 -0.31 -0.32 -0.05 -0.12 1.00 0.30 0.67 0.74
## wins -0.53 -0.07 -0.27 -0.28 -0.27 0.30 1.00 0.72 0.68
## cuppoints -0.88 -0.36 -0.26 -0.17 -0.20 0.67 0.72 1.00 0.97
## earnings -0.88 -0.38 -0.37 -0.21 -0.26 0.74 0.68 0.97 1.00
##
## n
## rank age events rounds cutsmade top10 wins cuppoints earnings
## rank 70 68 70 70 70 70 70 70 70
## age 68 68 68 68 68 68 68 68 68
## events 70 68 70 70 70 70 70 70 70
## rounds 70 68 70 70 70 70 70 70 70
## cutsmade 70 68 70 70 70 70 70 70 70
## top10 70 68 70 70 70 70 70 70 70
## wins 70 68 70 70 70 70 70 70 70
## cuppoints 70 68 70 70 70 70 70 70 70
## earnings 70 68 70 70 70 70 70 70 70
##
## P
## rank age events rounds cutsmade top10 wins cuppoints earnings
## rank 0.0009 0.0244 0.5146 0.2264 0.0000 0.0000 0.0000 0.0000
## age 0.0009 0.4204 0.1622 0.1591 0.0113 0.5893 0.0025 0.0015
## events 0.0244 0.4204 0.0000 0.0000 0.0061 0.0261 0.0276 0.0016
13.3 Spearman’s Rank Correlation 167

## rounds 0.5146 0.1622 0.0000 0.0000 0.6891 0.0184 0.1683 0.0865


## cutsmade 0.2264 0.1591 0.0000 0.0000 0.3122 0.0217 0.0970 0.0291
## top10 0.0000 0.0113 0.0061 0.6891 0.3122 0.0114 0.0000 0.0000
## wins 0.0000 0.5893 0.0261 0.0184 0.0217 0.0114 0.0000 0.0000
## cuppoints 0.0000 0.0025 0.0276 0.1683 0.0970 0.0000 0.0000 0.0000
## earnings 0.0000 0.0015 0.0016 0.0865 0.0291 0.0000 0.0000 0.0000

13.3 Spearman’s Rank Correlation

The idea behind the Spearman’s correlation is to evaluate the association be-
tween the ranks of the observations in two different variables.
This approach makes the test less sensitive to outliers. Of course, it is better suited
when the variables are ordinal in measurement.
Importantly, the Spearman correlation assesses the relationship between two
variables that is not necessarily linear but simply monotonic.
The statistic is, assuming the ranks are distinct (no ties),

6 ∑ 𝑑𝑖2
𝑟𝑠 = 1 − ,
𝑛(𝑛2 − 1)

where 𝑑𝑖 = rg(𝑋𝑖 ) − rg(𝑌𝑖 ) is the difference between the two ranks of each
observation and 𝑛 is the number of observations.
Here is an example of use in R that emphasizes the difference with respect to the
Pearson’s coefficient.

df1 <- tibble(x=seq(from = 0, to = 5, by= 0.1 ),


y= x^2)

cor(df1$x, df1$y)
## [1] 0.967074
cor(df1$x, df1$y, method = "spearman")
## [1] 1
168 13 Correlation

A test of hypothesis can similarly be performed.

cor.test(df$events, df$rounds, method = "spearman")


##
## Spearman's rank correlation rho
##
## data: df$events and df$rounds
## S = 10042, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8242995
cor.test(df$drvacc, df$`yds-drive`, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: df$drvacc and df$`yds-drive`
## S = 85395, p-value = 1.38e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.4940984
cor.test(df$age, df$cutsmade, use = 'complete.obs', method = "spearman")
##
## Spearman's rank correlation rho
##
## data: df$age and df$cutsmade
## S = 53527, p-value = 0.8611
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.02161583
14
Observational Versus Experimental Data

TL;DR .

Required Packages

library(magrittr) # pipe operator


library(dplyr) # data frames manipulation
library(ggplot2) # plotting
library(tibble) # alternative to data frame

library(scales) # useful functions: percent


library(palmerpenguins) # data set
data(penguins)
library(correlation) # some useful functions for here

14.1 Descriptive Approach

14.1.1 UCB Admissions

169
170 14 Observational Versus Experimental Data

df <- UCBAdmissions %>%


as_tibble() %>%
mutate(cases = sum(n))%>%
filter(Admit == "Admitted") %>%
summarise(Admission = sum(n) /cases) %>%
pull(Admission)
mean.Admission <- df[1]

df <- UCBAdmissions %>%


as_tibble() %>% # convert the table to a data frame
group_by(Gender, Dept) %>%
mutate(cases = sum(n)) %>%
ungroup() %>%
filter(Admit == "Admitted") %>%
group_by(Gender) %>%
summarise(Admission = sum(n)/sum(cases),
N = sum(cases))
df
## # A tibble: 2 x 3
## Gender Admission N
## <chr> <dbl> <dbl>
## 1 Female 0.304 1835
## 2 Male 0.445 2691

df %>%
ggplot(aes(x = Gender, y = Admission, fill = Gender)) +
geom_col() +
geom_text(aes(label = percent(Admission)), vjust = -1) +
labs(y = "Admission rate") +
scale_y_continuous(labels = percent, limits = c(0,0.5)) +
geom_hline(yintercept = mean.Admission, linetype="dashed") +
annotate(geom = "text",x=0.85, y =mean.Admission+0.02, label = paste0("Average admission rate (",per
guides(fill = FALSE)
14.1 Descriptive Approach 171

50.0%
45%

40.0% Average admission rate (39%)

30%
Admission rate

30.0%

20.0%

10.0%

0.0%

Female Male
Gender

p1 <- df$Admission[1]
p2 <- df$Admission[2]
n1 <- df$N[1]
n2 <- df$N[2]

test.c <- prop.test(x = c(p1*n1, p2*n2),


n = c(n1, n2),
alternative = "less",
conf.level = 0.9999,
correct = TRUE)
test.c
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(p1 * n1, p2 * n2) out of c(n1, n2)
## X-squared = 91.61, df = 1, p-value < 2.2e-16
## alternative hypothesis: less
172 14 Observational Versus Experimental Data

## 99.99 percent confidence interval:


## -1.00000000 -0.08768078
## sample estimates:
## prop 1 prop 2
## 0.3035422 0.4451877

14.1.2 Palmer Penguins

penguins %>%
na.omit() %>%
ggplot(aes(x=bill_length_mm, y=bill_depth_mm)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE) +
labs(x="Bill length", y="Bill Depth")

20.0
Bill Depth

17.5

15.0

40 50 60
Bill length
14.2 Covariates 173

r1 <- cor.test(penguins$bill_length_mm, penguins$bill_depth_mm)


r1
##
## Pearson's product-moment correlation
##
## data: penguins$bill_length_mm and penguins$bill_depth_mm
## t = -4.4591, df = 340, p-value = 1.12e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3328072 -0.1323004
## sample estimates:
## cor
## -0.2350529

14.2 Covariates

14.2.1 UCB Admissions, Again

df <- UCBAdmissions %>%


as_tibble() %>%
group_by(Gender, Dept) %>%
mutate(cases = sum(n)) %>%
filter(Admit == "Admitted") %>%
summarise(Admission = sum(n)/sum(cases),
N = sum(cases))
df
## # A tibble: 12 x 4
## # Groups: Gender [2]
## Gender Dept Admission N
## <chr> <chr> <dbl> <dbl>
## 1 Female A 0.824 108
## 2 Female B 0.68 25
## 3 Female C 0.341 593
## 4 Female D 0.349 375
174 14 Observational Versus Experimental Data

## 5 Female E 0.239 393


## 6 Female F 0.0704 341
## 7 Male A 0.621 825
## 8 Male B 0.630 560
## 9 Male C 0.369 325
## 10 Male D 0.331 417
## 11 Male E 0.277 191
## 12 Male F 0.0590 373

df %>%
ggplot(aes(x=Gender, y=Admission, fill = Gender)) +
geom_col() +
geom_text(aes(label = paste0(percent(Admission), "\n (of ", N, ")") ), vjust = -0.15, size=3) +
labs(y = "Admission rate") +
scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
facet_wrap(~Dept) +
guides(fill = FALSE)

A B C
100% 82.41%
(of 108) 68.00%
62.06% 63.04%
75% (of 25)
(of 825) (of 560)

50% 34.06% 36.92%


(of 593) (of 325)

25%
Admission rate

0%

D E F
100%

75%

50% 34.93% 33.09%


(of 375) 23.92% 27.75%
(of 417)
(of 393) (of 191)
25% 7.04% 5.90%
(of 341) (of 373)
0%
Female Male Female Male Female Male
Gender
14.3 Paradox Again 175

14.2.2 Penguins, Again

penguins %>%
na.omit() %>%
ggplot(aes(x=bill_length_mm, y=bill_depth_mm, color=species)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE) +
labs(x="Bill length", y="Bill Depth")

20.0

species
Bill Depth

17.5 Adelie
Chinstrap
Gentoo

15.0

40 50 60
Bill length

14.3 Paradox Again

a.
176 14 Observational Versus Experimental Data

All Effect no Effect N Recovery rate


Drug 20 20 40 50%
No Drug 16 24 40 40%
Total 36 44 80

b.

Male Effect no Effect N Recovery rate


Drug 18 12 30 60%
No Drug 7 3 10 70%
Total 25 15 40

c.

Female Effect no Effect N Recovery rate


Drug 2 8 10 20%
No Drug 9 21 30 30%
Total 11 29 40
15
Statistical Learning

15.1 Statistical Learning

The original postulate is that there exist a relationship between a response 𝑌 vari-
able and, jointly a set 𝑋 of variables ( independent variables, predictors, explanatory
variables).
Then, the general form of the relationship between these variables is as follows.

𝑌 = 𝑓(𝑋) + 𝜀

where 𝜀 captures various sources of error.


We will denote by 𝑛 the number of observations, i.e., the number of tuples con-
taining a value of response and a value for each predictor. Also, 𝑝 is the number
of predictors.
It is useful to see the different objects of the equation above.

𝑦1 𝑥11 𝑥12 … 𝑥1𝑝 𝜀1



⎜ 𝑦2 ⎞
⎟ ⎛
⎜ 𝑥21 𝑥22 … 𝑥2𝑝 ⎞
⎟ ⎛
⎜ 𝜀2 ⎞


⎜⋮ ⎟ ⎟ = 𝑓 ⎜
⎜ ⎟
⎟ ⎜⋮ ⎟
+ ⎜ ⎟
⋮ ⋮ ⋱ ⋮
⎝ 𝑦𝑛 ⎠ ⎝ 𝑥𝑛1 𝑥𝑛2 … 𝑥𝑛𝑝 ⎠ ⎝ 𝜀𝑛 ⎠
The goal of statistical learning is to estimate (learn, determine, guess,…) 𝑓()
and/or its main features.
Figure 15.1 illustrates the data learning process by plotting 𝑌 for the values of 𝑋 ,
a unique vector (left) along with the errors measured as the difference between
the observations and the true function (right). Notice that the true function is
known in this case because the data is simulated.

177
178 15 Statistical Learning

FIGURE 15.1: Instance of simulated Income data along with true 𝑓() and errors.

FIGURE 15.2: Instance of simulated Income data along with true 𝑓() and errors
(two predictors).

The different techniques explored in statistical learning are designed to come as


close as possible to the true, blue line.
Figure 15.2 illustrates the same idea as Figure 15.1, but with a true function over
two variables.
15.1 Statistical Learning 179

set.seed(2)
income <- read_csv("data/islr/Income1.csv") %>%
select(-X1, -Income) %>%
mutate(Income = 20 + 600* dnorm(Education, 22, 4) + rnorm(length(Education),0,4),
tIncome = 20 + 600* dnorm(Education, 22, 4))

p1 <- income %>%


ggplot(mapping = aes(x=Education, y=Income)) +
geom_point()

p2 <- income %>%


ggplot(mapping = aes(x=Education, y=Income)) +
geom_point() +
stat_function(fun = function(Education) 20 + 600*dnorm(Education, 22, 4)) +
geom_segment(aes(x = Education, y = Income, xend = Education, yend = tIncome, colour = "red"),
data = income) +
theme(legend.position = "none")

grid.arrange(p1, p2, ncol=2)

70 70
Income

Income

50 50

30 30

10.0 12.5 15.0 17.5 20.0 22.5 10.0 12.5 15.0 17.5 20.0 22.5
Education Education
180 15 Statistical Learning

income %>%
mutate(error = Income - tIncome)
## # A tibble: 30 x 4
## Education Income tIncome error
## <dbl> <dbl> <dbl> <dbl>
## 1 10 24.5 20.7 3.87
## 2 10.4 21.4 20.9 0.502
## 3 10.8 18.4 21.2 -2.84
## 4 11.2 18.0 21.6 -3.65
## 5 11.6 26.3 22.1 4.21
## 6 12.1 19.8 22.8 -3.01
## 7 12.5 17.8 23.5 -5.76
## 8 12.9 23.3 24.5 -1.14
## 9 13.3 21.5 25.6 -4.14
## 10 13.7 27.0 27.1 -0.113
## # ... with 20 more rows

15.2 Use of Statistical Learning

There are two main reasons one would want to estimate 𝑓().

15.2.1 Prediction

In many occasions, the independent variables are known but the response is not.
Therefore, 𝑓() can be used to predict these values. These predictions are noted by

̂
𝑌 ̂ = 𝑓(𝑋)

where 𝑓 ̂ is the estimated function for 𝑓().

15.2.2 Inference

̂ is also used to answer questions about the relationship be-


The estimated 𝑓()
tween the independent variables and the response variables, such as:
15.3 Universal Scope 181

300

300

300
200

200

200
Wage

Wage

Wage
50 100

50 100

50 100
20 40 60 80 2003 2006 2009 1 2 3 4 5

Age Year Education Level

FIGURE 15.3: Wage as function of various variables.

• which predictors contributes the response,


• how much each predictor contributes to the response,
• what is the form of the relationship.

15.3 Universal Scope

Statistical learning addresses a very large set of issues. This latter has been ex-
panding in the last decades thanks to the availability of computing power, data
sets, new software and theoretical developments. Here is a very short list of cases
handled by data science.

15.3.1 Wage vs Demographic Variables

Determining what demographic variables influence the worker’s wage.

15.3.2 Probability of Heart Attack

Predicting the probability of suffering a heart attack on the with demographic,


diet and clinical measurements.
182 15 Statistical Learning

FIGURE 15.4: Factors influencing the risk of a heart attack.

FIGURE 15.5: Frequencies for main words in email (to George).

15.3.3 Spam Detection

Devising a spam detection system.

15.3.4 Identifying Hand-Written Numbers

Identifying hand-written numbers of zip codes in letters.


If you are interested in this particular application, I recommend watching this
fantastic video (the first of a series).
15.4 Universal Scope 183

FIGURE 15.6: Sample of hand-written numbers.

FIGURE 15.7: LANDSAT images and classification.

15.3.5 Classify LANDSAT Image

Classify the pixels in a LANDSAT image, by usage.


184 15 Statistical Learning

15.4 ̂
Ideal 𝑓() vs 𝑓()

For a solution to be ideal of optimal, one has to first specify a criterion. In the
context of prediction, for instance, a natural criterion arises.
̂
If one wants to predict 𝑌 given 𝑋 , i.e., to get 𝑌 ̂ = 𝑓(𝑋) , a common criterion used
in statistical learning is the mean-squared error defined thanks to the squared
error

squared error = (𝑌 − 𝑌 ̂ )2

It can be shown that, using that criterion, the optimal 𝑓() that minimizes the
expected value of the squared error,

min 𝐸[(𝑌 − 𝑌 ̂ )2 |𝑋 = 𝑥]
is given by
𝑓(𝑥) = 𝐸[𝑌 |𝑋 = 𝑥]

This function 𝑓(𝑥) is the called the regression function.


Of course, the regression function is not known and must be estimated. In other
words, we can have 𝑓(𝑥)̂ instead of 𝑓(𝑥). A very important feature must then
be emphasized.
2
̂
𝐸[(𝑌 − 𝑌 ̂ )2 |𝑋 = 𝑥] = 𝐸[(𝑓(𝑋) + 𝜀 − 𝑓(𝑋)) |𝑋 = 𝑥]

Carrying the conditioning on 𝑋 , we can write


2
𝐸[(𝑌 − 𝑌 ̂ )2 ] = (𝑓(𝑥) ̂
⏟⏟⏟⏟⏟⏟⏟ + 𝑉
− 𝑓(𝑥)) 𝑎𝑟(𝜀)

Reducible Irreducible

15.5 Important Distinctions


15.5 Important Distinctions 185

FIGURE 15.8: Linear, smooth non-parametric and rough non-parametric fit (left


to right).

15.5.1 Approaches

Parametric methods imposed the functional form and estimate the parameters
for such function. The simplest and most common of these is the linear model of
the form:

𝑓(𝑋) = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
Non-parametric methods do not impose any functional form. But they have tun-
ing parameters, for instance, the level of smoothness.

15.5.2 Trade-offs

The estimation techniques, as hinted in the discussion above, present the re-
searcher with various trade-offs:

• Accuracy versus interpretability,


• Good versus over or under-fit,
• Parsimony versus all-in.

Depending on the researchers choice in these axes, there are more or less appro-
priate techniques.

15.5.3 Types of Statistical Problems

From the examples above, and others, we can distinguish types of statistical
learning problems:
186 15 Statistical Learning

• Regression versus classification problems.


• Supervised versus unsupervised learning.

15.6 Quality of Regression Fit

The most common measure for quality of a regression fit is the mean squared
error, MSE (or the square root of that number), given by

1 𝑛 ̂ ))2
𝑀 𝑆𝐸 = ∑ (𝑦𝑖 − 𝑓(𝑥 𝑖
𝑛 𝑖

At this stage, we must introduce a fundamental feature of the statistical learn-


ing philosophy. Any chosen method/technique is given data to learn, the train
data. However, the crucial attribute of the method should be measure on data
not previously seen, the test data.
In other words, the important measure is the test MSE. We compute it as
2
̂ ))
Ave(𝑦0 − 𝑓(𝑥 0

where (𝑦0 , 𝑥0 ) are the test observations.


The next examples should nourish our intuition about a fundamental trade-off
explained below.

15.7 Bias-Variance Trade-Off

The U-shape of the test MSE curves is an important result in statistical learning.
It derives from the following property. If the true model is 𝑌 = 𝑓(𝑋) + 𝜀 (with
𝑓(𝑥) = 𝐸[𝑌 |𝑋 = 𝑥] ), then we can show

̂ ))2 ] = 𝑉 𝑎𝑟(𝑓(𝑥
𝐸[(𝑦0 − 𝑓(𝑥 ̂ )))2 + 𝑉 𝑎𝑟(𝜀)
̂ )) + (𝐵𝑖𝑎𝑠(𝑓(𝑥
0 0 0
15.8 Bias-Variance Trade-Off 187

FIGURE 15.9: B-V case 1.

FIGURE 15.10: B-V case 2.

̂ )) is the how much 𝑓()


𝑉 𝑎𝑟(𝑓(𝑥 ̂ would change if estimated with a different
0
training data.
̂ )) is the discrepancy between the estimated function and the true
𝐵𝑖𝑎𝑠(𝑓(𝑥 0
function.
The trade-off exist because more flexible methods do tend to reduce bias but they
̂ .
increase the volatility of 𝑓()
188 15 Statistical Learning

FIGURE 15.11: B-V case 3.

FIGURE 15.12: Bias-Variance trade-off.

15.8 Accuracy in Classification Setting

In classification problems, one cannot calculate the MSE. Instead, the most com-
mon metric in the classification setting is the classification error rate.

1 𝑛
err(𝐶,̂ data) = ̂ ))
∑ 𝐼(𝑦𝑖 ≠ 𝐶(𝑥𝑖
𝑛 𝑖=1

where, 𝐼 is an indicator function taking the value 1 or 0. Thus, the error rate is
an average of the correct classification.
15.9 Cross-Validation 189

̂
1 𝑦𝑖 ≠ 𝐶(𝑥)
̂
𝐼(𝑦𝑖 ≠ 𝐶(𝑥)) ={ ̂
0 𝑦𝑖 = 𝐶(𝑥)

Again in this setting, we often split the data between train and test. We can cal-
culate the Train (Classification) Error but Test (Classification) Error is a better
measure of how well a classifier will work on future unseen data.

1
errtrn (𝐶,̂ train data) = ̂ ))
∑ 𝐼(𝑦𝑖 ≠ 𝐶(𝑥 𝑖
𝑛trn 𝑖∈trn
1
errtst (𝐶,̂ test data) = ̂ ))
∑ 𝐼(𝑦𝑖 ≠ 𝐶(𝑥𝑖
𝑛tst 𝑖∈tst

Notice, however, that other criteria can be used depending on the type of error
that one wants to focus on.

15.9 Cross-Validation

Recall the fundamental issue of the training error not being an good guide for the
test error. Figure 15.13 illustrates the problem and emphasizes the core trade-off
for the optimal level of flexibility/complexity for a model.
This sections focuses on the how to achieve the minimal test data error thanks to
a set of methods based on holding out subsets of the training data.
The core idea is to achieve a better estimate of the error in the test data by eval-
uating the error in a subsample that was not used to train the model.
For illustration purposes in this section I will mainly use the following example.
We estimate a linear model for the relationship between a dependent variable,
𝑦, and an independent variable, 𝑥. In this example, 𝑦 is the miles per gallon and
𝑥 is the horse power of the vehicle. Figure 15.14 plots the data at hand.
In order to take into account possible non-linearities, we estimate various models
differing by the number of the polynomial.
190 15 Statistical Learning

FIGURE 15.13: Training versus test data performance.

40

30
mpg

20

10

50 100 150 200


horsepower

FIGURE 15.14: Scatter plot of data set.

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥2𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑝𝑖𝑝 + 𝜀𝑖

The difficulty here is to establish the best value for 𝑝, as illustrated in Figure
15.15.
15.9 Cross-Validation 191

40

30
mpg

20

10

50 100 150 200


horsepower

Deg.1 Deg.2 Deg.5 Deg.9

FIGURE 15.15: Fits of mpg for various degrees of the polynomial of horsepower.

FIGURE 15.16: Validation set approach.

15.9.1 Validation Set Approach

This is the simplest form of these methods. It consists in:

• randomly divide the available data in two groups, training and validation set,
• estimate the model with the training data,
• use the favorite model in the validation data and calculate the error.

Figure 15.16 illustrates this technique.


For the example at hand, the result can be read in Figure 15.17.
This approach has two main drawbacks:

• it depends greatly on the random samples (training, validation) chosen,


• it estimates the model on less data, reducing its chances of good accuracy.
192 15 Statistical Learning

FIGURE 15.17: Choice of polynomial in the validation set approach.

FIGURE 15.18: LOOCV approach.

15.9.2 Leave-One-Out

This method shares the same approach as above but sets a specific choice for the
validation data: each and all observations are the validation data while the rest
is the training data. Figure 15.18 illustrates this technique.
Notice that this technique requires 𝑛 computations of the mean square, one for
each observation. The estimate for the test mean square error is simply the aver-
age of all of them.

1 𝑛
𝐶𝑉(𝑛) = ∑ MSE𝑖
𝑛 𝑖=1

This method can be computationally cumbersome. For our example, the results
are given in Figure 15.20.
15.9 Cross-Validation 193

FIGURE 15.19: 5-fold example of a cross-validation approach.

FIGURE 15.20: Choice of polynomial with LOOCV and 10-fold CV..

15.9.3 𝑘-Fold

A somehow intermediate solution is the 𝑘-fold cross-validation. The idea is to


separate the training data into 𝑘 groups, called folds, with each of them being
the validation set in turns. Figure 15.19 illustrates this technique.
The estimate for the test mean square error is simply the average of the mean
square error across the 𝑘 folds.

1 𝑘
𝐶𝑉(𝑘) = ∑ MSE𝑖
𝑘 𝑖=1

Results of a 10-fold CV are given in Figure 15.20.


194 15 Statistical Learning

15.9.4 Comments

A couple of short notes are the following.

• One could wonder why use 𝑘 < 𝑛, given the speed of our modern computers.
The reason is that the LOOCV method averages over many models that are
based essentially on almost the same data. Hence, the average is over correlated
values, which, in turn increases the variance of the estimate of the test error.
This might not be a desirable feature.
• In general, values of 𝑘 = 5 or 𝑘 = 10 are standard compromises.

• Cross-validation applies similarly to classification problems.

• An essential point with cross-validation is to never use training data in the


validation set, not even indirectly (e.g., in a preliminary choice of variables).

15.10 Ubiquity of Predictions

Predictions are an essential part of the human experience as they permeate our
day-by-day lives:

What will come out of that shaking bush?

— Anonymous hunter’s last thought.

A few simple examples of actual predictions can emphasize the point:

• will a given drug cure a disease?


• how long will a trip last?
• how many units will the firm sell?
15.12 Heuristics, Algorithms and AI 195

But predictions are more ubiquitous than these direct questions indicate. Indeed,
predictions are at the heart of judgment and decision making. Consider the fol-
lowing illustrations:

• “should parole be granted?”: the answer to that question depends on the com-
mittee prediction about the future behavior of a inmate.
• “is this email a spam?”: when services such as Gmail classify email as spam,
they make a prediction about the classification that the user would make, had
he/she read the email.
• “what candidate ought to be hired?”: the hiring process is based on the predic-
tion about each candidate’s future performance.

15.11 Heuristics, Algorithms and AI

Seeing predictions as part of the judgment and decision process puts forward
the documented human fallibility in that matter.
Here, the background reference is the compelling literature on heuristics and bias
to which Kahneman and Tversky are two preeminent contributors.
In short, humans’ mental apparatus is prone to systematic and predictable er-
roneous judgments and decisions because of its reliance on (time-, energy-) effi-
cient but ultimately potentially misleading mental rules (heuristics) and biased
reasoning.
In that perspective, data science is a tool for slow thinking type of judgment. Be-
cause it follows rules (algorithms) that are separated from, though not indepen-
dent of a human judgment, data science is potentially immune to these biases.
We encounter here a common theme: the struggle between humans and ma-
chines over intelligence status. As the top contender of the non-human side, AI is
often depicted as a almighty force. It might be. Our understanding, makes a less
grandiose vision of AI. This latter is simply the highest point in the scale starting
at heuristics and going over algorithms. In other words, the true power of AI is
its ability to make the best predictions.
196 15 Statistical Learning

15.12 AI, Not Why: Predicting vs Understanding

By nature, humans crave understanding by drawing causal relationships be-


tween the phenomena that they observe. A discussion of this psychological drive
goes beyond the scope of these notes. It is mentioned here, however, in order to
emphasize a crucial tension arising in the practice of data science, namely the
usual trade-off in predictions between their accuracy and their interpretability
(i.e., their understanding).
The techniques used to treat the data do not always allow for an interpretation
of this treatment. Using a common yet appropriate image, the prediction process
resembles a black-box.
As such, the main deliverable of a prediction process is a statement based on
complex and sophisticated correlations, almost never on causation.
The balance between the advantages and disadvantages of the approach are
context-specific. In some situations, understanding why a consequence followed
(i.e., causation) is not necessary, not possible and sometimes not even desirable.
In other cases, the cause of the effect is required. For instance, applicants would
like to know why their loan was refused by the bank (’s algorithm).1
In the former cases, data science will be associated with the great benefits of
accuracy, and, in the latter, it will cause frustration to the inquiring mind.
Back to AI as the best data scientist as it is able to make the most accurate predic-
tions. It follows from the above remarks that AI, and all the more so for its less
intelligent avatars, will always be bounded by the barrier of ‘why’, i.e., they will
generally not be able to make statements about causation.
Inference is the setting in data science that addresses the causality relationships.
This is a much more difficult setting as it faces two huge hurdles:

• it requires a previous theoretical modeling of a relationship,


• the functional specification of the relationship must be correctly estimated.
1
Also, some web services such as Amazon or Netflix post an approximate reason for their
suggestions.
15.13 Important Perspective 197

A problem with that approach is that practitioners often aim at collecting its
sweet fruits of causal statements without paying the harsh price of the modeling
and the specification. As a consequence, the fruits of these empirical investiga-
tions, despite their glowing appearance, are actually not edible and sometimes
even poisonous.
This is why we opt to not attempt incursions in the high spheres of inference but
remain on the ground of predictions.

15.12.1 Plus

• Black-box algorithms can inherit prejudice and other discriminating rules.


• Knowing the context is of utmost importance in any data science project.
• “No free lunch theorem”.
• “Some of the figures in this notes are taken from”An Introduction to Statistical
Learning, with applications in R” (Springer, 2013) with permission from the
authors: G. James, D. Witten, T. Hastie and R. Tibshirani”.

15.13 Important Perspective

It is important to understand data science even if you never intend to apply it yourself. Data-analytic
thinking enables you to evaluate proposals for data mining projects. For example, if an employee, a
consultant, or a potential investment target proposes to improve a particular business application by
extracting knowledge from data, you should be able to assess the proposal systematically and decide
whether it is sound or flawed. This does not mean that you will be able to tell whether it will actually
succeed - for data mining projects, that often requires trying - but you should be able to spot obvious
flaws, unrealistic assumptions, and missing pieces. (…)
The consulting firm McKinsey and Company estimates that “there will be a shortage of talent necessary
for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of
140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with
the know-how to use the analysis of big data to make effective decisions.” (Manyika, 2011). Why 10 times
as many managers and analysts than those with deep analytical skills? Surely data scientists aren’t so
difficult to manage that they need 10 managers! The reason is that a business can get leverage from a
data science team for making better decisions in multiple areas of the business. However, as McKinsey is
pointing out, the managers in those areas need to understand the fundamentals of data science to
effectively get that leverage.
198 15 Statistical Learning

— Provost and Fawcett (2013)


Part VII

Linear Regression
16
Simple Linear Regression

TL;DR This chapter introduces the workhorse


of data analysis, namely the linear regression
model.[16.1]
The simple linear regression is the starting point be-
cause it allows for a very useful and telling visualiza-
tion.[16.2]
The fit is obtained by the OLS procedure whereas the
coefficients are chosen such as to minimize the sum
of the squares of the residuals.[16.3]
The mathematical derivation is shown but is not part
of the tested material.[16.5]

16.1 A Classic Approach

The linear regression approach is the workhorse of many empirical investiga-


tions. It is also the classical method because of its simplicity and its easiness of
computation (an important argument in times of little or cumbersome comput-
ing capabilities).
Various important reasons explain why it is often the first tool in any analyst’s
toolbox.

201
202 16 Simple Linear Regression

• It can straightforwardly be extended and produce reasonably good estimates


in many applications.
• Despite its simplicity, it allows to clearly illustrate advanced concepts. In par-
ticular, it lays the ground for the need of more complicated techniques.

So far, the mean of a population has been treated as a constant, and we have
shown how to use sample data to estimate or to test hypotheses about this con-
stant mean.
In many applications, the mean of a population is not viewed as a constant, but
rather as a variable. For instance:

Mean sale price = $50000 + $17920 × N. Bedrooms

This formula implies that the mean sale price of 1 bedroom homes is $67’920;
the mean sale price of 2 bedroom homes is $85’840, and the mean sale price of 3
bedroom homes is $103’760.
We will now study these situations in which the mean of the population is treated
as a variable, dependent on the value of another variable.
We will learn how to use the sample data to estimate the relationship between
the mean value of one variable, 𝑌 , as it relates to a second variable, 𝑋 .
𝑌 is generally referred to as the response variable, outcome variable, dependent
variable, or endogenous variable. 𝑋 is generally referred to as the covariate, re-
gressor, explanatory variable, independent variable, or exogeneous variable.
Examples include:

• A manager would like to know what mean level of sales (𝑌 ) can be expected
if the price (𝑋 ) is set at $10 per unit.
• If 250 workers (𝑋 ) are employed in a factory, how many units (𝑌 ) can be pro-
duced during an average day?
• If a developing country increases its fertilizer production (𝑋 ) by 1’000’000 tons,
how much increase in grain production (𝑌 ) should it expect?
• …

The relationship between 𝑋 and 𝑌 can take on linear or non-linear forms. Previ-
ously, we saw how the relationship between two variables can be described by
using scatter plots and correlation coefficients.
16.2 The Simple Linear Regression 203

In many economic and business problems, a specific functional relationship is


needed to obtain numerical results. The straight-line model is the simplest of all
models relating a populating mean to another variable and, in many cases, it
provides an adequate approximation of the desired functional relationship.
The methodology of estimating and using a straight-line relationship is referred
to as linear regression analysis.
As a reference, recall that the basis of the linear model is a fit of the data with a
functional form that assumes linearity in the coefficients, 𝛽 ’s, with 𝑝 predictors,
𝑋 ’s, for 𝑛 observations.

𝑌 = 𝛽
⏟0⏟+
⏟ 𝛽1⏟
𝑋⏟1 + ⏟𝜀
Deterministic component Random error

The response 𝑌 is also assumed to be influenced by shocks/errors captured by


𝜀. The standard deviation of these errors is assumed to be 𝜎. We always assume
that the mean value of the random error equals 0. This is equivalent to assuming
that the mean value of 𝑌 given 𝑋 , 𝐸[𝑌 |𝑋], equals the deterministic component
of the model. In other words, even if the estimated model would correctly fit the
data, the predictions would still be off because of this irreducible error.
Notice that the linear function is almost never believed to fit the true data gen-
erating process but, instead, to more or less appropriately approximate it.

16.2 The Simple Linear Regression

This section builds around the example of the simple linear regression of sales on
the amount of TV advertising in the Advertising data set. To fix ideas, the linear
model estimated here is

sales = 𝛽0 + 𝛽1 × TV + 𝜀
204 16 Simple Linear Regression

16.2.1 Data and Scatter Plot

We start by loading the data and manipulate it to make it usable. Figure 16.1
provides a scatter plot of the data,.

advertising <- read_csv("data/islr/Advertising.csv") %>%


select(-X1)
advertising
## # A tibble: 200 x 4
## TV radio newspaper sales
## <dbl> <dbl> <dbl> <dbl>
## 1 230. 37.8 69.2 22.1
## 2 44.5 39.3 45.1 10.4
## 3 17.2 45.9 69.3 9.3
## 4 152. 41.3 58.5 18.5
## 5 181. 10.8 58.4 12.9
## 6 8.7 48.9 75 7.2
## 7 57.5 32.8 23.5 11.8
## 8 120. 19.6 11.6 13.2
## 9 8.6 2.1 1 4.8
## 10 200. 2.6 21.2 10.6
## # ... with 190 more rows

16.2.2 Estimation in R

The estimation of the model is carried with the function lm from the built-in stats
package. The result of the estimation is an object to assigned to a name.

model.slr <- lm(sales ~ TV,


data = advertising)

The content of this linear regression object is better described with the function
summary.
16.2 The Simple Linear Regression 205

20
sales

10

0 100 200 300


TV

FIGURE 16.1: Scatter plot of the TV-Sales observations.

summary(model.slr)
##
## Call:
## lm(formula = sales ~ TV, data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
206 16 Simple Linear Regression

20
sales

10

0 100 200 300


TV

FIGURE 16.2: Linear fit and residuals.

16.2.3 Fitted Values and Residuals

One of the main reasons for explaining the simple linear regression is its graphi-
cal appeal. Indeed, we can see what we are estimating. Figure 16.2 provides such
an illustration of the linear fit as well as its errors in prediction.

16.2.4 Residuals vs Errors/Shocks

A common source of misinterpretation has to do with the difference between


residuals and errors/shocks. Notice the following relationships. On one hand,

Residuals = Data - Fit

whereas,

Shock = Data - Deterministic Component


16.4 Ordinary Least Squares Procedure 207

16.3 Ordinary Least Squares Procedure

The least squares procedure obtains estimates of the linear equation coefficients
𝛽0 and 𝛽1 by minimizing the sum of the squared residuals 𝑒𝑖 :
𝑛 𝑛 𝑛
∑ 𝑒2𝑖̂ = ∑(𝑦𝑖 − 𝑦𝑖̂ ) = ∑(𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 ))2
2

𝑖=1 𝑖=1 𝑖=1

The coefficients 𝛽0̂ and 𝛽1̂ are chosen so that this sum is minimized. We use
differential calculus to obtain the coefficient estimators that minimize the sum
of squared residuals.
Early mathematicians struggled with the problem of developing a procedure for
estimating the coefficients for the linear equation. Various procedures have been
developed, but none has proven as useful or as popular as least squares regres-
sion. The coefficients developed using this procedure have very useful statistical
properties.
One way to decide quantitatively how well a straight line fits a set of data is to
note the extent to which the data points deviate from the line. Specifically, we
can calculate the magnitude of the deviations (i.e., the differences between the
observed and the predicted values of 𝑌 ).
These deviations, called residuals, are the vertical distances between observed
and predicted values. Note that for the best fitting line, the sum of residuals
equals 0 and so we square those residuals (or deviations) and compute the sum
of squares of the residuals.
By summing the squared residuals, we place a greater emphasis on large devi-
ations from the line. By shifting a ruler around the graph we can see that it is
possible to find many lines for which the sum of residuals is equal to 0.
However, it can be shown that there is one (and only one) line for which the sum
of the squared residuals is a minimum - the least squares line.
208 16 Simple Linear Regression

16.4 Finding the Least Squares Line

Assume we have a sample of 𝑛 data points consisting of pairs of values 𝑥 and 𝑦 :


(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), ..., (𝑥𝑛 , 𝑦𝑛 ).
The fitted line which we will calculate on the basis of these 𝑛 data points is writ-
ten as:

𝑦𝑖̂ = 𝛽0̂ + 𝛽1̂ 𝑥𝑖

The “hats” indicate that the symbols are estimates of 𝐸[𝑌 |𝑋], 𝛽0 , and 𝛽1 , respec-
tively.
For a given data point (𝑥𝑖 , 𝑦𝑖 ), the observed value of 𝑌 is 𝑦𝑖 and the predicted
value of 𝑌 would be obtained by substituting 𝑥𝑖 into the prediction equation:

𝑦𝑖̂ = 𝛽0̂ + 𝛽1̂ 𝑥𝑖

The deviation of the 𝑖𝑡ℎ value of 𝑦 from its predicted value is:

(𝑦𝑖 − 𝑦𝑖̂ ) = 𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 )

The sum of the squares of the deviations of the 𝑦 -values about their predicted
values for all the 𝑛 data points is:
2
∑ [𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 )]

The quantities 𝛽0̂ and 𝛽1̂ that make the sum of the squared residuals a minimum
are called the least squares estimates of the population parameters 𝛽0 and 𝛽1 .
The prediction equation

𝑦 ̂ = 𝛽0̂ + 𝛽1̂ 𝑥

is called the least squares line.


16.5 Deriving the OLS Estimators 209

16.4.1 Features of the Least Squares Line

Note the following properties.

• The sum of the residuals equals 0, i.e., mean residuals of prediction = 0.


• The sum of squared residuals is smaller than that for any other straight-line
model.

Also, the linear regression provides two important results:

1. Predicted values, 𝑦,̂ of the dependent variable as a function of the inde-


pendent variable.
2. Estimated marginal change in the dependent variable, 𝛽1̂ , that results
from a one-unit change in the independent variable.

16.5 Deriving the OLS Estimators

The OLS estimators are obtained from minimizing the sum of the squared resid-
uals:
𝑛 𝑛
2
∑ 𝑒2𝑖̂ = ∑ [𝑦𝑖 − 𝑦𝑖̂ ]
𝑖=1 𝑖=1
𝑛 2
= ∑ [𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 )]
𝑖=1

Thus we want to minimize:


𝑛
min 𝑆(𝛽0̂ , 𝛽1̂ ) ≡ ∑ 𝑒2𝑖̂
𝛽0̂ ,𝛽1̂ 𝑖=1
𝑛 2
= ∑ [𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 )]
𝑖=1
210 16 Simple Linear Regression

The first order conditions of the least squares problem are:

𝜕𝑆(𝛽0̂ , 𝛽1̂ )
= ∑ 2 [𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 )] (−1) = 0 (16.1)
𝜕𝛽 ̂ 𝑖=1
0
𝜕𝑆(𝛽0̂ , 𝛽1̂ )
= ∑ 2 [𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 )] (−𝑥𝑖 ) = 0 (16.2)
𝜕𝛽 ̂ 𝑖=1
1

Equation (16.1) can be simplified as:

∑ 𝑦𝑖 − 𝑛𝛽0̂ − 𝛽1̂ ∑ 𝑥𝑖 = 0
𝑖=1 𝑖=1

⟹ 𝑛𝑦 ̄ − 𝑛𝛽0̂ − 𝛽1̂ 𝑛𝑥̄ = 0

In turn, this implies,

𝛽0̂ = 𝑦 ̄ − 𝛽1̂ 𝑥̄

Using 𝛽0̂ = 𝑦 ̄ − 𝛽1̂ 𝑥̄ in equation (16.2) and simplifying:


𝑛 𝑛 𝑛
∑ 𝑥𝑖 𝑦𝑖 − 𝛽0̂ ∑ 𝑥𝑖 − 𝛽1̂ ∑ 𝑥2𝑖 = 0
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
∑ 𝑥𝑖 𝑦𝑖 − (𝑦 ̄ − 𝛽1̂ 𝑥)
̄ 𝑛𝑥̄ − 𝛽1̂ ∑ 𝑥2𝑖 = 0
𝑖=1 𝑖=1
𝑛 𝑛
̄ 1̂ 𝑥̄ − 𝛽1̂ ∑ 𝑥2𝑖 = 0
∑ 𝑥𝑖 𝑦𝑖 − 𝑛𝑥𝑦̄ ̄ + 𝑛𝑥𝛽
𝑖=1 𝑖=1

Therefore,
16.6 Exercises 211

𝑛 𝑛
∑ 𝑥𝑖 𝑦𝑖 − 𝑛𝑥𝑦 ∑(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
𝑖=1 𝑖=1
𝛽1̂ = 𝑛 = 𝑛
∑ 𝑥𝑖 2 − 𝑛𝑥2 ∑(𝑥𝑖 − 𝑥)2
𝑖=1 𝑖=1
𝑆𝑋𝑌 𝑆𝑌
= = 𝑟
𝑆𝑋2 𝑆𝑋

16.6 Exercises

Exercise 16.1. Consider a regression predicting weight (kg) from height (cm) for
a sample of adult males. What are, respectively, the units of:

• the intercept,
• the slope.

For the following exercises, use the Advertising data set (download here1 ).

Exercise 16.2. Consider a simple linear model,

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 .

a. Illustrate the fact that the linear fit passes through (𝑥,̄ 𝑦)̄ , i.e., the
point at defined by the mean of each variable. In other words, 𝛽0̂ and 𝛽1̂
satisfy:

𝑦 ̄ = 𝛽0̂ + 𝛽1̂ 𝑥̄
Hint: you can access a variable from a data frame, say df, by appending $ and the
name of the variable to df, e.g., df$TV.

b. Illustrate the fact that the sum of the OLS residuals is equal to 0, i.e.,
1
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=335655
212 16 Simple Linear Regression

𝑛
∑ 𝑒𝑖̂ = 0
𝑖=1

Hint: you can obtained the fitted values of an estimated model, say m1, by ap-
pending $fitted.values to its name, i.e., m1$fitted.values.

c. Illustrate the following result,

𝑠𝑦
𝛽1̂ = 𝑟
𝑠𝑥

where 𝑟 is the correlation between 𝑥 and 𝑦 , and 𝑠 is the sample’s standard devi-
ation of the given variable.
17
Multiple Linear Regression

TL;DR This chapter extends the discussion about


the linear regression model to the case of several ex-
planatory variables.[17.1]
A discussion about the coefficients of the model is
presented.[17.2].

17.1 Multiple Linear Regression Model

In the simple linear regression model we considered that the dependent variable
was a function of a single independent variable.
But, in many practical economic, financial, and managerial situations, the depen-
dent variable is influenced by more than one factor:

Price𝑖 = 𝛽0 + 𝛽1 N. Bedrooms𝑖 + 𝛽2 Size𝑖 + 𝛽3 Year Built𝑖 + 𝜀𝑖

where 𝛽0 is the intercept and 𝛽1 , 𝛽2 and 𝛽3 are the three parameters quantifying
the impact of the N. Bedrooms, Size and Year Built on Price.
Even though the multiple linear regression model is similar to the simple model,
the interpretation of some results is not exactly the same. The “House Price”

213
214 17 Multiple Linear Regression

model now tries to capture the joint influence of three different factors, isolating
the partial effect of each type.
Its general form is now,

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀

Multiple regression enables us to determine the simultaneous effect of several


independent variables on a dependent variable using the least squares principle.
Recall that the regression objectives are at least twofold:

1. To predict the dependent variable, 𝑌 , as a function of several observed


independent variables, 𝑋𝑗 , where 𝑗 = 1, 2, ..., 𝑘 and where 𝑖 = 1, ..., 𝑛
observations:

𝑦𝑖̂ = 𝛽0̂ + 𝛽1̂ 𝑥1𝑖 + 𝛽2̂ 𝑥2𝑖 ⋯ + 𝛽𝑘̂ 𝑥𝑘𝑖

The predicted value depends on the effect of the independent variables individ-
ually and their effect in combination with the other independent variables.

2. To estimate the marginal change in the dependent variable, 𝑌 , that is


related to changes in the independent variables.

The coefficient 𝛽𝑗̂ estimates the change in 𝑌 , given a unit change in 𝑋𝑗 , while
controlling for the effect of the other independent variables.

17.1.1 Partial Effects

Note that the model has now 𝑘 independent variables and 𝑘 + 1 parameters.
The variable 𝛽𝑗 gives us the change in the expected value of 𝑌 ,
(𝐸[𝑌 |𝑋1 , 𝑋2 , … , 𝑋𝑘 ]), resulting from a unit increase in 𝑋𝑗 while holding
the other factors fixed (keeping the other independent variables unchanged):

𝜕𝑦
= 𝛽𝑗 .
𝜕𝑥𝑗
17.2 OLS Estimated Model 215

17.1.2 Analyzing a Multiple-Regression Model

1. Hypothesize the deterministic component of the model - specify the way


the mean of 𝑌 relates to the independent variables 𝑋1 , 𝑋2 , ..., 𝑋𝑘 .
2. Use the sample data to estimate the unknown parameters 𝛽0 , 𝛽1 , ..., 𝛽𝑘
in the model.
3. Specify the probability distribution of the random-error term 𝜀𝑖 , and es-
timate the standard deviation 𝜎̂ of this distribution.
4. Check that the assumptions about 𝜀𝑖 are satisfied, and make modifica-
tions to the model if necessary.
5. Statistically evaluate the usefulness of the model.
6. When you are satisfied that the model is useful, use it for prediction,
estimation, and other purposes.

17.2 OLS Estimated Model

The unknown regression parameters have to be estimated using a sample with 𝑛


observations of all the 𝑘−1 independent variables and of the dependent variable:

• 𝛽0̂ : estimator for the intercept 𝛽0

• 𝛽𝑗̂ : estimator for the slopes 𝛽𝑗 (𝑗 = 1, 2, 3, … , 𝑘)

The sample regression model now writes as:

𝑦𝑖̂ = 𝛽0̂ + 𝛽1̂ 𝑥1𝑖 + 𝛽2̂ 𝑥2𝑖 ⋯ + 𝛽𝑘̂ 𝑥𝑘𝑖


where 𝑦𝑖̂ are the estimated/predicted values of the dependent variable.
The estimation residuals are still defined as:

𝑒𝑖̂ = 𝑦𝑖 − 𝑦𝑖̂
The (Ordinary) Least Squares (OLS) estimators are the solution to:
𝑛 𝑛
2
min ∑ 𝑒2𝑖̂ = ∑ [𝑦𝑖 − 𝑦𝑖̂ ]
𝛽𝑗̂ 𝑖=1 𝑖=1
216 17 Multiple Linear Regression

17.2.1 Two Regressors Illustration

Consider the regression model with only two predictor variables:

𝑦𝑖̂ = 𝛽0̂ + 𝛽1̂ 𝑥1 + 𝛽2̂ 𝑥2

The coefficient estimators are computed using the following equations:

𝛽0̂ = 𝑦 ̄ − 𝛽1̂ 𝑥1̄ − 𝛽2̂ 𝑥2̄


𝑠𝑦 (𝑟𝑥1 𝑦 − 𝑟𝑥1 𝑥2 𝑟𝑥2 𝑦 )
𝛽1̂ =
𝑠𝑥1 (1 − 𝑟𝑥2 1 𝑥2 )
𝑠𝑦 (𝑟𝑥2 𝑦 − 𝑟𝑥1 𝑥2 𝑟𝑥1 𝑦 )
𝛽2̂ =
𝑠𝑥2 (1 − 𝑟𝑥2 1 𝑥2 )

where

• 𝑟𝑥1 𝑦 is the sample correlation between 𝑋1 and 𝑌 ,


• 𝑟𝑥2 𝑦 is the sample correlation between 𝑋2 and 𝑌 ,
• 𝑟𝑥1 𝑥2 is the sample correlation between 𝑋1 and 𝑋2 ,
• 𝑠𝑥1 is the sample standard deviation for 𝑋1 ,
• 𝑠𝑥2 is the sample standard deviation for 𝑋2 ,
• 𝑠𝑦 is the sample standard deviation for 𝑌 .

17.2.2 Properties of OLS Estimators in Multiple Regression

• The regression line passes through the point (𝑥2 , 𝑥3 , … , 𝑥𝑘 , 𝑦).


• The mean of the residuals is zero: 𝑒 ̂ = 0.
• The sample means of the estimated and of the observed values are equal: 𝑦 ̂ = 𝑦 .
• The sample covariances between the residuals and the independent variables
are zero: 𝑠𝑒𝑋
̂ 𝑗 = 0.
• The sample covariance between the residuals and the estimated values is zero:
𝑠𝑒𝑌̂ ̂ = 0.
17.3 Exercises 217

17.3 Exercises

For the following exercises, use the Advertising data set (download here1 ).

Exercise 17.1. Consider a multiple linear regression model,

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝛽3 𝑥3𝑖 + 𝜀𝑖 .

a. Estimate such a model in the data.

b. How do you interpret the coefficients?

c. Illustrate the fact that the linear fit passes through the mean of each
variable.

Hint: you can access a variable from a data frame, say df, by appending $ and the
name of the variable to df, e.g., df$TV.

d. Based on the model that you estimated above, suppose that you want
to illustrate the linear fit in the sales (𝑦 ) - TV (𝑥) quadrant, i.e., with a
single line. What choice does it imply about the other explanatory vari-
ables?
e. Illustrate the fact that the sum of the OLS residuals is equal to 0, i.e.,

𝑛
∑ 𝑒𝑖̂ = 0
𝑖=1

Hint: you can obtained the fitted values of an estimated model, say m1, by ap-
pending $fitted.values to its name, i.e., m1$fitted.values.

1
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=335655
18
Assumptions

TL;DR The linear regression model is valid under


a specific set of assumptions. They are presented
here.[18.1]

18.1 When is the Model Valid?

The OLS model is predicated on the following assumptions.


A0. The relation between 𝑌 and 𝑋 is linear in the parameters.
A1. The random error has zero expected value.
A2. The random error has constant variance.
A3. The random errors are independent, they are not correlated with one another.
A4. The random error is unrelated with the explanatory variable.
A5. The random error is normally distributed.

18.2 Assumption 0

The relation between 𝑌 and 𝑋 is linear in the parameters (in other words, the
𝑌 ’s are linear functions of 𝑋 plus a random error term).
219
220 18 Assumptions

This assumption addresses the functional form of the model. In statistics, a re-
gression model is linear when all terms in the model are either the constant or a
parameter multiplied by an independent variable.
The model equation is built by adding the terms together. These rules constrain
the model to one type:

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖

18.3 Assumption 1

The random error has zero expected value:

𝐸[𝜀𝑖 |𝑋] = 0
In other words, the mean of the probability distribution of 𝜀𝑖 is 0. That is, the
average of the values of 𝜀𝑖 over an infinitely long series of experiments is 0 for
each setting of the independent variable 𝑋 .
This assumption implies that the mean value of 𝑌 , for a given value of 𝑋 , is:

𝐸[𝑦𝑖 |𝑋 = 𝑥𝑖 ] = 𝛽0 + 𝛽1 𝑥𝑖

18.4 Assumption 2

The random error has constant variance:

𝑉 𝑎𝑟[𝜀𝑖 |𝑋] = 𝐸[𝜀2𝑖 ] = 𝜎2


The variance of the probability distribution of 𝜀 is constant for all settings of the
independent variable 𝑋 . For our straight-line model, this assumption means that
the variance of 𝜀 is equal to a constant - say, 𝜎2 - for all values of 𝑋 .
This property is called homoscedasticity, or uniform variance: 𝐸[𝜀2𝑖 ] = 𝜎2 for
18.7 Assumption 3 221

(𝑖 = 1, ..., 𝑛). When this assumption is violated, then we have heteroskedasticity


in our model.

18.5 Assumption 3

The random errors 𝜀𝑖 are independent, they are not correlated with one another:

𝐶𝑜𝑣(𝜀𝑖 , 𝜀𝑗 |𝑋) = 𝐸[𝜀𝑖 𝜀𝑗 ] = 0 for 𝑖 ≠ 𝑗

The values of 𝜀 associated with any two observed values of 𝑌 are independent.
That is, the value of 𝜀 associated with one value of 𝑌 has no effect on any of the
values of 𝜀 associated with any other 𝑌 values.

18.6 Assumption 4

The random error is unrelated with 𝑋 :

𝐶𝑜𝑣(𝜀𝑖 , 𝑋𝑖 ) = 0

In other words, the 𝑋 values are fixed numbers, or realizations of the random
variable 𝑋 that are independent of the error terms, 𝜀𝑖 (𝑖 = 1, ..., 𝑛).
If an independent variable is correlated with the error term, we can use the in-
dependent variable to predict the error term, which violates the notion that the
error term represents unpredictable random error.
This assumption is also referred to as exogeneity. When this type of correlation
exists, and the assumption is violated, there is endogeneity.
222 18 Assumptions

18.7 Assumption 5

The random error is normally distributed.


19
Goodness of the Fit

TL;DR The criterion for judging the quality of the


fit is presented as the ratio of sample variability ex-
plained by the model over the total sample variability
of the dependent variable.

19.1 Sample Variability

The objective of simple regression is to explain (part of) the variability of a de-
pendent variable 𝑌 by an independent variable 𝑋 . That is to say that part of the
observed changes in 𝑌 result from changes in 𝑋 .

19.1.1 Total Sample Variability (TSS)

The total sample variability of 𝑌 is given by the total sum of squares:


𝑛
2
𝑇 𝑆𝑆 = ∑ (𝑦𝑖 − 𝑦)
𝑖=1

The TSS is fixed and independent of the regression coefficients. We can use Figure
19.1 to get an illustration of the TSS which is calculated using the squares of the
distances shown in blue.

223
224 19 Goodness of the Fit

20
sales

10

0 100 200 300


TV

FIGURE 19.1: Using the mean as the best fit and the resulting residuals.

20
sales

10

0 100 200 300


TV

FIGURE 19.2: Linear fit and residuals.

19.1.2 Unexplained Sample Variability (RSS)

We can use Figure 19.2 to get an illustration of the RSS which is calculated using
the squares of the distances shown in red.
19.2 Decomposition of the Total Sample Variability 225

19.1.3 Explained Sample Variability (ESS)

The variability of the estimated/predicted values of the dependent variable is


given by the explained (by regression) sum of squares.
𝑛
2
𝐸𝑆𝑆 = ∑ (𝑦𝑖̂ − 𝑦)
𝑖=1

We expect 0 ≤ 𝐸𝑆𝑆 ≤ 𝑇 𝑆𝑆 .
The ESS would is the sum of the vertical distances between each point and esti-
mated regression line.

19.2 Decomposition of the Total Sample Variability

We can decompose the total variability of 𝑌 in an explained component and a


residual component. The deviation of any observed value 𝑦𝑖 around 𝑦 ̄ can be
seen as the sum of:

(i) The deviation of the predicted value 𝑦𝑖̂ around 𝑦 .̄


(ii) The deviation of the observed value 𝑦𝑖 from its predicted value 𝑦𝑖̂ .

(𝑦𝑖 − 𝑦)̄ = (𝑦𝑖̂ − 𝑦)̄ + (𝑦𝑖 − 𝑦𝑖̂ ) = (𝑦𝑖̂ − 𝑦)̄ + 𝑒𝑖̂

Squaring both sides of the equality, summing for all sample elements and sim-
plifying, we obtain the decomposition of 𝑇 𝑆𝑆 into 𝐸𝑆𝑆 and 𝑅𝑆𝑆 .
𝑛 𝑛 𝑛
2 2
∑ (𝑦𝑖 − 𝑦) = ∑ (𝑦𝑖̂ − 𝑦) + ∑ 𝑒2𝑖̂
⏟⏟⏟⏟⏟
𝑖=1 ⏟⏟⏟⏟⏟
𝑖=1 ⏟
𝑖=1
𝑇 𝑆𝑆 𝐸𝑆𝑆 𝑅𝑆𝑆

TSS = Total Sum of Squares, measures the variation of the 𝑦𝑖 values around
their mean 𝑦.̄
226 19 Goodness of the Fit

ESS = Explained Sum of Squares, measures the variation explained by the linear
regression model, that is, the variation attributable to the relationship between
𝑋 and 𝑌 .
RSS = Residual Sum of Squares, measures the amount of variation attributable
to factors other than the relationship between 𝑋 and 𝑌 .

19.3 The Coefficient of Determination, 𝑅2

The coefficient of determination (𝑅2 ) measures the proportion of the total vari-
ability of the dependent variable that is explained by the regression model, i.e.,
by the independent variable:

𝐸𝑆𝑆 𝑅𝑆𝑆
𝑅2 = =1−
𝑇 𝑆𝑆 𝑇 𝑆𝑆
Since 0 ≤ 𝐸𝑆𝑆 ≤ 𝑇 𝑆𝑆

0 ≤ 𝑅2 ≤ 1

The coefficient of determination of the simple regression model is the square of


the coefficient of correlation between the dependent and the independent vari-
able:
2
𝑅2 = (𝑟)

𝑅2 gives us the percentage of the response variable variation that is explained


by a linear model. It is always between 0 and 100%.
𝑅2 is a statistical measure of how close the data are to the fitted regression line.
In general, the higher the 𝑅2 , the better the model fits your data.
However, before you can trust the statistical measures for goodness-of-fit, like
𝑅2 , you should check the residual plots for unwanted patterns that indicate bi-
ased results.
19.4 The Standard Error of the Regression 227

19.3.1 Adjusted 𝑅2

The adjusted coefficient of determination is defined as:

𝑛−1
Adjusted 𝑅2 = 1 − (1 − 𝑅2 )
𝑛−𝑘
We use this measure to correct for the fact that nonrelevant independent vari-
ables will result in some small reduction in the error sum of squares.
The adjusted 𝑅2 provides a better comparison between multiple regression mod-
els with different numbers of independent variables.

19.4 The Standard Error of the Regression

Recall that the population error 𝜀 is a random variable with zero mean and vari-
ance 𝜎2 . The variance of 𝜀 can be estimated using the residual sum of squares:

𝑅𝑆𝑆
𝜎̂ 2 =
𝑛−2
with
𝑛
𝑅𝑆𝑆 = ∑ 𝑒2𝑖̂
𝑖=1

The global quality of the model fit can also be assessed by the standard error
of the regression, which measures the variation of our observations around the
regression line.

𝑅𝑆𝑆
𝜎̂ = √
𝑛−2
The standard error of the regression is also known as the standard error of the esti-
mate. It represents the average distance that the observed values fall from the
regression line. It tells us how wrong the regression model is on average using
228 19 Goodness of the Fit

the units of the response variable. Smaller values are better because it indicates
that the observations are closer to the fitted line.
Unlike 𝑅2 , we can use the standard error of the regression to assess the preci-
sion of the predictions. Approximately 95% of the observations should fall within
±2 × 𝜎̂ from the regression line, which is also a quick approximation of a 95%
prediction interval. If we want to use a regression model to make predictions,
assessing the standard error of the regression might be more important than as-
sessing 𝑅2 .
20
Inference

TL;DR This chapter discusses inference for the co-


efficients of the linear mode. The null hypothesis is
that, under the assumptions, the value of the coeffi-
cient is 0.

20.1 ̂
Sampling Distributions of 𝛽 ’s

Having developed estimators for the coefficients 𝛽0 and 𝛽1 and for 𝜎2 , we are
ready to make inferences about the population model. Specifically, we are inter-
ested in computing confidence intervals and conducting hypothesis tests for the
parameters of interest.
Recall our population model in the simple regression model,

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
In short, and given the normality of the errors 𝜀𝑖 , the sampling distributions of
𝛽0̂ and 𝛽1̂ are:

̂ 2 1 𝑥2
𝛽0 ∼ 𝑁 (𝛽0 , 𝜎 [ + 2
])
𝑛 𝑛𝑆𝑋
229
230 20 Inference

and

𝜎2
𝛽1̂ ∼ 𝑁 (𝛽1 , 2
)
𝑛𝑆𝑋
Under assumptions A.0 to A.4, the OLS estimators are BLUE - they are the best
(lowest variance) among all linear estimators which are unbiased (Gauss-Markov
Theorem).
Notice that 𝜎2 is typically unknown and must be estimated. Hence, the sampling
distribution becomes,

𝜎̂ 2
𝛽1̂ ∼ 𝑁 (𝛽1 , 2
)
𝑛𝑆𝑋

It is important to note that the variance of 𝛽1̂ depends on two important quanti-
ties:

• The distance of the points from the regression line measured by ∑ 𝑒2𝑖̂ : higher
values imply greater variance for 𝛽1̂ .
2
• The total deviation of the 𝑋 values from the mean, which is measured by 𝑆𝑋 :
greater deviations in the 𝑋 values and larger sample sizes result in smaller
variance for 𝛽1̂ .

20.2 Estimating 𝜎2

Note, that the variance of both estimators depends on the error variance 𝜎2 , a
population parameter, which is typically unknown.
Therefore, we will need an estimate of 𝜎2 . Previously, we have learned that we
can use

2 ∑ 𝑒2𝑖̂
𝜎̂ =
𝑛−2
as an estimator of 𝜎2 . It can be shown that:
20.3 Inference on the Slopes 231

𝑛
𝐸 [𝜎̂ ] = 𝐸 [ ∑ 𝑒2𝑖̂ /(𝑛−2)] = 𝜎2
2

𝑖=1

and

(𝑛 − 2)𝜎̂ 2
2
∼ 𝜒2 (𝑛−2)
𝜎

And the distribution of 𝜎̂ is independent of 𝛽0̂ and 𝛽1̂ .

20.3 Inference on the Slopes

In our simple linear regression model 𝑦𝑖 = 𝛽0 + 𝛽1 + 𝜀𝑖 , 𝑌 is assumed to a have


a linear relationship with 𝑋 .
If 𝛽1 = 0 then the term 𝛽1 𝑥𝑖 would drop out of the expression and 𝑌 would not
be linearly related to 𝑋 . In other words, 𝑌 would not continuously increase or
decrease with increases in 𝑋 .
To determine whether there is a linear relationship between 𝑌 and 𝑋 , we can
test the hypothesis:

𝐻0 ∶ 𝛽1 = 0 vs 𝐻𝑎 ∶ 𝛽1 ≠ 0.

Given that 𝛽1̂ is normally distributed, our test statistic will be:

𝛽1̂ − 0 𝛽̂
𝑡= = 1
𝑠𝛽 ̂ 𝑠𝛽 ̂
1 1

which we will compare against the appropriate critical value from a 𝑡 distribu-
tion with (𝑛 − 2) degrees of freedom.
Inference on the slope will then follow the usual rules for tests of hypothesis for
the mean of a single variable. For instance, if the test is bilateral, as it almost
always is in this context, we have
232 20 Inference

𝛽1̂ 𝛽1̂
Reject 𝐻0 if < −𝑡(𝑛−2),𝛼/2 or > 𝑡(𝑛−2),𝛼/2
𝑠𝛽 ̂ 𝑠𝛽 ̂
1 1
21
Categorical Predictors

TL;DR This chapter discusses the case of categorical


variables as explanatory variables in the regression
model.

21.1 Introduction

Multiple-regression models can also be written to include qualitative (or cate-


gorical) independent variables. In such cases, we must code the values of the
qualitative variable (called levels) as numbers before we can fit the model.
The coded qualitative variables are called dummy (or indicator) variables, since
the numbers assigned to the various levels are arbitrarily selected.

21.1.1 Simplest Illustration

Suppose a female executive at a certain company claims that male executives


earn higher salaries, on average, than female executives with the same educa-
tion, experience, and responsibilities.
To support her claim, she wants to model the salary 𝑦 of an executive, using a

233
234 21 Categorical Predictors

qualitative independent variable representing the gender of the executive (male


or female).
A convenient method of coding the values of a qualitative variable at two levels
involves assigning a value of 1 to one of the levels and a value of 0 to the other.

1 if male
𝑥={
0 if female

The choice of which level is assigned to 1 and which is assigned to 0 is arbitrary.


The model then takes the following form:

𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝜀

𝛽0 represents the mean response associated with the level of the qualitative vari-
able assigned the value 0 (called the base level). In this example, it represents the
mean salary for females 𝜇𝐹 .
𝛽1 represents the difference between the mean response for the level assigned
the value 1 and the mean for the base level. In this example, it represents the
difference between the mean salary for males and the mean salary for females,
𝜇𝑀 − 𝜇 𝐹 .

21.1.2 Including a Dummy with Two Levels

Categorical variables enter a regression model in the form of dummy variables.


For instance, a categorical variable with two levels,

1 if level A
𝐷𝐴 = {
0 if level B

would be included in a model as,

𝑦 = 𝛽0 + 𝛽1 𝐷𝐴 + 𝛾𝑍 + 𝜀

where 𝑍 is a set of other relevant regressors and 𝑔𝑎𝑚𝑚𝑎 the coefficient associ-
ated with it.
21.2 Including a Dummy with Multiple Levels 235

The estimated model then provides predictions for all the levels of the categorical
variable:

𝑦𝐴̂ = 𝛽0̂ + 𝛽1̂ ⋅ 1 + 𝛾𝑍


̂ = 𝛽0̂ + 𝛽1̂ + 𝛾𝑍
̂
̂ = 𝛽0̂ + 𝛽1̂ ⋅ 0 + 𝛾𝑍
𝑦𝐵 ̂ = 𝛽0̂ + 𝛾𝑍
̂

The interpretation of 𝛽1 is therefore key in this context. Other things equal, it is


simply the difference in the mean/prediction between:

• the group for which the dummy 𝐷𝐴 of the categorical variable is 1, 𝐴 in this
case, and,
• the group for which the dummy 𝐷𝐴 of the categorical variable is 0, 𝐵 in this
case.

21.2 Including a Dummy with Multiple Levels

The qualitative independent variable can take multiple levels. As an illustration,


we could add the level 𝐶 to the example above.
The coding of that variable now requires the creation of two dummy variables.

1 if level A
𝐷𝐴 = {
0 if not level A

1 if level B
𝐷𝐵 = {
0 if not level B

Notice that the creation of

1 if level C
𝐷𝐶 = {
0 if not level C
236 21 Categorical Predictors

is superfluous because the observations for level 𝐶 are already implicitly de-
fined, i.e., they are those for which 𝐷𝐴 = 𝐷𝐵 = 0. In that case, we say the the
level 𝐶 is the reference level.
The estimated model now becomes,

𝑦 = 𝛽0 + 𝛽1 𝐷𝐴 + 𝛽2 𝐷𝐵 + 𝛾𝑍 + 𝜀
The estimated model then provides predictions for all the levels of the categorical
variable:

𝑦𝐴 ̂ = 𝛽0̂ + 𝛽1̂ ⋅ 1 + 𝛽2̂ ⋅ 0 + 𝛾𝑍


̂ = 𝛽0̂ + 𝛽1̂ + 𝛾𝑍̂
𝑦𝐵 ̂ = 𝛽0̂ + 𝛽1̂ ⋅ 0 + 𝛽2̂ ⋅ 1 + 𝛾𝑍 ̂ = 𝛽0̂ + 𝛽2̂ + 𝛾𝑍 ̂
𝑦 ̂ = 𝛽 ̂ + 𝛽 ̂ ⋅ 0 + 𝛽 ̂ ⋅ 0 + 𝛾𝑍
𝐶 0 1 2 ̂ = 𝛽 ̂ + 𝛾𝑍
0 ̂

The interpretation of 𝛽1 is therefore key in this context. Other things equal, it is


simply the difference in the mean/prediction between:

• the group for which the dummy 𝐷𝐴 of the categorical variable is 1, 𝐴 in this
case, and,
• the group for the reference level, 𝐶 in this case.

As for 𝛽2 , other things equal, it is simply the difference in the mean/prediction


between:

• the group for which the dummy 𝐷𝐵 of the categorical variable is 1, 𝐵 in this
case, and,
• the group for the reference level, 𝐶 in this case.

21.3 Including Multiple Dummies

If a model requires it, then multiple dummy variables can be included. For in-
stance, we can add to the model above a dummy about the gender,
21.5 The Dummy Variable Trap 237

1 if male
𝐷𝐺 = {
0 if female

The estimated model becomes,

𝑦 = 𝛽0 + 𝛽1 𝐷𝐴 + 𝛽2 𝐷𝐵 + 𝛽3 𝐷𝐺 + 𝛾𝑍 + 𝜀

Notice that the interpretation of the coefficients is the same as above. However,
the we can now create more subgroups, e.g., 𝐴-female, 𝐵-female, 𝐵-male, etc…

21.4 The Dummy Variable Trap

The general principle of dummy variables can be extended to cases where there
are several (but not infinite) discrete groups/categories. In general just define a
dummy variable for each category.
For instance, if number of groups is 3 (North, Midlands, South), then define:

• 𝐷𝑁𝑜𝑟𝑡ℎ = 1 if live in the North, 0 otherwise.


• 𝐷𝑀𝑖𝑑𝑙𝑎𝑛𝑑𝑠 = 1 if live in the Midlands, 0 otherwise.
• 𝐷𝑆𝑜𝑢𝑡ℎ = 1 if live in the South, 0 otherwise.

However, as a rule we always include one less dummy variable in the model
than there are categories, otherwise we will introduce multicolinearity into the
model. For a qualitative variable with 𝑘 levels, use 𝑘 − 1 dummy variables.

21.5 Exercises

Exercise 21.1. Consider the dummy variables defined in Section 21.2. Suppose
the estimation of the (true) model finds the following values: 𝛽0̂ = 10, 𝛽1̂ = 5,
𝛽2̂ = 1, and 𝛾̂ = 0.
238 21 Categorical Predictors

Now suppose that you estimate a model based on the same variables but where
you include 𝐷𝐶 .

𝑦 = 𝛼0 + 𝛼1 𝐷𝐴 + 𝛼3 𝐷𝐶 + 𝛾𝑍 + 𝜀

What would be the values of 𝛼0̂ , 𝛼1̂ , 𝛼3̂ , and 𝛾?


̂

Exercise 21.2. We estimate a multiple regression model to explain the price of a


bottle of Portuguese wine, 𝑝, over the 2010-2018 period, based on the following
dummy variables.
Type: red, white, others (e.g., ‘verde’ or ‘rosé’)

• 𝑇1 = 1 if the wine is sold as red wine, 0 otherwise,


• 𝑇2 = 1 if the wine is sold as white wine, 0 otherwise.

Region: Douro, Alentejo, Dão, others

• 𝑅1 = 1 if the wine is from the Douro region, 0 otherwise,


• 𝑅2 = 1 if the wine is from the Alentejo region, 0 otherwise,
• 𝑅3 = 1 if the wine is from the Dão region, 0 otherwise.

Place of sale: specialized wine shop or any other place (e.g., supermarket)

• 𝑆1 = 1 if the wine is sold in a specialized wine shop, 0 otherwise.

The estimated model has the following specification (with 𝜀 being a random
component):

𝑝 = 𝛽0 + 𝛽1 𝑇1 + 𝛽2 𝑇2 + 𝛽3 𝑅1 + 𝛽4 𝑅2 + 𝛽5 𝑅3 + 𝛽6 𝑆1 + 𝜀

The estimated coefficients obtained by OLS are the following:


21.5 Exercises 239

𝛽0̂ = 5.43,
𝛽1̂ = 1.21,
𝛽 ̂ = 1.54,
2
𝛽3̂ = 0.73,
𝛽4̂ = −0.22,
𝛽 ̂ = 0.15,
5
𝛽6̂ = 1.02.

a. Calculate the estimated price of a bottle of white wine from the Dão
region sold in a supermarket?
b. Calculate the estimated price of a bottle of ‘rosé’ wine from a region
outside Douro, Alentejo or Dão sold in a specialized wine shop?

Now, suppose we add the year of production, 𝑌 , as a continuous variable taking


the values 2010 through 2018.

𝑝 = 𝛽0 + 𝛽1 𝑇1 + 𝛽2 𝑇2 + 𝛽3 𝑅1 + 𝛽4 𝑅2 + 𝛽5 𝑅3 + 𝛽6 𝑆1 + 𝛽7 𝑌 + 𝜀

c. What does the coefficient 𝛽7 measure? How would you interpret its
value?
d. Suggest an alternative way of incorporating in the regression model
a variable based on the year of production. Explain your choice.
22
Simulating Violations of Assumptions

TL;DR This chapter introduces simulations as a tool


to investigate the consequences of violations of the
assumptions for the linear model.

22.1 Introduction

Simulations represent an invaluable tool for understanding the capacities and


the limitations of other tools. This is a topic on its own and would deserve a
larger treatment.
Here, we will use it in a simplified manner with a single objective in mind: eval-
uate the consequences of using the liner regression model when the assumptions
of the linear model are not satisfied.
The logic is the following:

• Since the data is simulated, we know what are the real parameters of the model.
• We estimate a linear regression for that model.
• Finally we evaluate how well the linear regression estimation performed by
comparing various measures to what should be expected if the assumptions of
the linear model were satisfied.

241
242 22 Simulating Violations of Assumptions

400
y

300

200

50 75 100 125 150


x

FIGURE 22.1: Scatter plot of simulated data in best case scenario.

22.2 Best Case Scenario

The first exercise is the best case scenario whereas all the assumptions are satis-
fied and we try to uncover the true model.

22.2.1 Simulating One Occurrence

I start by illustrating the case with one sample. The data is generated with the
following parameters. Figure 22.1 gives an example of a simulated sample.

set.seed(43)

n <- 50
x <- rnorm(n, mean= 100, sd=25)
epsilon <- rnorm(n, mean=0, sd=10)
y <- 5 + 3*x + epsilon

df <- tibble(y=y, x=x, epsilon=epsilon)

I now try to uncover the model. Sure enough, the results are excellent, meaning
22.2 Best Case Scenario 243

the the linear regression model works wonders when the true model is indeed
linear, see Figure 22.2.

m0 <- lm(y~x, data = df)


summary(m0)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.1306 -6.2598 -0.6175 6.1606 25.4533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.72081 6.23717 2.04 0.0469 *
## x 2.93055 0.05986 48.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.11 on 48 degrees of freedom
## Multiple R-squared: 0.9804, Adjusted R-squared: 0.98
## F-statistic: 2397 on 1 and 48 DF, p-value: < 2.2e-16

22.2.2 Simulating Several Occurrences

Of course, one sample is never enough to establish a result. An approximation


of the whole sampling distribution can be obtained though thousands of simu-
lations.

# simulating several occurrences


set.seed(43)
n.s <- 1e3
beta1.hat <- numeric(n.s)

n <- 50
244 22 Simulating Violations of Assumptions

400
y

300

200

50 75 100 125 150


x

FIGURE 22.2: Scatter plot of simulated data in best case scenario along with true
relationship (red) and OLS fit (blue).

beta0 <- 5
beta1 <- 3
mean.x <- 100
sd.x <- 20
mean.error <- 0
sd.error <- 10

for (i in 1:n.s){
x <- rnorm(n, mean = mean.x, sd = sd.x)
epsilon <- rnorm(n, mean= mean.error, sd=sd.error)
y <- beta0 + beta1 * x + epsilon
m0 <- lm(y~x)
beta1.hat[i] <- m0$coefficients[2]
}

A summary of the values obtained is the following (see Figure 22.3 for a visual
representation)
22.2 Best Case Scenario 245

4
density

2.8 2.9 3.0 3.1 3.2


x

FIGURE 22.3: Density estimate for the simulated slope coefficient.

summary(beta1.hat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.759 2.950 2.994 2.994 3.041 3.190

22.2.3 Simulating a Multiple Linear Regression

Nothing prevents us from simulating a richer model. The code is messier because
I want explanatory variables to have some degree of correlation.

# two variables
library(MASS)

n <- 50
my.means <- c(100, 100)
# off-diagonal = cov
# corr <- cov/(sqrt(var1*var2))
sd1 <- 80
sd2 <- 40
r <- 0.6
246 22 Simulating Violations of Assumptions

my.sigma <- matrix(c(sd1^2, r*sd1*sd2, r*sd1*sd2, sd2^2), ncol=2)

X <- mvrnorm(n,
mu = my.means,
Sigma = my.sigma,
empirical = TRUE)
x1 <- X[,1]
x2 <- X[,2]
epsilon <- rnorm(n, mean = 0, sd = 10)
y <- 20 + 1.2*x1 + 7.4*x2 + epsilon

df <- tibble(y=y, x1=x1, x2=x2)

How does the linear regression perform in one sample?

m1 <- lm(y ~ x1 + x2, data=df)


summary(m1)
##
## Call:
## lm(formula = y ~ x1 + x2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.8397 -5.7589 0.9407 5.0554 27.1644
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.00263 3.55227 5.631 9.7e-07 ***
## x1 1.19284 0.02050 58.175 < 2e-16 ***
## x2 7.41380 0.04101 180.787 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.186 on 47 degrees of freedom
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9994
## F-statistic: 3.804e+04 on 2 and 47 DF, p-value: < 2.2e-16
22.3 Omitted Variable Issue 247

22.3 Omitted Variable Issue

I now turn to illustrations of violations of the basic assumptions, for instance, an


omitted variable that is correlated to the other explanatory variable.

22.3.1 𝑟>0

# two variables
library(MASS)
n <- 50
my.means <- c(100, 100)
# off-diagonal = cov
# corr <- cov/(sqrt(var1*var2))
sd1 <- 80
sd2 <- 40
r <- 0.6

my.sigma <- matrix(c(sd1^2, r*sd1*sd2, r*sd1*sd2, sd2^2), ncol=2)

X <- mvrnorm(n,
mu = my.means,
Sigma = my.sigma,
empirical = TRUE)
x1 <- X[,1]
x2 <- X[,2]

epsilon <- rnorm(n, mean = 0, sd = 10)


y <- 20 + 1.2*x1 + 7.4* x2 + epsilon

m.omitted1 <- lm(y ~ x1 )


summary(m.omitted1)
##
## Call:
## lm(formula = y ~ x1)
248 22 Simulating Violations of Assumptions

##
## Residuals:
## Min 1Q Median 3Q Max
## -543.91 -166.73 -34.65 204.62 449.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 540.2932 54.7800 9.863 3.97e-13 ***
## x1 3.3811 0.4294 7.873 3.43e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 240.5 on 48 degrees of freedom
## Multiple R-squared: 0.5636, Adjusted R-squared: 0.5545
## F-statistic: 61.99 on 1 and 48 DF, p-value: 3.425e-10

22.3.2 𝑟<0

n <- 50
my.means <- c(100, 100)
sd1 <- 80
sd2 <- 40
r <- -0.6

my.sigma <- matrix(c(sd1^2, r*sd1*sd2, r*sd1*sd2, sd2^2), ncol=2)

X <- mvrnorm(n,
mu = my.means,
Sigma = my.sigma,
empirical = TRUE)
x1 <- X[,1]
x2 <- X[,2]

epsilon <- rnorm(n, mean = 0, sd = 10)


y <- 20 + 1.2*x1 + 7.4* x2 + epsilon
22.4 Incorrect Specification Issue 249

m.omitted2 <- lm(y ~ x1 )


summary(m.omitted2)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -576.76 -150.60 6.75 174.84 547.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 980.4585 54.8334 17.881 <2e-16 ***
## x1 -1.0000 0.4299 -2.326 0.0243 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 240.7 on 48 degrees of freedom
## Multiple R-squared: 0.1013, Adjusted R-squared: 0.0826
## F-statistic: 5.412 on 1 and 48 DF, p-value: 0.02427

22.4 Incorrect Specification Issue

Here, I attempt a simulation of a model whereas the true relationship among the
variables is not linear.

n <- 50
mean.x <- 5
sd.x <- 5
x <- rnorm(n, mean = mean.x, sd = sd.x)

mean.epsilon <- 0
sd.epsilon <- 7
epsilon <- rnorm(n, mean = mean.epsilon, sd = sd.epsilon)
250 22 Simulating Violations of Assumptions

400

300
y

200

100

0
−5 0 5 10 15
x

FIGURE 22.4: Scatter plot of sample with non-linear relationship.

beta0 <- 20
beta1 <- 2

y <- beta0 + beta1 * x^2 + epsilon

df <- tibble(y= y, x =x)

How did the linear regression preform in this case? Poorly, as shown in Figure
22.5.

m0 <- lm(y ~ x)
summary(m0)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.70 -52.02 -19.10 21.39 239.60
##
22.4 Incorrect Specification Issue 251

400

300

200
y

100

−5 0 5 10 15
x

FIGURE 22.5: Scatter plot of sample with non-linear relationship along with OLS
fit (blue).

## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.849 11.754 4.326 7.65e-05 ***
## x 14.322 1.803 7.945 2.67e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.03 on 48 degrees of freedom
## Multiple R-squared: 0.568, Adjusted R-squared: 0.559
## F-statistic: 63.12 on 1 and 48 DF, p-value: 2.671e-10
23
Relevant Applications

TL;DR This chapter introduces relevant illustrations


of the tools developed in the previous discussions.

23.1 Betting on Hitler

This section is based on Ferguson and Voth (2008) (download here1 ).

23.1.1 Abstract

This paper examines the value of connections between German industry and
the Nazi movement in early 1933. Drawing on previously unused contemporary
sources about management and supervisory board composition and stock re-
turns, we find that one out of seven firms, and a large proportion of the biggest
companies, had substantive links with the National Socialist German Work-
ers’ Party. Firms supporting the Nazi movement experienced unusually high
returns, outperforming unconnected ones by 5% to 8% between January and
March 1933. These results are not driven by sectoral composition and are robust
to alternative estimators and definitions of affiliation.
1
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=335654

253
254 23 Relevant Applications

23.1.2 Main Explanatory Variable

We systematically assess the value of prior ties with the new regime in 1933. To
do so, we combine two new data series: A new series of monthly stock prices,
collected from official publications of the Berlin stock exchange, and a second
series that uses hitherto unused contemporary data sources, in combination with
previous scholarship, to pin down ties between big business and the Nazis. We
consider both active managers (the Vorstand) and supervisory board members
(Aufsichtsrat).
We thus try to offer a quantitative answer to the question, How much was it
worth to have close, early connections with the Nazi party?
We identify businessmen and firms as connected to the NSDAP if they meet ei-
ther of two criteria. First, if business leaders or firms contributed financially to
the party or to Hitler or Göring, they qualify as connected. Second, certain busi-
nessmen provided political support for the Nazis at crucial moments, serving
on (or helping to finance) various groups that advised the party or Hitler on eco-
nomic policy.We also count the latter as connected. Appendix I lists all relevant
individuals and firms, along with notes on the main scholarly sources for each.
The first group includes early contributors such as Thyssen and Kirdorf. […] In
the second group are businessmen whose ties to the party also pre-dated Feb. 20.
It includes the signatories of a famous petition to President Hindenburg, urging
him to appoint Hitler as Chancellor.

23.1.3 Other Variables

(…) Our definition of Jewish-owned firms follows Mosse’s (1987) as closely as


possible. We attempt to identify “enterprises usually founded by men of Jewish
extraction, with Jews prominent in management and substantially represented
on the board.” Because Mosse focuses on large enterprises, we cross-checked
the firms in our sample against Kaznelson’s (1962) work from the period.26 As
a further safeguard against limited coverage of small firms, we supplement our
data with information from a 1927 series of articles in the Jewish periodical Der
Morgen.
(…) We also examined if the outperformance of Nazi-affiliated firms could be
a result of greater riskiness. Connected firms had a higher average beta. How-
23.1 Betting on Hitler 255

FIGURE 23.1: Descriptive statistics.

ever, adding the betas to the basic regression setup as an additional explanatory
variable does not change our main result.

23.1.4 Descriptive Statistics

23.1.5 Results (Selection)

The lower panel of Table III [Figure 23.2] documents significant outperformance
over the period from mid-January to mid-March. Nazi affiliated firms saw their
prices increase by almost 7% more than the rest.

23.1.6 Robustness Checks

(…) Could our main results be driven by expectations of an increase in armament


production? As we will show below, firms with relevant skills for rearmament
were more likely to form affiliations with the NSDAP. Here, we show that pos-
sible arms suppliers showed excess returns after January 1933. Nevertheless, the
value of NS connections is not affected by controlling for the armament effect.
(…) We investigate the effect of being a potential weapons supplier in case of
future rearmament. To this end, we use a list compiled by the Reichswehr in
1927–28, tabulating firms that were important for general armament production
256 23 Relevant Applications

FIGURE 23.2: Regressions results.

(Hansen 1978, App. 6, 10). As Eq. (1), Table VI, shows, being on the Reichswehr
list produced a positive return of 6.5% percent after January 30, but the coefficient
is not significant.
(…) In this subsection, we show that our results are not sensitive to alternative
definitions of connection with the Nazi party.
(…) Using 73’815 possible combinations of regressors - including all sector dum-
23.1 Betting on Hitler 257

mies, market capitalization, the dividend yield, the Jewish dummy, size quin-
tiles, Reichswehr association, beta, and twenty dummies of regional origin - the
smallest coefficient we obtain for the Nazi variable is 0.059 (t-statistic 3.1) and
the biggest is 0.11 (t-statistic 5.8). Despite using a large number of possible com-
binations of regressors, we consistently find a statistically significant and eco-
nomically meaningful coefficient.
24
Linear Regression Lab

TL;DR This chapter provides exercises on various


aspects related to estimation linear models in R.

24.1 Simple Linear Regression

24.1.1 Estimation

The main function for estimation linear models in R is lm().

library(MASS)
data(Boston)
# ?Boston
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"

m1 <- lm(medv ~ lstat, data=Boston)


summary(m1)
##

259
260 24 Linear Regression Lab

## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16

24.1.2 Names

names(m1)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"

s.m1 <- summary(m1)


names(s.m1)
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"

24.1.3 Prediction
24.1 Simple Linear Regression 261

my.values <- data.frame(lstat = c(1, 5, 10))

pred1 <- predict(m1, my.values)


pred1
## 1 2 3
## 33.60379 29.80359 25.05335

24.1.4 Plotting

library(ggplot2)
library(magrittr)
library(dplyr)

Boston %>%
mutate(fit.m = m1$coef[1] + m1$coef[2]*lstat,
fit.f = m1$fitted.values,
fit.p = predict(m1)) %>%
ggplot(aes(x=lstat, y=medv)) +
geom_point() +
geom_line(aes(y=fit.m), color="blue")
262 24 Linear Regression Lab

50

40

30
medv

20

10

0 10 20 30
lstat

Boston %>%
ggplot(aes(x=lstat, y=medv)) +
geom_point() +
geom_smooth(method = "lm")
24.3 Multiple Linear Regression 263

50

40

30
medv

20

10

0 10 20 30
lstat

24.2 Multiple Linear Regression

24.3 Estimation

m2 <- lm(medv ~ lstat + age + crim + tax, data=Boston)


summary(m2)
##
## Call:
## lm(formula = medv ~ lstat + age + crim + tax, data = Boston)
##
## Residuals:
264 24 Linear Regression Lab

## Min 1Q Median 3Q Max


## -16.767 -3.890 -1.429 1.900 24.837
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.225282 0.886235 38.619 < 2e-16 ***
## lstat -0.960518 0.051710 -18.575 < 2e-16 ***
## age 0.046510 0.012542 3.708 0.000232 ***
## crim -0.032526 0.039681 -0.820 0.412788
## tax -0.006395 0.002216 -2.886 0.004066 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.103 on 501 degrees of freedom
## Multiple R-squared: 0.5632, Adjusted R-squared: 0.5597
## F-statistic: 161.5 on 4 and 501 DF, p-value: < 2.2e-16

24.3.1 Prediction

my.values <- data.frame(lstat = c(1, 5, 10),


age= c(50, 60, 70),
crim=c(0.5),
tax=c(200))

pred2 <- predict(m2, my.values)


pred2
## 1 2 3
## 34.29491 30.91793 26.58044

24.3.2 Plotting

Boston %>%
mutate(fit.p = predict(m2)) %>%
ggplot(aes(x=lstat, y=medv)) +
24.3 Estimation 265

geom_point() +
geom_line(aes(y=fit.p), color="blue")

50

40

30
medv

20

10

0 10 20 30
lstat

fit.data <- data.frame(lstat=Boston$lstat,


age=mean(Boston$age),
crim=mean(Boston$crim),
tax=mean(Boston$tax))

Boston %>%
mutate(fit.p = predict(m2, fit.data),
fit.ps = predict(m1)) %>%
ggplot(aes(x=lstat, y=medv)) +
geom_point() +
geom_line(aes(y=fit.p), color="blue") +
geom_line(aes(y=fit.ps), color="red")
266 24 Linear Regression Lab

50

40

30
medv

20

10

0 10 20 30
lstat

24.4 Dummy Variables

24.4.1 Estimation

m3 <- lm(medv ~ lstat + chas, data = Boston)


summary(m3)
##
## Call:
## lm(formula = medv ~ lstat + chas, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.782 -3.798 -1.286 1.769 24.870
24.4 Dummy Variables 267

##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.09412 0.56067 60.809 < 2e-16 ***
## lstat -0.94061 0.03804 -24.729 < 2e-16 ***
## chas 4.91998 1.06939 4.601 5.34e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.095 on 503 degrees of freedom
## Multiple R-squared: 0.5626, Adjusted R-squared: 0.5608
## F-statistic: 323.4 on 2 and 503 DF, p-value: < 2.2e-16

24.4.2 Plotting

fit.data1 <- data.frame(lstat=Boston$lstat,


chas=1)
fit.data0 <- data.frame(lstat=Boston$lstat,
chas=0)

Boston %>%
mutate(fit.p1 = predict(m3, fit.data1),
fit.p0 = predict(m3, fit.data0)) %>%
ggplot(aes(x=lstat, y=medv)) +
geom_point() +
geom_line(aes(y=fit.p1), color="blue") +
geom_line(aes(y=fit.p0), color="red")
268 24 Linear Regression Lab

50

40

30
medv

20

10

0 10 20 30
lstat

24.4.3 Several Categories

Boston <- Boston %>%


mutate(ctax = case_when(
tax < 250 ~ "low",
tax > 300 ~ "high",
TRUE ~ "medium"
))

m4 <- lm(medv ~ lstat + ctax, data = Boston)


summary(m4)
##
## Call:
## lm(formula = medv ~ lstat + ctax, data = Boston)
24.5 Non-linear Transformations 269

##
## Residuals:
## Min 1Q Median 3Q Max
## -15.549 -3.995 -1.202 1.972 25.305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.30325 0.68752 48.439 < 2e-16 ***
## lstat -0.90329 0.04123 -21.908 < 2e-16 ***
## ctaxlow 2.94649 0.84310 3.495 0.000516 ***
## ctaxmedium 1.26345 0.73136 1.728 0.084688 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.15 on 502 degrees of freedom
## Multiple R-squared: 0.5556, Adjusted R-squared: 0.5529
## F-statistic: 209.2 on 3 and 502 DF, p-value: < 2.2e-16

24.5 Non-linear Transformations

24.5.1 Estimation

m5 <- lm(medv ~ lstat + I(lstat^2), data = Boston)


summary(m5)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
270 24 Linear Regression Lab

## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***


## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16

24.5.2 Plotting

Boston %>%
mutate(fit.f = m5$fitted.values) %>%
ggplot(aes(x=lstat, y=medv)) +
geom_point() +
geom_line(aes(y=fit.f), color="blue") +
geom_smooth(method = "lm", se=FALSE, color="red")
24.5 Non-linear Transformations 271

50

40

30
medv

20

10

0 10 20 30
lstat
Part VIII

Classification
25
Limited Dependent Variables

TL;DR This chapter introduces a method for esti-


mating models among variables when the dependent
variable is a categorical variable. The most common
case is when the dependent variable is a dummy, 0/1,
variable.

25.1 Motivation and Interpretation

This chapter introduces a fundamental departure from the regression methods


where the explained variables are continuous. Indeed, some dependent variables
are qualitative in nature or are only observed for a few discrete values. Examples
include:

• Client’s level of satisfaction,


• Job market participation,
• Choice in a vote,
• …

We describe here a method for building models when the dependent variable is

275
276 25 Limited Dependent Variables

a categorical variable with two possible values, e.g., 0 or 1, yes or no. It is referred
to as the logistic model.
Extensions to more than two values exist, i.e. extensions with multinomial vari-
ables ordered (e.g., number of stars given to a movie) or unordered (preferred
mean of transportation).. In that sense, the logistic model is a particular case.
The general group of methods to which the present one belongs is called classi-
fiers. This is because these tools are designed to assign each observation to a class
(or category). One of the most popular classifiers, one that you might have heard
of, is neural networks1 .
A sub-group of classifiers, including the logistic model, achieves the classifica-
tion by first estimating the probabilities of each observation of belonging to each
class and then, naturally, assigning the observation to the class for which the pre-
dicted probability is the highest. In that perspective, this sub-group behaves like
the regression models discussed previously, but with continuous probabilities
as dependent variables. This is why this sub-group is referred to as generalized
linear models.

25.1.1 An Illustrative Case

This subsection introduces the data used to illustrate the techniques in this chap-
ter.

require(ISLR)
require(tidyverse)
data("Default")
as_tibble(Default)
## # A tibble: 10,000 x 4
## default student balance income
## <fct> <fct> <dbl> <dbl>
## 1 No No 730. 44362.
## 2 No Yes 817. 12106.
## 3 No No 1074. 31767.
## 4 No No 529. 35704.
## 5 No No 786. 38463.

1
https://youtu.be/aircAruvnKk
25.1 Motivation and Interpretation 277

60000 60000
2000

balance
income

income
40000 40000

1000

20000 20000

0 0 0

0 1000 2000 No Yes No Yes


balance default default

FIGURE 25.1: Some plots on the Default data set.

## 6 No Yes 920. 7492.


## 7 No No 826. 24905.
## 8 No Yes 809. 17600.
## 9 No No 1161. 37469.
## 10 No No 0 29275.
## # ... with 9,990 more rows

Notice that the response variable here is default recording whether or not the
individual defaulted in the credit card. Hence, it is a categorical variable (also
called factor in R).
We start by providing some plots of the data.
Explanatory variables, 𝑋 , explain the 𝑦 = 0 or 𝑦 = 1. However, since no other
value is possible for 𝑦 , the prediction of the model can be interpreted as a proba-
278 25 Limited Dependent Variables

FIGURE 25.2: Depiction of a limited dependent variable.

bility. The relationship between the explained variable 𝑦 and the explanatory 𝑋
could then be illustrated as in Figure 25.2.
Therefore, we will model the relationship as

𝑝 = 𝑃 (𝑦 = 1|𝑋) = 𝐹 (𝑋𝛽)

25.2 Choice of 𝐹 (⋅)

At this point, it seems crucial to adopt an appropriate, specific function 𝐹 (⋅).


The main constraints include symmetry and, formerly, computational ease. The
standard econometric literature has elected three two functions for that purpose:

• the normal distribution,


• the logistic distribution.

Beyond these two functions, a candidate is simply the linear regression model,
which we describe first.
25.3 OLS: the Linear Probability Model (LPM) 279

FIGURE 25.3: OLS fit for LDV.

25.3 OLS: the Linear Probability Model (LPM)

We might first think of using the OLS estimator to discover the DGP at hand. In
terms of the expression above, we would have:

𝐹 (𝑋𝛽) = 𝑋𝛽

The estimated model would be

𝑦 = 𝑋𝛽 + 𝜀 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀 (25.1)

where

1 if the condition holds


𝑦={ (25.2)
0 otherwise

The OLS fit would then look like the one represented in Figure 25.3.
Technically, the estimated values are interpreted as probabilities that are used for
classification. For instance, we could have
280 25 Limited Dependent Variables

prediction = 1 if 𝑦 ̂ > 0.5

25.3.1 LPM Issues: Heteroskedasticity

Notice that the error term is

1 − 𝑋𝛽 with probability 𝑋𝛽
𝜀={ (25.3)
−𝑋𝛽 with probability 1 − 𝑋𝛽

Hence,

𝑉 𝑎𝑟(𝜀) = 𝑋𝛽(1 − 𝑋𝛽)2 + (1 − 𝑋𝛽)(−𝑋𝛽)2

Therefore, the variance of the error term in this model is not constant but depends
on 𝑋 , a violation of Assumption 2, see Section ??. The estimate of 𝛽 might be
consistent but won’t be efficient.

25.3.2 LPM Issues: Linear Increase of Probability

This model assumes constant increases of probability as 𝑋 increases, no matter


what the level of 𝑋 might be. This assumption is more difficult to take.

25.3.3 LPM Issues: Interpretation as Probability

As seen in the Figure 25.3, predicted values of 𝑦 might lie outside the (0, 1) range.
Figure 25.4 illustrates that point.

cl.lm <- lm(as.numeric(default) ~ balance, data = Default)


summary(cl.lm)
##
## Call:
## lm(formula = as.numeric(default) ~ balance, data = Default)
##
## Residuals:
## Min 1Q Median 3Q Max
25.3 OLS: the Linear Probability Model (LPM) 281

Yes
default

No

0 1000 2000
balance

FIGURE 25.4: OLS prediction for Default.

## -0.23533 -0.06939 -0.02628 0.02004 0.99046


##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.248e-01 3.354e-03 275.70 <2e-16 ***
## balance 1.299e-04 3.475e-06 37.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1681 on 9998 degrees of freedom
## Multiple R-squared: 0.1226, Adjusted R-squared: 0.1225
## F-statistic: 1397 on 1 and 9998 DF, p-value: < 2.2e-16

Default %>%
mutate(fit_lm= cl.lm$fitted.values) %>%
ggplot(aes(x=balance, y=default)) +
geom_point() +
geom_line(aes(y=fit_lm), color="blue")

Clearly, the probability interpretation is jeopardized for this kind of cases. Also,
282 25 Limited Dependent Variables

FIGURE 25.5: A possible better fit for LDV.

it calls for a better way to fit the data through the use of appropriate cumulative
density functions. For instance, the second fit in Figure 25.5 would be a better
choice.

25.4 Probit and Logit Models

The choice of 𝐹 (⋅) has given birth to two models:

• the probit model relies on a normal distribution,


• the logit model relies on a logistic distribution.

Both models give similar results. Practice shows arbitrary decisions on the choice
of one of the two functional forms.

25.4.1 Probit

For the probit model,


25.4 Probit and Logit Models 283

𝑋𝛽
1 1 2
𝑝 = 𝑃 (𝑦 = 1|𝑋) = ∫ √ 𝑒− 2 𝑡 𝑑𝑡
−∞ 2𝜋
𝑋𝛽
= ∫ 𝜙(𝑡)𝑑𝑡
−∞
= Φ(𝑋𝛽) (25.4)

where 𝜙(⋅) and Φ(⋅) are the pdf and the cdf of a normal distribution, respectively.

25.4.2 Logit

For the logit model,

𝑒𝑋𝛽
𝑝 = 𝑃 (𝑦 = 1|𝑋) =
1 + 𝑒𝑋𝛽
= Λ(𝑋𝛽) (25.5)

where Λ(𝑋𝛽) is the cdf for the logistic distribution. Its pdf is Λ(𝑋𝛽)(1 −
Λ(𝑋𝛽)).
Naturally, the prediction of the model will always be a S-shape curve in the [0, 1]
range, no matter the value of the explanatory variable, as shown in Figure 25.6.
The logit model is often preferred in some fields of research for the following
particular feature. We can rewrite the expression above to get,
𝑝
= 𝑒𝑋𝛽
1−𝑝
This ratio is called the odds. If the model estimates 𝑝̂ = 0.2, then the odds are
0.2
1/4 or 1 in 5 since 1−0.2 = 41 . Moreover, taking logs of the odds, we obtain the
log-odds,
𝑝
log ( ) = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘
1−𝑝
This last expression, in turn, shows that the logit model is a linear model of the
log-odds.
284 25 Limited Dependent Variables

Yes
default

No

0 1000 2000
balance

FIGURE 25.6: Fit of logistic regression.

FIGURE 25.7: Normal and logistic cdf’s.

25.4.3 Illustration

Figure 25.7 illustrates the form of these two distributions.


25.7 Estimation 285

25.5 Estimation

Estimation is based on maximum likelihood. This topic, however, goes beyond


the scope of this class.
For the estimation in R, see Section 25.8.

25.6 Marginal Effects

Marginal effects provide a measure of the change in the probability related to a


change in the regressor. Clearly, in our non-linear models, the marginal effects
will depend on the level at which it is estimated.

𝜕𝑃 (𝑦 = 1) 𝛽 LPM
= { 𝜙(𝑋𝛽) ⋅ 𝛽 probit
𝜕𝑋 Λ(𝑋𝛽) ⋅ (1 − Λ(𝑋𝛽)) ⋅ 𝛽 logit

Notice that 𝛽 is not the measure of the marginal effect. In fact, the impact of a
variable on the probability depends both on the coefficients and on the actual
values of the variables.

25.7 Goodness of Fit

There is no uniformly accepted way to compute a measure of the goodness of fit.


The usual 𝑅2 cannot be applied in this context.

25.7.1 Confusion Matrix

A natural prediction for a binary variable is also binary. Define the following
286 25 Limited Dependent Variables

1 if 𝑦𝑖̂ > 0.5


𝑦𝑖∗̂ = {
0 if 𝑦𝑖̂ < 0.5

A confusion matrix is simple a tabulation of the actual observed 𝑦 ’s versus the


𝑦’s
̂ predicted that way.
Typically, you would have

Pred. / Act. 0 1
0 𝑛1 𝑛2
1 𝑛3 𝑛4

where 𝑛1 and 𝑛4 are the numbers of correctly predicted 0’s and 1’s, respectively,
while 𝑛2 and 𝑛3 are the number of misclassified observations. Note 𝑛 the total
number of observations.
Goodness of fit measures can then be calculated based on the confusion matrix,
depending on the most particular perspective for the case at hand. Among them,
we could mention, assuming 0 is the ‘positive’ category:

𝑛 +𝑛
• accuracy: 1 𝑛 4 ,
𝑛1
• sensitivity: 𝑛 +𝑛 ,
1 3
𝑛4
• specificity: 𝑛2 +𝑛4 .

Again in this setting, we should distinguish between train and test measures and
use cross-validation to better estimate the latter.

25.8 An Example

We use data on birth weights. Variables are the following:

• low, indicator of birth weight less than 2.5 kg,


25.8 An Example 287

• smoke smoking status during pregnancy,


• race mother’s race (‘1’ = white, ‘2’ = black, ‘3’ = other),
• ht history of hypertension,
• ui presence of uterine irritability,
• ftv number of physician visits during the first trimester,
• age mother’s age in years,
• lwt mother’s weight in pounds at last menstrual period.

25.8.1 Linear Fit

m0 <- lm(low ~ smoke + race + ht + ui + ftv + age + lwt, data = birthwt)


summary(m0)
##
## Call:
## lm(formula = low ~ smoke + race + ht + ui + ftv + age + lwt,
## data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7404 -0.3160 -0.1520 0.4343 0.9076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.423817 0.230256 1.841 0.06731 .
## smoke 0.186092 0.070619 2.635 0.00914 **
## race 0.081438 0.038546 2.113 0.03599 *
## ht 0.377334 0.136621 2.762 0.00634 **
## ui 0.187621 0.091755 2.045 0.04232 *
## ftv 0.005350 0.031379 0.170 0.86481
## age -0.003815 0.006359 -0.600 0.54934
## lwt -0.002328 0.001130 -2.061 0.04073 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4394 on 181 degrees of freedom
288 25 Limited Dependent Variables

## Multiple R-squared: 0.1388, Adjusted R-squared: 0.1055


## F-statistic: 4.167 on 7 and 181 DF, p-value: 0.0002809

25.8.2 Logit Estimation

m1 <- glm(low ~ smoke + race + ht + ui + ftv + age + lwt,


data = birthwt,
family = binomial(link="logit"))
summary(m1)
##
## Call:
## glm(formula = low ~ smoke + race + ht + ui + ftv + age + lwt,
## family = binomial(link = "logit"), data = birthwt)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7426 -0.8398 -0.5698 1.0367 2.1293
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.117505 1.263702 -0.093 0.92592
## smoke 1.040777 0.391484 2.659 0.00785 **
## race 0.471209 0.213123 2.211 0.02704 *
## ht 1.851441 0.689782 2.684 0.00727 **
## ui 0.866535 0.451031 1.921 0.05470 .
## ftv 0.055545 0.169155 0.328 0.74263
## age -0.026944 0.035468 -0.760 0.44746
## lwt -0.013512 0.006547 -2.064 0.03901 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 206.72 on 181 degrees of freedom
## AIC: 222.72
25.8 An Example 289

##
## Number of Fisher Scoring iterations: 4

25.8.3 Probit Estimation

m2 <- glm(low ~ smoke + race + ht + ui + ftv + age + lwt,


data = birthwt,
family = binomial(link="probit"))
summary(m2)
##
## Call:
## glm(formula = low ~ smoke + race + ht + ui + ftv + age + lwt,
## family = binomial(link = "probit"), data = birthwt)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7203 -0.8498 -0.5559 1.0437 2.1606
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.087298 0.743832 -0.117 0.90657
## smoke 0.635185 0.228434 2.781 0.00543 **
## race 0.281346 0.124492 2.260 0.02382 *
## ht 1.109928 0.413935 2.681 0.00733 **
## ui 0.537178 0.273867 1.961 0.04983 *
## ftv 0.025796 0.100112 0.258 0.79666
## age -0.016834 0.020911 -0.805 0.42081
## lwt -0.007894 0.003789 -2.084 0.03719 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 206.29 on 181 degrees of freedom
## AIC: 222.29
290 25 Limited Dependent Variables

##
## Number of Fisher Scoring iterations: 5

25.8.4 Confusion Matrices

We can now compare the confusion matrices.

library(caret)
library(e1071)

OLS

predict0 <- predict(m1)


confusionMatrix(data = as.factor(ifelse(predict0>0.5, 1, 0)),
reference = factor(birthwt$low))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 125 50
## 1 5 9
##
## Accuracy : 0.709
## 95% CI : (0.6387, 0.7726)
## No Information Rate : 0.6878
## P-Value [Acc > NIR] : 0.2938
##
## Kappa : 0.1441
##
## Mcnemar's Test P-Value : 2.975e-09
##
## Sensitivity : 0.9615
## Specificity : 0.1525
## Pos Pred Value : 0.7143
## Neg Pred Value : 0.6429
## Prevalence : 0.6878
## Detection Rate : 0.6614
25.8 An Example 291

## Detection Prevalence : 0.9259


## Balanced Accuracy : 0.5570
##
## 'Positive' Class : 0
##

Logit

predict1 <- predict(m1, type = "response")


confusionMatrix(data = as.factor(ifelse(predict1>0.5, 1, 0)),
reference = factor(birthwt$low))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 116 41
## 1 14 18
##
## Accuracy : 0.709
## 95% CI : (0.6387, 0.7726)
## No Information Rate : 0.6878
## P-Value [Acc > NIR] : 0.2937923
##
## Kappa : 0.2256
##
## Mcnemar's Test P-Value : 0.0004552
##
## Sensitivity : 0.8923
## Specificity : 0.3051
## Pos Pred Value : 0.7389
## Neg Pred Value : 0.5625
## Prevalence : 0.6878
## Detection Rate : 0.6138
## Detection Prevalence : 0.8307
## Balanced Accuracy : 0.5987
##
## 'Positive' Class : 0
##
292 25 Limited Dependent Variables

Probit

predict2 <- predict(m2, type = "response")


confusionMatrix(data = as.factor(ifelse(predict2>0.5, 1, 0)),
reference = factor(birthwt$low))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 117 41
## 1 13 18
##
## Accuracy : 0.7143
## 95% CI : (0.6442, 0.7775)
## No Information Rate : 0.6878
## P-Value [Acc > NIR] : 0.2415587
##
## Kappa : 0.2356
##
## Mcnemar's Test P-Value : 0.0002386
##
## Sensitivity : 0.9000
## Specificity : 0.3051
## Pos Pred Value : 0.7405
## Neg Pred Value : 0.5806
## Prevalence : 0.6878
## Detection Rate : 0.6190
## Detection Prevalence : 0.8360
## Balanced Accuracy : 0.6025
##
## 'Positive' Class : 0
##
Part IX

Intermezzo
26
Presentations

TL;DR This section gathers a few notes on the pre-


sentations that students are asked to perform in class
or/and for their thesis.

This section gathers a few notes on the presentations that students are asked to
perform in class. At the outset, please note that I shall limit the discussion to
some selected aspects, in particular aspects related to the plan of the presenta-
tion. Therefore, I shall not attempt a full discussion on best practices for presen-
tations.

26.1 “Conclude with a Conclusion” Approach

My starting point is a version of a standard benchmark. The reader might have


seen a version for a presentation plan close to the one in Figure 26.1.
Another illustration of the standard version, on a more humorous tone is in Fig-
ure 26.2.

295
296 26 Presentations

FIGURE 26.1: Example of usual plan for presentation (Source: wiley.com (6 tips
for giving a fabulous academic presentation)).

26.2 “Say It” Approach

A tentative alternative plan that students are encouraged to follow is the follow-
ing.

1. Minimal yet sufficient description of the issue that will be addressed in


the presentation:

•go straight to the issue,


•if possible, avoid funnel-type introductions,
•the issue/problem must be clearly understandable…

2. Vivid image to help the listener picture the issue:

•this can be a picture, an anecdote, a particularly telling


graph/statistic…

3. The main result and conclusion of the presentation.


4. All the rest you may want to add.
26.2 “Say It” Approach 297

FIGURE 26.2: Another example of usual plan for presentation (Source:


http://phdcomics.com/comics/archive.php?comicid=1553).

• It is wise to add some points if one wants to convince the audience of the con-
clusions reached. Usually useful are the following.

– further motivation/ background,


– literature review,
– data description,
– methodology,
– analysis and secondary results,
298 26 Presentations

– robustness checks (what could be wrong… but it is not because the author
checked that the main results are immune to the possible problems),
– comparison with alternative results in the literature,
– implications for general understanding/ policy/ future research,
– Q&A…
Part X

Causality Claims
Why

TL;DR The following set of chapters gather thoughts


about making causal claims.

Causal claims relating variables are of an extreme kind. They manage to be:

• extremely valued, in particular because of our brain’s craving them,


• extremely difficult to obtain in non-experimental sciences,
• extremely useless, somewhat, in the increasingly important domain of data sci-
ences.

The following chapters gather some thoughts about making causal claims. For a
deep take on the issue, see the recent (!) contributions by Judea Pearl, e.g., Pearl
and Mackenzie (2018).

301
27
Sample Bias

27.1 The Issue

Sample bias in an analysis arises when the data/sample used was chosen in a
way that does not allow to answer the research question precisely because the
way the data/sample was selected affects the answer to the research question.
This happens typically when the selected data is not representative of the popu-
lation that was needed in the research question.
There are several sources for this issue such as,

• non-random sampling,
• self-selection,
• survivorship bias,
• …

The following cases provide some illustrations while showing its relevance and
its ubiquity.

27.2 Non-Random Sampling

27.2.1 Dewey Defeats Truman

27.2.2 Surveys of Friends

Several theses that I came to evaluate contain survey data obtained from Face-
book friends of the author. Clearly, this jeopardizes representativeness.

303
304 27 Sample Bias

FIGURE 27.1: President Truman holding a copy of the Chicago Daily Tribune,
November 1948.

27.3 Self-Selection

27.3.1 Lifetime Sexual Partners

When AIDS became a serious concern, in the 80’s, health officials realized the
lack of evidence on the sexual behavior of individuals. This knowledge would
prove crucial, for instance, to predict the spread of STDs.
Since then, several countries have conducted surveys in that topic with questions
such as how many sexual partners do people report having had in their lifetime.
Consider the fact that the response rate is typically below 100%, say 60-70%,
because some individuals decide to participate while other decide not to. One
should clearly be concerned with potential biases in the calculation of the sam-
pling distribution of any statistic based on the responses of the survey.

27.3.2 Heights

Understanding long-term changes in human well-being is central to understanding the consequences of


economic development. An extensive anthropometric literature purports to show that heights in the
United States declined between the 1830s and the 1890s, which is when the U.S. economy modernized.
Most anthropometric research contends that declining heights reflect the negative health consequences of
industrialization and urbanization.
27.5 Survivorship Bias 305

The apparent decline in heights in the United States, Great Britain, Sweden, and Habsburg - era central
Europe is indeed interesting, yet we question the reliability of the evidence adduced for this apparent
decline. These countries had fundamentally different economies at the time of their height reversals, but
they shared an important feature: they filled their military ranks with volunteers rather than conscripts.
A volunteer sample, which is the predominant type of sample in the literature, is selected in the sense
that such samples contain only individuals who chose to enlist in the military. Elsewhere we have shown
that the problem of inferring changes in population heights from a selected sample of volunteers can be
grave (Bodenhorn, Guinnane, and Mroz 2014). The implications of selection bias render the observed
“shrinking in a growing economy” less of an anomaly (Komlos 1998a). As the economy grows, the
outside option of military service becomes less attractive, especially to the productive and the tall.
Military heights declined because tall people increasingly chose non-military employment. Thus, we
cannot really say whether population heights declined; we can only be confident that the average height
of those willing to enlist in the military declined.

— Bodenhorn et al. (2017)

27.4 Survivorship Bias

27.5 The Tim Ferriss Show1

Consider the brief description offered in the web page of the popular Tim Ferriss
Show2 .3

Each episode, I deconstruct world-class performers from eclectic areas (investing, sports, business, art,
1
This is neither an endorsement of the show… nor a critique of the show.
2
https://tim.blog/podcast/
3
https://tim.blog/podcast/
306 27 Sample Bias

etc.) to extract the tactics, tools, and routines you can use. This includes favorite books, morning
routines, exercise habits, time-management tricks, and much more.

From a statistical point of view, this admitted goal of the show, in italics (my
emphasis), is clearly a doubtful one.
This little video on BBC4 further illustrates the point.

27.5.1 Caveman Effect

The evidence we have about our prehistoric ancestors is based on artifacts that
arrived to us, e.g., paintings. But these should not be considered as representative
of the real life of these people.

4
https://www.bbc.com/reel/video/p088rp00/the-dangers-of-idolising-successful-people
28
Endogeneity

28.1 The Issue

This barbarous term is actually a star in economics. The reason for that is its
rank as Number-One-Threat to the validity of an estimated model. Recall that
its mathematical description amounts to a simple formulation,

𝐶𝑜𝑣(𝜀, 𝑋) ≠ 0

A model suffers from an endogeneity issue when the explanatory variable is cor-
related with the error term. The consequence of that correlation is dramatic. For
instance, in the linear regression model, the estimated coefficient in the defective
model will not converge to the true parameter of the relationship.
There are several causes of endogeneity, including:

• omitted regressor,
• measurement error,
• omitted common source,
• omitted selection,
• simultaneity,
• …

Importantly, notice that this is not primarily a highly technically advanced issue.
It is above a defective way of setting causal claims.

307
308 28 Endogeneity

28.2 Omitted Regressor

This is a case that we briefly explored in a simulation (see Section 22.3).


Suppose that the true model is

𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝛽 2 𝑧 + 𝜀
where 𝜀 is a true random shock. Assume was well that there is some level of
correlation between 𝑥 and 𝑧 , which we can express as,

𝑧 = 𝛾1 𝑥 + 𝜉
where 𝜉 is a true random shock. Now, suppose one goes along and forgets 𝑧 , to
estimated

𝑦 = 𝜙 0 + 𝜙1 𝑥 + 𝑢
Substituting, the actual estimated model is,

𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝛽
⏟⏟2 (𝛾1 𝑥 + 𝜉)⏟⏟
⏟⏟⏟ +𝜀
𝑢
or,

𝑦 = 𝛽0 + (𝛽
⏟⏟ + 𝛽2⏟
1⏟ 𝛾⏟
1 ) 𝑥 + (𝛽2 𝜉 + 𝜀)
𝜙1

Clearly, 𝜙1̂ ↛ 𝛽1 unless 𝛽2 = 0, i.e., there is no omitted regressor or, 𝛾1 = 0, i.e.,


there is no correlation between 𝑥 and 𝑧 .

28.3 Measurement Error

This case is provided just as an illustration of the bias in the parameters. It is not
the most serious case. Suppose that the true model is
28.5 Omitted Common Source 309

𝑦 = 𝛽 0 + 𝛽 1 𝑥∗ + 𝜀

where 𝜀 is a true random shock. Now, instead of the real 𝑥∗ , one can only obtain
the imperfect measure,

𝑥 = 𝑥∗ + 𝜉

where 𝜉 is a true random shock. Substituting, the actual estimated model is,

𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀⏟
− 𝛽1 𝜉
𝑢

where the error term now, 𝑢, is no longer independent of 𝑥, making 𝛽1̂ ↛ 𝛽1 in


general.

28.4 Omitted Common Source

The relationship between the dependent variable 𝑦 and an explanatory variable


𝑥 cannot be considered as causal if there is a third variable, 𝑧 that causes fully or
partially both 𝑦 and 𝑥.
We can write it as,

𝑦 = 𝛼 0 + 𝛼1 𝑧 + 𝜈
𝑥 = 𝛾 0 + 𝛾1 𝑧 + 𝜉

An the estimated model is the usual

𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝜀

Another example is when variables grow independently over time. They can-
not be judged as the cause of one another simply on the based of an estimated
relationship between them.
310 28 Endogeneity

28.5 Omitted Selection

When the observations arise from a phenomenon of self-selection, then the esti-
mated relationship cannot be considered as causal.

28.6 Simultaneity

Simultaneity occurs when the supposedly dependent variable happens to itself,


simultaneously, influence the independent variable.
We can write it as,

𝑦 = 𝛼 0 + 𝛼1 𝑥 + 𝜈
𝑥 = 𝛾 0 + 𝛾1 𝑦 + 𝜉

This is a clear case of endogeneity. Indeed, 𝜉 is correlated with 𝑦 (second equa-


tion) because of its effect through 𝑥 (first equation), rendering 𝛾1 meaningless.
29
Regression to the Mean

29.1 Tentative Definition

Regression to the mean occurs when observations from two identical distribu-
tions are linked to one another. The problem with such link arises when extreme
observations of the first distribution are linked with observations of the second
distribution. Since the latter are less likely to be extreme, the unaware reader will
think that the two distributions are not identical. To compound the error, the un-
aware reader will often pick an obvious explanation for the difference and assign
it a causal origin. This misinterpretation is a famous fallacy.
Nobel Prize Winner, Daniel Kahneman has popularized the case of a flight in-
structor claiming the following:

“On many occasions I have praised flight cadets for clean execution of some aerobatic maneuver, and in
general when they try it again, they do worse. On the other hand, I have often screamed at cadets for bad
execution, and in general they do better the next time. So please don’t tell us that reinforcement works
and punishment does not, because the opposite is the case.”

29.2 Skill & Luck, Always

The first step to avoid the fallacy is to acknowledge the nature any variable and
emphasize its random component. We could then think of any variable 𝑦 as,

311
312 29 Regression to the Mean

𝑦= 𝑓(𝑋,
⏟ 𝛽) + ⏟𝜀
Deterministic component Random error

Alternatively, we can use a less technical view,

Outcome = Skill + Luck

29.2.1 Introductory Example

Suppose one wants to analyze the midterm and the endterm grades of the stu-
dents of a class. For instance, one could link these grades, for each student, in a
linear regression model as follows:

e-grade𝑖 = 𝛽0 + 𝛽1 m-grade𝑖 + 𝜀𝑖

where e-grade and m-grade are the grades at the endterm and midterm exams,
respectively, and 𝑖 refers to each student in the class.
Think of effect of luck on the grade at each test as the variance of the grade around
its expected value. Consider two cases about the effect of luck:

1. It is very small.
2. It is not relatively small.

Argue that the first case would result in a slope coefficient 𝛽1 ≈ 1. Argue that
the second case would result in a slope coefficient 𝛽1 < 1. This is more difficult.
Here is a hint. Suppose a student is very lucky at a test. Think of what is likely
to happen at the next test.

29.3 Selected Gallery


29.3 Selected Gallery 313

29.3.1 Regression to Mediocrity

Fallacious conclusions derived from a regression to the mean plagued the in-
fancy of data analysis. The very name regression comes from these dismal be-
ginnings.
Sir Francis Galton measured human characteristics, e.g., height, and noticed that
when these characteristics were outstanding in parents, they tended to be much
less so in the children. Therefore, he claimed that there was a regression towards
mediocrity in human characteristics.

29.3.2 SI Jinx

Figure 29.1 is the magazine’s cover refers to the Sports Illustrated Jinx stating that
individuals or teams who appear on the cover of the Sports Illustrated magazine
will subsequently experience bad luck.

29.3.3 Hiring Stars

Goyal and Wahal (2008) analyzed how 3’400 retirement plans, endowments, and
foundations (plan sponsors) hired and fired firms that manage investment funds
over a 10-year period.
Their results can be illustrated by Figure 29.2. The researchers link the hir-
ing/firing decisions to the excess returns of the firms in the various periods be-
fore and after that decision. For instance, “-2:0” is the period 2 years prior the
decision while “0:1” is the period of 1 year after the decision, etc.
Plan sponsors, despite the important consequences of their choice, are clearly
falling for the fallacy.
314 29 Regression to the Mean

FIGURE 29.1: Sports Illustrated cover about... its own myth.


29.3 Selected Gallery 315
Firing Hiring

5.0
Excess return %

2.5

0.0

−2.5
−2:0 −1:0 0:1 0:2 −2:0 −1:0 0:1 0:2
Periods before/after the hiring/firing decision

FIGURE 29.2: Excess returns and the selection and termination decisions of plan
sponsors.
30
“Gold Standard”

30.1 The “Gold Standard”

The “Gold Standard” for causality claims is the randomized controlled


trial/experiment (RCT). In these experiments, all the relevant variables are ac-
counted for and, thanks to random assignment across groups, the effect of a stud-
ied variable (e.g., a drug) can be pinned down.
RTC’s are a topic on their own including key features such as “double blinded”
requirement whereas both the subjects and researchers are unaware of who be-
longs to each group before the experiment is finalized.
A full discussion of RTC’s would be too long for our class. But there is a better
reason to only mention them en passant, as a reference. This is because social
sciences typically perform observational studies where little can be fully con-
trolled for. Notice, however, the recent Nobel Prize in Economics awarded to
Abijit Banerjee, Esther Duflo and Michale Kremer for their work on some version
of RTC in order to evaluate the best measures to promote economic development
(see Banerjee et al. (2011) for further details).

30.2 Approaching the Gold Standard

Economists have developed various techniques to overcome the various prob-


lems jeopardizing causality claims. These are generally advanced tools and their
discussion goes beyond the scope of this text. Suffice to say that their general
ambition is to come close to the Gold Standard.
One of these techniques is called regression discontinuity. While the details are

317
318 30 “Gold Standard”

FIGURE 30.1: Mita border and specific portion analyzed by Dell (2010).

advanced, the intuition is not. In order to establish the effect of a variable, we


should find situations where all the remaining influences can be believed to be
equal. Leaving the observed difference to be the exclusive consequence of the
variable of interest.

30.2.1 Mita System

Various authors have studied differences in institutions and their long term im-
pact on economic development. Dell (2010) evaluates the effect of the mita forced
labor system. She uses a regression discontinuity design that is made possible by
the mita border shown in Figure 30.1.

This discrete change suggests a regression discontinuity (RD) approach for evaluating the long-term
effects of the mita, with the mita boundary forming a multidimensional discontinuity in
longitude–latitude space. Because validity of the RD design requires all relevant factors besides
treatment to vary smoothly at the mita boundary, I focus exclusively on the portion that transects the
Andean range in southern Peru. Much of the boundary tightly follows the steep Andean precipice, and
hence has elevation and the ethnic distribution of the population changing discretely at the boundary. In
contrast, elevation, the ethnic distribution, and other observables are statistically identical across the
Approaching the Gold Standard 319

segment of the boundary on which this study focuses. Moreover, specification checks using detailed
census data on local tribute (tax) rates, the allocation of tribute revenue, and demography—collected just
prior to the mita’s institution in 1573 - do not find differences across this segment.
In contrast, elevation, the ethnic distribution, and other observables are statistically identical across the
segment of the boundary on which this study focuses.

Results:

Abstract This study utilizes regression discontinuity to examine the long-run impacts of the mita, an
extensive forced mining labor system in effect in Peru and Bolivia between 1573 and 1812. Results
indicate that a mita effect lowers household consumption by around 25% and increases the prevalence of
stunted growth in children by around 6 percentage points in subjected districts today. Using data from
the Spanish Empire and Peruvian Republic to trace channels of institutional persistence, I show that the
mita’s influence has persisted through its impacts on land tenure and public goods provision. Mita
districts historically had fewer large landowners and lower educational attainment. Today, they are less
integrated into road networks and their residents are substantially more likely to be subsistence farmers.

Explanation:

To minimize the competition the state faced in accessing scarce mita labor, colonial policy restricted the
formation of haciendas in mita districts, promoting communal land tenure instead (Garrett (2005),
Larson (1988)). The mita’s effect on hacienda concentration remained negative and significant in 1940.
Second, econometric evidence indicates that a mita effect lowered education historically, and today mita
districts remain less integrated into road networks. Finally, data from the most recent agricultural
census provide evidence that a long-run mita impact increases the prevalence of subsistence farming.
Based on the quantitative and historical evidence, I hypothesize that the long-term presence of large
landowners in non-mita districts provided a stable land tenure system that encouraged public goods
provision.
A
Assignments

A.1 Assignment I

General Instructions

• The goal of this assignment is threefold. First, it checks that the required soft-
ware is properly installed on your machine. Second, it illustrates several com-
ponents of the text editing language, Markdown. Finally, and arguably the
most important, it is a first example of a dynamic document.
• The assignment addresses exclusively the elements of the format of the docu-
ment. This means that it lacks any specific content such as an analysis to carry,
or a question to answer. My apologies for this dry exercise.
• As much as possible, organize your answers in Sections following the present
format.
• This is the only assignment that you will have to do alone.
• Please check Moodle for the submission link and deadline.

Deliverables

This assignment requires that you deliver several files. Please, put them in a
folder and compress this latter in one of the usual formats (.zip, .rar). The link
on Moodle will be set to accept only these compression files!
Make sure that you include all the required files. If the files are missing, then we
cannot knit your Rmd file. There is a penalty in that case.

If it knits, it ships.

321
322 A Assignments

— Alison Hill, blog entry1

Please make sure that it knits on your machine… and in ours! Because of the
task in Section A.1.2, you must knit your document a last time shortly before
submitting it.
Include your pdf document in the deliverables.

A.1.1 Checking Installation on Your Computer

1. The main file of your submission is a Rmd file. Follow the instructions
of the relevant chapter2 of the notes on the introduction to R.
2. Modify the YAML appropriately to a personalized version, e.g., change
the title.
3. Make sure the item ‘author’ in the YAML is filled as follows,

author: " Name - student number"

where Name and student number are your personal information.

4. Add the following item to your YAML (no indentation).

date: '`r format(Sys.time(), "%B %d, %Y, at %H:%M")`'

5. Paste the following three lines at the beginning of your Rmd file. Make
sure that the chunk options required for having the code evaluated,
echoed in the output file, and showing its result are all set to TRUE.

```{r}
getwd()
```

1
https://alison.rbind.io/post/2020-05-28-how-i-teach-r-markdown/
2
https://af-ucp.courses/introR/template.html
A.1 Assignment I 323

The output of the code above is the location of the current file in your computer.
This location will be printed in the output file. It is expected that the location
contains elements referring to your name. If it does not, please write a word to
explain why.
Here is the above code in my file, along with its output. As you can see, it gives
the sought for indication about the author.

getwd()
## [1] "/Users/antoniofidalgo/Dropbox/brm"

A.1.2 Dynamic Number

Check Moodle for the key number, noted kn, on the day of submission.
Your time submission number, noted tsn, is simply the hour at the time of your
submission, in a 0-24 scale. For instance, if you submit your work in the morning
at 09:24, then your tsn is 9. If you submit it at 22:56, then the tsn is 22.
The present document will dynamically refer to the ‘dynamic number’, dn, build
as shown in the code below that you must include in your report.

kn <- # fill with number from Moodle


tsn <- # fill with correct time submission number
td <- # fill with the last two digits of your student number

dn <- kn + tsn + td

Again, every time ‘dynamic number’ or dn is mentioned in this document, it refers


to the value calculated in the chunk above. Of course, the value shown here will
be different from the value in your assignment.

A.1.3 Simple Markdown Table

For this task, please find guidance in the dedicate Section of my introduction to
R3 .
3
https://af-ucp.courses/introR/mrmd.html
324 A Assignments

Replicate Table A.1 as a Markdown table.4 The caption of the table must also be
included in your copy, see help here5 .

TABLE A.1: Table containing various formating elements.

Mon Tue Wed Thu Fri


Morning Math (free) Marketing B. Ethics Stats
Break - squash squash dn=115 run
6 rd
Afternoon (free) (free) 3 meet. (free) Intro R

A.1.4 Include Graphic

For this task, please find guidance in the dedicate Section of my introduction to
R7 .
Google a picture related to Kahneman (2011) and include it in this section.
Find the picture in Figure A.1 in Tufte’s website8 and include it as well.

A.1.5 Cross-References

Report the following two statements.

There are more than the 345 (3 times dn) people in the picture of Figure A.1 of Section A.1.4.

4
The number of the footnote in your file will likely be different. Also, do not pay attention to
the background colors or the differences that you may observe between the HTML version and
a pdf version.
5
https://af-ucp.courses/introR/mrmd.html#crossrefbook
6
But maybe meeting with colleague.
7
https://af-ucp.courses/introR/mrmd.html
8
https://www.edwardtufte.com/tufte/powerpoint
A.1 Assignment I 325

FIGURE A.1: Book cover of Tufte’s book.

My schedule in Table A.1 of Section A.1.3 allows me to start reading Kahneman’s book on Tuesday.

A.1.6 Citations

Cite the two references above.

The following are two master pieces: Kahneman (2011) and Tufte (2003).

These are books. Look up the reference of the research paper by Reinhard and
Rogoff that we saw in the notes9 and quote it.
9
https://af-ucp.courses/introR/error.html
326 A Assignments

Reinhart and Rogoff (2010) is a very controversial paper.

For this task, you need to create a .bib file as explained in the notes on my intro-
duction to R10 . Do not forget to create a header called “References”.

10
https://af-ucp.courses/introR/mrmd.html
B
Bonus Assignments

You can answer any, both or none of the following questions. Their goal is
twofold. First, to provide practice of the concepts and techniques that we saw
together, and, second, to offer you the possibility of increasing your grade by
providing an extra effort.
You can work in groups of up to three people.
The deadline for submission is Monday, May 24, 10:59.

B.1 Keep Young and Beautiful

[This questions awards up to one extra and a half points in the midterm.]
This Assignment is based on Bland (2009) (download here1 ).
I created simulated data that you must use in this case (download here2 ).
The data set contains observations on two variables: wrinkle.red which measures
the wrinkle reduction at 6 months measured in the individual (in a made-up
unit), and, group which specifies whether the individual used the cream with the
active ingredient or the vehicle.

B.1.1 Task

Use the data as if it was the real data of Watson et al. (2009) to illustrate the
various elements discussed on Bland (2009), p.183, middle column.
1
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=335658
2
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=335659

327
328 B Bonus Assignments

This includes computing tests of hypotheses, and other secondary statistics.


Please provide limited but sufficient explanations to show that you understood
the point(s). Use and provide a .Rmd file.
Notice that my interpretation of Watson et al. (2009) is that they wanted to test
whether the cream had a positive effect on wrinkle reduction, while never con-
sidering that the cream could increase wrinkles.

B.2 Grades and Luck

[This questions awards up to one extra and a half points in the Quiz II.]
For the following exercises, use the grades.csv data set (download here3 ).
The file contains the grades of the students, identified by an ID, for three dif-
ferent tests in a semester: a midterm, a quiz and an endterm. If needed for the
interpretation, consider that the tests were made in that order, i.e., starting with
the midterm and ending with the endterm.

a. Read the file into a tibble and assign it to the name df.
b. Estimate the following models with OLS. Show and comment the out-
put of these estimated models.

Model 1:
quiz𝑖 = 𝛽0 + 𝛽1 midterm𝑖 + 𝜀𝑖

Model 2:
endterm𝑖 = 𝛼0 + 𝛼1 midterm𝑖 + 𝜉𝑖

where 𝜀 and 𝜉 are random terms, 𝑖 = 1, … , 𝑛, and 𝑛 is the number of students


in the class.

c. For each model, provide the scatter plot of the data along with the linear
fit. Add a 45-degree line (geom_abline(slope= 1, intercept=0)).
3
https://moodle.lisboa.ucp.pt/mod/resource/view.php?id=336209
B.2 Grades and Luck 329

20

15

test
grade

endterm

10 midterm
quiz

0 10 20 30 40 50
id

FIGURE B.1: Grades at the three tests.

d. Tidy the data such that every row is an observation and every column is
a variable. Assign the tidy tibble to the name t.df. Hint: Figure B.1 gives
hints about the resulting format.

e. Estimate a model with (exclusively) dummy variables. Show the results


and briefly comment the regression output. Interpret and explain the
coefficients 𝛿1̂ 𝛿2̂ .

Model 3:

grade = 𝛿0 + 𝛿1 𝐷1 + 𝛿2 𝐷2 + 𝑢

where 𝑢 is an error term and

1 if endterm
𝐷1 = {
0 otherwise

and

1 if quiz
𝐷2 = {
0 otherwise
330 B Bonus Assignments

This requires that you turn the variable test into a factor with as.factor(). It also
requires to set the right level for the variable with relevel().
C
Practice Quiz Questions

TABLE C.1: Practice quiz questions with elements of solution in this appendix.

Exercise Solution
C.1, C.2, C.3, C.4, C.5 b, e, b, a, b
C.6, C.7, C.8, C.9 a, a, b, a
C.10, C.11, C.12, C.13, C.14 c, a, a, b, d
C.15, C.16, C.17, C.18, C.19 b, c, a, g, b
C.20, C.21, C.22, C.23, C.24 b, b, b, a, a
C.25, C.26, C.27, C.28, C.29
C.30, C.31, C.32, C.33, C.34
C.35
C.36, C.37, C.38, C.39, C.40 c, b, a, c, d
C.41, C.42, C.43, C.44, C.45 e, a, b, n, a
C.46, C.47, C.48, C.49, C.50 a, a, b, a, a
C.51, C.52 b, f
C.53, C.54, C.55, C.56, C.57 a, g, h, b, b
C.58, C.59, C.60, C.61, C.62 c, d, b, a, c
C.63, C.64, C.65, C.66, C.67 c, b, c, c, d

C.1 Quiz I

Exercise C.1. Suppose you’re on a game show, and you’re given the choice of 8
doors. Behind one door is a car; behind the others, goats.
You pick a door, say No. 1, and the host, who knows what’s behind the doors
and cannot open a door with a car, opens 6 doors with a goat. Your door and
door No. 2 remain closed.

331
332 C Practice Quiz Questions

He then says to you, “Do you want to pick door No. 2”? What is the probability
of winning the car by picking door No. 2?

a. 0.83
b. 0.88
c. 0.50
d. 0.13
e. 0.24
f. None in the list

Exercise C.2. The p-value of a test is 0.163.


Is there a 𝛼 for which the null hypothesis would be rejected?

a. No, unless the test is one-tailed


b. Yes, 5%
c. No, unless the test is two-tailed
d. Yes, 10%
e. None in the list

Exercise C.3. Suppose that you want to compare a given proportion between
two populations (population 1 and population 2). You obtain a sample from each
population and want to test:

𝐻0 ∶ 𝑝1 − 𝑝2 = 0.25

In that case, using the pooled probability for the standard error, as we saw in
class, is still appropriate?

a. True
b. False

Exercise C.4. Consider the following output of a R function for a test:


1-sample proportions test without continuity correction data: p.hat1 * n1 out of n1,
null probability p.0 X-squared = 2, df = 1, p-value = 0.1573 alternative hypothesis:
true p is not equal to 0.5 95 percent confidence interval: 0.2760839 0.5381856 sample
estimates: p 0.4

If 𝛼 = 5%, what decision will you take about 𝐻0 ?


C.1 Quiz I 333

a. Do not reject 𝐻0
b. Reject 𝐻0
c. Cannot be said (information is lacking)

Exercise C.5. The use of a chi-squared distribution for the sampling distribution
of a statistic in a small sample, instead of a normal distribution is justified by its
higher accuracy.

a. True
b. False

Exercise C.6. The construction of the sampling distribution through the use of
permutations is an alternative to the use of analytical results (theorems, etc).

a. True
b. False

Exercise C.7. Suppose that, under the null, the relevant statistic for a two-tailed
test follows a standard normal distribution. Let 𝛼 = 5%.
The statistic in the sample is 0.01. What decision does the test recommend about
𝐻0 ?

a. Do no reject 𝐻0
b. Reject 𝐻0
c. Cannot be said (lack of information)

Exercise C.8. According to the central limit theorem, the sampling distribution
of the mean can be approximated by the normal distribution:

a. As the size of the sample standard deviation decreases.


b. As the sample size (number of observations) gets “large enough.”
c. As the size of the population standard deviation increases.
d. As the number of samples gets “large enough.”

Exercise C.9. Which of the following properties is not true regarding the sam-
pling distribution of the sample mean?

a. By the central limit theorem, the distribution of the sample mean is


normal no matter how large 𝑛 is.
334 C Practice Quiz Questions

b. 𝐸[𝑋]̄ = 𝜇, i.e., the expected value of the sample mean is the true
mean no matter how large 𝑛 is.
c. When the population being sampled follows a normal distribution,
the distribution of the sample mean is normal no matter how large 𝑛 is.
d. 𝜎𝑋̄ = √𝜎𝑛

Exercise C.10 (Quiz I, 20-21). Suppose that a test for COVID-19 uses

𝐻0 ∶ the virus is present and active in the individual

Suppose as well that the test can be modified by increasing the 𝐶𝑇 , from 20 to
25 or even higher. When the 𝐶𝑇 is increased, the presence of even pieces of the
virus, or even dead-virus will be detected and the test will be flagged as positive.
Recall the terminology about the type of errors and their probabilities. What does
a policy of decreasing the 𝐶𝑇 of the test correspond to?

a. increasing 𝛽
b. decreasing 𝛼
c. increasing 𝛼
d. cannot be said
e. decreasing 𝛼 and 𝛽
f. increasing 𝛼 and 𝛽

Exercise C.11 (Quiz I, 20-21). A video on Youtube shows a magician, always


before a fair die is thrown, guessing 20 times in a row the number appearing on
a die.
Extremely sophisticated tests on the video show that no editing whatsoever has
performed on the video.
Which of the following is the most likely explanation of this feat?

a. the magician spent time trying until they achieved the streak
b. the tests on the video were not good enough
c. the magician has psychic powers
d. the die obeyed the voice of the magician
e. this was the result of the skills of the magician at their best, though
it is extremely unlikely that they will be able to repeat the feat
C.1 Quiz I 335

Exercise C.12 (Quiz I, 20-21). I obtain a sample of students answering this ques-
tion.
If I want to test whether the students are better or worse than my grand-mother
at answering it, what is the most appropriate null hypothesis (for the probability
of knowing the correct answer) given that my grand-mother knows absolutely
nothing about statistics?

a. 𝑝 = 1/5
b. 𝑝 = 1/4
c. 𝑝 = 1/3
d. 𝑝 = 1/2
e. 𝑝 = 2/3
Exercise C.13 (Quiz I, 20-21). A die is known to be fair. For 5 times in a row, it
showed a number larger than 3.
The probability that the next throw (the 6th, in a n=6 experiment) shows a num-
ber smaller or equal to 3 is now 0.5 + √1𝑛 .

a. True
b. False

Exercise C.14 (Quiz I, 20-21). In R, the names of the functions for random vari-
ables (distributions) follow a pattern, in particular with respect to the first letter.
For instance, Xunif() does the “same” with a uniform distribution as Xchisq()
does with a chi-squared distribution (possibly needing more arguments,
though).
What is the function that gives the probability to the left of a value ‘q’ in a t-
distribution?

a. qt()
b. rt()
c. dt()
d. pt()

Exercise C.15 (Quiz I, 20-21). We take two samples, A and B, from a population
(with known variance). Sample A contains 200 observations and sample B, 250.
By coincidence, the mean of each sample is the same.
336 C Practice Quiz Questions

We then use each sample to make a test of hypothesis about the true mean in the
population, 𝜇. The null will be: 𝐻0 ∶ 𝜇 = 𝜇0 for some value of 𝜇0 . For instance,
we can use each sample to test whether the true mean in the population is 50.
Generally, in which sample will the null be rejected more often?

a. A
b. B
c. We cannot say: depends on 𝑚𝑢0
d. We cannot say: we more information about the samples
e. [None in the list]

Exercise C.16 (Quiz I, 20-21). Consider the following output of a test in R.


2-sample test for equality of proportions without
continuity correction
data: c(p.hat1 * n1, p.hat2 * n2) out of c(n1, n2)
X-squared = 0.97466, df = 1, p-value = 0.1618
alternative hypothesis: greater
95 percent confidence interval:
-0.0470849 1.0000000
sample estimates:
prop 1 prop 2
0.24 0.18

What is 𝐻0 in this test?

a. 𝑝2 ≤ 𝑝1
b. 𝑝1 = 𝑝2
c. 𝑝1 ≤ 𝑝2
d. 𝑝1 = 𝑝2 = 0
e. cannot be said

Exercise C.17 (Quiz I, 20-21). In a test of hypothesis, what is the minimum value
that the p-value can take?

a. 𝜖 > 0 where 𝜖 arbitrarily small


b. 0
c. cannot say: depends on the situation
d. cannot say: depends on 𝛼
C.1 Quiz I 337

e. cannot say: depends on the sampling distribution

Exercise C.18 (Quiz I, 20-21). When conducting a test of hypothesis, the rejection
region approach and the p-value approach suggest decisions about 𝐻0 that are…

a. only always identical when 𝐻0 is false


b. only always identical when 𝐻0 is true
c. rarely the same
d. always identical when the test is one-tailed, but sometimes different
when the test is two-tailed
e. always identical when the test is two-tailed, but sometimes different
when the test is one-tailed
f. almost always identical
g. always identical
h. (no answer in the list is correct)

Exercise C.19 (Quiz I, 20-21). The fact that virtually everybody can access
spreadsheet-based software, such as Excel, qualifies this tool as a tool for repro-
ducible research.

a. True
b. False

Exercise C.20 (Quiz I, 20-21). Consider the following output of a test in R.


2-sample test for equality of proportions without
continuity correction
data: c(p.hat1 * n1, p.hat2 * n2) out of c(n1, n2)
X-squared = 0.97466, df = 1, p-value = 0.1618
alternative hypothesis: greater
95 percent confidence interval:
-0.0470849 1.0000000
sample estimates:
prop 1 prop 2
0.24 0.18

What decision does the test recommend about 𝐻0 ?

a. reject 𝐻0
338 C Practice Quiz Questions

b. not reject 𝐻0
c. cannot be said
Exercise C.21 (Quiz I, 20-21). Under the null, the sampling distribution of a test
statistic is a normal distribution with mean 0 and standard deviation 1, i.e., a
standard normal.
In our sample, the value of the test statistic is 1.
Assume 𝛼 = 0.05. What decision does the test recommend about 𝐻0 ?

a. reject 𝐻0
b. not reject 𝐻0
c. cannot be said
Exercise C.22 (Quiz I, 20-21). The standard deviation of the sampling distribu-
tion of the sample mean, i.e., the standard error of the mean, decreases linearly
with 𝑛, the sample size.

a. True
b. False
Exercise C.23 (Quiz I, 20-21). Recall the following question:
Which of the following sequences of X’s and O’s seems more likely to have been
generated by a random process (e.g., flipping a coin)?

• XOXXXOOOOXOXXOOOXXXOX
• XOXOXOOOXXOXOXOOXXXOX

Respondents tend to answer “sequence 2”, which is the wrong answer.


This case illustrates the belief in the law of small numbers, as coined by Kahne-
man and Tversky.

a. True
b. False
Exercise C.24 (Quiz I, 20-21). You estimate a parameter in the population. The
more precise you want your point estimate to be, the larger the sample size has
to be.

a. True
b. False
C.2 Midterm Quiz 339

C.2 Midterm Quiz

Exercise C.25 (Midterm, 20-21). We would like to investigate the rate of return
of two stocks, say of Company A and Company B. For this, we take a random
sample of a large number of days and record the value of the stock for each
company on each of these days. These data are paired.

a. True
b. False

Exercise C.26 (Midterm, 20-21). While manipulating the data on two variables,
𝑥 and 𝑦, the largest (and positive) value of 𝑥 is accidentally multiplied by 1.96.
Which of the following applies for the correlation between 𝑥 and 𝑦 after this
manipulation?

a. the Pearson’s correlation do not change


b. the Spearman’s correlation does not change
c. both Spearman’s and Pearson’s correlations change
d. none of Spearman’s and Pearson’s correlations change
e. [cannot be said]

Exercise C.27 (Midterm, 20-21). Consider a sampling distribution obtained to


compare the means of two relatively large groups by subsetting a large num-
ber of all the possible permutations of the observations in the two groups. That
sampling distribution is exactly symmetric around its mean.

a. True
b. False

Exercise C.28 (Midterm, 20-21). In different samples, I obtain the following Pear-
son correlations that I use in the relevant test. Other things equal, for which is
the null most likely to not be rejected?

a. 0.45
b. -0.25
c. 0.04
340 C Practice Quiz Questions

FIGURE C.1: Estimation output.

d. -0.78
e. 0.85

Exercise C.29 (Midterm, 20-21). Consider the output below. You read the main
part line by line, i.e., for each variable (smoke, race, ht, ui, etc…). For each line,
there is a test with 𝐻0 ∶ 𝐶𝑜𝑒𝑓. = 0, where “Coef.” stands for coefficient. The
columns towards the right give the 𝑡 score and the p-value (𝑃 > |𝑡|). Which of
the following variables has a coefficient statistically different from 0 given the
model? Use 𝛼 = 1%.

a. smoke
b. race
c. age
d. [none in this list]
e. [cannot be said]

Exercise C.30 (Midterm, 20-21). In a unilateral (one-tail) test of hypothesis, in


contrast to a bilateral test of hypothesis, we do not need to multiply the test
statistic by 2 in order to obtain the p-value.

a. [sentence does not make sense]


b. True
c. False, we must also multiply it by 2
C.2 Midterm Quiz 341

d. True, if 𝐻0 ∶ 𝜇 = 0

Exercise C.31 (Midterm, 20-21). In a random sample of 629 respondents accu-


rately reflecting the Portuguese population, 57% declared that they were in favor
of postponing the presidential election. Based on that data, your friend claims
that the majority of the Portuguese population is in favor of postponing the pres-
idential election. Which of the following is your best evaluation of that claim?
Notice that you could calculate the answer, but you are asked to answer by using
the intuition that you have built based on the relevant elements of the class.

a. Yes, the data in the sample support the claim.


b. I have strong doubts because 57% is not much larger than 50%.
c. The data in the sample clearly contradict the claim.

Exercise C.32 (Midterm, 20-21). Considering the following plot intentionally


without a legend. Of course, this is not a good plot. How many aesthetics map-
pings were coded?

a. 2
b. 3
c. 4
d. 5
e. 6

Exercise C.33 (Midterm, 20-21). Which of the following would result in an in-
crease of the range of the confidence interval for the estimation of the population
mean?

a. [none in this list]


b. an increase of 𝑛, the number of observations
c. a decrease in 𝜎, the population’s standard deviation
d. an increase of 𝛼, knowing that the confidence level is 1 − 𝛼
e. [all in the list]

Exercise C.34 (Midterm, 20-21). Consider the construction of the confidence in-
terval for a parameter such as the mean from a relatively small sample. For a
given confidence level, the absolute value of 𝑡𝑑𝑓,1−𝛼 is XXX than 𝑧1−𝛼 , making
the confidence interval YYY. What do XXX and YYY stand for, respectively?
342 C Practice Quiz Questions

FIGURE C.2: Aesthetics mappings.

a. larger, wider
b. larger, narrower
c. smaller, wider
d. smaller, narrower
e. [cannot be said]

Exercise C.35 (Midterm, 20-21). In are interested in testing the mean effect, 𝜇, of
a new drug on the inflammation level of an injury.
Arguably, the best formulation of the null and the alternative is

𝐻0 ∶ 𝜇 = 0 vs 𝐻𝑎 ∶ 𝜇 ≠ 0,
and not

𝐻0 ∶ 𝜇 ≥ 0 vs 𝐻𝑎 ∶ 𝜇 < 0.
C.3 Quiz II 343

a. True
b. False

C.3 Quiz II

Exercise C.36 (Quiz II, 20-21). A categorical variable can take three values: north,
center, and south. This is why we create three dummy variables.

• 𝐷1 = 1 if observation is from north, 0 otherwise.


• 𝐷2 = 1 if observation is from center, 0 otherwise.
• 𝐷3 = 1 if observation is from south, 0 otherwise.

We wish to regress a continuous variable 𝑦 on that variable and we know we


must avoid the dummy-trap. This is why we estimate the model:

𝑦 = 𝛽0 + 𝛽1 𝐷1 + 𝛽2 𝐷2 + 𝜀

We find the following coefficients:

𝛽0̂ = 8.62
𝛽1̂ = 2.89
𝛽 ̂ = −1.44
2

Suppose that instead we had estimated the model:

𝑦 = 𝛼0 + 𝛼1 𝐷2 + 𝛼2 𝐷3 + 𝜖

What would be the value of 𝛼0̂ ?

a. 8.62
b. 7.18
c. 11.51
344 C Practice Quiz Questions

d. 7.17
e. 12.95
f. 5.73
g. 10.06
h. 1.45
i. -4.29
j. -1.45
Exercise C.37 (Quiz II, 20-21). Suppose that you use a statistical model to predict
the value of the variables below. Determine whether the estimation refers to a
regression (R) or a classification problem (C).

1. a worker’s wage,
2. the commute time of workers,
3. the preferred type of transportation of individuals,
4. the self-assessed health level from 1 (very bad) to 5(very good),
5. the risk of cancer group of individuals.

The answers below of R’s and C’s follow the order of these variables.

a. CRRCC
b. RRCCC
c. CCRCR
d. CCRRR
e. RRCRC
f. RCRCC
g. CCRRC
h. CCCRR
Exercise C.38 (Quiz II, 20-21). We can say that k-fold cross-validation is an ex-
tension/generalization of the simple validation set approach.

a. True
b. False
Exercise C.39 (Quiz II, 20-21). The following gives the estimates of a simple lin-
ear model.

𝑦 ̂ = 3.85 − 2.95𝑥
C.3 Quiz II 345

What is the predicted value of the model when the independent variable is equal
to 12?

a. 3.85
b. 15.85
c. -31.55
d. -35.40
e. 3.60
f. -2.76

Exercise C.40 (Quiz II, 20-21). In a simple linear regression, the slope coefficient
is 1.240 and it has a 𝑡-value of 5.544 when testing the null hypothesis that the
true parameter is 0.
What is the standard error of the slope coefficient?

a. 6.00
b. 4.47
c. 1.00
d. 0.22
e. 6.87

Exercise C.41 (Quiz II, 20-21). In the context of linear regression estimation for a
model of the variable 𝑦 , suppose that the usual significance test on a coefficient
allows us to reject the null hypothesis for the variable 𝑥1 .
My grandmother claims that this result implies that 𝑥1 has a causal effect (i.e.,
it’s a cause of) 𝑦 .
What is your best evaluation of my grandmother’s claim?

a. It is correct only of the researcher used cross-validation


b. It is correct if the 𝑅2 is very high
c. It is not correct
d. It is correct
e. It could be true, but it is not necessarily the case

Exercise C.42 (Quiz II, 20-21). Suppose you regress a normal random variable
𝑦 on another explanatory (?) random normal variable 𝑥, i.e., you estimate the
model
346 C Practice Quiz Questions

𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝜀
Then, you carry the usual test of the hypothesis: 𝐻0 ∶ 𝛽1 = 0. What is the
probability that you reject 𝐻0 ?

a. 𝛼
b. cannot be determined
c. 𝛽1
d. 𝛽0

Exercise C.43 (Quiz II, 20-21). A researcher’s effort to continuously increase the
ability of their model to fit their data at hand is rewarded with increased accuracy
in predictions.
This statement is generally…

a. True
b. False

Exercise C.44 (Quiz II, 20-21). Consider a regression predicting weight (kg) from
height (cm) for a sample of adult males. What are, respectively, the units of:
A. the correlation coefficient, B. the intercept, C. the slope.

a. A. cm/kg. B. no units. C. cm
b. A. kg. B. kg. C. cm/kg
c. A. no units. B. no units. C. kg/cm
d. A. cm/kg. B. cm. C. kg
e. A. kg. B. kg. C. kg
f. A. kg. B. cm. C. no units
g. A. cm. B. no units. C. kg
h. A. kg/cm. B. kg/cm. C. kg/cm
i. A. kg. B. no units. C. kg/cm
j. A. cm. B. cm. C. cm
k. A. kg/cm. B. kg. C. no units
l. A. cm/kg. B. cm/kg. C. kg/cm
m. A. cm/kg. B. no units. C. kg/cm
n. A. no units. B. kg. C. kg/cm
o. A. no units. B. kg. C. no units.
C.3 Quiz II 347

Exercise C.45 (Quiz II, 20-21). The 𝑅2 of a multiple regression model does not
use cross-validation. Therefore, it is not a reliable measure of the 𝑅2 of the same
regression model in another sample.

a. True
b. False

Exercise C.46 (Quiz II, 20-21). You define a dummy variable 𝐷𝑀 which takes the
value 1 if the individual is a man and 0 otherwise (the individual is a woman).
You then estimate a linear regression model of an individual’s wage using sev-
eral variables, including 𝐷𝑀 . Other things equal, you find that the value of the
coefficient on the dummy variable 𝐷𝑀 is equal to 20.
Now, suppose that instead of 𝐷𝑀 , you had used 𝐷𝑊 , a dummy variable taking
the value 1 if the individual is a woman and 0 otherwise.
Other things equal, what would be the value of the coefficient on the dummy
variable 𝐷𝑊 ?

a. -20
b. It cannot be determined
c. It depends on the intercept

Exercise C.47 (Quiz II, 20-21). The mean squared error (MSE) criterion cannot
correctly be used in the context of classification problems.

a. True
b. False

Exercise C.48 (Quiz II, 20-21). In the Netflix challenge, competitors were asked
to provide a model and an estimation technique in order to predict an aspect of
the clients’ choice based on some training data. Since the competitors could not
access the data on which their model will be evaluated, the problem is one of
unsupervised learning.

a. True
b. False
348 C Practice Quiz Questions

Exercise C.49 (Quiz II, 20-21). Suppose you assume that the true relationship
between the explained variable, the weight of a baby, (in grams), and the ex-
planatory variable, the baby’s age (in months) is linear. You estimate a simple
linear regression model and obtain 𝛽0̂ and 𝛽1̂ .
You later come to realize that the scale that was used to weigh the babies was
deficient in the sense that it was always 52 grams above the real weight.
Does this result in a bias of your 𝛽1̂ ?

a. No
b. Yes
c. Cannot be determined
Exercise C.50 (Quiz II, 20-21). The following estimated model of relationship
between variables is a linear regression model.

log(𝑦) = 𝛽0 + 𝛽1 log(𝑥) + 𝜀

where log(⋅) is the (non-linear) logarithm function.

a. True
b. False
Exercise C.51 (Quiz II, 20-21). When estimating and presenting the results of a
linear regression model, a high value of the 𝑅2 is a necessary requirement for
the validity of the model and its publication in a good research journal.

a. True
b. False
Exercise C.52 (Quiz II, 20-21). Let 𝑦 be the total weight of the individuals in a
sample. The total number of individuals is denoted 𝑥 and the total number of
kids among these individuals is denoted 𝑤.
We estimated the following linear regressions:

𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝜀

𝑦 = 𝛾 0 + 𝛾1 𝑤 + 𝜖
C.4 Endterm Quiz 349

𝑦 = 𝛼 0 + 𝛼1 𝑥 + 𝛼 2 𝑤 + 𝜉

What are, respectively, the likely signs of: 𝛽1̂ , 𝛾1̂ , 𝛼1̂ , 𝛼2̂ ?
(A hat . ̂ over a variable denotes the estimate through the OLS procedure.)

a. undetermined, undetermined, undetermined, undetermined


b. -,+,-,+
c. +,+,+,+
d. +,+, undetermined, undetermined
e. +,-,+,-
f. +,+,+,-
g. None in the list
h. -,-,-,-

C.4 Endterm Quiz

Exercise C.53 (Candidate Endterm Quiz, 20-21). Consider the linear fit in Figure
C.3, assuming that it represents the true relationship between the variable ‘sales’
and the variable ‘TV’.
What assumption of the linear model seems to be violated?

a. Homoscedasticity
b. Normality of the errors
c. Correlation between errors
d. None in the suggested list

Exercise C.54 (Candidate Endterm Quiz, 20-21). A categorical variable can take
three values: north, center, and south. This is why we create three dummy vari-
ables.

• 𝐷1 = 1 if observation is from north, 0 otherwise.


• 𝐷2 = 1 if observation is from center, 0 otherwise.
350 C Practice Quiz Questions

20
sales

10

0 100 200 300


TV

FIGURE C.3: Linear fit and residuals, again.

• 𝐷3 = 1 if observation is from south, 0 otherwise.

We wish to regress a continuous variable 𝑦 on that variable and we know we


must avoid the dummy-trap. This is why we estimate the model:

𝑦 = 𝛽0 + 𝛽1 𝐷1 + 𝛽2 𝐷2 + 𝜀

We find the following coefficients: 𝛽0̂ = −9.54, 𝛽1̂ = −1.31, 𝛽2̂ = 4.33.
Suppose that instead we had estimated the model:

𝑦 = 𝛼0 + 𝛼1 𝐷2 + 𝛼2 𝐷3 + 𝜖

What would be the value of 𝛼1̂ ?

a. 3.90
b. -15.18
c. -13.87
d. -5.21
e. -3.02
f. -12.56
g. 5.64
C.4 Endterm Quiz 351

FIGURE C.4: Regression output for exercise with XXX.

h. 3.02
i. -9.54
j. -8.23
Exercise C.55 (Candidate Endterm Quiz, 20-21). Consider the regression output
in Figure C.4. Determine the value of XXX.

a. -4.358
b. 3.762
c. 7.852
d. 2.091
e. -0.782
f. -1.349
g. 2.876
h. none in the suggested list

Exercise C.56 (Candidate Endterm Quiz, 20-21). You use OLS to regress a vari-
able 𝑦 on a variable 𝑥 and find that the coefficient on the variable 𝑥 is not statisti-
cally significant, i.e., we cannot reject the hypothesis that the coefficient is equal
to 0. We do the same test on a large number of samples and the result of the test
is always the same.
This means that there is no relationship between these variables.
352 C Practice Quiz Questions

a. True
b. False

Exercise C.57 (Candidate Endterm Quiz, 20-21). If we knew the real/true func-
tion relating an explained variable 𝑌 to the set of explanatory variables 𝑋 , then,
given these explanatory variables, we could achieve 0 MSE in the test data.

a. True
b. False

Exercise C.58 (Candidate Endterm Quiz, 20-21). Recall the results of the paper
by Ferguson and Voth (2008) shown in Figure C.5.
Based on the table, call A, the average (log) returns of a firm connected to the Nazi
regime in the January-March 1933 period; call B the average (log) returns of a firm
unconnected to the Nazi regime in the November 1932- January 1933 period. This
implies that you must use the models without any explanatory variable beyond
‘Nazi’.
What is A-B (A minus B, i.e., the difference in the log returns for these two groups
of firms)?

a. 0.1215
b. -0.0343
c. none in the suggested list
d. -0.0522
e. 0.0522
f. 0.0865
g. -0.0673
h. 0.0673
i. 0.0343

Exercise C.59 (Candidate Endterm Quiz, 20-21). In the old times of low com-
puting power, which of the following would be the most affordable method of
cross-validation?

a. none in the suggested list


b. LOOCV
C.4 Endterm Quiz 353

FIGURE C.5: Regressions results.

c. k-fold CV
d. Validation set

Exercise C.60 (Candidate Endterm Quiz, 20-21). Suppose individuals can have
three levels of education: 12 years of schooling, 17 years of schooling, or 21 years
of schooling.
Imagine we want to estimate a model of the wage. We could estimate a model
354 C Practice Quiz Questions

A (including dummies for 2 of the 3 levels) of model B (including the variable


years of schooling). For a given individual, both models would give the same
predictions.

a. True
b. False

Exercise C.61 (Candidate Endterm Quiz, 20-21). Suppose we estimate the fol-
lowing model,

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑥2𝑖 + 𝜀𝑖

where 𝑥2 is the square of the variable 𝑥.


This still belongs to the type of linear models.

a. True
b. False

Exercise C.62 (Candidate Endterm Quiz, 20-21). Suppose we fit a regression line


to predict the number of goals scored in a season by a striker. For a particular
striker, we predict 12.4 goals and the residual for that observation is -1.4.
Does our model over or underestimate this striker’s number of goals?

a. It depends/cannot be determined
b. Underestimate
c. Overestimate

Exercise C.63 (Candidate Endterm Quiz, 20-21). Suppose that you use a statis-
tical model to predict the value of the variables below. Determine whether the
estimation refers to a regression (R) or a classification problem (C).

1. The presence of a smoker in a household


2. The party with the highest vote in an election
3. The number of people in a household
4. The highest academic degree achieved by a worker
5. The VAT level of a product in a shop

The answers below of R’s and C’s follow the order of these variables.
C.4 Endterm Quiz 355

a. CRRCC
b. RRCCC
c. CCRCC
d. CCRRR
e. RRCRC
f. RCRCC
g. CCRRC
h. CCCRR

Exercise C.64 (Candidate Endterm Quiz, 20-21). Suppose you estimate a linear
model with a dummy variable 𝐷.

𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝛽 2 𝐷 + 𝜀

where 𝑥 and 𝑦 are continuous variables and 𝜀 a random error term.


What is the unit of the 𝛽2 coefficient on that dummy 𝐷?

a. it’s unit free


b. the same as 𝜀
c. the same as 𝛽1
d. none in the suggested list
e. the same as 𝑥

Exercise C.65 (Candidate Endterm Quiz, 20-21). Consider the two graphs in Fig-
ure C.6 showing the explained variable as a function of two explanatory vari-
ables, x1 and x2.
Which of the two (left or right) illustrates better a problem of unsupervised learn-
ing?

a. right
b. none of the two
c. left
d. it depends/cannot be determined

Exercise C.66 (Candidate Endterm Quiz, 20-21). Recall the results of the paper
by Ferguson and Voth (2008) shown in Figure C.5.
356 C Practice Quiz Questions

FIGURE C.6: One is unsupervised learning.

In all likelihood for the November 1932- January 1933 period, for how many
firms did the authors have information on the ‘Market Cap’ but not on the ‘Div-
idend Yield’?

a. 57
b. cannot be determined
c. 53
d. 22
e. 67
f. 0
g. 12
C.5 Selected Quiz I Solutions 357

Exercise C.67 (Candidate Endterm Quiz, 20-21). Suppose we estimate a simple


linear regression model with OLS. The true relationship between the variables is
positive and it shows so in the 𝑥 (horizontal) - 𝑦 (vertical) quadrant. We obtain
𝛽0̂ and 𝛽1̂ .
Now, suppose we add an outlier to our sample, i.e. an observation with a low
value of 𝑥 but a high value of 𝑦 , and estimate the model again.
Compare to the initial 𝛽0̂ and 𝛽1̂ , the new coefficients will be XXX for 𝛽0̂ and
YYY for 𝛽1̂ , respectively.
What do XXX and YYY stand for, respectively?

a. higher; higher
b. cannot be determined
c. lower; higher
d. higher; lower
e. lower; lower

C.5 Selected Quiz I Solutions

Solution to Exercise C.1

See Section 1.4 for a lengthy explanation.

Solution to Exercise C.2

Recall that 𝛼 is the probability of the error of Type I that the researcher is ready
to accept when rejecting the null hypothesis, knowing that it could be true.
Since the value of 𝛼 is chosen by the researcher, there is always a value for which
the null would be rejected. For instance, depending on the conditions, the re-
searcher could say that 𝛼 is 20%. Weird and difficult to explain choice to make,
but always a possible one.
358 C Practice Quiz Questions

Solution to Exercise C.3

As always in a test of hypothesis, we want to determine the sampling distribu-


tion under the null. Generally, the null is telling us that 𝑝1 = 𝑝2 , so it is reason-
able to pool the probabilities. After all, this is consistent with assuming that 𝐻0
is true. In this question, however, this equality is not holding under 𝐻0 . Instead,
under the null, 𝑝1 ≠ 𝑝2 . So it does not make sense to pool the probabilities.

Solution to Exercise C.4

Look at the p-value. It is larger than 𝛼. If the null is true, the probability of ob-
serving a test statistic as large or even larger that the statistic from the sample is
too high to reject the null. Hence the test suggests to not reject 𝐻0 .

Solution to Exercise C.5

The sentence does not really make sense. The sampling distribution is derived
analytically, following the assumptions on the distributions of the random vari-
ables involved.

Solution to Exercise C.6

Notice that the analytical results are based on assumptions about the random
variables involved. If we are ready to make them to develop a theory, we can
also make them and feed them to a computer. This latter will then be able to
generate a very large number of sample and obtain the sampling distribution in
this way.

Solution to Exercise C.7

The statistic falls very close to the true value under the null. Hence, we will cer-
tainly not reject the null. To better see this, recall Figure 4.1. In this question, the
distribution is a normal around 0. Put it in the center of the distribution. The test
statistic is 0.01, i.e., very close to 0. So, there is no chance it falls in the rejection
rejection.
The question aims at evaluating your understanding of the difference between
C.5 Selected Quiz I Solutions 359

the score/statistic in the sample and the p-value. The former can range from
minus (virtually) infinity to plus infinity and leads to a rejection of the null if
it falls in the rejection area (i.e., has a small p-value). The latter is a probability,
hence must be between 0 and 1.

Solution to Exercise C.8

See Section 5.2. The condition for the Central Limit Theorem to apply its magic
is that the sample size becomes sufficiently large.
The standard deviation in the population do not change and its sample counter-
part cannot be expected to change dramatically with 𝑛.

Solution to Exercise C.9

Once again, see Section 5.2. The condition for the Central Limit Theorem to apply
its magic is that the sample size becomes sufficiently large.

Solution to Exercise C.10

The increase of 𝐶𝑇 will result in more cases being flagged as “covid-positive”,


i.e., as saying that the patient has covid. Among these cases are not only the truly
“covid-positive” but also false “covid-positive” cases since “the presence of even
pieces of the virus, or even dead-virus will be detected”. These latter cases are
errors of the test.
An increase in 𝐶𝑇 will therefore result in an overall smaller rejection of the null.
Since 𝛼 is the probability of rejecting the null when the null is true, other things
equal, this probability will also decrease.
To fix ideas, imagine that the CT is so high that everybody is considered positive.
Then, we will never reject the null. But then, we will never make a Type I error.
This means alpha is 0.
In this question, we are asked about the effect of the decrease of 𝐶𝑇 . It should
be clear that the effect is the opposite as above, i.e., a increase of 𝛼.
Notice also that the CT has nothing to do with a statistical concept. It is only a
value for the covid test. When the covid test is made with a high CT, it behaves
in a certain way. When it is done with a small CT, it behaves another way.
360 C Practice Quiz Questions

Solution to Exercise C.11

Notice at the outset that all the options are possible, including the magician hav-
ing psychic powers. (Who knows!?) But the question asks about the “the most
likely explanation”.
This rules out psychic powers (if anything because otherwise the magician
would certainly be making millions of dollars before Youtube videos) and the
die obeying the voice of the magician.
The technical argument is defeated in the very question: “extremely sophisti-
cated tests” show no sign of editing.
At the end, the most likely explanation is the painful and long task of trying until
succeeding. There is no skill involved (what would it be if not psychic powers,
hence we would be back to above) and, if there was, the explanation is defeated
by the second part of the sentence: a skill, by definition, is an ability that we can
use more than once.

Solution to Exercise C.12

The null hypothesis should reflect the situation where the student knows nothing
about the question, i.e., they answer randomly.
In that case, the probability of answering correctly is one in 𝑛, 𝑛 being the number
of possible choices.

Solution to Exercise C.13

Obviously, the probabilities in the nth throw do not not change because of the
results of the nth-1 previous throws.

Solution to Exercise C.14

The function that gives the probability to the left of a value ‘q’ in a normal distri-
bution is pnorm(). Following the explanation in the question, the answer is pt().
C.5 Selected Quiz I Solutions 361

Solution to Exercise C.15

To understand this answer, we can recall the test for the mean based on the statis-
tic

𝑋̄ − 𝜇0
𝑍= √
𝜎/ 𝑛
The larger the 𝑛, the larger the statistic. Hence, the higher the chances that it fall
in the rejection region.
Intuitively, the fewer observations, the less sure we will about the true value.
Hence, when testing for a specific number given in 𝐻0 , we will be less able to
reject the hypothesis that the true value is equal to that given number.
Seen from the other side, imagine we have thousands and thousands of observa-
tions. Then, we will be much more certain about the true value. When testing for
a specific number given in 𝐻0 , if that number is not the mean that we obtained
in the sample, or very very close to it, then we will reject the null.

Solution to Exercise C.16

The first and main hint for this answer is that the R output says that
alternative hypothesis: greater. Therefore, the null involves the opposite, namely
≤. In the R command, we give the first proportion… first and the second after
that. So, the null of the test is 𝑝1 ≤ 𝑝2 .

Solution to Exercise C.17

Recall that the p-value is the probability, assuming that the statistical model and
𝐻0 are true, that we observe a test statistic as large as or more extreme than the
value we observe in the sample.
Read again… “as large as… the value we observe in the sample”. If we observe
it, then the probability of observing it cannot be zero! It can be extremely small,
yes, but not 0.
362 C Practice Quiz Questions

Solution to Exercise C.18

The two methods must be equivalent, otherwise we would need to discuss when
their results differ.
Mathematically, using a classic case, it is equivalent to evaluate the following
two comparisons:

• p-value = 2 ∗ 𝑃 (𝑍 > 𝑧) < 𝛼, or


• 𝑧 > 𝑧𝛼/2 ,

where 𝑧 is the test statistic in the sample.

Solution to Exercise C.19

That is not enough that you have the same software. For reproducibility, one
needs to be able to obtain the same results in a reasonably easy way, i.e., by need-
ing to check the all the cells individually to see if there is a mistake. (This is an
argument regarding Excel. Other arguments apply in general, e.g., availability
of data, etc).

Solution to Exercise C.20

The R output shows a p-value larger than 5%, i.e., the test statistic is not too
extreme compared to the threshold that we chose (see 95% confidence). Hence
the test recommends to not reject the null.

Solution to Exercise C.21

The statistic falls relatively very close to the true value under the null. Hence, we
will certainly not reject the null. To better see this, recall Figure 4.1. In this ques-
tion, the distribution is a normal around 0. Put it in the center of the distribution.
The test statistic is 1, i.e., somewhat close to 0. So, there is little chance it falls in
the rejection rejection.
Actually, since the sampling distribution has a standard deviation of 1 (and mean
0), then a test statistic of 1 is exactly 1 standard deviation away from 0. We should
C.5 Selected Quiz I Solutions 363

know that this is not in the rejection region. As a benchmark, recall that at the 5%
significance level, the rejection region starts around 2 standard deviations away
from the mean.

Solution to Exercise C.22

Recall that the relationship between the standard deviation of the sampling dis-
tribution of the sample mean is given by

2 𝜎
𝜎𝑋 ̄ = √
𝑛
2
The relationship between 𝜎𝑋 ̄ and 𝑛 is therefore not linear. It would be if, for
instance, we would have

2 𝑛
𝜎𝑋 ̄ =𝜎− 𝜎
100

Solution to Exercise C.23

The second sequence looks incorrectly more random because if fits the law of
small numbers. This latter states that the law of large numbers ought to apply to
small samples too.
As evidence of that, consider the first two observations of the second sequence,
i.e., 𝑛 = 2. By the “law of small numbers” we should expect 50%-50% distribu-
tion between X’s and O’s. That’s what we have.
The same applies to 𝑛 = 4, the first 4 observations. By the “law of small num-
bers” we should expect 50%-50% distribution between X’s and O’s. That’s what
we have. Same with 𝑛 = 6.
So, this example illustrates decisions about randomness based on the law of
small numbers.

Solution to Exercise C.24

This result should pretty intuitive: the larger the sample the more information
we have the more precise (and certain) we can be.
364 C Practice Quiz Questions

Another way of looking at it is by recalling the formula for the margin of error,
𝜎
𝑀 𝐸 = 𝑧𝛼/2 √
𝑛
We can see that the larger the 𝑛, the smaller the 𝑀 𝐸 .

C.6 Selected Quiz II Solutions

Solution to Exercise C.36

In the second model, 𝛼0̂ will be the predicted value for an observation where 𝐷2
and 𝐷3 are both 0. In other words it’s the predicted value for the variable when
𝐷1 is equal to 1.
From the first model we can calculate the predicted value for the variable when
𝐷1 is equal to 1. It’s 𝛽0 + 𝛽1 . Hence, 𝛼0̂ = 𝛽0̂ + 𝛽1̂ .

Solution to Exercise C.37

“A worker’s wage”, “the commute time of workers” are measure with a contin-
uous variable. Hence, they would imply a regression problem.
The remaining variables are categorical in nature, even if we can express each
category with a number, e.g., 1 to 5. Hence, they call for a classification tool.

Solution to Exercise C.38

Yes, we can say so. The simple validation set approach separates the train data
into two sets, training and validation, using the former to train the models and
the later to estimate the MSE in test data.
The 𝑘− fold validation extends this approach by separating the train data 𝑘 times
into two sets, training and validation, using the former to train the models and
the later to estimate the MSE in test data. Since it does it 𝑘 times, the estimated
MSE in the test data will be the average of the 𝑘 estimates.
C.6 Selected Quiz II Solutions 365

Solution to Exercise C.39

Substitute 𝑥 = 12 in

𝑦 ̂ = 3.85 − 2.95𝑥

to obtain 𝑦 ̂ =-31.55.

Solution to Exercise C.40

As you can se in Section 20.3

𝛽1̂
𝑡=
𝑠𝛽 ̂
1

so,

𝛽1̂
𝑠𝛽 ̂ =
1 𝑡
Here, 1.240/5.544=0.2236652.

Solution to Exercise C.41

Nothing in a linear model, or any other estimated model for that matter, guar-
antees that the relationship is of causal nature. In some rare cases, it could be
causal, but these are really exceptions.

Solution to Exercise C.42

If both variables (𝑦 and 𝑦 ) are truly random, they the true 𝛽1 is 0. Because of
sampling error, however, some samples will have a 𝛽1̂ that are very different
from 0, i.e., extreme, and will lead us to reject 𝐻0 ∶ 𝛽1 = 0. How many time
these “extreme” cases will happen depends on how we define “extreme”. In a
test of hypotheses, this will happen 𝛼% of the cases.
366 C Practice Quiz Questions

Solution to Exercise C.43

False. Fitting the data at hand, i.e., train data, is no good indicator of the model’s
ability to fit test data, i.e., to make predictions.

Solution to Exercise C.44

The correlation coefficient ranges from 0 to 1, and is unit free. This is why it is
used to compare the goodness of the fit for various models.
The intercept is the prediction when all the explanatory variables are set to 0.
Hence, it must be in the same units as the explained variable, i.e., kg.
A prediction must be in the same unit as the predicted variable. Hence, every
𝛽𝑗 𝑥𝑗 must be in this same same unit. In this particular case, 𝑥𝑗 is in cm. Hence,
for 𝛽𝑗 𝑥𝑗 to be in kg, it must be the case that 𝛽𝑗 is in kg/cm.

Solution to Exercise C.45

True. The 𝑅2 of the multiple linear regression is calculated only in train data.
Therefore, it is not a reliable estimate for the quality of the fit in test data.

Solution to Exercise C.46

For the two estimated models (one with 𝐷𝑀 and the other with 𝐷𝑊 ) to give
the same estimates for each type of individuals, it must be the case that 𝐷𝑀 =
−𝐷𝑊 .
Notice that in a regression with 𝐷𝑀 , 𝐷𝑀 is, all things equal, the difference in
wage earned by the male individuals with respect to the female individuals. In
a regression with 𝐷𝑊 , 𝐷𝑊 is, all things equal, the difference in wage earned by
the female individuals with respect to the male individuals.
Hence, it should be clear that the two differences must be equal, though with a
different sign.
C.6 Selected Quiz II Solutions 367

Solution to Exercise C.47

No, it cannot. This is because the MSE error uses the numeric difference between
the observed value and the prediction for that observation. In classification prob-
lems, the observed value is a category, e.g., “Yes/No”, “Train/Car/Bicycle”.
Therefore, we cannot meaningfully calculate a difference between these values.

Solution to Exercise C.48

False. The problem is unsupervised learning if the explained variable is not ob-
served. In the Netflix challenge, the competitors had that information. What they
didn’t have was the test data, i.e., the observations including the values of 𝑦 , the
clients’ votes on the movies that the competing models had to predict.

Solution to Exercise C.49

No, it doesn’t because 𝛽1 is the slope coefficient. A systematic change of this


kind, shifts all the observations up, but does affect the slope of the relationship.

Solution to Exercise C.50

It is linear in the log of the variables, but linear nevertheless. To convince your-
self, simply replace log(𝑦) by 𝑤 and log(𝑥) by 𝑧 . Then the model becomes,

𝑤 = 𝛽 0 + 𝛽1 𝑧 + 𝜀

Solution to Exercise C.51

As we saw in our discussion about the paper Ferguson and Voth (2008), a high
𝑅2 is not required for a publication in a prestigious outlet.

Solution to Exercise C.52

The positive value of 𝛽1̂ , 𝛾1̂ and even 𝛼1̂ is simple to understand and is not ques-
tioned.
368 C Practice Quiz Questions

The difficulty resides in the interpretation of 𝛼2̂ . Recall that a coefficient in the
linear model is the marginal effect of the variable, i.e., when the value of the other
regressors is maintained constant.
Here, if the value of the number of people in the sample is kept constant, then
having more kids in this sample will result in a smaller overall weight, hence a
negative coefficient 𝛼2̂ .
In class, we discussed a similar issue when we related the amount of money in a
wallet with 1. the number of coins in the wallet, and, 2. the number of 1 cent coins
in the wallet. Keeping the number of coins constant, the more 1 cent coins in a
wallet, the lower the amount of money in the wallet. The following simulation
illustrates this point, if you need to “see” it.

n.s <- 1e4 # number of simulations


df <- tibble(sum = numeric(n.s),
n.coins = numeric(n.s),
one.c = numeric(n.s))
coins <- c(0.01, 0.02, 0.05, .1, .2, .5, 1, 2)

for (i in 1:n.s) { # in each simulation/sample, do the following


n.coins <- sample(20:50,1) # pick a number of coins
my.coins <- sample(coins, n.coins, replace = TRUE) # randomly select these coins
df$sum[i] <- sum(my.coins) # amount of money in the sample
df$n.coins[i] <- n.coins # report number of coins
df$one.c[i] <- length(my.coins[my.coins==0.01]) # number of 1 cent coins
}

print(df, n=30) # first 30 observations


## # A tibble: 10,000 x 3
## sum n.coins one.c
## <dbl> <dbl> <dbl>
## 1 11.8 26 4
## 2 20.8 49 3
## 3 16.2 47 6
## 4 19.3 38 2
## 5 8.81 24 5
## 6 5.82 26 4
## 7 14.5 25 5
C.6 Selected Quiz II Solutions 369

## 8 16.1 30 3
## 9 8.3 23 5
## 10 19.0 37 0
## 11 18.4 47 8
## 12 11.0 27 8
## 13 16.9 40 8
## 14 10.7 25 3
## 15 15.8 21 2
## 16 22.0 39 5
## 17 11.3 28 2
## 18 27.2 43 3
## 19 16.3 32 5
## 20 23.0 46 5
## 21 17.2 24 2
## 22 23.3 43 5
## 23 13.5 38 4
## 24 28.4 45 3
## 25 25.5 41 5
## 26 8.34 20 3
## 27 15.3 40 7
## 28 16.4 32 5
## 29 18.2 48 4
## 30 17.0 31 2
## # ... with 9,970 more rows
summary(lm(sum ~ n.coins + one.c, data = df))
##
## Call:
## lm(formula = sum ~ n.coins + one.c, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2013 -2.5711 -0.1084 2.4089 14.5991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.169640 0.154048 1.101 0.271
## n.coins 0.549636 0.004875 112.741 <2e-16 ***
370 C Practice Quiz Questions

## one.c -0.538538 0.019235 -27.998 <2e-16 ***


## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.77 on 9997 degrees of freedom
## Multiple R-squared: 0.5781, Adjusted R-squared: 0.5781
## F-statistic: 6850 on 2 and 9997 DF, p-value: < 2.2e-16

Solution to Exercise C.53

The normality of the errors could not really be seen in the Figure C.3. It would
take another type of plot to see it.
The correlation between errors would be seen through patterns in the residuals,
which we don’t really see.
The key point to notice here is that the residuals seem to have very little variance
for the low values of TV and large variance at the other end. This is a case of non-
constant variance of the errors. In terms of the assumptions that we saw for the
linear model, this corresponds to a violation of the homoscedasticity assumption.

Solution to Exercise C.54

The procedure to respond this type of problems is to match the predictions for
each category across the two models.
The prediction for an observation in from north (𝐷1 = 1) is:

• 𝛽0 + 𝛽1 in the first model,


• 𝛼0 in the second model.

Therefore, 𝛼0 = 𝛽0 + 𝛽1 = −9.54 + (−1.31) = -10.85.


The prediction for an observation in from center (𝐷2 = 1) is:

• 𝛽0 + 𝛽2 in the first model,


• 𝛼0 + 𝛼1 in the second model.

Therefore, 𝛼0 + 𝛼1 = 𝛽0 + 𝛽2 . We saw above that 𝛼0 = -10.85. Therefore, 𝛼1 =


𝛽0 + 𝛽2 − 𝛼0 = 5.64.
C.6 Selected Quiz II Solutions 371

Solution to Exercise C.55

Recall the result we established for inference on the coefficients.

𝛽1̂ − 0 𝛽̂
𝑡= = 1
𝑠𝛽 ̂ 𝑠𝛽 ̂
1 1

Figure C.4 shows the 𝑡 and the standard error. It is therefore straightforward to
deduce the value of the coefficient.

𝛽1 = 𝑡 ⋅ 𝑠𝛽 ̂ = −2.875808
1

Solution to Exercise C.56

This is not correct. Rejection of the null in that case does not rule out the possi-
bility of a relationship between the variables, albeit nonlinear. For instance, their
relationship could be inverted-U shaped and it would typically not be picked up
by a linear fit (i.e., the estimated slope would be 0).

Solution to Exercise C.57

False. Knowing the true relationship does not eliminate the random shocks to
the relationship, often noted as 𝜀. Since these will still occur in the test data,
even the model with perfect knowledge of the true functio will not make perfect
predictions. It just can’t predict the random shocks. Hence, the MSE will never
be 0 in the test data.

Solution to Exercise C.58

The paper uses the dummy “Nazi” defined as follows:

1 if firms is connected
𝑁 𝑎𝑧𝑖 = {
0 if firms is unconnected

We are asked to compare different predictions of the simplest models.


372 C Practice Quiz Questions

𝑙𝑜𝑔𝑟𝑒𝑡𝑢𝑟𝑛𝑖,𝑝 = 𝛽0 + 𝛽1 𝑁 𝑎𝑧𝑖𝑖,𝑝 + 𝜀𝑖

where 𝑖 refers to the firm and 𝑝 = 𝑝1 if the estimation is for the first period,
while 𝑝 = 𝑝2 if the estimation is for the second period.
The question asks for two values:

𝐴 = 𝑙𝑜𝑔𝑟𝑒𝑡𝑢𝑟𝑛𝑝2 = 𝛽0̂ + 𝛽1̂ 𝑁 𝑎𝑧𝑖𝑝2


𝐵 = 𝑙𝑜𝑔𝑟𝑒𝑡𝑢𝑟𝑛 = 𝛽 ̂
𝑝1 0

Looking at the tables, A=0.0024 + 0.0697, and B= 0.104. Therefore, A-B= -0.0319.

Solution to Exercise C.59

The most affordable of the list would be the one requiring less computations.
This would be the ‘validation set’ method because it typically only requires to
compute the MSE on one test data, while LOOCV would require 𝑛 and k-fold 𝑘.

Solution to Exercise C.60

No, the two models would typically not give the same predictions. One of the
major causes of the difference lies in the fact that the model B, with the number
of years of schooling, imposes a constant effect of that variable over its values.
In other words, in that case, every year of schooling would increase wage by the
same amount, i.e., 𝛽𝑗 .
With the dummies, the predictions would be more flexible with respect to the
effect of years of schooling, i.e., they could allow for non-constant effects (e.g., the
group with 21 years of schooling could earn less than the group with 17 years).

Solution to Exercise C.61

Yes, it is still a linear model despite the inclusion of a power-2 term. To be con-
vinced of it, just rename 𝑧 = 𝑥2 and rewrite the model as

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑧𝑖 + 𝜀𝑖
C.6 Selected Quiz II Solutions 373

It will appear as a regular linear model… that it is.


More generally, “Polynomial regression extends the linear model by adding extra
predictors, obtained by raising each of the original predictors to a power. For
example, a cubic regression uses three variables, 𝑋 , 𝑋 2 , and 𝑋 3 , as predictors.
This approach provides a simple way to provide a non-linear fit to data.” (James
et al. (2013).)

Solution to Exercise C.62

Recall that a residual is the difference between the observed value (data) and the
predicted value (fit). If the residual is negative, at it is here, then it means that
the fit is larger than the data, i.e., the fit makes an overestimation.

Solution to Exercise C.63

All but one of these are classification problems. The number of people in a house-
hold is a continuous variable, implying a regression estimation. All the others
predict a category: yes/no, party, academic degree, VAT level.

Solution to Exercise C.64

Since 𝐷 is either 0 or 1, it does not have units. But 𝛽2 𝐷 must have units, and
it must be the same as 𝑦 since the right-hand side of the equation is a sum of
elements that must have the same units (we can’t add apples and oranges). To-
gether, these imply that 𝛽2 must have the same units as 𝑦 .
Now, notice that 𝑦 is not one of the options. We must chose one with the same
units as 𝑦 . In this case, only 𝜀 checks out.

Solution to Exercise C.65

The graph on the left does not offer guidance about the value taken by the ex-
plained variable. Indeed, all the observations appear as a dot in the graph. This
implies that the problem is an unsupervised one.
In contrast, the observations on the right graph indicate, by means of a difference
shape, the value/category of the explained variable.
374 C Practice Quiz Questions

120

80
y

40

0 25 50 75 100
x

FIGURE C.7: Illustrating the effect of an outlier.

Solution to Exercise C.66

For this question, we must compare models 2 and 3. Indeed, model 3 uses both
of the variables (‘Market Cap’ and ‘Dividend Yield’) while model 2 only uses
‘Market Cap’. Notice that the two models have different number of observations,
𝑁 . In all likelihood, this is because there were no values observed for one of these
variables.
Model 2 uses 352 observations and model 3 uses 299. This suggests that there
were 53 observations for which the authors have information on the ‘Market
Cap’ but not on the ‘Dividend Yield’.

Solution to Exercise C.67

This question is illustrated by Figure C.7. The red dot is the outlier added.
Clearly, compared to the original fit (blue line), the linear fit after the inclusion
of the outlier (red line) has a higher intercept and a smaller slope.
D
Practice Exam Questions

TABLE D.1: Practice exam questions with elements of solution in this appendix.

Exercise Solution
D.1 sol.
D.2 sol.
D.3 sol.
D.4 sol.
D.5 sol.
D.10 sol.
D.11 sol.
D.12 sol.
D.13 sol.

D.1 Midterm

Exercise D.1 (Severe complications at birth [Midterm 19-20). ] Table D.2 pro-
vides information related to the births and rate of severe complications at birth
(SCB) in three facilities, in 2019.

TABLE D.2: Severe complications at birth (SCB).

Number of births Rate of SCB Color of doors


Clínica São Miguel 203 0.49% light blue
Hospital Santa Maria 1057 1.51% blue
Centro Saúde Alvalade 195 1.53% dark blue

375
376 D Practice Exam Questions

Based on that table, discuss the following two-part proposition.

“Doors in a lighter color have a positive impact on the rate of severe complications at birth. One possible
explanation is that lighter colors provide a calmer atmosphere which reduces stress and its related
adverse effects.”

Exercise D.2 (Minimal p-value). [Difficult.]


Consider the use of permutations and resampling for testing a hypothesis in a
medium-size sample.

a. Explain why a p-value of 0 for the observed test statistic does not
make sense.
b. Still in this context what is the minimal p-value that your procedure
should find?

Exercise D.3 (Attending class [Midterm 19-20). ] [Medium difficulty.]


We are interested in testing whether or not attending the lectures is beneficial for
the final grade.
Suppose that we can establish with certitude who attended (say, mainly did) and
who did not (say, almost didn’t) this present class and that we will use the results
of the midterm as a sample for that test (45 students, 28 attended, 17 didn’t).
Moreover, we do not want to rely on any normality distribution along the way.
Explain with sufficient detail how you would make such a test. In particular, be
explicit about 𝐻0 and the details of the procedure.

Exercise D.4 (Reproducibility and data disclosure). A necessary element for re-
producibility in research is the availability of a publication’s original data.

a. Briefly explain why data disclosure is crucial.


b. What do you think is the current situation with respect to that issue?
Explain the possible reasons underpinning your view.
D.1 Midterm 377

Exercise D.5 (CT for Covid test). Suppose that a test for COVID-19 uses:
𝐻0 ∶ the virus is present and active in the individual.
Suppose as well that the test can be modified by increasing the 𝐶𝑇 , from 20 to
25 or even higher. When the 𝐶𝑇 is increased, the presence of even pieces of the
virus, or even dead-virus will be detected and the test will be flagged as positive.

a. Discuss how the choice of 𝐶𝑇 influences 𝛼, the probability of Type I


error.

b. Discuss how the choice of 𝐶𝑇 influences 𝛽 , the probability of Type


II error.
c. As a political authority, discuss the elements that you would take into
account in order to make your choice about 𝐶𝑇 in relation with the type
of error that you would want to minimize.

Exercise D.6 (Correct test [Midterm 20-21). ] In a sample of 627 Portuguese


people, a proportion 𝑝̂ = 0.55 of the respondents declare that they are in favor
of a new lockdown.
You want to investigate whether a majority of the Portuguese population as a
whole is in favor of the lockdown.

a. Provide a set of conditions on the way the sample was obtained


that must be satisfied for the sample to be useful in your investigation.
Briefly explain.
b. You understand that 0.55 is your best guess about the proportion in
the population, but there is a margin of error associated with your esti-
mate. Suppose that you had found a proportion in the sample that was
much higher than 0.55, say 0.8. Briefly explain why, other things equal,
your margin of error would be smaller in that second case. (Hint: don’t
think intuition, think formula, think Sections 8.5 and 9.1.)
c. You decide to carry a proper statistical test to answer the question
of your investigation, namely “is the majority of the Portuguese in fa-
vor of the lockdown”. A friend suggests the formulation below of the
hypotheses. Do you agree with this formulation? If yes, briefly explain.
If not, point out all the elements that you would correct in it. Briefly
explain.
378 D Practice Exam Questions

𝐻0 ∶ 𝑝̂ = 0.55 versus 𝐻𝑎 ∶ 𝑝̂ ≠ 0.55

Exercise D.7 (Animal testing [Midterm 20-21). ] Suppose that a researcher has
completed a difficult and time-consuming (harmless) experiment on 30 animals.
He has scored and analyzed a large number of variables. His results are generally
inconclusive, but one test (say a comparison of means before and after treatment)
yields a highly significant 𝑡-score, 2.70, which is surprising and could be of major
theoretical significance.

a. Interpret this value of 𝑡 in the context of a test, including by stating


the null and the decision on it that this 𝑡 implies.
b. What type of error is the researcher possibly doing if they think that
they obtained a fact, i.e., evidence of a major discovery? Explain briefly.
c. Explain why the replication of the experiment on more animals is a
sensible thing to do at this point, i.e., something that you would recom-
mend.

Assume that the researcher has in fact repeated the initial study with 20 addi-
tional animals, and has obtained an insignificant result in the same direction, 𝑡
= 1.02.

d. In that situation, another researcher recommends pooling the two re-


sults and publish the finding. What do you think of this recommenda-
tion?
e. In that situation, another researcher recommends making another
study to find an explanation for the difference between the results. What
do you think of this recommendation?

Exercise D.8 (Reproducible workflow [Midterm 20-21). ] All terms below refer
to the discussions we carried in class.

a. Based exclusively on your own (potential) experience, provide a con-


vincing example to illustrate the fact that a non-reproducible workflow
can lead to a very large waste of time, in comparison to a reproducible
workflow.
b. Based exclusively on your own (potential) experience, provide a con-
vincing example to illustrate the fact that a non-reproducible workflow
D.2 Endterm 379

can lead to a larger number of errors, in comparison to a reproducible


workflow.

Exercise D.9 (Milk consumption [Midterm 20-21). ] The average consumption


of milk in a sample from country A is 318,76342 kg per person per year. In the
sample from country B, the equivalent statistic is 318,76351 kg per person per
year. The standard deviation is identical in both samples and there is no error (of
calculation, reporting,…).

a. Explain how one could conclude that the average consumption is sta-
tistically different between the two countries.
b. The result above about the difference in the means has statistical
significance. What do you think of its economic significance? In other
words, do you think there is a substantial economic difference in milk
consumption between the two countries? Explain briefly.
c. What relation, then, can you make between statistical and economic
significance? Explain briefly.

D.2 Endterm

Exercise D.10 (Default data [Endterm 19-20). ] Consider again the ‘Default’ data
set that we analyzed in class. In particular, recall the following variables:

• ‘default’, a binary variable indicating whether the individual defaulted on the


credit card payment or not;
• ‘balance’, a continuous variable measuring the monthly credit card balance of
the individual;
• ‘student’ a binary variable indicating whether the individual is a student “Yes”
or not “No”.

We run two logistic regressions with one explanatory variable:


Model 1 has ‘balance’ as an explanatory variable and the regression output is
given in Figure D.1. Model 2 has ‘student’ as an explanatory variable and the
regression output is given in Figure D.1.
380 D Practice Exam Questions

FIGURE D.1: Summary of Model 1.

We then run a logistic regression with two explanatory variables, ‘balance’ and
‘student’ (studentYes), call it Model 3. Figure D.3 gives the predicted probabili-
ties of default calculated with Model 3 and separated for students (red line) and
non-students (blue line).

a. Interpret and briefly comment on the estimated coefficient for the


explanatory variable of Model 1 and Model 2.
b. In Model 2, the 0.40489 coefficient for the dummy StudentYes means
that the probability of defaulting in the credit card payment increases by
about .40 for students, compared to non-students. True/False? Explain.
c. Given the plot above, based on Model 3, which individuals have the
highest probability of default (for monthly balances, say, between 1500
and 2500)?
d. Compare the curves (of the plot above) for the predicted probabilities
of default given by Model 3. What do they say about the probability of
default for the different types of individuals? Explain.

Exercise D.11 (Predicting grades [Candidate to Endterm 20-21). ]


Suppose one wants to analyze the midterm and the endterm grades of the stu-
dents of a class. For instance, one could link these grades, for each student, in a
linear regression model as follows:
D.2 Endterm 381

FIGURE D.2: Summary of Model 2.

FIGURE D.3: Plot for Model 3.

e-grade𝑖 = 𝛽0 + 𝛽1 m-grade𝑖 + 𝜀𝑖

where e-grade and m-grade are the grades at the endterm and midterm exams,
respectively, and 𝑖 refers to each student in the class.
What would you expect from the estimation of this model? Explain. What would
you say about causality in this model? Explain.
382 D Practice Exam Questions

Exercise D.12 (Wine price [Endterm 19-20). ] Exercise 21.2 was in this endterm
exam.

Exercise D.13 (Correctly specified model [Endterm 19-20). ] A model is correctly


specified if, beyond the random shocks, it contains all the relevant variables that
determine an outcome, measured by the variable 𝑦 , the functional form that re-
lates these variables to 𝑦 is the correct one.
Consider the case where you have no theoretical model for the variables explain-
ing 𝑦 .

a. Argue that finding the correctly specified model for 𝑦 thanks to the
data at hand is an elusive quest that is not worth pursuing.

b. Suggest and explain an alternative goal for the modeling of 𝑦 as a


function of other variables.

D.3 Selected Midterm Solutions

Solution to Exercise D.1

The statement linking colors of the doors and SCB is not warranted. What should
be the default hypothesis is that the rate is the same across facilities. The differ-
ence could simply be due to sampling error. Indeed, as we can see, two facilities
have much smaller sample size, which could account for a larger variance in
sampling distribution of their average rate.
The second statement is a example of ad hoc explanation, with no statistical sup-
port. Yet, it will be seen as likely because we crave explanations, and better wrong
ones than none.

Solution to Exercise D.2

The 𝑝-value is the probability of observing a value in the sampling distribution


that is as or more extreme than the observed test statistic, assuming that the sta-
tistical model, including 𝐻0 , is true.
D.3 Selected Midterm Solutions 383

It follows that, since we observed at least one such value, the very test statistic
that we calculated, the probability of observing it cannot be 0.
1
Then, the smallest 𝑝-value that the procedure should find is 𝑁, where 𝑁 is the
number of simulations made in the test.
Recall that we sample a large number of simulations, but do lot list all the pos-
sible samples. Therefore, it is not guaranteed that we draw the simulation that
matches the sample that we have at hand. Potentially, we could then obtain a p-
value equal to 0. To avoid it, we correct the 𝑝-value by adding 1 to the numerator
and the denominator in the calculation of the probability.

Solution to Exercise D.3

In this case,

𝐻0 ∶ attending the class has no effect on the grade.

This is a two-sided test because we the effect of attending the class on the grade
could be either positive (presumably expected) but also negative (if the class
adds only confusion).
If we don’t want to rely on a (theoretical) normal distribution, then a permutation
test could be run.
The idea is to compare the averages of the two groups. If the difference between
the average of the 28 students who attended and the average of the 17 students
who didn’t is not extreme, then we do not reject 𝐻0 .
In order to know if it is extreme, we must obtain the sampling distribution of the
difference of the average between the two groups of these sizes.
For that, we randomly sample any 28 and put them in a group and the remain-
ing 17 are in the other group. We calculate the average for both groups and the
difference between the two.
We repeat this procedure a very number of times in order to obtain a sampling
distribution.
Finally, we check where the observed difference fits into that distribution. If it is
extreme, i.e., below some threshold level 𝛼, we reject 𝐻0 , otherwise we do not
reject it.
384 D Practice Exam Questions

Solution to Exercise D.4

For research results to be fully reproducible, the analysis code is almost useless
without the data to which it applies. Hence, the availability of the original data
is crucial for reproduction of the results.
Put differently, if one cannot obtain the original data, then any result will depend
on the good faith of the researcher. Rogue cases are not unseen (read here1 or
here2 ).
Currently, it is not always possible to obtain the data of the research (read for in-
stance, here3 for the case in psychology.) Beyond fraud, there are several reasons
explaining the reluctance of researchers in providing their data. For instance,
these might be confidential.
In most cases, however, this has to due with the fact that the data were difficult to
obtain. For instance, this happens when the data collection involved expensive
equipment or simply years for a researcher to dig huge archives. Researchers
sometimes then find it unfair to simply deliver for free the data that other re-
searchers could use for their publications.
Research in economics is converging towards full availability of data and code
(read here4 ). One compromise with respect to the previous point is that re-
searchers commit themselves to provide the data to the editor of the journal
and/or the referees.

Solution to Exercise D.5

a. For this part, please read the explanation provide in the answer to the
quiz question here.

b. We can use parts of the reasoning of the previous part to understand


how the change in CT also influences 𝛽 .
1
https://www.sciencemag.org/news/2018/10/what-massive-database-retracted-papers-reveals-about-
science-publishing-s-death-penalty
2
https://www.sciencemag.org/news/2018/08/researcher-center-epic-fraud-remains-enigma-those-who-
exposed-him
3
https://psycnet.apa.org/doiLanding?doi=10.1037%2F0003-066X.61.7.726
4
https://www.aeaweb.org/journals/policies/data-code/
D.4 Selected Endterm Solutions 385

Consider the increase of CT. As described, this will result in more cases being
flagged as positive. Among those, the positive test will only be due to the de-
tection of “pieces of the virus, or even dead-virus”. In other words there will be
more cases where we will fail to reject the null, i.e., more case where we fail to
tell the person that they don’t have covid. This represents a wrong diagnose, an
error of Type II (see Table 4.1).
Hence, the higher the CT, the higher the 𝛽 .

c. The choice of CT is related to the type of error that you want to


avoid/minimize.

In Type I, you may make an error by rejecting the null when actually it is true.
In other words, the person has covid but you do not detect it and send out a
person with covid. Of course, this generates costs related to the larger spread of
the disease since that person will contaminate other people.
In Type II, you may make an error by failing to reject the null when actually it
is wrong. In other words, the person does not have covid but you detect it and
send home a person without covid. Of course, this generates costs related to the
reduced economic production since that person will not be able to work.

D.4 Selected Endterm Solutions

Solution to Exercise D.10

Here are a few words for each of these questions.

a. balance and studentYes have a significant and positive impact on the prob-
ability of default.
b. False, there is no direct reading of that kind. The coefficient enters into
a non-linear function.
c. The non-students have a higher probability of default.

d. For a given balance, students have a lower probability of default than


non-students, despite the results of the previous estimations.
386 D Practice Exam Questions

Solution to Exercise D.11

We would expect the estimation of the model to show a very high correlation
between these two variables. This is because the observations are paired, one for
each student, making it possible that the reasons that determine one variable are
the same that determine the other.
In terms of causality, the most likely relation that we could draw is between a
common source/cause and these two variables. For instance, the student’s skills
could simultaneously explain both the result at the midterm and at the endterm.
Technically, the regression model would not satisfy the usual conditions. This is
because the random shocks to the model would be correlated with the “explana-
tory” variable. To see that, think of any shock affecting the real cause, say the
skills. Now, this shock will affect the e-grade, via 𝜀, and m-grade, rendering these
two correlated.

Solution to Exercise D.12

a. I obtain this result by substituting the values of the dummies in the


mode.

𝑝 = 𝛽 0 + 𝛽1 ⋅ 0 + 𝛽 2 ⋅ 1 + 𝛽 3 ⋅ 0 + 𝛽 4 ⋅ 0 + 𝛽 5 ⋅ 1 + 𝛽 6 ⋅ 0 + 𝜀

Using the estimated values, 𝑝 = 7.12.

b. Again the same procedure:

𝑝 = 𝛽 0 + 𝛽1 ⋅ 0 + 𝛽 2 ⋅ 0 + 𝛽 3 ⋅ 0 + 𝛽 4 ⋅ 0 + 𝛽 5 ⋅ 0 + 𝛽 6 ⋅ 1 + 𝜀

So, 𝑝 = 6.45.

c. It measures a average yearly increase or decrease in the prices of wine


over the whole period.

d. Alternatively, we could use dummies for some years to see the effect
of particular years. This would make more sense given the nature of
the good. Indeed, the price of a wine bottle does not increase/decrease
D.4 Selected Endterm Solutions 387

regularly with time, but is affected by the quality of the wine obtained
in a few particular years.

Solution to Exercise D.13

Here are very succinct elements:

a. We never know what the correct model really is. So, trying to find it is
not worth the effort.

b. Alternatively, we could use the model to make predictions, and check


how it performs on test data.
E
Solutions to Selected End-of-Chapter Exercises

TABLE E.1: End-of-chapter exercises with elements of solution in this appendix.

Exercise Solution
3.6 sol.

4.1 sol.

6.1 sol.
6.2 sol.
16.1 sol.
16.2 sol.
17.1 sol.

Solution to Exercise 3.6

The apparent contradiction hinges on the approximation 1.96 versus 2.


In a standard normal, the 𝑧 value 1.96 is a common benchmark because there is
2.5% probability that an observation falls beyond it (i.e., above 1.96).
It the context of hypothesis testing, this implies that even if the null implying a
standard normal is true, then there is still 5% chances of finding a test statistic
above 1.96 or below -1.96, leading us to reject the null. Hence, there is 5% chances
or 1 in 20 chances to make an error (I guess that is what Fisher refers to with “led
to follow up a negative result”).
Sometimes, practitioners use the limit 2 as a rule-of-thumb, instead of 1.96. Be-
yond 2, however, there is only 0.02275 density. Hence, using that criterion, we
will incorrectly reject the null in 2 × 0.02275 = 0.0455 of the times, i.e., 4.55%
or 1 in 22 times. Contradiction solved.

389
390 E Solutions to Selected End-of-Chapter Exercises

Solution to Exercise 4.1

The error of Type I occurs if we reject 𝐻0 when 𝐻0 is actually true. In this case,
it would amount to convict an individual who is actually innocent. This deci-
sion would be based on false or incomplete evidence. Hence, to avoid that error,
one would need to advance with conviction only in cases where the evidence is
plentiful and extremely good, e.g., footages of the person committing the crime,
confession, etc.
The Type II error occurs when we fail to reject 𝐻0 when 𝐻0 is actually false. In
the present case, it would amount to let a guilty person go free.
There is a relationship between the two types of error. Reducing the first implies
an increase of the second. Indeed, requiring a higher level of evidence for con-
viction means that it will also be more difficult to convict criminals, plentiful and
extremely good evidence being more difficult to obtain.

Solution to Exercise 6.1

The null hypothesis here is a bit different from usual. We don’t want to test
whether the true proportion, 𝑝, is equal to 0.5 or not. Instead, we want to find a
null whose alternative implies accepting that the true proportion is larger than
0.5. Hence, we write,

𝐻0 ∶ 𝑝 ≤ 0.5 vs. 𝐻𝑎 ∶ 𝑝 > 0.5.

To make a link with the generic expressions in the notes, this example has 𝑝0 =
0.5.
We know that, under the null, the CLT implies that the sampling distribution of
the sample proportion, 𝑝̂, is normally distributed around the true value 𝑝0 ,

𝑝0 (1 − 𝑝0 )
𝑝̂∼𝑁
̇ (𝑝0 , ).
𝑛

or, in the standardized version,

𝑝 ̂ − 𝑝0
= 𝑍 ∼𝑁
̇ (0, 1).
√𝑝0 (1 − 𝑝0 )/𝑛
391

Formally, we would reject 𝐻0 if, under the null, the probability of observing a
sample proportion so or more extreme is too small, as determined by 𝛼 and the
type of test (two- or one-tailed). Hence, to evaluate our hypothesis we start by
calculating the z-score of our sample proportion (aka test statistic):

𝑝 ̂ − 𝑝0
𝑧= .
√𝑝0 (1 − 𝑝0 )/𝑛

In our case, we have 𝑝̂ = 0.54, 𝑝0 = 0.5 and 𝑛 = 300 or 𝑛 = 500. We use R to


calculate the values implied by the formulas.

phat <- 0.54


p0 <- 0.5
n1 <- 300
n2 <- 500

z1 <- ((phat-p0)/sqrt(p0*(1-p0)/n1)) %>% round(2)


z1
## [1] 1.39
z2 <- ((phat-p0)/sqrt(p0*(1-p0)/n2)) %>% round(2)
z2
## [1] 1.79

When, 𝑛 = 300,

0.54 − 0.5
𝑧1 = = 1.39.
√0.5(1 − 0.5)/300

and when 𝑛 = 500,

0.54 − 0.5
𝑧2 = = 1.79.
√0.5(1 − 0.5)/500

At this point, we saw we can follow two routes: the rejection region approach or
the p-value approach.
Rejection region approach
As explained above, this is a one-tailed test. Hence, there is only one rejection
392 E Solutions to Selected End-of-Chapter Exercises

region, on the right tail of the Z distribution. And the probability of Type I error,
𝛼, concentrates in that tail.
The limit of the rejection region in this case, with 𝛼 = 0.05, is

Φ−1 (1 − 𝛼) = 𝑧𝛼 = 1.64.

As a consequence, we make the following decisions about 𝐻0 .


When 𝑛 = 300, 𝑧1 < 𝑧𝛼 , i.e., 𝑧1 is not in the rejection region, so we do not reject
𝐻0 .
When 𝑛 = 500, 𝑧2 > 𝑧𝛼 , i.e., 𝑧2 is in the rejection region, so we do reject 𝐻0 .
P-value approach
We must evaluate the probability, under the null, i.e., in the 𝑍 distribution, of
observing a score or more extreme than the test statistic 𝑧𝑖 (𝑖 = 1, 2 depending
on the case above),

𝑃 (𝑍 > 𝑧𝑖 ) = 1 − 𝑃 (𝑍 < 𝑧𝑖 )

We use R to calculate these probabilities.

p1 <- 1 - pnorm(q= z1, mean= 0, sd= 1) %>% round(4)


p1
## [1] 0.0823
p2 <- 1 - pnorm(q= z2, mean= 0, sd= 1) %>% round(4)
p2
## [1] 0.0367

Since this is a unilateral test, the probability is already the p-value of the test,
When 𝑛 = 300, p-value = 0.0823 > 0.05 = 𝛼. Therefore, based on this sample,
we cannot reject the null of the true proportion being smaller or equal to 0.5.
When 𝑛 = 500, p-value = 0.0367 < 0.05 = 𝛼. Therefore, based on this sample,
we can reject the null and accept the alternative that the true proportion is larger
than 0.5.
393

Solution to Exercise 6.2

For 𝑛 = 300.

test.1 <- prop.test(x = phat * n1,


n = n1,
p = p0,
alternative = "greater",
correct = FALSE)
test.1
##
## 1-sample proportions test without continuity correction
##
## data: phat * n1 out of n1, null probability p0
## X-squared = 1.92, df = 1, p-value = 0.08293
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
## 0.4925225 1.0000000
## sample estimates:
## p
## 0.54

For 𝑛 = 500.

test.2 <- prop.test(x = phat * n2,


n = n2,
p = p0,
alternative = "greater",
correct = FALSE)
test.2
##
## 1-sample proportions test without continuity correction
##
## data: phat * n2 out of n2, null probability p0
## X-squared = 3.2, df = 1, p-value = 0.03682
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
394 E Solutions to Selected End-of-Chapter Exercises

## 0.5032207 1.0000000
## sample estimates:
## p
## 0.54

Notice how the results are fully in accordance with the results obtained manually
in the solution. In particular, the p-values are virtually identical to the analytical
results.

Solution to Exercise 16.1

The intercept will have the same units as the explained variable. To convince
yourself, consider the case where the slope coefficient or the explanatory variable
is 0. Then the value of the explained variable amounts to the intercept. Hence,
they must have the same units.
As for the slope, notice that prediction must be in the same unit as the predicted
variable. Hence, 𝛽1 𝑥 must be in this same same unit. In this particular case, 𝑥 is
in cm. Hence, for 𝛽1 𝑥 to be in kg, it must be the case that 𝛽1 is in kg/cm.

Solution to Exercise 16.2

a. I first load the data and estimate the model:

df <- read_csv("data/Advertising.csv")
m1 <- lm(sales ~ TV, data = df)
summary(m1)
##
## Call:
## lm(formula = sales ~ TV, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
395

## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***


## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16

Notice that we can extract the coefficients of the model.

b0 <- m1$coefficients[[1]]
b1 <- m1$coefficients[[2]]

Now, I try to verify the given relationship, involving the means of the variables.

m.y <- mean(df$sales)


m.x <- mean(df$TV)
m.y # mean of y
## [1] 14.0225
b0 + b1*m.x # predicted value at mean of x
## [1] 14.0225

b. I first take a short-cut by using the part inside m1, the residuals. As sug-
gested in the hint, I calculate them (resid), and find the same result.

mean(m1$residuals)
## [1] -6.464447e-17
resid <- df$sales - m1$fitted.values
mean(resid)
## [1] 8.597723e-18

c. I calculate each of these elements one by one.


396 E Solutions to Selected End-of-Chapter Exercises

r <- cor(df$sales, df$TV)


s.y <- sd(df$sales)
s.x <- sd(df$TV)

And now I can see that the equality holds.

b1
## [1] 0.04753664
r *s.y/s.x
## [1] 0.04753664

Solution to Exercise 17.1

a. Again, loading the data and estimating the model with the lm().

library(readr)
df <- read_csv("data/Advertising.csv")
m2 <- lm(sales ~ TV + radio + newspaper, data = df)
summary(m2)
##
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
397

## Residual standard error: 1.686 on 196 degrees of freedom


## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16

b. The coefficients on TV and radio are positive and significant. Advertising


in these means of communication is associated with higher sales. As for
the advertising in the newspapers, there is no such association.

Recall that the coefficient gives the marginal effect on sales, i.e., keeping all the
rest constant. One of the reasons that could explain this lack of association with
newspaper is that advertising campaigns in the newspapers are always run at the
same time as advertising campaigns in other media. The “technical” implication
would be that the “keeping the rest constant” does not really apply, leaving little
room for the model to pick up the effect of newspaper on sales.

c. I proceed as in solution for Exercise 16.2.

b0 <- m2$coefficients[[1]]
b1 <- m2$coefficients[[2]]
b2 <- m2$coefficients[[3]]
b3 <- m2$coefficients[[4]]

m.sales <- mean(df$sales)


m.tv <- mean(df$TV)
m.radio <- mean(df$radio)
m.newspaper <- mean(df$newspaper)
m.sales
## [1] 14.0225
b0 + b1*m.tv + b2*m.radio + b3*m.newspaper
## [1] 14.0225

d. In order to be able to draw that fit, some (constant) value must be as-
signed to the other variables. For instance, one could “fix” these other
variables at their mean.
e. Again, as suggested in the hint, I calculate the residuals (resid) and their
mean.
398 E Solutions to Selected End-of-Chapter Exercises

resid <- df$sales - m2$fitted.values


mean(resid)
## [1] 3.551846e-17
F
Your Questions

Here is a set of questions that I received and answered by email. They might of
interest for every one.

F.1 Q

My doubt is directed to the question B.15 to be exact. My answer was d (we cannot say: we need more
information about the samples). The rationale behind my choice was that even though the question states
the variances are known and I’m aware that a bigger sample assures more accurate results, it was not
clear to me how big the variances were within each sample. Therefore, if the variance in the bigger sample
turned out to be very extreme, I guessed the null could be rejected more often in the smaller sample with a
much smaller variance…

While I see virtues in your explanation, I would still stick to the answer provided.
This for two reasons: one weaker than the other. First, the question uses the term
“generally”, which points at an explanation centered on the only element that
changes for sure across the samples, the number of observations, and somehow
averaging over the possibilities of the other conditions. I’m not too happy with
this argument because it seems to rely heavily on the minutiae of the wording.
Your answer suggests that we should discard the true parameter of the popu-
lation and use the sample estimate instead. I reckon that it is unlikely that we
can know the true variance, but the question specifically says that we do. And
this point matters because it is not reasonable to throw away information. The
standard deviation of the sampling distribution of the mean will be the same for
both samples means, except for the effect of n. Justifying my answer in the quiz.

399
400 F Your Questions

I could also attempt a complementary explanation (though it would need more


proof). Even if we would use the sample standard deviation in each sample,
after throwing away the information provided, it would still require that the
difference between the estimated variances in two non-small samples from the
same population would be large enough to compensate the effect of the division
by a larger number (sqrt(200) vs sqrt(250)). True, it could happen for some values
of sigma. But, here, the speculation about the value of sigma is not warranted
because it is known. And if it is known, it is used in the calculations.

F.2 Q

“green jelly beans” case. (…) what I am confused about now is that, when we are asking the computer to
calculate the risk of the firm with inductive models, we need to give as much data as we can (even the
logo shape and color) which we never consider as a factor of risk. by comparing these two concepts, could
we say that we need to prioritize the most important components regarding our budget and abilities? and
how can we be sure that we are not considering the wrong data? I want to give an example to make it
more clear, imagine that we are considering the most important factors for aggression in people and we
have come up with 2 factors of work stress and sleep time. how are we sure that there is not another factor
which is more important? (sleep time here might be like the color of the jelly - something useless) and
how can we be sure that these are not giving us wrong results?

The topic is a bit advanced, but I certainly wanted to touch on it during my


classes, in the second part of the semester, when we will estimate models. Is it
an urgent matter? Or is it okay to wait a few weeks? In short, there are two ap-
proaches to answer that question. And they depend on what you are trying to
do: inference or prediction. In both cases, correct statements are hard to obtain:
it’s not only a matter of putting all variables in. The good news is that clear pro-
cedures and criteria exist to guarantee that you are doings things properly. And
I will talk about them in class.
F.4 Q 401

F.3 Q

I get the fact that if the p-value is bigger than 0.05, we don’t reject the null hypothesis. if I understand
correctly, by expanding the sample size, we can reach a point that we are able to reject h0. so, there would
be a point where if we pass from it by 𝜖, h0 would not be accepted anymore and this point would be the
minimum size of our sample which we need to go for another hypothesis. in the republicans and
democrats example which we discussed in class, is it possible to say that this would be the minimum
population which we can for sure say that if we know the democrats in one state overweight the
republicans (or vice-versa) we can surely interpret the other state too?

I think we should clarify a point. What you say is correct IF you assume that
in the larger sample (increasing 𝑛) the observed difference between the samples
remains the same. In that case, if you keep that difference in larger samples, then
while getting a larger sample there will be a threshold 𝑛∗ for which you can reject
the null hypothesis, yes.
However, nothing guarantees that this will be the case, i.e., that the difference
will be the same in larger sample. You observe one sample with a difference.
Fine. But it could be the case that another sample would show another differ-
ence between the groups. And this idea is an even more fundamental point to
understand.

F.4 Q

If I have a confidence level of 95%, my margin error is 5%, or am I wrong?


402 F Your Questions

This interpretation is actually not correct. We use the confidence level and the
margin of error in the same context. Despite their names, they are not comple-
ments, not even directly related.
The confidence level of, say, 95% means the following. If you were to repeat the
same estimator in the same case but in a very large number of samples, then 95%
of the confidence intervals obtained would contain the true parameter.
The margin of error expresses the range around the true parameter. To fix ideas,
keep mind that you could be 95% confident that the true parameter is around 60
± 1%, or 60 ± 2%, or 60 ± 3%… The actual percentage of this margin depends
on other factors such as the real variance in the population or the number of
observations.
Furthermore, the margin of error is often expressed in absolute terms, not in
percentages. For instance, we could read that the researcher is 95% confident
that the true parameter is 1700 ± 150.

F.5 Q

The reference here is Section 22.2.1.

I see that the beta1 is close to the value that we expected (3) but the intercept is 12 and not 5, is the model
still good?

The key with simulations is that we know the true model. In this case, there can
only be an origin to the discrepancy between the estimated coefficient and the
true parameter, namely sampling error… it just can happen that we don’t get
exactly the true parameters even if we estimate the correct model.
Notice also that 𝛽0 is rarely of importance.
F.7 Q 403

F.6 Q

I am struggling to understand the difference between residuals and errors in practical terms, in a R code
for example. Where can we see this?

Fundamentally, we can never observe the errors. These errors, better understood
as random shocks to a variable 𝑦 , represent an influence on 𝑦 that nothing ac-
counts for. We can also see it as the difference between the value of the variable
𝑦 and the value that it would take on average under the true model for 𝑦. This
“true” model, however, is little short of a chimera: we will never know/see it.
Therefore, the true shock will similarly never be known/seen.
The residuals, on the other hand, can be observed because they are simply the
difference between the value of the variable 𝑦 and the value of the model’s pre-
diction, 𝑦.̂ This, in turn, implies that the residuals depend on the model that is
used to “explain” 𝑦 . For instance, a person earns a wage of 2000 euros. A given
model of wage predicts that, given some characteristics (the 𝑋 variables), she
would earn 1945. Then, the residual for that observation is 55.

F.7 Q

The 𝑅2 is only used in inferences and not predictions, am I correct to assume this? I also do not
understand why sometimes even if the 𝑅2 is small it is still a good model, so when do we look at it?

There is nothing that prevents using the 𝑅2 as a criterion in the context of pre-
diction. As you know, that would just be a very poor indicator. So, in reality, it
is used and reported in exercises that I referred to as inference, yes.
404 F Your Questions

We should always have a look at it, but never give it more importance than it
deserves. This is because the quality of a paper should not be measured with
that criterion only. There are many others, in particular many others that require
thought and context-specific knowledge.
Another way of seeing it is that a high 𝑅2 is no guarantee whatsoever of the
quality of a paper, certainly not a substitute for real thinking.

F.8 Q

Regarding the assumptions, when one of these does not hold it means the model is not good for our data?

When one of the assumptions of a model does not hold, then the inference based
on that model may be a little or highly misleading, depending on the assumption
and how seriously the assumption is violated.
For instance, in case of an endogeneity issue (𝑋 and 𝜀 correlated) then the infer-
ence on the coefficients cannot be trusted. The test of hypothesis may say that
we reject the null but this is no longer reliable.

F.9 Q

What was the conclusion for simulated models when the variables were correlated or when we forgot a
variable in our estimation?
Also, did we say that when we forget a variable in our estimation the variables used become correlated?
F.11 Q 405

The simulation illustrated that if,

• we estimate a model without a variable that truly determines the outcome 𝑦 ,


AND,
• that forgotten variable is correlated with the variable that we keep in the esti-
mated model,

then, the coefficient that we obtain in our defective model will be biased, i.e., the
estimate is not reliable.

F.10 Q

In general, I just want to make sure that the explanatory variable is the dependent Y variable and the
explain variable is the independent X variable?

Not exactly. The following table gives the main terms used for these variables.

𝑌 𝑋
dependent independent
explained explanatory
endogenous exogenous
predicted predictor
response regressor
outcome characteristics
… …
406 F Your Questions

F.11 Q

Bias Variance Trade off: What is more important to minimize the bias or the variance? Further, is a
flexible model the same as a complex model? If not what does it mean to be a “flexible” model?
Additionally, how should we understand and interpret the graph in Figure 15.14?

The MSE is the sum of 1. the bias squared, and, 2. the variance. The key point that
the Figure 15.14 conveys is there is a trade-off. When one decreases, the other in-
creases: reducing the bias increases the variance, and reducing variance increases
the bias. So, the optimal solution is when their sum, the MSE, is minimal.
Yes, in this context, the models with more variance are the flexible/ complex/
high degree of polynomial ones.
This is the opposite of rigid models such as the linear model. This latter is rigid
because no matter the shape of the relationship, it will always result in a line (or
hyperplane), which is doomed to not get close to the data points.

F.12 Q

Cross-Validation Method: Do I understand it correctly that the leave one out approach uses only one
variable as the training data and the rest will be validation data set? How can we estimate a model based
on only one value?

What is left out in the LOOCV is one observation. The model is then estimated
on the 𝑛 − 1 observations (training set) and the estimated model is used to make
a prediction on the left-out observation (validation set).
F.13 Q 407

F.13 Q

Further, in this part of the notes you talk a lot about polynomials, does the degree of polynomials refer to
the amount of values we have in each sample?

The degree of the polynomial, 𝑝, is the maximum value of the exponent on a


variable used as a regressor. Polynomials are used to better fit the relationship in
the data; the rule being: the higher the degree, the more flexible the model, the
closer the fit can go close to the data.
“Polynomial regression extends the linear model by adding extra predictors, ob-
tained by raising each of the original predictors to a power. For example, a cubic
regression uses three variables, 𝑋 , 𝑋 2 , and 𝑋 3 , as predictors. This approach
provides a simple way to provide a non-linear fit to data.” (James et al. (2013).)
For instance, suppose one wants to model the relationship between log-wage
and age, i.e., (log)wage profile over the age of the worker. We suspect that the
relationship is non-linear in age. To account for non-linearities, we can add the
square of age, the cube of age, etc, to the estimated model. This is what we call a
polynomial regression.
We could have, 𝑝 = 1,

𝑙𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑎𝑔𝑒 + 𝜀

or 𝑝 = 2,

𝑙𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑎𝑔𝑒 + 𝛽2 𝑎𝑔𝑒2 + 𝜀

or 𝑝 = 3,

𝑙𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑎𝑔𝑒 + 𝛽2 𝑎𝑔𝑒2 + 𝛽3 𝑎𝑔𝑒3 + 𝜀

etc…
408 F Your Questions
p=1 p=2 p=3

15

14

13

12
logwage

11
p=4 p=5 p=6

15

14

13

12

11
20 30 40 50 60 20 30 40 50 60 20 30 40 50 60
age

FIGURE F.1: Polynomials of age to model logwage.

In Figure F.1, we can “see” how the how the quality of the fit increases with the
degree of the polynomial 𝑝. The fit goes from a rigid line (𝑝 = 1) to a very flexible
curve with 𝑝 ≥ 3, though the gains in the fit after 𝑝 = 4 do not seem to be large.

F.14 Q

I am not quite sure if I understand the overall picture and difference of residuals, shocks, variances and
standard errors.
F.15 Q 409

Random shocks to a variable 𝑦 represent an influence on 𝑦 that nothing accounts


for. We can also see it as the difference between the value of the variable 𝑦 and the
value that it would take on average under the true model for 𝑦 , i.e., given some
regressors 𝑋 . This “true” model, however, is little short of a chimera: we will
never know/see it. Therefore, the true shock will similarly never be known/seen.
The residuals, on the other hand, can be observed because they are simply the
difference between the value of the variable 𝑦 and the value of the model’s pre-
diction, 𝑦.̂ This, in turn, implies that the residuals depend on the model that is
used to “explain” 𝑦 . For instance, a person earns a wage of 2000 euros. A given
model of wage predicts that, given some characteristics (the 𝑋 variables), she
would earn 1945. Then, the residual for that observation is 55.
For these residuals, since we can calculate them, we can also calculate their vari-
ance and their standard error. This latter is actually very useful. We use it as a
measure of the fit (see Section 19.4) and in the standard error for the sampling
distribution of the coefficients (see Section 20.1).

F.15 Q

Does ESS refer to variance and RSS to residuals or does it not have anything to do with each other?

The TSS is proportional to the variance of the variable 𝑦 . The RSS is proportional
to the variance of the residuals, 𝑒.̂
We can show that,

𝑇 𝑆𝑆 = 𝐸𝑆𝑆 + 𝑅𝑆𝑆.

Recall that the residuals are the distance between the observations and the pre-
dicted value, i.e., they measure the failure in the prediction. Hence, the RSS is the
part of the variance of 𝑦 , i.e. of TSS, that the model did not manage to explain.
410 F Your Questions

The ESS, on the other hand, is the part of the variance of 𝑦 , i.e., the part of TSS,
that our model managed to explain.
The measure of fit that we use, the 𝑅2 , is the ratio of variance explained by the
model, i.e.,

𝐸𝑆𝑆
𝑅2 = .
𝑇 𝑆𝑆
411
Bibliography

Anscombe, F. J. (1973). Graphs in statistical analysis. The american statistician,


27(1):17–21.

Banerjee, A. V., Banerjee, A., and Duflo, E. (2011). Poor economics: A radical re-
thinking of the way to fight global poverty. Public Affairs.

Bernoulli, J. (1713). Ars conjectandi, opus posthumum: accedit tractatus de seriebus


infinitis, et epistola Gallice scripta de ludo pilæ reticularis. Impensis Thurnisiorum
Fratrum.

Bland, M. (2009). Keep young and beautiful: evidence for an “anti-aging” prod-
uct? Significance, 6(4):182–183.

Bodenhorn, H., Guinnane, T. W., and Mroz, T. A. (2017). Sample-selection biases


and the industrialization puzzle. The Journal of Economic History, 77(1):171–207.

Cairo, A. (2016). The truthful art: data, charts, and maps for communication. New
Riders.

Dell, M. (2010). The persistent effects of peru’s mining mita. Econometrica,


78(6):1863–1903.

Efron, B. and Hastie, T. (2016). Computer age statistical inference, volume 5. Cam-
bridge University Press.

Ferguson, T. and Voth, H.-J. (2008). Betting on hitler—the value of political con-
nections in nazi germany. The Quarterly Journal of Economics, 123(1):101–137.

Fisher, R. A. (1925). Statistical Methods for Research Workers. Edinburgh: Oliver


and Boyd.

Goyal, A. and Wahal, S. (2008). The selection and termination of investment


management firms by plan sponsors. The Journal of Finance, 63(4):1805–1847.

413
414 F Bibliography

Ioannidis, J. P. (2005). Why most published research findings are false. PLos med,
2(8):e124.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An introduction to
statistical learning, volume 112. Springer.

Jones, R. and Tschirner, E. (2015). A frequency dictionary of German: Core vocabulary


for learners. Routledge.

Kahneman, D. (2011). Thinking, fast and slow, volume 1. Farrar, Straus and Giroux
New York.

Mauboussin, M. and Bartholdson, K. (2003). On streaks: Perception, probability,


and skill. Credit Suisse First Boston’s Consilient Observer, 2(8):22.

Mauboussin, M. J. (2012). The success equation: Untangling skill and luck in business,
sports, and investing. Harvard Business Review Press.

Mlodinow, L. (2009). The drunkard’s walk: How randomness rules our lives. Vintage.

Pearl, J. and Mackenzie, D. (2018). The book of why: the new science of cause and
effect. Basic Books.

Provost, F. and Fawcett, T. (2013). Data Science for Business: What you need to know
about data mining and data-analytic thinking. O’Reilly Media, Inc.

Reinhart, C. M. and Rogoff, K. S. (2010). Growth in a time of debt. American


economic review, 100(2):573–78.

Rosling, H., Rosling, O., and Rosling Rönnlund, A. (2018). Factfulness : Ten reasons
we’re wrong about the world - and why things are better than you think. Sceptre,
London.

Stigler, S. (2008). Fisher and the 5% level. Chance, 21(4):12–12.

Tetlock, P. E. and Gardner, D. (2016). Superforecasting: The art and science of pre-
diction. Random House.

Tufte, E. R. (1997). Visual Explanations: Images and Quantities, Evidence and Narra-
tive. Cheshire, CT: Graphics Press.
F.15 Bibliography 415

Tufte, E. R. (2001). The visual display of quantitative information. Cheshire, CT:


Graphics Press, 2 edition.

Tufte, E. R. (2003). The cognitive style of PowerPoint. Graphics Press Cheshire, CT.

Tufte, E. R. (2006). Beautiful evidence. Cheshire, CT: Graphics Press.

Tufte, E. R., Goeler, N. H., and Benson, R. (1990). Envisioning information.


Cheshire, CT: Graphics Press.

Tversky, A. and Kahneman, D. (1971). Belief in the law of small numbers. Psy-
chological bulletin, 76(2):105.

Tversky, A. and Kahneman, D. (1983). Extensional versus intuitive reasoning:


The conjunction fallacy in probability judgment. Psychological review, 90(4):293.

Watson, R., Ogden, S., Cotterell, L., Bowden, J., Bastrilles, J., Long, S., and Grif-
fiths, C. (2009). A cosmetic ‘anti-ageing’ product improves photoaged skin:
a double-blind, randomized controlled trial. British Journal of Dermatology,
161(2):419–426.

You might also like