# Applied Managerial Statistics

Steven L. Scott Winter 2005-2006

COPYRIGHT c 2002-2005 by Steven L. Scott. All rights reserved. No part of this work may be reproduced, printed, or stored in any form without prior written permission of the author.

Contents
1 Looking at Data 1.1 Our First Data Set . . . . . . . 1.2 Summaries of a Single Variable . 1.2.1 Categorical Data . . . . . 1.2.2 Continuous Data . . . . . 1.3 Relationships Between Variables 1.4 The Rest of the Course . . . . . 1 1 2 2 4 9 15 17 18 20 20 24 27 33 33 34 36 37 42 45 45 46 48 50 50 51 52

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

2 Probability Basics 2.1 Random Variables . . . . . . . . . . . . . . . . . . . . 2.2 The Probability of More than One Thing . . . . . . . 2.2.1 Joint, Conditional, and Marginal Probabilities 2.2.2 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . 2.2.3 A “Real World” Probability Model . . . . . . . 2.3 Expected Value and Variance . . . . . . . . . . . . . . 2.3.1 Expected Value . . . . . . . . . . . . . . . . . . 2.3.2 Variance . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Adding Random Variables . . . . . . . . . . . . 2.4 The Normal Distribution . . . . . . . . . . . . . . . . . 2.5 The Central Limit Theorem . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

3 Probability Applications 3.1 Market Segmentation and Decision Analysis . . . . . . . . . . . . . 3.1.1 Decision Analysis . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Building and Using Market Segmentation Models . . . . . 3.2 Covariance, Correlation, and Portfolio Theory . . . . . . . . . . . . 3.2.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Measuring the Risk Penalty for Non-Diversiﬁed Investments 3.2.3 Correlation, Industry Clusters, and Time Series . . . . . . i

. . . . . . .

ii 3.3

CONTENTS Stock Market Volatility . . . . . . . . . . . . . . . . . . . . . . . . . 57 61 61 63 64 65 67 69 71 73 74 75 76 76 79 82 87 87 89 91 91 92 93 97 98 104 106 109 110 111 112 113 114 115

4 Estimation and Testing 4.1 Populations and Samples . . . . . . . . . . . . . . 4.2 Sampling Distributions . . . . . . . . . . . . . . . . 4.2.1 Example: log10 CEO Total Compensation . 4.3 Conﬁdence Intervals . . . . . . . . . . . . . . . . . 4.3.1 Can we just replace σ with s? . . . . . . . 4.3.2 Example . . . . . . . . . . . . . . . . . . . . 4.4 Hypothesis Testing: The General Idea . . . . . . . 4.4.1 P-values . . . . . . . . . . . . . . . . . . . . 4.4.2 Hypothesis Testing Example . . . . . . . . 4.4.3 Statistical Signiﬁcance . . . . . . . . . . . . 4.5 Some Famous Hypothesis Tests . . . . . . . . . . . 4.5.1 The One Sample T Test . . . . . . . . . . . 4.5.2 Methods for Proportions (Categorical Data) 4.5.3 The χ2 Test . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

5 Simple Linear Regression 5.1 The Simple Linear Regression Model . . . . . . . . . 5.1.1 Example: The CAPM Model . . . . . . . . . 5.2 Three Common Regression Questions . . . . . . . . 5.2.1 Is there a relationship? . . . . . . . . . . . . . 5.2.2 How strong is the relationship? . . . . . . . . 5.2.3 What is my prediction for Y and how good is 5.3 Checking Regression Assumptions . . . . . . . . . . 5.3.1 Nonlinearity . . . . . . . . . . . . . . . . . . . 5.3.2 Non-Constant Variance . . . . . . . . . . . . 5.3.3 Dependent Observations . . . . . . . . . . . . 5.3.4 Non-normal residuals . . . . . . . . . . . . . . 5.4 Outliers, Leverage Points and Inﬂuential Points . . . 5.4.1 Outliers . . . . . . . . . . . . . . . . . . . . . 5.4.2 Leverage Points . . . . . . . . . . . . . . . . . 5.4.3 Inﬂuential Points . . . . . . . . . . . . . . . . 5.4.4 Strategies for Dealing with Unusual Points . 5.5 Review . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . it? . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

6 Multiple Linear Regression 117 6.1 The Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2 Several Regression Questions . . . . . . . . . . . . . . . . . . . . . . 119

. . . . . . . . .4. . . . . . . . . . . . .7. . 7. . . 6. . . . . .4 6. .3.4. . . . . . . . .1 Dummy Variables . . . . . . . . . . . . . . . . . . . . 7. . . . . . . .5. Regression Diagnostics: Detecting Problems . . . . . . . . . . . . . .2 General Advice on Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . Variance. . . . . . . . . . . . . . . . . . . . . . . . . . . .4. . . . . . . . . . . . . . . . . . .1 Interactions Between Continuous and Categorical Variables . . .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . 6. .3. . .2. . . . . . . . . . . . . .2 Ways of Removing Collinearity . . . . . . . . . . . . . . . . . . . .6 6. and Randomization 7. . Model Selection/Data Mining . . . . . . . .3. . .2 How Strong is the Relationship? R2 . . . . . . 6. . . . . . . . . . . 6. . . . 6. . . . . . . . . .4. . . . . . . .2. . . . .2 Factors with Several Levels .3.2 Time Series . 6. . . . . . . Regression When X is Categorical .3 6. 7. . . . . . . . . . . . . . . . . . . . . . . .1 Logistic Regression . . .3 Is an Individual Variable Important? The T Test . . . . . . . . . . . . . . . .1 Background . .3 More on Probability Distributions . . . . . . 7. . . . . . . . . . . . . . . . . . . . . . 6. .1 iii 6. . . .2. . . . . . . . . . . . . . . . .6. . . . . . . . . . . . . . . . 6. .6 Summary .5 6. . . . . . 6. . . . . . . . . . . . . . . . . . . . 7. 7. . . . . . . . . . . . . . . . . . . . . . . . 7. . 6. .CONTENTS Is there any relationship at all? The ANOVA Table and the Whole Model F Test .2 Whole Model Diagnostics . . . . . . . . . . .2. . . .3 Testing Diﬀerences Between Factor Levels . . .5 Predictions . . . . .3 Surveys . . . . . . .4.4. . . . . . . . .2 Multiple Comparisons and the Bonferroni Rule . . . . . . . . . . . . . . . . . .6. . . 6. 6. . . . . . . . . . . . . . 7.1 Model Selection Strategy . 7. . . . . . . . . . . .7. . . . . .7. . . . . . . . . . . . . Interactions Between Variables . . . . . 6. . . . . 7. . . . . . . . .3 Stepwise Regression . . . . . . . . . .4. . . . . . . . . . . . . .2 Bias.1 Diﬀerent Types of Studies . . . . . . . . . . . 6. . . . . . . . . . . .5 Observational Studies .1 Detecting Collinearity . . .7 120 122 123 124 126 127 127 129 133 134 135 136 137 137 141 144 145 147 150 150 151 152 152 157 157 160 162 163 164 164 165 165 165 166 167 170 171 173 7 Further Topics 7. . . . . . . . . . . .5. . . . . . . . . . . . . . . . . 7. . . . . .5. . . . . . . . . . . . . . . . . . Collinearity . . . 6.4 Planning Studies . .2. . . . . .3 General Collinearity Advice . . .3 Binomial and Poisson Counts .1 Leverage Plots . . 6. . . . . . . . . . . . .4 Experiments . . . . . 6. . . . . . . .4. . 7. . . . . . . . . . .2 Exponential Waiting Times . . . . 6. .4 Review . . . . . . . . . .4 Is a Subset of Variables Important? The Partial F Test . . . . . . . . . . . . . . . . . . .3.

.6. .2. . . .4 Logistic Regression . . . . . . . . . . . . . . . . .3.6.2. . . . . . . . . . . . . . . . . .2 Including and Excluding Points . . A. . . . . . . . . A. . . . . . .6. .2 Generally Neat Tricks . . . . . . . . . . . . . . . . . . . A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. . . . . . . . . . . . . . D. . . . .iv A JMP Cheat Sheet A. . . . . . . . . . . . . . .6. . . . . . . . .5 To Run a Stepwise Regression . . . . . A. . . . . . . . . . . .4. . . A. . . . .1 Normal Table . . . . . . B Some Useful Excel Commands C The Greek Alphabet D Tables D. . . . . . . . . . . . . . . .e. . . . . . D.3 Including Interactions and Quadratic Terms . . . .2 Quick and Dirty Normal D. . . . . . . . . . . . . . . . . . . .1 Get familiar with JMP. . . . . . . . . . . . . . A. . . . . .4 Fit Y by X .3 Taking a Subset of the Data . CONTENTS 177 177 177 177 177 178 178 178 178 178 178 179 179 179 179 180 181 181 181 181 181 182 182 182 183 185 189 .5 Multivariate . . . . . . . . . . . A. .1 Dynamic Graphics . . . . . . . . . . . . . . . . . . A. . .6. .6. . . . . . . .2 Categorical Data . . . . A. . . . . . . . . . .1 Running a Regression . .4. . . . .3. . . . . . . . . .2. . . . . . . . . . . . .1 Continuous Data . . . . . .4 Chi-Square Table . Multiple Regression) . A. . . . . . . . . . . . A. . . . . . . . . . . . . . . . . . . . . . . . . . . A. . . . . . . . . .4. . . . . A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table . . . . . . . .3 Cook’s Distance . A. . . . . . .3 Simple Regression . .3 The Distribution of Y . . . . A. . . . . . . . 191 192 193 194 195 . . . . . . .1 The Two Sample T-Test (or One Way ANOVA). . . . . . . . . . . . . .2 Contingency Tables/Mosaic Plots . . .4 Marking Points for Further Investigation . . . . . . . . . . . . . . . . A. . . . . . .6 Shift Clicking and Control Clicking . . . . . . . . . .5 Changing Preferences . . . . . A. . . . .6 Fit Model (i. . . . . . . . . . . A. . . . .2. . . . . . . A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4. . .2. . . . .2. A. . .2 Once the Regression is Run . . . . A. . . . . . . . . . . . . . . . . . . . . .4 Contrasts . . . . .6 Logistic Regression . . . . . A. . . . . . . . . . . . . . . . .

. . . 37 . .1 2. . . . . . . . . . . . . . . Standard Error . . . .2 3. . . . . . . . . .1 Standard Deviation vs. . Understanding Probability Distributions . .1 6. Standard Deviation vs. . . . . . . . . . . . . The diﬀerence between X1 + X2 and 2X . .1 4. . the p-value for the slope . . . v . . . .1 2. 6 . . . 19 . . . . . . The Standard Error of a Sample Proportion. . . . . . .2 4. . . . . . . . .Don’t Get Confused 1. . . . Covariance . . 66 . . . . . . . . . . . . 54 . . . . . . . . . Which One is the Null Hypothesis? . . . 92 . . . Variance. . . . .2 4. . . . . . . . . . . . . 72 . . . 56 . . . . . . . . . . . . . . . . . . . . . .3 5. .1 3. R2 vs. A general formula for the variance of a linear combination Correlation vs. . . . . . 122 The “Don’t Get Confused” call-out boxes highlight points that often cause new statistics students to stumble. . Why call it an “ANOVA table?” . . . . . . 80 . . . . . .

.

. . . Making the coeﬃcients sum to zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 4. . . .1 6. . . . . . . . How to build a leverage plot . . . . . . . . . . . . . . . . .3 Does the size of the population matter? . . . while letting others know that they can spend their energy reading elsewhere. . vii . . . . . . . . Where does the Bonferroni rule come from? . . . . . . . .1 4. . . . . .Not on the Test 4. . . . . . . . .3 6. . . . Much of this material has to do with minor technical points or questions of rationale that are not central to the course. . . . . . . . . . . .2 5. . Why leverage is “Leverage” . . . . . . . The “Not on the Test” call-out boxes explain such material to interested students. . . . . . . What are “Degrees of Freedom?” . . . . . . . . . . . . . . . . . . . . . . . . . Box-Cox transformations . . . 67 69 84 89 99 114 129 143 153 There is some material that is included because a minority of students are likely to be curious about it. . . . . . . . . . . Rationale behind the χ2 degrees of freedom calculation Why sums of squares? . . .1 5.3 5. . .2 6. . . . . . . . .

viii NOT ON THE TEST .

Much of the material in an MBA course can also be found in an undergraduate course. but an MBA course tends to emphasize topics that undergraduates never get to because they spend their time focusing on other things. Our philosophy is that by condensing the non-regression hypothesis testing material into one chapter we can present a more uniﬁed view of how hypothesis testing is used in practice. It is also a source of problems and exercises for students to work on to reinforce the ideas from the reading and from lecture. our MBA course lasts for eight weeks and covers up to multiple regression. so packing them in to one chapter lets us get to the good stuﬀ more quickly. Unfortunately for MBA students. Furthermore. Most undergraduate statistics courses last for a semester and conclude with either one way ANOVA or simple regression. and in many cases with an undue emphasis on having students do calculations themselves. A textbook for a course serves three basic functions.Preface An MBA statistics course diﬀers from an undergraduate course primarily in terms of the pace at which material is covered. These notes are an attempt to help MBA statistics students navigate through the mountains of material found in typical undergraduate statistics books. A good text concisely presents the ideas that a student must learn. Though it has a similar number of contact hours. most statistics textbooks are written with undergraduates in mind. which reduces the number of paradigms that students must master over an eight week term. In doing so. To get to where we need to be at the course’s end we must deviate from the usual undergraduate course and make regression our central theme. usually with other material that does not help prepare students to study regression. Material such as the two sample t test and the F test from one way ANOVA are presented as special cases of regression. Undergraduate textbooks often present this material in four or more chapters. It illustrates those ideas with examples as the ideas are presented. the one-sample problems typically found in these chapters are much less compelling than regression problems. we condense material that occupies several chapters in an undergraduate textbook into a single chapter on conﬁdence intervals and hypothesis tests. At some ix .

• Business Analysis Using Regression. Stine. Each of these sources provides data sets for their problems and examples. and Sall from the CD containing the JMP-IN program.x NOT ON THE TEST point these notes may evolve into a textbook. FSW from the internet. The notes are evolving into a good presentation of statistical theory. all of which are required or optional course reading: • Statistical Thinking for Managers. but the hardest part of writing a textbook is developing suﬃcient numbers of high quality examples and exercises. by John Sall. by Hildebrand and Ott. by Foster. published by Duxbury. . At present. H&O from the diskette included with their book. and Waterman. • JMP Start Statistics. published by Springer. published by Duxbury Press. but they’re not there yet. 4th Edition. We will distribute the FSW data sets electronically. we have borrowed and adapted examples and exercises from three primary sources.

This Chapter provides some tools for creating useful summaries of the data so that you can do a cursory examination of a data set without having to literally look at each and every observation.jmp (provided by Foster et al. Some sort of summary measures are needed. Each column represents a certain characteristic of each CEO such as how much they were paid in 1994. 1998). This is a very common way to organize data. In general terms each CEO is an observation and each column of the data table is a variable. and to motivate the material in later Chapters. which is almost surely too many for you to internalize by simply looking at the individual entries. 1. the CEO’s age. Variables such as the CEO’s age and total compensation are continu1 . Database people sometimes refer to observations as records and variables as ﬁelds. which lists the 800 highest paid CEO’s of 1994.Chapter 1 Looking at Data Any data analysis should begin with “looking at the data. Other goals of this Chapter are to illustrate the limitations of simply “looking at data” as a form of analysis. but most of the time you will have so much data that you can’t look at it all at once. A second feature of this data set is that it contains diﬀerent types of variables. When you open the data set in JMP (or any other computer software package) you will notice that the data are organized by rows and columns. Each row represents a CEO. There are 800 observations.” This sounds like an obvious thing to do. even with the tools discussed here. as ranked by Forbes magazine.1 Our First Data Set Consider the data set forbes94. One reason the CEO compensation data set is a good ﬁrst data set for us to look at is its size.. and the industry in which the CEO’s company operates.

It is sometimes easier to interpret the counts as fractions of the total number of observations in the data set. These are described below. Regardless of whether a variable is categorical or continuous.2 CHAPTER 1. while variables like the CEO’s industry and MBA status (whether or not each CEO has an MBA) are categorical2 . ous1 variables. The For our purposes. 1.2). which simply counts the number of times each level occurred in the data set. 1 . ” which are ordered. 2=blue.1: The ﬁrst few rows and columns of the CEO data set. Ordinal variables have levels like “strongly disagree. There are stricter. 2 There are actually two diﬀerent types of categorical variables. continuous variables are numerical variables where the numbers mean something. as opposed to being labels for categorical levels (1=yellow. deﬁnitions that could be applied.). also known as relative frequencies. We will treat all categorical variables as nominal. The distinction between categorical and continuous variables is important because diﬀerent summary measures are appropriate for categorical and continuous variables. LOOKING AT DATA Figure 1. but with no meaningful numerical values. agree. Nominal variables are categories like red and blue.2 1. . there are numerical and graphical methods that can be used to describe it (albeit diﬀerent numerical and graphical methods for diﬀerent types of variables). etc.1 Summaries of a Single Variable Categorical Data An example of a categorical variable is the CEO’s industry (see Figure 1. The diﬀerent values that a categorical variable can assume are called levels. disagree. . and more precise.2. with no order. . The most common numerical summary of a categorical variable is a frequency table or contingency table.

The really important advantage of mosaic plots is that you can put several of them next to each other to compare categorical variables for several groups (see Figure 1.6). If the categorical variable contains many levels then it will be easier to look at a picture of the frequency distribution such as a histogram or a mosaic plot. but mosaic plots are relative newcomers in the world of statistical graphics.2: Graphical and numerical summaries of CEO industries (categorical data). it is easier for people to see linear diﬀerences than angular diﬀerences (okay. A histogram is simply a bar-chart depicting a frequency distribution. Histograms have been around more or less forever.2 indicate that Finance is by far the most frequent industry in our data set of the 800 most highly paid CEO’s. The mosaic plot is very useful because you can put several of them next to each other to compare distributions within diﬀerent groups (see Figure 1. choice between viewing the data as frequencies or relative frequencies is largely a matter of personal taste.2.1. A mosaic plot works like a pie chart. The construction industry is the least represented. and you can get a sense of the relative numbers of . The summaries in Figure 1. maybe that’s not so big since you’ve been looking at pie charts all your life). the more frequent the level. Mosaic plots have two big advantages over pie charts. SUMMARIES OF A SINGLE VARIABLE 3 (a) Histogram and Mosaic Plot (b) Frequency Distribution Figure 1.6). but it represents relative frequencies as slices of a stick (or a candy bar) instead of slices of a pie. First. The bigger the bar.

this formula says to add up all the numbers in the data set and divide by the sample size. The ﬁrst moment is the sample mean n 1 x ¯= xi . n i=1 You certainly know how to take an average. The subscript i represents each individual observation in the data set (imagine i assuming each value 1. 800 in turn). FYI: putting a bar across the top of a letter (like x ¯. . . if we are considering CEO ages. For example. For the CEO data set n = 800. asks how far it is from the mean (xi − x ¯). pronounced “x bar”) is standard notation in statistics for “take the average.” The second moment is the sample variance s2 = 1 n−1 n i=1 (xi − x ¯)2 . but it is useful to present the formula for it to get you used to some standard notation that is going to come up repeatedly. and takes the average. which gives the data a sense of location. It is easier for many people to think of summaries for continuous data because you can imagine graphing them on a number line. and standard deviation) Moments (a term borrowed from physics) are simply averages.2 Continuous Data An example of a continuous variable is a CEO’s age or salary. Moments (mean. LOOKING AT DATA 1. which you already knew. Thus. For example. .3 The sample variance looks at each observation xi . The summation sign simply says to add up all the numbers in the data set. 2. and so on (see Figure 1. x2 = 62. then x1 = 52.1). and so forth. In this formula (and in most to follow) n represents the sample size. . It is the “second” moment because the thing being averaged is squared. Summaries of continuous data fall into two broad categories: measures of central tendency (like the mean and the median) and measures of variability (like standard deviation and range). Another way to classify summaries of continuous data is whether they are based on moments or quantiles. It 3 The third moment has something cubed in it.4 CEO’s from the other industries.2. squares each deviation from the mean (to make it positive). CHAPTER 1. you have a sense of how far an 80 year old CEO is than a 30 year old CEO. . x3 = 56. variance. but it is nonsense to ask how far a Capital Goods CEO is from a Utilities CEO.

so Michael Eisner was 24 standard deviations above the mean. 47. There are two technical details that cause people to get hung up on the formula for sample variance. Quantiles Quantiles (a fancy word for percentiles) are another method of summarizing a continuous variable. If you’re really curious you can check out page 89 (though you may want to wait a little bit until we get to Chapter 5).” Nobody pretends to know what that means. The average compensation was \$2. SUMMARIES OF A SINGLE VARIABLE 5 would take you a while to try to remember the formula for s2 by rote memorization.1. For example. the standard deviation of CEO total compensation is \$8.18 years. which says that CEO’s are typically about 7 years above or below the average. The second.9 years. That. Wait.81. For example: the standard deviation of CEO ages in the Chemicals industry is 3. Standard deviations are used in two basic ways. and more widespread use of standard deviations is as a standard unit of measurement to help us decide whether two things are close or far. why square each deviation from the mean instead of doing something like just dropping the minus signs? This one is a little deeper. the variance is computed en route to computing the standard deviation. Second. which is simply the square root of the variance n √ 1 (xi − x ¯)2 . To compute the p’th quantile of a variable simply sort the variable . For example. In practice. So the sample variance is the “average squared deviation from the mean. while CEO’s in the Chemicals industry tend to be more tightly clustered about the average CEO age in that industry. if you remember that s2 is the “average squared deviation from the mean” then the formula will make more sense and it will be easier to remember. we will soon learn. However. First. the variance of CEO ages is 47.4 years.2.3 million. the variance is hard to ¯ you get an answer in “years interpret because when you square each CEO’s xi − x squared. s = s2 = n−1 i=1 The standard deviation of CEO ages is 6.8 million.81 what? Actually. It so happens that Michael Eisner made over \$200 million that year.” You use the sample variance to measure how spread out the data are. why divide by n − 1 instead of n? In any data set with more than a few observations dividing by n − 1 instead of n makes almost no diﬀerence. We just do it to make math geeks happy for reasons explained (kind of) in the call-out box on page 69. The ﬁrst is to compare the “reliability” of two or more groups. is a lot of standard deviations. while the SD for CEO’s in the Insurance industry is 8. That means you can expect to ﬁnd more very old and very young CEO’s in the Insurance industry.

) If you want to use quantiles to measure the spread in the data set it is smart to use something other than the max and min. from smallest to largest and ﬁnd which number is p% of the way through the data set. Standard deviation measures spread using the units of the variable.57 million with him excluded.000 and \$2. Histograms work by chopping the variable into bins. The most famous quantiles are the median (50’th percentile). where Michael Eisner is an obvious outlier. The mean drops to \$2. Graphical Summaries Boxplots and histograms are the best ways to visualize the distribution of a continuous variable. Eisner has an even larger impact on the standard deviation.82 million.6 CHAPTER 1. The ﬁrst and third quartiles are \$787. The ﬁrst and third quartiles (aka the 25’th and 75’th percentiles) are often used instead. just take the average of those two numbers. Outliers have virtually no impact on the median.5 million regardless of Eisner’s presence. the minimum (0’th percentile). and the maximum (100’th percentile). Quantiles are useful summaries if you want to limit the inﬂuence of outliers. Variance measures spread on the squared scale. If p% of the way through the data set puts you between two numbers. Sometimes outliers are the most interesting points (people certainly seem to ﬁnd Michael Eisner’s salary very interesting). which is \$8. They measure how spread out a variable is. LOOKING AT DATA Don’t Get Confused! 1. Standard deviation and variance both measure how far away from your “best guess” you can expect a typical observation to fall. The median CEO compensation is \$1.3 million with or without Michael Eisner. which you may or may not want to do in any given situation.1 Standard Deviation vs. With Eisner in the sample the mean compensation is \$2. but they do impact the maximum and minimum values. which are observations far away from the rest of the data. Figure 1. The main reason people use quantiles to summarize data is to minimize the importance of outliers. and counting . (The maximum CEO compensation with Eisner in the data set is \$202 million. A big outlier like Eisner can have an big impact on averages like the mean and variance (and standard deviation).3 shows the histogram of CEO total compensation. Variance. It drops to \$53 million without him.3 million without him. If you’re given enough well chosen quantiles (say 4 or 5) you can get a pretty good idea of what the variable looks like.3 million with him in the sample and \$4.

4 means absolutely nothing).e.4 For boxplots the top of the box is the upper quartile i.” On the log scale the skewness is greatly reduced and Eisner is no longer an outlier. though it is easier to see individual outliers in a boxplot. At some point someone came up with a good algorithm for choosing histogram bins. SUMMARIES OF A SINGLE VARIABLE 7 Figure 1. frequencies for each bin. You shouldn’t worry about them. Histograms usually provide more information than boxplots.7). Outliers. 5 The rules for how long to make the whiskers are arcane and only somewhat standard. Therefore it is much easier to look at several boxplots than it is to look at several histograms. the point which 25% of the data lies below. The lines (or “whiskers”) extending from the box are supposed to cover “almost all” the rest of the data5 .e. are represented as single points. Michael Eisner made so much money that we had to write his salary in “scientiﬁc notation. (See Figure 1.2. is that only one dimension of the boxplot means anything (the height of the boxplot in Figure 1. the point 75% of the way through the data. i. like mosaic plots. 4 .1. The main advantage of boxplots.e. Thus the box in a boxplot covers the middle half of the data. which you shouldn’t waste your time thinking about. This makes boxplots very useful for comparing the distribution of a continuous variable across several groups.3: Histogram of CEO total compensation (left panel) and log10 CEO compensation (right panel). The line inside the box is the median. extremely large or small values. The bottom of the box is the lower quartile i.

How well do the quantiles in the data match the predictions from the normal model? The Normal Curve Often we can use the normal curve.000 61. The mean of the data is 56. consider Figure 1.000 52. 95% within ±2 SD and 99.0% maximum quartile median quartile minimum 81. and the size of an SD is 6. and a similar amount should be more than 2 SD’s below the mean.000 64. The data appear approximately normal.000 42.8 CHAPTER 1.4: Numerical and graphical summaries of CEO ages (continuous data).75% within ±3 SD. LOOKING AT DATA Quantiles 100.5% 97. For now. If the histogram of a continuous variable looks approximately like a normal curve then all the information about the variable is contained in its mean and standard deviation (a dramatic data reduction: from 800 numbers down to 2).” to model the distribution of a continuous variable.0% 10.5% 0. In Chapter 2 we will learn how to use the normal curve to make very precise calculations. a surprising number do.5% 0.000 Figure 1. In Chapter 2 we will learn why the normal curve occurs as often as it does.000 69. So 2 SD’s above the mean .0% 50.3 years.9 years.4.325 years. The normal curve is superimposed. which says that if the normal curve ﬁts well then (approximately): 68% of the data is within ±1 SD of the mean.0% 75. Although many continuous variables don’t ﬁt the normal curve very well. some of the most often used normal calculations are summarized by the empirical rule. which lists several observed quantiles for the CEO ages.9 years. The mean is 56. so the empirical rule says that about 95% of the data should be within 2 standard deviations of the mean.0% 99.0% 2.000 77.5% of the data should be more than 2 SD’s above the mean.000 48. or “bell curve. That means about 2. To illustrate the empirical rule.5% 90.000 36.000 57. The normal curve tells us what fraction of the data set we can expect to see within a certain number of standard deviations away from the mean.000 29. The standard deviation is 6.0% 25.

The direction of the skewness is the direction of the tail. • To use one or more variables to predict another e.5(b).3. 1. but with relatively few distinct values. If . The most common fat tailed distributions in business applications are the distributions of stock returns (closely related to corporate proﬁts). For more details see the discussion of Q-Q plots on page 41. Figure 1.5(c) shows evidence of discreteness. Three are described below. or Q-Q plot). Figure 1. The fourth (when Y is categorical and X is continuous) is best described using a model called logistic regression which we won’t see until Chapter 7. if one variable increases what happens to the other (if I increase the number of production lines what will happen to proﬁt).5(d) shows a bimodal variable. You can think of fat tailed distributions as being skewed in both directions.g.e. The distribution can be skewed with a heavy tail trailing oﬀ in one direction or the other. A variable can have fat tails like in Figure 1. It shows a variable which is “continuous” according to our working deﬁnition.5 shows the four most common ways that the data could be non-normal. If the dots deviate substantially from a straight line this indicates that the data does not look normal. We will then purchase the stock if its current value is under what we think it should be. so CEO compensation is “right skewed” because the tail trails oﬀ to the right. quantile-quantile plot.g. i. using Proﬁt. If the data looks like a normal the quantile plot should have an approximately straight line. Sales. Figure 1. RELATIONSHIPS BETWEEN VARIABLES 9 is about 70. There are four possible situations depending on whether X and Y are categorical or continuous (see page 179). This plots the data ordered from smallest to largest versus the corresponding quantiles from a normal curve. The way we analyze the relationship depends on the types of variables X and Y are. which is pretty close to what the normal curve predicted. A more precise way to check whether the normal curve is a good ﬁt is to use a normal quantile plot (aka. The 97. Of course you can’t use the empirical rule if the histogram of your data doesn’t look approximately like a normal curve. If we want to use one variable X to predict another variable Y then we call X the predictor and Y the response. PE ratio etc to predict the correct value for a stock. Finally.5% quantile is actually 69. a variable whose distribution shows two well-separated clusters.1. This is a subjective call which takes some practice to make.3 Relationships Between Variables There are two main reasons to look at variables simultaneously: • To understand the relationship e.

10 CHAPTER 1. LOOKING AT DATA (a) Skewness: CEO Compensation (top 20 outliers removed) (b) Heavy Tails: Corporate Proﬁts (c) Discreteness: CEO’s age upon obtaining under. .5: Some non-normal data.(d) Bimodal: Birth Rates of Diﬀerent Countries graduate degree (top 5 outliers excluded) Figure 1.

37% (=22/263) of the total data set. If X is continuous then this strategy is no longer feasible. then you want to use conditional proportions instead. X is categorical then it is possible to simply do the analysis you would do for Y separately for each level of X . Categorical Y and X Just as with summarizing a single categorical variable.6 shows data collected by an automobile dealership listing the type of car purchased by customers within diﬀerent age groups. This is known as a joint proportion because it treats X and Y symmetrically. What fraction of them bought work vehicles? There are 133 of them. For example there were 22 people in the 29-38 age group who purchased work vehicles. If you really want to think of one variable explaining another. The contingency table gives you two groups of conditional proportions because it doesn’t know in advance which variable you want to condition on. What does that mean to us? These 22 people represent 8. The primary diﬀerence between two-way contingency tables (with two categorical variables) and one-way tables (with a single variable) is that there are more ways to turn the counts in the table into proportions. This type of data is often encountered in Marketing applications. if you want to see how automobile preferences vary by age then you want to compute the distribution of TYPE conditional on AGEGROUP.3. so . Restrict your attention to just the one row of the contingency table corresponding to 29-38 year olds. Figure 1. RELATIONSHIPS BETWEEN VARIABLES 11 Figure 1. For example.6: Contingency table and mosaic plot for auto choice data. the main numerical tool for showing the relationship between categorical Y and categorical X is a contingency table.1.

The marginal distribution of the Y variable is a separate mosaic plot serving as the legend to the main plot. Finally.6. Far and away the best way to visualize a contingency table is through a sideby-side mosaic plot like the one in Figure 1. It looks like sporty cars are less attractive to older customers. Side-by-side mosaic plots are a VERY eﬀective way of looking at contingency tables. the joint proportions in the contingency table correspond to the area of the individual tiles. and older age groups. middle. open autopref. and that family cars (a favorite of the 29-38 demographic) were the most often purchased. by restricting your attention to the column for work vehicles.2 could have been “looking at marginal distributions. Thus you can see from Figure 1. family cars are more attractive to older customers. Finally. Of that same group. Continuous Y and Categorical X If you want to see how the distribution of a continuous variable varies across several groups you can simply list means and standard deviations (or your favorite . while sporty cars tend to be purchased by 18-28 year olds.12 CHAPTER 1. compare these “row percentages” for the young. The plot represents the marginal distribution of X by the width of the individual mosaic plots: the 39+ demographic has the thinnest mosaic plot because it has the fewest members. You could also condition the other way. The same terminology is used for continuous variables too. To see how auto preferences vary according to age group.6 that “family cars purchased by 29-38 year olds” is the largest cell of the table.” We can see from the margins of the table that the 29-38 age group was the most frequently observed. To see for yourself.jmp and construct separate histograms for TYPE within each level of AGEGROUP (use the “by” button in the “Distribution of Y” dialog box). The individual mosaic plots show you the conditional distribution of Y (in this case TYPE) for each level of X (in this case AGEGROUP). while the older demographic purchased only 11%. The younger demographic purchased 39% of work vehicles. By comparing these distributions across car type you can see that most family and work cars tend to be purchased by 29-38 year olds. Thus the title of Section 1. 63% purchased family vehicles. Of the 44 work vehicles purchased. when you restrict your attention to a single variable by ignoring other variables you are looking at its marginal distribution. because of the way the marginal distributions are represented.54% of that particular row. the margins of the contingency table contain information about the individual X and Y variables. LOOKING AT DATA the 22 people represent 16. and work vehicles have similar appeal across age groups. Because of this. 22 (50%) were purchased by 29-38 year olds. and 20% purchased sporty vehicles.

Graphically. Dots on the right are older CEO’s.8. Multiple histograms are harder to read than side-by-side boxplots because each histogram has a diﬀerent sets of axes. Each dot represents a CEO.7: Side-by-side boxplots comparing log10 compensation for CEO’s in diﬀerent industries. but beyond that boxplots are the way to go. quantiles) for each group. try doing the same comparison with 19 histograms. RELATIONSHIPS BETWEEN VARIABLES 13 Figure 1. while the forest and utilities CEO’s haven’t done as well. From the Figure it appears that if there is a relationship between a CEO’s age and compensation it isn’t a very strong one. Yuck! Continuous Y and X The best graphical way to show the relationship between two continuous variables is a scatterplot like the one in Figure 1. . Dots near the top are highly paid CEO’s. the best way to do the comparison is with side-by-side boxplots. Compensation-wise. the width of the side-by-side boxplots depicts the marginal distribution of X .1.3. To convince yourself of the value of side-by-side boxplots. Thus the ﬁnance industry has the widest boxplot because it is the most frequent industry in our data set. The aerospace-defense CEO’s are rather well paid. If there are only a few levels (2 or 3) you could look at a histograms for each level (make sure the axes all have the same scale).7. As with mosaic plots. which compares log10 CEO compensation for CEO’s in diﬀerent industries. the ﬁnance CEO’s seem fairly typical of other CEO’s on the list. Consider Figure 1.

The linear model says that older CEO’s make more than younger CEO’s. The linear and quadratic models say very diﬀerent things about CEO compensation. 6 The regression line is “best” according to a speciﬁc criterion known as “least squares” which is discussed in Chapter 5. Could there be a trend in the data that is just too hard to see in the Figure? We can use regression to compute the straight line that best6 ﬁts the trend in the data. and small changes in log compensation can be large changes in terms of real dollars. age for CEO dataset. The quadratic model says that a CEO’s earning power peaks and then falls oﬀ. Of course the Figure is plotted on the log scale. Of course. Which model should we believe? 3. Is there some way to numerically describe the entire data set and not just the trend. the regression line also raises some questions. The best ﬁtting line and quadratic function are also shown. LOOKING AT DATA Figure 1. which indicates that older CEO’s tend to be paid more than younger CEO’s. The regression line has a positive slope. . The slope of the line isn’t very large. The regression line only describes the trend in the data.14 CHAPTER 1. 1.8: Scatterplot showing log10 compensation vs. Our previous analyses (such as comparing log10 compensation by industry) actually described the data themselves (both center and spread). How large does a slope have to be before we conclude that it isn’t worth considering? 2. Why are we only looking at straight lines? We can also use regression to ﬁt the “best” quadratic function to the data.

which is one of the most ﬂexible and the most widely used models in all of statistics. Chapters 3. By the end of Chapter 6 you will have a working knowledge of the multiple regression model. and we will spend much of the rest of the course understanding the tools that help us answer them. questions 1 and 2 are answered by something called a p-value.3 and 5 return to the more interesting topic of relationships between variables. . which is included in the computer output that you get when you ﬁt a regression.4.1.” To measure the impact that several X variables have on Y requires that you build a model. which is the subject of Chapter 6.4 The Rest of the Course The questions listed above are all very important.2. the subject of Chapters 2 and 3. Question 3 will be dealt with in Chapter 5 once we learn a little more about the normal curve in Chapter 2. Question 4 may be the greatest limitation of analyses which consist only of “looking at data. Procedurally. What if a CEO’s compensation depends on more than one variable? 15 1. Chapter 4 is largely about helping you understand p-values. To do so you need to know a few basic facts about probability. THE REST OF THE COURSE 4.

LOOKING AT DATA .16 CHAPTER 1.

Chapter 2

Probability Basics
This Chapter provides an introduction to some basic ideas in probability. The focus in Chapter 1 was on looking at data. Now we want to start thinking about building models for the process that produced the data. Throughout your math education you have learned about one mathematical tool, and then learned about its opposite. You learned about addition, then subtraction. Multiplication, then division. Probability and statistics have a similar relationship. Probability is used to deﬁne a model for a process that could have produced the data you are interested in. Statistics then takes your data and tries to estimate the parameters of that model. Probability is a big subject, and it is not the central focus of this course, so we will only sketch some of the main ideas. The central characters in this Chapter are random variables. Every random variable has a probability distribution that describes the values the random variable is likely to take. While some probability distributions are simple, some of them are complicated. If a probability distribution is too complicated to deal with we may prefer to summarize it with its expected value (also known as its mean) and its variance. One probability distribution that we will be particularly interested in is the normal distribution, which occurs very often. A bit of math known as the central limit theorem (CLT) explains why the normal distribution shows up so much. The CLT says that sums or averages of random variables are normally distributed. The CLT is so important because many of the statistics we care about (such as the sample mean, sample proportion, and regression coeﬃcients) can be viewed as averages. 17

18

CHAPTER 2. PROBABILITY BASICS

2.1

Random Variables

Deﬁnition A number whose value is determined by the outcome of a random experiment. In eﬀect, a random variable is a number that hasn’t “happened” yet. Examples • The diameter of the next observed crank shaft from an automobile production process. • The number on a roll of a die. • Tomorrow’s closing value of the Nasdaq.

Notation
Random variables are usually denoted with capital letters like X and Y . The possible values of these random variables are denoted with lower case letters like x and y . Thus, if X is the number of cars my used car lot will sell tomorrow, and if I am interested in the probability of selling three cars, then I will write P (X = 3). Here 3 is a particular value of lower-case x that I specify. The Distribution of a Random Variable By deﬁnition it is impossible to know exactly what the numerical value of a random variable will be. However, there is a big diﬀerence between not knowing a variable’s value and knowing nothing about it. Every random variable has a probability distribution describing the relative likelihood of its possible values. A probability distribution is a list of all the possible values for the random variable and the corresponding probability of that value happening. Values with high probabilities are more likely than values with small probabilities. For example, imagine you own a small used-car lot that is just big enough to hold 3 cars (i.e. you can’t sell more than 3 cars in one day). Let X represent the number of cars sold on a particular day. Then you might face the following probability distribution x P (X = x) 0 0.1 1 0.2 2 0.4 3 0.3

From the probability distribution you can compute things like the probability that you sell 2 or more cars is 70% (=.4 + .3). Pretty straightforward, really.

2.1. RANDOM VARIABLES Don’t Get Confused! 2.1 Understanding Probability Distributions One place where students often become confused is the distinction between a random variable X and its distribution P (X = x). You can think of a probability distribution as the histogram for a very large data set. Then think of the random variable X as a randomly chosen observation from that data set. It is often convenient to think of several diﬀerent random variables with the same probability distribution. For example, let X1 , . . . , X10 represent the numbers of dots observed during 10 rolls of a fair die. Each of these random variables has the same distribution P (X = x) = 1 6 , for x = 1, 2, . . . , 6. But they are diﬀerent random variables because each one can assume diﬀerent values (i.e. you don’t get the same roll for each die).

19

Where Probabilities Come From Probabilities can come from four sources. 1. “Classical” symmetry arguments 2. Historical observations 3. Subjective judgments 4. Models Classical symmetry arguments include statements like “all sides of a fair die are equally likely, so the probability of any one side is 1 6 .” They are the oldest of the four methods, but are of mainly mathematical interest and not particularly useful in applied work. Historical observations are the most obvious way of of deriving probabilities. One justiﬁcation of saying that there is a 40% chance of selling two cars today is that you sold two cars on 40% of past days. A bit of ﬁnesse is needed if you wish to compute the probability of some event that you haven’t seen in the past. However, most probability distributions used in practice make use of past data in some form or another. Subjective judgments are used whenever experts are asked to assess the chance that some event will occur. Subjective probabilities can be valuable starting points when historical information is limited, but they are only as reliable as the “expert” who produces them. The most common sources of probabilities in business applications are probability models. Models are useful when there are too many potential outcomes to

20

CHAPTER 2. PROBABILITY BASICS

list individually, or when there are too many uncertain quantities to consider simultaneously without some structure. Many of the most common probability models make use of the normal distribution, and its extension the linear regression model. We will discuss these two models at length later in the course. The categories listed above are not mutually exclusive. For example, probability models usually have parameters which are ﬁt using historical data. Subjective judgment is used when selecting families of models to ﬁt in a given application.

2.2

The Probability of More than One Thing

Things get a bit more complicated if there are several unknown quantities to be modeled. For example, what if there were two car salesmen (Jim and Floyd) working on the lot? Then on any given day you would have two random variables: X , the number of cars that Jim sells, and Y , the number of cars that Floyd sells.

2.2.1

Joint, Conditional, and Marginal Probabilities

The joint distribution of two random variables X and Y is a function of two variables P (x, y ) giving the probability that X = x and Y = y . For example, the joint distribution for Jim and Floyd’s sales might be. X (Jim) 0 1 2 3 0 .10 .10 .10 .05 Y (Floyd) 1 2 .10 .10 .20 .10 .05 .00 .00 .00 3 .10 .00 .00 .00

Remember that there are only 3 cars on the lot, so P (x, y ) = 0 if x + y > 3. As with the distribution of a single random variable, the joint distribution of two (or more) random variables simply lists all the things that could happen, along with the corresponding probabilities. So in that sense it is no diﬀerent than the probability distribution of a single random variable, there are just more possible outcomes to consider. Just to be clear, the distribution given above says that the probability of Jim selling two cars on a day that Floyd sells 1 is .05 (i.e. that combination of events will happen about 5% of the time). Marginal Probabilities If you were given the joint distribution of two variables, you might decide that one of them was irrelevant for your immediate purpose. For example, Floyd doesn’t care

00 . plus the probability that he sells 0 cars and Jim sells 1.10 . That is. the name marginal suggests that marginal probabilities are often written on the margins of a joint probability distribution.05 . but you can’t go the other way around.00 . it says to add down the column of numbers in the joint distribution that correspond to Floyd selling 0 cars. Conditional Probabilities Each day.10 .” Even more simply.10 . .35 2 .1) x All this says is the following. plus .2. Floyd starts out believing that his sales distribution is Num.10 . The mathematical formula describing the computation looks worse than it actually is P (Y = y ) = P (X = x. .40 .35 . “The probability that Floyd sells 0 cars is the probability that he sells 0 cars and Jim sells 0. (Likewise. THE PROBABILITY OF MORE THAN ONE THING 21 about how many cars Jim sells. (2. Floyd wants to know the marginal distribution of Y .40 .15 .2.) Marginal probabilities are calculated in the obvious way.20 . you simply sum across any variable you want to ignore. he just wants to know how many cars he (Floyd) will sell. For example: Y (Floyd) 1 2 .35 1 .00 .20 3 . so there must be some information loss. Jim may only care about the marginal distribution of X .35 3 . because the two marginal distributions have only 8 numbers (4 each).00 The marginal probabilities say that Floyd has a 10% chance (and Jim a 5% chance) of selling three cars on any given day. Cars (y ) Prob 0 .00 .10 .10 . while the joint distribution has 16 numbers. . Notice that if you have the joint distribution you can compute the marginal distributions.10 .00 .20 X (Jim) 0 1 2 3 0 . Also note that the word marginal means something totally diﬀerent in probability than it does in economics.10 . That makes sense.05 1.00 .10 What if Floyd somehow knew that today was one of the days that Jim would sell 0 cars. What should he believe about his sales distribution in light of the new information? This situation comes up often enough in probability that there is .05 . In fact. Y = y ).

he can simply consider the 400 days when Jim sold zero cars. This thought experiment justiﬁes the deﬁnition of conditional probability P (Y = y |X = x) = P (Y = y. We can easily compute all of the possible conditional distributions that Floyd would face if he were told X = 0. or 3. notice that if you summed the numerator over all possible values of y . as he would to get the joint distribution).22 CHAPTER 2. X = x) . he would simply do it using probabilities instead of counts.2) does not depend on y . It is simply a normalizing factor.2) Notice that the denominator of equation (2. In the current example Floyd wants to know P (Y = y |X = 0). so the answer would be 1. PROBABILITY BASICS standard notation for it. 2. If Floyd didn’t have the original counts he could still do the normalization. ignoring the rest. . you would get P (X = x) in the numerator and denominator. The equation simply says to take the appropriate row or column of the joint distribution and normalize it so that it sums to 1. How should the updated probability be computed? Imagine that the probabilities in the joint distribution we have been discussing came from a data set describing the last 1000 days of sales. 1. That is. A vertical bar “|” inside a probability statement separates information which is still uncertain (on the left of the bar) from information which has become known (on the right of the bar). P (X = x) (2. The contingency table of sales counts would look something like X (Jim) 0 1 2 3 0 100 100 100 50 350 Y (Floyd) 1 2 3 100 100 100 200 100 0 50 0 0 0 0 0 350 200 100 400 400 150 50 1000 If Floyd wants to estimate P (Y |X = 0).” The updated probability is called a conditional probability because it has been conditioned on the given information. Also. he can normalize the (X = 0) row of the table by dividing everything in that row by 400 (instead of dividing by 1000. and the conditional distributions that Jim would face if he were told Floyd’s sales. This statement is read: “The probability that Y = y given that X = 0.

2.2. THE PROBABILITY OF MORE THAN ONE THING Y (Floyd) 1 2 .25 .25 .50 .25 .33 .00 .00 .00 Y (Floyd) 1 2 .29 .50 .57 .50 .14 .00 .00 .00 1.00 1.00

23

X (Jim) 0 1 2 3

0 .25 .25 .67 1.00

3 .25 .00 .00 .00

1.00 1.00 1.00 1.00

X (Jim) 0 1 2 3

0 .29 .29 .29 .13 1.00

3 1.00 .00 .00 .00 1.00

Floyd’s conditional probabilities given Jim’s sales P (Y |X )

Jim’s conditional probabilities given Floyd’s sales P (X |Y )

So what does the information that X = 0 mean to Floyd? If we compare his marginal sales distribution to the his conditional distribution given X = 0 No information Jim sells 0 cars .35 .25 .35 .25 .20 .25 .10 .25

it appears (unsurprisingly) that Floyd has a better chance of having a big sales day if Jim sells zero cars. Putting It All Together Let’s pause to summarize the probability jargon that we’ve introduced in this section. A joint distribution P (X, Y ) summarizes how two random variables vary simultaneously. A marginal distribution describes variation in one random variable, ignoring the other. A conditional distribution describes how one random variable varies if the other is held ﬁxed at some speciﬁed value. If you are given a joint distribution you can derive any conditional or marginal distributions of interest. However, to compute the joint distribution you need to have the marginal distribution of one variable, and all conditional distributions of the other. This is a consequence of the deﬁnition of conditional probability (equation 2.2) which is sometimes stated as the probability multiplication rule. P (X, Y ) = P (Y |X )P (X ) = P (X |Y )P (Y ) (2.3)

Equations (2.2) and (2.3) are the same, just multiply both sides of (2.2) by P (X = x). However, Equation (2.3) is more suggestive of how probability models are actually built. It is usually harder to think about how two (or more) things vary simultaneously than it is to think about how one of them would behave if we knew the other. Thus most probability distributions are created by considering the marginal distribution of X , and then considering the conditional distribution of Y given X . We will illustrate this procedure in Section 2.2.3.

24

CHAPTER 2. PROBABILITY BASICS

2.2.2

Bayes’ Rule

Probability distributions are a way of summarizing our beliefs about uncertain situations. Those beliefs change when we observe relevant evidence. The method for updating our beliefs to reﬂect the new evidence is called Bayes’ rule. Suppose we are unsure about a proposition U which can be true or false. For example, maybe U represents the event that tomorrow will be an up day on the stock market, and notU means that tomorrow will be a down day. Historically, 53% of days have been up days, and 47% have been down days, so we start oﬀ believing that P (U ) = .531 . But then we ﬁnd out that the leading ﬁrm in the technology sector has ﬁled a very negative earnings report just as the market closed today. Surely that will have an impact on the market tomorrow. Let’s call this new evidence E and compute P (U |E ) (“the probability of U given E ”), our updated belief about the likelihood of an up day tomorrow in light of the new evidence. Bayes’ rule says that the updated probability is computed using the following formula: P (U |E ) = P (E |U )P (U ) P (E ) P (E |U )P (U ) = . P (E |U )P (U ) + P (E |notU )P (notU )

(2.4)

The ﬁrst line here is just the deﬁnition of conditional probability. If you know P (E ) and P (U, E ) then Bayes’ rule is straightforward to apply. The second line is there in case you don’t have P (E ) already computed. You might recognize it as equation (2.1) which we encountered when discussing marginal probabilities. If not, then you should be able to convince yourself of the relationship P (E ) = P (E |U )P (U ) + P (E |notU )P (notU ) by looking at Figure 2.2. An Example Calculation Using Bayes Rule In order to evaluate Bayes’ rule we need to evaluate P (E |U ), the probability that we would have seen evidence E if U were true. In our example this is the probability that we would have seen a negative earnings report by the leading technology ﬁrm if the next market day were to be an up day. We could obtain this quantity by looking at all the up days in market history and computing the fraction of them that were preceded by negative earnings reports. Suppose that number is P (E |U ) = 1% = 0.010. While we’re at it, we may as well compute the percentage of down days (notU ) preceded by negative earnings reports. Suppose that number is
1 These numbers are based on daily returns from the S&P 500, which are plotted in Figure 3.4 on page 28.

2.2. THE PROBABILITY OF MORE THAN ONE THING

25

Figure 2.1: The Reverend Thomas Bayes 1702–1761. He’s even older than that Gauss guy
in Figure 2.10.

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

U

¡

¡

¡

¡

¡

E
¡   ¡   ¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

NotU
Figure 2.2: Venn diagram illustrating the denominator of Bayes’ rule: The probability of
E is the probability of “E and U” plus the probability of “E and NotU”.

P (E |notU ) = 1.5% = 0.015. It looks like such an earnings report is really unlikely regardless of whether or not we’re in for an up day tomorrow. However, the report is certainly less likely to happen under U than notU . Bayes’ rule tells us that the probability of an up day tomorrow, given the negative earnings report today, is P (U |E ) = P (E |U )P (U ) P (E |U )P (U ) + P (E |notU )P (notU ) (.010)(.53) = (.010)(.53) + (.015)(.47) = 0.429.

Keeping It All Straight Bayes’ rule is straightforward mathematically, but it can be confusing because there are several pieces to the formula that are easy to mix up. The formula for Bayes’ rule would be a lot simpler if we didn’t have to worry about the denominator. Notice

26 that P (U |E ) = and P (notU |E ) =

CHAPTER 2. PROBABILITY BASICS

P (E |U )P (U ) P (E |U )P (U ) + P (E |notU )P (notU ) P (E |notU )P (notU ) P (E |U )P (U ) + P (E |notU )P (notU )

both have the same denominator. When we’re evaluating Bayes’ rule we need to compute P (E |U ) and P (E |notU ) to get the denominator anyway, so what if we just wrote the calculation as P (U |E ) ∝ P (E |U )P (U ). The ∝ sign is read “is proportional to,” which just means that there is a constant multiplying factor which is too big a bother to write down. We can recover that factor because the probabilities P (U |E ) and P (notU |E ) must sum to one. Thus, if the equation for Bayes’ rule seems confusing, you can remember it as the following procedure. 1. Write down all possible values for U in a column on a piece of paper. 2. Next to each value write P (U ), the probability of U before you learned about the new evidence. P (U ) is sometimes called the prior probability. 3. Next to each prior probability write down the probability of the evidence if U had taken that value. This is sometimes called the likelihood of the evidence. 4. Multiply the prior times the likelihood, and sum over all possible values of U . This sum is the normalizing constant P (E ) from equation (2.4). 5. Divide by P (E ) to get the posterior probability P (U |E ) = P (U )P (E |U )/P (E ). This procedure is summarized in the table below. prior likelihood 0.53 0.010 0.47 0.015 Pri*Like 0.00530 0.00705 ------0.01235 posterior 0.4291498 = 0.00530/0.01235 0.5708502 = 0.00705/0.01235

Up Down

Once you have internalized either equation (2.4) or the ﬁve step procedure listed above you can remember them as: “The posterior probability is proportional to the prior times the likelihood.”

so let’s limit the complexity of our task by only considering whether each day’s returns are “up” (positive return) or “down” (negative return). The second thing we notice is that the “probability multiplication rule” starts to look scary. We will see several examples of Bayesian learning in Chapter 3. it is nothing more than a restatement of the “multiplication rule” in equation (2. the multiplication rule . which can mask some of the issues that come up in more realistic settings. THE PROBABILITY OF MORE THAN ONE THING Why Bayes’ Rule is Important 27 The ﬁrst time you see Bayes’ rule it seems like a piece of trivia.2. . After all. . Then P (X2 . That sounds a bit daunting. and we want it to work with any value of n.2)). For example. and may well be the e = mc2 of the 21st century. The computers’ beliefs about the complex scenario are described using a complex probability model. conditional. X4 . We want our model to compute the probability that the next n days will follow some speciﬁed sequence (e. X3 |X1 ) is a joint. up)). and marginal become a bit ambiguous when there are several random variables ﬂoating about. with n = 4 we want to compute P (up.g. Let’s work on building a realistic probability model for a familiar process: the daily returns of the S&P 500 stock market index. down.” which means programming a computer to make intelligent seeming decisions about complex problems. suppose stock market returns over the next 4 days are denoted by X1 . The computer also needs to “learn” as new information comes in. However. One thing worth noticing is that the terms joint.2. When applied to many random variables. 2. It is “conditional” because something formerly random (X1 ) is now known.3 A “Real World” Probability Model The preceding sections have illustrated some of the issues that can arise when two uncertain quantities are considered. . It is “joint” because it considers more than one random thing (X2 and X3 ). Then Bayes’ theorem is used to update the probability model to as the computer “learns” about its surroundings.2. it turns out that Bayes’ rule is the foundation of rational decision making. up. It is “marginal” because it ignores something random (X4 ).3) (which was a restatement of equation (2. In order to do do that you need to have some way to mathematically express what a computer should “believe” about a complex scenario. One example where Bayes theorem has made a huge impact is “artiﬁcial intelligence. and we want to consider what happens on days 2 and 3. Suppose we’re told that day 1 will be an “Up” day. In the interest of simplicity we have dealt mainly with “toy” examples. and conditional distribution all at the same time. marginal. .

. Xn ) by multiplying the conditional distributions of each Xi given all previous X ’s. .5) × P (Xn |Xn−1 .3: Daily returns for the S&P 500 market index.00 .00 1. These probabilities turn out to be x P (Xi ) = x Down 0. you can factor the joint distribution P (X1 . .567 1. .433 Up 0. . we have to count out how many (UU).519 0. . We can come up with P (X1 ) simply enough. . . . and (DD) transitions there were. . Xn ) =P (X1 ) × P (X2 |X1 ) × P (X3 |X2 .526 Finding P (X2 |X1 ) is twice as much work.. × P (Xn−1 |Xn−2 . Why is that scary? Remember that each of the random variables can only assume one of two values: Up or Down. . becomes P (X1 . just by counting how many up and down days there have been in the past. After normalizing the transition counts we get the conditional probabilities Xi = Xi−1 =Down Up Down 0. X1 ). . X1 ) (2..474 Up 0. That is. X1 ) × .28 CHAPTER 2. . PROBABILITY BASICS Figure 2.481 0. . . (DU). The vertical axis excludes a few outliers (notably 10/19/1987) that obscure the pattern evident in the remainder of the data. . (UD).

461 0.501 0. (UUD).00 Notice how each additional day we wish to consider doubles the amount of work we need to do to derive our model. Xi−2 Down Down Up Up Xi−1 Down Up Down Up Xi Down Up 0. then P (X2 |X1 ) = P (X2 ) and P (X3 |X1 .6) If we assume that X1 . The obvious solution is to limit the amount of dependence that we are willing to consider.00 1.499 0. (2. (DDU). If we were to assume that returns on the S&P 500 were independent.2. especially since there are only 14. if n = 20 we would have to compute over one million conditional probabilities.000 days in the data set. (DUU). The numbers here come from the marginal distribution of X1 on page 28.00 1. X2 .449 0.551 1.2. The general multiplication rule says that P (X1 . then we could compute the probability that the next three days returns are (UUD) (two up days followed by a down day) as follows. We have assumed that the marginal distribution does not change over time. X3 = U U D) =P (X1 = U ) × P (X2 = U |X1 = U ) × P (X3 = D|X2 = U. so the probability becomes P (U U D) = P (X1 = U ) × P (X2 = U ) × P (X3 = D) = (. That is far too many to be practical.474) = 0.588 0.526)(. X1 ) is twice as much work as P (X2 |X1 ). (2. X2 ) = P (X3 ). This quickly becomes an unacceptable burden. which is a common assumption in practice. and X3 are independent.00 1. (DUD). Translated into “probability speak” independence means that P (Y |X ) = P (Y ). The two most common solutions in practice are to assume independence or Markov dependence.131.7) . For example. (UDD).539 0. X2 . Independence Two random variables are independent if knowing the numerical value of one does not change the distribution you would use to describe the other. (UUU) was observed. we need to ﬁnd the number of times each pattern (DDD). THE PROBABILITY OF MORE THAN ONE THING 29 Finding P (X3 |X2 .526)(.412 0. (UDU). X1 = U ).

Markov dependence can be expressed P (Xn |Xn−1 . This is an example of very strong dependence between the observations in this series. (2. Some shafts measure greater than 815. Figure 2. Independence is a strong assumption. . and it also shows a strong seasonal pattern. The airline passenger data series exhibits an upward trend over time. You can often plot your data. Contrast the shaft diameter data set with the airline passenger data set shown in the right panel of Figure 2.30 CHAPTER 2. but it is reasonable in many circumstances. The airline passenger series exhibits strong dependence. PROBABILITY BASICS (a) Diameters of automobile crank shafts. The crank shafts should ideally be 815 thousands of an inch in diameter. X1 ) = P (Xn |Xn−1 ).4: The crank shaft diameters appear to be independent. . But it does not seem like one shaft being greater or less than 815 inﬂuences whether the next shaft is likely to be greater or less than 815. and vice versa.8) . and some lower. The passenger counts in any particular month are very close to the counts in neighboring months. to check whether it is reasonable to assume independence. . Markov Dependence Independence makes probability calculations easy. (b) International airline passenger traﬃc.4. Many of the statistical procedures we will discuss later assume independent observations.4. Mathematically. Each day ﬁve shafts are collected and measured during quality control checks. but there will be some variability from shaft to shaft. If you think that Up days tend to follow Up days on the stock market. The left panel shows data from a production line which produces crank shafts to go in automobile engines. but it is sometimes implausible. That’s what it means for random variables to be independent. The simplest way to to allow dependence across time do so is by assuming Markov dependence. . as we have done in Figure 2. then you should feel uncomfortable about assuming the returns to be independent.

2. X2 = U.433)(. Which Model Fits Best? Now we have an embarrassment of riches. and P (X4 |X3 . How can we tell which model ﬁts best? One way is to use Bayes’ rule. The general multiplication rule says that P (U U DU ) =P (X1 = U ) × P (X2 = U |X1 = U ) × P (X3 = D|X2 = U. Let’s suppose that the sequence of S&P 500 returns follows a Markov chain and compute the probability that the next 4 days X1 . X1 ) = P (X4 |X3 ). Again. The random walk hypothesis asserts that markets are eﬃcient. The thing we’re uncertain about here is which model is the right one. . the numbers here are based on the distributions on page 28. We just extend the computations earlier in this section to cover the whole data set.9) (2. To use Bayes’ rule we need the prior probabilities P (M = M arkov ) and P (M = Indep) as well as the likelihoods: P (E |M = M arkov ) and P (E |M = Indep). We have two probability models for the S&P 500 series. and we need to check its impact on our ﬁnal analysis. Which one ﬁts best? There is a ﬁnancial/economic theory called the random walk hypothesis that suggests the independence model should be the right answer. THE PROBABILITY OF MORE THAN ONE THING 31 Simply put. The evidence E that we observe is the sequence of up and down days in the S&P 500 data. Let’s call the model M . arbitrageurs would enter and remove it. Even so. X2 ) = P (X3 |X2 ). A sequence of random variables linked by Markov dependence is known as a Markov chain. so the probability becomes P (U U DU ) =P (X1 = U ) × P (X2 = U |X1 = U ) =(.481) =0. but let’s go with the 50/50 prior for now. Markov dependence means that P (X3 |X1 .10) . We end up with × P (X3 = D|X2 = U ) × P (X4 = U |X3 = D) (2.2. the Markov chain model has considerable intuitive appeal.526)(.567)(. X1 = U ) × P (X4 = U |X3 = D. The likelihoods are easy enough to compute. so maybe P (M = M arkov ) = P (M = Indep) = . so if there were day-to-day dependence in returns. This is clearly a subjective judgment. Markov dependence assumes that today’s value depends on yesterday’s value but not the day before. . X1 = U ). X4 follow the pattern U U DU .50. . Before looking at the data we might have no reason to believe in one model over another.062. X2 . .

5)(e−9713 ) e−53 = ≈ 0. The chance that you will correctly predict all of them simultaneously is very small (like e−9659 ). the Markov chain is the clear winner.5)(e−9659 ) = ≈ 1.999 and P (M = M arkov ) = . − 9659 − 9713 (. the prior probabilities that we chose make little diﬀerence. When we say that P (M = M arkov |E ) ≈ 1 there is an implicit assumption that the Markov and Independence models are the only ones to be considered.32 CHAPTER 2. For example.5)(e ) + (. but given a choice between the Markov chain and the Independence model. equivalently P (M = Indep|E ) = (. What you should observe is that the data are e53 times more likely under the Markov model than under the independence model. Computers store numbers using a ﬁnite number of 0’s and 1’s. With such strong evidence in the likelihood. (. Just add log probabilities instead of multiplying raw probabilities. There are a lot of possible outcomes over the next 14000 days of the stock market.001. 2 .2 Don’t be put oﬀ by the fact that the likelihoods are such small numbers. PROBABILITY BASICS Model Markov Independence Likelihood e−9659 e−9713 The e’s show up because we had to compute the likelihood on the log scale for numerical reasons.5)(e−9659 ) + (. if we were strong believers in the random walk hypothesis we might have had a prior belief that P (M = Indep) = . There are other models that ﬁt these data even better than the Markov chain. In that case we would wind up with P (M = M arkov |E ) = (.5)(e ) 1 + e−53 The evidence in favor of the Markov model is overwhelming. If we plug these numbers into Bayes’ rule we get P (M = M arkov |E ) = Or.5)(e−9713 ) 1 + e−53 1 (. When the stored numbers get so small the computer tends to give up and call the answer “0.001)e−9659 1 1 = = ≈ 1.” This is an easy problem to get around.001)e + (.999)e 1 + (999)e 1 + e−46 If there is strong evidence in the data (as there is here) then Bayes’ rule forces rational decision makers to converge on the same conclusion even if they begin with very diﬀerent prior beliefs. − 9659 − 9713 − 53 (.

1 Expected Value One way we can guess the value that a random variable will assume is to look at its expected value. 200 1’s. .1. the long run average value. It is pretty simple. Thus it is tempting to stick in 1. and replaces it with a plain old number like 1.3. and 300 3’s). If a and b are 3 It is the Austin Powers of operators. and their variance. that average would be about E (X ). We said earlier that you can think of a probability distribution as a histogram of a long series of future data. we don’t know how many cars we are going to sell tomorrow. E (X ). so what is “good” about it? Suppose you face the same probability distribution for sales each day. then think about the average number of cars per day you will sell for the next 1000 days. Don’t! Remember that 1. the random variable X . You get 1. Note we sometimes write E (X ) as µ.4) + (3 × 0. The expected value of a random variable is its long run average. The E () operator is seductive3 because it takes something that you don’t know. while X is speciﬁcally the number you will sell tomorrow (which you won’t know until tomorrow).3) = 1. EXPECTED VALUE AND VARIANCE 33 2.9 is only a the long run average for the number of cars sold per day.1) + (1 × 0. The most common summaries of probability distributions are their expected value (aka their mean).2) + (2 × 0. 2. etc.9. We can calculate E (X ) using the formula E (X ) = x xP (X = x) Returning to the used car example.9. but a good guess is E (X ) = (0 × 0.2.3. On about 10% of the days you would sell 0 cars. but Section 2. So it makes sense that we might want to summarize a probability distribution using tools similar to those we used to summarize data sets. If you repeated the experiment a large number of times and took the average of all the observations you got.9 Of course X will not be exactly 1.9. 400 2’s.2 showed that probability distributions can get suﬃciently complicated that we may wish to summarize them somehow instead of working with them directly. It means exactly the same thing.3 Expected Value and Variance Let’s return to the auto sales probability distribution from Section 2. E (X ). and divide by 1000.9 wherever you see X written. about 20% of the time you would sell 1 car. Add up the total number of cars you expect to sell (roughly 100 0’s. The expected value operator has some nice properties that come in handy when dealing with sums (possibly weighted sums) of random variables.

Remember that E () just says “take the average” so the variance of a random variable is the average squared deviation from the mean. and E (X ) and V ar (X ) are the mean and variance of a long series of future data that you would see if you let your data producing process run on for a long time. It is calculated using the formula V ar (X ) = σ 2 = x (x − µ)2 P (X = x) = E [(X − µ)2 ]. V ar (X ).e. But how good is the guess? If X is always very close to E (X ) then it will be a good guess.3630 0.01 3 0.1 1.3. To illustrate the variance calculation.81 2 0.9 3.61 1 0.3 1.3610 0. just like the variance of a data set.8900 . you can think of x ¯ and s2 as the mean and variance of past data (i. but it is possible that X is often a long way away from E (X ).89.2 Variance Expected value gives us a guess for X . V ar (X ) is often denoted by σ 2 but they mean the same thing. PROBABILITY BASICS known constants (weights) and X and Y are random variables then the following rules apply.1 −1. For example X may be an extremely large value half the time and a very small value the rest of the time.2 −0.34 CHAPTER 2.0040 0. In this case the expected value will be half way between but X is always a long way away.9 0. data which has already “happened” and is in your data set). If you like. for.1 0. 2. the variance of our auto sales random variable is 0.4 0. calculated as follows. We need a measure of how close X is on average to E (X ) so we can judge how good our guess is.3. x P (X = x) (x − µ) (x − µ)2 P (X 0 0.1620 0. This is what we use the variance.21 1 = x)(x − µ)2 0. • E (aX + bY ) = aE (X ) + bE (Y ) Example: E (3X + 4Y ) = 3E (X ) + 4E (Y ) • E (aX + b) = aE (X ) + b Example: E (3X + 4) = 3E (X ) + 4 We will illustrate these rules a little later in Section 2.3.

A standard deviation of 0. Remember that X and Y are random variables.94 cars.3.” Just as in Chapter 1. The smaller it is. What does a variance of . EXPECTED VALUE AND VARIANCE 35 Variance has a number of nice theoretical properties. and they are a lot easier to interpret.89 mean? It means that the average squared distance of X from its mean is 0. If SD (X ) is small then when X “happens” it will be close to E (X ). Rules for Variance We mentioned that variance obeys some nice math rules. the −1 gets squared) 4 Section 3.94.2.94 means (roughly) that the typical distance of X from its mean is about 0.2. variance is hard to interpret because it squares the units of the problem. but it is not very easy to interpret. It involves a new wrinkle called covariance that would just be a distraction at this point. so E (X ) is a good guess for X .89 “cars squared. the better guesser you can be. b = 1) • V ar (X − Y ) = V ar (X ) + V ar (Y ) (a = 1. b = −1. Taking the square root of variance restores the natural units of the problem. Note that the following rule applies to V ar ONLY if X and Y are independent. assuming you’ve already calculated V ar (X ).89 = 0.4 That contrasts with the rules for expected value on page 34. and a and b are known constants. Standard deviations are easy to calculate.3 explains what to do if X and Y are dependent. V ar (aX + bY ) = a2 V ar (X ) + b2 V ar (Y ) Here are some examples: • V ar (3X + 4Y ) = 32 V ar (X ) + 42 V ar (Y ) = 9V ar (X ) + 16V ar (Y ) • V ar (X + Y ) = V ar (X ) + V ar (Y ) (a = 1. For example the standard deviation of the above random variable is √ SD (X ) = 0. which apply all the time. Standard Deviation The standard deviation of a random variable is deﬁned as SD (X ) = σ = V ar (X ). Here’s the main one. There is no threshold for SD (X ) to be considered small. (if X and Y are independent) .

178. with E (W/5) = 9. Then E (W ) = E (X1 + X2 + X3 + X4 + X5 ) = E (X1 ) + E (X2 ) + E (X3 ) + E (X4 ) + E (X5 ) = 1.45.178 = . We’re interested in the distribution of weekly (5 day) sales.9 + 1. For example.89 + . to compute P (W = 6) we would have to consider all the diﬀerent ways that daily sales could total up to 6 (3 on the ﬁrst day and 3 on the last. That was MUCH easier than actually ﬁguring out the distribution of W and tabulating E (W ) and V ar (W ) directly. √ which means that SD (X ) = 4. It would be nice if there were some way to approximate the . and this is just a simple toy problem! It is much easier to ﬁgure out the mean and variance of W using the rules from the previous two Sections.1) and that car sales are independent from one day to the next. and similarly for the other X ’s. To determine SD (aX + bY ) you must ﬁrst calculate V ar (aX + bY ) and then take its square root. What if we cared about the weekly sales ﬁgure expressed as a daily average instead of the weekly total? Then we just consider W/ √ 5.36 CHAPTER 2.3 Adding Random Variables Expected value and variance are very useful tools when you want to build more complicated random variables out of simpler ones. MANY more). Of course we still don’t know the entire distribution of W . so SD (W/5) = . W . one each day except for two on Thursday. In principle we could list out all the possible values that W could assume (in this case from 0 to 15).422.9 = 9.89 = 4. The trick is to write W = X1 + X2 + X3 + X4 + X5 .11.5/5 = 1.9 + 1. suppose we know the distribution of daily car sales (from Section 2. It is a hard thing to do because there are so many cases to consider.5 and V ar (W ) = V ar (X1 + X2 + X3 + X4 + X5 ) = V ar (X1 ) + V ar (X2 ) + V ar (X3 ) + V ar (X4 ) + V ar (X5 ) = .3.89 + .9 + 1.89 + . For example. All we have are these two useful summaries.9 + 1. where X2 is the number of cars sold on day 2. 2.89 + . and many.45/52 = . then think of all the possible ways that daily sales could happen. PROBABILITY BASICS Note that there are no rules for manipulating standard deviation.45 = 2.9 and V ar (W/5) = 4.

If you were gambling in Las Vegas. then V ar (5X1 ) = 25σ 2 . they are not the same random variable.2 The diﬀerence between X1 + X2 and 2X . It is impractical to list out all possible values (and corresponding probabilities) of a continuously varying process. We’ll get one in Section 2. . Mathematical probability models are usually used instead. THE NORMAL DISTRIBUTION Don’t Get Confused! 2. The temptation comes from the fact that all ﬁve daily sales ﬁgures come from the same distribution. If V ar (X1 ) = σ 2 . because you’re not going to sell exactly the same number of cars each day.2. If the distribution of X is normal with mean µ and standard deviation σ then we write X ∼ N (µ.” Section 7. 5 A few are described in Section 7. There are many probability models out there.4 The Normal Distribution Thus far we have restricted our attention to random variables that could assume a denumerable set of values. This expression is read “X is normally distributed with mean µ and standard deviation σ .3.5 but the most common is the normal distribution. which we met brieﬂy in Chapter 1. We will use normal probability calculations at various stages throughout the rest of the course. but we have one more big idea to introduce ﬁrst. The one big bet is much riskier (i. we will often wish to model continuously varying phenomena such as ﬂuctuations in the stock market. σ ). Is this sensible? Absolutely. has a larger variance) than the ﬁve smaller ones. so they have the same mean and the same variance. The distinction makes a practical diﬀerence in the variance formula (among other places).5. If you thought X obeyed one of them you would write E or P (or some other letter) instead of N .e. then X1 + · · · + X5 might represent your winnings after 5 \$1 bets while 5X would represent your winnings after one \$5 bet. 2. In examples like our weekly auto sales problem many people are tempted to write W = 5X instead of W = X1 + · · · + X5 . While that is natural for some situations. 37 distribution of W based just on its mean and variance.3 describes some non-normal probability models.4. However. but V ar (X1 + · · · + X5 ) = 5σ 2 .

Z ∼ N (0. Procedurally. Thus you replace the probability calculation P (X < x) with the equivalent event P X −µ x−µ < σ σ = P (Z < z ). The phrase “standard normal” just means that Z has a mean of 0 and standard deviation 1 (i.g. normal probabilities of the form P (X ≤ x) can be calculated using the following two-step process. 6 . When we look up 1 in the normal table we see that P (X < 5) = P (Z < 1) = 0. Calculate the z score using the formula z= x−µ σ Note that z is the number of standard deviations that x is above µ.84137 . Figure 2. 2 2 In other words we want to know the probability that X is less than one standard deviation above its mean. Subtracting the mean and dividing by the standard deviation changes the units of x from whatever they were (e. 1)). 2) (i.” You do this by subtracting the mean and dividing by the standard deviation. 2. 7 We can aﬀord to be sloppy with < and ≤ here because the probability that a normal random variable is exactly equal to any ﬁxed number is 0.e. Calculate the probability by ﬁnding z in the normal table.5 illustrates the eﬀect that z-scoring has on the normal distribution. which are “standard deviations above the mean.” Subtracting the mean and dividing by the standard deviation is known as z-scoring. but the basic idea of how the tables are used is standard. 1.38 CHAPTER 2. PROBABILITY BASICS Calculating probabilities for a normal random variable The normal table is set up6 to answer “what is the probability that X is less than z standard deviations above the mean?” So the ﬁrst thing that must be done when calculating normal probabilities is to change the units of the problem into “standard deviations above the mean.e. About 84 in every 100 occurrences of a normal Diﬀerent books set up normal tables in diﬀerent ways. You might have seen a normal table organized diﬀerently in a previous course. σ = 2) and we want to calculate P (X ≤ 5) then 5−3 2 z= = = 1. µ = 3. Subtracting the mean and dividing by the standard deviation transforms X into a standard normal random variable Z . For example if X ∼ N (3. “cell phone calls”) to the units of z .

05 0. 2). The only diﬀerence between the ﬁgures is a centering and rescaling of the axes.1587 Further examples √ 1. For example: P (X < 1) = P (Z < −1) = 1 − P (Z < 1) = 0. Because the normal distribution is symmetric.8413 = 0. 6−5 z1 = √ = 0.4.20 0.2.2 0.10 0. The left panel depicts P (X < 5). For example (draw a pair of pictures like Figure 2.5: Z-scoring. The right panel depicts P (Z < 1) where 1 is the z-score (5 − 3)/2. 3) what is P (3 < X < 6)? This is the same as asking for P (X < 6) − P (X < 3) so we need to calculate two z scores.15 0.58. If X ∼ N (5.3 0. Suppose X ∼ N (3.e. 3 3−5 z2 = √ = −1. The table tells us P (Z < z ) so if we want P (Z > z ) we need to rewrite this as 1 − P (Z < z ) i. Finally P (Z < −z ) = P (Z > z ) = 1 − P (Z < z ).8413. THE NORMAL DISTRIBUTION 39 0.0 −4 0.4 −2 0 Z 2 4 Figure 2. P (Z > z ) = P (Z < −z ).00 −5 0 X 5 10 0. random variable will be less than one standard deviation above its mean. P (X > 5) = 1 − P (X < 5) = 1 − 0.1 0.15 3 .5): P (X > 1) = P (Z > −1) = P (Z < 1) = 0.1587.

2392. compute P (X < b) and subtract oﬀ P (X < a).58) − P (Z > 1. √ 2.5939. b). 3 Therefore (see Figure 2.58 3 . Therefore (see Figure 2.00 −2 0.73) − P (Z < 0.73.40 CHAPTER 2.15 d 0.05 0. P (X < 6) − P (X < 3) = P (Z < 0.6(b)).15)) = P (Z < 0.15 0.00 −2 0 2 4 x 6 8 10 12 0. 6−5 z2 = √ = 0. P (X < 8) − P (X < 6) = P (Z < 1.6(a)).10 d 0. 8−5 z1 = √ = 1.05 0.6: To compute the probability that a normal random variable is in an interval (a.15) = 0.7190 = 0.58) − (1 − P (Z < 1.10 0.8749) = 0.20 0 2 4 x 6 8 10 12 (a) P (3 < X < 6) (b) P (6 < X < 8) Figure 2.20 0.9582 − 0. If X ∼ N (5.58) − P (Z < −1. PROBABILITY BASICS 0.58) = 0.7190 − (1 − 0. 3) what is P (6 < X < 8)? This is the same as asking for P (X < 8) − P (X < 6) so we need to calculate two z scores.15) = P (Z < 0.

) The most interesting thing in a normal quantile plot are the dots. The actual data from each CEO is plotted against the data you would expect to see if the variable was normally distributed. Figure 2. To form the normal quantile plot. If what you do see is about the same as what you would expect to see if the data were normal.4. otherwise known as a quantile-quantile plot or a Q-Q plot. and so on. the computer orders the 800 CEO’s from smallest to largest. then this line would be the 45 degree line. One of the variables that looked approximately normally distributed was CEO ages. THE NORMAL DISTRIBUTION 41 Figure 2. Checking the Normality Assumption The normal distribution comes up so often in a statistics class that it is important to remember that not all distributions are “normal. Don’t take one or two observations on the edge of the plot very . Then it ﬁgures out how small you would expect the smallest observation from 800 normal random variables to be.2.” The most eﬀective way to check whether a distribution is approximately normal is to use a normal quantile plot.7 shows the normal quantile plot for that data. The reference line in the plot is the 45 degree line that shows where you would expect the dots to lie. (If both axes were on the same scale. The bowed lines on either side of the reference line are guidelines to help you decide how far the dots can stray from the reference line before you can claim a departure from normality.7: Normal quantile plot for CEO ages. Recall from Chapter 1 the data set listing the 800 highest paid CEO’s in 1994. then the dots in the normal quantile plot will follow an approximate straight line. although JMP provides some extra assistance in interpreting the plot. Then it ﬁgures out how small you would expect the next smallest observation to be.

3. 2. It is easier to see departures from normality in normal quantile plots than in histograms or boxplots because histograms and boxplots sometimes mask the behavior of the variable in the tails of the distribution. PROBABILITY BASICS (a) Skewness: CEO Compensation (top 20 outliers removed) (b) Heavy Tails: Corporate Proﬁts Figure 2.8. Otherwise. How many do you need before the CLT kicks in? The answer depends on how close the individual random variables that you’re adding are to being normal themselves.3. It is an approximation that gets better as more variables are included in the sum. . once you’ve got around 30 random variables in the sum the you can feel pretty comfortable assuming the sum is normally distributed. Examples of non-normal variables from the CEO data set are shown in Figure 2. If you notice a strong bend in the middle of the plot then that is evidence of non-normality. The CLT. If they’re highly skewed (like CEO compensation) then you might need a lot.42 CHAPTER 2.8: Examples of normal quantile plots for non-normal data. Recall our interest in the distribution of weekly car sales from Section 2. says that the sum of several random variables has a normal distribution.5 The Central Limit Theorem The central limit theorem (CLT) explains why the normal distribution comes up as often as it does. seriously.5 on page 10. which we won’t state formally. The histograms for these variables appear in Figure 1.

showing the normal distribution (faintly.3.3. The CLT explains why you would expect such phenomena to be normally distributed. The random variables being added . The weekly sales distribution is the sum of only 5 random variables.10 0. but right in the center) and Carl Friedrich Gauss (1775–1855). THE CENTRAL LIMIT THEOREM 43 Density 0. There are a few caveats to the CLT which help explain why not every random variable is normally distributed.5. who ﬁrst derived it. The normal approximation would ﬁt even better to the distribution of monthly sales.10: The German 10 Mark bank note. and a normal approximation. There are many phenomena in life that are the result of several small random components.2.9 shows the actual distribution of weekly car sales (the histogram) along with the normal curve with the mean and standard deviation that we derived in Section 2.15 2 4 6 8 w 10 12 14 Figure 2.9: Distribution of weekly car sales. but it is pretty close. Figure 2.00 0.05 0. Figure 2. The ﬁt is not perfect (the weekly sales distribution is skewed slightly to the left).

The CLT works as long as the dependence between the variables isn’t too strong (i. and the random variables being added are on similar enough scales that one or two of them don’t dominate the rest. on their 10 Mark bank note (before they switched to the Euro.44 CHAPTER 2.e. . In practice that isn’t such a big deal. a German mathematician named Gauss. PROBABILITY BASICS are supposed to be independent and all come from the same probability distribution. they can’t all be exactly the same number). So much so that Germany placed a picture of the normal curve and its inventor. of course). The normal distribution and the central limit theorem have had a huge impact on science.

In the process we will learn more about probability. The uncertainty comes from the fact that you don’t know with absolute certainty the market segment to which each of your potential customers should belong.1 Market Segmentation and Decision Analysis One of the basic tools in marketing is to identify market segments containing similar groups of potential customers. it is possible to assign each potential customer a distribution describing the probability of segment membership. Our goal in this Chapter is to present a few fundamental problems from basic business disciplines and to see how these problems can be addressed using probability models. including ﬁnance. or using probability help make decision under uncertain circumstances. Presumably this distribution depends on observable characteristics such as age. and economics. This Chapter focuses on applying the probability rules learned in Chapter 2 to problems faced in these disciplines. etc. which is a central goal of this course. Otherwise we won’t have time to learn what probability can tell us about statistics and data analysis. If you know the deﬁning characteristics of a market segment then (hopefully) you can tailor a marketing strategy to each segment and do better than you could by applying the same strategy to the whole market. marketing. operations.Chapter 3 Probability Applications For many students who are learning about probability for the ﬁrst time. Market segmentation is a good illustration of decision theory. Probability models play a central role in several business disciplines. Decision theory is about translating each probability 45 . 3. Obviously we won’t be able to go very deep into each area. However. the subject seems abstract and somehow divorced from the “real world.” Nothing could be further from the truth. credit rating.

(More strategies and categories are possible.46 CHAPTER 3.1. Strategy E targets “Early Adopters. A reward matrix explains what will happen if you choose a particular action when the world happens to be in a given state. If you treat a Follower like an Early Adopter then he may decide your technology is too complicated. Thus you may have the following reward matrix. Section 3. However. Intuitively it looks like you should treat Joe as a Follower. if Early Adopters are more valuable customers than Followers. This information is often summarized in a reward matrix.2 explores segment membership probabilities in greater detail. suppose each Early Adopter you discover is worth \$1000. However. For example. Strategy E Strategy F Early Adopter 1000 150 Follower 10 100 These are high stakes in that you lose about 90% of the customer’s potential value if you choose the wrong strategy. and an 80% chance that he is a Follower. because that is the segment he is most likely to belong to. but lets stick to two for now. income.” Which approach should you apply to Joe. . and Joe can be in one of two categories. and Ipod ownership were presumably used to arrive at these probabilities. suppose you’ve determined that there is a 20% chance that Joe is an Early Adopter. Early Adopters are worth more because Followers will eventually copy them.) In order to use decision analysis you need to know how valuable Joe will be to you under all four combinations. For example.” and Strategy F targets “Followers. and each Follower is worth \$100. to get these rewards you have to treat each group appropriately. To continue with the market segmentation example. it might make sense to treat Joe as belonging to something other than his most likely category.1 Decision Analysis Decision analysis is about making a trade oﬀ between the cost of making a bad decision and the probability of making a good one.1. and owns an Ipod? 3. you may have developed two marketing strategies for the new gadget your ﬁrm has developed. who is 32 years old. Actions and Rewards You can apply one of two strategies to Joe. PROBABILITY APPLICATIONS distribution into an action. Joe’s age. makes between \$60K-80K per year. If you treat an Early Adopter like a Follower then he may think your product isn’t suﬃciently “cutting edge” to warrant his attention.

if the stakes above were changed from dollars to billions of dollars.8 150 0 . then expected value should not be the only decision criterion. If a decision involves substantially greater sums of money. Reward Strategy E Strategy F Choosing an Action Once a risk proﬁle is computed. so the risk proﬁle here involves two probability distribution. For decision analysis to produce good decisions you need realistic reward information and a \$10 . representing the market value of the ﬁrm after the action is taken. but a 20% chance of making \$150.8 0 100 0 . but a 20% chance of making \$1000. The result is a risk proﬁle.8)(10) + (. so if you treat him as an Early Adopter you have an 80% chance of making only \$10. your decision simply boils down to deciding which probability distribution you ﬁnd most favorable. Recall that Joe is 80% likely to be a Follower. even if there is a chance of a “home run” turning the company into a trillion dollar venture. even though it is much more likely that he is a Follower. Expected value is a good summary here because you are planning to market to a large population of people like Joe. The expected return under Strategy E is (. When decisions involve only a moderate amount of money.1. As you repeat the experiment of advertising to customers like Joe over a large population your average reward per customer will settle down close to the long run average. For example. such as deciding whether your ﬁrm should merge with a large competitor. Is this realistic? As with probability models.3.2 0 . or expected value. the most reasonable way to distinguish among the distributions in your risk proﬁle is by their expected value. MARKET SEGMENTATION AND DECISION ANALYSIS Risk Proﬁle 47 The mechanics of decision theory involve combining the information in the reward matrix with a probability distribution about which state is correct.2 1000 . about 20% of whom are Early Adopters. Thus. decision analysis is as realistic as its inputs. a set of probability distributions describing the reward that you will experience by taking each action.2)(150) = \$110. The expected return under Strategy F is (.8)(100) + (. If you treat him as a Follower you have an 80% chance of making \$100.2)(1000) = \$208. You have two options with Joe. So Joe should be treated as an Early Adopter. the risk proﬁle you face is as follows. then many people will ﬁnd a guaranteed market value of \$100-150 billion preferable to the high chance of \$10 billion.

66% of the population are non-watchers with ages that are distributed approximately N (35. because it is “legal” to take averages of averages. It is certainly not believable that every Early Adopter will be worth exactly \$1. 3. The details of the models are too complex to discuss in an introductory class. and another to introduce the show to viewers who do not watch it. The entries in the reward matrix often make decision analysis seem artiﬁcial. there is an entire discipline devoted to the theory of complex decisions that expands on the basic principles outlined above. for example. (2004).1 suppose the producers of the television show Nightline want to market their show to diﬀerent demographic groups.48 CHAPTER 3.33% of the population are watchers with ages distributed approximately N (58. In fact.000. and the remaining 33. we might assume that 66.9200 11. Collect this type of information from a subset of the market you want to learn about. Expected values from these types of models can be used to ﬁll out a reward matrix with numbers that have some scientiﬁc validity.8).9600 5. However. experts in marketing science have reasonably sophisticated probability models that they can employ to model things like the amount and probability of a customer’s purchase under diﬀerent sets of conditions.8238 57. 5. 11).2 Building and Using Market Segmentation Models Although we will say nothing further about the models used to ﬁll out the reward matrix. . They have an advertising campaign designed to reinforce the opinions of viewers who already watch Nightline.1. PROBABILITY APPLICATIONS believable probability model describing the states of the unknown variables. For example. and also collect information that you can actually observe for individuals in the broader market. If actions are chosen based on the highest expected reward then it makes sense for the entries of the reward matrix to be expected values. Decisions can also involve choosing more than one action. we can introduce a bit more realism about the probability of customers belonging to a particular market segment. That is.1502 Assuming the broader population looks about like the sample (an assumption we will examine more closely in Chapter 4). The producers take a random sample of 75 viewers and get the results shown below Level No Yes Number 50 25 Mean Std Dev 34. but they can be found in Marketing elective courses. Interested students can learn more in Operations Management elective courses. One way to determine segment membership is simply to ask people if they engage in a particular activity. we could simply ﬁt normal models to each observed 1 This example modiﬁed from Albright et al.

.33 and P (notW ) = . (See page 185 for an excel command to do this. We see that the probability of viewership increases dramatically as Age moves from 40 to 50. but the second is more typical. MARKET SEGMENTATION AND DECISION ANALYSIS 49 segment. older viewers tend to watch the program more than younger viewers. income.84 Thus there is a 16% chance that the 42 year subject is a Nightline watcher.66.16 0.) The two methods give very similar answers. age) about a potential subject in front of whom we could place one of the two ads. More Complex Settings The approach outlined above is very ﬂexible. We could use the techniques described in Section 2. It can incorporate several observable characteristics (such as age.4 to compute P (A < 43) − P (A < 42) for a normal distribution with µ = 58 and σ = 11 (or µ = 35 and σ = 5. Clearly. (We could ﬁt other models too.33 . which are shown in Figure 3. Let W denote the event that the person is a watcher. It can obviously be extended to any number of market segments by simply ﬁtting a diﬀerent model to each segment. What is the chance that he is a Nightline watcher? This is clearly a job for Bayes’ rule.004158 0.1.0126 . and geographic region) by developing a joint probability model for the observed characteristics for each market segment.1.0332.e. Now Bayes’ rule is straightforward: segment W notW prior . We get P (A = 42|W ) = 0. if we knew any and thought they might ﬁt better.021912 --------0. and let A denote the person’s age.) Now suppose we have information (i.3.8 for notW ). The age of the potential viewer in question is 42. The multivariate normal distribution (which we will not discuss) is a common choice. To get P (A = 42|W ) we have two choices. It is easy enough to program a computer to do the preceding calculation for several ages and plot the results. We have a prior probability of . We could also approximate this quantity by the height of the normal curve evaluated at age = 42. which is between the mean ages of watchers and non-watchers. Bayes’ rule says P (W |A = 42) ∝ P (W )P (A = 42|W ) and P (notW |A = 42) ∝ p(notW )P (A = 42|notW ). Clearly P (W ) = .33 and we need to update this probability to reﬂect the fact that we know this person to be 42 years old.66 likelihood .0126 and P (A = 42|notW ) = 0.02607 posterior 0.0332 prior*like 0.

a portfolio consisting only of shares from ﬁrms in the same industry cannot be considered diversiﬁed.07 Watchers Non−Watchers 0. and Y the return on another. If the joint distribution of X and Y is unavailable (as is often the case in practice) the covariance can be estimated from a sample of n pairs .8 1. Correlation. given Age. It is generally accepted that a well diversiﬁed investment portfolio lowers an investor’s risk.2 Covariance.6 0.00 0.4 0. Y ) = E ((X − E (X ))(Y − E (Y ))).06 0. However there is more to portfolio diversiﬁcation than simply purchasing multiple stocks.1 Covariance The covariance between two variables X and Y is deﬁned as Cov (X.05 P(nightline|age) 20 30 40 50 Age 60 70 80 0.04 0. For example. For context. 3.2 0.50 CHAPTER 3. 3.1: (a) Age distributions for Nightline watchers and non-watchers.02 0. and Portfolio Theory Investors like to get the most return they can with the least amount of risk.0 30 40 50 age 60 70 80 (a) (b) Figure 3. imagine that X is the return on an investment in one stock. (b) P (W |A): the probability of watching Nightline.0 20 0.03 0. This section investigates seeks to calculate the amount of additional risk incurred by investors who own shares of closely related ﬁnancial instruments. PROBABILITY APPLICATIONS 0.01 0.2.

X ). CORRELATION. a quadratic relationship).1. The bad news is that the formula you have to remember is the more complicated of the two.2 Measuring the Risk Penalty for Non-Diversiﬁed Investments Suppose you invest in a stock portfolio by placing w1 = 2/3 of your money in Sears stock and 1/3 of your money in Penney stock.2. only that there is a general tendency towards big Y ’s being associated with big X ’s. yi ) using the formula 1 Cov (X. The good news is that you don’t have to remember two diﬀerent formulas. Y ) is negative we say that X and Y have a negative relationship.2. 3. If Cov (X. The sidebar on page 54 contains a trick to help you remember the general variance formula. One of the most common uses of covariance occurs when calculating the variance of a sum of random variables. that a covariance of zero does not necessarily imply that X and Y are independent. COVARIANCE. Y ) = Cov (Y. The oﬀ-diagonal elements are the covariances between the variables representing each row and column. Y ) = n−1 n i=1 51 (xi − x ¯)(yi − y ¯). Y ) = 0. Note that this does not guarantee that any particular Y will be large or small. V ar (aX + bY ) = a2 V ar (X ) + b2 V ar (Y ) + 2abCov (X. Y ) = 0. If Cov (X. then Cov (X.3. Notice that if X and Y are independent. even if there are more than two random variables involved. The diagonal elements in a covariance matrix are the variances of the individual variables (in this case the variances of monthly stock returns). A covariance of zero means that there is no linear relationship between X and Y . If X and Y are not independent then we can still perform the calculation using Cov (X. This means that when X is above its average then Y tends to be above its average as well. What is the variance of your stock . so the general formula contains the simple formula as a special case.g. The best way to look at covariances for more than two variables in a time is to put them in a covariance matrix like the one in Table 3. AND PORTFOLIO THEORY (xi . A covariance matrix is symmetric about its diagonal because Cov (X. Note. This means that as X increases Y tends to decrease. Y ). however. Y ) is positive we say that X and Y have a positive relationship. Recall that if X and Y are independent then V ar (aX + bY ) = a2 V ar (X ) + b2 V ar (Y ). Y ). but there could be a nonlinear relationship (e. If X and Y are independent then Cov (X.

and Time Series It is hard to use covariances to measure the strength of the relationship between two variables because covariances depend on the scale on which the variables are measured.00554) + (1/3)2 (0. Then 2 2 V ar (T ) = w1 V ar (S ) + w2 V ar (P ) + 2w1 w2 Cov (S. suppose your portfolio weights were w1 = 1. then you would have had V ar (T ) = 0. P ) = 0 then you wouldn’t have had to add the factor of 2(2/3)(1/3)(0.00510) + 2(2/3)(1/3)(0. If Cov (S.00303. PROBABILITY APPLICATIONS portfolio? Let S represent the return from Sears and P the return from Penney. Notice that the variance of your stock portfolio is less than the variance of either individual stock.5)(0.5)2 (0.067 = 0.2.5. If all shares had zero covariance the portfolio would also have a variance that approaches zero. Therefore.00452. The correlation between two variables X and Y is deﬁned as Corr (X. Y ) SD (X )SD (Y ) . Thus covariance explains why it is better to invest in a diversiﬁed stock portfolio.5)(−.00510) + 2(1.3 Correlation.008715. This means that we have no idea what a “large covariance” is. or SD (T ) = 0. but it is more than it would be if Sears and Penney had been uncorrelated.00335) = 0.1) 2 2 V ar (S ) + w2 V ar (P ) + 2w1 w2 Cov (S. Industry Clusters.00554) + (−. The formula can be extended to as many shares as you like (see the sidebar on page 54).5)2 (0. √ So SD (T ) = 0. if we want to measure how strong the relationship between two variables is. P ) = (1. It can be shown that for a portfolio with a large number of shares the factor that determines risk is not the individual variances but the covariance between shares. If we change the units of the variables we will also change the covariance.5 and w2 = −0.00335) = 0. which you can think of as a “covariance penalty” for investing in two stocks in the same industry. What if you wanted to “short sell” Penney in order to buy more shares of Sears? In other words. So SD (T ) = 0. The variance formula says (using numbers from Table 3. P ) V ar (T ) = w1 = (2/3)2 (0. The total return on your portfolio is T = w1 S + w2 P .055.52 CHAPTER 3. If Sears and Penney had been uncorrelated.093.00452. we use correlation rather than covariance. Y ) = Cov (X.00335). 3.

00415 0.1(b) is that stocks in the same industry are highly correlated with one another. • A correlation near −1 indicates a strong negative linear relationship.3829 0.2042 0.6031 0.4894 1.4023 0.1276 0. AND PORTFOLIO THEORY Sears 0.00094 0.1882 0.0000 53 Sears K-Mart Penney Exxon Amoco Imp_Oil Delta United Sears K-Mart Penney Exxon Amoco Imp_Oil Delta United (b) Correlation matrix Table 3. Correlation is usually denoted with a lower case r or the Greek letter ρ (“rho”).1974 0.0734 0.00762 0.0045 0.6598 1.00230 0. Correlations are often placed in a matrix just like covariances.3.0045 1. while the . • Correlation is always between −1 and 1.0400 0.6580 United 0.00049 0.00378 0.1220 0.4043 0.0906 0.00202 0.00202 0.00028 0.00382 0.0000 0.1220 -0.0709 1.6820 0.0000 0.” In other words whatever units (feet.00042-0.4043 0.00077 0.3928 K-Mart Penney Exxon Amoco 0.00257 0.0862 0.00002 0.6580 1.4894 0.00254 0.2042 0.00333 0.00667 0.00340 Sears 1.6307 0.0000 0.00252 0.63.00109 0.00302 0.00257 0.00211 0.1: Correlation and covariance matrices for the monthly returns of eight stocks in three diﬀerent industries.0000 0.00252 0.00211 0.00042-0.5338 0.1276 0.1900 0.00042 0.00335 0. the correlation between Sears and Penney is .00094 0.0817 0.1882 0.01348 United 0.0358 0.00254 0.0709 0.6820 1.0400 0.5338 0.00042 0.00071 0.3928 0.00554 0.6598 0.00050 0.00083 0.00055 0.00120 0.3060 0. • Just as for covariance a correlation of zero indicates no linear relationship.00055 0.3829 0.00077 0.0862 0.00382 0.1900 0.4062 0.3907 0.3907 0. • Correlation is a “unitless measure.00022 0.6031 0.00415 0.0000 0.00071 0.4062 0. COVARIANCE.00335 0.00444 0.00049 0. • A correlation near 1 indicates a strong positive linear relationship.00302 0.00725 0.2.00621 0.00002 0.00510 0.0817 0. For example.00109 0.00083 0.0000 0.00120 0. miles) we measure X and Y in we get the same correlation (but a diﬀerent covariance).00050 0.3060 0. One of the things you can see from the correlation matrix in Table 3.0358 Imp_Oil 0.1974 0.00340 0.0906 Delta 0.0000 0.00667 Delta 0.00083 Imp_Oil 0. inches. CORRELATION.00022 0.6307 1.00333 0. Correlation has a number of very nice properties.00028 (a) Covariance matrix K-Mart Penney 0.0734 -0.00378 0. relative to stocks in diﬀerent industries.4023 Exxon Amoco 0.00083 0.

54

CHAPTER 3. PROBABILITY APPLICATIONS

Don’t Get Confused! 3.1 A general formula for the variance of a linear combination The formula for the variance of a portfolio which is composed of several securities is as follows:
n n n

V ar (
i=1

wi Xi ) =
i=1 j =1

wi wj Cov (Xi , Xj ).

This formula isn’t as confusing as it looks. What it says is to write down all the covariances in a big matrix. Multiply each covariance by the product of the relevant portfolio weights, and add up all the answers.
w1 V1 C21 C31 w2 w3 Portfolio Weights w1 w2 w3 C12 V2 C32 C13 C23 Covariance Matrix V3

If you think about the variance formula using this picture it should (among other things) help you remember the formula for V ar (X − Y ) when X and Y are correlated. If there are more than two securities then you would probably want to use a computer to evaluate this formula.

correlation between Penney and Imperial Oil is .04. The correlation between Sears and the oil stocks is higher, presumably because Sears has an automotive division and Penney does not. These same relationships are present in the covariance matrix, but they are harder to see because some stocks are more variable than others. We’ve established what it means for a correlation to be 1 or −1, but what does a correlation of .6 mean? We’re going to have to put that oﬀ until we learn about something called R2 in Chapter 5. In the mean time, Figure 3.2 shows the correlations associated with a few scatterplots so that you can get an idea of what a “strong” correlation looks like. Correlation and Scatterplots We can get more information from a scatterplot of X and Y than from calculating the correlation. So why bother calculating the correlation? In fact one should always plot the data. However, the correlation provides a quick idea of the relationship.

3.2. COVARIANCE, CORRELATION, AND PORTFOLIO THEORY

55

2

2

1

1

Y

Y

Y −2 −1 0 X 1 2

0

0

−1

−1

−2

−2

−2

−1

0 X

1

2

−2 −2

−1

0

1

2

−1

0 X

1

2

(a) r = 1

(b) r = 1

(c) r = 0

4

1.0

1.5

3

Y

0.0

0.5

Y

Y −2 −1 0 X 1 2

2

−0.5

−1.0

−1.5

−2

−1

0 X

1

2

−2 −2

0

−1

1

0

1

2

−1

0 X

1

2

(d) r = 0

(e) r = 0

(f) r = −1

4

2

2

0

Y

Y

0

−2

Y −4 −2 −1 0 X 1 2 −5

−2

−2

−1

0 X

1

2

0 −2

5

−1

0 X

1

2

(g) r = .76

(h) r = .54

(i) r = .24

Figure 3.2: Some plots and their correlations.

56

CHAPTER 3. PROBABILITY APPLICATIONS

Don’t Get Confused! 3.2 Correlation vs. Covariance Correlation and covariance both measure the strength of the linear relationship between two variables. The only diﬀerence is that covariance depends on the units of the problem, so a covariance can be any real number. Correlation does not depend on the units of the problem. All correlations are between -1 and 1.

This is especially useful when we have a large number of variables and are trying to understand all the pairwise relationships. One way to do this is to produce a scatterplot matrix where all the pairwise scatterplots are produced on one page. However, a plot like this starts to get too complex to absorb easily once you include about 7 variables. On the other hand a table of correlations can be read easily with many more variables. One can rapidly scan through the table to get a feel for the relationships between variables. Because correlations take up less space than scatterplots, you can include a correlation matrix in a report to give an idea of the relationship. By contrast, including a similar number of scatterplots might overwhelm your readers. Autocorrelation Correlation also plays an important role in the study of time series. Recall that in Section 2.2.3 we determined that there was “memory” in the S&P 500 data series, meaning that the Markov model was clearly preferred to a model that assumed up and down days occur independently. Correlation allows us to measure the strength of the day-to-day relationship. How? Correlation measures the strength of the relationship between diﬀerent variables, but the time series of returns occupies only one column in the data set. The answer is to introduce “lag” variables. A lag variable is a time series that has been shifted up one row in the data set, as illustrated in Figure 3.3(b). The lag variable at time t has the same value as the original series at time t − 1. Thus the lag variable represents what happened one time period ago. Thus the correlation between the original series and the lag variable can be interpreted as the correlation in the series from one time period to the next. The name autocorrelation emphasizes that the correlation is between present and past values of the same time series, rather than between totally distinct variables. Notice that one could just as easily shift the series by any number k rows to compute the correlation between the current time period and k time periods ago. A graph of autocorrelations at the ﬁrst several lags is called the autocorrelation function. Figure 3.3(a) shows the autocorrelation function for the S&P 500 data series.

3.3. STOCK MARKET VOLATILITY

57

The lag 1 autocorrelation is only .08, which is not very large (relative to the correlations in Figure 3.2, anyway). Thus, while we can be sure that there is some memory present in the time series, the autocorrelations say that the memory is weak.

(a)

(b)

Figure 3.3: (a) Autocorrelations for the S&P 500 data series. October 19, 1987 has been
excluded from the calculation. (b) Illustration of the “lag” variable used to compute the autocorrelations.

3.3

Stock Market Volatility

One of the features of ﬁnancial time series data is that market returns tend to go through periods of high and low volatility. For example, consider Figure 3.4, which plots the daily return for the S&P 500 market index from January 3, 1950 to October 21, 2005.2 Notice that the overall level of the returns is extremely stable, but that there are some time periods when the “wiggles” in the plot are more violent than others. For example, 1995 seems to be a period of low volatility, while 2001 is a period of high volatility. After the fact it is rather clear which time periods belong to “high” and “low” volatility states, but it can be hard to tell whether or not a transition is occurring when the process is observed in real time. Clearly there is a ﬁrst-mover advantage to be had for analysts that can correctly identify the transition. For example, if an analyst is certain the market is entering a high volatility period then
2

estimating the parameters of a model like this is is hard. but if he reacts to every “blip” in the market then his clients will become annoyed with him. Because we don’t get to see which days belong to which states. If an analyst has “stock” and “bond” strategies designed to be used during low and high volatility periods he could presumably write down the expected returns under each strategy in a reward matrix. perhaps thinking he is making pointless trades with their money to collect commissions. Notice that the analyst is in a sticky situation.g.4: Daily returns for the S&P 500 market index. the analyst can move his customers into less risky positions (e. He needs to react quickly to volatility changes to serve his customers well. One way to model data like Figure 3.4 is to assume that the (unobserved) high/low volatility state follows a Markov chain.020 Hi .0 . Then the analyst just needs to know when the market’s volatility state changes.58 CHAPTER 3. That is.997 . However. more bonds and fewer stocks).003 Hi 0. and requires special software. PROBABILITY APPLICATIONS Figure 3. The vertical axis excludes a few outliers (notably 10/19/1987) that obscure the pattern evident in the remainder of the data. and that data from high and low volatility states follow diﬀerent normal distributions.005 .0 . suppose the transition probabilities for the Markov chain and parameters for the normal distributions were estimated to be Today Mean SD Yesterday Lo Hi Lo 0. if today is in a low volatility state then the .007 Lo .995 The transition probabilities suggest that high and low volatility states persist for long periods of time.

997).995) Hi Lo 0.02. Denote the volatility state at time t by St . How do we compute the probability for today P (St |Rt )? This is a Bayes’ rule problem.997) (.999)(.000005 = .3.02 for each volatility state (see Figure 3.00 Return 0.000995 Thus P (St = Lo) = 0.004.001 and that today’s market return was Rt = .001)(.002997 0.e. To update these probabilities based on today’s return Rt = .000005 St Hi 0.02 we need to compute P (Rt = .5: Distribution of returns under the low and high volatility states. The dotted vertical line shows today’s data. That means the joint distribution is St St−1 Lo Hi Lo (. probability that tomorrow is low volatility is very high (.005) Hi S = t−1 (.996003 .02 0.3 3 If you feel uncomfortable doing this you can instead compute the probability that Rt is in a . STOCK MARKET VOLATILITY 59 0 10 20 30 40 50 Low High −0.999)(. We use a computer to compute the height of the normal curves. Each day the analyst can update the probability that the market is in a high volatility state based on that day’s market return.3.02) using the height of the normal curves at . Suppose that yesterday’s (i. We need to update that probability distribution based on new information (today’s return).5).996003 + .04 −0.996 and P (St = Hi) = .003) Lo (.04 Figure 3.02 0.001)(. Yesterday we had a probability distribution describing what would happen today. We can get the marginal distribution for St because we have a marginal distribution for St−1 and conditional distributions for P (St |St−1 ). and similarly for high volatility. time t − 1) probability was P (St−1 = Hi) = .

.048 ----------0. But because he knew that yesterday was low volatility. PROBABILITY APPLICATIONS and plug them into Bayes’ rule as follows: State Low High prior 0. so you will get about the same answers out of Bayes’ rule.21).60 CHAPTER 3.004 likelihood 0.051 Notice what happened.2.949 0. . and that volatility states are persistent.996 0.944 post 0. The analyst saw data that looked 12 times more likely to have come from the high volatility state than the low volatility state.9 12. You get diﬀerent likelihoods. small interval around .896 0.1 pri*like 0. he regarded the new evidence skeptically and still believes that it is much more likely that today is a low volatility state than a high one. say (.19. but the ratio between them will be approximately 12:1.

which describes how averages behave.5. For example. The main objective of estimation is to quantify how conﬁdent you can be in your estimate using something called a conﬁdence interval. The material on probability covered in Chapter 2 plays a very important role in estimation and testing. In 61 . 4. such as is in quality control problems where the population is all the future goods your production process will ever produce. Most of the time it will be impossible. or at least highly impractical. or the population of customers who have purchased your product. For example. or the population of red blood cells in your body. plays a very important role because many of the quantities we wish to estimate and theories we wish to test involve averages. the Central Limit Theorem from Section 2.Chapter 4 Principles of Statistical Inference: Estimation and Testing This Chapter introduces some of the basic ideas used to infer characteristics of the population or process that produced your data based on the limited information in your dataset. entities. to take the measurements we would like for every member of the population.1 Populations and Samples A population is a large collection of individuals. In particular. The main goal of hypothesis testing is to determine whether patterns you see in the data are strong enough so that we can be sure they are not just random chance. we might be interested in some feature describing the population of publicly held corporations. or objects that we would like to study. A sample is a subset of a population. the population may be eﬀectively inﬁnite. Not all samples are created equally.

sample statistics are of little interest because the sample is just a small fraction of the population. Sample summaries are known as statistics. even to statisticians. It is mind-numbingly dull. the magic of statistics (the ﬁeld of study) is that statistics (the numbers) can tell us something about the population parameters we really care about. We can use the summaries from the sample to estimate the population summaries. An optional Section (7. this Chapter and in most that follow we will assume that the sample is a simple random sample. . We have already seen the established notation x ¯ and s for the sample mean.1: A sample is a subset of a population. Sample statistics are used to estimate population parameters. It is customary to denote a population mean by µ and a population standard deviation by σ . 1 There is a mathematical deﬁnition of simple random sampling which involves something called the hypergeometric distribution. In Chapter 1 we learned that even if we had the whole population in front of us we would have to summarize it somehow. ESTIMATION AND TESTING Population Sample Figure 4. and standard deviation.4) in the “Further Topics” Chapter discusses the key ingredients of a good sampling scheme and what can go wrong if you have a bad one.62 CHAPTER 4. but we will need some way to distinguish the two types of summaries in our discussions. By themselves. Think of a simple random sample1 as if the observations were drawn randomly out of a hat. Population summaries are known as parameters and are denoted with Greek letters (see Appendix C). However.

) A sampling distribution is simply the probability distribution describing a particular sample statistic (like the sample mean) that you get by taking a random sample from the population. Because variance is hard to interpret. SAMPLING DISTRIBUTIONS 63 4.g. 2.2 Sampling Distributions (or. First. because we only get to see one observation from that sampling distribution (e. The same is true for X2 . We need to understand as much as we can about a statistic’s sampling distribution. The observation gets promoted from the random variable X1 to the data point x1 once you actually observe it. we usually look at the . Your data aren’t random anymore. ¯ ) = µ. X3 . If the data in your data set are the result of a random process. then the sample statistics describing your data must be too. but they are the result of a random process. Suppose the population you wish to sample from has mean µ and standard deviation σ (calculated as in Chapter 1). If you randomly select one observation from that population. “What is Random About My Data?”) It is hard for some people to see where randomness and probability enter into statistics. you would get a diﬀerent mean. E (X on page 33. each time we take a sample we ¯ . but “on average” it gives you the right answer. Sometimes the sample mean will be too big. because the smaller V ar (X being close to µ. They look at their data sets and say “These numbers aren’t random. then that one observation is a random variable X1 with expected value µ and standard deviation σ (in the sense of Chapter 2). . The numbers in your data set are ﬁxed numbers. ¯ ) = σ 2 /n because of the rules for variance on page 35. Xn . it is just like any other probability distribution except that it is for a special random variable: a sample statistic. Let’s think about the sampling distribution of X know three key facts. (If you took another sample. They will be the same tomorrow as they are today.4. Don’t let the name confuse you. You can show this is true using the rules for expected values 1. The trick to understanding how sample statistics relate to population parameters is to mentally put yourself back in time to just before the data were collected and think about the process of random sampling that produced your data. . However. We know that V ar (X ¯ ) is the better chance X ¯ has of This is important. standard deviation. . We see only one sample mean). They’re right there!” In a sense that is true. The probability distribution of X1 is the histogram that you would plot if you could see the entire population. sometimes it will be too small. What this says is that the sample mean is an unbiased estimate of the population mean. etc. there was a time before the sample was taken when these concrete numbers were random variables. .2.

so you might not feel comfortable modeling this population using the normal distribution.3 and 4. but Sections 4. µ = 6. Remember that if the data are normally distributed then any individual observation has about a 95% chance of being within 2 standard deviations of µ. When we talk about the standard deviation standard deviation of X ¯ is2 of a statistic we call it the standard error. If the data are not normally distributed we can’t make that statement. In this contrived example we can actually compute the population mean. It says that the average of several random variables is normal even if the random variables themselves are not normal. 4. What these three facts tell us is that the number x ¯ in our data set is the result of one observation drawn from a normal distribution with mean µ and standard √ error σ/ n.2 shows the histogram of log10 CEO compensation for all 800 CEO’s in the data set. Generally speaking. ¯ occurs within 2 The central limit theorem is so important because it says. consider the CEO compensation data set from Chapter 1. V ar (X ¯ is 1 we know that X ¯ is typically For example. 3. but in the coming Chapters there The “of X are other statistics we will care about. Figure 4.2. We’re focusing on X ¯ right now. ¯ ” is important.64 CHAPTER 4.17.1 show how to use this fact to “back in” to estimates of µ.5. The standard error of X ¯) = SE (X √ ¯ ) = σ/ n. the smaller the standard error ¯ ) tells us of a statistic is. have standard errors too. We still don’t know the numerical values of µ and σ . It is somewhat skewed to the right. such as the slope of a regression line. These other statistics ¯ ). The formula for SE (X ¯ is a better guess for µ when σ is small (the individual observations in the X population have a small standard deviation) or n is large (we have a lot of data in our sample). Imagine the collection of 800 CEO’s is a population from which you wish to draw a sample. the more we can trust it. if the standard error of X about 1 unit away from µ. X standard errors of µ 95% of the time even if the data (the individual observations in the sample or population) are non-normal. and that you can aﬀord to obtain information from a sample of only 20 CEO’s.1 Example: log10 CEO Total Compensation To make the idea of a sampling distribution concrete. The third key idea is the central limit theorem. which will come from diﬀerent formulas than SE (X 2 . ESTIMATION AND TESTING ¯ instead.

The trick is to know enough about how sampling distributions behave so that you have some idea about how far that one x ¯ from your sample might be away from the population µ you wish you could see. because it is the last time you’re going to see an entire population or entire sampling distribution.3 Conﬁdence Intervals ¯ as a guess for µ. The Figure also shows a gray histogram which was created by randomly drawing many samples of size 20 from the CEO population.2 fondly. We took the mean of each sample. They represent the sampling distribution of ¯ . 4. If you took a random sample of 20 CEO’s and constructed its mean.2: The white histogram bars are log10 CEO compensation. The gray histogram bars are the means of 1000 samples of size 20 randomly selected from the population of 800 CEO’s. then plotted a histogram of all those sample means.4. it has a much smaller standard deviation than the individual observations in the population (by a √ factor of 20. and it is normally distributed even though the population is not. though you can’t tell by just looking at the Figure). In practice you only get to see one observation from the gray histogram. The gray histogram. Remember Figure 4. and that In the previous Section we saw that we could use X √ ¯ (SE (X ¯ ) = σ/ n) gave us an idea of how good our guess the standard error of X . CONFIDENCE INTERVALS 65 0 1 2 3 4 5 6 7 8 Log10 Total Compensation Figure 4. you would get one X observation from the gray histogram. our hypothetical population.3. has all the properties which is the sampling distribution of X advertised above: it is centered on the population mean of 6. ¯ in this problem.17.

For example if we said that a 95% conﬁdence interval for µ was 10 − 20 this would mean that we were 95% sure that µ lies between 10 and 20.645 for a 90% interval and 2.96 for a 95% interval. 10 − 20) which we are highly conﬁdent µ lies within. So X √ we can be 95% certain that µ is no more than 2σ/ n away from x ¯. In the previous sentence µ was known and X our move into the “real world” where we see x ¯ and are trying to guess µ. This is useful because we know that also assures us that X a normal random variable is almost always (95% of the time) within 2 standard ¯ will almost always be within 2σ/√n of µ. is.3 Therefore.g. The question is.57 for a 99% interval.) What if we want to be 99% sure or only 90% sure of being correct? If you look in the normal tables you will see that a normal will lie within 2. 3 ¯ was not. . that x ¯ is close to µ). ¯. This formula applies to any other certainty level as well. 1. This sentence marks Note the switch.96 rather than 2 but this is a minor point.66 CHAPTER 4. Therefore in general we get √ √ [¯ x − zσ/ n. ESTIMATION AND TESTING Don’t Get Confused! 4. how do we calculate the interval? A conﬁdence interval for µ with σ known First assume that the population we are looking at has mean µ and standard devi¯ has mean µ and standard error σ/√n. If the SD is big then it would be hard for you to guess the value of a single observation drawn at random from the population. Just look up z in the normal table.645 standard deviations of its mean 90% of the time. (In fact if we want to be exactly 95% sure of capturing µ we only need to use 1. if √ √ we take the interval [¯ x − 2σ/ n. x ¯ + 2σ/ n] we have a 95% chance of capturing µ. Standard Error • SD measures the spread of the data. The central limit theorem ation σ . • SE measures the amount of trust you can put in an estimate such as X If the standard error of an estimate is small then you can be conﬁdent that it is close to the true population quantity it is estimating (e. Conﬁdence intervals build on this idea. That means deviations of its mean. Then X ¯ is normally distributed.1 Standard Deviation vs. A conﬁdence interval gives a range of possible values (e.57 standard deviations of its mean 99% of the time and within 1. x ¯ + zσ/ n] where z is 1.g.

When taking samples from a ﬁnite population. A natural alternative is to use the sample standard deviation s instead of the population standard deviation σ . Our conﬁdence intervals would tend to be too narrow. You use it instead of the normal when you have to guess the standard deviation. For example. if your population has a billion people in it and you take a HUGE sample of a − 1 million/1 billion = 0.001 and it soon becomes apparent that the size of the sample is much more important than the size of the population. We expect s to be “close” to σ . so that we would expect to get a 95% conﬁdence interval.9995. But if we don’t know µ.1 Does the size of the population matter? ¯ assumes the individual observaThe formula for the standard error of X tions are independent. so why not use s s x ¯ − z√ . CONFIDENCE INTERVALS Not on the test 4.3.1 Can we just replace σ with s? The conﬁdence interval formula for µ uses σ which is the population standard deviation. we probably don’t know σ either. To ﬁx this problem we need to make the intervals a little wider so they really are correct 95% of the time.96. All you really need to know about the t distribution is 1.4. For example if we use z = 1. Contrast million people then the FPC is 1 √ √ that with the 1/ n factor of 1/ 1 million = . In practice the FPC makes almost no diﬀerence unless your sample size is really big or the population is really small. 4. .3. so they would not be correct as often as they should be. If you take a simple random sample of size n from a population of size N then it can be shown that σ ¯) = √ SE (X n N −n σ ≈√ N −1 n 1− n N 67 The extra factor at the end is called the ﬁnite population correction factor (FPC).x ¯ + z√ ? n n If we simply replace σ with s then we ignore the uncertainty introduced by using an estimate (s) in place of the true quantity σ . To do this we use something called the t distribution. the observations are actually very slightly correlated with one another because each unit in the sample reduces the chances of another unit being in the sample. the interval may in fact only be correct (cover µ) 80% of the time.

x ¯ − t√ . x n n Note the diﬀerence between t and z .3 plots the t distribution for a few diﬀerent sample sizes next to the normal curve. There are tables that list some of the more interesting calculations for the t distribution for a variety of degrees of freedom. As the sample size (DF ) grows.4 Normal T3 T10 T30 0. When the sample size is small (there are few “degrees of freedom”) the t-distribution has much heavier tails than the normal. The best way to ﬁnd t is to use a computer. Thus if 4 Greater than 30 is an established rule of thumb.3 0. the normal and t distributions become very close. When n is large4 the t and normal distributions are almost identical. ESTIMATION AND TESTING 0. To capture 95% of the probability you might have to go well beyond 2 standard errors. It is very similar to the normal. while t counts estimated standard errors. There is a diﬀerent t distribution for each value of n. and 30 degrees of freedom. 10. Figure 4.0 −4 0.2 −2 0 2 4 Figure 4. but z counts the number of true standard errors. To make a long story short the conﬁdence interval you should be using is s s ¯ + t√ . . Both measure the number of standard errors a random variable is above its mean. 3.1 0. because there is a diﬀerent t distribution for every value of n.3: The normal distribution and the t distribution with 3. but we won’t bother with them. but with fatter tails.68 CHAPTER 4. 2.

That’s because we used “2” for a number t that the computer looked up on its internal t-table with α = . and take their average. Second.55). The HR director obtains a random sample of 61 employees who provide the data in Figure 4.3. As the sample size grows. s becomes a better be [¯ x ± 3SE (X x ± 4SE (X guess for σ and the formula for a 95% conﬁdence interval soon becomes very close to what it would be if σ were actually known. square them. (x2 − x ¯). (xn − x ¯). 69 you had a very small sample size your formula for a 95% conﬁdence interval might ¯ )] or [¯ ¯ )]. Generally speaking. . which is 201. The ﬁrst thing you do is calculate the mean. x ¯.2 What are “Degrees of Freedom?” Suppose you have a sample with n numbers in it and you want to estimate the variance using s2 . each observation in the data set adds one degree of freedom. . but they are not all “free” because they must obey a constraint. their average would always be zero! Because the deviations from the mean must sum to zero. CONFIDENCE INTERVALS Not on the test 4. . Each parameter you must estimate before you can calculate the variance takes one away.2 Example A human resources director for a company wants to know the average annual life insurance expenditure for the members of a union with which she is about to engage in labor negotiations.4. If you carry this calculation out by hand you may notice that you don’t get exactly the same answer found in the computer output. The phrase “degrees of freedom” means the number of “free” numbers available in your data set. Find a 95% conﬁdence interval for the union’s average life insurance expenditure. In this case ﬁnding the interval is easy.05 and 60 = 61 − 1 degrees of freedom. 4. If you didn’t square the deviations from the mean. they don’t represent 100 numbers worth of information. or the single best guess for the thing you’re trying to estimate.3. Third.4. That is why we divide by n − 1 when calculating s2 .48. . How did the computer calculate the interval? It is best to think of it in three steps. .7635. so the point estimate here is x ¯ =√ 481. compute the 95% conﬁdence interval as x ¯ ± 2SE (¯ x). Then you calculate (x1 − x ¯). ﬁnd the point estimate.2195/ 61 = 25. because it is included in the computer output: (429. compute the standard error of the point estimate. First. 532. She is considering whether it would be cost eﬀective for the company to provide a life insurance beneﬁt to the union members. There are 100 numbers there. The best point estimate for a population mean is a sample mean.

2195 25. Right now the interval has a width of about \$100. but we’re pretty close. the HR director still doesn’t know the overall average life insurance expenditure for the entire union.7635 532. and we can feel pretty good about s ≈ 200 based on the current data set. but you will have a 100% chance of being wrong! How much more data does the HR director need to collect? That depends on how narrow an interval is desired. The computer’s answer is more precise than ours.) Suppose the desired margin of error E is ±\$25. The conﬁdence interval is shown graphically in Figure 4. the probability that the interval really doesn’t contain the mean of the population).4817 61. After observing the sample of 61 people.70 CHAPTER 4. (The margin of error is the ± term in a conﬁdence interval. and both of them come at a cost.e. A 95% conﬁdence interval is supposed to have a 95% chance of containing the thing you’re trying to estimate. which will make SE (X more data is costly in terms of time. The ¯ ) smaller. so t ≈ 2.0000 Figure 4. The margin of error for a conﬁdence interval is s E = t√ . She can then solve for n ≈ 256.4 as the diamond in the boxplot. in this case the mean of the population. n The HR director wants a 95% conﬁdence interval.5511 429.4: Annual life insurance expenditures for 61 union employees. If you want. but a good bet is that it is between \$429 and \$532. ESTIMATION AND TESTING Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean N 481. or a margin of error of about \$50. a larger α. The second option is to accept a larger probability of being wrong (i. you can have an interval of zero width. obtaining ﬁrst is to obtain more data. Of course you can solve the formula for n without substituting for the other . They’re not. However.0164 201. or both. A common misperception that people sometimes have is that 95% conﬁdence intervals are supposed to cover 95% of the data in the population. money. by going out fewer SE’s from the point estimate. Notice that the diamond covers very little of the histogram. What if the interval is too wide for the HR director’s purpose? There are only two choices for obtaining a shorter interval.

suppose the human resources manager from Section 4. 4. Our example makes it sound like hypothesis tests are simply an application of conﬁdence intervals.” Remember that a hypothesis test uses the sample to test something about the population.” . Hypothesis tests are often used to compare diﬀerences between two statistics.5. There is no meaningful conﬁdence interval to look at in that context. step by step.4. this is an easy “back of the envelope” calculation to tell you how much data you need to get an interval with the desired margin of error E and conﬁdence level (determined by t). In many instances (the so-called t-tests) the two techniques give you the same answer. such as in “H0 : µ = \$500” or “H0 : the variables are independent.4. You don’t need to test for it.3 describes a hypothesis test for determining whether two categorical variables are independent. Assuming you have done a pilot study or have some other way to make a guess at s. It could easily be the case that µ = \$500 and we just saw an x ¯ = \$481 by random chance. Is that small enough so that we can be sure the population mean is less than \$500? Obviously not. Section 4. Thus it is incorrect to write H0 : x ¯ = 500 or H0 : x ¯ = 481. like the HR director’s \$500 threshold. You can see x ¯. For example. and that any discrepancy you see in your sample statistic is just due to random chance. conﬁdence intervals and hypothesis tests are diﬀerent in two respects. such as two sample means. For example. In such instances the “speciﬁed value” is almost always zero. Here is how hypothesis tests work. which gives: n= ts E 71 2 . The sample of 61 people has an average premium of \$481.3. If it were exciting. HYPOTHESIS TESTING: THE GENERAL IDEA letters.4 Hypothesis Testing: The General Idea Hypothesis testing is about ruling out random chance as a potential cause for patterns in the data set. 1. which is that the population quantity you’re testing is actually equal to a speciﬁed value. You begin by assuming a null hypothesis. since the 95% conﬁdence interval for the population mean stretches up to \$532. a hypothesis test compares a statistic to a prespeciﬁed value. hypothesis tests can handle some problems that conﬁdence intervals cannot.2 needs to be sure that the per-capita insurance premium for union employees is less than \$500. However. First. 5 It sounds like the null hypothesis is a pretty boring state of the world. we wouldn’t give it a hum-drum name like “null. Second.5 The null hypothesis is written H0 .

Often the alternative hypothesis is simply the opposite of the null hypothesis. For example. Here are some guidelines. the diﬀerence between two sample means. Consequently. if you are testing a hypothesis about a population mean.5. A test statistic which has been standardized by subtracting oﬀ its . For example. the null hypothesis has to specify an exact model. Also.72 CHAPTER 4. you might see a test statistic represented as t= x ¯ − µ0 ¯). Instead of x ¯. H1 : µ = \$500 or Ha : the variables are related to one another. Some test statistics are a bit more clever. First. SE (X ¯ ) = s/√n and µ0 is the value speciﬁed in the null hypothesis. if you are testing the hypotheses “two variables are independent” and “two variables are dependent” then “independent” is the natural null hypothesis because it is simpler. or the slope of the population regression line. where SE (X like \$500. The HR director needs to show that µ < \$500 before she is authorized to oﬀer the life insurance beneﬁt in negotiations. not a sample or a statistic. Thus it is incorrect ¯ = 815 when you really mean H0 : µ = 815. It is written as H1 or Ha . The choice of a test statistic is obvious in many cases. 3.3. ESTIMATION AND TESTING Don’t Get Confused! 4. Test statistics are often standardized so that they do not depend on the units of the problem. a diﬀerence between two population means. Sometimes you will only care about an alternative hypothesis in a speciﬁed direction. The next step is to determine the alternative hypothesis you want to test against. be sure to remember that a hypothesis test is testing a theory about a population or a process. Identify a test statistic that can distinguish between the null and alternative hypotheses. so her natural alternative hypothesis is Ha : µ < \$500. the null hypothesis is almost always simpler than the alternative. For example. µ = 815 or β = 0 are valid null hypotheses. then good test statistics are the sample mean. to write something like H0 : x 2. but µ > 815 is not. We’ll see an example in Section 4.2 Which One is the Null Hypothesis? There is a bit of art involved in selecting the null hypothesis for a hypothesis test. For example. because in order to compute a p-value you (or some math nerd you keep around for problems like this) need to be able to ﬁgure out what the distribution of your test statistic would be if the null hypothesis were true. and the slope of the sample regression line. You will get better at it as you see more tests.

4. A p-value is the probability. the stronger the evidence against H0 . In other words.e. In our HR director example. The rule with p-values is Small p-value ⇒ Reject H0 .” On the other hand if the p-value is not too small then we “fail to reject the null hypothesis. you get the same t-statistic regardless of whether x ¯ is measured in dollars or millions of dollars. Because it has been standardized. around ±2). If the p-value is small enough (say less than 5%) then we conclude that there’s no way we’re that unlucky. The ﬁnal step of a hypothesis test looks at the test statistic to determine whether it is large or small (i. especially with small sample sizes of n < 30. The p-value is the main tool for measuring the strength of the evidence that a test statistic provides against H0 .1 P-values Once you have a p-value. The p-value measures how “unlucky” we would have to be to see the data in our data set if we were in case 1. of observing a test statistic that is as or more extreme than the one in our data set. close or far from the value speciﬁed in H0 ). there are other test statistics out there. The smaller the p-value. calculated assuming H0 is true.e. Even with t statistics. 4.4. 1. because we have some sense of what a big t looks like (i.4.2321. If the p-value is very small then there are two possibilities.4. the “magic number” needed to declare a test statistic “signiﬁcantly large” will be diﬀerent for diﬀerent sample sizes. It is a t-statistic. The null hypothesis is true and we just got very strange data by bad luck. Here’s why. You should recognize this standardization as nothing more than the “z -scoring” that we learned about in Section 2. it wouldn’t be particularly unusual for us to see sample means like \$481 if the true population mean were \$500. However. HYPOTHESIS TESTING: THE GENERAL IDEA 73 hypothesized mean and dividing by its standard error is known by a special name.” This does not mean that we are sure . if the population mean had been µ = 500 then the probability of seeing a sample mean of 481 or smaller is 0. 2. where the deﬁnition of “big” isn’t so obvious.4. with names like χ2 and F . Some people skip this step when working with t statistics. The null hypothesis is false and we should conclude that the alternative is in fact correct. so we must be in case 2 and we can “reject the null hypothesis. A t-statistic tells how many standard errors the ﬁrst thing in its numerator is above the second. hypothesis testing is easy.

we would begin to prefer Ha as soon as p < . It measures how likely it would be to see our data.74 CHAPTER 4.05 < . The less likely the data would be if H0 were true.01 to . Instead. understanding exactly what they say is a little harder. For example. Brand 1 is more reliable (i. To the eye it appears that brand 2 has a higher standard deviation (i. Roughly speaking here is the language you can use for diﬀerent p-values.1 . which indicates that the variances are not the same (small p-value says to reject the null hypothesis that the variances are the same). plug your data into a computer. the less comfortable we are with H0 . is less reliable) than brand 1. and locate the p-value. if H0 had been true. p > . the p-value is a kind of “what if” analysis. just that we don’t have enough evidence to be certain it is false. has a smaller standard deviation) than brand 2. Using them is easy. However you often may not have time to get into the nitty gritty details of each test. Thus the diﬀerence we see is too large to be the result of random chance.1 . but that’s the trade-oﬀ with p-values. small) pvalue.05 to . It takes some mental gymnastics to get your mind around the idea. or you might want to compare two stocks to see if one is more volatile than the other. 4.e. you can ﬁnd a hypothesis test with a null and alternative hypothesis that ﬁt your problem.05.4.e. ESTIMATION AND TESTING the null hypothesis is true.5 presents output from four hypothesis tests. Notice the advantage of using p-values.2 Hypothesis Testing Example Hypothesis tests have been developed for all sorts of questions: are two means equal? Are two variances equal? Are two distributions the same? In an ideal world you would know the details about how each of these tests worked.e.01 Evidence against H0 None Weak Moderate Strong The p-value is not the probability that H0 is true.5 shows computer output from a sample of two potato chip manufacturers. Each test has a signiﬁcant (i. Instead. We don’t have to know how large a “Brown-Forsythe F ratio” has to be in order to understand the results of the Brown- .5 rather than . Figure 4. This might occur in a quality control application. suppose you want to know if two populations have diﬀerent standard deviations. Each test compares the null hypothesis of equal variances to an alternative that the variances are unequal using a slightly diﬀerent test statistic. If it were. Is this a real discrepancy or could it simply be due to random chance? Figure 4.

0356 10.” For example.0318 DFNum 1 1 1 1 DFDen 46 46 46 Prob> F 0.0003 0.4.” Forsythe test. suppose it is very expensive to adjust potato chip ﬁlling machines. We can tell what Brown and Forsythe would say about our problem just by knowing their null and alternative hypothesis and looking at their p-value. Then the statistically signiﬁcant diﬀerence between the standard deviations of the two potato chip processes is of little practical importance. the standard deviations in the potato chip output were found to be signiﬁcantly diﬀerent.4.5] Brown-Forsythe Levene Bartlett F Ratio 9. People sometimes refer to this distinction as “statistical signiﬁcance vs. and there are industry guidelines stating that bags must be ﬁlled to within ±1oz. Of course if calibrating the ﬁlling machines is . practical signiﬁcance. You should ignore small patterns in the data set that fail tests of statistical signiﬁcance. because both processes are well within the industry limit. All that means is that the diﬀerence is too large to be strictly due to chance.0015 Figure 4.5: Potato chip output.” but there is no guarantee that it is important to your decision making process.9797 15. HYPOTHESIS TESTING: THE GENERAL IDEA 75 Test O’Brien[. You should view statistical signiﬁcance as a minimum standard.0018 0. When you ﬁnd a statistically signiﬁcant result. 4.4. P-values are located in the column marked “Prob > F .” For example.3 Statistical Signiﬁcance When you reject the null hypothesis in a hypothesis test you have found a “statistically signiﬁcant result.0043 0. you know the result is “real.2582 10.

4. The three tests are: the one sample t-test for a population mean. less than.4. the z-test for a proportion. All the one sample t-test does is check whether x ¯ is close to µ0 or far away.5 says brand 2 should recalibrate. This is called a two tailed (or two sided) test. 6 In a hypothesis test you calculate t and use it to compute a p-value. This is the opposite of conﬁdence intervals. or Two? (How to Tell and Why it Matters) The one sample t-test tests the null hypothesis H0 : µ = µ0 where µ0 is our hypothesized value for the mean of the population. (b) Ha : µ > µ0 . One Tail.1 The One Sample T Test You use the one sample t-test to test whether the population mean is greater than. To a large extent that is true. SE (X ¯ ) = s/√n. which look up t so that it matches a pre-speciﬁed probability such as 95%. This is called a one tailed test.5. and the χ2 test for independence between two categorical variables. 4. . Thus the null hypothesis is always H0 : µ = µ0 . The appropriate Ha depends only on the setup of the problem. The only complication comes from the fact that there are where SE (X three diﬀerent possible alternative hypotheses. and thus three diﬀerent possible pvalues that could be computed. It does not depend at all on the data in your data set. where you specify a number for µ0 . ESTIMATION AND TESTING cheap. “Close” and “far” are measured in terms of standard errors using the t-statistic6 t= x ¯ − µ0 ¯). then Figure 4. You only want one of them. This Section presents three tests which come up frequently enough that it is worth your time to learn them. We have already seen one example of the one sample t-test performed by our friend the HR director. There are 3 possible alternative hypotheses: (a) Ha : µ = µ0 .5 Some Famous Hypothesis Tests Section 4.76 CHAPTER 4. but there are a few extremely famous hypothesis tests that you should know how to do “by hand” (with the aid of a calculator). or simply not equal to some speciﬁed value µ0 .2 seems to argue that all you need to do a hypothesis test is the null hypothesis and the p-value.

005 0.5. if Ha is µ > µ0 ¯ ’s even larger than the x then the p-value is the probability that you would see X ¯ in your sample if H0 were true and the sampling process were repeated. Thus if Ha is µ = µ0 then the p-value ¯ ’s that are even farther away calculates the probability that you would see future X from µ0 than the one in your sample if H0 were true.7679 (c) p = 0. ¯ is cenThe null hypothesis determines where the sampling distribution for X tered.015 0.000 0. the upper and lower tail p-values must sum to 1. SOME FAMOUS HYPOTHESIS TESTS 77 0. but it helps to remember that a p-value provides evidence for rejecting the null hypothesis in favor of a speciﬁed alternative.4. The alternative hypothesis determines how you calculate your p-value from the sampling distribution.4.4642 (b) p = .010 0.010 0. Her calculation is depicted in Figure 4.015 0.010 0.2321 Figure 4. but they illustrate how the p-values would be calculated under the other alternative hypotheses. What type of x ¯ would support Ha if it had been µ = 500? An x ¯ far away from \$500 in either direction. so the relevant calculation is ¯ even smaller than \$481 (her sample mean) the probability that she would see an X if µ really was \$500 and the random sampling process were repeated again. (c) Ha : µ < µ0 . This is called a one tailed test.005 0. Based on the data from Figure 4.000 400 0. Otherwise.005 0.6(c). If you are not sure which alternative . Obviously.000 400 450 500 550 600 400 450 500 550 600 0. Our HR director was trying to show that µ < \$500. of the problem without regard to whether you saw x ¯ > µ0 or x people would only ever do one tailed tests.6: The three p-value calculations for the three possible alternative hypotheses in the one sample t-test. For the one sample t-test you can think of the p-value as the ¯ that supports Ha even more than the x probability of seeing an X ¯ in your sample.015 450 500 550 600 (a) p = 0. Remember that you choose the relevant alternative hypothesis from the context ¯ < µ0 .6 are irrelevant to the HR director. Likewise. The other two probability calculations in Figure 4. and the two tailed p-value is twice the smaller of the one tailed p-values. This can be a bit confusing.

What p-value should we calculate? We want to know the probability of seeing sample means at least as far away from ¯ ≥ 305) + P (X ¯ ≤ 300 degrees as the mean from our sample. (The approximate equality is because we’re using the normal table instead of the inconvenient t-table. This type of test is so 7 Mandatory Enron joke of 2003. so manufacturers don’t want to use more than is necessary to meet the industry guidelines. The null hypothesis is H0 : µ = 300.42.) Looking up 3. √ so the best alternative is Ha : µ = 300. It looks like the average heat tolerance is higher than we need it to be.42 on the normal table gives us a p-value of .0003 + .78 CHAPTER 4.42). we would only see a sample mean this far from 300 about 6 times out of 10. with a standard deviation of 8 degrees. What is the appropriate alternative? We are interested in deviations from the target temperature on either the positive or negative side. so the p-value is P (X 295) ≈ p(Z > 3. one before some treatment was applied and one after. on a given day the average failure point for the sample of 30 chips is 305 degrees. which destroys the chips. We can use the normal table to construct the p-value because there are at least 30 observations in the data set.7 Example 1 One of the components used in producing computer chips is a compound designed to make the chips resistant to heat. ESTIMATION AND TESTING hypothesis to use. The standard error of the mean is 8/ 30 = 1. If the true heat tolerance in the population of computer chips was 300 degrees.0003 = . . Each day a sample of 30 chips is taken from the day’s production and tested for heat resistance. pick the two tailed test. Example 2 An important use of the one sample t-test is in “before and after” studies where there are two observations for each subject. If you’re wrong then all you’ve done is apply a tougher standard than you needed to.46.42) + P (Z < −3. Industry standards require that a suﬃcient amount of the compound be used so that the average failing temperature for a chip is 300 degrees (Fahrenheit). That could keep you out of court when you become CEO.0006.000. That makes us doubt that the true mean really is 300 degrees. Suppose. Does it appear the appropriate heat resistance target is being met? Let µ be the average population failure temperature. The compound is relatively expensive. You can use the one sample t-test to determine whether there is any beneﬁt to the two treatments by computing the diﬀerence (after-before) for each person and testing whether the average diﬀerence is zero. so the t-statistic is 3.

suppose a car stereo manufacturer wishes to estimate the proportion of its customers who. Imagine that person i is labeled with a number xi .38 = −1. The average diﬀerence is -3.38. Suppose the manufacturer obtains a random sample of 100 customers. The large p-value says that the new search engine didn’t win the race by a suﬃcient margin to show that it is faster than the old one. You compute p ˆ by simply dividing the number of “successes” in your sample by the sample size.775/3. 73 of whom respond positively. 4.4. Then our best guess at p is p ˆ = . lower tail. We can calculate the proportion of favorable reviews by averaging all those 0’s and 1’s.4 milliseconds. For a concrete example.” They test the new version against their previous “fastest ever” search engine by running both engines on a suite of 40 test searches. The sample proportion is p ˆ (pronounced “p-hat”).12. suppose engineers at Floogle (a company which designs internet search engines) have developed a new version of their search engine that they would like to market as “our fastest ever.2 Methods for Proportions (Categorical Data) Recall from Chapter 1 that continuous data are often summarized by means. There is good news. even though it is just the one sample t-test applied to diﬀerences.775 (average search for the new engine was 3. so the t-statistic is t = −3.1314. one year after purchasing a car stereo. and 0 if they would not. SOME FAMOUS HYPOTHESIS TESTS 79 important it merits its own name: the paired t-test.5. From the normal table (because n > 30) we ﬁnd p ≈ 0. The standard error of the diﬀerences is SE (X ) = 21.4/ 40 = 3. Should we compute an upper tail. or two tailed p-value? Our alternative hypothesis is µ < 0. Thus. Let p denote the proportion of the population with a speciﬁed attribute.5. randomly choosing which program goes ﬁrst on each search to avoid potential biases. would would recommend the manufacturer to a friend. The diﬀerence in search times (new-old) is recorded for each search in the test suite. Do the data support the claim that the new engine is “their fastest ever?” Let µ be the diﬀerence in average search times between the two search engines. and categorical data are summarized by the proportion of the time they occur. The 40 diﬀerences have a standard deviation of 21. so the natural alternative hypothesis is √ ¯ µ < 0. The appropriate null hypothesis here is µ = 0 (no diﬀerence in average speed).73. so we should compute a lower-tailed p-value. What should the alternative be? The engineers want to show that the new engine is “signiﬁcantly faster” than the old one. and even more good news about proportions.775 milliseconds less than under the old engine). which is 1 if they would recommend our car stereo to a friend. The good news is that proportions are actually a special kind of mean. everything we learned about sample means . For example.

All of them recognize SE (ˆ p) = p(1 − p)/n. Usable even with no data. Here’s why. There are three ways to calculate SE (ˆ p) in practice. Conﬁdence intervals Hypothesis testing Estimating n for a future study Rationale Best guess for p. n So with proportions. They diﬀer in what you should plug in for p. guess for p p ˆ p0 1/2 Used in. Let Xi = 1 with probability p 0 with probability 1 − p. H0 says p = p0 . In particular. the central limit theorem implies p ˆ obeys a normal distribution. The second bit of good news about proportions it that it is even easier to calcu¯ . proportions are just means in disguise. ESTIMATION AND TESTING Don’t Get Confused! 4. if you have a guess at p you also have a guess at SE (ˆ p).80 CHAPTER 4. A conservative “worst case” estimate of p. carries over to sample proportions. You’re assuming H0 is true. and we know that SE (X our “useful fact” says SE (ˆ p) = p(1 − p) . Then it is easy to show E (Xi ) = p.) What’s so useful about our useful fact? Because ¯ ) = SD (X )/√n. and V ar (Xi ) = p(1 − p). Conﬁdence Intervals To produce a conﬁdence interval for p we use the formula p ˆ ± zSE (ˆ p) . (Try it and see! Hint: in this one special case Xi2 = Xi .3 The Standard Error of a Sample Proportion. Note the following useful fact about dummy variables. . . Data which only assume late the standard error of p ˆ than it is for X the values 0 and 1 (or sometimes -1 and 1) are called indicator variables or dummy variables.

8)/100 = . The shipment contains between 8% and 22% defective items (with 95% conﬁdence).2 versus Ha : p < . You may notice that. It is useful to assume p = 1/2 when you are planning a future study and want to know how much data you need to achieve a speciﬁed margin of error. What is a 95% conﬁdence interval for the proportion of defective items in the whole shipment? The point estimate is p ˆ = 15/100 = . underneath the square root sign. because if the shipment contained either all defectives or no defectives then every sample we took would give us either p ˆ = 1 or p ˆ = 0 with no uncertainty at all. Therefore to estimate a proportion to within ±E you need roughly n ≈ 1/E 2 observations. To get the margin of error down to ±1% you would need 1/(. To illustrate.2)(. which can’t be true when the data are 0’s and 1’s. so the conﬁdence interval is 0.0001) = 10. 8 . for example.2.5. You would want a sample of n = 1/(.4.0004) = 2500 voters. Standard errors √ p) in half you need to quadruple the sample size. SE (ˆ p) is a quadratic function of p. decrease like 1/ n.2 = −1.2 instead of . Recall from page 71 that to get a margin of error E you need roughly n = (ts/E )2 observations. 000 voters. Therefore the results for proportions presented in this section are for “large samples” with n > 30. so to cut SE (ˆ Hypothesis Tests Suppose your plant employs an extremely fault tolerant production process. suppose you wish to estimate the proportion of people planning to vote for the Democratic candidate in the next election to within ±2%. The standard error is zero if p = 0 or p = 1.15)(. We get the largest SE (ˆ p) when p = 1/2. 0.96 × 0.85)/100 = 0. Should you accept the shipment with 15 defectives in the sample of 100? You could answer with a hypothesis test of H0 : p = .15 and its standard error is SE (ˆ p) = (.0357.15 ± 1.08.0357 = [0. Our test statistic says that We use the normal table and not the t table here because one of the t assumptions is that the data are normally distributed. which allows you to accept a shipment as long as you can be conﬁdent that there are fewer than 20% defectives. That makes sense. that we sample n = 100 items from a shipment and ﬁnd that 15 are defective. and if you assume p = 1/2 then s (the standard deviation of an individual observation) is p(1 − p) = (1/2)(1/2) = 1/2.15 in the formula for SE (ˆ p) because hypothesis tests compute p-values by assuming H0 is true.02)2 = 1/(. 0. and H0 says p = . If you want a 95% conﬁdence interval then t ≈ 2.04 and the test statistic is z= 0.22].2.25. Then SE (ˆ p) = (.8 Suppose. SOME FAMOUS HYPOTHESIS TESTS 81 where SE (ˆ p) = p ˆ(1 − p ˆ)/n and z comes from the normal table.04 We used .15 − 0.

4.7 describes the type of car purchased by a random sample of 263 men and women.25 standard errors below . From the table in Figure 4.2. the numbers that actually happened) are very diﬀerent from what we would expect if X and Y were independent then we will conclude there must be a relationship. In the sample women buy a slightly higher percentage of family cars than men.5457)(.3004)= 43. which means we would expect to see 263(. Ei = npx py = nx ny /n.25 observations in that cell of the table. If TYPE and GENDER were independent we would expect the proportion of “men with sporty cars” in the data set to be (. and men buy a slightly higher percentage of sporty cars than women. which is pronounced with a hard “k” sound and rhymes with “sky. The hypothesizes for the χ2 test are H0 : The two variables are independent (i. For the alternative hypothesis Ha : p < p0 the p-value is P (Z < −1.3004 (=79/263). For example is there a relationship between gender and type of car purchased? Figure 4.e.3 The χ2 Test for Independence Between Two Categorical Variables The ﬁnal hypothesis test in this Chapter is a little diﬀerent from the others because there is no conﬁdence interval to which it corresponds.5457)(. no relationship) Ha : There is some sort of relationship To decide which we believe we produce a contingency table for the two variables and calculate the number of people we would expect to fall in each cell if the variables really were independent. The test is called the χ2 test. In more general terms.25) = 0.”) The χ2 test investigates whether there is a relationship between two categorical variables.7 we see that the proportion of men in the sample is .5457 (= 144/263) and the proportion of sporty cars in the sample is . The ﬁrst step is to compute how many observations we would expect to see in each cell of the table if the variables were actually independent. . (χ is the Greek letter “chi”. We should reject the shipment.82 CHAPTER 4.1056 from the normal table.3004). which is not all that unusual. If H0 were true and p = . If observed numbers (i. Recall that if two random variables X and Y are independent then P (X = x and Y = y ) = P (X = x)P (Y = y ).2 then we would see p ˆ ≤ .e. ESTIMATION AND TESTING p ˆ is 1.5. because we can’t reject H0 .15 about 10% of the time.

that we won’t cover.5. (Imagine chopping oﬀ one row and one column of the table and counting the cells There is another chi-square statistic called the “likelihood ratio” statistic.e. 9 . and px and py are the marginal proportions for that cell. X2 = i (Ei − Oi )2 .4. where nx and ny are the marginal counts for the cell (e. Once you have determined how many observations you would expect to see in each cell of the table. denoted G2 .420 1.417 Prob>ChiSq 0. you need to see whether the table you actually observed is close or far from the table you would expect if the variables were independent. then you compare X 2 to the χ2 distribution with (R − 1)(C − 1) degrees of freedom. If there are R rows and C columns. Ratio Pearson ChiSquare 1.4924 Figure 4. Ei where Ei and Oi are the expected and observed counts for each cell in the table. SOME FAMOUS HYPOTHESIS TESTS 83 Test L.7: Automobile preferences for men and women.4915 0. If you have to do the calculation yourself use the second form.g. G2 ≈ X 2 and both statistics are compared to the same reference distribution. which is a shortcut you get by noticing px = nx /n and py = ny /n. number of men and number of sporty cars). Otherwise X 2 is just a way to measure the distance between your observed table and the table you would expect if the variables were independent. the computer) does the calculation for you. Even though they are calculated diﬀerently. How large does X 2 need to be to conclude X and Y are related? That depends on the number of rows and columns in the table. where Ei is the expected cell count in the i’th cell of the table. The ﬁrst equation is how you should think about Ei being calculated when someone else (i. The Pearson χ2 test statistic9 ﬁts the bill. You divide by Ei because you expect cells with bigger counts to also be more variable. Dividing by Ei in each term puts cells with large and small expected counts on a level playing ﬁeld.

so each cell in the table gets its own probability.2 to compare individual proportions.e. Large X 2 values make you want to reject H0 (and say there is a relationship between the two variables). The small diﬀerences we see between men and women in the sample could very easily be due to random chance. X 2 = 1. To see what a relationship looks like. so the p-value for the χ2 test must be the probability in the upper tail of Figure 4.8 shows the χ2 distribution with 5 degrees of freedom. the probability that you would see an observed table even farther away from the expected table if the variables were truly independent and a second random sample was taken. If there is. .0001.84 CHAPTER 4. The χ2 distribution Figure 4.6 produce X 2 = 29. Thus for larger tables you need a larger X 2 to declare statistical signiﬁcance.070. The “degrees of freedom” for the test is the diﬀerence in the number of parameters needed to ﬁt each model.4924. There is very strong evidence of a relationship between TYPE and AGEGROUP. which makes sense because larger tables contribute more terms to the sum. like the t distribution. on 4=(3-1)(3-1) degrees of freedom. That’s RC − 1 “free” probabilities. to see if there is any relationship between the variables in the table. I.8.3 Rationale behind the χ2 degrees of freedom calculation The χ2 test compares two models. The χ2 test is a test for the entire table. in the smaller table. To specify that model you need to estimate R − 1 probabilities for the rows and C − 1 probabilities for the columns. The data in Figure 1. is a probability distributions we think you should know about.5. which showed the preferences for automobile type across three diﬀerent age groups. If TYPE and GENDER really were independent. on (3-1)(2-1)=2 degrees of freedom. A good strategy with contingency tables is to perform a χ2 test ﬁrst.417.6 on page 11. but we will not discuss them here.√ with d degrees of freedom has mean d and standard deviation 2d. The more complicated model assumes that the two variables are dependent. but not have to learn tables for.) The χ2 distribution. In principle you could design a hypothesis test to examine subsets of a contingency table. A little arithmetic shows that the complicated model has (R − 1)(C − 1) more parameters than the simple model. you can use methods similar to Section 4. recall Figure 1. for a p-value of . ESTIMATION AND TESTING Not on the test 4. The “−1” comes from the fact that probabilities have to sum to one. we would see observed tables this far or farther from the expected tables about half the time. yielding a p-value p < . In the auto choice example. The simpler model assumes the two variables are independent.

. if that is the case then you may form a new table by collapsing one or more of the oﬀending categories into a single category (i.8: The χ2 distribution with 5 degrees of freedom.5.00 0 0. replacing “red” and “blue” with “red or blue”). SOME FAMOUS HYPOTHESIS TESTS 85 0.10 0. Caveat There is one important caveat to keep in mind for the χ2 test.15 5 10 15 20 Figure 4.4. p-values come from the upper tail of the distribution.e. However. The p-values are generally not trustworthy if the expected number of observations in any cell of the table is less than 5.05 0.

ESTIMATION AND TESTING .86 CHAPTER 4.

the value of Y when X = 0).Chapter 5 Simple Linear Regression In Chapter 6 we will consider relationships between several variables. The error term comes in because we don’t really think that there is an exact linear relationship.g. We will start by examining the simplest form of relationship. We have two variables X and Y and we think there is an approximate linear relationship between them.1). The two most common reasons for modeling joint relationships are to understand how one variable will be aﬀected if we change another (for example how will proﬁt change if we increase production) and actually predict one variable using the value of another (e. linear or straight line but it will become clear that we can extend the methods to more complicated relationships. the amount that Y increases by when we increase X by 1) and is an error term (see Figure 5. if we price our product at \$X what will be our sales Y ). only an approximate one. A fundamental regression assumption is that the error terms are all normally distributed with the same standard deviation σ . 5.e. β1 is the slope (i. If so then we can write the relationship as Y = β0 + β1 X + where β0 is the intercept term (i. An equivalent way of writing the linear regression model is Y ∼ N (β0 + β1 X.e.1 The Simple Linear Regression Model The idea behind simple linear regression is very easy. 87 . There are any number of ways that two variables could be related to each other. σ ).

which measure how far the y that we actually saw is from its prediction. regression tries to minimize 2 2 SSE = e2 1 + e2 + · · · + en . If we are interested in how Y changes when we change X then we look at the ˆ = β0 + β1 x. σ ) means “Y is a normally distributed random variable with mean µ and standard deviation σ . For any given values of β0 and β1 we can calculate the residuals ei = yi − y ˆi .1: An illustration of the notation used in the regression equation. SSE stands for the “sum of squared errors” and the estimates for β0 and β1 are called b0 and b1 . slope β1 . Said another way. Instead we get to see a sample of X and Y pairs and need to use these numbers to guess β0 and β1 . What about a guess for σ ? . SIMPLE LINEAR REGRESSION Figure 5. The residuals are signed so that a positive residual means the point is above the regression line. and a negative residual means the point is below the line. Recall that Y ∼ N (µ. Therefore b0 and b1 are random variables (just like x ¯) because if we took a new sample of X and Y values we would get a diﬀerent line. Note that the line that we get is just a guess for the true line (just as x ¯ is a guess for µ). You obtain guesses for β0 and β1 by minimizing SSE. All that has changed is that now we’re letting µ depend on some background variable that we’re calling X . We choose the line (i. If we wish to predict Y for a given value of X we use y In practice there is a problem: β0 and β1 are unknown (just like µ was unknown in Chapter 4) because we can’t see all the X ’s and Y ’s in the entire population.e.” Writing the regression model this way helps make the connection to Chapters 2 and 4.88 CHAPTER 5. choose β0 and β1 ) that makes the sum of the squared residuals as small as possible (we square the residuals to remove the sign).

The sum of squared errors is the squared distance from the observed vector of responses (y1 . . so we use up one more degree of freedom than we did when our best guess for Y was simply y ¯. y ˆn ). .” The same is true here. Those same people would call s = s2 the “root” mean square error. we average (yi − y ˆi )2 .1.1 Example: The CAPM Model Figure 5. regression gets those two vectors to be as close as possible. we said that variance was the “average squared deviation from the mean. or MSE. The ﬁrst two tell you where the regression line goes. because to calculate the variance you take the √ mean of the squared errors. b1 . . 5.1. On a deeper level. but in a regression model the “mean” for each observation depends on X . . or RMSE. THE SIMPLE LINEAR REGRESSION MODEL Not on the test 5.1 Why sums of squares? There are a number of reasons why the sum of squared errors is the model ﬁtting criterion of choice. . . 89 In earlier Chapters. You could (but thankfully don’t have to) take the derivative of SSE with respect to β0 and β1 (viewing all the xi ’s and yi ’s as constants) and solve the system of two equations without too much diﬃculty. The “capital asset pricing model” (CAPM) from ﬁnance says that you can expect the returns for an individual stock to be linearly related to the . The corresponding formula works for ordered triples plotted in 3-space. So instead of averaging (yi − y ¯)2 . . sums of squares tell us how far one thing is from another. . One other diﬀerence is that in a regression model we must estimate two parameters (the slope and intercept) to compute y ˆi . That’s SSE! People sometimes call s2 the mean square error.5. When you estimate a regression model you estimate the three numbers b0 . Thus the estimate of σ 2 is s2 = n i=1 (yi −y ˆi )2 . Do you remember the “distance formula” from high school algebra? It says that the squared distance between two ordered pairs (x1 . y1 ) and (x2 . y2 ) is d2 = (x2 − x1 )2 + (y2 − y1 )2 . The last one gives you a sense of how spread out the Y ’s are around the line. By minimizing SSE. yn ) to the prediction vector (ˆ y1 . and s. n−2 Now look at the sum in the numerator of s2 .2 shows the results obtained from regressing monthly stock returns for Sears against a “value weighted” stock index representing the overall performance of the stock market. The ﬁrst is that it is a relatively easy calculus problem. and for ordered n-tuples in higher dimensions.

123536(VW Return). The ﬁrst column lists the names of the X variables in the regression. The equation of the estimated line describing this relationship can be found in the “Parameter Estimates” portion of the computer output.2.” The slope of the security market line is known as the stock’s “beta” (a reference to the standard regression notation). overall stock market. It provides information about the stock’s volatility relative to the market.3 shows the volatility measurements from the prospectus of one of Fidelity’s more aggressive mutual funds.1 So the regression equation is Sears = −0. The CAPM model refers to the estimated equation as the “security market line.2: Fitting the CAPM model for Sears stock returns.86 0.056324 0.003864 0.0001 Figure 5.424106 0. . which we just discussed. and “standard deviation” is s.3931 12.2.” In this example s = 0. You can ﬁnd s in the “Summary of Fit” table under the heading “Root Mean Square Error. Thus if β > 1 the stock is more volatile than the market as a whole. If β = 2 then a one percentage point change in the market return would correspond to a two percentage point change in the return of the individual security.426643 0.056324. You can ﬁnd regression-based volatility information in the prospectus for your investments. The second column lists their coeﬃcients in the regression equation.009779 228 Parameter Estimates Term Estimate Intercept -0.123536 Std Error 0.90 CHAPTER 5.003307 VW Return 1. SIMPLE LINEAR REGRESSION Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations 0. It lists β .97 <.086639 t Ratio Prob>|t| -0. R2 will be discussed in Section 5. and if β < 1 the stock is less volatile than the market.003307 + 1. 1 All statistics programs organize their estimated regression coeﬃcients in exactly this way. Figure 5.

2 leaves no doubt that there is a relationship between Sears stock and the stock market. Start with H0 : β1 = 0 versus H1 : β1 = 0. If the p-values is small we will conclude that there is a relationship. We usually hope that the p-value is small. THREE COMMON REGRESSION QUESTIONS 91 Figure 5. If there is a relationship. “What proportion of the variation have I explained?” 3. 1. As always. We calculate b1 and its standard error and then take the ratio t= b1 . 5. .2 Three Common Regression Questions There are three common questions often asked of a regression model. Can I be sure that X and Y are actually related? 2. SE (b1 ) As before.com. What Y would I predict for a given X value and how sure am I about my prediction? 5.1 Is there a relationship? This question really comes down to asking whether β1 = 0.5. The small p-value for the slope in Figure 5. Source: www. “Is b1 far enough away from 0 to make us conﬁdent that β1 = 0?” This should sound familiar because it is exactly the same idea we used to test a population mean in the one sample t-test.2. because if the slope is zero X disappears out of the regression model and we get Y = β0 + . (Is x ¯ far enough from µ0 that we can be sure µ = µ0 ?) We perform the same sort of hypothesis test here. otherwise we might as well just stop because there is no evidence that X and Y are related at all. The question is.3: Volatility measures for an aggressive Fidelity fund. If the absolute value of t is large then we will reject the null hypothesis and conclude there must be a relationship between X and Y . the t-statistic counts the number of standard errors that b1 is from zero. how strong is it? An equivalent way to phrase this question is. we determine whether t is large enough by looking at the p-value it generates.ﬁdelity.2.

β1 is the stock’s volatility measure. In Figure 5. you can construct a 95% conﬁdence interval for β1 as [b1 ± 2SE (b1 )]. The only diﬀerence between the t-test for a regression coeﬃcient and the t-test for a mean is that the standard error is computed using a diﬀerent formula. The conﬁdence interval contains 1. You shouldn’t even look at R2 unless the p-value is signiﬁcant. We measure the strength of the . R2 answers a “how much” question. sx is large. The smaller the p-value the more sure you can be that there is at least some relationship between the two variables. it is easier to guess the regression slope. so a conﬁdence interval for β1 represents a range of plausible values for the stock’s true volatility. n is large. 3. The standard error of b1 in Figure 5.2. SE (b1 ) = sx √ s n−1 This formula says that b1 is easier to estimate (its SE is small) if 1. the p-value for the slope R2 tells you how strong the relationship is. so if the points are tightly clustered around the regression line it is easier to guess the slope.2 How strong is the relationship? Once we have decided that there is some sort of relationship between X and Y we want to know how strong the relationship is.29). Put another way. the estimate of the slope improves as more data are available. As with most things. the p-value answers a yes-or-no question about the relationship between X and Y . Just as with means. If the X ’s are highly spread out. s is small. sx is the standard deviation of the X ’s. 5. How meaningful this conﬁdence interval is depends on whether you can give β1 a meaningful interpretation. s is the standard deviation of Y around the regression line.95. 1.086.12 ± . SIMPLE LINEAR REGRESSION Don’t Get Confused! 5. which says that Sears stock may be no more volatile than the stock market itself.2.2 is about .1 R2 vs.17) = (.92 CHAPTER 5. The p-value for the slope tells you whether R2 is suﬃciently large for you to believe there is any relationship. You could also use SE (b1 ) to construct a conﬁdence interval for β1 . so a 95% conﬁdence interval for Sears stock is roughly (1. 2.

3 What is my prediction for Y and how good is it? Once we have an estimate for the line. Think of SSE as the variability left over after ﬁtting the regression (the variability of Y about the regression line). Of course.75 because 0. You might be wondering why we wouldn’t just use the correlation coeﬃcient we learned about in Section 3. R2 is always between zero and one. .2. making a prediction of Y for a given x is easy. correlations tell you the direction of the relationship in addition to the strength of the relationship.52 is about 75% smaller than 12 . We just use y ˆ = b0 + b1 x. Think of SST as the variability you would have if you where SST = (yi − y ignored X and estimated each y with y ¯ (the variability of Y about its mean). However.3 to measure of how closely X and Y are related. 5. A number close to one indicates a large proportion of the variability has been explained. Then R2 calculates the proportion of variability that you have explained using the regression.5 then R2 ≈ . there are two sorts of uncertainty about this prediction. A number close to zero indicates the regression did not explain much at all! Note that R2 ≈ 1 − s2 e s2 y where se is the standard deviation of the residuals and sy is the standard deviation of the Y ’s. 2 2 The approximate equality is because s2 e = SSE/(n − 2).5. That’s why we keep them around instead of always using R2 . and the standard deviation of the residuals (about the regression line) is 0. To calculate R2 we use the formula SSE R2 = 1 − SST ¯)2 . which will be the case in Chapter 6 and in most real life regression applications.2. THREE COMMON REGRESSION QUESTIONS 93 relationship between X and Y using a quantity called R2 . If n is large then (n − 1)/(n − 2) ≈ 1.2 For example. In fact it can be shown that R2 = r 2 so the two quantities are equivalent. if the sample standard deviation of the y ’s (about the average y ) is 1. Correlations are limited to describing one pair of variables at a time.2. An advantage of R2 is that it retains its meaning even if there are several X variables. while sy = SST /(n − 1).

94 CHAPTER 5. Then the interval is y ˆ ± 2SE (ˆ y ). Remember that b0 and b1 are only guesses for β0 and β1 so y ˆ = b0 + b1 x is a guess for the population regression line. You are taking a long term view so you probably care more about the long run monthly average value of sales than you do month to month variation. The interval combines our uncertainty about the location of the population regression line with the variation of the points around the population regression line. pretend the second term under the square root sign √ isn’t there. y ˆ is a guess for the “µ” associated with this particular x. Suppose you are thinking of setting an advertising policy of spending a certain amount each month for the foreseeable future. I. (Translation: suppose you want to estimate the long run monthly average sales if you spend \$2.5 million per month on advertising. It shows the relationship between a ﬁrm’s monthly sales and monthly advertising expenses. then the prediction interval is what you want.) You would compute your conﬁdence interval by ﬁrst calculating y ˆ = b0 + b1 x∗ . n (n − 1)s2 x This is another one of those formulas that isn’t as bad as it looks when you break it down piece by piece. The ugly second term increases SE (ˆ y ) when x∗ (the x value where you want to make your prediction) moves away from x ¯. In that case you want to use a conﬁdence interval. SIMPLE LINEAR REGRESSION 1. where SE (ˆ y) = s 1 (x∗ − x ¯)2 + . “Far” is deﬁned . First. Figure 5. A prediction interval provides a range of plausible values for an individual future observation with the speciﬁed x value. Even if we knew the parameters of the regression model perfectly the individual data points wouldn’t lie exactly on the line. 2.4 illustrates the diﬀerence between these two types of predictions. All that says is that it is harder to guess where the regression line goes when you’re far away from x ¯. Conﬁdence Intervals for the Regression Line Suppose you are interested in predicting the numerical value of the population regression line at a new point x∗ of your choosing. Then SE (ˆ y ) is just s/ n like it was when we were estimating a mean. We use a conﬁdence interval to determine how good a guess it is.e. if you are considering a one month media blitz and you want to forecast what sales will be if you spend \$X on advertising. However.

the best way to actually construct a conﬁdence interval is on a computer. (\$1000’s) There is a “bow” eﬀect present in both sets of intervals.4: Regression output describing the relationship between sales and advertising expenses. relative to sx . The 95% conﬁdence interval when x∗ = 2500 is (114230. However.28412 14503.758 11709. The “ugly second term” is the thing that makes the conﬁdence intervals Figure 5. Suppose we wanted an interval estimate of long run average monthly sales assuming we spend \$2. The formula for SE (ˆ y ) is useful because it helps us understand how conﬁdence intervals behave. You can see these numbers in Figure 5. The regression equation says that y ˆ = 66151 + 23.5 million per month on advertising. or about \$124 million. THREE COMMON REGRESSION QUESTIONS 95 (a) (b) Parameter Estimates Term Estimate Std Error Intercept 66151.32421 6. .0008 RSquare RMSE N 0. 134694).096 t Ratio Prob>|t| 5.0001 3. Finally.67 0.4(a) by going to x∗ = 2500 on the X axis and looking straight up.444 SD of ADVEXP: 386.5. or between \$114 and \$135 million. the standard deviation of the X ’s.2.4(a) bow away from the regression line far from the average ADVEXP.349464 Mean of ADVEXP: 1804. including (a) conﬁdence and (b) prediction intervals. but it is more pronounced in (a).47 ADVEXP 23. 451. you have (x∗ − x ¯)2 and 2 sx because the formula for variances is based on squared things.32(2500) = 124.65 <.29 36 Figure 5.

n (n − 1)s2 x An ugly looking formula.96 CHAPTER 5. For example a straight line may ﬁt well for the observed data but for larger values of X the true relationship may not be linear. to be sure. Extrapolation. Also notice that under the square root sign you have the same formula you had for SE (ˆ y ) plus a “1. If this is the case then an extrapolated prediction will be even worse than the wide prediction intervals indicate. Using Prediction and conﬁdence intervals both get very wide when x∗ is far from x your model to make a prediction at an x∗ that is well beyond the range of the data is known as extrapolation. SIMPLE LINEAR REGRESSION Prediction Intervals for Predicting Individual Observations SE (ˆ y ) measures how sure you are about the location of the population regression line at a point x∗ of your choosing. If you knew β0 . But notice that if you had all the data in the population (i. because you have to combine your uncertainty about the location of the regression line with the variation of the y ’s around the line.e. Predicting an individual y value (let’s call it y ∗ ) for an observation with that x∗ is more diﬃcult. Interpolation. As it is. the standard error for a predicted y ∗ value is SE (y ∗ ) = s 1 ¯)2 (x∗ − x + + 1. Consequently. The extra “1” under the square root sign means that prediction intervals are always wider than conﬁdence intervals. if n → ∞) then it would just be s. Predicting within the range of the data is known as interpolation. β1 . An additional problem with extrapolation is that there is no guarantee that the relationship will follow the same pattern outside the observed data that it follows inside. 155662). . Wide conﬁdence and prediction intervals are the regression model’s way of warning you that extrapolation is dangerous. and σ 2 then your 95%prediction interval for y ∗ would be (ˆ y ± 2σ ). and the Intercept ¯. so V ar (y ∗ ) = s2 1 (x∗ − x ¯)2 + n (n − 1)s2 x V ar (ˆ y) + s2 V ar (residual) . The 95% prediction interval when ADVEXP=2500 is (93263. or from \$93 to \$156 million. we recognize that y = y ˆ + residual.” which would actually be s2 if you distributed the leading factor of s through the radical. Everything else under the square root sign means that prediction intervals are always wider than they would be if you knew the equation of the true population regression line.

so predicting Y when X = 0 is a serious extrapolation.5. Normality (Observations are normally distributed about the regression line) This section is about checking for violations of these assumptions. In order of importance they are: ˆ depends on X through a linear equation) 1. ˆ (a one number In Chapter 6 there will be several X variables. We can write the regression model as ˆ .3 Checking Regression Assumptions It is important to remember that the regression model is only a model. The main tool for checking the regression assumptions is a residual plot. that means there is a pattern in the data that the model missed. prediction intervals. It doesn’t matter which.e. where Y known as the regression assumptions. the residual plot will just look like random scatter. so we will put Y summary of all the X ’s) on the X axis instead. When we detect an assumption violation we want to do something to get the pattern out of the residuals and into the model where we can actually use it to make better predictions. 5. which plots the residuals ˆ . and the inferences you make from it (i. CHECKING REGRESSION ASSUMPTIONS 97 The notion of extrapolation explains why we rarely interpret the intercept term in a regression model. This simple equation contains four speciﬁc assumptions. and possible means of correcting violations. and hypothesis tests) are subject to criticism if the model fails to ﬁt the data reasonably well. If there is a pattern in the residual plot. Yet we very often encounter data where all the X ’s are far from zero. σ) Y ∼ N (Y ˆ = β0 + β1 X . Constant variance (σ does not depend on X ) 3. Thus if the model has captured all the patterns in the data. Thus it is better to think of the “intercept” term in a regression model as a parameter that adjusts the height of the regression line than to give it a special interpretation. what happens if each one is violated. Independence (The equation for Y does not depend on any other Y ’s) 4.3. The intercept is often described as the expected Y when X is zero. because Y ˆ is just a linear function vs. conﬁdence intervals. Linearity (Y 2. For simple regressions we typically put X on the X axis of a residual plot. The “something” is usually either changing the scale of . either X or Y of X . Why look at residuals? The residuals are what is left over after the regression has done all the explaining that it possibly can. which implies that one of the regression assumptions has been violated.

g. X . but does not completely change direction. The most popular transformations are those that raise a variable to a power like X 2 . you should consider transforming either the X and/or Y variables.1 Nonlinearity Nonlinearity simply means that a straight line does a poor job of describing the trend in the data. log x 1/x y^2 x^2 y^2 sqrt x. 1. then you have a bend.e.98 CHAPTER 5. If most of the residuals in one section are on one side of zero. 1/y log y. 2. x^2 Figure 5. The hallmark of nonlinearity is a bend in the residual plot. try mentally dividing the residual plot into two or three sections. log y. but the bend is often easier to see using the residual plot. and most in another are on the other side. That is. 1/x sqrt y.5: “Tukey’s Bulging Rule. look for a pattern of increasing and then decreasing residuals (e. If the trend in the data bends. 5. Trick: If you are not sure whether you see a bend.3.2 on page 99. If there is a “U” shape in the scatterplot (or an inverted “U”) then ﬁt a quadratic instead of a straight line.” Some suggestions for transformations you can try to correct for nonlinearity. . SIMPLE LINEAR REGRESSION sqrt x. There is a sense3 in which X 0 3 See “Not on the Test” 5. or adding another variable to the model. log x. You can also test for nonlinearity by ﬁtting a quadratic (i.” the problem by transforming one of the variables. both linear and squared terms) and testing whether the squared term is signiﬁcant. You can sometimes see strong nonlinear trends in the original scatterplot of Y vs. 1/y exp(x). X 1/2 (the square root of X ). or X −1 = 1/X . The appropriate transformation depends on how the data are “bulging. Fixing the problem: There are two ﬁxes for nonlinearity. sqrt y. a “U” shape).

For example. α 99 If α = 0 then w is “morally” xα . so we could straighten it out by either stretching the Y axis or compressing the X axis. along . If you remember “L’Hˆ opital’s Rule” from calculus then you can show that as α → 0 then w → loge x. LogDisplayFeet.” The bend does not change direction. Tukey’s bulging rule provides some some advice on how to get started. There is some trial and error involved. and powers less than one shrink. if the trend in the data bends like the upper right quadrant of Figure 5. The bend looks like the upper left quadrant of Figure 5. corresponds to log X . There is something of an art to choosing a good transformation. Figure 5. Given a choice. Then you can run an ordinary linear regression using the transformed variable. The most common way to carry out the transformation is by creating a new column of numbers in your data table containing the transformed variable. There is an obvious bend in the relationship between display feet and sales.2 Box-Cox transformations There is a theory for how to transform variables known as Box-Cox transformations.6(c) shows what the scatterplot looks like when we plot Sales vs.5. i. which for some reason is usually a good transformation to start with.e.6 shows weekly sales ﬁgures for several stores plotted against the number of “display feet” of space the product was given on each store’s shelves. by raising them to a power greater than 1.5. Nonlinearity Example 1 A chain of liquor stores is test marketing a new product.5 then you want to contract either the X or Y axis. Some stores in the chain feature the product more prominently than others. so we prefer to capture it using a transformation rather than ﬁtting a quadratic. we typically prefer transformations that contract the data because they can also limit the inﬂuence of unusual data points. Let’s start with a transformation from DisplayFeet to LogDisplayFeet. CHECKING REGRESSION ASSUMPTIONS Not on the test 5.4. which we will discuss in Section 5.5 then you can “straighten out” the trend by stretching either the X or Y axes. which you could characterize as “diminishing returns. and there could be several choices that ﬁt about equally well. If then trend looks like the lower left quadrant of Figure 5. Figure 5.3. Denote your transformed variable by w= xα − 1 . You should think about transformations as stretching or shrinking the axis of the transformed variable. Powers greater than one stretch.

(c) Linear regression on transformed scale. . what can we 4 You can compare R2 for these models because Y has not been transformed. then the log model makes more sense. 5 That’s good news for you. but also has some economic or physical justiﬁcation. The question is. Once you transform Y then R2 is measured diﬀerently for each model.6(a) shows the same estimated regression line but plotted on the original scale. even more than Arnold Schwarzenegger or Keanu Reeves. The log transformation changes the scale of the problem from one where a “straight line” assumption is not valid to another scale where it is. then the reciprocal model is more appropriate. Now that we have it. Your ability to use judgments in situations like this makes you more valuable than a computer.6: Relationship between amount of display space allocated to a new product and its weekly sales.5 Let’s suppose we prefer the reciprocal model. it doesn’t make sense to compare percent variation explained for log(sales) to the percent variation explained for 1/sales. R2 is about the same for both models. That means we want a transformation that ﬁts the data well. with the estimated regression line. and won’t increase no matter how many feet of display space the produce receives. will keep computers from taking over the earth. Figure 5.4 It is a little higher for the reciprocal model. what do we think the relationship between Sales and DisplayFeet will look like as DisplayFeet grows? If we think sales will continue to rise. The log transformation ﬁts the data quite nicely. (a) Original scale showing linear regression and log transform. SIMPLE LINEAR REGRESSION (a) (b) (c) Figure 5. Numerically. but there other transformations that would ﬁt just as well. The curved line in Figure 5.e. I. Graphically. The choice between them is based on your understanding of the context of the problem.100 CHAPTER 5. Human judgment. both transformations seem to ﬁt the data pretty well. but not enough to get excited about. If we think that sales will soon plateau.7 compares computer output for the log and reciprocal (1/X ) transformations. So which transformation should we prefer? We want the model to be as interpretable as possible. albeit at a slower and slower rate. Statistics can tell you that both models ﬁt about equally well. (b) Residual plot from linear regression.

41344 138.7042 22.5.3. so if we plug these numbers into our formula for the optimal x we ﬁnd it is between 2.0001 Term Estimate Std Error t Ratio Prob>|t| Intercept 376.04298 Term Intercept Log(DispFeet) Estimate Std Error 83.51988 -14.73 feet.91 <. so our best guess at the optimal display amount is 329/50 ≈ 2.826487 RMSE 40. How sure are we about this guess? Well β1 is plausibly between −329 ± 2(22.833914 t Ratio Prob>|t| 5.3082 Reciprocal model: RSquare 0.0001 1/(DisplayFeet) -329.0001 14. 6 If not then you can hire one of us to do the calculus for you.0001 Figure 5.439455 39. If you know some calculus6 you can ﬁgure out the optimal value of x is x= −β1 .5 feet. do with it? Suppose that the primary cost of stocking the product is an opportunity cost of \$50/foot.69522 9.7: Comparing the log (heavy line) and reciprocal (lighter line) transformations for the display space data.5) = (−374.64 <. 50 Our estimate for β1 is b1 = −329. for an exorbitant fee! .38 and 2.815349 41.10 <. the other products in the store collectively generate about \$50 in sales per foot of display space. How much of the new product should we stock? If we stock x display feet then our marginal proﬁt will be π (x) = β0 + β1 /x extra sales revenue − 50x opportunity cost . −284). CHECKING REGRESSION ASSUMPTIONS 101 Log model: RSquare RMSE 0.80 <. That is.62089 9.560256 14.

The negotiating team thought they had a deal.8 shows a scatterplot from a simple random sample of union members that the HR director obtained before the labor negotiations.8. Also notice that R2 is much higher for the quadratic model than for the linear model. Notice that the quadratic term has a signiﬁcant p-value. Most of the residuals in the middle third of the plot are positive. so the oldest employees ought to support her proposed beneﬁt. The quadratic model from Figure 5. if we believed in the linear relationship. despite the cost. so we estimate that the long run average weekly proﬁt is somewhere between \$107. Figure 5. The HR director ﬁts a linear regression of LifeCost on Age and notices the the regression line has a positive slope with a signiﬁcant p-value. The signiﬁcant p-value for the quadratic term says that R2 went . That means we’re pretty sure the product will be proﬁtable over the long haul.81.5. right? It would. SIMPLE LINEAR REGRESSION For budgeting purposes we want to know the long run average weekly sales we can expect if we give the product its optimal space.69522 − 329. Examine the residual plot in Figure 5. but when it was presented to the union for a vote the deal faced stiﬀ resistance from the oldest union members. Finally. our HR director from Chapter 4 has been oﬀ negotiating with her labor union.7042/2. but that could range between \$232. Now she’s wondering if it is time oﬀer the beneﬁt. Our best guess for weekly sales is 376. Will it be proﬁtable to stock the item over the long run? Stocking the product at 2. which we obtained from the computer. in order to disproportionately sway the older union members.5 = \$244. Thus we couldn’t justify the quadratic model if R2 were only a little bit higher.5 feet of display space will cost us \$125 per week in opportunity cost.8 is much better ﬁt to these data.00. The plot shows the relationship between a union member’s age and the amount they pay in life insurance. Nonlinearity Example 2 While we’ve been learning about correlation and regression. In Chapter 4 the HR director determined that a life insurance beneﬁt was too expensive to include in management’s initial oﬀer. If a store sells less than \$163 of the product in a given week they might receive a visit from the district manager to see what the problem is. which says that it is doing a useful amount of explaining. These number come from our conﬁdence interval for the regression line at x∗ = 2. In Chapter 6 we will learn that R2 ALWAYS increases when you add a variable to the model (as we did with the quadratic term). and most of the residuals in the other two thirds are negative. If they sell more than \$326 then they’ve done really well.62 and \$257.62 and \$127. the 95% prediction interval for a given week’s sales with 2. That is evidence of a nonlinear relationship.5 feet of display space goes from \$163 to \$326. That means older employees pay more for life insurance.102 CHAPTER 5.

up by enough to justify the term’s inclusion.787)2 The computer centers the quadratic term around the average age in the data set (i. 7 Or a HUGE consulting fee.24 -0.87651 105.16 0.62865 95. The residual plot is from the linear model.34 RSquare 0.9 plots the residuals from the quadratic model.300978 RMSE 171.6601 5. CHECKING REGRESSION ASSUMPTIONS 103 Term Intercept Age Term Intercept Age (Age-43.e.1107 Figure 5.0732059 2.16517 -4.5.05667 2.0345 Prob>|t| 0.78 4.073527 RMSE 195.63 + 4.48353 3. Figure 5. Thus our regression model predicts that 47 year old union employees pay the most for life insurance. particularly at the edges.8: Output from linear and quadratic regression models for insurance premium regressed on age.72(Age − 43.0001 RSquare 0.787)^2 Estimate Std Error 258. The residuals are more evenly scattered throughout the plot. which makes us comfortable that we’ve captured the bend.6(Age) − 0. .717539 0.787) to prevent something called collinearity that we will learn about in Chapter 6. The quadratic model diﬀers from the linear model because it says that older members pay more for life insurance up to a point. x ¯ = 43.0287 <.0004 0.3444 t Ratio Prob>|t| 2.45 0.0173 2.6138293 2. The equation for the quadratic model is Cost = 360. then their monthly premiums begin to decline.3152 Estimate Std Error t Ratio 360.3. A bit of calculus7 shows that if you have a quadratic equation written as a(x − x ¯)2 + bx + c then the optimal value of x is x ¯ − (b/2a). Perhaps the older employees locked into their life insurance premiums long ago.

Fixing the Problem: If Y is always positive. and hypothesis tests that we like to use for regression depend on s.e. while the residuals the other half seem more scattered. then you have non-constant variance. try transforming it to log(Y ). 8 . The old geezers standing in the way of a deal probably wouldn’t be swayed by a life insurance beneﬁt. all the formulas for prediction intervals. Consequence: There are two consequences of non-constant variance. conﬁdence intervals. If the residuals have non-constant variance then s is meaningless because it doesn’t make sense to summarize the variance with a single number. More seriously. residuals close to zero to start with and then further from zero for larger X values.3.104 CHAPTER 5.2 Non-Constant Variance Non-constant variance8 means that points at some values of X are more tightly clustered around the regression line than at other values of X . Trick: Try mentally dividing the plot in half. so there is a slight bias in the regression coeﬃcients. At least we’re assuming that’s the reason.9: Residuals from the quadratic model ﬁt to the insurance data. To check for nonconstant variance look for a funnel shape in the residual plot i. SIMPLE LINEAR REGRESSION Figure 5. If the residuals on one half seem tightly clustered. although the term is currently out of favor with statisticians because we’ve learned that 8 syllable words make people not like us. 5. points with low variance should be weighted more heavily than points with high variance. First. There is another solution called weighted least squares which you may read about. Also known as heteroscedasticity. Often as X increases the variance of the errors will also increase. but we will not discuss.

(a) Regression line ﬁt to raw √ data and log y transformation. The company tries to ﬁx the problem by transforming to log(rooms). and narrower than they need to be when the number of crews is large. but then it looks like the non-constant variance goes the other way because the points on the left side of Figure 5. crews. Figure 5. but that creates a non-linear eﬀect because there was roughly a straight line trend to begin with. . Goldilocks replaces the log transformations with square roots.3.10(c) shows the results. (c) Regression of Example Goldilocks Housekeeping Service is a contractor specializing in cleaning large oﬃce buildings. The left panel shows the prediction intervals from the data modeled on the raw scale. Linearity is restored when the company regresses log(rooms) on log(crews). √ crews looks almost identical to the line for rooms vs. So what has Goldilocks gained by addressing the non-constant variance in the data? Consider Figure 5.10(b) seem more scattered than those on the right. Notice how the prediction intervals are much wider than the spread of the data when the number of crews is small. (c) Figure 5. Figure 5.10: Cleaning data. √ When plotted on the data’s original scale. The prediction intervals in Figure 5.5.11. They have collected the data shown in Figure 5. √ (b) Regression of log y on log x. which look just right. the regression line for rooms vs. The data exhibit obvious non-constant variance (the points on the right are far more scattered than those on the left). They want to build a model to forecast the number of rooms that can be cleaned by X cleaning crews.11(b) shows prediction intervals constructed on the transformed scale where the constant variance assumption is much more reasonable. which don’t contract things as much as logs do.10. CHECKING REGRESSION ASSUMPTIONS 105 (a) (b) y on x.11(b) track the variation in the data much more closely than in panel (a).

e. This becomes even more important when we deal with several X variables in multiple regression. Consequence: If you see tracking in the residual plot then that means today’s residual could be used to predict tomorrow’s residual. look for tracking in the residual plot. then plotted on the original scale. it can sometimes be diﬃcult to distinguish between autocorrelation and non-linearity. Trick: The X variable in your plot must represent time in order to see tracking. To ﬁnd evidence of dependent observations.” Regressing . The model in panel (b) is ﬁt on the transformed scale. We will learn how to put both trend and autocorrelation in the model when we learn about multiple regression in Chapter 6. so if you don’t have time series data you don’t have to worry about this one. 5.3. SIMPLE LINEAR REGRESSION (a) (b) Figure 5.11: The eﬀect of non-constant variance on prediction intervals. the best way to deal with autocorrelation may be to simply regress yt on the lag variable yt−1 .106 CHAPTER 5. For now. The obvious thing to do here is to put “today’s” residual in the model somehow.3 Dependent Observations “Dependent observations” usually means that if one residual is large then the next one is too. and vice versa. The best way to describe tracking is when you see a residual far above zero. The technical name for this “tracking” pattern is autocorrelation. it takes many small steps to get to a residual below zero. Panel (a) is ﬁt on the original scale. You can think of a lag variable as “yesterday’s y . a regression where your X variable is time). The good news is that autocorrelation only occurs with time series data. This means you could be doing a better job than you’re doing by just including the long run trend (i. Unfortunately.

There is deﬁnite tracking in the residual plot.997. Figure 5.12: Cell phone subscriber data (a) on raw scale with log and square root transformations.13. If you wanted to forecast the number of cell phone subscribers in June 1996 (the next time period. Both models have a very high R2 .12(c). Example Figure 5.999. time. Nonlinearity is the most serious problem with ﬁtting a linear regression to these data.997. That doesn’t seem like a big change in R2 .12 plots the number of cell phone subscribers. (b) after y 1/4 transformation. The model based on the long run trend over time has R2 = .12(b). the highest we’ve seen so far.12(b) would do a very good job. a variable on its lag is sometimes called an autoregression. which says that you should probably increase your forecast for period 24. That’s not a particularly interpretable transformation.5. which bends too severely. it has R2 = . Figure 5.3. After all. Figure 5.12(a) shows the √ results of ﬁtting a linear regression to log y . where tomorrow’s residual is correlated with today’s residual. CHECKING REGRESSION ASSUMPTIONS 107 (a) (b) (c) Figure 5. industry wide. However you could do even better if you could also incorporate the pattern in Figure 5. while the autoregression has R2 = . . so the next residual is likely to be positive as well. period 24) using these data then it looks like a regression based on Figure 5. every six months from December 1984 to December 1995. but notice that s in the autoregression is only half as large as it is in the “long run trend” model. and y which doesn’t bend quite enough. That’s because the residuals for the last few time periods have been positive.12(b) shows the scatterplot obtained by transforming Subscribers to the 1/4 power. Perhaps the best reason to prefer the lag model is that when you plot its residuals vs. (c) residuals from panel (b).12(c) shows the residual plot obtained after ﬁtting a linear model to Figure 5. but it deﬁnitely ﬁxes our problem with nonlinearity. which is evidence of autocorrelation. which shows output from the regression of subscribers to the 1/4 power on a lag variable. Consider Figure 5.

then raising them to the fourth power.778 38.1: Point and interval forecasts (in millions) for the number of cell phone subscribers in June 1996 based on the trend model and the autoregression.12. Table 5.999063 RMSE 0.1 gives the predictions for each model.498 Upper 95% Prediction 41. (d) Summary of ﬁt for the trend model.12(c).972197 23 (d) Figure 5.13(b).996987 0.108 CHAPTER 5.12(c) are above the regression line.347 34. . as in Figure 5. The small diﬀerences on the fourth root scale translate into predictions that diﬀer by several million subscribers.276 Lower 95% Prediction 37. (a) Subscribers to the 1/4 power vs. It looks like the autoregression Point Estimate 39.13: Output for the lag model compared to the trend model in Figure 5. (b) Plot of residuals vs. there is a much weaker pattern than the residual plot in Figure 5. obtained by calculating point estimates and prediction intervals on the y 1/4 scale.023 30. time.51842 N 22 (c) Trend RSquare RMSE N (b) 0. The prediction intervals for the autoregression are tighter than the trend model. its lag. The autoregression predicts more subscribers than the trend model.394 Model Lag Trend Table 5. which is probably accurate given that the last few residuals in Figure 5. (c) Summary of ﬁt for the lag model. SIMPLE LINEAR REGRESSION (a) Lag Variable RSquare 0.

That’s because there is a “central limit theorem” for regression coeﬃcients just like the one for means.14: Number of seats in a car as a function of its weight (left panel). CHECKING REGRESSION ASSUMPTIONS 109 Figure 5. You have to save them as a new variable and make the plot yourself. 5. The p-value is trustworthy. JMP doesn’t make normal quantile plot of the residuals by default.5. For example. can predict to within about ±2 million subscribers. Non-normal residuals are only a problem when you wish to infer something about individual observations (e.14 shows that the number of seats in a car tends to be larger for heavier cars. If the residuals are nonnormal then you can’t assume that individual observations will be within ±2σ of the population regression line 95% of the time. The period 24 point estimate for the trend model is only very slightly larger than the actual observed value for period 23. You check the normality of the residuals using a normal quantile plot.g. prediction intervals).4 Non-normal residuals Normality of the residuals is the least important of the four regression assumptions. Therefore your intervals and tests for b1 and the conﬁdence interval for the regression line are all okay. even if the residuals are not normally distributed.3. Unfortunately. You could estimate . Figure 5. Normal quantile plot of residuals (right panel).3. but it would be diﬃcult to use this model to predict the number of seats in a 3000 pound car. and it has a signiﬁcant p-value. whereas the trend model can predict to within about ±4 million. The slope of the line is positive. Example One place where you can expect to ﬁnd non-normal residuals is when modeling a discrete response. just like you would any other variable.

3. you should also check to see if your analysis is dominated by one or two unusual points. the subject of Section 5. SIMPLE LINEAR REGRESSION (a) (b) Figure 5.110 CHAPTER 5. Regardless of whether an outlier or high leverage point inﬂuences the ﬁtted line.15: Outliers (two in panel(a)) are points with big residuals. the average number of seats per car in a population of 3000 pound cars. Such points are guaranteed to have substantial inﬂuence over the regression line. An outlier is a point with an unusual Y value for its X value. A high leverage point is an observation with an X that is far away from the other X ’s. 5. An inﬂuential point is an observation whose removal from the data set would substantially change the ﬁtted line. It is possible to have a point that is both a high leverage point and an outlier. so our technique for constructing a prediction interval wouldn’t make sense. Leverage points (one in panel (b)) have unusual X values. these observations can make a serious impact on the standard errors we use . but for an individual car the number of seats is certainly not normally distributed. 1.4 Outliers. That is. Leverage Points and Inﬂuential Points In addition to violations of formal regression assumptions. Both plots show regression lines with and without the unusual points. 2. None of the points inﬂuence the ﬁtted regression lines. outliers are points with big residuals. Outliers and high leverage points aﬀect your regression model diﬀerently. There are two types of unusual data points that you should be aware of.

The two outliers in Figure 5. Keep in mind that you expect about 5% of the residuals to be more than ±2 residual standard deviations from zero. and for an individual observation y ∗ . and only 0. However. RMSE is roughly a factor of 4. How big does a residual have to be before we call it an outlier? There is no hard and fast rule. .16 shows the computer output for the two regression lines in Figure 5.15(a) shows the relationship between average house price (\$thousands) and per-capita monthly income (\$hundreds) for 50 ZIP codes obtained from the 1990 Census. The outliers have very little eﬀect on the regression line.660 vs.15(a).5 larger if the outliers are included in the data set. but the slope of the line is nearly unchanged. the residual standard deviation. Outliers inﬂate s. particularly linearity. That means high leverage points make us more sure about our inferences. OUTLIERS. Indeed the estimated regression equations are very similar. Because SE (b1 ) is increased. We have discussed three types of standard errors in this Chapter: for the slope of the line. LEVERAGE POINTS AND INFLUENTIAL POINTS 111 in constructing conﬁdence intervals. between the high leverage point and the rest of the data. which increases standard errors and makes us less sure about things. That extra certainty comes at a cost: we can’t check the assumptions of the model.5.4. even though the slope is highly signiﬁcant without the outliers. . but you should measure the residuals relative to s.069) and the standard error for the slope. High leverage points inﬂate sx . the t-statistic and the p-value for the slope are insigniﬁcant when the outliers are included. The intern entered “0” for the observations where the house price was unavailable.15(a) shows regression lines ﬁt with and without the outliers generated by the intern recording 0 for the house prices. They decrease the intercept somewhat. The eﬀects of the increased RMSE can be seen in R2 (. the standard deviation of the x’s. for the y value of the regression line at a point x∗ . which is directly proportional to s. Outliers typically do not aﬀect the ﬁtted line very much unless they are also high leverage points.4. prediction intervals. Figure 5. and sx .3% of the points to be more than ±3s from zero. SE (b1 ) = √ s n−1 SE (ˆ y) = s 1 (x∗ − x ¯)2 + 2 n sx (n − 1) SE (y ∗ ) = s 1 (x∗ − x ¯)2 + 2 +1 n sx (n − 1) sx Note that all three standard errors depend on s.15(a) are observations with typical income levels. and hypothesis tests. so they are not high leverage points. The data were collected by an intern who was unable to locate the average house price for two of the ZIP codes. 5. which is in the denominator of all three standard error formulas. Figure 5.1 Outliers Figure 5.

The high leverage point has the expected eﬀect on SE (b1 ).15 where the value weighted stock market index lost roughly 20% of its value. but they can aﬀect the certainty of our inferences a great deal.069002 RMSE 46.256721 0. The farther a point is from the regression line the more force its spring is exerting. just like it was pulling on a very long lever.15(a) are between 4 and 5 residual standard deviations from zero. The slope and intercept of the line barely move when the point is added or deleted.1341704 t Ratio Prob>|t| 3.45 <. when the Dow was trading at just over 2000.0011 1. Figure 5.4.91994 40.89 0.0001 9. The two 0’s in Figure 5. but the eﬀect is minor.47 0. The farther a high leverage point is from x ¯.26643 INCOME 5.86 <.2 Leverage Points There is one month in Figure 5. October 1987 happens to have a Y value which is right where the regression line predicts it would be. though R2 is a little higher. so its spring isn’t pulling the .3208392 2. so they are clearly outliers.53823 INCOME 6. SIMPLE LINEAR REGRESSION Full Data Set: Term Estimate Std Error Intercept 139.112 CHAPTER 5. all the other observations collectively act like the fulcrum of a lever centered at x ¯. but it is pretty innocuous as a data point.821008 Outliers Excluded: Term Estimate Intercept 137. That is the infamous 1987 crash where the Dow fell over 500 points in a single day. outliers don’t change the ﬁtted line by very much unless they are also high leverage points. High leverage points get their name from their ability to exert undue “leverage” on the regression line.0653 RSquare 0. the less work its spring has to do to pull the line towards it. The point is a high leverage point.660041 RMSE 10. When an observation has an extreme X value. That would be a large one day decline today (when the Dow is at about 9000). 5. Imagine each data point is attached to the line by a spring. It was catastrophic in 1987.16: Regression output for the house price data. RMSE is virtually unchanged. October 1987 may have been disastrous for the stock market.649089 t Ratio Prob>|t| 14. and people were marveling that it was that high.51617 N 50 Std Error 9.0001 RSquare 0.66548 N 48 Figure 5. Generally speaking.17 shows the computer output for the models ﬁt with and without October 1987.

426643 RMSE 0. .0001 RSquare 0. 2/n). denoted hi . The next Section shows an of a high leverage point that moves the line a lot.3 Inﬂuential Points This last example of unusual points shows an instance of a high leverage point that does move the line a lot.380581 RMSE 0.86 0.003917 -0.17 plots the leverages for the CAPM data set.0890647 0. but you should know that it depends entirely on xi . LEVERAGE POINTS AND INFLUENTIAL POINTS Full Data: Term Estimate Intercept -0.3931 0. that can be calculated to determine the leverage of each point.002627 0. October 1987 stands out as a high leverage point. Some computer programs (but not JMP) warn you if a data point has hi greater than three times the average leverage (i.4.17: Computer output for the CAPM model ﬁt with and without the high leverage point.056311 N 227 Summary of leverages: Mean 0.086639 12.003307 VW Return 1.76 Prob>|t| 0.123536 113 Std Error t Ratio Prob>|t| 0. The farther xi is from x ¯.97 <. OUTLIERS.003864 -0.67 VW Return 1.4. Sure enough. even in a “Not on the test” box.056324 N 228 Leverage Point Excluded Term Estimate Std Error t Ratio Intercept -0.0001 RSquare 0. Figure 5.5031 <. line very hard.0087719 N 228 Figure 5. There is an actual number.9 You don’t need to know the formula for hi (but see page 114 if you’re interested). Each hi is always between 1/n and 1. the more leverage for the point.5. 9 The letter h is used for leverages because they are computed from something called the “hat matrix” which is beyond our scope.092626 11. and the hi ’s for all the points sum to 2. 5.e.

where ei is the i’th residual.4 Strategies for Dealing with Unusual Points When a point appreciably changes your conclusions you can perform your analyses with and without the point and report both results. A construction company that builds beach cottages typically builds them between 500 and 1000 square feet in size. large cottages) .g.114 CHAPTER 5. Not okay to delete unusual points: Just because they don’t ﬁt your model.18. 5.4. For example if a point has a large x value and we transform x by taking log(X) then log(X) will not be nearly as large. That means high leverage points tend to have smaller residuals than typical points. The intuition behind leverage is that the ﬁtted regression line is very strongly drawn to high leverage points. or Point has a big impact on model. n (n − 1)s2 x You may recognize this formula as the thing under the square root sign in the formula for SE (ˆ y ). Fit the model to the data. so that V ar (ei ) ≈ s2 . not vice versa. For multiple regression the formula for hi becomes suﬃciently complicated that we can’t write it down without “matrix algebra. Data from the last 18 cottages built by the company (roughly a year’s work) are shown in Figure 5. and you only want to use the model to predict “typical” future observations. SIMPLE LINEAR REGRESSION Not on the test 5. so the residual for a high leverage point ought to have smaller variance than the other points.3 Why leverage is “Leverage” For simple regression the formula for leverage is hi = 1 (xi − x ¯)2 + . Or use transformations to work on a scale where the point is not as inﬂuential. The company wants to explore whether they should be building more large cottages.” which yielded a healthy proﬁt. (e. Most observations have hi close to 0. Okay to delete unusual points if Point was recorded in error.” which is beyond our scope. It turns out that V ar (ei ) = s2 (1 − hi ). Recently an order was placed for a 3500 square foot “cottage. If an observation had the maximum leverage of 1 then V ar (ei ) = 0 because its pull on the ﬁtted line is so strong that the line is forced to go directly through the leverage point. When you want to predict future observations like the unusual point.

5. Panel (c) plots hi for the regression: max hi = .10 0.1370246 Std Error 1437.7755 7. .5. It is hard to test whether a correlation is zero.8594 SqFeet 9.591 17 Figure 5. The ﬁrst two panels plot conﬁdence intervals for the regression line ﬁt (a) with and (b) without the high leverage point.29 0.295848 t Ratio Prob>|t| -0.075205 3633. The slope is the most interesting part of the regression.2868 RSquare RMSE N 0.379 18 Std Error 4237.4005 SqFeet 6.779667 3570.249 5. REVIEW 115 (a) (b) (c) Point Included: Term Estimate Intercept -416.7505469 Point Excluded: Term Estimate Intercept 2245.556627 t Ratio Prob>|t| 0.015 1.0001 RSquare RMSE N 0.5 Review Correlation and covariance measure the strength of the linear relationship between two variables.6039 1.18: Computer output for cottages data with and without the high leverage point.52 <. Regression models the actual relationship.53 0. but easy to test whether a population line has zero slope (which amounts to the same thing).5.94680.

116 CHAPTER 5. SIMPLE LINEAR REGRESSION .

. Multiple regression formulas are most often written in terms of “matrix algebra. The key difference is that instead of trying to make a prediction for the response Y based on only one predictor X we use many predictors which we call X1 . . These tasks are obviously related. Predicting new observations based on speciﬁc characteristics. However multiple regression can be more complicated than simple regression because you can’t plot the data to “eyeball” relationships as easily as you could with only one Y and one X . One thing you may notice about this Chapter is that there are many fewer formulas. 2. Determining whether the relationship between Y and a speciﬁc X persists after controlling for other background variables. 6. 3. It is just the simple regression model from Chapter 5 with a few extra terms added to y ˆ. Often all three are part of the same analysis. The most common uses of multiple regression fall into three broad categories 1.” which is beyond the scope of these notes. X2 .Chapter 6 Multiple Linear Regression Multiple regression is the workhorse that handles most statistical analyses in the “real world. Identifying which of several factors are important determinants of Y . Xp .1 The Basic Model Multiple linear regression is very similar to simple linear regression. . The good news is that the computer understands all the matrix algebra needed to produce standard errors and prediction intervals. .” On one level the multiple regression model is very simple. In most real life situations Y does not depend on just one predictor so a multiple regression 117 .

4. in which a car’s fuel consumption (number of gallons required to go 1000 miles) is regressed on the car’s horsepower. i ∼ N (0. We have all the same assumptions as for simple linear regression i. You can think of each coeﬃcient as the marginal cost of each variable in terms of fuel consumption. Yi = β0 + β1 Xi. engine displacement.17Cylinders + 0. Horsepower acts as a proxy for weight (and possibly other variables) in the simple regression. over twice as much as in the multiple regression! The two numbers are diﬀerent because they are measuring diﬀerent things. MULTIPLE LINEAR REGRESSION can considerably improve the accuracy of our prediction.1 indicates the ith observation from variable 1 etc. The multiple regression model can also be stated as Yi ∼ N (β0 + β1 Xi1 + · · · + βp Xip .18 gallons is directly attributable to horsepower. Figure 6.0014Disp. The Horsepower coeﬃcient in the multiple regression is asking “how much extra fuel is needed if we add one extra horsepower without changing the car’s weight. Recall the simple linear regression model is Yi = β0 + β1 Xi + i .118 CHAPTER 6. but some of it is attributable to the fact that cars with greater horsepower also tend to be heavier cars. and number of cylinders. then it looks like each horsepower costs . For example. examine the regression output from Figure 6. To illustrate. Multiple regression is similar except that it incorporates all the X variables. Figure 6. weight (in pounds).1(a) estimates a car’s fuel consumption using the following equation F uel = 11.1(a).49 + 0. .080 extra gallons per 1000 miles.080HP + 0. Some of that . σ ) and independent.p + i where Xi.1 + · · · + βp Xi.0089W eight + 0. σ ). If the only thing you know about a car is its horsepower.18 gallons of gas.18 gallons per 1000 miles.e. In the simple regression it looks like each additional unit of horsepower costs . and 5.1(b) provides output from a simple regression where Horsepower is the only explanatory variable. engine displacement. or number of cylinders?” The simple regression doesn’t look at the other variables. The regression coeﬃcients should be interpreted as the expected increase in Y if we increase the corresponding predictor by 1 and hold all other variables constant. which highlights how the multiple regression model ﬁts in with Chapters 2. each additional unit of horsepower costs 0.

1: (a) Multiple regression output for a car’s fuel consumption (gal/1000mi) regressed on the car’s weight (lbs).0001 t Ratio 5.11 Prob>|t| <.3564 1216.59 153.7529 0.2. 1.7269 11.0001 (b) RSquare 0. SEVERAL REGRESSION QUESTIONS Summary RSquare RMSE N of Fit 0.324465 0.47 Prob > F <.494657 0. What Y value would you predict for an observation with a given set of X values and how accurate is the prediction? .690198 RMSE 4.553151 0.6. 6. Does the entire collection of X ’s explain anything at all? That is.73 <.9137 Parameter Estimates Term Estimate Intercept 11.0001 0.1808646 Std Error 1.0014316 Std Error 2.001066 0. can we be sure that at least one of the predictors is useful? 2.48 8.0001 15. How good a job does the regression do overall? 3.013982 0.75 0. horsepower. 7054.94 <.011501 (a) t Ratio Prob>|t| 16.36 5.2 Several Regression Questions Just as with simple linear regression there are several important questions we are going to be interested in answering.49468 Weight(lb) 0.852963 3.0001 <. Does a speciﬁc subset of the X ’s do a useful amount of explaining? 5.4123 Mean Square F Ratio 1763.877156 N 113 Figure 6.0089106 Horsepower 0.013183 Term Intercept Horsepower Estimate 25.0559 8270.0804557 Cylinders 0.387067 111 ANOVA Table Source DF Model 4 Error 106 Total 110 119 Sum of Sq.0001 <. (b) Simple regression output including only HP.099403 0. engine displacement and number of cylinders.32 0.1745649 Displacement 0. Does a particular X variable explain something that the others don’t? 4.

SSE/(n − p − 1) is the variance of Y around the regression line (aka the variance of the residuals). Despite its name. In practical terms the whole model F test asks whether any of the X ’s help explain Y .” An F statistic compares a regression model with several X ’s to a simpler model with fewer X ’s (i. By contrast.2. If SST is the variation that we started out with and SSE is the variation that we ended up with then the diﬀerence SST − SSE must represent the amount explained by the regression. It turns out that n SSM = i=1 (ˆ yi − y ¯)2 . n SSE = i=1 (yi − y ˆi )2 measures the amount of variation left after we use the X ’s to make a prediction for Y . MULTIPLE LINEAR REGRESSION 6.” So how does an F statistic compare the two models? Remember that the job of a regression line is to minimize the sum of squared errors (SSE). Thus SST /(n − 1) is the sample variance of Y from Chapter 1. The F statistic checks whether including the X ’s in the regression reduces SSE by enough to justify their inclusion. In other words. We call this quantity the model sum of squares.120 CHAPTER 6. That is. The “simpler model” for the whole model F test has all the slopes set to 0. we still think of SST as a sum of squared errors because if all the slopes in a regression are zero then the intercept is just y ¯.” The alternative hypothesis is “at least one X is helpful. ignoring X . SST measures Y ’s variance about the average Y . n SST = i=1 (yi − y ¯)2 . so we should say that at least one of the X ’s helps explain Y . In the special case of a regression with no X ’s in it at all the sum of squared errors is called SST (which stands for “total sum of squares”) instead of SSE.e. the null hypothesis is H0 : β1 = β2 = · · · = βp = 0 versus the alternative Ha : at least one β not equal to zero. These are the same quantities that we saw in Chapter 5 except that now we are using several X ’s to calculate y ˆ. with the coeﬃcients of some of the X ’s set to zero). If SSM is large then the regression has explained a lot. The null hypothesis is “no. How large does it need to be? Clearly we need . That is.1 Is there any relationship at all? The ANOVA Table and the Whole Model F Test This question is answered by a hypothesis test known as the “whole model F test. SSM.

Plus.” (Okay.1 has 4 DF in its numerator and 106 in its denominator. All these sums of squares are summarized in something called an ANOVA (Analysis of Variance) table.e. if H0 were true and we collected another data set.e. we didn’t make it up. for computing MSM) and how many were in the denominator (for computing MSE). M SM = SSM p where M SM stands for the “Mean Square explained by the Model. s2 = M SE = SSE n−p−1 where M SE stands for “mean square error. To get around this problem we divide SSM by our estimate for σ 2 i. To get around this problem we also divide SSM by the number of predictors.) You can think about MSM as the amount of explaining that the model achieves per degree of freedom (aka per X variable in the model). so we divided by n − 2. even if they have no relationship to Y . 2 The p-value here is the probability that we would see an F statistic even larger than the one we saw. When F is large we know that at least one of the X variables helps predict Y . Chapters 1 and 4 had p = 0 so we divided by n − 1. The computer calculates a p-value to help us determine whether F is large enough. M SE M SE The F statistic is independent of units. Source Model Error Total 1 df p n−p−1 n−1 SS SSM SSE SST MS SSM p SSE n−p −1 F M SM M SE p∗ p-value It turns out that we’ve been using this rule all along. In Chapter 5 we had p = 1. SEVERAL REGRESSION QUESTIONS 121 to standardize SSM somehow.e. . For example if Y measures heights in feet we can make SSM 144 times as large simply by measuring in inches instead (think about why).”1 We also expect SSM to be larger if there are more X ’s in the model. to which MSM is compared. Now we have p predictors in the model so we divide SSE by n − p − 1.2 To compute the p-value the computer needs to know how many degrees of freedom were used in the numerator of F (i. If we combine these two ideas together i. The F-statistic in Figure 6. but it helps remind you of MSE. that’s a dumb name.6.2. dividing SSM by the M SE and also by p we get the F statistic M SM SSM/p F = = .

the number of observations was 111 (111 − 1 = 110). The small p-value says that at least one of the variables is helping to explain Y . and we spent 4 of them to move 7054 of our variability from the “unexplained” box (SSE) to the “explained” box (SSM) of the table. The model mean square (M SM ) answers the question. which gives us the F -ratio. so we have to standardize it somehow. MULTIPLE LINEAR REGRESSION Don’t Get Confused! 6. Think of each degree of freedom as money. SST SST . our estimate for σ 2 is 11.47 (MSE) and MSM is 1763. It turns out that the right way to do this is to divide by the variance of the residuals (M SE ). R2 estimates how much variability has been explained. It is called an ANOVA table because it uses variances to see whether our model for means is doing a good job. In the cars example. As usual. One way to think about the ANOVA table is as the “balance sheet” for the regression model. You “spend” degrees of freedom by putting additional X variables into the model. Actually. did you get per degree of freedom that you spent?” If M SM is large then our degrees of freedom were well spent.1 tells us that a model with 4 variables had been ﬁt. “How much explaining. the ANOVA table in Figure 6. Furthermore. which is a model for the mean of each Y given each observation’s X ’s.7269) and the probability of this happening if none of the 4 variables had any relationship to Y was only 0.0001.2. the deﬁnition of “large” depends on the scale of the problem. MSM was signiﬁcantly larger than MSE (the F ratio is 153. or in this case a regression model.122 CHAPTER 6. 6. then you will have SSE = SST . If you don’t use any degrees of freedom.59. The “sum of squares” column explains what you got for your money. We still look at R2 = 1 − SSE SSM = . We are left with 1216 which remains unexplained. we had 110 degrees of freedom to spend.2 How Strong is the Relationship? R2 The ANOVA table and whole model F test try to determine whether any variability has been explained by the regression. the object of an ANOVA table is to say something about means. For example. Explaining variability works in exactly the same way as for simple regression. The “mean square” column tries to decide if we got a good deal for our money.1 Why call it an “ANOVA table?” The name “ANOVA table” misleads some people into thinking that the table says something about variances. on average.

7. because correlation only deals with pairs of variables. 6. We all know that 8 cylinder cars use more 3 There is another issue. However. This leads into model selection and data mining ideas which we discuss in Section 6. so that we can conclude that at least one of the predictors is helping. One fact about R2 that should be kept in mind is that even if you add a variable that has no relationship to Y the new R2 will be higher! In fact if you add as many predictors as data points you will get R2 = 1. All variables with small p-values are probably useful and should be kept. Does that mean that the number of cylinders in a car’s engine has nothing to do with its fuel consumption? Of course not. This may sound good but in fact it usually means that any future predictions that you make are terrible.2. Notice that correlation is not a meaningful way of describing the collective relationship between Y and all the X ’s.2. because of issues like collinearity (discussed in Section 6. . This is the same old t-test from Chapters 4 and 5 that computes how many SE’s each coeﬃcient is away from zero. If its p-value is not small then we might as well stop because we have no evidence that regression is helping. This is one of the reasons we use R2 because it gives an overall measure of the relationship.7.3 That is. For now let’s just say that the right way to drop insigniﬁcant variables from the model is to do it one at a time. For example.7 where we can add enough garbage variables to “predict the stock market” extremely well (high R2 ) for data in our data set but do a lousy job predicting future data. an apparently insigniﬁcant variable might become signiﬁcant if another insigniﬁcant variable is dropped.1 the coeﬃcient of Cylinders appears insigniﬁcant. When you test an individual coeﬃcient in a multiple regression you’re asking whether that coeﬃcient’s variable explains something that the other variables don’t. that we will ignore for now but pick up again in Section 6.2. However. If R2 is close to zero then the predictors do not help us much. SEVERAL REGRESSION QUESTIONS 123 If this number is close to 1 then our predictors have explained a large proportion of the variability in Y .2.4) some of the variables with large pvalues might be useful as well. in Figure 6. if the p-value is small. We’ll show an example in Section 6. called multiple comparisons.1. then the question becomes “which ones?” The main tool for determining whether an individual variable is important is the t-test for testing H0 : βi = 0.3 Is an Individual Variable Important? The T Test One of the ﬁrst steps in a regression analysis is to perform the whole model F test described in Section 6. That way you can be sure that you don’t accidentally throw away something valuable.6.

does a useful amount of explaining. but you don’t know that until you drop Cylinders ﬁrst. horsepower.) The order that you enter the X variables into the regression has no eﬀect on the p-values of the individual coeﬃcients. and we computed the residuals from these predictions. To illustrate this point we re-ran the regression from Figure 6.e. The partial F test compares a big regression model to a smaller regression model (with fewer X ’s). Then we dropped the insigniﬁcant variables Cylinders and Displacement (dropping them one at a time. Therefore if you include such a variable in the data set all you’re doing is adding noise to your predictions. we excluded them from the regression).57 3. regardless of the order in which they were entered into the computer.1 after setting aside 25 randomly chosen observations (i.48 The model with more variables does worse than the model based only on signiﬁcant variables. The partial F test works almost exactly the way same as the whole model F test. 6.124 CHAPTER 6. M SE full . The small p-value for Cylinders is saying that if you already know a car’s weight. We used the ﬁtted regression model to predict fuel consumption for the observations we set aside. It is desirable to get insigniﬁcant variables out of the regression model.4 Is a Subset of Variables Important? The Partial F Test Sometimes you want to examine whether a group of X ’s. to make sure Cylinders did not become signiﬁcant when Displacement was dropped) and made the same calculations.2. MULTIPLE LINEAR REGRESSION gas than 4 cylinder cars. (Actually you don’t need Displacement either. If a variable is not statistically signiﬁcant then you don’t have enough information in the data set to accurately estimate its slope. The test for doing this is called the partial F test. Each p-value calculation is done conditional on all the other variables in the model. and displacement then you don’t need to know Cylinders too. taken together. To calculate the partial F statistic you will need the ANOVA tables from the full (big) model and the null (small) model. The diﬀerence is that the whole model F test compares a big regression model (with many X ’s) to the mean of Y . The two regressions produced the following results for the 25 “holdout” observations: variables in model all four only signiﬁcant SD(residuals) 3. The formula for the partial F statistic is F = ∆SSE/∆DF . Even experienced regression users sometimes forget that that keeping insigniﬁcant variables in the model does nothing but add noise.

In particular it is larger than . For example.154628 0.004018 F Ratio 38.1(b) is the ANOVA table for the regression with just the VW stock index as a predictor. The question is whether both IBM and the S&P500 can be safely dropped from the regression. The big p-value says it is safe to drop both IBM and S&P500 from the model.004077 Table 6. Singly boxed items are used in the numerator of the test. Note the similarity between the partial F statistic and the whole model F statistic. The partial F test is the most general test we have for regression coeﬃcients.0001 F Ratio 110.448976 0.92598602 Mean Square 0.6. By plugging the F statistic and its relevant degrees of freedom into a computer which knows about the F distribution. so we can’t reject the null hypothesis that the variables being tested have zero coeﬃcients.1(a) contains the ANOVA table for the regression of Sears stock returns on the returns from IBM stock.4811 Prob > F < . In our example the denominator DF is 115.1: ANOVA tables for the partial F test.47700979 − 0.2.44897622   117 ¢0.47700979 ¡ 118 0. The denominator DF is the number of degrees of freedom used to calculate MSE. It . Doubly boxed items are used in the denominator. Table 6. Also notice that the whole model F test and the partial F test use the M SE from the big regression model. The numerator DF is simply the diﬀerence in the number of parameters for the two models. not the small one. The partial F statistic is F = (0. SEVERAL REGRESSION QUESTIONS (a) Full Model 125 Source Model Error Total Source Model Error Total DF £   ¢3 ¡ 115 118 Sum of Squares 0.0001 (b) Null Model DF £   Sum of Squares ¢1 ¡ £0. we ﬁnd that the p-value for our F statistic is p = 0. This is pretty large as p-values go. the VW stock index. Here the numerator DF is 3-1=2. Table 6.855097 0. IBM and S&P500 have been dropped.92598602 Mean Square 0.46210223)/(3 − 1) = 1. and the S&P 500.46388379   £ 0.1240 Prob > F < . In the whole model F we called ∆SSE = SSM and ∆DF = p.1610887.46210223 ¢ ¡ 0.004018 To ﬁnd the p-value for this partial F statistic you need to know how many degrees of freedom were used to calculate the numerator and the denominator.05.

where t is the t-statistic for the one variable you’re testing. The regression output omits information that you would need to create the intervals without a computer’s help.62) (57. so we want an interval that lets us know that this guess at Y know how sure we can be about the estimate.5. MULTIPLE LINEAR REGRESSION can test any subset of coeﬃcients you like. As with simple linear regression. 15.4 If the “subset” is everything then the partial F test is the same as the whole model F test. Then you could use the general variance formula on page 54. .5 Here are prediction and conﬁdence intervals for the car example in Figure 6. If we want to predict a single point (which has extra variance because even if we know the true regression line the point will not be exactly on it) we use a prediction interval. 72.3. bp are the estimated coeﬃcients.92. The prediction interval is always wider than the conﬁdence interval.126 CHAPTER 6. . To compute the intervals yourself you would need the covariance matrix describing their relationship. .15) ⇒ mi/gal (15.2. We dropped the insigniﬁcant variables from the model. 66. we ˆ is not exactly correct. The partial F test is particularly relevant for categorical X ’s because each categorical variable gets split into a collection of simpler “dummy variables” which we will either exclude or enter into the model as a group.5 Predictions Finally to make a prediction we just use ˆ = b0 + b1 X1 + · · · + bp Xp Y where b0 . which we will see in Section 6. The estimated regression coeﬃcients are correlated with one another. so now the only factors are Weight and HP.27) In this special case you get F = t2 .76) (13. Suppose we want to estimate the fuel consumption for a 4000 pound car with 200HP. 6. If we are trying to guess the average value of Y (i. The most common use of the partial F test is when one of your X variables is a categorical variable with several levels.2.45.01. If the “subset” is just a single coeﬃcient then the partial F test is equivalent to the t test from Section 6.86. The intervals are easy enough to obtain using a computer (see page 181 for JMP instructions). 17.e. .1. . The standard error formulas for obtaining prediction and conﬁdence intervals in multiple regression are suﬃciently complicated that we won’t show them. just trying to guess the population regression line) then we calculate a conﬁdence interval. Interval Conﬁdence Prediction 4 5 Gal/1000mi (63.

look at the residuals. The deﬁnition of a high leverage point changes a bit. 4.3.1 Leverage Plots Leverage plots show you the relationship between Y and X after the other X ’s in the model have been “accounted for. Constant variance of error terms.3 Regression Diagnostics: Detecting Problems Just as in simple regression. Independence of error terms. Outliers also have the same deﬁnition in multiple regression as in simple regression. it is possible for a car to have rather high (but not extreme) horsepower and rather low (but not extreme weight). An outlier is a point with a big residual.2 describes a variety of “whole model” tools. Normality of error terms. or you can use tools that look at one variable at a time. Neither variable is extreme by itself. Multiple regression uses the same assumptions as simple regression. 2.1 describes the best “one variable at a time” tool. As with simple regression the main tool to solve any violations is to transform the data. known as a leverage plot. Then we transformed them back to the familiar MPG units in which we are used to measuring fuel economy by simply transforming the endpoints of each interval.3.” See the call-out box on page 129 for a more . Linearity. 6. so the best place to look for outliers is in a residual plot. You can either use tools that work with the entire model at once.3. REGRESSION DIAGNOSTICS: DETECTING PROBLEMS 127 The conﬁdence and prediction intervals were constructed on the scale of the Y variable in the multiple regression model (chosen to avoid violating regression assumptions).6. you should check your multiple regression model for assumption violations and for the possible impact of outliers and high leverage points. 1.3. Multiple regression uses the same types of tools as simple regression to check for assumptions violations i. 3. In multiple regression a high leverage point is an observation with an unusual combination of X values. There are two diﬀerent families of tools for detecting assumption violation and unusual point issues in multiple regression models. For example. or added variable plot. 6.e. but the car can still be a high leverage point because it is an unusual HP-weight combination. Section 6. It looks like our car is going to be a gas guzzler. Section 6.

6 Leverage plots can be read just like scatterplots. However. the true value of a plot is that the points in the plot let you see more than you could in a simple numerical summary. technical deﬁnition. Weight. so the lines are not particularly helpful.2: (a) Leverage plot and (b) scatterplot showing the relationship between a car’s fuel consumption and number of cylinders. Figure 6.2. Leverage plots are great places to look for bends and funnels which may indicate regression assumption violations. not the lines.” . In a simple regression there is a relatively strong relationship between fuel consumption and Cylinders. but they are better 6 The word “leverage” means diﬀerent things in “leverage plot” and “high leverage point.3. look at the points on the plot. The leverage plot and scatterplot reinforce what we said about Cylinders in Section 6. HP.2 compares the leverage plot showing the relationship between fuel consumption and the number of cylinders in a car’s engine. but the relationship is better described by other variables in the multiple regression. Points on the right of the leverage plot are cars with more cylinders than we would have expected given the other variables in the model. You can get the same information from the table of parameter estimates. They are also great places to look for high leverage points. When you look at a leverage plot. displacement). and Displacement. and the corresponding scatterplot. Points near the top of the leverage plot use more gas than we would have expected given the other variables in the model (weight. The leverage plot is from a model that also includes Horsepower. MULTIPLE LINEAR REGRESSION (a) (b) Figure 6. The lines that JMP produces are there to help you visualize the hypothesis test for whether each β = 0.128 CHAPTER 6.

Leverage plots and “whole model” diagnostics complement one another. REGRESSION DIAGNOSTICS: DETECTING PROBLEMS Not on the test 6. The residuals from this regression are the information about Y that is not explained by the other X ’s in the model. . cylinders on weight. This Section describes some “whole model” tools that you can use instead of or in conjunction with leverage plots.2(a) shows no evidence of any assumption violations or unusual points. By plotting the ﬁrst set of residuals against the second.g. 129 than scatterplots because they control for the impact of the other variables in the regression.1 How to build a leverage plot Here is how you would build a leverage plot if the computer didn’t do it for you. The residuals from this regression are the information in the current X that couldn’t be predicted by the other X ’s already in the model. but if I had to create a leverage plot myself I probably wouldn’t bother adding the means back in. 2. It simply shows that Cylinders is an unimportant variable in our model. Now regress the current X on all the other X ’s in the model (e. everything but cylinders).3. 1. and displacement). For concreteness imagine we are creating the leverage plot for Cylinders shown in Figure 6.3. 6. A minor detail: because a leverage plot is plotting one set of residuals against another.6. HP. Regress Y on all the X variables in the model except for the X in the current leverage plot (e.2 Whole Model Diagnostics The only real drawback to leverage plots is that if you have many variables in your regression model then there will be many leverage plots for you to examine. Some people prefer to look at these tools ﬁrst to suggest things they should look for in leverage plots. When JMP creates leverage plots it adds the mean of X and Y back into the axes.g.2. you might expect both axes to have a mean of zero. Figure 6. so that the plot is created on the scale of the unadjusted variables. Which ones you look at ﬁrst is largely a matter of personal preference. That’s a nice touch. the leverage plot shows just the portion of the relationship between the Y and X that isn’t already explained by the other X ’s in the model.

A bend is a sign of nonlinearity and a funnel shape is a sign of non-constant variance. and Y You use the residual plot in multiple regression in exactly the same way as in simple regression.130 CHAPTER 6. the ﬁtted Y values. Examine the plot for “tracking” (aka autocorrelation) just like in Chapter 5. Otherwise try transforming the X variable responsible for the bend. Cylinders. Fitted Values When someone says “the” residual plot in multiple regression they are talking about ˆ . The Residual Plot: Residuals vs.3(a). several X ’s. The residual plot for the regression in Figure 6. time. In multiple regression there are ˆ is a convenient one-number summary of them.1 is shown in Figure 6. MULTIPLE LINEAR REGRESSION (a) (b) Figure 6.3: (a) Residual plot and (b) whole model leverage plot for the regression of fuel consumption on Weight. or add a quadratic term in that variable (see page 182). and Displacement. The residuals look ﬁne. . Time If your regression uses time series data you should also plot the residuals vs. If the bend is present in several leverage plots then try transforming Y . The (other) Residual Plot: Residuals vs. Horsepower. If you see evidence of nonlinearity then look at the leverage plots for each variable to see which one is responsible. Why Y ˆ ? In Chapter 5 the a plot of the residuals versus Y residual plot was a plot of residuals versus X .

6.3. 7 See page 181 for JMP tips. The general trend is that heavier cars have more horsepower. ˆ . These cars have a lot of horsepower. Notice that the same two cars show up on the right edge of the HP leverage plot. The conﬁdence bands that JMP puts around the line are a picture of the whole model F test. HP for the car data set. Leverage (for ﬁnding high leverage points) High leverage points were easy to spot in simple linear regression.4(a). Remember that the normality of the residuals is ONLY important if you want to use the regression to predict individual Y values (in which case it is very important). but the whole model leverage plot makes regression assumption violations harder to detect because the “baseline” is a 45 degree line instead of a horizontal line. Consider Figure 6. You can get the same information from the whole model leverage plot and the residual plot. they are lighter than other cars with similarly large horsepower. When we compute the leverage of each point in the regression model7 the Mustang and Corvette show up as the points with the most leverage because they have the most unusual combination of weight and HP. An Unhelpful Plot: The Whole Model Leverage Plot This plot (see Figure 6. The whole model leverage plot is a plot of the actual Y value vs.3(b)) isn’t really useful for diagnosing problems with the model. because they have more HP than one would expect for cars of their same weight. which is too bad because JMP puts is front and center in the multiple regression output. You interpret a normal quantile plot of the residuals in exactly the same way you interpret a normal quantile plot of any other variable: if the points are close to a straight line on the plot then the residuals are close to normally distributed. Our advice is to ignore this plot. when there are many predictors it is possible to have a point that is not unusual for any particular X but is still in an unusual spot overall. REGRESSION DIAGNOSTICS: DETECTING PROBLEMS Normal Quantile Plot of Residuals 131 The last residual plot worth looking at in a multiple regression is a normal quantile plot of the residuals. but not that much more than several other cars. You can think of it as a picture of R2 . You just look for a point with an unusual X value. which plots Weight vs. However. . however notice the two cars marked with a × (Ford Mustang) and a + (Chevy Corvette). However. Otherwise a central limit theorem eﬀect shows up to save the day. because if R2 = 1 then all the predicted Y the points would lie on a straight line.

Luckily.132 CHAPTER 6. A table of suggested cutoﬀ values is available in Appendix D. out of 111 observations. Don’t take the cutoﬀ values in Appendix D. which measures how “far” (in a funky statistical sense) the regression line would move if the point were deleted.138.” Some guidance may be obtained from the appropriate F table. The maximum leverage in panel (b) is . hi > 3(p + 1)/n).027. It can use a formula instead. The average is . which will depend on the number of variables in the model and the sample size. MULTIPLE LINEAR REGRESSION (a) (b) (c) Figure 6.e. Cook’s Distance: measuring a point’s inﬂuence Each point in the data set has an associated Cook’s distance. As with simple regression we have 1/n ≤ hi ≤ 1.4(b) and taking a closer look at points on the right that appear extreme. Thus if we added all the numbers in Figure 6. The formula is complicated. and (c) leverage plot for HP for the car data set.4(b) we would get 3 (two X variables plus the intercept). A larger hi means a point with more leverage.3 too seriously. Deﬁnitely not on the test.3. we suggest plotting a histogram of leverages like Figure 6. Some computer programs (though not JMP) warn you about points where hi is greater than 3× the average leverage (i. Recall that we denote the leverage of the ith data point by hi .8 but it only depends on two things: the point’s leverage and the size of its residual. They are only there to give you guidance about what a “large” Cook’s 8 2 2 If you must know the formula it is di = hi e2 i /(ps (1 − hi ) ). the computer does not actually have to delete each point and ﬁt a new regression to calculate the Cook’s distance. .4: (a) Scatterplot of Weight vs. Rather than use such a hard rule. A large Cook’s distance means the point is inﬂuential for the ﬁtted regression equation. There is no formal hypothesis test to say when a Cook’s distance is “too large. (b) histogram of leverages. In multiple regression we have n i=1 hi = (p + 1). High leverage points with big residuals have large Cook’s distances. HP.

However. Figure 6. If you think the point was recorded in error or represents a phenomenon you do not wish to model then you may consider deleting it.4 Collinearity Another issue that comes up in multiple regression is collinearity. 6. If the point legitimately belongs in the data set then you might consider transforming the scale of the model so that the point is no longer inﬂuential. In practice. the model can’t be sure how much of the credit for explaining Y belongs to “height in feet” and how much to “height in inches.5 highlights the Mustang and Corvette that attracted our attention as high leverage points in Figure 6. The classic example is regressing Y on someone’s height in feet.” 9 Some people call it “multicollinearity. and on their height in inches. COLLINEARITY 133 Figure 6. distance looks like.3.5 says that these two points are not inﬂuential.6.4. See the footnote on page 104. None of the Cook’s distances look particularly large when compared to the table in Appendix D.4. Figure 6. Cook’s distances are used in much the same way as leverages (i.5 shows the Cook’s distances from the car data set. Even if “height” is an important predictor. . High leverage points have the opportunity to inﬂuence the ﬁtted regression a great deal.e. you plot a histogram and look for extreme values). Figure 6. If you identify a point with a large Cook’s distance you should try to determine what makes it unusual. Attempt to ﬁnd the point in the leverage plots for the regression so that you can determine the impact it is having on your analysis.” which we don’t like because it is an eight syllable word. The largest Cook’s distance is well below the lowest threshold in Appendix D.3 with 3 model parameters (two slopes and the intercept) and about 100 observations.5: Cook’s distances for the car data.9 where two or more predictor variables are closely related to one another.

130059 Std Error 0. you could even have a very high R2 and signiﬁcant F . but none of the individual variables shows up as signiﬁcant.57 0. If all the X ’s in your model are highly collinear.132193 0.0134394 2.87 2. MULTIPLE LINEAR REGRESSION Std Error 0.10 We can calculate the VIF for each variable using the formula V IF (bi ) = 1 2 1 − RX i |X−i 2 where RX is the R2 we would get if we regressed Xi on all the other X ’s.46 1.0016 <. so the standard errors for both coeﬃcients become inﬂated.1560317 (a) t Ratio 3.006075 0.6: Regression output for Walmart stock regressed on (a) several stocks and stock indices. In that case. However. If i |X−i 2 RX is close to one.0138 0. 76.066683 0.028958 1.2070931 0. If the blocks are fairly evenly distributed under the plywood it will be very stable i.0195894 1. don’t try this at home). if you place all the blocks in a straight line along one of the diagonals of the plywood and stand on one of the opposite corners the plywood will move (and you will fall.0798655 Estimate 0. .2392068 CHAPTER 6.0001 (b) Figure 6.7847938 1.49 Prob>|t| 0.134 Term Intercept VW SP500 IBM PACGE Term Intercept VW Estimate 0.007179 1.1200 0. The second situation is an example of collinearity where two X variables are very closely related so they lie close to a straight line and you are trying to rest the regression plane on these points. you can understand collinearity by imagining that you are trying to rest a sheet of plywood on a number of cinder-blocks (the plywood represents the regression surface and the blocks the individual data points).618188 78.5729753 -1.61 Prob>|t| 0.22 10.1 Detecting Collinearity The best way to detect collinearity is to look at the variance inﬂation factor (VIF). Physically.5404 VIF .1458 0.56224 0. then the other X ’s can accurately predict Xi . (b) only the value weighted stock index.50 -1.4.118087 t Ratio 1.e. you can stand on it anywhere and it will not move. 6.399527 1.0638 0. i |X−i 10 See page 181 for JMP tips on calculating VIF’s.

6. 6. They provide virtually identical information.e. and it is not too much trouble). The VIF is the factor by which the variance of a coeﬃcient is increased relative to a regression where there was no collinearity.6 just pick one of the two stock market indices. where there is no collinearity because there is only one X . For example if you are trying to use CEO compensation to predict performance of a company you may have both “base salary” and “other beneﬁts” in the model. In Figure 6. Some common ways of removing collinearity are: • Drop one of the redundant predictors. These two variables may be highly correlated so they could be combined into a single variable “total compensation” by adding them together. you might think that the correlation matrix would be a good place to look for collinearity problems. COLLINEARITY 135 the computer can’t tell if Y is being explained by Xi or by some combination of the other X ’s that is virtually the same thing as Xi . like “height 2 in feet” and “height in inches” then RX = 1. with a VIF of around 76. Thus.6(a) the standard error of VW is about 1. • Combine them into a single predictor in some interpretable way. so it can fail to alert you to the problem. so it doesn’t matter which one you pick. so V IF (bi ) = ∞ and you will get i |X−i an error message when the computer divides by zero calculating SE (bi ). If one of your variables can be interpreted as . It isn’t bad. the standard error of VW √ is about . As a rough rule if the VIF is around 4 you should pay attention (i. don’t even think about this as your ﬁnal regression model). If the VIF is 9 then the standard error is 3 times larger than it would have otherwise been. why bother keeping all of them? In most cases it won’t matter which of the collinear variables you drop: that’s what redundant means! In Figure 6. If it is around 100 you should start to worry (i. The correlation matrix only shows relationships between one pair of variables at a time. In Figure 6. A VIF of 4 means that the standard error for that coeﬃcient is twice as large as it would be if there were no collinearity.4.e. but VIF’s are better because it is possible for collinearity to exist between three or more variables where no individual pair of variables is highly correlated.4. If two or more X ’s provide the same information.2 Ways of Removing Collinearity Collinearity happens when you have redundant predictors. do something about it if you can. Because collinearity is a strong relationship among the X variables. If a variable can be perfectly predicted by other X ’s.7 times as large as it is in the simple regression (with no collinearity).6(b). • Transform the collinear variables.118. the standard error of VW in the multiple regression is about 76 ≈ 8.

You would expect these to be collinear because more customers generally translates into larger sales.103 67. MULTIPLE LINEAR REGRESSION Std Error 307. The same output is reproduced in Figure 6.7: Output for a quadratic regression (a) without centering (b) with the squared term centered.3 General Collinearity Advice Collinearity is undesirable.717539 CHAPTER 6. “size” then you can use it in a ratio to put other variables on equal footing. 6.7 with and without centering.2546. It is a less important problem than regression assumption violations or extreme outliers. Consider replacing total sales with sales/customer (the average sale amount for each customer).78)^2 Estimate 360.50506 0. Such a variable is signiﬁcant .0001 49.872 RSq RMSE N 0. but TotalSales-Customers wouldn’t make much sense.300978 RMSE 171.8183.0004 .65 -4.056672 0.136 Term Intercept Age Age*Age Estimate -1015.48353 2.62865 4.0027 <. In the car example HP and Weight are correlated with r = .300978 171. For example.165171 t Ratio -3.1107 N 61 Figure 6.4. V W − S &P 500 makes sense as the amount by which V W outperformed the S &P 500 in a given month.34 (b) Prob>|t| VIF 0.0001 1.30 4. The centered version produces exactly the same ﬁtted equation (notice that RMSE is exactly the same).1107 61 Term Intercept Age (Age-43.6138293 -0. a chain of supermarkets might have a data set with each store’s total sales and number of customers.0001 49. A third example of a transformation that reduces collinearity is centering polynomials. but with much less collinearity.717539 Std Error 95.165171 (a) t Ratio 3.0287 1. <.0017 . 0.34 Prob>|t| VIF 0. You will rarely want to omit a signiﬁcant variable simply because it has a high VIF.0027 RSq 0.5207 14. but it isn’t very interpretable unless the two variables have the same units.78 2.451425 -0. Using Weight and HP/pound reduces the correlation to . Taking the diﬀerence between two collinear variables can also help.24 -4. Recall the quadratic regression of the size of an employee’s life insurance premium on the employee’s age from Chapter 5 (page 103).872 <. but not disastrous. For example.

0 Lower PI -0. The same is true when we predict with V W = S &P = 0.1 0. 0.1. when we try to predict Walmart’s return when V W = 0.5.12. The possible values of a factor are called its levels.0 0. despite the high VIF.5. Section 6.2.1421 0. 6. extrapolation means predicting Y using an unusual combination of X variables.2: Prediction intervals for the return on Walmart stock using a regression on VW and the S&P 500. In the context of multiple regression.5. Finally.0. which is simply a way of creating numerical values out of . the (0. Sex is a two-level factor with Male and Female as its levels.1 and 0 are typical values for both variables. For example. Now we return to ways of applying the multiple regression model to real problems. If you can remove the collinearity using transformations then the signiﬁcant variable will be even more signiﬁcant.2 explains how to extend the dummy variable idea to multi-level factors. However.1 S&P 0. keep in mind that it is easier to accidentally engage in extrapolation when using a model ﬁt by collinear X ’s.0296 Upper PI 0.4923 137 (Interpolation) (Interpolation) (Extrapolation) Table 6. so if the regression equation were well estimated we would expect the margin of error for a prediction interval to be about ±. If we predict Walmart’s return with V W = S &P = 0. 6.2642 0.0 0.06.12.5 Regression When X is Categorical The two previous Sections dealt with problems you can encounter with multiple regression.1 Dummy Variables A factor with only two levels can be incorporated into a regression model by creating a dummy variable.1 and S &P = 0.5.0 the prediction interval is almost twice as wide. We often encounter data sets containing categorical variables.1119 0.0061 0.1. a typical combination of values. then we get a prediction interval not much wider than ±.0) combination is unusual.1 describes how to include two-level factors in a regression model using dummy variables.6. The RMSE from the regression is about . A categorical variable in a regression is often called a factor. Section 6. For example.1 0. REGRESSION WHEN X IS CATEGORICAL VW 0. Even though . consider the prediction interval for the monthly return on Walmart stock based on a a model with both VW and the S&P500 shown in Table 6.

such as SexCodes in Figure 6.8. removing one of the dummy variables from the model) or forcing the coeﬃcients to sum to zero. The two most popular methods are setting one of the coeﬃcients to zero (i. Estimating this model is problematic because the two dummy variables are perfectly collinear. When you include a categorical variable in a regression.8: JMP will make dummy variables for you. 12 11 .e. For example.12 Sex[Female] = 1 − Sex[Male] We doubt you will want to do this. Creating your own dummy variables is a lot of work when you can let the computer do it for you. suppose we wanted compare the salaries of male and female managers. Then we would create dummy variables like Sex[Female] = Sex[Male] = 1 if subject i is Female 0 otherwise 1 if subject i is Male 0 otherwise.138 CHAPTER 6.11 In order to ﬁt the model we must constrain the dummy variables’ coeﬃcients somehow. MULTIPLE LINEAR REGRESSION Figure 6. categorical data. You don’t have to make them by hand as shown here with “SexCodes”.8. JMP automatically creates Sex[Female] and Sex[Male] and includes both variables using the “sum to zero” convention. A regression model using both these variables would look like ˆi = β0 + β1 Sex[Male] + β2 Sex[Female] = Y β0 + β1 β0 + β2 if subject i is Male if subject i is Female. such as Sex in Figure 6. To use the “leave one out” convention you must create your own dummy variable. and use it in the regression instead.

REGRESSION WHEN X IS CATEGORICAL Term Intercept Sex[female] ‘‘Sex[male] Term Intercept SexCodes Estimate 142.883877 0.9 describe exactly the same relationship. but just barely. as long as you know which one you are using. The diﬀerence is statistically signiﬁcant (p = . The 13 To see it you have to ask for “expanded estimates.” See page 182. Panel (c) compares men’s and women’s salaries using the two-sample t-test. The line for Sex[Male] is in quotes because it is not always reported by the computer.98 -2. The coeﬃcients for Sex[Male] and Sex[Female] must sum to zero.821839 1.1278 Upper 95% -0.0405 Lower 95% -7.06 Prob>|t| <.0001 0.06 t Ratio 97.5.883877 t Ratio 160.0001 0.0405 0.821839 Std Error 0. Furthermore.28851 -1. and the coeﬃcient of SexCodes represents the average diﬀerence between men’s and women’s salaries.0405).7678 Std Error 1. -3. it does not matter which dummy variables convention you use.1596 (c) Figure 6.46667 3.767753 (b) Estimate Std Error t-Test DF Prob > |t| -2. The average men’s salary is 1.13 although the coeﬃcient is easy to ﬁgure out if the computer doesn’t report it.82 (thousand dollars) below the baseline.9(b) shows the same regression using the “leave one out” convention.88 2. t-statistics. That means the diﬀerence between the average salaries of men and women is 1. Figure 6. In that sense. all the p-values. .0405 139 (a) Estimate 140.466 and that the average salary for men is \$3. so one coeﬃcient is just −1 times the other.43514 1. (b) the leave-one-out constraint.9: Regression output comparing male and female salaries based on (a) the sumto-zero constraint.061 218 0.6437 1.82 above the baseline. Figure 6.643 higher.6436782 Diff.06 2. The average women’s salary is 1. and other important statistical summaries of the relationship are the same for the two models.6.883877 0.9 shows regression output illustrating the two conventions. In panel (a) the intercept term is a baseline.82 × 2 = 3. They just parameterize it diﬀerently. The diﬀerence is signiﬁcant.64. Both regressions in Figure 6.0405’’ Prob>|t| <. In panel (b) the intercept represents the average salary for women (where SexCodes=0). Both models say that the average salary for women is \$140. with exactly the same p-value as in panel (a).

76 0. Adding dummy variables to a regression gives women (◦ solid line) and men ( dashed line) diﬀerent intercepts. the subjects in Figure 6. Henceforth we will use JMP’s “sum to zero” convention.0001 0.10 shows output for a regression of Salary on Sex and YearsExper.14 so we can’t say that the diﬀerence between men’s and women’s salaries is because of gender diﬀerences.33 Prob>|t| <. The large p-value for Sex in Figure 6.01757 0. There are other factors that could explain the salary diﬀerence. the subjects were not randomly assigned to be men or women. such as men and women having diﬀerent amounts of experience. A man with the same number of years of experience subtracts .140 CHAPTER 6. MULTIPLE LINEAR REGRESSION Term Intercept Sex[female] YearsExper Estimate 135. so the statistically signiﬁcant diﬀerence between their salaries generalizes to the population from which the survey was drawn. .173053 t Ratio 71. For example.10: Regression output comparing men’s and women’s salaries after controlling for years of experience. which is why we left it as an exercise in Chapter 4. A woman adds .02 4.9.9 shows that you get the same answer when you compare the sample means using regression or using the two sample t-test.0001 Salary 110 120 130 140 150 160 170 5 10 15 YearsExper 20 25 Figure 6. The “sum to zero” convention is nice because it treats all the levels of a categorical variable symmetrically.75Y earsExper . The advantage of comparing men’s and women’s salaries using regression is that we can control for other background variables.10 14 Obviously very diﬃcult to do. That is only a \$28 diﬀerence. The two sample t-test turns out to be a special case of regression. The last panel in Figure 6. Now the baseline for comparison is a regression line ˆ = 135 + . However.7496472 Std Error 1. Figure 6.881391 0.9877 <. Note the change from Figure 6.014 from the line to ﬁnd his expected salary (remember that the coeﬃcients for Sex must sum to zero).949791 0.8 come from a properly randomized survey. “leave one out” convention is a little easier if you have to create your own dummy variables by hand.0146588 0. The lines are parallel because the dummy variables do not aﬀect the slope.014 to the regression line to ﬁnd her Y expected salary.

Because the p-value for Sex is insigniﬁcant.1 apply here as well. Sex does not have a signiﬁcant eﬀect on Salary.” “dusty sage.5. The output from a regression on a multi-level factor can appear daunting because each factor level introduces a coeﬃcient. REGRESSION WHEN X IS CATEGORICAL 141 says that. which means that the regression lines for men and women are parallel.5.6 shows how to expand the model if you want to consider non-parallel lines. and his or her company’s proﬁts.” To include Color in the regression.1.6. Age. industry. The same constraints from Section 6. the model says that men and women with the same experience are being paid about the same. multiple regression estimates the impact of changing one variable while holding another variable constant. For example. We can either leave one dummy variable out of the model or constrain the coeﬃcients to sum to zero. For example. regression killing collinearity. X [DS ] = Just as with the male/female dummy variables in Section 6. 6. so there are 22 regression coeﬃcients in total. You can think about the contributions from the Sex dummy variable as adding or subtracting something from the intercept term. The color of each pair can be “natural. we must constrain the coeﬃcients of the L dummy variables in order to avoid a perfect.” or “heather. Figure 6.5.2 Factors with Several Levels A factor with L levels can be included in a multiple regression by splitting it into L dummy variables. and Proﬁts. once we control for YearsExper. let’s estimate the log10 compensation for a 60 year old CEO in the entertainment industry whose company made 72 million dollars in proﬁt. Said another way. The “sum-to-zero” constraint is more appealing when dealing with multi-level factors. like 1 if dusty sage 0 otherwise.5. Three more coeﬃcients are added by the intercept. . It helps to remember that all but one of the dummy variables will be zero for any given observation.” “bone. only 4 of the 22 coeﬃcients are needed for any given CEO. Therefore.” “light beige. Section 6. To illustrate. Therefore the coeﬃcient of Sex can be viewed as comparing the salaries of men and women with the same number of years of experience.11 gives output from a regression of log10 CEO compensation on the CEO’s age. simply make a dummy variable for each color. suppose each observation in the data represents a pair of pants ordered from a clothing catalog. Industry is a categorical variable with 19 diﬀerent levels.

taken as a whole. The same CEO in the chemicals industry could expect a log10 compensation of ˆ = 5. you test whether the collection of dummy variables that represent that factor. but remember we’re on the log scale.116069 0. We can either ﬁnd it in the “expanded estimates” table (easy) or compute it from the other Ind[x] coeﬃcients (hard). 782. The ﬁrst CEO expects to make 106.13 Ind[Consumer] 0.095629 0.075052 -0.44 Ind[Health] 0.01237 0.116558 1.87 Profits 0.1106206 0.98 Ind[Utility] -0.3603 0.1534 0.3083 0.049415 -0.22 Ind[Metals] -0.000149 × 7215 = 6.049318 1.000149 × 72 = 6. 20.052934 0.002075 3.38498 N 786 ANOVA Source Model Error Total DF 20 765 785 Sum Sq. Each coeﬃcient of Ind[X] in Figure 6.0001 Figure 6.0704833 0.020486 0.97 Ind[Forest] -0.333.8306 Prob > F <.0001 Effect Tests Source Nparm Ind 18 Age 1 Profits 1 DF 18 1 1 Sum Sq.4275 0.11 as 19 diﬀerent regression lines with the same slopes for Age and Proﬁt but diﬀerent intercepts for each industry.3607 15.085355 -0. The Ind[Utility] variable is left out of the usual parameter estimates table. 1. 152.00804 × 60 + .058207 0.346568 0.50 Ind[Finance] -0.1317273 0.000024 6.43 Ind[Constr] 0.24733 113.3334 0.072287 0.8252 0.123455 0.79 Ind[Travel] 0.0001485 0.50 Ind[Food] -0.053002 1. 106.0093132 0.0001 <.04783 0.25 Prob>|t| <.794551 F Ratio 4.0117 39.11: Output for the regression of log10 CEO compensation on the CEO’s age. 11.15152 RMSE 0.0001 RSquare 0.03313 -3.3816 0.22 Ind[Cap.09896 1.27 Ind[CompComm] 0. MULTIPLE LINEAR REGRESSION ************ Expanded Estimates ************** Nominal factors expanded to all levels Term Estimate Std Error t Ratio Intercept 5.73 + . and corporate proﬁts.11 + .1497 0. Y That doesn’t look like a big diﬀerence. do a useful amount 15 ˆ = 5.0949 0.203. the second.0970 Prob>F <.0001 <.088284 1.88 Ind[Transport] 0.333 = \$2.3346 0.13 Ind[Energy] -0.059622 -0. industry.67 Ind[Business] 0. 879.8680 0.7850 0.2571 0.224885 5.07352 1.203 = 1.8292 0.056013 0.078135 0.057568 0.0001 <. about half a million dollars less.0156033 0.06 Age 0.15 Ind[AeroDef] 0. 595. Testing a Multilevel Factor To test the signiﬁcance of a multi-level factor. gds] -0.63339 2. If you like.0001 0.92 Ind[Chem] -0.1955346 0.2588 0.97 Ind[Entmnt] 0.0601128 0.049066 -7.02 + .62774 Mean Sq.00804 × 60 + .73 − .0509543 0. Y Proﬁt in the data table is recorded in millions of dollars .38040 133.118957 48.0759199 0.087778 -1.1328 0.14821 F Ratio 6.0080405 0. you can interpret Figure 6.0001 0.142 CHAPTER 6.0116971 0.02 Ind[Retailing] 0.1476387 0.11 represents the amount to be added to the regression equation for a CEO in industry X.089493 0.0485 <.17 Ind[Insurance] 0.7275756 0.085613 -1.0005 0.

because each variable only introduces one coeﬃcient into the model. b1 + · · · + b19 = 0. which says . is p < . of explaining. with 18 and 765 degrees of freedom.5. The other dummy variables are    1 if subject i is in the aerospace-defense industry Ind[AeroDef] = −1 if subject i is in the utilities industry   0 otherwise. but to deﬁne all the others a little diﬀerently. The partial F test from Section 6. .6.63. the computer only estimates the ﬁrst 18 b’s.4. For example. the computer uses the formula from Section 6. .11 is the level left out. 11.2.2 Making the coeﬃcients sum to zero The computer uses a trick to force the coeﬃcients of the dummy variables to sum to zero. REGRESSION WHEN X IS CATEGORICAL Not on the test 6. If b1 . By using this trick. and derives b19 .2. because he gets a -1 on every dummy variable in the model. The computer didn’t actually estimate a parameter for it. . always assigning −1 to the level that doesn’t get its own dummy variable. These 18 dummy variables reduce SSE by 11.14821 is the MSE from the ANOVA table.14821 where 0. Clearly.3607 0. suppose the Utility industry in Figure 6.4 answers that question.11. compared to the model with just Age and Proﬁt. The trick is to leave one of the dummy variables out. The p-value generated by this partial F statistic.11. That explains why you don’t see Ind[Utility] in the regular parameter estimates box. b2 .63/18 F = = 4.0001. . To calculate the F statistic. then a CEO in the utilities industry adds b19 = −1(b1 + b2 + · · · + b18 ) to the regression equation. b18 are the coeﬃcients for the other 18 industries. The partial F tests for Age and Proﬁts are equivalent to the t-tests for those variables. The partial F test for the industry variable Ind examines whether the 18 “free” dummy variables reduce SSE by a suﬃcient amount to justify their inclusion. What makes this such a neat trick is that all “19” dummy variables behave as desired (as if one of them was 1 and the rest were 0) in Figure 6.    1 if subject i is in the ﬁnance industry Ind[Finance] = −1 if subject i is in the utilities industry   0 otherwise 143 and so forth. The partial F test for each variable in the model is shown in the “Eﬀects Test” table in Figure 6.

MULTIPLE LINEAR REGRESSION that at least some of the dummy variables are helpful.008 1. there is nothing that says you have to keep all the individual dummy variables in the model. after controlling Age and Proﬁts.0640690154 0. The tool you use to compare dummy variables is called a contrast. A contrast is a weighted average of dummy variable coeﬃcients where some of the weights are positive and some are negative.333β4 − . A more meaningful example appears in Section 6. 6.333β5 where βi is the coeﬃcient of the dummy variable for the ith industry. We make an exception to the “no garbage variables” rule when it comes to individual dummy variables that are part of a factor which is signiﬁcant overall.126 0. not to compare two dummy variables to one another. The results of this contrast are shown in the box below. The small p-value says that the diﬀerence is signiﬁcant. Retailing. For example. The weights can be anything you like as long as the positive and negative weights each sum to 1. Estimate Std Error t Ratio Prob>|t| SS 16 -0.1.3 Testing Diﬀerences Between Factor Levels Once you determine that there is a diﬀerence between a subset of the factor levels. The contrast you want to compute is17 .0474 -2. but it illustrates the point. That is a big hassle that doesn’t seem to have a big payoﬀ.6. The estimate is negative. and likewise for the negative components.144 CHAPTER 6.5β2 − . .658 0. The t-statistics for the individual dummy variables are set up to compare the dummy variable to the baseline regression. and Consumer Goods industries. if you want to drop insigniﬁcant dummy variables from the model you will need to create all the individual dummy variables by hand.0080286945 We don’t know why someone might want to do this. 17 See page 182 for computer tips. after controlling for Age and Proﬁts. However.16 Number these ﬁve industries 1 through 5. Often the positive components of the contrast are equally weighted. suppose you wanted to compare average CEO compensation in the the Finance and Insurance industries with average compensation in the Entertainment.5.333β3 − . you will often want to run subsequent tests to see where the diﬀerences are. which says that the average compensation in the Entertainment/Retail/Consumer industries (which have negative weights) is larger than the average in the Finance/Insurance industries.5β1 + .047 Sum of Squares Numerator DF Denominator DF F Ratio Prob > F 1.0469633847 1 765 7. Once you see that a categorical factor is helpful.

. That seems wrong to us. To determine the 18 and on Age.6 Interactions Between Variables Now that we can incorporate either categorical or continuous X ’s.5. Autocorrelation is when Y depends on previous Y ’s. interaction means that the slope of X1 depends on X2 .18 The standard way of including Proﬁt and Stock ownership in a regression model is ˆ = β0 + β1 Stock + β2 Proﬁt. the CEO compensation data set has a variable containing the percentage of a company’s stock owned by the CEO. Instead it seems like CEO’s with more stock should do even better if they run proﬁtable companies than if they don’t. It gets even more ﬂexible when we introduce the idea of interaction. Interaction is when the strength of the relationship between X1 and Y depends on X2 . and smaller if it is not. .12: Collinearity is a relationship between two or more X ’s. regardless of the company’s proﬁt level. Ignore them for the moment.2 we know that log10 compensation also depends on proﬁts. Y This model says that increasing a CEO’s ownership of his company’s Stock by 1% increases his expected compensation by β1 . . and presumably other variables as well. Industry.6. The individual variables Stock and Proﬁt are called main eﬀects or ﬁrst order eﬀects of the interaction.6. The practical implications of interactions become much easier to understand if you think of the regression slopes as meaningful “real world” quantities instead of the generic “one unit change in X leads to a . Suppose we wish to examine the eﬀect of stock ownership on CEO compensation. Interaction can be a hard idea to internalize because there are three variables involved. It is simply the product of the two variables. Consider the regression model where ˆ = β0 + β1 Stock + β2 Proﬁt + β3 (Stock)(Proﬁt).” For example. where the relationship between Y and X depends on another X . Mathematically. 6. Y The term (Stock)(Proﬁt) is called the interaction between Stock and Proﬁt. and vice-versa. Said another way. From Section 6. . we think the coeﬃcient of Stock should be larger if the CEO’s company is proﬁtable. INTERACTIONS BETWEEN VARIABLES 145 Y X1 X2 X1 Y X2 X1 Y X2 Y X1 X2 (a) typical regression (b) collinearity (c) interaction (d) autocorrelation Figure 6. multiple regression looks like a very ﬂexible tool.

interaction’s eﬀect on the “slope” of Stock. In the current example. Perhaps CEO’s who own large amounts of stock 19 If you know about partial derivatives.19 The slope of Stock is “βStock ” = β1 + β3 Proﬁt. The interaction can be interpreted the other way as well.000009 -1. then factor Stock out of all that do. βStock is the impact that stock ownership has on a CEO’s overall compensation.1630). If β3 is positive then the “slope” of Stock is larger when Proﬁt is high.098087 2. The ﬁrst thing we notice is that the interaction term is insigniﬁcant (p = . and smaller when Proﬁt is low.0939624 1. The key to understanding interactions is to come up with a “real world” meaning for the slopes of the variables involved in the interaction.15 Stock% -0.000029 4. the percent of a company’s stock owned by the CEO.0356 <. 0. Figure 6..54 .07 (Stock%-2.6220983 Figure 6. Industry.11. (other industry dummy variables omitted) Ind[Travel] 0. That means that the eﬀect of stock ownership on CEO compensation is about the same for CEO’s of highly proﬁtable companies and CEO’s of companies with poor proﬁts.000013 0. ignore all terms in the model that don’t have Stock in them.0001 0.002067 4.5551463 1.1822928 1.0001199 0.17974)* (Profit-244. .009774 0.087518 1.002402 -4.1630 4. and the Proﬁt*Stock interaction. If β3 > 0 then stock ownership has a more positive impact on compensation for CEO’s of more proﬁtable companies.1351741 0.0001 .13 tests our theory about the relationship between Stock. MULTIPLE LINEAR REGRESSION Term Estimate Std Error t Ratio Intercept 5.7103165 0..202) -0.130047 0.8915569 1.0087606 0.118212 48. and log10 compensation by incorporating the Stock*Proﬁt interaction into the regression from Figure 6. we’re just taking the partial derivative with respect to Stock.11 Age 0.0001 <. and a more negative impact on compensation for CEO’s of money losing companies.2064925 0.1229 4. Proﬁt.146 CHAPTER 6.0001 <.24 Profits 0.40 Prob>|t| VIF <.13: Regression of log10 CEO compensation on Age. So much for our theory! We also notice that the coeﬃcient of Stock is negative. Proﬁt. “βProﬁt ” = β2 + β3 Stock so if β3 > 0 then the model says that the proﬁts of a CEO’s company have a larger impact on a CEO’s compensation if the CEO owns a lot of stock. treating all other variables as constants.31 Ind[aerosp] 0.

The estimate says that each million dollars of proﬁt over the average of \$244 million decreases the slope of Stock by 0. it takes each person about 15 minutes to produce an item. the computer centers Stock and Proﬁt around their average values to avoid introducing excessive collinearity into the model. For example. even though it is insigniﬁcant and should be dropped from the model. conveniently named a. let’s try to interpret what the interaction term in Figure 6.0001199 − 0. To make the comparison fair. The leverage plot for Stock seems to ﬁt with that explanation. so it won’t show up in our data. Notice that the mean of Stock (2.14(a).13 says.7 man-hours below the baseline. It seems we had a ﬂaw in our logic. 6. Manger b is 14. When one variable is continuous and the other is categorical interactions mean that there is a diﬀerent slope for each level of the categorical variable. suppose a factory supervisor oversees three managers. . INTERACTIONS BETWEEN VARIABLES 147 choose to forgo huge compensation packages in the hopes that their existing shares will increase in value.18%. The ﬁxed cost for manager a is 38.25 man hours (i. when you multiply out the interaction term.17 .e.6. CEO’s that own a lot of stock obviously do well when their companies are proﬁtable. which says that each percent of the company’s stock owned by the CEO.8 man-hours 20 See page 136.000013(Proﬁt − 244. b.20 Figure 6.1 Interactions Between Continuous and Categorical Variables Interactions say that the slope for one variable depends on the value of another variable. the size (number of items produced) of each production run is also recorded. decreases the slope of proﬁt by 0. Before creating the interaction term. The supervisor obtains a random sample of 20 production runs from each manager. ) multiplies Proﬁt.6. not Stock.13 estimates the slope of Stock as bStock = −0. The supervisor decides to base each manager’s annual performance review on the typical amount of time (in man-hours) required for them to complete a production run.000013. The estimated slope of Proﬁt is bProﬁt = 0. once the process is up and running). The regression estimates the ﬁxed cost of starting up a production run at 176 manhours. and the marginal cost of each item produced is about . over the average of 2.17974). .009774 − 0. . and c. Just for practice.6. and manager c is 23. but an increase in the value of their existing shares does not count as compensation.4 man-hours above the baseline. The supervisor regresses RunTime on RunSize and Manager to obtain the regression output in Figure 6. so it doesn’t aﬀect bStock .000013.000013(Stock − 2.202).

619643 Manager[a] 38.14: Evaluating three managers based on run-time (a-b) without interactions (c-d) with the interaction between manager and run size.6273 Prob > F <.661 (c) F Ratio 89.936288 Manager[c] -24.250 94.70882 5.409663 3. the supervisor computed two contrasts (shown in Figure 6.0001 <.75851 2. MULTIPLE LINEAR REGRESSION Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Intercept 176. indicating that the setup time for manager a is signiﬁcantly above the average times of manager b and c on runs of similar size.317) -0.658644 31.71 Effect Tests Source Nparm Manager 2 Run Size 1 Prob>|t| <.6934 90. broken line).031379 -4.032207 Effect Test Source Nparm Manager 2 Run Size 1 Manager*RS 2 DF 2 1 2 Sum Sq.83 Manager[c] -23.5381 2. 43981.025076 9.93 Run Size 0.0001 <. F Ratio Prob > F 44773.996 83.63 0. so the diﬀerences between the managers are too large to be just random chance. Manager a ( .78 Manager[b] -14.59191 5.188168 2. Manager c(×.61 -8. Intercept 179.49 2.614 1778. light line).09765 0. Manager b’s method was optimized to reduce the marginal cost of each unit produced.1906 <.77 Prob>|t| <.0001 25260.148 CHAPTER 6.96 13.0001 <.0001 <. The supervisor began to worry that he hadn’t judged .072836 0. Manager b (+. and give managers b and c their usual bonus.900342 Manager[b] -13.65007 2.0001 <.23 Manager[a] 38.887839 Run Size 0. the supervisor learned that manager b used a diﬀerent production method than the other two managers.024708 Mgr[a](RS-209.0001 0.0112 0.4444 (a) Expanded Estimates Term Estimate Std Err.037178 Mgr[c]*RS 0.452 22070.4768 <.17 -4.2344284 0. The eﬀects test shows that the Manager variable is signiﬁcant.0001 <.0001 <. and that the diﬀerence between managers b and c is not statistically signiﬁcant.005923 12. While meeting with the managers to discuss the annual review.0001 <. The supervisor decides to put manager a on probation.0248147 0.0001 <. Upon determining that there actually is a diﬀerence between the three managers.243369 0.65115 3. below the baseline.0001 DF 2 1 Sum of Sq.317) 0.0333 (d) Figure 6.995898 -7.0437 0.035263 Mgr[b](RS-209.0192 3.07 -2.0001 (b) tRatio 31.15).54 9.0001 0. heavy line).

14(c).5 57.234RunSize − .59 + 38.15: Contrasts based on the model in Figure 6.21 Because the interaction enters into the model in the form of several variables at once. Unhappy with his current model. which says that at least one of the managers has a diﬀerent slope (marginal cost) than the other two. The regression equations for the three managers are:   179.073 man-hours per unit.59 − 13.1074 Std Error 5. manager b reduces the baseline marginal cost by . manager b would have an advantage on large jobs that the supervisor’s analysis ignored.073(RS − 209. but that is a typo on the part of the computer programmers.2243 t Ratio 1. Notice that the coeﬃcients of the dummy variables in the interaction terms sum 21 The centering term is not present in the Mgr[c] line of the computer output.5089 12. .098 man-hours per unit. Interactions say that the slope of one variable (in this case RunSize) depends on another (in this case Manager). the right way to test its signiﬁcance is through the partial F test shown in the “Eﬀects Test” table.3) if Manager a if Manager b if Manager c.6. manager b fairly.54 + . The output should read Mgr[c](RS-209.6.243 man-hours per item. After hiring you to explain all this (and paying you a huge consulting fee) the supervisor decides to run a new regression that includes the interaction between Manager and RunSize.5 -0.19 + .025 man-hours per unit.0001 43788 149 Figure 6.025(RS − 209. because his analysis assumed that all three managers had the same marginal cost of . So in the current example an interaction simply means each manager has his own slope.098(RS − 209.0333.99 a b c Estimate Std Error t Ratio Prob>|t| SS 1 -0. The interaction is signiﬁcant with p = .3) RunT ime = 179. The baseline marginal cost per unit of production is . Manager a adds . INTERACTIONS BETWEEN VARIABLES a 0 b 1 c -1 Estimate 9.778 <.2344 man-hours per unit.317). The supervisor’s question suggests an interaction between the Manager variable and the RunSize variable. shown in Figure 6.3)   179. The interaction term enters into the model as a product of RunSize with all the Manager[x] dummy variables.0868 SS 814.234RunSize + .59 − 24. and manager c adds . after centering the continuous variable RunSize to limit collinearity.14(a). the supervisor calls you to help. If manager b’s method was truly eﬀective (which the supervisor isn’t sure about and would like to test).65 + .234RunSize + .7433 Prob>|t| 0. or even more speciﬁc to this example: his own marginal cost of production.614 4.

manager b’s is 0. Interactions are relatively rare in regression analysis. smaller marginal cost per item produced) than the other two. Finally.259. we interpret interactions are the eﬀect that one variable has on the slope of another. Data mining has arisen because of the huge amounts of data that are now becoming available as a result of new technologies such as bar code scanners. You should think to try an interaction term whenever you think that changing a variable (either categorical or continuous) changes the dynamics of the other variables. when you ﬁnd a strong one it is a real bonus because you’ve learned something interesting about the process you’re studying. That interpretation becomes more diﬃcult if the slope of either variable is excluded from the model.136. which requires some contextual knowledge about the problem. In that case you have the option of keeping or dropping the main eﬀects as conditions warrant. The simple answer is that you try it in the model and look at its p-value.2 General Advice on Interactions Students often ask how you know if you need to include an interaction term. There is no plot or statistic you look at to say “I need an interaction here. The trick is having some idea which interaction to check. 6.150 CHAPTER 6. However. and manager c’s is 0. when ﬁtting models with interactions remember the hierarchical principle which says that if you have an interaction in the model. Model selection is one of the key components in the ﬁeld of “Data Mining” which you have probably heard about before.7 Model Selection/Data Mining Model selection is the process of trying to decide which are the important variables to include in your model. just like any other variable.6. you should also include the main eﬀects of each variable. MULTIPLE LINEAR REGRESSION to zero just like the coeﬃcients of the Manager main eﬀects. By adding like terms we ﬁnd that manager a’s slope is 0.e. Many people don’t bother to check for them because of the intuition required for the search. Therefore. Every time you go to the . 6. but clearly manager b has a smaller slope (i. The exception to the hierarchical principle is when the variable you create by multiplying two other variables is interpretable on its own merits.” The variables involved in an interaction may or may not be related to one another.307. etc. even if they are not signiﬁcant. JMP doesn’t provide a method equivalent to a contrast for testing interaction terms. To develop that intuition you should think about the economic interpretation of the coeﬃcients as marginal costs. marginal revenues. This is probably the most diﬃcult part of a typical regression problem.

and perhaps give you some insight on interactions that you might want to test for. make it easier for you to explain the model to other people. You should check for (1) regression assumption violations. In the interests of fairness we should point out that this is such a hard problem that answering it is somewhat more of an art than a science! 6. which we discuss below. a good strategy is to start oﬀ with what signiﬁcant predictors you can ﬁnd by trial and error.1 Model Selection Strategy When deciding on the “best” regression model you have several things to keep your eye on. When faced with a mountain of potential X variables to sort through. Being able to interpret your coeﬃcients this way will help you understand the model better. and making judgments about what to do with unusual points. though. There are some established procedures for deciding on the best variables. You should also keep your eye on the VIF’s of the variables in the model to be aware of any collinearity issues. . promotions. However. etc.7. (2) unusual points. However. glance at the diagnostics in the computer output to see if you can spot any glaring assumption violations or unusual points. Try to economically interpret the coeﬃcients of the models you ﬁt (“β3 is the cost of raw materials”). and (3) insigniﬁcant variables in the model. the problem has existed in statistics for a long time before data mining arrived. Be careful that you don’t jump to this step too soon. Computers are really good a churning through variables looking for signiﬁcant pvalues. This stage of the analysis generally requires quite a bit of work and a lot of practice. but they are really bad at identifying and ﬁxing assumption violations. There are literally terabytes of information recorded every day. This information can be used to plan marketing campaigns. However. Each time you ﬁt a model.6.7. there is so much data that it is virtually impossible to manually decide which of the thousands of recorded variables are important for any given response and which are not. suggesting interpretable transformations to remove collinearity problems. there is really no good substitute for this kind of approach. Once you’ve gone through a few iterations and are more or less happy with the situation regarding regression assumptions and unusual points. MODEL SELECTION/DATA MINING 151 supermarket and your products are scanned at the checkout that information is sent to a central location. The ﬁeld of data mining has sprung up to try to answer this problem (and related ones). you eventually get to the stage where a computer can help you ﬁgure out which variables will provide signiﬁcant p-values.

if you have a mountain of X variables staring you in the face you’re going to need some help. There are three common approaches. . With this procedure we start with no variables in the model and add to the model the variable with lowest p-value. we had only a 5% chance of mistaking it for a signiﬁcant one. The primary reason is the problem of multiple comparisons.05 rule for p-values no longer protects us from spurious signiﬁcant results as well as it did before.152 CHAPTER 6.05 rule for p-values meant that if an X was a garbage variable.05. simply because we have no way of knowing precisely how much we should relax it to maintain a “true” . This approach is continued until the p-value for the next variable to add is above some threshold (e. • Forward Selection. However.7. In previous Chapters each problem had only one hypothesis test for us to do.05 threshold for signiﬁcance with . which means there is a much greater opportunity for garbage variables to sneak into the regression model by chance. Regression analysis involves many diﬀerent hypothesis tests and a lot more trial and error. The question is “how much tougher?” Unfortunately there is no way of knowing. 5%. Consequently.05 signiﬁcance level. but there is a conservative procedure known as the Bonferroni adjustment which we know to be stricter than we need to guarantee a particular level of signiﬁcance. adjusted by the Bonferroni rule) at which point we stop.0001. our .3 Stepwise Regression In regression problems with only a few X variables it is best to select the variables in the regression model by hand. The threshold must be chosen by the user. At some point (a practical limit is .05/p. Some sort of automated approach is required to choose a manageable subset of the variables to consider in more detail.g. The Bonferroni adjustment says that if you are going to select from p potential X variables. We then add the variable with next lowest p-value (conditional on the ﬁrst variable being in the model). 6. the smallest p-value that the computer will print) the Bonferroni adjusted threshold gets low enough that we don’t penalize it any further. Thus if there were 10 potential X variables we should use . collectively called stepwise regression. The Bonferroni rule is a rough guide. It is important to remember that it is tougher than it needs to be. One obvious solution is to enact a tougher standard for declaring signiﬁcance based on the p-values of individual variables in a multiple regression.005 as the signiﬁcance threshold instead of . then replace the .7.2 Multiple Comparisons and the Bonferroni Rule At several points throughout this Chapter we have urged caution when looking at p-values for individual variables. Our . MULTIPLE LINEAR REGRESSION 6.

These forward and backward steps continue until all variables in the model have a low enough p-value and all variables outside the model have a large p-value if added.6. which makes the ﬁrst variable signiﬁcant again. again chosen by the user. Now suppose we run a regression and pick out only the X ’s that have signiﬁcant individual p-values.3 Where does the Bonferroni rule come from? Suppose X1 . From very basic probability we know that if A and B are two events. where our usual rule is α = . This is a combination of forward and backward selection.05. . The procedure continues in this way except that if at any point any of the variables in the model have a p-value above a certain threshold they are removed (the backward part). With backward selection we start with all the possible variables in the model. MODEL SELECTION/DATA MINING Not on the test 6. Each Xi has a probability α of spuriously making it into the model. then we can look at 10 tstatistics and still maintain only a 5% chance of allowing at least one garbage variable into the regression. . JMP (or most other statistical packages) will automatically perform these procedures for you.22 The only inputs you need to provide are the data and the 22 See page 182 for computer tips. Dropping that variable makes the previous variable insigniﬁcant.005 rule. Then we remove the variable with next largest p-value (given the new model with the ﬁrst variable removed). If the two thresholds are close to another the procedure can enter a cycle where including a variable make a previously included variable insigniﬁcant. then P (A or B ) = P (A) + P (B ) − P (A and B ) ≤ P (A) + P (B ). . the probability that at least one of the Xi ’s makes it into the model is P (X1 or X2 or · · · or Xp ) ≤ P (X1 ) + · · · + P (Xp ) = pα. We then remove the variable with largest p-value. . . 153 • Backward Selection. Thus if we replace the .05 rule with a . Notice that this procedure requires selecting two thresholds. This procedure continues until all remaining variables have a p-value below some threshold. • Mixed Selection. Xp are all variables with no relationship to Y . If you encounter such a cycle then simply choose a tougher threshold for including variables in the model. Therefore. We start with no variables in the model and as with forward selection add the variable with lowest p-value.7.

on all possible interaction and quadratic terms for the 10 X variables. the two series are completely unrelated to one another.16(a) shows a time series plot of the predictions from this regression model on the same plot as the actual data. for use in a regression to predict the stock market (60 monthly observations of the VW index from the early 1990’s).92! The whole model F test is highly signiﬁcant. Obviously our X variables are complete gibberish.05. which in this case is the right answer. with JMP’s default thresholds. Notice R2 = . and the remaining 48 were used to ﬁt the model. . Figure 6. When we tried our same experiment using . The point of this exercise is that stepwise regression is a “greedy” algorithm for choosing X ’s to be in the model. Give it a chance and it will include all sorts of spurious variables that happen to ﬁt just by chance. when they should be much smaller. MULTIPLE LINEAR REGRESSION thresholds.001 as a threshold (the lowest threshold that JMP allows). . as is reﬂected in the ANOVA table for the regression of VW on X1 . which were not used to ﬁt the model.16(b) makes the same point using a diﬀerent view of the data. For the 48 months that the model was ﬁt on the predicted and actual series track nearly perfectly. we simulated 10 normally distributed random variables. totally independently. Figure 6. X10 shown in Figure 6. Note that the default thresholds that JMP provides are TERRIBLE. we ended up with an “empty” model. The second ANOVA table in Figure 6. .154 CHAPTER 6. It plots the residuals versus the predicted returns for all 60 observations.16. At least half of the 12 points not used in ﬁtting the model could be considered outliers. . To illustrate. They are much larger than . . The last 12 months were set aside.16 was obtained using the Mixed stepwise procedure. For the last 12 months. When using the stepwise procedure remember to protect yourself by setting tough signiﬁcance thresholds.

014588 N 47 (a) (b) Figure 6.9791 Mixed Stepwise with JMP’s Defaults Source DF Sum of Squares Mean Square Model 26 0. MODEL SELECTION/DATA MINING 155 ANOVA Table for ordinary regression on 10 ‘‘Random Noise’’ variables Source DF Sum of Squares Mean Square F Ratio Model 10 0.2906 Error 36 0.00409569 0.05074203 0.000410 0.05058174 0.922389 RMSE 0.16: Regression output for the value weighted stock market index regressed on 10 variables simulated from “random noise.1422 Prob > F <.05483773 0.05483773 (all possible interactions considered) F Ratio 9.7.001410 Prob > F Total 46 0.001945 Error 20 0.” .00425599 0.000213 Total 46 0.0001 RSquare 0.6.

156 CHAPTER 6. MULTIPLE LINEAR REGRESSION .

for example.1 Logistic Regression Often we wish to understand the relationship between X and Y where X is continuous but Y is categorical. i A New Model ˆ as a guess. So. Yi = β0 + β1 Xi + and predict Y as ˆ = b0 + b1 X Y ˆ = 0. P (Y = 1). if would indicate that the probability of a person with this value of X 157 Y itself. For example a credit card company may wish to understand what factors (X variables) aﬀect whether a customer will default or whether a customer will accept an oﬀer of a new card.5 this Y defaulting is . First.Chapter 7 Further Topics 7. but ˆ = 0.e. Second. we There are two problems with this prediction. for example. Y this makes no sense because Y can only take on the values 0 or 1. Suppose we let Yi = 1 0 if ith customer defaults if not Why not just ﬁt the regular regression model. depending ˆ can range anywhere from negative to positive inﬁnity. not for The ﬁrst problem can be overcome by treating Y for the probability that Y equals 1 i. Both “default” and “accept new card” are categorical responses.5. if Y know this is an incorrect prediction because Y is either 0 or 1. Clearly on the value of X .

Finally. The problem is that a straight line relationship between X and P (Y = 1) is not correct. not linear. FURTHER TOPICS 50%. by taking logs of both sides we get p log = β0 + β1 X 1−p The left hand side is called the log odds or “logit. A number close to zero indicates that the probability of a 1 (default) is close to zero while a number close to inﬁnity indicates a high probability of default. the details are not important because most of the ideas are very similar. If we rearrange this equation we get p = eβ0 +β1 X 1−p p/(1 − p) is called the odds.” Notice that there is a linear relationship between the log odds and X .e. Hence. Thus any prediction between 0 and 1 now makes sense. We are still interested in testing the null hypothesis H0 : β1 = 0 because this corresponds to “no relationship between X and Y .” However.” We now use a χ2 (chi-square) statistic rather than t but you still look at the p-value to see whether you should reject H0 and conclude that there is a relationship. For example b1 is only a guess for β1 so it has a standard error which we can use to construct a conﬁdence interval. Therefore to understand the relationship between X and Y and make future predictions for Y we need to be estimate them using b0 and b1 . This curve has an S shape. However. There are many possible functions but the one that is used most often is the following p= eβ0 +β1 X 1 + eβ0 +β1 X where p = P (Y = 1). We need to use a diﬀerent function i. We still have the same questions and problems as with standard regression. Fitting the Model Just as with regular regression β0 and β1 are unknown. These estimates are produced using a method called “Maximum Likelihood” which is a little diﬀerent from “Least Squares. It can go anywhere from 0 to inﬁnity. Notice that as β1 X gets close to inﬁnity p gets close to one (but never goes past one) and when β1 X gets close to negative inﬁnity p gets close to zero (but never goes below zero). no matter what X is we will get a sensible prediction.158 CHAPTER 7. this does not solve the second problem because a probability less than zero or greater than one still has no sensible interpretation. The interpretation of β1 is a little more diﬃcult than with standard regression. If β1 is positive then increasing X will increase p = P (Y = 1) and vice versa for β1 . The true relationship will always be between 0 and 1.

Higher average balances caused a higher probability (b1 was positive) and higher incomes caused a lower probability (b2 was negative). Making Predictions Suppose we ﬁt the model using average debt as a predictor and get b0 = −3 and b1 = .) What eﬀect does each X have on Y . Then for a person with 0 average debt we would predict that the probability they defaulted on the loan would be p= e−3 e−3+0.000 1 + e−1 Multiple Logistic Regression The logistic regression model can easily be extended to as many X variables as we like.001×2. Do the variables overall help to predict Y? (Look at the p-value for the “whole model test.000 e−1 p= = = 0. the eﬀect of increasing X by one. on the probability is less clear. (Look at the signs on the coeﬃcients. 000 in average debt would have a probability of default of e−3+0.1.001. However.) In the credit card example it looked like average balance and income had an eﬀect on the probability of default. the eﬀect this will have on p depends on what the probability is to start with. Instead of using eβ0 +β1 X p= 1 + eβ0 +β1 X we use eβ0 +β1 X1 +···+βp Xp p= 1 + eβ0 +β1 X1 +···+βp Xp or equivalently log p 1−p = β0 + β1 X1 + · · · + βp Xp . Equivalently it multiplies the odds by eβ 1 .269 1 + e−3+0.7. Increasing X by one changes the log odds by β1 . . LOGISTIC REGRESSION 159 negative. The same questions from multiple regression reappear. However.001×0 1 + e−3 On the other hand a person with \$2.001×0 = = 0.001×2.047 1 + e−3+0. Therefore we are especially concerned about people with high balances and low incomes.”) Which of the individual variables help? (Look at the individual p-values.

001×0−0. 000 in average debt and income of \$30. – Two diﬀerent variables.e.000 1 + e0 7.000−0.000 e0 = = 0.000 e−4 p= = = 0. is evidence of autocorrelation.000−0.0001×50. In cross sectional data there is no time component which means no autocorrelation. i. . If there is a correlation between “yesterday’s” Y i. – Same variable.001 and b2 = −. diﬀerent times.001×2. Yt this is called “autocorrelation. FURTHER TOPICS Making predictions with several variables Suppose we ﬁt the model using average debt and income as predictors and get b0 = 1 and b1 = . a pattern where the residuals follow each other. no concept of time. The “lag variable” is then just Yt−1 i. Then for a person with 0 average debt and income of \$50. all the Y ’s shifted back by one.0001×50. In time series data this assumption is often violated.e.000 1 + e−4 On the other hand a person with \$2.001×0−0.5 1 + e1+0. 000 we would predict that the probability they defaulted on the loan would be e1+0.e. Autocorrelation When we have data measured over time we often denote the Y variable as Yt where t indicates the time that Y was measured at.e. Yt−1 and “today’s” Y i.160 CHAPTER 7. A standard assumption in regression is that the Y ’s are all independent of each other.0001×50. • Time Series – Past number of deliveries explains future number of deliveries. An easy way to spot it is to plot the residuals against time. On the other hand time series data almost always has some autocorrelation.” Autocorrelation means today’s residual is correlated with yesterday’s residual.2 Time Series The Diﬀerence Between Time Series and Cross Sectional Data • Cross Sectional Data – Car’s weight predicts/explains fuel consumption. 000 would have a probability of default of p= e1+0.0001×30.0180 1 + e1+0.0001.001×2. Tracking.

TIME SERIES Impact of Autocorrelation 161 • On predictions. Basically look for values a long way from 2. Therefore. This means that we should incorporate the previous (lagged) Y ’s in the regression model to produce a better estimate of today’s Y. It can be shown that DW ≈ 2 − 2r where r is the autocorrelation between the residuals. For example. Notice that if the correlation is zero the DW statistic should be close to 2. • On parameter estimates.2. The Durbin-Watson statistic assumes values between 0 and 4. Testing for Autocorrelation One test for autocorrelation is to use the “Durbin-Watson Statistic. This will generally improve the accuracy of the predictions as well as remove the autocorrelation. Yesterday’s Y can be used to predict today’s Y . In other words ﬁt the model with what ever X variables you want to use.7. A value close to 2 indicates no autocorrelation and a value close to 4 indicates strong negative autocorrelation. As a consequence today’s Y provides less new information than an independent observation. The past contains additional information about the future not captured by the current regression model. 100 Y ’s that have autocorrelation may only provide as much information as 80 independent Y ’s. save the residuals. What to do when you Detect Autocorrelation The easiest way to deal with autocorrelation is to incorporate the lagged residuals in the regression as a new X variable. lag them and reﬁt the model including the lagged residuals. an alternative to the Durbin-Watson statistic is to simply calculate the correlation between the residuals and the lagged residuals.” It is calculated using the following formula DW = n 2 t=2 (et − et−1 ) n 2 t=1 et It compares the variation of residuals about the lagged residuals (the numerator) to the variation in the residuals (denominator). Since we have less information the parameter estimates are less certain than in an independent sample. . A value close to 0 indicates strong positive autocorrelation. Today’s residual can help predict tomorrow’s residual. Another way to say this is that the “Equivalent Sample Size” is smaller.

e. Just as with any other predictor we need to check what sort of relationship there is between it and Y (i. non-linear) and transform as appropriate. To model a long term trend in the data we treat time as another predictor. . For example if you are selling outdoor furniture you would expect your summer sales to be higher than winter sales irrespective of any long term trend. This is a long run change in the average value of Y . Predicting the length of such periods is very diﬃcult and unless you have a lot of data it can be hard to diﬀerentiate between cycles and long run trends. An example is the “boom and bust” business cycle where the entire economy goes through a boom period where everything is expanding and then a bust where everything is contracting. 7.e. Spring. These are long run patterns with no ﬁxed period.3 More on Probability Distributions This section explains more about some standard probability distributions other than the normal distribution. linear. When we ﬁt the model JMP will automatically create the correct dummy variables to code for each of the seasons. Winter. none. This is a predictable pattern with a ﬁxed period. one for each month. Some examples of long memory are: • Trend. For example if we felt that sales may depend on the quarter i. Summer or Autumn we would add a categorical variable indicating which of these 4 time periods each of the data points correspond to. You may encounter some of these distributions in your operations class. By incorporating time. • Cyclical. As with any other predictors we should look at the appropriate p-values and plots to check whether the variables are necessary and none of the regression assumptions have been violated. Since we have so little time on time series data we won’t worry about cyclical variation in this class. To model seasonal variation we should incorporate a categorical variable indicating the season. FURTHER TOPICS Short Memory versus Long Memory Autocorrelation is called a “short memory” phenomenon because it implies that the Y ’s are aﬀected by or remember what happened in the recent past. • Seasonal. “Long memory” can also be important in making future predictions. For example industry growth.162 CHAPTER 7. seasons and lagged residuals in the regression model we can deal with a large range of possible time series problems. On the other hand we may feel that each month has a diﬀerent value in which case we would add a categorical variable with 12 levels.

141759.2 0. If you want to understand a random variable it is is easier to think about its PDF because the PDF looks like a histogram.1: PDF (left) and CDF (right) for the normal distribution. We are already familiar with one CDF: the normal table! If a random variable is continuous you can take the derivative of its CDF to get its probability density function (or PDF). b] by F (b) − F (a).0 −4 0.2 0.7.6 0.4.0 −2 0 z 2 4 Figure 7. . However. because you can calculate the probability that your random variables X is in any interval (a.3 probability −4 −2 0 z 2 4 density 0. and they never decrease as you move from left to right.3. . they stop at 1.4 0. Discrete random variables are typically integer valued.4 0.1 0. . 7. CDF’s have three basic properties: they start at 0.1. All random variables can be characterized by their cumulative distribution function (or CDF) F (x) = P r (X ≤ x). 2. .3. 1. Continuous random variables are real valued. like 3. That is: they can be 0.1 Background There are two basic types of random variables: discrete and continuous. MORE ON PROBABILITY DISTRIBUTIONS 163 0. The normal PDF and CDF are plotted in Figure 7. just like we did with the normal table in Section 2. the CDF is a handy thing to have around.0 0.8 1. If a random variable is discrete you can calculate its probability function P (X = x). .

if X is an exponential random variable with rate λ. That is.164 CHAPTER 7.0 −1 0.2: The exponential distribution (a) PDF (b) CDF.2 0. FURTHER TOPICS 1.2 Exponential Waiting Times The shorthand notation to say that X is an exponential random variable with rate λ is X ∼ E (λ).8 0.0 0. 7.4 0.0 0. One famous property of the exponential distribution is that it is memoryless. These are plotted in Figure 7. The exponential distribution has the density function f (x) = λe−λx and CDF F (x) = 1 − e−λx for x > 0.3.4 0.3.2 0.0 0 1 x 2 3 4 Figure 7.8 1.3 Binomial and Poisson Counts The binomial probability function is P (X = x) = n x p (1 − p)n−x x .6 0. and you’ve been waiting for a day for the 7.6 probability 0 1 2 x 3 4 5 dexp(x) 0.2.

7. 7.4 7.00 0 2 4 6 8 10 12 0. sample units (the basic unit of study: people.2 0 0. Experiments can be designed to simultaneously test more than one type of treatment (bottles/cans and 12-pack/6-pack).g.3: Poisson (a) probability function (b) CDF with λ = 3. bottles/cans) and an outcome measure is recorded.4.15 0. Survey The goal of a survey is to describe a population by collecting a small sample from the population.4.10 0. which are primarily distinguished by their goal.4 0.4 Review 7.20 0.8 1. PLANNING STUDIES 165 0. cars.05 0. For example. Experiment The goal of an experiment is to determine whether one action (called a treatment) causes another (the eﬀect) to occur.1 Planning Studies Diﬀerent Types of Studies In this Chapter we talk about three diﬀerent types of studies.0 2 4 6 8 10 12 Figure 7. The concept of randomization plays an important role in study design. supermarket transactions) are randomly assigned to one or more treatment levels (e.3. does receiving a coupon in the mail cause a customer to change the type of beer he buys in the grocery store? In an experiment. Surveys are more interested in painting an accurate picture of the population than determining causal relationships between . Diﬀerent types of randomization are required to achieve the diﬀerent goals.6 0.

n) Table 7. A survey in which the entire population is included is called a census. As a result they are the most common. Variance.4. but observational studies are the least expensive type of study to perform. bias is the diﬀerence between the expected value of the sample statistic the study is designed to produce and the population parameter that the statistic estimates.1: Summary of random variables variables. σ ) Exponential E (λ) λe−λx 1 − e−λx 1/λ 1/λ2 continuous CHAPTER 7. time consuming. To say that a study is biased means that there is something systematically wrong with the way the study is conducted. Conducting a census is expensive. There can also be some subtle problems with a census. p) px (1 − p)n−x tables np np(1 − p) discrete 1 √ 1 e− 2 2πσ (x−µ)2 σ2 tables µ σ2 continuous lots of stuﬀ tables λ λ discrete counts (no known max) waiting times counts (known max. The key issue in a survey is how to decide which units from the population are to be included in the sample. Mathematically speaking. and the data survey data should be analyzed with the randomization strategy in mind. A carefully conducted survey can sometimes produce more accurate results than a census! Observational Study Like an experiment. For example. in a survey obtained by simple random sampling. The best surveys randomly select units from the overall population. and it may even be impossible for practical or ethical reasons. However.166 Distribution: Notation f (x) CDF E (X ) V ar (X ) Discrete / Continuous Model for Normal N (µ. FURTHER TOPICS Poisson P o(λ) λx −λ x! e n x Binomial B (n.2 Bias. the goal of an observational study is to draw causal conclusions. such as what to do if some people refuse to respond. we know . bias and variance. 7. the sample units in an observational study are not under the control of the study designer. and Randomization Whenever a study is performed to estimate some quantity there are two possible problems. The conclusions drawn from an observational study are always subject to criticism. Note that there are many diﬀerent strategies that may be employed to randomly select units for inclusion in a survey.

but still managed to show a bias towards interviewing republicans who planned to vote for Dewey. 7. presidential election. In large studies (where the variance is small because n is so large) bias is usually a much worse problem than variance. there is a limit to the precision that can be placed on quotas. Ideally we want to produce an estimate that has both low bias and low variance. However. Randomly selecting units from the population is vital to ensuring the representativeness of your sample. then we can reduce the variance by simply gathering a larger sample.3 Surveys Randomization in Surveys The whole idea of a survey is to generalize the results from your sample to a larger population. The interviewers followed the quotas.S. 7 Asian females under 30). If these tools indicate that we have not measured a phenomenon accurately enough. If an estimator is biased it will get the wrong answer even if we have no variance.7. In the context of study design. where each interviewer was given a quota describing the characteristics of who they should interview: (12 white males over 40. won the election. As long as a study is unbiased. . We can’t control bias. bias can sneak into a deterministic sampling scheme in some very subtle ways. In fact Truman. The survey was conducted through personal interviews. PLANNING STUDIES 167 ¯ ) = µ so the bias is zero. the democratic candidate. Randomization is a tool for turning bias (which we can’t control) into variance (which we can). that E (X If an unbiased study were performed many times. You can only do so if you are sure that your sample is representative of the population. You may think that if you carefully decide which units should be included in the sample then you can be sure it is representative of the population. A famous example is the Dewey/Truman U.4.4. sometimes its estimates would be too large and sometimes they would be too small. we have very eﬀective tools (conﬁdence intervals and hypothesis tests) at our disposal for measuring variance. but we can control variance by collecting more data. where the Gallup organization used “quota sampling” in its polling. This is counter-intuitive to some people. Unfortunately bias is often hard (and sometimes impossible) to detect. However. That is why randomization is so important in designing studies. “variance” simply means that if you performed the study again you would get a slightly diﬀerent answer. If the bias is zero we call the estimator unbiased. but on average they would be correct.

4: The famous picture of Harry S.168 CHAPTER 7. Census is often used as a sampling frame. transactions occurring in busy accounts) have a higher probability of being selected than smaller units (transactions occurring in small accounts).” the basic unit of analysis in the study.g. . but it does not contain the information needed for your study. There are other sampling methods.” Simple random sampling is equivalent to drawing units out of a hat. FURTHER TOPICS Figure 7. Select sample There are several possible strategies that can be employed to randomly sample units from the sampling frame. The sampling frame may have some information about the units in the population. Sometimes the sampling frame is an explicit list.” Explicit sampling frames give your results greater credibility. For example. Presidential election. do you want to randomly sample transactions or accounts (which might contain several transactions)? If you sample the wrong thing you can end up with “size bias” where larger units (e. Truman after he defeated Thomas E. then it may not be obvious how you should sample from the population. Steps in a Survey Deﬁne “population” Clearly deﬁning the population you want to study is important because it deﬁnes the “sampling unit. The simplest is a “simple random sample. Sometimes it is implicit. For example. If you are not clear about the population you want to study. “Frame coverage bias” occurs when there are some units in the population do not appear on the sampling frame. the U. Dewey in the 1948 U.S. such as stratiﬁed random sampling.S. such as “the people watching CNN right now. Construct sampling frame The sampling frame is a list of “almost all” units in the population.

Analyze data The data analysis must be done with the sampling scheme in mind. Have you ever wondered how “Boy Band X” can have the number one hit record in the nation. Another form of selection bias occurs with a “convenience sample” composed of the units that are easy for study designer to observe. We will not discuss these weighting schemes. . Many survey agencies actually oﬀer ﬁnancial incentives for people to participate in the survey in order to increase the response rate. It is much lower for phone surveys. cluster sampling. The methods we know about can be modiﬁed to handle other sampling schemes. but you don’t know anyone who owns a copy? Your circle of friends is a convenience sample. The techniques we have learned (and will continue to learn) are appropriate for a simple random sample.e. employment records might have the most senior employees listed ﬁrst. The thing that determines whether a survey is “scientiﬁc” is whether the sampling probabilities are known (i. For example.” A common form of selection bias (called self-selection) occurs when people (on an implicit sampling frame) are encouraged to phone in or log on to the internet and voice their opinions. PLANNING STUDIES 169 cluster sampling.” The response rate is typically highest for surveys conducted in person. “Non-response bias” occurs when people that don’t answer your questions are systematically diﬀerent from those people that do. typically by assigning each observation a weight which can be computed using knowledge of the sampling strategy. and many others. Convenience samples occur very often when the ﬁrst n records are extracted from a large database. If applied to data from another sampling scheme (stratiﬁed sampling. If the sampling probabilities are unknown. The fraction of people who respond to your survey questions is called the “response rate. Those records might be ordered in some important way that you haven’t thought of. two stage sampling. then the survey very likely suﬀers from “selection bias. Typically.4.) they can produce biased results. Collect data Particularly when you are surveying people. only the strongest opinions are heard.7. just because you have selected a unit to appear in your survey does not mean they will agree to do so. We experience selection bias from convenience samples every day. the probability of each individual in the population appearing in the sample). and lower still for surveys conducted by mail. etc.

It is possible that there is an unobserved variable (e. the theory of experiments does not concern itself with generalizing the results of the experiment to a larger population. We then compare the two groups to see if the treatment seems to be helping. while the control group is given a placebo. For example. Stratiﬁed sampling.4. For example the fact that people who smoke get cancer does not prove that smoking causes cancer.” Because the goal of an experiment is diﬀerent than that of a survey. in testing a new drug experimental subjects are randomly assigned to one of two groups. But what if there were some other systematic diﬀerence between the treatment and control groups that was not so obvious to you? Then any diﬀerence you observe between the groups might be because of that other variable. for example. The goal of an experiment is to infer a causal relationship between two variables.g. If we then ﬁnd that the treatment group did better than the control group we have a problem because we can’t tell whether this is because of the drug or because of the gender. Experiments can be conducted with several diﬀerent treatment and response variables simultaneously. The study of the right way to organize and analyze complicated experiments falls under a sub-ﬁeld of statistics called “experimental design. The best way to overcome . all women are given the drug and all men are given the placebo. a diﬀerent type of randomization is required. Surveys randomly select which units are to be included in the study to ensure that the survey is representative of the larger population. The ﬁrst variable (X ) is called a treatment and is set by the experimenter. The variable is unobserved (lurking) so it is diﬃcult to tell for sure. a defective gene) that causes people to both smoke and develop cancer. The second variable (Y ) is called the response and is measured by the experimenter. FURTHER TOPICS Cluster sampling. Lurking Variables You would obviously never conduct an experiment by assigning all the men to one treatment level and all the women to another.170 Other Types of Random Sampling CHAPTER 7. 7. Two stage sampling. The treatment group is given the drug.4 Experiments Unlike surveys. Issues in Experimental Design Confounding This is where. Experiments randomly assign units to treatment levels to make sure there are no systematic diﬀerences between the diﬀerent groups the experiment is designed to compare.

Player A B Whole Season 251 286 It appears that player B is the better batter.” Omitted variable bias is illustrated by Simpson’s paradox. which simply says that conditioning an analysis on a lurking variable can change the conclusion. even those that you might not know to account for. or telling observers which group they are measuring. Player A B First Half 300 290 Second Half 250 200 How is it possible that player A can have a higher average in both halves of the season but a lower average overall? Upon closer examination of the numbers we see that . So someone who bats 250 gets a hit 25% of the time. Random assignment ensures that any systematic diﬀerences among individuals. Another way to say this is that observational studies suﬀer from “omitted variable bias. (Batting averages are the fraction of the time that a player hits the ball. We are comparing the batting averages for two baseball players.4. then the only systematic diﬀerence between the groups being compared is the treatment assignment. PLANNING STUDIES 171 this problem is to randomly assign people to groups.7.) First we look at the overall batting average for each player for the entire season. Placebo Eﬀect People in treatment group may do better simply because they think they should! This problem can be eliminated by not telling subjects which group they are in. If units are randomly assigned to treatment levels. if we look at the averages for the two halves of the season we get a completely diﬀerent conclusion. In that case you can be conﬁdent that a statistically signiﬁcant diﬀerence between the groups was caused by the treatment. However. 7. Studies involving smoking and cancer rates are observational studies because the subjects decide whether or not they will smoke.4. For example it is not ethical to randomly make some people smoke and make others not smoke. times 1000.5 Observational Studies An observational study is an experiment where randomization is impossible or undesirable. Here is an example. Observational studies are always subject to criticism due to possible lurking variables. are evenly spread between two groups. This is called a double blind trial.

Only departments C and E have slightly lower rates for women. However.6% of women were. Therefore if we condition on which half of the season it is we get a quite diﬀerent conclusion. Applicants % admitted 108 82 25 68 593 34 375 35 393 24 341 7 1835 30 The percentage of women admitted is generally higher in each department. How is this possible? Notice that men tend to apply in greater numbers to the “easy” programs (A and B) while women go for the harder ones (C.3 34. Hence the reason for the lower percentage .3% of men were admitted while only 34. and vice-versa for player B. Program A B C D E F Total Men No.6 44. look at the numbers if we do a comparison on a department by department basis. On the surface there appears to be strong evidence of gender bias in the admission process. FURTHER TOPICS Second Half Hits at-Bats 100 400 2 10 Player A B The batting average for both players is lower during the second half of the season (maybe because of better pitching. or worse weather). Department A seems if anything to be biased in favor of women. Applicants % admitted 825 62 560 63 325 37 417 33 191 28 373 6 2691 45 Women No. Admission Yes Gender Men Women Totals 3738 1494 5232 No 4704 2827 7531 Totals 8442 4321 12763 % Admitted 44. Here is another example based on a study of admissions to Berkeley graduate programs.172 First Half Hits at-Bats 3 10 58 200 CHAPTER 7. D. E and F). Most of player A’s attempts came during the second half.

Recall that possible lurking variables often make it hard to determine if an apparent diﬀerence between. PLANNING STUDIES 173 admitted is simply a result of women going for the harder programs. Using linear regression has several advantages over the two sample t-test. It turns out that it is simply that there are more women at a lower level of responsibility that causes it to seem that women are being discriminated against. Control for as many . Put in a transition here that explains the implications of observational studies on multiple regression and vice-versa. when we incorporated the level of responsibility we got Yi ≈ 112. Randomized surveys can generalize their results to the population. However.6 Summary • Surveys: Random selection ensures survey is representative.8 + 6. The ﬁrst is that linear regression allows you to incorporate other variables (once we learn about multiple regression). Using linear regression we can incorporate any other possible lurking variables.86 −1.86 if ith person is Female if ith person is Male if ith person is Female if ith person is Male So the conclusion was reversed.06Position + 114.4.06Position + = 6. 7. statistics cannot determine whether this means men are smart or lazy. For example with the compensation example males were payed signiﬁcantly more than Females. If a diﬀerence between genders is still apparent after including possible lurking variables then this suggests (though does not prove) that the diﬀerence is really caused by gender. say.9 1. The two sample t-test does not allow this. This of course leaves open the question as to why women are at a lower level of responsibility.7 110. However.7. The second reason for using linear regression is that it facilitates a comparison of more than two groups whereas the two sample t-test can only handle two groups. the two sample t-test is commonly used in practice so it is important that you understand how it works. • Experiments: Random treatment assignment prevents lurking variables from interfering with causal conclusions.4. Randomized experiments allow you to conclude that diﬀerences in the outcome variable are caused by diﬀerent treatments. • Observational Studies: No randomization is possible. Males and Females. is caused by gender or some other lurking variables. Unfortunately.

. FURTHER TOPICS things as you can to silence your critics.174 CHAPTER 7. If a relationship persists perhaps its real. Do an experiment (if possible/ethical) to verify.

Association rules etc. Business applications are emphasized. Neural networks. The amount of data collected is growing at a phenomenal rate. but wants detailed information about customers past purchases as well as predictions of future purchases. A marketing manager is no longer satisﬁed with a simple listing of marketing contacts. Data Mining: Arif Ansari Data mining is the process of automating information discovery. in which case: get back to work)! You may be wondering what happens next? No. Simple structured/query language queries are not adequate to support these increased demands for information. Data mining steps in to solve these needs. here are some courses you should consider. or you’re intrigued and would like to learn more.Congratulations!!! You’ve completed statistics (unless you skipped to the back to see how it all ends. The course is focused on developing a thorough understanding of how business data can be eﬃciently stored and analyzed to generate valuable business information. In this course you will learn the various techniques used in data mining like Decision trees. The users of the data are expecting more sophisticated information from them. Statistics classes tend to evoke one of two reactions: either you’re praying that you never see this stuﬀ again. 175 . If you’re in the ﬁrst camp I’ve got some bad news for you. This course gives you hands-on experience on how to apply data mining techniques to real world business problems.. you should know that it was a REAL statistics course. CART. It qualiﬁes you to go compete with the Wharton’s and Chicago’s of the world for quantitative jobs and summer internships. Be prepared to see regression analysis used in several of your remaining Core classes and second year electives. The good news is that all your hard work learning the material here means that these other classes won’t look nearly as frightening. If you want to see more. we mean after the drinking binge. For those of you who found this course interesting.

and other items are essential for marketing. Forecasts of consumer demand. because it allows you to proﬁle customers to a level not possible before. ﬁnance. The analytical skills learned from the class are sophisticated and marketable. In a few years data mining will a requirement of marketing organizations. Excel based package) to analyze data and develop actual forecasts. corporate revenues. FURTHER TOPICS Data Mining is especially useful to a marketing organization. IOM 522 . This course emphasizes the usefulness of regression. winner of the 1998 University Associates award for excellence in teaching. earnings.176 CHAPTER 7. Topics include the concept of stationarity. smoothing and Box-Jenkins forecasting procedures for analyzing time series data and developing forecasts. identiﬁcation and estimation of models. Distributors of mass mailers today generally all use data mining tools.Time Series Analysis for Forecasting Professor Delores Conway. Students obtain practical experience using ForecastX (a state-ofthe-art. autoregressive and moving average models. seasonal models. accounting and operations. prediction and assessment of model forecasts. capital expenditures. and intervention analysis. with wide application. .

• Identify the modeling type (continuous. especially on all the Formula Editor functions. Be able to change the modeling type if needed. or sample units. or pieces of information about each observation.Appendix A JMP Cheat Sheet This guide will tell you the JMP commands for implementing the techniques discussed in class. You should familiarize yourself with the basic features of JMP by reading the ﬁrst three Chapters of the JMP manual. Therefore. A. don’t worry if you come across an unfamiliar term.1 Get familiar with JMP. A. ordinal) of each variable.2 A. Some of the things you should be able to do are: • Open a JMP data table. You can exclude a point from an analysis by selecting 177 . This document is not intended to explain statistical concepts or give detailed descriptions of JMP output. If you don’t recognize a term like “variance inﬂation factor” then we probably just haven’t gotten that far in class. which means you can select an item or set of items in one graph and they will be selected in all other graphs and in the data table. • Make a new variable in the data table using JMP’s formula editor. • Understand the basic structure of data tables.2. • Use JMP’s online help system to answer questions for you. Columns are variables. A. • Use JMP’s tools to copy a table or graph and paste it into Word (or your favorite word processor). Rows are observations.1 Generally Neat Tricks Dynamic Graphics JMP graphics are dynamic.2. take the log of a variable. but you should get the main idea about how stuﬀ works. You don’t have to get every detail. nominal. For example.2 Including and Excluding Points Sometimes you will want to determine the impact of a small number of points (maybe just a single point) on your analysis.

).4 Marking Points for Further Investigation Sometimes you notice an unusual point in one graph and you want to see if the same point is unusual in other graphs as well. and then right click on it.2. For example. choose a default location to search for data ﬁles.178 APPENDIX A.5 Changing Preferences There are many ways you can customize JMP. Choose “Markers” and select the plotting character you want for the point. and a list of moments (mean. Then choose Tables ⇒ Subset from the menu and a new data table will be created with just the ﬁnance CEO’s. A. and several other things. Select “Test Mean. suppose you are working with a data set of CEO salaries and you only want to investigate CEO’s in the ﬁnance industry. standard deviation. Enter the mean you wish to use in the null hypothesis in the ﬁrst ﬁeld. etc. but the point may still appear in graphs.3. You can re-admit excluded and/or hidden points by selecting them and choosing Rows ⇒ Exclude/Unexclude a second time. A. etc. ﬁt Y by X. . An easy way to select all excluded points is by double clicking the “Excluded” line in the lower left portion of the data table. A. Click the little red triangle on the gray bar over the variable whose mean you want to do the T-test for. You can select diﬀerent toolbars.” A dialog box pops up. An easy way to do this is to select the point in the graph.1 Continuous Data By default you will see a histogram. A. Choose File ⇒ Preferences from the menu to explore all the things you can change. boxplot. 1. You can accomplish this by holding down the shift or control keys as you make your selections. Leave other ﬁelds blank. One Sample T Test. Make a histogram of the industry variable (a categorical variable in the data set). Note that excluding a point will remove it from all future numerical calculations.3 The Distribution of Y All the following instructions assume you have launched the “Distribution of Y” analysis platform. JMP CHEAT SHEET the point in any graph or in the data table and choosing Rows ⇒ Exclude/Unexclude. and click on the “Finance” histogram bar.2.2. set diﬀerent defaults for the analysis platforms (distribution of Y. The point will appear with the same plotting character in all other graphs. Note that shift and control clicking in JMP works just like it does in other Windows applications. select Rows ⇒ Hide/Unhide. A. You can also choose Rows ⇒ Row Selection ⇒ Select Excluded from the menu. a list of quantiles.3 Taking a Subset of the Data The easiest way to take a subset of the data is by selecting the observations you want to include in the subset and choosing Tables ⇒ Subset from the menu.6 Shift Clicking and Control Clicking Sometimes you may want to select several variables from a list or select several points on a graph. To eliminate the selected point from graphs.) including a 95% conﬁdence interval for the mean.2. A. Click OK.

select “Salary” as the Y variable and “Sex” as the X variable. instead of a big data set containing categorical information for each observation. Normal Quantile Plot (or Q-Q Plot). it doesn’t matter which is which.” For example. means diamonds. The ﬁrst is a categorical variable that lists the levels appearing in the data set. A. etc. Click the little red triangle on the gray bar over the variable you want the Q-Q plot for. select the variable whose means you wish to test as “Y. For example. to connect the means of each subgroup. “Stack” will take two or more columns of data and stack them into a single column. Select “Normal Quantile Plot.” 2. . or nominal) of the variables you select as Y and X. the variable Race may contain the levels “Black. As far as the contingency table is concerned. White. to test whether a signiﬁcant salary diﬀerence exists between men and women. Select “Means/ANOVA /T-test.” The second is a numerical list revealing how many times each level appeared in the data set. Use this menu to add side-by-side boxplots. select Race as the Y variable and enter counts in the “Frequency” ﬁeld. 1. 3. A. The appropriate analysis will be determined for you by the modeling type (continuous. ordinal. In the Fit Y by X dialog box. “Split” will take a single column of data and split it into several columns. The mosaic plot will put the X variable along the horizontal axis and the Y variable on the vertical axis (as you would expect).4.A.” A. Click on the little red triangle on the gray bar over the dotplot.2 Categorical Data Sometimes categorical data are presented in terms of counts. Asian. A dotplot of the data will appear. Manipulating the Data. Sometimes the data in the data table must be manipulated into the form that JMP expects in order to do the two sample T test. To enter this type of data into JMP you will need to make two columns. Suppose you name this variable counts. or to limit over-plotting by adding random jitter to each point’s X value.2 Contingency Tables/Mosaic Plots In the Fit Y by X dialog box select one categorical variable as Y and another as X.4 Fit Y by X All the following instructions assume you have launched the “Fit Y by X” analysis platform.4. The “Display Options” sub-menu under the little red triangle controls the features of the dotplot.” Select the categorical variable identifying group membership as “X.3.1 The Two Sample T-Test (or One Way ANOVA). How to do a T-Test.4. Hispanic. Display Options. FIT Y BY X 179 2. A. When you launch the Distribution of Y analysis platform. The “Tables” menu contains two options that are sometimes useful.

You can add or remove listings from cells of the contingency table by using the little red triangle in the gray bar above the table. Asian. If the transformation you want to use does not appear on the menu you will have to do it “by hand” using JMP’s formula editor. select Race as the X variable. (b) Polynomials Choose “Fit Polynomial” from the little red triangle. Trial and error is the best way to do this. The Regression Manipulation Menu You may want to ask JMP for more details about your regression after it has been ﬁt. Female.” The second is a list of the levels in the Y variable. Click Okay. For example. Select the transformation you want to use for Y.” The second row might be “Black. etc.” Each level of the X variable must be paired with each level of the Y variable. Simply create a new column in the data set and ﬁll the new column with the transformed data. Female. the variable Sex may contain the levels “Male. the variable Race may contain the levels “Black. Suppose you name this variable counts. Entering Tables Directly Into JMP. instead of a big data set containing categorical information for each observation.4. For example. 2. The ﬁrst is a categorical variable that lists the levels appearing in the X variable. contingency tables show counts. and column %. total %. Use this little red triangle to: • Save residuals and predicted values • Plot residuals • Plot conﬁdence and prediction curves . 1. 3. and for X.180 APPENDIX A. row %. For example. White. Hispanic. A. Display Options for a Contingency Table. The commands listed below all begin by choosing an option from the little red triangle on the gray bar above the scatterplot. 2. Then use this new column as the appropriate X or Y in a linear regression. Fitting Non-Linear Regressions (a) Transformations Choose “Fit Special” from the little red triangle. You should only use degree greater than two if you have a strong theoretical reason to do so. JMP CHEAT SHEET 1. When you launch the Fit Y by X analysis platform. You will have to specify how wiggly a spline you want to see.3 Simple Regression In the Fit Y by X dialog box select the continuous variable you want to explain as Y and the continuous variable you want to use to do the explaining as X. Each regression you ﬁt causes an additional little red triangle to appear below the scatterplot. A scatterplot will appear. To enter this type of data into JMP you will need to make three columns. Male. (c) Splines Choose “Fit Spline” from the little red triangle. the ﬁrst row in the data table might be “Black. and enter counts in the “Frequency” ﬁeld. Sometimes categorical data are presented in terms of counts. Fitting Regression Lines Select “Fit Line” from the little red triangle.” The ﬁnal column is a numerical list revealing how many times each combination of levels appeared in the data set. A dialog box appears. Sex as the Y variable. By default.

You read each plot in the scatterplot matrix as you would any other scatterplot. You have more options when you ﬁt a logistic regression using the Fit Model platform. Correlation Matrix You should see this by default. Ellipses that are almost circles are a graphical depiction of weak correlation. 2.” A. while the plot’s row in the matrix determines the Y axis.5 Multivariate Launch the multivariate platform and select the continuous variables you want to examine.4. completing the interval. Variance Inﬂation Factors Right click on the Parameter Estimates box and choose Columns ⇒ VIF in the menu that appears. 1. You can add or remove the scatterplot matrix using the little red triangle. Conﬁdence Intervals for Individual Coeﬃcients Right click on the Parameter Estimates box and choose Columns ⇒ Lower 95% in the menu that appears.4 Logistic Regression In the Fit Y by X dialog box choose a nominal variable as Y and a continuous variable as X. Then click “Run Model. A. Save Columns This menu lives under the big gray bar governing the whole regression. 3.2 Once the Regression is Run 1.5.1 Running a Regression Choose the Y variable and the X variables you want to consider in the Fit Model dialog box. Options under this menu will save a new column to your data table. There are no special options for you to manipulate. Use this menu to obtain: • Cook’s Distance • Leverage (or “Hats”) • Saving Residuals • Saving Predicted (or “Fitted”) Values . 3. Covariance Matrix You have to ask for this using the little red triangle.6 Fit Model (i. To determine the axes of the scatterplot matrix you must examine the diagonal of the matrix. The ellipses in each plot would contain about 95% of the data if both X and Y were normally distributed. tilted ellipses are a graphical depiction of a strong correlation.6. Multiple Regression) The Fit Model platform is what you use to run sophisticated regression models with several X variables. 2. Skinny. A. A. MULTIVARIATE 181 A.6. Scatterplot Matrix You may or may not see this by default. Repeat to get the Upper 95%.e.A. The column the plot is in determines the X axis.

You may want to change “Personality” to “Eﬀect Leverage” (if necessary) so that you will get the leverage plots. Before you ﬁt your model.” Enter all the X variables you wish to consider (including any interactions-see the instructions on interactions given above). go to the leverage plot for the categorical variable whose levels you want to test. .” In the table that pops up click the +/− signs next to the levels you want to test until you get the weights you want.” The Stepwise Regression dialog box appears. JMP CHEAT SHEET This menu is especially useful for making new predictions from your regression. Select the variable you want to include as a quadratic.5 To Run a Stepwise Regression In the Fit Model Dialog box. Then click “Done” to compute the results of the test. change the “Personality” to “Stepwise. A.” When JMP settles on a model. Use this menu to obtain: • Residual Plot (Residuals by Predicted Values) • The Durbin Watson Statistic • Plotting Residuals by Row 5. but categorical variables are expanded to include the default categories. so the Polynomial to Degree macro will create a quadratic. There are two basic ways to include an interaction term. go to the Fit Model dialog box. When you save predicted values and intervals JMP should save them to the row of “fake data” as well. A. The probability to enter should be less than or equal to the probability to leave.3 Including Interactions and Quadratic Terms To include a variable as a quadratic. To obtain the expanded estimates box select Estimates ⇒ Expanded Estimates from the little red triangle on the big gray bar governing the whole regression. A. add a row of data including the X variables you want to use in your prediction. click “Make Model” to get to the familiar regression dialog box.” The “Degree” ﬁeld under the Macros button controls the degree of the polynomial. Row Diagnostics This menu lives under the big gray bar governing the whole regression. Click “LS Means Contrast. The second is to select two or more variables that you want to use in an interaction (using the Shift or Control key) and select Macros ⇒ Factorial to Degree.6.6. The ﬁrst is to select the two variables you want to include as an interaction (perhaps by holding down the Shift or Control key) and hitting the ”Cross” button. Change “Direction” to ”Mixed” and make the probability to enter and the probability to leave small numbers. Options under this menu will add new tables or graphs to your regression output. Then click “Go. The degree ﬁeld shows “2” by default. Leave the Y variable blank.6.4 Contrasts To test a contrast in a multiple regression. 4. Expanded Estimates The expanded estimates box reveals the same information as the parameter estimates box. Then click “Run Model.182 • Conﬁdence and Prediction Intervals APPENDIX A. Then click on the “Macros” menu and select “Polynomial to Degree.

A.6.

FIT MODEL (I.E. MULTIPLE REGRESSION)

183

One of the odd things about JMP’s stepwise regression procedure is that it creates an unusual coding scheme for dummy variables. Suppose you have a categorical variable called color, with levels Red, Blue, Green, and Yellow. JMP’s stepwise procedure may create a variable named something like color[Red&Yellow-Blue&Green]. This variable assumes the value 1 if the color is red or yellow. It assumes the value -1 if the color is blue or green. This type of dummy variable compares colors that are either red or yellow to colors that are either blue or green.

A.6.6

Logistic Regression

Logistic regression with several X variables works just like regression with several X variables. Just choose a binary (categorical) variable as Y in the Fit Model dialog box. Check the little red triangle on the big gray bar to see the options you have for logistic regression. You can save the following for each item in your data set: the probability an observation with the observed X would fall in each level of Y, the value of the linear predictor, and the most likely level of Y for an observation with those X values.

184

APPENDIX A. JMP CHEAT SHEET

Appendix B

Some Useful Excel Commands
Excel contains functions for calculating probabilities from the normal, T , χ2 and F distributions. Each distribution also has an inverse function. You use the regular function when you have a potential value for the random variable and you want to compute a probability. You use the inverse function when you have a probability and you want to know the value to which it corresponds.

Normal Distribution
Normdist(x, mean, sd, cumulative) If cumulative is TRUE this function returns the probability that a normal random variable with the given mean and standard deviation is less than x. If cumulative is FALSE, this function gives the height of the normal curve evaluated at x. • Example: The salaries of workers in a factory is normally distributed with mean \$40,000 and standard deviation \$7,500. What is the probability that a randomly chosen worker from the factory makes less than \$53,000? Normdist(53000, 40000, 7500, TRUE). Norminv(p,mean,sd) returns the p’th quantile of the speciﬁed normal distribution. That is, it returns a value x such that a normal random variable has probability p of being less than x. • Example: In the factory described above ﬁnd the 25th and 75th salary percentiles. 25th: Norminv(.25, 40000, 7500). 75th: Norminv(.75, 40000, 7500).

185

186

APPENDIX B. SOME USEFUL EXCEL COMMANDS

T-Distribution
Tdist(t, df, tails) The tails argument is either 1 or 2. If tails=1 then this function returns the probability that at T random variable with df degrees of freedom is greater than t. For reasons known only to Bill Gates, the t argument cannot be negative. Note that you have to standardize the t statistic yourself before using this function, as there are no mean and sd arguments like there are in Normdist. • Example: For a test of H0 : µ = 3 vs. Ha : µ = 3 we get a t statistic of 1.76. There are 73 observations in the data set. What is the p-value? Tdist(1.76,72 2) ( df = 73 − 1 and tails =2 because it is a two tailed test.)

• Example: For a test of H0 : µ = 3 vs. Ha : µ = 3 we get a t statistic of -1.76. There are 73 observations in the data set. What is the p-value? Tdist(1.76,72 2) ( The T distribution is symmetric so ignoring the negative sign makes no diﬀerence.) • Example: For a test of H0 : µ = 3 vs. Ha : µ > 3 we get a t statistic of 1.76. There are 73 observations in the data set. What is the p-value? Tdist(1.76,72 1) ( The p-value here is the probability above 1.76.) • Example: For a test of H0 : µ = 3 vs. Ha : µ > 3 we get a t statistic of -1.76. There are 73 observations in the data set. What is the p-value? =1-Tdist(1.76,72 1) ( The p-value here is the probability below 1.76.) Tinv(p, df ) Returns the value of t that you would need to see in a two tailed test to get a p-value of p. • Example: We have 73 observations. How large a t statistic would we have to see to reject a two tailed test at the .13 level? Tinv(.13, 72) • Example: With 73 observations what t would give us a p-value of .05 for the test H0 : µ = 17 vs. Ha : µ > 17. Because of the alternative hypothesis, the p-value is the area to the right of t. Thus the answer here is the value of t that would give a p-value of .10 on a two tailed test Tinv(.05, 72).

χ2 (chi-square) distribution
Chidist(x,df ) Returns the probability to the right of x on the chi-square distribution with the speciﬁed degrees of freedom. • Example: A χ2 test statistic turns out to be 12.7 on 9 degrees of freedom. What is the p-value? Chidist(12.7,9) ChiInv(p, df ) p is the probability in the right tail of the χ2 distribution. This function returns the value of the corresponding χ2 statistic.

how large must the test statistic be in order to get a p-value of .7. NumDF. 102 Finv(p.187 • Example: In a χ2 test on 9 degrees of freedom. • If there were 3 numerator and 102 denominator degrees of freedom. NumDF.02? ChiInv(. DenomDF) Returns the value of the F statistic needed to achieve a p-value of p. • Example: If F = 12. 3.02. 9) F Distribution Fdist(F.7. 3.05? Finv(. how large an F statistic would be needed to get a p-value of . what is the p-value? Fdist(12.05. the numerator df = 3 and the denominator df = 102. 102) . DenomDF) Returns the p value from the F distribution with NumDF in the numerator and DenomDF in the denominator.

188 APPENDIX B. SOME USEFUL EXCEL COMMANDS .

Appendix C The Greek Alphabet lower case α β γ δ ζ η θ ι κ λ µ ν ξ o π ρ σ τ υ φ χ ψ ω upper case A B Γ ∆ E Z H Θ I K Λ M N Ξ O Π R Σ T Υ Φ X Ψ Ω 189 letter alpha beta gamma delta epsilon zeta eta theta iota kappa lambda mu nu xi omicron pi rho sigma tau upsilon phi chi psi omega .

190 APPENDIX C. THE GREEK ALPHABET .

Appendix D Tables 191 .

9744 0.9834 0.9994 0.7852 0.9265 0.9997 0.9987 0.7123 0.9846 0.5239 0.80 1.9996 0.9967 0.9949 0.7190 0.9995 0.7794 0.9998 0.9997 0.6844 0.8749 0.9767 0.9956 0.5557 0.9970 0.01 0.00 0.8770 0.1 0.8264 0. 0.40 0.9913 0.9222 0.70 1.9505 0.9406 0.7704 0.9981 0.7611 0.9982 0.7357 0.9279 0.8888 0.9671 0.8365 0.9968 0.9983 0.9976 0.8438 0.9974 0.9990 0.5987 0.9986 0.90 1.9998 0.9893 0.06 0.6591 0.9964 0.30 1.9463 0.9706 0.7549 0.9756 0.9901 0.10 2.07 0.7454 0.8925 0.9977 0.9750 0.5714 0.9979 0.6879 0.6628 0.7881 0.9997 0.9952 0.5675 0.5279 0.9429 0.7088 0.9945 0.9955 0.9989 0.7995 0.9525 0.6293 0.9854 0.9625 0.9974 0.9988 0.70 2. 1) random variable is less than z .9997 0.30 3.9599 0.02 0.9236 0.9941 0.30 2.8133 0.8643 0.9082 0.9049 0.9821 0.60 1.9713 0.9868 0.50 Normal Table 0.9998 0.8289 0.9850 0.9515 0.9925 0.8389 0.9934 0.9994 0.9940 0.5398 0.9936 0.9332 0.8051 0.7157 0.9830 0.8980 0.9998 The body of the table contains the probability that a N (0.9994 0.9582 0.5040 0.7389 0.5080 0.9772 0.90 2.9929 0.9951 0.9996 0.5871 0.9115 0.6255 0.9995 0.9484 0.8315 0.6915 0.9649 0.9916 0.9319 0.7764 0.6554 0.6664 0.9995 0.9992 0.9394 0.9861 0.9981 0.9664 0.9545 0.9573 0.9998 0.6064 0.9953 0.8997 0.00 1.60 0.8023 0.9990 0.9591 0.10 3.9927 0.6736 0.0 −4 0.05 0.10 0.9871 0.03 0.00 0.6808 0.8413 0.9922 0.20 0.9998 0.6772 0.9988 0.8212 0.50 1.9864 0.9984 0.9884 0.9495 0.9991 0.5478 0.9881 0.9978 0.7967 0.80 0.9875 0.40 1.8078 0.9812 0.7823 0.9693 0.9251 0.9890 0.9992 0.9656 0.9418 0.50 0.5319 0.8577 0.5793 0.7224 0.00 2.9918 0.9554 0.9608 0.9991 0.6141 0.9977 0.08 0.6443 0. TABLES D.9616 0.8708 0.7517 0.9998 0.9972 0.9946 0.9963 0.9996 0.4 0.8340 0.5000 0.1 0.9994 0.5199 0.6026 0.5910 0.9996 0.8944 0.9909 0.9997 0.9995 0.9995 0.7324 0.9987 0.5438 0.9994 0.9793 0.9932 0.9996 0.9452 0.9761 0.9993 0.6368 0.8849 0.9979 0.5517 0.9726 0.20 1.09 0.9357 0.9066 0.6985 0.8508 0.9962 0.9292 0.9633 0.5832 0.00 3.9971 0.9131 0.9973 0.9993 0.30 0.9678 0.50 2.9948 0.9099 0.9878 0.9985 0.9986 0.3 −2 0 Z 2 4 .9699 0.9998 0.80 2.6517 0.8665 0.7580 0.9441 0.9904 0. The top row of the table contains the third digit of z .9965 0.6480 0.9842 0.5120 0.9982 0.8962 0.9817 0.9535 0.5636 0.7422 0.9984 0.9192 0.8186 0.9997 0.6179 0.9989 0.9931 0.9719 0.9826 0.9997 0.8106 0.9995 0.9306 0.9991 0.5753 0.20 3.9147 0.6406 0.6700 0.7486 0.7054 0.6331 0.5948 0.9997 0.9998 0.192 APPENDIX D.9896 0.8599 0.9992 0.8531 0.9989 0.9857 0.8907 0.8621 0.8869 0.9783 0.9177 0.6103 0.8554 0.9996 0.9998 0.9943 0.9798 0.7642 0.7734 0.9985 0.8810 0.8159 0.9382 0.9207 0. The left margin of the table contains the ﬁrst two digits of z .60 2.8485 0.8830 0.10 1.2 0.9838 0.70 0.9732 0.9966 0.9998 0.8686 0.9990 0.8790 0.9993 0.7019 0.9887 0.9641 0.7910 0.9162 0.9960 0.6950 0.7291 0.7673 0.9997 0.9961 0.90 3.9920 0.9686 0.8729 0.9803 0.5596 0.6217 0.9906 0.9987 0.9738 0.9474 0.9938 0.9957 0.5359 0.7257 0.9345 0.40 2.9959 0.9898 0.20 2.9778 0.9980 0.9992 0.9911 0.9032 0.9564 0.40 3.04 0.5160 0.9997 0.9808 0.9969 0.9975 0.8461 0.9993 0.9370 0.9015 0.8238 0.9788 0.7939 0.

9994 0.9996 0.40 0.90 -3.15 2.20 2.45 -1.70 -3.35 2.0158 0.4801 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | z 0.8413 0.20 -1.0000 0.30 0.40 1.0002 0.25 0.65 -2.70 0.1 because z increments by .25 -0.6368 0.15 -3.25 3.90 3.70 1.0808 0.0035 0.7734 0.60 2.35 -1.5398 0.45 -0.60 -1.4013 0.0139 0.8023 0.3264 0.9987 0.05 2.15 0.9115 0.60 -0.60 1.15 3.70 -2.15 -2.0040 0.0005 0.1251 0.90 -2.2420 0.3446 0.10 1.85 0.15 -1.10 -3.6736 0.90 -0.0071 0.20 3.10 -1.55 1.5987 0.2743 0.1841 0.2.9953 0.05 rather than .00 -0.60 -2.5199 0.9998 0.4404 0.8531 0.0495 0.0202 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | z -2.9981 0.0228 0.3085 0.9974 0.85 -1.55 -3.0107 0.6179 0.1711 0.8944 0.8643 0.95 -3.80 -2.0006 0.9772 0.5793 0.9505 0.0082 0.9265 0.80 3.30 3.40 -1.75 2.6915 0.95 Pr(Z<z) 0.1151 0.9999 1.8849 0.9906 0.40 -2.0030 0.85 -3.8749 0.9992 0.65 -0.3632 0.95 Pr(Z<z) 0.1357 0.60 3.65 -3.00 3.50 -1.95 -0.0003 0.90 0.25 -3.80 -1.2 z -4.0287 0.0001 0.9032 0.0000 The table gives the probability that a N (0.0968 0.0022 0.00 0.10 3.9984 0.30 -3.55 0.00 -2.80 1.0000 1.10 2. 1) random variable is less than z .05 Pr(Z<z) 0.10 -2.9997 0.90 2.9998 0. QUICK AND DIRTY NORMAL TABLE 193 D.1977 0.9965 0.9938 0.9989 0.75 -0.95 3.0001 0.75 0.9997 0.40 -3.30 -0.25 -1.05 -3.05 -1.80 -0.9599 0.95 -2.35 3.35 -2.D.50 -2.0047 0.0446 0.50 3.9678 0.40 -0.0026 0.25 -2.9946 0.55 -0.50 1.30 -2.85 2.80 -3.9970 0.85 -2.1587 0.9893 0. but it may be easier to use.1056 0.7257 0.50 -3.50 -0.2912 0.0000 0.9554 0.75 -1.9878 0.0000 0.55 -1.40 2.20 -0.65 2.80 2.45 1.65 0.95 1.75 -3.9713 0.15 -0. .9960 0.80 0.0019 0.9641 0.0007 0.0094 0.75 1.55 3.9918 0.9192 0.0054 0.45 -2.0735 0.0016 0.2578 0.10 -0.0256 0.05 3.1469 0.0011 0.45 0.05 1.55 -2.00 -3.45 -3.9978 0.5000 0.0668 0.01.0001 0.65 -1.0003 0.9861 0.9842 0.0062 0.9744 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | z 2.70 3.7881 0.8159 0. It is less precise than Table D.70 -0.8289 0.65 3.0885 0.5596 0.0010 0.9999 0.0002 0.45 2.50 0.0179 0.55 2.0013 0.90 -1.05 Quick and Dirty Normal Table Pr(Z<z) 0.25 1.0008 0.9993 0.30 -1.30 2.9999 0.9332 0.45 3.20 0.35 1.0548 0.9929 0.7580 0.0606 0.4207 0.4602 0.0002 0.40 3.0001 0.9394 0.9452 0.0001 0.60 0.2119 0.9999 0.9998 0.9999 0.20 -2.25 2.20 1.30 1.10 0.9990 0.35 0.85 -0.0122 0.20 -3.6554 0.0359 0.2266 0.0401 0.05 0.75 3.60 -3.9995 0.00 -1.7422 0.95 -1.65 1.70 -1.9798 0.35 -0.0004 0.85 3.70 2.3821 0.35 -3.7088 0.85 1.15 1.90 1.75 -2.50 2.0322 0.00 2.9821 0.00 1.

9651 0.6961 0.194 APPENDIX D.0459 1.9658 0.37 0.7941 0.394 0.0395 1.7108 0.4066 5 0.6021 0.1948 0.4571 0.8709 0.9332 0.4792 0.696 0.0063 1.6498 0.9828 0.0366 1.2658 0.5062 0.5795 0.6813 The numbers in the table are approximate cutoﬀ values for Cook’s distances with the speciﬁed number of model parameters (intercept + number of slopes).9828 0.4553 0.0232 1.5496 0.3 Cook’s Distance Shocking 0.0349 1.5058 0.6793 30 0.0159 0.3351 0.6272 0. TABLES D.9705 0.367 0.2282 3 0.9276 0.4306 0.1912 0.674 0.5314 0.7434 0.5623 10 0.6681 0.2647 0.544 9 0.777 0.5215 8 0.8449 0.9544 0.9761 0.6771 29 0.7389 0.698 0.6287 0.8397 0.6606 23 0.1055 0.0331 1.5979 0.7733 0.0645 0.7517 0.973 0.9789 0.3351 0.0199 1.9617 0.653 21 0.5114 0.6292 0.0166 0.5552 0.0441 1.5685 0.9464 0.4583 0.467 0.705 0.7555 0.9785 Number of Params 10 Odd Surprising 1 0.9407 0.7207 0.7342 0.4858 0. and nearest sample size.5738 0.5578 0.5406 0.6438 0.4625 0.9457 0.6392 0.5691 0.016 1.8451 0.9566 0.5973 0.6334 0.6634 0.7651 0.1944 0.536 0.7458 0.6568 0.4998 0.6567 0.2551 0.9405 0.6936 0.9556 0.4115 0.0431 1.9734 0.7132 0.0116 1.6347 0.5094 0.7693 0.3033 0.4856 0.981 0.6436 0.6013 13 0.6569 22 0.9242 0.0261 1.7477 0.3199 0.5807 0.5472 0.5903 12 0.5209 0.5891 0.0467 Sample Size 100 Odd Surprising Shocking 0.9837 0.5703 0.6487 20 0.7511 0.6748 28 0.529 0.9741 0.8762 0.6698 26 0.2232 0.3357 4 0.1065 0.1054 0.031 1.7892 0.9778 0.7561 0.4121 0.7173 0. An “odd” point is one about which you are mildly curious.6264 16 0.0642 0.6206 0.3641 0.6592 0.6329 17 0.6191 15 0.6204 0.4448 0.4987 0.6125 0.9697 0.98 0.4309 0.6116 0.542 0.5453 0.4568 0.9769 0.3218 0.9751 0.4043 0.6387 18 0.6724 27 0.9749 0.3405 0.9639 0.6546 0.7435 0.4563 6 0.6686 0.6784 0.6669 25 0.6629 0.9624 0.9764 0.9716 0.6364 0.6636 0. A “shocking” point clearly inﬂuences the ﬁtted regression.0158 0.6499 0.606 0.644 19 0.0677 2 0.0408 1.042 1.6754 0.9186 0.6446 0.4678 0.6841 0.7233 0.4931 0.9498 0.7341 0.516 0.4931 7 0.9592 0.6792 0.9844 1000 Odd Surprising Shocking 0.5391 0. .5251 0.5326 0.6862 0.5163 0.9072 0.5429 0.9923 1 1.4005 0.9534 0.5931 0.4897 0.4772 0.8988 0.2236 0.5113 0.9691 0.9348 0.8974 0.7402 0.5447 0.5457 0.6108 14 0.7289 0.5918 0.9675 0.5244 0.6173 0.6876 0.9777 0.9593 0.0381 1.9514 0.759 0.9319 0.892 0.045 1.4139 0.9705 0.7624 0.9718 0.4356 0.4684 0.7037 0.9675 0.5775 11 0.7607 0.6639 24 0.7277 0.6136 0.6504 0.0287 1.982 0.9127 0.

34 13.19 50.87 30.25 7.42 37.00 26.56 36.26 32.70 39.96 48.92 39.D.24 10.32 124.28 15.73 26.09 40.33 124.31 23.20 Probability 0.13 18.4 DF 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 50 60 70 80 90 100 Chi-Square Table 0.96 55.56 37.57 38.54 24.78 9.18 52.47 20.03 22.21 149.07 58.77 25.59 50.36 14.88 31.81 36.33 3.73 51.99 17.07 12.17 74.02 13. 0.88 29.66 99.62 30.51 25.31 43.15 67.08 90.31 45.85 29.11 41.61 112.74 135. CHI-SQUARE TABLE 195 D.97 109.41 29.62 161.17 64.22 27.06 95.14 30.40 86.38 35.14 61.50 0.05 55.64 46.82 45.64 42.65 38.81 18.76 67.40 85.81 32.30 59.07 15.81 10.68 21.59 28.80 48.57 118.45 0.51 22.14 31.55 19.28 49.79 42.53 101.58 44.89 63.93 40.05 0.31 19.71 4.63 9.17 36.93 47.99 27.20 28.38 53.51 16.84 5.38 100.27 18.84 137.81 21.99 7.75 27.36 39.70 73.87 42.67 23.68 15.41 34.06 22.92 18.66 66.68 25.53 96.81 63.4.48 56.12 135.28 18.26 45.29 41.9 2.00 33.09 21.43 112.13 40.26 51.79 52.99 0.10 0.32 26.21 11.72 35.74 37.62 54.9999 15.69 76.95 Probability 0.77 55.56 43.83 33.98 44.83 13.30 27.12 27.82 16.67 33.53 36.81 9.64 12.27 49.89 40.01 33.999 6.50 79.09 16.56 49.15 0.20 34.46 24.10 23.25 40.58 32.59 31.58 107.92 35.15 88.49 11.69 29.12 37.52 57.61 60.61 6.77 148.50 122.88 113.34 The table shows the value that a chi-square random variable must attain so that the speciﬁed amount of probability lies to its left.36 23.59 14.41 32.48 20.21 24.25 0.34 42.00 0 2 4 6 8 Chi Square (Number in Table Body) 10 12 .19 37.62 82.42 21.89 58.67 63.31 46.15 124.91 34.

196 APPENDIX D. TABLES .

. and Zappe. D. Stine. Brooks/Cole–Thomson Lerning. Basic Business Statistics. P. A. (2004). Data Analysis for Managers with Microsoft Excel.Bibliography Albright. R. and Waterman. Springer. (1998). C. W.... 2nd Edition. P. C. R. S. Foster. L. 197 . Winston.

132 correlation. 2. 144 Cook’s distance. 80 empirical rule. 2 contrast. 7 categorical. 29 indicator variables. 146 margin of error. 80 chi square test. 7. 107 levels. 66. 70 . 126 for the regression slope. 45 discreteness. 3. 50 covariance matrix. 107 autocorrelation function. 72 ANOVA table. 67 for a proportion. 11 conﬁdence interval. 152 Box-Cox transformations. 138 dummy variables. 96 factor. 123. 52 covariance. 82 collinearity. 20 main eﬀects. 9 ﬁelds. 22 conditional proportions. 80 for the regression line. 96 joint distribution. 56 autoregression. 158 for a mean. 2 continuous. 11 lag variable. 13 independence. 150 high leverage point. 137 fat tails. 110 histogram. 64. 51 decision theory. 126. 107 bimodal. 99 boxplot. 92 contingency table. 145 interpolation. 80 inﬂuential point. 65. 104 hierarchical principle. 8 expected value. 106. 110 interaction.Index added variable plot. 127 linear regression model. 94. 121 autocorrelation. 2 heteroscedasticity. 2 central limit theorem. 20 joint proportion. 56. 33 extrapolation. 1 ﬁrst order eﬀects. 137 leverage plot. 127 alternative hypothesis. 133–135 conditional probability. 9 198 dummy variable. 146 frequency table. 9 Bonferroni adjustment.

64. 122. 104. 18. 106 Tukey’s bulging rule. 126 Q-Q plot. 72 trend in a time series. 173 records. 167 non-constant. 2 outlier. 120 variance inﬂation factor. 99 variable. 64. 73 t-test for a regression coeﬃcient. 20. 51. 67 t-statistic. see normal quantile plot quantile-quantile plot. 61 prediction interval. 140 test statistic. 69 population. 91 standard deviation. 120 standard error. 98. 37 normal quantile plot. 91. 152 t distribution. 37–39. 134 VIF. 170. 120 SSM. 160–162 reward matrix. 41 null hypothesis. 167. 76 paired. 45 Markov chain. 93 standard error. 3 multiple comparisons. 35. 126. 62 point estimate. 4 mosaic plots. 122. 66. 98. 97. 6 random variable. 87. 2 residual plot. 130. 31 Markov dependence. 12. 21 market segments. 63. 2 normal distribution. 94. 91 SSE. 79 two-sample. 104 of a random variable. 52. 106. 73 parameters. 64. 123 one sample. 13 simple random sample. 104. 96. 62 skewed. 1 ordinal variable. 93. 63.INDEX marginal distribution. 130 199 residuals. 46 risk proﬁle. 66. 127. 34–36. 97 relative frequencies. 64–66. 166. 1 variance. 39. 47 sample. 61 sampling distribution. 167. 1 regression assumptions. 71 observation. 120 moments. 6 p-value. 67. 9 slope. 62 stepwise regression. 109. 127. 165. 88 randomization. 88. 4 quartiles. 166. 171 in experiments. 152 nominal variable. 51. 158 statistics. 96. 38. 34 residual. 110 outliers. 5. 120 SST. 54. 63 scatterplot. 9. 35. 8. 97. see variance inﬂation factor . 29 model sum of squares. see normal quantile plot quantiles. 173 in surveys.

38 INDEX .200 z-score.