1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and conﬁdence intervals 4. Testing statistical hypotheses 5. Regression analysis

Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

All these notions can be extended to the case with multiple predictors...

Statistics

Example We can use two predictors for Intel: S&P500 and inﬂation.

Intel

SP500

Inflation

5.3.1 Multiple regression equation We extend the regression theory to k explanatory variables. Deﬁnition: multiple linear regression equation yi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik + εi , where xi1 . . . xik are observable variables. β0 . β1 . . . βk are ﬁxed and unknown parameters. ε1 . . . εn ∼ i.i.d. N (0. σ 2 ). σ > 0 is a ﬁxed and unknown parameter. i = 1. . . . n

Deﬁnitions The least squares (LS) estimators of β0 . . . βk are n ˆ ˆ (β0 . . . βk ) = arg min β0 . . . .βk i=1 yi − (β0 + β1 xi1 + · · · + βk xik ) 2 . Remark: explicit formulas for these estimators are available

Foundations of inferential statistics 3. . and is ˆ β = (X T X )−1 X T Y . . ε =  . . . Testing statistical hypotheses 5. . . . with Y =  . Y = Xβ + ε.3 Multiple regression . . . we will use Excel to estimate the parameters. No need to learn this slide by heart. yn 1 xn1 · · · xnk βk εn The LS estimator of β minimizes (Y − X β)T (Y − X β). β =  .1. .         1 x11 · · · x1k y1 β0 ε1 . Estimation and conﬁdence intervals 4.  . Descriptive statistics 2. .1 Correlation 5. . X =  . . . .  . . Regression analysis 5. 197 / 221 Veronika Czellar HEC Paris Statistics . . but require a matrix form of the regression model.2 Simple linear regression 5.

1 Correlation 5.2 Simple linear regression 5. Testing statistical hypotheses 5. Regression analysis 5. Foundations of inferential statistics 3.1. Estimation and conﬁdence intervals 4.3 Multiple regression Regressing Intel on S&P500 and inﬂation: Back to one predictor 198 / 221 Veronika Czellar HEC Paris Statistics . Descriptive statistics 2.

1 Correlation 5. Testing statistical hypotheses 5. Foundations of inferential statistics 3.2 Simple linear regression 5. Regression analysis 5.1. Descriptive statistics 2. Estimation and conﬁdence intervals 4. than in the case of one predictor S&P500 only. Question What does R Square mean in the multiple regression? 199 / 221 Veronika Czellar HEC Paris Statistics .3 Multiple regression The R Square in the Excel output is higher.

Syy where Syy = n (yi − y )2 and yi are the ﬁtted values ˆ i=1 ˆ0 + β1 xi1 + · · · + βk xik . Testing statistical hypotheses 5. Statistics . Regression analysis 5. Foundations of inferential statistics 3.3 Multiple regression 5. ˆ ˆ yi = β ˆ Proposition R2 = 1 − 200 / 221 Veronika Czellar HEC Paris n i=1 (yi Syy − yi ) 2 ˆ .2 Evaluating a multiple regression equation Deﬁnition n y i=1 (ˆi Back to simple regression The coeﬃcient of multiple determination R 2 is deﬁned by R2 = − y )2 . Estimation and conﬁdence intervals 4. Descriptive statistics 2.3.1 Correlation 5.1.2 Simple linear regression 5.

Estimation and conﬁdence intervals 4.2 Simple linear regression 5. A value near 0 indicates little linear association between the set of independent variables and the dependent variable. R 2 can almost always be made very close to 1 by using a model with k quite close to n. Testing statistical hypotheses 5. Foundations of inferential statistics 3.1 Correlation 5. R 2 cannot go down when an extra predictor is added to the model and it will generally increase. Regression analysis 5. A value near 1 means a strong association. 2 3 201 / 221 Veronika Czellar HEC Paris Statistics .1.3 Multiple regression Properties of the coeﬃcient of multiple determination 1 It can range from 0 to 1. even if many of the predictors would contribute only marginally to variation in y . Descriptive statistics 2.

adjusted R 2 penalizes the addition of extraneous predictors to the model.3 Multiple regression Deﬁnition The adjusted R 2 is deﬁned by Adjusted R 2 = 1 − Properties of the adjusted R 2 1 n−1 n−k −1 n i=1 (yi Syy − yi ) 2 ˆ . 2 202 / 221 Veronika Czellar HEC Paris Statistics .2 Simple linear regression 5.1.1 Correlation 5. Estimation and conﬁdence intervals 4. Regression analysis 5. Testing statistical hypotheses 5. Foundations of inferential statistics 3. Descriptive statistics 2. adjusted R 2 is smaller than R 2 .

But how large should this value be before we draw this conclusion? 203 / 221 Veronika Czellar HEC Paris Statistics . Regression analysis 5.3 Multiple regression Question High values of R 2 suggest that the model ﬁt is a useful one.1. Foundations of inferential statistics 3.2 Simple linear regression 5. Estimation and conﬁdence intervals 4.1 Correlation 5. Descriptive statistics 2. Testing statistical hypotheses 5.

Testing statistical hypotheses 5.3 Testing the global utility of the multiple regression H0 : β 1 = β 2 = · · · = β k = 0 Ha : at least one among β1 . .3 Multiple regression 5. Regression analysis 5. .1. Estimation and conﬁdence intervals 4. Descriptive statistics 2.1 Correlation 5. Foundations of inferential statistics 3.2 Simple linear regression 5.n−k−1) .3. . . βk is not zero Model utility F test: F = R 2 /k H0 ∼ F(k. 2 )/(n − k − 1) (1 − R 204 / 221 Veronika Czellar HEC Paris Statistics .

Descriptive statistics 2. Regression analysis 5.1. Foundations of inferential statistics 3.2 Simple linear regression 5.3 Multiple regression Model utility F test for the Intel example with two predictors: 205 / 221 Veronika Czellar HEC Paris Statistics . Testing statistical hypotheses 5. Estimation and conﬁdence intervals 4.1 Correlation 5.

1. Testing statistical hypotheses 5. Foundations of inferential statistics 3. Descriptive statistics 2. 206 / 221 Veronika Czellar HEC Paris Statistics . it does not mean that all predictors are useful.1 Correlation 5. Regression analysis 5.3 Multiple regression Warning If the F test results in the rejection of H0 .2 Simple linear regression 5. Estimation and conﬁdence intervals 4.

1. Estimation and conﬁdence intervals 4. .. . we can test H0 : β j = 0 . . Foundations of inferential statistics 3. Descriptive statistics 2. k}. For any given j ∈ {0.3 Multiple regression 5. Testing statistical hypotheses 5. Ha : β j = 0 using a t test: T βj = where ˆ βj ˆ SE (βj ) H0 ∼ tn−k−1 . 207 / 221 Veronika Czellar HEC Paris Statistics . .1.3.2 Simple linear regression 5..4 Evaluating individual regression coeﬃcients The t tests can be extended to the multivariate case. Regression analysis 5.1 Correlation 5.

1 Correlation 5. Regression analysis 5.2 Simple linear regression 5. Testing statistical hypotheses 5. ˆ jj 1 ˆ σ = n−k−1 n (yi − yi )2 and is called multiple standard ˆ i=1 error of estimate. (and has the matrix form σ 2 (X T X )−1 ). Example Do an individual test of each independent variable for the Intel regression with two predictors. Estimation and conﬁdence intervals 4.05 signiﬁcance level. Foundations of inferential statistics 3.1. Descriptive statistics 2.3 Multiple regression ˆ SE (βj ) is the standard error of the coeﬃcient j. Which variable would you consider eliminating? Use the 0. 208 / 221 Veronika Czellar HEC Paris Statistics .

2 Simple linear regression 5. Foundations of inferential statistics 3.3 Multiple regression 209 / 221 Veronika Czellar HEC Paris Statistics .1. Descriptive statistics 2.1 Correlation 5. Regression analysis 5. Testing statistical hypotheses 5. Estimation and conﬁdence intervals 4.

we should delete only one variable at a time. Testing statistical hypotheses 5. we need to rerun the regression equation and check the remaining variables. Estimation and conﬁdence intervals 4.1.3 Multiple regression Remark: if there are more than one nonsigniﬁcant variables.2 Simple linear regression 5. 210 / 221 Veronika Czellar HEC Paris Statistics . Descriptive statistics 2. Regression analysis 5.1 Correlation 5. This method is called backward stepwise regression method. Each time we delete a variable. Foundations of inferential statistics 3.

1. Estimation and conﬁdence intervals 4.2 Simple linear regression 5. Descriptive statistics 2. Testing statistical hypotheses 5. 211 / 221 Veronika Czellar HEC Paris Statistics . Example Global warming is the increase in the average temperature of Earth’s near-surface air and oceans since the mid-20th century and its projected continuation. 37 percent above those in 1990. Regression analysis 5.3.1 billion tons in 2009.3 Multiple regression 5.5 Transformed variables We can also include transformed variables or mixtures of variables in a multiple regression model.1 Correlation 5. Foundations of inferential statistics 3. Global CO2 emissions totalled 31.txt for more than 65 countries has been released in August 2010 by and available on the CERINA Plan website (and on the course website as well). It is well-known that climate change is inﬂuenced by human CO2 emissions. Global data GlobalAirpollution.

Estimation and conﬁdence intervals 4.2009 ) GDP2009realgrowth : GDP real growth rate (in %.1 Correlation 5.2008 212 / 221 Veronika Czellar HEC Paris Statistics i = 1.1 + β2 xi. Testing statistical hypotheses 5. xi.2 + β3 xi. Descriptive statistics 2. yi. yi.1 ) PopGrowth2009 : population growth rate (in %.1 ) 2 SquarePopGrowth2009 : squared PopGrowth2009 (xi.2 Simple linear regression 5.2 ) Fit the following model: yi. 65 . .1.2008 ) Year2009 : emissions of CO2 in 2009 (in million tons. Foundations of inferential statistics 3.2009 2 2 = β0 + β1 xi.1 + β4 xi. xi. . yi.2 ) 2 SquareGDP2009 : squared GDP2009realgrowth (xi. . Year2008 : emissions of CO2 in 2008 (in million tons. Regression analysis 5. . .2 + εi .3 Multiple regression Example continued We would like to investigate the impact of GDP per capita and population growth on the increase of CO2 emissions.

Regression analysis 5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression 213 / 221 Veronika Czellar HEC Paris Statistics . Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and conﬁdence intervals 4.1. Testing statistical hypotheses 5.

Descriptive statistics 2. Estimation and conﬁdence intervals 4.2 Simple linear regression 5. Foundations of inferential statistics 3.3 Multiple regression 214 / 221 Veronika Czellar HEC Paris Statistics . Regression analysis 5. Testing statistical hypotheses 5.1.1 Correlation 5.

6 Dummy variables We can also include a dummy variable as a predictor. which takes the values 0 or 1 to indicate the absence or presence of some categorical eﬀect. Foundations of inferential statistics 3. Testing statistical hypotheses 5.3. Example: CEO salaries (see NorthwestCEOsalaries.1.txt on course website) 215 / 221 Veronika Czellar HEC Paris Statistics .1 Correlation 5. Estimation and conﬁdence intervals 4.3 Multiple regression 5. Regression analysis 5. Descriptive statistics 2.2 Simple linear regression 5.

Example: prices of LCD televisions (see LCD. Descriptive statistics 2.txt on course website.1 Correlation 5. Regression analysis 5.2 Simple linear regression 5.7 Qualitative variables A categorical (or qualitative) variable is a predictor that takes a ﬁnite number d possible values.3. Testing statistical hypotheses 5. Only d − 1 categories are added to the regression model.3 Multiple regression 5. and exercise 5.12) 216 / 221 Veronika Czellar HEC Paris Statistics . Estimation and conﬁdence intervals 4. Foundations of inferential statistics 3.1.

Testing statistical hypotheses 5. Foundations of inferential statistics 3. Regression analysis 5.3 Multiple regression 5.1 Correlation 5.3.2 Simple linear regression 5. Estimation and conﬁdence intervals 4.1. Descriptive statistics 2.8 Interaction variables In some cases. Example: CEO salaries The product between the woman dummy and sales is an interaction term. 217 / 221 Veronika Czellar HEC Paris Statistics . which are products of at least two variables. it can be useful to add interaction terms.

.8). Estimation and conﬁdence intervals 4.1 Correlation 5. Regression analysis 5. there is an additional requirement in multiple regression: predictors should not be correlated. Descriptive statistics 2.2.2 Simple linear regression 5. Foundations of inferential statistics 3.1. Testing statistical hypotheses 5. Back to simple regression However. . 218 / 221 Veronika Czellar HEC Paris Statistics .3 Multiple regression Model assumptions in multiple regression can be veriﬁed in the same way as in simple linear regression (see 5.

A regression coeﬃcient that should have a positive sign turns out to be negative.3 Multiple regression 5. Several clues that indicate problems with multicollinearity: An independent variable known to be an important predictor ends up being not signiﬁcant.1. Testing statistical hypotheses 5. Descriptive statistics 2. Estimation and conﬁdence intervals 4. or vice versa.3. Regression analysis 5. Foundations of inferential statistics 3. there is a drastic change in the values of the remaining coeﬃcients. 219 / 221 Veronika Czellar HEC Paris Statistics . When an independent variable is added or removed.9 Multicollinearity Multicollinearity exists when independent variables are correlated.1 Correlation 5.2 Simple linear regression 5.

Estimation and conﬁdence intervals 4. Neter and Li (2005). Applied Regression Analysis and Generalized Linear Models..2 Simple linear regression 5. see Kutner. 2nd ed. Fox (2008). 5th ed. Descriptive statistics 2. Sage Publications. Testing statistical hypotheses 5.1 Correlation 5. 220 / 221 Veronika Czellar HEC Paris Statistics . Foundations of inferential statistics 3. Nachtscheim.3 Multiple regression For further details about linear regression. Regression analysis 5.. McGraw-Hill. Applied Linear Statistical Models.1.

