You are on page 1of 17

Multiple Linear Regression Review

Outline Outline
Simple Linear Regression Multiple Regression Understanding the Regression Output Coefficient of Determination R2 Validating the Regression Model

Linear Linear Regression: Regression: An AnExample Example


First Year Sales ($Millions)
A p p le g lo F ir s t- Y e a r A d v e r t is in g E x p e n d it u r e s ($ m illio n s ) x 1 .8 1 .2 0 .4 0 .5 2 .5 2 .5 1 .5 1 .2 1 .6 1 .0 1 .5 0 .7 1 .0 0 .8 F ir s t -Y e a r S a le s ($ m illio n s ) y 104 68 39 43 127 134 87 77 102 65 101 46 52 33

160 120 80 40 0 0 0.5 1 1.5 2 2.5 Advertising Expenditures ($Millions)

R e g io n M a in e N e w H a m p s h ir e V erm ont M a s s a c h u s e t ts C o n n e c t ic u t R h o d e I s la n d N ew Y ork N ew Jersey P e n n s y lv a n ia D e la w a r e M a r y la n d W e s t V ir g in ia V ir g in ia O h io

Questions: a) How to relate advertising expenditure to sales? b) What is expected firstfirst-year sales if advertising expenditure is $2.2 million? c) How confident is your estimate? How good is the fit? fit?

The The Basic Basic Model: Model: Simple Simple Linear Linear Regression Regression
Data: Data: (x1, y1), (x2, y2), . . . , (x (xn, yn) Model of the population: population: Yi = 0 + 1 xi + i 1, 2, . . . , n are i.i.d. random variables, N(0, ) This is the true relation between Y and x, but we do not know 0 and 1 and have to estimate them based on the data. Comments: E (Yi | xi) = 0 + 1x i SD(Yi | xi) = Relationship is linear described by a line 0 = baseline value of Y (i.e., value of Y if x is 0) 1 = slope of line (average change in Y per unit change in x)

How do we choose the line that best fits the data?


80

First Year Sales ($M)

(xi, ^ yi)
60

40

bo=13.82

ei (xi, yi)

Best choices: bo = 13.82 b1 = 48.60

20

Slope b1 = 48.60
0 0 0.5 Advertising Expenditures ($M) 1

Regression coefficients: coefficients: b0 and b1 are estimates of 0 and 1


^ Regression estimate for Y at xi : y i = b0 + b1xi (prediction)

Residual (error): yi (error): ei = yi - ^ The best regression line is the one that chooses b0 and b1 to best minimize the total errors (residual sum of squares): SSR = i=1ei2 =
n ^ 2 i=1 (yi - y i) n

Example: Example: Sales Sales of ofNature-Bar Nature($ million) million) Nature-Bar($

region

sales advertising promotions competitors sales Selkirk 101.8 1.3 0.2 20.40 Susquehanna 44.4 0.7 0.2 30.50 Kittery 108.3 1.4 0.3 24.60 Acton 85.1 0.5 0.4 19.60 Finger Lakes 77.1 0.5 0.6 25.50 Berkshire 158.7 1.9 0.4 21.70 Central 180.4 1.2 1.0 6.80 Providence 64.2 0.4 0.4 12.60 Nashua 74.6 0.6 0.5 31.30 Dunster 143.4 1.3 0.6 18.60 Endicott 120.6 1.6 0.8 19.90 Five-Towns 69.7 1.0 0.3 25.60 Waldeboro 67.8 0.8 0.2 27.40 Jackson 106.7 0.6 0.5 24.30 Stowe 119.6 1.1 0.3 13.70

Multiple Multiple Regression Regression


In general, there are many factors in addition to advertising expenditures that affect sales Multiple regression allows more than one x variables. Independent variables: Data:

x1, x2, . . . , xk

(k of them)

(y1, x11, x21, . . . , xk1), . . . , (y (yn, x1n, x2n, . . . , xkn), Yi = 0 + 1x1i + . . . + kxki + i 1, 2, . . . , n are iid random variables, ~ N(0, ) b0 + b1x1i + . . . + bkxki

Population Model:

Regression coefficients: b0, b1,, bk are estimates of 0, 1,, k .


^ Regression Estimate of yi : yi =

Goal: Choose b0, b1, ... , bk to minimize the residual sum of squares. I.e., minimize:

SSR = i=1ei2 =

^ 2 i=1 (yi - y i)

Regression RegressionOutput Output(from (from Excel) Excel)


Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations
Analysis of Variance df Regression Residual Total Sum of Mean F Significance F Squares Square 3 16997.537 5665.85 18.290 0.000 11 3407.473 309.77 14 20405.009
Error Intercept Advertising Promotions Competitors Sales PLower t Statistic value 95% 2.37 4.60 2.53 -2.26 Upper 95%

0.913 0.833 0.787 17.600 15

Coefficients Standard

65.71 48.98 59.65 -1.84

27.73 10.66 23.63 0.81

0.033 4.67 126.74 0.000 25.52 72.44 0.024 7.66 111.65 0.040 -3.63 -0.047

Understanding UnderstandingRegression RegressionOutput Output


1) Regression coefficients: coefficients: b0, b1, . . . , bk are estimates of 0, 1, . . . , k based on sample data. Fact: E[b E[bj ] = j . Example: b0 = 65.705 (its interpretation is context dependent . b1 = 48.979 (an additional $1 million in advertising is expected to result in an additional $49 million in sales) b2 = 59.654 (an additional $1 million in promotions is expected to result in an additional $60 million in sales) b3 = -1.838 (an increase of $1 million in competitor sales is expected to decrease sales by $1.8 million)

Understanding UnderstandingRegression RegressionOutput, Output, Continued Continued


2) Standard errors: errors: an estimate of , the SD of each i. It is a measure of the amount of noise in the model. Example: s = 17.60 3) Degrees of freedom: freedom: #cases - #parameters, relates to overover-fitting phenomenon 4) Standard errors of the coefficients: coefficients: sb0 , sb1 , . . . , sbk They are just the standard deviations of the estimates b0 , b1, . . . , bk. They are useful in assessing the quality of the coefficient coefficient estimates and validating the model.

R2 takes values between 0 and 1 (it is a percentage). R2 = 0.833 in our Appleglo Example

First Year Sales ($Millions)

160 120 80 40 0 0 0.5 1 1.5 2 2.5 Advertising Expenditures ($Millions)

35 30 25 20 15 10 5 0 0 5 10 15 X 20 25 30

30 25 20 15 10 5 0 0 5 10 15 X 20 25 30

R2 = 1; 1; x values account for all variation in the Y values

R2 = 0; 0; x values account for none variation in the Y values

Understanding UnderstandingRegression RegressionOutput, Output, Continued Continued


5) Coefficient of determination: determination: R2 It is a measure of the overall quality of the regression. Specifically, it is the percentage of total variation exhibited in the yi data that is accounted for by the sample regression line. _ The sample mean of Y: y = (y1 + y2 + . . . + yn)/ n Total variation in Y =

i=1 (yi - y )2

^ Residual (unaccounted) variation in Y = i=1 ei2 = i=1 (yi - yi )2

R2 = =1=1-

variation accounted for by x variables total variation variation not accounted for by x variables total variation
^ 2 i=1 (yi - y i) n

i=1 (yi - y )2

2 Coefficient Coefficientof ofDetermination: Determination: R R2

A high R2 means that most of the variation we observe in the yi data can be attributed to their corresponding x values a desired property. In simple regression, the R2 is higher if the data points are better aligned along a line. But outliers Anscombe example. How high a R2 is good enough depends on the situation (for example, the intended use of the regression, and complexity of the problem). Users of regression tend to be fixated on R2, but its not the whole story. It is important that the regression model is valid. valid.

2 Coefficient Coefficientof ofDetermination: Determination: R R2

One should not include x variables unrelated to Y in the model, model, just to make the R2 fictitiously high. (With more x variables there will be more freedom in choosing the bis to make the residual variation closer to 0). Multiple R is just the square root of R2.

Validating Validatingthe theRegression RegressionModel Model


Assumptions about the population: Yi = 0 + 1x1i + . . . + kxki + i (i = 1, . . . , n) 1, 2, . . . , n are iid random variables, ~ N(0, ) 1) Linearity If k = 1 (simple regression), one can check visually from scatter scatter plot. Sanity check: the sign of the coefficients, reason for nonnon-linearity?

2) Normality of i
^ Plot a histogram of the residuals (e (ei = yi - yi ).

Usually, results are fairly robust with respect to this assumption. assumption.

3) Heteroscedasticity Do error terms have constant Std. Dev.? (i.e., SD(i ) = for all i?) Check scatter plot of residuals vs. Y and x variables.
Residuals
20.00 10.00 R es 0.00 id 0.0 u -10.00 -20.00 Advertising Expenditures

20.00 10.00 0.00

Residuals

1.0

2.0

0.0 -10.00 -20.00

1.0

2.0

Advertising Expenditures

No evidence of heteroscedasticity

Evidence of heteroscedasticity

May be fixed by introducing a transformation May be fixed by introducing or eliminating some independent variables variables

4) Autocorrelation : Are error terms independent? Plot residuals in order and check for patterns
Time Plot
6
Residual
6

Time Plot
Residual
4 2 0
0 5 10 15 20

4 2 0
0 5 10 15 20

-2 -4 -6

-2 -4

No evidence of autocorrelation

Evidence of autocorrelation

Autocorrelation may be present if observations have a natural sequential order (for example, time). May be fixed by introducing a variable or transforming a variable. variable.

Pitfalls Pitfallsand andIssues Issues


1) Overspecification Including too many x variables to make R2 fictitiously high.

Rule of thumb: we should maintain that n >= 5(k+2).


2) Extrapolating beyond the range of data

120 90 60 30 0 0.0 1.0 Advertising 2.0 3.0

Validating Validatingthe theRegression RegressionModel Model


3) Multicollinearity Occurs when two of the x variable are strongly correlated. Can give very wrong estimates for is. s. TellTell-tale signs: - Regression coefficients (b (bis) s) have the wrong sign. - Addition/deletion of an independent variable results in large changes of regression coefficients - Regression coefficients (b s) not significantly different from 0 (bis) May be fixed by deleting one or more independent variables

Example Example
Student Graduate Number GPA 1 4.0 2 4.0 3 3.1 4 3.1 5 3.0 6 3.5 7 3.1 8 3.5 9 3.1 10 3.2 11 3.8 12 4.1 13 2.9 14 3.7 15 3.8 16 3.9 17 3.6 18 3.1 19 3.3 20 4.0 21 3.1 22 3.7 23 3.7 24 3.9 25 3.8 College GPA 3.9 3.9 3.1 3.2 3.0 3.5 3.0 3.5 3.2 3.2 3.7 3.9 3.0 3.7 3.8 3.9 3.7 3.0 3.2 3.9 3.1 3.7 3.7 4.0 3.8 GMAT 640 644 557 550 547 589 533 600 630 548 600 633 546 602 614 644 634 572 570 656 574 636 635 654 633

10

Regression RegressionOutput Output


R Square Standard Error Observations 0.96 0.08 25

What happened?

Intercept College GPA GMAT

Coefficients Standard Error 0.09540 0.28451 1.12870 0.10233 -0.00088 0.00092

College GPA and GMAT are highly correlated!


0.958 0.08 25

Graduate College GMAT

Graduate College 1 0.98 1 0.86 0.90

GMAT

R Square Standard Error Observations

Eliminate GMAT

Intercept College GPA

Coefficients Standard Error -0.1287 0.1604 1.0413 0.0455

Regression RegressionModels Models

In linear regression, we choose the best coefficients b0, b1, ... , bk as the estimates for 0, 1,, k . We know on average each bj hits the right target j . However, we also want to know how confident we are about our estimates

11

Back Back to toRegression RegressionOutput Output


Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations
Analysis of Varianc df Regression Residual Total Sum of Squares 16997.537 3407.473 20405.009

0.913 0.833 0.787 17.600 15


Mean Square 5665.85 309.77

11

t PCoeffic Standard Error Statistic value ients Intercept 65.71 27.73 2.37 Advertising 48.98 10.66 4.60 Promotions 59.65 23.63 2.53 Compet. -1.84 0.81 -2.26 Sales

Lower 95% 4.67 25.52 7.66 -3.63

Upper 95% 126.74 72.44 111.65 -0.047

Regression RegressionOutput Output Analysis Analysis


1) Degrees of freedom (dof ) (dof) Residual dof = n - (k+1) (We used up (k + 1) degrees of freedom in forming (k+1) sample estimates b0, b1, . . . , bk .)

2) Standard errors of the coefficients: coefficients: sb0 , sb1 , . . . , sbk They are just the SDs of estimates b0, b1, . . . , bk . Fact: Before we observe b j and sbj, bj - j

sbj

obeys a

t-distribution with dof = (n - k - 1), the same dof as the residual. We will use this fact to assess the quality of our estimates bj . What is a 95% confidence interval for j? Does the interval contain 0? Why do we care about this?

12

3) t-Statistic:

b tj = j sbj A measure of the statistical significance of each individual xj in accounting for the variability in Y. Let c be that number for which P(P(- c < T < c) = %, where T obeys a t-distribution with dof = (n - k - 1).

If tj > c, then the % C.I. for j does not contain zero In this case, we are % confident that j different from zero.

Example: Example:Executive Executive Compensation Compensation


Pay ($1,000) 1,530 1,117 602 1,170 1,086 2,536 300 670 250 2,413 2,707 341 734 2,368 Years in Change in position Stock Price (%) 7 48 6 35 3 9 6 37 6 34 9 81 2 -17 2 -15 0 -52 10 109 7 44 1 28 4 10 8 16 Change in Sales (%) 89 19 24 8 28 -16 -17 -67 49 -27 26 -7 -7 -4

Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14

MBA? YES YES NO YES NO YES NO YES NO YES YES NO NO NO

13

Dummy variables: Often, some of the explanatory variables in a regression are categorical rather than numeric. If we think whether an executive has an MBA or not affects his/her pay, We create a dummy variable and let it be 1 if the executive has an MBA and 0 otherwise. If we think season of the year is an important factor to determine sales, how do we create dummy variables? How many? What is the problem with creating 4 dummy variables? In general, if there are m categories an x variable can belong to, then we need to create m-1 dummy variables for it.

OILPLUS OILPLUSdata data


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Month August, 1989 September, 1989 October, 1989 November, 1989 December, 1989 January, 1990 February, 1990 March, 1990 April, 1990 May, 1990 June, 1990 July, 1990 August, 1990 September, 1990 October, 1990 November, 1990 December, 1990 heating oil 24.83 24.69 19.31 59.71 99.67 49.33 59.38 55.17 55.52 25.94 20.69 24.33 22.76 24.69 22.76 50.59 79.00 temperature 73 67 57 43 26 41 38 46 54 60 71 75 74 66 61 49 41

14

Heating Oil Consumption (1,000 gallons)

120 100 80 60 40 20 0 20 40 60 80 100

Average Temperature (degrees Fahrenheit)

oil consumption

120 90 60 30 0 0.01 0.02 0.03 0.04 inverse temperature


inverse temperature 0.0137 0.0149 0.0175 0.0233 0.0385 0.0244

0.05

heating oil temperature 24.83 73 24.69 67 19.31 57 59.71 43 99.67 26 49.33 41

15

The The Practice Practiceof ofRegression Regression

Choose which independent variables to include in the model, based on common sense and context specific knowledge.

Collect data (create dummy variables in necessary). Run regression the easy part. Analyze the output and make changes in the model this is where the action is.

Test the regression result on outout-ofof-sample data

The The Post-Regression PostChecklist Post-RegressionChecklist


1) Statistics checklist: Calculate the correlation between pairs of x variables watch for evidence of multicollinearity Check signs of coefficients do they make sense? Check 95% C.I. (use t-statistics as quick scan) are coefficients significantly different from zero? R2 :overall quality of the regression, but not the only measure 2) Residual checklist: Normality look at histogram of residuals Heteroscedasticity plot residuals with each x variable Autocorrelation if data has a natural order, plot residuals in order and check for a pattern

16

The The Grand GrandChecklist Checklist


Linearity: scatter plot, common sense, and knowing your problem,
transform including interactions if useful

t-statistics: are the coefficients significantly different from zero?


Look at width of confidence intervals F-tests for subsets, equality of coefficients R2: is it reasonably high in the context? Influential observations, outliers in predictor space, dependent dependent variable space

Normality: plot histogram of the residuals


Studentized residuals

Heteroscedasticity: plot residuals with each x variable, transform if


necessary, BoxBox-Cox transformations

Autocorrelation: time series plot Multicollinearity: compute correlations of the x variables, do


signs of coefficients agree with intuition? Principal Components

Missing Values

17

You might also like