Professional Documents
Culture Documents
Lecture 8
Lecture 8
Outline Outline
Simple Linear Regression Multiple Regression Understanding the Regression Output Coefficient of Determination R2 Validating the Regression Model
Questions: a) How to relate advertising expenditure to sales? b) What is expected firstfirst-year sales if advertising expenditure is $2.2 million? c) How confident is your estimate? How good is the fit? fit?
The The Basic Basic Model: Model: Simple Simple Linear Linear Regression Regression
Data: Data: (x1, y1), (x2, y2), . . . , (x (xn, yn) Model of the population: population: Yi = 0 + 1 xi + i 1, 2, . . . , n are i.i.d. random variables, N(0, ) This is the true relation between Y and x, but we do not know 0 and 1 and have to estimate them based on the data. Comments: E (Yi | xi) = 0 + 1x i SD(Yi | xi) = Relationship is linear described by a line 0 = baseline value of Y (i.e., value of Y if x is 0) 1 = slope of line (average change in Y per unit change in x)
(xi, ^ yi)
60
40
bo=13.82
ei (xi, yi)
20
Slope b1 = 48.60
0 0 0.5 Advertising Expenditures ($M) 1
Residual (error): yi (error): ei = yi - ^ The best regression line is the one that chooses b0 and b1 to best minimize the total errors (residual sum of squares): SSR = i=1ei2 =
n ^ 2 i=1 (yi - y i) n
region
sales advertising promotions competitors sales Selkirk 101.8 1.3 0.2 20.40 Susquehanna 44.4 0.7 0.2 30.50 Kittery 108.3 1.4 0.3 24.60 Acton 85.1 0.5 0.4 19.60 Finger Lakes 77.1 0.5 0.6 25.50 Berkshire 158.7 1.9 0.4 21.70 Central 180.4 1.2 1.0 6.80 Providence 64.2 0.4 0.4 12.60 Nashua 74.6 0.6 0.5 31.30 Dunster 143.4 1.3 0.6 18.60 Endicott 120.6 1.6 0.8 19.90 Five-Towns 69.7 1.0 0.3 25.60 Waldeboro 67.8 0.8 0.2 27.40 Jackson 106.7 0.6 0.5 24.30 Stowe 119.6 1.1 0.3 13.70
x1, x2, . . . , xk
(k of them)
(y1, x11, x21, . . . , xk1), . . . , (y (yn, x1n, x2n, . . . , xkn), Yi = 0 + 1x1i + . . . + kxki + i 1, 2, . . . , n are iid random variables, ~ N(0, ) b0 + b1x1i + . . . + bkxki
Population Model:
Goal: Choose b0, b1, ... , bk to minimize the residual sum of squares. I.e., minimize:
SSR = i=1ei2 =
^ 2 i=1 (yi - y i)
Coefficients Standard
0.033 4.67 126.74 0.000 25.52 72.44 0.024 7.66 111.65 0.040 -3.63 -0.047
R2 takes values between 0 and 1 (it is a percentage). R2 = 0.833 in our Appleglo Example
35 30 25 20 15 10 5 0 0 5 10 15 X 20 25 30
30 25 20 15 10 5 0 0 5 10 15 X 20 25 30
i=1 (yi - y )2
R2 = =1=1-
variation accounted for by x variables total variation variation not accounted for by x variables total variation
^ 2 i=1 (yi - y i) n
i=1 (yi - y )2
A high R2 means that most of the variation we observe in the yi data can be attributed to their corresponding x values a desired property. In simple regression, the R2 is higher if the data points are better aligned along a line. But outliers Anscombe example. How high a R2 is good enough depends on the situation (for example, the intended use of the regression, and complexity of the problem). Users of regression tend to be fixated on R2, but its not the whole story. It is important that the regression model is valid. valid.
One should not include x variables unrelated to Y in the model, model, just to make the R2 fictitiously high. (With more x variables there will be more freedom in choosing the bis to make the residual variation closer to 0). Multiple R is just the square root of R2.
2) Normality of i
^ Plot a histogram of the residuals (e (ei = yi - yi ).
Usually, results are fairly robust with respect to this assumption. assumption.
3) Heteroscedasticity Do error terms have constant Std. Dev.? (i.e., SD(i ) = for all i?) Check scatter plot of residuals vs. Y and x variables.
Residuals
20.00 10.00 R es 0.00 id 0.0 u -10.00 -20.00 Advertising Expenditures
Residuals
1.0
2.0
1.0
2.0
Advertising Expenditures
No evidence of heteroscedasticity
Evidence of heteroscedasticity
May be fixed by introducing a transformation May be fixed by introducing or eliminating some independent variables variables
4) Autocorrelation : Are error terms independent? Plot residuals in order and check for patterns
Time Plot
6
Residual
6
Time Plot
Residual
4 2 0
0 5 10 15 20
4 2 0
0 5 10 15 20
-2 -4 -6
-2 -4
No evidence of autocorrelation
Evidence of autocorrelation
Autocorrelation may be present if observations have a natural sequential order (for example, time). May be fixed by introducing a variable or transforming a variable. variable.
Example Example
Student Graduate Number GPA 1 4.0 2 4.0 3 3.1 4 3.1 5 3.0 6 3.5 7 3.1 8 3.5 9 3.1 10 3.2 11 3.8 12 4.1 13 2.9 14 3.7 15 3.8 16 3.9 17 3.6 18 3.1 19 3.3 20 4.0 21 3.1 22 3.7 23 3.7 24 3.9 25 3.8 College GPA 3.9 3.9 3.1 3.2 3.0 3.5 3.0 3.5 3.2 3.2 3.7 3.9 3.0 3.7 3.8 3.9 3.7 3.0 3.2 3.9 3.1 3.7 3.7 4.0 3.8 GMAT 640 644 557 550 547 589 533 600 630 548 600 633 546 602 614 644 634 572 570 656 574 636 635 654 633
10
What happened?
GMAT
Eliminate GMAT
In linear regression, we choose the best coefficients b0, b1, ... , bk as the estimates for 0, 1,, k . We know on average each bj hits the right target j . However, we also want to know how confident we are about our estimates
11
11
t PCoeffic Standard Error Statistic value ients Intercept 65.71 27.73 2.37 Advertising 48.98 10.66 4.60 Promotions 59.65 23.63 2.53 Compet. -1.84 0.81 -2.26 Sales
2) Standard errors of the coefficients: coefficients: sb0 , sb1 , . . . , sbk They are just the SDs of estimates b0, b1, . . . , bk . Fact: Before we observe b j and sbj, bj - j
sbj
obeys a
t-distribution with dof = (n - k - 1), the same dof as the residual. We will use this fact to assess the quality of our estimates bj . What is a 95% confidence interval for j? Does the interval contain 0? Why do we care about this?
12
3) t-Statistic:
b tj = j sbj A measure of the statistical significance of each individual xj in accounting for the variability in Y. Let c be that number for which P(P(- c < T < c) = %, where T obeys a t-distribution with dof = (n - k - 1).
If tj > c, then the % C.I. for j does not contain zero In this case, we are % confident that j different from zero.
Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14
13
Dummy variables: Often, some of the explanatory variables in a regression are categorical rather than numeric. If we think whether an executive has an MBA or not affects his/her pay, We create a dummy variable and let it be 1 if the executive has an MBA and 0 otherwise. If we think season of the year is an important factor to determine sales, how do we create dummy variables? How many? What is the problem with creating 4 dummy variables? In general, if there are m categories an x variable can belong to, then we need to create m-1 dummy variables for it.
14
oil consumption
0.05
15
Choose which independent variables to include in the model, based on common sense and context specific knowledge.
Collect data (create dummy variables in necessary). Run regression the easy part. Analyze the output and make changes in the model this is where the action is.
16
Missing Values
17