Professional Documents
Culture Documents
Review
Outline
Outline
1
Linear
Linear Regression:
Regression: An
AnExample
Example
A p p le g lo F ir s t- Y e a r F ir s t -Y e a r
160
The
The Basic
Basic Model:
Model: Simple
Simple Linear
Linear Regression
Regression
population: Yi = β 0 + β1 xi + ε i
Model of the population:
ε1, ε2, . . . , εn are i.i.d. random variables, N(0, σ )
This is the true relation between Y and x, but we do
not know β 0 and β 1 and have to estimate them based
on the data.
Comments:
• E (Yi | xi) = β 0 + β 1x i
• SD(Yi | xi) = σ
• Relationship is linear – described by a “line”
• β 0 = “baseline” value of Y (i.e., value of Y if x is 0)
• β 1 = “slope” of line (average change in Y per unit change in x)
2
How do we choose the line that “best” fits the data?
80
(xi, ^
yi)
0
0 0.5 1
Advertising Expenditures ($M)
Example:
Example: Sales
Sales of
ofNature-Bar
Nature-Bar($
Nature- ($ million)
million)
3
Multiple
Multiple Regression
Regression
• In general, there are many factors in addition to advertising
expenditures that affect sales
• Multiple regression allows more than one x variables.
Regression
RegressionOutput
Output(from
(from Excel)
Excel)
Regression Statistics
Multiple R 0.913
R Square 0.833
Adjusted R Square 0.787
Standard Error 17.600
Observations 15
Analysis of
Variance
df Sum of Mean F Significance
Squares Square F
Regression 3 16997.537 5665.85 18.290 0.000
Residual 11 3407.473 309.77
Total 14 20405.009
Coefficients Standard t P- Lower Upper
Error Statistic value 95% 95%
4
Understanding
UnderstandingRegression
RegressionOutput
Output
Example:
b0 = 65.705 (its interpretation is context dependent .
Understanding
UnderstandingRegression
RegressionOutput,
Output, Continued
Continued
3) Degrees of freedom:
freedom: #cases - #parameters,
relates to over-
over-fitting phenomenon
5
160
R2 = 0.833 in our 40
Appleglo Example 0
0 0.5 1 1.5 2 2.5
Advertising Expenditures ($Millions)
30
35
30 25
25 20
20 15
15 10
10 5
5
0
0
0 5 10 15 20 25 30
0 5 10 15 20 25 30
X
X
R2 = 1;
1; x values account for R2 = 0;
0; x values account for
all variation in the Y values none variation in the Y values
Understanding
UnderstandingRegression
RegressionOutput,
Output, Continued
Continued
determination: R2
5) Coefficient of determination:
• It is a measure of the overall quality of the regression.
• Specifically, it is the percentage of total variation exhibited in
the yi data that is accounted for by the sample regression line.
_
The sample mean of Y: y = (y1 + y2 + . . . + yn)/ n
n _
Total variation in Y = Σi=1 (yi - y )2
n n
Residual (unaccounted) variation in Y = Σi=1 ei2 = Σi=1 (yi - yi )2
^
6
Coefficient 2
Coefficientof
ofDetermination:
Determination: RR2
Coefficient 2
Coefficientof
ofDetermination:
Determination: RR2
7
Validating
Validatingthe
theRegression
RegressionModel
Model
Assumptions about the population:
Yi = β0 + β1x1i + . . . + βkxki + εi (i = 1, . . . , n)
ε 1, ε 2, . . . , ε n are iid random variables, ~ N(0, σ)
1) Linearity
2) Normality of ε i
^
(ei = yi - yi ).
• Plot a histogram of the residuals (e
• Usually, results are fairly robust with respect to this assumption.
assumption.
3) Heteroscedasticity
SD(εi ) = σ for all i?)
• Do error terms have constant Std. Dev.? (i.e., SD(ε
• Check scatter plot of residuals vs. Y and x variables.
Advertising
-20.00
Expenditures
Advertising Expenditures
8
4) Autocorrelation : Are error terms independent?
Plot residuals in order and check for patterns
Residual
Residual
4 4
2
2
0
0 5 10 15 20 0
-2 0 5 10 15 20
-4 -2
-6 -4
Pitfalls
Pitfallsand
andIssues
Issues
1) Overspecification
120
90
60
30
0
0.0 1.0 2.0 3.0
Advertising
9
Validating
Validatingthe
theRegression
RegressionModel
Model
3) Multicollinearity
• Tell-
Tell-tale signs:
- Regression coefficients (b
(bi’s)
’s) have the “wrong” sign.
(bi’s)
- Regression coefficients (b ’s) not significantly different from 0
Example
Example
10
Regression
RegressionOutput
Output
R Square 0.96
Standard Error 0.08 What happened?
Observations 25
Regression
RegressionModels
Models
11
Back
Back to
toRegression
RegressionOutput
Output
Regression Statistics
Multiple R 0.913
R Square 0.833
Adjusted R Square 0.787
Standard Error 17.600
Observations 15
Analysis of Varianc
df Sum of Mean
Squares Square
Regression 16997.537 5665.85
Residual 11 3407.473 309.77
Total 20405.009
Regression
RegressionOutput
Output Analysis
Analysis
1) Degrees of freedom (dof
(dof))
• Residual dof = n - (k+1) (We used up (k + 1) degrees of
freedom in forming (k+1) sample estimates b0, b1, . . . , bk .)
12
b
tj = j
3) t-Statistic:
sbj
• A measure of the statistical significance of each individual xj
in accounting for the variability in Y.
• If ⏐tj⏐ > c, then the α % C.I. for β j does not contain zero
Example:
Example:Executive
Executive Compensation
Compensation
. . . . . .
13
Dummy variables:
OILPLUS
OILPLUSdata
data
Month heating oil temperature
1 August, 1989 24.83 73
2 September, 1989 24.69 67
3 October, 1989 19.31 57
4 November, 1989 59.71 43
5 December, 1989 99.67 26
6 January, 1990 49.33 41
7 February, 1990 59.38 38
8 March, 1990 55.17 46
9 April, 1990 55.52 54
10 May, 1990 25.94 60
11 June, 1990 20.69 71
12 July, 1990 24.33 75
13 August, 1990 22.76 74
14 September, 1990 24.69 66
15 October, 1990 22.76 61
16 November, 1990 50.59 49
17 December, 1990 79.00 41
. . . . . .
14
120
Heating Oil Consumption 100
(1,000 gallons)
80
60
40
20
0
20 40 60 80 100
120
oil consumption
90
60
30
0
0.01 0.02 0.03 0.04 0.05
inverse temperature
15
The
The Practice
Practiceof
ofRegression
Regression
The
The Post-Regression
Post-RegressionChecklist
Post- Checklist
1) Statistics checklist:
Calculate the correlation between pairs of x variables
−− watch for evidence of multicollinearity
Check signs of coefficients – do they make sense?
2) Residual checklist:
Normality – look at histogram of residuals
Heteroscedasticity – plot residuals with each x variable
Autocorrelation – if data has a natural order, plot residuals in
order and check for a pattern
16
The
The Grand
GrandChecklist
Checklist
• Linearity: scatter plot, common sense, and knowing your problem,
transform including interactions if useful
• t-statistics: are the coefficients significantly different from zero?
Look at width of confidence intervals
• F-tests for subsets, equality of coefficients
• R2: is it reasonably high in the context?
• Influential observations, outliers in predictor space, dependent
dependent
variable space
• Heteroscedasticity:
Heteroscedasticity: plot residuals with each x variable, transform if
necessary, Box-
Box-Cox transformations
• Autocorrelation: ”time series plot”
• Multicollinearity:
Multicollinearity: compute correlations of the x variables, do
signs of coefficients agree with intuition?
• Principal Components
• Missing Values
17