You are on page 1of 48

A PowerPoint Presentation Package to Accompany

Applied Statistics in Business &


Economics, 5th edition David P. Doane and
Lori E. Seward

Prepared by Lloyd R. Jaisingh

McGraw-Hill/Irwin Copyright © 2015 by The McGraw-Hill Companies, Inc. All rights reserved.
Simple Regression

Chapter Contents

12.1 Visual Displays and Correlation Analysis


12.2 Simple Regression
12.3 Regression Models
12.4 Ordinary Least Squares Formulas
12.5 Tests for Significance
12.6 Analysis of Variance: Overall Fit
12.7 Confidence and Prediction Intervals for Y
Simple Regression

Chapter Contents

12.8 Residual Tests


12.9 Unusual Observations
12.10 Other Regression Problems (Optional)
Simple Regression

Chapter Learning Objectives (LO’s)

LO12-1: Calculate and test a correlation coefficient for significance.


LO12-2: Interpret a regression equation and use it to make predictions.
LO12-3: Explain the form and assumptions of a simple regression model.
LO12-4: Explain the least squares method, apply formulas for coefficients,
and interpret .
LO12-5: Construct confidence intervals and test hypotheses for the slope
and intercept.
LO12-6: Interpret the ANOVA table and use it to compute F, , and
standard error.
Simple Regression

Chapter Learning Objectives (LO’s)

LO12-7: Distinguish between confidence and prediction intervals for Y.


LO12-8: Calculate residuals and perform tests of regression
assumptions.
LO12-9: Identify unusual residuals and tell when they are outliers. LO12-10:
Define leverage and identify high-leverage observations. LO12-11: Improve
data conditioning and use transformations if needed
(Optional).
12.1 Visual Displays and Correlation
Analysis
Visual Displays

Begin the analysis of bivariate data (i.e., two variables) with a scatter plot.
A scatter plot - displays each observed data pair (xi, yi) as a dot on an X/Y
grid.- indicates visually the strength of the relationship between the
two variables.

Sample Scatter Plot


12.1 Visual Displays and Correlation
LO12-1
Analysis
LO12-1: Calculate and test a correlation coefficient for
significance.
Correlation Coefficient, r
The sample correlation coefficient (r) measures the
degree of linearity in the relationship between X and
Y.

Note: -1 ≤ r ≤ +1
r = 0 indicates no linear
relationship
12.1 Visual Displays and
LO12-1
Correlation Analysis
Scatter Plots Showing Various Correlation Values
12.1 Visual Displays and Correlation
LO12-1
Analysis
Tests for Significant Correlation Using Student’s t
Step 1: State the HypothesesDetermine whether you are using a one or
two-tailed test and the level of significance (α). H0: ρ = 0 H1: ρ ≠ 0
Step 2: Specify the Decision RuleFor degrees of freedom df = n -2, look
up the critical value tα in Appendix D.

Note: r is an estimate of the population


correlation coefficient ρ (rho).
12.1 Visual Displays and Correlation
LO12-1
Analysis
Tests for Significant Correlation Using Student’s t

Step 3: Calculate the Test Statistic

Step 4: Make the DecisionReject H0 if t > tα/2 or if t < -tα/2


.Also, reject H0 the if the p-value ≤ α.
12.1 Visual Displays and Correlation
LO12-1
Analysis
Critical Value for Correlation Coefficient (Tests for Significance)
Equivalently, you can calculate the critical value for the correlation
coefficient using

This method gives a benchmark for the correlation coefficient.


However, there is no p-value and is inflexible if you change your mind
about α.
MegaStat uses this method, giving two-tail critical values for α = 0.05
and α = 0.01.
12.1 Visual Displays and Correlation
LO12-1
Analysis
LO12-2 12.2 Simple Regression
LO12-2: Interpret the slope and intercept of a regression equation
and use it to make prediction.
What is Simple Regression?

Simple Regression analyzes the relationship between two variables.


It specifies one dependent (response) variable and one independent
(predictor) variable.
The hypothesized relationship here will be linear of the form Y = slope
× X + y-intercept..
LO12-2 12.2 Simple Regression

Interpreting an Estimated Regression Equation: Examples


LO12-2 12.2 Simple Regression
Prediction Using Regression: Examples
LO12-2 12.2 Simple Regression

NOTES:
LO12-3 12.3 Regression Models
LO12-3: Explain the form and assumptions of a simple
regression model.

Model and Parameters

The assumed model for a linear relationship is


y = β0 + β1x + ε.
The relationship holds for all pairs (xi , yi ).
The error term ε is not observable, is assumed to be independently normally
distributed with mean of 0 and standard deviation σ.
The unknown parameters are: β0 Intercept β1 Slope.
LO12-3 12.3 Regression Models

Model and Parameters

The fitted model or regression model is used to predict the expected value
of Y for a given value of X and is given below.
The fitted coefficients are b0 the estimated intercept b1 the estimated
slope
LO12-3 12.3 Regression Models

Fitting a Regression on a Scatter Plot

A more precise method is to let Excel


calculate the estimates. We enter
observations on the independent variable
x1, x2, . . ., xn and the dependent variable
y1, y2, . . ., yn into
separate columns, and let Excel fit the
regression equation, as illustrated in
Figure 12.6. Excel will choose the
regression coefficients so as to produce a
good fi t
LO12-3 12.3 Regression Models

Slope and Intercept Interpretations


Figure 12.6 (previous slide) shows a sample of miles per gallon and
horsepower for 15 engines. The Excel graph and its fitted regression
equation are also shown.
Slope Interpretation: The slope of -0.0785 says that for each additional
unit of engine horsepower, the miles per gallon decreases by 0.0785 mile.
This estimated slope is a statistic because a different sample might yield a
different estimate of the slope.
Intercept Interpretation: The intercept value of 49.216 suggests that
when the engine has no horsepower, the fuel efficiency would be quite
high. However, the intercept has little meaning in this case, not only
because zero horsepower makes no logical sense, but also because
extrapolating to x = 0 is beyond the range of the observed data.
LO12-4 12.4 Ordinary Least Squares (OLS)
Formulas
LO12-4: Explain the least squares method, apply
formulas for coefficients, and interpret .
Slope and Intercept

The ordinary least squares method (OLS) estimates the slope and
intercept of the regression line so that the sum of residuals is minimized
which will ensure the best fit.
The sum of the residuals = 0.

The sum of the squared residuals is SSE.


LO12-4 12.4 Ordinary Least Squares (OLS)
Formulas
Slope and Intercept
The OLS estimator for the slope is:

o
r
The OLS estimator for the intercept is:
LO12-4 12.4 Ordinary Least Squares (OLS)
Formulas
Slope and Intercept

*Recall from Chapter 8 that an unbiased estimator’s expected value is the true parameter
and that a consistent estimator approaches ever closer to the true parameter as the sample
size increases.
LO12-4 12.4 Ordinary Least Squares (OLS)
Formulas
Assessing Fit
We want to explain the total variation in Y around its mean (SST for Total
Sums of Squares).

The regression sum of squares (SSR) is the explained variation in Y.


LO12-4 12.4 Ordinary Least Squares (OLS)
Formulas
Assessing Fit
The error sum of squares (SSE) is the unexplained variation in Y.

If the fit is good, SSE will be relatively small compared to SST.


A perfect fit is indicated by an SSE = 0.
The magnitude of SSE depends on n and on the units of measurement.
LO12-4 12.4 Ordinary Least Squares (OLS)
Formulas
Coefficient of Determination
R2 is a measure of relative fit based on a comparison of SSR and SST.

Often expressed as a percent, an R2 = 1 (i.e., 100%) indicates perfect fit.


In simple regression, R2 = (r)2
LO12-5 12.5 Test For Significance
LO12-5: Construct confidence intervals and test
hypotheses for the slope and intercept.
Standard Error of Regression
The standard error () is an overall measure of model fit.

If the fitted model’s predictions are perfect (SSE = 0), then s = 0. Thus, a
small  indicates a better fit.
Used to construct confidence intervals.
Magnitude of  depends on the units of measurement of Y and on data
magnitude.
LO12-5 12.5 Test For Significance
Confidence Intervals for Slope and Intercept
Standard error of the slope and intercept:
LO12-5 12.5 Test For Significance
Confidence Intervals for Slope and Intercept

Confidence interval for the true slope and intercept:

Note: One can use Excel, Minitab, MegaStat or


other software to compute these intervals
and do hypothesis tests relating to linear regression.
LO12-5 12.5 Test For Significance
Hypothesis Tests
Is the true slope different from zero? Well, if β1 = 0, then X cannot influence Y and the regression model collapses to a constant β0
plus random error.

The hypotheses (for zero slope and/or intercept) to be tested are:

df = n -2
Reject H0 if tcalc
> tα/2
or if p-value ≤
α.
LO12-6 12.6 Analysis of Variance: Overall Fit

LO12-6: Interpret the ANOVA table and use it to calculate F, R2, and the
standard error.
Decomposition of Variance
The decomposition of variance may be written as
LO12-6 12.6 Analysis of Variance: Overall Fit

LO12-6: Interpret the ANOVA table and use it to calculate F, R2, and
the standard error.
F Test for Overall Fit
To test a regression for overall significance, we use an F test to compare
the explained (SSR) and unexplained (SSE) sums of squares.
12.7 Confidence and Prediction
LO12-7
Intervals for Y
LO12-7: Distinguish between confidence and prediction
intervals for Y.
How to Construct an Interval Estimate for Y
Confidence Interval for the conditional mean of Y.
Prediction intervals are wider than confidence intervals because individual
Y values vary more than the mean of Y.
LO12-8 12.8 Residual Tests
LO12-8: Calculate residuals and perform tests of
regression assumptions.
Three Important Assumptions
1. The errors are normally distributed.
2. The errors have constant variance (i.e., they are homoscedastic).
3. The errors are independent (i.e., they are nonautocorrelated).

Violation of Assumption 1: Non-normal Errors


Non-normality of errors is a mild violation since the regression
parameter estimates b0 and b1 and their variances remain unbiased
and consistent.
Confidence intervals for the parameters may be untrustworthy
because normality assumption is used to justify using Student’s t
distribution.
12.8 Residual Tests
LO12-8

Non-normal Errors
A large sample size would compensate.
Outliers could pose serious problems.

Normal Probability Plot

The Normal Probability Plot tests the assumption H0:


Errors are normally distributed H1: Errors are not normally
distributed
If H0 is true, the residual probability plot should be linear
as shown in the example.
12.8 Residual Tests
LO12-8

What to Do About Non-Normality?

1. Trim outliers only if they clearly are mistakes.


2. Increase the sample size if possible.
3. Try a logarithmic transformation of both X and Y.

Violation of Assumption 2: Nonconstant Variance


The ideal condition is if the error magnitude is constant (i.e., errors
are homoscedastic).
12.8 Residual Tests
LO12-8

Violation of Assumption 2: Nonconstant Variance

Heteroscedastic (nonconstant) errors increase or decrease with X.


In the most common form of heteroscedasticity, the variances of the
estimators are likely to be understated.
This results in overstated t statistics and artificially narrow confidence
intervals.

Tests for Heteroscedasticity

Plot the residuals against X. Ideally,


there is no pattern in the residuals
moving from left to right.
12.8 Residual Tests
LO12-8

Tests for Heteroscedasticity


The “fan-out” pattern of increasing residual variance is the most common
pattern indicating heteroscedasticity.
12.8 Residual Tests
LO12-8

What to Do About Heteroscedasticity?

Transform both X and Y, for example, by taking logs.


Although it can widen the confidence intervals for the coefficients,
heteroscedasticity does not bias the estimates.

Violation of Assumption 3: Autocorrelated Errors


Autocorrelation is a pattern of non-independent errors.
In a first-order autocorrelation, et is correlated with et-1.
The estimated variances of the OLS estimators are biased, resulting in
confidence intervals that are too narrow, overstating the model’s fit.
12.8 Residual Tests
LO12-8

Runs Test for Autocorrelation


In the runs test, count the number of the residual’s sign reversals (i.e., how often
does the residual cross the zero centerline?).
If the pattern is random, the number of sign changes should be n/2.
Fewer than n/2 would suggest positive autocorrelation.
More than n/2 would suggest negative autocorrelation.

Durbin-Watson (DW) Test

Tests for autocorrelation under the hypotheses H0: Errors are non-autocorrelatedH1:
Errors are autocorrelated
The DW statistic will range from 0 to 4.DW < 2 suggests positive autocorrelationDW
= 2 suggests no autocorrelation (ideal)DW > 2 suggests negative autocorrelation
12.8 Residual Tests
LO12-8

What to Do About Autocorrelation?


Transform both variables using the method of first differences in which both
variables are redefined as changes. Then we regress Y against X.

Although it can widen the confidence interval for the coefficients,


autocorrelation does not bias the estimates.
LO12-9 12.9 Unusual Observations
LO12-9: Identify unusual residuals and tell when they are
outliers.
Standardized Residuals
One can use Excel, Minitab, MegaStat or other software to compute
standardized residuals.
If the absolute value of any standardized residual is at least 2, then it is classified
as unusual.
LO12-10 12.9 Unusual Observations
LO12-10: Define leverage and identify high leverage
observations.
High Leverage
A high leverage statistic indicates the observation is far from the mean
of X.
These observations are influential because they are at the “ end of the
lever.”
The leverage for observation i is denoted hi .
12.9 Unusual Observations
LO12-10

High Leverage

A leverage that exceeds 3/n is unusual.


LO12-11 12.10 Other Regression Problems
(optional)
LO12-11: Improve data conditioning and use
transformations if needed (optional).
Outliers
Outliers may be caused by- an error To fix the problem, - delete the
in recording observation(s)- delete the data-
data- impossible data - an formulate a multiple regression
observation that has been influenced model that includes the lurking
by an unspecified “lurking” variable variable.
that should have been controlled but
wasn’t.

12B-45
12.10 Other Regression Problems
LO12-11
(optional)
Model Misspecification
If a relevant predictor has been omitted, then the model is misspecified.
Use multiple regression instead of bivariate regression.

Ill-Conditioned Data
Well-conditioned data values are of the same general order of magnitude.
Ill-conditioned data have unusually large or small data values and can
cause loss of regression accuracy or awkward estimates.
12.10 Other Regression Problems
LO12-11
(optional)
Ill-Conditioned Data
Avoid mixing magnitudes by adjusting the magnitude of your data before
running the regression.

Spurious Correlation
In a spurious correlation two variables appear related because of the way
they are defined.
This problem is called the size effect or problem of totals.
12.10 Other Regression Problems
LO12-11
(optional)

Model Form and Variable Transforms


Sometimes a nonlinear model is a better fit than a linear model.
Excel offers many model forms.
Variables may be transformed (e.g., logarithmic or exponential functions)
in order to provide a better fit.
Log transformations reduce heteroscedasticity.
Nonlinear models may be difficult to interpret.

You might also like