You are on page 1of 12

INFERENCE FOR REGRESSION PART 1

Topics Outline
Review of Least Squares Regression Line
The Linear Regression Model
Confidence Intervals for the Intercept
and the Slope
Testing the Hypothesis of No Linear Relationship
Inference about Prediction
Residuals
Conditions for Regression Inference
Review of Least Squares Regression Line
In simple linear regression, we consider a data set consisting of the paired observations
( x1 , y1 ),, ( xn , yn ) . Our goal is to investigate how the two quantitative variables x and y,
corresponding to the data values x i and y i , are related. We are also interested in predicting a
future response y from information about x.
The correlation coefficient r measures the direction and strength of the linear relationship between
two quantitative variables. Values of r close to (1) or (+1) indicate a strong negative or positive
linear relationship.
The least-squares regression line of the response variable y on the explanatory variable x is the line
y a bx
that minimizes the sum of the squares of the vertical distances of the data points ( xi , yi )
from the line. The slope
sy
b r
sx
of the regression line is the rate at which the predicted response y changes along the line as the
explanatory variable x changes. Specifically, b is the change in y when x increases by 1.
The intercept of the regression line

y bx

is the predicted response when the explanatory variable x = 0. This prediction is of no statistical
interest unless x can actually take values near 0.
The coefficient of determination r 2 is the square of the correlation coefficient r.
It measures the fraction of the variation in the response variable y that is explained by the least
squares regression on the explanatory variable x.
The least squares regression line can be used to predict the value of the response variable y for a
given value of the explanatory variable x by substituting this x into the equation of the line.

-1-

Example 1
Car plant electricity usage
The manager of a car plant wishes to investigate how the plants electricity usage depends upon
the plants production, based on the data for each month of the previous year:
x
y
Production Electricity usage
($ million)
(million kWh)
January
4.51
2.48
February
3.58
2.26
March
4.31
2.47
April
5.06
2.77
May
5.64
2.99
June
4.99
3.05
July
5.29
3.18
August
5.83
3.46
September
4.70
3.03
October
5.61
3.26
November
4.90
2.67
December
4.20
2.53
Month

Electricity usage (million kWh)

Car Plant Electricity Usage


3.5
y = 0.4988x + 0.409
R = 0.8021

3.25
3
2.75
2.5
2.25
2
3.5

4.5

5.5

Production ($ million)

The scatterplot shows a positive linear relationship, with no extreme outliers or potentially
influential observations. Higher levels of production do tend to require higher levels of electricity.
0.8021 0.896 is high, indicating a strong linear
The correlation coefficient r = r 2
relationship between Production and Electricity. The equation of the least squares regression line is

a bx = 0.409 + 0.499x

Because r 2 = 0.8021, about 80% of the variation in Electricity usage is explained by Production levels.
Is the observed relationship statistically significant?
-2-

The Linear Regression Model


Regression analysis is used primarily to predict the values of the response variable y based on the
values of the explanatory variable x. To assess the accuracy of these predictions, we need to
consider the mathematical model for linear regression.
Figure 1 provides a summary of the estimation process for simple linear regression.
The mathematical model for linear regression analysis assumes that the observed data points
( x1 , y1 ),, ( xn , yn ) constitute a random sample from a population. We suppose that in the
population there is an underlying linear relationship between the explanatory variable x and the
response variable y:

where is a random variable referred to as the error (or residual) term. The error term accounts
for the variability in y that cannot be explained by the linear relationship between x and y.
The random variable is assumed to have a mean of zero and standard deviation .
A consequence of this assumption is that the mean of y is equal to:
x

This is the equation of the true regression line.


The unknown parameters
(true intercept) and (true slope), which determine the relationship
between x and y, can be estimated from the data set ( x1 , y1 ),, ( xn , yn ) .
It can be shown that the estimators a and b from the least squares method are the
best linear unbiased estimators of and
(whatever that means!).
The estimation of and
is a statistical process much like the estimation of
using the sample
statistic x . In regression,
and are two unknown parameters of interest, and the coefficients a
and b obtained from the least squares line are the sample statistics used to estimate these parameters.
The third unknown parameter, the standard deviation
of the error , can also be estimated
from the data set. Recall that the residuals (errors) are the vertical deviations of the data points
from the least-squares line:
residual = (observed y) (predicted y) = y y
There are n residuals, one for each data point and their mean is 0.
The estimate of
is given by the sample standard deviation of the residuals

1
n 2

(residuali

0) 2

i 1

n 2

i 1

( yi

y i ) 2

and is referred to as the regression standard error (or standard error of estimate).
The regression standard error for our example is s = 0.173. (See Excel output on the last page.)
-3-

Sample Data

Regression Model

y
(

x
- st. dev. of

x y
x1 y1
x2 y2
. .
. .
. .
xn yn

True Regression Line

Regression Parameters

The values of

Compute the
sample statistics

a, b, s
provide estimates of

a, b, s

and the estimated


regression line

a bx

Figure 1 The estimation process in simple linear regression.

-4-

Confidence Intervals for the Intercept

and the Slope

If we did experiment many times with the same xi ' s we would get different yi ' s each time,
due to random errors. Therefore, we would also get different values for the least squares
estimators a and b of the population parameters
and . Indeed, a and b are sample statistics
that have their own sampling distributions.
Let SEa and SEb be estimates of the standard errors (i.e. standard deviations) of a and b,
respectively. It can be shown that the level C confidence intervals for the intercept
and the
slope are given by the following confidence limits:
:

a t * SEa

b t * SEb

Here t* is the critical value for the t (n 2) density curve with area C between t* and t*.
Note: All t procedures in simple linear regression have n 2 degrees of freedom.
Example 1 (Continued)
For our example (see Excel output),
a = 0.4090
b = 0.4988

SEa = 0.3860

SEb = 0.0784

There are 12 data points, so the degrees of freedom are n 2 = 10.


For 95% confidence and df = 10, the t-Table gives t* = 2.228.
95% CI for

a t * SEa = 0.4090 (2.228)(0.3860) = 0.4090 0.86 = 0.45 to 1.27

Hence, the true value for the intercept


lies in the interval from 0.45 to 1.27,
and this statement is made with 95% confidence.
Note: Inferences for the population intercept
95% CI for

are rarely of practical importance.

b t * SEb = 0.4988 (2.228)(0.0784) = 0.4988 0.1748 = 0.32 to 0.67

Thus the management of the car plant can be 95% confident that within the range of the data set,
the mean electricity usage increases by somewhere between a third of a million kilowatt-hours
and two thirds of a million kilowatt-hours for every additional $1 million dollars of production.

-5-

Testing the Hypothesis of No Linear Relationship


One of the first things we want to do upon obtaining the sample regression equation
y a bx
is to test its slope b. If there is no (linear) relationship between the variables x and y,
then the slope of the regression equation would be expected to be zero.
If b = 0, then y a and thus x is useless as a predictor of y.
Recall that

is unknown and represents the slope of the true unknown regression line
x

while b is the estimate of the slope obtained by fitting a line to the data set.
Hence, we can determine the existence of a statistically significant relationship between x and y
variables by testing whether
(the true slope) is equal to 0.
The null and alternative hypotheses are stated as follows:
H0 :
Ha :

0 (There is no linear relationship.)


0 (There is a linear relationship.)

If the null hypothesis is rejected, we would conclude that there is evidence of a linear relationship.
It can be shown that the test statistic is
b
t
SE b
Example 1 (Continued) To test the hypothesis
H0 :

Ha :

we calculate the test statistic (see also Excel output):


t

0.4988
0.0784

6.37

The t-Table shows that the two-sided P-value for t distribution with 10 degrees of freedom is
smaller than 0.001. (Excel gives P-value = 0.000082.)
We reject H 0 and conclude that the slope of the population regression line is not 0.
In other words, the data provide very strong evidence to conclude that the distribution of
electricity usage does depend upon the level of production.
An alternative to testing the existence of a linear relationship between x and y variables is to set
up a confidence interval for
and to determine whether the hypothesized value ( = 0 ) is
included in the interval. The 95% confidence interval for
is 0.32 to 0.67.
Because this interval does not contain 0, we conclude that there is a significant linear relationship
between x and y. Had the interval included 0, the conclusion would have been that no (linear)
relationship exists between the variables.
-6-

Inference about Prediction


There are several reasons for building a linear regression. One, of course, is to predict response
values (ys) at one or more values of the explanatory variable x.
Example 1 (Continued)
If the monthly production is x* = $5 million, then the plant manager can predict that the
electricity usage will be

y *

0.409 (0.4988)(5) 2.903 kWh

How accurate this prediction is likely to be?


Can we supply this prediction with a margin of error?
Given a specified value of the explanatory variable x*, which is not necessarily one of the values
x1 ,, xn , we can construct two fundamentally different kinds of intervals.
1. Confidence interval for the expected (mean) response E(y*) =

y * t * SEmean

SEmean

x*:

y*

(x * x)2

1
n

( xi

x)2

i 1

This confidence interval expresses our uncertainty about the regression line.
If we knew
and , then we would know the regression line exactly and our confidence
interval would be one point.
2. Prediction interval for an individual (future) response y*:

y * t * SEind

SEind

s 1

1
n

(x * x)2
n

( xi

x)2

i 1

This prediction interval expresses our uncertainty about the regression line and the fact that
there are errors in the data. If we knew
and , we would know the regression line exactly,
but the length of our prediction interval would not shrink to zero, since the error term in
y* =
always has a fixed variance

-7-

x* + *

In both intervals, t* is the critical value for the t(n 2) density curve with area C between t* and t*,
and

n 2

i 1

( yi

y i ) 2

is the regression standard error.


Both intervals are centered at y * and have the usual form
point estimate (critical value)(standard error)

y * t * SE
However, the prediction interval is wider than the confidence interval because it is harder to
predict one individual response than to predict a mean response.
Individuals are always more variable than averages!
Excels Regression tool does not have an option for computing confidence and prediction intervals.
These intervals can be computed using formulas along with the output of the Regression tool.
Example 1 (Continued)
For our example, y * = 2.903

t* = 2.228

SEmean = 0.0507

The 95% confidence interval for the mean response

y*

SEind = 0.1802

x * to the value x* = 5 is

y * t * SEmean= 2.903 (2.228)(0.0507) = 2.903 0.113

or

2.79

to

3.02

This interval implies that with a monthly production of $5 million, the mean electricity usage is
between about 2.8 and 3 million kWh.
A 95% prediction interval for a future response to the value x* = 5 is

y * t * SEind = 2.903 (2.228)(0.1802) = 2.903 0.401

or

2.50

to

3.30

This prediction interval indicates that if next months production target is $5 million,
then with 95% confidence next months electricity usage will be somewhere between 2.5 and 3.3
million kWh.
Thus, while the expected or average electricity usage in a month with $5 million of production is
known to lie somewhere between 2.8 and 3.0 million kWh, the electricity usage in a particular
month with $5 million of production will be somewhere between 2.5 and 3.3 million kWh.

-8-

Residuals
The residuals (y ) give useful information about the contribution of individual data points to
the overall pattern of scatter. Residual values show how much the observed values differ from
the fitted values. If a particular residual is positive, the corresponding data point is above the
line; if it is negative, the point is below the line. The only time a residual is zero is when the
point lies directly on the line.
Example 1 (Continued)
There are twelve residuals:
Observation
Residual

10

11

12

0.18

0.07

0.09

0.16

0.23

0.15

0.13

0.14

0.28

0.05

0.18

0.03

We can construct a residual plot by plotting the residuals against the explanatory variable x or the
predicted (also called fitted) values y . In a residual plot, the residual = 0 line represents the
position of the least-squares line in the scatterplot of y against x. (See Excel output.)
Residual plots are the primary tool for determining whether the assumed regression model is
appropriate.
Conditions for Regression Inference
An important step in determining whether the assumed linear regression model y
x
is appropriate involves testing for the significance of the relationship between the explanatory and
response variables. The tests of significance in regression analysis are based on four assumptions
about the error term .
Figure 2 illustrates the regression model assumptions and their implications. Note that in this
graphical interpretation, the mean response y moves along a straight line as the explanatory
variable x changes. The normal curves show how the observed response y will vary when x is held
fixed at different values. All of the curves have the same standard deviation , so the variability
of y is the same for all values of x.

Figure 2 Assumptions for the linear regression model.


-9-

Here are the four conditions for regression inference, their implications and how to check if the
conditions are satisfied.
1. Linearity
Condition: The error term

is a random variable with a mean 0.

Implication
Because
and

are constants, the mean of y is


x
y
implying a linear relationship between x and y.
How to check
Look for curved patterns or other departures from a straight-line overall pattern in the residual plot.
(You can also use the original scatterplot, but the residual plot magnifies any effects.)
Example 1
The scatterplot and the residual plot both show a linear relationship.
2. Independence
Condition: The values of

are statistically independent.

Implication
The value of for a particular value of x is not related to the value of for any other value of x.
Thus, the value of y for a particular value of x is not related to the value of y for any other value of x.
How to check
Signs of dependence in the residual plot are a bit subtle. In general, if the residual plot displays
a random pattern with no apparent trends, cycles, alternations, or clumping, it is reasonable to
conclude that the independence assumption holds.
Example 1
The residual plot shows a random variation around the residual = 0 line.
3. Normality
Condition: The error term is a normally distributed random variable
(with mean 0 and standard deviation ).
Implication
Because y is a linear function of , y is also a normally distributed random variable
x and standard deviation ).
(with mean y
How to check
Check for clear skewness or other major departures from normality in the histogram of the residuals.
Or, check if the points in the normal probability plot (Q-Q plot) are far from a 45o line.
Example 1
The histogram of the residuals does not show any important deviations from normality.

- 10 -

4. Equal spread
Condition: The standard deviation

of

is the same for all values of x.

Implication
The standard deviation of y about the regression line equals
values of x.

and is the same for all

How to check
Look at the scatter of the residuals above and below the residual = 0 line in the
residual plot. The scatter should be roughly the same from one end to the other.
Example 1
The residual plot shows no unusual variation in the scatter of the residuals above and
below the line as x varies.
Example 2
The following figure shows some general patterns that might be observed in any residual plot.

Good pattern
residuals are randomly scattered.

Curved pattern
the relationship is not linear.

Change in variability
is not equal for all values of x.

- 11 -

Excel Output for Car Plant Electricity Usage


Regression Statistics
Multiple R
0.895606
R Square
0.802109
Adjusted R Square
0.782320
Standard Error
0.172948
Observations
12
Standard Error
t Stat
P-value
0.385991 1.059736 0.314190
0.078352 6.366551 0.000082

Lower 95%
-0.450992
0.324252

Upper 95%
1.269089
0.673409

Residuals versus Production


0.30
0.20

Residuals

0.10
0.00
3.5

4.5

5.5

-0.10
-0.20
-0.30

Production

Histogram of the Residuals


3
Frequency

Intercept
Production

Coefficients
0.409048
0.498830

2
1
0
-0.2

-0.1

0.1
Residual

- 12 -

0.2

0.3