5 views

Original Title: 09 Inference for Regression Part1

Uploaded by Rama Dulce

- The Man-hour Estimation Models Its Comparison of Interim Products Assembly for Shipbuilding
- Statistics
- Climate Change Research Methodology
- MANNUAL FOR VOL.pdf
- HRM - Employee Relationship - Journal Summary
- Correlation of Limestone Properties with Bit Performance Variables for LIMROCKWARE2010 Development
- Formoso Moura 2009 Evaluation of the Impact of the Last Planner System on the Performance of Construction Projects
- Weighted Regression of Calibration Curve
- Stats Lec6
- Regression
- 14 Edited
- Faisal's DHA Suffa
- ANL-DIS-06-3
- Applied Statistics 102 July 2016 With Business Case
- Regresi Ganda
- Linear Regression in R
- Kollmeyer_C_Explaining_deindustrizlization_2009.pdf
- 61-404-65
- Ch 23 Bi Variate
- ch23

You are on page 1of 12

Topics Outline

Review of Least Squares Regression Line

The Linear Regression Model

Confidence Intervals for the Intercept

and the Slope

Testing the Hypothesis of No Linear Relationship

Inference about Prediction

Residuals

Conditions for Regression Inference

Review of Least Squares Regression Line

In simple linear regression, we consider a data set consisting of the paired observations

( x1 , y1 ),, ( xn , yn ) . Our goal is to investigate how the two quantitative variables x and y,

corresponding to the data values x i and y i , are related. We are also interested in predicting a

future response y from information about x.

The correlation coefficient r measures the direction and strength of the linear relationship between

two quantitative variables. Values of r close to (1) or (+1) indicate a strong negative or positive

linear relationship.

The least-squares regression line of the response variable y on the explanatory variable x is the line

y a bx

that minimizes the sum of the squares of the vertical distances of the data points ( xi , yi )

from the line. The slope

sy

b r

sx

of the regression line is the rate at which the predicted response y changes along the line as the

explanatory variable x changes. Specifically, b is the change in y when x increases by 1.

The intercept of the regression line

y bx

is the predicted response when the explanatory variable x = 0. This prediction is of no statistical

interest unless x can actually take values near 0.

The coefficient of determination r 2 is the square of the correlation coefficient r.

It measures the fraction of the variation in the response variable y that is explained by the least

squares regression on the explanatory variable x.

The least squares regression line can be used to predict the value of the response variable y for a

given value of the explanatory variable x by substituting this x into the equation of the line.

-1-

Example 1

Car plant electricity usage

The manager of a car plant wishes to investigate how the plants electricity usage depends upon

the plants production, based on the data for each month of the previous year:

x

y

Production Electricity usage

($ million)

(million kWh)

January

4.51

2.48

February

3.58

2.26

March

4.31

2.47

April

5.06

2.77

May

5.64

2.99

June

4.99

3.05

July

5.29

3.18

August

5.83

3.46

September

4.70

3.03

October

5.61

3.26

November

4.90

2.67

December

4.20

2.53

Month

3.5

y = 0.4988x + 0.409

R = 0.8021

3.25

3

2.75

2.5

2.25

2

3.5

4.5

5.5

Production ($ million)

The scatterplot shows a positive linear relationship, with no extreme outliers or potentially

influential observations. Higher levels of production do tend to require higher levels of electricity.

0.8021 0.896 is high, indicating a strong linear

The correlation coefficient r = r 2

relationship between Production and Electricity. The equation of the least squares regression line is

a bx = 0.409 + 0.499x

Because r 2 = 0.8021, about 80% of the variation in Electricity usage is explained by Production levels.

Is the observed relationship statistically significant?

-2-

Regression analysis is used primarily to predict the values of the response variable y based on the

values of the explanatory variable x. To assess the accuracy of these predictions, we need to

consider the mathematical model for linear regression.

Figure 1 provides a summary of the estimation process for simple linear regression.

The mathematical model for linear regression analysis assumes that the observed data points

( x1 , y1 ),, ( xn , yn ) constitute a random sample from a population. We suppose that in the

population there is an underlying linear relationship between the explanatory variable x and the

response variable y:

where is a random variable referred to as the error (or residual) term. The error term accounts

for the variability in y that cannot be explained by the linear relationship between x and y.

The random variable is assumed to have a mean of zero and standard deviation .

A consequence of this assumption is that the mean of y is equal to:

x

The unknown parameters

(true intercept) and (true slope), which determine the relationship

between x and y, can be estimated from the data set ( x1 , y1 ),, ( xn , yn ) .

It can be shown that the estimators a and b from the least squares method are the

best linear unbiased estimators of and

(whatever that means!).

The estimation of and

is a statistical process much like the estimation of

using the sample

statistic x . In regression,

and are two unknown parameters of interest, and the coefficients a

and b obtained from the least squares line are the sample statistics used to estimate these parameters.

The third unknown parameter, the standard deviation

of the error , can also be estimated

from the data set. Recall that the residuals (errors) are the vertical deviations of the data points

from the least-squares line:

residual = (observed y) (predicted y) = y y

There are n residuals, one for each data point and their mean is 0.

The estimate of

is given by the sample standard deviation of the residuals

1

n 2

(residuali

0) 2

i 1

n 2

i 1

( yi

y i ) 2

and is referred to as the regression standard error (or standard error of estimate).

The regression standard error for our example is s = 0.173. (See Excel output on the last page.)

-3-

Sample Data

Regression Model

y

(

x

- st. dev. of

x y

x1 y1

x2 y2

. .

. .

. .

xn yn

Regression Parameters

The values of

Compute the

sample statistics

a, b, s

provide estimates of

a, b, s

regression line

a bx

-4-

If we did experiment many times with the same xi ' s we would get different yi ' s each time,

due to random errors. Therefore, we would also get different values for the least squares

estimators a and b of the population parameters

and . Indeed, a and b are sample statistics

that have their own sampling distributions.

Let SEa and SEb be estimates of the standard errors (i.e. standard deviations) of a and b,

respectively. It can be shown that the level C confidence intervals for the intercept

and the

slope are given by the following confidence limits:

:

a t * SEa

b t * SEb

Here t* is the critical value for the t (n 2) density curve with area C between t* and t*.

Note: All t procedures in simple linear regression have n 2 degrees of freedom.

Example 1 (Continued)

For our example (see Excel output),

a = 0.4090

b = 0.4988

SEa = 0.3860

SEb = 0.0784

For 95% confidence and df = 10, the t-Table gives t* = 2.228.

95% CI for

lies in the interval from 0.45 to 1.27,

and this statement is made with 95% confidence.

Note: Inferences for the population intercept

95% CI for

Thus the management of the car plant can be 95% confident that within the range of the data set,

the mean electricity usage increases by somewhere between a third of a million kilowatt-hours

and two thirds of a million kilowatt-hours for every additional $1 million dollars of production.

-5-

One of the first things we want to do upon obtaining the sample regression equation

y a bx

is to test its slope b. If there is no (linear) relationship between the variables x and y,

then the slope of the regression equation would be expected to be zero.

If b = 0, then y a and thus x is useless as a predictor of y.

Recall that

is unknown and represents the slope of the true unknown regression line

x

while b is the estimate of the slope obtained by fitting a line to the data set.

Hence, we can determine the existence of a statistically significant relationship between x and y

variables by testing whether

(the true slope) is equal to 0.

The null and alternative hypotheses are stated as follows:

H0 :

Ha :

0 (There is a linear relationship.)

If the null hypothesis is rejected, we would conclude that there is evidence of a linear relationship.

It can be shown that the test statistic is

b

t

SE b

Example 1 (Continued) To test the hypothesis

H0 :

Ha :

t

0.4988

0.0784

6.37

The t-Table shows that the two-sided P-value for t distribution with 10 degrees of freedom is

smaller than 0.001. (Excel gives P-value = 0.000082.)

We reject H 0 and conclude that the slope of the population regression line is not 0.

In other words, the data provide very strong evidence to conclude that the distribution of

electricity usage does depend upon the level of production.

An alternative to testing the existence of a linear relationship between x and y variables is to set

up a confidence interval for

and to determine whether the hypothesized value ( = 0 ) is

included in the interval. The 95% confidence interval for

is 0.32 to 0.67.

Because this interval does not contain 0, we conclude that there is a significant linear relationship

between x and y. Had the interval included 0, the conclusion would have been that no (linear)

relationship exists between the variables.

-6-

There are several reasons for building a linear regression. One, of course, is to predict response

values (ys) at one or more values of the explanatory variable x.

Example 1 (Continued)

If the monthly production is x* = $5 million, then the plant manager can predict that the

electricity usage will be

y *

Can we supply this prediction with a margin of error?

Given a specified value of the explanatory variable x*, which is not necessarily one of the values

x1 ,, xn , we can construct two fundamentally different kinds of intervals.

1. Confidence interval for the expected (mean) response E(y*) =

y * t * SEmean

SEmean

x*:

y*

(x * x)2

1

n

( xi

x)2

i 1

This confidence interval expresses our uncertainty about the regression line.

If we knew

and , then we would know the regression line exactly and our confidence

interval would be one point.

2. Prediction interval for an individual (future) response y*:

y * t * SEind

SEind

s 1

1

n

(x * x)2

n

( xi

x)2

i 1

This prediction interval expresses our uncertainty about the regression line and the fact that

there are errors in the data. If we knew

and , we would know the regression line exactly,

but the length of our prediction interval would not shrink to zero, since the error term in

y* =

always has a fixed variance

-7-

x* + *

In both intervals, t* is the critical value for the t(n 2) density curve with area C between t* and t*,

and

n 2

i 1

( yi

y i ) 2

Both intervals are centered at y * and have the usual form

point estimate (critical value)(standard error)

y * t * SE

However, the prediction interval is wider than the confidence interval because it is harder to

predict one individual response than to predict a mean response.

Individuals are always more variable than averages!

Excels Regression tool does not have an option for computing confidence and prediction intervals.

These intervals can be computed using formulas along with the output of the Regression tool.

Example 1 (Continued)

For our example, y * = 2.903

t* = 2.228

SEmean = 0.0507

y*

SEind = 0.1802

x * to the value x* = 5 is

or

2.79

to

3.02

This interval implies that with a monthly production of $5 million, the mean electricity usage is

between about 2.8 and 3 million kWh.

A 95% prediction interval for a future response to the value x* = 5 is

or

2.50

to

3.30

This prediction interval indicates that if next months production target is $5 million,

then with 95% confidence next months electricity usage will be somewhere between 2.5 and 3.3

million kWh.

Thus, while the expected or average electricity usage in a month with $5 million of production is

known to lie somewhere between 2.8 and 3.0 million kWh, the electricity usage in a particular

month with $5 million of production will be somewhere between 2.5 and 3.3 million kWh.

-8-

Residuals

The residuals (y ) give useful information about the contribution of individual data points to

the overall pattern of scatter. Residual values show how much the observed values differ from

the fitted values. If a particular residual is positive, the corresponding data point is above the

line; if it is negative, the point is below the line. The only time a residual is zero is when the

point lies directly on the line.

Example 1 (Continued)

There are twelve residuals:

Observation

Residual

10

11

12

0.18

0.07

0.09

0.16

0.23

0.15

0.13

0.14

0.28

0.05

0.18

0.03

We can construct a residual plot by plotting the residuals against the explanatory variable x or the

predicted (also called fitted) values y . In a residual plot, the residual = 0 line represents the

position of the least-squares line in the scatterplot of y against x. (See Excel output.)

Residual plots are the primary tool for determining whether the assumed regression model is

appropriate.

Conditions for Regression Inference

An important step in determining whether the assumed linear regression model y

x

is appropriate involves testing for the significance of the relationship between the explanatory and

response variables. The tests of significance in regression analysis are based on four assumptions

about the error term .

Figure 2 illustrates the regression model assumptions and their implications. Note that in this

graphical interpretation, the mean response y moves along a straight line as the explanatory

variable x changes. The normal curves show how the observed response y will vary when x is held

fixed at different values. All of the curves have the same standard deviation , so the variability

of y is the same for all values of x.

-9-

Here are the four conditions for regression inference, their implications and how to check if the

conditions are satisfied.

1. Linearity

Condition: The error term

Implication

Because

and

x

y

implying a linear relationship between x and y.

How to check

Look for curved patterns or other departures from a straight-line overall pattern in the residual plot.

(You can also use the original scatterplot, but the residual plot magnifies any effects.)

Example 1

The scatterplot and the residual plot both show a linear relationship.

2. Independence

Condition: The values of

Implication

The value of for a particular value of x is not related to the value of for any other value of x.

Thus, the value of y for a particular value of x is not related to the value of y for any other value of x.

How to check

Signs of dependence in the residual plot are a bit subtle. In general, if the residual plot displays

a random pattern with no apparent trends, cycles, alternations, or clumping, it is reasonable to

conclude that the independence assumption holds.

Example 1

The residual plot shows a random variation around the residual = 0 line.

3. Normality

Condition: The error term is a normally distributed random variable

(with mean 0 and standard deviation ).

Implication

Because y is a linear function of , y is also a normally distributed random variable

x and standard deviation ).

(with mean y

How to check

Check for clear skewness or other major departures from normality in the histogram of the residuals.

Or, check if the points in the normal probability plot (Q-Q plot) are far from a 45o line.

Example 1

The histogram of the residuals does not show any important deviations from normality.

- 10 -

4. Equal spread

Condition: The standard deviation

of

Implication

The standard deviation of y about the regression line equals

values of x.

How to check

Look at the scatter of the residuals above and below the residual = 0 line in the

residual plot. The scatter should be roughly the same from one end to the other.

Example 1

The residual plot shows no unusual variation in the scatter of the residuals above and

below the line as x varies.

Example 2

The following figure shows some general patterns that might be observed in any residual plot.

Good pattern

residuals are randomly scattered.

Curved pattern

the relationship is not linear.

Change in variability

is not equal for all values of x.

- 11 -

Regression Statistics

Multiple R

0.895606

R Square

0.802109

Adjusted R Square

0.782320

Standard Error

0.172948

Observations

12

Standard Error

t Stat

P-value

0.385991 1.059736 0.314190

0.078352 6.366551 0.000082

Lower 95%

-0.450992

0.324252

Upper 95%

1.269089

0.673409

0.30

0.20

Residuals

0.10

0.00

3.5

4.5

5.5

-0.10

-0.20

-0.30

Production

3

Frequency

Intercept

Production

Coefficients

0.409048

0.498830

2

1

0

-0.2

-0.1

0.1

Residual

- 12 -

0.2

0.3

- The Man-hour Estimation Models Its Comparison of Interim Products Assembly for ShipbuildingUploaded byFrans Truly Aritonang
- StatisticsUploaded byAbhijit Pathak
- Climate Change Research MethodologyUploaded byMarissa Adraincem
- MANNUAL FOR VOL.pdfUploaded byRajeev Dhiman
- HRM - Employee Relationship - Journal SummaryUploaded byyayuk88
- Correlation of Limestone Properties with Bit Performance Variables for LIMROCKWARE2010 DevelopmentUploaded bySEP-Publisher
- Formoso Moura 2009 Evaluation of the Impact of the Last Planner System on the Performance of Construction ProjectsUploaded byRichard Hugo Reymundo Gamarra
- Weighted Regression of Calibration CurveUploaded byBaharnarenj11
- Stats Lec6Uploaded byAsad Ali
- RegressionUploaded bybibekmishra8107
- 14 EditedUploaded byAli Husanen
- Faisal's DHA SuffaUploaded byMaqsood Alam
- ANL-DIS-06-3Uploaded byMohammed Abdo
- Applied Statistics 102 July 2016 With Business CaseUploaded bySumit Kumar
- Regresi GandaUploaded byekoefendi
- Linear Regression in RUploaded byahcang
- Kollmeyer_C_Explaining_deindustrizlization_2009.pdfUploaded byeduardojanser
- 61-404-65Uploaded byCucu
- Ch 23 Bi VariateUploaded byRAMEEZ. A
- ch23Uploaded byNouman Shafique
- Regression Kann Ur 14Uploaded byAmer Rahmah
- Trends Opportunities in Insurance SectorUploaded bymanishmohanty
- 044444.rtfUploaded bydahniar
- 1b line of fit by eyeUploaded byapi-254501788
- Module 6Uploaded byEugene Palmes
- New Analysis WayUploaded byPuskar Bist
- Lecture 1wqdwqfqweUploaded bySoumik Sarangi
- Problem Set 1 Ec1Uploaded bygalih90
- Chapter 13Uploaded bymuath wardat
- 1-s2.0-S0167739X17324627-main.pdfUploaded byDilip Kumar

- Multiple RegressionUploaded byAman Poonia
- 6-KWW(16)-LTLiab-HWSolUploaded byRama Dulce
- 14_altprob_8eUploaded byRama Dulce
- 16_altprob_8eUploaded byRama Dulce
- AmylaseUploaded byRama Dulce
- 13_altprob_8eUploaded byRama Dulce
- 12_altprob_8eUploaded byRama Dulce
- Income Homework AccountingUploaded byRama Dulce
- Chap 002Uploaded bysam
- 15 Building Regression Models Part2Uploaded byRama Dulce
- 16 Review of Part IIUploaded byRama Dulce
- 11 Multiple Regression Part1Uploaded byRama Dulce
- 14_Building_Regression_Models_Part1.pdfUploaded byRama Dulce
- 08_Review_of_Part_IUploaded byRama Dulce
- 12 Multiple Regression Part2Uploaded byRama Dulce
- 10 Inference for Regression Part2Uploaded byRama Dulce
- 13 Multiple Regression Part3Uploaded byRama Dulce
- 05_Statistical_Inference.pdfUploaded byRama Dulce
- 04_Decision_Analysis_Part2-2.pdfUploaded byRama Dulce
- 07 Simple Linear Regression Part2Uploaded byRama Dulce
- 03 Decision Analysis Part1Uploaded byRama Dulce
- 06 Simple Linear Regression Part1Uploaded byRama Dulce
- 02 Describing DistributionsUploaded byRama Dulce
- 01 Probability and Probability DistributionsUploaded byRama Dulce
- Chapter 7Uploaded byRama Dulce
- Admin ExcelUploaded byRama Dulce

- WR Grace - Heckman Brief and Opinion Re Liability Estimation ProcessUploaded byKirk Hartley
- SI_F09_Ch01.pdfUploaded byDavid Wee
- Build 443Uploaded bymh_tadayon
- Seed GerminationUploaded bygabrielle
- Econometrics - Final ProjectUploaded byPelayo Esparza Sola
- binomial_and_poisson(2).docxUploaded byJenny Robles
- Correlation after midterm 3.pptUploaded byNataliAmiranashvili
- Case Study Stat(1)Uploaded byPraval Sai
- stat 2Uploaded byHaru Haru
- lec8Uploaded byjuntujuntu
- Bluefield Blue Jays Game Notes 8-25Uploaded bysmyth621
- Correlation and RegressionUploaded bynalini
- Principles of Error AnalysisUploaded bybasura12345
- Solidificatio nUploaded byIndranil Bhattacharyya
- ForecastingUploaded byAdeel Sheikh
- Characterization Based on Distillation PropertiesUploaded byKahwai Wong
- prob statsUploaded byKausam Bhat
- Investigation on SmokingUploaded byAlice Tao
- Akaike 1974Uploaded bypereiraomar
- Business Analysis Using Regression - A CasebookUploaded byMichael Wen
- Notes 5Uploaded byyalllik
- Mathematica Laboratories for Mathematical StatisticsUploaded byhammoudeh13
- Spc 0002Uploaded byDeepa Dhilip
- Relief Valve SizingUploaded bycutefrenzy
- [Report] Final Project of StatisticUploaded byDuy Tiến
- Discussions and ConclusionsUploaded bypehweihao
- Acanthochelys spixiiUploaded byDanilo Capela
- l83Uploaded byEdwin Johny Asnate Salazar
- Manual SSHUploaded byAnonymous h5uTxGvc
- Serial Correlation.docUploaded byThành Trương