3 views

Uploaded by eric

ppt

- ClassOf1_regression_prediction_intervals_8
- SBE9_TB16
- Regression Analysis in Adult Age Estimation
- Application of Measurement Models to Impedance Spectroscopy
- Curve Fitting Method
- HW1_N
- 1-s2.0-S2212567115004360-main
- Writing an Honours Thesis - Joe Wolfe - UNSW
- Young Children Job Satisfaction
- dd-12.pdf
- ch11
- Prediction of Construction Project Performance using Regression Analysis and Artificial Neural Network
- Demo Conjoint
- Module 5 Lecture 4 Final
- CH11-isbe.pdf
- Regresi Used
- A RELATIONAL STUDY OF TRANSFORMATIONAL LEADERSHIP BEHAVIOUR OF SCHOOL PRINCIPALS AND THE SCHOOL CLIMATE IN HARYANA
- Multiple Regression Paper 2
- Exploring Relationships s Pss
- fake data

You are on page 1of 47

and Statistics

Thirteenth Edition

Chapter 12

Linear Regression and

Correlation

Introduction

• In Chapter 11, we used ANOVA to

investigate the effect of various factor-level

combinations (treatments) on a response x.

• Our objective was to see whether the

treatment means were different.

• In Chapters 12 and 13, we investigate a

response y which is affected by various

independent variables, xi.

• Our objective is to use the information

provided by the xi to predict the value of y.

Example

• Let y be a student’s college achievement,

measured by his/her GPA. This might be a

function of several variables:

– x1 = rank in high school class

– x2 = high school’s overall rating

– x3 = high school GPA

– x4 = SAT scores

• We want to predict y using knowledge of x1,

x2, x3 and x4.

Example

• Let y be the monthly sales revenue for a

company. This might be a function of

several variables:

– x1 = advertising expenditure

– x2 = time of year

– x3 = state of economy

– x4 = size of inventory

• We want to predict y using knowledge of x1,

x2, x3 and x4.

Some Questions

• Which of the independent variables are

useful and which are not?

• How could we create a prediction

equation to allow us to predict y using

knowledge of x1, x2, x3 etc?

• How good is this prediction?

We start with the simplest case, in which the

response y is a function of a single

independent variable, x.

A Simple Linear Model

• In Chapter 3, we used the equation of

a line to describe the relationship between

y and x for a sample of n pairs, (x, y).

• If we want to describe the relationship

between y and x for the whole

population, there are two models we can

choose

•Deterministic Model: y = a + bx

•Probabilistic Model:

–y = deterministic model + random error

–y = a + bx + e

A Simple Linear Model

• Since the bivariate measurements that

we observe do not generally fall exactly

on a straight line, we choose to use:

• Probabilistic Model:

– y = a + bx + e

– E(y) = a + bx

Points deviate from the

line of means by an amount

e where e has a normal

distribution with mean 0 and

variance s2.

The Random Error

• The line of means, E(y) = a + bx , describes

average value of y for any fixed value of x.

• The population of measurements is

generated as y deviates from

the population line

by e. We estimate a

and b using sample

information.

The Method of

Least Squares

• The equation of the best-fitting line

is calculated using a set of n pairs (xi, yi).

•We choose our

estimates a and b to

estimate a and b so

Bestthe

that fitting line :yˆ a + bx

vertical

distances

Choosea and of the

b to minimize

points from the 2line,

SSE ( y

are minimized. ˆ

y ) ( y a bx ) 2

Least Squares

Estimators

Calculatethe sumsof squares:

( x)2

( y ) 2

Sxx x 2

Syy y

2

n n

( x)( y )

Sxy xy

n

Bestfitting line : yˆ a + bx where

S xy

b and a y bx

S xx

Example

The table shows the math achievement test

scores for a random sample of n = 10 college

freshmen, along with their final calculus grades.

Student 1 2 3 4 5 6 7 8 9 10

Math test, x 39 43 21 64 57 47 28 75 34 52

Calculus grade, y 65 78 52 82 92 89 73 98 56 75

x 460 y 760

Use your

calculator to find x 23634 y 59816

2 2

sums of squares.

xy 36854

x 46 y 76

Example

(460) 2

Sxx 23634 2474

10

(760) 2

Syy 59816 2056

10

(460)(760)

Sxy 36854 1894

10

1894

b .76556 and a 76 .76556(46) 40.78

2474

Bestfitting line : yˆ 40.78 + .77 x

The Analysis of Variance

• The total variation in the experiment is

measured by the total sum of squares:

Total SS S yy ( y y )2

The Total SS is divided into two parts:

SSR (sum of squares for regression):

measures the variation explained by using

x in the model.

SSE (sum of squares for error):

measures the leftover variation not

explained by x.

The Analysis of Variance

We calculate

( S xy ) 2 18942

SSR

S xx 2474

1449.9741

SSE TotalSS - SSR

( S xy ) 2

S yy

S xx

2056 1449.9741

606.0259

The ANOVA Table

Total df = n -1 Mean Squares

Regression df = 1 MSR = SSR/(1)

Error df = n –1 – 1 = n - 2

MSE = SSE/(n-2)

Source df SS MS F

Regression 1 SSR SSR/(1) MSR/MS

E

Error n-2 SSE SSE/(n-2)

Total n -1 Total SS

The Calculus Problem

( S xy ) 2 18942

SSR 1449.9741

S xx 2474

( S xy ) 2

SSE Total SS - SSR S yy

S xx

2056 1449.9741 606.0259

Source df SS MS F

Regression 1 1449.9741 1449.9741 19.14

Error 8 606.0259 75.7532

Total 9 2056.0000

Testing the Usefulness

of the Model

• The first question to ask is whether the

independent variable x is of any use in

predicting y.

• If it is not, then the value of y does not

change, regardless of the value of x. This

implies that the slope of the line, b, is zero.

H 0 : b 0 versus H a : b 0

Testing the

Usefulness of the Model

• The test statistic is function of b, our best

estimate of b. Using MSE as the best

estimate of the random variation s2, we

obtain a t statistic.

b0

Test statistic: t which has a t distribution

MSE

S xx

MSE

with df n 2 or a confidenceinterval: b ta / 2

S xx

The Calculus Problem

• Is there a significant relationship between

the calculus grades andTheretheis test

a significant

scores at

linear relationship

the 5% level of significance?

between the calculus

grades and the test

H 0 : b 0 versusH a : b scores

0 for the

b0 .7656 population

0 of college

t 4.38

MSE/ S xx 75.7532 /freshmen.

2474

Reject H 0 when |t| > 2.306. Since t = 4.38 falls

into the rejection region, H 0 is rejected .

The F Test

• You can test the overall usefulness of

the model using an F test. If the model

is useful, MSR will be large compared to

the unexplained variation, MSE.

MSR This test is

Test Statistic: F exactly

MSE equivalent to

Reject H 0 if F Fa with1 and n - 2 df . the t-test, with

t2 = F.

Minitab Output

Least squares

To test H 0 : b 0line

regression

Regression Analysis: y versus x

The regression equation is y = 40.8 + 0.766 x

Predictor Coef SE Coef T P

Constant 40.784 8.507 4.79 0.001

x 0.7656 0.1750 4.38 0.002

Analysis of Variance

Source DF SS MS F P

Regression 1 1450.0 1450.0 19.14 0.002

Residual Error 8 606.0 75.8

Total 9 2056.0

Regression

MSE coefficients, a and b t 2

F

Measuring the Strength

of the Relationship

• If the independent variable x is of useful in

predicting y, you will want to know how well

the model fits.

• The strength of the relationship between x

and y can be measured using:

S xy

Correlation coefficient : r

S xx S yy

2

S xy

SSR

Coefficient of determination : r 2

S xx S yy Total SS

Measuring the Strength

of the Relationship

• Since Total SS = SSR + SSE, r2 measures

the proportion of the total variation in the

responses that can be explained by using

the independent variable x in the model.

the percent reduction the total variation by

using the regression equation rather than

just using the sample mean y-bar to

estimate y.

SSR

For the calculus problem, r2 = .705 or r

2

Interpreting a

Significant Regression

• Even if you do not reject the null hypothesis

that the slope of the line equals 0, it does

not necessarily mean that y and x are

unrelated.

• Type II error—falsely declaring that the

slope is 0 and that x and y are unrelated.

• It may happen that y and x are perfectly

related in a nonlinear way.

Some Cautions

• You may have fit the wrong model.

• Extrapolation—predicting values of y

outside the range of the fitted data.

• Causality—Do not conclude that x causes

y. There may be an unknown variable at

work!

Checking the

Regression Assumptions

• Remember that the results of a regression

analysis are only valid when the necessary

assumptions have been satisfied.

1. The relationship between x and y is linear,

given by y = a + bx + e.

2. The random error terms e are independent

and, for any value of x, have a normal

distribution with mean 0 and variance s 2.

Diagnostic Tools

• We use the same diagnostic tools

used in Chapter 11 to check the

normality assumption and the

assumption of equal variances.

1. Normal probability plot of

residuals

2. Plot of residuals versus fit or

residuals versus variables

Residuals

• The residual error is the “leftover”

variation in each data point after the

variation explained by the regression

model has been removed.

Residual yi yˆi or yi a bxi

these residuals should be normal, with

mean 0 and variance s2.

Normal Probability Plot

If the normality assumption is valid,

the plot should resemble a straight

line, sloping upward to the right.

Normal Probability Plot of the Residuals

(response is y)

99

95

90

80

70

Percent

60

50

40

30

20

10

1

-20 -10 0 10 20

Residual

Residuals versus Fits

If the equal variance assumption is

valid, the plot should appear as a

random scatter around the zero

Residuals Versus the Fitted Values

(response is y)

center line. 15

10

5

residuals.

Residual

-5

-10

60 70 80 90 100

Fitted Value

Estimation and Prediction

• Once you have

determined that the regression line is

useful

used the diagnostic plots to check for

violation of the regression assumptions.

• YouEstimate

are readythetoaverage

use the value

regression line to

of y for

a given value of x

Predict a particular value of y for a

given value of x.

Estimation and Prediction

Estimating a

particular value of y

when x = x0

Estimating the

average value of

y when x = x0

Estimation and Prediction

• The best estimate of either E(y) or y for

a given value x = x0 is

yˆ a + bx0

• Particular values of y are more difficult to

predict, requiring a wider range of values in the

prediction interval.

Estimation and

Prediction

To estimatethe averagevalueof y when x x0 :

1 ( x0 x ) 2

yˆ ta / 2 MSE +

n S xx

To predict a particularvalueof y when x x0 :

1 ( x0 x ) 2

yˆ ta / 2 MSE 1 + +

n S xx

The Calculus Problem

• Estimate the average calculus grade for

students whose achievement score is 50

with a 95% confidence interval.

Calculate yˆ 40.78424 + .76556(50) 79.06

1 (50 46) 2

yˆ 2.306 75.7532 +

10 2474

79.06 6.55 or 72.51to 85.61.

The Calculus Problem

• Estimate the calculus grade for a particular

student whose achievement score is 50 with

a 95% confidence interval.

1 (50 46) 2

yˆ 2.306 75.75321 + +

10 2474

Notice how

79.06 21.11 or 57.95 to 100.17.much wider

this interval is!

Minitab Output

Confidence and prediction

intervals when x = 50

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI

1 79.06 2.84 (72.51, 85.61) (57.95,100.17)

New Obs x

1 50.0 Fitted Line Plot

y = 40.78 + 0.7656 x

Regression

95% CI

95% PI

110

bands are always 100

S

R-Sq

8.70363

70.5%

R-Sq(adj) 66.8%

80

y

confidence bands. 70

60

40

narrowest when x = 30

20 30 40 50 60 70 80

x

x-bar.

Correlation Analysis

• The strength of the relationship between x

and y is measured using the coefficient of

correlation:

S xy

Correlation coefficient : r

S xx S yy

• Recall from Chapter 3 that

(1) -1 r 1 (2) r and b have the same sign

(3) r 0 means no linear relationship

(4) r 1 or –1 means a strong (+) or (-)

relationship

Example

The table shows the heights and weights of

n = 10 randomly selected college football

players.

Player 1 2 3 4 5 6 7 8 9 10

Height, x 73 71 75 72 72 75 67 69 71 69

Weight, y 185 175 200 210 190 195 150 170 180 175

calculator to find

328

the sums and r .8261

sums of squares. (60.4)(2610)

Football Players

Scatterplot of Weight vs Height

210

200

190

Weight

180

r = .8261

170

160

Strong positive

150

correlation

66 67 68 69 70 71 72 73 74 75

Height

As the player’s

height increases,

so does his

weight.

Some Correlation Patterns

• Use the Exploring Correlation applet to

r = 0; No r = .931; Strong

explore

correlationsome correlation patterns:

positive correlation

r = 1; Linear

relationship r = -.67; Weaker

negative correlation

Inference using r

• The population coefficient of correlation is

called r (“rho”). We can test for a significant

correlation between x and y using a t test:

exactly

n2 equivalent to

Test Statistic: t r the t-test for

1 r 2

the slope b0.

Reject H 0 if t ta / 2 or t ta / 2 with n - 2 df .

r .8261 Example

Is there a significant positive correlation

between weight and height in the

population of all college football players?

H0 : r 0 n2

Test Statistic: t r

Ha : r 0 1 r 2

8

Use the t-table with n-2 = 8 df .8261 4.15

to bound the p-value as p- 1 .8261 2

significant positive correlation.

Key Concepts

I. A Linear Probabilistic Model

1. When the data exhibit a linear relationship, the

appropriate model is y a + b x + e .

2. The random error e has a normal distribution

with mean 0 and variance s2.

II. Method of Least Squares

1. Estimates a and b, for a and b, are chosen to

minimize SSE, the sum of the squared

yˆ a + bx

deviations about the regression line,

.

2. The least squares estimates are b = Sxy/Sxx and

a y bx.

Key Concepts

III. Analysis of Variance

1. Total SS SSR + SSE, where Total SS Syy and

SSR (Sxy)2 / Sxx.

2. The best estimate of s 2 is MSE SSE / (n 2).

1. A test for the significance of the linear regression—H0

: b 0 can be implemented using one of two test

statistics:

b MSR

t or F

MSE / S xx MSE

Key Concepts

2. The strength of the relationship between x and y can

be measured using

SSR

R

2

Total SS

which gets closer to 1 as the relationship gets

stronger.

3. Use residual plots to check for nonnormality,

inequality of variances, and an incorrectly fit model.

4. Confidence intervals can be constructed to estimate

the intercept a and slope b of the regression line and

to estimate the average value of y, E( y ), for a given

value of x.

5. Prediction intervals can be constructed to predict a

particular observation, y, for a given value of x. For a

given x, prediction intervals are always wider than

confidence intervals.

Key Concepts

V. Correlation Analysis

1. Use the correlation coefficient to measure the

relationship between x and y when both

variables are random:

S xy

r

S xx S yy

2. The sign of r indicates the direction of the

relationship; r near 0 indicates no linear

relationship, and r near 1 or 1 indicates a

strong linear relationship.

3. A test of the significance of the correlation

coefficient is identical to the test of the slope b.

- ClassOf1_regression_prediction_intervals_8Uploaded byClassOf1.com
- SBE9_TB16Uploaded byHuong Tran
- Regression Analysis in Adult Age EstimationUploaded byamapink94
- Application of Measurement Models to Impedance SpectroscopyUploaded byYu Shu Hearn
- Curve Fitting MethodUploaded byAnonymous ya6gBBwHJF
- HW1_NUploaded byejh2163
- 1-s2.0-S2212567115004360-mainUploaded byAlexandra Crăciunoiu
- Writing an Honours Thesis - Joe Wolfe - UNSWUploaded byrizkibizniz
- dd-12.pdfUploaded byDushyant Patel
- Young Children Job SatisfactionUploaded byLee Hou Yew
- ch11Uploaded byPhía trước là bầu trời
- Prediction of Construction Project Performance using Regression Analysis and Artificial Neural NetworkUploaded byIJRASETPublications
- Demo ConjointUploaded byLuigui Meza Galdos
- Module 5 Lecture 4 FinalUploaded bytejap314
- CH11-isbe.pdfUploaded byasa
- Regresi UsedUploaded byLutfi Untung Angga Laksana
- A RELATIONAL STUDY OF TRANSFORMATIONAL LEADERSHIP BEHAVIOUR OF SCHOOL PRINCIPALS AND THE SCHOOL CLIMATE IN HARYANAUploaded byAnonymous CwJeBCAXp
- Multiple Regression Paper 2Uploaded byRashid Ayubi
- Exploring Relationships s PssUploaded byNicoleta Poiana
- fake dataUploaded byAssad Bilal
- Comparison of Regression and Kriging Techniques for Mapping the Average Annual Precipitation of Turkey-Bostan Et Al., 2012Uploaded byParag Jyoti Dutta
- 7 bivariate edaUploaded byAman Sodi
- Regrerssion in BusinessUploaded byshagunparmar
- Boschken, H. L. (2002). Urban spatial form and policy outcomes in public agencies.UAR.pdfUploaded byHenil Dudhia
- McNichols, Maureen F, %3B Research Design Issues in Earnings Management StudiesUploaded byAdi Setiawan
- Linear RegressionUploaded bySher Ali Khan
- Diagnosing Model Issues and ReportingUploaded byNathan Otten
- SriUploaded byhipzul watoni
- HKSA AKTIVITASUploaded byNusaibah Romadhoni
- Transportation Statistics: table 04 08Uploaded byBTS

- Marshall Islands letter to International Maritime OrganizationUploaded byclimatehomescribd
- 5-1Uploaded byMarco Olivetto
- Tunnel.pdfUploaded byea53
- adj (1)Uploaded byRamji Killani
- UpToDate Massive Blood TransfusionUploaded byAlice Huii
- Office furniture manufacturer marketing planUploaded byPalo Alto Software
- 07-12-30+Workshop+AttendanceUploaded byricky_soni32
- Radio Communication linkUploaded byPavithra P
- Digital Signal ProcessingUploaded byPECMURUGAN
- Doe_JMPUploaded byEmilio Hipola
- Hemorrhoids Pathophysiology to TreatmentUploaded byKen Cheung
- WADE7 Lecture 04Uploaded byhaha_le12
- 3. the Three Levels of Strategic PlanningUploaded byAnnie Rachel
- Xincheng P., Super High Strength_High Performance Concrete, 2013Uploaded byMario A Villada Ríos
- 166.full.pdfUploaded byYosua Chobat Max
- Police Log October 25, 2016Uploaded byMansfieldMAPolice
- Using SAP Cloud Platform IntegrationUploaded byRavi
- Jonathon Bradford CV2Uploaded byJono Bradford
- Chetan Arunrao Rekhate Cv 15032012 (1)Uploaded bySani Tipare
- BrochuresUploaded byAlexander Rodriguez
- DD CEN TS 01591-4-2007Uploaded byk1gabitzu9789
- lesson plan 1 (1)Uploaded byapi-287920249
- Bye Bye (Paper) BritannicaUploaded bystefi idlab
- s/w architectureUploaded byrajianand2
- 06 Catalog Krisbow9 Cutting ToolUploaded byEnrique da Matta
- Turgeon FrenchBeads 2001Uploaded byFrancois G. Richard
- Declaration-Penalty-Matrix.pdfUploaded byrohit jaiswal
- Detroit Chapter 9 Bankruptcy List of Creditors March 4, 2014Uploaded byBeverly Tran
- 2014_Journal of Applied fUploaded byTim Wong
- Business anaysis and decision makingUploaded byMischal Menden