This action might not be possible to undo. Are you sure you want to continue?

**and Correlation Analysis
**

Goals

After this, you should be able to:

• Calculate and interpret the simple correlation

between two variables

• Determine whether the correlation is significant

• Calculate and interpret the simple linear

regression equation for a set of data

• Understand the assumptions behind

regression analysis

• Determine whether a regression model is

significant

Goals

After this, you should be able to:

• Calculate and interpret confidence

intervals for the regression coefficients

• Recognize regression analysis

applications for purposes of prediction

and description

• Recognize some potential problems if

regression analysis is used incorrectly

• Recognize nonlinear relationships

between two variables

(continued)

Scatter Plots and Correlation

• A scatter plot (or scatter diagram) is used

to show the relationship between two

variables

• Correlation analysis is used to measure

strength of the association (linear

relationship) between two variables

–Only concerned with strength of the

relationship

–No causal effect is implied

Scatter Plot Examples

y

x

y

x

y

y

x

x

Linear relationships Curvilinear relationships

Scatter Plot Examples

y

x

y

x

y

y

x

x

Strong relationships Weak relationships

(continued)

Scatter Plot Examples

y

x

y

x

No relationship

(continued)

Correlation Coefficient

• The population correlation coefficient

ρ (rho) measures the strength of the

association between the variables

• The sample correlation coefficient r is

an estimate of ρ and is used to

measure the strength of the linear

relationship in the sample

observations

(continued)

Features of ρ and r

• Unit free

• Range between -1 and 1

• The closer to -1, the stronger the

negative linear relationship

• The closer to 1, the stronger the positive

linear relationship

• The closer to 0, the weaker the linear

relationship

r = +.3 r = +1

Examples of Approximate

r Values

y

x

y

x

y

x

y

x

y

x

r = -1 r = -.6 r = 0

Calculating the

Correlation Coefficient

¿ ¿

¿

÷ ÷

÷ ÷

=

] ) y y ( ][ ) x x ( [

) y y )( x x (

r

2 2

where:

r = Sample correlation coefficient

n = Sample size

x = Value of the independent variable

y = Value of the dependent variable

¿ ¿ ¿ ¿

¿ ¿ ¿

÷ ÷

÷

=

] ) y ( ) y ( n ][ ) x ( ) x ( n [

y x xy n

r

2 2 2 2

Sample correlation coefficient:

or the algebraic equivalent:

Calculation Example

Tree

Height

Trunk

Diamete

r

y x xy y

2

x

2

35 8 280 1225 64

49 9 441 2401 81

27 7 189 729 49

33 6 198 1089 36

60 13 780 3600 169

21 7 147 441 49

45 11 495 2025 121

51 12 612 2601 144

E=321 E=73 E=3142 E=14111 E=713

0

10

20

30

40

50

60

70

0 2 4 6 8 10 12 14

0.886

] (321) ][8(14111) (73) [8(713)

(73)(321) 8(3142)

] y) ( ) y ][n( x) ( ) x [n(

y x xy n

r

2 2

2 2 2 2

=

÷ ÷

÷

=

÷ ÷

÷

=

¿ ¿ ¿ ¿

¿ ¿ ¿

Trunk Diameter, x

Tree

Height,

y

Calculation Example

(continued)

r = 0.886 → relatively strong positive

linear association between x and y

Excel Output

Tree Height Trunk Diameter

Tree Height 1

Trunk Diameter 0.886231 1

Excel Correlation Output

Tools / data analysis / correlation…

Correlation between

Tree Height and Trunk Diameter

Significance Test for

Correlation

• Hypotheses

H

0

: ρ = 0 (no correlation)

H

A

: ρ ≠ 0 (correlation exists)

• Test statistic

– (with n – 2 degrees of freedom)

2 n

r 1

r

t

2

÷

÷

=

Example: Produce Stores

Is there evidence of a linear relationship

between tree height and trunk diameter at

the .05 level of significance?

H

0

: ρ

= 0 (No correlation)

H

1

: ρ ≠ 0 (correlation exists)

o =.05 , df = 8 - 2 = 6

4.68

2 8

.886 1

.886

2 n

r 1

r

t

2 2

=

÷

÷

=

÷

÷

=

4.68

2 8

.886 1

.886

2 n

r 1

r

t

2 2

=

÷

÷

=

÷

÷

=

Example: Test Solution

Conclusion:

There is

evidence of a

linear relationship

at the 5% level of

significance

Decision:

Reject H

0

Reject H

0

Reject H

0

o/2=.025

-t

α/2

Do not reject H

0

0

t

α/2

o/2=.025

-2.4469 2.4469

4.68

d.f. = 8-2 = 6

Introduction to Regression Analysis

• Regression analysis is used to:

– Predict the value of a dependent variable based

on the value of at least one independent variable

– Explain the impact of changes in an independent

variable on the dependent variable

Dependent variable: the variable we wish to explain

Independent variable: the variable used to explain the

dependent variable

Simple Linear Regression

Model

• Only one independent variable, x

• Relationship between x and y is

described by a linear function

• Changes in y are assumed to be

caused by changes in x

Types of Regression Models

Positive Linear Relationship

Negative Linear Relationship

Relationship NOT Linear

No Relationship

ε x β β y

1 0

+ + =

Linear component

Population Linear Regression

The population regression model:

Population

y intercept

Population

Slope

Coefficient

Random

Error

term, or

residual

Dependent

Variable

Independent

Variable

Random Error

component

Linear Regression Assumptions

• Error values (ε) are statistically independent

• Error values are normally distributed for any given value

of x

• The probability distribution of the errors is normal

• The probability distribution of the errors has constant

variance

• The underlying relationship between the x variable and

the y variable is linear

Population Linear Regression

(continued)

Random Error

for this x value

y

x

Observed Value

of y for x

i

Predicted Value

of y for x

i

ε x β β y

1 0

+ + =

x

i

Slope = β

1

Intercept = β

0

ε

i

x b b y

ˆ

1 0 i

+ =

The sample regression line provides an estimate of

the population regression line

Estimated Regression Model

Estimate of

the regression

intercept

Estimate of the

regression slope

Estimated

(or predicted)

y value

Independent

variable

The individual random error terms e

i

have a mean of zero

Least Squares Criterion

• b

0

and b

1

are obtained by finding the

values of b

0

and b

1

that minimize the

sum of the squared residuals

2

1 0

2 2

x)) b (b (y

) y

ˆ

(y e

+ ÷ =

÷ =

¿

¿ ¿

The Least Squares Equation

• The formulas for b

1

and b

0

are:

algebraic equivalent:

¿

¿

¿

¿ ¿

÷

÷

=

n

x

x

n

y x

xy

b

2

2

1

) (

¿

¿

÷

÷ ÷

=

2

1

) (

) )( (

x x

y y x x

b

x b y b

1 0

÷ =

and

• b

0

is the estimated average value of

y when the value of x is zero

• b

1

is the estimated change in the

average value of y as a result of a

one-unit change in x

Interpretation of the

Slope and the Intercept

Finding the Least Squares

Equation

• The coefficients b

0

and b

1

will

usually be found using computer

software, such as Excel or Minitab

• Other regression measures will also

be computed as part of computer-

based regression analysis

Simple Linear Regression Example

• A real estate agent wishes to examine the relationship

between the selling price of a home and its size

(measured in square feet)

• A random sample of 10 houses is selected

– Dependent variable (y) = house price in $1000s

– Independent variable (x) = square feet

Sample Data for House Price

Model

House Price in $1000s

(y)

Square Feet

(x)

245 1400

312 1600

279 1700

308 1875

199 1100

219 1550

405 2350

324 2450

319 1425

255 1700

Regression Using Excel

• Tools / Data Analysis /

Regression

Excel Output

Regression Statistics

Multiple R 0.76211

R Square 0.58082

Adjusted R

Square 0.52842

Standard Error 41.33032

Observations 10

ANOVA

df SS MS F

Significance

F

Regression 1 18934.9348

18934.934

8

11.084

8 0.01039

Residual 8 13665.5652 1708.1957

Total 9 32600.5000

Coefficien

ts Standard Error t Stat

P-

value Lower 95%

Upper

95%

Intercept 98.24833 58.03348 1.69296

0.1289

2 -35.57720

232.0738

6

Square Feet 0.10977 0.03297 3.32938

0.0103

9 0.03374 0.18580

The regression equation is:

feet) (square 0.10977 98.24833 price house + =

0

50

100

150

200

250

300

350

400

450

0 500 1000 1500 2000 2500 3000

Square Feet

H

o

u

s

e

P

r

i

c

e

(

$

1

0

0

0

s

)

Graphical Presentation

• House price model: scatter plot and

regression line

feet) (square 0.10977 98.24833 price house + =

Slope

= 0.10977

Intercept

= 98.248

Interpretation of the

Intercept, b

0

• b

0

is the estimated average value of Y

when the value of X is zero (if x = 0 is in

the range of observed x values)

– Here, no houses had 0 square feet, so b

0

=

98.24833 just indicates that, for houses

within the range of sizes observed,

$98,248.33 is the portion of the house price

not explained by square feet

feet) (square 0.10977 98.24833 price house + =

Interpretation of the

Slope Coefficient, b

1

• b

1

measures the estimated change in

the average value of Y as a result of

a one-unit change in X

– Here, b

1

= .10977 tells us that the average

value of a house increases by .10977($1000)

= $109.77, on average, for each additional

one square foot of size.

feet) (square 0.10977 98.24833 price house + =

Least Squares Regression

Properties

• The sum of the residuals from the least

squares regression line is 0 ( )

• The sum of the squared residuals is a

minimum (minimized )

• The simple regression line always passes

through the mean of the y variable and the

mean of the x variable

• The least squares coefficients are unbiased

estimates of β

0

and β

1

.

0 ) ˆ ( = ÷

¿

y y

2

)

ˆ

( y y

¿

÷

Explained and Unexplained

Variation

• Total variation is made up of two

parts:

SSR SSE SST + =

Total sum

of Squares

Sum of Squares

Regression

Sum of

Squares Error

¿

÷ =

2

) y y ( SST

¿

÷ =

2

) y

ˆ

y ( SSE

¿

÷ =

2

) y y

ˆ

( SSR

where:

= Average value of the dependent variable

y = Observed values of the dependent variable

= Estimated value of y for the given x value

yˆ

y

• SST = total sum of squares

– Measures the variation of the y

i

values

around their mean y

• SSE = error sum of squares

– Variation attributable to factors other than the

relationship between x and y

• SSR = regression sum of squares

– Explained variation attributable to the

relationship between x and y

(continued)

Explained and Unexplained

Variation

(continued)

X

i

y

x

y

i

SST = ¿(y

i

- y)

2

SSE = ¿(y

i

- y

i

)

2

.

SSR = ¿(y

i

- y)

2

.

_

_

_

Explained and Unexplained

Variation

y

.

y

y

_

y

.

• The coefficient of determination is the

portion of the total variation in the

dependent variable that is explained by

variation in the independent variable

• The coefficient of determination is also

called R-squared and is denoted as R

2

Coefficient of Determination, R

2

SST

SSR

R =

2

1 R 0

2

s s

where

Coefficient of determination

Coefficient of Determination, R

2

squares of sum total

regression by explained squares of sum

SST

SSR

R = =

2

(continued)

Note: In the single independent variable case, the coefficient

of determination is

where:

R

2

= Coefficient of determination

r = Simple correlation coefficient

2 2

r R =

R

2

= +1

Examples of Approximate

R

2

Values

y

x

y

x

R

2

= 1

R

2

= 1

Perfect linear relationship

between x and y:

100% of the variation in y is

explained by variation in x

Examples of Approximate

R

2

Values

y

x

y

x

0 < R

2

< 1

Weaker linear relationship

between x and y:

Some but not all of the

variation in y is explained

by variation in x

Examples of Approximate

R

2

Values

R

2

= 0

No linear relationship

between x and y:

The value of Y does not

depend on x. (None of the

variation in y is explained

by variation in x)

y

x

R

2

= 0

Excel Output

Regression Statistics

Multiple R 0.76211

R Square 0.58082

Adjusted R Square 0.52842

Standard Error 41.33032

Observations 10

ANOVA

df SS MS F Significance F

Regression 1 18934.9348 18934.9348 11.0848 0.01039

Residual 8 13665.5652 1708.1957

Total 9 32600.5000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

58.08% of the variation in

house prices is explained by

variation in square feet

0.58082

32600.5000

18934.9348

SST

SSR

R

2

= = =

Standard Error of Estimate

• The standard deviation of the variation of

observations around the regression line is

estimated by

1 ÷ ÷

=

c

k n

SSE

s

Where

SSE = Sum of squares error

n = Sample size

k = number of independent variables in the model

The Standard Deviation of the

Regression Slope

• The standard error of the regression slope

coefficient (b

1

) is estimated by

¿

¿ ¿

÷

=

÷

=

n

x) (

x

s

) x (x

s

s

2

2

ε

2

ε

b

1

where:

= Estimate of the standard error of the least squares slope

= Sample standard error of the estimate

1

b

s

2 n

SSE

s

ε

÷

=

Excel Output

Regression Statistics

Multiple R 0.76211

R Square 0.58082

Adjusted R Square 0.52842

Standard Error 41.33032

Observations 10

ANOVA

df SS MS F Significance F

Regression 1 18934.9348 18934.9348 11.0848 0.01039

Residual 8 13665.5652 1708.1957

Total 9 32600.5000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

41.33032 s

ε

=

0.03297 s

1

b

=

Comparing Standard Errors

y

y y

x

x

x

y

x

1

b

s small

1

b

s large

c

s small

c

s large

Variation of observed y values

from the regression line

Variation in the slope of regression

lines from different possible samples

Inference about the Slope: t Test

• t test for a population slope

– Is there a linear relationship between x and

y?

• Null and alternative hypotheses

– H

0

: β

1

= 0 (no linear relationship)

– H

1

: β

1

= 0 (linear relationship does exist)

• Test statistic

–

–

1

b

1 1

s

β b

t

÷

=

2 n d.f. ÷ =

where:

b

1

= Sample regression slope

coefficient

β

1

= Hypothesized slope

s

b1

= Estimator of the standard

error of the slope

House Price

in $1000s

(y)

Square Feet

(x)

245 1400

312 1600

279 1700

308 1875

199 1100

219 1550

405 2350

324 2450

319 1425

255 1700

(sq.ft.) 0.1098 98.25 price house + =

Estimated Regression Equation:

The slope of this model is 0.1098

Does square footage of the house

affect its sales price?

Inference about the Slope:

t Test

(continued)

Inferences about the Slope:

t Test Example

H

0

: β

1

= 0

H

A

: β

1

= 0

Test Statistic: t = 3.329

There is sufficient evidence

that square footage affects

house price

From Excel output:

Reject H

0

Coefficient

s

Standard

Error t Stat

P-

value

Intercept 98.24833 58.03348

1.6929

6

0.1289

2

Square

Feet 0.10977 0.03297

3.3293

8

0.0103

9

1

b

s

t b

1

Decision:

Conclusion:

Reject H

0

Reject H

0

o/2=.025

-t

α/2

Do not reject H

0

0

t

α/2

o/2=.025

-2.3060 2.3060 3.329

d.f. = 10-2 = 8

Regression Analysis for

Description

Confidence Interval Estimate of the Slope:

Excel Printout for House Prices:

At 95% level of confidence, the confidence interval for

the slope is (0.0337, 0.1858)

1

b /2 1

s t b

o

±

Coefficient

s

Standard

Error t Stat P-value Lower 95%

Upper

95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

d.f. = n - 2

Regression Analysis for

Description

Since the units of the house price variable is

$1000s, we are 95% confident that the average

impact on sales price is between $33.70 and

$185.80 per square foot of house size

Coefficient

s

Standard

Error t Stat P-value Lower 95%

Upper

95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

This 95% confidence interval does not include 0.

Conclusion: There is a significant relationship between

house price and square feet at the .05 level of significance

Confidence Interval for

the Average y, Given x

Confidence interval estimate for the

mean of y given a particular x

p

Size of interval varies according

to distance away from mean, x

¿

÷

÷

+ ±

o

2

2

p

ε /2

) x (x

) x (x

n

1

s t y

ˆ

Confidence Interval for

an Individual y, Given x

Confidence interval estimate for an

Individual value of y given a particular x

p

¿

÷

÷

+ + ±

o

2

2

p

ε /2

) x (x

) x (x

n

1

1 s t y

ˆ

This extra term adds to the interval width to reflect

the added uncertainty for an individual case

Interval Estimates

for Different Values of x

y

x

Prediction Interval

for an individual y,

given x

p

x

p x

Confidence

Interval for

the mean of

y, given x

p

House Price

in $1000s

(y)

Square Feet

(x)

245 1400

312 1600

279 1700

308 1875

199 1100

219 1550

405 2350

324 2450

319 1425

255 1700

(sq.ft.) 0.1098 98.25 price house + =

Estimated Regression Equation:

Example: House Prices

Predict the price for a house

with 2000 square feet

317.85

0) 0.1098(200 98.25

(sq.ft.) 0.1098 98.25 price house

=

+ =

+ =

Example: House Prices

Predict the price for a house

with 2000 square feet:

The predicted price for a house with 2000

square feet is 317.85($1,000s) = $317,850

(continued)

Estimation of Mean Values:

Example

Find the 95% confidence interval for the average

price of 2,000 square-foot houses

Predicted Price Y

i

= 317.85 ($1,000s)

.

Confidence Interval Estimate for E(y)|x

p

37.12 317.85

) x (x

) x (x

n

1

s t y

ˆ

2

2

p

ε α/2

± =

÷

÷

+ ±

¿

The confidence interval endpoints are 280.66 -- 354.90,

or from $280,660 -- $354,900

Estimation of Individual

Values: Example

Find the 95% confidence interval for an individual

house with 2,000 square feet

Predicted Price Y

i

= 317.85 ($1,000s)

.

Prediction Interval Estimate for y|x

p

102.28 317.85

) x (x

) x (x

n

1

1 s t y

ˆ

2

2

p

ε α/2

± =

÷

÷

+ + ±

¿

The prediction interval endpoints are 215.50 -- 420.07,

or from $215,500 -- $420,070

Residual Analysis

• Purposes

–Examine for linearity assumption

–Examine for constant variance for all

levels of x

–Evaluate normal distribution

assumption

• Graphical Analysis of Residuals

–Can plot residuals vs. x

–Can create histogram of residuals to

check for normality

Residual Analysis for Linearity

Not Linear

Linear

x

r

e

s

i

d

u

a

l

s

x

y

x

y

x

r

e

s

i

d

u

a

l

s

Residual Analysis for

Constant Variance

Non-constant variance

Constant variance

x x

y

x

x

y

r

e

s

i

d

u

a

l

s

r

e

s

i

d

u

a

l

s

House Price Model Residual Plot

-60

-40

-20

0

20

40

60

80

0 1000 2000 3000

Square Feet

R

e

s

i

d

u

a

l

s

Excel Output

RESIDUAL OUTPUT

Predicted

House

Price Residuals

1 251.92316 -6.923162

2 273.87671 38.12329

3 284.85348 -5.853484

4 304.06284 3.937162

5 218.99284 -19.99284

6 268.38832 -49.38832

7 356.20251 48.79749

8 367.17929 -43.17929

9 254.6674 64.33264

10 284.85348 -29.85348

Summary

• Introduced correlation analysis

• Discussed correlation to measure the

strength of a linear association

• Introduced simple linear regression

analysis

• Calculated the coefficients for the simple

linear regression equation

• measures of variation (R

2

and s

ε

)

• Addressed assumptions of regression

and correlation

Summary

• Described inference about the slope

• Addressed estimation of mean values

and prediction of individual values

• Discussed residual analysis

(continued)

R software regression

• yx=c(245,1400,312,1600,279,1700,308,1875,19

9,1100,219,1550,405,2350,324,2450,319,1425,

255,1700)

• mx=matrix(yx,10,2, byrow=T)

• hprice=mx[,1]

• sqft=mx[,2]

• reg1=lm(hprice~sqft)

• summary(reg1)

• plot(reg1)

- DNVGL Industry Outlook Report 2017 Digital Spreads
- First Announcement ICSOT 2017-1
- Vodafone_23.06.2016
- NavalFEA Brochure PDF
- Recognized Organizations Performance Table
- Bone
- HP Visa Restrictions Index 140728
- ALP Response and Stability
- Overview of Experimental Mechanics
- Course Outline
- 2011_2012_JBSD_Def_Life
- Reliability of offshore structures
- Mathcad T Joint Final 1
- Exporting MIT
- Analysis of Numerical Schemes
- Platform Support Ansys 15.0 Detailed Summary
- professional ethics
- Stochastic processes
- Assignment 2
- Cauchy Characteristic Function
- OrcaFlex Demonstration
- Photoelasticity Circular Polariscope
- Virtual Polariscope
- Snake and Ladder problem
- Wave Dynamics Introduction

Sign up to vote on this title

UsefulNot usefulLinear Regression Analysis

Linear Regression Analysis

- Powerpoint - Regression and Correlation Analysis
- Multiple Regression
- Correlation Ppt
- Simple Regression
- Regression Analysis
- SPSS-Survival-Manual
- Regression
- Linear Regression
- FM4-2008
- Regression and Correlation
- DDA2013 Week4 WindowsExcel2003 Regression
- Chapter 17
- 09 Inference for Regression Part1
- Chapter 17
- Regression and Correlation Analysis
- Simple Linear Regression[1]
- BB ABS 10th Univariate and Bivariate Corr SR MR 81Pages Final
- Statistics-Linear Regression and Correlation Analysis
- QM II Formula Sheet
- ANR
- AAKERch19
- Linear Regression Analysis and Least Square Methods
- Chapter 5,6 Regression Analysis
- Review 2013 Regression
- DADM-Correlation and Regression
- BBS10 Ppt Mtb Ch13 Simple Linear Regression
- spss 簡單線性回歸分析 spss simple regression.pdf
- Regression and Correlation
- Statistics 578 Assignment 5 Homework
- chapter 8 (2)
- Linear Regression and Correlation Analysis PPT @ BEC DOMS