3 views

Uploaded by tesfu

vvv

- ECON 330- Practice Questions
- Applied Regression
- Scatter Diagram
- Paired Sample T-Test
- Statistical Interpretation and Discussion for a set of results
- Leadership and Hierarchical Level (for LODJ - Resubmission)
- Cap 16 Construccion de Modelos
- Correlational Research
- General Regression Analysis
- polyfit.pdf
- Multiple Linear Regression
- Regression and Correlation
- Cph 14
- RegannA14NOV05Refresh21OCT10
- MBA2038 1T10 Wk4 Forecasting
- Development Of Mathematical Models To Forecasting The Monthly Precipitation
- Regression
- Stat
- QMM_EPGDM_5.pptx
- Regression

You are on page 1of 50

and Correlation

1

17.1 Introduction

In this chapter we employ Regression Analysis

to examine the relationship among quantitative

variables.

The technique is used to predict the value of one

variable (the dependent variable - y)based on

the value of other variables (independent

variables x1, x2,xk.)

2

17.2 The Model

The first order linear model

y b0 b1x

b0 and b1 are unknown,

y = dependent variable y therefore, are estimated

from the data.

x = independent variable

b0 = y-intercept

Rise

b1 = slope of the line b1 = Rise/Run

Run

= error variable b0

x

3

17.3 Estimating the Coefficients

The estimates are determined by

drawing a sample from the population of interest,

calculating sample statistics.

producing a straight line that cuts into the data.

y w The question is:

w Which straight line fits best?

w

w

w w w w w

w w w w w

w

x 4

The best line is the one that minimizes

the sum of squared vertical differences

between the points and the line.

Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 = 6.89

Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99

Let us compare two lines

4 (2,4)

w The second line is horizontal

3 w (4,3.2)

2.5

2

(1,2) w The smaller the sum of

w (3,1.5)

1 squared differences

the better the fit of the

1 2 3 4

line to the data. 5

To calculate the estimates of the coefficients The regression equation that estimates

that minimize the differences between the data the equation of the first order linear model

points and the line, use the formulas: is:

cov( X , Y )

b1

s 2x y b 0 b1x

b 0 y b1 x

6

Example 17.1 Relationship between odometer

reading and a used cars selling price.

A car dealer wants to find Car Odometer Price

the relationship between 1 37388 5318

2 44758 5061

the odometer reading and 3 45833 5008

the selling price of used cars. 4 30862 5795

5 31705 5784

A random sample of 100 cars 6 34010 5359

is selected, and the data . . .

. . .

recorded. . . .

Find the regression line.

Independent variable x

Dependent variable y

7

Solution

Solving by hand

To calculate b0 and b1 we need to calculate several

statistics first;

x 36,009.45; s 2x

2

( x x)

i

43,528,688

n 1

y 5,411.41; cov(X, Y)

( x x)(y

i i y)

1,356,256

n 1

where n = 100.

cov(X, Y) 1,356,256

b1 .0312

s 2x 43,528,688

b 0 y b1x 5411.41 ( .0312)(36,009.45) 6,533

y b 0 b1x 6,533 .0312 x 8

Using the computer (see file Xm17-01.xls)

Tools > Data analysis > Regression > [Shade the y range and the x range] > OK

SUMMARY OUTPUT

6000

5500

Price

Regression Statistics

Multiple R 0.806308 5000

R Square 0.650132

Adjusted R Square

0.646562

4500

Standard Error

151.5688 19000 29000 39000 49000

Observations 100 Odometer

y 6,533 .0312x

ANOVA

df SS MS F Significance F

Regression 1 4183528 4183528 182.1056 4.4435E-24

Residual 98 2251362 22973.09

Total 99 6434890

Intercept 6533.383 84.51232 77.30687 1.22E-89

Odometer -0.03116 0.002309 -13.4947 4.44E-24 9

6533

6000

5500

Price

5000

4500

0 No data 19000 29000 39000 49000

Odometer

y 6,533 .0312x

For each additional mile on the odometer,

the price decreases by an average of $0.0312

Do not interpret the intercept as the

Price of cars that have not been driven

10

17.4 Error Variable: Required Conditions

Four requirements involving the distribution of must

be satisfied.

The probability distribution of is normal.

The mean of is zero: E() = 0.

The standard deviation of is s for all values of x.

The set of errors associated with different values of y are

all independent.

11

From the first three assumptions we have:

y is normally distributed with mean

E(y) = b0 + b1x, and a constant standard

deviation s

E(y|x3)

The standard deviation remains constant,

b0 + b1x3 m3

E(y|x2)

b0 + b1x2 m2

but the mean value changes with x E(y|x1)

b0 + b1x1 m1

x1 x2 x3

12

17.5 Assessing the Model

The least squares method will produce a

regression line whether or not there is a linear

relationship between x and y.

Consequently, it is important to assess how well

the linear model fits the data.

Several methods are used to assess the model:

Testing and/or estimating the coefficients.

Using descriptive measurements.

13

Sum of squares for errors

This is the sum of differences between the points

and the regression line.

It can serve as a measure of how well the line fits the

n

data. SSE ( y i y i ) 2 .

i 1

cov( X , Y )

SSE (n 1)s 2Y

s 2x

technique we employ to assess the model.

14

Standard error of estimate

The mean error is equal to zero.

If s is small the errors tend to be close to zero

(close to the mean error). Then, the model fits the

data well.

Therefore, we can, use s as a measure of the

suitability of using a linear model.

An unbiased estimator of s2 is given by s2

S tan dard Error of Estimate

SSE

s

n2 15

Example 17.2

Calculate the standard error of estimate for example

17.1, and describe what does it tell you about the

model fit?

Solution

s 2Y

( y i y i ) 2

6,434,890

64,999

Calculated before

n 1 99

2

cov( X , Y ) ( 1,356,256 )

SSE (n 1)s 2Y 2

99(64,999) 2,252,363

sx 43,528,688

Thus, It is hard to assess the model based

SSE 2,251,363 on s even when compared with the

s 151.6

n2 98 mean value of y.

s 151.6, y 5,411.4 16

Testing the slope

When no linear relationship exists between two

variables, the regression line should be horizontal.

q

q

q

qq q

q

q

q q

q q

Different inputs (x) yield Different inputs (x) yield

different outputs (y). the same output (y).

17

We can draw inference about b1 from b1 by testing

H0: b1 = 0

H1: b1 = 0 (or < 0,or > 0)

The test statistic is

b1 b1 s

t where s b1

s b1 (n 1)s 2x

The standard error of b1.

Student t distribution with d.f. = n-2.

18

Solution

Solving by hand

To compute t we need the values of b1 and sb1.

b1 .312

s 151.6

s b1 .00231

(n 1)s 2x (99)(43,528,688

b1 b1 .312 0 There is overwhelming evidence to infer

t 13.49

s b1 .00231 that the odometer reading affects the

auction selling price.

Using the computer

Coefficients Standard Error t Stat P-value

Intercept 6533.383035 84.51232199 77.30687 1.22E-89

Odometer -0.031157739 0.002308896 -13.4947 4.44E-24

19

Coefficient of determination

When we want to measure the strength of the linear

relationship, we use the coefficient of determination.

2

[cov( X , Y )] SSE

R

2

or R 1

2

s x2 s 2y ( yi y) 2

20

To understand the significance of this coefficient

note:

Overall variability in y

The error

21

Two data points (x1,y1) and (x2,y2) of a certain sample are shown.

y2

y1

x1 x2

Total variation in y = Variation explained by the + Unexplained variation (error)

regression line)

( y1 y ) 2 ( y 2 y ) 2 (y1 y) 2 (y 2 y) 2 (y1 y1 ) 2 (y 2 y 2 ) 2

22

Variation in y = SSR + SSE

that is explained by the variation in x.

2

R 1

SSE

( y i y ) 2 SSE

SSR

(y i y) 2

(y y)

i

2

(y i y)2

R2 = 1: Perfect match between the line and the data points.

R2 = 0: There are no linear relationship between x and y.

23

Example 17.4

Find the coefficient of determination for example 17.1;

what does this statistic tell you about the model?

Solution

2 [cov( X, Y)]2 [ 1,356 ,256 ]2

Solving by hand; R ( 43 ,528 ,688 )( 64 ,999 )

.6501

s 2x s 2y

Using the computer

From the regression output we have

Regression Statistics

Multiple R 0.8063 65% of the variation in the auction

R Square 0.6501 selling price is explained by the

Adjusted R Square 0.6466 variation in odometer reading. The

rest (35%) remains unexplained by

Standard Error 151.57 this model.

Observations 100 24

17.6 Finance Application: Market Model

regression is the market model.

It is assumed that rate of return on a stock (R) is

linearly related to the rate of return on the overall

market.

R = b0 + b1Rm +

Rate of return on a particular stock Rate of return on some major stock index

The beta coefficient measures how sensitive the stocks rate

of return is to changes in the level of the overall market.

25

Example 17.5 The market model

SUMMARY OUTPUT

Estimate the market model for

Regression Statistics Nortel, a stock traded in the

Multiple R 0.560079 Toronto Stock Exchange.

R Square 0.313688

Adjusted R Square

0.301855 Data consisted of monthly

Standard Error

0.063123 percentage return for Nortel

Observations 60

and monthly percentage return

ANOVA for all the stocks.

This is a measure of the dfstocks SS This is aMS measure of the

F total risk embedded

Significance F

marketRegression

related risk. In this sample,

1 in the Nortel

0.10563 0.10563stock,26.50969

that is market-related.

3.27E-06

Residual 58 0.231105 0.003985

for each 1% increase in the TSE Specifically, 31.37% of the variation in Nortels

Total 59 0.336734

return, the average increase in return are explained by the variation in the

Nortels return is .8877%.

CoefficientsStandard TSEs

Error returns.

t Stat P-value

Intercept 0.012818 0.008223 1.558903 0.12446

TSE 0.887691 0.172409 5.148756 3.27E-06

26

17.7 Using the Regression Equation

Before using the regression model, we need to

assess how well it fits the data.

If we are satisfied with how well the model fits

the data, we can use it to make predictions for y.

Illustration

Predict the selling price of a three-year-old Taurus

with 40,000 miles on the odometer (Example 17.1).

y 6533 .0312x 6533 .0312( 40,000) 5,285

27

Prediction interval and confidence interval

Two intervals can be used to discover how closely

the predicted value will match the true value of y.

Prediction interval - for a particular value of y,

Confidence interval - for the expected value of y.

The prediction interval The confidence interval

1 ( x g x) 2 1 ( x g x) 2

y t 2 s 1 y t 2 s

n ( x i x) 2 n ( x i x) 2

28

Example 17.6 interval estimates for the car

auction price

Provide an interval estimate for the bidding price on a

Ford Taurus with 40,000 miles on the odometer.

Solution

The dealer would like to predict the price of a single car

1 ( x g x) 2

y t 2 s 1

n ( x i x) 2

t.025,98

1 ( 40,000 36,009)2

[ 6533 .0312( 40000)] 1.984(151.6) 1 5,285 303

100

4,309,340,160 29

The car dealer wants to bid on a lot of 250 Ford

Tauruses, where each car has been driven for about

40,000 miles.

Solution

The dealer needs to estimate the mean price per car.

1 ( x g x)2

The confidence interval (95%) = y t 2 s

n

( x i x)2

1 ( 40,000 36,009) 2

[ 6533 .0312( 40000)] 1.984(151.6) 5,285 35

100

4,309,340,160

30

The effect of the given value of x on the interval

As xg moves away from x the interval becomes

longer. That is, the shortest interval is found at x.

y b0 b1x g 1 ( x x ) 2

The yconfidence

t 2 s interval

g

when xg =nx

( x x)2

i

y( x g x 1)

y( x g x 1) The yconfidence 1 12

t 2 s interval

x

when xg = n 1

( xi x)2

x 2 x 1 x 1 x 2 1

The confidence interval 22

x y t 2 s

( x( x2)1)xx21 ( x 12))xx 12

when xg = xn 2

( xi x)2

31

17.8 Coefficient of correlation

The coefficient of correlation is used to measure the

strength of association between two variables.

The coefficient values range between -1 and 1.

If r = -1 (negative association) or r = +1 (positive

association) every point falls on the regression line.

If r = 0 there is no linear pattern.

The coefficient can be used to test for linear

relationship between two variables.

32

Testing the coefficient of correlation

When there are no linear relationship between two

variables, r = 0.

The hypotheses are:

H0: r = 0 Y

H1: r = 0

X

The test statistic is:

The statistic is Student t distributed

n2

t r with d.f. = n - 2, provided the variables

1 r2 are bivariate normally distributed.

cov (X, Y)

calculated by r

sxsy 33

Example 17.7 Testing for linear relationship

relationship exists in the data of example 17.1.

Solution

We test H0: r = 0 The value of the t statistic is

H1: r 0. n2

t r 13.49

2

Solving by hand: 1 r

The rejection region is Conclusion: There is sufficient

|t| > t/2,n-2 = t.025,98 = 1.984 or so. evidence at = 5% to infer that

The sample coefficient of there are linear relationship

between the two variables.

correlation r=cov(X,Y)/sxsy=-.806

34

Spearman rank correlation coefficient

relationship exists between variables in cases where

at least one variable is ranked, or

both variables are quantitative but the normality

requirement is not satisfied

35

The hypotheses are:

H0: rs = 0

H1: rs = 0

The test statistic is

cov (a, b)

rs

s a sb

a and b are the ranks of the data.

For a large sample (n > 30) rs is approximately

normally distributed

z rs n 1

36

Example 17.8

relationship between

aptitude test score given prior to hiring, and

performance rating three months after starting work.

A random sample of 20 production workers was

selected. The test scores as well as performance

rating was recorded.

37

Aptitude Performance

Employee test rating Solution

1 59 3

2 47 2

The problem objective is

3 58 4 to analyze the

4 66 3 relationship between

5 77 2

. . . two variables.

. . .

Performance rating is

. . .

Scores range from 0 to 100 Scores range from 1 to 5 ranked.

The hypotheses are: The test statistic is rs,

H0: rs = 0 and the rejection region

H1: rs = 0 is |rs| > rcritical (taken from

the Spearman rank

correlation table). 38

Aptitude Performance

Employee test Rank(a) rating Rank(b)

1 59 9 3 10.5

2 47 3 2 3.5 Ties are broken

3 58 8 4 17 by averaging the

4 66 14 3 10.5 ranks.

5 77 20 2 3.5

. . . . .

. . . . .

. . . . .

Conclusion:

Solving by hand

Do not

Rank each

reject thevariable separately.At 5% significance

null hypothesis.

level there is insufficient

Calculate sa = 5.92; sevidence to infer that the

b =5.50; cov(a,b) = 12.34

two variable

Thus rs =are related tos one

cov(a,b)/[s another.

a b] = .379.

The critical value for = .05 and n = 20 is .450. 39

17.9 Regression Diagnostics - I

The three conditions required for the validity of

the regression analysis are:

the error variable is normally distributed.

the error variance is constant for all values of x.

The errors are independent of each other.

How can we diagnose violations of these

conditions?

40

Residual Analysis

we can identify violations of the required conditions

Example 17.1 - continued

Nonnormality.

Use Excel to obtain the standardized residual histogram.

Examine the histogram and look for a bell shaped diagram with

mean close to zero.

41

RESIDUAL OUTPUT A Partial list of For each residual we calculate

Standard residuals the standard deviation as follows:

Observation Residuals Standard Residuals

1 -50.45749927 -0.334595895 sri s 1 hi w here

2 -77.82496482 -0.516076186

3 -97.33039568 -0.645421421 1 ( xi x)2

hi

4

5

223.2070978

238.4730715

1.480140312

1.58137268

n

( x j x)2

Standardized residual i =

Residual i / Standard deviation

40

30

20

We can also apply the Lilliefors test

10 or the c2 test of normality.

0

-2.5 -1.5 -0.5 0.5 1.5 2.5 More

42

Heteroscedasticity

violated we have heteroscedasticity.

+

^y

++

Residual

+

+ + + ++

+

+ + + ++ + +

+ + + +

+ + + ++ +

+ + + + y^

+ + ++ +

+ + +

+ + ++

+ + ++

43

When the requirement of a constant variance is

not violated we have homoscedasticity.

+

^y

++

Residual

+ +

+ + + ++

+

+ + + +

+ ++ + +

+ +

+ + + ++ ++ +

+ + + y^ ++

+ + + ++ +

+ + + + +

+ +++

+ +++

+

The spread of the data points

does not change much.

44

When the requirement of a constant variance is

not violated we have homoscedasticity.

++

^y ++ +

++ ++

Residual

+ +++

+ + +++ +

+ +++

+ + + +

+ ++ +

+ + +

+ ++

+ + + + ++

+ + y^ +

+ + +

+ + + ++ +

+ ++

+ ++

As far as the even spread, this is

a much better situation

45

Nonindependence of error variables

over time.

Examining the residuals over time, no pattern should

be observed if the errors are independent.

When a pattern is detected, the errors are said to be

autocorrelated.

Autocorrelation can be detected by graphing the

residuals against time.

46

Patterns in the appearance of the residuals

over time indicates that autocorrelation exists.

Residual Residual

+

+ ++

+

+ + +

+ + +

0 + 0 + +

+ Time Time

+ + + + + +

+

+

+ ++ +

+

Note the runs of positive residuals, Note the oscillating behavior of the

replaced by runs of negative residuals residuals around zero.

47

Outliers

An outlier is an observation that is unusually small or

large.

Several possibilities need to be investigated when an

outlier is observed:

There was an error in recording the value.

The point does not belong in the sample.

The observation is valid.

Identify outliers from the scatter diagram.

It is customary to suspect an observation is an outlier if

its |standard residual| > 2 48

An outlier An influential observation

+++++++++++

+ +

+ but, some outliers

+ +

+ +

may be very influential

+

+ + + +

+

+ +

+

in the regression line

49

Procedure for regression diagnostics

Gather data for the two variables in the model.

Draw the scatter diagram to determine whether a

linear model appears to be appropriate.

Check the required conditions for the errors.

Assess the model fit.

If the model fits the data, use the regression

equation.

50

- ECON 330- Practice QuestionsUploaded byRehan Hasan
- Applied RegressionUploaded bybijugeorge1
- Scatter DiagramUploaded byVivek Garg
- Paired Sample T-TestUploaded byAre Meer
- Statistical Interpretation and Discussion for a set of resultsUploaded byAmmaar Sayyid
- Leadership and Hierarchical Level (for LODJ - Resubmission)Uploaded bymc170201936 Aimal Khushal
- Cap 16 Construccion de ModelosUploaded byJorge Enriquez
- Correlational ResearchUploaded byAnnaliza Gando
- General Regression AnalysisUploaded byAbir Majumdar
- polyfit.pdfUploaded bydarafeh
- Multiple Linear RegressionUploaded byMarlene G Padigos
- Regression and CorrelationUploaded byApporva Malik
- Cph 14Uploaded bySulistyono
- RegannA14NOV05Refresh21OCT10Uploaded byFelipe Fernandez
- MBA2038 1T10 Wk4 ForecastingUploaded byChristine Cueva
- Development Of Mathematical Models To Forecasting The Monthly PrecipitationUploaded byAJER JOURNAL
- RegressionUploaded byhanik i
- StatUploaded bymujie
- QMM_EPGDM_5.pptxUploaded bymanish gupta
- RegressionUploaded byPhemmy Mushroom
- OutputUploaded byFarhatunnisa Afrilliana
- elmesUploaded by1234khus
- virohUploaded byVerlind Angel's
- Basic ConceptsUploaded byhsrinivas_7
- CorrelationUploaded byNurul Ikhsan
- 20170227110204 CorrelationUploaded bydaissy
- 7.Nadia FinalUploaded byGomiz Eddiez
- New Report of music for the massesUploaded byneil stag
- ECO 231 Empirical Project Mets AttendanceUploaded bymulcahk1
- 1-s2.0-S0889540609009329-mainUploaded byFiorella Loli Pinto

- A Trust-Based Approach to Control Privacy Exposure InUploaded bytesfu
- nazaritalooki2013Uploaded bytesfu
- Digital EditingUploaded bytesfu
- Project WorkUploaded bytesfu
- quiz 5nUploaded bytesfu
- quiz2Uploaded bytesfu
- Beam Stiff Ness Calculation @BUploaded bytesfu
- beam stiff ness calculation.xlsxUploaded bytesfu
- BERE.xlsxUploaded bytesfu
- CncUploaded bytesfu
- Java NotesUploaded bytesfu
- ReadmeUploaded bytesfu
- res.titleUploaded bytesfu
- Daily Schedule Planner 02Uploaded bytesfu
- Cover PageUploaded bytesfu
- Stephen H.docxUploaded bytesfu
- ProfilingUploaded bytesfu
- ToUploaded bytesfu
- Reference for AecUploaded bytesfu
- DocumentUploaded bytesfu
- Tomasulo2Uploaded bytesfu
- frilab12hexUploaded bytesfu
- How to ns-2Uploaded bytesfu
- 41-5 Iind Ed Ece Pwr Electronics Ch 2-2 Apllications of Semiconductor DevicesUploaded bytesfu
- Assignment1 Network Based Scientific CalculatoreUploaded bytesfu
- Lect 3Uploaded byAkanksha Batra
- Daily Schedule Planner 02Uploaded bytesfu
- Chapter 2 OsUploaded bytesfu
- Unit4-Optimal Network Design of Communication NetwoksUploaded bytesfu
- Unit5-Energy Efficiency in Wireless Networks.pptUploaded bytesfu

- How We Say Numbers and Symbols in EnglishUploaded byTim Cooper
- SUBGRADE CBRUploaded byOlaolu Olalekan
- A Few Remarks on Some Key Factors in Analysing Source Texts. a Response to Anna TrosborgUploaded byMatin
- 3. Research Objectives and HypothesisUploaded byGinoong Lenino Marxismo
- IMPROVING THE MF-TDMA SPECTRAL EFFICIENCY THROUGH SYMBOL SYNCHRONOUS OPERATIONUploaded byVinod Gowda
- Lecture 5 OOP ExercisesUploaded byMina Fawzy
- Workability of Fresh Concrete by Slump TestUploaded bylearnafren
- Corrosion Inhibitor Advice _ Ready to Buy Online ChemicalsUploaded byAhmed Taher
- Principles of Semiconductor Devices - ZeghbroeckUploaded byapi-3717353
- m 2000003Uploaded byVinayak Bhatt
- Mass, Volume and Density without NumbersUploaded byKeith Josephs
- Nanotech SyllabusUploaded byserf007
- Transformational Change VsUploaded bynorousta
- MEC442_KJM492 (9)Uploaded byafnanhanany
- 7. Seminar Report - CopyUploaded byVinu Vikky
- Ceo Profile PptUploaded byValusha Saldanha
- scanpath_cad99Uploaded byirinuca12
- A653.(11)Uploaded byCarlos Ramirez Baltazar
- 3D Woven FabricUploaded byCrystal Newman
- Recoverability for Holevo's just-as-good fidelityUploaded byMark M. Wilde
- 10.1.1.406.9574Uploaded byvita
- Revised Membership Guidelines and Application FormUploaded byvetteyap
- Www.sefindiaUploaded bySivaAgathamudi
- Forces and Motion TestUploaded byBrennen
- Design 9Uploaded byemaster1
- Test 4 Review f18Uploaded byMaitri Barot
- ch03Uploaded byNancy Rah
- Application Form for KEF ( for Fellows)Uploaded byranjithc1980
- HTML (2)Uploaded byचन्द्र विलाश भूर्तेल
- Is Code for FurnitureUploaded byFaisalAhmedShaikh