Simple Linear Regression and Correlation

Chapter 17
Simple Linear Regression and Correlation
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.1
Linear Regression Analysis

Regression analysis is used to predict the value of one variable (the dependent variable) on the basis of other variables (the independent variables). Dependent variable: denoted Y Independent variables: denoted X1, X2, , Xk
If we only have ONE independent variable, the model is
which is referred to as simple linear regression. We would be interested in estimating 0 and 1 from the data we collect.
17.2

Variables: X = Independent Variable (we provide this) Y = Dependent Variable (we observe this) Parameters: 0 = Y-Intercept 1 = Slope ~ Normal Random Variable ( = 0, = ???) [Noise]
17.3
Effect of Larger Values of
House Price
Lower vs. Higher Variability
25K$
House Price = 25,000 + 75(Size) +

Same square footage, but different price points (e.g. dcor options, cabinet upgrades, lot location)
House size
17.4
Theoretical Linear Model
17.5
1. Building the Model Collect Data

Test 2 Grade = 0 +1*(Test 1 Grade)
Student 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Test 1 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 Test 2 32 33 34 35 36 37 39 40 41 42 43 44 46 47 48 49 50 51 53 54 55 56 57
From Data: Estimate 0 Estimate 1 Estimate
17.6
Plot of Fitted Model

100 80 60 40 20 0 40 50 60 70 80 90 100

92 82
Test B2
Test 2
72 62 52 42 60 70 80 90 100
Test 1
100 90
Test B1
Test B2
80 70 60 50 50 60 70 80 90 100
Test B1
17.7
Correlation Analysis -1 < < 1

If we are interested only in determining whether a relationship exists, we employ correlation analysis. Example: Students height and weight.
Plot of Height vs Weight
7 6.6

7 6.6
Height
5.8 5.4 5 4.6 100 140 180 220 260
Height
6.2
6.2 5.8 5.4 100 140 180 220 260
Weight
Weight

6.8 6.5
6.6 6.2
Height
Height
100 140 180 220 260
6.2 5.9 5.6 5.3
5.8 5.4 5 100 140 180 220 260
Weight
Weight
17.8
Correlation Analysis -1 < < 1

If the correlation coefficient is close to +1 that means you have a strong positive relationship. If the correlation coefficient is close to -1 that means you have a strong negative relationship. If the correlation coefficient is close to 0 that means you have no correlation. WE HAVE THE ABILITY TO TEST THE HYPOTHESIS H0: = 0
17.9
Regression: Model Types X=size of house, Y=cost of house
Deterministic Model: an equation or set of equations that allow us to fully determine the value of the dependent variable from the values of the independent variables. y = $25,000 + (75$/ft2)(x) Area of a circle: A = *r2 Probabilistic Model: a method used to capture the randomness that is part of a real-life process. y = 25,000 + 75x + E.g. do all houses of the same size (measured in square feet) sell for exactly the same price?
17.10
Simple Linear Regression Model

Meaning of and > 0 [positive slope]
y
< 0 [negative slope]
rise run =slope (=rise/run)
=y-intercept x
17.11
Which line has the best fit to the data?

? ? ?
17.12
Estimating the Coefficients

In much the same way we base estimates of on , we estimate with b0 and with b1, the y-intercept and slope (respectively) of the least squares or regression line given by:
(This is an application of the least squares method and it produces a straight line that minimizes the sum of the squared differences between the points and the line)
17.13
Least Squares Line
these differences are called residuals or
errors
17.14
Least Squares Line[sure glad we have computers now!]
The coefficients b1 and b0 for the least squares line
are calculated as:
17.15
Least Squares Line See if you can estimate Y-intercept and slope from this data
Recall
Data
Data Points: x 1 y 6
Statistics
Information
2 3 4 5
6
1 9 5 17
12 y = .934 + 2.114x
17.16
Least Squares Line See if you can estimate Y-intercept and slope from this data
Sum = Xbar = Ybar = sxy = sx2 = b1 = b0 =
X 1 2 3 4 5 6 21 3.500 8.333 7.400 3.500 2.114 0.933
Y 6 1 9 5 17 12 50
X - Xbar -2.500 -1.500 -0.500 0.500 1.500 2.500 0.000
2 Y - Ybar (X-Xbar)*(Y-Ybar) (X - Xbar) -2.333 5.833 6.250 -7.333 11.000 2.250 0.667 -0.333 0.250 -3.333 -1.667 0.250 8.667 13.000 2.250 3.667 9.167 6.250 0.000 37.000 17.500
37.00/(6-1) 17.5/(6-1) 7.4/3.5 8.33 - 2.114*3.50
17.17
Excel: Data Analysis - Regression
SUMMARY OUTPUT Regression Statistics Multiple R 0.7007 R Square 0.4910 Adjusted R Square 0.3637 Standard Error 4.5029 Observations ANOVA df Regression Residual Total 1 4 5 SS MS F Significance F Same as p-value 78.22857143 78.22857143 3.858149366 0.120968388 H0: Regression Model is "NO Good" 81.1047619 20.27619048 159.3333333
The proportion of the variation in the variable Y that can be explained by your regression model Will use later 6
Intercept X Variable 1
Coefficients Standard Error t Stat P-value 0.933333333 4.19198025 0.222647359 0.834716871 2.114285714 1.076401159 1.96421724 0.120968388
H0: 1 = 0
17.18
Excel: Plotted Regression Model You will need to play around with this to get the plot to look Good
X Variable 1 Line Fit Plot

20 15 10 5 0 0 1 2 3 4 5 6 7 X Variable 1
Y Predicted Y
17.19
Required Conditions
For these regression methods to be valid the following four conditions for the error variable ( ) must be met: The probability distribution of is normal. The mean of the distribution is 0; that is, E( ) = 0. The standard deviation of is , which is a constant regardless of the value of x. The value of associated with any particular value of y is independent of associated with any other value of y.
17.20
Assessing the Model

The least squares method will always produce a straight line, even if there is no relationship between the variables, or if the relationship is something other than linear. Hence, in addition to determining the coefficients of the least squares line, we need to assess it to see how well it fits the data. Well see these evaluation methods now. Theyre based on the what is called sum of squares for errors (SSE).
17.21
Sum of Squares for Error (SSE another thing to calculate)
The sum of squares for error is calculated as:
and is used in the calculation of the standard error of estimate:
If
is zero, all the points fall on the regression line.
17.22
Standard Error
If is small, the fit is excellent and the linear model should be used for forecasting. If is large, the model is poor
But what is small and what is large?
17.23
Standard Error
Judge the value of by comparing it to the sample mean of the dependent variable ( ). In this example, = .3265 and = 14.841 so (relatively speaking) it appears to be small, hence our linear regression model of car price as a function of odometer reading is good.
17.24
Testing the SlopeExcel output does this for you. If no linear relationship exists between the two variables, we would expect the regression line to be horizontal, that is, to have a slope of zero. We want to see if there is a linear relationship, i.e. we want to see if the slope ( ) is something other than zero. Our research hypothesis becomes: H1: 0 Thus the null hypothesis becomes: H0: = 0 Already discussed!
17.25
Testing the Slope

We can implement this test statistic to try our hypotheses: H0: 1 = 0 where is the standard deviation of b1, defined as:
If the error variable ( ) is normally distributed, the test statistic has a Student t-distribution with n2 degrees of freedom. The rejection region depends on whether or not were doing a one- or two- tail test (two-tail test is most typical).
17.26
Example 17.4
Test to determine if the slope is significantly different from 0 (at 5% significance level) We want to test: H1: 0 H0: = 0 (if the null hypothesis is true, no linear relationship exists) The rejection region is:
OR check the p-value.

17.27
Example 17.4
COMPUTE
We can compute t manually or refer to our Excel output
p-value
We see that the t statistic for Compare odometer (i.e. the slope, b1) is 13.49 which is greater than tCritical = 1.984. We also note that the p-value is 0.000. There is overwhelming evidence to infer that a linear relationship between odometer reading and price exists.
17.28
Testing the Slope

We can also estimate (to some level of confidence) and interval for the slope parameter, . Recall that your estimate for is b1. The confidence interval estimator is given as:
Hence:
That is, we estimate that the slope coefficient lies between .0768 and .0570
17.29
Coefficient of Determination
Tests thus far have shown if a linear relationship exists; it is also useful to measure the strength of the relationship. This is done by calculating the coefficient of determination R2.
The coefficient of determination is the square of the coefficient of correlation (r), hence R2 = (r)2 r will be computed shortly and this is true for models with only 1 indepenent variable
17.30
Coefficient of Determination
R2 has a value of .6483. This means 64.83% of the variation in the auction selling prices (y) is explained by your regression model. The remaining 35.17% is unexplained, i.e. due to error. Unlike the value of a test statistic, the coefficient of determination does not have a critical value that enables us to draw conclusions. In general the higher the value of R2, the better the model fits the data. R2 = 1: Perfect match between the line and the data points. R2 = 0: There are no linear relationship between x and y.
17.31
Remember Excels Output

An analysis of variance (ANOVA) table for the simple linear regression model can be give by:
Source Regression Error Total
degrees of freedom
1 n2 n1
Sums of Squares SSR SSE Variation in y (SST)
Mean Squares MSR = SSR/1
F-Statistic F=MSR/MSE
MSE = SSE/(n2)
17.32
Using the Regression Equation

We could use our regression equation: y = 17.250 .0669x to predict the selling price of a car with 40 (40,000) miles on it: y = 17.250 .0669x = 17.250 .0669(40) = 14, 574 We call this value ($14,574) a point prediction (estimate). Chances are though the actual selling price will be different, hence we can estimate the selling price in terms of a confidence interval.
17.33
Prediction Interval
The prediction interval is used when we want to predict one particular value of the dependent variable, given a specific value of the independent variable:
(xg is the given value of x were interested in)
17.34
Confidence Interval Estimator for Mean of Y The confidence interval estimate for the expected value of y (Mean of Y) is used when we want to predict an interval we are pretty sure contains the true regression line . In this case, we are estimating the mean of y given a value of x:
(Technically this formula is used for infinitely large populations. However, we can interpret our problem as attempting to determine the average selling price of all Ford Tauruses, all with 40,000 miles on the odometer)
17.35
Whats the Difference?

Prediction Interval Confidence Interval
1 Used to estimate the value of one value of y (at given x)
no 1 Used to estimate the mean value of y (at given x)
The confidence interval estimate of the expected value of y will be narrower than the prediction interval for the same given value of x and confidence level. This is because there is less error in estimating a mean value as opposed to predicting an individual value.
17.36
Regression Diagnostics
There are three conditions that are required in order to perform a regression analysis. These are: The error variable must be normally distributed, The error variable must have a constant variance, & The errors must be independent of each other. How can we diagnose violations of these conditions? Residual Analysis, that is, examine the differences between the actual data points and those predicted by the linear equation
17.37
Nonnormality
We can take the residuals and put them into a histogram to visually check for normality
were looking for a bell shaped histogram with the mean close to zero [our old test for normality].
17.38
Heteroscedasticity
When the requirement of a constant variance is violated, we have a condition of heteroscedasticity.
We can diagnose heteroscedasticity by plotting the residual against the predicted y.

17.39
Heteroscedasticity
If the variance of the error variable ( ) is not constant, then we have heteroscedasticity. Heres the plot of the residual against the predicted value of y:
there doesnt appear to be a change in the spread of the plotted points, therefore no
heteroscedasticity
17.40
Nonindependence of the Error Variable

If we were to observe the auction price of cars every week for, say, a year, that would constitute a time series. When the data are time series, the errors often are correlated. Error terms that are correlated over time are said to be autocorrelated or serially correlated. We can often detect autocorrelation by graphing the residuals against the time periods. If a pattern emerges, it is likely that the independence requirement is violated.
17.41
Nonindependence of the Error Variable

Patterns in the appearance of the residuals over time indicates that autocorrelation exists:
Note the runs of positive residuals, replaced by runs of negative residuals
Note the oscillating behavior of the residuals around zero.
17.42
Outliers Problem worked earlier

An outlier is an observation that is unusually small or unusually large. E.g. our used car example had odometer readings from 19.1 to 49.2 thousand miles. Suppose we have a value of only 5,000 miles (i.e. a car driven by an old person only on Sundays ) this point is an outlier.
17.43
Outliers
Possible reasons for the existence of outliers include: There was an error in recording the value The point should not have been included in the sample * Perhaps the observation is indeed valid. Outliers can be easily identified from a scatter plot. If the absolute value of the standard residual is > 2, we suspect the point may be an outlier and investigate further. They need to be dealt with since they can easily influence the least squares line
17.44
Procedure for Regression Diagnostics

1. Develop a model that has a theoretical basis. 2. Gather data for the two variables in the model. 3. Draw the scatter diagram to determine whether a linear model appears to be appropriate. Identify possible outliers. 4. Determine the regression equation. 5. Calculate the residuals and check the required conditions 6. Assess the models fit. 7. If the model fits the data, use the regression equation to predict a particular value of the dependent variable and/or estimate its mean.
17.45

Simple Linear Regression and Correlation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simple Linear Regression and Correlation

Uploaded by

Copyright:

Available Formats

Chapter 17

Simple Linear Regression and Correlation

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Linear Regression Analysis

Linear Regression Analysis

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Effect of Larger Values of

Lower vs. Higher Variability

House Price = 25,000 + 75(Size) +

Theoretical Linear Model

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

1. Building the Model Collect Data

From Data: Estimate 0 Estimate 1 Estimate

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Linear Regression Analysis

Plot of Fitted Model

Plot of Fitted Model

Plot of Fitted Model

Correlation Analysis -1 < < 1

Plot of Height vs Weight

5.8 5.4 5 4.6 100 140 180 220 260

6.2 5.8 5.4 100 140 180 220 260

Plot of Height vs Weight

Plot of Height vs Weight

6.2 5.9 5.6 5.3

5.8 5.4 5 100 140 180 220 260

Correlation Analysis -1 < < 1

Regression: Model Types X=size of house, Y=cost of house

Simple Linear Regression Model

< 0 [negative slope]

rise run =slope (=rise/run)

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Which line has the best fit to the data?

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Estimating the Coefficients

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Least Squares Line

these differences are called residuals or

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Least Squares Line[sure glad we have computers now!]

The coefficients b1 and b0 for the least squares line

are calculated as:

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Sum = Xbar = Ybar = sxy = sx2 = b1 = b0 =

X 1 2 3 4 5 6 21 3.500 8.333 7.400 3.500 2.114 0.933

X - Xbar -2.500 -1.500 -0.500 0.500 1.500 2.500 0.000

37.00/(6-1) 17.5/(6-1) 7.4/3.5 8.33 - 2.114*3.50

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Excel: Data Analysis - Regression

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

X Variable 1 Line Fit Plot

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Assessing the Model

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Sum of Squares for Error (SSE another thing to calculate)

The sum of squares for error is calculated as:

and is used in the calculation of the standard error of estimate:

is zero, all the points fall on the regression line.

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Testing the Slope