Professional Documents
Culture Documents
Simple Linear Regression and Correlation
Simple Linear Regression and Correlation
17.1
which is referred to as simple linear regression. We would be interested in estimating 0 and 1 from the data we collect.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.2
17.3
House Price
25K$
House size
17.4
17.5
17.6
Test B2
Test 2
72 62 52 42 60 70 80 90 100
Test 1
100 90
Test B1
Test B2
80 70 60 50 50 60 70 80 90 100
Test B1
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.7
Height
Height
6.2
Weight
Weight
Height
Height
100 140 180 220 260
Weight
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Weight
17.8
17.9
Deterministic Model: an equation or set of equations that allow us to fully determine the value of the dependent variable from the values of the independent variables. y = $25,000 + (75$/ft2)(x) Area of a circle: A = *r2 Probabilistic Model: a method used to capture the randomness that is part of a real-life process. y = 25,000 + 75x + E.g. do all houses of the same size (measured in square feet) sell for exactly the same price?
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.10
=y-intercept x
17.11
17.12
(This is an application of the least squares method and it produces a straight line that minimizes the sum of the squared differences between the points and the line)
17.13
errors
17.14
17.15
Least Squares Line See if you can estimate Y-intercept and slope from this data
Recall
Data
Data Points: x 1 y 6
Statistics
Information
2 3 4 5
6
1 9 5 17
12 y = .934 + 2.114x
17.16
Least Squares Line See if you can estimate Y-intercept and slope from this data
Y 6 1 9 5 17 12 50
2 Y - Ybar (X-Xbar)*(Y-Ybar) (X - Xbar) -2.333 5.833 6.250 -7.333 11.000 2.250 0.667 -0.333 0.250 -3.333 -1.667 0.250 8.667 13.000 2.250 3.667 9.167 6.250 0.000 37.000 17.500
17.17
SUMMARY OUTPUT Regression Statistics Multiple R 0.7007 R Square 0.4910 Adjusted R Square 0.3637 Standard Error 4.5029 Observations ANOVA df Regression Residual Total 1 4 5 SS MS F Significance F Same as p-value 78.22857143 78.22857143 3.858149366 0.120968388 H0: Regression Model is "NO Good" 81.1047619 20.27619048 159.3333333
The proportion of the variation in the variable Y that can be explained by your regression model Will use later 6
Intercept X Variable 1
Coefficients Standard Error t Stat P-value 0.933333333 4.19198025 0.222647359 0.834716871 2.114285714 1.076401159 1.96421724 0.120968388
H0: 1 = 0
17.18
Excel: Plotted Regression Model You will need to play around with this to get the plot to look Good
Y Predicted Y
17.19
Required Conditions
For these regression methods to be valid the following four conditions for the error variable ( ) must be met: The probability distribution of is normal. The mean of the distribution is 0; that is, E( ) = 0. The standard deviation of is , which is a constant regardless of the value of x. The value of associated with any particular value of y is independent of associated with any other value of y.
17.20
17.21
If
17.22
Standard Error
If is small, the fit is excellent and the linear model should be used for forecasting. If is large, the model is poor
But what is small and what is large?
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.23
Standard Error
Judge the value of by comparing it to the sample mean of the dependent variable ( ). In this example, = .3265 and = 14.841 so (relatively speaking) it appears to be small, hence our linear regression model of car price as a function of odometer reading is good.
17.24
Testing the SlopeExcel output does this for you. If no linear relationship exists between the two variables, we would expect the regression line to be horizontal, that is, to have a slope of zero. We want to see if there is a linear relationship, i.e. we want to see if the slope ( ) is something other than zero. Our research hypothesis becomes: H1: 0 Thus the null hypothesis becomes: H0: = 0 Already discussed!
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.25
If the error variable ( ) is normally distributed, the test statistic has a Student t-distribution with n2 degrees of freedom. The rejection region depends on whether or not were doing a one- or two- tail test (two-tail test is most typical).
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.26
Example 17.4
Test to determine if the slope is significantly different from 0 (at 5% significance level) We want to test: H1: 0 H0: = 0 (if the null hypothesis is true, no linear relationship exists) The rejection region is:
17.27
Example 17.4
COMPUTE
p-value
We see that the t statistic for Compare odometer (i.e. the slope, b1) is 13.49 which is greater than tCritical = 1.984. We also note that the p-value is 0.000. There is overwhelming evidence to infer that a linear relationship between odometer reading and price exists.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.28
Hence:
That is, we estimate that the slope coefficient lies between .0768 and .0570
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.29
Coefficient of Determination
Tests thus far have shown if a linear relationship exists; it is also useful to measure the strength of the relationship. This is done by calculating the coefficient of determination R2.
The coefficient of determination is the square of the coefficient of correlation (r), hence R2 = (r)2 r will be computed shortly and this is true for models with only 1 indepenent variable
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.30
Coefficient of Determination
R2 has a value of .6483. This means 64.83% of the variation in the auction selling prices (y) is explained by your regression model. The remaining 35.17% is unexplained, i.e. due to error. Unlike the value of a test statistic, the coefficient of determination does not have a critical value that enables us to draw conclusions. In general the higher the value of R2, the better the model fits the data. R2 = 1: Perfect match between the line and the data points. R2 = 0: There are no linear relationship between x and y.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.31
degrees of freedom
1 n2 n1
F-Statistic F=MSR/MSE
MSE = SSE/(n2)
17.32
17.33
Prediction Interval
The prediction interval is used when we want to predict one particular value of the dependent variable, given a specific value of the independent variable:
17.34
Confidence Interval Estimator for Mean of Y The confidence interval estimate for the expected value of y (Mean of Y) is used when we want to predict an interval we are pretty sure contains the true regression line . In this case, we are estimating the mean of y given a value of x:
(Technically this formula is used for infinitely large populations. However, we can interpret our problem as attempting to determine the average selling price of all Ford Tauruses, all with 40,000 miles on the odometer)
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.35
The confidence interval estimate of the expected value of y will be narrower than the prediction interval for the same given value of x and confidence level. This is because there is less error in estimating a mean value as opposed to predicting an individual value.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.36
Regression Diagnostics
There are three conditions that are required in order to perform a regression analysis. These are: The error variable must be normally distributed, The error variable must have a constant variance, & The errors must be independent of each other. How can we diagnose violations of these conditions? Residual Analysis, that is, examine the differences between the actual data points and those predicted by the linear equation
17.37
Nonnormality
We can take the residuals and put them into a histogram to visually check for normality
were looking for a bell shaped histogram with the mean close to zero [our old test for normality].
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.38
Heteroscedasticity
When the requirement of a constant variance is violated, we have a condition of heteroscedasticity.
17.39
Heteroscedasticity
If the variance of the error variable ( ) is not constant, then we have heteroscedasticity. Heres the plot of the residual against the predicted value of y:
there doesnt appear to be a change in the spread of the plotted points, therefore no
heteroscedasticity
17.40
17.41
17.42
17.43
Outliers
Possible reasons for the existence of outliers include: There was an error in recording the value The point should not have been included in the sample * Perhaps the observation is indeed valid. Outliers can be easily identified from a scatter plot. If the absolute value of the standard residual is > 2, we suspect the point may be an outlier and investigate further. They need to be dealt with since they can easily influence the least squares line
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.44
17.45