Professional Documents
Culture Documents
Correlation and
Regression
• 9.1 Correlation
• 9.2 Linear Regression
• 9.3 Measures of Regression and Prediction Intervals
• 9.4 Multiple Regression
Correlation
Correlation
• A relationship between two variables.
• The data can be represented by ordered pairs (x, y)
x is the independent (or explanatory) variable
y is the dependent (or response) variable
Example: 2
x 1 2 3 4 5 x
y –4 –2 –1 0 2 2 4 6
–2
–4
x x
No Correlation Nonlinear Correlation
© 2012 Pearson Education, Inc. All rights reserved. 7 of 84
Example: Constructing a Scatter Plot
An economist wants to determine CO2 emission
whether there is a linear GDP (millions of
relationship between a country’s (trillions of $), metric tons),
x y
gross domestic product (GDP)
1.6 428.2
and carbon dioxide (CO2) 3.6 828.8
emissions. The data are shown in 4.9 1214.2
the table. Display the data in a 1.1 444.6
scatter plot and determine whether 0.9 264.0
there appears to be a positive or 2.9 415.3
negative linear correlation or no 2.7 571.8
linear correlation. (Source: World 2.3 454.9
Bank and U.S. Energy Information 1.6 358.7
Administration) 1.5 573.5
© 2012 Pearson Education, Inc. All rights reserved. 8 of 84
Solution: Constructing a Scatter Plot
50
1 5
From the scatter plot, it appears that the variables have a
positive linear correlation.
© 2012 Pearson Education, Inc. All rights reserved. 11 of 84
Correlation Coefficient
Correlation coefficient
• A measure of the strength and the direction of a linear
relationship between two variables.
• The symbol r represents the sample correlation
coefficient.
• A formula for r is
n ∑ xy − ( ∑ x) ( ∑ y) n is the number
r=
n ∑ x 2 − ( ∑ x)
2
n ∑ y 2 − ( ∑ y)
2 of data pairs
-1 0 1
If r = –1 there is a If r is close to 0 If r = 1 there is a
perfect negative there is no linear perfect positive
correlation correlation correlation
r = –0.91 r = 0.88
x x
Strong negative correlation Strong positive correlation
y y
r = 0.42 r = 0.07
x x
Weak positive correlation Nonlinear Correlation
© 2012 Pearson Education, Inc. All rights reserved. 14 of 84
Calculating a Correlation Coefficient
In Words In Symbols
1. Find the sum of the x- ∑x
values.
2. Find the sum of the y- ∑y
values.
3. Multiply each x-value by ∑ xy
its corresponding y-value
and find the sum.
level of significance
Number of
pairs of data
in sample
• Two-tailed test
H0: ρ = 0 (no significant correlation)
Ha: ρ ≠ 0 (significant correlation)
Linear Regression
x
© 2012 Pearson Education, Inc. All rights reserved. 40 of 84
Residuals
Residual
• The difference between the observed y-value and the
predicted y-value for a given x-value on the line.
For a given x-value,
di = (observed y-value) – (predicted y-value)
y
Observed
d6{
y-value
d4 { }d
d3{ 5
}d2 Predicted
}d1 y-value
x
© 2012 Pearson Education, Inc. All rights reserved. 41 of 84
Regression Line
• ŷ = mx + b where
n ∑ xy − ( ∑ x) ( ∑ y) ∑y ∑x
m= b = y − mx = −m
n ∑ x 2 − ( ∑ x)
2
n n
• y is the mean of the y-values in the data
• x is the mean of the x-values in the data
• The regression line always passes through the point
( x, y )
yˆ = 12.481x + 33.683
100
50
1 5
© 2012 Pearson Education, Inc. All rights reserved. 49 of 84
Example: Predicting y-Values Using
Regression Equations
The regression equation for the gross domestic products
(in trillions of dollars) and carbon dioxide emissions (in
millions of metric tons) data is ŷ = 196.152x + 102.289.
Use this equation to predict the expected carbon dioxide
emissions for the following gross domestic products.
(Recall from section 9.1 that x and y have a significant
linear correlation.)
1. 1.2 trillion dollars
2. 2.0 trillion dollars
3. 2.5 trillion dollars
© 2012 Pearson Education, Inc. All rights reserved. 50 of 84
Solution: Predicting y-Values Using
Regression Equations
ŷ = 196.152x + 102.289
1. 1.2 trillion dollars
ŷ =196.152(1.2) + 102.289 ≈ 337.671
When the gross domestic product is $1.2 trillion, the
CO2 emissions are about 337.671 million metric tons.
Total Deviation = yi − y
Explained Deviation = yˆi − y
Unexplained Deviation = yi − yˆi
y (xi, yi) Unexplained
Total deviation
deviation yi − yˆi
yi − y Explained
(xi, ŷi)
y deviation
(xi, yi)
yˆi − y
x
x
© 2012 Pearson Education, Inc. All rights reserved. 57 of 84
Variation About a Regression Line
Total variation
• The sum of the squares of the differences between the
y-value of each ordered pair and the mean of y.
Total variation = ∑ ( yi − y )
2
Explained variation
• The sum of the squares of the differences between
each predicted y-value and the mean of y.
Explained variation = ∑ ( yˆi − y )
2
Unexplained variation
• The sum of the squares of the differences between the
y-value of each ordered pair and each corresponding
predicted y-value.
Unexplained variation = ∑ ( yi − yˆi)
2
Coefficient of determination
• The ratio of the explained variation to the total
variation.
• Denoted by r2
2 Explained variation
r =
Total variation
∑( yi − yˆi) 2 152,916.020898
se = = ≈ 138.255
n−2 10 − 2
2
1 n( x − x )
4. Find the margin of error E. E = tcse 1 + + 0
n n ∑ x 2 − (∑ x) 2
1 10(3.5 − 2.31) 2
≈ (2.306)(138.255) 1 + + ≈ 349.424
10 10(67.35) − (23.1) 2
Multiple Regression