Professional Documents
Culture Documents
2-1
POPULATION
Random Variables X, Y: numerical
Definition: Population Linear Correlation Coefficient of X, Y
σXY
ρ =
σX σY
FACT: −1 ≤ ρ ≤ +1
SAMPLE, size n
600
Example: r = = 0.907 strong, positive linear correlation
250 1750
FACT: −1 ≤ r ≤ +1
Any set of data points (xi, yi), i = 1, 2, …, n, having r > 0 (likewise, r < 0) is said
to have a positive linear correlation (likewise, negative linear correlation).
The linear correlation can be strong, moderate, or weak, depending on the
magnitude. The closer r is to +1 (likewise, −1), the more strongly the points
follow a straight line having some positive (likewise, negative) slope. The closer
r is to 0, the weaker the linear correlation; if r = 0, then EITHER the points are
uncorrelated (see 7.1), OR they are correlated, but nonlinearly (e.g., Y = X 2).
Exercise: Draw a scatterplot of the following n = 7 data points, and compute r.
(−3, 9), (−2, 4), (−1, 1), (0, 0), (1, 1), (2, 4), (3, 9)
Ismor Fischer, 5/29/2012 7.2-2
sxy
(Pearson’s) Sample Linear Correlation Coefficient r =
sx sy
uncorrelated
r
−1 − 0.8 − 0.5 0 + 0.5 + 0.8 +1
r = 0, but X and Y are r > 0 in each of the two r > 0, only due to the effect of one
correlated, nonlinearly individual subgroups, influential outlier; if removed,
but r < 0 when combined then data are uncorrelated (r = 0)
Ismor Fischer, 5/29/2012 7.2-3
Statistical Inference for ρ
Suppose we now wish to conduct a formal test of…
Test Statistic
r n−2
T = ~ tn − 2
1 − r2
.907 3
Example: p-value = 2 PT3 ≥ = 2 P(T3 ≥ 3.733) = 2(.017) = .034
1 − (.907)2
As p < α = .05, the null hypothesis of no linear correlation can be rejected at this level.
Comments:
Defining the numerator “sums of squares” Sxx = (n – 1) sx2, Syy = (n – 1) sy2, and
S xy
Sxy = (n – 1) sxy, the correlation coefficient can also be written as r = .
S xx S yy
The general null hypothesis H0: ρ = ρ0 requires a more complicated Z-test, which
first applies the so-called Fisher transformation, and will not be presented here.
The assumption on X and Y is that their joint distribution is bivariate normal, which
is difficult to check fully in practice. However, a consequence of this assumption is
that X and Y are linearly uncorrelated (i.e., ρ = 0) if and only if X and Y are
independent. That is, it overlooks the possibility that X and Y might have a
nonlinear correlation. The moral: ρ – and therefore the Pearson sample linear
correlation coefficient r calculated above – only captures the strength of linear
correlation. A more sophisticated measure, the multiple correlation coefficient,
can detect nonlinear correlation, or correlation in several variables. Also, the
nonparametric Spearman rank-correlation coefficient can be used as a substitute.
Correlation does not imply causation! (E.g., X = “children’s foot size” is indeed
positively correlated with Y = “IQ score,” but is this really cause-and-effect????) The
ideal way to establish causality is via a well-designed randomized clinical trial, but
this is not always possible, or even desirable. (E.g., X = smoking vs. Y = lung cancer)
Ismor Fischer, 5/29/2012 7.2-4
k = 2 parameters,
“regression coefficients”
Predictor Variable,
Explanatory Variable
Y = β0 + β1 X + ε
Ŷ = βˆ0 + βˆ1 X
intercept = b0 b1 = slope
Y
εˆ
Yˆ
That is, given the “response vector” Y, we wish to find the linear estimate Yˆ that
makes the magnitude of the difference εˆ = Y – Yˆ as small as possible.
Ismor Fischer, 5/29/2012 7.2-5
Y = β0 + β1 X + ε ⇒ Yˆ = β̂0 + β̂1 X
How should we define the line that “best” fits the data, and obtain its coefficients β̂0 and β̂1 ?
For any line, errors εi , i = 1, 2, …, n, can be estimated by the residuals εˆi = ei = yi − yˆi .
yi − (xi, yi)
en
ei = yi − yˆi
(x, y)
e1 e3
e2
| X
xi
The least squares regression line is the unique line that minimizes the
n n
Error (or Residual) Sum of Squares SSError = ∑ ei 2 =
i =1
∑
i =1
( yi − yˆi )2 .
sxy
Slope: β̂1 = b1 = s 2
x
Ŷ = b0 + b1 X
Intercept: β̂0 = b0 = y − b1 x
600
Example (cont’d): Slope b1 = 250 = 2.4 Intercept b0 = 240 − (2.4)(80) = 48
Therefore, the least squares regression line is given by the equation Ŷ = 48 + 2.4 X.
Ismor Fischer, 5/29/2012 7.2-6
^
Y = 48 + 2.4 X
+2
+16
−20
+18 −16
Note that the sum of the residuals is equal to zero. But the sum of their squares,
2
εˆ = SSError = (+18)2 + (−16)2 + (−20)2 + (+16)2 + (+2)2 = 1240
is, by construction, the smallest such value of all possible regression lines that could
have been used to estimate the data. Note also that the center of mass (80, 240)
lies on the least squares regression line.
1 ( x )2
For β0: b0 ± tn − 2, α/2 ⋅ se n + Sxx
1
For β1: b1 ± tn − 2, α/2 ⋅ se
Sxx
Test Statistic
b0 − β0 n Sxx
For β0: T =
Sxx + n ( x )2 ~ tn – 2
se
b1 − β1
For β1: T = Sxx ~ tn – 2
se
SS
where se2 = n −Error
2
is the so-called standard error of estimate, and Sxx = (n – 1) sx2.
(Note: se2 is also written as MSE or MSError, the “mean square error” of the
regression; see ANOVA below.)
Example: Calculate the p-value of the slope parameter β1, under…
1240
First, se2 = 3 = 413.333, so se = 20.331. And Sxx = (4)(250) = 1000. So…
2.4 − 0
p-value = 2 P T3 ≥ 20.331 1000 = 2 P(T3 ≥ 3.733) = 2 (.017) = .034
As p < α = .05, the null hypothesis of no linear association can be rejected at this level.
Note that the T-statistic (3.733), and hence the resulting p-value (.034), is identical to
the test of significance of the linear correlation coefficient H0: ρ = 0 conducted above!
Exercise: Calculate the 95% confidence interval for β1, and use it to test H0: β1 = 0.
Ismor Fischer, 5/29/2012 7.2-8
linear model that allows us to derive an estimate of the response variable via
Yˆ = b0 + b1 X , for any value X = x* (with certain restrictions to be discussed later), i.e.,
yˆ = b0 + b1 x * . There are two standard possible interpretations for this fitted value. First,
ŷ can be regarded simply as a “predicted value” of the response variable Y, for a
randomly selected individual from the specific normally-distributed population
corresponding to X = x*, and can be improved via a so-called prediction interval.
.025
Y
.025
X
x*
Exercise: Confirm that the 95% prediction interval for ŷ = 228 (when x* = 75) is
(156.3977, 299.6023).
The second interpretation is that ŷ can be regarded as a point estimate of the conditional
mean µY | X = x* of this population, and can be improved via a confidence interval.
.025
Y
X
x*
Exercise: Confirm that the 95% confidence interval for ŷ = 228 (when x* = 75) is
(197.2133, 258.6867).
Note: Both approaches are based on the fact that there is, in principle, variability in the
coefficients b0 and b1 themselves, from one sample of n data points to another. Thus, for
fixed x*, the object yˆ = b0 + b1 x * can actually be treated as a random variable in its own
right, with a computable sampling distribution.
Comments:
Note that, because individual responses have greater variability than mean
responses (recall the Central Limit Theorem, for example), we expect prediction
intervals to be wider than the corresponding confidence intervals, and indeed, this
is the case. The two formulas differ by a term of “1 +” in the standard error of the
former, resulting in a larger margin of error.
Note also from the formulas that both types of interval are narrowest when x* = x ,
and grow steadily wider as x* moves farther away from x . (This is evident in the
graph of the 95% confidence intervals above.) Great care should be taken if x* is
outside the domain of sample values! For example, when fat grams x = 0, the
linear model predicts an unrealistic cholesterol level of ŷ = 48, and the margin of
error is uselessly large. The linear model is not a good predictor there.
Ismor Fischer, 5/29/2012 7.2-11
ANOVA Formulation
As with comparison of multiple treatment means (§6.3.3), regression can also be
interpreted in the general context of analysis of variance. That is, because
Response = Model + Error,
it follows that the total variation in the original response data can be partitioned into a
source of variation due to the model, plus a source of variation for whatever remains.
We now calculate the three “Sums of Squares (SS)” that measure the variation of the
system and its two component sources, and their associated degrees of freedom (df).
1. Total Sum of Squares = sum of the squared deviations of each observed response
value yi from the mean response value y .
SSTotal = (210 – 240)2 + (200 – 240)2 + (220 – 240)2 + (280 – 240)2 + (290 – 240)2 = 7000
dfTotal = 5 – 1 = 4 Reason: n data values – 1
SSTotal 7000
Note that, by definition, sy2 = df = 4 = 1750, as given in the beginning of this
Total
example in 7.1.
2. Regression Sum of Squares = sum of the squared deviations of each fitted response
value yˆi from the mean response value y .
SSReg = (192 – 240)2 + (216 – 240)2 + (240 – 240)2 + (264 – 240)2 + (288 – 240)2 = 5760
dfReg = 1 Reason: As the regression model is linear, its degrees of freedom =
one less than the k = 2 parameters we are trying to estimate (β0 and β1).
3. Error Sum of Squares = sum of the squared deviations of each observed response
yi from its corresponding fitted response yˆi (i.e., the sum of the squared residuals).
SSError = (210 – 192)2 + (200 – 216)2 + (220 – 240)2 + (280 – 264)2 + (290 – 288)2 = 1240
dfError = 5 – 2 = 3 Reason: n data values – k regression parameters in model
ANOVA Table
Test Statistic
“Sum of Squares” “Mean Squares” (F1, 3 distribution)
SS MSReg
Source df SS MS = df F = MS p-value
Err
Total 4 7000 −
According to this F-test, we can reject…
at the α = .05 significance level, which is consistent with our earlier findings.
F1, 3
.034
13.94
Comment: Again, note that 13.94 = (± 3.733)2, i.e., F1, 3 = t32 ⇒ equivalent tests.
Ismor Fischer, 5/29/2012 7.2-13
How well does the model fit? Out of a total response variation of 7000, the linear
regression model accounts for 5760, with the remaining 1240 unaccounted for
(perhaps explainable by a better model, or simply due to random chance). We can
SSReg
therefore assess how well the model fits the data by calculating the ratio SS =
Total
5760
7000 = 0.823. That is, 82.3% of the total response variation is due to the linear
association between the variables, as determined by the least squares regression
line, with the remaining 17.7% unaccounted for. (Note: This does NOT mean that
82.3% of the original data points lie on the line. This is clearly false; from the
scatterplot, it is clear that none of the points lies on the regression line!)
Moreover, note that 0.823 = (0.907)2 = r2, the square of the correlation coefficient
calculated before! This relation is true in general…
Coefficient of Determination
SSReg SSErr
r2 = SS = 1 − SS
Total Total
Show agreement of your answers, and that, despite a value of r2 very close to 1, the
exact association between X and Y is actually a nonlinear one. Compare the linear
estimate of Y when X = 5, with its exact value.
Also see Appendix > Geometric Viewpoint > Least Squares Approximation.
Ismor Fischer, 5/29/2012 7.2-14
variance σ 2 – i.e, yi ~ N( µY | X = x , σ) –
i
x1 x2 … xn
Ismor Fischer, 5/29/2012 7.2-15
Check: Residual plot (residuals ei vs. fitted values yˆi ) for a general random appearance,
evenly distributed about zero. (Can also check the normal probability plot.)
0 0 0 0
Nonlinear trend can often be described with a polynomial regression model, e.g.,
Y = β0 + β1 X + β2 X 2 + ε. If a residual plot resembles the last figure, this is a
possible indication that more than one predictor variable may be necessary to
explain the response, e.g., Y = β0 + β1 X1 + β2 X 2 + ε, multiple linear regression.
Nonconstant variance can be handled by Weighted Least Squares (WLS) –
versus Ordinary Least Squares (OLS) above – or by using a transformation of
the data, which can also alleviate nonlinearity, as well as violations of the third
assumption that the errors are normally distributed.
Ismor Fischer, 5/29/2012 7.2-16
Example: Regress Y = “human age (years)” on X = “dog age (years),” based on the
following n = 20 data points, for adult dogs 23-34 lbs.: Sadie
X 1 2 3 4 5 6 7 8 9 10
Y 15 21 27 32 37 42 46 51 55 59
11 12 13 14 15 16 17 18 19 20
63 67 71 76 80 85 91 97 103 111
^
Y = 12.1 + 4.7 X
Residuals:
Min 1Q Median 3Q Max
-2.61353 -1.57124 0.08947 1.16654 4.87143
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.06842 0.87794 13.75 5.5e-11 ***
X 4.70301 0.07329 64.17 < 2e-16 ***
The residual plot exhibits a clear nonlinear trend, despite the excellent fit of the
linear model. It is possible to take this into account using, say, a cubic (i.e.,
third-degree) polynomial, but this then begs the question: How complicated
should we make the regression model?