Professional Documents
Culture Documents
and Regression
1.0
0.8
0.8
0.6
0.4
0.6
0.2
Y
0.4
0.0
0.2
-0.2
0.0
-0.4
-0.6
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
Developing the Correlation Formula
Recall sample variance from much earlier.
1
𝑆𝑥 = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝑛−1
It makes sense to try one X and one Y. This is called
1
covariance. 𝑆𝑥𝑦 = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑛−1
(X, Y) example: (0, 0) (1, –2) (2, –2) (3, –5) (4, –6)
Covariance = –3.75
The unit of covariance is X-Y, such as foot-pounds.
Often this makes no sense.
Developing the Correlation Formula
To solve the unit problem and make –1 ≤ r ≤ +1,
we divide by both standard deviations.
Correlation can be defined as
r = Covariance / [sd(X) * sd(Y)]
Example: r for little set on prior page = –0.968
In real life, correlation is found by computer or
calculator, but the book provides a very messy
formula. We’ll use the calculator instead.
𝑛(σ 𝑥𝑦)−(σ 𝑥)(σ 𝑦)
𝑟=
𝑛(σ 𝑥 2 )− σ 𝑥 2 ∗ 𝑛(σ 𝑦 2 )− σ 𝑦 2
25
Use the Calculator!
Six (x, y) pairs:
20
(2,13), (7,21), (9,23),
(1,14), (5,15), (12,21)
15
∑ x = 36, ∑ y = 107, ∑ x2 = 304,
∑ y2 = 2001, ∑ xy = 721
10
0 5 10 15
x
y Fi tted value s
𝑟−0
This makes test value t =
1−𝑟2൘
𝑛−2
Correlation Test Example
For the earlier dataset, assume X and Y come from
normally distributed populations.
Let a = 0.10 and do a two-tailed test.
② Reject if t < –t0.10/2 or t > t0.10/2 with (6 – 2) = 4 DF.
That means t < –2.132 or t > 2.132.
0.874−0 0.874
③ Test value t = = 0.4859Τ = 3.597
1−0.8742
൘ 2
6−2
20
15
10
Y
5
0
-5
0 2 4 6 8
X
Formal Regression Line (r = 0.768)
Dataset Sample Size 9
20
15
10
Y
5
0
-5
0 2 4 6 8
X
Building the Regression Line
Residual e = actual y – predicted y’
[most other sources use ŷ for predicted value]
The least squares regression line minimizes the sum
of residuals squared.
Now the choice of Y and X matters, unlike
correlation; the regression line is not symmetric.
The regression line always passes through the center
of the distribution (x̅ , y̅)
The Ugly Truth (Formula)
The Bluman book states y’ = a + bx
BUT TI calculators generally use y’ = ax + b
σ 𝑦 σ 𝑥 2 −σ 𝑥 σ 𝑥𝑦
Intercept a =
𝑛(σ 𝑥 2 )− σ 𝑥 2
𝑛 (σ 𝑥𝑦) σ − σ 𝑥 σ 𝑦
Slope b =
𝑛(σ 𝑥 2 )− σ 𝑥 2
When using the calculator, be sure you know which
letter means intercept and slope.
Practice with the calculator!
This WILL be on the exam!
Regression Computation Example
Six (x,y) pairs: (2,13), (7,21), (9,23), (1,14), (5,15), (12,21)
∑ x = 36, ∑ y = 107, ∑ x2 = 304, ∑ y2 = 2001, ∑ xy = 721
Regression line y’ = 12.447 + 0.898 x
If you reverse y and x, you get x’ = (–9.176) + 0.851 y
For x = 2, predicted value y’ = 12.447 + 0.898 (2) = 14.243
and residual = 13 – 14.243 = –1.243
Videos for correlation also show
25
how to find intercept and slope.
20
15
10
0 5 10 15
x
y Fi tted value s
Interpretations
Intercept is the predicted value for x = 0, which may
be a hypothetical value if x = 0 makes no sense.
We need to avoid extrapolation, trying to make
predictions outside the range of the data.
Slope is the marginal change in predicted y’ when
predictor x increases by one unit.
An outlier in regression is a point outside the pattern
of the rest of the data.
When any point – has a substantial effect on the
regression line equation, the point is considered
influential. Outliers are often influential.
Outliers and Influential Points
Anscombe’s
Quartet:
All four
regression
lines are
y’ = 3 + 0.5 x
X1: no issues
X2: not linear
X3: outlier
X4: influential
point
Interpreting Variation
Thinking back, we earlier computed variance, which
had total variation as the numerator.
Total variation for response Y is σ 𝑦 − 𝑦
ത 2