You are on page 1of 28

Correlation

and Regression

(the end of the journey)


Lecture Notes #10
OSU Statistics 2013
Summer 2017
Adam Molnar
Two Dimensions
 Every technique so far has
had only one variable –
although sometimes there
was more than one sample.
 Our final chapter begins to
examine relationships
between two variables.
 We’ve already graphed two-
dimensional data with the
scatterplot.
Review: Terms for Scatterplots
 Horizontal X axis: Variable that is
known, controlled, or potentially
explanatory. Called predictor,
explanatory, or independent variable.
 Vertical Y axis: Variable that is
unknown, uncontrolled, or the
response. Called response or
dependent variable.
 Often (not always) there is a potential
cause (X) and effect (Y) relationship.
• Final exam grade (Y) as
function of midterm grade (X)
• Plant height (Y) as function of
amount of fertilizer (X)
About Relationships
 In algebra, Y = f(X) was a fixed,
error-free statement.
 Unlike mathematics,
statistics is not error-free.
Real data rarely has precise
straight lines or curves.
 We still use Y = f(X), but our
models include uncertainty,
like the holes in the target.
 New model: Y = f(X) + error
Correlation
 Correlation is a number that measures the strength
of the linear relationship between two variables.
 There are several different correlation coefficients;
we will compute the most common,
Pearson product-moment correlation coefficient.
 Pearson correlation is appropriate for two
quantitative variables, NOT categorical variables.
 For population data, the symbol is r “rho”.
For sample data, the symbol is r.
Relationship Types
Properties of Correlation
 Correlation does not depend on which variable is labeled X or Y.
 Correlation is unitless; it does not depend on scale of X and Y.

 Correlation ranges between –1 and +1: –1 ≤ r ≤ +1.


 The sign of correlation is the sign of perceived slope.
 Correlation measures strength of linear relationship.
• Near +1: Near straight line upward
• Near -1: Near straight line downward
• Near 0: Little linear relationship, but may have nonlinear fit
 Correlation is sensitive to outliers.

 For visual practice, play the little matching game at


http://istics.net/Correlations/
Correlation Graph Strength Examples
Correlation = –0.54 Correlation = –0.20
(tighter, more line-like)

1.0
0.8

0.8
0.6
0.4

0.6
0.2
Y

0.4
0.0

0.2
-0.2

0.0
-0.4
-0.6

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

X X
Developing the Correlation Formula
 Recall sample variance from much earlier.
1
 𝑆𝑥 = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝑛−1
 It makes sense to try one X and one Y. This is called
1
covariance. 𝑆𝑥𝑦 = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑛−1
 (X, Y) example: (0, 0) (1, –2) (2, –2) (3, –5) (4, –6)
 Covariance = –3.75
 The unit of covariance is X-Y, such as foot-pounds.
Often this makes no sense.
Developing the Correlation Formula
 To solve the unit problem and make –1 ≤ r ≤ +1,
we divide by both standard deviations.
 Correlation can be defined as
r = Covariance / [sd(X) * sd(Y)]
 Example: r for little set on prior page = –0.968
 In real life, correlation is found by computer or
calculator, but the book provides a very messy
formula. We’ll use the calculator instead.
𝑛(σ 𝑥𝑦)−(σ 𝑥)(σ 𝑦)
 𝑟=
𝑛(σ 𝑥 2 )− σ 𝑥 2 ∗ 𝑛(σ 𝑦 2 )− σ 𝑦 2
25
Use the Calculator!
 Six (x, y) pairs:

20
(2,13), (7,21), (9,23),
(1,14), (5,15), (12,21)

15
 ∑ x = 36, ∑ y = 107, ∑ x2 = 304,
∑ y2 = 2001, ∑ xy = 721

10
0 5 10 15
x

y Fi tted value s

 Correlation r = 474 / 542.3062 = +0.874


 TI-83, TI-84: https://www.youtube.com/watch?v=7v1-2kiGAEY
Put X in L1, Y in L2. Also use a+bx, option #8, not #4.
 TI-36X Pro: https://www.youtube.com/watch?v=9ktyhFt4i1M
 TI-34: https://www.youtube.com/watch?v=Wi1lWuPWF70
 TI-30XS: https://www.youtube.com/watch?v=cwVeEd4N87g
 TI-30X IIS: https://www.youtube.com/watch?v=vNLvd_pJlv8
Hypothesis Testing for Correlation
 If the two populations are normally distributed,
we can conduct a hypothesis test to see if population
correlation is significantly different from zero.
 These are usually 2-tailed tests, but can be 1-tailed.
① H 0: r = 0 H 1: r ≠ 0
 The test is a t-test with (n – 2) degrees of freedom.
 Conditions:
• A) Good random sample from large population
• B1) Normally distributed populations for X and Y
There is NO B2) large sample size option!
• C-r) Approximate linear relationship with no outliers
Correlation Test Value
𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑−𝑁𝑢𝑙𝑙
 Recall basic t test formula: 𝑠
ൗ 𝑛

 Observed value is sample correlation r.


 Null value is 0.
 For standard deviation, replace s by 1 − 𝑟2
 For technical reasons, we use 𝑛 − 2 not 𝑛

𝑟−0
 This makes test value t =
1−𝑟2൘
𝑛−2
Correlation Test Example
 For the earlier dataset, assume X and Y come from
normally distributed populations.
 Let a = 0.10 and do a two-tailed test.
② Reject if t < –t0.10/2 or t > t0.10/2 with (6 – 2) = 4 DF.
That means t < –2.132 or t > 2.132.
0.874−0 0.874
③ Test value t = = 0.4859Τ = 3.597
1−0.8742
൘ 2
6−2

④ Since 3.597 > 2.132, we reject the null hypothesis.


⑤ Based on the data, the population correlation is
statistically significantly different from 0.
Correlation Test Practice: Heights
 Height and self esteem are
roughly normally
distributed.
 N = 69 middle school boys,
good random sample,
no outliers
 Sample r = +0.180
 Is there a statistically
significant relationship?
 Let a = 0.10.
Correlation Test Practice
① H0: r = 0 H1: r ≠ 0  Height and self esteem are
② Reject if t < –t0.10/2 or roughly normally
t > t0.10/2 with (69 – 2) = 67 DF, distributed.
when t < –1.669 or t > 1.669.  N = 69 middle school boys,
0.180−0 good random sample,
③ Test value t = =
1−0.1802൘ no outliers
69−2
0.180  Sample r = +0.180
= 1.498
0.1202
 Is there a statistically
④ Since 1.498 < 1.669, we do not significant relationship?
reject the null hypothesis.
 Let a = 0.10.
⑤ Based on the data, the
population correlation between
height and self esteem is not
significantly different from 0.
Correlation and Causation
 Recall from chapter 1, “Correlation does not equal
Causation”. At least, not automatically.
 Sometimes X causes Y.
 Sometimes Y really causes X, and we mislabeled.
 It could just be a coincidence, and we made a false
rejection Type I error.
Beware the Lurking Variable
 A lurking variable is a variable not included in the
analysis that has an effect on the relationship.
 Ice cream sales are positively correlated with
homicides. (Really!)
 … but both are positively correlated with
temperature.
 ... and temperature really doesn’t cause homicides. In
warmer weather, people interact more on the streets.
 A problem can have a complex inter-relationship
between many variables.
From Correlation to Regression
 In the past section, we tested correlation, a numeric
measure of the linear relationship between X and Y.
 If we fit a formal line, the line of best fit is called
the least squares regression line.
 Dr. Bluman says that regression analysis should be
undertaken only when correlation is statistically
significant. Use this on the homework.
 Almost all other sources construct regression lines
without restriction.
Where to Draw the Line?
Dataset Sample Size 9

20
15
10
Y

5
0
-5

0 2 4 6 8

X
Formal Regression Line (r = 0.768)
Dataset Sample Size 9

20
15
10
Y

5
0
-5

0 2 4 6 8

X
Building the Regression Line
 Residual e = actual y – predicted y’
[most other sources use ŷ for predicted value]
 The least squares regression line minimizes the sum
of residuals squared.
 Now the choice of Y and X matters, unlike
correlation; the regression line is not symmetric.
 The regression line always passes through the center
of the distribution (x̅ , y̅)
The Ugly Truth (Formula)
 The Bluman book states y’ = a + bx
BUT TI calculators generally use y’ = ax + b
σ 𝑦 σ 𝑥 2 −σ 𝑥 σ 𝑥𝑦
 Intercept a =
𝑛(σ 𝑥 2 )− σ 𝑥 2
𝑛 (σ 𝑥𝑦) σ − σ 𝑥 σ 𝑦
 Slope b =
𝑛(σ 𝑥 2 )− σ 𝑥 2
 When using the calculator, be sure you know which
letter means intercept and slope.
 Practice with the calculator!
This WILL be on the exam!
Regression Computation Example
 Six (x,y) pairs: (2,13), (7,21), (9,23), (1,14), (5,15), (12,21)
 ∑ x = 36, ∑ y = 107, ∑ x2 = 304, ∑ y2 = 2001, ∑ xy = 721
 Regression line y’ = 12.447 + 0.898 x
 If you reverse y and x, you get x’ = (–9.176) + 0.851 y
 For x = 2, predicted value y’ = 12.447 + 0.898 (2) = 14.243
and residual = 13 – 14.243 = –1.243
 Videos for correlation also show

25
how to find intercept and slope.

20
15
10

0 5 10 15
x

y Fi tted value s
Interpretations
 Intercept is the predicted value for x = 0, which may
be a hypothetical value if x = 0 makes no sense.
 We need to avoid extrapolation, trying to make
predictions outside the range of the data.
 Slope is the marginal change in predicted y’ when
predictor x increases by one unit.
 An outlier in regression is a point outside the pattern
of the rest of the data.
 When any point – has a substantial effect on the
regression line equation, the point is considered
influential. Outliers are often influential.
Outliers and Influential Points
 Anscombe’s
Quartet:
All four
regression
lines are
y’ = 3 + 0.5 x
 X1: no issues
 X2: not linear
 X3: outlier
 X4: influential
point
Interpreting Variation
 Thinking back, we earlier computed variance, which
had total variation as the numerator.
 Total variation for response Y is σ 𝑦 − 𝑦
ത 2

 Total variation can be divided into


• Explained variation σ 𝑦′ − 𝑦ത 2
• Unexplained or residual variation σ 𝑦 − 𝑦′ 2

 Coefficient of determination is correlation squared,


R2 = explained variation / total variation.
It is proportion of variation in y due to variation in x.
 1 – R2 = the coefficient of nondetermination.
Determination Practice
 For the earlier graph, correlation r = 0.768.
 Coefficient of determination r2 = (0.768)2 = 0.599.
 Coefficient of non-determination
1 – r2 = 1 – 0.599 = 0.401.

 With height and self esteem, r = 0.180.


 Coefficient of determination r2 = (0.180)2 = 0.032
 Coefficient of non-determination
1 – r2 = 1 – 0.032 = 0.968.

You might also like