Chapter 10 Summer 17

Correlation
and Regression
(the end of the journey)

Lecture Notes #10
OSU Statistics 2013
Summer 2017
Adam Molnar
Two Dimensions
 Every technique so far has
had only one variable –
although sometimes there
was more than one sample.
 Our final chapter begins to
examine relationships
between two variables.
 We’ve already graphed two-
dimensional data with the
scatterplot.
Review: Terms for Scatterplots
 Horizontal X axis: Variable that is
known, controlled, or potentially
explanatory. Called predictor,
explanatory, or independent variable.
 Vertical Y axis: Variable that is
unknown, uncontrolled, or the
response. Called response or
dependent variable.
 Often (not always) there is a potential
cause (X) and effect (Y) relationship.
• Final exam grade (Y) as
function of midterm grade (X)
• Plant height (Y) as function of
amount of fertilizer (X)
About Relationships
 In algebra, Y = f(X) was a fixed,
error-free statement.
 Unlike mathematics,
statistics is not error-free.
Real data rarely has precise
straight lines or curves.
 We still use Y = f(X), but our
models include uncertainty,
like the holes in the target.
 New model: Y = f(X) + error
Correlation
 Correlation is a number that measures the strength
of the linear relationship between two variables.
 There are several different correlation coefficients;
we will compute the most common,
Pearson product-moment correlation coefficient.
 Pearson correlation is appropriate for two
quantitative variables, NOT categorical variables.
 For population data, the symbol is r “rho”.
For sample data, the symbol is r.
Relationship Types
Properties of Correlation
 Correlation does not depend on which variable is labeled X or Y.
 Correlation is unitless; it does not depend on scale of X and Y.
 Correlation ranges between –1 and +1: –1 ≤ r ≤ +1.

 The sign of correlation is the sign of perceived slope.
 Correlation measures strength of linear relationship.
• Near +1: Near straight line upward
• Near -1: Near straight line downward
• Near 0: Little linear relationship, but may have nonlinear fit
 Correlation is sensitive to outliers.
 For visual practice, play the little matching game at

http://istics.net/Correlations/
Correlation Graph Strength Examples
Correlation = –0.54 Correlation = –0.20
(tighter, more line-like)
1.0
0.8
0.8
0.6
0.4
0.6
0.2
Y
0.4
0.0
0.2
-0.2
0.0
-0.4
-0.6
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
Developing the Correlation Formula
 Recall sample variance from much earlier.
1
 𝑆𝑥 = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝑛−1
 It makes sense to try one X and one Y. This is called
1
covariance. 𝑆𝑥𝑦 = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑛−1
 (X, Y) example: (0, 0) (1, –2) (2, –2) (3, –5) (4, –6)
 Covariance = –3.75
 The unit of covariance is X-Y, such as foot-pounds.
Often this makes no sense.
Developing the Correlation Formula
 To solve the unit problem and make –1 ≤ r ≤ +1,
we divide by both standard deviations.
 Correlation can be defined as
r = Covariance / [sd(X) * sd(Y)]
 Example: r for little set on prior page = –0.968
 In real life, correlation is found by computer or
calculator, but the book provides a very messy
formula. We’ll use the calculator instead.
𝑛(σ 𝑥𝑦)−(σ 𝑥)(σ 𝑦)
 𝑟=
𝑛(σ 𝑥 2 )− σ 𝑥 2 ∗ 𝑛(σ 𝑦 2 )− σ 𝑦 2
25
Use the Calculator!
 Six (x, y) pairs:
20
(2,13), (7,21), (9,23),
(1,14), (5,15), (12,21)
15
 ∑ x = 36, ∑ y = 107, ∑ x2 = 304,
∑ y2 = 2001, ∑ xy = 721
10
0 5 10 15
x
y Fi tted value s
 Correlation r = 474 / 542.3062 = +0.874

 TI-83, TI-84: https://www.youtube.com/watch?v=7v1-2kiGAEY
Put X in L1, Y in L2. Also use a+bx, option #8, not #4.
 TI-36X Pro: https://www.youtube.com/watch?v=9ktyhFt4i1M
 TI-34: https://www.youtube.com/watch?v=Wi1lWuPWF70
 TI-30XS: https://www.youtube.com/watch?v=cwVeEd4N87g
 TI-30X IIS: https://www.youtube.com/watch?v=vNLvd_pJlv8
Hypothesis Testing for Correlation
 If the two populations are normally distributed,
we can conduct a hypothesis test to see if population
correlation is significantly different from zero.
 These are usually 2-tailed tests, but can be 1-tailed.
① H 0: r = 0 H 1: r ≠ 0
 The test is a t-test with (n – 2) degrees of freedom.
 Conditions:
• A) Good random sample from large population
• B1) Normally distributed populations for X and Y
There is NO B2) large sample size option!
• C-r) Approximate linear relationship with no outliers
Correlation Test Value
𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑−𝑁𝑢𝑙𝑙
 Recall basic t test formula: 𝑠
ൗ 𝑛
 Observed value is sample correlation r.

 Null value is 0.
 For standard deviation, replace s by 1 − 𝑟2
 For technical reasons, we use 𝑛 − 2 not 𝑛
𝑟−0
 This makes test value t =
1−𝑟2൘
𝑛−2
Correlation Test Example
 For the earlier dataset, assume X and Y come from
normally distributed populations.
 Let a = 0.10 and do a two-tailed test.
② Reject if t < –t0.10/2 or t > t0.10/2 with (6 – 2) = 4 DF.
That means t < –2.132 or t > 2.132.
0.874−0 0.874
③ Test value t = = 0.4859Τ = 3.597
1−0.8742
൘ 2
6−2
④ Since 3.597 > 2.132, we reject the null hypothesis.

⑤ Based on the data, the population correlation is
statistically significantly different from 0.
Correlation Test Practice: Heights
 Height and self esteem are
roughly normally
distributed.
 N = 69 middle school boys,
good random sample,
no outliers
 Sample r = +0.180
 Is there a statistically
significant relationship?
 Let a = 0.10.
Correlation Test Practice
① H0: r = 0 H1: r ≠ 0  Height and self esteem are
② Reject if t < –t0.10/2 or roughly normally
t > t0.10/2 with (69 – 2) = 67 DF, distributed.
when t < –1.669 or t > 1.669.  N = 69 middle school boys,
0.180−0 good random sample,
③ Test value t = =
1−0.1802൘ no outliers
69−2
0.180  Sample r = +0.180
= 1.498
0.1202
 Is there a statistically
④ Since 1.498 < 1.669, we do not significant relationship?
reject the null hypothesis.
 Let a = 0.10.
⑤ Based on the data, the
population correlation between
height and self esteem is not
significantly different from 0.
Correlation and Causation
 Recall from chapter 1, “Correlation does not equal
Causation”. At least, not automatically.
 Sometimes X causes Y.
 Sometimes Y really causes X, and we mislabeled.
 It could just be a coincidence, and we made a false
rejection Type I error.
Beware the Lurking Variable
 A lurking variable is a variable not included in the
analysis that has an effect on the relationship.
 Ice cream sales are positively correlated with
homicides. (Really!)
 … but both are positively correlated with
temperature.
 ... and temperature really doesn’t cause homicides. In
warmer weather, people interact more on the streets.
 A problem can have a complex inter-relationship
between many variables.
From Correlation to Regression
 In the past section, we tested correlation, a numeric
measure of the linear relationship between X and Y.
 If we fit a formal line, the line of best fit is called
the least squares regression line.
 Dr. Bluman says that regression analysis should be
undertaken only when correlation is statistically
significant. Use this on the homework.
 Almost all other sources construct regression lines
without restriction.
Where to Draw the Line?
Dataset Sample Size 9
20
15
10
Y
5
0
-5
0 2 4 6 8
X
Formal Regression Line (r = 0.768)
Dataset Sample Size 9
20
15
10
Y
5
0
-5
0 2 4 6 8
X
Building the Regression Line
 Residual e = actual y – predicted y’
[most other sources use ŷ for predicted value]
 The least squares regression line minimizes the sum
of residuals squared.
 Now the choice of Y and X matters, unlike
correlation; the regression line is not symmetric.
 The regression line always passes through the center
of the distribution (x̅ , y̅)
The Ugly Truth (Formula)
 The Bluman book states y’ = a + bx
BUT TI calculators generally use y’ = ax + b
σ 𝑦 σ 𝑥 2 −σ 𝑥 σ 𝑥𝑦
 Intercept a =
𝑛(σ 𝑥 2 )− σ 𝑥 2
𝑛 (σ 𝑥𝑦) σ − σ 𝑥 σ 𝑦
 Slope b =
𝑛(σ 𝑥 2 )− σ 𝑥 2
 When using the calculator, be sure you know which
letter means intercept and slope.
 Practice with the calculator!
This WILL be on the exam!
Regression Computation Example
 Six (x,y) pairs: (2,13), (7,21), (9,23), (1,14), (5,15), (12,21)
 ∑ x = 36, ∑ y = 107, ∑ x2 = 304, ∑ y2 = 2001, ∑ xy = 721
 Regression line y’ = 12.447 + 0.898 x
 If you reverse y and x, you get x’ = (–9.176) + 0.851 y
 For x = 2, predicted value y’ = 12.447 + 0.898 (2) = 14.243
and residual = 13 – 14.243 = –1.243
 Videos for correlation also show
25
how to find intercept and slope.
20
15
10
0 5 10 15
x
y Fi tted value s
Interpretations
 Intercept is the predicted value for x = 0, which may
be a hypothetical value if x = 0 makes no sense.
 We need to avoid extrapolation, trying to make
predictions outside the range of the data.
 Slope is the marginal change in predicted y’ when
predictor x increases by one unit.
 An outlier in regression is a point outside the pattern
of the rest of the data.
 When any point – has a substantial effect on the
regression line equation, the point is considered
influential. Outliers are often influential.
Outliers and Influential Points
 Anscombe’s
Quartet:
All four
regression
lines are
y’ = 3 + 0.5 x
 X1: no issues
 X2: not linear
 X3: outlier
 X4: influential
point
Interpreting Variation
 Thinking back, we earlier computed variance, which
had total variation as the numerator.
 Total variation for response Y is σ 𝑦 − 𝑦
ത 2
 Total variation can be divided into

• Explained variation σ 𝑦′ − 𝑦ത 2
• Unexplained or residual variation σ 𝑦 − 𝑦′ 2
 Coefficient of determination is correlation squared,

R2 = explained variation / total variation.
It is proportion of variation in y due to variation in x.
 1 – R2 = the coefficient of nondetermination.
Determination Practice
 For the earlier graph, correlation r = 0.768.
 Coefficient of determination r2 = (0.768)2 = 0.599.
 Coefficient of non-determination
1 – r2 = 1 – 0.599 = 0.401.
 With height and self esteem, r = 0.180.

 Coefficient of determination r2 = (0.180)2 = 0.032
 Coefficient of non-determination
1 – r2 = 1 – 0.032 = 0.968.

Chapter 10 Summer 17

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 10 Summer 17

Uploaded by

Copyright:

Available Formats

Correlation

(the end of the journey)

 Correlation ranges between –1 and +1: –1 ≤ r ≤ +1.

 For visual practice, play the little matching game at

 Correlation r = 474 / 542.3062 = +0.874

 Observed value is sample correlation r.

④ Since 3.597 > 2.132, we reject the null hypothesis.

 Total variation can be divided into

 Coefficient of determination is correlation squared,

 With height and self esteem, r = 0.180.

You might also like