You are on page 1of 67

Correlation

Agenda
• Correlation
– Definition & properties
– Calculating r
– Hypothesis test
– Confidence Intervals
Projects
• Check the notes on your project grade!!!
• If you need major changes:
– Talk to me or grad TA (ASAP)!!!
• If no major changes:
– Start collecting data and working on preliminary
analysis
WHAT NOW?!?!
• SDS 322E Data Science with R and Python
– You will learn modern data manipulation and
visualization techniques in R (and python) as well
as more advanced statistical concepts in applied
biomedical contexts.
– Grad school?
– Research?
Example
• You want to know if ID Sodas/wk BMI

drinking diet soda 1 10 27


2 14 24
relates to obesity. How 3 7 32
could you study this? 4 5 20
5 10 29
– Experimental or 6 0 14
observational design? …
– How do you display 100 7 31
variables? mea 8.26 27.10
n
s 6.16 5.44
Example
• Scatterplot
Describe: Two Numeric Variables
• One numeric variable:
– Measure of center and spread
• Two numeric variables:
– Can still give a measure of center & spread for
each one independently
– Is there a way to describe them together? How
they relate to each other?
Correlation Coefficient “r”
• How are two numeric variables associated?
– Gives the strength and direction of their linear
relationship
– NO UNITS
Pearson (Product-Moment)
Correlation Coefficient
• ρ (population)
• r (sample)
-1 ≤ r ≤ 1
• Strength: How close is I r I to 1?
– Strong linear correlations have I r I close to 1
– Weak linear correlations have I r I close to 0
• Direction: + or -
– Positive: as x increases, so does y
– Negative: as x increases, y decreases
Strength of correlation:
0 < r < 0.3 = weak
Correlation 0.3 < r < 0.5 = medium
0.5 < r < 1 = strong

NOTE: Slope doesn’t matter!


r indicates how close the points are
to a line.
Correlation

Properties that impact r :

1. Direction
2. Strength
3. Linearity
4. Presence of outliers
Correlation Properties
• 1. Direction
• 2. Strength
Correlation Properties

3. Importance of Linearity:
Correlation Properties

3. Importance of Linearity:

r = 0.0004

Correlation ≠
Association
Correlation Properties

4. Importance of Outliers:
Correlation Properties

4. Importance of Outliers:

r = 0.602

Without outlier:
r = 0.971
What’s the deal with r?
• Things that DON’T CHANGE r
– 1. r[x,y] = r[y,x]
• e.g. correlation between weight and height is the same
as between height and weight
What’s the deal with r?
• Things that DON’T change r

• 2. r[x,y] = r[ax,y] = r[x,by] =


r[cx,dy]
• Can multiply x and/or y by a
constant and r stays the same
• e.g. correlation between weight
(lbs) and height (inches) is the
same as between weight (kg)
and height (cm)
Other important
Scatterplot of Age and Height forpoints
Females
80

• Restriction
70
60
of Range Problem
– 50Try to include the whole span of x values (“range”)
Height (in)

– 40
Restricting
30
it to one Correl all:
section of 0.78
x’s will change your
correlation Correl age 0-12: 0.99
20
Correl age 15-45: 0.02
10
0
0 5 10 15 20 25 30 35 40 45
Age
Correlation and Error
• We try to measure as accurately as possible
– Always some error
– Measurement error: (true value – measured value)
• Increasing measurement error will (usually)
reduce the absolute value of the correlation
coefficient
– attenuation
If you’re bored
• guessthecorrelation.com

• https://www.google.com/trends/correlate/
Calculating r
• Cigarettes: tar and nicotine
Cigarette Data
1.4

Brand Tar Nicotine 1.2

(mg/cig) (mg/cig)

Nicotine (mg/cig)
1

0.8

Now 2 0.2 0.6

0.4

TRUE 6 0.6 0.2

0
Vantage 8 0.7 0 2 4 6 8 10 12 14 16 18
Tar (mg/cig)

Newport Stripe 11 0.9


Lucky Strike 13 1.1
Marlboro 16 1.2
mean 9.33 0.78
s 5.05 0.37
Calculating r

Brand Tar Nicotine Z(tar) Z(nic) z(tar)*z(nic)


(mg/cig) (mg/cig)
Now 2 0.2
TRUE 6 0.6
Vantage 8 0.7
Newport Stripe 11 0.9
Lucky Strike 13 1.1
Marlboro 16 1.2
mean 9.33 0.78
s 5.05 0.37
Calculating r

Brand Tar Nicotine Z(tar) Z(nic) z(tar)*z(nic)


(mg/cig) (mg/cig)
(2 − 9.33)
Now 2 0.2
5.05
6 0.6 (6 − 9.33)
TRUE
5.05
Vantage 8 0.7
Newport Stripe 11 0.9
Lucky Strike 13 1.1
Marlboro 16 1.2
mean 9.33 0.78
s 5.05 0.37
Calculating r

Brand Tar Nicotine Z(tar) Z(nic) z(tar)*z(nic)


(mg/cig) (mg/cig)
Now 2 0.2 -1.451
TRUE 6 0.6 -0.659
Vantage 8 0.7 -0.263
Newport Stripe 11 0.9 0.331
Lucky Strike 13 1.1 0.727
Marlboro 16 1.2 1.321
mean 9.33 0.78
s 5.05 0.37
Calculating r

Brand Tar Nicotine Z(tar) Z(nic) z(tar)*z(nic)


(mg/cig) (mg/cig)
(0.2 − 0.78)
Now 2 0.2 -1.451
0.37
6 0.6 (0.6 − 0.78)
TRUE -0.659
0.37
Vantage 8 0.7 -0.263
Newport Stripe 11 0.9 0.331
Lucky Strike 13 1.1 0.727
Marlboro 16 1.2 1.321
mean 9.33 0.78
s 5.05 0.37
Calculating r

Brand Tar Nicotine Z(tar) Z(nic) z(tar)*z(nic)


(mg/cig) (mg/cig)
Now 2 0.2 -1.451 -1.568
TRUE 6 0.6 -0.659 -0.486
Vantage 8 0.7 -0.263 -0.216
Newport Stripe 11 0.9 0.331 0.324
Lucky Strike 13 1.1 0.727 0.865
Marlboro 16 1.2 1.321 1.135
mean 9.33 0.78
s 5.05 0.37
Calculating r

Brand Tar Nicotine Z(tar) Z(nic) z(tar)*z(nic)


(mg/cig) (mg/cig)
Now 2 0.2 -1.451 -1.568 (-1.451)(-1.568)
TRUE 6 0.6 -0.659 -0.486 (-.659)(-.486)
Vantage 8 0.7 -0.263 -0.216
Newport Stripe 11 0.9 0.331 0.324
Lucky Strike 13 1.1 0.727 0.865
Marlboro 16 1.2 1.321 1.135
mean 9.33 0.78
s 5.05 0.37
Calculating r

Brand Tar Nicotine Z(tar) Z(nic) z(tar)*z(nic)


(mg/cig) (mg/cig)
Now 2 0.2 -1.451 -1.568 2.275
TRUE 6 0.6 -0.659 -0.486 0.321
Vantage 8 0.7 -0.263 -0.216 0.057
Newport Stripe 11 0.9 0.331 0.324 0.107
Lucky Strike 13 1.1 0.727 0.865 0.629
Marlboro 16 1.2 1.321 1.135 1.499
mean 9.33 0.78
s 5.05 0.37
Calculating r

Brand Tar Nicotine Z(tar) Z(nic) z(tar)*z(nic)


(mg/cig) (mg/cig)
Now 2 0.2 -1.451 -1.568 2.275
TRUE 6 0.6 -0.659 -0.486 0.321
Vantage 8 0.7 -0.263 -0.216 0.057
Newport Stripe 11 0.9 0.331 0.324 0.107
Lucky Strike 13 1.1 0.727 0.865 0.629
Marlboro 16 1.2 1.321 1.135 1.499
mean 9.33 0.78
s 5.05 0.37 Sum=4.888
Calculating r

Brand Tar Nicotine Z(tar) Z(nic) z(tar)*z(nic)


(mg/cig) (mg/cig)
Now 2 0.2 -1.451 -1.568 2.275
TRUE 6 0.6 -0.659 -0.486 0.321
Vantage 8 0.7 -0.263 -0.216 0.057
Newport Stripe 11 0.9 0.331 0.324 0.107
Lucky Strike 13 1.1 0.727 0.865 0.629
Marlboro 16 1.2 1.321 1.135 1.499
mean 9.33 0.78
s 5.05 0.37 Sum=4.888
-....
• r= = .978
(/01)
Calculating r (again)
• The gas mileage of an automobile first increases
and then decreases as the speed increases.
Suppose this relationship is very regular (as
shown in the following table), with speed in mph
and mileage in miles per gallon.
Speed (MPH) Mileage (MPG)
20 24
30 28
40 30
50 28
60 24

mean 40 26.8
s 15.811 2.683
Try it!
• The gas mileage of an automobile first increases
and then decreases as the speed increases.
Suppose this relationship is very regular (as
shown in the following table), with speed in mph
and mileage in miles per gallon.
Speed (MPH) Mileage (MPG) z(sp) z(mile) z(sp)z(mile)
20 24
30 28
40 30
50 28
60 24
Try it!
• The gas mileage of an automobile first increases
and then decreases as the speed increases.
Suppose this relationship is very regular (as
shown in the following table), with speed in mph
and mileage in miles per gallon.
Speed (MPH) Mileage (MPG) z(sp) z(mile) z(sp)z(mile)
20 24 -1.265 -1.043 1.320
30 28 -0.632 0.447 -0.283
40 30 0.000 1.193 0.000
50 28 0.632 0.447 0.283
60 24 1.265 -1.043 -1.320
Relationship?
Relation Between MPH and MPG
32

30

28
Milage (MPG)

26

24

22

20
10 20 30 40 50 60 70
Speed (MPH
Correlation Matrices
• Way to report the relations among several
numeric variables
Correlation Hypothesis Test
• Are the two variables significantly related?
– Is the correlation significant?
Correlation Hypothesis Test: Steps
• 1. Assumptions
• 2. Hypotheses
• 3. Calculate t
– need r and SEr
• 4. Find t*
• 5. Conclusion
Example
• Are the length and weight of vipers
significantly, linearly related? r = 0.944
Correlation Hypothesis Test
• Assumptions
– 1. Random sample
– 2. Independent observations
– 3. x, y come from a bivariate normal distribution
Check: Bivariate Normal Distribution

Bivariate Normal Distribution:


Check:
1. Relation between x and y is linear
2. Frequency distributions of x and y
separately are normal
3. Scatterplot of x and y is circular or
elliptical

Y X
Violations of Bivariate Normality
• Common issues
Did we meet the assumption?
Step 2: Hypotheses
• Are the two variables significantly linearly
related?

• HO : ρ = 0
• HA : ρ ≠ 0

• Remember: ρ is the correlation parameter (r is


the sample estimate)
Example
• Step 1: Assumptions
• Step 2: Hypotheses
– HO: there is no linear relationship between the
length and weight of vipers
• H O: ρ = 0
– HA: There is a linear relationship between the
length and weight of vipers
• HA: ρ ≠ 0
Collect Data, Find r and SEr

• r = .944

• SEr
.1247
Step 3: Find t
• Good news, everyone!

• We can use a t distribution!


– Run a one-sample t-test!
• t = .944 / .1247 = 7.57
Step 4: Find t*
• t = 7.57
• Critical Value (t*)
– NOTE: df are n – 2
– two-tailed
• t* .05(2),7 = 2.365
Step 5: Conclusion
• I t I > t*
• REJECT THE NULL
– Reject HO: ρ = 0
• There is a significant, linear relationship between the
length and weight of vipers (t=7.57, df=7, p<.05)
• The correlation is significant!
I want more!
• Confidence intervals!
• r = .944
• t* = 2.365
• SEr = 0.1247

– “We are 95% confident that the true population


correlation lies between _____ and _____.”
.65 < ρ < 1.24
Confidence Intervals: PROBLEM!
0.65 < ρ < 1.24

PROBLEM! r and ρ must be -1 ≤ r ≤ 1

So just change the bounds of your CI:


0.65 < ρ < 1
Try It!
• Is the correlation between diet sodas and BMI
significant?
• Assumptions
– random sample
– independent observations
– bivariate normality
Hypotheses
• HO: there is no linear relationship between
diet soda and BMI
– HO: ρ = 0
• HA: There is a linear relationship between diet
soda and BMI
– HA: ρ ≠ 0
Collect Data, Find r and SE
• r = .516
• n = 100
• SEr =
0.0865
• t=
t = .516/.0865
t = 5.965
Find t*; Conclusion
• t = 5.965
• t* = 1.984

• I t I > t*
– Reject the null
– There is a significant, linear relationship between
diet soda consumption and BMI (t=5.965, df=98, p
<.05)
I want more!
• Confidence intervals!
• r = .516
• t* = 1.984
• SEr = 0.0865

– “We are 95% confident that the true population


correlation lies between _____ and _____.”
.344 < ρ < .688
Note: If CI contains 0, then correlation is NOT signif!
Correlation

Summary –
• 𝑟 can tell us the strength and direction of
the linear relationship between X and Y

• You should confirm linearity and look out


for outliers or funneling

• You can run a one-sample t-test to see if r is


significant (non-zero)

• You can create a CI for r


IMPORTANT POINT!

Correlation does not imply Causation


*even if a very strong r

Instead: correlation, association, relation(ship)


Spurious Correlations
• People who drowned by falling into a pool and number
of films Nicolas Cage appeared in
r = 0.666004
• Per capita consumption of cheese and number of
people who died by becoming tangled in their
bedsheets
r = 0.947091
• Honey producing bee colonies and juvenile arrests for
possession of marijuana
r = -.933389
• Divorce rate in Mississippi and murders by bodily force
r = .890472
Think about it!
• In the example we did in class, we calculated
the correlation between length (in cm) and
weight (in g) of Vipera bertis snakes (r =
0.944). If we converted the snake lengths into
inches by multiplying by 0.394 and converted
the weights into ounces by multiplying by
0.035, what will the new correlation be?
Explain.
Squarecap!
• In their study of hyena laughter, researchers
wanted to know whether certain audio
properties of calls are associated with the age of
the hyenas. They collected data on the age (in
years) and frequency (in Hz) of 16 hyenas and
calculated an r = -0.402.
• Assumptions?
• Hypotheses?
• 1. SE
• 2. Test statistic t
• 3. If t*=2.145, conclusion?
Squarecap!
• In their study of hyena laughter, researchers wanted to
know whether certain audio properties of calls are
associated with the age of the hyenas. They collected
data on the age (in years) and frequency (in Hz) of 16
hyenas and calculated an r=-0.402.
• Suppose the researchers calculate the same
correlation, r = -0.402, but it was based on a random
sample of 32 hyenas. Would this have changed your
conclusion above?
• 4. New test statistic
• 5. If t* = 2.042, conclusion?
Coming Up (TTh11)…
• Friday
– Discussion 3 is due by 11:59pm to canvas
• Tuesday
– Class: Regression
• Wednesday
– Pre-lab and Lab 8

• WORK ON PRELIMINARY ANALYSIS!!!


Coming Up (TTh11)…
• Friday
– Discussion 3 is due by 11:59pm to canvas
• Tuesday
– Class: Regression
– Pre-lab and Lab 8

• WORK ON PRELIMINARY ANALYSIS!!!


Coming Up (MW)…
• Tuesday
– Pre-lab and Lab 8
• Wednesday
– Class: Regression
• Friday
– PRELIMINARY ANALYSIS DUE!!!

You might also like