Correlation Coefficients

Linear Correlation Analysis
Spring 2005
Superstitions
Walking under a ladder Opening an umbrella indoors
Empirical Evidence
Consumption of ice cream and drownings are generally positively correlated. Can we reduce the number of drownings if we prohibit ice cream sales in the summer?
3 kinds of relationships between variables

Association or Correlation or Covary
Both variables tend to be high or low (positive relationship) or one tends to be high when the other is low (negative relationship). Variables do not have independent & dependent roles.
Prediction
Variables are assigned independent and dependent roles. Both variables are observed. There is a weak causal implication that the independent predictor variable is the cause and the dependent variable is the effect.
Causal
Variables are assigned independent and dependent roles. The independent variable is manipulated and the dependent variable is observed. Strong causal statements are allowed.
General Overview of Correlational Analysis The purpose is to measure the strength of a linear relationship between 2 variables. A correlation coefficient does not ensure causation (i.e. a change in X causes a change in Y) X is typically the Input, Measured, or Independent variable. Y is typically the Output, Predicted, or Dependent variable. If, as X increases, there is a predictable shift in the values of Y, a correlation exists.
General Properties of Correlation Coefficients

Values can range between +1 and -1 The value of the correlation coefficient represents the scatter of points on a scatterplot You should be able to look at a scatterplot and estimate what the correlation would be You should be able to look at a correlation coefficient and visualize the scatterplot
Perfect Linear Correlation

Occurs when all the points in a scatterplot fall exactly along a straight line.
Positive Correlation Direct Relationship

As the value of X increases, the value of Y also increases Larger values of X tend to be paired with larger values of Y (and consequently, smaller values of X and Y tend to be
paired)
Negative Correlation Inverse Relationship

As the value of X increases, the value of Y decreases Small values of X tend to be paired with large value of Y (and vice versa).
Non-Linear Correlation
As the value of X increases, the value of Y changes in a non-linear manner
No Correlation
As the value of X changes, Y does not change in a predictable manner. Large values of X seem just as likely to be paired with small values of Y as with large values of Y
Interpretation
Depends on what the purpose of the study is but here is a general guideline...
Value = magnitude of the relationship Sign = direction of the relationship
Some of the many Types of Correlation Coefficients

(there are lots more)
Name Pearson r Spearman rho Kendall's Tau Phi Intraclass R X variable Interval/Ratio Ordinal Ordinal Dichotomous Interval/Ratio Test Y variable Interval/Ratio Ordinal Ordinal Dichotomous Interval/Ratio Retest
Some of the many Included in SPSS Types of Correlation Coefficients Bivariate Correlation
(there are lots more. these are the ones we will procedure focus on this semester)
Name Pearson r Spearman rho Kendall's Tau Phi Intraclass R X variable Interval/Ratio Ordinal Ordinal Dichotomous Interval/Ratio Test Y variable Interval/Ratio Ordinal Ordinal Dichotomous Interval/Ratio Retest
The Pearson Product-Moment Correlation (r)

Named after Karl Pearson
(1857-1936)
Both X and Y measured at the Interval/Ratio level Most widely used coefficient in the literature
The Pearson ProductMoment Correlation (r)

A measure of the extent to which paired scores occupy the same or opposite positions within their own distributions
From: Pagano (1994)
Computing Pearson r
Hand Calculation
Step #1
Computing Pearson r in EXCEL

Step #2: Insert Function (Pearson)
Step #3: Select X and Y data
Step #4: Format output

Subject A B C D E X 1 3 4 6 7 Y 2 5 3 7 5
Pearson r = 0.73
Step #1
Computing Pearson r in SPSS

Step #2: Analyze-Correlate-Bivariate
Step #3: Select X and Y data
Step #4: Means + SDs
Computing Pearson r in SPSS

Output #1
Descriptiv e Statistics Mean 4.20 4.40 Std. Deviation 2.387 1.949 N 5 5
VARX VARY
Output #2:
Correlations VARX VARX Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N 1 . 5 .731 .161 5 VARY .731 .161 5 1 . 5
VARY
Interpretation
r = 0.73 : p = .161
The researchers found a moderate, but notsignificant, relationship between X and Y
SAMPLE SIZE: One of the

many issues involved with the interpretation of correlation coefficients
Descriptiv e Statistics Mean 4.20 4.40 Std. Deviation 2.179 1.780 N 25 25
VARX VARY
Correlations VARX VARX Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N 1 . 25 .731** .000 25 VARY .731** .000 25 1 . 25
VARY
**. Correlation is significant at the 0.01 level (2-tailed).
Interpretation
r = 0.73 : p = .000
The researchers found a significant moderate relationship between X and Y
How can this be?
The distribution of Pearson r is not symmetrically shaped as r approaches 1

(see http://davidmlane.com/hyperstat/A98696.html for more information)
Examining the 95% confidence interval for r
An additional way to Interpret Pearson r

Coefficient of Determination
r2 The proportion of the variability of Y accounted for by X
Variability of Y This area of overlap represents the proportion of variability of Y accounted for by X (value is expressed as a %) X
Correlation Identification Practice

Lets see if you can identify the value for the correlation coefficient from a scatterplot Click to begin
Outliers
100 90 80 70 60 50 40 30 20 10 0 0 10 20 30
Observations that clearly appear to be out of range of the other observations.
Variable Y
r = 0.97
40 50 60 70 80 90 100
Variable X
r = 0.72
100 90 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Variable X
Variable Y
What to do with Outliers

You are stuck with them unless..
Check to see if there has been a data entry error. If so, fix the data. Check to see if these values are plausible. Is this score within the minimum and maximum score possible? If values are impossible, delete the data. Report how many scores were deleted. Examine other variables for these subjects to see if you can find an explanation for these scores being so different from the rest. You might be able to delete them if your reasoning is sound.
Correlation & Attenuation

Restricting the range of scores can have a large impact on a correlation coefficient.
100 90 80 70 60 50 40 30 20 10 0 0
r = 0.72
MEDIUM
Variable Y
LOW
HIGH
10
20
30
40
50
60
70
80
90
100
Variable X
100 90 80 70 60 50 40 30 20 10 0 0
LOW
Variable Y
Low Group r = 0.55

10 20 30 40 50 60 70 80 90 100 Variable X
45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 Variable X
Variable Y
100 90 80 70 60 50 40 30 20 10 0 0 10 20 30
MEDIUM
Variable Y
Medium Group r = 0.86

40 50 60 70 80 90 100
80 70
Variable Y
Variable X
60 50 40 30 20 20 30 40 50 60 70 Variable X
100 90 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 Variable X
Variable Y
HIGH
High Group r = 0.67

90 100
70
80
100 90 80 70 60 60 70 80 Variable X 90 100
Variable Y
Using all of the data r = 0.72

100 90 80 70 60 50 40 30 20 10 0 0
MEDIUM r=0.86
Variable Y
LOW r=0.55
HIGH r=0.67
10
20
30
40
50
60
70
80
90
100
Variable X
Heres another problem with interpreting Correlation Coefficients that you should watch out for..
140 120
All data combined r = +0.89
Y variable
100 80 60 40 20 0 0
Men r = -0.21
Women r = +0.22
Men Women
20
40
60
80
100
120
140
X variable
Reporting a set of Correlation Coefficients in a table
Complete correlation matrix. Notice redundancy.
Lower triangular correlation matrix. Values are not repeated. There is also an upper triangular matrix!
Spearman Rho (rs)

Named after Charles E. Spearman (1863-1945) Assumptions:
Data consist of a random sample of n pairs of numeric or non-numeric observations that can be ranked. Each pair of observations represents two measurement taken on the same object or individual.
Photo from: http://www.york.ac.uk/depts/maths/histstat/people/sources.htm
Why choose Spearman rho instead of a Pearson r?

Both X and Y are measured at the ordinal level Sample size is small X and Y are measured at the interval/ratio level, but are not normally distributed (e.g. are severely skewed) X and Y do not follow a bivariate normal distribution
What is a Bivariate Normal Distribution?
What is a Bivariate Normal Distribution?
Sample Problem
Pincherle and Robinson (1974) note a marked inter-observer variation in blood pressure readings. They found that doctors who read high on systolic tended to read high on diastolic. Table 1 shows the mean systolic and diastolic blood pressure reading by 14 doctors. Research question: What is the strength of the relationship between the two variables?
Pincherle, G. & Robinson, D. (1974). Mean blood pressure and its relation to other factors determined at a routine executive health examination. J. Chronic Dis., 27, 245-260.
Table 1. Mean blood pressure readings, millimeters mercury, by doctor. Doctor ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Systolic 141.8 140.2 131.8 132.5 135.7 141.2 143.9 140.2 140.8 131.7 130.8 135.6 143.6 133.2 Diastolic 89.7 74.4 83.5 77.8 85.8 86.5 89.4 89.3 88.0 82.2 84.6 84.4 86.3 85.9
Research question: What is the strength of the relationship between the two variables?
Option #1: Compute a Pearson r If you do not feel this data meet with assumptions of the Pearson r then Option #2: Convert data to Ranks and then compute a Spearman rho We will be going over how to check the assumptions on Wednesday when we talk about Regression
Computation of Spearman Rho Step #1

Rank each X relative to all other observed values of X from smallest to largest in order of magnitude. The rank of the ith value of X is denoted by R(Xi) and R(Xi)=1 if Xi is the smallest observed value of X Follow the same procedure for the Y variable
Table 1. Mean blood pressure pressure readings, readings, millimeters millimetersmercury, by doctor. millimeters mercury, bymercury, doctor. by doctor. Doctor ID 11 2 1 10 2 4 3 10 3 4 14 12 5 12 11 6 5 7 14 8 2 13 9 8 10 6 9 11 9 6 12 8 1 13 7 14 1 7
Systolic Systolic 141.8 130.8 140.2 141.8 140.2 131.7 132.5 140.2 131.8 131.7 131.8 132.5 131.8 132.5 135.7 133.2 135.6 135.7 141.2 135.6 130.8 141.2 135.7 143.9 143.9 135.7 133.2 140.2 140.2 143.6 140.8 140.8 140.2 141.2 131.7 131.7 140.8 140.8 130.8 130.8 141.2 140.2 135.6 135.6 141.8 143.9 143.6 143.6 141.8 133.2 133.2 143.9
(systolic) R (systolic) R(diastolic) Diastolic R Diastolic 89.7 84.6 1 74.4 89.7 8.5 12 14 1 74.4 82.2 2 77.8 74.4 8.5 4 2 1 83.5 3 82.2 83.5 2 3 3 4 77.8 4 83.5 77.8 3 4 4 2 85.8 85.9 5 84.4 85.8 6 7 5 7 86.5 84.4 6 84.6 86.5 11 1 10 6 85.8 89.4 14 7 13 7 89.4 85.8 7 85.9 89.3 8.5 5 12 8 89.3 74.4 8.5 86.3 88.0 13 10 11 9 88.0 89.3 8.5 86.5 82.2 11 2 10 3 82.2 88.0 10 88.0 84.6 10 1 11 6 84.6 86.5 11 89.3 84.4 8.5 6 12 5 84.4 89.7 12 89.4 86.3 14 13 13 9 86.3 13 89.7 85.9 12 5 14 8 85.9 89.4 14
Table 1. Mean blood pressure readings, millimeters mercury, by Table 1. Mean blood pressure readings, millimeters doctor. millimeters mercury, mercury, by by doctor. doctor.
Doctor ID 1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 Systolic 141.8 140.2 131.8 132.5 135.7 141.2 143.9 140.2 140.8 131.7 131.7 130.8 130.8 135.6 135.6 143.6 143.6 133.2 133.2 Diastolic Diastolic 89.7 89.7 74.4 74.4 83.5 83.5 77.8 77.8 85.8 85.8 86.5 86.5 89.4 89.4 89.3 89.3 88.0 88.0 82.2 82.2 84.6 84.6 84.4 84.4 86.3 86.3 85.9 85.9 R R(systolic) (systolic) R R(diastolic) (diastolic) 12 14 12 14 8.5 1 8.5 1 3 4 3 4 4 2 4 2 7 7 7 7 11 10 11 10 14 13 14 13 8.5 12 8.5 12 10 11 10 11 2 3 2 3 1 6 1 6 6 5 6 5 13 9 13 9 5 8 5 8
2 d d d ii i -2 4 -2 7.5 56.25 7.5 -1 1 -1 2 4 2 0 0 0 1 1 1 1 1 1 -3.5 12.25 -3.5 -1 1 -1 -1 1 -1 -5 25 -5 1 1 1 4 16 4 -3 9 -3 di = 132.50
Computing Spearman Rho using SPSS

Analyze-Correlate-Bivariate
Correlations SYSTOLIC DIASTOLI SYSTOLIC Correlation Coefficient 1.000 .708** Sig. (2-tailed) . .005 N 14 14 DIASTOLI Correlation Coefficient .708** 1.000 Sig. (2-tailed) .005 . N 14 14
Spearman's rho
**. Correlation is significant at the .01 level (2-tailed).
Kendalls Tau (, T, or t)
Named after Sir Maurice G. Kendall
(1907-1983)
Based on the ranks of observations Values range between 1 and +1 Computation is more tedious than rs Defined as the probability of concordance minus the probability of discordance. Typically will yield a different value than rs
To find out more about this statistic, see http://www2.chass.ncsu.edu/garson/pa765/assocordinal.htm
Photo from: http://www.york.ac.uk/depts/maths/histstat/people/sources.htm
Correlations SYSTOLIC 1 . 14 .418 .136 14 DIASTOLI .418 .136 14 1 . 14
Comparison of values for the Blood Pressure Data
SYSTOLIC Pearson Correlation Sig. (2-tailed) N DIASTOLI Pearson Correlation Sig. (2-tailed) N
Correlations SYSTOLIC DIASTOLI 1.000 .708** . .005 14 14 .708** 1.000 .005 . 14 14
Spearman's rho
SYSTOLIC Correlation Coefficient Sig. (2-tailed) N DIASTOLI Correlation Coefficient Sig. (2-tailed) N
**. Correlation is significant at the .01 level (2-tailed).
Correlations SYSTOLIC DIASTOLI 1.000 .486* . .016 14 14 .486* 1.000 .016 . 14 14
Kendall's tau_b
SYSTOLIC Correlation Coefficient Sig. (2-tailed) N DIASTOLI Correlation Coefficient Sig. (2-tailed) N
*. Correlation is significant at the .05 level (2-tailed).
The Pearson Family

Types of Correlation Coefficients Pearson "Family"
Name Pearson Product-moment Spearman rho Phi Point Biserial Rank-Biserial Symbol r rs rpb rrb X Interval/Ratio Ordinal True Dichotomous True Dichotomous True Dichotomous Y Interval/Ratio Ordinal True Dichotomous Interval/Ratio Ordinal
Non-Pearson "family"
Name Kendal's Tau Biserial Tetrachoric Definitions True Dichotomous: A variable that is nominal and has only two levels. Forced Dichtomous: The variable is assumed to have an underlying normal distribution, but is forced to be a dichotomous variable (e.g. Rich/Poor, Happy/Sad, Smart/Not Smart, etc.) Symbol rb rt X Ordinal Forced Dichotomous Y Ordinal Interval/Ratio
Forced Dichotomous Forced Dichotomous
From: http://www.oandp.org/jpo/library/1996_03_105.asp
Nonparametric tests should not be substituted for parametric tests when parametric tests are more appropriate. Nonparametric tests should be used when the assumptions of parametric tests cannot be met, when very small numbers of data are used, and when no basis exists for assuming certain types or shapes of distributions (9). Nonparametric tests are used if data can only be classified, counted or ordered-for example, rating staff on performance or comparing results from manual muscle tests. These tests should not be used in determining precision or accuracy of instruments because the tests are lacking in both areas.
From: http://www.unesco.org/webworld/idams/advguide/Chapt4_2.htm
Pearson correlation is unduly influenced by outliers, unequal variances, non-normality, and nonlinearity. An important competitor of the Pearson correlation coefficient is the Spearmans rank correlation coefficient. This latter correlation is calculated by applying the Pearson correlation formula to the ranks of the data rather than to the actual data values themselves. In so doing, many of the distortions that plague the Pearson correlation are reduced considerably.
For more information about the effect of ties on Spearman Rho, see
CONOVER, WJ. Approximations of the Critical Region for Spearman's Rho With and Without Ties Present. Communications in Statistics, Volume B7, No. 3 (1978) (with R. L. Iman), pp. 269-282..

Correlation Coefficients

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation Coefficients

Uploaded by

Copyright:

Available Formats

Linear Correlation Analysis

3 kinds of relationships between variables

General Properties of Correlation Coefficients

Perfect Linear Correlation

Positive Correlation Direct Relationship

Negative Correlation Inverse Relationship

Value = magnitude of the relationship Sign = direction of the relationship

Some of the many Types of Correlation Coefficients

The Pearson Product-Moment Correlation (r)

The Pearson ProductMoment Correlation (r)

From: Pagano (1994)

Computing Pearson r in EXCEL

Step #3: Select X and Y data

Step #4: Format output

Computing Pearson r in SPSS

Step #3: Select X and Y data

Step #4: Means + SDs

Computing Pearson r in SPSS

The researchers found a moderate, but notsignificant, relationship between X and Y

SAMPLE SIZE: One of the

**. Correlation is significant at the 0.01 level (2-tailed).

The researchers found a significant moderate relationship between X and Y

How can this be?

The distribution of Pearson r is not symmetrically shaped as r approaches 1

Examining the 95% confidence interval for r

An additional way to Interpret Pearson r

Correlation Identification Practice

Observations that clearly appear to be out of range of the other observations.

100 90 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Variable X

What to do with Outliers

Correlation & Attenuation

Low Group r = 0.55

Medium Group r = 0.86

High Group r = 0.67

100 90 80 70 60 60 70 80 Variable X 90 100

Using all of the data r = 0.72

All data combined r = +0.89

Reporting a set of Correlation Coefficients in a table

Complete correlation matrix. Notice redundancy.

Spearman Rho (rs)

Why choose Spearman rho instead of a Pearson r?

What is a Bivariate Normal Distribution?

What is a Bivariate Normal Distribution?

Computation of Spearman Rho Step #1

Computing Spearman Rho using SPSS

**. Correlation is significant at the .01 level (2-tailed).

Correlations SYSTOLIC 1 . 14 .418 .136 14 DIASTOLI .418 .136 14 1 . 14

Comparison of values for the Blood Pressure Data

Correlations SYSTOLIC DIASTOLI 1.000 .708** . .005 14 14 .708** 1.000 .005 . 14 14

**. Correlation is significant at the .01 level (2-tailed).

Correlations SYSTOLIC DIASTOLI 1.000 .486* . .016 14 14 .486* 1.000 .016 . 14 14

*. Correlation is significant at the .05 level (2-tailed).

The Pearson Family

Forced Dichotomous Forced Dichotomous

You might also like

Correlations SYSTOLIC DIASTOLI 1.000 .708 . .005 14 14 .708 1.000 .005 . 14 14