Professional Documents
Culture Documents
Introduction to Correlation
The Pearson product-moment correlation coefficient measures the
degree of association between two interval (or better)-level
variables, for example, the relationship between daily consumption
of fat calories and body weight, or attitudes towards smoking and
attitudes towards consumption of alcohol; what is the relationship
between student achievement and dollars per student spent by the
school district?
Sometimes both of the variables are treated as “dependent,”
meaning that we haven’t ordered them causally. Sometimes one
of the variables, X, is treated as independent and the other, Y, as
dependent. Which of these is dependent and which is independent
depends on your theory of the relationship
The correlation coefficient, Pearson’s r, ranges between +1 and -1
where +1 is a perfect positive association (people who get high
scores on X also get high scores on Y) and -1 is a perfect negative
association (people who get high scores on X get low scores on Y).
A correlation near zero indicates that there is no relationship
between scores on the two variables
Related Measures of Association
The correlation coefficient is related to other types of
measures of association:
The partial correlation, which measures the degree of
association between two variables when the effects
on them of a third variable is removed: what is the
relationship between student achievement and
dollars per student spent by the school district when
the effect of parents’ SES is removed
The multiple correlation, which measures the degree
to which one variable is correlated with two or more
other variables: how well can I predict student
achievement knowing mean school district
expenditure per pupil and parent SES
Other Related Measures
The squared Pearson’s correlation coefficient,
usually called R squared or the coefficient of
determination, tells us how much of the
variation in Y, the dependent variable, can be
explained by variation in X, the independent
variable; for example, how much of the
variation in student achievement can be
explained by dollars per student expenditure by
the school district?
The quantity 1-R2 is sometimes called the
coefficient of non-determination, and it is an
estimate of the proportion of variance in the
dependent variable that is not explained by the
IV
Scatterplot: Visual Representation of the
Relationship Measured by the Correlation
Coefficient
The scatterplot is a figure which plots off cases for which
two measures have been taken (for example, people who
have filled out a survey of their attitudes toward smoking
and another survey about their attitudes toward drinking)
against each other
In a scatterplot, one of the variables (usually the
independent variable) is plotted along the horizontal or X
axis and the other is plotted along the vertical or Y axis
Each point in a scatterplot corresponds to the scores (X,Y)
for an individual case (a person, for example) where X is
the score that person was assigned or obtained on one
variable and Y is the score they attained on the other
The strength of the linear relationship between X and Y is
stronger as the swarm of points in the scatterplot more
closely approximates a diagonal “line” across the graph
An Example of a Scatterplot
In this scatterplot, 30
computer anxiety scores
(openness to computing)
are plotted against the Y
(vertical) axis and 20
computer self-efficacy
scores are plotted along the
X (horizontal) axis. For
Openness to computing
example, the person to 10
whom the arrow is pointing
had a score of about 17 on
the openness scale and
about 162 on the self- 0
efficacy scale. What were 20 40 60 80 100 120 140 160 180
Openness to computing
the relationship is not
perfect, there is a tendency 10
Openness to computing
you might want to find out if a
line that represented a 10
curvilinear relationship (in this
case an inverted U) was a better
fit, but we’ll leave that question
for another time. The line that 0
represents this relationship best 20 40 60 80 100 120 140 160 180
mathematically is called a
Computer Self-efficacy
“regression line” and the point
at which the mathematically
best fitting line crosses the y
axis is called the “intercept”
Various Types of Associations
Strong
50
negative
40 Relationship
300
between X and
30
Y; points
20 tightly
clustered
Essentially no
20
relationship
0
between X and
Total Years of Education
Number of Children
How is the Correlation Coefficient
Computed?
The conceptual formula for the
correlation coefficient is a little
daunting, but it looks like this:
∑(X – X) (Y – Y)
r=
[∑ (X – X)2 ] [∑ (Y – Y)2 ]
N ∑XY - ∑X ∑Y
r=
A negative
relationship: The
more shy you are
(the farther you are
along the X axis),
the fewer speeches
you give (the lower
you are on the Y
axis)
Computational Example of r for the
relationship between Shyness and
Speeches
Shyness Speeches XY X2 Y2
X Y
N ∑XY - ∑X ∑Y
r=
0 8 0 0 64
[ N ∑X2 – (∑X)2] [N ∑Y2 – (∑Y)2]
2 10 20 4 100
3 4 12 9 16
(6 X 107) – 30 (32) 6 6 36 36 36
9 1 9 81 1
[6 (230) – 302] [6 (226) – 322 ]
10 3 30 100 9
30 32 107 230 226
r = -.797 (note crossproducts
term in the numerator is
negative) and R-square = .635
SPSS vs. the Hand Calculation: It’s
a Lot Quicker
Now let’s try computing the coefficient with that same
data in SPSS
Go to Analyze/Correlate/Bivariate, and move Shyness
and Speeches into the Variables box. Click Pearson,
one-tailed, and OK. Did you get the same result as
the hand calculation?
Correlations
Shyness Speeches
Shyness Pearson Correlation 1 -.797*
Sig. (1-tailed) .029
N 6 6
Speeches Pearson Correlation -.797* 1
Sig. (1-tailed) .029
N 6 6
*. Correlation is significant at the 0.05 level (1-tailed).
Using SPSS to Test a Hypothesis about the Strength of
Association between Two Interval or Ratio Level
Variables: Correlation Coefficient
Download the file called World95.sav
We are going to test the strength of the association between
population density (the variable is “number of people per square
kilometer) and “average female life expectancy,” based on data
from 109 cases (109 countries, with each country a case). Our
hypothesis is that the association will be negative; that is, as
population density increases, female life expectancy will decrease
In SPSS Data Editor, go to Analyze/ Correlate/ Bivariate
Move the two variables, “number of people per square kilometer” and
“average female life expectancy” into the variables box
Under correlation coefficients, select Pearson
Under Tests of Significance, click one-tailed (we are making a
directional prediction, so we will only accept as significant results in
the “negative” 5% of the distribution
Click “flag significant results”
Click Options, and under Statistics, select Means and standard
deviations, then Continue, then OK
Compare your output to the next slide
SPSS Output for Bivariate
Correlation
Descriptive Statistics
Number of Average
people / sq. female life
kilometer expectancy
Number of people Pearson Correlation 1 .128
/ sq. kilometer Sig. (1-tailed) . .093
N 109 109
Average female Pearson Correlation .128 1
life expectancy Sig. (1-tailed) .093 .
N 109 109
Significance Test of Pearson’s r
Significance of r is tested with a t-statistic
with N-2 degrees of freedom where t =
r N–2 Correlations
Number of Average
people / sq. female life
1 – r2 kilometer expectancy
Number of people Pearson Correlation 1 .128
/ sq. kilometer Sig. (1-tailed) . .093
N 109 109
SPSS provides the results of the Average female Pearson Correlation .128 1
t test of the significance of r for life expectancy Sig. (1-tailed) .093 .
you. Can also consult table F in N 109 109
The hypothesis that the proportion of its people living in cities would be positively
associated with a country’s rate of male literacy was confirmed (r = .587, DF=83, p < .
01, one-tailed).
The Regression Model
Regression takes us a step beyond correlation in that
not only are we concerned with the strength of the
association, but we want to be able to describe its
nature with sufficient precision to be able to make
predictions
To be able to make predictions, we need to be able to
characterize one of the variables in the relationship as
independent and the other as dependent
For example, in the relationship (male literacy--% of
people living in the cities), the causal order seems
pretty obvious. Literacy rates are not like to produce
urbanization, but urbanization is probably causally
prior to increases in literacy rates
Regression model, cont’d
A regression equation is used to predict the value of a
dependent variable, Y, in this case a country’s male
literacy rate, on the basis of some constant a that
applies to all cases, plus some amount b which is
applied to each individual value of X (the country’s %
of people living in cities), plus some error term e that
is unique to the individual case and unpredictable: Y
= a + bX + e (male literacy = a + b(percent urban)
+ e)
You can think of regression as an equation which best
fits a scatter of points representing the plot of X and Y
against each other
Calculating the Regression Line
What line best describes the
relationship depicted in the
scattergram?
The formula for the regression line
is Y = a + bX + e where Y is (in
this case) a country’s score on
male literacy and X is the
country’s % of people living in
cities, a is the y-intercept or
constant (the point where the line
crosses the Y axis, or, the value of
Y when X is zero if X is a variable
for which there is zero amount of
the property) and b is the slope of
the regression line (the amount
by which male literacy changes for
each unit change in percent living
in cities)
We can use the formula to predict
Y given that we know X
Calculating the Regression Line,
cont’d
We will not do the hand computations for b, the slope of
the regression line, or a, the intercept. Let’s use
another SPSS method for finding not only the correlation
coefficient, Pearson’s r, but also the regression equation
(e.g., find the intercept, a, and the slope of the
regression line, b)
In Data Editor, go to Analyze/Regression/Linear
Put the dependent variable, male literacy, into
the Dependent box
Move the independent variable, percentage of
people living in cities, into the Independent(s)
box, and click OK
Compare your output to the next slide
Finding the Intercept (Constant) and Slope (β
or unstandardized regression coefficient)
The intercept, or a (sometimes called β0
Coefficientsa Beta weight
Unstandardized Standardized when X and Y
Coefficients Coefficients
Model B Std. Error Beta t Sig.
are expressed
1 (Constant) 52.372 4.378 11.961 .000 in standard
People living in cities (%) .495 .075 .587 6.608 .000
score units
a. Dependent Variable: Males who read (%)
The slope, or β
The regression equation for predicting Y (male literacy) is Y = a + (b)X, or Y =
52.372 +.495X, so if we wanted to predict the male literacy rate in country j we
would multiply its percentage of people living in cities by .495, and add the
constant, 52.372. Compare this to the scatterplot. Does it look right?
When scores on X and Y are available as Z scores, and are expressed in the same
standardized units, then there is no intercept (constant) because you don’t have to
make an adjustment for the differences in scale between X and Y, and so the
equation for the regression line just becomes Y = (b) X, or in this case Y = .587 X,
where .587 is the standardized version of b (note that it’s also the value of r, but
only when there is just the one X variable and not multiple independent variables)
More Output from Linear
Regression
Model Summary
Sum of
Model Squares df Mean Square F Sig.
1 Regression 12104.898 1 12104.898 43.668 .000a
Residual 23007.879 83 277.203
Total 35112.776 84
a. Predictors: (Constant), People living in cities (%)
b. Dependent Variable: Males who read (%)
Gross
Zero-
People living Males who
domestic
product / order r
Control Variables in cities (%) read (%) capita
Zero -none- a People living in cities (%) Correlation
Significance (1-tailed)
1.000
.
.587
.000
.591
.000
order r of df 0 83 83
r when
Males who read (%) Correlation .587 1.000 .417
effect of
83 0 83
variable
Gross domestic product / Correlation .591 .417 1.000
capita Significance (1-tailed) .000 .000 .
with X
df
83 83 0 GDP is
And Y
Gross domestic
product / capita
People living in cities (%) Correlation
Significance (1-tailed)
1.000
.
.464
.000
removed
df 0 82
Males who read (%) Correlation .464 1.000
Significance (1-tailed) .000 .
df 82 0
a. Cells contain zero-order (Pearson) correlations.
Note that the partial correlation of % people living in cities and male literacy is
only .4644 when GDP is held constant, where the zero order correlation you
obtained previously was .5871. So clearly GDP is a control variable which
influences the relationship between % of people living in cities and male literacy,
although the % living in cities-literacy relationship is still significant even with
GDP removed