You are on page 1of 32

CORRELATION

Introduction to Correlation

 The Pearson product-moment correlation coefficient measures the degree


of association between two interval (or better)-level variables, for example,
the relationship between daily consumption of fat calories and body weight,
or attitudes towards smoking and attitudes towards consumption of
alcohol; what is the relationship between student achievement and dollars
per student spent by the school district?
 Sometimes both of the variables are treated as “dependent,” meaning that
we haven’t ordered them causally. Sometimes one of the variables, X, is
treated as independent and the other, Y, as dependent. Which of these is
dependent and which is independent depends on your theory of the
relationship
 The correlation coefficient, Pearson’s r, ranges between +1 and -1 where +1
is a perfect positive association (people who get high scores on X also get
high scores on Y) and -1 is a perfect negative association (people who get
high scores on X get low scores on Y). A correlation near zero indicates that
there is no relationship between scores on the two variables
Related Measures of Association

 The correlation coefficient is related to other types of


measures of association:
 The partial correlation, which measures the degree of
association between two variables when the effects on them
of a third variable is removed: what is the relationship
between student achievement and dollars per student spent
by the school district when the effect of parents’ SES is
removed
 The multiple correlation, which measures the degree to
which one variable is correlated with two or more other
variables: how well can I predict student achievement
knowing mean school district expenditure per pupil and
parent SES
Other Related Measures

 The squared Pearson’s correlation coefficient, usually


called R squared or the coefficient of determination,
tells us how much of the variation in Y, the dependent
variable, can be explained by variation in X, the
independent variable; for example, how much of the
variation in student achievement can be explained by
dollars per student expenditure by the school district?
 The quantity 1-R2 is sometimes called the coefficient of
non-determination, and it is an estimate of the
proportion of variance in the dependent variable that is
not explained by the IV
Scatterplot: Visual Representation of the
Relationship Measured by the Correlation Coefficient

 The scatterplot is a figure which plots off cases for which two
measures have been taken (for example, people who have filled
out a survey of their attitudes toward smoking and another
survey about their attitudes toward drinking) against each other
 In a scatterplot, one of the variables (usually the independent
variable) is plotted along the horizontal or X axis and the other is
plotted along the vertical or Y axis
 Each point in a scatterplot corresponds to the scores (X,Y) for an
individual case (a person, for example) where X is the score that
person was assigned or obtained on one variable and Y is the
score they attained on the other
 The strength of the linear relationship between X and Y is
stronger as the swarm of points in the scatterplot more closely
approximates a diagonal “line” across the graph
An Example of a Scatterplot

In this scatterplot, computer 30

anxiety scores (openness to


computing) are plotted against
the Y (vertical) axis and
computer self-efficacy scores 20

are plotted along the X


(horizontal) axis. For example,
the person to whom the arrow

O p e n n e s s to c o m p u tin g
is pointing had a score of about 10

17 on the openness scale and


about 162 on the self-efficacy
scale. What were the scores on
the two scales of the person 0

with the star next to his point? 20 40 60 80 100 120 140 160 180

C o m p u te r S e lf- e ffic a c y
Scatterplot Allows You to Visualize the
Relationship between Variables
The purpose of the scatterplot 30

is to visualize the relationship


between the two variables
represented by the horizontal
and vertical axes. Note that 20

although the relationship is not


perfect, there is a tendency for

O p e n n e s s to c o m p u tin g
higher values of openness to
computing to be associated 10

with larger values of computer


self-efficacy, suggesting that as
openness increases, self-
efficacy increases. This 0

20 40 60 80 100 120 140 160 180


indicates that there is a positive
correlation C o m p u te r S e lf- e ffic a c y
Drawing A Possible Regression Line

Let’s draw a line through the swarm


of points that best “fits” the data set 30

(minimizes the distance between the


line and each of the points). This is
imposing a linear description of the
relationship between the two 20

variables, when sometimes you


might want to find out if a line that
represented a curvilinear relationship

O p e n n e s s to c o m p u tin g
(in this case an inverted U) was a
better fit, but we’ll leave that question 10

for another time. The line that


represents this relationship best
mathematically is called a
“regression line” and the point at 0

which the mathematically best fitting 20 40 60 80 100 120 140 160 180

line crosses the y axis is called the C o m p u te r S e lf- e ffic a c y


“intercept”
Various Types of Associations

Strong negative
Relationship
between X and Y;
points tightly
clustered around
line; nonlinear
trend at lower
weights

30

Essentially no
20
relationship
between X and Y;
T o ta l Y e a r s o f E d u c a tio n

10
points loosely
clustered around
line
Positive Relationship between X and Y 0

-2 0 2 4 6 8

N u m b e r o f C h ild r e n
How is the Correlation Coefficient
Computed?
 The conceptual formula for the
correlation coefficient is a little
daunting, but it looks like this:
∑(X – X) (Y – Y)
r=
[∑ (X – X)2 ] [∑ (Y – Y)2 ]

Where X is a person’s or case’s score on the independent variable, Y is a person’s or case’s score on the
dependent variable, and X-bar and Y-bar are the means of the scores on the independent and dependent
variables, respectively. The quantity in the numerator is called the sum of the crossproducts (SP). The
quantity in the denominator is the square root of the product of the sum of squares for both variables (SSx
and SSy)
Meaning of Crossproducts

 The notion of the crossproducts is not too difficult to understand. When we


have a positive relationship between two variables, a person who is high on
one of the variables will also score high on the other. And it follows that if
his or her score on X is larger than the mean of variable X, then if there is a
positive relationship his or her score on Y will be larger than the mean of Y.
And this should hold for all or most of the cases
 When the crossproducts are negative (when for example the typical person
who scores higher than the mean on X scores lower than the mean on Y)
then there still may be a relationship but it is a negative relationship
 Thus the sign of the crossproducts (positive or negative) in the numerator of
the formula for r tells us whether the relationship is positive or negative
 You can think of the formula for r as the ratio of (a) how much score
deviation the two distributions (X and Y) have in common to (b) the
maximum amount of score deviation they could have in common
Computing Formula for Pearson’s r

 The conceptual formula for Pearson’s r is rarely used to


compute it. You will find a nice illustration here of a computing
formula and a brief example
Here is another computing formula

N ∑XY - ∑X ∑Y
r=

[ N ∑X2 – (∑X)2] [N ∑Y2 – (∑Y)2]

We will do an example using this computing formula next, so


let’s download the correlation.sav data set
Scatterplot for the Correlation.sav data
set
 Open the correlation.sav file in SPSS
 Go to Graphs/Chart Builder/OK
 Under Choose From select ScatterDot (top
leftmost icon) and double click to move it into the
preview window
 Drag Shyness onto the X axis box
 Drag Speeches onto the Y axis box and click OK
 In the Output viewer, double click on the chart to bring
up the Chart Editor; go to Elements and select “Fit
Line at Total,” then select “linear” and click Close
ScatterPlot of Shyness and Speeches

A negative relationship:
The more shy you are
(the farther you are
along the X axis), the
fewer speeches you
give (the lower you are
on the Y axis)
Computational Example of r for the
relationship between Shyness and Speeches

Shyness Speeches
X Y
XY X2 Y2
N ∑XY - ∑X ∑Y
r=
0 8 0 0 64
[ N ∑X 2 – (∑X)2] [N ∑Y2 – (∑Y)2]
2 10 20 4 100
3 4 12 9 16
(6 X 107) – 30 (32) 6 6 36 36 36
9 1 9 81 1
[6 (230) – 302] [6 (226) – 322 ]
10 3 30 100 9
30 32 107 230 226
r = -.797 (note crossproducts term in
the numerator is negative) and R-
square = .635
SPSS vs. the Hand Calculation: It’s a Lot
Quicker
 Now let’s try computing the coefficient with that same data in
SPSS
 Go to Analyze/Correlate/Bivariate, and move Shyness and
Speeches into the Variables box. Click Pearson, one-tailed,
and OK. Did you get the same result as the hand calculation?

Correlations

Shyness Speeches
Shyness Pearson Correlation 1 -.797*
Sig. (1-tailed) .029
N 6 6

Speeches Pearson Correlation -.797* 1


Sig. (1-tailed) .029
N 6 6

*. Correlation is significant at the 0.05 level (1-tailed).


Using SPSS to Test a Hypothesis about the Strength of
Association between Two Interval or Ratio Level Variables:
Correlation Coefficient

 Download the file called World95.sav


 We are going to test the strength of the association between
population density (the variable is “number of people per square
kilometer) and “average female life expectancy,” based on data from
109 cases (109 countries, with each country a case). Our hypothesis is
that the association will be negative; that is, as population density
increases, female life expectancy will decrease
 In SPSS Data Editor, go to Analyze/ Correlate/ Bivariate
 Move the two variables, “number of people per square kilometer” and
“average female life expectancy” into the variables box
 Under correlation coefficients, select Pearson
 Under Tests of Significance, click one-tailed (we are making a directional
prediction, so we will only accept as significant results in the “negative” 5%
of the distribution
 Click “flag significant results”
 Click Options, and under Statistics, select Means and standard deviations,
then Continue, then OK
 Compare your output to the next slide
SPSS Output for Bivariate Correlation

Descriptive Statistics

Mean Std. Deviation N


Number of people
203.415 675.7052 109
/ sq. kilometer

Average female
70.16 10.572 109
life expectancy
Correlations

Number of Average
people / sq. female life
kilometer expectancy
Number of people Pearson Correlation 1 .128
/ sq. kilometer Sig. (1-tailed) . .093
N 109 109
Average female Pearson Correlation .128 1
life expectancy Sig. (1-tailed) .093 .
N 109 109
Significance Test of Pearson’s r

Significance of r is tested with a t-statistic with N-2


degrees of freedom where t =

r N–2 Correlations

Number of Average

1 – r2
people / sq. female life
kilometer expectancy
Number of people Pearson Correlation 1 .128
/ sq. kilometer Sig. (1-tailed) . .093
N 109 109
SPSS provides the results of the t test Average female Pearson Correlation .128 1
of the significance of r for life expectancy Sig. (1-tailed) .093 .

you. Can also consult table F in N 109 109

Levin and Fox

Write a sentence which states your findings. Report the correlation coefficient, r, R2 (the percent of
variance in y accounted for by x), the significance level, and N, as well as the means on each of the
two variables. Indicate whether or not your hypothesis was supported.
A Hypothesis to Test

 Now, test the following hypothesis: Countries in


which there is a larger proportion of people living
in cities (urban) will have a higher proportion of
males who read (lit_male) (not “people who read”)
 Write up your result
 Compare to next slide
Writing up the Results of Your Test

Descriptive Statistics Correlation of


the variable with
Mean Std. Deviation N
itself = 1 which
People living in cities (%) 56.53 24.203 108
appears in all the
Males who read (%) 78.73 20.445 85 main diagonal
cells
Correlations

People living Males who


in cities (%) read (%)
People living in cities (%) Pearson Correlation 1 .587**

Sig. (1-tailed) . .000

N 108 85

Males who read (%) Pearson Correlation .587** 1

Sig. (1-tailed) .000 .

N 85 85

**. Correlation is significant at the 0.01 level (1-tailed).

The hypothesis that the proportion of its people living in cities would be positively associated with a
country’s rate of male literacy was confirmed (r = .587, DF=83, p < .01, one-tailed).
The Regression Model

 Regression takes us a step beyond correlation in that not only


are we concerned with the strength of the association, but we
want to be able to describe its nature with sufficient precision
to be able to make predictions
 To be able to make predictions, we need to be able to
characterize one of the variables in the relationship as
independent and the other as dependent
 For example, in the relationship (male literacy--% of people
living in the cities), the causal order seems pretty obvious.
Literacy rates are not like to produce urbanization, but
urbanization is probably causally prior to increases in literacy
rates
Regression model, cont’d

 A regression equation is used to predict the value of a


dependent variable, Y, in this case a country’s male literacy
rate, on the basis of some constant a that applies to all cases,
plus some amount b which is applied to each individual value
of X (the country’s % of people living in cities), plus some error
term e that is unique to the individual case and unpredictable:
Y = a + bX + e (male literacy = a + b(percent urban) + e)
 You can think of regression as an equation which best fits a
scatter of points representing the plot of X and Y against each
other
Calculating the Regression Line

 What line best describes the


relationship depicted in the
scattergram?
 The formula for the regression line is
Y = a + bX + e where Y is (in this case)
a country’s score on male literacy and
X is the country’s % of people living in
cities, a is the y-intercept or constant
(the point where the line crosses the Y
axis, or, the value of Y when X is zero if
X is a variable for which there is zero
amount of the property) and b is the
slope of the regression line (the
amount by which male literacy
changes for each unit change in
percent living in cities)
 We can use the formula to predict Y
given that we know X
Calculating the Regression Line, cont’d

 We will not do the hand computations for b, the slope of the


regression line, or a, the intercept. Let’s use another SPSS
method for finding not only the correlation coefficient, Pearson’s
r, but also the regression equation (e.g., find the intercept, a, and
the slope of the regression line, b)
 In Data Editor, go to Analyze/Regression/Linear
 Put the dependent variable, male literacy, into the
Dependent box
 Move the independent variable, percentage of people
living in cities, into the Independent(s) box, and click OK
 Compare your output to the next slide
Finding the Intercept (Constant) and Slope (β or
unstandardized regression coefficient)

The intercept, or a (sometimes called β0


Coefficients a
Beta weight
Unstandardized Standardized when X and Y
Coefficients Coefficients

Model B Std. Error Beta t Sig.


are expressed in
1 (Constant) 52.372 4.378 11.961 .000 standard score
People living in cities (%) .495 .075 .587 6.608 .000
units
a. Dependent Variable: Males who read (%)

The slope, or β
The regression equation for predicting Y (male literacy) is Y = a + (b)X, or Y = 52.372 +.495X, so if
we wanted to predict the male literacy rate in country j we would multiply its percentage of
people living in cities by .495, and add the constant, 52.372. Compare this to the scatterplot.
Does it look right?

When scores on X and Y are available as Z scores, and are expressed in the same standardized
units, then there is no intercept (constant) because you don’t have to make an adjustment for
the differences in scale between X and Y, and so the equation for the regression line just
becomes Y = (b) X, or in this case Y = .587 X, where .587 is the standardized version of b (note
that it’s also the value of r, but only when there is just the one X variable and not multiple
independent variables)
More Output from Linear Regression

Model Summary

Adjusted Std. Error of


Model R R Square R Square the Estimate
1 a
.587 .345 .337 16.649

a. Predictors: (Constant), People living in cities (%)

The correlation coefficient and the coefficient of determination.


The coefficient of determination, or R-square, is the proportion of
variance in the dependent variable which can be accounted for by
the independent variable. Adjusted R-square is an adjustment
made to R-square when you get a lot of independent variables or
predictors in the equation or have complexities like cubic terms.
Minor adjustment with only one predictor
F test of the regression equation: More Output
from Linear Regression, Cont’d

ANOVA b

Sum of
Model Squares df Mean Square F Sig.
1 Regression 12104.898 1 12104.898 43.668 .000a

Residual 23007.879 83 277.203

Total 35112.776 84

a. Predictors: (Constant), People living in cities (%)

b. Dependent Variable: Males who read (%)

If the independent variable, X, were of no value in predicting Y, the best estimate of Y would
be the mean of Y. To see how much better our calculated regression line is as a predictor of
Y than the simple mean of Y, we calculate the sum of squares for the regression line and
then a residual sum of squares (variance left over after the regression line has done its work
as a predictor) which shows how well or how badly the regression line fits the actual
obtained scores on Y. If the residual mean square is large compared to the regression mean
square, the value of F would be low and the resulting F ratio may not be significant. If the F
ratio is statistically significant it suggests that we can reject the hypothesis that our
predictor, β, is zero in the population, and say that the regression line is a good fit to the
data
Partial Correlation

 What is the relationship between a country’s


percentage of people living in cities (X2, the IV)
and male literacy rate (Y, the DV) when the effects
of gross domestic product (X1, another potential
IV or control variable) are removed?
 That is, what happens when you statistically remove that portion of
the variance that both percentage of people living in cities (X2) and
gross domestic product (X1) have in common with each other and
with Y, male literacy rate , e.g. compute a partial correlation?
 What you want to see is if the correlation between male literacy rate
and percentage of people living in cities, which you determined to be .
587 in your previous analyses, is lower when the effects of gross
domestic product are “partialled out”
Using SPSS to Compute a Partial
Correlation
 A partial correlation is the relationship between two variables after
removing the overlap with a third variable completely from both
variables. In the diagram below, this would be the relationship between
male literacy (Y) and percentage living in cities (X2), after removing the
influence of gross domestic product (X1) on both literacy and
percentage living in cities

In the calculation of the partial correlation coefficient


rYX2.X1, the area of interest is section a, and the effects
removed are those in b, c, and d; partial correlation is
the relationship of X2 and Y after the influence of X1
is completely removed from both variables. When
only the effect of X1 on X2 is removed, this is called a
part correlation; part correlation first removes from X2
all variance which may be accounted for by X1
(sections c and b), then correlates the remaining
unique component of the X2 with the dependent
variable, Y
Computing the Partial Correlation in
SPSS
 Go to Analyze/Correlate/Partial
 Move % People living in cities and Males who read into
the Variables box
 Put Gross Domestic Product into the Controlling for box
 Select one-tailed test and check display actual
significance level
 Under Options, select zero-order correlations
 Click Continue and then OK
 Compare your output to the next slide
Comparing Partial to Zero-Order Correlation: Effect of
Controlling for GDP on Relationship Between Percent Living in
Cities and Male Literacy

Zero-order
r
Zero order r
of
Control r when
variable effect of
with X GDP is
And Y removed

Note that the partial correlation of % people living in cities and male literacy is only .4644 when
GDP is held constant, where the zero order correlation you obtained previously was .5871. So
clearly GDP is a control variable which influences the relationship between % of people living in
cities and male literacy, although the % living in cities-literacy relationship is still significant
even with GDP removed

You might also like