Professional Documents
Culture Documents
Introduction to Correlation
The scatterplot is a figure which plots off cases for which two
measures have been taken (for example, people who have filled
out a survey of their attitudes toward smoking and another
survey about their attitudes toward drinking) against each other
In a scatterplot, one of the variables (usually the independent
variable) is plotted along the horizontal or X axis and the other is
plotted along the vertical or Y axis
Each point in a scatterplot corresponds to the scores (X,Y) for an
individual case (a person, for example) where X is the score that
person was assigned or obtained on one variable and Y is the
score they attained on the other
The strength of the linear relationship between X and Y is
stronger as the swarm of points in the scatterplot more closely
approximates a diagonal “line” across the graph
An Example of a Scatterplot
O p e n n e s s to c o m p u tin g
is pointing had a score of about 10
with the star next to his point? 20 40 60 80 100 120 140 160 180
C o m p u te r S e lf- e ffic a c y
Scatterplot Allows You to Visualize the
Relationship between Variables
The purpose of the scatterplot 30
O p e n n e s s to c o m p u tin g
higher values of openness to
computing to be associated 10
O p e n n e s s to c o m p u tin g
(in this case an inverted U) was a
better fit, but we’ll leave that question 10
which the mathematically best fitting 20 40 60 80 100 120 140 160 180
Strong negative
Relationship
between X and Y;
points tightly
clustered around
line; nonlinear
trend at lower
weights
30
Essentially no
20
relationship
between X and Y;
T o ta l Y e a r s o f E d u c a tio n
10
points loosely
clustered around
line
Positive Relationship between X and Y 0
-2 0 2 4 6 8
N u m b e r o f C h ild r e n
How is the Correlation Coefficient
Computed?
The conceptual formula for the
correlation coefficient is a little
daunting, but it looks like this:
∑(X – X) (Y – Y)
r=
[∑ (X – X)2 ] [∑ (Y – Y)2 ]
Where X is a person’s or case’s score on the independent variable, Y is a person’s or case’s score on the
dependent variable, and X-bar and Y-bar are the means of the scores on the independent and dependent
variables, respectively. The quantity in the numerator is called the sum of the crossproducts (SP). The
quantity in the denominator is the square root of the product of the sum of squares for both variables (SSx
and SSy)
Meaning of Crossproducts
N ∑XY - ∑X ∑Y
r=
A negative relationship:
The more shy you are
(the farther you are
along the X axis), the
fewer speeches you
give (the lower you are
on the Y axis)
Computational Example of r for the
relationship between Shyness and Speeches
Shyness Speeches
X Y
XY X2 Y2
N ∑XY - ∑X ∑Y
r=
0 8 0 0 64
[ N ∑X 2 – (∑X)2] [N ∑Y2 – (∑Y)2]
2 10 20 4 100
3 4 12 9 16
(6 X 107) – 30 (32) 6 6 36 36 36
9 1 9 81 1
[6 (230) – 302] [6 (226) – 322 ]
10 3 30 100 9
30 32 107 230 226
r = -.797 (note crossproducts term in
the numerator is negative) and R-
square = .635
SPSS vs. the Hand Calculation: It’s a Lot
Quicker
Now let’s try computing the coefficient with that same data in
SPSS
Go to Analyze/Correlate/Bivariate, and move Shyness and
Speeches into the Variables box. Click Pearson, one-tailed,
and OK. Did you get the same result as the hand calculation?
Correlations
Shyness Speeches
Shyness Pearson Correlation 1 -.797*
Sig. (1-tailed) .029
N 6 6
Descriptive Statistics
Average female
70.16 10.572 109
life expectancy
Correlations
Number of Average
people / sq. female life
kilometer expectancy
Number of people Pearson Correlation 1 .128
/ sq. kilometer Sig. (1-tailed) . .093
N 109 109
Average female Pearson Correlation .128 1
life expectancy Sig. (1-tailed) .093 .
N 109 109
Significance Test of Pearson’s r
r N–2 Correlations
Number of Average
1 – r2
people / sq. female life
kilometer expectancy
Number of people Pearson Correlation 1 .128
/ sq. kilometer Sig. (1-tailed) . .093
N 109 109
SPSS provides the results of the t test Average female Pearson Correlation .128 1
of the significance of r for life expectancy Sig. (1-tailed) .093 .
Write a sentence which states your findings. Report the correlation coefficient, r, R2 (the percent of
variance in y accounted for by x), the significance level, and N, as well as the means on each of the
two variables. Indicate whether or not your hypothesis was supported.
A Hypothesis to Test
N 108 85
N 85 85
The hypothesis that the proportion of its people living in cities would be positively associated with a
country’s rate of male literacy was confirmed (r = .587, DF=83, p < .01, one-tailed).
The Regression Model
The slope, or β
The regression equation for predicting Y (male literacy) is Y = a + (b)X, or Y = 52.372 +.495X, so if
we wanted to predict the male literacy rate in country j we would multiply its percentage of
people living in cities by .495, and add the constant, 52.372. Compare this to the scatterplot.
Does it look right?
When scores on X and Y are available as Z scores, and are expressed in the same standardized
units, then there is no intercept (constant) because you don’t have to make an adjustment for
the differences in scale between X and Y, and so the equation for the regression line just
becomes Y = (b) X, or in this case Y = .587 X, where .587 is the standardized version of b (note
that it’s also the value of r, but only when there is just the one X variable and not multiple
independent variables)
More Output from Linear Regression
Model Summary
ANOVA b
Sum of
Model Squares df Mean Square F Sig.
1 Regression 12104.898 1 12104.898 43.668 .000a
Total 35112.776 84
If the independent variable, X, were of no value in predicting Y, the best estimate of Y would
be the mean of Y. To see how much better our calculated regression line is as a predictor of
Y than the simple mean of Y, we calculate the sum of squares for the regression line and
then a residual sum of squares (variance left over after the regression line has done its work
as a predictor) which shows how well or how badly the regression line fits the actual
obtained scores on Y. If the residual mean square is large compared to the regression mean
square, the value of F would be low and the resulting F ratio may not be significant. If the F
ratio is statistically significant it suggests that we can reject the hypothesis that our
predictor, β, is zero in the population, and say that the regression line is a good fit to the
data
Partial Correlation
Zero-order
r
Zero order r
of
Control r when
variable effect of
with X GDP is
And Y removed
Note that the partial correlation of % people living in cities and male literacy is only .4644 when
GDP is held constant, where the zero order correlation you obtained previously was .5871. So
clearly GDP is a control variable which influences the relationship between % of people living in
cities and male literacy, although the % living in cities-literacy relationship is still significant
even with GDP removed