Professional Documents
Culture Documents
Two variable
Chapter
18
statistics
Contents: A Correlation
B Measuring correlation
C Least squares regression
Investigation: Spearman’s rank order
correlation coefficient
D The Â2 test of independence
25
50
75
95
100
0
25
50
75
95
100
IB_03
cyan black
Y:\...\IBBK3_18\571IB318.CDR
Wed Jul 21 09:17:44 2004
Color profile: Disabled
Composite Default screen
or this
OPENING PROBLEM
The height and weight of a squad of hockey players is to be investigated.
0
The raw data is given:
Player Height Weight Player Height Weight Player Height Weight
A 183 85 H 167 67 O 167 72
B 170 74 I 169 74 P 171 69
C 174 76 J 163 67 Q 170 76
D 168 69 K 161 69 R 174 71
E 167 68 L 172 74 S 162 71
F 177 74 M 160 64 T 187 84
G 162 62 N 160 62 U 175 77
25
50
75
95
100
0
25
50
75
95
100
IB_03
cyan black
Z:\...\IBBK3_18\572IB318.CDR
Wed Jul 21 09:15:35 2004
Color profile: Disabled
Composite Default screen
A CORRELATION
Correlation refers to the relationship or association between two variables.
In looking at the correlation between two variables we should follow these steps:
Step 1: Look at the scatterplot for any pattern.
Step 2: Look at the spread of points to make a judgement about the strength of the corre-
lation. For positive relationships we would classify the following scatterplots as:
25
75
95
100
0
25
50
75
95
100
IB_03
cyan black
Z:\...\IBBK3_18\573IB318.CDR
Wed Jul 21 09:16:02 2004
Color profile: Disabled
Composite Default screen
Looking at the scatterplot for the Opening Problem we can say that ‘there appears to be a
not very strong positive correlation between the hockey players’ heights and weights. The
relationship appears to be linear with no possible outliers’.
CAUSATION
Correlation between two variables does not necessarily mean that one variable causes the
other. Consider the following:
The arm length and running speed of a sample of young children were measured and a strong,
positive correlation was found to exist between the variables.
Does this mean that short arms cause a reduction in running speed or that a high running
speed causes your arms to grow long?
These are obviously nonsense assumptions and the
strong positive correlation between the variables is
attributed to the fact that both arm length and run-
ning speed are closely related to a third variable, age.
Arm length increases with age as does running speed
(up to a certain age).
When variables are related so that if one is changed
the other changes then we can say a causal relation-
ship exists between the variables.
In cases where this is not apparent, there is no justification, based on high correlation alone,
to conclude that changes in one variable cause the changes in the other.
50
0
25
75
95
100
0
25
50
75
95
100
IB_03
cyan black
Z:\...\IBBK3_18\574IB318.CDR
Wed Jul 21 09:16:35 2004
Color profile: Disabled
Composite Default screen
EXERCISE 18A
1 For each of the scatterplots below state:
i whether there is positive, negative or no association between the variables
ii whether the relationship between the variables appears to be linear or not
iii the strength of the association (zero, weak, moderate or strong).
a b c
y y y
x x x
d e f y
y y
x x x
3 The results of a group of students for a Maths test and a Science test are compared:
Student A B C D E F G H I J
Maths test 64 67 69 70 73 74 77 82 84 85
Science test 68 73 68 75 78 73 77 84 86 89
Construct a scatterplot for the data. (Make the scale on both axes from 60 to 90:)
4 The scores awarded by two judges at a diving competition are shown
in the table.
Competitor P Q R S T U V W X Y
Judge A 5 6:5 8 9 4 2:5 7 5 6 3
Judge B 6 7 8:5 9 5 4 7:5 5 7 4:5
a Construct a scatterplot for this data with Judge A’s scores on the
horizontal axis and Judge B’s scores on the vertical axis.
b Copy and complete the following comments on the scatterplot:
There appears to be ............, .............. correlation between Judge A’s scores and
Judge B’s scores. This means that as Judge A’s scores increase, Judge B’s scores
..................
0
25
50
75
95
100
0
25
50
75
95
100
IB_03
cyan black
Y:\...\IBBK3_18\575IB318.CDR
Thu Jul 22 09:34:36 2004
Color profile: Disabled
Composite Default screen
x x x
d y e f y
y
x x x
Regression is the method of fitting a line to a set of data and then finding the equation
of the line.
80
70
60
height (kg)
50
150 160 170 180 190 200
0
25
50
75
95
100
0
25
50
75
95
100
IB_03
cyan black
Y:\...\IBBK3_18\576IB318.CDR
Thu Jul 22 09:34:59 2004
Color profile: Disabled
Composite Default screen
‘By eye’ we draw a line which best fits with about equal numbers of points on either side
(but not necessarily so). The average distances away from the line should balance.
y¡b
We now select two points on this line and use = m to find the equation.
x¡a
In our hockey players data these are (160, 64) and (190, 88).
88 ¡ 64 y ¡ 64
Now m + So, the equation is + 0:8
190 ¡ 160 x ¡ 160
+ 24 ) y ¡ 64 + 0:8(x ¡ 160)
30
) y ¡ 64 + 0:8x ¡ 128
+ 0:8
i.e., y + 0:8x ¡ 64
The problem with this method is that the answer will vary from one person to another.
Selecting the line and choosing two points on it can be very difficult.
B MEASURING CORRELATION
When dealing with linear association we can use the concept known as correlation to measure
the strength and direction of association.
The correlation coefficient (r) lies between ¡1 and 1.
POSITIVE CORRELATION
An association between two variables is described as a positive correlation if an
increase in one variable results in an increase in the other in an approximately
linear manner.
The strength of the association is best measured with the correlation coefficient (r) that
ranges between 0 and 1 for positive correlation.
An r value of 0 suggests that there is no linear association present (or no correlation).
An r value of 1 suggests that there is a perfect linear association present (or perfect positive
correlation).
The correlation between the height and the weight of people is positive and lies between 0
and +1. It is not an example of perfect positive correlation because, for example, not all short
people are of light weight. However, taller people are generally heavier than shorter people.
The r values in between 0 and 1 represent varying degrees of linearity.
Scatter diagrams for positive correlation:
The scales on each of the four graphs are the same.
y y y y
100
0
25
75
95
0
25
50
75
95
100
IB_03
cyan black
Z:\...\IBBK3_18\577IB318.CDR
Wed Jul 21 09:18:24 2004
Color profile: Disabled
Composite Default screen
NEGATIVE CORRELATION
The strength of the association is best measured with the correlation coefficient (r) that
ranges between 0 and ¡1 for negative correlation.
An r value of ¡1 suggests that there is a perfect linear association present (or perfect negative
correlation).
Scatter diagrams for negative correlation:
y y y y
The second of these formulae is useful as it does not require the means of the X and Y
distributions, x and y, to be found.
50
0
25
75
95
100
0
25
50
75
95
100
IB_03
cyan black
Y:\...\IBBK3_18\578IB318.CDR
Thu Jul 22 09:37:41 2004
Color profile: Disabled
Composite Default screen
Example 1
A chemical fertiliser company wishes Lawn Compound Lawn
to determine the extent of correlation X (g) growth (mm)
between ‘quantity of compound X A 1 3
used’ and ‘lawn growth’ per day. B 2 3
Find the Pearson’s correlation C 4 6
coefficient between the two variables. D 5 8
Example 2
In attempting to find if there is any associ- average speed
ation between average speed in the metro-
politan area and age of drivers, a device 70
was fitted to cars of drivers of different
ages.
The results are shown in the scatterplot. 60
The r-value for this association is +0:027.
Describe the association.
50
20 30 40 50 60 70 80 90 age
25
50
75
95
0
25
50
75
95
100
IB_03
cyan black
Z:\...\IBBK3_18\579IB318.CDR
Wed Jul 21 09:19:33 2004
Color profile: Disabled
Composite Default screen
Example 3
Wydox have been trying out a new chemical to control the number of lawn beetles
in the soil. Determine the extent of the correlation between the quantity of chemical
used and the number of surviving lawn beetles per square metre of lawn.
Lawn Amount of chemical (g) Number of surviving lawn beetles
A 2 11
B 5 6
C 6 4
D 3 6
E 9 3
10
+ ¡0:859
5
Clearly we have a moderate negative association
between the amount of chemical used and the num-
x (chemical) ber of surviving lawn beetles. Generally, the more
5 10 chemical used, the fewer beetles survive.
EXERCISE 18B.1
1 Consider the three graphs given below:
A y B y C y
(3, 4) 2
(2, 3) (1, 2) 1
(1, 2) (2, 1)
x (3, 0) x x
1 2
Clearly A shows perfect positive linear correlation C shows no correlation.
B shows perfect negative linear correlation
P P
( x)( y) P
xy ¡
a For each set of points find r using r = r P r n P
P 2 ( x)2 P 2 ( y)2
x ¡ y ¡
n n
b Comment on the value of r for each graph.
50
0
25
75
95
100
0
25
50
75
95
100
IB_03
cyan black
Z:\...\IBBK3_18\580IB318.CDR
Wed Jul 21 09:20:07 2004
Color profile: Disabled
Composite Default screen
25
75
95
100
0
25
50
75
95
100
IB_03
cyan black
Y:\...\IBBK3_18\581IB318.CDR
Sun Jul 25 16:45:17 2004
Color profile: Disabled
Composite Default screen
Enter the data and simply read off r, r2 and the degree STATISTICS
PACKAGE
of strength of the correlation. Click on the icon to find
an easy to use two variable analysis computer package.
7000
a find r2
6500
b describe the association between
these variables. 6000
30000 35000 40000 45000 50000 55000
all crashes
2 In an investigation to examine the association between
the tread depth (y mm) and the number of kilometres
travelled (x thousand), a sample of 8 tyres of the same depth of tread
brand was taken and the results are given below. tyre cross-section
Kilometres (x thousand) 14 17 24 34 35 37 38 39
Tread depth (y mm) 5:7 6:5 4:0 3:0 1:9 2:7 1:9 2:3
a Draw a scatterplot of the data. b Calculate r and r2 for the tabled data.
c Describe the association between tread depth and the number of kilometres travelled
for this brand of tyre.
3 A restauranteur believes that during March the number of people wanting dinner (y) is
related to the temperature at noon (xo C). Over a period of a fortnight the number of
diners and the noon temperature were recorded.
Temperature (xoC) 23 25 28 30 30 27 25 28 32 31 33 29 27
Number of diners (y) 57 64 62 75 69 58 61 78 80 67 84 73 76
95
100
0
25
75
0
25
50
75
95
100
IB_03
cyan black
Z:\...\IBBK3_18\582IB318.CDR
Wed Jul 21 09:21:25 2004
Color profile: Disabled
Composite Default screen
4 Tomatoes are sprayed with a pesticide-fertiliser mix. The figures below give the yield of
tomatoes per bush for various spray concentrations.
5 It has long been thought that frosty conditions are necessary to ‘set’
the fruit of cherries and apples. The following data shows annual
cherry yield and number of frosts data for a cherry growing farm
over a 7 year period.
7 World War II saw a peak in the production of aeroplanes. One specific type of plane
that was made was the fighter plane. It was used in aerial combat and also to shoot at
enemy on the ground. The table below contains information on the maximum speed and
maximum altitude obtainable (ceiling) for nineteen fighter planes. Maximum speed is
given in km/h ¥ 1000: Ceiling is given in m ¥ 1000.
25
75
95
100
0
25
50
75
95
100
IB_03
cyan black
Z:\...\IBBK3_18\583IB318.CDR
Wed Jul 21 09:22:05 2004