You are on page 1of 13

Color profile: Disabled

Composite Default screen

Two variable
Chapter
18
statistics
Contents: A Correlation
B Measuring correlation
C Least squares regression
Investigation: Spearman’s rank order
correlation coefficient
D The Â2 test of independence

Review set 18A


Review set 18B
0

25

50

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Y:\...\IBBK3_18\571IB318.CDR
Wed Jul 21 09:17:44 2004
Color profile: Disabled
Composite Default screen

572 TWO VARIABLE STATISTICS (Chapter 18)

TWO VARIABLE ANALYSIS


Often a statistician will want to know how two variables are associated or related.
To find such a relationship the first step is to construct and observe a scat-
terplot. On the horizontal axis we put the independent variable and on The weight of a
the vertical axis the dependent variable. A typical scatterplot could look person is usually
dependent on
like this:
their height.

dependent variable weight

or this

independent variable height

OPENING PROBLEM
The height and weight of a squad of hockey players is to be investigated.
0
The raw data is given:
Player Height Weight Player Height Weight Player Height Weight
A 183 85 H 167 67 O 167 72
B 170 74 I 169 74 P 171 69
C 174 76 J 163 67 Q 170 76
D 168 69 K 161 69 R 174 71
E 167 68 L 172 74 S 162 71
F 177 74 M 160 64 T 187 84
G 162 62 N 160 62 U 175 77

Things to think about:


² Are the variables categorical or quantitative?
² What is the dependent variable?
² What would the scatterplot look like? Are the points close to being linear?
² Does an increase in the independent variable generally cause an increase or a
decrease in the dependent variable?
² Is there a way of indicating the strength of a linear connection for the variables?
² How can we find the equation of ‘the line of best fit’ and how can we use it?

The scatterplot for the Opening Problem is drawn 100


weight (kg)
alongside. Height is the independent variable and is 90
represented on the horizontal axis. 80
Notice that in general, as the height increases so 70
does the weight.
60
height (kg)
50
150 160 170 180 190 200
0

25

50

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\572IB318.CDR
Wed Jul 21 09:15:35 2004
Color profile: Disabled
Composite Default screen

TWO VARIABLE STATISTICS (Chapter 18) 573

A CORRELATION
Correlation refers to the relationship or association between two variables.

In looking at the correlation between two variables we should follow these steps:
Step 1: Look at the scatterplot for any pattern.

For a generally upward trend we say that the correlation is


positive, and in this case an increase in the independent vari-
able means that the dependent variable generally increases.

For a generally downward trend we say that the correlation is


negative, and in this case an increase in the independent vari-
able means that the dependent variable generally decreases.

For randomly scattered points (with no upward or downward


trend) there is usually no correlation.

Step 2: Look at the spread of points to make a judgement about the strength of the corre-
lation. For positive relationships we would classify the following scatterplots as:

strong moderate weak

Similarly there are strength classifications for negative relationships:

strong moderate weak


50
0

25

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\573IB318.CDR
Wed Jul 21 09:16:02 2004
Color profile: Disabled
Composite Default screen

574 TWO VARIABLE STATISTICS (Chapter 18)

Step 3: Look at the pattern of points to see whether or not it is linear.


These points appear to be roughly linear. These points do not appear to be linear.

Step 4: Look for and investigate any outliers. outlier


These appear as isolated points away
from the main body of data.
Outliers should be investigated as some- not an
times they are mistakes made in record- outlier
ing the data or plotting it.
Genuine extraordinary data should be in-
cluded.

Looking at the scatterplot for the Opening Problem we can say that ‘there appears to be a
not very strong positive correlation between the hockey players’ heights and weights. The
relationship appears to be linear with no possible outliers’.

CAUSATION
Correlation between two variables does not necessarily mean that one variable causes the
other. Consider the following:
The arm length and running speed of a sample of young children were measured and a strong,
positive correlation was found to exist between the variables.
Does this mean that short arms cause a reduction in running speed or that a high running
speed causes your arms to grow long?
These are obviously nonsense assumptions and the
strong positive correlation between the variables is
attributed to the fact that both arm length and run-
ning speed are closely related to a third variable, age.
Arm length increases with age as does running speed
(up to a certain age).
When variables are related so that if one is changed
the other changes then we can say a causal relation-
ship exists between the variables.
In cases where this is not apparent, there is no justification, based on high correlation alone,
to conclude that changes in one variable cause the changes in the other.
50
0

25

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\574IB318.CDR
Wed Jul 21 09:16:35 2004
Color profile: Disabled
Composite Default screen

TWO VARIABLE STATISTICS (Chapter 18) 575

EXERCISE 18A
1 For each of the scatterplots below state:
i whether there is positive, negative or no association between the variables
ii whether the relationship between the variables appears to be linear or not
iii the strength of the association (zero, weak, moderate or strong).
a b c
y y y

x x x

d e f y
y y

x x x

2 Copy and complete the following:


a If the variables x and y are positively associated then as x increases y .............
b If there is negative correlation between the variables m and n then as m increases,
n ..............
c If there is no association between two variables then the points on the scatterplot
appear to be ............. ...............

3 The results of a group of students for a Maths test and a Science test are compared:

Student A B C D E F G H I J
Maths test 64 67 69 70 73 74 77 82 84 85
Science test 68 73 68 75 78 73 77 84 86 89
Construct a scatterplot for the data. (Make the scale on both axes from 60 to 90:)
4 The scores awarded by two judges at a diving competition are shown
in the table.

Competitor P Q R S T U V W X Y
Judge A 5 6:5 8 9 4 2:5 7 5 6 3
Judge B 6 7 8:5 9 5 4 7:5 5 7 4:5

a Construct a scatterplot for this data with Judge A’s scores on the
horizontal axis and Judge B’s scores on the vertical axis.
b Copy and complete the following comments on the scatterplot:
There appears to be ............, .............. correlation between Judge A’s scores and
Judge B’s scores. This means that as Judge A’s scores increase, Judge B’s scores
..................
0

25

50

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Y:\...\IBBK3_18\575IB318.CDR
Thu Jul 22 09:34:36 2004
Color profile: Disabled
Composite Default screen

576 TWO VARIABLE STATISTICS (Chapter 18)

5 a What is meant by the independent and dependent variables?


b Give another name for each of the variables in a.
c When graphing, which variable is placed on the horizontal axis?

6 For the following scatterplots comment on:


i the existence of any pattern (positive, negative or no association)
ii the relationship strength (zero, weak, moderate or strong)
iii whether or not the relationship is linear
iv whether or not there are any outliers.
a y b y c y

x x x
d y e f y
y

x x x

7 What is meant by causation?

LINE OF BEST FIT

Regression is the method of fitting a line to a set of data and then finding the equation
of the line.

The line is often called the model.


The regression line is often called ‘the line of best fit’ and can be used to predict a value of
the dependent variable for a given value of the independent variable. There are several ways
to fit a straight line to a data set. We will examine two of them:
² The line of best fit ‘by eye’.
² The ‘least squares’ regression line (linear regression).

THE ‘BY EYE’ METHOD: (Hockey player data)


Your guess is as
100 good as mine!
weight (kg)
90

80

70

60
height (kg)
50
150 160 170 180 190 200
0

25

50

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Y:\...\IBBK3_18\576IB318.CDR
Thu Jul 22 09:34:59 2004
Color profile: Disabled
Composite Default screen

TWO VARIABLE STATISTICS (Chapter 18) 577

‘By eye’ we draw a line which best fits with about equal numbers of points on either side
(but not necessarily so). The average distances away from the line should balance.
y¡b
We now select two points on this line and use = m to find the equation.
x¡a
In our hockey players data these are (160, 64) and (190, 88).
88 ¡ 64 y ¡ 64
Now m + So, the equation is + 0:8
190 ¡ 160 x ¡ 160

+ 24 ) y ¡ 64 + 0:8(x ¡ 160)
30
) y ¡ 64 + 0:8x ¡ 128
+ 0:8
i.e., y + 0:8x ¡ 64
The problem with this method is that the answer will vary from one person to another.
Selecting the line and choosing two points on it can be very difficult.

B MEASURING CORRELATION
When dealing with linear association we can use the concept known as correlation to measure
the strength and direction of association.
The correlation coefficient (r) lies between ¡1 and 1.

POSITIVE CORRELATION
An association between two variables is described as a positive correlation if an
increase in one variable results in an increase in the other in an approximately
linear manner.

The strength of the association is best measured with the correlation coefficient (r) that
ranges between 0 and 1 for positive correlation.
An r value of 0 suggests that there is no linear association present (or no correlation).
An r value of 1 suggests that there is a perfect linear association present (or perfect positive
correlation).
The correlation between the height and the weight of people is positive and lies between 0
and +1. It is not an example of perfect positive correlation because, for example, not all short
people are of light weight. However, taller people are generally heavier than shorter people.
The r values in between 0 and 1 represent varying degrees of linearity.
Scatter diagrams for positive correlation:
The scales on each of the four graphs are the same.
y y y y

r = +1 x r = +0.8 x r = +0.5 x r = +0.2 x


50

100
0

25

75

95
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\577IB318.CDR
Wed Jul 21 09:18:24 2004
Color profile: Disabled
Composite Default screen

578 TWO VARIABLE STATISTICS (Chapter 18)

NEGATIVE CORRELATION

An association between two variables is described as a negative correlation if an


increase in one variable results in a decrease in the other in an approximately
linear manner.

The strength of the association is best measured with the correlation coefficient (r) that
ranges between 0 and ¡1 for negative correlation.
An r value of ¡1 suggests that there is a perfect linear association present (or perfect negative
correlation).
Scatter diagrams for negative correlation:
y y y y

r = -1 x r = -0.8 x r = -0.5 x r = -0.2 x

PEARSON’S CORRELATION COEFFICIENT


Pearson’s correlation coefficient, for finding the degree of linearity between two random
variables X and Y , given n ordered pairs (x1 , y1 ), (x2 , y2 ), (x3 , y3 ), ......, (xn , yn ) of
data is:
P P P
sxy (x ¡ x)(y ¡ y) P ( x)( y)
r= where sxy = or = xy ¡
sx sy n n
sP s P
(x ¡ x)2 P 2 ( x)2
sx = or = x ¡
n n
sP s P
(y ¡ y)2 P 2 ( y)2
sy = or = y ¡
n n

sxy is called the covariance of X and Y


sx is the standard deviation of X
sy is the standard deviation of Y
So, P P
P P ( x)( y)
(x ¡ x)(y ¡ y) xy ¡
r = pP pP or r = s P 2 s n P 2
(x ¡ x)2 (y ¡ y)2 P ( x) P ( y)
x2 ¡ y2 ¡
n n

The second of these formulae is useful as it does not require the means of the X and Y
distributions, x and y, to be found.
50
0

25

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Y:\...\IBBK3_18\578IB318.CDR
Thu Jul 22 09:37:41 2004
Color profile: Disabled
Composite Default screen

TWO VARIABLE STATISTICS (Chapter 18) 579

Example 1
A chemical fertiliser company wishes Lawn Compound Lawn
to determine the extent of correlation X (g) growth (mm)
between ‘quantity of compound X A 1 3
used’ and ‘lawn growth’ per day. B 2 3
Find the Pearson’s correlation C 4 6
coefficient between the two variables. D 5 8

n = 4 (our pairs of data values)


x y xy x2 y2
P P P
1 3 3 1 9 x = 12, y = 20, xy = 73
2 3 6 4 9 P 2 P 2
x = 46, y = 118
4 6 24 16 36
5 8 40 25 64 12 £ 20
73 ¡
totals: 12 20 73 46 118 r =r r4
122 202
46 ¡ 118 ¡
growth (y) 4 4
8 13
) r p p
6 10 18
) r + 0:969
4
So, ² a positive r means that as x (the mass
2 of chemical compound) increases, then
compound X so does y (the lawn growth)
2 4 6 ² r close to 1 indicates a very strong
positive correlation.

Example 2
In attempting to find if there is any associ- average speed
ation between average speed in the metro-
politan area and age of drivers, a device 70
was fitted to cars of drivers of different
ages.
The results are shown in the scatterplot. 60
The r-value for this association is +0:027.
Describe the association.
50

20 30 40 50 60 70 80 90 age

As r is close to zero, there is no correlation between the two variables.


100
0

25

50

75

95
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\579IB318.CDR
Wed Jul 21 09:19:33 2004
Color profile: Disabled
Composite Default screen

580 TWO VARIABLE STATISTICS (Chapter 18)

Example 3
Wydox have been trying out a new chemical to control the number of lawn beetles
in the soil. Determine the extent of the correlation between the quantity of chemical
used and the number of surviving lawn beetles per square metre of lawn.
Lawn Amount of chemical (g) Number of surviving lawn beetles
A 2 11
B 5 6
C 6 4
D 3 6
E 9 3

There are five data points ) n = 5:


x y xy x2 y2
P P P
2 11 22 4 121 x = 25, y = 30, xy = 121
5 6 30 25 36 P 2 P 2
x = 155, y = 218
6 4 24 36 16
3 6 18 9 36 25 £ 30
121 ¡
9 3 27 81 9 ) r =r r5
totals 25 30 121 155 218 252 302
155 ¡ 218 ¡
5 5
y (beetles)

10
+ ¡0:859

5
Clearly we have a moderate negative association
between the amount of chemical used and the num-
x (chemical) ber of surviving lawn beetles. Generally, the more
5 10 chemical used, the fewer beetles survive.

EXERCISE 18B.1
1 Consider the three graphs given below:
A y B y C y

(3, 4) 2
(2, 3) (1, 2) 1
(1, 2) (2, 1)
x (3, 0) x x
1 2
Clearly A shows perfect positive linear correlation C shows no correlation.
B shows perfect negative linear correlation
P P
( x)( y) P
xy ¡
a For each set of points find r using r = r P r n P
P 2 ( x)2 P 2 ( y)2
x ¡ y ¡
n n
b Comment on the value of r for each graph.
50
0

25

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\580IB318.CDR
Wed Jul 21 09:20:07 2004
Color profile: Disabled
Composite Default screen

TWO VARIABLE STATISTICS (Chapter 18) 581

2 Find Pearson’s correlation coefficient for random variables X and Y where:


a sx = 14:7, sy = 19:2, and sxy = 136:8
b the standard deviation of the X distribution is 8:71, the standard deviation of the Y
distribution is 13:23 and the covariance of X and Y is ¡9.26
P P P P 2 P 2
c x = 65, y = 141, xy = 1165, x = 505, y = 2745, n = 11

3 The scores awarded by two judges at a diving competition are shown


in the table.
Competitor P Q R S T U V W
Judge A 7 6 8:5 8 5 5 6 7:5
Judge B 6 5 9 8 4:5 3:5 6:5 7

a Construct a scatterplot for this data with Judge A’s scores on


the horizontal axis and Judge B’s scores on the vertical axis.
b Copy and complete the following comments on the scatterplot:
There appears to be ............, .............. correlation between Judge A’s scores and
Judge B’s scores. This means that as Judge A’s scores increase, Judge B’s scores
..................
c Calculate and interpret Pearson’s correlation coefficient.

From this point onwards we will use technology to find r.

THE COEFFICIENT OF DETERMINATION ( r 2 )


To help describe the strength of asso-
value strength of association
ciation we calculate the coefficient of
determination (r2 ). This is simply the r2 = 0 no correlation
square of the correlation coefficient (r) 2
0 < r < 0:25 very weak correlation
and as such the direction of association 2
is eliminated. 0:25 6 r < 0:50 weak correlation
2
0:50 6 r < 0:75 moderate correlation
Many texts vary on the advice they
give. We suggest the rule of thumb 0:75 6 r2 < 0:90 strong correlation
given alongside when describing the 0:90 6 r < 1 2
very strong correlation
strength of linear association. 2
r =1 perfect correlation

USING TECHNOLOGY FOR THE CORRELATION COEFFICIENT


We will calculate r for the data set: x 1 2 3 4 5 6 7
y 5 8 10 13 16 18 20

CALCULATING r USING A GRAPHICS CALCULATOR


Enter the data and find r, r2 and hence determine the TI
strength of the correlation. Click on the calculator icon
of your choice to find detailed instructions. C
50
0

25

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Y:\...\IBBK3_18\581IB318.CDR
Sun Jul 25 16:45:17 2004
Color profile: Disabled
Composite Default screen

582 TWO VARIABLE STATISTICS (Chapter 18)

CALCULATING r USING A STATISTICS PACKAGE

Enter the data and simply read off r, r2 and the degree STATISTICS
PACKAGE
of strength of the correlation. Click on the icon to find
an easy to use two variable analysis computer package.

CALCULATING r USING MS EXCEL


Enter the data in columns A and B as shown and
type into cell D2 say,
=CORREL(A1:A7,B1:B7)
to get r + 0:998
The formula in cell D3 will calculate r2 .

EXERCISE 18B.2 ‘Casualty crashes’ v ‘All crashes’


10000
casualty crashes

1 The scatterplot alongside shows the 9500


association between the number of
9000
car crashes in which a casualty
8500
occurred and total number of car
8000
crashes in each year from 1972 to
1994. Given that the r value is 0:49: 7500

7000
a find r2
6500
b describe the association between
these variables. 6000
30000 35000 40000 45000 50000 55000
all crashes
2 In an investigation to examine the association between
the tread depth (y mm) and the number of kilometres
travelled (x thousand), a sample of 8 tyres of the same depth of tread
brand was taken and the results are given below. tyre cross-section

Kilometres (x thousand) 14 17 24 34 35 37 38 39
Tread depth (y mm) 5:7 6:5 4:0 3:0 1:9 2:7 1:9 2:3
a Draw a scatterplot of the data. b Calculate r and r2 for the tabled data.
c Describe the association between tread depth and the number of kilometres travelled
for this brand of tyre.

3 A restauranteur believes that during March the number of people wanting dinner (y) is
related to the temperature at noon (xo C). Over a period of a fortnight the number of
diners and the noon temperature were recorded.

Temperature (xoC) 23 25 28 30 30 27 25 28 32 31 33 29 27
Number of diners (y) 57 64 62 75 69 58 61 78 80 67 84 73 76

a Draw a scatterplot of the data. b Calculate r and r2 for the data.


c Describe the association between number of diners and noon temperature for the
restaurant in question.
50

95

100
0

25

75
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\582IB318.CDR
Wed Jul 21 09:21:25 2004
Color profile: Disabled
Composite Default screen

TWO VARIABLE STATISTICS (Chapter 18) 583

4 Tomatoes are sprayed with a pesticide-fertiliser mix. The figures below give the yield of
tomatoes per bush for various spray concentrations.

Spray concentration (x mL/L) 3 5 6 8 9 11


Yield of tomatoes per bush (y) 67 90 103 120 124 150
a Draw a scatterplot for this data. b Determine the r and r2 values.
c Describe the association between yield and spray concentration.

5 It has long been thought that frosty conditions are necessary to ‘set’
the fruit of cherries and apples. The following data shows annual
cherry yield and number of frosts data for a cherry growing farm
over a 7 year period.

Number of frosts, (x) 27 23 7 37 32 14 16


Cherry yield (y tonnes) 5:6 4:8 3:1 7:2 6:1 3:7 3:8
a Draw a scatterplot for this data. b Determine the r and r 2 value.
c Describe the association between cherry yield and the number of frosts.

6 In 2002 a business Field Bachelor’s degree (x) PhD (y)


advertised starting
salaries for recently Chemical engineer 38 250 48 750
graduated university Computer coder 41 750 68 270
students depending Electrical engineer 38 250 56 750
on whether they held Sociologist 32 750 38 300
a Bachelor’s degree Applied mathematician 43 000 72 600
or a PhD, as shown Accountant 38 550 46 000
alongside.
a Draw a scatterplot of the data. b Determine r and r2 .
c Describe the association between starting salaries for Bachelor’s degrees and start-
ing salaries for PhD’s.

7 World War II saw a peak in the production of aeroplanes. One specific type of plane
that was made was the fighter plane. It was used in aerial combat and also to shoot at
enemy on the ground. The table below contains information on the maximum speed and
maximum altitude obtainable (ceiling) for nineteen fighter planes. Maximum speed is
given in km/h ¥ 1000: Ceiling is given in m ¥ 1000.

max. speed ceiling max. speed ceiling max. speed ceiling


0:46 8:84 0:68 10:66 0:67 12:49
0:42 10:06 0:72 11:27 0:57 10:66
0:53 10:97 0:71 12:64 0:44 10:51
0:53 9:906 0:66 11:12 0:67 11:58
0:49 9:448 0:78 12:80 0:70 11:73
0:53 10:36 0:73 11:88 0:52 10:36
0:68 11:73
a Draw a scatterplot for this data. b Determine the r and r2 value.
c Describe the association between maximum speed and ceiling.
50
0

25

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\583IB318.CDR
Wed Jul 21 09:22:05 2004

You might also like