Chapter 18

18
CH APTER
Statistics and Probability
Statistics
In previous books in this series, we have looked at the measures of central
tendency, such as the mean and the median.
In this chapter, we discuss two measurements of spread – the interquartile range
and standard deviation. The representation of numerical data by boxplots is
also introduced.
In our study of statistics up to now, we have often associated one measurement
with an item. For example, the height of each person in a class, the number of
possessions obtained by a player in a football match or the number of marks
obtained by a student in a test.
In the last two sections of this chapter, we look at associating a pair of numbers
with an item, for example, the height and weight of a person or the age and
salary of an employee. This is called bivariate data.
When a measurement is collected or recorded at successive intervals of time, it
is referred to as time-series data. This type of bivariate data is also introduced in
this chapter.
52 0 ICE-EM Mathematics 10 3ed ISBN 978-1-108-40434-1 © The University of Melbourne / AMSI 2017 Cambridge University Press
Photocopying is restricted under law and this material must not be transferred to another party.
18A The median and the
interquartile range
The median has been introduced and discussed in earlier books in this series. We review it here,
because it is the measure of central tendency used when working with the interquartile range as a
measurement of spread.
Median
We often see the median value being used to describe the housing market in a city. The median is the
‘middle value’ when all values are arranged in numerical order.
Here are 13 numbers in numerical order:
2, 2, 3, 3, 3, 4, 5 , 11, 13, 18, 18, 19, 21
This data set has an odd number of values. The middle value is 5, since it has the same number of
values on either side of it. Hence, the median of this data set is 5.
Here is a set of 12 numbers, arranged in numerical order:
1, 3, 4, 4, 5, 7 , 9 , 11, 13, 13, 19, 21
This data set has an even number of values. The middle values are 7 and 9. We take the average of
7 and 9 to calculate the median.
7+9
Median =
2
= 8
Hence, the median of this data set is 8, even though this value does not occur in the data set.
Median
• When a data set has an odd number of values and they are arranged in numerical order,
the median is the middle value.
• When a data set has an even number of values and they are arranged in numerical order,
the median is the average of the two middle values.
th
n + 1⎞
• When a data set with n items is arranged in numerical order, the median lies in the ⎛⎜
⎝ 2 ⎟⎠
position.
Example 1
Calculate the median of the data sets.

a 33 35 43 29 53 39 45 b 5 7 9 5 12 10
Solution
a To locate the median, first put the values in numerical order. This gives:
29 33 35 39 43 45 53
Median = 39
(continued over page)
C H A P T E R 1 8 S TAT I S T I C S
ICE-EM Mathematics 10 3ed ISBN 978-1-108-40434-1 © The University of Melbourne / AMSI 2017 Cambridge University Press
521
18A THE MEDIAN AND THE INTERQUARTILE RANGE
b Again, the values are placed in numerical order.

5 5 7 9 10 12
7+9
Median =
2
= 8
Quartiles and the interquartile range

The interquartile range(IQR) measures the spread of the middle 50% of the data in an ordered
data set.
We use the interquartile range to see how closely the data are grouped around the median. When we
calculate the interquartile range, we organise the data into quartiles, each containing 25% of the
data. The word ‘quartile’ is related to ‘quarter’.
Olivia has been playing Sudoku on the internet. Her last 11 games were all rated ‘diabolical’, and her
times, correct to the nearest minute and arranged in ascending order, were:
8, 12, 14, 14, 16, 18, 19, 19, 25, 78, 523
The range of these times is 523 – 8 = 515.
Clearly the range does not give a clear picture of Olivia’s considerable skills, because the last two
times, 78 and 523, are outliers. An outlier is a single data value far away from the rest of the data.
That is, it is much larger or much smaller than all of the other values. Outliers have a huge influence
on the value of both the mean and the range. (In fact, the time of 78 minutes occurred when Olivia
left the game running over dinner, and the time of 523 minutes occurred when Olivia left the game
running overnight.)
Because of situations like this, the interquartile range is often a better measure of the spread of the
data than the range. Here is the procedure for finding it.
Step 1: Find the median. Divide the data into two equal groups. Omit the median (middle value) if
there is an odd number of values. In Olivia’s case, there are 11 values so, omitting the median
18, the two groups of 5 are:
8, 12, 14, 14, 16 and 19, 19, 25, 78, 523
Step 2: The lower quartile is the median of the lower set of values. In Olivia’s case, the lower
quartile is 14.
Step 3: The upper quartile is the median of the upper set of values. In Olivia’s case, the upper
quartile is 25.
Step 4: The interquartile range is the difference between the two quartiles. In Olivia’s case:
Interquartile range = 25 − 14
= 11
Thus, the middle 50% of Olivia’s times have a spread of 11 minutes.
52 2 I C E - E M M AT H E M AT I C S Y E A R 1 0
Notice that the interquartile range is unaffected by the lower quarter and the upper quarter of the
values. Hence, the large sizes of two of Olivia’s times, when she left the game running to eat dinner
and to sleep, do not affect the interquartile range.
The calculations begin slightly differently when there is an even number of results. For example,
suppose that Olivia played one more game, which she solved in 22 minutes.
There are now 12 results to arrange in ascending order:
8, 12, 14, 14, 16, 18, 19, 19, 22, 25, 78, 523
Step 1: Since there is an even number of results, we divide them into two equal groups of 6. (The median lies
12 + 1
‘between’ the 6th and 7th member of the ordered data set. That is, in the = 6.5th position.)
2
8, 12, 14, 14, 16, 18 and 19, 19, 22, 25, 78, 523
14 + 14
Step 2: The lower quartile is now = 14 .
2
22 + 25
Step 3: The upper quartile is now = 23 12 .
2
Step 4: The interquartile range is now 23 12 − 14 = 9 12 .
In this case, the middle 50% of Olivia’s times have a spread of 9 12 minutes.
The minimum, maximum, median and the two quartiles are sometimes called the five-number
summary. Sometimes the lower quartile is called the first quartile, because it marks the first quarter
of the ordered data. The median is then the second quartile, although this term is seldom used. The
upper quartile is called the third quartile.
We denote the lower quartile by Q1 and the upper quartile by Q 3. We sometimes use the abbreviation
IQR for the interquartile range.
Example 2
Find the interquartile range of the data set:

26 19 25 13 24 23 23 25 20 28 23
Solution
First arrange in order and locate the median.

13 19 20 23 23 23 24 25 25 26 28
median = 23
11 + 1
There are 11 data values. The 6th value is 23, ⎛ = 6th value⎞ , so the median is 23.
⎝ 2 ⎠
The lower group contains 5 values. The 3rd value is 20. So the lower quartile is 20.
Similarly, the upper quartile is 25.
Thus, interquartile range = 25 − 20
=5
That is, the middle 50% of data values have a spread of 5.
523
Example 3
For the stem-and-leaf plot opposite, find

2 4 6 7 8 9
the median and the quartiles.
3 0 1 1 3 4 6! 7
3 4 means 34.
4 1 4 5 5 7 8 9
5 0 1 2
Solution
There are 22 data values. First locate the median to divide the data into two equal groups.
22 + 1
The median lies in the = 11.5th position of the ordered set. The 11th value is 36 and
2
the 12th value is 37, so the median is 36.5.
The lower group contains 11 values. The 6th value is 30. So the lower quartile is 30.
Similarly, the upper quartile is 47.
Measures of spread
• The range is the difference between the highest and lowest values in a data set.
• The interquartile range measures the spread of the middle 50% of the data in an ordered
data set.
• To calculate the interquartile range, find the difference between the upper quartile Q 3
and the lower quartile Q1.
Exercise 18A
Example 2 1 Find the range and interquartile range of each data set.
a 7 5 15 10 13 3 20 7 15 b 8 5 1 7 5 7 8 10 5 7
c 40646794 d 3 13 8 11 1 18 5 13
Example 3 2 Locate the median and the quartiles for each of the following stem-and-leaf plots. State the
interquartile range for each data set.
a 2 0 12 4 4 7 7 9 b 5446 779
3 111 2 2 4 6 6 7 8 9 6 1 4 4 4 6 7 8
4 0 12 2 4 7 1 5 7 8 9 9
3 2 means 32 8 0 1 1 2 3 4 6
9 1 3 4 5
6 1 means 61
3 Find the mean, the mode, the median and the interquartile range of this data set.
Value 0 1 2 3 4 5 6 7 8 9 10
Frequency 5 2 0 7 1 8 4 6 0 2 11
4 Complete the following table for the positions of the median and the quartiles for data sets
of 100 and 101 items. (Note: A position of 8.5 means it is between the eighth and ninth data
values).
Number of data Lower quartile Upper quartile

Median position
items position position
a 100
b 101
5 The stem-and-leaf plot opposite gives the height in centimetres of 14 4 5 6

20 students in a class. 15 0 1 2 8
a What is the range of the height of students in the class? 16 0 0 1 2 4 5 7
17 2 6 7 8
b What is the median height of students in the class? 18 0 2
c What is the interquartile range? 15 1 means 151
6 The stem-and-leaf plot opposite gives the lengths in 4 4

centimetres of 15 leaves that have fallen from a tree. The 5 5 1 8 44
values are given correct to one decimal place. Find the 6 3 1 2 4
interquartile range of the leaf lengths. 7 7 2 7
8
9 4 3 9 4 means 9.4
7 The following figures are the amounts a family spent on food each week for 13 weeks.
$148 $143 $152 $149 $158
$155 $147 $152 $158 $139
$143 $150 $141
a Find the median, upper quartile and lower quartile.
b Find the interquartile range of the amounts spent.
8 Write down two sets of seven whole numbers with minimum data value 3, lower quartile 5,
median 10, upper quartile 12 and maximum data value 13.
9 The median is always between the two quartiles. Is the mean always between the two
quartiles? If not, give an example of seven whole numbers where the mean is above the
upper quartile and an example where the mean is below the lower quartile.
10 a For a data set, the minimum value is 8 and the range is 27. Find the maximum value.
b For a particular data set, the upper quartile is 25.6, and the interquartile range is 11.9. Find
the lower quartile.
525
18B Boxplots
A useful way of displaying the maximum value and the minimum value, the upper and lower
quartiles and the median of a data set (the five-number summary) is a boxplot.
scale
lower upper
quartile (Q1) quartile (Q3)
minimum median maximum
The rectangle is called the box.

The horizontal lines from the lower and upper quartiles to the minimum and maximum are called the
whiskers. In a boxplot, the box itself indicates the location of the middle 50% of the data.
Boxplots are especially useful for large data sets. A boxplot is a visual summary of some of the main
features of the data set. Boxplots are also useful for comparing related data sets – see Questions 9, 10
and 11 in Exercise 18B.
Example 4
The weights of 20 students are recorded here. The weights are given to the nearest kilogram.
48 52 54 54 55 58 58 61 62 63 63 64 65 66 66 67 69 70 72 79
a Find the median, upper quartile, lower quartile and interquartile range.
b Draw a boxplot for this data.
Solution
63 + 63
a There are 20 data values. Therefore, the median = = 63 kg
2
Divide the data into two equal groups of 10.
48 52 54 54 55 58 58 61 62 63 63 64 65 66 66 67 69 70 72 79
55 + 58 66 + 67
The lower quartile = = 56.5 kg The upper quartile = = 66.5 kg
2 2
The interquartile range = 66.5 − 56.5
= 10 kg
b 40 50 60 70 80
lower upper
quartile quartile
56.5 kg 66.5 kg
minimum median maximum
48 kg 63 kg 79 kg
18B BOXPLOTS
Exercise 18B
1 The boxplot below shows the price (in $) of 20 different brands of sports shirts.
10 20 30 40 50
What is the cost of the most expensive and least expensive sports shirt?
2 The boxplot below gives information regarding the annual salaries (in thousands of dollars)
of employees in a large company.
40 60 80 100 120 140 160 180
a What is the lowest salary?

b What is the range of the salaries?
c What is the median salary?
d What is the interquartile range?
3 The boxplot below gives information about the marks out of 100 obtained by a group of
40 people on a general knowledge quiz.
40 50 60 70 80 90 100
a What was the lowest mark obtained on the quiz?

b What was the median mark obtained on the quiz?
c What was the range of marks?
d What was the interquartile range?
4 Construct a boxplot for the data set given in Exercise 18A, question 2b.
Example 4 5 The pulse rates of 21 adult females are recorded.
60 61 67 68 69 70 70 70 73 74 75 75 76 77 77 78 79 80 81 89 90
a Find the median, upper quartile, lower quartile and interquartile range.
b Draw a boxplot for this data.
6 In a boxplot for a large data set, approximately what percentage of the data set is:
a below the median?
b below the lower quartile?
c in the box?
d in each whisker?
7 In a boxplot, is one whisker always longer than the other?
527
18B BOXPLOTS
8 In a boxplot, why is the median not always in the centre of the box?
9 Here are two boxplots drawn on the one scale.
Data set A
Data set B
10 20 30 40 50
Which data set has:

a the greater median?
b the greater range?
c the greater interquartile range?
d the greater largest data value?
10 Students in two classes sat the same mathematics test. Their results are shown in the two
boxplots below.
Class A
Class B
10 20 30 40 50
a Which class had the higher median mark?

b Which class had the higher interquartile range?
c In which class was the highest mark for the test obtained?
d In which class was the lowest mark for the test obtained?
e Which class did better on the test? Give reasons for your choice. (Class discussion)
11 The ratings for a number of television programs on Channel A, Channel B and Channel C
were collated. The information is shown in the boxplots below. (If a program has a rating of
14, it means that 14% of the viewing audience watched that particular program.)
5 10 15 20 25
Channel A
Channel B
Channel C
a Write down the approximate values of the median, quartiles and maximum and
minimum values for each channel.
b Which channel has the largest interquartile range?
c If the winning channel is the one with the highest rated program, which channel is the
winner? Which is second? Which is third?
d If the winning channel is the one with the largest median, rank the channels.
e Can you find a criterion that makes Channel C the winning channel?
18C Boxplots, histograms
and outliers
It is common to use a form of the boxplot that is designed to illustrate any possible outliers in the
data. Outliers are unusual, or ‘freak’, values that differ greatly in magnitude from the majority of
data values.
median
outlier
Q1 Q3
• Any point that is more than 1.5 IQRs away from the end of the box is classified as an outlier. That
is, if a data value is greater than Q 3 + 1.5 × IQR or less than it Q1 − 1.5 × IQR is considered to be
an outlier. An outlier is indicated by a marker, as shown in the diagram above.
• The whiskers end at the highest and lowest data values that lie within 1.5 IQRs from the ends of
the box.
Comparing a boxplot to the histogram of the same data

In ICE-EM Mathematics Year 9 we looked at different shapes of histograms and the distributions of
data, and in particular we used the terms symmetric, positively skewed and negatively skewed to
describe the shapes.
Symmetric distribution Negatively skewed distribution Positively skewed distribution
The following examples look at representing data with histograms and boxplots.
Example 5
The house prices of 50 houses sold in a town over a period of two years are recorded. The
prices are in thousands of dollars.
110, 110, 120, 130, 140, 150, 150, 170, 170, 170, 180, 190, 200, 210, 210, 230, 270, 270,
290, 310, 340, 340, 340, 340, 350, 360, 360, 365, 365, 400, 400, 400, 400, 410, 430,
440, 450, 460, 460, 460, 460, 564, 678, 678, 750, 760, 904, 1320, 2350, 2350
a Find the quartiles, the median and the interquartile range.
b Calculate 1.5 × IQR.
c Name the outliers.
d Draw a histogram and boxplot of this information. The boxplot should show outliers.
e i Calculate the mean, including the outliers.
ii Calculate the mean, not including the outliers.
529
18C BOXPLOTS, HISTOGRAMS AND OUTLIERS
Solution
a The data has been given in ascending order. There are 50 data values. The median is the
mean of the 25th and 26th values.
Median = $355 000
Q1 is the median of the lower set of 25 values. This is the 13th value.
Q1 = $200 000
Q 3 is the median of the upper set of 25 values.
Q 3 = $460 000
IQR = $260 000
b 1.5 × IQR = 1.5 × (Q 3 − Q1 ) = $390 000
Hence, a value is an outlier if it is greater than 460 000 + 390 000 = $850 000
or less than 200 000 − 1.5 × 260 000
c The outliers are $904 000, $1 320 000, $2 350 000 and $2 350 000.
d
0 500 1000 1500 2000 2500

(Thousands of dollars)
(Note: The right-hand whisker ends with the value $760 000)
14
12
10
0
0-
0-
0-
0-
0-
0-
0-
0-
10 -
11 -
12 -
13 -
14 -
15 -
16 -
17 -
18 -
19 -
20 -
21 -
22 -
23 -
24 -
-
0
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
10
20
30
40
50
60
70
80
90
(Thousands of dollars)
The classes are $100 000 to $199 000, $200 000 to $299 000 etc.
e i Mean with outliers = $449 300, to the nearest $100.
ii Mean without outliers = $337 800, to the nearest $100.
It could be said that the distribution has a positive skew. The left-hand whisker is short. Most
of the values lie in the interval from $100 000 to $500 000.
Example 6
The waiting times in seconds at a ticket counter were as follows:

0, 0, 3, 5, 5, 5, 9, 10, 12, 13, 16, 17, 18, 18, 21, 22, 23, 23, 24, 24, 24, 24, 24, 25,
25, 25, 26, 26, 27, 28, 29, 28, 29, 29, 28, 30, 31, 31, 31, 32, 34, 34, 33, 33, 33, 34, 34, 33,
34, 35, 35, 35, 36, 36, 37, 38, 39, 38, 39, 39, 38, 40, 41, 41, 52
a Find Q1, the median, Q 3 and the IQR.
b Draw a boxplot, showing outliers.
c Draw a histogram.
d Comment on the shape of the histogram and the boxplot.
Solution
a Q1 = 22.5, median = 29, Q 3 = 34.5, IQR = Q 3 − Q1 = 12
0 5 10 15 20 25 30 35 40 45
Q 3 + 1.5 × IQR = 34.5 + 1.5 × 12 = 52.5
Q1 − 1.5 × IQR = 22.5 − 1.5 × 12 = 4.5
Therefore, the values 0, 0 and 3 are considered to be outliers.
c 16
14
12
10
0
0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–39 40–44 45–49 50–54
(Waiting time in seconds)
d There is a negative skew. The right-hand whisker is short. The left-hand whisker is longer,
indicating a tailing off of the data values. The values 0, 0 and 3 are outliers.
531
Example 7
Fifty-four lengths of wire are cut off by a machine. The resulting lengths measured in cm are
as shown:
103, 104, 105, 106, 106, 106, 107, 107, 107, 107, 107, 108, 108, 108, 108, 108, 108,
108, 108, 109, 109, 109, 109, 109, 109, 109, 109, 110, 110, 110, 110, 110, 110, 110,
110, 110, 111, 111, 111, 111, 111, 111, 111, 112, 112, 112, 112, 113, 113, 113, 113,
114, 115, 116
a Find Q1, the median, Q 3 and the IQR.
b Draw a boxplot, showing outliers.
c Draw a histogram.
Solution
a Q1 = 108 cm, median = 109.5 cm, Q 3 = 111 cm and IQR = 3 cm
102 104 106 108 110 112 114 116

(Lengths of wires in cm)
c 10
0
103 104 105 106 107 108 109 110 111 112 113 114 115 116
(Length of wires in cm)
d The histogram is symmetric. The whiskers on the boxplot are of equal length. The values
103 cm and 116 cm are outliers.
Exercise 18C
Example
5, 6
1 The heights, measured in centimetres, of 25 students in a class are:
170 175 133 153 164 189 143 133 167 145 150 164 169
159 177 186 173 164 177 168 142 155 153 167 166
a Find Q1, the median and Q 3. b Find the interquartile range.
c Draw a boxplot, showing any outliers.
Example 7 2 The annual incomes of 30 people, given correct to the nearest $1000, are:
54 000 67 000 92 000 78 000 54 000 87 000 102 000 112 000
132 000 45 000 256 000 89 000 78 000 98 000 34 000 75 000
65 000 100 000 34 000 68 000 79 000 81 000 82 000 103 000
21 000 345 000 98 000 67 000 105 000 98 000
a Find Q1, the median and Q 3. b Find the interquartile range.
c Draw a boxplot, showing any outliers.
3 Match each histogram a − c with its box plot i − iii and describe the shape of the data distribution.
a 16
i
14
50 60 70 80 90 100 110 120
12
10
0
50–59 60–69 70–79 80–89 90–99 100–109 110–119
b 16 ii
14
50 60 70 80 90 100 110 120
12
10
0
50–59 60–69 70–79 80–89 90–99 100–109 110–119
c 16
iii
14
50 60 70 80 90 100 110 120
12
10
0
50–59 60–69 70–79 80–89 90–99 100–109 110–119
533
4 Consider the data shown in the stem-and-leaf plot.

15 6 8
16 9 9
17 0 1 3 3 4 5 8 8 9 9
18 0 0 13 3 4 7 7 8 8
19 1 2 3
a Draw a histogram.
b Find Q1, the median, Q 3 and the IQR.
c Draw the boxplot.
5 The lower and upper quartiles for a data set are 116 and 134. Which of the following data
values would be classified as an outlier?
a 190 b 60 c 150
6 The speeds of 20 cars measured on a city street were recorded.

40 14 3 26 20 31 42 36 17 24
28 33 27 29 24 51 11 35 5 24
a Construct a stem-and-leaf diagram.
b Construct a boxplot.
c Comment on the shape of the distribution of data.
7 The reaction times (in milliseconds) of 20 people are listed here.

38 31 36 39 35 25 35 44 43 44
46 34 62 22 42 48 31 30 45 40
a Find the median, Q1, Q 3 and the interquartile range.
b Construct a boxplot.
c Identify any outliers.
8 The weight loss (in kilograms) of 20 randomly selected people undertaking a special diet
over three weeks is:
8 5 10 6 6 12 4 5 5 6
8 13 7 7 7 6 6 4 5 5
a Construct a dotplot of the data.
b Construct a boxplot of the data.
c Comment on the shape.
18D The mean and the
standard deviation
Mean
The mean of a data set is a measure of its centre. The mean is calculated by adding together all the
data values and then dividing the resulting sum by the number of data values.
sum of values
Mean =
number of values
A more common name for the mean is ‘average’. We use the symbol x to denote the mean.
For a set of data x1 , x2 , x3 , … , xn ,
x1 + x2 + x3 + … + xn
x =
n
Example 8
A student obtained the following marks in seven tests:

43, 35, 41, 29, 33, 39 and 42
Calculate the mean mark correct to two decimal places.
Solution
43 + 35 + 41 + 29 + 33 + 39 + 42
x =
7
≈ 37.43 (Correct to two decimal places.)
For larger sets of data, a frequency table can be prepared. Let f1 be the frequency of the data
item x1, let f2 be the frequency of the data item x2 and so on. In this case we can write:
f1 x1 + f2 x2 + … + fs xs
x =
f1 + f2 + … + fs
The numerator is the sum of the data items and the denominator is the number of data items.
Example 9
The following information gives the number of children in each of 20 families. Calculate the
mean number of children per family.
Number of children xi Frequency fi
0 4
1 5
2 7
3 4
535
1 8 D T H E M E A N A N D T H E S TA N D A R D D E V I AT I O N
Solution
Add in a column for fi xi .

Number of children xi Frequency fi fi xi
0 4 0
1 5 5
2 7 14
3 4 12
Total = 20 Total = 31
31
x = = 1.55
20
It is obviously impossible for a family to have 1.55 children. The mean is not necessarily a
member of the data set.
Standard deviation
The standard deviation of a set of data is a measure of how far the data values are spread out from
the mean. The difference between each data item and the mean is called the deviation of the data
value. The sum of the deviations is zero, which will be proved in question 10 of Exercise 18D.
The standard deviation is calculated from the squares of the deviations.
Here are the steps in finding the standard deviation:
• Calculate the mean.
• Square each of the deviations.
• Sum these squares.
• Divide the sum of the squares by the number of data values.
• Take the square root of the value obtained.
This is given by the formula:
( x1 − x )2 + ( x2 − x )2 + ( x3 − x )2 + … + ( xn − x )2
σ =
n
where the xi are the data values, x is the mean and n is the number of data values.
We will use the Greek letter σ (sigma) to denote the standard deviation of a data set.
Example 10
Find the standard deviation, correct to two decimal places, for the data set.
5, 7, 11, 13, 14
Solution
5 + 7 + 11 + 13 + 14
x =
5
= 10
(5 − 10)2 + (7 − 10)2 + (11 − 10)2 + (13 − 10)2 + (14 − 10)2

σ2 =
5
25 + 9 + 1 + 9 + 16
=
5
60
=
5
= 12
Hence, σ = 12 ≈ 3.46 (Correct to two decimal places.)
When calculating the standard deviation from a frequency table, we can use the following formula:
f1 ( x1 − x )2 + f2 ( x2 − x )2 + f3 ( x3 − x )2 + … + fs ( xs − x )2
σ =
f1 + f2 + … + fs
When frequencies are taken into account, we can see that this is the same formula as above.
We can calculate the standard deviation with an extended frequency table with five columns. Fill in
the first three columns, then calculate x . Fill in the other two columns and then calculate σ.
Example 11
Calculate the mean and standard deviation of the set of values, correct to two decimal places.
1, 3, 4, 5, 7, 3, 6, 9, 9, 4, 5, 2, 5, 7
Solution
xi fi fi xi (xi − x) fi (xi − x)2

1 1 1 −4 16
2 1 2 −3 9
3 2 6 −2 8
4 2 8 −1 2
5 3 15 0 0
6 1 6 1 1
7 2 14 2 8
9 2 18 4 32
Total = 14 Total = 70 Total = 76
70 76
x = =5 σ =
14 14
≈ 2.33 (Correct to two decimal places.)
Note: The sum of the deviations xi – x is zero. Hence, the average of the deviations is not useful.
537
Mean and standard deviation

• The mean of a set of data is denoted by x.
• The standard deviation of a data set is a measure of spread and is denoted by the Greek
letter σ .
• There are two formulas for the standard deviation.
( x1 − x )2 + ( x 2 − x )2 + ( x 3 − x )2 + … + ( xn − x )2
σ = , when the data is in a list.
n
f1( x1 − x )2 + f2 ( x 2 − x )2 + f3 ( x 3 − x )2 + … + fs ( x s − x )2
σ = , when the data is in a
f1 + f2 + … + fs
frequency table.
It is clear that the larger the standard deviation, the more spread out the data are about the mean.
For example, here is a bar chart of the data in Example 11, and also another set of 14 data items
where the data are not as spread out but have the same mean.
4 5
4
3
3
2
2
1
1
0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
x = 5 and σ ≈ 2.33 x = 5 and σ ≈ 1.25
In the following section we will see how the standard deviation may be used to make comparisons
between data sets.
Use of calculators
Many calculators and spreadsheets have a built-in facility for calculating the standard deviation of a
set of data.
To save time, we recommend using this facility for all but the simplest data sets. In particular, if x is
not an integer, then calculating σ is very tedious.
It should be noted that in this book we calculate the standard deviation by dividing the sum of
the squares of the deviations by n, the number of data items, and taking the square root. There is
also another type of standard deviation that is obtained by dividing the sum of the squares of the
deviations by n – 1, and taking the square root. Many calculators offer both versions. Sometimes
they are denoted by symbols such as σ n and σ n –1. In this book, we only use σ n.
Exercise 18D
Give all answers correct to two decimal places unless otherwise specified.
Example 8 1 During a 13-week football season, the number of kicks obtained by a particular player each
week is:
18, 18, 20, 26, 10, 8, 21, 14, 16, 14, 12 and 16
Calculate the mean number of kicks obtained by the player.
2 The daily maximum temperature was recorded in two different cities for a week. The results
are shown below.
City A: 28, 31, 34, 32, 31, 29, 28
City B: 26, 32, 36, 38, 37, 29, 25
Which city had the greater mean daily maximum temperature?
3 The average of 5 masses is 67 kg. If a mass of 25 kg is added, what is the average of the
6 masses?
4 During a term, a student has an average of 46 marks after the first four tests and his average
for the next six tests is 38 marks. What is his average for the ten tests?
Example 10 5 a Calculate, correct to two decimal places, the mean and standard deviation for the
data sets.
i 2, 4, 8, 10, 2, 9, 3, 8, 2, 2 ii 3, 6, 4, 5, 6, 7, 3, 4, 6, 6
b Comment on the results from part a.
Example 11 6 Complete the following extended frequency table to calculate the mean and standard
deviation of the given data set.
xi fi fi xi (xi − x) fi (xi − x)2

1 2
2 7
3 6
4 1
5 2
6 2
Total = Total = Total =
7 Use a calculator to find, correct to two decimal places, the mean and standard deviation for
each data set.
a 3, 6, 7, 5, 8, 5, 10, 12, 13, 12, 6, 9, 12, 14, 15
b 8, 10, 12, 14, 16, 17, 19, 12, 11, 10, 14, 16, 18, 19
539
8 Twenty students sat a test and their results are given in the stem-and-leaf plot below.
1 2 2 8 9
2 2 4 5 6 8
1 2 means 12 3 0 2 6 8 8 9
4 0 1 2 3 6
a Calculate their mean mark.

b How many students obtained a mark higher than the mean mark?
c Find the standard deviation of their marks.
9 Twenty people completed a test worth 10 marks. Their scores are shown in the frequency
table below.
Score 0 1 2 3 4 5 6 7 8 9 10
Number of people 0 2 0 1 1 2 4 6 0 2 2
a Calculate the mean mark.

b How many students obtained a mark lower than the mean mark?
c Find the standard deviation of their marks.
10 a Prove that the sum of the deviations for the data set a, b, c is zero.
b Prove that the sum of the deviations of any data set is zero.
18E Interpreting the standard

deviation
Consider the data sets 4, 5, 6, 7, 8 and 2, 4, 6, 8, 10.
Both the data sets have a mean and median of 6. However, when we apply the formula for σ , it can be
observed that the standard deviation for the second data set is 2 2, which is twice the standard
deviation of the first data set, 2. This reflects the difference in spread between the two data sets. That
is, even though both have evenly distributed values, the spread of data from the mean is twice as great
in the second data set as compared to the first.
Intervals about the mean

In the following we will look at a ‘symmetric’ set of data which ‘tails off’ as you move away from
the mean in either direction.
The stem-and-leaf plot on the next page gives the incomes, in thousands of dollars, of 134 people.
1 8 E I N T E R P R E T I N G T H E S TA N D A R D D E V I AT I O N
0 889
1 00223
2 4444448888888
3 111 2 2 4 4 4 4 6 6 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9
4 11111 2 2 2 3 3 3 4 4 4 4 4 5 5 5 5 6 6 7 7 7 7 7 7 8 8 9 9 9 9 9 9 9 9 9
5 0 0 0 0 0 11111111 2 2 2 3 3 3 4 4 4 4 4 5 7 7
6 333336666 9999
7 77899
8 666 7 | 7 means $77 000
The mean is 45.1, the median is 45.5, and the standard deviation is 16.1.
We next consider intervals centred on the mean.
x + σ = 45.1 + 16.1 = 61.2 and x – σ = 45.1 – 16.1 = 29.0
We can observe from the plot above that there are 92 45
values between 29 and 61; hence, the percentage of 40
values within one standard deviation 35
of the mean is 68.7%. Also, 30
x + 2σ = 45.1 + 2 × 16.1 = 77.3 and 25
x – 2σ = 45.1 – 2 × 16.1 = 12.9 20

15
There are 121 values between 13 and 77. 10
Thus, the percentage of values within two 5
standard deviations of the mean is 90.3%. 0
0–9 10–19 20–29 30–39 40–49 50–59 60–69 70–79 80–89
x − σ to x + σ
xxxxxxx − 2σ to x + 2σ xxxxxx
We have seen that about 69% of the data is within one standard deviation of the mean and about 90%
of the data is within two standard deviations of the mean.
Histograms similar to this one occur frequently. In most cases like these the median and the mean
are very close.
Example 12
David plays golf every Friday. He has recorded his score each Friday for five years, and has
found that his mean score for all his games is 85 and the standard deviation of his scores is 5.2.
Find the range of scores that lie within:
a one standard deviation of the mean b two standard deviations of the mean
Solution
a x + σ = 85 + 5.2 = 90.2 and x − σ = 85 – 5.2 = 79.8

So the range of scores within one standard deviation of the mean is 80 to 90.
b x + 2σ = 85 + 10.4 = 95.4 and x – 2σ = 85 – 10.4 = 74.6
So the range of scores within the two standard deviations of the mean is 75 to 95.
541
A remarkable result known as Chebyshev’s inequality states that, for any set of data, if we take an
interval between x − kσ and x + kσ, then all values can lie outside this interval for 0 < k ≤ 1, but
1
for k > 1, at most 2 of the data can lie outside this interval.
k
1
So, for example, taking k = 2, not more than of the data can be outside this interval.
4
So at least 75% of the data must lie inside this interval.
σ – 2σ x x + 2σ
at least 75% of the data
Using the standard deviation to compare data

To compare values from different data sets with approximately the same shape, it is useful to
consider where they are positioned relative to their respective means. This can be achieved by using
their respective standard deviations, and calculating where these values lie in terms of the number of
standards above or below the mean.
Example 13
Gus scored 14 in a maths test and 14 in an English test. The scores of each student in the
maths and English classes are listed below. In which test did Gus perform better, relative to
the class results?
Maths test: 10, 13, 18, 17, 12, 16, 9, 8, 7, 11, 10, 12
English test: 15, 17, 18, 19, 18, 17, 19, 16, 14, 15, 14, 12
Solution
143
Maths test x = ≈ 11.92, σ ≈ 3.38
12
English test x ≈ 16.17 , σ ≈ 2.11
It can be seen that in the maths test Gus scored about 0.6 of a standard deviation above the mean
⎛ 14 − 11.92 ≈ 0.6⎞ and in the English test Gus scored about 1 standard deviation below the
⎝ 3.38 ⎠
14 − 16.17
mean ⎛ ≈ −1⎞ . So Gus has done better relative to the class in the maths test.
⎝ 2.11 ⎠
Exercise 18E
1 Find the mean and standard deviation of each set of data.

a 5, 6, 6, 7, 8, 9, 22
b 11, 7, 8, 9, 8, 10, 10
c 1, 3, 7, 9, 11, 15, 17
Compare the sets of data using their means and standard deviations.
Example 12 2 The mean and standard deviation of each set of data is given. Find the range of values that
is within:
i one standard deviation of the mean ii two standard deviations of the mean
a x = 35, σ = 2.5
b x = 40, σ = 5
c x = 35, σ = 8
Example 13 3 The mathematics and English marks for a class of 15 students are given below.
Mathematics: 12, 16, 14, 19, 17, 18, 15, 15, 19, 20, 14, 18, 19, 15, 11
English: 10, 13, 16, 19, 20, 19, 18, 16, 15, 14, 17, 11, 15, 18, 17
a Calculate, correct to two decimal places, the mean and standard deviation for each set
of marks.
b If a student scored 16 for the mathematics test and 14 for the English test, which is the
better mark relative to the class results?
4 The following table lists the marks of several students on different tests in English and
mathematics. Compare the English and mathematics marks of each student.
Mark Mean Standard deviation
a David
English 15 17 2
Mathematics 13 17 3
b Akira
English 42 30 6
Mathematics 39 25 8
c Katherine
English 70 75 5
Mathematics 65 70 10
d Daniel
English 70 55 9
Mathematics 69 62 7
543
5 The bar charts of three sets of data are shown.
i 4 4
ii
3 3
2 2
1 1
0 0
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
a For each set of data, calculate the mean iii 4

and the standard deviation. 3
b Add 5 onto each data item in each of 2
i, ii and iii and state the mean and
standard deviation of each new set of data. 1
c Multiply each data item in each of i, ii 0

1 2 3 4 5 6 7 8 9 10 11
and iii by 2 and state the mean and standard
deviation of each new set of data.
6 (There is no arithmetic required in the following.)
Make up a list of 10 numbers so that the standard deviation is as large as possible and:
a every number is either 1 or 5 b every number is either 1 or 9
c every number is either 1 or 5 or 9, and at least two of them are 5
7 Repeat question 6, but this time so the standard deviation is as small as possible.
8 An employer has 29 employees whose weekly salaries have x = $429 and σ = $1.53. The
employer decides to give a flat $100 raise to every employee.
a What would be the change to the average annual salary paid by the employer?
b Would there be a change in the standard deviation?
c What would be the change in total weekly payments to employees?
18F Time-series data

A time series is a set of data that has been obtained by taking repeated measurements over time.
Maximum daily temperatures, average weekly wages, quarterly sales figures of a company and
annual population of a city are all examples of a time series.
To represent the information obtained in a time series pictorially, a graph is drawn in which:
• the horizontal axis represents time
• the vertical axis represents the quantity that is being measured at regular intervals
• adjacent plotted points are joined by line intervals.
1 8 F T I M E - S E R I E S D ATA
Example 14
The mean daily maximum temperature was measured each month in a particular city.
Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Mean daily
29.2 28.9 28.1 26.4 23.5 21.2 20.6 21.7 23.8 25.7 27.4 28.7
max. temp (° C)
a Represent this information on a time-series plot.
b Briefly comment on the annual variation in daily maximum temperature.
Solution
a To construct a time-series plot, the 30

months are placed on the horizontal 29
28
axis and the vertical axis will represent
Temperature (°C)
27
the mean daily maximum temperature. 26
The points are plotted and joined by lines. 25
24
The following time-series plot is obtained. 23
22
b There is a gradual decrease in the mean daily 21
maximum temperature over the months January, 20
February and March. During April, May and June, J F M A M J J A S O N D
the mean daily maximum temperature falls quite Month
quickly to a minimum during July. For the remainder of the year, there is a steady increase in
the mean daily maximum temperature each month.
Exercise 18F
Example 14 1 a Construct a time-series plot for the average rainfall (in cm) in a particular city, which is
given in the table below.
Rainfall (in cm) 16.2 17.5 14.2 9.1 9.6 7.1 6.2 4.1 3.3 9.3 9.6 12.6
b Use the time-series plot to write a brief description as to how the rainfall varies in this
particular city.
2 The table below gives the annual profit (in $ million) of a particular company over a 10-year
period. Construct a time-series plot of the information.
Year 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998
Profit ($ million) 1.2 1.8 2.4 2.2 2.6 3.1 3.2 3.4 3.6 4.0
3 The table below gives the number of births that occurred in a hospital each month for a year.
Number of births 52 46 43 40 31 32 26 27 24 20 26 26

b Briefly describe how the number of births recorded each month changed over the year.
545
1 8 F T I M E - S E R I E S D ATA
4 The table below gives the position of a particular football team in a competition of 12 teams
at the completion of each round throughout the season.
Round 1 2 3 4 5 6 7 8 9 10 11
Position 10 12 11 9 8 6 5 5 4 5 5
Round 12 13 14 15 16 17 18 19 20 21 22
Position 6 4 4 3 4 3 5 7 6 9 8

b Briefly describe the progress of the team throughout the season.
5 The data below shows the quarterly sales of a department store over a period of three years.
The quarters are labelled 1 to 12 in the corresponding time-series graph.
90
Sales quarter Sales $’000 80
2009–1 45 70
Sales $ ‘000
60
2009–2 63 50
2009–3 67 40
30
2009–4 43
20
2010–1 51 10
2010–2 69 0
1 2 3 4 5 6 7 8 9 10 11 12
2010–3 75 Quarter
2010–4 39
2011–1 55
2011–2 71
2011–3 79
2011–4 49
a In which quarter of each year are the sales figures the worst?
b In which quarter of each year are the sales figures the best?
c Are the sales figures improving? Compare the sales figures for the first quarter of each
year and do the same for the other quarters.
6 The table below gives the quarterly sales figures for a car dealer for the period 2009–2011.
Number of sales Q1 Q2 Q3 Q4
2009 72 62 90 98
2010 87 78 112 111
2011 90 84 132 117

b Briefly describe how the car sales have altered over the given time period.
c Does it appear that the car dealer is able to sell more cars in a particular period
each year?
18G Bivariate data
We often want to know if there is a relationship between the items in two different data sets.
• Is there a relationship between children’s ages and their heights?
• Is there a relationship between people’s heights and weights?
• Is there a relationship between students’ marks in English and their marks in mathematics?
In each of the above, two pieces of information are to be collected from each person in the
investigation and then the two data sets are to be compared. When two pieces of information are
collected from each subject in an investigation, we are then concerned with bivariate data.
A scatter graph or scatter plot is a type of display that uses coordinates to display values for two
variables for a set of data. The data is displayed as a collection of points, each having the value of
one variable determining the position of the horizontal coordinate and the value of the other variable
determining the position of the vertical coordinate.
Example 15
The age (in years) and height (in cm) of a group Person Age (years) Height (cm)
of people was recorded. The data obtained is shown Alan 12 145
in the table on the right. Present the information in
the table on a scatter plot. Brianna 14 140
Chiyo 15 160
Danielle 14 150
Ezra 10 130
Frankie 11 135
Solution
The variables under consideration are age and 160

C (15, 160)
height. The horizontal axis represents the age
155
and the vertical axis represents height. The axes
150
are broken (using the symbol ) to allow us A (12, 145) D (14, 150)
Height (cm)
to focus on the data points. 145

140 B (14, 140)
In this scatter plot, it is noted that points
135
towards the top-right of the plot represent F (11, 135)
individuals who are older and taller. Points in 130
E (10, 130)
the bottom-right represent individuals who are 125
older but shorter than the rest of the group.
10 11 12 13 14 15
The bottom-left of the plot represents people
Age (years)
who are younger and shorter, while the top-left
portion of the graph represents individuals who are younger but taller than the rest of the group.
We can see from the general trend of the points, which is upward as we move to the right,
that the height of a child increases as the child grows older (for children in this data set).
547
1 8 G B I VA R I AT E D ATA
Example 16
The second-hand price and age of a particular model of car are recorded in the table below,
and the points plotted on a scatter plot.
25 000
Age of car Second-hand
Second-hand price ($)

(year) price($) 20 000
1 22 000
15 000
2 19 500
2 18 700 10 000
3 16 400 5000
3 17 000
0
3 16 800 0 2 4 6 8 10 12
Age of car (years)
4 15 800
4 15 950
5 14 800
6 12 500
6 12 000
6 12 800
7 12 200
7 11 580
8 10 500
8 9200
8 8600
9 5700
10 4850
11 4500
a Describe the points in the top-left of the plot.

b Describe the points in the bottom-right of the plot.
c Describe the trend.
Solution
a The top-left of the scatter plot has points corresponding to relatively new second-hand
cars with higher prices.
b The bottom-right of the scatter plot has points corresponding to older second-hand cars
with lower prices.
c As the age of the car increases the value decreases.
Exercise 18G
Example 15 1 The table below gives the marks obtained by 10 students in a mathematics examination and
an English examination.
Mathematics mark 72 50 96 58 86 94 78 66 85 78
English mark 78 64 70 46 88 72 70 62 72 74
Represent this information on a scatter plot, using the horizontal axis to represent the
mathematics marks and the vertical axis to represent the English marks.
Break the axes so that the vertical axis starts near 40 and the horizontal axis starts near 50.
2 The table below gives the average monthly rainfall, in mm, and the average number of rainy
days per month for twelve different cities in Australia.
Average rainfall (in mm) 161 175 142 90 96 71 62 41 33 93 96 126

Average number of rainy days 13 14 14 11 10 7 7 6 7 10 10 12
a Represent this information on a scatter plot. Use the horizontal axis to represent
average monthly rainfall and the vertical axis to represent the average number of rainy
days per month.
b Give a brief description of the relationship between rainy days and average rainfall.
3 The table below gives the amount of carbohydrates, in grams, and the amount of fat, in
grams, in 100 g of a number of breakfast cereals.
Carbohydrates (in g) 88.7 67.0 77.5 61.7 86.8 32.4 72.4 77.1 86.5
Fat (in g) 0.3 1.3 2.8 7.6 1.2 5.7 9.4 10.0 0.7
a Represent this information on a scatter plot. Use the x-axis to represent the amount of
carbohydrates and the y-axis to represent the amount of fat.
b Does there appear to be any relationship between the carbohydrate content and the fat
content?
Example 16 4 The table below gives the IQ of a number of adults and the time, in seconds, for them to
complete a simple puzzle.
IQ 115 118 110 103 120 104 124 116 110

Time (in seconds) 14 15 21 27 11 25 9 16 18
a Represent this information on a scatter plot. Use the x-axis to represent IQ and the
y-axis to represent the time taken to complete the puzzle.
b Is there any trend in the data?
549
5 The table below gives the number of kicks and the number of handballs obtained by each
player in an AFL team in a particular match.
Player 1 2 3 4 5 6 7 8 9 10 11
Number of kicks 3 20 7 19 7 6 2 9 7 26 3
Number of handballs 8 11 11 6 4 6 3 1 3 3 8
Player 12 13 14 15 16 17 18 19 20 21 22
Number of kicks 12 17 6 11 14 5 1 21 6 13 4
Number of handballs 4 5 0 3 8 3 0 11 0 17 11
a Represent this information on a scatter plot. Use the x-axis to represent the number of
kicks and the y-axis to represent the number of handballs.
b Does your scatter plot support the claim, ‘the more kicks a player obtains, the more
handballs he gives’? Explain your answer.
6 The table below gives the number of ‘goals for’ (scored by the team) and the number of
‘goals against’ (scored by the opposing team) for each team in a soccer competition.
Team A B C D E F G H I J K L
Goals for 36 45 22 26 20 59 24 41 23 43 32 41
Goals against 31 16 33 26 64 16 53 42 47 21 49 14
a Represent this information on a scatter plot. Use the x-axis to represent ‘goals for’ and
the y-axis to represent ‘goals against’.
b Use your scatter plot to answer the following questions.
i Which team is the best team in the competition? Why?
ii Which team is the worst team in the competition? Why?
iii Which of team J and team H is better? Why?
7 The scatter plot at the right gives information iv
ii v vi vii
about the height and weight of a number of iii
people. Annabelle’s height and weight is
i A
Weight (kg)
represented by the point A.

viii
Write down the point that represents each of
the following people.
a Barry, who is heavier and taller than Annabelle
b Chandra, who is shorter but heavier than Annabelle
Height (cm)
c Dario, who is the same height as Barry but a little heavier
d Edwina, who is shorter and lighter than Chandra
e Frederick, who is the same weight as Barry but a bit taller
f George, who is the same height as Annabelle but heavier
g Harriet, who is the same weight as Annabelle but shorter
h Ivan, who is the tallest person in the group
8 The scatter plot at the right gives the marks iii

ii
obtained by students in two tests. iv
v
John’s marks on the tests are represented by
vi J
the point J.
Test 2
i viii
Which point represents each of the following vii
students?
a Alex, who got the top mark in both tests
b Bao, who got the top mark in Test 1 but not Test 1
in Test 2
c Charlene, who did better in Test 1 than John, but not as well on Test 2
d Drago, who did not do as well as Charlene on either test
e Eddie, who got the same mark as John for Test 2, but did not do as well as John on Test 1
f Francis, who got the same mark as John for Test 1, but did better than John on Test 2
g Georgina, who got the lowest mark for Test 1
h Harvir, who had the greatest discrepancy between his two marks
9 The test results of a group of 9 students is recorded in the table and plotted on a scatter plot.
A line has been drawn through the ‘middle of the points’.
100
Test 1 Test 2
90
53 54
80
70 67
Test 2
70
53 55
60
81 81
50
85 82 40
51 51
40 50 60 70 80 90 100
52 53 Test 1
76 78
75 77
The equation for this line is Test 2 = 0.95 × Test 1 + 3.85.

a Use this equation to predict the Test 2 mark of a student if their mark on Test 1 was:
i 53 ii 54 iii 34 iv 84 v 67
b Use this equation to predict the Test 1 mark of a student if their mark on Test 2 was:
i 53 ii 54 iii 34 iv 84 v 67
551
18H Line of best fit
Consider the four scatter plots below. A trend line or ‘line of best fit’ has been fitted to each ‘by
eye’. It is constructed by first noting the general trend, increasing or decreasing. A line (or curve) is
then drawn through the middle of the scatter plot following that upwards or downwards trend, with
roughly equal number of points above and below the line. The distance points lie from the line must
also be taken into account.
I 170 II 25 000
160
Second-hand price ($)

20 000
150
Body mass (grams)
140 15 000
130 10 000
120
5000
110
100
0 2 4 6 8 10 12
15 25 35 45 55 Age of car (years)
Heart mass (grams)
III 5 IV 30
4.5
25
4
Time to complete
Performance level
3.5 20
(seconds)
3
2.5 15
2 10
1.5
1 5
0.5
0 5 10
0 2 4 6 8 10 Age (years)
Time spent preparing (hours)
Observations
• Graphs I and III show an increasing trend whilst graphs II and IV show a decreasing trend.
• Graph II shows a strong linear relationship between the variables and all points are in close
proximity to the line of best fit. However, graphs I and IV show moderately strong linear
relationships between the variables.
• Graph III shows a non-linear relationship between variables and a ‘curve’ of best fit is suggested.
The other graphs display a linear relationship.
In this section, only linear relationships will be studied. To determine the equation of the line of best
fit we draw on skills that were introduced in Chapter 4.
18H LINE OF BEST FIT
Example 17
Consider the scatter plot below showing the relationship between ice-creams sold by vendor
during the month of February and maximum temperature for the day.
90
Number of ice-creams sold
80
70
60
50
40
30
20
10
0
15 20 25 30 35 40
Maximum temperature (°C)
a Draw a line of best fit by eye.

b Determine the equation of the line.
c Use the equation to predict the number of ice-creams the vendor will sell on a 35°C day.
d Use the equation to predict the maximum temperature of the day if the vendor sells 58
ice-creams.
Solution
a 90
Number of ice-creams sold
80
70
60
50
40
30
20
10
0
15 20 25 30 35 40
Maximum temperature (°C)
Note: Small variations in the placement of the line of best fit is expected using this
technique.
(continued over page)
553
b Use the point–gradient form, y − y1 = m( x − x1 ), to find the equation of the line.

Note: The grid lines can assist you to find two points on the line. For improved accuracy,
ensure they are not too close together.
Choose (34, 70) and (22, 50). (Other selections are possible.)
70 − 50 20 5
m = = =
34 − 22 12 3
5
y − 50 = ( x − 22)
3
5 110
y = x + 50 −
3 3
5 40
y = x+
3 3
Interpreting this equation in the given context, we get;

5 40
Number of ice-creams sold = × ( maximum temperature °C ) +
3 3
5 40
c Number of ice-creams sold = × 35 +
3 3
= 71 23
≈ 72 (Round up to the nearest integer.)
5 40
d 58 = × ( maximum temperature °C ) +
3 3
174 = 5 × ( maximum temperature °C ) + 40 (multiplying all terms by 3)
174 − 40
∴ maximum temperature = = 26.8°C
5
Interpolation versus extrapolation

When we use the line of best fit to make predictions of values within the range of data already obtained
it is called interpolation. In the example above, the predicted number of ice-creams sold, based on a
maximum temperature of 35°C was interpolation. This is because 35°C lies between the minimum
(17°C) and maximum (38°C) recorded temperatures. The same can be said for predicting the maximum
temperature based on a sale of 58 ice-creams.
Extrapolation is the term used for making predictions outside the range of values already
obtained. Extrapolation should be performed with a degree of caution, since there is no guarantee
the noted relationship between variables will continue beyond the observed range.
5 40
For example, using the equation, number of ice-creams sold = × (maximum temperature °C) + ,
3 3
to predict the maximum temperature when 20 ice-creams are sold is an act of extrapolation. The
predicted maximum temperature of 4°C may not be feasible.
Lines of best fit by other techniques

You may have noticed that creating a line of best fit by eye is prone to variation and discrepancy. This
is not desirable if we need to be consistent and accurate with fitting a line to data. Fortunately, there are
several alternative approaches to drawing a line of best fit. The approach commonly used is called the
least squares method.
Line of best fit

• Drawing a line of best fit by eye consists of tracing the trend of the scatter plot with a
straight line, ensuring that there are roughly equal numbers of points above and below
the line, with distance of points from the line taken into account.
• Once two points have been identified on the straight line, the equation of the line can be
determined using the point–gradient form.
• Interpolation is making predictions using data that lies within the range of observed
values.
• Extrapolation is making predictions using data that lies outside the range of observed
values. Caution must be used when predicting values based on extrapolation.
Exercise 18H
1 Copy these scatter plots and draw a line of best fit by eye though each.
i ii
iii iv
2 In the scatter plots in question 1, comment on the following.

a Do the scatter plots display an increasing or decreasing trend?
b What is the strength of the relationships between y and x?
555
3 Data was collected on 100 adults comparing shoe size and height. Shoe sizes ranged from
6 to 13. An equation relating height (in cm) to shoe size was determined to be:
height = 127.18 + 4.84 × shoe size
Use this equation to predict (to the nearest cm) the height of a person whose shoe size is as
follows. Are you interpolating or extrapolating?
a size 7 b size 12 c size 14
4 A line of best fit for a scatter plot, relating the weight of a pumpkin (kg) to the number of
seeds it contains, was found to pass through the points (1, 300) and (7, 540). Assume weight
is on the x-axis.
a Find the equation of the line of best fit.
b Use your equation to estimate the number of seeds a pumpkin contains that
weighs 5.2 kg.
c Use your equation to estimate the weight of a pumpkin containing 600 seeds.
Example 17 5 A class of Year 10 PE students were asked to run a lap of the school’s oval. Their times were
recorded and compared against their fitness levels, which had been previously analysed and
placed on a scale of 1 to 10. The teacher then drew a line of best fit over the scatter plot as
shown.
80
75
Time (seconds)
70
65
60
55
50
0 2 4 6 8 10
Fitness level
a Determine the equation of the line of best fit.

b Use the equation to predict the time it would take a Year 10 PE student to run a lap of
the oval if that student has a fitness level of 3. Leave your answer correct to one decimal
place.
c Use the equation to predict the fitness level of a Year 10 PE student if a lap of the oval is
run in 62 seconds.
d Are these predictions examples of interpolation or extrapolation? Explain your answer.
6 State the problems with making predictions using the lines of best in the following
scatter plots.
a b
7 Consider the time series below, showing a company’s profit for consecutive financial years
over a 10 year period. ‘Year 1’ marks the financial year 1988–1989, ‘Year 2’ marks the
financial year 1989–1990, and so on. ‘Year 10’ marks the financial year 1997–1998.
4.5
4
3.5
Profit ($ milion)
3
2.5
2
1.5
1
0.5
0 2 4 6 8 10 12
Year number
Create a line of best fit on the time series and use it to predict the company’s profits, to the
nearest $100 000, in the financial year 1998–1999. (Predicting future values in a time series
based on previously observed values is called forecasting.) Is your answer an example of
interpolation or extrapolation?
Review exercise
1 The stem-and-leaf plot on the right gives the times for which a 11 5
class of 26 Year 10 students ran 100 m. 12 3 4 6 9
a What is the range of times to run 100 m in the class? 13 0 0 2 6 8
14 0 1 2 4 7 9 9
b What is the median time to run 100 m in the class?
15 1 2 4 5 5 5
c What is the interquartile range? 16 3 4
d Would the median time change if the fastest 17
and slowest times were removed? 18 2
15 1 means 15.1seconds
557
REVIEW EXERCISE
2 The ‘life’ of alkaline batteries is compared through continuous use in a standard product.
40 Grade A and 40 Grade B batteries are tested in this way. Their results are shown in
the two boxplots below.
Grade B
Grade A
15 20 25 30 35 Battery life (hours)
a State the median battery life for the Grade A and Grade B batteries.
b State the range in battery life for the Grade A and Grade B batteries.
c State the interquartile range for the Grade A and Grade B batteries.
d Determine the number of Grade A and Grade B batteries lasting longer than 29 hours.
e Describe the shape of data distributions for the Grade A and Grade B battery life.
f Under what criterion is the Grade B battery ‘better’ than the Grade A battery in this test?
3 The following data are the speeds of 45 semi-trailers passing a given point on an
interstate highway. The speeds are measured in km/h.
88 90 93 94 95 96 98 100 100 100 100 100
101 102 102 102 103 103 103 104 105 106 106 107
109 109 110 110 110 112 113 114 116 117 118 120
120 121 128 130 130 139 141 144 150
a Construct a dotplot of the data.
b Construct a boxplot of the data.
c Comment on the shape.
4 The number of times 35 randomly chosen Year 10 students go online in the course of a
school day was recorded. The results are shown in the frequency table below.
Times online 0 1 2 3 4 5 6 7
Number of students 8 3 5 6 7 5 0 1
a Calculate the mean number of times students in this random sample go online.
b Find the standard deviation of the number of times students go online, correct to
two decimal places.
c Find the range of times online that lie within one standard deviation of the mean.
d If every student in this sample went online one more time than what was recorded,
determine the effect on the mean and standard deviation.
REVIEW EXERCISE
5 Kathryn scored 78% on both her history and mathematics tests. Both tests had a class
mean of 70%, but history had a standard deviation of 8% and mathematics had a
standard deviation of 12%. In which test did Kathryn perform better relative to the rest of
the class?
6 The table below gives the quarterly sales figures for a Melbourne swimwear shop in the
period 2014–2016.
Sales $’000 January–March April–June July–September October–December

2014 33 16 5 21
2015 35 19 8 26
2016 44 22 10 30
a Represent this information on a time-series plot. (Use numbers 1 to 12 to mark the

quarters.)
b In which quarter of each year are the sales figures the best?
c Describe briefly how the quarterly sales figures change over time. Are the sales
figures improving?
7 In an all-female class of Year 10 students, the length of each student’s tibia (shin bone) and
height (in centimetres) was recorded and graphed below. A line of best fit was drawn.
190
185
180
175
Height (cm)
170
165
160
155
150
145
140
0 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
Tibia length (cm)
a Determine the equation of the line of best fit.

b Use the equation to predict the height of a Year 10 female with tibia length of 44 cm.
c Use the equation to predict the tibia length of a 145 cm tall Year 10 female.
d Are these predictions examples of interpolation or extrapolation? Explain your
answer.
559

Chapter 18

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 18

Uploaded by

Copyright:

Available Formats

18

Statistics and Probability

Calculate the median of the data sets.

b Again, the values are placed in numerical order.

Quartiles and the interquartile range

Find the interquartile range of the data set:

First arrange in order and locate the median.

For the stem-and-leaf plot opposite, find

Number of data Lower quartile Upper quartile

5 The stem-and-leaf plot opposite gives the height in centimetres of 14 4 5 6

6 The stem-and-leaf plot opposite gives the lengths in 4 4

The rectangle is called the box.

40 60 80 100 120 140 160 180

a What is the lowest salary?

a What was the lowest mark obtained on the quiz?

Which data set has:

a Which class had the higher median mark?

Comparing a boxplot to the histogram of the same data

Symmetric distribution Negatively skewed distribution Positively skewed distribution

0 500 1000 1500 2000 2500

The waiting times in seconds at a ticket counter were as follows:

a Q1 = 22.5, median = 29, Q 3 = 34.5, IQR = Q 3 − Q1 = 12

a Q1 = 108 cm, median = 109.5 cm, Q 3 = 111 cm and IQR = 3 cm

102 104 106 108 110 112 114 116

4 Consider the data shown in the stem-and-leaf plot.

6 The speeds of 20 cars measured on a city street were recorded.

7 The reaction times (in milliseconds) of 20 people are listed here.

A student obtained the following marks in seven tests:

Add in a column for fi xi .

(5 − 10)2 + (7 − 10)2 + (11 − 10)2 + (13 − 10)2 + (14 − 10)2

Hence, σ = 12 ≈ 3.46 (Correct to two decimal places.)

xi fi fi xi (xi − x) fi (xi − x)2

Mean and standard deviation

x = 5 and σ ≈ 2.33 x = 5 and σ ≈ 1.25

xi fi fi xi (xi − x) fi (xi − x)2

a Calculate their mean mark.

a Calculate the mean mark.

18E Interpreting the standard

Intervals about the mean

x + 2σ = 45.1 + 2 × 16.1 = 77.3 and 25

x – 2σ = 45.1 – 2 × 16.1 = 12.9 20

a x + σ = 85 + 5.2 = 90.2 and x − σ = 85 – 5.2 = 79.8

at least 75% of the data

Using the standard deviation to compare data

1 Find the mean and standard deviation of each set of data.

a For each set of data, calculate the mean iii 4

c Multiply each data item in each of i, ii 0

18F Time-series data

a To construct a time-series plot, the 30

a Represent this information on a time-series plot.

a Represent this information on a time-series plot.

a Represent this information on a time-series plot.

The variables under consideration are age and 160

to focus on the data points. 145

Second-hand price ($)

a Describe the points in the top-left of the plot.

Average rainfall (in mm) 161 175 142 90 96 71 62 41 33 93 96 126

IQ 115 118 110 103 120 104 124 116 110

represented by the point A.

8 The scatter plot at the right gives the marks iii

The equation for this line is Test 2 = 0.95 × Test 1 + 3.85.

Second-hand price ($)

a Draw a line of best fit by eye.