You are on page 1of 13

MT130 Principles of Statistics

Introduction: looking at data

1.1

Whats it about

Data are produced in enormous quantities. The aim of statistics (from a


practical point of view) is to take those data, to put them into a form
which is informative (so they become information), and then use the
information to make decisions. So we start in this chapter to consider
how to present data and gain simple information; in chapters 2 4 we
develop the idea of probability distributions, in chapter 5 we see how
we can fit them to the data, and in chapters 7 to 9 we see what
decisions can be made based on this.

1.2

True or false?

1.2.1
The boxes for Puma sports shoes say Average contents 2. Is this useful
information?
Most people have more than the average number of feet. True or
false?

1.2.2
What does it mean to say that the probability of rain in London
tomorrow is 79%?

1.2.3
Dan is being tried for murder on a remote island. The only evidence against him is the
forensic evidence blood samples taken from the scene of the crime match Dans.

The chance that such a match happening if Dan were in fact innocent, and the
match were just a coincidence, is calculated (by the prosecutions expert witness) as
1 in 10000. The defences expert witness, however, points out that the crime could
have been committed by any one of the 40000 people on the island. If you were on
the jury, would you find Dan guilty? What do you think is the probability of his guilt?

In any case you need to consider: is it worse to convict an innocent


person or to let a guilty person go free? This will lead us to the idea of
Type I errors and Type II errors in chapter 9.

1.2.4

LONDON (Reuters) - House prices rose at the quickest pace in more than four years
during the three months to October, a survey showed on Thursday, supporting
evidence of a resilient housing market in spite of rising borrowing costs.
The Royal Institution of Chartered Surveyors said its house price balance rose to +48.1
in October from +45.7 in the three months to September -- the strongest reading since
September 2002 and more than double the long-run average of +21.
The sales-to-stock ratio, which economists say is a more reliable gauge of the health
of the market, rose to +40.9 from +39.1 -- the highest ratio in over two years.
"Even after last week's interest rate rise, surveyors are still confident that the housing
market will remain buoyant," said RICS spokesman Ian Perry.
"The market is unlikely to feel cold winds from high finance costs until mid-year at the
earliest as economic conditions are favourable."
The RICS data chimes with other property surveys which have portrayed a robust
market following August's interest rate rise to 4.75 percent, but analysts have said the
market may flag next year after last week's rise in borrowing costs.
The Bank of England hiked rates to 5 percent -- their highest level in five years -- but
did not make explicit mention of the buoyant house market when it gave reasons for
the move.

This is from the BBC news page for 16 November 2006. Most readers will
not know what the figures mean, and no explanation is given.

1.3

Misusing data

Deliberately or inadvertently, it is possible to start from correct data and


produce misleading information.

1.3.1 Misleading pictures 1

We want to illustrate the gender ratio in a class: suppose 58% are male
and 42% are female. If they are represented by squares, and the
length of the sides of the squares are proportional to the numbers, then
we get these. But we would probably look at the areas, in which case
the first square is almost twice the area of the second in fact
0.582
= 1.907 approximately. On the other hand if we draw them so
0.42 2
that the areas are in the correct ratio, we get

which is better (though different people will perceive the difference in


various ways).

1.3.2 Misleading pictures 2


Suppose we look at the monthly sales of some product. We might be
given these plots:
1140

1120

1100

1080

10

12

Here it looks as if sales are booming.


1200
1000
800
600
400
200

10

12

But using this scale the increase is seen to be much more modest.
Which of these is better depends on the purpose, and the reader: if the
aim is to convince a backer that it is booming, or to act as a tool in
prediction, for example.

1.4

Types of data and samples

Consider the example in 1.2.4 above, about house prices. Suppose it is


agreed that the rate of increase of house prices in some area (say
Surrey) should be found. How would it be measured? House prices
need to be measured at different times. But there are so many houses
sold in the area: a conservative estimate is about 5000 per month. It
would be too time-consuming to find all of them, so we use a sample,
a selection of those we are interested in. We can then use the sample
to estimate the quantities we want for the whole set. The whole set of
interest is called the population. An important issue is how accurate
will the estimate based on the sample be?
Just as in MT181 you met different types of numbers, we find different
types of data. If we are measuring (say) the heights of a group of
plants, or the weights of animals, then they can be any real nonnegative number. If we are considering currency rates, then in
practice they are given to a fixed number of decimal places, but they
4

are so close together that we can assume that they also can be any
real non-negative number, and if we are looking at changes, then they
can be any real number.
If on the other hand we are looking at the number of goals scored in
football matches in a league, or the number of people entering a shop
in a day, then they will be always non-negative integers. There may be
a limit on the value, too: if y is the number of heads that occur if a coin
is tossed 12 times, then y can only take integer values such that
0 y 12 .
All these are quantitative variates: the first group are continuous and
the second group are discrete. We shall look at continuous variates in
chapter 2 and discrete variates in chapter 3. But variates can also be
qualitative: we can look for example at the gender of insects (M or F),
or nationality (UK, French, Pakistani, ). A survey of a television
programme may involve sampling the views of members of the
population in the form Brilliant, Good, Satisfactory, Rubbish. Sometimes
qualitative results can be converted to quantitative ones, and this is
dealt with in MT230.
Take another item from the BBC on 16 November 2006.

Average European 'is overweight'


The Maltese and the Greeks are
the heavyweights of Europe,
figures from the European
Commission reveal.
The Italians and French the most
trim, while the average Briton like the average European - is
slightly over the ideal weight.
The average European is overweight

Obesity, which is linked to a range of health problems, including


heart disease, is a growing problem across much of the
developed world.
The Commission plans to launch a strategy to tackle obesity
next year.
Obesity is measured by
calculating body mass
index (BMI).
A BMI of between 18.5
and 25 is considered
healthy, between 25 and
30 overweight and above
30 obese.

EUROPE'S HEAVYWEIGHTS

Malta - 26.6
Greece - 25.9
Finland - 25.8
Luxembourg - 25.7
Hungary - 25.6
Cyprus - 25.6
Lithuania - 25.5
Slovenia - 25.5
Denmark - 25.5
UK 25.4

Calculate your BMI


The latest figures show

that the average citizen in 20 EU countries, including the UK where the average BMI is 25.4 - is overweight.
The average person in the other five, including Italy and France,
is officially healthy.
Poor self-awareness
The Maltese recorded the highest average BMI at 26.6, the
Italians the lowest, at 24.3.
Despite the figures, only 38% of
EU citizens consider themselves
to be overweight, according to
the poll of about 1,000 people
in each of the 25 EU member
states.
Most blamed a sedentary life
for restricting their scope to be
healthier.

SLIMMEST COUNTRIES

Italy - 24.3
France - 24.5
Austria - 24.8
Poland - 24.8
Netherlands - 24.9
Slovakia - 25.0
Belgium - 25.1
Latvia - 25.1
Estonia - 25.2
Czech Rep - 25.2
Body Mass Index figures

Less than one third said they engaged in regular "intensive"


physical activity.
And a large majority - 85% - said public authorities should play a
stronger role in fighting obesity.
EU Health and Consumer Protection Commissioner Markos
Kyprianou said: "This survey provides us with valuable insights into
the concerns of EU citizens on health and nutrition."

The item also explains how the BMI is calculated. How can they be
sure that the national figures are correct? Clearly they have not
measured everyone in the EU, so they must have sampled. Was the
choice a representative sample? In the last section on self-awareness
they say that they sampled 1000 people from each EU country.
Doesnt that mean that the result will be much more reliable for Malta,
with less than 400000 inhabitants than for Germany, with over 82000000
inhabitants? How much better? In chapter 7 we look at the reliability
of such samples.

1.5

Measuring the location and dispersion of data 1

Take first the marks of just five students (taken from last years MT171
test): 70, 84, 61, 48, 60. The minimum is 48, the maximum is 84, and the
obvious measure is the mark of the average student, the central value,
which is 61. This is the median. Now consider these figures for the sales
of a company over 30 weeks. This uses the Data Display instruction in
MINITAB.

Data Display
Sales
2817 2913 3040 2539 2323 2763 2898 2834 2786 3095 2411 2972 3405 2852 2756
2754 2492 2804 3263 2902 2705 2724 3114 2400 3260 2658 3194 2672 2716 3069

If we arrange the data in increasing order, we find


2323 2400 2411 2492 2539 2658 2672 2705 2716 2724 2754 2756 2763 2786 2804
2817 2834 2852 2898 2902 2913 2972 3040 3069 3095 3114 3194 3260 3263 3405

We can see that the minimum is 2323 and the maximum 3405, and the
central value is a measure of the location of the values. But in this case
there isnt a central value, since we have an even number (30) of them
if we had 29 then the fifteenth value would be in the middle. The
best we can do is take the average of the two in the middle here the
fifteenth and sixteenth, 2804 and 2817, which is 2810.5. This is the
median.
To get an idea of the spread around this, we can take the values one
quarter and three quarters up the ordered lists, giving the lower quartile
Q1 and the upper quartile Q3 (we can also call the median Q2 ). One
quarter of the way for the first to the thirtieth here is number
1
(1 + 30 ) = 7.75 , so here we take a value between the seventh and
4
1
eighth in the list, (1 2672 + 3 2705) = 2696.75 . In the same way the
4
3
upper quartile is at (1 + N ) , here 23.25, so we take a point between
4
1
the 23rd and 24th values, (3 3040 + 1 3069) = 3047.25 . The inter-quartile
4
range is Q3 Q1 , the range containing half the values, here 350.5.
We can also define percentiles: the pth percentile is the value p % of
the way up the list. Q1 is the 25th percentile, the median is the 50th
percentile, and Q3 is the 75th percentile.
Another measure of location is the mode, the most frequently
occurring value. We shall not need to use it.

1.6

Visual representations of data

There are several ways of getting a quick idea of the location and
dispersion of the figures. One is the use of a boxplot, as here.
Boxplot of Sales
3500

3250

Sales

3000

2750

2500

Here the line runs from the maximum (here 3405) down to the minimum
(here 2323). The box runs from the upper quartile (here 3047.25, one
quarter of the way from 3040 to 3069) to the lower quartile (here
2696.75, three quarters of the way from 2672 to 2705). The line across
the middle is at the median (here 2810.5, halfway from 2804 to 2817).
So half of the values lie within the box, and we can see the range
easily.
We can also use a histogram. A set of intervals is chosen, and the
number of data values in each interval plotted. It is important to
choose the intervals so that a reader gets a good picture too many
intervals and the picture is confusing, too few and information is lost.
The advantage of a histogram over a boxplot is that you get a feel for
the distribution of the data values.
Histogram of Sales
7
6

Frequency

5
4
3
2
1
0
2400

2600

2800
Sales

3000

3200

3400

It is clear that this is better than the version below: we have lost too
much information by making the intervals too large.
8

Histogram of Sales
18
16
14

Frequency

12
10
8
6
4
2
0
2400

2600

2800
C6

3000

3200

More detailed information is available from a stem-and-leaf display.


This is MINITABs version.
Stem-and-Leaf Display: Sales
Stem-and-leaf of Sales N = 30
Leaf Unit = 10
1 23 2
4 24 019
5 25 3
7 26 57
14 27 0125568
(5) 28 01359
11 29 017
8 30 469
5 31 19
3 32 66
1 33
1 34 0

Here the centre column gives the first two significant figures of the data
values, and each figure in the right hand column gives the next figure
of each value. So in the first row there is one value close to 2320, and
in the second row there are three values, close to 2400, 2410 and 2490.
The left hand column has one figure in a bracket, which gives the
position of the median, and the number there is the number of values
in that interval. The other figures in the left hand column are
cumulative number of points up to and down from the median. Note
that as well as giving the figures, the shape of the right hand side gives
a feel for the distribution. As with histograms, the unit needs to be
chosen to give the most useful information. If the unit here is changed
to 100 instead, we get this, which is much less helpful.
Stem-and-Leaf Display: Sales
Stem-and-leaf of Sales N = 30
Leaf Unit = 100
(22) 2 3444566777777788888999
8
3 00011224

1.7

Measuring the location and dispersion of data 2

The arithmetic average of the values is the obvious measure of


location. If our data values are y1 , y2 ,K , yn , then this is defined by

1 n
1
y j = ( y1 + y2 + K + yn ) .
(1.1)

n j =1
n
This is the sample mean. In the example above,
1
85131
y = ( 2817 + 2913 + K + 3069 ) =
= 2837.7.
30
30
We next consider a measure of the dispersion around the sample
mean. The sample variance s 2 is given by
2
1 n
s2 =
yj y) .
(1.2)
(

n 1 j =1
The sample standard deviation s is (not surprisingly) the square root of
that:
2
1 n
s=
yj y) .
(1.3)
(

n 1 j =1
y=

For the example above,


1
2052556
2
2
2
s2 =
= 70778
( 2817 2837.7 ) + ( 2913 2837.7 ) + K + ( 3069 2837.7 ) =
29
29
and s = 70778 = 266.0 .

This is a helpful result:


n

( y

y ) = y 2j ny 2 .

j =1
n

To prove this multiply out:

(1.4)

j =1
2

j =1

j =1

( y j y ) = y 2j 2 y y j + ny 2 since all the


j =1

last terms are equal to y 2 and there are n of them, and

= ny . In

j =1

the example,
n

2
j

= 243628795 , so that

( y
j =1

y)

851312
= 243628795
= 2052556 as
30

before.
MINITAB will produce descriptive statistics: a set of useful information.
Here we have
Descriptive Statistics: Sales
Variable N N* Mean SE Mean StDev Minimum
Q1 Median
Q3
Sales
30 0 2837.7
48.6 266.0
2323.0 2696.8 2810.5 3047.3
Variable Maximum
Sales
3405.0

10

Here N is the number of data values, N * is the number of gaps in the


data, the sample mean, standard deviation and median we have
already seen, the maximum and minimum are obvious, Q1 and Q3 are
the lower and upper quartiles, and the standard error of the mean is
266
= 48.56 . It will crop up
the standard deviation divided by N : here
30
in chapter 7.

1.8

Grouped data

If the size of the sample is large, it may be better to summarize the data
by giving the frequencies in certain classes instead that is, the number
of data values in each class or interval. Consider two examples.

1.8.1
The number of goals scored in 100 matches in a league were recorded
as follows:
Number of goals scored in
Frequency
match
0
9
1
23
2
33
3
22
4
12
5
1
In this case a histogram is appropriate.
Goals in each match
35
30

Frequency

25
20
15
10
5
0
0

C1

We can find the median, which is 2: the mean is


1
( 9 0 + 23 1 + 33 2 + 22 3 + 12 4 + 1 5 ) = 2.08 .
100

11

1.8.2
Now consider 120 customers in a supermarket: the amounts they spend
are recorded. Here the minimum spent was 1.19 and the maximum
was 78.11. The amounts were divided into classes which run from
0.01 to 10.00, 10.01 to 20.00 and so on. The figures are
Amount spent
Frequency Midpoint of
interval
0.01 - 10.00
20
5.00
10.01 - 20.00
35
15.00
20.01 - 30.00
17
25.00
30.01 - 40.00
13
35.00
40.01 - 50.00
14
45.00
50.01 - 60.00
12
55.00
60.01 - 70.00
4
65.00
70.01 - 80.00
5
75.00
We can approximate the mean by assuming that all the amounts in
each interval were equal to the midpoint, so we get as an estimate of
the mean 28.58. Since this is a sample anyway, it is likely to be as
good as anything. In the same way we can estimate the standard
deviation as 19.78. But is this the appropriate method? It is clear from
the histogram that the distribution is not symmetric, and so perhaps we
should make use of that. We could take the median, which is 25, with
the lower quartile 15 and the upper quartile 45, instead of the mean
and standard deviation.
We could continue by estimating the standard deviation. In this case,
however, it is unwise for two reasons. The quartiles (and the percentiles
generally) are more useful, and the estimation of the variance is
numerically unreliable.
supermarket sales
40

Frequency

30

20

10

0
10

20

30

40
C1

50

60

12

70

Finally two questions:


1. Why use the apparently complicated formula for the sample
2
1 n
standard deviation s =
y j y ) when one could use the
(

n 1 j =1
interquartile range, or just the average deviation from the mean
1
y j y ? The answer is that we can use calculus easily on s 2 ,
n
but not on the other formulae, and differentiating quadratic
functions leads to linear formulae which are very easy to handle.
1
1
2. Why the
factor rather than
in (1.2)? The answer is rather
n 1
n
subtle, and will be explained later when we meet the idea of
unbiasedness.

13