You are on page 1of 16

2.

3 Measures of location
The graphic procedures described earlier helps us to visualize the pattern of the data. To
obtain a more objective description and comparison of the datasets, we must go step further
to obtain numerical values for the location or centre of the data and the amount of variability
present.

Since data is normally obtained by sampling from a large population, our discussion of
numerical measures in this section will be constrained to data arising in that context. We
begin by presenting a convenient way of representing sampled data before looking at
different measures of location or centre in the data.

Notations

To effectively present the ideas and associated calculations for measures of location, it is
convenient to represent sample dataset by symbols for generalisation. A sample consisting 𝑛
observations can be represented as 𝑥1 , 𝑥2 , … , 𝑥𝑛 where 𝑛 denotes the number of observations
in the data, and 𝑥1 , 𝑥2 , … represent the first observation, second observation and so on. For
instance, five measurements 5.2, 0.5, 2.3, 5.5 and 3.5 can be represented in symbols by
𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 where 𝑥1 = 5.2, 𝑥2 = 0.5, 𝑥3 = 2.3, 𝑥4 = 5.5 and 𝑥5 = 3.5. With this
knowledge, we able to define measures of location as follows:

2.3.1 The Arithmetic Mean


This is the most popular measure of central location in the data. It is known as an “average”
to the laypersons and statisticians call it the arithmetic mean or simply the mean. It is
defined as the sum of measurements, 𝑥1 , 𝑥2 , … , 𝑥𝑛 divided by 𝑛 and denoted by 𝑥̅ . That is,

∑𝑛
𝑖=1 𝑥𝑖
𝑥̅ = , where 𝑖 = 1, 2, … , 𝑛
𝑛

The computation of sample mean and its interpretation is illustrated in the example below.

Example 2.9

The birthweights (in kg) of five babies born in a hospital on a certain day are 3.2, 2.4, 4.5, 3.1
and 2.8. Obtain the sample mean of the data and interpret it.

Solution:
The mean birthweight for the data is

∑𝑛𝑖=1 𝑥𝑖 3.2 + 2.4 + 4.5 + 3.1 + 2.8 16


𝑥̅ = = = = 3.2 𝑘𝑔
𝑛 5 5

This means the central or middle birthweight would be 3.2 kg.

Properties of the arithmetic mean:

1. It always exists (i.e. it can be calculated for any kind of numerical data)
2. It is always unique (i.e. any set of data has one and only one mean)
3. It takes into account each observation (or item) in the data
4. It is generally affected by extreme (very small or very large) value in the data

The term ‘arithmetic mean’ is mainly used to distinguish the mean from the other two used in
special cases; the harmonic and geometric means.

2.3.2 The Harmonic Mean


The harmonic mean is a very specific type of average. It’s generally used when dealing with
averages of units, like speed or other rates and ratios. We calculate it as follows:

𝑛 𝑛
̅𝑥̅̅𝐻̅ = =
1 1 1 1
+ +⋯+ ∑𝑛𝑖=1
𝑥1 𝑥2 𝑥𝑛 𝑥𝑖

Example 2.10

What is the harmonic mean of 12, 4, 5, 6, 3?

Solution

5 5
̅𝑥̅̅̅
𝐻 = = = 4.85
1 1 1 1 1 1.03
+ + + +
12 4 5 6 3

Note: The Harmonic mean is always less than the arithmetic mean.

2.3.3 The Geometric Mean


The geometric mean usually used for growth rates, like population growth or interest rates.
While the arithmetic mean adds items, the geometric mean multiplies items. Also, you can
only get the geometric mean for positive numbers.

Formally, the geometric mean is defined as “…the nth root of the product of n numbers.” In
other words, for a set of numbers {𝑥𝑖 }𝑁
𝑖=1 the geometric mean is:

1
𝑁 𝑁
𝑥𝐺 = {∏ 𝑥𝑖 } = 𝑁√𝑥1 ∙ 𝑥2 ∙∙∙ 𝑥𝑛
̅̅̅
𝑖=1

Example 2.11

The average person’s monthly salary in a certain town jumped from BWP2,500 to BWP5,000
over the course of ten years. Using the geometric mean, what is the average yearly increase?

Solution:

Step 1: Find the geometric mean.

2
𝑥𝐺 = √2500 ∗ 5000 = 3535.54
̅̅̅

Step 2: Divide by 10 (to get the average increase over ten years).

3535.53390593 / 10 = 353.53. Therefore, the average increase according to the GM is


BWP353.53.

2.3.4 The Weighted Mean


The arithmetic mean may not be an appropriate measure when the quantities averaged are not
equally important or significant. The relative weights need to be accounted for. We do so by
calculating the weighted mean given as follows:

𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 ∑ 𝑤 ∙ 𝑥
𝑥𝑤 =
̅̅̅̅ =
𝑤1 + 𝑤2 + ⋯ + 𝑤𝑛 ∑𝑤

Here, ∑ 𝑤 ∙ 𝑥 is the sum of the products obtained by multiplying each 𝑥 by the corresponding
weight, and ∑ 𝑤 is the sum of weights.
Example 2.12

The combined batting averages of the five leading batters in American Baseball are given as
below:

Batters Batting Averages Times at Bat


N. Garciaparra 0.321 156
M. Ramirez 0.308 568
J. Damon 0.304 621
D. Ortiz 0.301 582
K. Millar 0.297 508

Obtain the weighted mean in the data using the times at bat as weights.

Solution:

156 (0.321) + 568 (0.308) + ⋯ + 508 (0.297) 739.862


xw =
̅̅̅̅ = = 0.304
156 + 568 + ⋯ + 508 2 435

Note: When the weights are all equal, the weighted mean is equivalent to the arithmetic
mean.

2.3.5 Grand mean of combined data


This arises when we must find the overall mean or grand mean of 𝑘 sets of data having the
means ̅̅̅,
𝑥1 ̅̅̅,
𝑥2 … , ̅̅̅,
𝑥𝑘 and consisting of 𝑛1 , 𝑛2 , … , 𝑛𝑘 measurements or observations.

𝑛1 ̅̅̅
𝑥1 + 𝑛2 ̅̅̅ 𝑥𝑘 ∑ 𝑛 ∙ 𝑥̅
𝑥2 + ⋯ 𝑛𝑘 ̅̅̅
𝑥̿ = =
𝑛1 + 𝑛2 + ⋯ + 𝑛𝑘 ∑𝑛

Where the weights are the sizes of the samples, the numerator is the total of all the
measurements or observations, and the denominator is the number of observations in the
combined samples.

Example 2.13

In a physics class, there are 14 freshman, 25 sophomores, and 16 juniors. Given that the
freshmen averaged 76 in the final examination, the sophomores averaged 83, and the juniors
averaged 89, what is the mean grade for the entire class?
Solution

From our example we have 𝑛1 = 14, 𝑛2 = 25, 𝑛3 = 16 and ̅̅̅


𝑥1 = 76, ̅̅̅
𝑥2 = 83, ̅̅̅
𝑥3 = 89

Therefore,

14 (76) + 25 (83) + 16 (89) 4 563


𝑥̿ = = = 82.96
14 + 25 + 16 55

2.3.6 The Median


Another measure of location (or centre) is the middle value known as the ‘median’. The
median of a set of 𝑛 measurements 𝑥1 , 𝑥2 , … , 𝑥𝑛 is the middle value when measurements are
arranged from smallest to largest.

Roughly speaking, the median is the value that divides the data into two equal halves (50%
below the median and 50% above it). If 𝑛 is an odd number, there is unique middle value and
it is the median. If 𝑛 is an even number, there are two middle values and the median is
defined as their average. Put differently, we use the median position to establish the median
𝑛+1
value. The median occurs at position (after ranking the data from the lowest to largest
2

values).

Example 2.14

The number of days the first six heart transplant patients at GPH survived their operations
were 15, 3, 46, 623, 126, 64. Calculate the median of the survival time.

Solution:

Step 1: Reorder the data from the smallest to largest values as follows:

3, 15, 46, 64, 126, 623

Step 2: Establish the median position

𝑛+1 6+1
𝑥̃
𝑝𝑜𝑠 = = = 3.5 position
2 2

Step 3: From above, the median lies between the 4th and 5th values.

46+64
Thus, the median = 𝑥̃ = = 55 days.
2

Properties of the median


- It is not affected by extremely large or small values in the data. Thus, the median is
likely to be sensible than the mean when the distribution is asymmetric. For example,
the mean is inflated above (due to one large survival time of 623 days in the data).

2.3.7 Percentiles
If the number of observations is large (say >30), the notion of median can be extended by
dividing ordered data into percentiles.

The sample 100 p-th percentile is a value such that after the data are ordered from smallest
to largest, at least 100 𝑝% of the observation are at or below this value and at least 100 (1 −
𝑝)% are at above this value.

If 𝑝 = 0.5; then 100 (0.5) = 50th percentile. This implies that at least half of the observations
are equal or smaller than the 50th value and at least half are equal or larger than this value.

If 𝑝 = 0.25; the 100 (0.25) = 25th percentile. This implies that the sample has one-fourth of
observations that are the same or smaller than the 25th value and three-fourth are the same or
larger this value.

To calculate the percentiles, the following operating rules are necessary:

i. Order the data from smallest to largest


ii. Determine the product (sample size) x (proportion) = 𝑛𝑝.

*** If 𝑛𝑝 is not an integer, round it up to the next integer and find the corresponding ordered
value.

*** If 𝑛𝑝 is an integer, say 𝑘, calculate the average of 𝑘 −th and (𝑘 + 1) −th ordered values.

From above, the 25th, 50th and 75th percentiles are also known as Quartiles.

Lower (first) quartile: Q1 = 25th percentile

Second quartile (or median): Q2=50th percentile

Upper (third) quartile: Q3=75th percentile

Example 2.15

The ordered lengths of phone calls (in minutes) are given below:
1.6 1.7 1.8 1.8 1.9 2.1 2.5 3.0 3.0 4.4

4.5 4.5 5.9 7.1 7.4 7.5 7.7 8.6 9.3 9.5

12.7 15.3 15.5 15.9 15.9 16.1 16.5 17.3 17.5 19.0

19.4 22.5 23.5 24.0 31.7 32.8 43.5 53.3

Obtain quartiles to summarise the lengths of phone calls in this data.

Solution

i. First quartile (Q1): assuming 𝑝 = 0.25, 𝑛𝑝 = 38 ∗ 0.25 = 9.5. Since 9.5 is not an
integer, the next largest integer is 10. Thus, the 10th ordered observation is 4.4. This
implies that Q1 = 4.4 minutes
ii. Second quartile (Q2): assuming 𝑝 = 0.50, 𝑛𝑝 = 38 ∗ 0.50 = 19. Since 19 is an
integer, we average the 19th and 20th ordered observations to attain the second quartile
(or median). Thus, Q2 = (9.3 + 9.5)/2 = 9.4 minutes
iii. Third quartile (Q3): assuming 𝑝 = 0.75, 𝑛𝑝 = 38 ∗ 0.75 = 28.5. Since, 28.5 is
not an integer, the next largest integer is 29. Thus, the 29th ordered observation is
17.5. This implies that Q3 = 17.5 minutes.

Using the same dataset, we can also attain the 90th percentile as follows:

Here, 𝑝 = 0.90 ⇒ 𝑛𝑝 = 38 ∗ 0.90 = 34.2. From this, the next largest integer is 35, thus
the 90th percentile = 31.7 minutes.

2.3.8 The Mode


A mode is another statistical measure used to describe the middle (or centre) of the data. It is
defined as the value that appears most in the data.

Example 2.16

Find the mode in the data given below.

22, 24, 23, 24, 27, 25, 24, 20, 24, 26, 23, 21, 24, 25, 23, 28, 24, 26, 25

Solution:

The most frequent value (mode) in the data is 24.


2.3.8 Description of Grouped Data
In this section we consider grouped datasets and consider measures of location (or centre).

2.3.8.1 Mean of Grouped Data

To give the general formula for the mean distribution with 𝑘 classes, let us denote the
successive class marks (or midpoints) by 𝑥1 , 𝑥2 , … , 𝑥𝑘 and the corresponding frequencies by
𝑓1 , 𝑓2 , … , 𝑓𝑘 such that sum of all the measurement is given by

∑ 𝑓 ∙ 𝑥 = 𝑓1 ∙ 𝑥1 + 𝑓2 ∙ 𝑥2 + ⋯ + 𝑓𝑘 ∙ 𝑥𝑘

And the mean of the distribution is given by

∑ 𝑓∙𝑥
𝑥̅ = , where 𝑛 = 𝑓1 + 𝑓2 + ⋯ + 𝑓𝑘
𝑛

Example 2.17

Find the mean of the distribution for student’s marks given in example 2.3.

Solution

Class Boundaries Class Mark (𝑥) Class frequency (𝑓) 𝑓𝑥


29.5-39.5 34.5 1 34.5
39.5-49.5 44.5 1 44.5
49.5-59.5 54.5 10 545
59.5-69.5 64.5 8 516
69.5-79.5 74.5 14 1 043
79.5-89.5 84.5 14 1 176
89.5-99.5 94.5 2 189
Total 50 3 548

From above, the mean of student’s marks is given as follows:

∑ 𝑓𝑥 3 548
𝑥̅ = = = 70.96 ≈ 71
𝑛 50
2.3.8.2 The Median of Grouped Data
The method for finding the median location of grouped data is slightly different from that of
ungrouped data. We use the following steps to obtain the median of grouped data.

Step 1: Construct the cumulative frequency distribution

Step 2: Decide the class that contain the median. Class median is the first class with the value
of cumulative frequency equal at least 𝑛⁄2.

Step 3: Find the median by using the following formula:

𝑛
− 𝐶𝐹
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝐿𝑚 + (2 )∗𝑐
𝑓𝑚

Where;

𝑛 = the total frequency

𝐶𝐹 = the cumulative frequency before class median

𝑓𝑚 = the frequency of class median

𝑐 = class width

𝐿𝑚 = the lower boundary of the class median

Example 2.18

Find the median of distribution for student’s marks given in example 2.3

Solution

Step 1: Construct the cumulative frequency

Class Boundaries Class frequency Cumulative frequency


29.5-39.5 1 1
39.5-49.5 1 2
49.5-59.5 10 12
59.5-69.5 8 20
69.5-79.5 14 34
79.5-89.5 14 48
89.5-99.5 2 50
Total 50
Step 2: Decision on the median class
𝑛 50
⇒ = 25, implies the median is in the 5th class
2 2

Step 3: Calculation of the median (𝑥̃)


𝑛
− 𝐶𝐹 25 − 20
𝑚𝑒𝑑𝑖𝑎𝑛 = 𝑥̃ = 𝐿𝑚 + (2 ) ∗ 𝑐 = 69.5 + ( ) ∗ 10 = 73
𝑓𝑚 14

2.3.8.3 Quartiles of Grouped Data


Using the same method of calculation as the median, we can get 𝑄1 and 𝑄3 equation as
follows:
𝑛
−𝐶𝐹
4
First quartile: 𝑄1 = 𝐿𝑄1 + ( )∗𝑐
𝑓𝑄1

And
3𝑛
−𝐶𝐹
4
Third quartile: 𝑄3 = 𝐿𝑄3 + ( )∗𝑐
𝑓𝑄3

Using the same data in example 2.18, the first quartile (𝑄1 ) and third quartile (𝑄3 ) can be
determined as follows:

𝑛 50
𝐶𝑙𝑎𝑠𝑠 𝑄1 = ⇒ = 12.5, implies that 𝑄1 is in the 4th class
4 4

Therefore,

12.5 − 12
𝑄1 = 59.5 + ( ) ∗ 10 = 60
8

3𝑛 3 (50)
𝐶𝑙𝑎𝑠𝑠 𝑄3 = ⇒ = 37.5, implies that 𝑄3 is in the 6th class
4 4

Therefore,

37.5 − 34
𝑄3 = 79.5 + ( ) ∗ 10 = 82
14
2.3.8.4 The Mode of Grouped Data
For grouped data, class mode (or modal class) is the class with the highest frequency. To find
the mode in such datasets we use the following formula:

∆1
𝑀𝑜𝑑𝑒 = 𝐿𝑚𝑜 + ( )∗𝑐
∆1 + ∆2

Where

𝑐 = class width

∆1 = the difference between the frequency of class mode and the frequency of class before
the class mode.

∆2 = the difference between the frequency of class mode and the frequency of class after the
class mode.
𝐿𝑚𝑜 = the lower boundary of class mode

Example 2.19

Find the mode of grouped frequency table showing student marks below (from example 2.3)

Class Boundaries Class frequency


29.5-39.5 1
39.5-49.5 1
49.5-59.5 10
59.5-69.5 8
69.5-79.5 14
79.5-89.5 14
89.5-99.5 2
Total 50

Solution

𝐿𝑚𝑜 = 69.5, ∆1 = (14 − 8) = 6, ∆2 = (14 − 2) = 12 and 𝑐 = 20

Then,

6
𝑀𝑜𝑑𝑒 = 69.5 + ( ) ∗ 20 = 76
6 + 12
Alternatively;
The mode can be obtained from the histogram as follows:
Step 1: Identify the modal class and the bar representing it

Step 2: Draw two cross lines from the neighbouring class boundaries

Step 3: Drop a perpendicular from the intersection of the two lines until it touch the
horizontal axis.

Step 4: Read the mode from the horizontal axis

2.4 Measures of dispersion


Measuring variability is an important aspect in statistical inference. We look at some of the
widely used measures of variability in this section.

2.4.1 The Range


This is defined as the difference between the largest and smallest values in a set of data. It
gives the length of the interval covered by the observations. Thus, it is simply expressed as
follows:

𝑅𝑎𝑛𝑔𝑒 = 𝑙𝑎𝑟𝑔𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 – 𝑠𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒

Example 2.20

Calculate the range for the lengths of phone calls (in minutes) given in Example 2.15

Solution:

In the data, we observed the smallest value as 1.6 minutes and the largest value of 53.3
minutes.

Thus, the length of interval covered by these values is

𝑅𝑎𝑛𝑔𝑒 = 53.3 – 1.6 = 51.7 𝑚𝑖𝑛𝑢𝑡𝑒𝑠

As a measure spread, the range has two attractive properties:

- It is extremely easy to compute and interpret


The disadvantages of range as a measure of spread or dispersion include the following:

- It is too sensitive to the existence of very large or very small values in the dataset.
- It also ignores the information present in the scatter of the intermediate points.

To avoid the problem of using a measure that maybe thrown far off the mark by one or two
wild or unusual observations, a compromised is made using the Interquartile range (IQR).
The interquartile range measures the interval between the first and third quartile
representing majority of observations in the centre half.

𝐼𝑛𝑡𝑒𝑟𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑟𝑎𝑛𝑔𝑒 (𝐼𝑄𝑅) = 𝑡ℎ𝑖𝑟𝑑 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 – 𝑓𝑖𝑟𝑠𝑡 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒

Advantages of using interquartile range as a measure of dispersion:

- It is not disturbed if small fraction of observations are very large or very small.

Example 2.21

Using the data in Example 2.15, the interquartile range for length of phone calls be given by:

𝑖𝑛𝑡𝑒𝑟𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑟𝑎𝑛𝑔𝑒 (𝐼𝑄𝑅) = 𝑄3 − 𝑄1

= 17.5 – 4.4 = 13.1 𝑚𝑖𝑛𝑢𝑡𝑒𝑠

𝟏
Note: In some instance, statisticians may use semi-interquartile range = (𝑸𝟑 − 𝑸𝟏 ) ,
𝟐

which is also referred as quartile deviation.

2.4.2 The standard deviation and the variance

The variation of data points can be reflected by their deviation from the mean (𝑥̅ ) as follows:

𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 – (𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛)

= 𝑥 − 𝑥̅

For instance, the data set 3, 1, 5, 8, 6, 9 has the mean 𝑥̅ = (3 + 1 + 6 + 4 + 2 + 8)/6 = 4. To


calculate the deviations from this mean (𝑥̅ ), we subtract 4 from each observation. See the
table below:
Observation Deviation
𝑥 𝑥 − 𝑥̅
3 -1
1 -3
6 2
4 0
2 -2
8 4

To obtain a measure of spread above, we must first eliminate the signs of the deviations
before averaging. Otherwise the sum deviations would be zero.

∑(𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛𝑠) = ∑(𝑥𝑖 − 𝑥̅ ) = 0

To eliminate the signs, a measure of spread (or dispersion) called the sample variance
should be constructed by adding the squared deviations and dividing the total by the number
of observations minus one. In other words,

Sample variance of 𝑛 observations is expressed as follows:

∑𝑛𝑖=1(𝑥 − 𝑥̅ )2
𝑠2 =
𝑛−1

Note: The denominator is 𝑛 − 1 rather than 𝑛. This is the degrees of freedom associated with
𝑠 2 . Using the table above, the variance can be calculated as follows:

Observation Deviation (Deviation)2


𝑥 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2
3 -1 2
1 -3 9
6 2 4
4 0 0
2 -2 4
8 4 16
∑ 𝑥 = 24 ∑(𝑥 − 𝑥̅ ) = 0 ∑(𝑥 − 𝑥̅ )2 = 35
Therefore,

2
∑𝑛𝑖=1(𝑥 − 𝑥̅ )2 35
𝑠 = = =7
𝑛−1 5

Because the variance has its units expressed in squared form, we need to transform this
measure of variability in the same unit as the data. We take its square root and achieve the
standard deviation. Thus, the standard deviation serves a basic measure of variability than
the variance. Given the variance, we can expression the sample standard deviation as
follows:

∑𝑛𝑖=1(𝑥 − 𝑥̅ )2
𝑠 = √𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = √
𝑛−1

Taking our example above, the sample standard deviation would be:

𝑠 = √7 = 5.9 ≈ 6

Alternatively, we can compute the sample variance using the formula below:

1 (∑ 𝑥𝑖 )2
𝑠2 = [∑ 𝑥𝑖2 − ]
𝑛−1 𝑛

For grouped data

The sample variance is expressed as follows:

1 (∑ 𝑓𝑥)2
𝑠2 = [∑ 𝑓𝑥 2 − ]
𝑛−1 𝑛

For example, the sample variance of grouped data in example 2.3 can be computed as shown
in the table below:

Class Mark (𝑥) 𝑓 𝑓𝑥 𝑥2 𝑓𝑥 2


34.5 1 34.5 1190.25 1190.25
44.5 1 44.5 1980.25 1980.25
54.5 10 545 2970.25 29702.5
64.5 8 516 4160.25 33282
74.5 14 1 043 5550.25 77703.5
84.5 14 1 176 7140.25 99963.5
94.5 2 189 8930.25 17860.5
Total 50 3 548 31921.75 261682.5
Therefore, the sample variance would be:
1 (∑ 𝑓𝑥)2
𝑠2 = [∑ 𝑓𝑥 2 − ]
𝑛−1 𝑛

1 (3548)2
= [261682.5 − ]
49 50
1
= (261682.5 − 251766.1) = 202.4
49

And the sample standard deviation would be:

𝑠 = √202.4 = 14.2

You might also like