Professional Documents
Culture Documents
3 Measures of location
The graphic procedures described earlier helps us to visualize the pattern of the data. To
obtain a more objective description and comparison of the datasets, we must go step further
to obtain numerical values for the location or centre of the data and the amount of variability
present.
Since data is normally obtained by sampling from a large population, our discussion of
numerical measures in this section will be constrained to data arising in that context. We
begin by presenting a convenient way of representing sampled data before looking at
different measures of location or centre in the data.
Notations
To effectively present the ideas and associated calculations for measures of location, it is
convenient to represent sample dataset by symbols for generalisation. A sample consisting 𝑛
observations can be represented as 𝑥1 , 𝑥2 , … , 𝑥𝑛 where 𝑛 denotes the number of observations
in the data, and 𝑥1 , 𝑥2 , … represent the first observation, second observation and so on. For
instance, five measurements 5.2, 0.5, 2.3, 5.5 and 3.5 can be represented in symbols by
𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 where 𝑥1 = 5.2, 𝑥2 = 0.5, 𝑥3 = 2.3, 𝑥4 = 5.5 and 𝑥5 = 3.5. With this
knowledge, we able to define measures of location as follows:
∑𝑛
𝑖=1 𝑥𝑖
𝑥̅ = , where 𝑖 = 1, 2, … , 𝑛
𝑛
The computation of sample mean and its interpretation is illustrated in the example below.
Example 2.9
The birthweights (in kg) of five babies born in a hospital on a certain day are 3.2, 2.4, 4.5, 3.1
and 2.8. Obtain the sample mean of the data and interpret it.
Solution:
The mean birthweight for the data is
1. It always exists (i.e. it can be calculated for any kind of numerical data)
2. It is always unique (i.e. any set of data has one and only one mean)
3. It takes into account each observation (or item) in the data
4. It is generally affected by extreme (very small or very large) value in the data
The term ‘arithmetic mean’ is mainly used to distinguish the mean from the other two used in
special cases; the harmonic and geometric means.
𝑛 𝑛
̅𝑥̅̅𝐻̅ = =
1 1 1 1
+ +⋯+ ∑𝑛𝑖=1
𝑥1 𝑥2 𝑥𝑛 𝑥𝑖
Example 2.10
Solution
5 5
̅𝑥̅̅̅
𝐻 = = = 4.85
1 1 1 1 1 1.03
+ + + +
12 4 5 6 3
Note: The Harmonic mean is always less than the arithmetic mean.
Formally, the geometric mean is defined as “…the nth root of the product of n numbers.” In
other words, for a set of numbers {𝑥𝑖 }𝑁
𝑖=1 the geometric mean is:
1
𝑁 𝑁
𝑥𝐺 = {∏ 𝑥𝑖 } = 𝑁√𝑥1 ∙ 𝑥2 ∙∙∙ 𝑥𝑛
̅̅̅
𝑖=1
Example 2.11
The average person’s monthly salary in a certain town jumped from BWP2,500 to BWP5,000
over the course of ten years. Using the geometric mean, what is the average yearly increase?
Solution:
2
𝑥𝐺 = √2500 ∗ 5000 = 3535.54
̅̅̅
Step 2: Divide by 10 (to get the average increase over ten years).
𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 ∑ 𝑤 ∙ 𝑥
𝑥𝑤 =
̅̅̅̅ =
𝑤1 + 𝑤2 + ⋯ + 𝑤𝑛 ∑𝑤
Here, ∑ 𝑤 ∙ 𝑥 is the sum of the products obtained by multiplying each 𝑥 by the corresponding
weight, and ∑ 𝑤 is the sum of weights.
Example 2.12
The combined batting averages of the five leading batters in American Baseball are given as
below:
Obtain the weighted mean in the data using the times at bat as weights.
Solution:
Note: When the weights are all equal, the weighted mean is equivalent to the arithmetic
mean.
𝑛1 ̅̅̅
𝑥1 + 𝑛2 ̅̅̅ 𝑥𝑘 ∑ 𝑛 ∙ 𝑥̅
𝑥2 + ⋯ 𝑛𝑘 ̅̅̅
𝑥̿ = =
𝑛1 + 𝑛2 + ⋯ + 𝑛𝑘 ∑𝑛
Where the weights are the sizes of the samples, the numerator is the total of all the
measurements or observations, and the denominator is the number of observations in the
combined samples.
Example 2.13
In a physics class, there are 14 freshman, 25 sophomores, and 16 juniors. Given that the
freshmen averaged 76 in the final examination, the sophomores averaged 83, and the juniors
averaged 89, what is the mean grade for the entire class?
Solution
Therefore,
Roughly speaking, the median is the value that divides the data into two equal halves (50%
below the median and 50% above it). If 𝑛 is an odd number, there is unique middle value and
it is the median. If 𝑛 is an even number, there are two middle values and the median is
defined as their average. Put differently, we use the median position to establish the median
𝑛+1
value. The median occurs at position (after ranking the data from the lowest to largest
2
values).
Example 2.14
The number of days the first six heart transplant patients at GPH survived their operations
were 15, 3, 46, 623, 126, 64. Calculate the median of the survival time.
Solution:
Step 1: Reorder the data from the smallest to largest values as follows:
𝑛+1 6+1
𝑥̃
𝑝𝑜𝑠 = = = 3.5 position
2 2
Step 3: From above, the median lies between the 4th and 5th values.
46+64
Thus, the median = 𝑥̃ = = 55 days.
2
2.3.7 Percentiles
If the number of observations is large (say >30), the notion of median can be extended by
dividing ordered data into percentiles.
The sample 100 p-th percentile is a value such that after the data are ordered from smallest
to largest, at least 100 𝑝% of the observation are at or below this value and at least 100 (1 −
𝑝)% are at above this value.
If 𝑝 = 0.5; then 100 (0.5) = 50th percentile. This implies that at least half of the observations
are equal or smaller than the 50th value and at least half are equal or larger than this value.
If 𝑝 = 0.25; the 100 (0.25) = 25th percentile. This implies that the sample has one-fourth of
observations that are the same or smaller than the 25th value and three-fourth are the same or
larger this value.
*** If 𝑛𝑝 is not an integer, round it up to the next integer and find the corresponding ordered
value.
*** If 𝑛𝑝 is an integer, say 𝑘, calculate the average of 𝑘 −th and (𝑘 + 1) −th ordered values.
From above, the 25th, 50th and 75th percentiles are also known as Quartiles.
Example 2.15
The ordered lengths of phone calls (in minutes) are given below:
1.6 1.7 1.8 1.8 1.9 2.1 2.5 3.0 3.0 4.4
4.5 4.5 5.9 7.1 7.4 7.5 7.7 8.6 9.3 9.5
12.7 15.3 15.5 15.9 15.9 16.1 16.5 17.3 17.5 19.0
Solution
i. First quartile (Q1): assuming 𝑝 = 0.25, 𝑛𝑝 = 38 ∗ 0.25 = 9.5. Since 9.5 is not an
integer, the next largest integer is 10. Thus, the 10th ordered observation is 4.4. This
implies that Q1 = 4.4 minutes
ii. Second quartile (Q2): assuming 𝑝 = 0.50, 𝑛𝑝 = 38 ∗ 0.50 = 19. Since 19 is an
integer, we average the 19th and 20th ordered observations to attain the second quartile
(or median). Thus, Q2 = (9.3 + 9.5)/2 = 9.4 minutes
iii. Third quartile (Q3): assuming 𝑝 = 0.75, 𝑛𝑝 = 38 ∗ 0.75 = 28.5. Since, 28.5 is
not an integer, the next largest integer is 29. Thus, the 29th ordered observation is
17.5. This implies that Q3 = 17.5 minutes.
Using the same dataset, we can also attain the 90th percentile as follows:
Here, 𝑝 = 0.90 ⇒ 𝑛𝑝 = 38 ∗ 0.90 = 34.2. From this, the next largest integer is 35, thus
the 90th percentile = 31.7 minutes.
Example 2.16
22, 24, 23, 24, 27, 25, 24, 20, 24, 26, 23, 21, 24, 25, 23, 28, 24, 26, 25
Solution:
To give the general formula for the mean distribution with 𝑘 classes, let us denote the
successive class marks (or midpoints) by 𝑥1 , 𝑥2 , … , 𝑥𝑘 and the corresponding frequencies by
𝑓1 , 𝑓2 , … , 𝑓𝑘 such that sum of all the measurement is given by
∑ 𝑓 ∙ 𝑥 = 𝑓1 ∙ 𝑥1 + 𝑓2 ∙ 𝑥2 + ⋯ + 𝑓𝑘 ∙ 𝑥𝑘
∑ 𝑓∙𝑥
𝑥̅ = , where 𝑛 = 𝑓1 + 𝑓2 + ⋯ + 𝑓𝑘
𝑛
Example 2.17
Find the mean of the distribution for student’s marks given in example 2.3.
Solution
∑ 𝑓𝑥 3 548
𝑥̅ = = = 70.96 ≈ 71
𝑛 50
2.3.8.2 The Median of Grouped Data
The method for finding the median location of grouped data is slightly different from that of
ungrouped data. We use the following steps to obtain the median of grouped data.
Step 2: Decide the class that contain the median. Class median is the first class with the value
of cumulative frequency equal at least 𝑛⁄2.
𝑛
− 𝐶𝐹
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝐿𝑚 + (2 )∗𝑐
𝑓𝑚
Where;
𝑐 = class width
Example 2.18
Find the median of distribution for student’s marks given in example 2.3
Solution
And
3𝑛
−𝐶𝐹
4
Third quartile: 𝑄3 = 𝐿𝑄3 + ( )∗𝑐
𝑓𝑄3
Using the same data in example 2.18, the first quartile (𝑄1 ) and third quartile (𝑄3 ) can be
determined as follows:
𝑛 50
𝐶𝑙𝑎𝑠𝑠 𝑄1 = ⇒ = 12.5, implies that 𝑄1 is in the 4th class
4 4
Therefore,
12.5 − 12
𝑄1 = 59.5 + ( ) ∗ 10 = 60
8
3𝑛 3 (50)
𝐶𝑙𝑎𝑠𝑠 𝑄3 = ⇒ = 37.5, implies that 𝑄3 is in the 6th class
4 4
Therefore,
37.5 − 34
𝑄3 = 79.5 + ( ) ∗ 10 = 82
14
2.3.8.4 The Mode of Grouped Data
For grouped data, class mode (or modal class) is the class with the highest frequency. To find
the mode in such datasets we use the following formula:
∆1
𝑀𝑜𝑑𝑒 = 𝐿𝑚𝑜 + ( )∗𝑐
∆1 + ∆2
Where
𝑐 = class width
∆1 = the difference between the frequency of class mode and the frequency of class before
the class mode.
∆2 = the difference between the frequency of class mode and the frequency of class after the
class mode.
𝐿𝑚𝑜 = the lower boundary of class mode
Example 2.19
Find the mode of grouped frequency table showing student marks below (from example 2.3)
Solution
Then,
6
𝑀𝑜𝑑𝑒 = 69.5 + ( ) ∗ 20 = 76
6 + 12
Alternatively;
The mode can be obtained from the histogram as follows:
Step 1: Identify the modal class and the bar representing it
Step 2: Draw two cross lines from the neighbouring class boundaries
Step 3: Drop a perpendicular from the intersection of the two lines until it touch the
horizontal axis.
Example 2.20
Calculate the range for the lengths of phone calls (in minutes) given in Example 2.15
Solution:
In the data, we observed the smallest value as 1.6 minutes and the largest value of 53.3
minutes.
- It is too sensitive to the existence of very large or very small values in the dataset.
- It also ignores the information present in the scatter of the intermediate points.
To avoid the problem of using a measure that maybe thrown far off the mark by one or two
wild or unusual observations, a compromised is made using the Interquartile range (IQR).
The interquartile range measures the interval between the first and third quartile
representing majority of observations in the centre half.
- It is not disturbed if small fraction of observations are very large or very small.
Example 2.21
Using the data in Example 2.15, the interquartile range for length of phone calls be given by:
𝟏
Note: In some instance, statisticians may use semi-interquartile range = (𝑸𝟑 − 𝑸𝟏 ) ,
𝟐
The variation of data points can be reflected by their deviation from the mean (𝑥̅ ) as follows:
= 𝑥 − 𝑥̅
To obtain a measure of spread above, we must first eliminate the signs of the deviations
before averaging. Otherwise the sum deviations would be zero.
∑(𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛𝑠) = ∑(𝑥𝑖 − 𝑥̅ ) = 0
To eliminate the signs, a measure of spread (or dispersion) called the sample variance
should be constructed by adding the squared deviations and dividing the total by the number
of observations minus one. In other words,
∑𝑛𝑖=1(𝑥 − 𝑥̅ )2
𝑠2 =
𝑛−1
Note: The denominator is 𝑛 − 1 rather than 𝑛. This is the degrees of freedom associated with
𝑠 2 . Using the table above, the variance can be calculated as follows:
2
∑𝑛𝑖=1(𝑥 − 𝑥̅ )2 35
𝑠 = = =7
𝑛−1 5
Because the variance has its units expressed in squared form, we need to transform this
measure of variability in the same unit as the data. We take its square root and achieve the
standard deviation. Thus, the standard deviation serves a basic measure of variability than
the variance. Given the variance, we can expression the sample standard deviation as
follows:
∑𝑛𝑖=1(𝑥 − 𝑥̅ )2
𝑠 = √𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = √
𝑛−1
Taking our example above, the sample standard deviation would be:
𝑠 = √7 = 5.9 ≈ 6
Alternatively, we can compute the sample variance using the formula below:
1 (∑ 𝑥𝑖 )2
𝑠2 = [∑ 𝑥𝑖2 − ]
𝑛−1 𝑛
1 (∑ 𝑓𝑥)2
𝑠2 = [∑ 𝑓𝑥 2 − ]
𝑛−1 𝑛
For example, the sample variance of grouped data in example 2.3 can be computed as shown
in the table below:
1 (3548)2
= [261682.5 − ]
49 50
1
= (261682.5 − 251766.1) = 202.4
49
𝑠 = √202.4 = 14.2