You are on page 1of 75

CHAPTER THREE

Numerical Summary Measures

Numerical Summarizing Data
12/19/2013 1 Yusuf H (BSc, MPHE)
• A frequency distribution is a general picture of
the distribution of a variable
• But, can’t indicate the average value or (the
middle) and the spread of the values

• Although these techniques are extremely
useful, they do not allow us to make concise,
quantitative statements that characterize the
distribution of values as a whole
12/19/2013 2 Yusuf H (BSc, MPHE)
Measures of Central Tendency (MCT)

• Computed from the data of a sample or
population
• Convey information regarding the average value
• Facilitate the description or comparison of a
data
• The goal of measures of central tendency is to
come up with the one single number that best
describes a distribution of scores.
12/19/2013 3 Yusuf H (BSc, MPHE)
12/19/2013 4 Yusuf H (BSc, MPHE)
Characteristics of a good MCT
A MCT is good or satisfactory if it possesses the following
characteristics.
1. It should be based on all the observations
2. It should not be affected by the extreme values
3. It should be as close to the majority of values as
possible
4. It should have a definite value
5. It should not be subjected to complicated and
tedious calculations
6. It should be capable of further algebraic
treatment

12/19/2013 5 Yusuf H (BSc, MPHE)
• The most common MCT include:
– Arithmetic Mean
– Median
– Mode
– Others
12/19/2013 6 Yusuf H (BSc, MPHE)
1. Arithmetic Mean
A. Ungrouped Data
• The arithmetic mean is the "average" of the data set
and by far the most widely used measure of central
location

• The sample mean is the sum of all the observations
divided by the number of observations:

12/19/2013 7 Yusuf H (BSc, MPHE)
12/19/2013 8 Yusuf H (BSc, MPHE)
The heart rates for n=10 patients were as follows (beats per
minute):
167, 120, 150, 125, 150, 140, 40, 136, 120, 150
What is the arithmetic mean for the heart rate of these patients?
12/19/2013 9 Yusuf H (BSc, MPHE)
b) Grouped data
In calculating the mean from grouped data, we assume that all values falling into a
particular class interval are located at the mid-point of the interval. It is calculated as
follow:

x =
m f
f
i i
i=1
k
i
i=1
k
¿
¿

where,
k = the number of class intervals
m
i
= the mid-point of the i
th
class interval
f
i
= the frequency of the i
th
class interval
12/19/2013 10 Yusuf H (BSc, MPHE)
Example. Compute the mean age of 169 subjects from the grouped data.
E.g. Given below the grouped data
Mean = 5810.5/169 = 34.48 years

Class interval Mid-point (mi) Frequency (fi) mifi
10-19
20-29
30-39
40-49
50-59
60-69
14.5
24.5
34.5
44.5
54.5
64.5
4
66
47
36
12
4
58.0
1617.0
1621.5
1602.0
654.0
258.0
Total __ 169 5810.5
12/19/2013 11 Yusuf H (BSc, MPHE)
Properties of the Arithmetic Mean.

• Uniqueness. One and only one mean
• Simplicity
• Affected by each value
• Extreme values influence the mean
• It is most willing to algebraic treatment
12/19/2013 12 Yusuf H (BSc, MPHE)
2. Median

• Divides the data set into two equal parts
• Alternative measure of central location next to
mean.
12/19/2013 13 Yusuf H (BSc, MPHE)
2. Median
a) Ungrouped data
• Odd number:
– The median will be the middle value when all values are
arranged in order of magnitude.
• Even number:
– There is no single middle value but two middle
observations.
– In this case the median is the mean of these two middle
observations, when all observations have been arranged
in the order of their magnitude.

12/19/2013 14 Yusuf H (BSc, MPHE)
If the observations are ordered from smallest to
largest, then the median is defined as follows.
12/19/2013 15 Yusuf H (BSc, MPHE)
12/19/2013 16 Yusuf H (BSc, MPHE)
• Compute the sample median for the birth weight
data. First arrange the sample in ascending order.
2069 2581 2759 2834 2828 2841 3031 3101 3200 3245
3248 3260 3265 3314 3323 3484 3541 3609 3649 4146
• Since n = 20 is even, median = average of the
10th and 11th observation =
(3245 + 3248)/2 = 3246.5 gm.
12/19/2013 17 Yusuf H (BSc, MPHE)
• The median is a better description (than the
mean) of the majority when the distribution is
skewed
• Example
– Data: 14, 89, 93, 95, 96
– Skewness is reflected in the outlying low value of 14
– The sample mean is 77.4
– The median is 93
12/19/2013 18 Yusuf H (BSc, MPHE)
b) Grouped data

• In calculating the median from grouped
data, we assume that the values within a
class-interval are evenly distributed
through the interval.
• The first step is to locate the class interval
in which the median is located, using the
following procedure.
• Find n/2 and see a class interval with a
minimum cumulative frequency which
contains n/2.
• Then, use the following formal.

12/19/2013 19 Yusuf H (BSc, MPHE)
W
f
F
2
n
L = x
~
m
c
m
|
|
|
|
.
|

\
|
÷
+
where,
L
m
= lower true class boundary of the interval containing the median
F
c
= cumulative frequency of the interval just above the median class
interval
f
m
= frequency of the interval containing the median
W= class interval width
n = total number of observations

12/19/2013 20 Yusuf H (BSc, MPHE)
Example. Compute the median age of 169 subjects from
the grouped data.

n/2 = 169/2 = 84.5

Class interval Mid-point (mi) Frequency (fi) Cum. freq
10-19
20-29
30-39
40-49
50-59
60-69
14.5
24.5
34.5
44.5
54.5
64.5
4
66
47
36
12
4
4
70
117
153
165
169
Total 169
12/19/2013 21 Yusuf H (BSc, MPHE)
• n/2 = 84.5 = in the 3
rd
class interval
• Lower limit = 29.5, Upper limit = 39.5
• Frequency of the class = 47
• (n/2 – fc) = 84.5-70 = 14.5

• Median = 29.5 + (14.5/47)10 = 32.58 ≈ 33
12/19/2013 22 Yusuf H (BSc, MPHE)
Properties of the median
• Uniqueness
• It is an average position
• Simplicity.
• affected by the number of items than by extreme
values
• Insensitive to very large or very small values unlike
the mean.
• Weakness determined mainly by the middle points
12/19/2013 23 Yusuf H (BSc, MPHE)
Quartiles

• Just as the median is the value above and
below which lie half the set of data, one can
define measures (above or below) which lie
other fractional parts of the data.

• The median divides the data into two equal
parts (50%)

12/19/2013 24 Yusuf H (BSc, MPHE)
12/19/2013 25 Yusuf H (BSc, MPHE)
a) The first quartile (Q
1
): 25% of all the
ranked observations are less than Q
1.


b) The second quartile (Q
2
): 50% of all the ranked
observations are less than Q
2
. The second
quartile is the median.

c) The third quartile (Q
3
): 75% of all the ranked
observations are less than Q
3.


12/19/2013 26 Yusuf H (BSc, MPHE)
12/19/2013 27 Yusuf H (BSc, MPHE)
Percentiles
• Simply divide the data into 100 pieces.
• Percentiles are less sensitive to outliers and
not greatly affected by the sample size (n).
• Commonly used percentiles:
→ 10, 20, ….. 90% (deciles)
→ 20, 40, ….. 80% (quintiles)
→ 25, 50, 75% (quartiles)
→ 33.3, 66.7% (tertiles)

12/19/2013 28 Yusuf H (BSc, MPHE)
3. Mode
• Value which occurs most frequently
• If all observations are different, no mode
• More than one mode can occur, bimodal (two-
peaks).
• Less amenable (responsive) to mathematical
treatment.
• The mode is not often used in biological or medical
data.
12/19/2013 29 Yusuf H (BSc, MPHE)

• It is a value which occurs most frequently in
a set of values.
• If all the values are different there is no
mode, on the other hand, a set of values
may have more than one mode.

a) Ungrouped data
12/19/2013 30 Yusuf H (BSc, MPHE)
Some distributions have more than one mode:
• Unimodal: A distribution with one mode
• Bimodal: A distribution with two modes
• Trimodal: A distribution with three modes

• Empirical formula:

12/19/2013 31 Yusuf H (BSc, MPHE)
) ( 3 median mean mode mean ÷ × = ÷
• Example
• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
• Mode is 4 “Unimodal”
• Example
• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
• There are two modes = 2 & 5
• This distribution is said to be “bi-modal”
• Example
• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
• No mode, since all the values are different
12/19/2013 32 Yusuf H (BSc, MPHE)
b) Grouped data

• To find the mode of grouped data, we
usually refer to the modal class, where the
modal class is the class interval with the
highest frequency.
• If a single value for the mode of grouped
data must be specified, it is taken as the
mid-point of the modal class interval.

12/19/2013 33 Yusuf H (BSc, MPHE)
12/19/2013 34 Yusuf H (BSc, MPHE)
Properties of mode
• An average of position
• Not affected by extreme values
• The most typical value of the distribution
• Not capable of mathematical treatment

12/19/2013 35 Yusuf H (BSc, MPHE)
• The mean can be used for discrete and
continuous data
• The median is appropriate for discrete and
continuous data as well, but can also be used
for ordinal data
• The mode can be used for all types of data,
but may be especially useful for nominal and
ordinal measurements

Which MTC to use?
12/19/2013 36 Yusuf H (BSc, MPHE)
• Skewness: If extremely low or extremely high observations are
present in a distribution, then the mean tends to shift towards
those scores.
• Types of Distributions
(a) Symmetric and unimodal distribution — Mean,
median, and mode should all be approximately the
same
Mean, Median & Mode
12/19/2013 37 Yusuf H (BSc, MPHE)
(b) Skewed to the right (positively skewed) —
Mean is sensitive to extreme values, so median
might be more appropriate
Mode
Median
Mean
12/19/2013 38 Yusuf H (BSc, MPHE)
(c) Skewed to the left (negatively skewed) —
Same as (b)



Mode
Median
Mean
12/19/2013 39 Yusuf H (BSc, MPHE)
Measures of Central Tendency
The Shape of Distributions
• With perfectly bell
shaped distributions, the
mean, median, and mode
are identical.
• With positively skewed
data, the mode is lowest,
followed by the median
and mean.
• With negatively skewed
data, the mean is lowest,
followed by the median
and mode.
4. Geometric mean (GM)

If x x ..., x are n positive observed values, then
GM= x
1 2 n
i
i=1
n
n
, ,
[

and
logGM =
logx
n
i
i=1
n
¿
.
The geometric mean is generally used with data measured on a logarithmic scale, such
as titers of anti-neutrophil immunoglobulin G.
Geometric mean…
• The GM is preferable to the AM if the series of
observations contains one or more unusually large
values.
• It is obtained by taking the nth root of the product
• of “n” values, i.e, if the values of the observation are
demoted by x1,x2,…,x n then, GM = n√(x1)(x2)….(xn) .
• GM = n√(x1)(x2)….(xn) = { (x1)(x2)… (xn ) }1/n
• Log GM = log {(x1 )(x2 )…(xn)}1/n
• = 1/n log {(x1 )(x2 )…(xn)}
• =1/n {log(x1 ) + log(x2 )+…log(xn)}
• = Σ(log xi)/n
• The logarithm of the geometric mean is equal to the
arithmetic mean of the logarithms of individual values.
12/19/2013 Yusuf H (BSc, MPHE) 42
• Examle: The geometric mean may be calculated
for the following parasite counts per 100 fields
of thick films.
• 7 8 3 14 2 1 440 15 52 6 2 1 1 25
• 12 6 9 2 1 6 7 3 4 70 20 200 2 50
• 21 15 10 120 8 4 70 3 1 103 20 90 1 237

• GM = 42√7x8x3x…x1x237
• log Gm = 1/42 (log 7+log8+log3+..+log 237)
• = 1/42 (.8451+.9031+.4771 +…2.3747)
• = 1/42 (41.9985)
12/19/2013 Yusuf H (BSc, MPHE) 43

• = 0.9999 ≈ 1.0000
• The anti-log of 0.9999 is 9.9992 ≈10 and this
is the required geometric mean. By contrast,
the arithmetic mean, which is inflated by the
high values of 440, 237 and 200 is 39.8 ≈ 40.
12/19/2013 Yusuf H (BSc, MPHE) 44
5. Weighted mean (WM)
• In a weighted mean, separate outcomes have
separate influences.

• The influence attached to an outcome is the
weight.

• Familiar is the calculation of a course grade as
a weighted average of scores on separate
outcomes.


Example:
Which measure of central tendency is best with a given set of
data?

• Two factors are important in making this
decisions:
– The shape of the distribution of the
observations
– The scale of measurement
Measures of Dispersion
Consider the following two sets of data:

A: 177 193 195 209 226 Mean = 200

B: 192 197 200 202 209 Mean = 200
Two or more sets may have the same mean and/or
median but they may be quite different.
12/19/2013 48 Yusuf H (BSc, MPHE)
• MCT are not enough to give a clear
understanding about the distribution of the
data.

• We need to know something about the
variability or spread of the values —
whether they tend to be clustered close
together, or spread out over a broad range


12/19/2013 49 Yusuf H (BSc, MPHE)
Measures of Dispersion…
 Measures that quantify the variation or dispersion
of a set of data from its central location

Dispersion refers to the variety exhibited by the
values of the data.

The amount may be small when the values are close
together.

If all the values are the same, no dispersion


12/19/2013 50 Yusuf H (BSc, MPHE)
• Measures of dispersion include:
– Range
– Inter-quartile range
– Variance
– Standard deviation
– Coefficient of variation

12/19/2013 51 Yusuf H (BSc, MPHE)
1. Range (R)
• The difference between the largest and
smallest observations in a sample.

• Range = Maximum value – Minimum value

• Example –
– Data values: 5, 9, 12, 16, 23, 34, 37, 42
– Range = 42-5 = 37
• Data set with higher range exhibit more
variability

12/19/2013 52 Yusuf H (BSc, MPHE)
Properties of range
- It is the simplest crude measure and can be easily
understood
- It takes into account only two values which causes it
to be a poor measure of dispersion
- Very sensitive to extreme observations
• Extreme values may be unreliable (most likely
faulty)
• Not suitable for mathematical treatment

12/19/2013 53 Yusuf H (BSc, MPHE)
2. Interquartile range (IQR)

• Indicates the spread of the middle 50% of
the observations, and used with median

IQR = Q3 - Q
1



12/19/2013 54 Yusuf H (BSc, MPHE)
12/19/2013 55 Yusuf H (BSc, MPHE)
• Example:
• Suppose the first and third quartile for
weights of girls 12 months of age are 8.8 Kg
and 10.2 Kg, respectively.
IQR = 10.2 Kg – 8.8 Kg
i.e., 50% of the infant girls weigh between 8.8
and 10.2 Kg.

12/19/2013 56 Yusuf H (BSc, MPHE)
Properties of IQR:
• It is a simple and versatile measure
• It encloses the central 50% of the observations
• It is not based on all observations but only on
two specific values
• Since it excludes the lowest and highest 25%
values, it is not affected by extreme values


12/19/2013 57 Yusuf H (BSc, MPHE)
3. Variance (o
2
, s
2
)

• The variance is the average of the squares of
the deviations taken from the mean.
• A good measure of dispersion make use of all
the data
• Measures the deviations of each observation
from the mean
• Achieves by averaging the sum of the squares
of the deviations from the mean.
12/19/2013 58 Yusuf H (BSc, MPHE)
• Variance is used to measure the dispersion of
values relative to the mean.
• When values are close to their mean (narrow
range) the dispersion is less than when there
is scattering over a wide range.
– Population variance = σ
2
– Sample variance = S
2

12/19/2013 59 Yusuf H (BSc, MPHE)
a) Ungrouped data
- Let X
1
, X
2
, ..., X
N
be the measurement on N
population units, then:




mean. population the is
N
X
=
where
N
) (X
N
1 = i
i
N
1 i
2
i
2
¿
¿
=
÷
=
µ
µ
o
12/19/2013 60 Yusuf H (BSc, MPHE)
A sample variance is calculated for a sample of individual
values (X1, X2, … Xn) and uses the sample mean (e.g. ) rather
than the population mean µ.

12/19/2013 61 Yusuf H (BSc, MPHE)
b) Grouped data




where
m
i
= the mid-point of the i
th
class interval
f
i
= the frequency of the i
th
class interval
= the sample mean
k = the number of class intervals

1 - f
f ) x (m
S
k
1 = i
i
k
1 = i
i
2
i
2
¿
¿
÷
=
x
12/19/2013 62 Yusuf H (BSc, MPHE)
Properties of Variance:

•The main disadvantage of variance is that its unit is the
square of the unite of the original measurement values


•A variance of a distribution of weight is not expressed in
Kg, but in Kg2

weight = 36.5 Kg, s² = 257 Kg2





12/19/2013 63 Yusuf H (BSc, MPHE)
4. Standard deviation ( , s)
• It is the square root of the variance.
• This produces a measure having the same scale as
that of the individual values.
• Most commonly used



12/19/2013 64 Yusuf H (BSc, MPHE)
o o =
2
and S = S
2
12/19/2013 65 Yusuf H (BSc, MPHE)
Example
• Areas of sprayable surfaces with DDT from a sample
of 15 houses are as follows (m
2
):

101 105 110 114 115 124 125 125
130 133 135 136 137 140 145

• Find the variance and standard deviation of the
above distribution.
• The mean of the sample is 125 m
2
.
12/19/2013 66 Yusuf H (BSc, MPHE)
12/19/2013 67 Yusuf H (BSc, MPHE)
Example. Compute the variance and SD of the age of 169 subjects
from the grouped data.
Mean = 5810.5/169 = 34.48 years
S
2
= 20199.22/169-1 = 120.23
SD = √S2 = √120.23 = 10.96
Class
interval

(mi)

(fi)

(mi-Mean)

(mi-Mean)
2

(mi-Mean)
2
fi
10-19
20-29
30-39
40-49
50-59
60-69
14.5
24.5
34.5
44.5
54.5
64.5
4
66
47
36
12
4
-19.98
-9-98
0.02
10.02
20.02
30.02
399.20
99.60
0.0004
100.40
400.80
901.20
1596.80
6573.60
0.0188
3614.40
4809.60
3604.80
Total 169 1901.20 20199.22
12/19/2013 68 Yusuf H (BSc, MPHE)
Properties of SD
• The SD has the advantage of being expressed in the
same units of measurement as the mean

• SD is considered to be the best measure of dispersion
and is used widely
• However, if the units of measurements of variables of
two data sets is not the same, then there variability
can’t be compared by comparing the values of SD.

12/19/2013 69 Yusuf H (BSc, MPHE)
5. Coefficient of variation (CV)

• When two data sets have different units of
measurements, or their means differ
sufficiently in size, the CV should be used as
a measure of dispersion.
• It is the best measure to compare the
variability of two series of sets of
observations.
• Data with less CV is considered more
consistent.

12/19/2013 70 Yusuf H (BSc, MPHE)
• “Cholesterol is more variable than systolic blood
pressure”


SD Mean CV (%)
SBP
Cholesterol
15mm
40mg/dl
130mm
200mg/dl
11.5
20.0
•CV is the ratio of the SD to the mean multiplied by 100.
12/19/2013 71 Yusuf H (BSc, MPHE)
Characteristics of a Distribution
• A distribution is characterized by:
– Location = about average value
– Dispersion = about spread out
– Modality = number of peaks
– Skewness = whether or not a distribution is
symmetric
12/19/2013 72 Yusuf H (BSc, MPHE)
Which Measures to Use?
• For symmetrically distributed data, mean and
SD are used to summarize the data.
• However, in skewed distributions it is
preferable to use median and quartiles
• Median and quartiles are not easily
influenced by extreme values as are means
and SDs.
12/19/2013 73 Yusuf H (BSc, MPHE)
Exercise
• The following frequency distribution table shows the forced
expiratory volume in one second (FEV1) for 13 adults suffering
from asthma.

12/19/2013 74 Yusuf H (BSc, MPHE)
1. Find the mean, median and mode.
2. Find Range, quartiles and IQR.
3. Calculate variance and SD.
4. Suppose the FEV1 value for subject 11was given
by mistake 40.2 instead of 4.02, what will happen
to the mean, median and mode, respectively?
12/19/2013 75 Yusuf H (BSc, MPHE)