You are on page 1of 17

04 : Measures of Dispersion (1)

04. Measures of Dispersion

The mean of all the three curves is the same, but curve A has less spread (or variability)
than curve B, and curve B has less variability than curve C. If we measure only mean of these
three distributions, we will miss an important difference among the three curves. To increase the
understanding of the pattern of the data, we must also measure its dispersion.
These are additional information that enables us to judge the reliability of our measure of
the central tendency. A wide spread of values away from the centre indicates an unacceptable
risk. A quantity that measures this characteristic is called measure of dispersion, scatter or
variability. The main measures are
(1) Range
Range R defined as the difference between xmax and xmin in a set of data.
i.e. R = xmax - xmin = xn  x0
The main disadvantage is that it depend only on two values (extreme values) may be
seriously affected by one usual observations. It is therefore unsatisfactory measure of dispersion.
However, it is appropriately used in statistical quality control charts of manufactured products,
daily temperatures, stock prices etc. This is an absolute measure of dispersion. Its relative
measure known as the co-efficient of dispersion, defined as;
xn  x0
co-efficient of dispersion =
xn + x0
(2) Inter-quartile Range
The measure of variability that overcome the dependency on extreme values is the inter-
quartile range (IQR), defined by the difference between the third and first quartiles.
Interquartile range:
IQR = Q3  Q1).
In other words, the interquartile range is the range for the middle 50% of the data.
Half of this range is called the semi-interquartile range or the quartile deviation (Q.D),
symbolically;
Q3  Q1
Q.D =
2

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (2)

For the data on monthly starting salaries, the quartiles are Q3 = 3600 and Q1 = 3465. Thus,
the interquartile range is 3600  3465 = 135.
The quartile deviation is also an absolute measure of dispersion. Its relative measure is
called the co-efficient of quartile deviation or the coefficient of semi-interquartile range, is
defined as
Q3  Q1
co-efficient of quartile deviation =
Q3 + Q1
which is used for comparing the variation in two or more sets of data.
(3) Mean Deviation
The mean deviation (M.D) of a set of data is defined as the A.M of the absolute deviation
measured either from positive mean or from median or from mode; the reason to disregard the
algebraic signs is to avoid the difficulty arising from the property that the sum of the deviations
of the observation from their mean is zero.
n

 x  x i
i =1
M.D =
n
For grouped data, with k classes, having the mid points x1, x2,….,xk with the
n

correspondence frequencies f1, f2, …., fk where xi = n. The mean deviation of the sample is
i =1

given by
k

fi | xi - 
x|
i =1
M.D =
n
(4) Population Variance and Standard Deviation
The variance is the average of the squares of the distance each value is from the mean.
The symbol for the population variance is 2 ( is the Greek lowercase letter sigma). The
formula for the population variance is
The symbolic definition of variance is given by
(xi )2 fi(xi )2
2 = (for ungrouped data) and 2 = (for grouped data)
N fi
alternative formula,
2 Xi2 Xi 2 2 fiXi2 fiXi 2
 = -( ) (for ungrouped data) and  = -( ) (for grouped data)
N N fi fi
The positive square root of the variance is called standard Deviation. Symbolically,
(xi)2 2 fi(xi)2
= (for ungrouped data) and  = (for grouped data)
N fi
The standard deviation is a very important concept that serves as a basic measure of
variability. A smaller value of the standard deviation indicates that most of the observations in

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (3)

the data are close to the mean while a larger value implies that the observations are scattered
widely about the mean.
Obviously the standard deviation may be found by taking the positive square roots of the
above values. It is an absolute measure of dispersion. Its relative measure called coefficient of
standard deviation, is defined as
Standard Deviation
Coefficient of S.D. =
Mean
(5) Sample Variance and Standard Deviation
In most cases the purpose of calculating the statistic is to estimate the corresponding
parameter. For example, the sample mean is used to estimate the population mean . The
expression

(xi x)2
n
does not give best estimate of the population variance because when the population is
large and the sample is small (usually less than 30), the variance computed by this formula
usually underestimates the population variance. Therefore, instead of dividing by n, find the
variance of the sample by dividing by n  1, giving a slightly larger value and an unbiased
estimate of the population variance. The formula for the sample variance denoted by s2 , is

2 (xi x)2
s =
n1
and standard deviation of a sample (denoted by s) is
(xi)2
s=
n1
(6) Properties of Variance
i). Var .(a) = 0
ii). Var (X + a) = Var (X) = 2
iii). Var (aX) = a2 Var (X)
iv). Var (X Y)= Var (X) + Var (Y)
v). Let x¯1 and s12 be mean and variance of n1 observations and x¯2 and s22 be mean
and variance of n2 observations (n1 and n2 are sufficiently large) then if the
variance of n1 + n2 observations prove that
n1 s12+ n2 s22 n1( x¯1 - x̄ )2 n2( x¯2 - x̄ )2
S2 = + +
n1 +n2 n1 +n2 n1 +n2
Examples (1)
The breaking strength of test pieces of a certain alloy is given as under
95 103 97 130 96 73 78 95 89 68
82 79 69 67 83 108 94 87 93 117
Calculate the average breaking strength of the alloy and the standard deviation.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (4)

Solution

Breaking Strength (X) X2 Breaking Strength (X) X2


67 4489 93 8649
68 4624 94 8836
69 4761 95 9025
73 5329 95 9025
78 6084 96 9216
79 6241 97 9409
82 6724 103 10609
83 6889 108 11664
87 7569 117 13689
89 7921 130 16900
Total: 1803 167653
X 1803
Mean = = = 90.15
n 20
X2 X 2 167653 1803 2
= -( ) = -( )
n n 20 20
= 8382.65 - 8127.0225
= 255.6275
= 15.99
Problems (Variance and Standard Deviation)
(1) For three sections of statistics class consisting of 32, 28, and 40 students, the mean
grades on the final exams were 83, 80 and 76 with standard deviations 5, 6 and 4.
Find combined mean and standard deviation of the class.
(2) By multiplying each number 3, 6, 2, 1, 7, 5 by 2 and then adding 5, we obtain 11,
17, 9, 7, 19, 15. What is the relationship between the variance and the mean of the
two sets.
(3) The first of the two samples has 100 items with mean 15 and variance 9. If whole
group has 250 items with mean 15.6 and S.D =  13.44. Find the standard deviation
of the second group. (4.15b)
(4) Two brands of cigarettes are compared to determine the variance of the difference
D in the Nicotine content of brand A which has the variance of 5mg and Y be the
Nicotine content of brand B which has the variance of 4mg. i.e. D = X – Y. It is
assumed that X and Y are independent. What is the variance of D? (

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (5)

Examples (2)
( in case of grouped data) Find variance and standard deviation.
Classes 65-85 85-105 105-125 125-145 145-165 165-185 185-205
Frequency 9 10 17 10 5 4 5

Solution

Classes xi fi fixi fixi2


65-85 75 9 675 50625
85-105 95 10 950 90205
105-125 115 17 1955 224825
125-145 135 10 1350 182250
145-165 155 5 475 120125
165-185 175 4 700 122500
185-205 195 5 975 190120
7380 9807700

2 fiXi2 fiXi 2 980700 7380 2


 = -( ) = -( ) = 1236.61
n n 60 60

(7) Coefficient of Variation


The variability of the two or more than two sets of data cannot be compared unless we
have a relative measure of dispersion. For this purpose, Karl Pearson (1857-1938) introduced a
relative measure of variation, known as Co-efficient of variation (C.V) which expresses the
standard deviation as a percentage of the arithmetic mean of the data set. It is defined as
C.V = 

  100
x
Coefficient of variation allows you to compare standard deviations when the units are
different, for example, if a manager wanted to compare the standard deviations of two different
variables, such as the number of sales per salesperson over a 3-month period and the
commissions made by these salespeople?
Examples (3)
The mean of the number of sales of cars over a 3-month period is 87, and the standard
deviation is 5. The mean of the commissions is $5225, and the standard deviation is $773.
Compare the variations of the two.
Solution
The coefficients of variation are

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (6)

5
C.V = 

  100 =  100 = 5.7 % (sales)
x 87
773
C.V = 

  100 =  100 = 14.8 % (commissions)
x 5225
Since the coefficient of variation is larger for commissions, the commissions are more
variable than the sales.
Exercise
The lengths (in feet) of the main span of the longest suspension bridges in the United
States and the rest of the world are shown below. Which set of data is more variable?
United States: 4205, 4200, 3800, 3500, 3478, 2800, 2800, 2310
World: 6570, 5538, 5328, 4888, 4626, 4544, 4518, 3970 (Bluman Ex. 3.2, 29)
(8) Range Rule of Thumb
The range can be used to approximate the standard deviation. The approximation is called
the range rule of thumb.
A rough estimate of the standard deviation is
range
s=
4
For example, the standard deviation for the data set 5, 8, 8, 9, 10, 12, and 13 is 2.7, and the range
is 13  5 = 8. The range rule of thumb is s  2.
A note of caution should be mentioned here. The range rule of thumb is only an
approximation and should be used when the distribution of data values is unimodal and roughly
symmetric.
The range rule of thumb can be used to estimate the largest and smallest data values of a
data set. The smallest data value will be approximately 2 standard deviations below the mean,
and the largest data value will be approximately 2 standard deviations above the mean of the data
set. The mean for the previous data set is 9.3; hence,
Smallest data value = x  2s = 9.3  2(2.8) = 3.7
Largest data value =  x + 2s = 9.3 + 2(2.8) = 14.9
Notice that the smallest data value was 5, and the largest data value was 13. Again, these
are rough approximations. For many data sets, almost all data values will fall within 2 standard
deviations of the mean. Better approximations can be obtained by using Chebyshev’s theorem
and the empirical rule.
Chebyshev’s theorem, developed by the Russian mathematician Chebyshev (1821–1894),
specifies the proportions of the spread in terms of the standard deviation.

(9) Chebyshev’s Theorem


At least (1  1/k2) of the data values must be within k standard deviation of the mean,
where k is any value greater than 1. (k is not necessarily an integer).
Some of the implications of this theorem, with k = 2, 3, 4 standard deviations, follow.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (7)

 At least 0.75, or 75% of the data values must be within k = 2 standard deviations
of the mean
 At least 0.89, or 89% of the data values must be within k = 3 standard deviations
of the mean
 At least 0.94, or 94% of the data values must be within k = 4 standard deviations
of the mean

For the example in which variable 1 has a mean of 70 and a standard deviation of 1.5, at
least three-fourths, or 75%, of the data values fall between 67 and 73. These values are found by
adding 2 standard deviations to the mean and subtracting 2 standard deviations from the mean, as
shown:
70 + 2(1.5) = 70 + 3 = 73
and 70  2(1.5) = 70  3 = 67
Furthermore, the theorem states that at least eight-ninths, or 88.89%, of the data values
will fall within 3 standard deviations of the mean. This result is found by letting k = 3 and
substituting in the expression.
1 1 1 8
1  2 or 1  2 = 1  = = 88.89 %
k 3 9 9
For variable 1, at least eight-ninths, or 88.89%, of the data values fall between 65.5 and
74.5, since
70 + 3(1.5) = 70 + 4.5 = 74.5
and 70  3(1.5) = 70  4.5 = 65.5
For variable 2, at least three-fourths, or 75%, of the data values fall between 50 and 90.
Again, these values are found by adding and subtracting, respectively, 2 standard deviations to
and from the mean.
70 + 2(10) = 70 + 20 = 90
and 70  2(10) = 70  20 = 50
For variable 2, at least eight-ninths, or 88.89%, of the data values fall between 40 and 100.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (8)

Exercise
The mean price of houses in a certain neighborhood is $50,000, and the standard
deviation is $10,000. Find the price range for which at least 75% of the houses will sell.
Final Remarks
This theorem can be applied to any distribution regardless of its shape as shown in above figure.
(10) The Empirical Rule
Chebyshev’s theorem applies to any distribution regardless of its shape. However, when a
distribution is bell-shaped (or what is called normal), the following statements, which make up
the empirical rule, are true.
Approximately 68% of the data values will fall within 1 standard deviation of the mean.
Approximately 95% of the data values will fall within 2 standard deviations of the mean.
Approximately 99.7% of the data values will fall within 3 standard deviations of the mean.
For example, suppose that the scores on a national achievement exam have a mean of 480
and a standard deviation of 90. If these scores are normally distributed, then approximately 68%
will fall between 390 and 570 (480 + 90 = 570 and 480  90 = 390). Approximately 95% of the
scores will fall between 300 and 660 (480 + 2.90 = 660 and 480  2.90 = 300). Approximately
99.7% will fall between 210 and 750 (480 + 3.90 = 750 and 480  3.90 = 210). See Figure.
(The empirical rule will be explained in greater detail in topic Normal Distribution.)

Exercise
The mean of a distribution is 20 and the standard deviation is 2. Use Chebyshev’s
theorem.
a. At least what percentage of the values will fall between 10 and 30?
b. At least what percentage of the values will fall between 12 and 28? (Bluman ch. 3)

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (9)

Exercise
The Energy Information Administration reported that the mean retail price per gallon of
regular grade gasoline was $2.30 (Energy Information Administration, February 27, 2006).
Suppose that the standard deviation was $.10 and that the retail price per gallon has a bell shaped
distribution.
a. What percentage of regular grade gasoline sold between $2.20 and $2.40 per gallon?
b. What percentage of regular grade gasoline sold between $2.20 and $2.50 per gallon?
c. What percentage of regular grade gasoline sold for more than $2.50 per gallon?
(Sweeny Chap 3 )
(11) Exploratory Data Analysis
Exploratory data analysis enables us to use simple arithmetic and easy-to-draw pictures to
summarize data. In this section we continue exploratory data analysis by considering five-
number summaries and box plots.
1. Smallest value
2. First quartile (Q1)
3. Median (Q2)
4. Third quartile (Q3)
5. Largest value

Examples (4)

The easiest way to develop a five-number summary is to first place the data in ascending
order. Then it is easy to identify the smallest value, the three quartiles, and the largest value. The
monthly starting salaries shown in the above table for a sample of 12 business school graduates
are repeated here in ascending order.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (10)

The median of 3505 and the quartiles Q1 = 3465 and Q3 = 3600. Reviewing the data
shows a smallest value of 3310 and a largest value of 3925. Thus the five-number summary for
the salary data is 3310, 3465, 3505, 3600, 3925. Approximately one-fourth, or 25%, of the
observations are between adjacent numbers in a five-number summary.
(12) Box Plot
A box plot is a graphical summary of data that is based on a five-number summary. A
key to the development of a box plot is the computation of the median and the quartiles, Q1 and
Q3. The interquartile range, IQR = Q3  Q1, is also used. Following figure is the box plot for
the monthly starting salary data. The steps used to construct the box plot follow.
Abox is drawn with the ends of the box located at the first and third quartiles. For the
salary data,Q1 = 3465 andQ3 = 3600. This box contains the middle 50% of the data.
A vertical line is drawn in the box at the location of the median (3505 for the salary data).
By using the interquartile range, IQR = Q3  Q1, limits are located. The limits for the
box plot are 1.5(IQR) below Q1 and 1.5(IQR) above Q3. For the salary data, IQR = Q3  Q1 =
3600  3465 = 135. Thus, the limits are 3465  1.5(135) = 3262.5 and 3600 + 1.5(135) = 3802.5.
Data outside these limits are considered outliers.
The dashed lines in Figure are called whiskers. The whiskers are drawn from the ends of
the box to the smallest and largest values inside the limits computed in step 3. Thus, the whiskers
end at salary values of 3310 and 3730.
Finally, the location of each outlier is shown with the symbol *. In Figure we see one
outlier, 3925.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (11)

Exercise
The nine measurements that follow are furnace temperature recorded on
successive batches in a semiconductor manufacturing process (units are F 0): 953, 950,
948, 955, 951, 949, 957, 954, 955.
(a) Calculate the sample mean, sample variance, and standard deviation.
(b) Find the median. How much could the largest temperature measurement
increase without changing the median value?
(c) Construct a box plot of the data.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (12)

(13) Measures of Skewness and Kurtosis


A fundamental task in many statistical analyses is to characterize the location and
variability of a data set. A further characterization of the data includes skewness and kurtosis.
(13.1) Skewness
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A
distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
If a curve is symmetrical, then the number of values deviating from mean values below
the mean and above the mean are the same. This is called the symmetry.
Skewness is the degree of asymmetry (departure from symmetry of a distribution)

In a symmetrical distribution, the mean, median and mode coincide.


If the frequency curve of a distribution has a longer tail to the right of the central
maximum than to the left, the distribution is said to be skewed to the right or to have positive
skewness.

In positive skewed distribution, the mean exceeds the mode.

If the frequency curve of a distribution has a longer tail to the left of the central
maximum than to the right, the distribution is said to be skewed to the left or to have negative
skewness.

In negative skewed distribution, the mean is smaller than the mode.


For univariate data, the formula for skewness is

(Xi  X )3 /N
Skewness =
s3

Where X is the mean, s is the standard deviation, and N is the number of the data points.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (13)

Note that in computing the skewness, the s is computed with N in the denominator rather
than N - 1.
Many software programs actually compute the adjusted Fisher-Pearson coefficient of
skewness.

N(N  1) (Xi  X )3 /N
Skewness =
N1 s3
This is an adjustment for sample size. The adjustment approaches 1 as N gets large. For
reference, the adjustment factor is 1.49 for N = 5, 1.19 for N = 10, 1.08 for N = 20, 1.05 for N =
30, and 1.02 for N = 100.
Karl Pearson investigated the following formula to measure the skewness:
mean  mode
Skewness =
standard deviation
Led Bowley introduced the following measure of skewness
Q3 + Q1  2Q2
Quartile coefficient of skewness =
Q3  Q1
This measure is equal to zero when quartiles are equidistant from median. Then the
distribution is symmetrical. It is positive when the upper quartile is farther from the median than
the lower quartile. Then the distribution is positive skewed. This measure is negative when the
lower quartile is farther from the median than the upper quartile.
For a perfectly symmetrical curve, this measure is zero.
Problems (Skewness)
(1) What can you say of skewness in each case of the following cases;
(i) The median is 26.01, while the two quartiles are 13.73 and 38.29.
(ii) Mean = 140 and mode = 148.7
(iii) Mean = 129.5 and median = 128.7
(iv) The first three moments about the value 16 are respectively  0.35,
2.9 and 1.93
(2) Which of the following is correct in a positively skewed and negatively
skewed distribution
(i) The arithmetic mean is greater than the mode.
(ii) The arithmetic mean is less than the mode.
(iii) The arithmetic mean is greater than the median.
(iv) The median is greater than the mode.
(3) The length of stay on the cancer floor of Apolo Hospital were organized into a
frequency distribution. The mean length of stay was 28 days, the medial 25 days
and modal length is 23 days. The standard deviation was computed to be 4.2 days.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (14)

(13.2) Kurtosis
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a
normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data
sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be
the extreme case.
The histogram is an effective graphical technique for showing both the skewness and
kurtosis of data set.
Kurtosis is the degree of peakness of a distribution. A distribution having relatively high
peak is called Lepto-Kurtic whereas a distribution having flat topped is called Platy Kurtic. A
frequency curve which is neither very high peaked nor vary flat topped is called Meso-kurtic or a
Normal curve having a Normal distribution.

For univariate data, the formula for Kurtosis is



(Xi  X )4 /N
Kurtosis =
s4

Where X is the mean, s is the standard deviation, and N is the number of the data points.
The kurtosis for a standard normal distribution is 3, for Lepto-Kurtic, b2 > 3 and for
Meso-kurtic, b2 < 3.
Another measure of Kurtosis is:
Q.D
Percentile coefficient of Kurtosis = k =
P90  P10
Q3Q1
Where Q.D = quartile deviation =
2

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (15)

Examples (5)
A group data for heights of 100 randomly selected male students is given below
Height (inches) Class Marks, X Frequency, f
59.5  62.5 61 5
62.5  65.5 64 18
65.5  68.5 67 42
68.5  71.5 70 27
71.5  74.5 73 8
Now

x = (61×5 + 64×18 + 67×42 + 70×27 + 73×8) ÷ 100 = 67.45
For Skewness,

Class marks Frequency, f xf (x  


x) (x  
x )2 f (x  
x )3 f
61 5 305 - 6.45 208.01 -1341.68
64 18 1152 - 3.45 214.25 -739.15
67 42 2814 - 0.45 8.51 - 3.83
70 27 1890 2.55 175.57 447.70
73 8 584 5.55 246.42 1367.63
 100 6745 852.75 - 269.33


(Xi  X )2 f 852.75
Variance = = = 8.5275
N 100
Standard Deviation =  = 8.5275 = 2.92

(Xi  X )3 /N
Skewness =
s3
-269.33/100
=
(2.92)3
= - 2.6933
This means that the distribution is negatively skewed

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (16)

For Kutosis,

Class Mark, x Frequency, f x−x̅ (x−x̅)4f

61 5 -6.45 8653.84
64 18 -3.45 2550.05
67 42 -0.45 1.72
70 27 2.55 1141.63
73 8 5.55 7590.35
∑ n/a 19937.60


(Xi  X )4 /N 19937.60
Kurtosis = = = 199.3760 < 3
s4 100
This means the frequency curve is flat, that is platy-Kurtic

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
04 : Measures of Dispersion (17)

Impact of Sample Size on Skewness and Kurtosis


The 5,000-point dataset above was used to explore what happens to skewness and
kurtosis based on sample size. For example, suppose we wanted to determine the skewness and
kurtosis for a sample size of 5. 5 results were randomly selected from the data set above and the
two statistics calculated. This was repeated for the sample sizes shown in Table 1.

Notice how much di erent the results are when the sample size is small compared to the
"true" skewness and kurtosis for the 5,000 results. For a sample size of 25, the skewness was -
.356 compared to the true value of 0.007 while the kurtosis was -0.025. Both signs are opposite
of the true values which would lead to wrong conclusions about the shape of the distribution.
There appears to be a lot of variation in the results based on sample size.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore

You might also like