Professional Documents
Culture Documents
i=1
x
i
.
The sample variance is dened by
s
2
=
1
(n 1)
n
i=1
(x
i
x)
2
.
The sample standard deviation is
s =
_
s
2
.
The sample variance s
2
and the sample standard deviation s both are measures of the variability or
scatteredness of data values around the sample mean x. Larger the variance, more is the spread.
We note that s
2
and s are both nonnegative. One question we may ask is why not just take the sum
of the differences (x
i
x) as a measure of variation? The answer lies in the following result which
shows that if we add up all deviations about the sample mean, we always get a zero value.
Theorem 1.5.1 For a given set of measurements x
1
, x
2
, . . . , x
n
, let x be the sample mean. Then
n
i=1
(x
i
x) = 0.
1.5 Numerical Description of Data 27
Proof. Since x = (1/n)
n
i=1
x
i
, we have
n
i=1
x
i
= nx. Now
n
i=1
(x
i
x ) =
n
i=1
x
i
i=1
x
= nx nx = 0.
Thus although there may be a large variation in the data values,
n
i=1
(x
i
x) as a measure of spread
would always be zero, implying no variability. So it is not useful as a measure of variability.
Sometimes we can simplify the calculation of the sample variance s
2
by using the following
computational formula:
s
2
=
_
_
n
i=1
x
2
i
1
n
_
n
i=1
x
i
_
2
_
_
(n 1)
.
If the data set has a large variation with some extreme values (called outliers), the mean may not
be a very good measure of the center. For example, average salary may not be a good indicator of
the nancial well-being of the employees of a company if there is a huge difference in pay between
support personnel and management personnel. In that case, one could use the median as a measure
of the center, roughly 50% of data fall below and 50% above. The median is less sensitive to extreme
data values.
Denition 1.5.2 For a data set, the median is the middle number of the ordered data set. If the data set has
an even number of elements, then the median is the average of the middle two numbers. The lower quartile
is the middle number of the half of the data below the median, and the upper quartile is the middle number
of the half of the data above the median. We will denote
Q
1
= lower quartile
Q
2
= M = middle quartile (median)
Q
3
= upper quartile
The difference between the quartiles is called interquartile range (IQR).
IQR = Q
3
Q
1
.
A possible outlier (mild outlier) will be any data point that lies below
Q
1
1.5(IQR) or above Q
3
+1.5(IQR).
Note that the IQR is unaffected by the positions of those observations in the smallest 25% or the
largest 25% of the data.
Mode is another commonly used measure of central tendency. A mode indicates where the data tend
to concentrate most.
28 CHAPTER 1 Descriptive Statistics
Denition 1.5.3 Mode is the most frequently occurring member of the data set. If all the data values are
different, then by denition, the data set has no mode.
Example 1.5.1
The following data give the time in months from hire to promotion to manager for a random sample of 25
software engineers from all software engineers employed by a large telecommunications rm.
5 7 229 453 12 14 18 14 14 483
22 21 25 23 24 34 37 34 49 64
47 67 69 192 125
Calculate the mean, median, mode, variance, and standard deviation for this sample.
Solution
The sample mean is
x =
1
n
n
i=1
x
i
= 83.28 months.
To obtain the median, first arrange the data in ascending order:
5 7 12 14 14 14 18 21 22 23
24 25 34 34 37 47 49 64 67 69
125 192 229 453 483
Now the median is the thirteenth number which is 34 months.
Since 14 occurs most often (thrice), the mode is 14 months.
The sample variance is
s
2
=
1
n 1
n
i=1
(x
i
x)
2
=
1
24
_
(5 83.28)
2
+ +(125 83.28)
2
_
= 16,478.
and the sample standard deviation is, s =
s
2
=128.36 months. Thus, we have sample mean x = 83.28
months, median=34 months, and mode = 14 months. Note that the mean is very much different from the
other two measures of center because of a few large data values. Also, the sample variance s
2
=16,478
months, and the sample standard deviation s =128.36 months.
Example 1.5.2
For the data of Example 1.5.1, nd lower and upper quartiles, median, and interquartile range (IQR). Check
for any outliers.
1.5 Numerical Description of Data 29
Solution
Arrange the data in an ascending order.
5 7 12 14 14 14 18 21 22 23
24 25 34 34 37 47 49 64 67 69
125 192 229 453 483
Then the median M is the middle (13th) data value, M = Q
2
= 34. The lower quartile is the middle number
below the median, Q
1
= [(14 +18)/2] = 16. The upper quartile, Q
3
= [(67 +69)/2] = 68.
The interquartile range, (IQR) = Q
3
Q
1
= 68 16 = 52.
To test for outliers, compute
Q
1
1.5(IQR) = 16 1.5(52) = 62
and
Q
3
+1.5(IQR) = 68 +1.5(52) = 146.
Then all the data that fall above 146 are possible outliers. None is below 62. Therefore the outliers are 192,
229, 453, and 483.
We have remarked earlier that the mean as a measure of central location is greatly affected by the
extreme values or outliers. Arobust measure of central location (a measure that is relatively unaffected
by outliers) is the trimmed mean. For 0 1, a 100% trimmed mean is found as follows: Order
the data, and then discard the lowest 100%and the highest 100%of the data values. Find the mean
of the rest of the data values. We denote the 100% trimmed mean by x
i=1
|x
i
x|
n
.
Calculation of average deviation is simple and straightforward.
1.5.1 Numerical Measures for Grouped Data
When we encounter situations where the data are grouped in the form of a frequency table (see
Section 1.4), we no longer have individual data values. Hence, we cannot use the formulas in De-
nition 1.5.1. The following formulas will give approximate values for x and s
2
. Let the grouped data
have l classes, with m
i
being the midpoint and f
i
being the frequency of class i, i = 1, 2, . . . , l. Let
n =
l
i=1
f
i
.
Denition 1.5.4 The mean for a sample of size n,
x =
1
n
l
i=1
f
i
m
i
,
where m
i
is the midpoint of the class i and f
i
is the frequency of the class i.
Similarly the sample variance,
s
2
=
1
n 1
n
i=1
f
i
(m
i
x)
2
=
m
2
i
f
i
i
f
i
m
i
_
2
n
n 1
.
The following example illustrates how we calculate the sample mean for a grouped data.
Example 1.5.4
The grouped data in Table 1.8 represent the number of children frombirth through the end of the teenage
years in a large apartment complex. Find the mean, variance, and standard deviation for these data:
Table 1.8 Number of Children and Their Age Group
Class 03 47 811 1215 1619
Frequency 7 4 19 12 8