# CHAPTER 3

Numerical Descriptive Measures
USING STATISTICS: Evaluating the Performance of Mutual Funds
3.1

3.2

MEASURES OF CENTRAL TENDENCY,
VARIATION, AND SHAPE
The Mean
The Median
The Mode
Quartiles
The Geometric Mean
The Range
The Interquartile Range
The Variance and the Standard Deviation
The Coefficient of Variation
Z Scores
Shape
Visual Explorations: Exploring Descriptive
Statistics
Microsoft Excel Descriptive Statistics Output
Minitab Descriptive Statistics Output
NUMERICAL DESCRIPTIVE MEASURES
FOR A POPULATION
The Population Mean
The Population Variance and Standard Deviation

The Empirical Rule
The Chebychev Rule
3.3

COMPUTING NUMERICAL DESCRIPTIVE
MEASURES FROM A FREQUENCY
DISTRIBUTION

3.4

EXPLORATORY DATA ANALYSIS
The Five-Number Summary
The Box-and-Whisker Plot

3.5

THE COVARIANCE AND THE COEFFICIENT
OF CORRELATION
The Covariance
The Coefficient of Correlation

3.6

PITFALLS IN NUMERICAL DESCRIPTIVE
MEASURES AND ETHICAL ISSUES

A.3

USING SOFTWARE FOR DESCRIPTIVE
STATISTICS
A.3.1 Microsoft Excel
A3.2 Minitab
A3.3 (CD-ROM Topic) SPSS

LEARNING OBJECTIVES
In this chapter, you learn:
• To describe the properties of central tendency, variation,
and shape in numerical data
• To calculate descriptive summary measures for a population
• To construct and interpret a box-and-whisker plot
• To describe the covariance and the coefficient of correlation

72

CHAPTER THREE Numerical Descriptive Measures

U S I N G S TAT I S T I C S
Evaluating the Performance of Mutual Funds
Return to the study of mutual funds introduced in Chapter 2. You want to
decide which types of mutual funds to invest in. In the last chapter you
learned how to present data in tables and charts. However, when dealing
with numerical data, such as the return on investments in mutual funds in
2003, you also need to summarize the data, and ask statistical questions.
What is the central tendency for returns of the various funds? For example, what is the mean return in 2003 for the low-risk, average-risk, and
high-risk mutual funds? How much variability is present in the returns?
Are the returns for high-risk funds more variable than for average-risk
funds or low-risk funds? How can you use this information when deciding
what mutual funds to invest in?

or numerical variables, you need more than just the visual picture of what a variable looks
like than you get from the graphs discussed in Chapter 2. For example, for the 2003 returns,
you would like to determine not only whether the riskier funds had a higher 2003 return, but
whether they also had greater variation, and how the returns for each risk group were distributed. You also want to examine whether there is a relationship between the expense ratio and
the 2003 return. Reading this chapter will allow you to learn about some of the methods to
measure:

F

central tendency, the extent to which all of the data values group around a central value
variation, the amount of dispersion or scattering of values away from a central point
shape, the pattern of the distribution of values from the lowest value to the highest value

You will also learn about the covariance and the coefficient of correlation that help measure the
strength of the association between two numerical variables.

3.1

MEASURES OF CENTRAL TENDENCY, VARIATION,
AND SHAPE
You can characterize any set of data by measuring its central tendency, variation, and shape.
Most sets of data show a distinct central tendency to group around a central point. When people talk about an “average value” or the “middle value” or the most popular or frequent value,
they are talking informally about the mean, median, and mode, three measures of central tendency.
Variation measures the spread or dispersion of values in a data set. One simple measure of
variation is the range, the difference between the highest and lowest value. More commonly
used in statistics are the standard deviation and variance, two measures explained later in this
section. The shape of a data set represents a pattern of all the values from the lowest to highest
value. As you will learn later in this section, many data sets have a pattern that looks approximately like a bell, with a peak of values somewhere in the middle.

3.1: Measures of Central Tendency, Variation, and Shape

73

The Mean
The arithmetic mean (typically referred to as the mean) is the most common measure of central tendency. The mean is the only common measure in which all the values play an equal role.
The mean serves as a “balance point” in a set of data (like the fulcrum on a seesaw). You calculate the mean by adding together all the values in a data set and then dividing that sum by the
number of values in the data set.
The symbol X , called X bar, is used to represent the mean of a sample. For a sample containing n values, the equation for the mean of a sample, is written as
sum of the values
number of values

X =

Using the series X1, X2, . . . , Xn to represent the set of n values and n to represent the number of
values, the equation becomes:
X =

X1 + X 2 + L + X n
n

By using summation notation (discussed fully in Appendix B), you replace the numerator
n

X 1 + X 2 + … + X n by the term

∑ Xi

that means sum all the X i values from the first X

i =1

value, X1 , to the last X value, Xn , to form Equation (3.1), a formal definition of the sample
mean.
SAMPLE MEAN
The sample mean is the sum of the values divided by the number of values.
n

X =

∑ Xi
i =1

(3.1)

n

X = sample mean
n = number of values or sample size

where

Xi = ith value of the variable X

n

∑ X i = summation of all Xi values in the sample
i =1

Because all the values play an equal role, a mean will be greatly affected by any value that
is greatly different from the others in the data set. When you have such extreme values, you
should avoid using the mean.
The mean can suggest what is a “typical” or central value for a data set. For example, if you
knew the typical time it takes you to get ready in the morning, you might be able to better plan
your morning and minimize any excessive lateness (or earliness) going to your destination.
Suppose you define the time to get ready as the time in minutes (rounded to the nearest minute)
from when you get out of bed to when you leave your home. You collect the times shown below
for 10 consecutive work days:
Day:
Time (minutes):

1

2

3

4

5

6

7

8

9

10

39

29

43

52

39

44

40

31

44

35

74

CHAPTER THREE Numerical Descriptive Measures

TIMES

The mean time is 39.6 minutes, computed as follows:
X =

sum of the values
number of values
n

X =

∑ Xi
i =1

n

X =

39 + 29 + 43 + 52 + 39 + 44 + 40 + 31 + 44 + 35
10

X =

396
= 39.6
10

Even though no one day in the sample actually had the value 39.6 minutes, allotting about 40
minutes to get ready would be a good rule for planning your mornings, but only because the 10
days does not contain extreme values.
Contrast this to the case in which the value on day four was 102 minutes instead of 52 minutes. This extreme value would cause the mean to rise to 44.6 minutes as follows:
X =

sum of the values
number of values
n

X =
X =

∑ Xi
i =1

n
446
= 44.6
10

The one extreme value has increased the mean by more than 10% from 39.6 to 44.6 minutes. In
contrast to the original mean that was in the “middle,” greater than 5 of the get-ready times
(and less than the 5 other times), the new mean is greater than 9 of the 10 get-ready times. The
extreme value has caused the mean to be a poor measure of central tendency.
EXAMPLE 3.1

THE MEAN 2003 RETURN FOR SMALL CAP MUTUAL FUNDS WITH HIGH RISK
The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified
according to the risk level of the mutual funds (low, average, and high) and type (small cap, mid
cap, and large cap). Compute the mean 2003 return for the small cap mutual funds with high risk.
SOLUTION The mean 2003 return for the small cap mutual funds with high risk (MUTUALis 51.53, calculated as follows:

FUNDS2004)

X =

sum of the values
number of values
n

=
=

∑ Xi
i =1

n
463.8
= 51.53
9

The ordered array for the nine small cap mutual funds with high risk is:
37.3

39.2

44.2

44.5

53.8

56.6

59.3

62.4

66.5

Four of these returns are below the mean of 51.53 and five of these returns are above the mean.

3.1: Measures of Central Tendency, Variation, and Shape

75

The Median
The median is the value that splits a ranked set of data into two equal parts. The median is not
affected by extreme values, so you can use the median when extreme values are present.
The median is the middle value in a set of data that has been ordered from lowest to highest
value.
To calculate the median for a set of data, you first rank the values from smallest to largest.
Then use Equation (3.2) to compute the rank of the value that is the median.
MEDIAN
50% of the values are smaller than the median and 50% of the values are larger than the
median.
Median =

n +1
ranked value
2

(3.2)

You compute the median value by following one of two rules:

Rule 1 If there are an odd number of values in the data set, the median is the middle
ranked value.
Rule 2 If there are an even number of values in the data set, then the median is the average
of the two middle ranked values.

To compute the median for the sample of 10 times to get ready in the morning, you rank the
daily times as follows:
Ranked values:
29 31 35 39 39 40 43 44 44 52
Ranks:
1

2

3

4

5

6

7

8

9 10

Median = 39.5
Because the result of dividing n + 1 by 2 is (10 + 1)/2 = 5.5 for this sample of 10, you must use
Rule 2 and average the fifth and sixth ranked values, 39 and 40. Therefore, the median is 39.5.
The median of 39.5 means that for half of the days, the time to get ready is less than or equal to
39.5 minutes, and for half of the days the time to get ready is greater than or equal to 39.5 minutes. The median time to get ready of 39.5 minutes is very close to the mean time to get ready
of 39.6 minutes.

EXAMPLE 3.2

COMPUTING THE MEDIAN FROM AN ODD-SIZED SAMPLE
The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low, average, and high) and type (small
cap, mid cap, and large cap). Compute the median 2003 return for the nine small cap mutual
funds with high risk. MUTUALFUNDS2004
SOLUTION Because the result of dividing n + 1 by 2 is (9 + 1)/2 = 5 for this sample of nine,
using Rule 1, the median is the fifth ranked value. The percentage return in 2003 for the nine
small cap mutual funds with high risk are ranked from the smallest to the largest:

4 66. Half the small cap high-risk mutual funds have returns equal to or below 53. Thus. EXAMPLE 3. For example. Like the median and unlike the mean. the mode is 3.5 53.8. .5 1 2 3 4 5 6 7 8 9 Ranks: ↑ Median The median return is 53. extreme values do not affect the mode. For this data set.8 and half have returns equal to or above 53.3 62. The Mode The mode is the value in a set of data that appears most frequently. Compute the mode for the following data that represents the number of server failures in a day for the past two weeks. The extreme value 26 is an outlier.” Example 3.2 44.76 CHAPTER THREE Numerical Descriptive Measures Ranked values: 37. 39 minutes and 44 minutes.4 66. You should use the mode only for descriptive purposes as it is more variable from sample to sample than either the mean or the median. the median and the mode better measure central tendency than the mean. the systems manager can say that the most common occurrence is having three server failures in a day.3 62.5 These data have no mode. more times than any other value. the median is also equal to 3 while the mean is equal to 4.6 59. A set of data will have no mode if none of the values is “most typical.2 44.6 59. None of the values is most typical because each value appears once.4 presents a data set with no mode.2 44. 1 3 0 3 26 2 7 4 0 2 3 3 6 3 SOLUTION The ordered array for these data is 0 0 1 2 2 3 3 3 3 3 4 6 7 26 Because 3 appears five times.8 56.5. MUTUALFUNDS2004 SOLUTION The ordered array for these data is 37. consider the time to get ready data shown below. EXAMPLE 3. since each of these values occurs twice.8.3 COMPUTING THE MODE A systems manager in charge of a company’s network keeps track of the number of server failures that occur in a day. 29 31 35 39 39 40 43 44 44 52 There are two modes.8 56.5 53. For these data. Often there is no mode or there are several modes in a set of data.3 39.3 39.4 DATA WITH NO MODE Compute the mode for the 2003 return for the small cap mutual funds with high risk.2 44.

50th.0% that are larger. if the sample size n = 7. Equations (3. (3.3. then the quartile is equal to the average of the corresponding ranked values. 4.75 ranked value. the time to get ready is greater than or equal to 35 minutes. Q3 = 3( n + 1) ranked value 4 (3.1 FIRST QUARTILE Q1 25. the time to get ready is less than or equal to 44 minutes. You interpret this to mean that on 75% of the days.0% of the values from the other 75. and Q3 are also the 25th. . Ranked values: 29 31 35 39 39 40 43 44 44 52 1 2 3 4 5 6 7 8 9 10 Ranks: The first quartile is the (n + 1)/4 = (10 + 1)/4 = 2. The third ranked value for the get-ready time data is 35 minutes.3) and (3. For example. Round 2.5.0% are larger than the third quartile Q3.0% of the values from the largest 25. and on 25% of the days. The third quartile Q3 divides the smallest 75.4) Use the following rules to calculate the quartiles: • • • Rule 1 If the result is a whole number. Variation. the time to get ready is greater than or equal to 44 minutes.0% of the values are smaller than the median and 50. if the sample size n = 9. halfway between the second ranked value and the third ranked value. you round the result to the nearest integer and select that ranked value. etc. and 75.0% are larger than the first quartile Q1. Q1 = n +1 ranked value 4 (3.0% are larger.3) THIRD QUARTILE Q3 75. and 25. Using the third rule for quartiles.5.2). For example.75 ranked value. respectively. the first quartile Q1 is equal to the (9 + 1)/4 = 2.25 ranked value. and 75th percentile. the first quartile. Equations (3.3).1: Measures of Central Tendency.4) can be expressed generally in terms of finding percentiles: (p ∗ 100)th percentile = p ∗ (n + 1) ranked value. if the sample size n = 10.).5 ranked value. The second quartile Q2 is the median—50. then the quartile is equal to that ranked value. The third quartile is the 3(n + 1)/4 = 3(10 + 1)/4 = 8. and Shape 77 Quartiles 1The Q1. Rule 3 If the result is neither a whole number nor a fractional half. the first quartile Q1 is equal to the (7 + 1)/4 = second ranked value. You interpret the first quartile of 35 to mean that on 25% of the days the time to get ready is less than or equal to 35 minutes. and on 75% of the days. and (3. The eighth ranked value for the get-ready time data is 44 minutes. the first quartile Q1 is equal to the (10 + 1)/4 = 2. For example.4) define the first and third quartiles.0% of the values are smaller than the third quartile Q3.75 to 3 and use the third ranked value. median. you round this down to the eighth ranked value.0% of the values are smaller than Q1. rank the data from smallest to largest.0%. Quartiles split a set of data into four equal parts—the first quartile Q1 divides the smallest 25. Rule 2 If the result is a fractional half (2. To illustrate the computation of the quartiles for the time-to-get-ready data. you round up to the third ranked value. Using the third rule for quartiles.

5 1 2 3 4 5 6 7 8 9 Ranks: For these data Q1 = = ( n + 1) ranked value 4 9 +1 ranked value = 2.2 44.7. the percentage return in 2003 for the nine small cap mutual funds with high risk is: Ranked value: 37.3 and 62. Q1 is the 2. the third quartile Q3 is halfway between 59. halfway between the seventh ranked value and the eighth ranked value.2 = 41. and high) and type (small cap.7 and 75% are greater than or equal to 41. Q1 = 39. mid cap.85.5 ranked value. halfway between the second ranked value and the third ranked value.85 2 The first quartile of 41.8 56. .78 CHAPTER THREE Numerical Descriptive Measures EXAMPLE 3.2 and the third ranked value is 44. using the second rule. Compute the first quartile (Q1) and third quartile (Q3) 2003 return for the small cap mutual funds with high risk.7 2 To find the third quartile Q3 Q3 = = 3( n + 1) ranked value 4 3( 9 + 1) ranked value = 7.6 59.2. average.3 and the eighth ranked value is 62.85 indicates that 75% of the returns in 2003 for small cap high-risk funds are below or equal to 60.5 53.7 indicates that 25% of the returns in 2003 for small cap high-risk funds are below or equal to 41. The third quartile of 60.85 and 25% are greater than or equal to 60.2 44. Since the second ranked value is 39.5 ranked value 4 Therefore. Since the seventh ranked value is 59.4 = 60. and large cap).3 62.3 39.4.2. Thus. the first quartile Q1 is halfway between 39.2 and 44.5 COMPUTING THE QUARTILES The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low.4.2 + 44. using the second rule. MUTUALFUNDS2004 SOLUTION Ranked from smallest to largest.4 66.5 ranked value 4 Therefore. Thus.5 ranked value. Q3 is the 7.3 + 62. Q3 = 59.

However. Equation (3.0 ))]1/ 2 − 1 = [(0.1: Measures of Central Tendency. The geometric mean measures the rate of change of a variable over time. GEOMETRIC MEAN RATE OF RETURN RG = [(1 + R1 ) × (1 + R2 ) × L × (1 + Rn )]1/ n − 1 where (3.00 ) = 0.50 or − 50%   100.000 that declined to a value of \$50. 000 − 50.6) Ri is the rate of return in time period i To illustrate using these measures. the arithmetic mean of the yearly rates of return of this investment is X = ( −0. 000 Using Equation (3.6) defines the geometric mean rate of return. GEOMETRIC MEAN The geometric mean is the nth root of the product of n values X G = ( X 1 × X 2 × L × X n )1/ n (3. Variation.0 ]1/ 2 − 1 = 1−1 = 0 Thus.000 value at the end of year 2. the geometric mean rate of return more accurately reflects the (zero) change in the value of the investment for the two-year period than does the arithmetic mean.50 )) × (1 + (1. the geometric mean rate of return for the two years. and Shape 79 The Geometric Mean The geometric mean and the geometric rate of return measure the status of an investment over time.6).25 or 25% 2 since the rate of return for year 1 is  50. The rate of return for this investment for the two-year period is 0.0 )]1/ 2 − 1 = [1. 000 − 100.5) defines the geometric mean.50 ) + (1.000 at the end of year 1 and then rebounded back to its original \$100.3. 000 and the rate of return for year 2 is  100. is RG = [(1 + R1 ) × (1 + R2 )]1/ n − 1 = [(1 + ( −0.00 or 100%   50.5) Equation (3. because the starting and ending value of the investment is unchanged. . consider an investment of \$100. 000  R2 =   = 1.50 ) × ( 2. 000  R1 =   = −0.

2. . MUTUALFUNDS2004 SOLUTION Ranked from the smallest to the largest.35%.5 53.6847) × (1. RANGE The range is equal to the largest value minus the smallest value. and high) and type (small cap. EXAMPLE 3. using Equation (3. the range is 52 − 29 = 23 minutes.2. the 2003 return for the nine small cap mutual funds with high risk is: 37. The range of 23 minutes indicates that the largest difference between any two days in the time to get ready in the morning is 23 minutes. Range = Xlargest − Xsmallest (3. average.5001))]1/ 2 − 1 = [(0. SOLUTION Using Equation (3. mid cap.6). you rank the data from smallest to largest: 29 31 35 39 39 40 43 44 44 52 Using Equation (3.7) To determine the range of the times to get ready.5 Therefore.3 = 29.6 COMPUTING THE GEOMETRIC MEAN RATE OF RETURN The percentage change in the NASDAQ Composite Index was −31.01% in 2003. and large cap).2 44.4 66.7) the range = 66.7 COMPUTING THE RANGE IN THE 2003 RETURN OF SMALL CAP HIGH-RISK MUTUAL FUNDS The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low.2 44.53% in 2002 and +50.5 − 37.0135 − 1 = 0. Compute the range of the 2003 return for the small cap mutual funds with high risk.5001)]1/ 2 − 1 = [1.80 CHAPTER THREE Numerical Descriptive Measures EXAMPLE 3.3 39. The Range The range is the simplest numerical descriptive measure of variation in a set of data.3153)) × (1 + (0. the geometric mean rate of return in the NASDAQ Composite Index for the two years is RG = [(1 + R1 ) × (1 + R2 )]1/ n − 1 = [(1 + ( −0.3 62.8 56. Compute the geometric rate of return.0271]1/ 2 − 1 = 1.0135 The geometric rate of return in the NASDAQ Composite Index for the two years is 1. The largest difference between any two returns for the small cap mutual funds with high risk is 29.6 59.7).

In other words. it is not influenced by extreme values. it cannot be affected by extreme values. . clustered near the middle. Summary measures such as the median. Thus. Although the range is a simple measure of total variation in the data.2 44.7 = 19.15 Therefore. and Shape 81 The range measures the total spread in the set of data.5 53. are called resistant measures.85 − 41. MUTUALFUNDS2004 SOLUTION Ranked from smallest to largest. using the range as a measure of variation when at least one value is an extreme value is misleading.8 COMPUTING THE INTERQUARTILE RANGE FOR THE 2003 RETURN OF SMALL CAP HIGH-RISK MUTUAL FUNDS The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low.85.15. INTERQUARTILE RANGE The interquartile range is the difference between the third quartile and the first quartile. mid cap.6 59. To determine the interquartile range of the times to get ready 29 31 35 39 39 40 43 44 44 52 you use Equation (3. Interquartile range = 60.3.3 39.8) and the earlier results on page 78. Q1. or clustered near one or both extremes.1: Measures of Central Tendency.5 Using Equation (3. the 2003 return for the nine small cap mutual funds with high risk is: 37. the range does not indicate if the values are evenly distributed throughout the data set.2 44. it does not take into account how the data are distributed between the smallest and largest values. and large cap). The interval 35 to 44 is often referred to as the middle fifty.4 66. Q3. the interquartile range in the 2003 return is 19.3 62. Interquartile range = Q3 − Q1 (3. therefore. which cannot be influenced by extreme values. the interquartile range in the time to get ready is 9 minutes. Q1 = 41.8 56. average. and high) and type (small cap. Compute the interquartile range of the 2003 return for the small cap mutual funds with high risk. Q1 = 35 and Q3 = 44. The Interquartile Range The interquartile range (also called midspread) is the difference between the third and first quartiles in a set of data. EXAMPLE 3. Variation. and the interquartile range.8) The interquartile range measures the spread in the middle 50% of the data.7 and Q3 = 60. Interquartile range = 44 − 35 = 9 minutes Therefore.8) and the earlier results on page 78. Because the interquartile range does not consider any value smaller than Q1 or larger than Q3.

82 CHAPTER THREE Numerical Descriptive Measures The Variance and the Standard Deviation Although the range and the interquartile range are measures of variation. These statistics measure the “average” scatter around the mean—how larger values fluctuate above it and how smaller values distribute below it. for every set of data these differences would sum to zero. Two commonly used measures of variation that take into account how all the values in the data are distributed are the variance and the standard deviation. This sum is then divided by the number of values minus 1 (for sample data) to get the sample variance (S 2). X1.9) expresses the equation using summation notation. neither the variance nor the standard deviation can ever be negative. X3. Xn. In statistics. The square root of the sample variance is the sample standard deviation (S). n S = S2 = ∑ ( X i − X )2 i =1 n −1 (3. . if you did that. .9) X = mean n = sample size Xi = ith value of the variable X ∑ ( X i − X )2 = summation of all the squared differences between the Xi values and X i =1 SAMPLE STANDARD DEVIATION The sample standard deviation is the square root of the sum of the squared differences around the mean divided by the sample size minus one. . For a sample containing n values. they do not take into consideration how the values distribute or cluster between the extremes. this quantity is called a sum of squares (or SS). SAMPLE VARIANCE The sample variance is the sum of the squared differences around the mean divided by the sample size minus one. although both of these statistics will be zero if there is no variation at all in a set of data and each value in the sample is the same. One measure of variation that would differ from data set to data set would square the difference between each value and the mean and then sum these squared differences. Because the sum of squares are a sum of squared differences that by the rules of arithmetic will always be nonnegative. the variance and standard deviation will be a positive value. n S2 = where n ∑ ( X i − X )2 i =1 n −1 (3. the sample variance (given by the symbol S2) is S2 = ( X1 − X )2 + ( X 2 − X )2 + L + ( X n − X )2 n −1 Equation (3. A simple measure of variation around the mean might take the difference between each value and the mean and then sum these differences. For most sets of data. . However. you would find that because the mean is the balance point in a set of data. X2.10) .

Step 5: Take the square root of the sample variance to get the sample standard deviation. For almost all sets of data. You will most likely use the sample standard deviation as your measure of variation [defined in Equation (3. The sum of the squared differences (Step 3) is shown at the bottom of Table 3. Table 3.40 −4.1 shows Step 1.36 0.60 0.3.60 4.56 153. As the sample size increases.9) [and the inner term in Equation (3. the majority of the observed values lie within an interval of plus and minus one standard deviation above and below the mean.60 4.1 Computing the Variance of the Getting Ready Times X = 39.40 12.60 −10. Variation. The third column of Table 3. TABLE 3. n − 1 is used because of certain desirable mathematical properties possessed by the statistic S2 that make it appropriate for statistical inference (which will be discussed in Chapter 7).40 45.40 −8. Step 3: Add the squared differences.36 19.96 19. However.60 3.10)].40 −0. The standard deviation helps you to know how a set of data clusters or distributes around its mean. and Shape 83 If the denominator were n instead of n − 1. the standard deviation is always a number that is in the same units as the original sample data.1 shows the first four steps for calculating the variance and standard deviation for the getting ready times data with a mean ( X ) equal to 39.1 shows Step 2. the difference between dividing by n or n − 1 becomes smaller and smaller.40 0. This total is then divided by 10 − 1 = 9 to compute the variance (Step 4).1. Unlike the sample variance.36 21.1: Measures of Central Tendency.6 Time (X) 39 29 43 52 39 44 40 31 44 35 Step 1: (Xi − X ) Step 2: (Xi − X )2 −0.10)] would calculate the average of the squared differences around the mean. The second column of Table 3.16 Step 3: Sum: Step 4: Divide by (n − 1): 412. To hand-calculate the sample variance S2 and the sample standard deviation S: Step 1: Compute the difference between each value and the mean.76 0.16 73. Step 2: Square each difference. Therefore.36 112. which is a squared quantity.36 11. knowledge of the mean and the standard deviation usually helps define where at least the majority of the data values are clustering. Equation (3. Step 4: Divide this total by n − 1 to get the sample variance.6 (see page 74 for the calculation of the mean).82 .

1. clustering between X − 1S = 32.77 This indicates that the get-ready times in this sample are clustering within 6. and high) and type (small cap.6 ) 2 + L + ( 35 − 39. Using the second column of Table 3.395 .16 8 = 111.53) 2 9 −1 = 891. the sample standard deviation S is n S2 = S = ∑ ( X i − X )2 i =1 n −1 = 45.9) on page 82 n S2 = ∑ ( X i − X )2 i =1 n −1 = ( 44. mid cap.53) 2 + ( 39.82 Because the variance is in squared units (in squared minutes for these data).6 ) 2 10 − 1 = 412. EXAMPLE 3.83 and X + 1S = 46.5 − 51.77 minutes around the mean of 39. For any set of data.6 minutes (i.2 − 51.4 9 = 45.10) on page 82. Using Equation (3.53) 2 + L + ( 66. you can also calculate the sum of the differences between each value and the mean to be zero.2 illustrates the computation of the variance and standard deviation for the return in 2003 for the small cap mutual funds with high risk. average. 7 out of 10 get-ready times lie within this interval.9): n S2 = ∑ ( X i − X )2 i =1 n −1 = ( 39 − 39.e. MUTUALFUNDS2004 SOLUTION Table 3.84 CHAPTER THREE Numerical Descriptive Measures You can also calculate the variance by substituting values for the terms in Equation (3.37).9 COMPUTING THE VARIANCE AND STANDARD DEVIATION OF THE 2003 RETURN OF SMALL CAP HIGH-RISK MUTUAL FUNDS The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low.6 ) 2 + ( 29 − 39. this sum will always be zero: n ∑ ( X i − X ) = 0 for all sets of data i =1 This property is one of the reasons that the mean is used as the most common measure of central tendency.82 = 6.5 − 51. and large cap). Compute the variance and standard deviation of the 2003 return for the small cap mutual funds with high risk. In fact. Using Equation (3. to compute the standard deviation you take the square root of the variance..

0667 2. the data are. variance. In fact. None of the measures of variation (the range.55 indicates that the 2003 returns for the small cap mutual funds with high risk are clustering within 10. variance.395 Using Equation (3. interquartile range. standard deviation. Variation. The Coefficient of Variation Unlike the previous measures of variation presented.5333 Return 2003 Step 1: (Xi − X ) Step 2: (Xi − X )2 44. clustering between X − 1S = 40.2333 −7. and standard deviation will all equal zero.5 39.8667 7.0333 −12.3333 10.3 56.2 Computing the Variance of the 2003 Return for the Small Cap Mutual Funds with High Risk 85 X = 51. interquartile range. .2 62..8 37. • • • • The more spread out. the sample standard deviation S is n S = S2 = ∑ ( X i − X )2 i =1 n −1 = 111.5 −7.55 around the mean of 51.7778 224. and standard deviation.e. and standard deviation.0844 60.3.395 = 10.53 (i.3211 25. measures the scatter in the data relative to the mean.1: Measures of Central Tendency. and standard deviation. the coefficient of variation is a relative measure of variation that is always expressed as a percentage rather than in terms of the units of the particular data. variance. the range. the larger the range. If the values are all the same (so that there is no variation in the data).6 53. or homogeneous the data are. variance.16 111.3 44.2 66.7667 5. or dispersed.98 and X + 1S = 62.55 The standard deviation of 10.0011 Step 3: Sum: Step 4: Divide by (n − 1): 891. denoted by the symbol CV.1111 118.6711 5. interquartile range. and Shape TABLE 3. The coefficient of variation. interquartile range.08).2667 −14.10) on page 82.5878 53.6% (5 out of 9) of the 2003 returns lie within this interval. The more concentrated.4 59. The following summarizes the characteristics of the range. interquartile range. 55. the smaller the range.1378 202. and variance) can ever be negative.9667 49.4678 152.3333 14.

You will find the coefficient of variation very useful when comparing two or more sets of data that are measured in different units as Example 3. and finds that the mean weight is 26. and the mean volume is 8.8 cubic feet. relative to the mean.77  CV =   100% =   100% = 17. the coefficient of variation is  3. EXAMPLE 3. . the standard deviation is 17. Z Scores An extreme value or outlier is a value located far away from the mean. the farther the distance from the value to the mean.77. The Z score is the difference between the value and the mean. multiplied by 100%.8  Thus.9  CVW =   100% = 15%  26. divided by the standard deviation. the operations manager should compare the relative variability in the two types of measurements. When packages are stored in the trucks in preparation for delivery. S CV =   100% X where (3.10 COMPARING TWO COEFFICIENTS OF VARIATION WHEN TWO VARIABLES HAVE DIFFERENT UNITS OF MEASUREMENT The operations manager of a package delivery service is deciding on whether to purchase a new fleet of trucks.6 and S = 6. For weight. the coefficient of variation is  2.2 cubic feet. The operations manager samples 200 packages. How can the operations manager compare the variation of the weight and the volume? SOLUTION Because the measurement units differ for the weight and volume constraints.11) S = sample standard deviation X = sample mean For the sample of 10 get-ready times. since X = 39.0 pounds.0  For volume. with a standard deviation of 3.86 CHAPTER THREE Numerical Descriptive Measures COEFFICIENT OF VARIATION The coefficient of variation is equal to the standard deviation divided by the mean.0%  8. the coefficient of variation is S  6.6  For the get-ready times.1% of the size of the mean.10 illustrates. with a standard deviation of 2. Z scores are useful in identifying outliers.9 pounds. the package volume is much more variable than the package weight.10% X  39. The larger the Z score. you need to consider two major constraints—the weight (in pounds) and the volume (in cubic feet) for each item.2  CVV =   100% = 25.

65 0. None of the percentage returns met that criterion to be considered outliers.0. a Z score is considered an outlier if it is less than -3. average.77 minutes.09 0.1: Measures of Central Tendency. The largest Z score is 1.77 −0. .5.0 minutes.09 Table 3.65 −0.12) For the time to get ready in the morning data.6 6.42 for a percentage return of 66. The largest Z score is 1. Compute the Z scores of the 2003 return for the small cap mutual funds with high risk.0.3 shows the Z scores for all 10 days.3.4 illustrates the Z scores of the 2003 return for the small cap mutual funds with high risk.50 1. The time to get ready on the first day is 39.3 Z Scores for the 10 Get-Ready Times Mean Standard deviation EXAMPLE 3. MUTUALFUNDS2004 SOLUTION Table 3.83 −0. and large cap). and high) and type (small cap.11 Time (X) Z Score 39 29 43 52 39 44 40 31 44 35 39.0 − 39. The lowest Z score is −1.57 0.27 0.06 −1. the mean is 39. As a general rule. and Shape 87 Z SCORES Z = X −X S (3.68 COMPUTING THE Z SCORES OF THE 2003 RETURN OF SMALL CAP HIGH-RISK MUTUAL FUNDS The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low.6 6.0 or greater than +3.35 for a percentage return of 37. As a general rule.0 or greater than +3. Variation.6 minutes and the standard deviation is 6. The lowest Z score was −1.09 −1.57 for day 2 on which the time to get ready was 29 minutes. mid cap. TABLE 3. a Z score is considered an outlier if it is less than −3. None of the times met that criterion to be considered outliers.3. You compute the Z score for day 1 from Z = = X −X S 39.77 = −0.83 for day 4 on which the time to get ready was 52 minutes.

FIGURE 3. Each half of the curve is a mirror image of the other half of the curve.2 66.xla macro workbook and select VisualExplorations  Descriptive Statistics from the Microsoft Excel menu bar.53 10.35 −0.21 −1. most of the values are in the upper portion of the distribution.8 37. Shape influences the relationship of the mean to the median in the following ways: • • • Mean < median. when low and high values balance each other out. Shape is the pattern of the distribution of data values throughout the entire range of all the values.4 Z Scores of the 2003 Return for the Small Cap Mutual Funds with High Risk Mean Standard Deviation Return 2003 Z Scores 44.74 0. negative or left-skewed Mean = median.3 44.03 0. There is a long tail and distortion to the left that is caused by some extremely small values.2 62. or skewed. There is a long tail on the right of the distribution and a distortion to the right that is caused by some extremely large values. most of the values are in the lower portion of the distribution.48 0.5 51. and the mean equals the median.42 Shape A third important property that describes a set of numerical data is shape.6 53. In this panel. Read the instructions in the popup box (see illustration on page 89) and click OK to examine a dot scale diagram for the sample of 10 get-ready times used throughout this chapter. . In this panel. not symmetrical and showing an imbalance of low values or high values. variation. or right-skewed The data in panel A are negative. These extremely large values pull the mean upward so that the mean is greater than the median. and shape.1 depicts three data sets.4 59.3 56. positive or right-skewed Figure 3.1 A Comparison of Three Data Sets Differing in Shape Panel A Negative. Open the Visual Explorations. The low and high values on the scale balance. or left-skewed.5 39. symmetric or zero skewness Mean > median. The data in panel B are symmetrical. A distribution will either be symmetrical.55 −0. or right-skewed. each with a different shape.17 1.69 1. Visual Explorations: Exploring Descriptive Statistics Use the Visual Explorations Descriptive Statistics procedure to see the effect of changing data values on measures of central tendency. The data in panel C are positive. These extremely small values pull the mean downward so that the mean is less than the median.88 CHAPTER THREE Numerical Descriptive Measures TABLE 3. or left-skewed Panel B Symmetrical Panel C Positive.67 −1.

along with statistics for kurtosis and skewness. median.3. Skewness measures the lack of symmetry in the data and is based on a statistic that is a function of the cubed differences around the mean. Which measures are affected by this change? Which ones are not? You can flip between the “before” and “above” diagrams by repeatedly pressing Crtl-Z (undo) followed by Crtl-Y (redo) to help see the changes the extreme value caused in the diagram. In addition. This measure is not discussed in this text (see reference 2).2 Microsoft Excel Descriptive Statistics of the 2003 Returns Based on Risk Level . and count (sample size) on a single worksheet. Variation. range. variance. A skewness value of zero indicates a symmetric distribution. minimum. maximum. Kurtosis measures the relative concentration of values in the center of the distribution as compared with the tails and is based on the differences around the mean raised to the fourth power. mode. From Figure 3. there appears to be slight differences in the 2003 percentage return for the FIGURE 3. the Excel descriptive statistics output for the 2003 return of the funds based on risk level. standard deviation.2. and Shape 89 Experiment by entering an extreme value such as 10 minutes into one of the tinted cells of column A. The standard error is the standard deviation divided by the square root of the sample size and will be discussed in Chapter 7.1: Measures of Central Tendency. Excel computes the standard error. Microsoft Excel Descriptive Statistics Output The Microsoft Excel Data Analysis ToolPak generates the mean. all of which have been discussed in this section.

there appears to be slight differences in the 2003 percentage return for the three risk levels. PH Grade ASSIST . variance.2 The following is a set of data from a sample of n = 6: 7 4 9 7 3 12 a. variance. Are there any outliers? d. and mode. Are there any outliers? d. b. c. Compute the mean. PH Grade ASSIST 3. and coefficient of variation. c. High-risk funds had a slightly higher mean. coefficient of variation (labeled CoefVar). 3. interquartile range. Compute the mean. FIGURE 3. and coefficient of variation. b. Describe the shape of the data set. standard deviation. and coefficient of variation. first and third quartiles. Describe the shape of the data set. standard deviation.4 The following is a set of data from a sample of n = 5: 7 −5 −8 7 9 a. c. standard deviation. Minitab computes the sample size (labeled as N).3 The following set of data is from a sample of n = 7: 12 7 4 9 0 7 3 a. interquartile range.1 The following is a set of data from a sample of n = 5: 7 4 9 8 2 a. the Minitab descriptive statistics output for the 2003 return of the funds based on risk level. Compute the range. Compute the Z scores. and quartiles than did low-risk and average-risk funds. and mode.90 CHAPTER THREE Numerical Descriptive Measures three risk levels. standard deviation (labeled StDev). High-risk funds had a slightly higher mean and median than did low-risk and average-risk funds. range. Compute the Z scores. and mode. From Figure 3. Compute the range. Describe the shape of the data set. and coefficient of variation. interquartile range. minimum. median. median. and mode. Compute the range. median. the mean. Compute the range. There was very little difference in the standard deviations or interquartile ranges of the three groups. and interquartile range (labeled IQR). b. Minitab Descriptive Statistics Output For descriptive statistics. maximum.1 Learning the Basics PH Grade ASSIST 3. median. Compute the mean. median. variance. c. median. interquartile range.3 Minitab Descriptive Statistics of the 2003 Returns Based on Risk Level PROBLEMS FOR SECTION 3. standard deviation. variance. b. PH Grade ASSIST 3. Describe the shape of the data set. There was very little difference in the standard deviations of the three groups. Compute the mean. all of which have been discussed in this section.3.

Are the data skewed? If so. what conclusions can you reach concerning the price of 3-megapixel digital cameras at a camera specialty store during 2003? 3. Compute the mean. b. c. A. 2003. 3.5 15. Are the data skewed? If so. Compute the mean. cities during a week in October 2003. standard deviation. and third quartile. interquartile range. Repeat (a) through (c).13 A software development and consulting firm located in the Phoenix metropolitan area develops software for supply chain management systems using systematic software reuse. 3. coefficient of variation. first quartile. Based on the results of (a) through (c). range. Based on the results of (a) through (c). b.) d. what conclusions can you reach concerning the calories and fat in iced coffee drinks at Dunkin’ Donuts and Starbucks? 3.0 45. a. coefficient of variation. Rothenberger.5 75. range. and standard deviation. For each variable (hotel cost and rental car cost). “A Performance Measure for Software Reuse Projects.342 instead of 342. variance. REUSE 50 62.M. The waiting time in minutes (defined as the time the customer enters the line to . Looking at the distribution of times to failure. standard deviation. October 10. which measures of location do you think are most appropriate and which least appropriate to use for these data? Why? b. Louis New Orleans Detroit Cleveland Atlanta Orlando Miami Pittsburgh Boston New York Washington. how? d. S.049 631 512 266 492 562 298 a.5 37. Comment on the difference in the results.0 47. what conclusions can you reach concerning the daily cost of a hotel and rental car? 3. coefficient of variation. and Z scores. The following data are given as a percentage of the total code written for a software system that is part of the reuse database. b.0 25. Instead of starting from scratch when writing and developing new custom software systems. median. 1131–1153. What would you advise if the manufacturer wanted to be able to say in advertisements that these batteries “should last 400 hours”? (Note: There is no right answer to this question. Are there any outliers? Explain. Compute the variance. first quartile. median. range. median. and Z scores. J. Compute the mean. 30(Fall 1999). interquartile range. variance. Compute the range. Are the data skewed? If so. c. The numbers of hours they were used until failure were: BATTERIES 342 426 317 545 264 451 1. how? d. a. c.92 CHAPTER THREE Numerical Descriptive Measures For each variable (calories and fat). and Z scores. and K. and third quartile.” Decision Sciences. CAMERA 340 450 450 280 220 340 290 370 400 310 340 430 270 380 a. Based on the results of (a) through (c). Dooley.15 A bank branch located in a commercial district of a city has developed an improved process for serving customers during the noon to 1:00 P. c. Hotel Cars 205 179 185 210 128 145 177 117 221 159 205 128 165 180 198 158 132 283 269 204 47 41 49 38 32 48 49 41 56 41 50 32 34 46 41 40 39 67 69 40 Source: Extracted from The Wall Street Journal. Suppose that the first value was 1. and mode. HOTEL-CAR City San Francisco Los Angeles Seattle Phoenix Denver Dallas Houston Minneapolis Chicago St.11 The following data represent the daily hotel cost and rental car cost for 20 U. b. first quartile. Interpret the summary measures calculated in (a) and (b).C. interquartile range. Eight analysts at the firm were asked to estimate the reuse rate when developing a new software system. how? d.14 A manufacturer of flashlight batteries took a sample of 13 batteries from a day’s production and used them continuously until they were drained. Are there any outliers? Explain. a. lunch period. and mode. c. standard deviation. Compute the mean.0 Source: M.12 The cost of 14 models of 3-megapixel digital cameras at a camera specialty store during 2003 was as follows. Are there any outliers? Explain. and standard deviation. W4. the firm uses a database of reusable components totaling more than 2. the point is to consider how to make such a statement precise. Compute the mean.000. Calculate the range. median. and third quartile. Compute the variance. D. using this value. Compute the variance. median.000 lines of code collected from 10 years of continuous reuse effort.

68 5.90 −9. and the money market deposit.19 3. b.66 5.64 4. and Shape when he or she reaches the teller window) of all customers during this hour is recorded over a period of one week.44 −6.79 a. A random sample of 15 customers is selected.20 The time period from 2000 to 2003 saw a great deal of volatility in the value of metals. 2004. Compare the results of (b) to those of problems 3.79 8. and the Wilshire 5000 Index. 2004. Year Platinum Gold Silver 2003 2002 2001 2000 34. Passenger car sales increased 61% in 2002 and 55% in 2003 (Peter Wonacott.” On the basis of the results of (a) and (b).89 Source: Extracted from The Wall Street Journal.38 5. Are there any outliers? Explain.8 24. Compare the results of (b) to those of problems 3. Variation. and silver from 2000 to 2003.01 8.40 −20. As a customer walks into the branch office during the lunch hour. gold.74 3.02 1. a. Year One Year 30 Month Money Market 2003 2002 2001 2000 1. a.3 −23. What conclusions can you reach concerning the geometric rates of return of the four stock indexes? c.12 6. the Standard & Poor’s 500. standard deviation. evaluate the accuracy of this statement. interquartile range. and the results are as follows: BANK1 4. The waiting time in minutes (defined as the time the customer enters the line to the time he or she reaches the teller window) of all customers during these hours is recorded over a period of one week. and coefficient of variation. and Z scores.9 Source: Extracted from The Wall Street Journal. What conclusions can you reach concerning the geometric rates of return of the three deposits? c. and third quartile.5 24.5 −21. Compute the variance.35 10. and silver. b.0 −5.0 5. Compute the variance.M.18 (b) and 3. 3.98 3.20 (b).21 5.03 −3.09 Source: Extracted from The Wall Street Journal. The branch manager replies.2 24. Calculate the geometric rate of return for the Dow Jones Industrial Index. the Standard & Poor’s 500. range.17 China is the fastest-growing market for passenger car sales and fourth biggest after the United States.1: Measures of Central Tendency.” The Wall Street Journal. gold. What conclusions can you reach concerning the geometric rates of return of the three metals? c.93 3. b. and the money market deposit from 2000 to 2003. b. range.82 8. is also concerned with the noon to 1 P. 2004. and the results are as follows: BANK2 9.19 The time period from 2000 to 2003 saw a great deal of volatility in the value of investments. standard deviation. January 2.20 26.50 6. the Russell 2000 Index. January 2.61 1.76 2.97 −10. Compute the mean.18 (b) and 3.” On the basis of the results of (a) and (b). evaluate the accuracy of this statement. Japan. The data in the following table BANKRETURN represent the total rate of return of the one-year certificate of deposit. Calculate the geometric rate of return for the one-year certificate of deposit. Compute the geometric mean rate of increase.46 1. The data in the following table STOCKRETURN represent the total rate of return of the Dow Jones Industrial Index.34 3. 3.60 5.20 1. a.73 2. and third quartile.40 −22.90 −10.08 6.20 (b).55 3. first quartile. how? d.02 29. January 2.64 0. As a customer walks into the branch office during the lunch hour.73 3. he asks the branch manager how long he can expect to wait.54 3.40 −21. Year DJIA SP500 Russell2000 Wilshire5000 2003 2002 2001 2000 25. “Almost certainly less than five minutes.91 5.02 5.10 45. the 30-month certificate of deposit.10 −11.61.16 Suppose that another branch.77 2. and Germany. 2004.47 a. “Almost certainly less than five minutes. lunch hour. (Hint: Denote an increase of 61% as R1 = 0. Are the data skewed? If so.49 6. median. c.2 1.46 6. the Russell 2000 Index. 3. Compute the mean. interquartile range. c.01 −5. the 30-month certificate of deposit.3 19.97 5. median.5 1.19 (b) and 3.5 −3. how? d. . and the Wilshire 5000 Index from 2000 to 2003. The data in the following table METALRETURN represent the total rate of return for platinum.) SELF Test 3. she asks the branch manager how long she can expect to wait. first quartile. “A Fear Amid China’s Car Boom. The branch manager replies.58 −1. 3. A random sample of 15 customers is selected.19 (b).90 8. b.02 5.17 9. A17).10 0. Calculate the geometric rate of return for platinum. Are there any outliers? Explain.30 −15. February 2. Compare the results of (b) to those of problems 3. Are the data skewed? If so.13 4. coefficient of variation.18 The time period from 2000 to 2003 saw a great deal of volatility in the value of stocks.20 4. located in a residential area.

5 that contains the five biggest bond funds (in terms of total assets) as of March 1. In this section. If your data set represents numerical measurements for an entire population.5 5 5 Thus. the population mean. The Population Mean The population mean is represented by the symbol µ.5 + 7.0 7. C2. variation. you need to calculate and interpret parameters. To help illustrate these parameters.3 + 12.0 + 7.5. N µ = where ∑ Xi i =1 N (3.94 CHAPTER THREE Numerical Descriptive Measures 3. March 25.8 + 6. . 2004.3 12. the Greek lowercase letter mu.13) µ = population mean Xi = ith value of the variable X N ∑ X i = summation of all Xi values in the population i =1 To compute the mean return for the population of bond funds given in Table 3. summary measures for a population.13).9 37. Equation (3.9 Source: Extracted from The Wall Street Journal.13) defines the population mean. POPULATION MEAN The population mean is the sum of the values in the population divided by the population size N.1 presented various statistics that described the properties of central tendency. and shape for a sample.2 NUMERICAL DESCRIPTIVE MEASURES FOR A POPULATION Section 3.5%. you will learn about three descriptive population parameters. The 52-week return for each of these funds is also listed. LARGEST BONDS.5 2003 Return for the Population Consisting of the Five Largest Bond Funds 52-Week Return (in %) Vanguard GNMA Vanguard Total Bond Index Pimco Total Return Admin Pimco Total Return Instl America Bond Fund 3. and population standard deviation.5 7.8 6. Bond Fund TABLE 3. use Equation (3.5 = = 7. population variance. N µ = ∑ Xi i =1 N = 3. the mean 2003 return for these bond funds is 7. first review Table 3. 2004.

represents the population variance and the symbol σ.04 + 29.5) 2 + ( 7.5) 2 + (6. you use Equation (3. Like the related sample statistics.3.14 = 8.9) and (3.0 − 7. The denominators for the right-side terms in these equations use N and not the (n − 1) term that is used in the equations for the sample variance and standard deviation [see Equations (3.5 − 7.5 on page 94.828 5 . The symbol σ2. N σ2 = ∑ ( X i − µ)2 i =1 N = ( 3.5) 2 5 = 13. the Greek lowercase letter sigma.15) define these parameters.14) and (3. the Greek lowercase letter sigma squared.10) on page 82].16 5 = 44.69 + 1.9 − 7. POPULATION VARIANCE The population variance is the sum of the squared differences around the population mean divided by the population size N.8 − 7.2: Numerical Descriptive Measures for a Population 95 The Population Variance and Standard Deviation The population variance and the population standard deviation measure variation in a population.3 − 7. the population standard deviation is the square root of the population variance. N σ2 = where ∑ ( X i − µ )2 i =1 N (3.5) 2 + (12. represents the population standard deviation.00 + 0.15) To compute the population variance for the data of Table 3.14) µ = population mean Xi = ith value of the variable X N ∑ ( X i − µ )2 = summation of all the squared differences between the Xi values and µ i =1 POPULATION STANDARD DEVIATION N σ = ∑ ( X i − µ )2 i =1 N (3. Equations (3.25 + 0.14).5) 2 + ( 7.

You should use the standard deviation that uses the original units of the data (percentage return). The empirical rule helps you measure how the values distribute above and below the mean. values not found in the interval µ ± 3σ are almost always considered outliers.02 ) = (12. This can help you to identify outliers when analyzing a set of numerical data. at a value greater than the mean. Is it very likely that a can will contain less than 12 ounces of cola? SOLUTION µ ± σ = 12. This large amount of variation suggests that these large bond funds produce results that differ greatly.96 CHAPTER THREE Numerical Descriptive Measures Thus. where the median and mean are the same.10 ) µ ± 3σ = 12.828 squared percentage return.00 and 12. Therefore. EXAMPLE 3.97.00. or those not appearing bell-shaped for any other reason. In left-skewed data sets.828 = 2. you can consider values not found in the interval µ ± 2σ as potential outliers. that is.12 ) Using the empirical rule.06 ± 3( 0. In symmetrical data sets. 12.97 Therefore.12 ounces. . it is highly unlikely that a can will contain less than 12 ounces. that is. and approximately 99. the values often tend to cluster around the median and mean producing a bellshaped distribution. this clustering occurs to the left of the mean.000 will be beyond three standard deviations from the mean. Describe the distribution of fill-weights. 12. approximately 95% will contain between 12.5 by approximately 2.02.06 ± 2(0. the values tend to cluster to the right of the mean.7% are within a distance of ±3 standard deviations from the mean.7% will contain between 12.04.08 ) µ ± 2σ = 12. The squared units make the variance hard to interpret. As a general rule.08 ounces. In right-skewed data sets.06 ounces and a standard deviation of 0. the variance of the returns is 8. The empirical rule implies that for bell-shaped distributions only about one out of 20 values will be beyond two standard deviations from the mean in either direction. Approximately 99. Approximately 95% of the values are within a distance of ±2 standard deviations from the mean. at a value less than the mean.02 = (12.04 and 12. For heavily skewed data sets. From Equation (3. approximately 68% of the cans will contain between 12. the Chebyshev rule discussed on page 97 should be applied instead of the empirical rule. 12. Therefore.02 and 12.15). a large portion of the values tend to cluster somewhat near the median.12 USING THE EMPIRICAL RULE A population of 12-ounce cans of cola is known to have a mean fill-weight of 12.02.02 ) = (12.06 ± 0. The rule also implies that only about three in 1. You can use the empirical rule to examine the variability in bell-shaped distributions: • • • Approximately 68% of the values are within a distance of ±1 standard deviation from the mean.10 ounces. The Empirical Rule In most data sets. The population is also known to be bell-shaped. the typical 2003 return differs from the mean of 7. N σ = σ2 = ∑ ( X i − µ )2 i =1 N = 8.

6 How Data Vary Around the Mean % of Values Found in Intervals Around the Mean Interval (µ − σ.06 ounces and a standard deviation of 0. However. and at least 88. a population of 12-ounce cans of cola is known to have a mean fill-weight of 12.04. between 0 and 11. TABLE 3.12.89% Approximately 68% Approximately 95% Approximately 99. You can state that at least 75% of the cans will contain between 12. The results you compute using the sample statistics are approximations since you used sample statistics ( X .7% USING THE CHEBYSHEV RULE As in Example 3.3. σ). Consider k = 2. µ + 3σ) EXAMPLE 3. 12. The rule indicates at least what percentage of the values fall within a given distance from the mean. Is it very likely that a can will contain less than 12 ounces of cola? SOLUTION µ ± σ = 12.08 ) µ ± 2σ = 12.6 compares the Chebyshev and empirical rules. Therefore.00 and 12. 12. µ + σ) (µ − 2σ.00. 12. The Chebyshev rule is very general and applies to any type of distribution.06 ± 2( 0.2: Numerical Descriptive Measures for a Population 97 The Chebyshev Rule The Chebyshev rule (reference 1) states that for any data set.10 ounces. You can use these two rules for understanding how data are distributed around the mean when you have sample data. .02.13 Chebyshev (for any distribution) Empirical Rule (bell-shaped distribution) At least 0% At least 75% At least 88. However. regardless of shape.08 ounces. Describe the distribution of fill-weights. µ + 2σ) (µ − 3σ.06 ± 0. the shape of the population is unknown and you cannot assume that it is bell-shaped.12 ounces. if the data set is approximately bell-shaped. The Chebyshev rule states that at least [1 − (1/2)2] × 100% = 75% of the values must be found within ±2 standard deviations of the mean.02 = (12.12 ) Because the distribution may be skewed.11% of the cans contain less than 12 ounces. In each case. you cannot use the empirical rule.02 and 12.04 and 12. Table 3.89% will contain between 12. use the value you calculated for X in place of µ and the value you calculated for S in place of σ. the empirical rule will more accurately reflect the greater concentration of data close to the mean.10 ) µ ± 3σ = 12. you cannot say anything about the percentage of cans containing between 12.02 ) = (12. S) and not population parameters (µ. Using the Chebyshev rule.02 ) = (12.06 ± 3( 0. the percentage of values that are found within distances of k standard deviations from the mean must be at least (1 − 1/k2) × 100% You can use this rule for any value of k greater than 1.02.

a. Compute the mean. In addition.8 10.24 Consider a population of 1.3 13.5 12. whichever is appropriate.5 11.5 9. are there any outliers? Explain.5 (Q1) and 10. Are you surprised at the results in (b)? 3.3 10.7 11. and standard deviation for the population. Compute the population mean. c.2 Learning the Basics 3. Index Bond Fund of America A Franklin Calif.7 11. Compute the variance and standard deviation for this population. b.22 The following is a set of data for a population with N = 10: 7 5 6 6 6 4 8 6 9 3 a.7 10.9 a. and standard deviation for this population. at least 93.5 16.0 8.20 and that σ.5 10. ±2.2 11.5 9. within ±2 standard deviations of the mean? c. Using the results in (c).75% of these funds are expected to have one-year total returns between what two amounts? 3.21 The following is a set of data for a population with N = 10: 7 5 11 8 3 6 2 1 9 8 a.2 10.25 The following table ASSETS represents the assets in billions of dollars of the five largest bond funds. and within ±3 standard deviations of the mean? c. According to the Chebyshev rule.9 9.8 10.6 9.5 10.3 10.1 12.3 8. the mean one-year total percentage return achieved by all the funds.23 The following data represent the quarterly sales tax receipts (in thousands of dollars) submitted to the comptroller of the Village of Fair Lake for the period ending March 2004 by all 50 business establishments in that locale: TAX SELF Test 10. Compute the variance and standard deviation for this population.7 12.5 (Q3). is 2. How have the results changed? 3. to further explain the variation in this data set. what percentage of these funds is expected to be a. Compute the mean. b.024 mutual funds that primarily invested in large companies.1 9.6 a. Interpret the standard deviation. suppose you determined that the range in the one-year total returns is from −2. Compute the population standard deviation.1 6. is 8. A Vanguard Short-Term Corp.98 CHAPTER THREE Numerical Descriptive Measures PROBLEMS FOR SECTION 3. According to the empirical rule.2 15.4 10.8 10.5 9. within ±1 standard deviation of the mean? PH Grade ASSIST b. What proportion of these businesses have quarterly sales tax receipts within ±1. Compute the mean for this population.0 7.0 8. Use the empirical rule or the Chebyshev rule. You determined that µ. Assets (Billions \$) 19. According to the Chebyshev rule.3 11.5 7. c. or ±3 standard deviations of the mean? d.0 11. the standard deviation.75.9 14. respectively.1 and that the quartiles are.0 12. b.1 11. PH Grade ASSIST 3.6 10. d.0 to 17. Compare and contrast your findings with what would be expected on the basis of the empirical rule.5 7.8 7. Compare and contrast your findings versus what would be expected based on the empirical rule. PH Grade ASSIST Applying the Concepts 3. variance. Compute the population mean.8 13.27 The data in the file DOWRETURN give the 10-year annualized return (1994–2003) for the 30 companies in the Dow Jones Industrials.3 12.7 11. or ±3 standard deviations of the mean? c.6 8. what percentage of these funds are expected to be within ±1.0 12.4 5.6 9. a. Is there a lot of variability in the assets of the bond funds? 3.5 9.0 13. variance. ±2. b. b.1 12. Are you surprised at the results in (b)? d.8 8. Compute the mean for this population of the five largest bond funds. b. Tax-Free Inc.9 6. Compute the population standard deviation.6 11. within ±2 standard deviations of the mean. 5.26 The data in the file ENERGY contains the per capita energy consumption in kilowatt hours for each of the 50 states and the District of Columbia during 1999. . Do (a) through (c) with the District of Columbia removed. Interpret this parameter. Interpret this number. Bond Fund Vanguard GNMA Vanguard Total Bond Mkt. What proportion of these states has average per capita energy consumption within ±1 standard deviation of the mean. Interpret these parameters.

you assume that all values within each class interval are located at the midpoint of the class.3: Computing Numerical Descriptive Measures from a Frequency Distribution 3.17) n −1 Example 3.3 99 COMPUTING NUMERICAL DESCRIPTIVE MEASURES FROM A FREQUENCY DISTRIBUTION Sometimes you have only a frequency distribution.7).7 Frequency Distribution of the 2003 Return for Growth Mutual Funds Annual Percentage 2003 Return 10 but less than 20 20 but less than 30 30 but less than 40 40 but less than 50 50 but less than 60 60 but less than 70 Total Frequency 2 9 13 15 5 5 49 .3. TABLE 3.16) n X = sample mean n = number of values or sample size c = number of classes in the frequency distribution mj = midpoint of the jth class fj = numbers of values in the jth class To calculate the standard deviation from a frequency distribution.14 APPROXIMATING THE MEAN AND STANDARD DEVIATION FROM A FREQUENCY DISTRIBUTION Consider the frequency distribution of the 2003 return of growth funds (Table 3.14 illustrates the computation of the mean and the standard deviation from a frequency distribution. When you have data from a sample that has been summarized into a frequency distribution. EXAMPLE 3. you can compute approximations to the mean and the standard deviation. When this occurs. you can compute an approximation of the mean by assuming that all values within each class interval are located at the midpoint of the class. Compute the mean and standard deviation. APPROXIMATING THE MEAN FROM A FREQUENCY DISTRIBUTION c X = where ∑ mj f j j =1 (3. APPROXIMATING THE STANDARD DEVIATION FROM A FREQUENCY DISTRIBUTION c S = ∑ ( m j − X )2 f j j =1 (3. not the raw data.

51 49 c and S = S = ∑ ( m j − X )2 f j j =1 n −1 8. Percentage Return Number of Funds(fj) Midpoint(mj) mj fj (mj − X ) (mj − X )2 (mj − X )2fj 10 but less than 20 20 but less than 30 30 but less than 40 40 but less than 50 50 but less than 60 60 but less than 70 Total 2 9 13 15 5 5 49 15 25 35 45 55 65 30 225 455 675 275 325 1. .985 −25.29 Given the following frequency distribution for n = 100: Class Intervals Frequency Class Intervals Frequency 0—Under 10 10—Under 20 20—Under 30 30—Under 40 40—Under 50 10 20 40 20 10 100 0—Under 10 10—Under 20 20—Under 30 30—Under 40 40—Under 50 40 25 15 15 5 100 Approximate a.212.8005 8.0 = 40.08 PROBLEMS FOR SECTION 3.4015 1.16) and (3.49 14.0409 394.301.49 24.2449 49 − 1 = 171.3 Learning the Basics 3.3601 20.08843 = 13. b.17) on page 99.100 CHAPTER THREE Numerical Descriptive Measures SOLUTION The computations that you need to calculate the approximations of the mean and standard deviation of the 2003 return for growth mutual funds are summarized in Table 3.212.6813 302.5202 2.51 −15.049.985. the standard deviation.9601 599.7601 1. the mean. the standard deviation.49 650.8005 2.5601 30.165.8 Computations Needed to Calculate the Approximations of the Mean and Standard Deviation of the 2003 Return for Growth Mutual Funds Using Equations (3.51 4.51 −5.1601 209. Approximate a. b.8.28 Given the following frequency distribution for n = 100: 3.7601 240.998.2449 TABLE 3. the mean. c X = X = ∑ mj f j j =1 n 1.

2 98. approximate the a. do you think the mean and the standard deviation of the accounts receivable have changed substantially from March to April? Explain.4 97.32 The following data represent the distribution of the ages of employees within two different divisions of a publishing company.0 For U.0 1. do U. Age of Employees (Years) 20—Under 30 30—Under 40 40—Under 50 50—Under 60 60—Under 70 A Frequency B Frequency 8 17 11 8 2 15 32 20 4 0 For each of the two divisions (A and B).0 12.S.0 3. mean.4 75. . approximate the a. mean. d.000 \$10.0 100.0 32.0 32 54 61 68 68 70 71 72 44..7 94. Two independent samples of 50 accounts were selected for each of the two months. Construct a frequency distribution for each group. variation.000 to under \$8.4 U. EXPLORATORY DATA ANALYSIS Section 3.000 \$8. 3.0 92.S.S. On the basis of the results of (b) and (c). On the basis of the results of (a).0 4. c.000 \$6. On the basis of the results of (a) and (b).4 94.and foreign-made automobiles seem to differ in their braking distance? Explain.0 44.000 \$4. do you think there are differences in the age distribution between the two divisions? Explain.-Made Automobile Models “Less Than” Braking Indicated Values Distance (in Ft) Number Percentage 210 220 230 240 (continued) 0 1 2 3 0.4: Exploratory Data Analysis Applying the Concepts 3. On the basis of (a) and (b).000 Total For each month.-Made Automobile Models “Less Than” Braking Indicated Values Distance (in Ft) Number Percentage 250 260 270 280 290 300 310 320 4 8 11 17 21 23 25 25 101 Foreign-Made Automobile Models “Less Than” Indicated Values Number Percentage 16. b. On the basis of the results of (a). standard deviation.S.000 \$2.and foreign-made automobiles a. approximate the standard deviation of the braking distance.4 Foreign-Made Automobile Models “Less Than” Indicated Values Number Percentage 0 1 4 19 0.0 8. b. Another way of describing numerical data is thrpough exploratory data analysis that includes the five-number summary and the box-and-whisker plot (references 5 and 6).6 26.. and shape.000 to under \$10.0 84.4 5.31 The following table contains the cumulative frequency distributions and cumulative percentage distributions of braking distance (in feet) at 80 miles per hour for a sample of 25 U. 3. b. standard deviation.30 A wholesale appliance distributing firm wished to study its accounts receivable for two successive months.0 84.6 100.000 to under \$6. c.3.000 to under \$4. c.1 discussed sample statistics for numerical data that are measures of central tendency.S. approximate the mean of the braking distance.0 100.000 to under \$12.0 68. The results are summarized in the following table: Frequency Distributions for Accounts Receivable Amount March Frequency April Frequency 6 13 17 10 4 0 50 10 14 13 10 0 3 50 \$0 to under \$2.-manufactured automobile models and for a sample of 72 foreign-made automobile models in a recent year: U.

The distance from Q1 to the median is less than the distance from the median to Q3. TABLE 3. EXAMPLE 3. For the sample of 10 get-ready times. Compute the five-number summary of the 2003 return for the small cap mutual funds with high risk. average. The distance from Q1 to the median is greater than the distance from the median to Q3. The distance from Xsmallest to Q1 is greater than the distance from Q3 to Xlargest. Distance from Q1 to the median versus the distance from the median to Q3.5 = 12. MUTUALFUNDS2004 . The distance from Xsmallest to Q1 is less than the distance from Q3 to Xlargest. Therefore. Both distances are the same. The distance from Xsmallest to the median is greater than the distance from the median to Xlargest.9 Relationships among the Five-Number Summary and the Type of Distribution Type of Distribution Comparison Left-Skewed Symmetric Right-Skewed Distance from Xsmallest to the median versus the distance from the median to Xlargest. the five-number summary is 29 35 39.5. the first quartile = 35. and large cap). Therefore. Both distances are the same. mid cap. The distance from Xsmallest to the median is less than the distance from the median to Xlargest. The distance from Xsmallest to Q1 (35 − 29 = 6) is slightly less than the distance from Q3 to Xlargest (52 − 44 = 8). Table 3.9 explains how the relationships among the “five numbers” allows you to recognize the shape of a data set. and high) and type (small cap. the get-ready times are slightly right-skewed.5 44 52 The distance from the median to Xsmallest to the median (39.5 − 29 = 10. and the third quartile = 44.15 COMPUTING THE FIVE-NUMBER SUMMARY OF THE 2003 PERCENTAGE RETURN OF SMALL CAP HIGH-RISK MUTUAL FUNDS The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low. Distance from Xsmallest to Q1 versus the distance from Q3 to Xlargest.1 show that the median = 39. the smallest value is 29 minutes and the largest value is 52 minutes (see pages 75 and 77). Calculations done previously in section 3.102 CHAPTER THREE Numerical Descriptive Measures The Five-Number Summary A five-number summary that consists of Xsmallest Q1 Median Q3 Xlargest provides a way to determine the shape of the distribution.5) is slightly less than the distance from Xlargest (52 − 39.5). Both distances are the same.

3 = 16.8 − 37.8 60. AVERAGE RISK. the five-number summary is 37.3. The box-and-whisker plot of the get-ready times in Figure 3. MUTUALFUNDS2004 . The right whisker is slightly longer than the left whisker. Therefore. Construct the box-and-whisker plot of the 2003 return for lowrisk. Figure 3.e. the upper 25% of the data are represented by a whisker connecting the right side of the box to Xlargest.8 = 12.7) is less than the distance from Xsmallest to the median (53. The distance from Xsmallest to Q1 (41.3 and the largest value is 66. The lower 25% of the data are represented by a line (i..16 THE BOX-AND-WHISKER PLOT OF THE 2003 PERCENTAGE RETURN OF LOW-RISK. the median = 53. the smallest value in the data set is 37. the box contains the middle 50% of the values in the distribution.4: Exploratory Data Analysis 103 SOLUTION From previous computations for the 2003 return for the small cap mutual funds with high risk (see pages 76 and 78).4 illustrates the box-and-whisker plot for the get-ready times.4 indicates very slight rightskewness since the distance between the median and the highest value is slightly more than the distance between the lowest value and the median.5. the results are inconsistent. This indicates slight right-skewness. AND HIGH-RISK MUTUAL FUNDS The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low.65).8.85 = 5.5 The distance from the median to Xlargest (66.3 41.5 − 60.7 53. and large cap).7.7 − 37.5 − 53. average. mid cap. average-risk.85 66.85. and the third quartile = 60. Similarly. and high) and type (small cap.4) is slightly less than the distance from Q3 to Xlargest (66. FIGURE 3. EXAMPLE 3. The Box-and-Whisker Plot A box-and-whisker plot provides a graphical representation of the data based on the fivenumber summary. Thus. and high-risk mutual funds. In addition. a whisker) connecting the left side of the box to the location of the smallest value.3 = 4. Therefore. The vertical line at the left side of the box represents the location of Q1 and the vertical line at the right side of the box represents the location of Q3. Xsmallest.5). the first quartile = 41. This indicates left skewness.4 Box-and-Whisker Plot of the Time to Get Ready Xsmallest 20 25 30 Q1 35 Median 40 Time (minutes) Xlargest Q3 45 50 55 The vertical line drawn within the box represents the median.

The high-risk funds appear left-skewed because of the long lower whisker. The low-risk funds appear to be slightly right-skewed since the upper whisker is longer than the lower whisker. the whiskers in the Minitab boxand-whisker plot extend to 1. FIGURE 3.5 Minitab Box-andWhisker Plot of the 2003 Return for LowRisk.2 The median percentage return and the quartiles are higher for the highrisk funds than for the low-risk and average-risk funds. (Note: The area under each polygon is split into quartiles corresponding to the five-number summary for the box-and-whisker plot. Minitab displays the box-and-whisker plot vertically from bottom (low) to top (high).6 Box-and-Whisker Plots and Corresponding Polygons for Four Distributions Panel A Bell-shaped distribution Panel B Left-skewed distribution Panel C Right-skewed distribution Panel D Rectangular distribution .) FIGURE 3. but the median return is closer to the first quartile than to the third quartile. The average-risk funds are right-skewed due to the extremely large return of one fund (78). and high-risk mutual funds. and High-Risk Mutual Funds Figure 3. average-risk.6 demonstrates the relationship between the box-and-whisker plot and the polygon for four different types of distributions. The asterisk (*) for the average-risk fund represents the presence of outlier values. SOLUTION Figure 3.104 CHAPTER THREE Numerical Descriptive Measures 2If there are outliers.5 is the Minitab box-and-whisker plot of the 2003 return for low-risk. Average-Risk.5 times the interquartile range beyond the quartiles or to the highest value.

. b.20 4. you used scatter diagrams to visually examine the relationship between two numerical variables.” Copyright © 2000 by Consumers Union of U.08 6. Yonkers.64 4.40 The following data represent the bounced check fee (in dollars) for a sample of 23 banks for direct-deposit customers who maintain a \$100 balance and the monthly service fee (in dollars) for direct-deposit customers if their accounts fall below the minimum required balance of \$1500 for a sample of 26 banks.01 8. Dooley.46 6.47 a.5 15.5 Burgers 19 31 34 35 39 39 43 Chicken 7 9 15 16 16 18 22 25 27 33 39 Source: Extracted from “Quick Bites.13 4. and describe the shape of the distribution for the burgers and chicken items.55 3. and the results are as follows: BANK1 4. March 2001.17 illustrates its use. Construct the box-and-whisker plot for the burgers and the chicken items. June 2000. located in a residential area.19 3. List the five-number summary of the waiting time at the two bank branches.77 2.” Copyright © 2001 by Consumers Union of U..10 0.49 6.35 10.” Decision Sciences..38 5. A random sample of 15 customers is selected. b.02 5. a.91 5.79 8. Equation (3. List the five-number summary for the burgers and for the chicken items. Construct the box-and-whisker plot of the bounced check fee and the monthly service fee. 1131–1153. List the five-number summary of the bounced check fee and of the monthly service fee.73 3. Rothenberger. c.5 75. c.0 Source: M.66 5. Should you compare the two bank branches? Explain. A random sample of 15 customers is selected. b. the firm uses a database of reusable components totaling more than 2.68 5. The waiting time in minutes (defined as the time the customer enters the line until he or she reaches the teller window) of all customers during these hours is recorded over a period of one week.42 A bank branch located in a commercial district of a city has developed an improved process for serving customers during the noon to 1:00 P.21 5. In this section. List the five-number summary. 46. Inc.S.5. Yonkers. 30(Fall 1999).17 9. The Covariance The covariance measures the strength of the linear relationship between two numerical variables (X and Y). Form the box-and-whisker plot and describe the shape of the data.106 CHAPTER THREE Numerical Descriptive Measures 3.0 25.34 3.79 Another branch. Adapted with permission from Consumer Reports.90 8.12 6. and K.02 5. the covariance and the coefficient of correlation that measure the strength of the relationship between two numerical variables are discussed.82 8.18) defines the sample covariance and Example 3. is also concerned with the noon to 1 P.50 6. lunch period. THE COVARIANCE AND THE COEFFICIENT OF CORRELATION In section 2. The waiting time in minutes (operationally defined as the time the customer enters the line to the time he or reaches the teller window) of all customers during this hour is recorded over a period of one week. b. and the results are as follows: BANK2 9. What similarities and differences are there in the distribution of the waiting time at the two bank branches? d.0 47. A. a.54 3. What similarities and differences are there in the distributions for the burgers and the chicken items? 3. J.41 The following data represent the total fat for burgers and chicken items from a sample of fast-food chains. BANKCOST1 BANKCOST2 26 28 20 20 21 22 25 25 18 25 15 20 18 20 25 25 22 30 30 30 15 20 29 12 8 5 5 6 6 10 10 9 7 10 7 7 5 0 10 6 9 12 0 5 10 8 5 5 9 Source: Extracted from “The New Face of Banking.000. “A Performance Measure for Software Reuse Projects. c. FASTFOOD 3.000 lines of code collected from 10 years of continuous reuse effort. Eight analysts at the firm were asked to estimate the reuse rate when developing a new software system. 3. The following data are given as a percentage of the total code written for a software system that is part of the reuse database. REUSE 50 62. Adapted with permission from Consumer Reports. What similarities and differences are there in the distributions for the bounced check fee and the monthly service fee? 3. NY 10703–1057.M.0 45. lunch hour.39 A software development and consulting firm located in the Phoenix metropolitan area develops software for supply chain management systems using systematic software reuse. a. Inc. NY 10703–1057. Construct the box-and-whisker plot and describe the shape of the distribution of the two bank branches. Instead of starting from scratch when writing and developing new custom software systems.M.S..5 37..

18) into a set of smaller calculations.7 Microsoft Excel Worksheet for the Covariance between Expense Ratio and 2003 Return for the Small Cap High-Risk Funds Expense Ratio 1.579 9 −1 = 1.42 1.3 62.4 66.25 0.17 ∑ ( X i − X )(Yi − Y ) i =1 (3.7 contains a Microsoft Excel worksheet that calculates the covariance for these data.10 presents the expense ratio and 2003 return for the small cap high-risk funds and Figure 3.5 .72 1.61 1.2 44. Y ) = 9.6 59.19738. Compute the sample covariance.57 1.40 1. The Calculations area of Figure 3.5: The Covariance and the Coefficient of Correlation 107 THE SAMPLE COVARIANCE n cov( X .8 56. From cell C17. the covariance is 1.7 breaks down Equation (3.20 2003 Return 37.5 53.68 1.3 39. cov( X .18) directly.33 1. SOLUTION Table 3.18) n −1 COMPUTING THE SAMPLE COVARIANCE Consider the expense ratio and the 2003 return for the small cap high-risk funds.19738 TABLE 3.3. Y ) = EXAMPLE 3.2 44.10 Expense Ratio and 2003 Return for the Small Cap High-Risk Funds FIGURE 3. or by using Equation (3.

In the discussion of Figure 3. When dealing with population data for two numerical variables. Panel C illustrates a perfect positive relationship where ρ equals +1. You can see that for small values of X there is a very strong tendency for Y to be large. all the points could be connected with a straight line. the Greek letter ρ is used as the symbol for the coefficient of correlation. Figure 3.6. each of which contains 100 values of X and Y. Likewise. Since the covariance can have any value. Thus. In panel A. In panel C the linear relationship between X and Y is very weak. or −1. the coefficient of correlation ρ equals 0.108 CHAPTER THREE Numerical Descriptive Measures The covariance has a major flaw as a measure of the linear relationship between two numerical variables. you are unlikely to have a sample coefficient of exactly +1. Y increases in a perfectly predictable manner when X increases. Y decreases in a perfectly predictable manner. 0. the coefficient of correlation in panel B is not as negative as in panel A. the relationships were deliberately described as tendencies and not as cause-and-effect. This wording was used on purpose. The data do not all fall on a straight line. When you have sample data. the coefficient of correlation r is −0. the sample coefficient of correlation r is calculated. the coefficient of correlation ρ equals −1.9 on page 109 presents scatter diagrams along with their respective sample coefficients of correlation r for six data sets. Panel B shows a situation in which there is no relationship between X and Y. In this case. In this case. The data in panel B have a coefficient of correlation equal to −0. The values of the coefficient of correlation range from −1 for a perfect negative correlation to +1 for a perfect positive correlation. The linear relationship between X and Y in panel B is not as strong as in panel A.8 there is a perfect negative linear relationship between X and Y. Correlation alone cannot prove . Perfect means that if the points were plotted in a scatter diagram. and when X increases. and as X increases. The Coefficient of Correlation The coefficient of correlation measures the relative strength of a linear relationship between two numerical variables.8 Types of Association between Variables Y Y Panel A Perfect negative correlation (r = –1) X Y Panel B No correlation (r = 0) X Panel C Perfect positive correlation (r = +1) X In panel A of Figure 3. When using sample data. there is no tendency for Y to increase or decrease. the large values of X tend to be paired with small values of Y. and the small values of X tend to be paired with large values of Y. r = −0. so the association between X and Y cannot be described as perfect.9. Figure 3.3. and there is only a slight tendency for the small values of X to be paired with the larger values of Y. To better determine the relative strength of the relationship. you need to compute the coefficient of correlation.8 illustrates three different types of association between two variables. and the large values of X tend to be associated with large values of Y. FIGURE 3. you are unable to determine the relative strength of the relationship. Thus.9. Panels D through F depict data sets that have positive coefficients of correlation because small values of X tend to be paired with small values of Y.

9 Six Scatter Diagrams Created from Minitab and Their Sample Coefficients of Correlation r that there is a causation effect. that the change in the value of one variable caused the change in the other variable. but correlation alone does not imply causation. by the effect of a third variable not considered in the calculation of the correlation. You would need to perform additional analysis to determine which of these three situations actually produced the correlation.5: The Covariance and the Coefficient of Correlation Panel A Panel B Panel C Panel D Panel E Panel F 109 FIGURE 3. . Equation (3. that is.3. Therefore. or by a cause-andeffect relationship.19) defines the sample coefficient of correlation r and Example 3. A strong correlation can be produced simply by chance.18 illustrates its use. you can say that causation implies correlation.

10 and Equation (3.110 CHAPTER THREE Numerical Descriptive Measures THE SAMPLE COEFFICIENT OF CORRELATION r = cov( X . compute the sample coefficient of correlation.19).18 COMPUTING THE SAMPLE COEFFICIENT OF CORRELATION Consider the expense ratio and the 2003 return for the small cap high-risk funds.18 illustrates the computation of the sample coefficient of correlation using Equation (3.19) n ∑ ( X i − X )(Yi − Y ) where cov(X. EXAMPLE 3. Y) = i =1 n −1 n ∑ ( X i − X )2 SX = i =1 n −1 n ∑ (Yi − Y )2 SY = i =1 n −1 Example 3.3943786 FIGURE 3. Y ) S X SY (3.287663)(10. SOLUTION r = = cov( X . From Figure 3.19738 ( 0.19).554383) = 0. Y ) S X SY 1.10 Microsoft Excel Worksheet for the Sample Coefficient of Correlation r between the Expense Ratio and the 2003 Return for Small Cap High-Risk Funds .

D1) that discussed investment in foreign stocks stated that the coefficient of correlation between the return on investment of U. . Those mutual funds with the highest expense ratios tend to be associated with the highest 2003 returns. It only indicates the tendencies present in the data. stocks and these five other types of investments can you make? b. stocks and International Large Cap stocks was 0. b.” The Wall Street Journal. Adapted with permission from Consumer Reports. the coefficient of correlation indicates the linear relationship..53. 2003. bonds and Emerging market stocks was −0.58. Compare the results of (a) to those of problem 3. bonds and these five other types of investments can you make? b.S. c. What conclusions about the strength of the relationship between the return on investment of U. U. or SPSS.S. PROBLEMS FOR SECTION 3. between two numerical variables. U. U.46 The following data COFFEEDRINK represent the calories and fat (in grams) of 16-ounce iced coffee drinks at Dunkin’ Donuts and Starbucks: Product Calories Fat Dunkin’ Donuts Iced Mocha Swirl latte (whole milk) Starbucks Coffee Frappuccino blended coffee Dunkin’ Donuts Coffee Coolatta (cream) Starbucks Iced Coffee Mocha Expresso (whole milk and whipped cream) Starbucks Mocha Frappuccino blended coffee (whipped cream) Starbucks Chocolate Brownie Frappuccino blended coffee (whipped cream) Starbucks Chocolate Frappuccino Blended Crème (whipped cream) 240 260 350 8.0 420 16.S.03.S.5 22. Compare the results of (a) to those of problem 3.71.49 can be solved manually or by using Microsoft Excel..e.3.S. November 26.S.5 Learning the Basics 3.45 (a).394.48. as indicated by a coefficient of correlation. bonds and International Small Cap stocks was −0. Clements. stocks and Emerging market debt was 0.S..20.e. 3.0 Source: Extracted from “Coffee as Candy at Dunkin’Donuts and Starbucks. stocks and International Small Cap stocks was 0. Inc.0 350 20. U. Clements. 9. U. S. November 26. a.44–3. past performance does not guarantee future performance. In summary. bonds and International Bonds was 0. stocks and International Bonds was 0.” The Wall Street Journal. the larger values of X are typically paired with the larger values of Y) or negatively correlated (i. Yonkers. the linear relationship between the two variables is stronger. bonds and Emerging market debt was 0.0 510 22. Compute the coefficient of correlation. U.18. U.44 (a). 2003. the larger values of X are typically paired with the smaller values of Y). Compute the covariance. stocks and Emerging market stocks was 0. little or no linear relationship exists.0 530 19.5: The Covariance and the Coefficient of Correlation 111 The expense ratio and the 2003 return for the small cap high-risk funds are positively correlated.44 A recent article (J. When the coefficient of correlation is near 0. NY 10703–1057. Applying the Concepts Problems 3. r = 0. bonds and International Large Cap stocks was −0. The sign of the coefficient of correlation indicates whether the data are positively correlated (i. “Why Investors Should Put up to 30% of Their Stock Portfolio in Foreign Funds.S. or association. As with all investments. D1) that discussed investment in foreign bonds stated that the coefficient of correlation between the return on investment of U. Those mutual funds with the lowest expense ratios tend to be associated with the lowest 2003 returns. What conclusions about the strength of the relationship between the return on investment of U. The existence of a strong correlation does not imply a causation effect.S. “Why Investors Should Put up to 30% of Their Stock Portfolio in Foreign Funds.13. June 2004.80..S.45 A recent article (J.” Copyright © 2004 by Consumers Union of U. How strong is the relationship between X and Y? Explain.S. You can only say that this is what tended to happen in the sample. When the coefficient of correlation gets closer to +1 or −1. Minitab.0 3. You cannot assume that having a low expense ratio caused the low 2003 return. 3.S. U. 3.10.43 The following is a set of data from a sample of n = 11 items: X 7 5 8 Y 21 15 24 3 6 10 12 4 9 15 18 9 18 30 36 12 27 45 54 a. This relationship is fairly weak. a.

Y ) = ∑ ( X i − X )(Yi − Y ) i =1 n −1 (3.18) Sample Coefficient of Correlation r = cov( X . Y ) S X SY (3.19) TERMS arithmetic mean 73 box-and-whisker plot 103 central tendency 72 Chebyshev rule 97 coefficient of correlation 108 coefficient of variation 85 covariance 106 dispersion 72 empirical rule 96 extreme value 86 five-number summary 102 geometric mean 79 interquartile range 81 left-skewed 88 mean 73 median 75 midspread 81 mode 76 outlier 86 population mean 94 population standard deviation population variance 97 Q1: first quartile 77 Q2: second quartile 77 95 .114 CHAPTER THREE Numerical Descriptive Measures KEY FORMULAS Sample Mean Z Scores n X = ∑ Xi i =1 (3.16) n ∑ ( X i − X )2 i =1 (3.14) N (3.8) Sample Variance n X = ∑ mj f j j =1 Approximating the Standard Deviation from a Frequency Distribution c (3.3) i =1 σ2 = Third Quartile Q3 Q3 = X −X S Z = 3( n + 1) ranked value 4 (3.4) Population Standard Deviation N ∑ ( X i − µ )2 Geometric Mean 1/ n X G = ( X1 × X 2 × L × X n ) (3.9) n −1 ∑ ( m j − X )2 f j j =1 S = Sample Covariance n S2 = ∑ (Xi − X ) i =1 n 2 n −1 (3.7) Interquartile Range Interquartile range = Q3 − Q1 (3.17) n −1 Sample Standard Deviation S = (3.11) cov( X .2) µ = First Quartile Q1 n +1 ranked value Q1 = 4 ∑ Xi i =1 (3.15) N Geometric Mean Rate of Return S2 = (3.5) i =1 σ = RG = [(1 + R1 ) × (1 + R2 ) × L × (1 + Rn )]1/ n − 1 (3.6) Approximating the Mean from a Frequency Distribution c Range Range = Xlargest − Xsmallest (3.13) N Population Variance N ∑ ( X i − µ )2 (3.1) n (3.12) Population Mean N Median Median = n +1 rank value 2 (3.10) Coefficient of Variation S CV =   100% X KEY (3.

5 grams of tea in a bag.45 5.61 5. interquartile range. Construct a box-and-whisker plot.59 What is meant by the property of shape? 3.50 5. on average.53 How do you interpret the first quartile. 3. two problems arise.53 5.40 5.67 5.68–3. Interpret the measures of central tendency and variation within the context of this problem.53 5.77 5. We recommend that you solve problems 3.58 5. First.Chapter Review Problems Q3: third quartile 77 quartiles 77 range 80 resistant measures 81 right-skewed 88 sample coefficient of correlation 109 CHAPTER sample covariance 106 sample mean 73 sample standard deviation sample variance 82 shape 72 skewed 88 spread 72 REVIEW Checking Your Understanding 3.54 5. there are 5. interquartile range.65 5. and the extremely fast filling operation of the machine (approximately 170 bags a minute).40 5.49 5.57 5. possible requests for additional medical information and medical exams. median.55 What does the Z score measure? 3.5 grams of tea in a bag? If you were in charge of this process. Second. variance. c.25 5.55 5. variance.57 5. The following table provides the weight in grams of a sample of 50 tea bags produced in one hour by a single machine.32 5. differences in the density of the tea. The ability to deliver approved policies to customers in a timely manner is critical to the profitability of this service to the bank. Is the company meeting the requirement set forth on the label that.52 What are the differences among the mean. The approval process consists of underwriting.51 5.44 5. Compute the mean.57 5.51 What is meant by the property of central tendency? 3. Are the data skewed? If so. if any. and a policy compilation stage during which the policy pages are generated and sent to the bank for delivery.60 How do the covariance and the coefficient of correlation differ? Applying the Concepts You can solve problems 3.58 How do the empirical rule and the Chebychev rule differ? 3. first quartile. and what are the advantages and disadvantages of each? 3. a random sample of 27 approved policies was selected and the following total processing time in days was recorded: INSURANCE 73 19 16 64 28 28 31 90 60 56 31 56 22 18 45 48 17 17 17 91 92 63 50 51 69 16 17 .56 5. how? e. the company is giving away product.67 manually or by using Microsoft Excel.86 using Microsoft Excel.44 5.47 5.58 5. For this product. Getting an exact amount of tea in a bag is problematic 115 standard deviation 82 sum of squares 82 symmetrical 88 variance 82 variation 76 Z scores 86 82 PROBLEMS because of variation in the temperature and humidity inside the factory.36 a. Compute the range.56 5. would you try to make concerning the distribution of weights in the individual bags? 3. standard deviation. customers may not be able to brew the tea to be as strong as they wish.53 5. and coefficient of variation. and coefficient of variation.54 5.61 5.50 5.47 5. and mode.52 5. Minitab. TEABAGS 5.61–3. there are 5. or SPSS.46 5. a medical information bureau check.34 5. savings banks are permitted to sell a form of life insurance called Savings Bank Life Insurance (SBLI).40 5.45 5.55 5. During a period of one month.29 5. and third quartile? 3. If the bags are underfilled. or SPSS.62 5.42 5. and third quartile.44 5.41 5. standard deviation. Minitab. If the average amount of tea in a bag exceeds the label weight. on average.62 In New York State.53 5. what changes. Why should the company producing the tea bags be concerned about the central tendency and variation? d. the company may be in violation of the truth-in-labeling laws. and what are the advantages and disadvantages of each? 3.54 What is meant by the property of variation? 3. median. the label weight on the package indicates that.67 5.50 5.61 A quality characteristic of interest for a tea-bag-filling process is the weight of the tea in the individual bags.58 5. b.63 5.42 5.56 What are the differences among the various measures of variation such as the range. which includes a review of the application.32 5.50 What are the properties of a set of numerical data? 3. median.57 How does the empirical rule help explain the ways in which the values in a set of numerical data cluster and distribute? 3.

Calculate the mean.66 Problems with a telephone line that prevent a customer from receiving or making calls are disconcerting to both the customer and the telephone company. Construct a box-and-whisker plot and describe the shape. and third quartile. what would you say? Explain.15 3.97 Central Office II Time to Clear Problems (minutes) 7.64 also produces electric insulators. first quartile. standard deviation.55 3.652 1. d.403 8. and 15 installation crews.65 The manufacturing company in problem 3. how? d.634 1.465 8. FURNITURE 54 5 35 137 31 27 152 2 123 81 74 27 11 19 126 110 110 29 61 35 94 31 26 5 12 4 165 32 29 28 29 26 25 1 14 13 13 10 5 27 4 52 30 22 36 26 20 23 33 68 a.688 1. and coefficient of variation.60 0.420 8. and coefficient of variation.32 3.411 8.728 1.752 1. Compute the mean. The company requires that the width of the trough be between 8.382 8. Construct a box-and-whisker plot and describe the shape.58 4.409 a.30 2.373 8. had undergone a major expansion in the past several years. Compute the mean.405 8.476 8.48 1. List the five-number summary.64 A manufacturing company produces steel housings for electrical equipment. The distance from one side of the form to the other is critical because of weatherproofing in outdoor applications.447 8. Interpret the measures of central tendency and variability in (a). On the basis of the results of (a) through (c).75 0. and third quartile.383 8. range. . interquartile range. interquartile range. Construct a side-by-side box-and-whisker plot.762 1. In particular.10 1.498 8. a measurer. Compute the range.75 0.439 8. c. median.396 8. and standard deviation for the force variable. standard deviation. how? d. variance.784 1. Are the data skewed? If so. and coefficient of variation.810 1.866 1.93 1. Interpret these measures of central tendency and variability. On the basis of the results of (a) through (c).774 1.80 1. a short-circuit is likely to occur.756 1. A large family-held department store selling furniture and flooring.60 1. variance.53 0.85 0.866 1.429 8.413 8.385 8.592 1.23 0. if you had to tell the president of the company how long a customer should expect to wait to have a complaint resolved.481 8. including carpet. first quartile. b. c. the flooring department had expanded from 2 installation crews to an installation supervisor. The following data represent the number of days between the receipt of the complaint and the resolution of the complaint.656 1.52 1. b. If the insulators break when in use.93 5.410 8.680 1.458 8. Construct a box-and-whisker plot.60 4.419 8. Calculate the mean.116 CHAPTER THREE Numerical Descriptive Measures a.05 6.92 0.427 8.10 0. TROUGH 8. A sample of 50 complaints concerning carpet installation was selected during a recent year.45 0.460 8.65 0.744 1.429 8. b. Compute the range. range.31 and 8.317 8. The following are the widths of the troughs in inches for a sample of n = 49. variance. The following data represent samples of 20 problems reported to two different offices of a telephone company and the time to clear these problems (in minutes) from the customers’ lines: PHONE Central Office I Time to Clear Problems (minutes) 1.479 8.870 1.414 8.414 8. The data from 30 insulators from this experiment are as follows: FORCE 1. first quartile.610 1. What can you conclude about the strength of the insulators if the company requires a force measurement of at least 1.420 8. median. Force is measured by observing how many pounds must be applied to the insulator before it breaks. b.460 8. d.10 0.662 1.410 8.52 3.63 One of the major measures of the quality of service provided by any organization is the speed with which it responds to customer complaints. median. how? d.734 1.405 8.500 pounds? 3.48 1.412 8.61 inches. c.422 8.489 8.02 0. Compute the mean.764 1.31 inches and 8. interquartile range.351 8.436 8. What can you conclude about the number of troughs that will meet the company’s requirements of troughs being between 8.810 1. and standard deviation for the width.53 4.550 1.348 8.97 1.498 8.522 1. standard deviation. Are the data skewed? If so.447 8. b.72 For each of the two central office locations: a. are there any differences between the two central offices? Explain.484 8.481 8.75 0. Are the data skewed? If so. Compute the range. c.734 1.48 3.462 8.02 3. To test the strength of the insulators.420 8.323 8. It is produced using a 250-ton progressive punch press with a wipe-down operation putting two 90-degree forms in the flat steel to make the trough. c.08 1. The main component part of the housing is a steel trough that is made out of a 14-gauge steel coil. median.736 a.662 1.415 8.312 8.61 inches wide? 3. What would you tell a customer who enters the bank to purchase this type of insurance policy and asks how long the approval process takes? 3.65 1.78 2. 3.820 1.444 8.10 1. destructive testing is carried out to determine how much force is required to break the insulators.343 8. Construct a box-and-whisker plot.60 0. median.696 1.788 1. and third quartile.

Are the data skewed? If so.29 7. b. Adapted with permission from Consumer Reports.92 Plant B 9.67 In many manufacturing processes the term “work-inprocess” (often abbreviated WIP) is used. 12(2003).25 10.37 6. first quartile. and sugar in grams for 33 breakfast cereals. and third quartile. fiber in grams. for the variables of cost per serving. C.. Are the data for any of the types of food skewed? If so. NY 10703–1057.62 8. and fat in grams for 97 varieties of dry and canned dog and cat food. Construct a side-by-side box-and-whisker plot for the four types (dry dog food.62 5. protein in grams.. On the basis of the results of (a) through (c).62 12. and third quartile. standard deviation. Construct a box-and-whisker plot. February 1998. protein in grams.17 13. such as bobble-head giveaways.Chapter Review Problems 3.69 State budget cuts forced a rise in tuition at public universities during the 2003–2004 academic year.75 15.92 11. Construct a side-by-side box-and-whisker plot. and coefficient of variation for the difference in 117 tuition between 2002–2003 and 2003–2004 for in-state students and out-of-state students.58 5. fiber in grams. interquartile range. Construct a five-number summary for the 43 games where promotions were held and for the 37 games without promotions. first quartile.54 11.50 7. Yonkers. interquartile range. 3.41 14. Compute the mean. median. What conclusions can you reach concerning the cost per ounce in cents. median.71 10. 18–19.75 12. standard deviation. interquartile range. c. Adapted with permission from Consumer Reports.S. calories. For each variable: a.25 5. and canned cat food).46 9. dry cat food. how? d. variance. b.25 9. 173–183).54 8. “Promotion Timing in Major League Baseball and the Stacking Effects of Factors that Increase Game Attractiveness.62 7. variance. and the sugar in grams for the 33 breakfast cereals? 3.68 The data contained in the file CEREALS consists of the cost in dollars per ounce.S.13 13.45 8.42 10. Compute the mean.46 21. how? . median. b. how? d. Compute the range. interquartile range. c. standard deviation.71 For each of the two plants: a.. 3. October 1999. Yonkers. canned dog food. and bound. a. and coefficient of variation. b. Inc. WIP Plant A 5. standard deviation. are there any differences between the two plants? Explain. C.46 16. variance. Are the data skewed? If so. median. first quartile. one for the 43 games where promotions were held and one for the 37 games without promotions. and coefficient of variation. The data file ROYALS includes the following variables for the Kansas City Royals during the 2002 baseball season: GAME = Home games in the order they were played ATTENDANCE = Paid attendance for the game PROMOTION—Y = a promotion was held. Inc. The following data represent samples of 20 books at each of two production plants and the processing time (operationally defined as the time in days from when the books came off the press to when they were packed in cartons) for these jobs. For the four types of food (dry dog food. variance. In a book manufacturing plant the WIP represents the time it takes for sheets from a press to be folded. Compute the range. Compute the range. how? d. Compute the mean.58 9.50 7. Krehbiel. gathered.00 2. N = no promotion was held a.96 4. Source: Extracted from Copyright 1999 by Consumers Union of U.” Sport Marketing Quarterly.21 6.62 25. NY 10703–1057. What conclusions can you reach concerning the difference in tuition between 2002–2003 and 2003–2004 for in-state students and out-of-state students? 3. sewn. 33–34. Construct a box-and-whisker plot of the difference in tuition between 2002–2003 and 2003–2004 for in-state students and out-of-state students.33 14.04 5. d. calories. c. c. c.29 13. increase attendance at Major League Baseball games? An article in Sport Marketing Quarterly reported on the effectiveness of marketing promotions (T. tipped on end sheets. Calculate the mean and standard deviation of attendance for the 43 games where promotions were held and for the 37 games without promotions. dry cat food and canned cat food). Source: Extracted from Copyright 1998 by Consumers Union of U.29 7. Compute the range.41 11. Compute the mean. canned dog food. Boyd and T. Discuss the results of (a) through (c) and comment on the effectiveness of promotions at Royals’ games during the 2002 season. Are the data skewed? If so.. cups per can. Construct a graphical display containing two boxand-whisker plots.29 16.70 Do marketing promotions. b.42 11.71 The data contained in the file PETFOOD2 consist of the cost per serving. and third quartile. The data in the file TUITION include the difference in tuition between 2002–2003 and 2003–2004 for in-state students and outof-state students. first quartile. and coefficient of variation. and fat in grams: a. and third quartile for the difference in tuition between 2002–2003 and 2003–2004 for in-state students and out-of-state students.