Professional Documents
Culture Documents
Very 10
Somewhat 14
None 6
f
Class Limits
5–9 3
10–14 6
15–19 8
20–24 8
25–29 5
Relative frequency and percentage could also be
calculated for the grouped data.
iPods Sold f Relative frequency percentage
None
Somewhat
Frequency
Very
0 2 4 6 8 10 12 14 16
A bar chart used to separate the “vital few” from the “trivial many.”
These charts are based on the Pareto Principle which states that 20
percent of the problems have 80 percent of the impact. The 20
percent of the problems are the “vital few” and the remaining
problems are the “trivial many.” A Pareto chart can help you:
Separate the few major problems from the many possible problems
so you can focus your improvement efforts.
Arrange data according to priority or importance
Determine which problems are most important,
using data, not perception.
Steps
•Arrange the data from the largest to smallest
according to frequency.
•Draw and label the x and y axes.
•Draw the bars corresponding to the frequencies
Example
Twenty-five army indicates were given a blood
test to determine their blood type.
Raw Data:
A,B,B,AB,O,O,O,B,AB,B,B,B,O,A,O,A,O,O,O,AB,AB,
A,O,B,A
Pareto chart analysis for counts
Frequency Cum.Freq. Percentage Cum.Percent.
O 9 9 36 36
B 7 16 28 64
A 5 21 20 84
AB 4 25 16 100
Frequency
0 5 10 15 20 25
A
Pareto Chart for Blood group
AB
20%
33%
Very
Somewhat
None
47%
frequency
4
0
9.5 14.5 19.5 24.5 29.5
30
25
20
15
10
0
9.5 14.5 19.5 24.5 29.5
5 027
6 14589
7 112256799
8 0134677
9 223568
Example
The following data give the monthly rents paid by a
sample of 30 households selected from a small
town. Construct a stem-and-leaf display for these
data.
880 1081 721 1075 1023 775 1235 750 965 960
1210 985 1231 932 850 825 1000 915 1191 1035
1151 630 1175 952 1100 1140 750 1140 1370 1280
6 30
7 75 50 21 50
8 80 25 50
9 32 52 15 60 85 65
10 23 81 35 75 00
11 91 51 40 75 40 00
12 10 31 35 80
13 70
Box plot
A recently created graphic display, called a boxplot,
highlights the summary information in the
quartiles. The center half of the data, from the first
to the third quartile, is represented by a
rectangle(box) with the median indicated by a bar.
A line extends from Q3 to the maximum value and
another from Q1 to the minimum.
The construction of a box-and-whisker plot
(sometimes called, simply, a boxplot) makes use of
the quartiles of a data set and may be
accomplished by following these five steps:
1. Represent the variable of interest on the horizontal axis.
2. Draw a box in the space above the horizontal axis in such a way
that the left end of the box aligns with the first quartile and the
right end of the box aligns with the third quartile.
3. Divide the box into two parts by a vertical line that aligns with
the median.
4. Draw a horizontal line called a whisker from the left end of the
box to a point that aligns with the smallest measurement in the
data set.
5. Draw another horizontal line, or whisker, from the right end of
the box to a point that aligns with the largest measurement in
the data set.
Examination of a box-and-whisker plot for a set of data reveals
information regarding the amount of spread, location of
concentration, and symmetry of the data.
Example: In an epidemiological study, the total
organochlorines and PCB's present in milk samples
were recorded from 40 donors in Colorado.
(Source: Pesticides Monitoring Journal, June 1973.)
The measurements were ordered from lowest to
highest. For the data set, construct a box plot.
27 43 52 53 53 53 61 63 63 65
68 70 72 75 83 95 96 97 101 105
110 115 115 115 115 126 127 134 145 152
153 182 190 197 197 282 322 322 342 521
Sort: 27 43 52 53 53 53 61 63 63 65 68 70 72 75 83 95
96 97 101 105 110 115 115 115 115 126 127 134 145 152 153 182
190 197 197 282 322 322 342 521
Min. 1st Qu. Median Mean 3rd Qu. Max.
27.00 65.75 107.50 133.90 152.75 521.00
Outliers or Extreme Values
Values that are very small or very large relative to
the majority of the values in a data set are called
outliers or extreme values.
An outlier is an observation whose value, x, either
exceeds the value of the third quartile by a
magnitude greater than 1.5(IQR) or is less than the
value of the first quartile by a magnitude greater
than 1.5(IQR).
That is, an observation of x > Q3 + 1.5(IQR) or an
observation of x < Q1 - 1.5(IQR) is called an outlier.
Assignment
1. Evans et al. examined the effect of velocity on ground
reaction forces (GRF) in dogs with lameness from a
torn cranial cruciate ligament. The dogs were walked
and trotted over a force platform, and the GRF was
recorded during a certain phase of their performance.
The table below contains 20 measurements of force
where each value shown is the mean of five force
measurements per dog when trotting. Construct a
boxplot for the data set.
37 6 20 5 25 30 24 10 12 20
24 8 26 15 13 22 72 80 96 33
84 86 70 40 92 36 28 90 36 32
72 45 38 18 9
3. The following table, gives the frequency distribution of the
number of credit cards possessed by 80 adults.
Number of Credit Cards Number of Adults
0–3 18
4-7` 26
8 – 11 22
12 – 15 11
16 – 19 3
a. Prepare a cumulative frequency distribution.
b. Calculate the cumulative relative frequencies and
cumulative percentages for all classes.
c. Find the percentage of these adults who possess 7 or fewer
credit cards.
d. Draw an ogive for the cumulative percentage distribution.
e. Using the ogive, find the percentage of adults who possess
10 or fewer credit cards.
4. Nixon Corporation manufactures computer monitors. The
following data are the numbers of computer monitors produced
at the company for a sample of 30 days.
24 32 27 23 33 33 29 25 23 28
21 26 31 22 27 33 27 23 28 29
31 35 34 22 26 28 23 35 31 27
East 20 25 90 20
West 30 35 40 30
North 45 45 50 40
100
90
80
70
60
East
50
West
40 North
30
20
10
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
B. Stacked bar charts
Stacked bar chars are similar to grouped bar charts
in that they are used to display information about
the sub-groups that make up the different
categories.
In stacked bar charts the bars representing the
sub-groups are placed on top of each other to
make a single column, or side by side to make a
single bar. The overall height or length of the bar
shows the total size of the category whilst
different colours or shadings are used to indicate
the relative contribution of the different sub-
groups.
Consider the example of a stacked bar chart for
agricultural produce in some parts of Nigeria.
Agricultural produce in some parts of Nigeria
East 20 25 90 20
West 30 35 40 30
North 45 45 50 40
200
180
160
140
120
North
100
West
80 East
60
40
20
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
DESCRIPTIVE STATISTICS:
MEASURES OF CENTRAL TENDENCY
A descriptive measure computed from the data of
a sample is called a statistic.
A descriptive measure computed from the data of
a population is called a parameter.
x
Mean for population data: N
x
x
Mean for sample data: n
x
x 1368
228
n 6
Thus, the mean 2008 sales of these six companies was 228.
Properties of the Mean
The arithmetic mean possesses certain properties,
some desirable and some not so desirable. These
properties include the following:
1. Uniqueness. For a given set of data there is one and
only one arithmetic mean.
2. Simplicity. The arithmetic mean is easily understood
and easy to compute.
3. Since each and every value in a set of data enters
into the computation of the mean, it is affected by
each value. Extreme values, therefore, have an
influence on the mean and, in some cases, can so
distort it that it becomes undesirable as a measure of
central tendency.
Class Activity
1. The following are the ages (in years) of all eight employees
of a small company:
53 32 61 27 39 44 49 57
Find the mean age of these employees.
2. The following data give the 2006–07 team salaries for 20
teams of the English Premier League, arguably the best-
known soccer league in the world. The salaries are given in
the order in which the teams finished during the 2006–07
season. The salaries are in millions of British pounds (note
that the approximate value of 1 British pound was $1.95
during the 2006–07 season, so the team salaries range from
$34.3 million to $259 million). (Source: BBC, May 28, 2008.)
92.3 132.8 77.6 89.7 43.8 38.4 30.7
29.8 36.9 36.7 43.2 38.3 62.5 36.4
44.2 35.2 27.5 22.4 34.3 17.6
Find the mean for these data.
Assignment
1. Johnson et al. performed a retrospective review of 50 fetuses
that underwent open fetal myelomeningocele closure. The data
below show the gestational age in weeks of the 50 fetuses
undergoing the procedure.
25 25 26 27 29 29 29 30 30 31
32 32 32 33 33 33 33 34 34 34
35 35 35 35 35 35 35 35 35 36
36 36 36 36 36 36 36 36 36 36
36 36 36 36 36 36 36 36 37 37
(a) Construct a stem-and-leaf plot for these gestational ages.
(b) Based on the stem-and-leaf plot, what one word would you
use to describe the nature of the data?
(c) Why do you think the stem-and-leaf plot looks the way it
does?
(d) Compute the mean.
2. The purpose of a study by Tam et al. was to investigate the
wheelchair maneuvering in individuals with lower-level
spinal cord injury (SCI) and healthy controls. Subjects used a
modified wheelchair to incorporate a rigid seat surface to
facilitate the specified experimental measurements.
Interface pressure measurement was recorded by using a
high-resolution pressure-sensitive mat with a spatial
resolution of 4 sensors per square centimeter taped on the
rigid seat support. During static sitting conditions, average
pressures were recorded under the ischial tuberosities. The
data for measurements of the left ischial tuberosity (in mm
Hg) for the SCI and control group are shown below.
Control: 131, 115, 124, 131, 122, 117, 88, 114, 150, 169.
SCI: 60, 150, 130, 180, 163, 130, 121, 119, 130, 148.
Find the mean for the controls and the SCI group.
Other Means
We must not think that the arithmetic mean is the
only important mean. The geometric mean and
harmonic mean are all important in some areas of
Engineering.
The geometric mean is defined as the nth root of
the product of n observations:
x n x1 x 2 x3 ...x n
Harmonic mean
5
x
1 1 1 1 1
8 16 30 18 22
5
x 15.5376
0.3218
Median
If all the items with which we are concerned are sorted in
order of increasing magnitude (size), from the smallest to
the largest, then the median is the middle item.
As is obvious from the definition of the median, it divides a
ranked data set into two equal parts. The calculation of the
median consists of the following two steps:
1. Rank the data set in increasing order.
2. Find the middle term. The value of this term is the
median.
Note that if the number of observations in a data set is odd,
then the median is given by the value of the middle term in
the ranked data.
However, if the number of observations is even, then the
median is given by the average of the values of the two
middle terms.
Example
The following data give the prices (in thousands of
dollars) of seven houses selected from all houses sold
last month in a city.
312 257 421 289 526 374 497
Find the median.
First, we rank the given data in increasing order as
follows:
257 289 312 374 421 497 526
Since there are seven homes in this data set and the
middle term is the fourth term, the median is given by
the value of the fourth term in the ranked data.
257 289 312 374 421 497 526
Thus, the median price of a house is 374, or $374,000.
The 2008 profits (rounded to billions of dollars) of 12 companies selected
from all over the world.
2008 Profits
Company (billions of dollars)
Merck & Co 8
IBM 12
Unilever 7
Microsoft 17
Petrobras 14
Exxon Mobil 45
Lukoil 10
AT&T 13
Nestlé 17
Vodafone 13
Deutsche Bank 9
China Mobile 11
The Mode
The mode of a set of values is that value which occurs
most frequently.
If all the values are different there is no mode; on the
other hand, a set of values may have more than one
mode.
Example
The following data give the speeds (in miles per
hour) of eight cars that were stopped on I-95 for
speeding violations.
77 82 74 81 79 84 74 78
Find the mode.
In this data set, 74 occurs twice, and each of the
remaining values occurs only once. Because 74
occurs with the highest frequency, it is the mode.
Therefore,
Mode is 74 miles per hour
A data set with only one value occurring with the
highest frequency has only one mode. The data
set in this case is called unimodal.
( x ) 2
2 i 1
N
n
( x x ) 2
S2 i 1
n 1
where
2 2
is the population variance and S is the
sample variance.
The quantity x or x x in the above formulas is
called the deviation of the x value from the mean. The
sum of the deviations of the x values from the mean is
always zero.
Likewise, the following are what we will call the
basic formulas that are used to calculate the
standard deviation,
N
( x ) 2
2 i 1
N
n
(x x) 2
s s 2 i 1
n 1
where is the population standard deviation and
s is the sample standard deviation.
For example, suppose the midterm scores of a
sample of four students are 82, 95, 67, and 92,
respectively. Then, the mean score for these four
students is
336
x 84
4
s2
( x x ) 2
n 1
(2) 2 (11) 2 (17) 2 (8) 2
s
2
4 1
478
s2
3
s 2 159.33
SD s 2
SD 159.33
SD 12.62
Short-Cut Formulas for the Variance and
Standard Deviation for Ungrouped Data
The standard deviation is obtained by taking the
positive square root of the variance.
( x )
2
x 2
x 2
Nx 2
2 N 2
N N
( x)
2
x 2
x 2
nx 2
s2 n s 2
n 1 n 1
The following table gives the 2008 market values
(rounded to billions of dollars) of five international
companies.
Company Market Value (billions of dollars)
PepsiCo 75
Google 107
PetroChina 271
Johnson & Johnson 138
Intel 71
Find the variance and standard deviation for these
data
x 662
x 114600
2
( x ) 2
x 2
n
S2
n 1
114600
662
2
S2 5
5 1
114600 87648.8
S2
4
S 2 6737.80
S S2
S 6737.8
S 82.08
Thus, the variance and standard deviation of the
market values of these five companies are $6737.80
and $82.08 billion respectively.
Observation
•The values of the variance and the standard deviation
are never negative. That is, the numerator in the
formula for the variance should never produce a
negative value. Usually the values of the variance and
standard deviation are positive, but if a data set has
no variation, then the variance and standard deviation
are both zero.
•The measurement units of variance are always the
square of the measurement units of the original data.
This is so because the original values are squared to
calculate the variance.
Coefficient of variation CV
One disadvantage of the standard deviation as a
measure of dispersion is that it is a measure of
absolute variability and not of relative variability.
Sometimes we may need to compare the variability of
two different data sets that have different units of
measurement. The coefficient of variation is one such
measure. The coefficient of variation, denoted by CV,
expresses standard deviation as a percentage of the
mean and is computed as follows:
For population data: CV 100%