Professional Documents
Culture Documents
Statistics & Probability PDF
Statistics & Probability PDF
Chapter 29
Statistics is the field of study where data are collected for the In general, data can be classified as either qualitative or quan-
purpose of drawing conclusions and making inferences. titative.
Descriptive statistics is the summarization and description of
data, and inferential statistics is the estimation, prediction, Qualitative Data
and/or generalization about the population based on the Qualitative data can be categorized or summarized.
data from a sample
Example:
Four elements are essential to inferential statistical problems: As of September 2, 2003, the total membership of AACE
International is 4,307. This can be classified according to
1. Population is the collection of all elements of interest to the member types:
decision-maker. The size of the population is usually
denoted by N. Very often, the population is so large that a members and associates 4,036
complete census is out of the question. Sometimes, not students 147
even a small population can be examined entirely because honorary 124
it may be destructive or prohibitively expensive to obtain total 4,307
the data. Under these situations, we draw inferences based
upon a part of the population (called a sample). Or according to geographical distribution:
2. Sample is a subset of data randomly selected from a
population. the size of a sample is usually denoted by n. U.S. 3,509
3. Statistical inference is an estimation, prediction or gen- Canada 480
eralization about the population based on the informa- Caribbean 28
tion from the sample. Asia 158
4. Reliability is the measurement of the “goodness” of the Africa 29
inference. Europe 48
Australia 55
Only the first two elements will be discussed in this chapter. total 4,307
Numerical characteristics of a population are called parame-
ters of the population. The corresponding numerical charac- Quantitative Data
teristics calculated from a sample are called sample statistics. The description of quantitative data is more complex. It can
be described graphically or numerically.
29.1
STATISTICS & PROBABILITY AACE INTERNATIONAL
Numerical methods for describing quantitative data include Stem Leaf f f/n
the following: frequency relative
frequency
• measures of location (central tendency),
-mean (average), 2 5, 0, 0, 0, 0, 0 6 6/50
-median, and 3 5, 0, 0, 0, 5, 5, 0, 0, 0, 5, 0 11 11/50
4 0, 0, 0, 0, 0, 5, 5, 5, 0, 0, 0, 0, 0, 5, 0 15 15/50
-mode;
5 0, 0, 0, 0, 0, 0, 5, 0 8 8/50
• measures of dispersion, 6 0, 0 2 2/50
-range, 7 0,0,0,5 4 4/50
-variance, and 8 0 0
-standard deviation; 9+ 95, 145, 160, 140 4 4/50
• relative standing,
-percentile, and 50 50/50
-Z-score.
Example: Frequency
Many companies invest on training their employees. The fol-
lowing average training hours for every employee are select- 15
ed from “The 100 Best Companies to Work For” (Fortune,
January 20, 2003). Numbers are rounded to the nearest 5 for 11
convenience.
145 50 35 50 25 160 30 40 40 20
40 20 30 95 40 50 30 50 20 35 6 8
70 35 40 45 30 45 70 45 40 140 4 4
50 70 20 30 40 40 30 60 40 50
60 55 50 35 40 30 45 75 20 40
2 Training Hours
The data will first be divided into smaller equal intervals 20 30 40 50 60 70 80 90+
(classes). The number of observations that fall into each class,
the frequency, is then counted. These classes should not over-
Figure 29.1—Stem, Leaf, Frequency and Relative Frequency
lap and there should be enough classes to include all the
Distributions
data. There may be open-ended intervals when the first class
contains no lower limit or the last data class contains no
upper limit. The number of classes depends on the number
of observations in the data set. In practice, a frequency distri-
bution usually has from five to twenty classes.
Stem Leaf
4 5
29.2
AACE INTERNATIONAL STATISTICS & PROBABILITY
The simplest measure of the variability of a data set is its range. Figure 29.5—Leftward Skewness
29.3
STATISTICS & PROBABILITY AACE INTERNATIONAL
accident. The alternative might be to calculate the average The first (lower) quartile: the 25th percentile
absolute deviation. However, this measure is rarely used The third (upper) quartile: the 75th percentile
because it is difficult to handle algebraically and does not have
the nice mathematical properties possessed by the variance. And you guessed it; the second (middle) quartile is the
median.
Variance: The average of the squared deviations from the mean.
For the above training hours example, the 80th percentile is
The population variance is denoted by 60 hours.
∑(x - µ)2 = ∑x2 - N µ2 Another measure of relative standing is the famous z-score.
σ2 =
N N
A z-score is the number of standard deviations a point is
above or below the mean of a set of data.
The sample variance is denoted by
The population z-score for a measurement x is z = (x - µ)/?
∑ (x - x)2 ∑x2 - nx2
s2 = = The sample z-score for a measurement x is z = (x - x)/s
n-1 n-1
The sample standard deviation s of this example is If a random variable can take on only countable number of
values, then we call it a discrete random variable. For exam-
√882.95 = 29.71 hours ple, the number of sales made by a salesperson in a given day
The standard deviation can be approximated by range/6. Random variables that can assume any value within some
Some may prefer to use range/4. interval or intervals are called continuous random variables.
For example, the length of time an employee is late for work
Measures of Relative Standing
Another measure of interest is the description measurement Probability Distribution: Probability Distribution is
of the relative location of a particular observation within a expressed in a table listing all possible values that a random
data set. variable can take on together with the associated probabilities.
pth percentile: In any data set, the pth percentile is the num Notice that the random variable itself will be denoted by X,
ber with exactly p percent of the measurements fall while the small x denotes a particular value of X. The symbol
below it and (100-p) percent fall above it when the data p(x) = Pr( X=x ) means the probability that the experiment
are arranged in ascending or descending order. yields the value x.
29.4
AACE INTERNATIONAL STATISTICS & PROBABILITY
In a discrete probability distribution, each p(x)≥0 for all val- Suppose you work for an insurance company and sell an
ues of x and ∑p(x)=1. individual 10-year $100,000 term life insurance coverage at
an annual premium of $240. Actuarial tables show that the
Example: Two coins are tossed. Let X be the number of heads probability of death during the next year for a person of your
appeared. customer’s age, sex, health, etc., is .001. What is the expected
gain to the company for a policy of this type?
x 0 1 2
Probability distribution of X
p(x) 1/4 2/4 1/4
Gain(X) Event Probability
The probabilities p(x) can be interpreted as long-run relative $240 customer lives .999
frequencies. For instance, if two coins were flipped many, $240–100,000 customer dies .001
many times, we could anticipate obtaining two tails (X=0)
about one-fourth of the time, one head and one tail (X=1) one
half of the time and two heads one-fourth of the time. If the customer lives, the company keeps the $240 premium.
Therefore, the probability distribution for a random variable If the customer dies, the company must pay $100,000 and will
is a theoretical model for the relative frequency distribution have a net “gain” of $(240-100,000). The expected gain is
of a population. therefore
Like the frequency distribution, the mean and standard devi- µ=E(X)=(240)(.999) + (240 - 100,000)(.001) = 240(.999 +
ation of a probability distribution need to be calculated to .001)-100,000(.001) = $140
describe the central location and spread of the probability
distribution. Please note that for each policy sold, the insurance company
is taking a risk of either gaining $240 or losing $99,760.
The predicted long-range average of a discrete random vari- However, if the company were to sell a very large number of
able X, often called the expected value (or mean) of X, is such insurance policies to customers possessing the charac-
defined by teristics described above, the company would on the average
net $140 per policy written.
µ = E(x) = ∑xp(x)
µ = E(x-µ)2 = ∑(x- µ)2p(x) There are several theoretical discrete probability distribu-
tions that have extensive applications in decision-making.
The standard deviation σ is the square root of the variance. One will be introduced in this section.
29.5
STATISTICS & PROBABILITY AACE INTERNATIONAL
Pr (Accepting the lot) = Pr(x < 2) = Pr (x ≤ 1) = Because there is no area over a point, the probability associ-
Pr(x = 0) + Pr(x = 1) ated with any particular value of x, say, x=a, is equal to zero.
Hence, Pr(a≤x≤b)=Pr(a<x<b). In other words, the probability
= ( 250 ) (.1)0 (.9)25 + ( 251 ) (.1)1(.9)24 is the same regardless of whether the endpoints of the inter-
val are included. The total area under the curve, which is the
total probability for x, equals to 1.
= .27121
The areas under most probability density functions are
(b) If p = .01; q = .99 obtained by the use of calculus or other numerical methods.
This is often a difficult procedure. However, as with com-
25 monly used discrete probability distributions, there are tables
Pr(x≤1) = ( 250 ) (.01)0(.99)25+ ( 1 ) exist for finding probabilities under commonly used contin-
uous probability distributions.
(.01)1(.99)24 = .77782 + .19642
= .97424 Similar to the requirements for a discrete probability distri-
bution, we require
The two measurements indicate that, under the proposed
quality control plan, there is 73 percent chance that you will f(x)≥0 and ∫ f(x)dx = 1 for all x
reject the lot if in fact 10 percent (p=0.1) of the fuses manu-
factured are defective. If only 1 percent of the fuses are defec- The Normal Distribution
tive, the chance of re-inspection is very small (less than 3 per- The most important continuous distribution in statistical
cent). decision making is the normal distribution. It is important for
the following reasons:
29.6
AACE INTERNATIONAL STATISTICS & PROBABILITY
where
Figure 29.8—A Normal Probability Distribution
µ = mean of the random variable X
σ = standard deviation of X
e = 2.71828…
π = 3.14159…
is a normally distributed variable with mean zero and stan-
dard deviation 1. The probability distribution of Z is called
the standard normal distribution. Notice that z gives the
number of standard deviations that a value of x lies above or
below the mean. By using the Z score, all normal distribu-
tions can be transformed to Standard Normal Distribution.
We can say that if X is N(µ, σ 2), then Z=(x - µ)/σ is N(0,1).
29.7
STATISTICS & PROBABILITY AACE INTERNATIONAL
Example: The actual amount of coffee grounds that a filling 2. The average project duration to build a greenfield man-
machine puts into “6-ounce” jars varies from jar to jar, and it ufacturing plant is 26 months with a standard deviation
may be assumed as a normal random variable with a stan- of two months, assuming the project critical path follows
dard deviation of 0.04 ounce. If the jar contains less than 6 a normal distribution. Your company is planning to
ounces, it is considered unacceptable. Determine the mean build a similar greenfield manufacturing plant. The sen-
fill of the machine so that only 1 percent of the jars will be ior management is interested in the following three
unacceptable. schedule outcomes:
Solution: Let x be the amount of coffee in the jar. We are given a. What is the chance to complete the project between 24
σ = 0.04 and 28 months?
b. What is the likelihood of completing the project in 24
We are asked to find the average fill, µ, such that months?
Pr(x<6)=.01 c. What is the risk that the project duration would exceed
30 months?
x-µ 6-µ
Px(x<6) = Pr(
σ < σ
) = Pr (z< 6 - µ ) = .01
.04 Solutions:
From the above table we find that 1. a. Mean is the average work-hours/unit of the 200
Pr(z < –2.326) = .01; therefore, observations.
6-µ
= -2.326 µ = 6.093
.04 (6x 6) + (7 x 11) + (8 x 27) + (9 x 47) + (10 x 52) +
(11 x 44) + (12 x 9) + (13 x 4)
∑x
µ=—= ——————————————————————
If the average fill is set at 6.093 ounces, only 1 percent of the N 200
jars will contain less than 6 ounces.
29.8
AACE INTERNATIONAL STATISTICS & PROBABILITY
2. a. The project duration between 24 and 28 months is c. Thirty-month duration is four months longer than the
within one standard deviation from the average 26-month mean duration, which is two standard
(mean) duration of 26 months. Since the probability deviations above the mean. From the standard nor-
within one standard deviation from the mean is 0.68, mal distribution table, the probability exceeding two
this project has a 68 percent chance to be competed standard deviations is 0.023. Thus, there is only a 2
between 24 and 28 months. percent risk that the project duration would exceed 30
months.
29.9
STATISTICS & PROBABILITY AACE INTERNATIONAL
REFERENCES
29.10
AACE INTERNATIONAL BASIC CONCEPTS IN DESCRIPTIVE STATISTICS
Chapter 30
The manager of the construction process must make deci- After completing this chapter, the readers should be able to
sions daily that affect the operation of an individual project
as well as the company as a whole. Rarely does a manager • understand the basic concepts and procedures of
have the intuition to make decisions that avoid serious or descriptive statistics, such as frequency distributions,
continued error without input from past field and company frequency graphs, the normal curve, and cumulative
performance. The successful company collects information probability curve.
so that when analyzed, good decisions can be made.
Statistics constitutes all methods useful for the analysis of
this information. In general, statistical methods are of two FREQUENCY DISTRIBUTIONS
types and subsequent purpose:
A concrete contractor engaged in installing foundation foot-
1. descriptive statistics, which allow the cost engineer to ings and walls needs to be able to predict with some accura-
organize, summarize, interpret, and communicate quan- cy the time it takes to install formwork. Not only will a
titative information obtained from observations; and knowledge of past performance help the contractor predict
2. inferential statistics, which allow the cost engineer to go future performance useful in estimating and scheduling, this
beyond the data collected from a small sample to formu- knowledge also will provide an internal benchmark for cost
late tentative conclusions about the population from control purposes once the project is in progress. Before these
which the sample was taken. predictions can be made, the contractor must know more
about past performance. Thus, data about similar forming
This chapter examines some of the basic concepts and proce- techniques is collected from 20 projects completed over the
dures that are a part of descriptive statistics. last several years. This data is summarized in Table 30.1.
The data above is hard to interpret in its present form. It must be organized so that the data yields meaning to the manager.
30.1
BASIC CONCEPTS IN DESCRIPTIVE STATISTICS AACE INTERNATIONAL
Frequency Distribution—A frequency distribution is an Interpretation of the cumulative frequency distribution (col-
organization of measures or observations that lists the class umn 3) indicates that 6 of the 20 productivity rates fall below
(in this case, the productivity rate as measured in labor hours (too many hours expended per square foot of contact area) the
per square foot of contact area) and the frequency or the required rate of .055 HRS/SFCA. In order to be within budget
number of times this production rate was achieved. In Table and on time for more than 2/3 of the projects, the data indicates
30.2, the data has been rearranged by listing the data from that the contractor has two choices: (1) bid future work at a
high productivity to low productivity (column 1), and the higher rate per square foot of contact area, or (2) implement
number of times this production rate occurred (column 2). process changes to increase production on future projects.
30.2
AACE INTERNATIONAL BASIC CONCEPTS IN DESCRIPTIVE STATISTICS
FREQUENCY GRAPHS
Once the contractor finds that 30 percent of the projects yield GLOSSARY TERMS IN THIS CHAPTER
an unacceptable production rate, this information is conveyed
to project management and labor. Rather than show the infor- frequency distribution ◆ standard deviaton
mation in tabular form, the contractor decides to present the statistics ◆
information graphically. Graphs convey the essential charac-
teristics of a frequency distribution in a pictorial form.
Graphical information is much more pleasing to view than
tables, and so provides an effective medium for communicat- togram, but instead of drawing a vertical bar, a point is plot-
ing frequency distribution information to others. ted at the exact score (or midpoint of the interval) and at a
height corresponding to the frequency of that score (or inter-
Frequency graphs have two characteristics in common: (1) one val). These points are then connected by a straight line, which
axis that represents all possible scores or classes within a distri- results in a polygon. See Figure 30.2.
bution, and (2) one axis that represents the frequency of occur-
rence of that score or class. Frequency distributions are repre- The reason for constructing histograms and frequency poly-
sented graphically via the histogram or the frequency polygon. gons is to reveal how scores are distributed along the score
scale. That is, the form of the distribution is shown. A distri-
Histogram—In a histogram, the frequency of each score or bution is symmetrical if one side is a mirror image of the
class is represented as a vertical bar. For the production rate other. If not, it is asymmetrical. Asymmetrical curves can be
data, a histogram would be produced as shown in Figure 30.1. skewed either positively or negatively. See Figure 30.3.
When developing the histogram, the 3/4 rule should be For negative skewness, the tail travels to the left; for positive
applied. That is, the highest frequency should be laid out so skewness, the tail travels to the right. The production rate fre-
that the height is approximately 3/4 the length of the horizon- quency polygon, Figure 30.3, indicates a mild skew in the
tal axis. Otherwise, the viewer may obtain the wrong impres- positive direction. Another noticeable feature of a polygon is
sion based on graph appearance rather than on graph data. the number of humps or high points. If only one high point,
The bar width should be the same as the “real limit” of a class. then the curve is unimodal. If two humps, then the curve is
For example, suppose the production rates were rounded off bimodal. For three humps, it is trimodal, and so on. The pro-
to the nearest .005, then the real productivity rate for class .05 duction rate frequency polygon, Figure 30.2, shows 2 humps,
would fall between .0475 and .0525. Thus, the width of the ver- one much higher than the other.
tical bar would be .005 and extend from .0475 to .0525 on the
graph. In addition, the graph should be titled in a descriptive The frequency graph's information should generate curiosity
fashion to indicate what the graph is showing. among the viewers. In this case, the following questions
arise, “Why the second hump at .065 HRS/SFCA? What
Frequency Polygon—For the frequency polygon, the vertical causes the variation? Can we isolate this cause and fix it?”
and horizontal axes are laid out the same way as for the his- For example, suppose that an examination of job
data reveals that the projects with the lower pro-
ductivity rate (higher number of hours per square
foot of contact area installed) occurred where
8 formwork was stacked. One might conclude that
7 the stacking activity requires more hours related to
6 square feet of contact area than if no stacking
FREQUENCY 5 occurred. Thus, the outcome of the analysis would
4 be to have two budget rates—one for when panels
3 are not stacked and one for when they are. Based
2 on the frequency distribution, viable budget rates
1 would be .050 HRS/SFCA for no stacking and .065
0 HRS/SFCA for stacking. Each rate reflects a mode
of the frequency distribution.
0.045
0.075
0.035
0.055
0.04
0.07
0.05
0.06
0.06
30.3
BASIC CONCEPTS IN DESCRIPTIVE STATISTICS AACE INTERNATIONAL
MEASURES OF CENTRAL TENDENCY When the number of scores is even or there is a repetition of
a certain score, then the location of the median requires com-
The frequency distributions can be characterized by certain putation. Consider the production rate distribution shown in
statistics. One type of statistic is called the index of central Figure 30.4. Since there are 20 scores, the median is the point
tendency, or average, and represents the general location of a below which 10 cases fall.
distribution of measures on the measurement scale. There are
three commonly used indexes of central tendency—the If one counts from the left of the distribution, one finds that
mode, the median, and the mean. the median falls between the sixth and seventh .05 in the dis-
tribution. Since there is an additional .05 lying above the
Mode—The mode is the simplest measure of central tenden- tenth score which is also .05, one cannot say that .05 is the
cy. It is merely the score value or measure that occurs most median. Looking at the distribution, there are 4 scores that
often in a distribution of scores. For the production rate dis- fall below .05, and 9 scores that lie above .05. Thus, .05 would
tribution, the score occurring most often is .050 HRS/SFCA. not fit the definition of the median. But, one knows that the
Hence, the mode for this distribution is .050. median falls somewhere within the interval .005, somewhere
between .0475 and .0525. One can locate the median within
Median—The median is the middle point in a distribution. an interval by applying the formula shown in Figure 30.5.
Half of the distribution is above this point and half is below.
To find the median one arranges the scores in order. For Mean—The best known and most reliable measure of central
example, consider the following observations of concrete test tendency is the mean. The mean is the arithmetic average of
cylinders: 2700 psi, 2750 psi, 2965 psi, 3100 psi, 3130 psi, 3480 a group of scores. Thus, for the production rate distribution
psi, and 3500 psi. The score, 3100 psi, is the middle point. containing the 20 observations in Table 30.1, one would com-
There are three observations above and below the median of pute the mean as shown in Table 30.3.
3100 psi.
Comparison of Mean, Median, and Mode—If the
contractor wants to know what production rate
occurred most often, the mode would be calculat-
ed. However, the mode is a crude and unstable
measure of central tendency and is generally not
8
used to describe a distribution. Usually the medi-
7
an or the mean is used. However, there is an
6
important difference between the median and
FREQUENCY 5
mean. The median is a rank or a position statistic
4
unaffected by the numerical size of the individual
3
scores, while the mean is sensitive to the size of the
2
individual scores in a distribution, including
1
extreme scores.
0
0.045
0.050
0.070
0.040
0.060
0.075
0.065
0.055
Positive Negative
30.4
AACE INTERNATIONAL BASIC CONCEPTS IN DESCRIPTIVE STATISTICS
04, .045, .045, .045, .05, .05, .05, .05, .05, .05, .05, .055, .055, .055, .06, .06, .065, .065, .065, .07
|~~
10 scores <--|-----> 10 scores
Table 30.3—Calculation of the Mean Production Rate Mdn = L + [(N/2-cfb)/fw]i = .0475 + [(20/2 - 4)/7].005 = .05179
RATE (HRS/SFCA) FREQUENCY (F) PRODUCT
where: Mdn = the median
.040 1 .040 N = total number of cases in the distribution
.045 3 .135 L = lower real limit
cfb = the cumulative frequency below
.050 7 .350
fw = the frequency of cases within the median
.055 3 .165 interval
.060 2 .120
.065 3 .195 Figure 30.5—Calculation of the Median Production Rate
.070 1 .070
TOTALS 20 1.075
Mean = Sum of scores/N = 1.075/20 = .05375 the work, the distribution with the greater spread would
yield less confidence in the accuracy of the mean rate for any
one particular project than if the spread were small. In addi-
tion, the real amount of error would be greater.
distribution is skewed (that is, scores are concentrated more
at one end or the other), then the curve will not be symmet-
The Range—The simplest measure of variability is the range.
rical, and the three measures of central tendency will not be
The range is defined as the difference between the lowest and
equal. Note that the median lies between the mode and mean
highest score in a distribution of scores. Figure 30.6 shows the
in all skewed distributions. If negatively skewed, the median
calculation of the range.
is higher than the mean. If positively skewed, the median is
lower than the mean.
Range = Xh- X1 = .065 - .04 = .25
The mean is the most stable or reliable measure of central ten-
dency. If one were to draw a sample from the total popula- where: Xh = the highest score
tion, the mean would show less fluctuation from sample to X1 = the lowest score
sample than the medians. Thus, if one wanted to infer some
characteristic about a population from a sample, the mean
would yield the most reliable estimate of the population Figure 30.6—Calculation of the Range for the Production
parameter. Or, stated another way, if the contractor wanted to Rates
bid the next job based on past experience, the best estimate
(that rate with the least amount of error) of the actual pro-
The range is not considered a stable measure of variability
duction rate would be the mean of .05375 HRS/SFCA.
because the value can change greatly with the change in a
single score within the distribution--either the high or low
score. In addition, there may be frequent or large gaps in the
MEASURE OF VARIABILITY distribution, which the range does not reflect, because it only
uses two scores—the high and low. Thus, the range is only
The measures of central tendency (mean, median, and mode) useful as a quick estimate of variability.
provide a concise index of the average value of a set of scores
or measures. However, there is more to be known about a Quartile Deviation—The quartile deviation is more stable
distribution of scores than this one characteristic. The than the range because it is based on the spread of the scores
amount of variability or spread of the scores within the dis- through the center of the distribution rather than through the
tribution is also an important characteristic to know about a two extremes. The quartile deviation is the measure which is
given distribution. For example, suppose the production half the distance between the 1st and 3rd quadrilles. The first
rates varied from .02 to .09, a spread of .07, rather than .04 to quartile (Q1) is the score that sets off the lowest 25 percent of
.07, a spread of .03; yet, in both instances the mean was the scores while the third quartile (Q3) sets off the upper 25
.05375. If the contractor used the mean production rate to bid
30.5
BASIC CONCEPTS IN DESCRIPTIVE STATISTICS AACE INTERNATIONAL
percent of the scores. The interval from Q1 to Q3 contains the indicator of the spread of a distribution can be found by
middle 50 percent of the scores in a distribution and is called determining the amount each score deviates from the mean
the inter-quartile range. Then, this distance is divided by 2 to of the distribution. In most instances, the contractor will be
give the average distance from the median to each of the computing a sample statistic rather than a population statis-
quadrilles. This is called the quartile deviation (QD). For the tic. The production rates do not include the entire population
production rates, the calculations are shown in Fgure 30.7. of rates for every job past, present, and future. Thus, the 20
rates are a sample of all work the contractor does. The sam-
Since the quartile deviation is an index that reflects the ple standard deviation can be calculated from the frequency
spread of scores throughout the middle part of the distribu- distribution shown in table 30.4.
tion, it should be used whenever extreme scores may distort
the data. Thus, the median and the quartile deviation are The standard deviation can be computed from raw scores
both insensitive to extreme scores in the distribution and with the use of an inexpensive hand-held calculator that has
should be used accordingly. statistical functions built in. One simply enters the raw
scores in the STAT mode on the calculator. Then, a few key
Standard Deviation—The major disadvantage of the quartile strokes will yield such information as the sample mean and
deviation is that it does not take into account the value of the sample standard deviation.
each of the raw scores in the distribution. A more reliable
where:
Q1 = first quartile
Q3 = second quartile
L = the lower limit of the interval within which the first quartile lies or the third quartile lies
N = the number of cases
cfb = the cumulative frequency below the interval containing either the first quartile or third quartile
fw = the frequency of cases within the interval containing either the first or third quartile
i = the interval size
30.6
AACE INTERNATIONAL BASIC CONCEPTS IN DESCRIPTIVE STATISTICS
THE NORMAL CURVE The z-score is computed from a sample score by applying the
formula, z = (X-X)/s, where X is the mean of the distribution,
Recall that a frequency distribution that is symmetrical is X is a raw score from the distribution, and s is the sample
known as the normal curve. The normal curve is unimodal standard deviation. For example, assume that the produc-
with mean, median, and mode at the same point. See Figure tion rate frequency distribution approximates the normal
30.8. curve. Then, a rate of .60 would yield the following:
In actuality, the normal curve is a theoretical curve, which by z = (.060-.05375)/.009 = .694 or .69
definition can take many shapes, but in all of those shapes,
the shape is symmetrical, and the curve is unimodal. This When all scores from a distribution are transformed to their
theoretical curve is important because many physical and corresponding z-scores, the result is a distribution of scores
psychological phenomena resemble the normal curve when with a mean of 0 and a standard deviation of 1. Thus, the score
shown in a frequency distribution. above is .694 standard deviation above the mean. One can
determine what this z-score indicates by entering published
Properties—The important properties of normal curves are tables. A portion of one such table is shown in Table 30.5.
(1) the curve is symmetrical with its maximum height at the
mean; (2) the mean, median, and mode fall at the same point; Based on the observed data, the contractor concludes that for
(3) the height of the curve decreases to the left and to the right the z-score of .69 (column 1), the area (proportion of scores)
of the mean at an accelerated rate, which forms the convex from the mean to .69 is .2549 or 25.49 percent (column 2), the
portion of the curve until reaching one standard deviation area below the score of .69 is .7549 or 75.49 percent (column
above or below the mean at which point the decrease decel- 3), the area above the score of .69 is .2451 or 24.51 percent
erates and the curve becomes concave; and (4) the theoretical (column 4), and the .69 score is found on the curve at the y-
range of the curve is plus infinity to minus infinity, but for all ordinate of .3144 (column 5). Thus, the contractor could inter-
practical purposes so little of the curve falls below -3 stan- pret the score by saying that 75.49 percent of all the possible
dard deviations or above +3 standard deviations that for production rates will fall below .060 HRS/SFCA.
most frequency distributions these are the practical limits of
the curve. The contractor also can use the standard normal curve table
to determine a specific production rate to be used in estimat-
z-scores—Though two or more frequency distributions may ing and scheduling. Assume that the contractor wants to
approximate normality yet differ in terms of their means and select a rate based on past experience that will be greater than
standard deviations, any normal distribution can be trans- 95 percent of the rates possible. From the table, the contractor
formed into a distribution of standard scores. These scores finds the area in larger portion (column 3) closest to .95 or 95
are known as z-scores. The distribution is known as the stan- percent. From the table, .9505 is found, with a corresponding
dard normal curve. z-score of 1.65. (Typically, for the kinds of analysis the con-
tractor will perform, interpolation is not necessary.)
1.65 = (X -.05375)/.009
X = .0686 HRS/SFCA
20
When one compares this rate (.0686) with that shown on
Frequency
30.7
BASIC CONCEPTS IN DESCRIPTIVE STATISTICS AACE INTERNATIONAL
1 2 3 4 5
z-score Area from Area in Area in y ordinate
mean to z-score larger portion smaller portion at z-score
.67 .2422 .7422 .2578 .3230
.68 .2517 .7517 .2483 .3166
.69 .2549 .7549 .2451 .3144
.70 .2580 .7580 .2420 .3123
.71 .2611 .7611 .2389 .3101
.72 .2642 .7642 .2358 .3079
CUMULATIVE PROBABILITY CURVE to know what the probability is that the production rate will
not meet or exceed the acceptable rate. From the cumulative
One key purpose of the frequency polygon is to examine how probability curve, one reads that 30 percent of the production
the data is distributed. From this examination, one gets an rates are .060 HRS/SFCA or higher; thus, 30 percent of the
indication of whether the scores are normally distributed or observed rates failed to yield the acceptable rate of .055
not. If they are not, then the cumulative probability curve can HRS/SFCA.
be applied to the data. An examination of the frequency poly-
It is important to note that the construction of the cumulative
gon in Figure 30.2 reveals that the production rate data is not
probability curve is dependent upon the question asked. In
normally distributed. The curve plots the cumulative per-
this case, the contractor wanted to determine the probability
centage distribution data (Table 30.2, column 4) as illustrated
of failure. The contractor could just have easily asked for the
in Figure 30.9.
probability of success. In this instance, the data in Table 30.2
would have been rearranged from low production (high
Note that the curve shows the percentage of scores (crude
score) to high production (low score). The cumulative per-
measure of probability) where the production rate will fall
centage frequency would be found as shown in Table 30.6.
below a certain value. For example, the contractor may wish
%
100
90
80
70
CUMULATIVE 60
PERCENTAGE
50
40
30
20
10
0
0.070
0.060
0.050
0.040
0.075
0.065
0.035
0.055
0.045
30.8
AACE INTERNATIONAL BASIC CONCEPTS IN DESCRIPTIVE STATISTICS
The resultant cumulative probability curve is shown in 3. Montgomery, D. C., and G. C. Runger. 1994. Applied
Figure 30.10. Note that the successful production rate is .055 Statistics and Probability for Engineers. New York: John
HRS/SFCA. Thus, from the curve, the probability of success Wiley & Sons, Inc.
is 70 percent. 4. Spiegel, M. R. 1989. Schaum’s Outline of Statistics. New
York: McGraw-Hill, Inc., 1989.
RECOMMENDED READING
%
100
90
80
70
CUMULATIVE 60
PERCENTAGE
50
40
30
20
10
0
0.040
0.070
0.035
0.045
0.075
0.055
0.05
0.06
0.06
30.9